The Cray X1 First in a Series of Extreme Performance Systems

The Cray X1 First in a series of extreme performance systems Steve Scott Cray X1 Chief Architect [email protected] Cray X1 Overview, Copyright 2002-3, Cray Inc. 1 Cray Supercomputer Evolution and Roadmap sustained petaflops High BW MPP DARPA HPCS BW++ BW+ BlackWidow X1e Microprocessor-based MPP X1 1000’s of processors UNICOS/mk 1350 1200 T3E T3D SV1ex SV1 T90 J90 C90 Y-MP High Bandwidth X-MP Cray-1 Vector 1-32 processors Cray X1 Overview, Copyright 2002-3, Cray Inc. 2 Architecture Cray X1 Overview, Copyright 2002-3, Cray Inc. 3 Cray X1 Cray PVP Cray T3E • Powerful single processors • Distributed shared memory • Very high memory bandwidth • High BW scalable network • Non-unit stride computation • Optimized communication • Special ISA features and synchronization features • Modernized the ISA and • Improved via custom microarchitecture processors Cray X1 Overview, Copyright 2002-3, Cray Inc. 4 Cray X1 Instruction Set Architecture New ISA – Much larger register set (32x64 vector, 64+64 scalar) – All operations performed under mask – 64- and 32-bit memory and IEEE arithmetic – Integrated synchronization features Advantages of a vector ISA – Compiler provides useful dependence information to hardware – Very high single processor performance with low complexity ops/sec = (cycles/sec) * (instrs/cycle) * (ops/instr) – Localized computation on processor chip – large register state with very regular access patterns – registers and functional units grouped into local clusters (pipes) ⇒ excellent fit with future IC technology – Latency tolerance and pipelining to memory/network ⇒ very well suited for scalable systems – Easy extensibility to next generation implementation Cray X1 Overview, Copyright 2002-3, Cray Inc. 5 Not your father’s vector machine • New instruction set • New system architecture • New processor microarchitecture • “Classic” vector machines were programmed differently – Classic vector: Optimize for loop length with little regard for locality – Scalable micro: Optimize for locality with little regard for loop length • The Cray X1 is programmed like a parallel micro-based machine – Rewards locality: register, cache, local memory, remote memory – Decoupled microarchitecture performs well on short loop nests –(however, does require vectorizable code) Cray X1 Overview, Copyright 2002-3, Cray Inc. 6 Cray X1 Node P P P P P P P P P P P P P P P P $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ M M M M M M M M M M M M M M M M mem mem mem mem mem mem mem mem mem mem mem mem mem mem mem mem IO IO • Four multistream processors (MSPs), each 12.8 Gflops • High bandwidth local shared memory (128 Direct Rambus channels) • 32 network links and four I/O links per node Cray X1 Overview, Copyright 2002-3, Cray Inc. 7 NUMA Scalable up to 1024 Nodes Interconnection Network • 32 parallel networks for bandwidth • Global shared memory across machine Cray X1 Overview, Copyright 2002-3, Cray Inc. 8 Network for Small Systems M0 M15 Node 0 M0 M15 M0 M15 M0 M15 Node 3 Slice 0 Slice 15 • Up to 16 CPUs, can connect directly via M chip ports Cray X1 Overview, Copyright 2002-3, Cray Inc. 9 Scalable Network Building Block M0 M15 Node 0 M0 M15 M0 M15 Re M0 Re M15 M0 Ro M15 Ro Hypercube/ 2D Torus M0 M15 Network M0 M15 M0 M15 Node 7 Slice 0One 32-processor “stack” Slice 15 Cray X1 Overview, Copyright 2002-3, Cray Inc. 10 Multicabinet Configuration • Scales as a 2D torus of “stacks” - ~third dimension within a stack (32 links between stacks) - two stacks per cabinet 32 links Bird’s Eye View 8 cabinets, 512 processors, 6.5 Tflops Cray X1 Overview, Copyright 2002-3, Cray Inc. 11 A 40 Tflop Configuration 50 cabinets, 3200 processors Bisection = 2 TB/s (4 TB/s global bandwidth) Cray X1 Overview, Copyright 2002-3, Cray Inc. 12 Designed for Scalability T3E Heritage • Distributed shared memory (DSM) architecture – Low latency, load/store access to entire machine (tens of TBs) • Decoupled vector memory architecture for latency tolerance – Thousands of outstanding references, flexible addressing • Very high performance network – High bandwidth, fine-grained transfers – Same router as Origin 3000, but 32 parallel copies of the network • Architectural features for scalability – Remote address translation – Global coherence protocol optimized for distributed memory – Fast synchronization • Parallel I/O scales with system size Cray X1 Overview, Copyright 2002-3, Cray Inc. 13 Decoupled Microarchitecture • Decoupled access/execute and decoupled scalar/vector • Scalar unit runs ahead, doing addressing and control – Scalar and vector loads issued early – Store addresses computed and saved for later use – Operations queued and executed later when data arrives • Hardware dynamically unrolls loops – Scalar unit goes on to issue next loop before current loop has completed – Full scalar register renaming through 8-deep shadow registers – Vector loads renamed through load buffers – Special sync operations keep pipeline full, even across barriers This is key to making the system like short-VL code Cray X1 Overview, Copyright 2002-3, Cray Inc. 14 Maintaining Decoupling Past Synchronization Points Lsyncs ensure Msyncs ensure ordering among the ordering within a processing units of an MSP processing unit P0 P1 P2 P3 …. …. …. …. …. …. St Vx St Vx St Vx St Vx St V …. …. …. …. …. Msync Msync Msync Msync Lsync V,S …. …. …. …. Ld S Ld Ld Ld Ld Want to protect against hazards, but not drain memory pipeline. Vector store addresses computed early (before data is available): – run past scalar Dcache to invalidate any matching entries – sent out to the Ecache – provides ordering with later scalar load or loads from other P chips Cray X1 Overview, Copyright 2002-3, Cray Inc. 15 Address Translation 63 62 6148 47 3231 16 15 0 VA: MBZ Virtual Page # Page Offset Memory region: Possible page boundaries: useg, kseg, kphys 64 KB to 4 GB 47 4645 36 35 0 PA: Node Offset Physical address space: Main memory, MMR, I/O Max 64 TB physical memory • High translation bandwidth: scalar + four vector translations per cycle per P • Source translation using 256 entry TLBs with support for multiple page sizes • Remote (hierarchical) translation: – allows each node to manage its own memory (eases memory mgmt.) – TLB only needs to hold translations for one node ⇒ scales Cray X1 Overview, Copyright 2002-3, Cray Inc. 16 Cache Coherence Global coherence, but only cache memory from local node (8-32 GB) – Supports SMP-style codes up to 4 MSP (4-way sharing) – References outside this domain converted to non-allocate – Scalable codes use explicit communication anyway – Keeps directory entry and protocol simple Explicit cache allocation control – Per instruction hint for vector references – Per page hint for scalar references – Use non-allocating refs for explicit communication or to avoid cache pollution Coherence directory stored on the M chips (rather than in DRAM) – Low latency and really high bandwidth to support vectors • Typical CC system: 1 directory update per proc per 100 or 200 ns •Cray X1: 3.2 dir updates per MSP per ns (factor of several hundred!) Cray X1 Overview, Copyright 2002-3, Cray Inc. 17 System Software Cray X1 Overview, Copyright 2002-3, Cray Inc. 18 Cray X1 I/O Subsystem SPC PCI-X PCI-X Channels BUSES Cards 200 MBytes/s FC-AL 2 x 1.2 GB/s 800 MB/s RS200 FC SB IOCA SB CB FC SB I Chip SB FC IOCA Gigabit Ethernet or FC CNS Hippi Network I/O Board RS200 SB X1 Node FC SB IOCA CB SB FC SB I Chip FC IP over Fiber CPES IOCA Gigabit Ethernet or FC CNS Hippi Network I/O Board Cray X1 Overview, Copyright 2002-3, Cray Inc. 19 Cray X1 UNICOS/mp UNIX kernel executes Single System Image on each Application node UNICOS/mp (somewhat like Chorus™ Global resource microkernel on UNICOS/mk) manager (Psched) schedules Provides: SSI, scaling & applications on resiliency 256 CPU Cray X1 Cray X1 nodes Commands launch System Service nodes provide applications like on file services, process manage- T3E (mpprun) ment and other basic UNIX functionality (like /mk servers). User commands execute on System Service nodes. Cray X1 Overview, Copyright 2002-3, Cray Inc. 20 Programming Models • Traditional shared memory applications – OpenMP, pthreads – 4 MSPs – Single node memory (8-32 GB) – Very high memory bandwidth – No data locality issues • Distributed memory applications – MPI, shmem(), UPC, Co-array Fortran – Rewards optimizations for cluster/NUMA micro-based machines • work and data decomposition • cache blocking – But less worry about communication/computation ratio, strides and bandwidth • multiple GB/s network bandwidth between nodes • scatter/gather and large-stride support Cray X1 Overview, Copyright 2002-3, Cray Inc. 21 Mechanical Design Cray X1 Overview, Copyright 2002-3, Cray Inc. 22 Packaging - node board lower side Field-Replaceable Memory Daughter Cards Spray Cooling Caps CPU MCM 8 chips Network PCB Interconnect 17” x 22’’ Air Cooling Manifold Cray X1 Overview, Copyright 2002-3, Cray Inc. 23 Cray X1 Node Module Cray X1 Overview, Copyright 2002-3, Cray Inc. 24 Node Module (Power Converter side) Cray X1 Overview, Copyright 2002-3, Cray Inc. 25 Cray X1 Chassis Cray X1 Overview, Copyright 2002-3, Cray Inc. 26 64 Processor Cray X1 System ~820 Gflops Cray X1 Overview, Copyright 2002-3, Cray Inc. 27 256 Processor Cray X1 System ~ 3.3 Tflops Cray X1 Overview, Copyright 2002-3, Cray Inc. 28 Cray X1 Summary • Extreme system capability – Tens of TFLOPS in a Single System Image (SSI) MPP – Efficient operation due to balanced system design – Focus on sustained performance on the most challenging problems Achieved via: – Very powerful single processors • High instruction-level parallelism • High bandwidth memory system – Best in the world scalability • Latency tolerance via decoupled vector processors • Very high-performance, tightly integrated network • Scalable address translation and communication protocols • Robust, highly scalable operating system (Unicos/mp) Cray X1 Overview, Copyright 2002-3, Cray Inc.

The Cray X1 First in a Series of Extreme Performance Systems

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support