Big Challenges for Big Systems Ter@tech Forum June 16th, 2010
Robert Uebelmesser HPC Director, SGI EMEA Agenda
Big Challenges for Big HPC Systems
– The SGI ICE 8400 system and its IB-based, enhanced system interconnect
– The SGI Altix UV and its Numalink-based system interconnect
– (The recently announced SGI „Mojo“, an accelerator-based „1 Petaflop in a Cabinet“ system)
Company Confidential (Some of the) Big Challenges for Exascale Systems
Physical Challenges Power Floorspace Cooling Architectural Challenges Many core processors Very large memories System interconnects Accelerators Software Challenges OS Languages Applications
Company Confidential (Some of the) Big Challenges for Big Next Generation HPC Systems Physical Challenges Power Floorspace Cooling Architectural Challenges Many core processors Very large memories System interconnects Accelerators Software Challenges OS Languages Applications
Company Confidential Company Confidential Supercomputer System Approaches
Integrated Cluster Systems Globally Shared Memory Systems
Commodity Interconnect SGI NUMAlink Interconnect
mem mem mem mem ... mem Global shared memory
OS OS OS OS ... OS Operating System
System System System System ... System System System System ... System
± Each system has own memory and OS + All nodes operate on one large - Node bandwidth and latency issues shared memory space – More network interfaces + Eliminates data passing between nodes – Lower application efficiency + Big data sets fit in memory + lower hardware cost + Less memory per node required + Heterogeneity + Higher Application Efficiency + Node autonomy + Easier to Deploy and Administer + Increased reliability – More expensive hardware – Considered less scalable
Company Confidential SGI Supercomputer Lines
SGI ICE 8400 SGI Altix® UV Highly Integrated Cluster System Partitioned Globally Addressable Memory System
Company Confidential SGI Altix ICE 8400
SGI Altix ICE 8400 – Blade-based architecture – AMD and Intel based processor blades – Diskless blades operation – Integrated Management network – Hierarchical System Management – single-plane or dual-plane 4xQDR Infiniband interconnect – SGI enhanced Hypercube or Fat Tree networks – Integrated switch topology simplifies scaling from 32 to 65,536 nodes (1024 racks)
– up to 128 processor sockets – 4 or more Dimms per socket – Optional 2.5” SSD, HD for local storage
Company Confidential SGI Altix ICE 8400 Designed for High-Performance Computing
Breakthrough Performance Density: Up to 128 sockets per Rack Up to 16 Blades SGI ® Altix ® ICE Compute Blade Up to 12-Core, 96GB, 2-IB
Up to 16 ® ® Blades SGI Altix ICE Compute Blade Up to 24-Core, 128GB, 2-IB
Up to 16 Blades Altix ICE Rack: • 42U rack (30” W x 40” D) • 4 Cable-free blade enclosures, each with up to 16 22-Socket-Socket nodes • Up to 128 DP AMD Opteron or Intel ® Xeon ® sockets Up to 16 Blades • Single-plane or Dual-plane IB 4x QDR interconnect • Minimal switch topology simplifies scaling to 1000s of nodes
SlideCompany 9 Confidential SGI ICE Blade for Intel Westmere (Type 1) IP101: One Single Port QDR HCA
(6) DDR3 RDIMMs Read BW per Socket: 1333 MHz = 32 GB/s
Company Confidential SGI ICE Blade for Intel Westmere (Type 2) IP-105: Two Single Port QDR HCAs (Dual Plane)
(6) DDR3 RDIMMs Read BW per Socket: 1333 MHz = 32 GB/s
Two independent Mellanox ConnectX-2 HCAs for 2x off-blade, IB interconnect bandwidth.Company Confidential SGI ICE Node Blade for AMD Magny Cours one single dual-ported port QDR HCA
Company Confidential SGI ICE Node Blade for AMD Magny Cours two single ported QDR HCAs
Two independent IB HCAs for 2x off-blade, IB interconnect bandwidth.
Company Confidential SGI Altix ICE 8400EX Blade Container
Hypercube internal Hypercube internal
Service ports Service ports (external) (external)
Hypercube internal Hypercube internal
Company Confidential Comparison Fat Tree vs. Hypercube
36 nodes 36 nodes
36 external IB ports 216 external IB ports (8*27)
27 9 9 27 18 18 27 9 9 27
27 9 9 27 18 18
27 9 9 27
Company Confidential Flexibility in Networking Topologies
Hypercube Topology: - Lowest network infrastructure cost - Well suited for "nearest neighbor" type MPI communication patterns Enhanced Hypercube Topology: - Increased bisectional bandwidth per node at only a small increase in cost - Well suited for larger node count MPI jobs All-to-All Topology: - Maximum bandwidth at lowest latency for up to 128 nodes - Well suited for "all-to-all" MPI communication patterns. Robust integrated switch blade design enables industry- Fat Tree Topology: leading bisectional bandwidth - Highest network infrastructure cost. Requires at ultra-low latency! external switches. - Well suited for "all-to-all" type MPI communication patterns
Company Confidential SGI’s Enhanced Hypercube
With the standard Hypercube topology, only 1 cable is used for each dimension link. This leaves many IB switch ports unused With SGI’s Enhanced Hypercube topology, we make use of the available ports by adding redundant links at the lower dimensions to improve the bandwidth of the interconnect.
Hypercube Enhanced Hypercube with Single Links with Redundant Links
Company Confidential The Strengths of SGI’s Hypercube
Higher connection capabilities & resources at the node. Hypercube topologies add ‘orthogonal’ dimensions of interconnect with every 2x doubling in system size. Each dimension of the Hypercube interconnect scales linearly as the system size increases Hypercube topologies are best described by the aggregate bisection bandwidth of all dimensions of the hypercube interconnect. – Minimum Bisection Bandwidth requirements defined for Fat Tree topologies do not capture the full capabilities of SGI’s Hypercube fabric.
Company Confidential Hypercube vs. 3D Torus
Hop Count: Hypercube vs. 3D Torus
35
30
25
20 Hops 15
10
5
0 8 16 32 64 128 256 512 1024 2048 4096 8192 16384 32768 Nodes in System
Hypercube 3D Torus
Company Confidential MPI bisection bandwidth measurements
HC_1 MPI Bandwidth measurements 400% EHC_1 HC_2 350% EHC_2 300%
250%
200%
150%
100%
50%
Relative performance [ 100% is HC_1 ] 0% D-R Ring D-R Ring D-R Ring D-R Ring D-R Ring D-R Ring D-R Ring Exchange Exchange Exchange Exchange Exchange Exchange Exchange 16 threads 32 threads 64 threads 128 threads 256 threads 512 threads 1024 theads
MPT 1.23 Exchange: Simple bisection BW experiment D-R Ring: Double Random Ring – more accurate bisection BW estimation Company Confidential MPI_AlltoAll communications
HC_1 MPI_Alltoall() perfromance EHC_1 400% HC_2 EHC_2 350%
300%
250%
200%
150%
100%
50% Relative Perf [100% is HyperCube 1plane] HyperCube is [100% Perf Relative
0% 128 256 512 1024 n threads MPT 1.23 - MPI_Alltoall() – buffer size = 700 KB
Company Confidential HPCC benchmarks
HPCC performance on Nehalem/CB+ HC-1 HyperCube vs EnhancedHyperCube | single and dual-rail EHC-1 400% HC-2 EHC-2 350%
300%
250%
200%
150%
100% relative performance [100% is1 rail] [100% HC performance relative 50%
0% FFT FFT FFT FFT GUPs GUPs GUPs GUPs ptrans ptrans ptrans ptrans PP BW BW RR PP BW BW RR PP BW BW RR PP BW BW RR 128 threads 256 threads 512 threads 1024 threads
HPCC version 1.x – MPT 1.23 PP BW: PingPong Bandwidth RR Bandwidth: Random Ring bandwidth Company Confidential SGI Enhanced Hypercube vs FatTree Latencies
Estimated Altix ICE System Half Ping-Pong MPI Latency
5000
4500 DDR mit Infinihost 4000 DDR mit Connectx 3500
3000
2500 QDR mit Connectx-2 Fat Tree Latency (nsec) Latency 2000
1500
1000
500
0 16 32 64 128 256 512 1024 2048 4096 8192 16384 32768 Nodes in System
ICE 8200 Hypercube ICE 8200 LX/EX Hypercube ICE 8400 Hypercube
Company Confidential 12D Hypercube
12D Element 12D links (2048) 32768 Nodes
512 Racks
Company Confidential SGI System Interconnect at NASA/Ames
Company Confidential SGI ICE System at GENCI/CINES
• SGI Altix ICE with • 23,040 processor cores , • 26,878 peak performance
• 238 Tflops Linpack
• # 18 on June 2010 Top500 list
• Highest ranked system in France • 3rd ranked in Europe
SlideCompany 26 Confidential Altix® UV
Ter@tech Forum June 16, 2010 The SGI Altix Ultraviolet (UV) System
Evolution from ccNUMA Shared Memory (SGI Origin) to Partitioned Globally Addressable Shared Memory (SGI Altix 4700) to HW Accelerated Partitioned Globally Addressable System (SGI Altix UV)
Company Confidential SGI Altix® UV
Partitioned Globally Addressable Memory System Advanced, SGI-enhanced bladed architecture Intel Nehalem-EX processors SGI Numalink Interconnect for shared memory implementation Built-in MPI offload engine
Company Confidential UV Node Architecture
Intel Intel Intel Cache Nehalem-EX Nehalem-EX Global
Global QPI Socket
QPI Socket Memory Coherence Memory
PI NUMAlink 5 GRU to Other UV90 HUBnm NI NUMAlink 5 SGI Numa Nodes AMU Protocol MI
Coherence Directory Memory
Company Confidential UV Interconnect Architecture
Intel Cache Intel IOH Coherence
Intel Intel Intel Intel QPI Socket QPI Socket QPI Socket QPI Socket Global Global Global Global Memory Memory Memory Memory
PI PI NUMAlink 5 NUMAlink 5 GRU GRU to Other to Other UV90 HUB nm NI NUMAlink 5 UV90 HUB nm NI Nodes Nodes AMU AMU SGI Numalink MI MI Protocol Coherence Coherence Directory Directory Memory Memory Company Confidential Globally Shared Memory System
NUMAlink® 5 is the glue of Altix® UV 100/1000
NUMAlink Router k Alin NUM
Altix UV Blade Altix UV Blade Altix UV Blade Altix UV Blade
HUB HUB HUB HUB
CPU CPU CPU CPU CPU CPU CPU CPU
64GB 64GB 64GB 64GB 64GB 64GB 64GB 64GB
512GB Shared Memory Up to 16 TB Global Shared Memory
Company Confidential UV Architectural Scalability
16,384 Nodes (scaling supported by NUMAlink 5 node ID) – 16,384 UV_HUBs – 32,768K Sockets / 262,144 Cores (with 8-cores per socket) Coherent shared memory – Xeon: 16TB (44 bits socket PA) 8 Petabytes coherent get/put memory (53 bits PA w/GRU)
Company Confidential UV_HUB/Node Controller Technologies
Globally Addressable Memory Active Memory Unit • Large Shared Address Space (8 PB) • Rich set of Atomic Operations (e.g. HW barrier • Extremely Large Coherent Get/Put Space support) • Atomic Memory Operations (AMOs) in Coherent • Multicast Memory • Message Queues in Coherent Memory • Coherence Directory • Page Initialization
GRU Global Reference Unit • For MPI data movement RAS • For PGAS support • X4 DRAM correction • High-BW, Low-Latency Socket Communication • Redundant Real-Time Clock • Update Cache for many AMOs • Failure Isolation Between Partitions • Scatter/Gather Operations • Built-In Debug and Performance Monitors • BCOPY Operations • Internal/External Datapath Protection • External TLB with Large Page Support
NOTE: UV HUB memory management functions do not interfere with fast on-node memory access
Company Confidential Altix® UV Characteristics
1. Scalability 2. Performance
Company Confidential SGI Altix UV Scalability Single System Images
Single System Image scales to 256 Intel® Xeon® “Nehalem- EX” sockets (2048 cores) & 16TB memory – Intel coherence within blade – SGI coherence between blades – 16 TB is the global shared memory limit of “Nehalem-EX” processor Investment protection – Start as small as four sockets and scale up over time – Start with large SSI and partition later as required
Company Confidential SGI Altix UV Scalability Architectural Limits Altix® UV’s architecture supports scaling to Petaflop level Upper limit on scaling is the Altix UV hub, capable of connecting 32,768 sockets
Each Red & Green Torus Petaflop System link shown is (2) links / L1R
(8) L1Rs per plane Green Links (8 of 16) ports / L1R support Fat Tree (Interleaved across (8 of 16) ports / L1R support 2-copies of Torus the aisles) (16) copies of Torus per plane
Red Links (Interleaved down the ranks)
= 4-Rack Group 256-Socket Fat Tree Building Block (4 racks)
Company Confidential SGI Altix UV Performance SPEC Benchmarks
World record SPECint_rate and SPECfp_rate performance with only 64 sockets populated! – SPECint_rate_2006: #1 on any architecture – SPECfp_rate_2006: #1 on x86 architecture, #2 behind SGI Altix 4700 with eight times as many processors
Company Confidential SGI Altix UV Performance SPEC Benchmarks
SPECint_rate_base2006: #1: SGI Altix UV 1000 512c Xeon X7560 10400 #2: SGI Altix 4700 Bandwidth System 1024c Itanium 9030 #3: Sun Blade 6048 Chassis 768c Opteron 8384 (cluster) 8840 #4: ScaleMP vSMP Foundation 128c Xeon X5570 3150 #5: SGI Altix 4700 Density System 256c Itanium 2890
SPECfp_rate_base2006: #1: SGI Altix 4700 Bandwidth System 1024c Itanium 10600 #2: SGI Altix UV 1000 512c Xeon X7560 6840 #3: Sun Blade 6048 Chassis 768c Opteron 8384 (cluster) 6500 #4: SGI Altix 4700 Bandwidth System 256c Itanium 3420 #5: ScaleMP vSMP Foundation 128c Xeon X5570 2550 Source: www.spec.org (March, 2010)
Company Confidential SGI Altix UV Performance
Shared memory capacity per SSI (max. 16TB) – Massive speed-up for memory-bound applications MPI Offload Engine (MOE) – frees CPU cycles and improves MPI performance – MPI reductions 2-3X faster than competitive clusters/MPPs – Barriers up to 80X+ faster
Company Confidential SGI Altix UV MPI Performance Acceleration
Altix UV offers up to 3X improvement in MPI reduction processes over standard IB networks HPCC Benchmarks Barrier latency is UV with MOE dramatically better than competing
UV with MOE platforms (up to 80 times)
UV with MOE
0
Company Confidential SGI Altix UV Performance Acceleration with MPI Offload Engine (MOE)
HPCC Benchmarks
HPCC benchmark UV with MOE simulations show
UV, MOE disabled substantial improvements with MPI Offload Engine
UV with MOE (MOE)
UV, MOE disabled
UV with MOE
UV. MOE disabled
0
Company Confidential Source: SGI Engineering projections SGI Altix® UV
World’s Fastest with – World Record SPECint_rate and SPECfp_rate Performance – High speed NUMAlink® 5 interconnect (15 GB/sec) – MPI offload engines maximize efficiency World’s Most Scalable with – Single system image scales up to 2048 cores & 16TB memory – Direct access to global data sets up to 16TB World‘s Most Flexible in – Investment protection: • start with four sockets and scale up over time if needed • Start with 2048-core SSI and partition over time if needed – Compelling performance regardless of type of application Open Platform which – Leverages Intel® Xeon® 7500 (“Nehalem-EX”) processors – Runs industry-standard x86 operating systems & application code
Company Confidential Altix UV Graphics and GP-GPU Packaging
NVIDIA® Tesla™ or Quadro® Plex Enclosures Cable Directly to UV External PCIe Mezzanine Riser
NVidia Tesla unit = 4 GPU, Two x16 links 3U
1U NVidia Quadro Plex unit = 2 GPU + Gsync, one X16 link (2 units shown) 4U
Altix UV 100/1000 (2) PCIe Gen2 x16 Cable
Each UV 100 or UV 1000 blade can connect up to one NVIDIA Tesla or Quadro Plex enclosures (Altix UV 10 uses NVIDIA host cards to achieve similar connectivity)
Company Confidential Summary
Big Challenges for Big HPC Systems
– The SGI ICE 8400 system and its IB-based, enhanced system interconnect • Significantly improves interconnect bandwidth without adding cost
– The SGI Altix UV and its Numalink-based system interconnect • Significantly improves interconnect latencies and performance on complex MPI operations • Allows very large memories without performance degradation
– (The recently announced SGI „Mojo“, an accelerator-based „1 Petaflop in a Cabinet“ system) Company Confidential Thank You SGI “Mojo” System
“1 Peta-Flop-in-a-Cabinet” Introduction to “Mojo”
Internal Product Name “Mojo” : 1 Peta-Flop-in-a-Cabinet
Create industry leading PCIe infrastructure for Accelerator deployment – Complete integration into industry-leading SGI ALTIX ICE system – Highest density in the industry – Flexible power & cooling solutions – Scales to several Petaflops – Deliver 1 Peta-Flop in a Cabinet single precision peak using ATI GPUs – Also supports NVIDIA, ATI, Tilera and other PCIe based solution in high density packaging
Customer Deliveries Q4 CY2010
Company Confidential Mojo Software Overview
Open Standard based Full integration in ALTIX ICE system administration OpenCL support SGI AEE (Accelerator Execution Environment) – Accelerator Resource Management – System Accounting for Accelerator Use – PCIe BW Monitor – Diagnostics
Company Confidential PCIe infra-structure requirements
4 PCIe x16 Gen2 interfaces to support full band-width to four accelerators per node
Dual QDR IB interconnect for optimized cluster integration
unique PCIe bandwidth optimized motherboard for integration into ALTIX ICE framework
Company Confidential Basic Building Block : Node Board
Colgate IP112 Node Board Study:
2.1” PCIe Riser Boards 16.4” Max
5650 A 5670 A
5100 G34 IB BMC IB HCA GigE GigE 1.8” SATA 13.3” IB IB Disk Drive GPU Boards (4) HCA HDD or SSD G34 Single Slot wide Passive Cooled 5670 5650
Front View Airflow Front to Rear 1.8 Inch SATA Disk (HDD or SSD) Features PCIe Card Outline off SP5100. For Swap / Scratch 2 Socket G34 Node 16 DDR3 DIMMs (8 DIMMs per socket) - Support for up to 128 GB (using 8 GB DIMMs) 2 x SR5670 2 x SR5650 Support for up to 4 full size x16 PCIe Gen 2 single wide card slots (up to 4 GPUs) 1 x SP5100 1 x Dual Port GigE NIC 2 x Single Port QDR IB HCA Company Confidential Node Boards scaling into an IRU
Power & Control Mojo Chassis Enclosure Study Backplane Power Extender Board (8)
Differentiated density, G34 G34 G34 G34 Node 1 NodeNode 2 1
while supporting high BMC BMC 1.8”SATA Drive Disk 1.8”SATA Drive Disk HDDSSD or HDDSSD or UV POWER SUPPLY UV POWER SUPPLY UV POWER SUPPLY
5100 5100 GigE GigE end 225W single wide IB IB IB IB 5650 HCA HCA 5670 5650 HCA HCA 5670
5670 5650 5670 5650 GPU GPU GPU GPU GPU cards GPU GPU IB GigE IB IB GigE IB
Chassis Plan View UV PS UV PS UV PS UV PS UV PS ICE CMCICE UVPS UVPS UVPS
Company Confidential Chassis Front View IRUs scale into Enclosures …
Full 128 128GPU GPU Deployment in standard in 42U 42U rack 256 GPU Deployment256 GPU M-Rack Optimzed Enclosure
Head Node 2U UVPS UVPS UV PS 1U 36 Port Infiniband Switch 1U ICE CMC 1U 36 Port Infiniband Switch 1U UVPS UVPS 64 GPGPUs in 16 Nodes 1U 1U UV UV PS UV PS UV PS UV PS UV PS UV UV PS
64 GPGPUs in 16 Nodes UV PS UV PS UV ICE CMC 1U 1U 64 GPGPUs in 16 Nodes UVPS UVPS UVPS
UV PS UV PS UV PS
UV PS UV PS UV PS
UV PS UV PS UV PS UV PS UV PS ICE CMC UV UV PS UV PS ICE CMC 1U 1U 64 GPGPUs in 16 Nodes UV PS UV PS UV PS UV UVPS UVPS UV PS 64 GPGPUs in 16 Nodes ICE CMC UVPS UVPS 1U 1U 64 GPGPUs in 16 Nodes
UV PS UV PS UV PS UVPS UVPS UVPS UV PS UV PS UV PS
UV PS UV PS ICE CMC
Company Confidential One Peta-Flop-in-a-Cabinet
Based on 4*256GPUs and 1.03 TF/s (single prec.) per GPU Company Confidential Example : NVIDIA Linpack
NVIDIA M2050 – 1.03 TF/s Single Precision FP (peak) – 515 GF/s Double Precision FP (peak) – 330 GF/s Linpack (DP) – 148 GB/s Bandwidth – 225 W TDP
60% efficiency , TESLA was around 80% efficiency
Using M2050 and 330 GFLOPs Linpack, 1 Petaflop Double Precision Linpack would require 3032 cGPUs
Company Confidential Mojo in a Container Environment
Mojo in a Container Environment Up to 3072 GPUs & 768 Nodes in a single Container Up to 6 PFLOP Single Precision, 1.2 PFLOP Double Precision
I/O Equipment Space Blower Blower Blower
Blower Blower Blower Equipment I/O Space
Plan View
~ 1 MW Total Power !!!!
Company Confidential 56 Summary
Big Challenges for Big HPC Systems
– The SGI ICE 8400 system and its IB-based, enhanced system interconnect
– The SGI Altix UV and its Numalink-based system interconnect
– The recently announced SGI „Mojo“, an accelerator-based „1 Petaflop in a Cabinet“ system
Company Confidential