Big Challenges for Big Systems Ter@tech Forum June 16th, 2010

Robert Uebelmesser HPC Director, SGI EMEA Agenda

 Big Challenges for Big HPC Systems

– The SGI ICE 8400 system and its IB-based, enhanced system interconnect

– The SGI Altix UV and its Numalink-based system interconnect

– (The recently announced SGI „Mojo“, an accelerator-based „1 Petaflop in a Cabinet“ system)

Company Confidential (Some of the) Big Challenges for Exascale Systems

 Physical Challenges  Power  Floorspace  Cooling  Architectural Challenges  Many core processors  Very large memories  System interconnects  Accelerators  Software Challenges  OS  Languages  Applications

Company Confidential (Some of the) Big Challenges for Big Next Generation HPC Systems  Physical Challenges  Power  Floorspace  Cooling  Architectural Challenges  Many core processors  Very large memories  System interconnects  Accelerators  Software Challenges  OS  Languages  Applications

Company Confidential Company Confidential System Approaches

Integrated Cluster Systems Globally Shared Memory Systems

Commodity Interconnect SGI NUMAlink Interconnect

mem mem mem mem ... mem Global shared memory

OS OS OS OS ... OS Operating System

System System System System ... System System System System ... System

± Each system has own memory and OS + All nodes operate on one large - Node bandwidth and latency issues shared memory space – More network interfaces + Eliminates data passing between nodes – Lower application efficiency + Big data sets fit in memory + lower hardware cost + Less memory per node required + Heterogeneity + Higher Application Efficiency + Node autonomy + Easier to Deploy and Administer + Increased reliability – More expensive hardware – Considered less scalable

Company Confidential SGI Supercomputer Lines

SGI ICE 8400 SGI Altix® UV Highly Integrated Cluster System Partitioned Globally Addressable Memory System

Company Confidential SGI Altix ICE 8400

SGI Altix ICE 8400 – Blade-based architecture – AMD and based processor blades – Diskless blades operation – Integrated Management network – Hierarchical System Management – single-plane or dual-plane 4xQDR Infiniband interconnect – SGI enhanced Hypercube or Fat Tree networks – Integrated switch topology simplifies scaling from 32 to 65,536 nodes (1024 racks)

– up to 128 processor sockets – 4 or more Dimms per socket – Optional 2.5” SSD, HD for local storage

Company Confidential SGI Altix ICE 8400 Designed for High-Performance Computing

Breakthrough Performance Density: Up to 128 sockets per Rack Up to 16 Blades SGI ® Altix ® ICE Compute Blade Up to 12-Core, 96GB, 2-IB

Up to 16 ® ® Blades SGI Altix ICE Compute Blade Up to 24-Core, 128GB, 2-IB

Up to 16 Blades Altix ICE Rack: • 42U rack (30” W x 40” D) • 4 Cable-free blade enclosures, each with up to 16 22-Socket-Socket nodes • Up to 128 DP AMD Opteron or Intel ® ® sockets Up to 16 Blades • Single-plane or Dual-plane IB 4x QDR interconnect • Minimal switch topology simplifies scaling to 1000s of nodes

SlideCompany 9 Confidential SGI ICE Blade for Intel Westmere (Type 1) IP101: One Single Port QDR HCA

(6) DDR3 RDIMMs Read BW per Socket: 1333 MHz = 32 GB/s

Company Confidential SGI ICE Blade for Intel Westmere (Type 2) IP-105: Two Single Port QDR HCAs (Dual Plane)

(6) DDR3 RDIMMs Read BW per Socket: 1333 MHz = 32 GB/s

Two independent Mellanox ConnectX-2 HCAs for 2x off-blade, IB interconnect bandwidth.Company Confidential SGI ICE Node Blade for AMD Magny Cours one single dual-ported port QDR HCA

Company Confidential SGI ICE Node Blade for AMD Magny Cours two single ported QDR HCAs

Two independent IB HCAs for 2x off-blade, IB interconnect bandwidth.

Company Confidential SGI Altix ICE 8400EX Blade Container

Hypercube internal Hypercube internal

Service ports Service ports (external) (external)

Hypercube internal Hypercube internal

Company Confidential Comparison Fat Tree vs. Hypercube

36 nodes 36 nodes

36 external IB ports 216 external IB ports (8*27)

27 9 9 27 18 18 27 9 9 27

27 9 9 27 18 18

27 9 9 27

Company Confidential Flexibility in Networking Topologies

 Hypercube Topology: - Lowest network infrastructure cost - Well suited for "nearest neighbor" type MPI communication patterns  Enhanced Hypercube Topology: - Increased bisectional bandwidth per node at only a small increase in cost - Well suited for larger node count MPI jobs  All-to-All Topology: - Maximum bandwidth at lowest latency for up to 128 nodes - Well suited for "all-to-all" MPI communication patterns. Robust integrated switch blade design enables industry-  Fat Tree Topology: leading bisectional bandwidth - Highest network infrastructure cost. Requires at ultra-low latency! external switches. - Well suited for "all-to-all" type MPI communication patterns

Company Confidential SGI’s Enhanced Hypercube

With the standard Hypercube topology, only 1 cable is used for each dimension link. This leaves many IB switch ports unused With SGI’s Enhanced Hypercube topology, we make use of the available ports by adding redundant links at the lower dimensions to improve the bandwidth of the interconnect.

Hypercube Enhanced Hypercube with Single Links with Redundant Links

Company Confidential The Strengths of SGI’s Hypercube

 Higher connection capabilities & resources at the node.  Hypercube topologies add ‘orthogonal’ dimensions of interconnect with every 2x doubling in system size.  Each dimension of the Hypercube interconnect scales linearly as the system size increases  Hypercube topologies are best described by the aggregate bisection bandwidth of all dimensions of the hypercube interconnect. – Minimum Bisection Bandwidth requirements defined for Fat Tree topologies do not capture the full capabilities of SGI’s Hypercube fabric.

Company Confidential Hypercube vs. 3D Torus

Hop Count: Hypercube vs. 3D Torus

35

30

25

20 Hops 15

10

5

0 8 16 32 64 128 256 512 1024 2048 4096 8192 16384 32768 Nodes in System

Hypercube 3D Torus

Company Confidential MPI bisection bandwidth measurements

HC_1 MPI Bandwidth measurements 400% EHC_1 HC_2 350% EHC_2 300%

250%

200%

150%

100%

50%

Relative performance [ 100% is HC_1 ] 0% D-R Ring D-R Ring D-R Ring D-R Ring D-R Ring D-R Ring D-R Ring Exchange Exchange Exchange Exchange Exchange Exchange Exchange 16 threads 32 threads 64 threads 128 threads 256 threads 512 threads 1024 theads

MPT 1.23 Exchange: Simple bisection BW experiment D-R Ring: Double Random Ring – more accurate bisection BW estimation Company Confidential MPI_AlltoAll communications

HC_1 MPI_Alltoall() perfromance EHC_1 400% HC_2 EHC_2 350%

300%

250%

200%

150%

100%

50% Relative Perf [100% is HyperCube 1plane] HyperCube is [100% Perf Relative

0% 128 256 512 1024 n threads MPT 1.23 - MPI_Alltoall() – buffer size = 700 KB

Company Confidential HPCC benchmarks

HPCC performance on Nehalem/CB+ HC-1 HyperCube vs EnhancedHyperCube | single and dual-rail EHC-1 400% HC-2 EHC-2 350%

300%

250%

200%

150%

100% relative performance [100% is1 rail] [100% HC performance relative 50%

0% FFT FFT FFT FFT GUPs GUPs GUPs GUPs ptrans ptrans ptrans ptrans PP BW BW RR PP BW BW RR PP BW BW RR PP BW BW RR 128 threads 256 threads 512 threads 1024 threads

HPCC version 1.x – MPT 1.23 PP BW: PingPong Bandwidth RR Bandwidth: Random Ring bandwidth Company Confidential SGI Enhanced Hypercube vs FatTree Latencies

Estimated Altix ICE System Half Ping-Pong MPI Latency

5000

4500 DDR mit Infinihost 4000 DDR mit Connectx 3500

3000

2500 QDR mit Connectx-2 Fat Tree Latency (nsec) Latency 2000

1500

1000

500

0 16 32 64 128 256 512 1024 2048 4096 8192 16384 32768 Nodes in System

ICE 8200 Hypercube ICE 8200 LX/EX Hypercube ICE 8400 Hypercube

Company Confidential 12D Hypercube

12D Element 12D links (2048) 32768 Nodes

512 Racks

Company Confidential SGI System Interconnect at NASA/Ames

Company Confidential SGI ICE System at GENCI/CINES

• SGI Altix ICE with • 23,040 processor cores , • 26,878 peak performance

• 238 Tflops Linpack

• # 18 on June 2010 Top500 list

• Highest ranked system in France • 3rd ranked in Europe

SlideCompany 26 Confidential Altix® UV

Ter@tech Forum June 16, 2010 The SGI Altix Ultraviolet (UV) System

Evolution from ccNUMA Shared Memory (SGI Origin) to Partitioned Globally Addressable Shared Memory (SGI Altix 4700) to HW Accelerated Partitioned Globally Addressable System (SGI Altix UV)

Company Confidential SGI Altix® UV

 Partitioned Globally Addressable Memory System  Advanced, SGI-enhanced bladed architecture  Intel Nehalem-EX processors  SGI Numalink Interconnect for shared memory implementation  Built-in MPI offload engine

Company Confidential UV Node Architecture

Intel Intel Intel Cache Nehalem-EX Nehalem-EX Global

Global QPI Socket

QPI Socket Memory Coherence Memory

PI NUMAlink 5 GRU to Other UV90 HUBnm NI NUMAlink 5 SGI Numa Nodes AMU Protocol MI

Coherence Directory Memory

Company Confidential UV Interconnect Architecture

Intel Cache Intel IOH Coherence

Intel Intel Intel Intel QPI Socket QPI Socket QPI Socket QPI Socket Global Global Global Global Memory Memory Memory Memory

PI PI NUMAlink 5 NUMAlink 5 GRU GRU to Other to Other UV90 HUB nm NI NUMAlink 5 UV90 HUB nm NI Nodes Nodes AMU AMU SGI Numalink MI MI Protocol Coherence Coherence Directory Directory Memory Memory Company Confidential Globally Shared Memory System

 NUMAlink® 5 is the glue of Altix® UV 100/1000

NUMAlink Router k Alin NUM

Altix UV Blade Altix UV Blade Altix UV Blade Altix UV Blade

HUB HUB HUB HUB

CPU CPU CPU CPU CPU CPU CPU CPU

64GB 64GB 64GB 64GB 64GB 64GB 64GB 64GB

512GB Shared Memory Up to 16 TB Global Shared Memory

Company Confidential UV Architectural Scalability

 16,384 Nodes (scaling supported by NUMAlink 5 node ID) – 16,384 UV_HUBs – 32,768K Sockets / 262,144 Cores (with 8-cores per socket)  Coherent shared memory – Xeon: 16TB (44 bits socket PA)  8 Petabytes coherent get/put memory (53 bits PA w/GRU)

Company Confidential UV_HUB/Node Controller Technologies

Globally Addressable Memory Active Memory Unit • Large Shared Address Space (8 PB) • Rich set of Atomic Operations (e.g. HW barrier • Extremely Large Coherent Get/Put Space support) • Atomic Memory Operations (AMOs) in Coherent • Multicast Memory • Message Queues in Coherent Memory • Coherence Directory • Page Initialization

GRU Global Reference Unit • For MPI data movement RAS • For PGAS support • X4 DRAM correction • High-BW, Low-Latency Socket Communication • Redundant Real-Time Clock • Update Cache for many AMOs • Failure Isolation Between Partitions • Scatter/Gather Operations • Built-In Debug and Performance Monitors • BCOPY Operations • Internal/External Datapath Protection • External TLB with Large Page Support

NOTE: UV HUB memory management functions do not interfere with fast on-node memory access

Company Confidential Altix® UV Characteristics

1. Scalability 2. Performance

Company Confidential SGI Altix UV Scalability Single System Images

 scales to 256 Intel® Xeon® “Nehalem- EX” sockets (2048 cores) & 16TB memory – Intel coherence within blade – SGI coherence between blades – 16 TB is the global shared memory limit of “Nehalem-EX” processor  Investment protection – Start as small as four sockets and scale up over time – Start with large SSI and partition later as required

Company Confidential SGI Altix UV Scalability Architectural Limits  Altix® UV’s architecture supports scaling to Petaflop level  Upper limit on scaling is the Altix UV hub, capable of connecting 32,768 sockets

Each Red & Green Torus Petaflop System link shown is (2) links / L1R

(8) L1Rs per plane Green Links (8 of 16) ports / L1R support Fat Tree (Interleaved across (8 of 16) ports / L1R support 2-copies of Torus the aisles) (16) copies of Torus per plane

Red Links (Interleaved down the ranks)

= 4-Rack Group 256-Socket Fat Tree Building Block (4 racks)

Company Confidential SGI Altix UV Performance SPEC Benchmarks

 World record SPECint_rate and SPECfp_rate performance with only 64 sockets populated! – SPECint_rate_2006: #1 on any architecture – SPECfp_rate_2006: #1 on x86 architecture, #2 behind SGI Altix 4700 with eight times as many processors

Company Confidential SGI Altix UV Performance SPEC Benchmarks

SPECint_rate_base2006: #1: SGI Altix UV 1000 512c Xeon X7560 10400 #2: SGI Altix 4700 Bandwidth System 1024c 9030 #3: Sun Blade 6048 Chassis 768c Opteron 8384 (cluster) 8840 #4: ScaleMP vSMP Foundation 128c Xeon X5570 3150 #5: SGI Altix 4700 Density System 256c Itanium 2890

SPECfp_rate_base2006: #1: SGI Altix 4700 Bandwidth System 1024c Itanium 10600 #2: SGI Altix UV 1000 512c Xeon X7560 6840 #3: Sun Blade 6048 Chassis 768c Opteron 8384 (cluster) 6500 #4: SGI Altix 4700 Bandwidth System 256c Itanium 3420 #5: ScaleMP vSMP Foundation 128c Xeon X5570 2550 Source: www.spec.org (March, 2010)

Company Confidential SGI Altix UV Performance

 Shared memory capacity per SSI (max. 16TB) – Massive speed-up for memory-bound applications  MPI Offload Engine (MOE) – frees CPU cycles and improves MPI performance – MPI reductions 2-3X faster than competitive clusters/MPPs – Barriers up to 80X+ faster

Company Confidential SGI Altix UV MPI Performance Acceleration

 Altix UV offers up to 3X improvement in MPI reduction processes over standard IB networks HPCC Benchmarks  Barrier latency is UV with MOE dramatically better than competing

UV with MOE platforms (up to 80 times)

UV with MOE

0

Company Confidential SGI Altix UV Performance Acceleration with MPI Offload Engine (MOE)

HPCC Benchmarks

 HPCC benchmark UV with MOE simulations show

UV, MOE disabled substantial improvements with MPI Offload Engine

UV with MOE (MOE)

UV, MOE disabled

UV with MOE

UV. MOE disabled

0

Company Confidential Source: SGI Engineering projections SGI Altix® UV

 World’s Fastest with – World Record SPECint_rate and SPECfp_rate Performance – High speed NUMAlink® 5 interconnect (15 GB/sec) – MPI offload engines maximize efficiency  World’s Most Scalable with – Single system image scales up to 2048 cores & 16TB memory – Direct access to global data sets up to 16TB  World‘s Most Flexible in – Investment protection: • start with four sockets and scale up over time if needed • Start with 2048-core SSI and partition over time if needed – Compelling performance regardless of type of application  Open Platform which – Leverages Intel® Xeon® 7500 (“Nehalem-EX”) processors – Runs industry-standard x86 operating systems & application code

Company Confidential Altix UV Graphics and GP-GPU Packaging

NVIDIA® Tesla™ or Quadro® Plex Enclosures Cable Directly to UV External PCIe Mezzanine Riser

NVidia Tesla unit = 4 GPU, Two x16 links 3U

1U NVidia Quadro Plex unit = 2 GPU + Gsync, one X16 link (2 units shown) 4U

Altix UV 100/1000 (2) PCIe Gen2 x16 Cable

Each UV 100 or UV 1000 blade can connect up to one NVIDIA Tesla or Quadro Plex enclosures (Altix UV 10 uses NVIDIA host cards to achieve similar connectivity)

Company Confidential Summary

 Big Challenges for Big HPC Systems

– The SGI ICE 8400 system and its IB-based, enhanced system interconnect • Significantly improves interconnect bandwidth without adding cost

– The SGI Altix UV and its Numalink-based system interconnect • Significantly improves interconnect latencies and performance on complex MPI operations • Allows very large memories without performance degradation

– (The recently announced SGI „Mojo“, an accelerator-based „1 Petaflop in a Cabinet“ system) Company Confidential Thank You SGI “Mojo” System

“1 Peta-Flop-in-a-Cabinet” Introduction to “Mojo”

 Internal Product Name “Mojo” : 1 Peta-Flop-in-a-Cabinet

 Create industry leading PCIe infrastructure for Accelerator deployment – Complete integration into industry-leading SGI ALTIX ICE system – Highest density in the industry – Flexible power & cooling solutions – Scales to several Petaflops – Deliver 1 Peta-Flop in a Cabinet single precision peak using ATI GPUs – Also supports NVIDIA, ATI, Tilera and other PCIe based solution in high density packaging

 Customer Deliveries Q4 CY2010

Company Confidential Mojo Software Overview

 Open Standard based  Full integration in ALTIX ICE system administration  OpenCL support  SGI AEE (Accelerator Execution Environment) – Accelerator Resource Management – System Accounting for Accelerator Use – PCIe BW Monitor – Diagnostics

Company Confidential PCIe infra-structure requirements

 4 PCIe x16 Gen2 interfaces to support full band-width to four accelerators per node

 Dual QDR IB interconnect for optimized cluster integration

 unique PCIe bandwidth optimized motherboard for integration into ALTIX ICE framework

Company Confidential Basic Building Block : Node Board

Colgate IP112 Node Board Study:

2.1” PCIe Riser Boards 16.4” Max

5650 A 5670 A

5100 G34 IB BMC IB HCA GigE GigE 1.8” SATA 13.3” IB IB Disk Drive GPU Boards (4) HCA HDD or SSD G34 Single Slot wide Passive Cooled 5670 5650

Front View Airflow Front to Rear 1.8 Inch SATA Disk (HDD or SSD) Features PCIe Card Outline off SP5100. For Swap / Scratch 2 Socket G34 Node 16 DDR3 DIMMs (8 DIMMs per socket) - Support for up to 128 GB (using 8 GB DIMMs) 2 x SR5670 2 x SR5650 Support for up to 4 full size x16 PCIe Gen 2 single wide card slots (up to 4 GPUs) 1 x SP5100 1 x Dual Port GigE NIC 2 x Single Port QDR IB HCA Company Confidential Node Boards scaling into an IRU

Power & Control Mojo Chassis Enclosure Study Backplane Power Extender Board (8)

 Differentiated density, G34 G34 G34 G34 Node 1 NodeNode 2 1

while supporting high BMC BMC 1.8”SATA Drive Disk 1.8”SATA Drive Disk HDDSSD or HDDSSD or UV POWER SUPPLY UV POWER SUPPLY UV POWER SUPPLY

5100 5100 GigE GigE end 225W single wide IB IB IB IB 5650 HCA HCA 5670 5650 HCA HCA 5670

5670 5650 5670 5650 GPU GPU GPU GPU GPU cards GPU GPU IB GigE IB IB GigE IB

Chassis Plan View UV PS UV PS UV PS UV PS UV PS ICE CMCICE UVPS UVPS UVPS

Company Confidential Chassis Front View IRUs scale into Enclosures …

Full 128 128GPU GPU Deployment in standard in 42U 42U rack 256 GPU Deployment256 GPU M-Rack Optimzed Enclosure

Head Node 2U UVPS UVPS UV PS 1U 36 Port Infiniband Switch 1U ICE CMC 1U 36 Port Infiniband Switch 1U UVPS UVPS 64 GPGPUs in 16 Nodes 1U 1U UV UV PS UV PS UV PS UV PS UV PS UV UV PS

64 GPGPUs in 16 Nodes UV PS UV PS UV ICE CMC 1U 1U 64 GPGPUs in 16 Nodes UVPS UVPS UVPS

UV PS UV PS UV PS

UV PS UV PS UV PS

UV PS UV PS UV PS UV PS UV PS ICE CMC UV UV PS UV PS ICE CMC 1U 1U 64 GPGPUs in 16 Nodes UV PS UV PS UV PS UV UVPS UVPS UV PS 64 GPGPUs in 16 Nodes ICE CMC UVPS UVPS 1U 1U 64 GPGPUs in 16 Nodes

UV PS UV PS UV PS UVPS UVPS UVPS UV PS UV PS UV PS

UV PS UV PS ICE CMC

Company Confidential One Peta-Flop-in-a-Cabinet

Based on 4*256GPUs and 1.03 TF/s (single prec.) per GPU Company Confidential Example : NVIDIA Linpack

 NVIDIA M2050 – 1.03 TF/s Single Precision FP (peak) – 515 GF/s Double Precision FP (peak) – 330 GF/s Linpack (DP) – 148 GB/s Bandwidth – 225 W TDP

 60% efficiency , TESLA was around 80% efficiency

 Using M2050 and 330 GFLOPs Linpack,  1 Petaflop Double Precision Linpack would require 3032 cGPUs

Company Confidential Mojo in a Container Environment

Mojo in a Container Environment Up to 3072 GPUs & 768 Nodes in a single Container Up to 6 PFLOP Single Precision, 1.2 PFLOP Double Precision

I/O Equipment Space Blower Blower Blower

Blower Blower Blower Equipment I/O Space

Plan View

~ 1 MW Total Power !!!!

Company Confidential 56 Summary

 Big Challenges for Big HPC Systems

– The SGI ICE 8400 system and its IB-based, enhanced system interconnect

– The SGI Altix UV and its Numalink-based system interconnect

– The recently announced SGI „Mojo“, an accelerator-based „1 Petaflop in a Cabinet“ system

Company Confidential