High-end HPC architectures

Mithuna Thottethodi School of Electrical and Computer Engineering Purdue University What makes a a Supercomputer • Top 500 (www.top500.org) – Processor family: /AMD/Power families • 96% of top 500 – Operating systems • //BSD dominate –Scale • Range from 128-128K processors – Interconnect • Also varies significantly • Is this really surprising? • Better interconnect can scale to more CPUs #1 and #500 over time The Pyramid

• 3 of top 5 and 13 of top 50 - BlueGene solutions Number ofSystems • MPPs (60% of top 50 vs Cost/Performance 21% of top 500) • Clusters (~75%) – with High-performance interconnect (#8, #9 on top 10) – with Gigabit (41% of top 500, only 1 in top 50) Outline

• Interconnects? – Connectivity: MPP vs. cluster – Latency, Bandwidth, Bisection • When is computer A faster than computer B? – Algorithm Scaling, Concurrency, Communication • Other issues – Storage I/O, Failures, Power • Case Studies Connectivity

• How fast/slow can the processor get information on/off the network – How far is the on-ramp/the exit from the source/destination? – Can limit performance even if network is fast Massively Parallel Processing (MPP) • Network interface typically Processor close to processor

– Memory bus: Cache Network • locked to specific processor Interface architecture/bus protocol – Registers/cache: Memory Bus I/O Bridge • only in research machines Network • Time-to-market is long I/O Bus – processor already available or Main Memory work closely with processor Disk designers Controller • Maximize performance and cost Disk Disk • Numalink, BlueGene network Clusters Processor interrupts • Network interface Cache on I/O bus Core Chip Set • Standards (e.g., PCI-X) => longer

Main I/O Bus life, faster to Memory market Disk Graphics Network Controller Controller Interface • Slow to access network interface Disk Disk Graphics Network • , Myrinet, Infiniband, GigE, Link Speeds

MPI latency Bandwidth per link Technology Vendor usec, short msg (unidirectional, MB/s) NUMAlink 4 (Altix) SGI 1 3200 RapidArray (XD1) 1.8 20001 QsNet II Quadrics 2 9002 Infiniband Voltaire 3.5 8303 High Performance Switch IBM 5 10004 Myrinet XP2 Myricom 5.7 4955 SP Switch 2 IBM 18 5006 Ethernet Various 30 100

Source: http://www.sgi.com/products/servers/altix/numalink.html Topology

• Link speeds alone are not sufficient • Topology matters – Bisection • Weakest links • Most likely spot for traffic jams and unnecessary serialization – Not cost-neutral • Cost-performance is important • 64K nodes in BlueGene/L • No node farther than 64 hops from any other Link Speeds

MPI latency Bandwidth per link Technology Vendor usec, short msg (unidirectional, MB/s) NUMAlink 4 (Altix) SGI 1 3200 RapidArray (XD1) Cray 1.8 20001 QsNet II Quadrics 2 9002 Infiniband Voltaire 3.5 8303 High Performance Switch IBM 5 10004 Myrinet XP2 Myricom 5.7 4955 SP Switch 2 IBM 18 5006 Ethernet Various 30 100

Source: http://www.sgi.com/products/servers/altix/numalink.html Topology

• Link speeds alone are not sufficient • Topology matters – Bisection • Weakest links • Most likely spot for traffic jams and unnecessary serialization – Not cost-neutral • Cost-performance is important • 64K nodes in BlueGene/L • No node farther than 64 hops from any other Outline

• Interconnects? – Connectivity: MPP vs. cluster – Latency, Bandwidth, Bisection • When is computer A faster than computer B? – Algorithm Scaling, Concurrency, Communication • Other issues – Storage I/O, Failures, Power • Case Studies Caveat: Methodology

• When is computer A faster than computer B?

• Before answering the above question • Which is better: a car or a bus? – If metric = cost AND typical payload = 2 • Car wins – If metric = persons delivered per unit time and cost AND typical payload = 30 • Bus wins • Back to the original question Caveat: Methodology

• When is computer A faster than computer B? • Top500.org answer – Flops on LINPACK – Rewards scaling and interconnect performance – Other cases? • Application does not scale (~ only 2 people ride) • Application scales even without better interconnect Case 2

• Independent task parallelism – Run 1000 simulations with different parameters – Scenario 1) Run 100 simulations on 100 machines • Repeat 10 times – Scenario 2) Run 200 simulations on 200 machines • Repeat 5 times • Do no need to parallelize application • Do not need MPP – Large cluster adequate – More expensive interconnect : waste of money – Storage I/O may still be bottleneck Case 1

• Application does not scale • Cost exceeds benefit – Cost of increased parallelism – Benefit of concurrency • Need new scalable parallel algorithm or parallelization – Why pay for aggressive machine? Storage

• Can compute as fast as data can be fed • Large data-set HPC often disk-bound • Top 500 does not report storage subsystem statistics

• Anecdotes of disk array racks being moved physically Miscellany • Interaction of scale and MTTF – 64K processor system: failures every 6 days – Time to repair? • n days • effective performance reduced by 6/(6+n) ? • Power – 1.2MW for 64K node – Cooling • Multicore – Increasing multicore-based platforms in top 500 – Anecdotes of users using only one core • Using other core degrades performance by 20% Case Studies

• High end MPP: Blue Gene/L (at LLNL) – #1 for two years (four ranking cycles) – 3 of top 5 – 13 of top 50 • MPP with support for global shared memory – SGI Altix 4700 • Clusters – 74% of top 500 The BlueGene Interconnect

•3D Torus – Cube with wraparound links – 64K nodes – No node is more than 64 hops away – Clever tricks to create subtorus • Multiple networks – Global reduce – Tree based network –GigE Source: llnl.gov SGI Altix

• Altix 4700 • Global shared memory – Up to 128TB – Very fast MPI performance • Fat-tree topology with Numalink links – At the memory bus (MPP) – Hub chip catches remote loads/stores • Translates them to network traffic • Support for FGPA acceleration – RASC Clusters

• Most accessible platform • Great with fast interconnect • Under-representation in top 10 may not be relevant for many application domains – Clusters as fast as BlueGene/L if communication is minimal • Cost-performance leader Question 1 Question 2