High-End HPC Architectures

High-end HPC architectures Mithuna Thottethodi School of Electrical and Computer Engineering Purdue University What makes a Supercomputer a Supercomputer • Top 500 (www.top500.org) – Processor family: Intel/AMD/Power families • 96% of top 500 – Operating systems • Linux/Unix/BSD dominate –Scale • Range from 128-128K processors – Interconnect • Also varies significantly • Is this really surprising? • Better interconnect can scale to more CPUs #1 and #500 over time The Pyramid • 3 of top 5 and 13 of top 50 - BlueGene solutions Number ofSystems • MPPs (60% of top 50 vs Cost/Performance 21% of top 500) • Clusters (~75%) – with High-performance interconnect (#8, #9 on top 10) – with Gigabit Ethernet (41% of top 500, only 1 in top 50) Outline • Interconnects? – Connectivity: MPP vs. cluster – Latency, Bandwidth, Bisection • When is computer A faster than computer B? – Algorithm Scaling, Concurrency, Communication • Other issues – Storage I/O, Failures, Power • Case Studies Connectivity • How fast/slow can the processor get information on/off the network – How far is the on-ramp/the exit from the source/destination? – Can limit performance even if network is fast Massively Parallel Processing (MPP) • Network interface typically Processor close to processor – Memory bus: Cache Network • locked to specific processor Interface architecture/bus protocol – Registers/cache: Memory Bus I/O Bridge • only in research machines Network • Time-to-market is long I/O Bus – processor already available or Main Memory work closely with processor Disk designers Controller • Maximize performance and cost Disk Disk • Numalink, BlueGene network Clusters Processor interrupts • Network interface Cache on I/O bus Core Chip Set • Standards (e.g., PCI-X) => longer Main I/O Bus life, faster to Memory market Disk Graphics Network Controller Controller Interface • Slow to access network interface Disk Disk Graphics Network • Quadrics, Myrinet, Infiniband, GigE, Link Speeds MPI latency Bandwidth per link Technology Vendor usec, short msg (unidirectional, MB/s) NUMAlink 4 (Altix) SGI 1 3200 RapidArray (XD1) Cray 1.8 20001 QsNet II Quadrics 2 9002 Infiniband Voltaire 3.5 8303 High Performance Switch IBM 5 10004 Myrinet XP2 Myricom 5.7 4955 SP Switch 2 IBM 18 5006 Ethernet Various 30 100 Source: http://www.sgi.com/products/servers/altix/numalink.html Topology • Link speeds alone are not sufficient • Topology matters – Bisection • Weakest links • Most likely spot for traffic jams and unnecessary serialization – Not cost-neutral • Cost-performance is important • 64K nodes in BlueGene/L • No node farther than 64 hops from any other Link Speeds MPI latency Bandwidth per link Technology Vendor usec, short msg (unidirectional, MB/s) NUMAlink 4 (Altix) SGI 1 3200 RapidArray (XD1) Cray 1.8 20001 QsNet II Quadrics 2 9002 Infiniband Voltaire 3.5 8303 High Performance Switch IBM 5 10004 Myrinet XP2 Myricom 5.7 4955 SP Switch 2 IBM 18 5006 Ethernet Various 30 100 Source: http://www.sgi.com/products/servers/altix/numalink.html Topology • Link speeds alone are not sufficient • Topology matters – Bisection • Weakest links • Most likely spot for traffic jams and unnecessary serialization – Not cost-neutral • Cost-performance is important • 64K nodes in BlueGene/L • No node farther than 64 hops from any other Outline • Interconnects? – Connectivity: MPP vs. cluster – Latency, Bandwidth, Bisection • When is computer A faster than computer B? – Algorithm Scaling, Concurrency, Communication • Other issues – Storage I/O, Failures, Power • Case Studies Caveat: Methodology • When is computer A faster than computer B? • Before answering the above question • Which is better: a car or a bus? – If metric = cost AND typical payload = 2 • Car wins – If metric = persons delivered per unit time and cost AND typical payload = 30 • Bus wins • Back to the original question Caveat: Methodology • When is computer A faster than computer B? • Top500.org answer – Flops on LINPACK – Rewards scaling and interconnect performance – Other cases? • Application does not scale (~ only 2 people ride) • Application scales even without better interconnect Case 2 • Independent task parallelism – Run 1000 simulations with different parameters – Scenario 1) Run 100 simulations on 100 machines • Repeat 10 times – Scenario 2) Run 200 simulations on 200 machines • Repeat 5 times • Do no need to parallelize application • Do not need MPP – Large cluster adequate – More expensive interconnect : waste of money – Storage I/O may still be bottleneck Case 1 • Application does not scale • Cost exceeds benefit – Cost of increased parallelism – Benefit of concurrency • Need new scalable parallel algorithm or parallelization – Why pay for aggressive machine? Storage • Can compute as fast as data can be fed • Large data-set HPC often disk-bound • Top 500 does not report storage subsystem statistics • Anecdotes of disk array racks being moved physically Miscellany • Interaction of scale and MTTF – 64K processor system: failures every 6 days – Time to repair? • n days • effective performance reduced by 6/(6+n) ? • Power – 1.2MW for 64K node – Cooling • Multicore – Increasing multicore-based platforms in top 500 – Anecdotes of users using only one core • Using other core degrades performance by 20% Case Studies • High end MPP: Blue Gene/L (at LLNL) – #1 for two years (four ranking cycles) – 3 of top 5 – 13 of top 50 • MPP with support for global shared memory – SGI Altix 4700 • Clusters – 74% of top 500 The BlueGene Interconnect •3D Torus – Cube with wraparound links – 64K nodes – No node is more than 64 hops away – Clever tricks to create subtorus • Multiple networks – Global reduce – Tree based network –GigE Source: llnl.gov SGI Altix • Altix 4700 • Global shared memory – Up to 128TB – Very fast MPI performance • Fat-tree topology with Numalink links – At the memory bus (MPP) – Hub chip catches remote loads/stores • Translates them to network traffic • Support for FGPA acceleration – RASC Clusters • Most accessible platform • Great with fast interconnect • Under-representation in top 10 may not be relevant for many application domains – Clusters as fast as BlueGene/L if communication is minimal • Cost-performance leader Question 1 Question 2.

High-End HPC Architectures

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support