Present and Future Supercomputer Architectures
Total Page:16
File Type:pdf, Size:1020Kb
Hong Kong, China, 13-15 Dec. 2004 Present and Future Supercomputer Architectures Jack Dongarra University of Tennessee and Oak Ridge National Laboratory 12/12/2004 1 A Growth-Factor of a Billion in Performance in a Career Super Scalar/Vector/Parallel 1 PFlop/s (1015) IBM Parallel BG/L ASCI White ASCI Red 1 TFlop/s Pacific (1012) TMC CM-5 Cray T3D 2X Transistors/Chip Vector Every 1.5 Years TMC CM-2 Cray 2 1 GFlop/s Cray X-MP (109) Super Scalar Cray 1 1941 1 (Floating Point operations / second, Flop/s) CDC 7600 IBM 360/195 1945 100 1 MFlop/sScalar 1949 1,000 (1 KiloFlop/s, KFlop/s) 6 CDC 6600 1951 10,000 (10 ) 1961 100,000 1964 1,000,000 (1 MegaFlop/s, MFlop/s) IBM 7090 1968 10,000,000 1975 100,000,000 1987 1,000,000,000 (1 GigaFlop/s, GFlop/s) 1992 10,000,000,000 1993 100,000,000,000 1 KFlop/s 1997 1,000,000,000,000 (1 TeraFlop/s, TFlop/s) (103) UNIVAC 1 2000 10,000,000,000,000 EDSAC 1 2003 35,000,000,000,000 (35 TFlop/s) 07 1950 1960 1970 1980 1990 2000 2010 2 1 H. Meuer, H. Simon, E. Strohmaier, & JD - Listing of the 500 most powerful Computers in the World - Yardstick: Rmax from LINPACK MPP Ax=b, dense problem TPP performance Rate - Updated twice a year Size SC‘xy in the States in November Meeting in Mannheim, Germany in June 07 - All data available from www.top500.org 3 What is a Supercomputer? ♦ A supercomputer is a hardware and software system that provides close to the maximum performance that can currently be achieved. ♦ Over the last 10 years the range for the Top500 has Why do we need them? increased greater than Almost all of the technical areas that Moore’s Law are important to the well-being of ♦ 1993: humanity use supercomputing in ¾ #1 = 59.7 GFlop/s fundamental and essential ways. ¾ #500 = 422 MFlop/s Computational fluid dynamics, ♦ 2004: protein folding, climate modeling, ¾ #1 = 70 TFlop/s national security, in particular for ¾ #500 = 850 GFlop/s 07 cryptanalysis and for simulating nuclear weapons to name a few. 4 2 TOP500 Performance – November 2004 1.127 PF/s 1 Pflop/s IBM BlueGene/L 100 Tflop/s SUM 70.72 TF/s NEC 10 Tflop/s 1.167 TF/s N=1 Earth Simulator IBM ASCI White 1 Tflop/s LLNL 850 GF/s 59.7 GF/s Intel ASCI Red Sandia 100 Gflop/s Fujitsu 10 Gflop/s 'NWT' NAL N=500 My Laptop 0.4 GF/s 1 Gflop/s 100 Mflop/s 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 07 5 Vibrant Field for High Performance Computers ♦ Cray X1 ♦ Coming soon … ♦ SGI Altix ¾ Cray RedStorm ♦ IBM Regatta ¾ Cray BlackWidow ♦ IBM Blue Gene/L ¾ NEC SX-8 ♦ IBM eServer ¾ Galactic Computing ♦ Sun ♦ HP ♦ Dawning ♦ Bull NovaScale ♦ Lanovo ♦ Fujitsu PrimePower ♦ Hitachi SR11000 ♦ NEC SX-7 ♦ Apple 07 6 3 Architecture/Systems Continuum Tightly 100% Coupled ♦ ♦ Custom processor Best Customprocessor performance for 80% codes that are not “cache with custom interconnect friendly” ¾ Cray X1 ♦ ¾ NEC SX-7 Good communication performance ♦ Simplest programming model ¾ IBM Regatta 60% ¾ IBM Blue Gene/L ♦ Most expensive ♦ Commodity processor Hybrid ♦ with custom interconnect 40% Good communication performance ¾ SGI Altix ♦ Good scalability ¾ Intel Itanium 2 ¾ Cray Red Storm ¾ AMD Opteron 20% ♦ Commodity processor with commodity interconnect ♦ Best price/performance (forCommod codes that work well with caches ¾ 0% Clusters and are latency tolerant) ¾ Pentium, Itanium, Jun-93 Jun-94 Jun-95 Jun-96 Jun-97 Jun-98 Jun-99 Jun-00 Jun-01 Jun-02 Jun-03 Jun-04 ♦ Dec-93 Dec-94 Dec-95 Dec-96 Dec-97 Dec-98 Dec-99 Dec-00 Dec-01 Dec-02 Dec-03 Opteron, Alpha More complex programming model ¾ GigE, Infiniband, Myrinet, Quadrics Loosely ¾ NEC TX7 Coupled ¾ IBM eServer 07 ¾ Dawning 7 Top500 Performance by Manufacture (11/04) Cray Hitachi Sun 1% 0% Fujitsu 2% 2% Intel NEC 0% 4% SGI 7% others 14% IBM 49% HP 21% 07 8 4 Commodity Processors ♦ Intel Pentium Nocona ♦ HP PA RISC ¾ 3.6 GHz, peak = 7.2 Gflop/s ♦ Sun UltraSPARC IV ¾ Linpack 100 = 1.8 Gflop/s ♦ HP Alpha EV68 ¾ Linpack 1000 = 3.1 Gflop/s ¾ 1.25 GHz, 2.5 Gflop/s peak ♦ AMD Opteron ♦ MIPS R16000 ¾ 2.2 GHz, peak = 4.4 Gflop/s ¾ Linpack 100 = 1.3 Gflop/s ¾ Linpack 1000 = 3.1 Gflop/s ♦ Intel Itanium 2 ¾ 1.5 GHz, peak = 6 Gflop/s ¾ Linpack 100 = 1.7 Gflop/s 07 ¾ Linpack 1000 = 5.4 Gflop/s 9 Commodity Interconnects ♦ Gig Ethernet ♦ Myrinet Clos ♦ Infiniband ♦ Fa QsNet t t ree ♦ SCI T o Cost Cost Cost MPI Lat / 1-way / Bi-Dir r u s Switch topology NIC Sw/node Node (us) / MB/s / MB/s Gigabit Ethernet Bus $ 50 $ 50 $ 100 30 / 100 / 150 SCI Torus $1,600 $ 0 $1,600 5 / 300 / 400 QsNetII (R) Fat Tree $1,200 $1,700 $2,900 3 / 880 / 900 QsNetII (E) Fat Tree $1,000 $ 700 $1,700 3 / 880 / 900 Myrinet (D card) Clos $ 595 $ 400 $ 995 6.5 / 240 / 480 Myrinet07 (E card) Clos $ 995 $ 400 $1,395 6 / 450 / 900 IB 4x Fat Tree $1,000 $ 400 $1,400 6 / 820 / 790 10 5 24th List: The TOP10 Rmax Manufacturer Computer Installation Site Country Year #Proc [TF/s] BlueGene/L 1 IBM 70.72 DOE/IBM USA 2004 32768 β-System Columbia 2 SGI 51.87 NASA Ames USA 2004 10160 Altix, Infiniband 3 NEC Earth-Simulator 35.86 Earth Simulator Center Japan 2002 5120 MareNostrum Barcelona Supercomputer 4 IBM BladeCenter JS20, 20.53 Spain 2004 3564 Myrinet Center Thunder Lawrence Livermore 5 CCD 19.94 USA 2004 4096 Itanium2, Quadrics National Laboratory ASCI Q Los Alamos 6 HP 13.88 USA 2002 8192 AlphaServer SC, Quadrics National Laboratory X 7 Self Made 12.25 Virginia Tech USA 2004 2200 Apple XServe, Infiniband BlueGene/L Lawrence Livermore 8 IBM/LLNL 11.68 USA 2004 8192 DD1 500 MHz National Laboratory Naval Oceanographic 9 IBM pSeries 655 10.31 USA 2004 2944 Office Tungsten 10 Dell 9.82 NCSA USA 2003 2500 PowerEdge, Myrinet 07 399 system > 1 TFlop/s; 294 machines are clusters, top10 average 8K proc 11 IBM BlueGene/L System 131,072 Processors (64 racks, 64x32x32) Rack 131,072 procs (32 Node boards, 8x8x16) 2048 processors BlueGene/L Compute ASIC Node Card (32 chips, 4x4x2) 16 Compute Cards 64 processors Compute Card (2 chips, 2x1x1) 180/360 TF/s 4 processors 32 TB DDR Chip (2 processors) 2.9/5.7 TF/s 0.5 TB DDR Full system total of 90/180 GF/s 131,072 processors 16 GB DDR 5.6/11.2 GF/s 2.8/5.6 GF/s 1 GB DDR 4 MB (cache) “Fastest Computer” BG/L 700 MHz 32K proc 16 racks 07 Peak: 91.7 Tflop/s Linpack: 70.7 Tflop/s 12 77% of peak 6 BlueGene/L Interconnection Networks 3 Dimensional Torus ¾ Interconnects all compute nodes (65,536) ¾ Virtual cut-through hardware routing ¾ 1.4Gb/s on all 12 node links (2.1 GB/s per node) ¾ 1 µs latency between nearest neighbors, 5 µs to the farthest ¾ 4 µs latency for one hop with MPI, 10 µs to the farthest ¾ Communications backbone for computations ¾ 0.7/1.4 TB/s bisection bandwidth, 68TB/s total bandwidth Global Tree ¾ Interconnects all compute and I/O nodes (1024) ¾ One-to-all broadcast functionality ¾ Reduction operations functionality ¾ 2.8 Gb/s of bandwidth per link ¾ Latency of one way tree traversal 2.5 µs ¾ ~23TB/s total binary tree bandwidth (64k machine) Ethernet ¾ Incorporated into every node ASIC ¾ Active in the I/O nodes (1:64) ¾ All external comm. (file I/O, control, user interaction, etc.) Low Latency Global Barrier and Interrupt 07 ¾ Latency of round trip 1.3 µs Control Network 13 NASA Ames: SGI Altix Columbia 10,240 Processor System ♦ Architecture: Hybrid Technical Server Cluster ♦ Vendor: SGI based on Altix systems ♦ Deployment: Today ♦ Node: ¾ 1.5 GHz Itanium-2 Processor ¾ 512 procs/node (20 cabinets) ¾ Dual FPU’s / processor ♦ System: ¾ 20 Altix NUMA systems @ 512 procs/node = 10240 procs ¾ 320 cabinets (estimate 16 per node) ¾ Peak: 61.4 Tflop/s ; LINPACK: 52 Tflop/s ♦ Interconnect: ¾ FastNumaFlex (custom hypercube) within node ¾ Infiniband between nodes ♦ Pluses: ¾ Large and powerful DSM nodes ♦ 07 Potential problems (Gotchas): ¾ Power consumption - 100 kw per node (2 Mw total) 14 7 1 Eflop/ 100 Pflop/ 10 Pflop/ s Performance Projection 1 Pflop/ s 100 Tflop/ s 10 Tflop/ s 1 Tflop/ s 100 Gflop/ s 10 Gflop/ s SUM 1 Gflop/ s 100 Mflop/ s s N=1 s 07 1993 1995 1997 1999 2001 2003 2005 2007 2009 2011 2013 2015 N=500 BlueGene/L DARPA HPCS Power: Watts/ My Laptop 120 100 80 Watts/Gflop 60 Gflop 40 20 0 (smaller is better) BlueGene/L DD2 beta-System (0.7 GHz PowerPC 440) SGI Altix 1.5 GHz, Voltaire Infiniband 07 Earth-Simulator eServer BladeCenter JS20+ (PowerPC970 2.2 GHz), Myrine t 15 Intel Itanium2 Tiger4 1.4GHz - Quadrics ASCI Q - AlphaServer SC45, 1.25 GHz 1100 Dual 2.3 GHz Apple XServe/Mellanox Infiniband 4X/Cisco GigE BlueGene/L DD1 Prototype (0.5GHz PowerPC 440 w/Custom) eServer pSeries 655 (1.7 GHz Power4+) PowerEdge 1750, P4 Xeon 3.06 GHz, Myrinet eServer pSeries 690 (1.9 GHz Power4+) Based on processor power rating only eServer pSeries 690 (1.9 GHz Power4+) LNX Cluster, Xeon 3.4 GHz, Myrinet Top 20 systems RIKEN Super Combined Cluster BlueGene/ L DD2 Prototype (0.7 GHz PowerPC 440) Integrity rx2600 Itanium2 1.5 GHz, Quadrics Dawning 4000A, Opteron 2.2 GHz, Myrinet Opteron 2 GHz, Myrinet MCR Linux Cluster Xeon 2.4 GHz - Quadrics ASCI White, SP Power3 375 MHz 16 8 Top500 in Asia (Numbers of Machines) 120 100 80 Others 60 India South Korea 40 China Japan 20 0 3 5 7 8 0 2 4 96 03 99 9 99 99 00 00 0 00 /1 /1 1 /1 /2 2 /2 2 6 6 6 6 6/ 6 6/ 0 06/1994 06/199 0 06/ 0 06/1999 0 06/2001 0 0 0 07 17 17 Chinese Sites on the Top500 R Rank Installation-site-name Manufacturer Computer Area Year max Procs Shanghai Supercomputer 17 Center Dawning Dawning 4000A, Opteron 2.2 GHz, Myrinet Research 2004 8061 2560 38 Chinese Academy of Science lenovo DeepComp 6800, Itanium2 1.3 GHz, QsNet Academic 2003 4193 1024 Institute of Scientific 61 Computing/Nankai University IBM xSeries Xeon 3.06 GHz, Myrinet Academic 2004 3231 768 132 Petroleum Company (D) IBM BladeCenter Xeon 3.06 GHz, Gig-Ethernet Industry 2004 1923 512 184 Geoscience (A) IBM BladeCenter Xeon 3.06 GHz, Gig-Ethernet Industry 2004 1547 412 209 University of Shanghai HP DL360G3 Xeon 3.06 GHz, Infiniband Academic 2004 1401 348 Academy of Mathematics and 225 System Science lenovo DeepComp 1800 - P4 Xeon 2 GHz - Myrinet Academic 2002 1297 512 229 Digital China Ltd.