Survey of “High Performance Machines”
Total Page:16
File Type:pdf, Size:1020Kb
SurveySurvey ofof ““HighHigh PerformancePerformance MachinesMachines”” Jack Dongarra University of Tennessee and Oak Ridge National Laboratory 1 OverviewOverview ♦ Processors ♦ Interconnect ♦ Look at the 3 Japanese HPCs ♦ Examine the Top131 2 History of High Performance Computers 1000000.00001P Aggregate Systems Performance 100000.0000 100T Earth Simulator ASCI Q ASCI White 10000.0000 SX-6 10T VPP5000 SX-5 SR8000G1 VPP800 ASCI Blue Increasing 1000.00001T ASCI Red VPP700 ASCI Blue Mountain SX-4 T3E SR8000 Parallelism Paragon VPP500 SR2201/2K NWT/166 100G100.0000 T3D CM-5 SX-3R T90 S-3800 SX-3 SR2201 10G10.0000 C90 Single CPU Performance FLOPS VP2600/1 CRAY-2 Y-MP8 0 SX-2 S-820/80 1.0000 10GHz 1G S-810/20 VP-400 X-MP VP-200 100M0.1000 1GHz CPU Frequencies 10M0.0100 100MHz 1M0.0010 10MHz 1980 1985 1990 1995 2000 2005 2010 Year 3 VibrantVibrant FieldField forfor HighHigh PerformancePerformance ComputersComputers ♦ Cray X1 ♦ Coming soon … ♦ SGI Altix ¾ Cray RedStorm ♦ IBM Regatta ¾ Cray BlackWidow ¾ NEC SX-8 ♦ Sun ¾ IBM Blue Gene/L ♦ HP ♦ Bull ♦ Fujitsu PowerPower ♦ Hitachi SR11000 ♦ NEC SX-7 ♦ Apple 4 JD1 Architecture/SystemsArchitecture/Systems ContinuumContinuum Loosely ♦ Commodity processor with commodity interconnect ¾ Clusters Coupled ¾ Pentium, Itanium, Opteron, Alpha, PowerPC ¾ GigE, Infiniband, Myrinet, Quadrics, SCI ¾ NEC TX7 ¾ HP Alpha ¾ Bull NovaScale 5160 ♦ Commodity processor with custom interconnect ¾ SGI Altix ¾ Intel Itanium 2 ¾ Cray Red Storm ¾ AMD Opteron ¾ IBM Blue Gene/L (?) ¾ IBM Power PC ♦ Custom processor with custom interconnect ¾ Cray X1 Tightly ¾ NEC SX-7 ¾ IBM Regatta Coupled 5 Slide 5 JD1 check bgl status Jack Dongarra, 4/15/2004 CommodityCommodity ProcessorsProcessors ♦ AMD Opteron ¾ 2 GHz, 4 Gflop/s peak ♦ HP Alpha EV68 ¾ 1.25 GHz, 2.5 Gflop/s peak ♦ HP PA RISC ♦ IBM PowerPC ¾ 2 GHz, 8 Gflop/s peak ♦ Intel Itanium 2 ¾ 1.5 GHz, 6 Gflop/s peak ♦ Intel Pentium Xeon, Pentium EM64T ¾ 3.2 GHz, 6.4 Gflop/s peak ♦ MIPS R16000 ♦ Sun UltraSPARC IV 6 ItaniumItanium 22 ♦ Floating point bypass for level 1 cache ♦ Bus is 128 bits wide and operates at 400 MHz, for 6.4 GB/s ♦ 4 flops/cycle ♦ 1.5 GHz Itanium 2 ¾ Linpack Numbers: (theoretical peak 6 Gflop/s) ¾ 100: 1.7 Gflop/s 7 ¾ 1000: 5.4 Gflop/s PentiumPentium 44 IA32IA32 ♦ Processor of choice for clusters ♦ 1 flop/cycle ♦ Streaming SIMD Extensions 2 (SSE2): 2 Flops/cycle ♦ Intel Xeon 3.2 GHz 400/533 MHz bus, 64 bit wide(3.2/4.2 GB/s) ¾ Linpack Numbers: (theorical peak 6.4 Gflop/s) ¾ 100: 1.7 Gflop/s ♦ Coming Soon: “Pentium 4 EM64T” ¾ 1000: 3.1 Gflop/s ¾ 800 MHz bus 64 bit wide 8 ¾ 3.6 GHz, 2MB L2 Cache ¾ Peak 7.2 Gflop/s using SSE2 HighHigh BandwidthBandwidth vsvs CommodityCommodity SystemsSystems ♦ High bandwidth systems have traditionally been vector computers ¾ Designed for scientific problems ¾ Capability computing ♦ Commodity processors are designed for web servers and the home PC market (should be thankful that the manufactures keep the 64 bit fl pt) ¾ Used for cluster based computers leveraging price point ♦ Scientific computing needs are different ¾ Require a better balance between data movement and floating point operations. Results in greater efficiency. Earth Simulator Cray X1 ASCI Q MCR VT Big Mac (NEC) (Cray) (HP EV68) (Dual Xeon) (Dual IBM PPC) Year of Introduction 2002 2003 2002 2002 2003 Node Architecture Vector Vector Alpha Pentium Power PC Processor Cycle Time 500 MHz 800 MHz 1.25 GHz 2.4 GHz 2 GHz Peak Speed per Processor 8 Gflop/s 12.8 Gflop/s 2.5 Gflop/s 4.8 Gflop/s 8 Gflop/s9 Bytes/flop (main memory) 4 2.6 0.8 0.44 0.5 CommodityCommodity InterconnectsInterconnects ♦ Gig Ethernet ♦ Myrinet Clos ♦ Infiniband ♦ Fa QsNet t t ree ♦ SCI T o r u s Switch topology $ NIC $Sw/node $ Node Lt(us)/BW (MB/s) (MPI) Gigabit Ethernet Bus $ 50 $ 50 $ 100 30 / 100 SCI Torus $1,600 $ 0 $1,600 5 / 300 QsNetII Fat Tree $1,200 $1,700 $2,900 3 / 880 Myrinet (D card) Clos $ 700 $ 400 $1,100 6.5/ 240 IB 4x Fat Tree $1,000 $ 400 $1,400 6 / 820 10 QuickQuick LookLook atat …… ♦ Fujitsu PrimePower2500 ♦ Hitachi SR11000 ♦ NEC SX-7 11 FujitsuFujitsu PRIMEPOWERPRIMEPOWER HPC2500HPC2500 Peak (128 nodes): High Speed Optical Interconnect85 Tflop/s system 4GB/s x4 128Nodes SMP Node SMP Node SMP Node SMP Node 8‐128CPUs 8‐128CPUs ・・・・8‐128CPUs 8‐128CPUs Crossbar Network for Uniform Mem. Access (SMP within node) <DTU Board> <System Board> <System Board> PCIBOX 8.36 GB/s per CPU CPU CPU CPU D D D D Channel Channel system board CPU CPU CPU CPU T T T T memory memory 133 GB/s total CPU CPU ・・・CPU CPU U U U U Adapter Adapter CPU CPU CPU CPU System Board x16 …… to Channels to I/O Device to High Speed Optical Interconnect 5.2 Gflop/sDTU : /Data proc Transfer Unit 12 1.3 GHz Sparc 41.6 Gflop/s system board based architecture 666 Gflop/s node LatestLatest InstallationInstallation ofof FUJITSUFUJITSU HPCHPC SystemsSystems User Name Configuration Japan Aerospace Exploration Agency (JAXA) PRIMEPOWER 128CPU x 14(Cabinets) (9.3 Tflop/s) Japan Atomic Energy Research Institute (ITBL PRIMEPOWER 128CPU x 4 + 64CPU (3 Tflop/s) Computer System) Kyoto University PRIMEPOWER 128CPU(1.5 GHz) x 11 + 64CPU (8.8 Tflop/s) Kyoto University (Radio Science Center for Space PRIMEPOWER 128CPU + 32CPU and Atmosphere ) Kyoto University (Grid System) PRIMEPOWER 96CPU Nagoya University (Grid System) PRIMEPOWER 32CPU x 2 National Astronomical Observatory of Japan PRIMEPOWER 128CPU x 2 (SUBARU Telescope System) Japan Nuclear Cycle Development Institute PRIMEPOWER 128CPU x 3 Institute of Physical and Chemical Research IA-Cluster (Xeon 2048CPU) with InfiniBand & Myrinet (RIKEN) National Institute of Informatics IA-Cluster (Xeon 256CPU) with InfiniBand (NAREGI System) PRIMEPOWER 64CPU Tokyo University IA-Cluster (Xeon 64CPU) with Myrinet (The Institute of Medical Science) PRIMEPOWER 26CPU x 2 13 Osaka University (Institute of Protein Research) IA-Cluster (Xeon 160CPU) with InfiniBand HitachiHitachi SR11000SR11000 ♦ Based on IBM Power 4+ 2-6 planes ♦ SMP with 16 processors/node Node ¾ 109 Gflop/s / node(6.8 Gflop/s / p) ¾ IBM uses 32 in their machine ♦ IBM Federation switch ¾ Hitachi: 6 planes for 16 proc/node ¾ IBM uses 8 planes for 32 proc/node ♦ Pseudo vector processing features ¾ No hardware enhancements ¾ Unlike the SR8000 ♦ Hitachi’s Compiler effort is separate from IBM ¾ No plans for HPF ♦ 3 customers for the SR 11000, ¾ 7 Tflop/s largest system 64 nodes ♦ National Institute for Material Science Tsukuba - 64 nodes (7 Tflop/s) ♦ Okasaki Institute for Molecular Science - 50 nodes (5.5 Tflops) ♦ Institute for Statistic Math Institute - 4 nodes 14 NECNEC SXSX--7/160M57/160M5 Total Memory 1280 GB Peak performance 1412 Gflop/s # nodes 5 # PE per 1 node 32 Memory per 1 node 256 GB Peak performance per PE 8.83 Gflop/s # vector pipe per 1PE 4 Rumors of SX-8 Data transport rate between nodes 8 GB/sec8 CPU/node 26 Gflop/s / proc ♦ SX-6: 8 proc/node ♦ SX-7: 32 proc/node ¾ 8 GFlop/s, 16 GB ¾ 8.825 GFlop/s, 256 GB, ¾ processor to memory ¾ processor to memory 15 AfterAfter 22 years,years, StillStill AA TourTour dede ForceForce inin EngineeringEngineering ♦ Homogeneous, Centralized, Proprietary, Expensive! ♦ Target Application: CFD-Weather, Climate, Earthquakes ♦ 640 NEC SX/6 Nodes (mod) ¾ 5120 CPUs which have vector ops ¾ Each CPU 8 Gflop/s Peak ♦ 40 TFlop/s (peak) ♦ H. Miyoshi; master mind & director ¾ NAL, RIST, ES ¾ Fujitsu AP, VP400, NWT, ES ♦ ~ 1/2 Billion $ for machine, software, & building ♦ Footprint of 4 tennis courts ♦ Expect to be on top of Top500 for at least another year! ♦ From the Top500 (November 2003) ¾ Performance of ESC > Σ Next Top 3 Computers 16 TheThe Top131Top131 ♦ Focus on machines that are at least 1 TFlop/s on the Linpack benchmark 1 Tflop/s ♦ Pros ¾ One number ¾ Simple to define and rank ¾ Allows problem size to change with machine and over time ♦ Cons ¾ Emphasizes only “peak” CPU speed and number of CPUs ¾ Does not stress local ♦ 1993: bandwidth ¾ #1 = 59.7 GFlop/s ¾ Does not stress the network ¾ #500 = 422 MFlop/s ¾ Does not test gather/scatter ¾ Ignores Amdahl’s Law (Only ♦ 2003: ¾ does weak scaling) #1 = 35.8 TFlop/s 17 ¾ … ¾ #500 = 403 GFlop/s NumberNumber ofof SSyystemsstems onon Top500Top500 >> 11 Tflop/sTflop/s OverOver TimeTime 140 131 120 100 80 59 60 46 40 23 17 20 12 5 7 11 11 2 3 0 Nov-96 May-97 Nov-97 May-98 Nov-98 May-99 Nov-99 May-00 Nov-00 May-01 Nov-01 May-02 Nov-02 May-03 Nov-03 May-04 18 FactoidsFactoids onon MachinesMachines >> 11 TFlop/sTFlop/s Year of Introduction for 131Systems ♦ 131 Systems > 1 TFlop/s 100 ♦ 80 Clusters (61%) 86 80 ♦ Average rate: 2.44 Tflop/s 60 ♦ Median rate: 1.55 Tflop/s 40 31 20 ♦ Sum of processors in Top131: 7 155,161 1 33 0 ¾ Sum for Top500: 267,789 1998 1999 2000 2001 2002 2003 ♦ Average processor count: 1184 ♦ Median processor count: 706 Number of processors 10000 ♦ Numbers of processors ¾ Most number of processors: 963226 ¾ ASCI Red 1000 ¾ Fewest number of processors: 12471 ¾ Cray X1 Numberprocessors of 100 19 1 21 41 61 81 101 121 Rank PercentPercent OfOf 131131 SSyystemsstems WhichWhich UseUse TheThe FollowingFollowing ProcessorsProcessors >> 11 TFlop/sTFlop/s About a half are based on 32 bit architecture 9 (11) Machines have a Vector instruction Sets Promicro Cut by Manufacture of System 2% Fujits u Fujitsu Legend Group Atipa TechnologySparc HPTi 2% 1% 1% 1% NEC 1% SGI Intel AMD Cray 3% HitachiAlpha 1% Hitachi 1% 2% 4% NEC 2% 6% 2% IBM Itanium 3% 24% 9% Visual Technology Linux Networx 1% 5% Self-made 5% SGI IBM 5% Dell 52% 5% Pentium 48% Cray Inc.