<<

SurveySurvey ofof ““HighHigh PerformancePerformance MachinesMachines””

Jack Dongarra University of Tennessee and Oak Ridge National Laboratory

1 OverviewOverview

♦ Processors ♦ Interconnect ♦ Look at the 3 Japanese HPCs ♦ Examine the Top131

2 History of High Performance

1000000.00001P Aggregate Systems Performance

100000.0000 100T ASCI Q ASCI White 10000.0000 SX-6 10T VPP5000 SX-5 SR8000G1 VPP800 ASCI Blue Increasing 1000.00001T ASCI Red VPP700 ASCI Blue Mountain SX-4 T3E SR8000 Parallelism Paragon VPP500 SR2201/2K NWT/166 100G100.0000 T3D CM-5 SX-3R T90 S-3800 SX-3 SR2201 10G10.0000 C90 Single CPU Performance FLOPS VP2600/1 -2 Y-MP8 0 SX-2 S-820/80 1.0000 10GHz 1G S-810/20 VP-400 X-MP VP-200

100M0.1000 1GHz CPU Frequencies 10M0.0100 100MHz

1M0.0010 10MHz 1980 1985 1990 1995 2000 2005 2010 Year 3 VibrantVibrant FieldField forfor HighHigh PerformancePerformance ComputersComputers

♦ Coming soon … ♦ SGI ¾ Cray RedStorm ♦ IBM Regatta ¾ Cray BlackWidow ¾ NEC SX-8 ♦ Sun ¾ IBM Blue Gene/L ♦ HP ♦ Bull ♦ PowerPower ♦ Hitachi SR11000 ♦ NEC SX-7 ♦ Apple

4 JD1 Architecture/SystemsArchitecture/Systems ContinuumContinuum Loosely ♦ Commodity with commodity interconnect ¾ Clusters Coupled ¾ , , Opteron, Alpha, PowerPC ¾ GigE, Infiniband, Myrinet, Quadrics, SCI ¾ NEC TX7 ¾ HP Alpha ¾ Bull NovaScale 5160

♦ Commodity processor with custom interconnect ¾ SGI Altix ¾ Itanium 2 ¾ Cray ¾ AMD Opteron ¾ IBM Blue Gene/L (?) ¾ IBM Power PC

♦ Custom processor with custom interconnect ¾ Cray X1 Tightly ¾ NEC SX-7 ¾ IBM Regatta Coupled 5 Slide 5

JD1 check bgl status Jack Dongarra, 4/15/2004 CommodityCommodity ProcessorsProcessors ♦ AMD Opteron ¾ 2 GHz, 4 Gflop/s peak ♦ HP Alpha EV68 ¾ 1.25 GHz, 2.5 Gflop/s peak ♦ HP PA RISC ♦ IBM PowerPC ¾ 2 GHz, 8 Gflop/s peak ♦ Intel Itanium 2 ¾ 1.5 GHz, 6 Gflop/s peak ♦ Intel Pentium , Pentium EM64T ¾ 3.2 GHz, 6.4 Gflop/s peak ♦ MIPS R16000 ♦ Sun UltraSPARC IV 6 ItaniumItanium 22

♦ Floating point bypass for level 1 is 128 bits wide and operates at 400 MHz, for 6.4 GB/s ♦ 4 /cycle ♦ 1.5 GHz Itanium 2 ¾ Linpack Numbers: (theoretical peak 6 Gflop/s) ¾ 100: 1.7 Gflop/s 7 ¾ 1000: 5.4 Gflop/s PentiumPentium 44 IA32IA32

♦ Processor of choice for clusters ♦ 1 flop/cycle ♦ Streaming SIMD Extensions 2 (SSE2): 2 Flops/cycle ♦ Intel Xeon 3.2 GHz 400/533 MHz bus, 64 bit wide(3.2/4.2 GB/s) ¾ Linpack Numbers: (theorical peak 6.4 Gflop/s) ¾ 100: 1.7 Gflop/s ♦ Coming Soon: “ EM64T” ¾ 1000: 3.1 Gflop/s ¾ 800 MHz bus 64 bit wide 8 ¾ 3.6 GHz, 2MB L2 Cache ¾ Peak 7.2 Gflop/s using SSE2 HighHigh BandwidthBandwidth vsvs CommodityCommodity SystemsSystems ♦ High bandwidth systems have traditionally been vector computers ¾ Designed for scientific problems ¾ Capability computing ♦ Commodity processors are designed for web servers and the home PC market (should be thankful that the manufactures keep the 64 bit fl pt) ¾ Used for cluster based computers leveraging price point ♦ Scientific computing needs are different ¾ Require a better balance between data movement and floating point operations. Results in greater efficiency.

Earth Simulator Cray X1 ASCI Q MCR VT Big Mac (NEC) (Cray) (HP EV68) (Dual Xeon) (Dual IBM PPC) Year of Introduction 2002 2003 2002 2002 2003 Node Architecture Vector Vector Alpha Pentium Power PC Processor Cycle Time 500 MHz 800 MHz 1.25 GHz 2.4 GHz 2 GHz Peak Speed per Processor 8 Gflop/s 12.8 Gflop/s 2.5 Gflop/s 4.8 Gflop/s 8 Gflop/s9 /flop (main memory) 4 2.6 0.8 0.44 0.5 CommodityCommodity InterconnectsInterconnects

♦ Gig ♦ Myrinet Clos ♦ Infiniband

♦ Fa QsNet t t ree ♦ SCI T o r u s topology $ NIC $Sw/node $ Node Lt(us)/BW (MB/s) (MPI) Gigabit Ethernet Bus $ 50 $ 50 $ 100 30 / 100 SCI Torus $1,600 $ 0 $1,600 5 / 300 QsNetII Fat Tree $1,200 $1,700 $2,900 3 / 880 Myrinet (D card) Clos $ 700 $ 400 $1,100 6.5/ 240 IB 4x Fat Tree $1,000 $ 400 $1,400 6 / 820 10 QuickQuick LookLook atat ……

♦ Fujitsu PrimePower2500 ♦ Hitachi SR11000 ♦ NEC SX-7

11 FujitsuFujitsu PRIMEPOWERPRIMEPOWER HPC2500HPC2500 Peak (128 nodes): High Speed Optical Interconnect85 Tflop/s system

4GB/s x4 128Nodes

SMP Node SMP Node SMP Node SMP Node 8‐128CPUs 8‐128CPUs ・・・・8‐128CPUs 8‐128CPUs

Crossbar Network for Uniform Mem. Access (SMP within node)

PCIBOX

8.36 GB/s per CPU CPU CPU CPU D D D D Channel Channel system board CPU CPU CPU CPU T T T T memory memory 133 GB/s total CPU CPU ・・・CPU CPU U U U U Adapter Adapter CPU CPU CPU CPU

System Board x16 …… to Channels

to I/O Device to High Speed Optical Interconnect 5.2 Gflop/sDTU : /Data proc Transfer Unit 12 1.3 GHz Sparc 41.6 Gflop/s system board based architecture 666 Gflop/s node LatestLatest InstallationInstallation ofof FUJITSUFUJITSU HPCHPC SystemsSystems

User Name Configuration Japan Aerospace Exploration Agency (JAXA) PRIMEPOWER 128CPU x 14(Cabinets) (9.3 Tflop/s) Japan Atomic Energy Research Institute (ITBL PRIMEPOWER 128CPU x 4 + 64CPU (3 Tflop/s) System) Kyoto University PRIMEPOWER 128CPU(1.5 GHz) x 11 + 64CPU (8.8 Tflop/s) Kyoto University (Radio Science Center for Space PRIMEPOWER 128CPU + 32CPU and Atmosphere ) Kyoto University (Grid System) PRIMEPOWER 96CPU Nagoya University (Grid System) PRIMEPOWER 32CPU x 2 National Astronomical Observatory of Japan PRIMEPOWER 128CPU x 2 (SUBARU Telescope System) Japan Nuclear Cycle Development Institute PRIMEPOWER 128CPU x 3 Institute of Physical and Chemical Research IA-Cluster (Xeon 2048CPU) with InfiniBand & Myrinet (RIKEN) National Institute of Informatics IA-Cluster (Xeon 256CPU) with InfiniBand (NAREGI System) PRIMEPOWER 64CPU Tokyo University IA-Cluster (Xeon 64CPU) with Myrinet (The Institute of Medical Science) PRIMEPOWER 26CPU x 2 13 Osaka University (Institute of Protein Research) IA-Cluster (Xeon 160CPU) with InfiniBand HitachiHitachi SR11000SR11000

♦ Based on IBM Power 4+ 2-6 planes ♦ SMP with 16 processors/node Node ¾ 109 Gflop/s / node(6.8 Gflop/s / p) ¾ IBM uses 32 in their machine ♦ IBM Federation switch ¾ Hitachi: 6 planes for 16 proc/node ¾ IBM uses 8 planes for 32 proc/node ♦ Pseudo vector processing features ¾ No hardware enhancements ¾ Unlike the SR8000

♦ Hitachi’s effort is separate from IBM ¾ No plans for HPF

♦ 3 customers for the SR 11000, ¾ 7 Tflop/s largest system 64 nodes

♦ National Institute for Material Science Tsukuba - 64 nodes (7 Tflop/s) ♦ Okasaki Institute for Molecular Science - 50 nodes (5.5 Tflops) ♦ Institute for Statistic Math Institute - 4 nodes 14 NECNEC SXSX--7/160M57/160M5

Total Memory 1280 GB Peak performance 1412 Gflop/s # nodes 5 # PE per 1 node 32 Memory per 1 node 256 GB Peak performance per PE 8.83 Gflop/s # vector pipe per 1PE 4 Rumors of SX-8 Data transport rate between nodes 8 GB/sec8 CPU/node 26 Gflop/s / proc ♦ SX-6: 8 proc/node ♦ SX-7: 32 proc/node ¾ 8 GFlop/s, 16 GB ¾ 8.825 GFlop/s, 256 GB, ¾ processor to memory ¾ processor to memory 15 AfterAfter 22 years,years, StillStill AA TourTour dede ForceForce inin EngineeringEngineering ♦ Homogeneous, Centralized, Proprietary, Expensive! ♦ Target Application: CFD-Weather, Climate, Earthquakes ♦ 640 NEC SX/6 Nodes (mod) ¾ 5120 CPUs which have vector ops ¾ Each CPU 8 Gflop/s Peak ♦ 40 TFlop/s (peak) ♦ H. Miyoshi; master mind & director ¾ NAL, RIST, ES ¾ Fujitsu AP, VP400, NWT, ES ♦ ~ 1/2 Billion $ for machine, , & building

♦ Footprint of 4 tennis courts ♦ Expect to be on top of Top500 for at least another year!

♦ From the Top500 (November 2003) ¾ Performance of ESC > Σ Next Top 3 Computers 16 TheThe Top131Top131

♦ Focus on machines that are at least 1 TFlop/s on the Linpack

1 Tflop/s ♦ Pros ¾ One number ¾ Simple to define and rank ¾ Allows problem size to change with machine and over time ♦ Cons ¾ Emphasizes only “peak” CPU speed and number of CPUs ¾ Does not stress local ♦ 1993: bandwidth ¾ #1 = 59.7 GFlop/s ¾ Does not stress the network ¾ #500 = 422 MFlop/s ¾ Does not test gather/scatter ¾ Ignores Amdahl’s Law (Only ♦ 2003: ¾ does weak scaling) #1 = 35.8 TFlop/s 17 ¾ … ¾ #500 = 403 GFlop/s NumberNumber ofof SSyystemsstems onon Top500Top500 >> 11 Tflop/sTflop/s OverOver TimeTime

140 131 120 100 80 59 60 46 40 23 17 20 7 12 11 11 2 3 5 0

Nov-96 May-97 Nov-97 May-98 Nov-98 May-99 Nov-99 May-00 Nov-00 May-01 Nov-01 May-02 Nov-02 May-03 Nov-03 May-04 18 FactoidsFactoids onon MachinesMachines >> 11 TFlop/sTFlop/s Year of Introduction for 131Systems ♦ 131 Systems > 1 TFlop/s ♦ 80 Clusters (61%)

100 86 ♦ Average rate: 2.44 Tflop/s 80

♦ Median rate: 1.55 Tflop/s 60

40 31

♦ Sum of processors in Top131: 20 7 1 33 155,161 0 ¾ Sum for Top500: 267,789 1998 1999 2000 2001 2002 2003 ♦ Average processor count: 1184

♦ Median processor count: 706 Number of processors

♦ Numbers of processors 10000

¾ Most number of processors: 963226 ¾ ASCI Red 1000 ¾ Fewest number of processors: 12471

¾ Cray X1 Numberprocessors of

100 19 1 21 41 61 81 101 121 Rank PercentPercent OfOf 131131 SSyystemsstems WhichWhich UseUse TheThe FollowingFollowing ProcessorsProcessors >> 11 TFlop/sTFlop/s About a half are based on 32 bit architecture 9 (11) Machines have a Vector instruction Sets

Promicro Cut by Manufacture of System 2% Fujits u Fujitsu Legend Group Atipa TechnologySparc HPTi 2% 1% 1% 1% NEC 1% SGI Intel AMD Cray 3% HitachiAlpha 1% Hitachi 1% 2% 4% NEC 2% 6% 2% IBM Itanium 3% 24% 9% Visual Technology Networx 1% 5%

Self-made 5% SGI IBM 5% Dell 52% 5% Pentium 48% Cray Inc. 4% HP 10% 20 Cut of the data distorts manufacture counts, ie HP(14), IBM > 24% PercentPercent BreakdownBreakdown byby ClassesClasses

Proprietary processor with proprietary interconnect 33%

Commodity Commodity processor with processor with commodity proprietary interconnect interconnect 61% 6%

21 WhatWhat AboutAbout Efficiency?Efficiency? ♦ Talking about Linpack ♦ What should be the efficiency of a machine on the Top131 be? ¾ Percent of peak for Linpack > 90% ? > 80% ? > 70% ? > 60% ? … ♦ Remember this is O(n3) ops and O(n2) data ¾ Mostly matrix multiply

22 ES Efficiency of Systems > 1 Tflop/s ASCI Q VT-Apple 1 NCSA PNNL LANL Lighting LLNL MCR0.9 ASCI White NERSC LLNL (6.6)100000 0.8 AMD 0.7 Cray X1 0.6 Alpha IBM 0.5 Hitachi _ NEC SX

Efficency 10000 0.4 Pentium 0.3 Sparc Itanium 0.2 SGI

0.1 Rank

1000 0 1 21 411 61 81 101 121 21416181101121 23 Performance Rank Rank ES ASCI Q Efficency of Systems > 1 TFlop/s VT-Apple NCSA PNNL LANL Lighting LLNL MCR ASCI White NERSC LLNL 1 gigE44 0.9 0.8 Infiniband3 0.7 Myrinet19

0.6 Quadrics12 0.5 Proprietary52 Efficency 0.4 SCI1 0.3 0.2 0.1 0 1 21 41 61 81 101 121

Rank 24 InterconnectsInterconnects UsedUsed

SCI Myrinet 1% 15% Proprietary 39%

GigE 34% Quadrics Infiniband 9% 2%

Largest EfficiencyEfficiency forfor LinpackLinpack node count min max average GigE 1024 17% 63% 37% SCI 120 64% 64% 64% QsNetII 2000 68% 78% 74% Myrinet 1250 36% 79% 59% Infiniband 4x 1100 58% 69% 64% Proprietary 9632 45% 98% 68% 25 CountryCountry PercentPercent byby TotalTotal PerformancePerformance

France Finland 3% India China 0% 0% Malaysia 2% Italy Israel 0% Canada 3% 0% 1% 2% Korea, South Australia 1% 0% Japan Mexico 15% 1%

New Zealand 1% United States 63% Netherlands 1% United Kingdom Saudia Arabia 5% Sweden 0% Switzerland 26 0% 0% KFlop/s per Capita (Flops/Pop) 1000 KFlop/s per Capita (Flops/Pop)

900 800

700 600

500

400 300

200 100

0

ia ina bia ly ia h s m s Ind ico ta y do te nd Ch ra I nd n den nd ingd Japan Mex alaysiaia A rma e ance la ada Sta eala 27 M Australa, Sout r n K Z Ge Sw F a Israel ited aud ore etherla itzer itedFinlan ew S K N n Un N Sw U SpecialSpecial Purpose:Purpose: GRAPEGRAPE--66

♦ The 6th generation of GRAPE (Gravity Pipe) Project ♦ Gravity (N-Body) calculation for many particles with 31 Gflops/chip ♦ 32 chips / board - 0.99 Tflops/board ♦ 64 boards of full system is installed in University of Tokyo - 63 Tflops ♦ On each board, all particles data are set onto SRAM memory, and each target particle data is injected into the , then acceleration data is calculated ¾ No software! ♦ Gordon Bell Prize at SC for a number of years (Prof. Makino, U. Tokyo) 28 SonySony PlayStation2PlayStation2

: ♦ 8K D cache; 32 MB memory not expandable OS goes here as well ♦ 6 Gflop/s peak ♦ 32 bit fl pt; not IEEE ♦ Superscalar MIPS 300 MHz ♦ 2.4GB/s to memory (.38 B/Flop) core + vector + ♦ Potential 20 fl pt ops/cycle graphics/DRAM ¾ FPU w/FMAC+FDIV ¾ About $200 ¾ VPU1 w/4FMAC+FDIV ¾ 29 529M sold ¾ VPU2 w/4FMAC+FDIV ¾ EFU w/FMAC+FDIV HighHigh--PerformancePerformance ChipsChips EmbeddedEmbedded ApplicationsApplications ♦ The driving market is gaming (PC and game consoles) ¾ which is the main motivation for almost all the technology developments.

♦ Demonstrate that arithmetic is quite cheap. ♦ Not clear that they do much for scientific computing.

♦ Today there are three big problems with these apparent non-standard "off-the-shelf" chips. ¾ Most of these chips have very limited memory bandwidth and little if any support for inter-node communication. ¾ Integer or only 32 bit fl.pt ¾ No software support to map scientific applications to these processors. ¾ Poor memory capacity for program storage

♦ Developing "custom" software is much more expensive than developing custom hardware. 30 RealReal CrisisCrisis WithWith HPCHPC IsIs WithWith TheThe SoftwareSoftware ♦ Programming is stuck ¾ Arguably hasn’t changed since the 70’s ♦ It’s time for a change ¾ Complexity is rising dramatically ¾ highly parallel and distributed systems ¾ From 10 to 100 to 1000 to 10000 to 100000 of processors!! ¾ multidisciplinary applications ♦ A application and software are usually much more long-lived than a hardware ¾ Hardware life typically five years at most. ¾ and C are the main programming models ♦ Software is a major cost component of modern technologies. ¾ The tradition in HPC system procurement is to assume that the software is free.

31 SomeSome CurrentCurrent UnmetUnmet NeedsNeeds

♦ Performance / Portability ♦ Fault tolerance ♦ Better programming models ¾ Global shared address space ¾ Visible locality ♦ Maybe coming soon (since incremental, yet offering real benefits): ¾ Global Address Space (GAS) languages: UPC, Co-Array Fortran, Titanium) ¾ “Minor” extensions to existing languages ¾ More convenient than MPI ¾ Have performance transparency via explicit remote memory references ♦ The critical cycle of prototyping, assessment, and commercialization must be a long-term, sustaining investment, not a one time, crash program.

32 ThanksThanks forfor thethe MemoriesMemories andand CyclesCycles

ACRI Digital Equipment Corp Alpha farm Meiko CS-1 series Alex AVX 2 Elxsi Meiko CS-2 series Alliant ETA Systems Multiflow Alliant FX/2800 Evans and Sutherland Computer Division American Supercomputer Myrias Ametek Fujitsu AP1000 nCUBE 2S Applied Dynamics Fujitsu VP 100-200-400 NEC Cenju-3 Astronautics Fujitsu VPP300/700 NEC Cenju-4 Avalon A12 Fujitsu VPP500 series NEC SX-3R BBN Fujitsu VPP5000 series BBN TC2000 Fujitsu VPX200 series NEC SX-4 Burroughs BSP Galaxy YH-1 NEC SX-5 Cambridge Parallel Processing DAP Gamma Goodyear Aerospace MPP Numerix C-DAC PARAM 10000 Openframe Gould NPL Parsys SN9000 series C-DAC PARAM 9000/SS Guiltech Parsys TA9000 series C-DAC PARAM Openframe Hitachi S-3600 series Parsytec CC series CDC Hitachi S-3800 series Parsytec GC/Power Plus Convex Hitachi SR2001 series Prisma Convex SPP-1000/1200/1600 Hitachi SR2201 series S-1 Cray Computer HP/Convex C4600 Saxpy Cray Computer Corp Cray-2 IBM RP3 Scientific Computer Systems (SCS) Cray Computer Corp Cray-3 IBM GF11 SGI Origin 2000 IBM ES/9000 series Siemens-Nixdorf VP2600 series Cray Research IBM SP1 series PowerChallenge Cray Research Cray Y-MP, Cray Y-MP M90 ICL DAP Stern Computing Systems SSP Cray Research Inc APP XP SUN E10000 Starfire Intel Scientific Computers Supercomputer Systems (SSI) Classic International Parallel Machines Supertek J Machine Suprenum Cray Y-MP C90 The AxilSCC Culler Scientific Kendall Square Research KSR2 The HP Exemplar V2600 Culler-Harris Key Computer Laboratories Thinking Machines Cydrome Kongsberg Informasjonskontroll SCALI TMC CM-2(00) Dana/Ardent/Stellar/Stardent MasPar TMC CM-5 DEC AlphaServer 8200 & 8400 MasPar MP-1, MP-2 33 Vitesse Electronics Denelcor HEP Meiko Matsushita ADENART