33rd TOP500 List
ISC’ 09, Hamburg Agenda • Welcome and Introduction (H.W. Meuer) • TOP10 and Awards (H.D. Simon) • Hig hlig hts of the 30th TOP500 (E. ShiStrohmaier) • HPC Power Consumption (J. Shalf) • Multicore and Manycore and their Impact on HPC (J. J. Dongarra) • Discussion 33rd List: The TOP10 Rmax Power Rank Site Manufacturer Computer Country Cores [Tflops] [MW] Roadrunner 1 DOE/NNSA/LANL IBM USA 129,600 1,105.0 2.48 BladeCenter QS22/LS21 Oak Ridge National Jaguar 2 Cray Inc. USA 150,152 1,059.0 6.95 Laboratory Cray XT5 QC 2.3 GHz Forschungszentrum Jugene 3 IBM Germany 294,912 825.50 2.26 Juelich (FZJ) Blue Gene/P Solution NASA/Ames Pleiades 4 Research SGI USA 51,200 487.0 2.09 SGI Altix ICE 8200EX Center/NAS BlueGene/L 5 DOE/NNSA/LLNL IBM USA 212,992 478.2 2.32 eServer Blue Gene Solution University of Kraken 6 Cray USA 66,000 463.30 Tennessee Cray XT5 QC 2.3 GHz ANtilArgonne National ItIntrepid 7 IBM USA 163,840 458.61 1.26 Laboratory Blue Gene/P Solution Ranger 8 TACC/U. of Texas Sun USA 62,976 433.2 2.0 SunBlade x660420 Dawn 9 DOE/NNSA/LANL IBM USA 147,456 415.70 1.13 Blue Gene/P Solution Forschungszentrum JUROPA 10 Sun/Bull SA Germany 26,304 274.80 1.54 Juelich (FZJ) NovaScale /Sun Blade The TOP500 Project • Listing the 500 most powerful computers in the world • Yardstick: Rmax of Linpack – Solve Ax=b, dense problem, matrix is random • Update twice a year: – ISC’ xy in June in Germany • SCxy in November in the U.S. • All ifinforma tion availabl e from the TOP500 web site at: www.top500.org TOP500 Status • 1st Top500 List in June 1993 at ISC’93 in Mannheim • • • 30th Top500 List on November 13, 2007 at SC07 in Reno • 31st Top500 List on June 18, 2008 at ISC’08 in Dresden • 32nd Top500 List on November 18, 2008 at SC08 in Austin • 33rd Top500 List on June 24, 2009 at ISC’09 in Hamburg • 34th Top500 List on November 18, 2009 at SC09 in Portland
• AkAcknowl ldedge d by HPC‐users, manuftfacturers and media 33rd List: Higgghlights
• The Roadrunner system, which broke the petaflop/s barrier one year ago, held on to its No. 1 spot. It still is one of the most energy efficient systems on the TOP500. • The most powerful system outside the U.S. is an IBM BlueGene/P system at the German Forschungszentrum Juelich (FZJ) at No. 3. FZJ has a second system in the TOP10, a mixed Bull and Sun system at No. 10. • Intel dominates the high‐end processor market with 79.8 percent of all systems and 87.7 percent of quad‐core based systems. • Intel Core i7 (Nehalem) makes its first appearance in the list with 33 systems • Quad‐core processors are used in 76.6 percent of the systems. Their use accelerates performance growth at all levels. • Other notable systems are: – An IBM BlueGene/P system at the King Abdullah University of Science and Technology (KAUST) in Saudi Arabia at No. 14 – The Chinese‐built Dawning 5000A at the Shanghai Supercomputer Center at No 15. It is the largest system which can be operated with Windows HPC 2008. • Hewlett‐Packard kept a narrow lead in market share by total systems from IBM, but IBM still stays ahead by overall installed performance. • Cray’s XT system series is very popular for big customers 10 systems in the TOP50 (20 percent). 33rd List: Notable (()New) Systems • Shaheen, an IBM BlueGene/P system at the King Abdullah University of Science and Technology (KAUST) in Saudi Arabia at No. 14 • The Chinese‐built Dawning 5000A at the Shanghai Supercomputer Center at No. 15. It is the largest system which can be operated with Windows HPC 2008. • The ()(new) Earth Simulator at No. 22 (SX‐9E bdbased ‐ only vector system) –it’s back! • Koi, a Cray internal system using Shanghai six‐cores at #128. • A Grape‐DR system at the National Astronomical Observatory (NAO/CfCA) at # 259 – target size 2PF/s peak ! Multi‐Core and Many‐Core • Power consumption of chips and systems has increased tremendously, because of ‘cheap’ exploitation of Moore’s Law. – Free lunch has ended – Stall of frequencies forces increasing concurrency levels, Multi‐Cores – Optimal core sizes/power are smaller than current ‘rich cores’, which leads to Many‐Cores • Many‐Cores, more (10‐100x) but smaller cores: – Intel Polaris –80 cores, – Clearspeed CSX600 –96 cores, – nVidia G80 – 128 cores, or – CISCO Metro – 188 cores Performance Development
100000000100 Pflop/s 22.9 PFlop/s 1000000010 Pflop/s 1.1 PFlop/s 10000001 Pflop/s
100100000 Tflop/s SUM 17.08 TFlop/s 10 Tflop/s 10000 1.17 TFlop/s N=1 1 Tflop/s1000 59.7 GFlop/s 100 Gflop/s 100 N=500 10 Gflop/s10 400 MFlop/s 1 Gflop/s1
100 Mflop/s0,1 Performance Development
1E+11 1E+10 1 1E+09Eflop/s 0100000000 Pflop/s 010000000 Pflop/s SUM 10000001 Pflop/s 100100000 Tflop/s N=1 1010000 Tflop/s
1 1000Tflop/s N=500 100 Gflop/s100 10 Gflop/s10 1 Gflop/s1 100 Mflop/s0,1 2020 2014 2008 2002 1996 Replacement Rate
350
300
250
200 226 150
100
50
0 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 Vendors / System Share
SGI Dell HP Cray3% 4% IBM 4% Cray Inc. Dell HP SGI 46% Sun Appro Fujitsu
IBM 40% Vendors
500 Others Self‐made 400 Dell Hitachi TMC 300 Intel NEC stems y y S 200 Fujitsu Sun SGI 100 Cray Inc. HP 0 IBM 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 Vendors (()TOP50)
50 Others Appr o Bull 40 Sun Self‐made LNXI 30 Dell
tems Intel s s
Sy 20 TMC SGI HP 10 Hitachi NEC Fujitsu 0 Cray Inc. IBM 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 Customer Segments
500
400 Others 300 Government Vendor tems s s
Sy 200 Classified Academic Research 100 Industry
0 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 Customer Segments
100% 90% 80% 70% Government 60% Vendor
rmance 50% o o Classified 40% Academic Perf 30% Research 20% Industry 10% 0% 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 Customer Segments ()(TOP50)
50
40
Government 30 Classified tems s s Vendor Sy 20 Academic Research 10 Industry
0 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 Continents
500
400
300 Others tems s s Asia Sy 200 Europe Americas 100
0 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 Countries
500
Others 400 India China 300 Korea tems s s Italy Sy 200 France UK 100 Germany Japan 0 US 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 Countries (()TOP50)
50
Others 40 India China 30 Korea tems s s Italy Sy 20 France UK 10 Germany Japan 0 US 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 Countries / System Share
1% 1% United States 2% 1% 1% 8% United Kingdom 1% France Germany 4% 3% Japan China 6% Italy 5% 58% Sweden India 9% Russia SiSpain Poland Asian Countries
120
100
80 Others India tems 60 s s
Sy China 40 Korea, South Japan 20
0 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 European Countries
200 180 Others 160 Poland 140 Russia 120 Spain 100 Sweden 80 Switzerland 60 Netherlands Italy 40 France 20 United Kingdom 0 Germany 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 Processor Architecture / Systems
500
400
300 SIMD tems s s
Sy 200 Vector Scalar
100
0 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 Oppgerating Systems
500
400 N/A 300 BSD Based Mac OS tems s s
Sy 200 Windows Mixed Linux 100 Unix
0 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 Architectures
500
400
SIMD 300 Single Proc. tems s s SMP Sy 200 Const. Cluster 100 MPP
0 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 Architectures (()TOP50)
50
40
SIMD 30 Single Proc. tems s s SMP Sy 20 Constellations Cluster 10 MPP
0 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 Processors / Systems
Xeon 54xx (Harpertown) 3% Xeon 53xx (Clovertown) 2% 4% Xeon 55xx (Nehalem‐EP) 3% Xeon 51xx (Woodcrest) Opteron Quad Core 5% Opteron Dual Core PowerPC 440 10% 53% PowerPC 450 POWER6 Others 6%
8% Processors / Performance
7% Xeon 54xx (Harpertown) 5% Xeon 53xx (Clovertown) 4% Xeon 55xx (Nehalem‐EP) 32% Xeon 51xx (Woodcrest) 10% Opteron Quad Core Opteron Dual Core 5% PowerXCell 8i PowerPC 450 3% 6% PowerPC 440 5% POWER6 18% 5% Others Core Count
500 1 2 38049 400 38114 38245 17‐32 300 33‐64 65‐128 tems s s 129‐256 Sy 200 257‐512 513‐1024 1025‐2048 100 2049‐4096 4k‐8k 8k‐16k 0 16k‐32k 32k‐64k 64k‐128k 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 128k‐ Cores per Socket
21,4% 0,8% 4 Cores 2 Cores 9 Cores 2,0% 0,6% 1 Core 6 Cores 76,6% 0,4% 16 Cores 0,2% Cores per Socket
500
400
300 Others
tems 9 ss
Sy 200 4 2 100 1
0 Cluster Interconnects
300
250
200 GigE Myrinet 150 Infiniband Quadrics 100
50
0 Interconnect Family
500 Others Quadrics 400 Proprietary Fat Tree 300 Infiniband tems s s Cray Sy 200 Myrinet SP Switch 100 GigE Crossbar 0 N/A 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 Interconnect Family (()TOP50)
50 Others GigE 40 Quadrics Proprietary 30 Fat Tree tems s s IfiibdInfiniband Sy 20 Cray Myrinet 10 SP Switch Crossbar 0 N/A 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 Linpack Efficiency
100% 90% 80% 70% 60% 50% iciency ff
Ef 40% 30% 20% 10% 0% 0 100 200 300 400 500 TOP500 Ranking Linpack Efficiency
100% 90% 80% 70% 60% 50% iciency ff
Ef 40% 30% 20% 10% 0% 0 100 200 300 400 500 TOP500 Ranking Bell’s Law (()1972)
“Bell's Law of Computer Class formation was discovered about 1972. It states that technology advances in semiconductors, storage, user interface and networking advance every decade enable a new, usually lower priced computing platform to form. Once formed, each class is maintained as a quite independent industry structure. This explains mainframes, minicomputers, workstations and Personal computers, the web, emerging web services, palm and mobile devices, and ubiquitous interconnected networks. We can expect home and body area networks to follow this path.”
From Gordon Bell (2007), http://research.microsoft.com/~GBell/Pubs.htm HPC Computer Classes and Bell’s Law • Bell’s Law states, that: • Important classes of computer architectures come in cycles of about 10 years. • It takes about a decade for each phase – Early research – Early adoption and maturation – Prime usage – Phase out past its prime • Can we use Bell’ s Law to classify computer architectures in the TOP500? HPC Computer Classes and Bell’s Law
• Gordon Bell (1972): 10 year cycles for computer classes • Computer classes in HPC based on the TOP500: • Data Parallel Systems: – Vector (Cray Y‐MP and X1, NEC SX, …) – SIMD (CM‐2, …) • Custom Scalar Systems: – MPP ((yCray T3E and XT3, IBM SP, …) – Scalar SMPs and Constellations (Cluster of big SMPs) • Commodity Cluster: NOW, PC cluster, Blades, … • Power‐Efficient Systems (BG/L as first example of low‐power / embedded systems = potential new class ?) – Tsubame with Clearspeed, Roadrunner with Cell, ? – This is a fundamental rapture and new paradigm for computing at all levels! Bell’s Law
Computer Classes / Systems 500 Embedded/Acc 400 elerated Commodity 300 Cluster Custom Scalar 200 Vector/SIMD 100
0 4 5 6 7 8 3 4 5 6 7 8 9 0 1 2 3 0 0 0 0 0 9 9 9 9 9 9 9 0 0 0 0 0 0 0 0 0 9 9 9 9 9 9 9 0 0 0 0 20 20 20 20 20 19 19 19 19 19 19 19 20 20 20 20 Bell’s Law
Computer Classes ‐ refined / Systems 500
400 BG/L or BG/P
300 Commodity Cluster SMP/ 200 Constellat. MPP 100
0 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 Bell’s Law
HPC Computer Classes Class Early Adoption Prime Use Past Prime starts: starts: Usage starts:
Data Parallel Mid 70’s Mid 80’s Mid 90’s Systems
Custom Scalar Mid 80’s Mid 90’s Mid 2000’s Systems
Commodity Mid 90’s Mid 2000’s Mid 2010’s ??? Cluster
BG/L or BG/P Mid 2000’s Mid 2010’s ??? Mid 2020’s ??? Power Consumption Data • Have been planning this for years. • SdStarted in June 2008 • Independent from the Green500, but we try to learn from each other. • Collect power consumption for: – Linpack as workload – Including all essential parts of a system (processor, memory, …) – EldiExcluding features relldated to machine room (Most dis k, UPS, …) – Full system or part large enough to include all shared components (fans, power supplies, …) • Analyze these data carefully! Power Ranking and How Not to do it!
• To rank objects by600 “size” one needs extensive properties: tts] – Weight or Volumeaa 500 – Rmax (TOP500)400
• A ‘larger’ er system300 should have a larger Rmax. Flops/W ww
• The ratio of 2Po extensive200 properties is an intensive one: – (weight/volume100 = density) iency[M – Performance /cc Power0 Consumption = Power_efficiency • One can‐not ‘rank’Effe objects0 with 100 densities 200 BY 300 SIZE: 400 500 – Density does not tell anything about sizeTOP500 of an Rank object – A piece of lead is not heavier or larger than one piece of wood. • Linpack (sub‐linear) / Power (linear) will always sort smaller systems before larger ones! Power Ranking and How Not to do it!
• To rank objects by “size” one needs extensive properties: – Weight or Volume – Rmax (TOP500) • A ‘larger’ system should have a larger Rmax. • The ratio of 2 extensive properties is an intensive one: – (weight/volume = density) – Performance / Power Consumption = Power_efficiency • One can‐not ‘rank’ objects with densities BY SIZE: – Density does not tell anything about size of an object – A piece of lead is not heavier or larger than one piece of wood. • Linpack (sub‐linear) / Power (linear) will always sort smaller systems before larger ones! Multi‐Core and Many‐Core • Power consumption of chips and systems has increased tremendously, because of ‘cheap’ exploitation of Moore’s Law. – Free lunch has ended – Stall of frequencies forces increasing concurrency levels, Multi‐Cores – Optimal core sizes/power are smaller than current ‘rich cores’, which leads to Many‐Cores • Many‐Cores, more (10‐100x) but smaller cores: – Intel Polaris –80 cores, – Clearspeed CSX600 –96 cores, – nVidia G80 – 128 cores, or – CISCO Metro – 188 cores Absolute Power Levels
7000
6000
5000 ]]
4000 [kWatt r ee 3000
Pow 2000
1000
0 0 100 200 300 400 500 TOP500 Rank Power Efficiency related to Processors Frequencies and Power Efficiency Power Efficiencies of different Systems Power Efficiency related to Interconnects
600
500
400
300
200
100
0 Infiniband Proprietary Myrinet Gigabit Ethernet Power Efficiencies of different Systems
(8+1) core
600 Embedded
500 Quadcore 400
300 Dualcore
200
100
0 Power Efficiencies of different Systems
Low Power Efficient L5420 Linpac k Efficient Double Density Blade 300
250
200
150
100
50
0
BladeCenter HS21 Low Power HT SGI Aliltix ICE 8200 Cluster Platform 3000 BL 2x220