33rd TOP500 List

ISC’ 09, Hamburg Agenda • Welcome and Introduction (H.W. Meuer) • TOP10 and Awards (H.D. Simon) • Hig hlig hts of the 30th TOP500 (E. ShiStrohmaier) • HPC Power Consumption (J. Shalf) • Multicore and Manycore and their Impact on HPC (J. J. Dongarra) • Discussion 33rd List: The TOP10 Rmax Power Rank Site Manufacturer Computer Country Cores [Tflops] [MW] Roadrunner 1 DOE/NNSA/LANL IBM USA 129,600 1,105.0 2.48 BladeCenter QS22/LS21 Oak Ridge National Jaguar 2 Inc. USA 150,152 1,059.0 6.95 Laboratory Cray XT5 QC 2.3 GHz Forschungszentrum Jugene 3 IBM Germany 294,912 825.50 2.26 Juelich (FZJ) Blue Gene/P Solution NASA/Ames Pleiades 4 Research SGI USA 51,200 487.0 2.09 SGI ICE 8200EX Center/NAS BlueGene/L 5 DOE/NNSA/LLNL IBM USA 212,992 478.2 2.32 eServer Blue Gene Solution University of Kraken 6 Cray USA 66,000 463.30 Tennessee Cray XT5 QC 2.3 GHz ANtilArgonne National ItIntrepid 7 IBM USA 163,840 458.61 1.26 Laboratory Blue Gene/P Solution Ranger 8 TACC/U. of Texas Sun USA 62,976 433.2 2.0 SunBlade x660420 Dawn 9 DOE/NNSA/LANL IBM USA 147,456 415.70 1.13 Blue Gene/P Solution Forschungszentrum JUROPA 10 Sun/Bull SA Germany 26,304 274.80 1.54 Juelich (FZJ) NovaScale /Sun Blade The TOP500 Project • Listing the 500 most powerful computers in the world • Yardstick: Rmax of Linpack – Solve Ax=b, dense problem, matrix is random • Update twice a year: – ISC’ xy in June in Germany • SCxy in November in the U.S. • All ifinforma tion availabl e from the TOP500 web site at: www..org TOP500 Status • 1st Top500 List in June 1993 at ISC’93 in Mannheim • • • 30th Top500 List on November 13, 2007 at SC07 in Reno • 31st Top500 List on June 18, 2008 at ISC’08 in Dresden • 32nd Top500 List on November 18, 2008 at SC08 in Austin • 33rd Top500 List on June 24, 2009 at ISC’09 in Hamburg • 34th Top500 List on November 18, 2009 at SC09 in Portland

• AkAcknowl ldedge d by HPC‐users, manuftfacturers and media 33rd List: Higgghlights

• The Roadrunner system, which broke the petaflop/s barrier one year ago, held on to its No. 1 spot. It still is one of the most energy efficient systems on the TOP500. • The most powerful system outside the U.S. is an IBM BlueGene/P system at the German Forschungszentrum Juelich (FZJ) at No. 3. FZJ has a second system in the TOP10, a mixed Bull and Sun system at No. 10. • dominates the high‐end processor market with 79.8 percent of all systems and 87.7 percent of quad‐core based systems. • Intel Core i7 (Nehalem) makes its first appearance in the list with 33 systems • Quad‐core processors are used in 76.6 percent of the systems. Their use accelerates performance growth at all levels. • Other notable systems are: – An IBM BlueGene/P system at the King Abdullah University of Science and Technology (KAUST) in Saudi Arabia at No. 14 – The Chinese‐built Dawning 5000A at the Shanghai Center at No 15. It is the largest system which can be operated with Windows HPC 2008. • Hewlett‐Packard kept a narrow lead in market share by total systems from IBM, but IBM still stays ahead by overall installed performance. • Cray’s XT system series is very popular for big customers 10 systems in the TOP50 (20 percent). 33rd List: Notable (()New) Systems • Shaheen, an IBM BlueGene/P system at the King Abdullah University of Science and Technology (KAUST) in Saudi Arabia at No. 14 • The Chinese‐built Dawning 5000A at the Shanghai Supercomputer Center at No. 15. It is the largest system which can be operated with Windows HPC 2008. • The ()(new) Earth Simulator at No. 22 (SX‐9E bdbased ‐ only vector system) –it’s back! • Koi, a Cray internal system using Shanghai six‐cores at #128. • A Grape‐DR system at the National Astronomical Observatory (NAO/CfCA) at # 259 – target size 2PF/s peak ! Multi‐Core and Many‐Core • Power consumption of chips and systems has increased tremendously, because of ‘cheap’ exploitation of Moore’s Law. – Free lunch has ended – Stall of frequencies forces increasing concurrency levels, Multi‐Cores – Optimal core sizes/power are smaller than current ‘rich cores’, which leads to Many‐Cores • Many‐Cores, more (10‐100x) but smaller cores: – Intel Polaris –80 cores, – Clearspeed CSX600 –96 cores, – nVidia G80 – 128 cores, or – CISCO Metro – 188 cores Performance Development

100000000100 Pflop/s 22.9 PFlop/s 1000000010 Pflop/s 1.1 PFlop/s 10000001 Pflop/s

100100000 Tflop/s SUM 17.08 TFlop/s 10 Tflop/s 10000 1.17 TFlop/s N=1 1 Tflop/s1000 59.7 GFlop/s 100 Gflop/s 100 N=500 10 Gflop/s10 400 MFlop/s 1 Gflop/s1

100 Mflop/s0,1 Performance Development

1E+11 1E+10 1 1E+09Eflop/s 0100000000 Pflop/s 010000000 Pflop/s SUM 10000001 Pflop/s 100100000 Tflop/s N=1 1010000 Tflop/s

1 1000Tflop/s N=500 100 Gflop/s100 10 Gflop/s10 1 Gflop/s1 100 Mflop/s0,1 2020 2014 2008 2002 1996 Replacement Rate

350

300

250

200 226 150

100

50

0 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 Vendors / System Share

SGI Dell HP Cray3% 4% IBM 4% Cray Inc. Dell HP SGI 46% Sun Appro Fujitsu

IBM 40% Vendors

500 Others Self‐made 400 Dell Hitachi TMC 300 Intel NEC stems y y S 200 Fujitsu Sun SGI 100 Cray Inc. HP 0 IBM 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 Vendors (()TOP50)

50 Others Appr o Bull 40 Sun Self‐made LNXI 30 Dell

tems Intel s s

Sy 20 TMC SGI HP 10 Hitachi NEC Fujitsu 0 Cray Inc. IBM 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 Customer Segments

500

400 Others 300 Government Vendor tems s s

Sy 200 Classified Academic Research 100 Industry

0 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 Customer Segments

100% 90% 80% 70% Government 60% Vendor

rmance 50% o o Classified 40% Academic Perf 30% Research 20% Industry 10% 0% 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 Customer Segments ()(TOP50)

50

40

Government 30 Classified tems s s Vendor Sy 20 Academic Research 10 Industry

0 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 Continents

500

400

300 Others tems s s Asia Sy 200 Europe Americas 100

0 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 Countries

500

Others 400 India China 300 Korea tems s s Italy Sy 200 France UK 100 Germany Japan 0 US 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 Countries (()TOP50)

50

Others 40 India China 30 Korea tems s s Italy Sy 20 France UK 10 Germany Japan 0 US 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 Countries / System Share

1% 1% United States 2% 1% 1% 8% United Kingdom 1% France Germany 4% 3% Japan China 6% Italy 5% 58% Sweden India 9% Russia SiSpain Poland Asian Countries

120

100

80 Others India tems 60 s s

Sy China 40 Korea, South Japan 20

0 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 European Countries

200 180 Others 160 Poland 140 Russia 120 Spain 100 Sweden 80 Switzerland 60 Netherlands Italy 40 France 20 United Kingdom 0 Germany 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 Processor Architecture / Systems

500

400

300 SIMD tems s s

Sy 200 Vector Scalar

100

0 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 Oppgerating Systems

500

400 N/A 300 BSD Based Mac OS tems s s

Sy 200 Windows Mixed Linux 100 Unix

0 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 Architectures

500

400

SIMD 300 Single Proc. tems s s SMP Sy 200 Const. Cluster 100 MPP

0 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 Architectures (()TOP50)

50

40

SIMD 30 Single Proc. tems s s SMP Sy 20 Constellations Cluster 10 MPP

0 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 Processors / Systems

Xeon 54xx (Harpertown) 3% 53xx (Clovertown) 2% 4% Xeon 55xx (Nehalem‐EP) 3% Xeon 51xx (Woodcrest) Opteron Quad Core 5% Opteron Dual Core PowerPC 440 10% 53% PowerPC 450 POWER6 Others 6%

8% Processors / Performance

7% Xeon 54xx (Harpertown) 5% Xeon 53xx (Clovertown) 4% Xeon 55xx (Nehalem‐EP) 32% Xeon 51xx (Woodcrest) 10% Opteron Quad Core Opteron Dual Core 5% PowerXCell 8i PowerPC 450 3% 6% PowerPC 440 5% POWER6 18% 5% Others Core Count

500 1 2 38049 400 38114 38245 17‐32 300 33‐64 65‐128 tems s s 129‐256 Sy 200 257‐512 513‐1024 1025‐2048 100 2049‐4096 4k‐8k 8k‐16k 0 16k‐32k 32k‐64k 64k‐128k 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 128k‐ Cores per Socket

21,4% 0,8% 4 Cores 2 Cores 9 Cores 2,0% 0,6% 1 Core 6 Cores 76,6% 0,4% 16 Cores 0,2% Cores per Socket

500

400

300 Others

tems 9 ss

Sy 200 4 2 100 1

0 Cluster Interconnects

300

250

200 GigE Myrinet 150 Infiniband Quadrics 100

50

0 Interconnect Family

500 Others Quadrics 400 Proprietary Fat Tree 300 Infiniband tems s s Cray Sy 200 Myrinet SP Switch 100 GigE Crossbar 0 N/A 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 Interconnect Family (()TOP50)

50 Others GigE 40 Quadrics Proprietary 30 Fat Tree tems s s IfiibdInfiniband Sy 20 Cray Myrinet 10 SP Switch Crossbar 0 N/A 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 Linpack Efficiency

100% 90% 80% 70% 60% 50% iciency ff

Ef 40% 30% 20% 10% 0% 0 100 200 300 400 500 TOP500 Ranking Linpack Efficiency

100% 90% 80% 70% 60% 50% iciency ff

Ef 40% 30% 20% 10% 0% 0 100 200 300 400 500 TOP500 Ranking Bell’s Law (()1972)

“Bell's Law of Computer Class formation was discovered about 1972. It states that technology advances in semiconductors, storage, user interface and networking advance every decade enable a new, usually lower priced computing platform to form. Once formed, each class is maintained as a quite independent industry structure. This explains mainframes, minicomputers, workstations and Personal computers, the web, emerging web services, palm and mobile devices, and ubiquitous interconnected networks. We can expect home and body area networks to follow this path.”

From Gordon Bell (2007), http://research.microsoft.com/~GBell/Pubs.htm HPC Computer Classes and Bell’s Law • Bell’s Law states, that: • Important classes of computer architectures come in cycles of about 10 years. • It takes about a decade for each phase – Early research – Early adoption and maturation – Prime usage – Phase out past its prime • Can we use Bell’ s Law to classify computer architectures in the TOP500? HPC Computer Classes and Bell’s Law

• Gordon Bell (1972): 10 year cycles for computer classes • Computer classes in HPC based on the TOP500: • Data Parallel Systems: – Vector (Cray Y‐MP and X1, NEC SX, …) – SIMD (CM‐2, …) • Custom Scalar Systems: – MPP ((yCray T3E and XT3, IBM SP, …) – Scalar SMPs and Constellations (Cluster of big SMPs) • Commodity Cluster: NOW, PC cluster, Blades, … • Power‐Efficient Systems (BG/L as first example of low‐power / embedded systems = potential new class ?) – Tsubame with Clearspeed, Roadrunner with Cell, ? – This is a fundamental rapture and new paradigm for computing at all levels! Bell’s Law

Computer Classes / Systems 500 Embedded/Acc 400 elerated Commodity 300 Cluster Custom Scalar 200 Vector/SIMD 100

0 4 5 6 7 8 3 4 5 6 7 8 9 0 1 2 3 0 0 0 0 0 9 9 9 9 9 9 9 0 0 0 0 0 0 0 0 0 9 9 9 9 9 9 9 0 0 0 0 20 20 20 20 20 19 19 19 19 19 19 19 20 20 20 20 Bell’s Law

Computer Classes ‐ refined / Systems 500

400 BG/L or BG/P

300 Commodity Cluster SMP/ 200 Constellat. MPP 100

0 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 Bell’s Law

HPC Computer Classes Class Early Adoption Prime Use Past Prime starts: starts: Usage starts:

Data Parallel Mid 70’s Mid 80’s Mid 90’s Systems

Custom Scalar Mid 80’s Mid 90’s Mid 2000’s Systems

Commodity Mid 90’s Mid 2000’s Mid 2010’s ??? Cluster

BG/L or BG/P Mid 2000’s Mid 2010’s ??? Mid 2020’s ??? Power Consumption Data • Have been planning this for years. • SdStarted in June 2008 • Independent from the Green500, but we try to learn from each other. • Collect power consumption for: – Linpack as workload – Including all essential parts of a system (processor, memory, …) – EldiExcluding features relldated to machine room (Most dis k, UPS, …) – Full system or part large enough to include all shared components (fans, power supplies, …) • Analyze these data carefully! Power Ranking and How Not to do it!

• To rank objects by600 “size” one needs extensive properties: tts] – Weight or Volumeaa 500 – Rmax (TOP500)400

• A ‘larger’ er system300 should have a larger Rmax. Flops/W ww

• The ratio of 2Po extensive200 properties is an intensive one: – (weight/volume100 = density) iency[M – Performance /cc Power0 Consumption = Power_efficiency • One can‐not ‘rank’Effe objects0 with 100 densities 200 BY 300 SIZE: 400 500 – Density does not tell anything about sizeTOP500 of an Rank object – A piece of lead is not heavier or larger than one piece of wood. • Linpack (sub‐linear) / Power (linear) will always sort smaller systems before larger ones! Power Ranking and How Not to do it!

• To rank objects by “size” one needs extensive properties: – Weight or Volume – Rmax (TOP500) • A ‘larger’ system should have a larger Rmax. • The ratio of 2 extensive properties is an intensive one: – (weight/volume = density) – Performance / Power Consumption = Power_efficiency • One can‐not ‘rank’ objects with densities BY SIZE: – Density does not tell anything about size of an object – A piece of lead is not heavier or larger than one piece of wood. • Linpack (sub‐linear) / Power (linear) will always sort smaller systems before larger ones! Multi‐Core and Many‐Core • Power consumption of chips and systems has increased tremendously, because of ‘cheap’ exploitation of Moore’s Law. – Free lunch has ended – Stall of frequencies forces increasing concurrency levels, Multi‐Cores – Optimal core sizes/power are smaller than current ‘rich cores’, which leads to Many‐Cores • Many‐Cores, more (10‐100x) but smaller cores: – Intel Polaris –80 cores, – Clearspeed CSX600 –96 cores, – nVidia G80 – 128 cores, or – CISCO Metro – 188 cores Absolute Power Levels

7000

6000

5000 ]]

4000 [kWatt r ee 3000

Pow 2000

1000

0 0 100 200 300 400 500 TOP500 Rank Power Efficiency related to Processors Frequencies and Power Efficiency Power Efficiencies of different Systems Power Efficiency related to Interconnects

600

500

400

300

200

100

0 Infiniband Proprietary Myrinet Gigabit Ethernet Power Efficiencies of different Systems

(8+1) core

600 Embedded

500 Quadcore 400

300 Dualcore

200

100

0 Power Efficiencies of different Systems

Low Power Efficient L5420 Linpac k Efficient Double Density Blade 300

250

200

150

100

50

0

BladeCenter HS21 Low Power HT SGI Aliltix ICE 8200 Cluster Platform 3000 BL 2x220