CS 594 Spring 2007 Lecture 4: Overview of High-Performance Computing

CS 594 Spring 2007 Lecture 4: Overview of High-Performance Computing Jack Dongarra Computer Science Department University of Tennessee 1 Top 500 Computers - Listing of the 500 most powerful Computers in the World - Yardstick: Rmax from LINPACK MPP TPP performance Ax=b, dense problem Updated twice a year Rate SC‘xy in the States in November Size Meeting in Germany in June 2 1 What is a Supercomputer? X A supercomputer is a hardware and software system that provides close to the maximum performance that can currently be achieved. X Over the last 14 years the range for the Top500 has increased greater than Moore’s Why do we need them? Law Almost all of the technical areas that X 1993: are important to the well-being of ¾ #1 = 59.7 GFlop/s humanity use supercomputing in ¾ #500 = 422 MFlop/s fundamental and essential ways. X 2007: ¾ #1 = 280 TFlop/s Computational fluid dynamics, ¾ #500 = 2.73 TFlop/s protein folding, climate modeling, national security, in particular for cryptanalysis and for simulating 3 nuclear weapons to name a few. Architecture/Systems Continuum Tightly 100% Coupled X Best processor performance for codes X Custom processor that are not “cache friendly” with custom interconnect X Good Customcommunication performance ¾ 80% Cray X1 X Simpler programming model ¾ NEC SX-8 X ¾ IBM Regatta Most expensive ¾ IBM Blue Gene/L X Commodity processor 60%X Good communication performance with custom interconnect X Good scalability ¾ SGI Altix Hybrid » Intel Itanium 2 40% ¾ Cray XT3 » AMD Opteron X Best price/performance (for codes X Commodity processor that work well with caches and are with commodity interconnect 20% latency tolerant) ¾ Clusters X More complex programming model » Pentium, Itanium, Commod Opteron, Alpha 0% » GigE, Infiniband, Jun-93 Jun-94 Jun-95 Jun-96 Jun-97 Jun-98 Jun-99 Jun-00 Jun-01 Jun-02 Jun-03 Jun-04 Myrinet, Quadrics Dec-93 Dec-94 Dec-95 Dec-96 Dec-97 Dec-98 Dec-99 Dec-00 Dec-01 Dec-02 Dec-03 ¾ NEC TX7 ¾ IBM eServer Loosely ¾ Dawning Coupled 4 2 Architectures / Systems 500 SIMD 400 Single Proc. 300 SMP 200 Const. 100 Cluster 0 MPP 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 5 Supercomputing Changes Over Time 500 Fastest Systems Over the Past 14 Years 3.54 PF/s 1 Pflop/s The Fastest Computer SUM of the 500 Fastest Computers 280.6 TF/s IBM BlueGene/L 100 Tflop/s NEC Earth Simulator 10 Tflop/s 1.167 TF/s IBM ASCI White 2.74 1 Tflop/s TF/s 59.7 GF/s Intel ASCI6-8 Red years The Computer 100 Gflop/s at Position 500 Fujitsu 'NWT' 10 Gflop/s 0.4 GF/s My Laptop 1 Gflop/s 100 Mflop/s 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 A supercomputer is a hardware and software system that provides close to the 6 maximum performance that can currently be achieved. 3 28th List: The TOP10 Rmax Year/ Manufacturer Computer Installation Site Country #Proc [TF/s] Arch 2005 1 IBM BlueGene/L 280.6 DOE/NNSA/LLNL USA 131,072 eServer Blue Gene Custom 2 2006 Sandia/Cray Red Storm 101.4 NNSA/Sandia USA 26,544 9 Cray XT3 Hybrid 3 2005 IBM BGW 91.29 IBM Thomas Watson USA 40,960 2 eServer Blue Gene Custom 4 2005 IBM ASC Purple 75.76 DOE/NNSA/LLNL USA 12,208 3 eServer pSeries p575 Custom Barcelona Supercomputer 2006 5 IBM MareNostrum 62.63 Spain 12,240 JS21 Cluster, Myrinet Center Commod 2005 6 Dell Thunderbird 53.00 NNSA/Sandia USA 9,024 PowerEdge 1850, IB Commod 7 Tera-10 2006 Bull 52.84 CEA France 9,968 5 NovaScale 5160, Quadrics Commod 8 Columbia 2004 SGI 51.87 NASA Ames USA 10,160 4 Altix, Infiniband Hybrid 9 GSIC / Tokyo Institute of 2006 NEC/Sun Tsubame 47.38 Japan 11,088 7 Fire x4600, ClearSpeed, IB Technology Commod 2006 10 Cray Jaguar 43.48 ORNL USA 10,424 Cray XT3 Hybrid 7 IBM BlueGene/L #1 131,072 Processors Total of 18 systems all in the Top100 1.6 MWatts (1600 homes) (64 racks, 64x32x32) 43,000 ops/s/person Rack 131,072 procs (32 Node boards, 8x8x16) 2048 processors BlueGene/L Compute ASIC Node Board (32 chips, 4x4x2) 16 Compute Cards 64 processors Compute Card (2 chips, 2x1x1) 180/360 TF/s 4 processors 32 TB DDR Chip (2 processors) 2.9/5.7 TF/s 0.5 TB DDR Full system total of 90/180 GF/s 131,072 processors 16 GB DDR 5.6/11.2 GF/s 2.8/5.6 GF/s 1 GB DDR 4 MB (cache) “Fastest Computer” BG/L 700 MHz 131K proc The compute node ASICs include all networking and processor functionality. 64 racks Each compute ASIC includes two 32-bit superscalar PowerPC 440 embedded Peak: 367 Tflop/s cores (note that L1 cache coherence is not maintained between these cores). 8 Linpack: 281 Tflop/s (13K sec about 3.6 hours; n=1.8M) 77% of peak 4 Performance Projection 1 Eflop/s 100 Pflop/s 10 Pflop/s 1 Pflop/s 100 Tflop/s 10 Tflop/s SUM 1 Tflop/s 6-8 years 100 Gflop/s N=1 10 Gflop/s 8-10 years 1 Gflop/s N=500 100 Mflop/s 1993 1995 1997 1999 2001 2003 2005 2007 2009 2011 2013 2015 9 Customer Segments / Performance 100% Government Classified 90% Vendor 80% Academic 70% 60% Industry 50% 40% 30% 20% Research 10% 0% 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 10 5 Processor Types 500 SIMD Vector 400 oth. Scalar AMD 300 Sparc MIPS 200 intel 100 HP Power 0 Alpha 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 11 Processors Used in Each of the 500 Systems 92% = 51% Intel 19% IBM Sun Sparc Intel IA-32 1% 22% 22% AMD Intel EM64T NEC 22% 1% HP Alpha Cray 1% 1% HP PA-RISC 4% Intel IA-64 7% AMD x86_64 22% IBM Power 19% 12 6 Interconnects / Systems 500 Others 400 Cray Interconnect SP Switch 300 Crossbar 200 Quadrics Infiniband (78) 100 Myrinet (79) Gigabit Ethernet (211) 0 N/A 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 GigE + Infiniband + Myrinet = 74% 13 Processors per System - Nov 2006 200 180 160 140 120 100 80 60 NumberSystems of 40 20 0 33-64 65-128 129-256 257-512 513- 1025- 2049- 4k-8k 8k-16k 16k-32k 32k-64k 64k- 1024 2048 4096 128k 14 7 KFlop/s per Capita (Flops/Pop) Based on the November 2005 Top500 6000 5000 4000 3000 2000 1000 Hint: Peter Jackson had something to do with this 0 United States Switzerland Israel WETA Digital (Lord of the Rings) New Zealand United Kingdom Netherlands Australia Japan Germany Environmental Burden of PC Spain Canada Has nothing to do with the 47.2 million sheep in NZ CPUs Korea, South Sweden Saudia Arabia France Taiwan Italy Brazil Mexico China Russia India 15 Source: Cool Chips & Micro 32 16 8 Power Consumption of World’s CPUs Year Power # CPUs (in MW) (in millions) 1992 180 87 1994 392 128 1996 959 189 1998 2,349 279 2000 5,752 412 2002 14,083 607 2004 34,485 896 2006 87,439 1,321 17 Power is an Industry Wide Problem X Google facilities ¾ leveraging hydroelectric power » old aluminum plants “Hiding in Plain Sight, Google Seeks More Power”, by John Markoff, June 14, 2006 ¾ >500,000 servers worldwide New Google Plant in The Dulles, Oregon, from NYT, June 14, 2006 18 9 And Now We Want Petascale … High-Speed Train Conventional Power Plant 10 Megawatts 300 Megawatts X What is a conventional petascale machine? ¾ Many high-speed bullet trains … a significant start to a conventional power plant. ¾ “Hiding in Plain Sight, Google Seeks More Power,” The New York Times, June 14, 2006. 19 Top Three Reasons for “Eliminating” Global Climate Warming in the Machine Room 3. HPC Contributes to Global Climate Warming ¾ “I worry that we, as HPC experts in global climate modeling, are contributing to the very thing that we are trying to avoid: the generation of greenhouse gases.” 2. Electrical Power Costs $$$. ¾ Japanese Earth Simulator » Power & Cooling: 12 MW/year Æ $9.6 million/year? ¾ Lawrence Livermore National Laboratory » Power & Cooling of HPC: $14 million/year » Power-up ASC Purple Æ “Panic” call from local electrical company. 1. Reliability & Availability Impact Productivity ¾ California: State of Electrical Emergencies (July 24-25, 2006) » 50,538 MW: A load not expected to be reached until 2010! 20 10 Reliability & Availability of HPC Systems ASCI Q ASCI White CPUs NERSC Seaborg 8,192 Reliability & Availability PSC Lemieux 8,192 MTBI: 6.5 hrs. MTBI: mean time betweenGoogle interrupts; MTBF: mean time between6,656 failures; MTTR: mean time to restore MTBF: 5 hrs. ¾(2001) and 40 hrs. (2003). HW outage sources: storage, CPU, memory. 114 unplanned outages/month. 3,016 MTBI: 14 days.¾ MTTR: 3.3 hrs. HW outage sources: storage, CPU, 3 Availability: 98.74%. ~15,000 ¾ MTBI: 9.7 hrs. SW is the main outage source. Availability: 98.33%. 20 reboots/day; 2-3% ma Availability: ~100%. ¾ HW outage sources: storage, memory. rd Fuel Efficiency: GFlops/WattSource: Daniel A. Reed, RENCI -party HW. 0.9 0.8 chines replaced/year. 0.7 0.6 0.5 GFlops/Watt 0.4 0.3 0.2 0.1 0 1 2 3 4 5 6 7 8 BlueGene/L Blue Gene ASC Purple p5 1.9 GHz Columbia - SGI Altix 1.5 GHz Thunderbird - Pentium 3.6 GHz 21 Red Storm Cray XT3, 2.0 GHz Earth-Simulator 9 10 11 12 13 14 15 16 17 MareNostrum PPC 970, 2.2 GHz Blue Gene Jaguar - Cray XT3, 2.4 GHz Based on Thunder - Intel Itanium2 1.4GHz Top 20 systems Blue Gene p Blue Gene rocessor Cray XT3, 2.6 GHz Apple XServe, 2.0 GHz p ower ratin Cray X1E (4GB) Cray X1E (2GB) 18 19 20 ASCI Q - Alpha 1.25 GHz g onl IBM p5 575 1.9 GHz y (3,>100,>800) System X 2.3 GHz Apple XServe/ 22 11 Top500 Conclusions X Microprocessor based supercomputers have brought a major change in accessibility and affordability.

CS 594 Spring 2007 Lecture 4: Overview of High-Performance Computing

The ASCI Red TOPS Supercomputer

2017 HPC Annual Report Team Would Like to Acknowledge the Invaluable Assistance Provided by John Noe

An Extensible Administration and Configuration Tool for Linux Clusters

Scalability and Performance of Salinas on the Computational Plant

Computer Architectures an Overview

Test, Evaluation, and Build Procedures for Sandia's ASCI Red (Janus) Teraflops Operating System

High Performance Computing

Delivering Insight: the History of the Accelerated Strategic Computing

Sandia's ASCI Red, World's First Teraflop Supercomputer, Is

Performance of Various Computers Using Standard Linear Equations Software

Supercomputers: the Amazing Race Gordon Bell November 2014

Architectural Specification for Massively