Jack Dongarra
University of Tennessee Oak Ridge National Laboratory University of Manchester
1/14/11 1 TPP performance
Rate Size
2 Rmax % of Power Flops/ Rank Site Computer Country Cores [Pflops] Peak [MW] Watt
Nat. SuperComputer NUDT YH Cluster, X5670 1 China 186,368 2.57 55 4.04 636 Center in Tianjin 2.93Ghz 6C, NVIDIA GPU DOE / OS Jaguar / Cray 2 USA 224,162 1.76 75 7.0 251 Oak Ridge Nat Lab Cray XT5 sixCore 2.6 GHz Nebulea / Dawning / TC3600 Nat. Supercomputer 3 Blade, Intel X5650, Nvidia China 120,640 1.27 43 2.58 493 Center in Shenzhen C2050 GPU Tusbame 2.0 HP ProLiant GSIC Center, Tokyo 4 SL390s G7 Xeon 6C X5670, Japan 73,278 1.19 52 1.40 850 Institute of Technology Nvidia GPU
Hopper, Cray XE6 12-core 5 DOE/SC/LBNL/NERSC USA 153,408 1.054 82 2.91 362 2.1 GHz
Commissariat a Tera-100 Bull bullx super- 6 l'Energie Atomique France 138,368 1.050 84 4.59 229 node S6010/S6030 (CEA) DOE / NNSA Roadrunner / IBM 7 USA 122,400 1.04 76 2.35 446 Los Alamos Nat Lab BladeCenter QS22/LS21 NSF / NICS / Kraken / Cray 8 USA 98,928 .831 81 3.09 269 U of Tennessee Cray XT5 sixCore 2.6 GHz
Forschungszentrum Jugene / IBM 9 Germany 294,912 .825 82 2.26 365 Juelich (FZJ) Blue Gene/P Solution
DOE/ NNSA / 10 Cray XE6 8-core 2.4 GHz USA 107,152 .817 79 2.95 277 Los Alamos Nat Lab Rmax % of Power Flops/ Rank Site Computer Country Cores [Pflops] Peak [MW] Watt
Nat. SuperComputer NUDT YH Cluster, X5670 1 China 186,368 2.57 55 4.04 636 Center in Tianjin 2.93Ghz 6C, NVIDIA GPU DOE / OS Jaguar / Cray 2 USA 224,162 1.76 75 7.0 251 Oak Ridge Nat Lab Cray XT5 sixCore 2.6 GHz Nebulea / Dawning / TC3600 Nat. Supercomputer 3 Blade, Intel X5650, Nvidia China 120,640 1.27 43 2.58 493 Center in Shenzhen C2050 GPU Tusbame 2.0 HP ProLiant GSIC Center, Tokyo 4 SL390s G7 Xeon 6C X5670, Japan 73,278 1.19 52 1.40 850 Institute of Technology Nvidia GPU
Hopper, Cray XE6 12-core 5 DOE/SC/LBNL/NERSC USA 153,408 1.054 82 2.91 362 2.1 GHz
Commissariat a Tera-100 Bull bullx super- 6 l'Energie Atomique France 138,368 1.050 84 4.59 229 node S6010/S6030 (CEA) DOE / NNSA Roadrunner / IBM 7 USA 122,400 1.04 76 2.35 446 Los Alamos Nat Lab BladeCenter QS22/LS21 NSF / NICS / Kraken / Cray 8 USA 98,928 .831 81 3.09 269 U of Tennessee Cray XT5 sixCore 2.6 GHz
Forschungszentrum Jugene / IBM 9 Germany 294,912 .825 82 2.26 365 Juelich (FZJ) Blue Gene/P Solution
DOE/ NNSA / 10 Cray XE6 8-core 2.4 GHz USA 107,152 .817 79 2.95 277 Los Alamos Nat Lab Name Peak “Linpack” Country Pflop/s Pflop/s Tianhe-1A 4.70 2.57 China NUDT: Hybrid Intel/Nvidia/ Self Nebula 2.98 1.27 China Dawning: Hybrid Intel/ Nvidia/IB Jaguar 2.33 1.76 US Cray: AMD/Self Tsubame 2.0 2.29 1.19 Japan HP: Hybrid Intel/Nvidia/IB RoadRunner 1.38 1.04 US IBM: Hybrid AMD/Cell/IB Hopper 1.29 1.054 US Cray: AMD/Self Tera-100 1.25 1.050 France Bull: Intel/IB Mole-8.5 1.14 .207 China CAS: Hybrid Intel/Nvidia/IB Kraken 1.02 .831 US Cray: AMD/Self Cielo 1.02 .817 US Cray: AMD/Self JuGene 1.00 .825 Germany IBM: BG-P/Self 100,000
10,000 US
1,000
100
10
1
0 100,000
US
10,000
EU
1,000
100
10
1
0 100,000 US
10,000 EU
Japan 1,000
100
10
1
0 100,000 US
EU 10,000 Japan
China 1,000
100
10
1
0 ¨ Has 3 Pflops systems NUDT, Tianhe-1A, located in Tianjin Dual-Intel 6 core + Nvidia Fermi w /custom interconnect Budget 600M RMB MOST 200M RMB, Tianjin Government 400M RMB CIT, Dawning 6000, Nebulea, located in Shenzhen Dual-Intel 6 core + Nvidia Fermi w /QDR Ifiniband Budget 600M RMB MOST 200M RMB, Shenzhen Government 400M RMB Mole-8.5 Cluster/320x2 Intel QC Xeon E5520 2.26 Ghz + 320x6 Nvidia Tesla C2050/QDR Infiniband ¨ Fourth one planned for Shandong
¨ The interconnect on the Tianhe-1A is a proprietary fat-tree.
¨ The router and network interface chips where designed by NUDT.
¨ It has a bi-directional bandwidth of 160 Gb/s, double that of QDR infiniband, a latency for a node hop of 1.57 microseconds, and an aggregated bandwidth of 61 Tb /sec.
¨ On the MPI level, the bandwidth and latency is 6.3GBps(one direction)/9.3 GBps(bi- direction) and 2.32us, respectively. ¨ DOE Funded, Titan at ORNL, Based on Cray design with accelerators, 20 Pflop/s, 2012 ¨ DOE Funded, Sequoia at Lawrence Livermore Nat. Lab, Based on IBM’s BG/Q, 20 Pflop/s, 2012 ¨ DOE Funded, BG/Q at Argonne National Lab, Based on IBM’s BG/Q, 10 Pflop/s, 2012 ¨ NSF Funded, Blue Waters at University of Illinois UC, Based on IBM’s Power 7 Proc, 10 Pflop/s, 2012 1E+11 1E+10 11E+09 Eflop/s 100000000100 Pflop/s 10000000 10 Pflop/s SUM
1000000 1 Pflop/s 100000100 Tflop/s N=1 Gordon Bell 1000010 Tflop/s Winners 1000 1 Tflop/s N=500 100 100Gflop/s 10 Gflop/s10 1 Gflop/s1 100 Mflop/s0.1 1996 2014 2002 2008 2020 ¨ Town Hall Meetings April-June 2007 ¨ Scientific Grand Challenges Workshops November 2008 – October 2009 Climate Science (11/08), High Energy Physics (12/08), Nuclear Physics (1/09), Fusion Energy (3/09), Nuclear Energy (5/09), Biology (8/09), Material Science and Chemistry (8/09), National Security (10/09) (with NNSA) ¨ Cross-cutting workshops Architecture and Technology (12/09) MISSION IMPERATIVES Architecture, Applied Math and CS (2/10) ¨ “The key finding of the Panel is that there are compelling needs for Meetings with industry (8/09, exascale computing capability to support the DOE’s missions in 11/09) energy, national security, fundamental sciences, and the ¨ External Panels environment. The DOE has the necessary assets to initiate a program that would accelerate the development of such capability to ASCAC Exascale Charge (FACA) meet its own needs and by so doing benefit other national interests. Trivelpiece Panel Failure to initiate an exascale program could lead to a loss of U. S. competitiveness in several critical technologies.” 15 Trivelpiece Panel Report, January, 2010 DOE Progress toward Exascale
IESP October 18-19, 2010
Proposals processed in Exascale related topic areas: Applied Math: Uncertainty Quantification (90 proposals requesting ~$45M/year; 6 funded at $3M/yr) Computer Science: Advanced Architectures(28 proposals requesting ~ $28M/year, 6 funded at $5M/yr) Computer Science: X-Stack ( 55 proposals requesting ~$40M/year, 11 funded at $8.5M/yr) Computational Partnerships: Co-Design (21 Proposals requesting ~ $160M/year, 5 funded at $?) Exascale Coordination meetings with other Federal Departments and Agencies International Exascale Software Project DOE & NSF funded Community together to build a software roadmap for Exascale Formalizing Partnership with National Nuclear Security Administration (NNSA) within DOE
16 European Exascale Software Initiative EESI
18 months of funding 640K € EC funding Participants: EDF, GENCI, EPSRC, Juelich, BSC, NCF, ARTTIC, CINECA Coordination of “disciplinary working groups” at the European level • Four groups “Application Grand Challenges” • (Transport, Energy), (Climate, Weather&Earth Sciences), Chemistry & Physics), (Life Sciences) • Four groups “Enabling technologies for Petaflop/Exaflop computing” • (HW & vendors), (SW eco-systems), (Num libraries), (Scientific SW Engineering)
Synthesis, dissemination and recommendation to the European Commission G8 Countries and EC Related
G8 has a call out for “Interdisciplinary Program on Application Software towards Exascale Computing for Global Scale Issues” 10 million € over three years An initiative between Research Councils from Canada, France, Germany, Japan, Russia, the UK, and the USA 78 preproposals submitted, 25 selected, expect to fund 6-10 Full proposals due August 25th Winners and losers notified last month EC FP7: Exascale computing, software and simulation Announcement out September 28, 2010; proposals being written today 25 million € 2 or 3 integrated project to be funded www.exascale.org Exascale (1018 Flop/s) Systems: Two possible paths (swim lanes)
Light weight processors (think BG/P) ~1 GHz processor (109) ~1 Kilo cores/socket (103) ~1 Mega sockets/system (106) Socket Level Cores scale-out for planar geometry
Hybrid system (think GPU based) ~1 GHz processor (109) ~10 Kilo FPUs/socket (104) ~100 Kilo sockets/system (105)
Node Level 3D packaging 19 • Steepness of the ascent from terascale to petascale to exascale • Extreme parallelism and hybrid design • Preparing for million/billion way parallelism • Tightening memory/bandwidth bottleneck • Limits on power/clock speed implication on multicore • Reducing communication will become much more intense • Memory per core changes, byte-to-flop ratio will change • Necessary Fault Tolerance • MTTF will drop • Checkpoint/restart has limitations
Software infrastructure does not exist today Commodity Accelerator (GPU) Intel Xeon Nvidia C2050 “Fermi” 8 cores 448 “Cuda cores” 3 GHz 1.15 GHz 8*4 ops/cycle 448 ops/cycle 96 Gflop/s (DP) 515 Gflop/s (DP)
Interconnect PCI-X 16 lane 21 64 Gb/s 1 GW/s Different Classes of Many Floating- Point Cores Chips Home Games / Graphics Business Scientific
+ 3D Stacked Memory • Most likely be a hybrid design • Think standard multicore chips and accelerator (GPUs) • Today accelerators are attached • Next generation more integrated • Intel’s “Knights Corner” and “Knights Ferry” to come. . 48 x86 cores • AMD’s Fusion in 2011 - 2013 . Multicore with embedded graphics ATI • Nvidia’s Project Denver plans to develop an integrated chip using ARM architecture in 2013.
23 • NVIDIA-Speak Processing Core . 240 CUDA cores (ALUs) • Generic speak . 30 processing cores • 8 CUDA Cores (SIMD functional units) per core . 1 mul-add (2 flops) + 1 mul per functional unit (3 flops/cycle) . Best case theoretically: 240 mul-adds + 240 muls per cycle • 1.3 GHz clock • 30 * 8 * (2 + 1) * 1.33 = 933 Gflop/s peak . Best case reality: 240 mul-adds per clock • Just able to do the mul-add so 2/3 or 624 Gflop/s . All this is single precision • Double precision is 78 Gflop/s peak (Factor of 8 from SP; exploit mixed prec) . 141 GB/s bus, 1 GB memory . 4 GB/s via PCIe (we see: T = 11 us + Bytes/3.3 GB/s) . In SP SGEMM performance 375 Gflop/s • NVIDIA-Speak Processing Core . 448 CUDA cores (ALUs) • Generic speak . 14 processing cores • 32 CUDA Cores (SIMD functional units) per core . 1 mul-add (2 flops) per ALU (2 flops/cycle) . Best case theoretically: 448 mul-adds • 1.15 GHz clock • 14 * 32 * 2 * 1.15 = 1.03 Tflop/s peak . All this is single precision • Double precision is half this rate, 515 Gflop/s . In SP SGEMM performance 640 Gflop/s . In DP DGEMM performance 300 Gflop/s . Interface PCI-x16 120%
100%
80%
60% Efficiency Efficiency
40% Linpack
20%
0% 0 100 200 300 400 500 120%
100%
80%
60% Efficiency Efficiency
40% Linpack
20%
0% 0 100 200 300 400 500 120%
100%
80%
60% Efficiency Efficiency
40% Linpack
20%
0% 0 100 200 300 400 500
18 GPU based systems • Must rethink the design of our software . Another disruptive technology • Similar to what happened with cluster computing and message passing . Rethink and rewrite the applications, algorithms, and software • Numerical libraries for example will change . For example, both LAPACK and ScaLAPACK will undergo major changes to accommodate this 29 1. Effective Use of Many-Core and Hybrid architectures . Break fork-join parallelism . Dynamic Data Driven Execution . Block Data Layout 2. Exploiting Mixed Precision in the Algorithms . Single Precision is 2X faster than Double Precision . With GP-GPUs 10x . Power saving issues 3. Self Adapting / Auto Tuning of Software . Too hard to do by hand 4. Fault Tolerant Algorithms . With 1,000,000’s of cores things will fail 5. Communication Reducing Algorithms
. For dense computations from O(n log p) to O(log p) 30 communications . Asynchronous iterations . GMRES k-step compute ( x, Ax, A2x, … Akx ) Step 1 Step 2 Step 3 Step 4 . . .
• Fork-join, bulk synchronous processing 31 • Break into smaller tasks and remove dependencies
* LU does block pair wise pivoting • Objectives . High utilization of each core . Scaling to large number of cores . Shared or distributed memory • Methodology . Dynamic DAG scheduling . Explicit parallelism . Implicit communication . Fine granularity / block data layout • Arbitrary DAG with dynamic scheduling Fork-join parallelism
DAG scheduled parallelism Time 33 TAMU, UTK, Virginia Tech Dual-core 1.8GHz AMD Opteron. 4 cores in total. The matrix size is 16000x16000.
Part of the NSF MuMI project
300 300
250 250
200 200
150 150 Power (Watts) Power (Watts) System CPU 100 Memory 100 System Disk CPU Motherboad Memory Disk Motherboad 50 50
0 0 0 50 100 150 200 0 50 100 150 200 Time (seconds) Time (seconds) 34 LAPACK used 38.72 KJ Tile-Cholesky used 27.18 KJ. TAMU, UTK, Virginia Tech Dual-core 1.8GHz AMD Opteron. 4 cores in total. The matrix size is 16000x16000.
Part of the NSF MuMI project
300 300
250 250
200 200
150 150 Power (Watts) Power (Watts) System CPU 100 100 System Memory CPU Disk Memory Motherboad Disk Motherboad 50 50
0 0 0 50 100 150 200 0 50 100 150 200 Time (seconds) Time (seconds) 35 LAPACK used 38.72 KJ Tile-Cholesky used 27.18 KJ. TAMU, UTK, Virginia Tech Dual-core 1.8GHz AMD Opteron. 4 cores in total. The matrix size is 16000x16000.
Part of the NSF MuMI project
300 300
250 250
200 200
150 150 Power (Watts) Power (Watts) System CPU 100 Memory 100 System Disk CPU Motherboad Memory Disk Motherboad 50 50
0 0 0 50 100 150 200 0 50 100 150 200 Time (seconds) Time (seconds) 36 LAPACK used 38.72 KJ DAG based Cholesky used 27.18 KJ. 37 Matrix Size • Goal: Algorithms that communicate as little as possible • Jim Demmel and company have been working on algorithms that obtain a provable minimum communication. • Direct methods (BLAS, LU, QR, SVD, other decompositions) • Communication lower bounds for all these problems • Algorithms that attain them (all dense linear algebra, some sparse) • Mostly not in LAPACK or ScaLAPACK (yet) • Iterative methods – Krylov subspace methods for Ax=b, Ax=λx • Communication lower bounds, and algorithms that attain them (depending on sparsity structure) • Not in any libraries (yet) • For QR Factorization they can show:
38 R0! R0! R0! R! R!
D0! Domain_Tile_QR D0!
R1!
D1! Domain_Tile_QR D1!
R2! R2!
D2! Domain_Tile_QR D2!
R3!
D3! Domain_Tile_QR D3!
A. Pothen and P. Raghavan. Distributed orthogonal factorization. In The 3rd Conference on Hypercube Concurrent Computers and Applications, volume II, Applications, pages 1610‒1620, Pasadena, CA, Jan. 1988. ACM. Penn. State. R0! R0! R0! R! R!
D0! Domain_Tile_QR D0!
R1!
D1! Domain_Tile_QR D1!
R2! R2!
D2! Domain_Tile_QR D2!
R3!
D3! Domain_Tile_QR D3!
A. Pothen and P. Raghavan. Distributed orthogonal factorization. In The 3rd Conference on Hypercube Concurrent Computers and Applications, volume II, Applications, pages 1610‒1620, Pasadena, CA, Jan. 1988. ACM. Penn. State. R0! R0! R0! R! R!
D0! Domain_Tile_QR D0!
R1!
D1! Domain_Tile_QR D1!
R2! R2!
D2! Domain_Tile_QR D2!
R3!
D3! Domain_Tile_QR D3!
A. Pothen and P. Raghavan. Distributed orthogonal factorization. In The 3rd Conference on Hypercube Concurrent Computers and Applications, volume II, Applications, pages 1610‒1620, Pasadena, CA, Jan. 1988. ACM. Penn. State. R0! R0! R0! R! R!
D0! Domain_Tile_QR D0!
R1!
D1! Domain_Tile_QR D1!
R2! R2!
D2! Domain_Tile_QR D2!
R3!
D3! Domain_Tile_QR D3!
A. Pothen and P. Raghavan. Distributed orthogonal factorization. In The 3rd Conference on Hypercube Concurrent Computers and Applications, volume II, Applications, pages 1610‒1620, Pasadena, CA, Jan. 1988. ACM. Penn. State. R0! R0! R0! R! R!
D0! Domain_Tile_QR D0!
R1!
D1! Domain_Tile_QR D1!
R2! R2!
D2! Domain_Tile_QR D2!
R3!
D3! Domain_Tile_QR D3!
A. Pothen and P. Raghavan. Distributed orthogonal factorization. In The 3rd Conference on Hypercube Concurrent Computers and Applications, volume II, Applications, pages 1610‒1620, Pasadena, CA, Jan. 1988. ACM. Penn. State. Communication Reducing QR Factorization
Quad-socket, quad-core machine Intel Xeon EMT64 E7340 at 2.39 GHz. Theoretical peak is 153.2 Gflop/s with 16 cores.
Matrix size 51200 by 3200 Experimental Results
• On two cluster machines and a Cray XT5 system • Cluster 1: 2 cores per node (Grig at UTK) • Cluster 2: 8 cores per node (Newton at UTK) • Cray XT5: 12 cores per node (Jaguar in ORNL) • In comparison with vendors’ ScaLAPACK library • Take as input tall and skinny matrix • Strong scalability experiment • Weak scalability experiment • Crossover point for general-size matrices
45 Strong Scalability on Jaguar
• Fixed-size input for an increasing number of cores • Each node has 2 sockets, 6-core AMD Opteron 2.6GHz per socket 1800 1600 Tile CA-QR 1400 ScaLAPACK 1200 1000
GFLOPS GFLOPS 800 600 400 200 0 1 2 4 8 12 24 48 96 192 384 Number of Cores
46 Weak Scalability on Jaguar
• Increase the input size while the number of cores increases • Each node has 2 sockets, 6-core AMD Opteron per socket
47 Applying Tile CAQR to General-Size Matrices
• Each node has 2 sockets, 6-core AMD Opteron per socket • Using 16 nodes on Jaguar (i.e., 192 cores)
P0! R! R! R! R! R! R! R! P1!
P2!
P3!
P4!
P5!
P6!
P7!
Idle processes: P0 and P1 Figure: Crossover point on matrices with 512 rows
48 GMRES speedups on 8-core Clovertown
49 Communication-avoiding iterative methods • Iterative Solvers: • Dominant cost of many apps (up to 80+% of runtime). • Exascale challenges for iterative solvers: • Collectives, synchronization. • Memory latency/BW. • Not viable on exascale systems in present forms. • Communication-avoiding (s-step) iterative solvers: LAPACK – Serial, MGS ‒Threaded modified Gram-Schmidt • Idea: Perform s steps in bulk ( s=5 or more ): TSQR capability: • s times fewer synchronizations. • Critical for exascale solvers. • s times fewer data transfers: Better latency/ • Part of the Trilinos scalable multicore BW. capabilities. • Problem: Numerical accuracy of • Helps all iterative solvers in Trilinos orthogonalization. (available to external libraries, too). • Staffing: Mark Hoemmen (lead, post- • TSQR Implementation: doc, UC-Berkeley), M. Heroux • 2-level parallelism (Inter and intra node). • Part of Trilinos 10.6 release, Sep 2010. • Memory hierarchy optimizations. • Flexible node-level scheduling via Intel Threading Building Blocks. • Generic scalar data type: supports mixed and extended precision. Match algorithmic requirements to architectural strengths of the hybrid components Multicore : small tasks/tiles Accelerator: large data parallel tasks Algorithms as DAGs Current hybrid CPU+GPU algorithms (small tasks/tiles for multicore) (small tasks for multicores and large tasks for GPUs)