Jack Dongarra

University of Tennessee Oak Ridge National Laboratory University of Manchester

1/14/11 1 TPP performance

Rate Size

2 Rmax % of Power Flops/ Rank Site Computer Country Cores [Pflops] Peak [MW] Watt

Nat. NUDT YH Cluster, X5670 1 China 186,368 2.57 55 4.04 636 Center in Tianjin 2.93Ghz 6C, GPU DOE / OS Jaguar / Cray 2 USA 224,162 1.76 75 7.0 251 Oak Ridge Nat Lab Cray XT5 sixCore 2.6 GHz Nebulea / Dawning / TC3600 Nat. Supercomputer 3 Blade, X5650, Nvidia China 120,640 1.27 43 2.58 493 Center in Shenzhen C2050 GPU Tusbame 2.0 HP ProLiant GSIC Center, Tokyo 4 SL390s G7 Xeon 6C X5670, Japan 73,278 1.19 52 1.40 850 Institute of Technology Nvidia GPU

Hopper, Cray XE6 12-core 5 DOE/SC/LBNL/NERSC USA 153,408 1.054 82 2.91 362 2.1 GHz

Commissariat a Tera-100 Bull bullx super- 6 l'Energie Atomique France 138,368 1.050 84 4.59 229 node S6010/S6030 (CEA) DOE / NNSA Roadrunner / IBM 7 USA 122,400 1.04 76 2.35 446 Los Alamos Nat Lab BladeCenter QS22/LS21 NSF / NICS / Kraken / Cray 8 USA 98,928 .831 81 3.09 269 U of Tennessee Cray XT5 sixCore 2.6 GHz

Forschungszentrum Jugene / IBM 9 Germany 294,912 .825 82 2.26 365 Juelich (FZJ) Blue Gene/P Solution

DOE/ NNSA / 10 Cray XE6 8-core 2.4 GHz USA 107,152 .817 79 2.95 277 Los Alamos Nat Lab Rmax % of Power Flops/ Rank Site Computer Country Cores [Pflops] Peak [MW] Watt

Nat. SuperComputer NUDT YH Cluster, X5670 1 China 186,368 2.57 55 4.04 636 Center in Tianjin 2.93Ghz 6C, NVIDIA GPU DOE / OS Jaguar / Cray 2 USA 224,162 1.76 75 7.0 251 Oak Ridge Nat Lab Cray XT5 sixCore 2.6 GHz Nebulea / Dawning / TC3600 Nat. Supercomputer 3 Blade, Intel X5650, Nvidia China 120,640 1.27 43 2.58 493 Center in Shenzhen C2050 GPU Tusbame 2.0 HP ProLiant GSIC Center, Tokyo 4 SL390s G7 Xeon 6C X5670, Japan 73,278 1.19 52 1.40 850 Institute of Technology Nvidia GPU

Hopper, Cray XE6 12-core 5 DOE/SC/LBNL/NERSC USA 153,408 1.054 82 2.91 362 2.1 GHz

Commissariat a Tera-100 Bull bullx super- 6 l'Energie Atomique France 138,368 1.050 84 4.59 229 node S6010/S6030 (CEA) DOE / NNSA Roadrunner / IBM 7 USA 122,400 1.04 76 2.35 446 Los Alamos Nat Lab BladeCenter QS22/LS21 NSF / NICS / Kraken / Cray 8 USA 98,928 .831 81 3.09 269 U of Tennessee Cray XT5 sixCore 2.6 GHz

Forschungszentrum Jugene / IBM 9 Germany 294,912 .825 82 2.26 365 Juelich (FZJ) Blue Gene/P Solution

DOE/ NNSA / 10 Cray XE6 8-core 2.4 GHz USA 107,152 .817 79 2.95 277 Los Alamos Nat Lab Name Peak “Linpack” Country Pflop/s Pflop/s Tianhe-1A 4.70 2.57 China NUDT: Hybrid Intel/Nvidia/ Self Nebula 2.98 1.27 China Dawning: Hybrid Intel/ Nvidia/IB Jaguar 2.33 1.76 US Cray: AMD/Self Tsubame 2.0 2.29 1.19 Japan HP: Hybrid Intel/Nvidia/IB RoadRunner 1.38 1.04 US IBM: Hybrid AMD/Cell/IB Hopper 1.29 1.054 US Cray: AMD/Self Tera-100 1.25 1.050 France Bull: Intel/IB Mole-8.5 1.14 .207 China CAS: Hybrid Intel/Nvidia/IB Kraken 1.02 .831 US Cray: AMD/Self Cielo 1.02 .817 US Cray: AMD/Self JuGene 1.00 .825 Germany IBM: BG-P/Self 100,000

10,000 US

1,000

100

10

1

0 100,000

US

10,000

EU

1,000

100

10

1

0 100,000 US

10,000 EU

Japan 1,000

100

10

1

0 100,000 US

EU 10,000 Japan

China 1,000

100

10

1

0 ¨ Has 3 Pflops systems  NUDT, Tianhe-1A, located in Tianjin Dual-Intel 6 core + Nvidia Fermi w /custom interconnect  Budget 600M RMB  MOST 200M RMB, Tianjin Government 400M RMB  CIT, Dawning 6000, Nebulea, located in Shenzhen Dual-Intel 6 core + Nvidia Fermi w /QDR Ifiniband  Budget 600M RMB  MOST 200M RMB, Shenzhen Government 400M RMB  Mole-8.5 Cluster/320x2 Intel QC Xeon E5520 2.26 Ghz + 320x6 C2050/QDR Infiniband ¨ Fourth one planned for Shandong

¨ The interconnect on the Tianhe-1A is a proprietary fat-tree.

¨ The router and network interface chips where designed by NUDT.

¨ It has a bi-directional bandwidth of 160 Gb/s, double that of QDR infiniband, a latency for a node hop of 1.57 microseconds, and an aggregated bandwidth of 61 Tb /sec.

¨ On the MPI level, the bandwidth and latency is 6.3GBps(one direction)/9.3 GBps(bi- direction) and 2.32us, respectively. ¨ DOE Funded, Titan at ORNL, Based on Cray design with accelerators, 20 Pflop/s, 2012 ¨ DOE Funded, Sequoia at Lawrence Livermore Nat. Lab, Based on IBM’s BG/Q, 20 Pflop/s, 2012 ¨ DOE Funded, BG/Q at Argonne National Lab, Based on IBM’s BG/Q, 10 Pflop/s, 2012 ¨ NSF Funded, Blue Waters at University of Illinois UC, Based on IBM’s Power 7 Proc, 10 Pflop/s, 2012 1E+11 1E+10 11E+09 Eflop/s 100000000100 Pflop/s 10000000 10 Pflop/s SUM

1000000 1 Pflop/s 100000100 Tflop/s N=1 Gordon Bell 1000010 Tflop/s Winners 1000 1 Tflop/s N=500 100 100Gflop/s 10 Gflop/s10 1 Gflop/s1 100 Mflop/s0.1 1996 2014 2002 2008 2020 ¨ Town Hall Meetings April-June 2007 ¨ Scientific Grand Challenges Workshops November 2008 – October 2009  Climate Science (11/08),  High Energy Physics (12/08),  Nuclear Physics (1/09),  Fusion Energy (3/09),  Nuclear Energy (5/09),  Biology (8/09),  Material Science and Chemistry (8/09),  National Security (10/09) (with NNSA) ¨ Cross-cutting workshops  Architecture and Technology (12/09) MISSION IMPERATIVES  Architecture, Applied Math and CS (2/10) ¨ “The key finding of the Panel is that there are compelling needs for Meetings with industry (8/09, exascale computing capability to support the DOE’s missions in 11/09) energy, national security, fundamental sciences, and the ¨ External Panels environment. The DOE has the necessary assets to initiate a program that would accelerate the development of such capability to  ASCAC Exascale Charge (FACA) meet its own needs and by so doing benefit other national interests.  Trivelpiece Panel Failure to initiate an exascale program could lead to a loss of U. S. competitiveness in several critical technologies.” 15 Trivelpiece Panel Report, January, 2010 DOE Progress toward Exascale

IESP October 18-19, 2010

 Proposals processed in Exascale related topic areas:  Applied Math: Uncertainty Quantification (90 proposals requesting ~$45M/year; 6 funded at $3M/yr)  Computer Science: Advanced Architectures(28 proposals requesting ~ $28M/year, 6 funded at $5M/yr)  Computer Science: X-Stack ( 55 proposals requesting ~$40M/year, 11 funded at $8.5M/yr)  Computational Partnerships: Co-Design (21 Proposals requesting ~ $160M/year, 5 funded at $?)  Exascale Coordination meetings with other Federal Departments and Agencies  International Exascale Software Project DOE & NSF funded  Community together to build a software roadmap for Exascale  Formalizing Partnership with National Nuclear Security Administration (NNSA) within DOE

16 European Exascale Software Initiative EESI

 18 months of funding  640K € EC funding  Participants: EDF, GENCI, EPSRC, Juelich, BSC, NCF, ARTTIC, CINECA  Coordination of “disciplinary working groups” at the European level • Four groups “Application Grand Challenges” • (Transport, Energy), (Climate, Weather&Earth Sciences), Chemistry & Physics), (Life Sciences) • Four groups “Enabling technologies for Petaflop/Exaflop computing” • (HW & vendors), (SW eco-systems), (Num libraries), (Scientific SW Engineering)

 Synthesis, dissemination and recommendation to the European Commission G8 Countries and EC Related

 G8 has a call out for “Interdisciplinary Program on Application Software towards Exascale Computing for Global Scale Issues”  10 million € over three years  An initiative between Research Councils from Canada, France, Germany, Japan, Russia, the UK, and the USA  78 preproposals submitted, 25 selected, expect to fund 6-10  Full proposals due August 25th  Winners and losers notified last month  EC FP7: Exascale computing, software and simulation  Announcement out September 28, 2010; proposals being written today  25 million €  2 or 3 integrated project to be funded www.exascale.org Exascale (1018 Flop/s) Systems: Two possible paths (swim lanes)

 Light weight processors (think BG/P)  ~1 GHz processor (109)  ~1 Kilo cores/socket (103)  ~1 Mega sockets/system (106) Socket Level Cores scale-out for planar geometry

 Hybrid system (think GPU based)  ~1 GHz processor (109)  ~10 Kilo FPUs/socket (104)  ~100 Kilo sockets/system (105)

Node Level 3D packaging 19 • Steepness of the ascent from terascale to petascale to exascale • Extreme parallelism and hybrid design • Preparing for million/billion way parallelism • Tightening memory/bandwidth bottleneck • Limits on power/clock speed implication on multicore • Reducing communication will become much more intense • Memory per core changes, byte-to-flop ratio will change • Necessary Fault Tolerance • MTTF will drop • Checkpoint/restart has limitations

Software infrastructure does not exist today Commodity Accelerator (GPU) Intel Xeon Nvidia C2050 “Fermi” 8 cores 448 “Cuda cores” 3 GHz 1.15 GHz 8*4 ops/cycle 448 ops/cycle 96 Gflop/s (DP) 515 Gflop/s (DP)

Interconnect PCI-X 16 lane 21 64 Gb/s 1 GW/s Different Classes of Many Floating- Point Cores Chips Home Games / Graphics Business Scientific

+ 3D Stacked Memory • Most likely be a hybrid design • Think standard multicore chips and accelerator (GPUs) • Today accelerators are attached • Next generation more integrated • Intel’s “Knights Corner” and “Knights Ferry” to come. . 48 cores • AMD’s Fusion in 2011 - 2013 . Multicore with embedded graphics ATI • Nvidia’s Project Denver plans to develop an integrated chip using ARM architecture in 2013.

23 • NVIDIA-Speak Processing Core . 240 CUDA cores (ALUs) • Generic speak . 30 processing cores • 8 CUDA Cores (SIMD functional units) per core . 1 mul-add (2 flops) + 1 mul per functional unit (3 flops/cycle) . Best case theoretically: 240 mul-adds + 240 muls per cycle • 1.3 GHz clock • 30 * 8 * (2 + 1) * 1.33 = 933 Gflop/s peak . Best case reality: 240 mul-adds per clock • Just able to do the mul-add so 2/3 or 624 Gflop/s . All this is single precision • Double precision is 78 Gflop/s peak (Factor of 8 from SP; exploit mixed prec) . 141 GB/s bus, 1 GB memory . 4 GB/s via PCIe (we see: T = 11 us + Bytes/3.3 GB/s) . In SP SGEMM performance 375 Gflop/s • NVIDIA-Speak Processing Core . 448 CUDA cores (ALUs) • Generic speak . 14 processing cores • 32 CUDA Cores (SIMD functional units) per core . 1 mul-add (2 flops) per ALU (2 flops/cycle) . Best case theoretically: 448 mul-adds • 1.15 GHz clock • 14 * 32 * 2 * 1.15 = 1.03 Tflop/s peak . All this is single precision • Double precision is half this rate, 515 Gflop/s . In SP SGEMM performance 640 Gflop/s . In DP DGEMM performance 300 Gflop/s . Interface PCI-x16 120%

100%

80%

60% Efficiency Efficiency

40% Linpack

20%

0% 0 100 200 300 400 500 120%

100%

80%

60% Efficiency Efficiency

40% Linpack

20%

0% 0 100 200 300 400 500 120%

100%

80%

60% Efficiency Efficiency

40% Linpack

20%

0% 0 100 200 300 400 500

18 GPU based systems • Must rethink the design of our software . Another disruptive technology • Similar to what happened with cluster computing and message passing . Rethink and rewrite the applications, algorithms, and software • Numerical libraries for example will change . For example, both LAPACK and ScaLAPACK will undergo major changes to accommodate this 29 1. Effective Use of Many-Core and Hybrid architectures . Break fork-join parallelism . Dynamic Data Driven Execution . Block Data Layout 2. Exploiting Mixed Precision in the Algorithms . Single Precision is 2X faster than Double Precision . With GP-GPUs 10x . Power saving issues 3. Self Adapting / Auto Tuning of Software . Too hard to do by hand 4. Fault Tolerant Algorithms . With 1,000,000’s of cores things will fail 5. Communication Reducing Algorithms

. For dense computations from O(n log p) to O(log p) 30 communications . Asynchronous iterations . GMRES k-step compute ( x, Ax, A2x, … Akx ) Step 1 Step 2 Step 3 Step 4 . . .

• Fork-join, bulk synchronous processing 31 • Break into smaller tasks and remove dependencies

* LU does block pair wise pivoting • Objectives . High utilization of each core . Scaling to large number of cores . Shared or distributed memory • Methodology . Dynamic DAG scheduling . Explicit parallelism . Implicit communication . Fine granularity / block data layout • Arbitrary DAG with dynamic scheduling Fork-join parallelism

DAG scheduled parallelism Time 33 TAMU, UTK, Virginia Tech Dual-core 1.8GHz AMD Opteron. 4 cores in total. The matrix size is 16000x16000.

Part of the NSF MuMI project

300 300

250 250

200 200

150 150 Power (Watts) Power (Watts) System CPU 100 Memory 100 System Disk CPU Motherboad Memory Disk Motherboad 50 50

0 0 0 50 100 150 200 0 50 100 150 200 Time (seconds) Time (seconds) 34 LAPACK used 38.72 KJ Tile-Cholesky used 27.18 KJ. TAMU, UTK, Virginia Tech Dual-core 1.8GHz AMD Opteron. 4 cores in total. The matrix size is 16000x16000.

Part of the NSF MuMI project

300 300

250 250

200 200

150 150 Power (Watts) Power (Watts) System CPU 100 100 System Memory CPU Disk Memory Motherboad Disk Motherboad 50 50

0 0 0 50 100 150 200 0 50 100 150 200 Time (seconds) Time (seconds) 35 LAPACK used 38.72 KJ Tile-Cholesky used 27.18 KJ. TAMU, UTK, Virginia Tech Dual-core 1.8GHz AMD Opteron. 4 cores in total. The matrix size is 16000x16000.

Part of the NSF MuMI project

300 300

250 250

200 200

150 150 Power (Watts) Power (Watts) System CPU 100 Memory 100 System Disk CPU Motherboad Memory Disk Motherboad 50 50

0 0 0 50 100 150 200 0 50 100 150 200 Time (seconds) Time (seconds) 36 LAPACK used 38.72 KJ DAG based Cholesky used 27.18 KJ. 37 Matrix Size • Goal: Algorithms that communicate as little as possible • Jim Demmel and company have been working on algorithms that obtain a provable minimum communication. • Direct methods (BLAS, LU, QR, SVD, other decompositions) • Communication lower bounds for all these problems • Algorithms that attain them (all dense linear algebra, some sparse) • Mostly not in LAPACK or ScaLAPACK (yet) • Iterative methods – Krylov subspace methods for Ax=b, Ax=λx • Communication lower bounds, and algorithms that attain them (depending on sparsity structure) • Not in any libraries (yet) • For QR Factorization they can show:

38 R0! R0! R0! R! R!

D0! Domain_Tile_QR D0!

R1!

D1! Domain_Tile_QR D1!

R2! R2!

D2! Domain_Tile_QR D2!

R3!

D3! Domain_Tile_QR D3!

A. Pothen and P. Raghavan. Distributed orthogonal factorization. In The 3rd Conference on Hypercube Concurrent Computers and Applications, volume II, Applications, pages 1610‒1620, Pasadena, CA, Jan. 1988. ACM. Penn. State. R0! R0! R0! R! R!

D0! Domain_Tile_QR D0!

R1!

D1! Domain_Tile_QR D1!

R2! R2!

D2! Domain_Tile_QR D2!

R3!

D3! Domain_Tile_QR D3!

A. Pothen and P. Raghavan. Distributed orthogonal factorization. In The 3rd Conference on Hypercube Concurrent Computers and Applications, volume II, Applications, pages 1610‒1620, Pasadena, CA, Jan. 1988. ACM. Penn. State. R0! R0! R0! R! R!

D0! Domain_Tile_QR D0!

R1!

D1! Domain_Tile_QR D1!

R2! R2!

D2! Domain_Tile_QR D2!

R3!

D3! Domain_Tile_QR D3!

A. Pothen and P. Raghavan. Distributed orthogonal factorization. In The 3rd Conference on Hypercube Concurrent Computers and Applications, volume II, Applications, pages 1610‒1620, Pasadena, CA, Jan. 1988. ACM. Penn. State. R0! R0! R0! R! R!

D0! Domain_Tile_QR D0!

R1!

D1! Domain_Tile_QR D1!

R2! R2!

D2! Domain_Tile_QR D2!

R3!

D3! Domain_Tile_QR D3!

A. Pothen and P. Raghavan. Distributed orthogonal factorization. In The 3rd Conference on Hypercube Concurrent Computers and Applications, volume II, Applications, pages 1610‒1620, Pasadena, CA, Jan. 1988. ACM. Penn. State. R0! R0! R0! R! R!

D0! Domain_Tile_QR D0!

R1!

D1! Domain_Tile_QR D1!

R2! R2!

D2! Domain_Tile_QR D2!

R3!

D3! Domain_Tile_QR D3!

A. Pothen and P. Raghavan. Distributed orthogonal factorization. In The 3rd Conference on Hypercube Concurrent Computers and Applications, volume II, Applications, pages 1610‒1620, Pasadena, CA, Jan. 1988. ACM. Penn. State. Communication Reducing QR Factorization

Quad-socket, quad-core machine Intel Xeon EMT64 E7340 at 2.39 GHz. Theoretical peak is 153.2 Gflop/s with 16 cores.

Matrix size 51200 by 3200 Experimental Results

• On two cluster machines and a Cray XT5 system • Cluster 1: 2 cores per node (Grig at UTK) • Cluster 2: 8 cores per node (Newton at UTK) • Cray XT5: 12 cores per node (Jaguar in ORNL) • In comparison with vendors’ ScaLAPACK library • Take as input tall and skinny matrix • Strong scalability experiment • Weak scalability experiment • Crossover point for general-size matrices

45 Strong Scalability on Jaguar

• Fixed-size input for an increasing number of cores • Each node has 2 sockets, 6-core AMD Opteron 2.6GHz per socket 1800 1600 Tile CA-QR 1400 ScaLAPACK 1200 1000

GFLOPS GFLOPS 800 600 400 200 0 1 2 4 8 12 24 48 96 192 384 Number of Cores

46 Weak Scalability on Jaguar

• Increase the input size while the number of cores increases • Each node has 2 sockets, 6-core AMD Opteron per socket

47 Applying Tile CAQR to General-Size Matrices

• Each node has 2 sockets, 6-core AMD Opteron per socket • Using 16 nodes on Jaguar (i.e., 192 cores)

P0! R! R! R! R! R! R! R! P1!

P2!

P3!

P4!

P5!

P6!

P7!

Idle processes: P0 and P1 Figure: Crossover point on matrices with 512 rows

48 GMRES speedups on 8-core Clovertown

49 Communication-avoiding iterative methods • Iterative Solvers: • Dominant cost of many apps (up to 80+% of runtime). • Exascale challenges for iterative solvers: • Collectives, synchronization. • Memory latency/BW. • Not viable on exascale systems in present forms. • Communication-avoiding (s-step) iterative solvers: LAPACK – Serial, MGS ‒Threaded modified Gram-Schmidt • Idea: Perform s steps in bulk ( s=5 or more ): TSQR capability: • s times fewer synchronizations. • Critical for exascale solvers. • s times fewer data transfers: Better latency/ • Part of the Trilinos scalable multicore BW. capabilities. • Problem: Numerical accuracy of • Helps all iterative solvers in Trilinos orthogonalization. (available to external libraries, too). • Staffing: Mark Hoemmen (lead, post- • TSQR Implementation: doc, UC-Berkeley), M. Heroux • 2-level parallelism (Inter and intra node). • Part of Trilinos 10.6 release, Sep 2010. • Memory hierarchy optimizations. • Flexible node-level scheduling via Intel Threading Building Blocks. • Generic scalar data type: supports mixed and extended precision. Match algorithmic requirements to architectural strengths of the hybrid components Multicore : small tasks/tiles Accelerator: large data parallel tasks Algorithms as DAGs Current hybrid CPU+GPU algorithms (small tasks/tiles for multicore) (small tasks for multicores and large tasks for GPUs)‏

e.g. split the computation into tasks; define critical path that “clears” the way for other large data parallel tasks; proper schedule the tasks execution Design algorithms with well defined “search space” to facilitate auto-tuning Parallel Performance of the hybrid SPOTRF (4 Opteron 1.8GHz and 4 GPU TESLA C1060 1.44GHz) 1CPU-1GPU 2CPUs-2GPUs 3CPUs-3GPUs 4CPUs-4GPUs

1200

1000

800

600 Gflop/s

400

200

0 0 5000 10000 15000 20000 25000 Matrix sizes 52 Performance of tile LU factorization in single precision arithmetic

1000 3 GPUs 2 GPUs 1 GPU 900 GPU : Fermi C2050 (14 processors

800 @1.14GHz) CPU : dual-socket hexa-core Intel Nehalem 700 @2.67GHz

600

Gflop/s 500 Tile size: 2,496 400 Blocking size: 96

300

200

100 0 5000 10000 15000 20000 25000 Matrix order Exploiting Mixed Precision Computations

• Single precision is faster than DP because: . Higher parallelism within floang point units • 4 ops/cycle (usually) instead of 2 ops /cycle . Reduced data moon • 32 bit data instead of 64 bit data . Higher locality in cache • More data items in cache • Exploit 32 bit floating point as much as possible. . Especially for the bulk of the computation • Correct or update the solution with selective use of 64 bit floating point to provide a refined results • Intuitively: . Compute a 32 bit result, . Calculate a correction to 32 bit result using selected higher precision and, . Perform the update of the 32 bit results with the correction using high precision. 55 • Iterative refinement for dense systems, Ax = b, can work this way. L U = lu(A) SINGLE O(n3) x = L\(U\b) SINGLE O(n2) r = b – Ax DOUBLE O(n2) WHILE || r || not small enough z = L\(U\r) SINGLE O(n2) x = x + z DOUBLE O(n1) r = b – Ax DOUBLE O(n2) END

. Wilkinson, Moler, Stewart, & Higham provide error bound for SP fl pt results when using DP fl pt. • Iterative refinement for dense systems, Ax = b, can work this way. L U = lu(A) SINGLE O(n3) x = L\(U\b) SINGLE O(n2) r = b – Ax DOUBLE O(n2) WHILE || r || not small enough z = L\(U\r) SINGLE O(n2) x = x + z DOUBLE O(n1) r = b – Ax DOUBLE O(n2) END

. Wilkinson, Moler, Stewart, & Higham provide error bound for SP fl pt results when using DP fl pt. . It can be shown that using this approach we can compute the solution to 64-bit floating point precision.

• Requires extra storage, total is 1.5 times normal; • O(n3) work is done in lower precision • O(n2) work is done in high precision • Problems if the matrix is ill-conditioned in sp; O(108) PLASMA DP Two dual-core 1.8 GHz AMD Opteron processors Theoretical peak: 14.4 Gflops per node DGEMM using 4 threads: 12.94 Gflops PLASMA 2.3.1, GotoBLAS2 Experiments: PLASMA LU solver in double precision PLASMA LU solver in mixed precision

N = 8400, using 4 PLASMA PLASMA cores DP Mixed PLASMA Mixed Precision Time to Solution (s) 39.5 22.8 GFLOPS 10.01 17.37

Accuracy| | Ax − b || 2.0E-02 1.3E-01 (|| A |||| X || + || b ||)Nε Iterations 7 System Energy 10852.8 6314.8 (KJ) 500 Single Precision 450

400

350

300 Double Precision 250 Gflop/s

200

150

100

50

0 960 3200 5120 7040 8960 11200 13120

Matrix size Tesla C2050, 448 CUDA cores (14 multiprocessors x 32) @ 1.15 GHz., 3 GB memory, connected through PCIe to a quad-core Intel @2.5 GHz. 500 Single Precision 450

400 Mixed Precision 350

300 Double Precision 250 Gflop/s

200

150

100

50

0 960 3200 5120 7040 8960 11200 13120

Matrix size Tesla C2050, 448 CUDA cores (14 multiprocessors x 32) @ 1.15 GHz., 3 GB memory, connected through PCIe to a quad-core Intel @2.5 GHz. • Writing high performance software is hard • Ideal: get high fraction of peak performance from one algorithm • Reality: Best algorithm (and its implementation) can depend strongly on the problem, computer architecture, compiler,… . Best choice can depend on knowing a lot of applied mathematics and computer science . Changes with each new hardware, compiler release • Automatic performance tuning . Use machine time in place of human time for tuning . Search over possible implementations . Use performance models to restrict search space . Past successes: ATLAS, FFTW, Spiral, Open-MPI • Many parameters in the code needs to be optimized. • Software adaptivity is the key for applications to effectively use available resources whose complexity is exponentially increasing

MFLOPS Compile, Execute, Measure

L1Size NB Detect ATLAS Search ATLAS MM MiniMMM NR MU,NU,KU Hardware Engine xFetch Code Generator Source Parameters MulAdd (MMSearch) MulAdd (MMCase) L* Latency

62 Best algorithm implementation can depend strongly on the problem, computer architecture, compiler,…

There are 2 main approaches . Model-driven optimization [Analytical models for various parameters; Heavily used in the compilers community; May not give optimal results ] . Empirical optimization [ Generate large number of code versions and runs them on a given platform to determine the best performing one; Effectiveness depends on the chosen parameters to optimize and the search heuristics used ]

Natural approach is to combine them in a hybrid approach [1st model-driven to limit the search space for a 2nd empirical part ] [ Another aspect is adaptivity – to treat cases where tuning can not be restricted to optimizations at design, installation, or compile time ] • Major Challenges are ahead for extreme computing . Parallelism . Hybrid . Fault Tolerance . Power . … and many others not discussed here • We will need completely new approaches and technologies to reach the Exascale level • This opens up many new opportunities for applied mathematicians and computer scientists • For the last decade or more, the research investment strategy has been overwhelmingly biased in favor of hardware. • This strategy needs to be rebalanced - barriers to progress are increasingly on the software side. • Moreover, the return on investment is more favorable to software. . Hardware has a half-life measured in years, while software has a half-life measured in decades. • High Performance Ecosystem out of balance . Hardware, OS, Compilers, Software, Algorithms, Applications • No Moore’s Law for software, algorithms and applications Published in the January 2011 issue of The International Journal of High Performance Computing Applications

66

“We can only see a short distance ahead, but we can see plenty there that needs to be done.” . Alan Turing (1912 —1954)

• www.exascale.org • Looking for people to help with . Manycore/GPU linear algebra libraries . Performance Evaluation . Distributed memory software

. Contact: • Jack Dongarra [email protected]

67