An Overview of HPC and the Changing Rules at Exascale

Jack Dongarra

University of Tennessee Oak Ridge National Laboratory University of Manchester

8/9/16 1 Outline

• Overview of High Performance Computing • Look at some of the adjustments that are needed with Extreme Computing

2 State of Supercomputing Today • Pflops (> 1015 Flop/s) computing fully established with 95 systems. • Three technology architecture possibilities or “swim lanes” are thriving. • Commodity (e.g. Intel) • Commodity + accelerator (e.g. GPUs) (93 systems) • Special purpose lightweight cores (e.g. ShenWei, ARM, Intel’s Knights Landing) • Interest in supercomputing is now worldwide, and growing in many new markets (around 50% of Top500 computers are used in industry). • Exascale (1018 Flop/s) projects exist in many countries and regions. • Intel processors have largest share, 91% followed by AMD, 3%. 3 H. Meuer, H. Simon, E. Strohmaier, & JD - Listing of the 500 most powerful Computers in the World - Yardstick: Rmax from LINPACK MPP

Ax=b, dense problem TPP performance Rate - Updated twice a year Size SC‘xy in the States in November Meeting in Germany in June

- All data available from www..org 4 Performance Development of HPC over the Last 24 Years from the Top500

11E+09 Eflop/s 567 PFlop/s

100 Pflop/s 100000000 93 PFlop/s

1000000010 Pflop/s

1 Pflop/s 1000000 SUM 100 Tflop/s 100000 286 TFlop/s 6-8 10 Tflop/s 10000 N=1 years 1 Tflop/s 1000 1.17 TFlop/s 100 Gflop/s My Laptop 70 Gflop/s 100 N=500 59.7 GFlop/s 10 Gflop/s 10 My iPhone & iPad 4 Gflop/s 1 Gflop/s 1 100 Mflop/s 400 MFlop/s 0.1 1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 201420152016 PERFORMANCE DEVELOPMENT

1 Eflop/s1E+09 N=1 100000000100 Pflop/s 10 Pflop/s 10000000 N=10 10000001 Pflop/s 100 Tflop/s SUM N=100 100000 10 Tflop/s 10000 1 Tflop/s 1000 100 Gflop/s 100 10 Gflop/s 10 1 Gflop/s 1 1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014 2016 2018 2020

Tflops Pflops Eflops Achieved Achieved Achieved? June 2016: The TOP 10 Systems

Rmax % of Power GFlops/ Rank Site Computer Country Cores [Pflops] Peak [MW] Watt

National Super TaihuLight, SW26010 1 Computer Center in 10,649,000 93.0 74 15.4 (260C) + Custom China 6.04 Wuxi National Super Tianhe-2 NUDT, 2 Computer Center in Xeon (12C) + IntelXeon Phi (57c) China 3,120,000 33.9 62 17.8 1.91 Guangzhou + Custom Titan, Cray XK7, AMD (16C) + DOE / OS 3 Nvidia Kepler GPU (14c) + USA 560,640 17.6 65 8.21 2.14 Oak Ridge Nat Lab Custom DOE / NNSA Sequoia, BlueGene/Q (16C) 4 USA 1,572,864 17.2 85 7.89 2.18 L Livermore Nat Lab + custom RIKEN Advanced K computer Fujitsu SPARC64 5 Japan 705,024 10.5 93 12.7 .827 Inst for Comp Sci VIIIfx (8C) + Custom

DOE / OS Mira, BlueGene/Q (16C) 6 USA 786,432 8.16 85 3.95 2.07 Argonne Nat Lab + Custom

DOE / NNSA / Trinity, Cray XC40,Xeon (16C) + 7 USA 301,056 8.10 80 4.23 1.92 Los Alamos & Sandia Custom

Piz Daint, Cray XC30, Xeon (8C) 8 Swiss CSCS Swiss 115,984 6.27 81 2.33 2.69 + Nvidia Kepler (14c) + Custom Hazel Hen, Cray XC40, Xeon 9 HLRS Stuttgart Germany 185,088 5.64 76 3.62 1.56 (12C) + Custom Shaheen II, Cray XC40, Xeon Saudi 10 KAUST 196,608 5.54 77 2.83 1.96 (16C) + Custom Arabia 500 Internet company Inspur Intel (8C) + Nnvidia China 5440 .286 71 Countries Share

China has 1/3 of the systems, while the number of systems in the US has fallen to the lowest point since the TOP500 list was created. Countries Share Number of systems

Performance / Country

9 Sunway TaihuLight http://bit.ly/sunway-2016 • SW26010 processor • Chinese design, fab, and ISA • 1.45 GHz • Node = 260 Cores (1 socket) • 4 – core groups • 64 CPE, No cache, 64 KB scratchpad/CG • 1 MPE w/32 KB L1 dcache & 256KB L2 cache • 32 GB memory total, 136.5 GB/s • ~3 Tflop/s, (22 /byte) • Cabinet = 1024 nodes • 4 supernodes=32 boards(4 cards/b(2 node/c)) • ~3.14 Pflop/s • 40 Cabinets in system • 40,960 nodes total • 125 Pflop/s total peak • 10,649,600 cores total • 1.31 PB of primary memory (DDR3) • 93 Pflop/s HPL, 74% peak • 0.32 Pflop/s HPCG, 0.3% peak • 15.3 MW, water cooled • 6.07 Gflop/s per Watt • 3 of the 6 finalists Gordon Bell Award@SC16 • 1.8B RMBs ~ $270M, (building, hw, apps, sw, …) http://tiny.cc/hpcg 11 Many Other Benchmarks

• TOP500 • Livermore Loops • Green 500 • EuroBen • Graph 500 • NAS Parallel Benchmarks • Sustained Petascale • Genesis Performance • RAPS • HPC Challenge • SHOC • Perfect • LAMMPS • ParkBench • Dhrystone • SPEC-hpc • Whetstone • Big Data Top100 • I/O Benchmarks 12 hpcg-benchmark.org HPCG Snapshot • High Performance Conjugate Gradients (HPCG). • Solves Ax=b, A large, sparse, b known, x computed. • An optimized implementation of PCG contains essential computational and communication patterns that are prevalent in a variety of methods for discretization and numerical solution of PDEs

• Patterns: • Dense and sparse computations. • Dense and sparse collectives. • Multi-scale execution of kernels via MG (truncated) V cycle. • Data-driven parallelism (unstructured sparse triangular solves). • Strong verification (via spectral properties of PCG). HPCG with 80 Entries Rank HPCG / % of (HPL) Site Computer Cores Rmax HPCG HPL Peak 1 (2) NSCC / Guangzhou Tianhe-2 NUDT, Xeon 12C 3,120,000 33.86 0.580 1.7% 1.1% 2.2GHz + Intel Xeon Phi 57C + Custom 2 (5) RIKEN AICS K computer, SPARC64 VIIIfx 705,024 10.51 0.554 5.3% 4.9% 2.0GHz, custom 3 (1) NCSS / Wuxi Sunway TaihuLight -- SW26010, 10,649,600 93.01 0.371 0.4% 0.3% Sunway 4 (4) DOE NNSA/ LLNL Sequoia - IBM BlueGene/Q + 1,572,864 17.17 0.330 1.9% 1.6% custom 5 (3) DOE SC / ORNL Titan - Cray XK7 , Opteron 560,640 17.59 0.322 1.8% 1.2% 6274 16C 2.200GHz, custom, NVIDIA K20x 6 (7) DOE NNSA/ Trinity - Cray XC40, Intel E5- 301,056 8.10 0.182 2.3% 1.6% LANL& SNL 2698v3, + custom 7 (6) DOE SC / ANL Mira - BlueGene/Q, Power BQC 786,432 8.58 0.167 1.9% 1.7% 16C 1.60GHz, + Custom 8 (11) TOTAL Pangea -- Intel Xeon E5-2670, 218592 5.28 0.162 3.1% 2.4% Ifb FDR 9 (15) NASA / Mountain Pleiades - SGI ICE X, Intel E5- 185,344 4.08 0.155 3.8% 3.1% View 2680, E5-2680V2, E5-2680V3 + Ifb 10 (9) HLRS / U of Stuttgart Hazel Hen - Cray XC40, Intel 185,088 5.64 0.138 2.4% 1.9% E5-2680v3, + custom Bookends: Peak, HPL, and HPCG

1000

100

10

1 Peak

Pflop/s HPL Rmax (Pflop/s)

0.1

0.01

0.001 1 3 5 7 9 11 13 15 23 27 31 38 41 48 50 57 66 80 108 127 158 253 279 303 338 349 Bookends: Peak, HPL, and HPCG

1000

100

10

Peak 1 HPL Rmax (Pflop/s) Pflop/s HPCG (Pflop/s)

0.1

0.01

0.001 1 3 5 7 9 11 13 15 23 27 31 38 41 48 50 57 66 80 108 127 158 253 279 303 338 349 Apps Running on Sunway TaihuLight

07 16 Peak Performance - Per Core

Floating point operations per cycle per core Ê Most of the recent computers have FMA (Fused multiple add): (i.e. x ←x + y*z in one cycle) Ê Intel Xeon earlier models and AMD Opteron have SSE2 Ê 2 flops/cycle DP & 4 flops/cycle SP Ê Intel Xeon Nehalem (’09) & Westmere (’10) have SSE4 Ê 4 flops/cycle DP & 8 flops/cycle SP Ê Intel Xeon Sandy Bridge(’11) & Ivy Bridge (’12) have AVX Ê 8 flops/cycle DP & 16 flops/cycle SP Ê Intel Xeon Haswell(’13) & (Broadwell (’14)) AVX2 Ê 16 flops/cycle DP & 32 flops/cycle SP Ê Xeon Phi (per core) is at 16 flops/cycle DP & 32 flops/cycle SP We are Ê Intel Xeon Skylake (server) AVX 512 here (almost) Ê 32 flops/cycle DP & 64 flops/cycle SP Ê Knight’s Landing CPU Access Latencies in Clock Cycles

In 167 cycles can do 2672 DP Flops Cycles

Cycles Classical Analysis of Algorithms May Not be Valid

• Processors over provisioned for floating point arithmetic • Data movement extremely expensive • Operation count is not a good indicator of the time to solve a problem. • Algorithms that do more ops may actually take less time.

8/9/16 19 Singular Value Decomposition LAPACK Version 1991 Level 1, 2, & 3 BLAS First Stage 8/3 n3 Ops

3 Generations of software compared square, with vectors 50

40

30 LAPACKlapack QR QR (BLAS in ||, 16 cores) LAPACKlapack QR QR (1 core) (using1 core)(1991) LINPACKlinpack QR QR (1979) 20 EISPACKeispack (1 QR core) (1975)

QR refers to the QR algorithm speedup over eispack 10 for computing the eigenvalues

0 Dual socket – 8 core 0k 4k 8k 12k 16k 20k Intel Sandy Bridge 2.6 GHz columns (matrix size N N) (8 Flops per core per cycle) ⇥ Bottleneck in the Bidiagonalization The Standard Bidiagonal Reduction: xGEBRD Two Steps: Factor Panel & Update Tailing Matrix

0 0 0 0

10 10 10 10

20 20 20 20

30 30 30 30

40 40 40 40

50 50 50 50

60 60 60 60 0 10 20 30 40 50 60 0 10 20 30 40 50 60 0 10 20 30 40 50 60 0 10 20 30 40 50 60 nz = 3035 nz = 275 nz = 2500 nz = 225 factor panel k then update è factor panel k+1 Requires 2 GEMVs H ­Characteristics Q*A*P • Total cost 8n3/3, (reduction to bi-diagonal) • Too many Level 2 BLAS operations • 4/3 n3 from GEMV and 4/3 n3 from GEMM • Performance limited to 2* performance of GEMV • èMemory bound algorithm.

16 cores Intel Sandy Bridge, 2.6 GHz, 20 MB shared L3 cache. The theoretical peak per core double precision is 20.4 Gflop/s per core. Compiled with icc and using MKL 2015.3.187 Recent Work on 2-Stage Algorithm

0 0 Second stage 0 10 First stage 10 Bulge chasing 10 20 To band 20 To bi-diagonal 20 30 30 30

40 40 40

50 50 50

60 60 60 0 20 40 60 0 10 20 30 40 50 60 0 10 20 30 40 50 60 nz = 605 nz = 119 nz = 3600 ­Characteristics • Stage 1: • Fully Level 3 BLAS • Dataflow Asynchronous execution

• Stage 2: • Level “BLAS-1.5” • Asynchronous execution • Cache friendly kernel (reduced communication) Recent work on developing new 2-stage algorithm

n n b n 0 0 b 0 flops  2n3 +(nt s)Second3n3 +( stagent s) 10 n3+(nt s) (nt s)5n3 10 10 b b 103 b b First stage ⇡ s=1 Bulge chasing ⇥ 20 20n n 20 To band b n To bi-diagonal 30 30 b 30 + 3 +( ) 3 +( ) 10 3+( ) ( ) 3 40 40  2nb nt s 1 3nb nt 40s 1 3 nb nt s nt s 1 5nb s=1 ⇥ 50 50 50

60 60 60 0 20 40 60 0 10 20 30 40 50 60 0 10 20 30 40 50 60 10 3 nz =10n 605 b 2 2nb 3 nz = 119 nz = 3600 n + n + n ⇡ 3 3 3 n n b nb 10 3 3 3 3 10n (3gemm)first stage 3 flops  2nb +(nt s)3nb +(nt⇡ s) 3 nb+(nt s) (nt s)5nb ⇡ s=1 ⇥ n n b nb (7) 3 3 10 3 3 +  2nb +(nt s 1)3nb +(nt s 1) 3 nb+(nt s) (nt s 1)5nb s=1 ⇥

10 n3 + 10nb n2 + 2nsecondb n3 stage: ⇡ 3 3 3 10 n3(gemm) flops = 6 n n2(gemv) (8) ⇡ 3 first stage ⇥ b ⇥ second stage More Flops, original did 8/3 n3 (7) maximium25% performance More flops bound: second stage: reference number of operations Pmax = minimum time tmin 2 flops = 6 nb n (gemv)second stage (8) ⇥ ⇥ 8 3 3 n = 10 3 2 tmin( 3 n flops in gemm)+tmin(6nbn flops in gemv) maximium performance bound: 3 8n PgemmPgemv = 3 2 (9) reference number of operations 10n Pgemv+18nbn Pgemm Pmax = minimum time tmin 8 8 8 3 = 40 Pgemm Pmax 12 Pgemm 3 n )   = 10 3 2 tmin( 3 n flops in gemm)+tmin(6nbn flops in gemv) if Pgemm is about 20x Pgemv and 120 nb 240. 3   8n PgemmPgemv = 3 2 (9) 10n Pgemv+18nbn Pgemm

= 8 P P 8 P ) 40 gemmSpeedup: max  12 gemm if P is about 20x P and 120 n 240. gemm gemv time ofb one-stage speedup = time of two-stage

3 3 4n /3Pgemv + 4n /3Pgemm = 3 2 Speedup: 10n /3Pgemm+6nbn /Pgemv

speedup = time of one-stage = 84 Speedup 84 (10) time of two-stage ) 70   15

3 3 4n /3Pgemv + 4n /3Pgemm = 1.2 Speedup 5.6 = 3 2 10n /3Pgemm+6nbn /Pgemv )   P P 120 n 240 = 84 Speedup 84 if gemm is about 20x (10)gemv and b . ) 70   15   = 1.2 Speedup 5.6 )   if P is about 20x P and 120 n 240. gemm gemv  b  3

3 Analysis of BRD using 2-stages

June 10, 2015

1 Two stage analysis

n n b nb 3 3 10 3 3 f lops  2nb +(nt s)3nb +(nt s) 3 nb+(nt s) (nt s)5nb ⇡ s=1 ⇥ n n b nb 3 3 10 3 3 +  2nb +(nt s 1)3nb +(nt s 1) 3 nb+(nt s) (nt s 1)5nb s=1 ⇥

10 n3 + 10nb n2 + 2nb n3 ⇡ 3 3 3 10 3 3 n (gemm)first stage ⇡ (1)

f lops = 6 n n2(gemv) (2) Recent work on⇥ developingb ⇥ second stage new 2-stage algorithm

P = re f erence number o f operations max minimum time tmin 0 8 3 0 0 3 n = 10 Second stage 10 ( 3 )+ 10( 2 ) 10 tmin 3 n fFirst lops in stage gemm tmin 6nbn f lops in gemv Bulge chasing 20 To band 20 20 3 To bi-diagonal 30 8n PgemmPgemv 30 30 = 3 2 (3) 40 10n Pgemv+18nbn Pgemm 40 40

50 50 50 8 8 60 60 60 0 20 =40 60 Pgemm Pmax Pgemm0 10 20 30 40 50 60 0 10 20 30 40 50 60 40 12 nz = 605 nz = 119 nz = 3600 )  

2−stages / MKL (DGEBRD) 6 if P is about 20x P and 120 data2n 240. gemm gemv  b  time of one-stage 5 speedup = time of two-stage 4 3 3 4n /3Pgemv + 4n /3Pgemm = 3 2 Speedup 3 10n /3Pgemm+6nbn /Pgemv

2 84 84 (4) = 70 Speedup 15 )   1

2k 4k 6k 8k 10k 12k 14k 16k 18k 20k 22k 24k 26k = 1.8 Speedup 7 Matrix size )   16 Sandy Bridge cores 2.6 GHz if P is about 22x P and 120 n 240. gemm gemv  b  25% More flops and 1.8 – 6 times faster

1 Critical Issues at Peta & Exascale for Algorithm and Software Design • Synchronization-reducing algorithms § Break Fork-Join model • Communication-reducing algorithms § Use methods which have lower bound on communication • Mixed precision methods § 2x speed of ops and 2x speed for data movement • Autotuning § Today’s machines are too complicated, build “smarts” into software to adapt to the hardware • Fault resilient algorithms § Implement algorithms that can recover from failures/bit flips • Reproducibility of results § Today we can’t guarantee this. We understand the issues, but some of our “colleagues” have a hard time with this. Collaborators and Support MAGMA team http://icl.cs.utk.edu/magma PLASMA team http://icl.cs.utk.edu/plasma

Collaborating partners University of Tennessee, Knoxville Lawrence Livermore National Laboratory, Livermore, CA University of California, Berkeley University of Colorado, Denver INRIA, France (StarPU team) KAUST, Saudi Arabia