<<

With Extreme Scale the Rules Have Changed

Jack Dongarra

University of Tennessee Oak Ridge National Laboratory University of Manchester

11/17/15 1 • Overview of High Performance Computing • With Extreme Computing the “rules” for computing have changed

2 3 • Create systems that can apply exaflops of computing power to exabytes of data. • Keep the United States at the forefront of HPC capabilities. • Improve HPC application developer productivity • Make HPC readily available • Establish hardware technology for future HPC systems. 4 11E+09 Eflop/s 362 PFlop/s 100000000100 Pflop/s

10000000 10 Pflop/s 33.9 PFlop/s

1000000 1 Pflop/s SUM 100000100 Tflop/s 166 TFlop/s 1000010 Tflop /s N=1

1 Tflop1000/s 1.17 TFlop/s 100 Gflop/s100 My Laptop 70 Gflop/s N=500 10 59.7 GFlop/s 10 Gflop/s My iPhone 4 Gflop/s 1 1 Gflop/s 0.1 100 Mflop/s 400 MFlop/s 1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014 2015 1 Eflop/s 1E+09 420 PFlop/s 100000000100 Pflop/s

10000000 10 Pflop/s 33.9 PFlop/s

1000000 1 Pflop/s SUM 100000100 Tflop/s 206 TFlop/s 1000010 Tflop /s N=1

1 Tflop1000/s 1.17 TFlop/s 100 Gflop/s100 My Laptop 70 Gflop/s N=500 10 59.7 GFlop/s 10 Gflop/s My iPhone 4 Gflop/s 1 1 Gflop/s 0.1 100 Mflop/s 400 MFlop/s 1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014 2015 1E+10 1 Eflop/s 1E+09 100 Pflop/s 100000000 10 Pflop/s 10000000 1 Pflop/s 1000000 SUM 100 Tflop/s 100000 10 Tflop/s N=1 10000 1 Tflop/s 1000 100 Gflop/s N=500 100 10 Gflop/s 10 1 Gflop/s 1 100 Mflop/s 0.1 1996 2002 2020 2008 2014 1E+10 1 Eflop/s 1E+09 100 Pflop/s 100000000 10 Pflop/s 10000000 1 Pflop/s 1000000 SUM 100 Tflop/s 100000 10 Tflop/s N=1 10000 1 Tflop/s 1000 100 Gflop/s N=500 100 10 Gflop/s 10 1 Gflop/s 1 100 Mflop/s 0.1 1996 2002 2020 2008 2014 • Pflops (> 1015 Flop/s) computing fully established with 81 systems. • Three technology architecture possibilities or “swim lanes” are thriving. • Commodity (e.g. ) • Commodity + accelerator (e.g. GPUs) (104 systems) • Special purpose lightweight cores (e.g. IBM BG, ARM, Knights Landing) • Interest in supercomputing is now worldwide, and growing in many new markets (over 50% of Top500 are in industry). • Exascale (1018 Flop/s) projects exist in many countries and regions. • Intel processors largest share, 89% followed by AMD, 4%. 9 Rmax % of Power MFlops Rank Site Country Cores [Pflops] Peak [MW] /Watt

National Super Tianhe-2 NUDT, 1 Computer Center in 12C 2.2GHz + IntelXeon China 3,120,000 33.9 62 17.8 1905 Guangzhou Phi (57c) + Custom , XK7, AMD (16C) + DOE / OS 2 Kepler GPU (14c) + USA 560,640 17.6 65 8.3 2120 Oak Ridge Nat Lab Custom DOE / NNSA , BlueGene/Q (16c) 3 USA 1,572,864 17.2 85 7.9 2063 L Livermore Nat Lab + custom RIKEN Advanced Inst Fujitsu SPARC64 4 Japan 705,024 10.5 93 12.7 827 for Comp Sci VIIIfx (8c) + Custom

DOE / OS , BlueGene/Q (16c) 5 USA 786,432 8.16 85 3.95 2066 Argonne Nat Lab + Custom

Piz Daint, Cray XC30, Xeon 8C + 6 Swiss CSCS Swiss 115,984 6.27 81 2.3 2726 Nvidia Kepler (14c) + Custom Shaheen II, Cray XC30, Xeon Saudi 7 KAUST 196,608 5.54 77 4.5 1146 16C + Custom Arabia Texas Advanced Stampede, Dell Intel (8c) + Intel 8 USA 204,900 5.17 61 4.5 1489 Computing Center (61c) + IB

Forschungszentrum JuQUEEN, BlueGene/Q, 9 Germany 458,752 5.01 85 2.30 2178 Juelich (FZJ) Power BQC 16C 1.6GHz+Custom DOE / NNSA Vulcan, BlueGene/Q, 10 USA 393,216 4.29 85 1.97 2177 L Livermore Nat Lab Power BQC 16C 1.6GHz+Custom 500 (422) Software Comp HP Cluster USA 18,896 .309 48 Rmax % of Power MFlops Rank Site Computer Country Cores [Pflops] Peak [MW] /Watt

National Super Tianhe-2 NUDT, 1 Computer Center in Xeon 12C + IntelXeon Phi (57c) China 3,120,000 33.9 62 17.8 1905 Guangzhou + Custom Titan, Cray XK7, AMD (16C) + DOE / OS 2 Nvidia Kepler GPU (14c) + USA 560,640 17.6 65 8.3 2120 Oak Ridge Nat Lab Custom DOE / NNSA Sequoia, BlueGene/Q (16c) 3 USA 1,572,864 17.2 85 7.9 2063 L Livermore Nat Lab + custom RIKEN Advanced Inst K computer Fujitsu SPARC64 4 Japan 705,024 10.5 93 12.7 827 for Comp Sci VIIIfx (8c) + Custom

DOE / OS Mira, BlueGene/Q (16c) 5 USA 786,432 8.16 85 3.95 2066 Argonne Nat Lab + Custom

DOE / NNSA / Trinity, Cray XC40,Xeon 16C + 6 USA 301,056 8.10 80 Los Alamos & Sandia Custom

Piz Daint, Cray XC30, Xeon 8C + 7 Swiss CSCS Swiss 115,984 6.27 81 2.3 2726 Nvidia Kepler (14c) + Custom Hazel Hen, Cray XC40, Xeon 12C 8 HLRS Stuttgart Germany 185,088 5.64 76 + Custom Shaheen II, Cray XC30, Xeon Saudi 9 KAUST 196,608 5.54 77 4.5 1146 16C + Custom Arabia Texas Advanced Stampede, Dell Intel (8c) + Intel 10 USA 204,900 5.17 61 4.5 1489 Computing Center Xeon Phi (61c) + IB 500 (368) Karlsruher MEGAWARE Intel Germany 10,800 .206 95 ♦ US DOE planning to deploy O(100) Pflop/s systems for 2017-2018 - $525M hardware ♦ Oak Ridge Lab and Lawrence Livermore Lab to receive IBM and Nvidia based systems ♦ Argonne Lab to receive Intel based system ! After this Exaflops ♦ US Dept of Commerce is preventing some China groups from receiving Intel technology ! Citing concerns about nuclear research being done with the systems; February 2015. ! On the blockade list: ! National SC Center Guangzhou, site of Tianhe-2 ! National SC Center Tianjin, site of Tianhe-1A ! National University for Defense Technology, developer ! National SC Center Changsha, location of NUDT

12 07 13 07 14 ♦ Sunway Blue Light - Sunway BlueLight MPP, ShenWei SW1600 975.00 MHz, Infiniband QDR ! Site:National Supercomputing Center in Jinan ! Cores:137,200 Linpack Performance (Rmax)795.9 TFlop/s Theoretical Peak (Rpeak)1,070.16 TFlop/s ! Processor:ShenWei processor SW1600 16C 975MHz (Alpha arch) ! Interconnect:Infiniband QDR ♦ Rumored to have a 100 Pflop/s system

07 15 Peak Performance - Per Core

Floating point operations per cycle per core ! Most of the recent computers have FMA (Fused multiple add): (i.e. x ←x + y*z in one cycle) ! Intel Xeon earlier models and AMD Opteron have SSE2 ! 2 flops/cycle DP & 4 flops/cycle SP ! Intel Xeon Nehalem (’09) & Westmere (’10) have SSE4 ! 4 flops/cycle DP & 8 flops/cycle SP ! Intel Xeon (’11) & Ivy Bridge (’12) have AVX ! 8 flops/cycle DP & 16 flops/cycle SP ! Intel Xeon Haswell (’13) & (Broadwell (’14)) AVX2 ! 16 flops/cycle DP & 32 flops/cycle SP ! We Xeon Phi (per core) is at 16 flops/cycle DP & 32 flops/cycle SP are ! Intel Xeon Skylake (server) (’15) AVX 512 Here ! soon 32 flops/cycle DP & 64 flops/cycle SP CPU Access Latencies in Clock Cycles

In 167 cycles can do 2672 DP Flops Cycles

Cycles ♦ Processors over provisioned for floating point arithmetic ♦ Data movement extremely expensive ♦ Operation count is not a good indicator of the time to solve a problem. ♦ Algorithms that do more ops may actually take less time.

11/17/15 18 Singular Value Decomposion LAPACK Version 1991 Level 1, 2, & 3 BLAS First Stage 8/3 n3 Ops

square, with vectors 50

40

30 lapack QR lapack QR (1 core) QR 20 eispack (1 core)

QR refers to the QR algorithm speedup over eispack 10 for computing the eigenvalues

0 Intel Sandy Bridge 2.6 GHz 0k 4k 8k 12k 16k 20k (8 Flops per core per cycle) columns (matrix size N N) ⇥ Boleneck in the Bidiagonalizaon The Standard Bidiagonal Reducon: xGEBRD Two Steps: Factor Panel & Update Tailing Matrix

0 0 0 0

10 10 10 10

20 20 20 20

30 30 30 30

40 40 40 40

50 50 50 50

60 60 60 60 0 10 20 30 40 50 60 0 10 20 30 40 50 60 0 10 20 30 40 50 60 0 10 20 30 40 50 60 nz = 3035 nz = 275 nz = 2500 nz = 225 factor panel k then update ! factor panel k+1 Requires 2 GEMVs H " Characteristics Q*A*P • Total cost 8n3/3, (reduction to bi-diagonal) • Too many Level 2 BLAS operations • 4/3 n3 from GEMV and 4/3 n3 from GEMM • Performance limited to 2* performance of GEMV • !Memory bound algorithm.

16 cores Intel Sandy Bridge, 2.6 GHz, 20 MB shared L3 . The theoretical peak per core double precision is 20.4 Gflop/s per core. Compiled with icc and using MKL 2015.3.187 Recent Work on 2-Stage Algorithm

0 0 Second stage 0 10 First stage 10 Bulge chasing 10 20 To band 20 To bi-diagonal 20 30 30 30

40 40 40

50 50 50

60 60 60 0 20 40 60 0 10 20 30 40 50 60 0 10 20 30 40 50 60 nz = 3600 nz = 605 nz = 119 " Characteristics • Stage 1: • Fully Level 3 BLAS • Dataflow Asynchronous execution

• Stage 2: • Level “BLAS-1.5” • Asynchronous execution • Cache friendly kernel (reduced communication) Recent work on developing new 2-stage algorithm

n n b n 0 0 b 0 3 +( )Second3 +( stage ) 10 3+( ) ( ) 3 10 flops 10 Â 2nb nt s 3nb nt s 10 3 nb nt s nt s 5nb First stage ⇡ s=1 Bulge chasing ⇥ 20 20n n 20 To band b n To bi-diagonal 30 30 b 30 + 3 +( ) 3 +( ) 10 3+( ) ( ) 3 40 40 Â 2nb nt s 1 3nb nt 40 s 1 3 nb nt s nt s 1 5nb s=1 ⇥ 50 50 50

60 60 60 0 20 40 60 0 10 20 30 40 50 60 0 10 20 30 40 50 60 10 3 nz =10n 605 b 2 2nb 3 nz = 119 nz = 3600 n + n + n ⇡ 3 3 3 n n b nb 10 3 3 3 3 10n (3gemm)first stage 3 flops  2nb +(nt s)3nb +(nt⇡ s) 3 nb+(nt s) (nt s)5nb ⇡ s=1 ⇥ n n b nb (7) 3 3 10 3 3 +  2nb +(nt s 1)3nb +(nt s 1) 3 nb+(nt s) (nt s 1)5nb s=1 ⇥

10 n3 + 10nb n2 + 2nsecondb n3 stage: ⇡ 3 3 3 10 n3(gemm) flops = 6 n n2(gemv) (8) ⇡ 3 first stage ⇥ b ⇥ second stage More Flops, original did 8/3 n 3 (7) maximium performance bound: second stage: reference number of operations Pmax = minimum time tmin 2 flops = 6 nb n (gemv)second stage (8) ⇥ ⇥ 8 3 3 n = 10 3 2 tmin( 3 n flops in gemm)+tmin(6nbn flops in gemv) maximium performance bound: 3 8n PgemmPgemv = 3 2 (9) reference number of operations 10n Pgemv+18nbn Pgemm Pmax = minimum time tmin 8 8 8 3 = 40 Pgemm Pmax 12 Pgemm 3 n )   = 10 3 2 tmin( 3 n flops in gemm)+tmin(6nbn flops in gemv) if Pgemm is about 20x Pgemv and 120 nb 240. 3   8n PgemmPgemv = 3 2 (9) 10n Pgemv+18nbn Pgemm

= 8 P P 8 P ) 40 gemmSpeedup: max  12 gemm

if Pgemm is about 20x Pgemv and 120time ofnb one-stage240. speedup = time of two-stage

3 3 4n /3Pgemv + 4n /3Pgemm = 3 2 Speedup: 10n /3Pgemm+6nbn /Pgemv

speedup = time of one-stage = 84 Speedup 84 (10) time of two-stage ) 70   15

3 3 4n /3Pgemv + 4n /3Pgemm = 1.2 Speedup 5.6 = 3 2 10n /3Pgemm+6nbn /Pgemv )   if P is about 20x P and 120 n 240. = 84 Speedup 84 gemm (10)gemv b ) 70   15   = 1.2 Speedup 5.6 )   if P is about 20x P and 120 n 240. gemm gemv  b  3

3 Analysis of BRD using 2-stages

June 10, 2015

1 Two stage analysis

n n b nb 3 3 10 3 3 f lops  2nb +(nt s)3nb +(nt s) 3 nb+(nt s) (nt s)5nb ⇡ s=1 ⇥ n n b nb 3 3 10 3 3 +  2nb +(nt s 1)3nb +(nt s 1) 3 nb+(nt s) (nt s 1)5nb s=1 ⇥

10 n3 + 10nb n2 + 2nb n3 ⇡ 3 3 3 10 3 3 n (gemm)first stage ⇡ (1)

f lops = 6 n n2(gemv) (2) Recent work on⇥ developingb ⇥ second stage new 2-stage algorithm

P = re f erence number o f operations max minimum time tmin 0 8 3 0 0 3 n = 10 Second stage 10 t ( n3 f lops in gemm)+t 10(6n n2 f lops in gemv) 10 min 3 First stage min b Bulge chasing 20 To band 20 20 3 To bi-diagonal 30 8n PgemmPgemv 30 30 = 3 2 (3) 40 10n Pgemv+18nbn Pgemm 40 40

50 50 50 8 8 60 60 60 0 20 =40 60 Pgemm Pmax Pgemm0 10 20 30 40 50 60 0 10 20 30 40 50 60 40 12 nz = 605 nz = 119 nz = 3600 )  

2−stages / MKL (DGEBRD) 6 if P is about 20x P and 120 data2n 240. gemm gemv  b  time of one-stage 5 speedup = time of two-stage 4 3 3 4n /3Pgemv + 4n /3Pgemm = 3 2 Speedup 3 10n /3Pgemm+6nbn /Pgemv

2 84 84 (4) = 70 Speedup 15 )   1

2k 4k 6k 8k 10k 12k 14k 16k 18k 20k 22k 24k 26k = 1.8 Speedup 7 Matrix size )   if P is about 22x P and 120 n 240. gemm gemv  b 

1 Parallelization of LU and QR. Parallelize the update: dgemm • Easy and done in any reasonable software. • This is the 2/3n3 term in the FLOPs count. - • Can be done efficiently with LAPACK+multithreaded BLAS U

L A(1)

dge2

lu( )

dtrsm (+ dswp) Fork - Join parallelism \ Bulk Sync Processing dgemm -

U L A(2) •

Fork-join, bulk synchronous processingFork-join,

23 Cores ! ! bulk synchronous processing processing synchronous bulk fork join fork join 25 Time Time 27 Numerical program generates tasks and run time system executes tasks respecting xTRSM data dependences.

xGEMM

xGEMM Sparse / Dense Matrix DAG-based factorization Batched LA System ! LU, QR, or Cholesky on small diagonal matrices ! TRSMs, QRs, or LUs

! TRSMs, TRMMs

! Updates (Schur complement) GEMMs, SYRKs, TRMMs

And many other BLAS/LAPACK, e.g., for application specific solvers, preconditioners, and matrices

xGETF2

xTRSM

xGEMM xGEMM xGEMM xTRSM xGEMM xGEMM xGEMM xGEMM xTRSM xGEMM xGEMM ♦Objectives ! High utilization of each core ! Scaling to large number of cores ! Synchronization reducing algorithms ♦Methodology ! Dynamic DAG scheduling ! Explicit parallelism ! Implicit communication ! Fine granularity / block data layout ♦Arbitrary DAG with dynamic scheduling

Fork-join parallelism Notice the synchronization penalty in the presence of DAG scheduled heterogeneity. parallelism

27 Algorithmic Bombardment

• Interesng in solving a large sparse non-symmetric system on a parallel computer. • Many algorithms to choose – CG Square – Bi CG Stabilized – QMR – GMRES – etc. • Best choice not clear • Some may not converge • Develop a “poly-algorithm” combining mulple algorithms in one, leveraging data movement. Poly-algorithm Bombardment: Combine three iterative Iterang in Lock Step methods into multi-iterative solver. Amount of work / iteration Method α = xTy y = αx + y y = Ax CGS 2 6 2 Bi-CGSTAB 4 6 2 QMR 2 12 2

Number of Communications / iteration Method α = y = αx + y = x = Storage xTy y Ax M-1y CGS 2 0 2 2 A + 6n Poly-algorithm Bombardment: Combine three iterative Iterang in Lock Step methods into multi-iterative solver. Amount of work / iteration Method α = xTy y = αx + y y = Ax CGS 2 6 2 Bi-CGSTAB 4 6 2 QMR 2 12 2

Number of Communications / iteration Method α = y = αx + y = x = Storage xTy y Ax M-1y CGS 2 0 2 2 A + 6n Bi-CGSTAB 3 0 2 2 A + 6n Bombardment 3 0 2 3 A + 26n Poly-algorithm Bombardment: Combine three iterative Iterang in Lock Step methods into multi-iterative solver. Amount of work / iteration Method α = xTy y = αx + y y = Ax CGS 2 6 2 Bi-CGSTAB 4 6 2 QMR 2 12 2

Number of Communications / iteration Method α = y = αx + y = x = Storage xTy y Ax M-1y CGS 2 0 2 2 A + 6n Bi-CGSTAB 3 0 2 2 A + 6n QMR 2 0 2 2 A + 16n Bombardment 3 0 2 3 A + 26n Algorithmic Bombardment using Multi-Iterative Method

• Which is the iteration method of choice for a given problem?

• Combine three iterative solvers combined into multi-iterative solver: Bombardment combines CGS, QMR, BiCGSTAB in one method

• Target architecture: Graphics processing units (NVIDIA K40 GPU)

• Bombardment can be optimized for this architecture : • Benefits from blocking the distinct SpMV into an SpMM • Benefits from blocking the global reductions • Benefits from overlapping kernels

• 311 Sparse matrices from the University of Florida Matrix Collection • One of the solvers converges with < 5,000 iterations • Iteration limit: 100,000 iterations • Runtime is target metric

https://www.cise.ufl.edu/research/sparse/matrices/ Algorithmic Bombardment using Multi-Iterative Method

Time-to-solution normalized to Bombardment time 103 QMR CGS BiCGSTAB Bombardment 102

101 Normalized runtime 100

10-1 50 100 150 200 250 300 Matrix ID

Stats for 311 test matrices QMR CGS BiCGSTAB Fastest method: 38 (12%) 93 (30%) 181 (58%) Converges: 299 (96%) 262 (84%) 248 (80%) Fails: 12 (4%) 49 (16%) 63 (20%) Bombardment Speedup* 2.42 x 1.11 x 1.66 x

*only considering converging test cases Algorithmic Bombardment using Multi-Iterative Method

Time-to-solution normalized to Bombardment time 103 QMR CGS BiCGSTAB Bombardment 102

101 Normalized runtime

100

10-1 10-3 10-2 10-1 100 101 102 Bombardment runtime [s]

Stats for 311 test matrices QMR CGS BiCGSTAB Fastest method: 38 (12%) 93 (30%) 181 (58%) Converges: 299 (96%) 262 (84%) 248 (80%) Fails: 12 (4%) 49 (16%) 63 (20%) Bombardment Speedup* 2.42 x 1.11 x 1.66 x

*only considering converging test cases ♦ “Responsibly Reckless” Algorithms ! Try fast algorithm (unstable algorithm) that might fail (but rarely) ! Check for instability ! If needed, recompute with stable algorithm

35 ♦ To solve Ax = b : T ! Compute Ar = U AV, with U and V random matrices

! Factorize Ar without pivoting (GENP) T ! Solve Ar y = U b and then Solve x = Vy ♦ U and V are Recursive Butterfly Matrices

! Randomization is cheap (O(n2) operations) ! GENP is fast (can take advantage of the GPU) ! Accuracy is in practice similar to GEPP (with iterative refinement), but…

Think of this as a preconditioner step.

Goal: Transform A into a matrix that would be sufficiently “random” so that, with a probability close to 1, pivoting is not needed. • Mixed precision, use the lowest precision required to achieve a given accuracy outcome # Improves runtime, reduce power consumption, lower data movement # Reformulate to find correction to solution, rather than solution; Δx rather than x.

37 37 • Exploit 32 bit floating point as much as possible. # Especially for the bulk of the computation • Correct or update the solution with selective use of 64 bit floating point to provide a refined results • Intuitively: # Compute a 32 bit result, # Calculate a correction to 32 bit result using selected higher precision and, # Perform the update of the 32 bit results with the correction using high precision. 38 • Iterative refinement for dense systems, Ax = b, can work this way. L U = lu(A) SINGLE O(n3) x = L\(U\b) SINGLE O(n2) r = b – Ax DOUBLE O(n2) WHILE || r || not small enough z = L\(U\r) SINGLE O(n2) x = x + z DOUBLE O(n1) r = b – Ax DOUBLE O(n2) END

# Wilkinson, Moler, Stewart, & Higham provide error bound for SP fl pt results when using DP fl pt.

39 • Iterative refinement for dense systems, Ax = b, can work this way. L U = lu(A) SINGLE O(n3) x = L\(U\b) SINGLE O(n2) r = b – Ax DOUBLE O(n2) WHILE || r || not small enough z = L\(U\r) SINGLE O(n2) x = x + z DOUBLE O(n1) r = b – Ax DOUBLE O(n2) END

# Wilkinson, Moler, Stewart, & Higham provide error bound for SP fl pt results when using DP fl pt. # It can be shown that using this approach we can compute the solution to 64-bit floating point precision.

• Requires extra storage, total is 1.5 times normal; • O(n3) work is done in lower precision • O(n2) work is done in high precision • Problems if the matrix is ill-conditioned in sp; O(108) 40 Solving general dense linear systems using mixed precision iterative refinement

1600

1400 SP Solve

1200

1000 DP Solve

800

600

400 GPU K20c (13 MP @0.7 GHz, peak 1165 GFlop/s) 200 CPU Genuine Intel (2x8 @2.60GHz, peak 333 GFlop/s)

0

Matrix size 41 Solving general dense linear systems using mixed precision iterative refinement

1600 SP Solve 1400 DP Solve (MP 1200 Iter.Ref.)

1000 DP Solve

800

600

400 GPU K20c (13 MP @0.7 GHz, peak 1165 GFlop/s) CPU Genuine Intel (2x8 @2.60GHz, peak 333 GFlop/s) 200

0 0 0 0 0 4 0 2048 6016 8192 1000 1200 1400 1600 1798 2000 4032 Matrix size 42 Solving general dense linear systems using

5000 mixed precision iterative refinement 4500 4000 CPOSV 3500 ZCPOSV 3000 ZPOSV 2500 26 x 2000 1500 1000 500 0 2500 5000 7500 10000 12500 15000 17500 20000 GPU TITAN X (3,072 CUDA cores @ 1.076 GHz) Matrix size Z/C GEMM peak ~ 190 / 5,600 GFlop/s; Maxwell CPU Intel Xeon [email protected] (2 x 6 cores) Methods for Matrix Approximaon via Random Sampling

• Random sampling algorithms address the size problem by generang much smaller approximate representaons of large matrices – Then solved using classic linear algebra algorithms on plaorms that would not otherwise be usable for this purpose • Deliver results that are approximate soluons. – They are faster, easier to code, easier to map to modern computer architectures, oen more numerically robust, and display a higher level of parallelism than tradional algorithms. Compung rank-k approximaon of A, T A ≈ Ak = UkVk , plays important roles in many applicaons with different constraints: – SVD T • Ak=Uk ΣkVk to minimize ||A-Ak|| • for latent semanc indexing and populaon clustering [SC’15] – CX, and then Rank-Reviling QR

• Ak=A(:,Jk+l)Xk+l, and then AP ≈QkRk • with datasets from Hapmap project [SC’15] – CUR

• Ak=A(:,Jk)UkA(Ik,:) by ACA, etc. • Data analysis, linear solver, etc. – Matrix compleon: 2 • Minimize Σi,j(A(i,j)-Ak(i,j)) for A(i,j)≠0 by ALS, SGD, etc. • Recommendaon system, missing data, etc. – etc. • Different algorithms for compung approximaons, e.g., for SVD

• Increasing interests in randomized algorithms – Demand from architecture (e.g., high cost of data access) – Demand from data (e.g., processing “BigData”) – Aim to extract as much info as possible from each data access – May not work in all cases, good for low rank matrices where factorizaon is overkill. Random sampling framework: normalized block power iteraon

• Iterate to improve approximaon – Sample A and AT (e.g., SpMM) to approximate le and right singular vectors • Block to improve convergence – Q had k+l columns with oversampling parameter l • “normalize” to maintain stability – many opons; e.g., different algorithms, one-sided Random sampling framework: normalized block power iteraon • Potenal to integrate different randomizaon/sampling schemes – Gaussian random sampling (SpMM/GEMM with Gaussian random vector Q) – “pruned” FFT instead of GEMM [SC’14] – Sampling based on probabilisc distribuon (e.g., populaon clustering by SNP, leverage scores) • Nice relaon to “tradional algorithm” e.g., block (CA) Lanczos for bidiagonalizaon [IEEE BigData workshop’14] • If sampling converges to desired accuracy in a couple of iteraons, it could outperform others, especially if data access is expensive Singular value distribuon of an SNP matrix (TS matrix)O(100Kx500)

4 10

3 10 Singular value

2 10

0 100 200 300 400 500 Singular value number • approximate “low-rank-plus-shi” – Big enough gap to obtain desired low accuracy in a couple of iteraons – Small enough gap for stability (e.g., CholQR for othogonalizaon) • many matrices of interests seem to have such distribuons Case studies: populaon clustering and latent semanc indexing

Average Clustering Correlation Coefficient = 0.65 0.15 0.64 JPT Recompute 0.62 CHB Update 0.1 MEX 0.6 Update−inc Sample−2 CHD 0.58 0.05 0.56 0.54 0 0.52 0.5 −0.05 0.48 Average precision 0.46 −0.1 0.44

−0.15 0.42 0.4

−0.2 0.38 −0.08 −0.075 −0.07 −0.065 −0.06 532 582 632 682 732 782 832 882 932 982 1032 Number of documents

• Desired accuracy obtained aer a couple of iteraons Performance studies on hybrid CPU/GPU architecture

4.5 Other 0.5 SVD PRNG 4 TSQR Sampling 0.45 GEMM GEMM (Iter) 3.5 SpMM 0.4 Orth (Iter) QRCP 3 0.35 QR QP3 2.5 0.3

2 0.25 Time (s) Time (s) 0.2 1.5 0.15 1 13.4x on 14.1x on 7.7x on 0.1 3 GPUs 12 GPUs 48 GPUs 0.5 0.05

0 0 Update Random−1 Update Random−1 Update Random−1 0 10000 20000 30000 40000 50000 Number of rows (m)

– Sparse truncated SVD on Supercomputer: each node with two six-core Intel Xeon CPUs and three K20Xm GPUs – Dense truncated RRQR on one node with two eight-core Intel SandyBridge CPUs and three NVIDIA Tesla K40c GPUs ♦ Classical analysis of algorithms is not valid, ! # of floating point ops ≠ computation time. ♦ Algorithms and software must take advantage by reducing data movement. ! Need latency tolerance in our algorithms ♦ Communication and synchronization reducing algorithms and software are critical. ! As parallelism grows ♦ Hardware presents a dynamically changing environment ! Turbo Boost and OS jitter ♦ Many existing algorithms can’t fully exploit the

features11/17/15 of modern architecture 52 • Major Challenges are ahead for extreme computing # Parallelism O(109) • Programming issues # Hybrid • Peak and HPL may be very misleading • No where near close to peak for most apps # Fault Tolerance • Today Sequoia BG/Q node failure rate is 1.25 failures/day # Power • 50 Gflops/w (today at 2 Gflops/w) • We will need completely new approaches and technologies to reach the Exascale level