<<

Outline • Review Gaussian Elimination (GE) for solving Ax=b • Optimizing GE for caches on sequential machines - using - (BLAS and LAPACK) CS 267 • Minimizing communication for sequential GE Dense : - Not LAPACK, but Recursive LU minimizes bandwidth (latency possible) Parallel Gaussian Elimination • Data layouts on parallel machines • Parallel Gaussian Elimination (ScaLAPACK) • Minimizing communication for parallel GE - Not ScaLAPACK (yet), but “Comm-Avoiding LU” (CALU) - Same idea for minimizing bandwidth and latency in sequential case James Demmel • Summarize rest of dense linear algebra

• Dynamically scheduled LU for Multicore www.cs.berkeley.edu/~demmel/cs267_Spr16 • LU for Heterogeneous computers (CPU + GPU)

03/01/2016 CS267 Lecture 13 1 03/01/2016 CS267 Lecture 13 2

Gaussian Elimination (GE) for solving Ax=b Refine GE (1) • Add multiples of each row to later rows to make A upper • Initial Version triangular … for each column i • Solve resulting triangular system Ux = c by substitution … zero it out below the diagonal by adding multiples of row i to later rows for i = 1 to n-1 … for each column i … for each row j below row i … zero it out below the diagonal by adding multiples of row i to later rows for j = i+1 to n for i = 1 to n-1 … add a multiple of row i to row j … for each row j below row i tmp = A(j,i); for j = i+1 to n for k = i to n … add a multiple of row i to row j A(j,k) = A(j,k) - (tmp/A(i,i)) * A(i,k) tmp = A(j,i); for k = i to n A(j,k) = A(j,k) - (tmp/A(i,i)) * A(i,k) • Remove computation of constant tmp/A(i,i) from inner loop.

for i = 1 to n-1 0 0 0 … 0 . . 0 . 0 . 0 for j = i+1 to n ...... i . . . . . 0 . . 0 m = A(j,i)/A(i,i) . . . . . 0 0 0 0 0 0 0 0 0 0 for k = i to n m 0 0 A(j,k) = A(j,k) - m * A(i,k) After i=1 After i=2 After i=3 After i=n-1 j

03/01/2016 CS267 Lecture 13 3 03/01/2016 CS267 Lecture 13 4

1 Refine GE Algorithm (2) Refine GE Algorithm (3) • Last version • Last version

for i = 1 to n-1 for i = 1 to n-1 for j = i+1 to n for j = i+1 to n m = A(j,i)/A(i,i) m = A(j,i)/A(i,i) for k = i to n for k = i+1 to n A(j,k) = A(j,k) - m * A(i,k) A(j,k) = A(j,k) - m * A(i,k)

• Don’t compute what we already know: • Store multipliers m below diagonal in zeroed entries zeros below diagonal in column i for later use

for i = 1 to n-1 for i = 1 to n-1 for j = i+1 to n i for j = i+1 to n i m = A(j,i)/A(i,i) A(j,i) = A(j,i)/A(i,i) for k = i+1 to n m for k = i+1 to n m A(j,k) = A(j,k) - m * A(i,k) j A(j,k) = A(j,k) - A(j,i) * A(i,k) j

Do not compute zeros Store m here 03/01/2016 CS267 Lecture 13 5 03/01/2016 CS267 Lecture 13 6

Refine GE Algorithm (4) Refine GE Algorithm (5) for i = 1 to n-1 • Last version • Last version for j = i+1 to n A(j,i) = A(j,i)/A(i,i) for i = 1 to n-1 for j = i+1 to n for j = i+1 to n for k = i+1 to n A(j,i) = A(j,i)/A(i,i) A(j,k) = A(j,k) - A(j,i) * A(i,k) for k = i+1 to n A(j,k) = A(j,k) - A(j,i) * A(i,k) • Express using matrix operations (BLAS)

• Split Loop for i = 1 to n-1 A(i+1:n,i) = A(i+1:n,i) * ( 1 / A(i,i) ) for i = 1 to n-1 … BLAS 1 (scale a vector) A(i+1:n,i+1:n) = A(i+1:n , i+1:n ) for j = i+1 to n i A(j,i) = A(j,i)/A(i,i) - A(i+1:n , i) * A(i , i+1:n) for j = i+1 to n … BLAS 2 (-1 update) for k = i+1 to n j A(j,k) = A(j,k) - A(j,i) * A(i,k)

Store all m’s here before updating 03/01/2016 CS267 Lecture 13 rest of matrix 7 03/01/2016 CS267 Lecture 13 8

2 What GE really computes Problems with basic GE algorithm for i = 1 to n-1 for i = 1 to n-1 A(i+1:n,i) = A(i+1:n,i) / A(i,i) … BLAS 1 (scale a vector) A(i+1:n,i) = A(i+1:n,i) / A(i,i) … BLAS 1 (scale a vector) A(i+1:n,i+1:n) = A(i+1:n , i+1:n ) … BLAS 2 (rank-1 update) A(i+1:n,i+1:n) = A(i+1:n , i+1:n ) - A(i+1:n , i) * A(i , i+1:n) … BLAS 2 (rank-1 update) - A(i+1:n , i) * A(i , i+1:n)

• Call the strictly lower of multipliers • What if some A(i,i) is zero? Or very small? M, and let L = I+M - Result may not exist, or be “unstable”, so need to pivot • Call the upper triangle of the final matrix U • Current computation all BLAS 1 or BLAS 2, but we know that BLAS 3 (matrix multiply) is fastest (earlier lectures…) • Lemma (LU ): If the above algorithm terminates (does not divide by zero) then A = L*U Peak = * • Solving A*x=b using GE BLAS 3 - Factorize A = L*U using GE (cost = 2/3 n3 flops) - Solve L*y = b for y, using substitution (cost = n2 flops) - Solve U*x = y for x, using substitution (cost = n2 flops) BLAS 2 BLAS 1 • Thus A*x = (L*U)*x = L*(U*x) = L*y = b as desired

03/01/2016 CS267 Lecture 13 9 03/01/2016 CS267 Lecture 13 10

Pivoting in Gaussian Elimination Gaussian Elimination with Partial Pivoting (GEPP) • A = [ 0 1 ] fails completely because can’t divide by A(1,1)=0 • Partial Pivoting: swap rows so that A(i,i) is largest in column [ 1 0 ] for i = 1 to n-1 find and record k where |A(k,i)| = max{i ≤ j ≤ n} |A(j,i)| • But solving Ax=b should be easy! … i.e. largest entry in rest of column i if |A(k,i)| = 0 exit with a warning that A is singular, or nearly so • When diagonal A(i,i) is tiny (not just zero), algorithm may elseif k ≠ i terminate but get completely wrong answer swap rows i and k of A • Numerical instability end if • Roundoff error is cause A(i+1:n,i) = A(i+1:n,i) / A(i,i) … each |quotient| ≤ 1 A(i+1:n,i+1:n) = A(i+1:n , i+1:n ) - A(i+1:n , i) * A(i , i+1:n) • Cure: Pivot (swap rows of A) so A(i,i) large • Lemma: This algorithm computes A = P*L*U, where P is a permutation matrix. • This algorithm is numerically stable in practice • For details see LAPACK code at http://www.netlib.org/lapack/single/sgetf2.f • Standard approach – but communication costs?

03/01/2016 CS267 Lecture 13 11 03/01/2016 CS267 Lecture 13 12

3 Problems with basic GE algorithm Converting BLAS2 to BLAS3 in GEPP • Blocking • What if some A(i,i) is zero? Or very small? - Used to optimize matrix-multiplication - Result may not exist, or be “unstable”, so need to pivot - Harder here because of data dependencies in GEPP • Current computation all BLAS 1 or BLAS 2, but we know that BLAS 3 (matrix multiply) is fastest (earlier lectures…) • BIG IDEA: Delayed Updates for i = 1 to n-1 A(i+1:n,i) = A(i+1:n,i) / A(i,i) … BLAS 1 (scale a vector) - Save updates to “trailing matrix” from several consecutive A(i+1:n,i+1:n) = A(i+1:n , i+1:n ) … BLAS 2 (rank-1 update) BLAS2 (rank-1) updates - A(i+1:n , i) * A(i , i+1:n) - Apply many updates simultaneously in one BLAS3 (matmul) operation Peak • Same idea works for much of dense linear algebra BLAS 3 - Not eigenvalue problems or SVD – need more ideas • First Approach: Need to choose a block size b BLAS 2 - Algorithm will save and apply b updates BLAS 1 - b should be small enough so that active submatrix consisting of b columns of A fits in cache - b should be large enough to make BLAS3 (matmul) fast 03/01/2016 CS267 Lecture 13 13 03/01/2016 CS267 Lecture 13 14

Blocked GEPP (www.netlib.org/lapack/single/sgetrf.f) Efficiency of Blocked GEPP (all parallelism “hidden” inside the BLAS) for ib = 1 to n-1 step b … Process matrix b columns at a time end = ib + b-1 … Point to end of block of b columns * = Speed (LAPACK/LU) / Speed(best effort) apply BLAS2 version of GEPP to get A(ib:n , ib:end) = P’ * L’ * U’ 1.2 Speed(Matmul) / HW Peak … let LL denote the strict lower triangular part of A(ib:end , ib:end) + I Speed(LAPACK LU) / Speed(MatMul) A(ib:end , end+1:n) = LL-1 * A(ib:end , end+1:n) … update next b rows of U A(end+1:n , end+1:n ) = A(end+1:n , end+1:n ) 1 - A(end+1:n , ib:end) * A(ib:end , end+1:n) … apply delayed updates with single matrix-multiply 0.8 … with inner dimension b 0.6

Efficiency 0.4 (For a correctness proof, see on-line notes from CS267 / 1996.) 0.2

0 Cnvx C4 Cnvx C4 Cray C90 Cray C90 Alpha RS6000 SGI PC (1 p) (4 p) (1 p) (16 p)

03/01/2016 15 03/01/2016 CS267 Lecture 13 16

CS267 Lecture 13

4 Communication Lower Bound for GE Does LAPACK’s GEPP Minimize Communication?

• Matrix Multiplication can be “reduced to” GE for ib = 1 to n-1 step b … Process matrix b columns at a time end = ib + b-1 … Point to end of block of b columns • Not a good way to do matmul but it shows that GE apply BLAS2 version of GEPP to get A(ib:n , ib:end) = P’ * L’ * U’ needs at least as much communication as matmul … let LL denote the strict lower triangular part of A(ib:end , ib:end) + I A(ib:end , end+1:n) = LL-1 * A(ib:end , end+1:n) … update next b rows of U • Does blocked GEPP minimize communication? A(end+1:n , end+1:n ) = A(end+1:n , end+1:n ) - A(end+1:n , ib:end) * A(ib:end , end+1:n) … apply delayed updates with single matrix-multiply … with inner dimension b

• Case 1: n ≥ M - huge matrix – attains lower bound I 0 -B I I 0 -B - b = M1/2 optimal, dominated by matmul A I 0 = A I · I A·B • Case 2: n ≤ M1/2 - small matrix – attains lower bound 0 0 I 0 0 I I - Whole matrix fits in fast memory, any algorithm attains lower bound • Case 3: M1/2 < n < M - medium size matrix – not optimal - Can’t choose b to simultaneously optimize matmul and BLAS2 GEPP of n x b submatrix - Worst case: Exceed lower bound by factor M1/6 when n = M2/3

03/01/2016 CS267 Lecture 13 17 • Detailed counting on backup slides 18

02/23/2012 CS267 Lecture 12

Alternative cache-oblivious GE formulation (1/2) Alternative cache-oblivious GE formulation (2/2) • Toledo (1997) A = L * U function [L,U] = RLU (A) … assume A is m by n if (n=1) L = A/A(1,1), U = A(1,1) - Describe without pivoting for simplicity else - “Do left half of matrix, then right half” [L1,U1] = RLU( A(1:m , 1:n/2)) … do left half of A … let L11 denote top n/2 rows of L1 A( 1:n/2 , n/2+1 : n ) = L11-1 * A( 1:n/2 , n/2+1 : n ) function [L,U] = RLU (A) … assume A is m by n … update top n/2 rows of right half of A if (n=1) L = A/A(1,1), U = A(1,1) A( n/2+1: m, n/2+1:n ) = A( n/2+1: m, n/2+1:n ) else - A( n/2+1: m, 1:n/2 ) * A( 1:n/2 , n/2+1 : n ) U1 [L1,U1] = RLU( A(1:m , 1:n/2)) … do left half of A L11-1* … update rest of right half of A L11 A( ··, ,·· )) … let L11 denote top n/2 rows of L1 L11 [L2,U2] = RLU( A(n/2+1:m , n/2+1:n) ) … do right half of A A( 1:n/2 , n/2+1 : n ) = L11-1 * A( 1:n/2 , n/2+1 : n ) A(·,·U2) - = return [ L1,[0;L2] ] and [U1, [ A(.,.) ; U2 ] ] A(·,·) A(·,·) * L1 … update top n/2 rows of right half of A L2A(· , ·) 2 1/2 A( n/2+1: m, n/2+1:n ) = A( n/2+1: m, n/2+1:n ) • W(m,n) = W(m,n/2) + O(max(m·n,m·n /M )) + W(m-n/2,n/2) - A( n/2+1: m, 1:n/2 ) * A( 1:n/2 , n/2+1 : n ) 2 1/2 Still doesn ’t ≤ 2 · W(m,n/2) + O(max(m·n,m·n /M )) … update rest of right half of A minimize 2 1/2 [L2,U2] = RLU( A(n/2+1:m , n/2+1:n) ) … do right half of A = O(m·n /M + m·n·log M) latency, return [ L1,[0;L2] ] and [U1, [ A(.,.) ; U2 ] ] but fixable = O(m·n2/M1/2 ) if M1/2·log M = O(n) 03/01/2016 CS267 Lecture 13 19 CLASS PROJECT 20

5 Explicitly Parallelizing Gaussian Elimination Different Data Layouts for Parallel GE • Parallelization steps Bad load balance: Load balanced, but - Decomposition: identify enough parallel work, but not too much 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 P0 idle after first 0 1 2 3 can’t easily use BLAS3 - Assignment: load balance work among threads n/4 steps - Orchestrate: communication and synchronization - Mapping: which processors execute which threads (locality) 1) 1D Column Blocked Layout 2) 1D Column Cyclic Layout • Decomposition Can trade load balance and BLAS3 0 1 2 3 - In BLAS 2 algorithm nearly each flop in inner loop can be done in performance by 3 0 1 2 Complicated addressing, parallel, so with n2 processors, need 3n parallel steps, 0 1 2 3 0 1 2 3 May not want full parallelism choosing b, but 2 3 0 1 O(n log n) with pivoting factorization of block In each column, row 1 2 3 0 for i = 1 to n-1 column is a bottleneck A(i+1:n,i) = A(i+1:n,i) / A(i,i) … BLAS 1 (scale a vector) b A(i+1:n,i+1:n) = A(i+1:n , i+1:n ) … BLAS 2 (rank-1 update) 3) 1D Column Block Cyclic Layout 4) Block Skewed Layout - A(i+1:n , i) * A(i , i+1:n) 0 1 0 1 0 1 0 1 - This is too fine-grained, prefer calls to local matmuls instead 0 1 2 3 2 3 2 3 2 3 Bad load balance: 0 1 0 1 0 1 0 1 - Need to use parallel matrix multiplication P0 idle after first 2 3 2 3 2 3 2 3 The winner! 0 1 0 1 0 1 0 1 n/2 steps 2 3 2 3 2 3 2 3 2 3 • Assignment and Mapping 0 1 0 1 0 1 0 1 2 3 2 3 2 3 2 3 6) 2D Row and Column - Which processors are responsible for which submatrices? 5) 2D Row and Column Blocked Layout Block Cyclic Layout 03/01/2016 CS267 Lecture 13 21 03/01/2016 CS267 Lecture 13 22

Distributed GE with a 2D Block Cyclic Layout pink * blue - green = green Matrix multiply of of Matrixmultiply

02/14/2006 CS267 Lecture 9 23 02/14/2006 CS267 Lecture 9 24

6 Review of Parallel MatMul PDGESV = ScaLAPACK Parallel LU

• Want Large Problem Size Per Since it can run no faster than its Processor inner loop (PDGEMM), we measure: Efficiency = PDGEMM = PBLAS matrix multiply Speed(PDGESV)/Speed(PDGEMM)

Observations: Observations: • For fixed N, as P increasesn • Efficiency well above 50% for large Mflops increases, but less than enough problems 100% efficiency • For fixed N, as P increases, efficiency • For fixed P, as N increases, decreases (just as for PDGEMM) Mflops (efficiency) rises • For fixed P, as N increases efficiency increases (just as for PDGEMM) DGEMM = BLAS routine for matrix multiply • From bottom table, cost of solving Maximum speed for PDGEMM • Ax=b about half of matrix multiply = # Procs * speed of DGEMM for large enough matrices. • From the flop counts we would 3 3 Observations: expect it to be (2*n )/(2/3*n ) = 3 • Efficiency always at least 48% times faster, but communication • For fixed N, as P increases, makes it a little slower. efficiency drops • For fixed P, as N increases, efficiency increases 25 03/01/2016 26

03/04/2010 CS267 Lecture 14 CS267 Lecture 14

Does ScaLAPACK Minimize Communication? Choosing Rows by “Tournament Pivoting” • Lower Bound: O(n2 / P1/2 ) words sent in O(P1/2 ) mess. - Attained by Cannon and SUMMA (nearly) for matmul W P ·L ·U 1 1 1 1 Choose b pivot rows of W1, call them W1’ W P ·L ·U Choose b pivot rows of W , call them W ’ Wnxb = 2 = 2 2 2 2 2 • ScaLAPACK: W3 P3·L3·U3 Choose b pivot rows of W3, call them W3’ Choose b pivot rows of W , call them W ’ - O(n2 log P / P1/2 ) words sent – close enough W4 P4·L4·U4 4 4 - O(n log P ) messages – too large ’ - Why so many? One reduction costs O(log P) per column to W1 P12·L12·U12 Choose b pivot rows, call them W ’ find maximum pivot, times n = #columns W2’ 12 = W3’ P34·L34·U34 Choose b pivot rows, call them W34’ • Need to abandon partial pivoting to reduce #messages W4’ - Suppose we have n x n matrix on P1/2 x P1/2 processor grid W ’ - Goal: For each panel of b columns spread over P1/2 procs, 12 = P ·L ·U W ’ 1234 1234 1234 Choose b pivot rows identify b “good” pivot rows in one reduction 34 “ ” • Call this factorization TSLU = Tall Skinny LU Go back to W and use these b pivot rows - Several natural bad (numerically unstable) ways explored, (move them to top, do LU without pivoting) but good way exists Not the same pivots rows chosen as for GEPP • SC08, “Communication Avoiding GE”, D., Grigori, Xiang Need to show numerically stable (D., Grigori, Xiang, ‘11) 03/01/2016 CS267 Lecture 13 27 03/01/2016 CS267 Lecture 13 28

7 Minimizing Communication in TSLU Same idea for QR of Tall-skinny matrix (TSQR)

W1 LU LU W1 QR QR Parallel: W = W2 LU Parallel: W = W2 QR LU QR W3 LU W3 QR LU QR W4 LU W4 QR

W LU W QR 1 LU 1 QR W W Sequenal: W = 2 LU Sequenal: W = 2 QR W W 3 LU 3 QR W4 W4

W1 LU W1 QR LU QR Dual Core: W = W2 LU Dual Core: W = W2 QR LU QR W3 LU W3 QR LU QR W4 LU W4 QR

Mulcore / Mulsocket / Mulrack / Mulsite / Out-of-core: ? First step of SVD of Tall-Skinny matrix Can Choose reducon tree dynamically 03/01/2016 29 03/01/2016 30 CS267 Lecture 13 CS267 Lecture 13

Performance vs ScaLAPACK LU TSQR Performance Results • Parallel • TSLU – Intel Clovertown – IBM Power 5 – Up to 8x speedup (8 core, dual socket, 10M x 10) – Pentium III cluster, Dolphin Interconnect, MPICH • Up to 4.37x faster (16 procs, 1M x 150) • Up to 6.7x speedup (16 procs, 100K x 200) – BlueGene/L – Cray XT4 • Up to 4x speedup (32 procs, 1M x 50) • Up to 5.52x faster (8 procs, 1M x 150) - Tesla C 2050 / Fermi • Up to 13x (110,592 x 100) • CALU - Grid – 4x on 4 cities vs 1 city (Dongarra, Langou et al) – IBM Power 5 - Cloud – (Gleich and Benson) ~2 map-reduces

• Up to 2.29x faster (64 procs, 1000 x 1000) • Sequential – Cray XT4 – “Infinite speedup” for out-of-core on PowerPC laptop • As little as 2x slowdown vs (predicted) infinite DRAM • Up to 1.81x faster (64 procs, 1000 x 1000) • LAPACK with virtual memory never finished

• See INRIA Tech Report 6523 (2008), paper at SC08 • SVD costs about the same • Joint work with Grigori, Hoemmen, Langou, Anderson, Ballard, Keutzer, others 03/01/2016 CS267 Lecture 13 31 32

Data from Grey Ballard, Mark Hoemmen, Laura Grigori, Julien Langou, Jack Dongarra, Michael Anderson

8 CALU speedup prediction for a Petascale machine - up to 81x faster Summary of dense sequential O(n3) P = 8192 attaining communication lower bounds • References are from Table 3.1 in “Communication lower bounds and optimal algorithms for ”, Ballard et al, 2014 • #words moved = Ω(n3/M1/2), #messages = Ω(n3/M3/2) • Cache-oblivious, Ours, LAPACK, Randomized Computation 2-Level Mem Multiple Level

Min #Words Min# Messages Min #Words Min #Messages

BLAS-3 [1,2] [1,2] [1,2] [1,2]

Cholesky [3,4,5,6] [3,5,6] [3,5,6] [3,5,6]

LU [6,7,8,9] [7,8] [6,7,9] [7] Petascale machine with 8192 procs, each at 500 GFlops/s, a bandwidth of 4 GB/s.

−12 −5 −9 γ = 2⋅10 s,α =10 s,β = 2⋅10 s / word. 33 Sym03/01/2016 Indef [10] [10]CS267 Lecture 13 [10] [10] 34

QR [7,11,12,13] [7,11,13] [7,12,13] [7,13]

Eig(A=AT) [14,15] [14,15] [14] [14]

3 Summary of dense parallel O(n /p) algorithms SVD [14Can,15,16] we[ 14do,15 ,16]even better?[14,16] [14,16] attaining communication lower bounds • Assume nxn matrices on p processors Eig(A) [14] [14] [14] [14] • References are from Table 3.2 in “Communication lower bounds and • Use c copies of data: M = O(cn2 / p) per processor optimal algorithms for numerical linear algebra”, Ballard et al, 2014 • Assume nxn matrices on p procs, minimum memory per proc: M = O(n2/p) • Increasing M reduces lower bounds: • #words moved = Ω(n2/p1/2), #messages = Ω(p1/2), #words_moved = Ω( (n3/ P) / M1/2 ) = Ω( n2 / (c1/2 P1/2 ) ) • Ours, ScaLAPACK, Randomized 3 3/2 1/2 3/2 • ScaLAPACK sends > n/p1/2 times too many messages (except Cholesky) #messages = Ω( (n / P) / M ) = Ω( P / c ) • Attainable for Matmul Computation Minimizes # Words Minimizes # Messages • Not attainable for LU, Cholesky, QR • Thm: #words_moved * #messages = Ω( n2 ) BLAS3 [1,2,3,4] [1,2,3,4] • Lowering #words by factor c1/2 must increase #messages by Cholesky [2] [2] same factor LU [2,5,10,11] [5,10,11] • Cor: Perfect strong scaling impossible for LU, Cholesky,QR Symmetric IndefiniteCLASS [2,6,9] PROJECTS[6,9] • Both lower bounds attainable for Cholesky, LU, QR (via CholeskyCLASS QR): PROJECTS QR [2,7] [7] • #words_moved = Ω( n2 / (c1/2 P1/2 ) ) Eig(A=AT) and SVD [2,8,9] [8,9] • #messages = Ω( c1/2 P1/2 ) 03/01/2016 CS267 Lecture 13 35 36 Eig(A) [8] [8]

9 LU Speedups from 2.5D vs 2D LU Tournament Pivoting and 2.5D With and Without Pivoting LU on 16,384 nodes of BG/P (n=131,072) 100 2.5D LU with CA-pivoting on BG/P (n=65,536) communication idle 100 80 compute 2.5D LU (CA-pvt) 2D LU (CA-pvt) 80 ScaLAPACK PDGETRF 60

40 2X faster Time (sec) 60 20 2X faster

40 0 NO-pivot 2DNO-pivot 2.5D CA-pivot 2DCA-pivot 2.5D

20 Percentage of machine peak

0 256 512 1024 2048 #nodes

Dense Linear Algebra on Recent Architectures Multicore: Expressing Parallelism with a DAG • Multicore • DAG = Directed Acyclic Graph - How do we schedule all parallel tasks to minimize idle time? - S1 → S2 means statement S2 “depends on” statement S1 • GPUs - Can execute in parallel any Si without input dependencies - Heterogeneous computer: consists of functional units • For simplicity, consider Cholesky A = LLT, not LU (CPU and GPU) that are good at different tasks - N by N matrix, numbered from A(0,0) to A(N-1,N-1) - How do we divide the work between the GPU and CPU to - “Left looking” code: at step k, completely compute column k of L take maximal advantage of both? - Challenging now, will get more so as platforms become for k = 0 to N-1 more heterogeneous for n = 0 to k-1 A(k,k) = A(k,k) – A(k,n)*A(k,n) A(k,k) = sqrt(A(k,k)) for m = k+1 to N-1 for n = 0 to k-1 A(m,k) = A(m,k) – A(m,n)*A(k,n) A(m,k) = A(m,k) / A(k,k) 03/01/2016 CS267 Lecture 13 39

03/02/2009 CS267 Lecture 1140

10 Expressing Parallelism with a DAG - Cholesky Expressing Parallelism with a DAG – Block Cholesky • Each A[i,j] is a b-by-b block for k = 0 to N-1 for n = 0 to k-1 for k = 0 to N/b-1 for n = 0 to k-1 S1(k,n) A(k,k) = A(k,k) – A(k,n)*A(k,n) SYRK: S (k,n) T S (k) A(k,k) = sqrt(A(k,k)) 1 A[k,k] = A[k,k] – A[k,n]*A[k,n] 2 POTRF: S (k) for m = k+1 to N-1 2 A[k,k] = unblocked_Cholesky(A[k,k])

for n = 0 to k-1 for m = k+1 to N/b-1 for n = 0 to k-1 S3(k,m,n) A(m,k) = A(m,k) – A(m,n)*A(k,n) GEMM: S (k,m,n) T S (k,m) A(m,k) = A(m,k) · A(k,k)-1 3 A[m,k] = A[m,k] – A[m,n]*A[k,n] 4 -1 TRSM: S4(k,m) A[m,k] = A[m,k] · A[k,k] n S (k,n) 3 1 DAG has ≈N /6 vertices: n S1(k,n) S (k,n) S (k) for n=0:k-1 k S2(k) 1 → 2 Same DAG, but only k S (k) 3 S3(k,m,n) → S4(k,m) for n=0:k-1 2 ≈(N/b) /6 vertices S3(k,m,n) S4(k,m) S (k) S (k,m) for m=k+1:N 2 → 4 S3(k,m,n) S4(k,m) m S4(k,m) → S3 (k’,m,k) for k’>k m S4(k,m) → S3(k,m’,k) for m’>m

CS267 Lecture 11 41 CS267 Lecture 11 42 03/02/2009 03/02/2009

Sample Cholesky DAG with Scheduling options #blocks in any row or column = N/b = 5 • Static (pre-assign tasks to processors) vs Dynamic (idle processors grab ready jobs from work-queue) • Note implied order of - If dynamic, does scheduler take user hints/priorities? summation from left to right • Respect locality (eg processor must have some task data in its cache) vs not • Not necessary for correctness, but it • Build and store entire DAG to schedule it (which may be does reflect what the very large, (N/b)3 ), vs sequential code does Build just the next few “levels” at a time (smaller, but less information for scheduler) • Can process DAG in any order respecting • Programmer builds DAG & schedule vs dependences Depend on compiler or run-time system - Ease of programming, vs not exploiting user knowledge - If compiler, how conservative is detection of parallelism? - Generally useful, not just linear algebra Slide courtesy of Jakub Kurzak, UTK

03/02/2009 43 03/01/2016 CS267 Lecture 13 44

CS267 Lecture 11

11 Schedulers tested Measured Results for Tiled Cholesky • Cilk • programmer-defined parallelism • spawn – creates independent tasks • sync – synchronizes a sub-branch of the tree • SMPSs • dependency-defined parallelism • pragma-based annotation of tasks (directionality of PLASMA: the parameters) • PLASMA (Static Pipeline) • programmer-defined (hard-coded) • Measured on Intel Tigerton 2.4 GHz “ ” • apriori processing order • Cilk 1D: one task is whole panel, but with look ahead • Cilk 2D: tasks are blocks, scheduler steals work, little locality • stalling on dependencies Slide courtesy of Jakub Kurzak, UTK • PLASMA works best

Slide courtesy of Jakub Kurzak, UTK

More Measured Results for Tiled Cholesky Still More Measured Results for Tiled Cholesky • Measured on Intel Tigerton 2.4 GHz • PLASMA (static pipeline) – Cilk best • SMPSs – somewhat worse • Cilk 2D – inferior • Cilk 1D – still worse

SMPSs

PLASMA (Static Pipeline)

quad-socket, quad-core (16 cores total) Intel Tigerton 2.4 GHz

Slide courtesy of Jakub Kurzak, UTK Slide courtesy of Jakub Kurzak, UTK

12 Intel’s Clovertown Quad Core Scheduling on Multicore – Next Steps • PLASMA 2.8.0 released 12/2015 3 Implementations of LU factorization Quad core w/2 sockets per board, w/ 8 Threads - Includes BLAS, Cholesky, QR, LU, LDLT, eig, svd - icl.cs.utk.edu/plasma/ 45000 3. DAG Based (Dynamic Scheduling) 40000 • Future of PLASMA - Continue adding functions 35000 - Add dynamic scheduling 30000 2. ScaLAPACK (Mess Pass using mem copy) • QUARK dynamic scheduler released 12/2011 25000 • DAGs for eigenproblems are too complicated to do by hand 1. LAPACK (BLAS Fork-Join Parallelism) 20000 • Plan to adopt OPenMP4.0 DAG scheduling features Mflop/s

15000 - Still assume homogeneity of available cores • What about GPUs, or mixtures of CPUs and GPUs? 10000 8 Core Experiments CLASS PROJECTS - MAGMA 5000 • icl.cs.utk.edu/magma 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 11000 12000 13000 14000 15000 Source: Jack Dongarra Problems49 Size 03/01/2016 CS267 Lecture 13 50

Motivation Dense Linear Algebra on GPUs

• Source: Vasily Volkov’s SC08 paper • NVIDIA released CUBLAS 1.0 in 2007, which is BLAS for GPUs - Best Student Paper Award (729 citations) • This enables a straighorward port of LAPACK to GPU • New challenges • Consider single precision only - More complicated memory hierarchy impressive sheer peak in compute power - Not like “L1 inside L2 inside …”, a*b+c

• Need to choose which memory to use carefully not so great in matrix- BLAS CUBLAS 1.1 matrix mulply • Need to move data manually SGEMM MKL 10.0 GeForce 8800 GTX - GPU does some operations much faster than CPU, but not all Core2 Quad 2.4GHz LAPACK naive disappoinng performance in - CPU and GPU fastest using different data layouts SGETRF MKL 10.0 (naive) LU factorizaon 2007 results

0 50 100 150 200 250 300 350 Gflop/s • Goal: understand bolenecks in the dense linear algebra kernels • Requires detailed understanding of the GPU architecture • Result 1: New coding recommendaons for high performance on GPUs • Result 2: New , fast variants of LU, QR, Cholesky, other rounes 03/01/2016 CS267 Lecture 13 51 03/01/2016 CS267 Lecture 13 52

13 GPU Memory Hierarchy (Some new) NVIDIA coding recommendations 16 KB store 64 KB vector register file

16 lanes • Minimize communication with CPU memory crossbar 64 lanes • Keep as much data in registers as possible - Largest, fastest on-GPU memory - Vector-only operations • Register file is the fastest and the largest on-chip • Use as little shared memory as possible memory - Smaller, slower than registers; use for communication, sharing only - Constrained to vector operations only - Speed limit: 66% of peak with one shared mem argument • Shared memory permits indexed and shared access • Use vector length VL=64, not max VL = 512 - However, 2-4x smaller and 4x lower bandwidth than - Strip mine longer vectors into shorter ones registers • Only 1 operand in shared memory is allowed versus 4 register operands • Final matmul code similar to Cray X1 or IBM 3090 vector codes - Some instructions run slower if using shared memory

03/01/2016 CS267 Lecture 13 53 03/01/2016 CS267 Lecture 13 54

__global__ void sgemmNN( const float *A, int lda, const float *B, int ldb, float* C, int ldc, int k, float alpha, float beta ) New code vs. CUBLAS 1.1 { A += blockIdx.x * 64 + threadIdx.x + threadIdx.y*16; B += threadIdx.x + ( blockIdx.y * 16 + threadIdx.y ) * ldb; Compute pointers to the data C += blockIdx.x * 64 + threadIdx.x + (threadIdx.y + blockIdx.y * ldc ) * 16; Performance in mulplying two NxN matrices on GeForce 8800 GTX: __shared__ float bs[16][17]; Declare the on-chip storage 70% mulply-and-add with an operand in shared memory (66%) float c[16] = {0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0}; const float *Blast = B + k; our implementaon (60%) do 60% { #pragma unroll 50% for( int i = 0; i < 16; i += 4 ) ’ bs[threadIdx.x][threadIdx.y+i] = B[i*ldb]; Read next B s block 40% CUBLAS 1.1 (37%) B += 16; __syncthreads(); #pragma unroll 30% for( int i = 0; i < 16; i++, A += lda ) { The boleneck: Fracon of Peak 20% c[0] += A[0]*bs[i][0]; c[1] += A[0]*bs[i][1]; c[2] += A[0]*bs[i][2]; c[3] += A[0]*bs[i][3]; Read A’s columns c[4] += A[0]*bs[i][4]; c[5] += A[0]*bs[i][5]; c[6] += A[0]*bs[i][6]; c[7] += A[0]*bs[i][7]; Do Rank-1 updates c[8] += A[0]*bs[i][8]; c[9] += A[0]*bs[i][9]; c[10] += A[0]*bs[i][10]; c[11] += A[0]*bs[i][11]; 10% c[12] += A[0]*bs[i][12]; c[13] += A[0]*bs[i][13]; c[14] += A[0]*bs[i][14]; c[15] += A[0]*bs[i][15]; } 0% __syncthreads(); 64 128 256 512 1024 2048 4096 } while( B < Blast ); N for( int i = 0; i < 16; i++, C += ldc ) Store C’s block to memory C[0] = alpha*c[i] + beta*C[0]; } CS267 Lecture 13 55 03/01/2016 CS267 Lecture 13 56 03/01/2016

14 The Progress So Far Row-Pivoting in LU Factorization

Exchange two rows of an NxN matrix (SSWAP in CUBLAS 2.0): in registers peak in a*b+c using shared memory Arithmec runs slower if 1024 Core2 Quad using shared memory 512 matrix in column-major layout (stride-N access) CUBLAS 1.1 Good compared to 256 BLAS SGEMM our implementaon (now in CUBLAS 2.0) the new, smaller peak Core2 Quad 128

naive 64 LAPACK SGETRF w/CUBLAS2.0 Where does the me go? GeForce 8800 GTX 40x Core2 Quad 32 microseconds

0 50 100 150 200 250 300 350 16 Gflop/s matrix in row-major layout • Achieved predictable performance in SGEMM 8 (aligned unit-stride access) - Which does O(N3) work in LU factorization 4 • But LU factorization (naïve SGETRF) still underperforms 0 2048 4096 6144 8192 10240 12288 14336 16384 N - Must be due to the rest O(N2) work done in BLAS1 and BLAS2 - Why O(N2) work takes so much time? Row pivong in column-major layout on GPU is very slow This alone consumes half of the runme in naïve SGETRF 03/01/2016 CS267 Lecture 13 57 58 03/01/2016 CS267 Lecture 13

BLAS1 Performance Panel Factorization Factorizing Nx64 matrix in GPU memory using LAPACK’s SGETF2: Scale a column of an NxN matrix that fits in the GPU memory (assumes aligned, unit-stride access) 25 bound assumes 4 µs overhead per BLAS call 8 20 and 127 GB/s bandwidth in memory access 7 (these are the best sustained numbers) bound

6 15 5 4 GTX280 Gflop/s 10 GeForce 3 microseconds 2 GeForce 8600 GTS, peak = 32 GB/s 5 Core2 Quad 1 GeForce GTX 280, peak = 141 GB/s (including transfer to CPU and back) 0 0 64 128 256 512 1024 2048 4096 8192 16384 32768 0 2048 4096 6144 8192 10240 12288 14336 16384 N N • Peak bandwidth of these GPUs differs by a factor of 4.4 • Invoking small BLAS operaons on GPU from CPU is slow • But runmes are similar • Can we call a sequence of BLAS operaons from GPU? • • Requires barrier synchronizaon aer each parallel BLAS operaon Small tasks on GPU are overhead bound • Barrier is possible but requires sequenal consistency for correctness 03/01/2016 CS267 Lecture 13 59 60 03/01/2016 CS267 Lecture 13

15 Design of fast matrix on GPU Raw Performance of Factorizations on GPU • Use GPU for matmul only, not BLAS2 or BLAS1

• Factor panels on CPU 350 QR 51% • Use “look-ahead” to overlap CPU and GPU work 300 Cholesky - GPU updates matrix while CPU factoring next panel LU 250 • Use row-major layout on GPU, column-major on CPU - Convert on the fly 200 49%

• Substitute triangular solves LX= B with multiply by L-1 Gflop/s 150 - For stability CPU needs to check || L-1 || 100 78% • Use variable-sized panels for load balance 50 • For two GPUs with one CPU, use column-cyclic layout on GPUs 0 64 128 256 512 1024 2048 4096 8192 16384 Order of Matrix

03/01/2016 CS267 Lecture 13 61 03/01/2016 CS267 Lecture 13 62

Speedup of Factorizations on GPU over CPU Where does the time go?

GPU only useful on large enough matrices • Time breakdown for LU on 8800 GTX

100%

4.5 90% 4.0 QR 4.4x 80% Cholesky GTX280 GPU 3.5 LU 70% CPU/GPU 3.0 2.7x 60% overlap 2.5 50% CPU 8800GTX Time 2.0 40% look-ahead

1.5 30%

1.0

Speedup vs Core2 Quad Core2 vs Speedup 20% CPU-GPU transfer 0.5 10%

0.0 0% 64 128 256 512 1024 2048 4096 8192 16384 448 704 1088 1664 2496 3648 5312 7744 11264 Order of Matrix Order of Matrix

03/01/2016 CS267 Lecture 13 63 03/01/2016 CS267 Lecture 13 64

16 Importance of various optimizations on GPU Results for matmul, LU on NVIDIA • Slowdown when omitting one of the optimizations on GTX 280 in registers peak in a*b+c if using shared memory Core2 Quad 2.0 overlap CPU/GPU 1.9 CUBLAS 1.1 transpose matrix BLAS SGEMM our implementaon (now in CUBLAS 2.0) 1.8 Core2 Quad TRSM via GEMM 1.7 batch pivoting naive 1.6 LAPACK SGETRF our implementaon GeForce 8800 GTX Core2 Quad 1.5 1.4 0 50 100 150 200 250 300 350 Gflop/s Slowdown 1.3 • What we’ve achieved: 1.2 - Identified realistic peak speed of GPU architecture 1.1 1.0 - Achieved a large fraction of this peak in matrix multiply 0.9 - Achieved a large fraction of the matrix multiply rate in dense 64 128 256 512 1024 2048 4096 8192 16384 factorizations Order of Matrix

03/01/2016 CS267 Lecture 13 65 03/01/2016 CS267 Lecture 13 66

Class Projects • Pick one (of many) functions/algorithms • Pick a target parallel platform • Pick a “parallel programming framework” - LAPACK – all parallelism in BLAS - ScaLAPACK – distributed memory using MPI - PLASMA – DAG scheduling on multicore • Parallel Linear Algebra for Scalable Multi-core Architectures • http://icl.cs.utk.edu/plasma/ - MAGMA – DAG scheduling for heterogeneous platforms • Matrix Algebra on GPU and Multicore Architectures • http://icl.cs.utk.edu/magma/ - Spark, Elemental, … • Design, implement, measure, model and/or compare performance - Can be missing entirely on target platform - May exist, but with a different programming framework 67

17