Past the BLAS

DSLs for DLA: Past the BLAS Robert van de Geijn Department of Computer Science Institute for Computation Engineering and Sciences 1 Motivation l Throughout we will use Cholesky factorization as a motivating example: ¡ Given Symmetric Positive Definite A compute L such that A = L LT ¡ L typically overwrites A. 2 Blocked Right-looking Algorithm 3 A11 A 21 A22 4 A11 := Chol(A11) A 21 A22 5 A11 := Chol(A11) −T A21 := A21A11 A22 6 A11 := Chol(A11) −T A21 := A21A11 T A22 := A22 − A21A21 7 Done Done 8 A Brief and Incomplete History 9 State-of-the-art in 1980 l LINPACK: dchdc(a,lda,n,work,jpvt,job,info) J. J. Dongarra, C. B. Moler, J. R. Bunch, and G. W. Stewart. LINPACK Users' Guide, 1979. l MATLAB: L = chol( A ) Cleve Moler. MATLAB-an interactive matrix laboratory. 1980. 10 Layering in 1980 MATLAB/Applications LINPACK (Level-1) BLAS 11 (Level-1) BLAS l C. L. Lawson, R. J. Hanson, D. R. Kincaid, and F. T. Krogh. Basic Linear Algebra Subprograms for Fortran Usage. ACM TOMS, 1979. ¡ Provide portable high performance ¡ Improve readability sscal, dscal, cscal, zscal scopy, dcopy, ccopy, zcopy sdot, ddot, cdot, zdot saxpy, daxpy, caxpy, zaxpy … ¡ Early example of a Domain Specific Language 12 LINPACK-like Cholesky factorization do j=1, n A( j,j ) = sqrt( A( j,j ) ) call dscal( n-j, 1.0d00 / A( j,j ), A( j+1, j ), 1 ) do k=j+1,n call daxpy( n-k+1, -A( k,j ), A(k,j), 1, A( k, k ), 1 ); enddo enddo 13 The State-of-the-Art in 1990 l LAPACK: DPOTRF( UPLO, N, A, LDA, INFO ) Anderson et al. LAPACK: A portable linear algebra library for high-performance computers. Supercomputing 1990. 14 Layering in 1990 MATLAB/Applications LAPACK Blocked Algorithms LAPACK Unblocked Algorithms BLAS1 BLAS2 BLAS3 15 Level-2 BLAS l Jack J. Dongarra, Jeremy Du Croz, Sven Hammarling, and Richard J. Hanson. An extended set of FORTRAN basic linear algebra subprograms. ACM TOMS. 1988. ¡ sgemv, dgemv, cgemv, zgemv sger, dger, cger, zger ssyr, dsyr, csyr, zsyr strsv, dtrsv, ctrsv, ztrsv … 16 LAPACK-like Unblocked Cholesky do j=1, n A( j,j ) = sqrt( A( j,j ) ) call dscal( n-j, 1.0d00 / A( j,j ), A( j+1, j ), 1 ) do k=j+1,n callcall dsyrdaxpy( `Lower( n-k+1,`, n-k+1,-A( k,j ),-1.0d00, A(k,j), 1,A( A( j+1,j k, k), ),1, 1 A( ); j+1, j+1 ), lda ); enddo enddo Improved performance and readability 17 Level-3 BLAS l J. J. Dongarra, Jeremy Du Croz, Sven Hammarling, and I. S. Duff. A set of level 3 basic linear algebra subprograms. ACM TOMS. 1990. ¡ sgemm, dgemm, cgemm, zgemm ssymm, dsymm, csymm, zsymm ssyrk, dsyrk, csyrk, zsyrk strsm, dtrsm, ctrsm, ztrsm … 18 LAPACK-like Blocked Cholesky DO 20 J = 1, N, NB JB = MIN( NB, N-J+1 ) CALL DPOTF2( 'Lower', JB, A( J, J ), LDA, INFO ) CALL DTRSM( 'Right', 'Lower', 'Transpose', 'Non-unit', $ N-J-JB+1, JB, ONE, A( J, J ), LDA, $ A( J+JB, J ), LDA ) CALL DSYRK( 'Lower', 'No transpose', $ N-J-JB+1, JB, -ONE, A( J+JB, J ), LDA, $ ONE, A( J+JB, J+JB ), LDA ) ENDDO Improved performance Improved readability??? 19 Performance One thread 10 8 6 GFlops 4 2 Hand optimized BLAS3 BLAS2 BLAS1 triple loops 0 0 200 400 600 800 1000 1200 1400 1600 1800 2000 20 n The State-of-the-Art in 1996 l ScaLAPACK: LDA PDPOTRF( UPLO, N, A, IA, JA, IDESC, INFO ) S. Blackford et al. ScaLAPACK Users’ Guide. 1997. 21 DLA State-of-the-Art in 1996 ScaLAPACK PBLAS Basic LAPACK Linear BLAS Algebra Communication Subprograms Blackford et al. ScaLAPACK Users’ Guide. SIAM Press. 1997. 22 PBLAS psdot, etc. psgemv, etc. psgemm, etc. … J. Choi, et. All. “ScaLAPACK: A portable linear algebra library for distributed memory computers --- Design issues and performance”, PARA '95, 1996. DSL in the spirit of BLAS 23 ScaLAPACK Cholesky Factorization DO 20 J = JN+1, JA+N-1, DESCA( NB_ ) JB = MIN( N-J+JA, DESCA( NB_ ) ) I = IA + J – JA CALL PDPOTF2( UPLO, JB, A, I, J, DESCA, INFO ) IF( INFO.NE.0 ) THEN INFO = INFO + J - JA GO TO 30 END IF * IF( J-JA+JB+1.LE.N ) THEN CALL PDTRSM( 'Right', ’Lower', 'Transpose', 'Non-Unit', N-J-JB+JA, $ JB, ONE, A, I, J, DESCA, A, I+JB, J, DESCA ) CALL PDSYRK( UPLO, 'No Transpose', N-J-JB+JA, JB, -ONE, $ A, I+JB, J, DESCA, ONE, A, I+JB, J+JB, $ DESCA ) * END IF 20 CONTINUE 24 The “State-of-the-Art” in 1997 l PLAPACK: PLA_Chol( uplo, A ) P. Alpatov, G. Baker, C. Edwards, J. Gunnels, G. Morrow, J. Overfelt, R. van de Geijn, Y.-J. Wu. PLAPACK: parallel linear algebra package design overview. Supercomputing, 1997. Robert van de Geijn. Using PLAPACK. MIT Press. 1997. 25 Parallel Linear Algebra Package (PLAPACK) l Conceived in Spring 1996. ¡ CS395T Parallel Computing at UT Austin ¡ Users’ manual for fictitious distributed memory parallel dense linear algebra (DLA) package (SL – Simple Library) ¡ Class implemented parts of the package 26 PLAPACK Cholesky Factorization PLA_Obj_view_all( A, &ABR ); while ( TRUE ) { PLA_Obj_split_size( ABR, PLA_SIDE_TOP, &size_top, &owner_top ); PLA_Obj_split_size( ABR, PLA_SIDE_LEFT, &size_left, &owner_left ) if ( 0 == ( size = min( min( size_top, size_left), nb_alg ) ) ) break; PLA_Obj_split_4( ABR, size, size, &A11, PLA_DUMMY, &A21, &ABR ); PLA_Local_chol( PLA_LOWER_TRIANGULAR, A11 ); PLA_Trsm( PLA_SIDE_RIGHT, PLA_LOWER_TRIANGULAR, PLA_TRANSPOSE, PLA_NONUNIT_DIAG, one, A11, A21 ); PLA_Syrk( PLA_LOWER_TRIANGULAR, minus_one, A21, one, ABR ); } Code devoid of indexing. 27 A11 := Chol(A11) −T A21 := A21A11 T A22 := A22 − A21A21 CALLCALL PDTRSM(PDTRSM( ''LeftLeft',', ’’LowerLower',', ''TransposeTranspose',', 'Non-Unit’,'Non-Unit’, JB,JB, N-J-JB+JA,N-J-JB+JA, && ONE,ONE, A,A, II,, J,J, DESCA,DESCA, A,A, II,, J+JB,J+JB, DESCADESCA )) PLA_Trsm( PLA_SIDE_RIGHT, PLA_LOWER_TRIANGULAR, PLA_TRANSPOSE, PLA_NONUNIT_DIAG, one, A11, A21 ); 28 PLAPACK Innovations l Object-based API (DSL) inspired by MPI. l Describe data and its distribution l Many new parallel DLA algorithms. l Families of algorithms. l Sensible user interface. Great product, lousy marketing department… 29 PLAPACK’s Legacy: 1996 - 2016 l FLAME l FLAME-it ¡ Notation l SuperMatrix ¡ Methodology l FLAPACK ¡ APIs l Elemental l FLAMEC l FLAME@lab l BLIS l FLAMEPy l LAC/LAP l FLaTeX l DxT ¡ libflame l LAFF l ROTE l TBLIS 30 What do we learn from our experience? 31 Developing a DSL for HPC l Reflect: How do we reason about algorithms? l Abstract: How can we express algorithms so the algorithm captures how we reason about them? l Derive: How do we make reasoning about algorithms systematic? l Encode: How do we represent algorithms in code? l (Re)layer: How do we layer our solution to attain portable high performance? l Build: How do we develop useable software? l Preach: How do we get the word out? How do we build a user/developer community? 32 Overview l Reflect l Abstract l Derive l Encode l (Re)layer l Build l Preach 33 Reflect “The idea of knowledge as cumulative — a ladder, or a tower of stones, rising higher and higher — existed only as one possibility among many. For several hundred years, scholars of scholarship had considered that they might be like dwarves seeing farther by standing on the shoulders of giants, but they tended to believe more in rediscovery than in progress.” - James Gleick 34 34 Reflect 35 35 Reflect 36 Reflect 37 Reflect 38 Reflect 39 Reflect 40 Reflect 41 Reflect 42 Reflect 43 Reflect 44 Reflect 45 Reflect 46 Reflect 47 Reflect 48 Reflect 49 Reflect 50 Reflect 51 Reflect 52 Reflect 53 Reflect 54 Reflect 55 Reflect 56 Reflect 57 Overview l Reflect l Abstract l Derive l Encode l (Re)layer l Build l Preach 58 59 60 61 62 63 64 65 66 67 68 FLAME Notation X. Sun, E. S. Quintana, G. Quintana, and R. van de Geijn. A Note on Parallel Matrix Inversion. SISC, 2001. 69 Insight l It starts with the right notation: Express algorithms at the level of abstraction at which we reason. 70 Overview l Reflect l Abstract l Derive l Encode l Layer l Build l Preach 71 72 Step Algorithm: A := Chol unb var3(A) 1a A = A 4 b where 2 3 while do 2,3 ^ 5a where 6 8 5b 7 2 endwhile 2,3 () ^¬ 73 1b A = L LLT = A ^ b Loop invariant l A loop invariant is a predicate that is true before and after each iteration of the loop. done done done partially updated 74 Loop invariant l A loop invariant is a predicate that is true before and after each iteration of the loop. LTL * T LBL ABR - LBLLBL Important: We can determine the loop invariant a priori75 StepStep Algorithm:Algorithm:AA:=:=:=CholChol unbunb var3var3((A)) Step Algorithm: 1aStep1a AAAlgorithm:==AA A := Chol unb var3(A) The FLAME Methodology: 1a 1a A = AAATLTLTLAATRTRTR LLTLTL LLTRTR 44 AA bb ,,,LL ! 0 1 ! 0 1 ! 0AABLAABR1 ! 0LLBL LLBR 1 b BL BR BL BR 4 where A and L are 0 0 where@@ ATLTLTLandAALTLTLTLare@@ 0 0 A ⇥⇥ TT ATLTL ATRTR LTLTL ATRTR LTLTLL T = ATLTL AwhereTL ATR LTL ATR LTLLTLTL = ATL 22 == 0 where 1 0 TT 1 ^ TT 0AABLAABR1 0LLBL AABR LLBLLBLT 1 ^ LBLLLTLTLT ==AABL Our weapon of 2 BL BR BL BR −bb BL BL BL TL bbBL @ A @ − A 32 @while m((ATLTLA)) <m@ ((A)) do A 3 while m(ATL) <m(A) dob b b TT b ATLTL ATRTR LTLTL ATRTR LTLTLL T = ATLTL 3 whileATL ATR LdoTL ATR LTLLTLTL = ATL 2,3 = TL m((ATLTL)) <m((A)) Math induction 2,33 while = do m(ATL) <m(A) 0 1 0 TT 1 ^ TT ^ 0 ABL ABR1 0 LBL ABR LBLLBLT 1 ^ LBLLTLTLT = ABL ^ 2,3 ABL ABR LBL ABR −bLBLLBL LBLLTL = bABL @ A @ −b A b 2,3 @ A @ A0000 a0101 A0202 A L0000 ll0101 L0202 ^ A00 ab01 A02 Lb00 l01 L02 ATLTL ATRTR b LTLTL LTRTR b ^ for the war on error 5a ATL ATR 0 aTT ↵ aTT 1,, LTL LTR 0 llTT λ llTT 1 5a 0 T1010 1111 T1212 1, 0 1010T 1111 1212T 1 0A A 1 ! a10 ↵11 a12 0L L 1 ! l10 λ11 l12 0A BLA BR1 ! B C 0LBL LBR 1 !B C 5a BL BR BBA2020 a2121 A2222C BL BR BBL2020 ll2121 L2222CC 5a @ A BBA20 a21 A22 C @ A BBL20 l21 L22CC @ where ↵A1111 andB@λ1111 are 1 1CA @ A @B AC where ↵11 and@λ11 are 1 ⇥

Load more