DSLs for DLA: Past the BLAS

Robert van de Geijn Department of Computer Science Institute for Computation Engineering and Sciences

1 Motivation l Throughout we will use Cholesky factorization as a motivating example: ¡ Given Symmetric Positive Definite A compute L such that

A = L LT

¡ L typically overwrites A.

2 Blocked Right-looking Algorithm

3 A11

A 21 A22

4 A11 := Chol(A11)

A 21 A22

5 A11 := Chol(A11)

−T A21 := A21A11 A22

6 A11 := Chol(A11)

−T A21 := A21A11

T A22 := A22 − A21A21

7 Done

Done

8 A Brief and Incomplete History

9 State-of-the-art in 1980 l LINPACK: dchdc(a,lda,n,work,jpvt,job,info)

J. J. Dongarra, C. B. Moler, J. R. Bunch, and G. W. Stewart. LINPACK Users' Guide, 1979.

l MATLAB: L = chol( A )

Cleve Moler. MATLAB-an interactive matrix laboratory. 1980.

10 Layering in 1980

MATLAB/Applications

LINPACK

(Level-1) BLAS

11 (Level-1) BLAS l C. L. Lawson, R. J. Hanson, D. R. Kincaid, and F. T. Krogh. Basic Linear Algebra Subprograms for Fortran Usage. ACM TOMS, 1979. ¡ Provide portable high performance ¡ Improve readability sscal, dscal, cscal, zscal scopy, dcopy, ccopy, zcopy sdot, ddot, cdot, zdot saxpy, daxpy, caxpy, zaxpy … ¡ Early example of a Domain Specific Language 12 LINPACK-like Cholesky factorization do j=1, n A( j,j ) = sqrt( A( j,j ) ) call dscal( n-j, 1.0d00 / A( j,j ), A( j+1, j ), 1 ) do k=j+1,n call daxpy( n-k+1, -A( k,j ), A(k,j), 1, A( k, k ), 1 ); enddo enddo

13 The State-of-the-Art in 1990 l LAPACK: DPOTRF( UPLO, N, A, LDA, INFO )

Anderson et al. LAPACK: A portable linear algebra library for high-performance computers. Supercomputing 1990.

14 Layering in 1990

MATLAB/Applications

LAPACK Blocked Algorithms

LAPACK Unblocked Algorithms

BLAS1 BLAS2 BLAS3

15 Level-2 BLAS l Jack J. Dongarra, Jeremy Du Croz, Sven Hammarling, and Richard J. Hanson. An extended set of FORTRAN basic linear algebra subprograms. ACM TOMS. 1988. ¡ sgemv, dgemv, cgemv, zgemv sger, dger, cger, zger ssyr, dsyr, csyr, zsyr strsv, dtrsv, ctrsv, ztrsv …

16 LAPACK-like Unblocked Cholesky

do j=1, n A( j,j ) = sqrt( A( j,j ) ) call dscal( n-j, 1.0d00 / A( j,j ), A( j+1, j ), 1 ) do k=j+1,n callcall dsyrdaxpy( `Lower( n-k+1,`, n-k+1,-A( k,j ),-1.0d00, A(k,j), 1,A( A( j+1,j k, k), ),1, 1 A( ); j+1, j+1 ), lda ); enddo enddo

Improved performance and readability 17 Level-3 BLAS l J. J. Dongarra, Jeremy Du Croz, Sven Hammarling, and I. S. Duff. A set of level 3 basic linear algebra subprograms. ACM TOMS. 1990. ¡ sgemm, dgemm, cgemm, zgemm ssymm, dsymm, csymm, zsymm ssyrk, dsyrk, csyrk, zsyrk strsm, dtrsm, ctrsm, ztrsm …

18 LAPACK-like Blocked Cholesky

DO 20 J = 1, N, NB JB = MIN( NB, N-J+1 ) CALL DPOTF2( 'Lower', JB, A( J, J ), LDA, INFO ) CALL DTRSM( 'Right', 'Lower', 'Transpose', 'Non-unit', $ N-J-JB+1, JB, ONE, A( J, J ), LDA, $ A( J+JB, J ), LDA ) CALL DSYRK( 'Lower', 'No transpose', $ N-J-JB+1, JB, -ONE, A( J+JB, J ), LDA, $ ONE, A( J+JB, J+JB ), LDA ) ENDDO

Improved performance Improved readability??? 19 Performance

One thread

10

8

6 GFlops

4

2 Hand optimized BLAS3 BLAS2 BLAS1 triple loops 0 0 200 400 600 800 1000 1200 1400 1600 1800 2000 20 n The State-of-the-Art in 1996 l ScaLAPACK: LDA PDPOTRF( UPLO, N, A, IA, JA, IDESC, INFO )

S. Blackford et al. ScaLAPACK Users’ Guide. 1997.

21 DLA State-of-the-Art in 1996

ScaLAPACK

PBLAS

Basic LAPACK Linear BLAS Algebra Communication Subprograms

Blackford et al. ScaLAPACK Users’ Guide. SIAM Press. 1997. 22 PBLAS

psdot, etc. psgemv, etc. psgemm, etc. …

J. Choi, et. All. “ScaLAPACK: A portable linear algebra library for distributed memory computers --- Design issues and performance”, PARA '95, 1996.

DSL in the spirit of BLAS 23 ScaLAPACK Cholesky Factorization

DO 20 J = JN+1, JA+N-1, DESCA( NB_ ) JB = MIN( N-J+JA, DESCA( NB_ ) ) I = IA + J – JA

CALL PDPOTF2( UPLO, JB, A, I, J, DESCA, INFO ) IF( INFO.NE.0 ) THEN INFO = INFO + J - JA GO TO 30 END IF * IF( J-JA+JB+1.LE.N ) THEN CALL PDTRSM( 'Right', ’Lower', 'Transpose', 'Non-Unit', N-J-JB+JA, $ JB, ONE, A, I, J, DESCA, A, I+JB, J, DESCA )

CALL PDSYRK( UPLO, 'No Transpose', N-J-JB+JA, JB, -ONE, $ A, I+JB, J, DESCA, ONE, A, I+JB, J+JB, $ DESCA ) * END IF 20 CONTINUE 24 The “State-of-the-Art” in 1997 l PLAPACK: PLA_Chol( uplo, A )

P. Alpatov, G. Baker, C. Edwards, J. Gunnels, G. Morrow, J. Overfelt, R. van de Geijn, Y.-J. Wu. PLAPACK: parallel linear algebra package design overview. Supercomputing, 1997.

Robert van de Geijn. Using PLAPACK. MIT Press. 1997.

25 Parallel Linear Algebra Package (PLAPACK) l Conceived in Spring 1996. ¡ CS395T Parallel Computing at UT Austin ¡ Users’ manual for fictitious distributed memory parallel dense linear algebra (DLA) package (SL – Simple Library) ¡ Class implemented parts of the package

26 PLAPACK Cholesky Factorization

PLA_Obj_view_all( A, &ABR ); while ( TRUE ) { PLA_Obj_split_size( ABR, PLA_SIDE_TOP, &size_top, &owner_top ); PLA_Obj_split_size( ABR, PLA_SIDE_LEFT, &size_left, &owner_left ) if ( 0 == ( size = min( min( size_top, size_left), nb_alg ) ) ) break;

PLA_Obj_split_4( ABR, size, size, &A11, PLA_DUMMY, &A21, &ABR );

PLA_Local_chol( PLA_LOWER_TRIANGULAR, A11 );

PLA_Trsm( PLA_SIDE_RIGHT, PLA_LOWER_TRIANGULAR, PLA_TRANSPOSE, PLA_NONUNIT_DIAG, one, A11, A21 );

PLA_Syrk( PLA_LOWER_TRIANGULAR, minus_one, A21, one, ABR ); }

Code devoid of indexing. 27 A11 := Chol(A11)

−T A21 := A21A11

T A22 := A22 − A21A21

CALLCALL PDTRSM(PDTRSM( ''LeftLeft',', ’’LowerLower',', ''TransposeTranspose',', 'Non-Unit’,'Non-Unit’, JB,JB, N-J-JB+JA,N-J-JB+JA, && ONE,ONE, A,A, II,, J,J, DESCA,DESCA, A,A, II,, J+JB,J+JB, DESCADESCA ))

PLA_Trsm( PLA_SIDE_RIGHT, PLA_LOWER_TRIANGULAR, PLA_TRANSPOSE, PLA_NONUNIT_DIAG, one, A11, A21 ); 28 PLAPACK Innovations l Object-based API (DSL) inspired by MPI. l Describe data and its distribution l Many new parallel DLA algorithms. l Families of algorithms. l Sensible user interface.

Great product, lousy marketing department…

29 PLAPACK’s Legacy: 1996 - 2016 l FLAME l FLAME-it ¡ Notation l SuperMatrix ¡ Methodology l FLAPACK ¡ APIs l Elemental l FLAMEC l FLAME@lab l BLIS l FLAMEPy l LAC/LAP l FLaTeX l DxT ¡ libflame l LAFF l ROTE l TBLIS

30 What do we learn from our experience?

31 Developing a DSL for HPC l Reflect: How do we reason about algorithms? l Abstract: How can we express algorithms so the algorithm captures how we reason about them? l Derive: How do we make reasoning about algorithms systematic? l Encode: How do we represent algorithms in code? l (Re)layer: How do we layer our solution to attain portable high performance? l Build: How do we develop useable software? l Preach: How do we get the word out? How do we build a user/developer community?

32 Overview l Reflect l Abstract l Derive l Encode l (Re)layer l Build l Preach

33 Reflect

“The idea of knowledge as cumulative — a ladder, or a tower of stones, rising higher and higher — existed only as one possibility among many. For several hundred years, scholars of scholarship had considered that they might be like dwarves seeing farther by standing on the shoulders of giants, but they tended to believe more in rediscovery than in progress.” - James Gleick 34

34 Reflect

35

35 Reflect

36 Reflect

37 Reflect

38 Reflect

39 Reflect

40 Reflect

41 Reflect

42 Reflect

43 Reflect

44 Reflect

45 Reflect

46 Reflect

47 Reflect

48 Reflect

49 Reflect

50 Reflect

51 Reflect

52 Reflect

53 Reflect

54 Reflect

55 Reflect

56 Reflect

57 Overview l Reflect l Abstract l Derive l Encode l (Re)layer l Build l Preach

58 59 60 61 62 63 64 65 66 67 68 FLAME Notation

X. Sun, E. S. Quintana, G. Quintana, and R. van de Geijn. A Note on Parallel Matrix Inversion. SISC, 2001.

69

Insight l It starts with the right notation:

Express algorithms at the level of abstraction at which we reason.

70 Overview l Reflect l Abstract l Derive l Encode l Layer l Build l Preach

71 72 Step Algorithm: A := Chol unb var3(A) 1a A = A

4 b

where

2

3 while do

2,3 ^

5a

where

6

8

5b

7

2

endwhile

2,3 () ^¬ 73 1b A = L LLT = A ^ b Loop invariant l A loop invariant is a predicate that is true before and after each iteration of the loop.

done done

done partially updated

74 Loop invariant l A loop invariant is a predicate that is true before and after each iteration of the loop.

LTL *

T LBL ABR - LBLLBL

Important: We can determine the loop invariant a priori75 StepStep Algorithm:Algorithm:AA:=:=:=CholChol unbunb var3var3((A)) Step Algorithm: 1aStep1a AAAlgorithm:==AA A := Chol unb var3(A) The FLAME Methodology: 1a 1a A = AAATLTLTLAATRTRTR LLTLTL LLTRTR 44 AA bb ,,,LL ! 0 1 ! 0 1 ! 0AABLAABR1 ! 0LLBL LLBR 1 b BL BR BL BR 4 where A and L are 0 0 where@@ ATLTLTLandAALTLTLTLare@@ 0 0 A ⇥⇥ TT ATLTL ATRTR LTLTL ATRTR LTLTLL T = ATLTL AwhereTL ATR LTL ATR LTLLTLTL = ATL 22 == 0 where 1 0 TT 1 ^ TT 0AABLAABR1 0LLBL AABR LLBLLBLT 1 ^ LBLLLTLTLT ==AABL Our weapon of 2 BL BR BL BR bb BL BL BL TL bbBL @ A @ A 32 @while m((ATLTLA))

ATL ATR A ! 0 1 ABL ABR where@ A isA 0 0 TL ⇥ while m(ATL)

A00 a01 A02 ATL ATR 0 T T 1 a10 ↵11 a12 0A A 1 ! BL BR B C BA20 a21 A22 C @ A B C where ↵ is 1 @1 A 11 ⇥ ↵11 := p↵11

a21 := a21/↵11 A := A a aT (update only lower triangular part) 22 22 21 21 A00 a01 A02 ATL ATR 0 T T 1 a10 ↵11 a12 0A A 1 BL BR B C BA20 a21 A22 C @ A B C @ A endwhile 77 A Systematic Methodology l But what about more complex matrix computations? ¡ They may involve many matrices. l Can the method be automated? l Diego Fabregat-Traver and Paolo Bientinesi. ¡ Cl1ck

78 Related Publications l John A. Gunnels. A Systematic Approach to the Design and Analysis of Linear Algebra Algorithms. Ph.D. Dissertation. Nov. 2001 l John Gunnels, Fred G. Gustavson, Greg M. Henry, Robert A. van de Geijn. FLAME: Formal Linear Algebra Methods Environment, ACM Transactions on Mathematical Software (TOMS), 2001 l Paolo Bientinesi, John A. Gunnels, Margaret E. Myers, Enrique S. Quintana-Orti, Robert A. van de Geijn. The science of deriving dense linear algebra algorithms. ACM Transactions on Mathematical Software (TOMS), 2005 l Paolo Bientinesi. Mechanical Derivation and Systematic Analysis of Correct Linear Algebra Algorithms. Ph.D. Dissertation. Aug. 2006 l Robert A. van de Geijn and Enrique S. Quintana-Orti. The Science of Programming Matrix Computations. www.lulu.com , 2008 l Paolo Bientinesi, John Gunnels, Maggie Myers, Enrique Quintana-Orti, Tyler Rhodes, Robert van de Geijn, and Field Van Zee. Deriving Linear Algebra Libraries. Formal Aspects of Computing. 2012 l Tze Meng Low. A Calculus of Loop Invariants for Dense Linear Algebra Optimization. Computer Science. December 2013. 79 l … Insight l Notation facilitates systematic reasoning. l In our case: systematic to the point where it has been automated. l In our case: we derive families of algorithms for each operation l Pick the best algorithm for the situation

80 Overview l Reflect l Abstract l Derive l Encode l (Re)layer l Build l Preach

81 Encoding Algorithms l How do we translate correct algorithms to correct code?

FLAME APIs

82 83 84 85 86 Related Publications l John Gunnels, Fred G. Gustavson, Greg M. Henry, Robert A. van de Geijn. FLAME: Formal Linear Algebra Methods Environment, ACM Transactions on Mathematical Software (TOMS), 2001 l Paolo Bientinesi, Enrique S. Quintana-Orti, Robert A. van de Geijn. Representing linear algebra algorithms in code: the FLAME application program interfaces. ACM Transactions on Mathematical Software (TOMS), 2005 l Margaret Myers, Pierce van de Geijn, Robert van de Geijn. Linear Algebra: Foundations to Frontiers - Notes to LAFF. (MOOC) offered by edX l Field G. Van Zee. libflame: The Complete Reference. www.lulu.com, 2009

87 Insights l APIs as a Domain Specific Languages (DSL) . l Notation for algorithms should mirror how we reason. l Implementations in code should closely resemble the algorithm. l Our approach is language agnostic: API can be created for most (any?) language.

88 Overview l Reflect l Abstract l Derive l Encode l (Re)layer l Build l Preach

89 Traditional Layering

MATLAB/Applications

LAPACK Blocked Algorithms

LAPACK Unblocked Algorithms

BLAS1 BLAS2 BLAS3

90 BLAS-Like Library Instantiation Framework l Framework for rapid instantiation of the BLAS ¡ Level 1, 2, and 3 BLAS l Alternative APIs l Set of building blocks for BLAS-like operations

Field G. Van Zee, Robert A. van de Geijn. BLIS: A Framework for Rapidly Instantiating BLAS Functionality. ACM Transactions on Mathematical Software (TOMS), 2015 91

5th loop around micro-kernel nC nC

A Cj += Bj

4th loop around micro-kernel

k C Bp C A j += p

kC ~ Pack Bp → Bp 3rd loop around micro-kernel ~ m m A Ci C C i Bp

+= ~ Pack Ai → Ai

C AB C nd = + 2 loop around micro-kernel ~ nR ~ nR Ci Ai Bp

mR +=

kC 1st loop around micro-kernel nR

mR

+= kC

micro-kernel main memory 1 L3 cache L2 cache L1 cache += 1 registers Intel Xeon “Dunnington” (one core)

93 Intel Xeon Phi 240 threads

84% of peak

T. M. Smith, R. van de Geijn, M. Smelyanskiy, J. R. Hammond, and F. G. Van Zee. Anatomy of High-Performance Many-Threaded . IPDPS 2014. 94

IBM

T. M. Smith, R. van de Geijn, M. Smelyanskiy, J. R. Hammond, and F. G. Van

Zee. Anatomy of High-Performance Many-Threaded Matrix Multiplication. 95 IPDPS 2014.

Relayering

MATLAB/Applications

LAPACK Blocked Algorithms

LAPACK Unblocked Algorithms

BLAS1 BLAS2 BLAS3

vecvec matvec matmat µ-kernels µ-kernels µ-kernels

96 Relayering

Applications

Blocked Algorithms

Unblocked Algorithms

BLAS1 BLAS2 BLAS3

vecvec matvec matmat µ-kernels µ-kernels µ-kernels

97 BLIS primitives as building blocks l Strassen

98 5th#loop#around#micro5kernel# 5th#loop#around#micro5kernel# nC nC nC nC

Cj A B Dj Y W n n n n C C C C V C +=# j +=# X j

4th#loop#around#micro5kernel# 4th#loop#around#micro5kernel#

Cj Ap kC Bp Dj Yp kC Wp Cj Xp Vp +=# +=#

kC k ~ C ~ Pack B → B Pack V + W → B 3rd#loop#around#micro5kernel# p p 3rd#loop#around#micro5kernel# p p p ~ ~ Ci m Ai Di m Yi mC C Bp C Bp C m Xi +=# i +=#C ~ ~ Pack Ai→ Ai Pack Xi + Yi→ Ai

2nd#loop#around#micro5kernel# 2nd#loop#around#micro5kernel# n ~ n ~ n ~ n ~ R R R R B Ai Bp Ai p m m Ci R Di R +=# +=# Ci kC kC 1st#loop#around#micro5kernel# 1st#loop#around#micro5kernel# nR nR

mR mR k k +=# C +=# C Update Cij Update Cij , Dij

micro5kernel# micro5kernel# main#memory# main#memory# 1 1 L3#cache# L3#cache# L2#cache# L2#cache# L1#cache# 1 L1#cache# 1 registers# +=# registers# +=# 99 5th#loop#around#micro5kernel# 5th#loop#around#micro5kernel# nC nC nC nC

Cj A B Dj Y W n n n n C C C C V C +=# j +=# X j

4th#loop#around#micro5kernel# 4th#loop#around#micro5kernel#

Cj Ap kC Bp Dj Yp kC Wp Cj Xp Vp +=# +=#

kC k ~ C ~ Pack B → B Pack V + W → B 3rd#loop#around#micro5kernel# p p 3rd#loop#around#micro5kernel# p p p ~ ~ Ci m Ai Di m Yi mC C Bp C Bp C m Xi +=# i +=#C ~ ~ Pack Ai→ Ai Pack Xi + Yi→ Ai

2nd#loop#around#micro5kernel# 2nd#loop#around#micro5kernel# n ~ n ~ n ~ n ~ R R R R B Ai Bp Ai p m m Ci R Di R +=# +=# Ci kC kC 1st#loop#around#micro5kernel# 1st#loop#around#micro5kernel# nR nR

mR mR k k +=# C +=# C Update Cij Update Cij , Dij

micro5kernel# micro5kernel# main#memory# main#memory# 1 1 L3#cache# L3#cache# L2#cache# L2#cache# L1#cache# 1 L1#cache# 1 registers# +=# registers# +=# 100 101 J. Huang, T. Smith, G. Henry, R. van de Geijn. Strassen’s Algorithm Restarted SC’16. Numerical Algorithms, Part II. Wednesday 3:30pm-5pm. Room: 355-E102 5th loop around micro-kernel nC nC

Cj A Bj n n C += C

4th loop around micro- kernel

Cj Ap kC Bp +=

kC ~ Pack B → B 3rd loop around micro-kernel p p ~ Ci m Ai mC C Bp += ~ Pack Ai→ Ai

2nd loop around micro-kernel ~ n ~ n B R Ai R p m Ci R +=

kC 1st loop around micro-kernel n R

mR

+= kC Update Cij

micro-kernel main memory 1 L3 cache L2 cache L1 cache += 1 registers 103 5th loop around micro-kernel n n C C using GotoGEMM = Cj A Bj n n C += C partition_gemm_nc<

4th loop around micro- kernel C A k B j p C p partition_gemm_kc< += pack_b

kC ~ Pack B → B 3rd loop around micro-kernel p p ~ Ci m Ai partition_gemm_mc< mC C Bp += pack_a

2nd loop around micro-kernel ~ n ~ n B R Ai R p m partition_gemm_nr< Ci R +=

kC 1st loop around micro-kernel n R partition_gemm_mr< m R += kC Update C ij micro-kernel main memory gemm_micro_kernel 1 L3 cache L2 cache >>>>>>>; L1 cache += 1 registers 104 using GotoGEMM = partition_gemm_nc< Tensor contractions are matrix-matrix multiplications partition_gemm_kc<

pack_b

partition_gemm_mc<

pack_a

partition_gemm_nr<

partition_gemm_mr<

gemm_micro_kernel

>>>>>>>;

105 using GotoGEMM = Using TensorGEMM = partition_gemm_nc< partition_gemm_nc<

partition_gemm_kc< partition_gemm_kc<

pack_b

partition_gemm_mc< partition_gemm_mc<

pack_a

partition_gemm_nr< partition_gemm_nr<

partition_gemm_mr< partition_gemm_mr<

gemm_micro_kernel gemm_micro_kernel

>>>>>>>; >>>>>>>; Devin Matthews, "High-Performance Tensor Contraction without BLAS”, 106 Research Poster #35 template void cholesky(matrix& A) { partition_2x2 A_2x2(A); partition_3x3 A_3x3(A);

while (A_2x2.m_TL < A.m) { A_3x3.repartition_down(A_2x2);

auto& A11 = A_3x3.A11(); auto& A21 = A_3x3.A21(); auto& A22 = A_3x3.A22();

// A11 = chol( A11 ) cholesky(A11);

// A21 = A21 * inv( tril( A11 )' ) trsm(T(1), A11, A21);

// A22 = A22 - A21 * A21' herk(T(-1), A21, T(1), A22);

A_2x2.continue_with(A_3x3); } 107 } Related Publications l J. A. Gunnels, G. M. Henry, and R. A. van de Geijn. A Family of High- Performance Matrix Algorithms. In - 2001. l Kazushige Goto, Robert A. van de Geijn. Anatomy of high-performance matrix multiplication. ACM Transactions on Mathematical Software (TOMS), 2008. l Kazushige Goto, Robert van de Geijn. High-performance implementation of the level-3 BLAS. ACM Transactions on Mathematical Software (TOMS), 2008 l Field G. Van Zee, Robert A. van de Geijn. BLIS: A Framework for Rapidly Instantiating BLAS Functionality. ACM Transactions on Mathematical Software (TOMS), 2015 l Tyler M. Smith, Robert van de Geijn, Mikhail Smelyanskiy, Jeff R. Hammond, and Field G. Van Zee. Anatomy of High-Performance Many-Threaded Matrix Multiplication. IPDPS 2014. l Field G. Van Zee, Tyler Smith, Bryan Marker, Tze Meng Low, Robert A. van de Geijn, Francisco D. Igual, Mikhail Smelyanskiy, Xianyi Zhang, Michael Kistler, Vernon Austel, John Gunnels, Lee Killough. The BLIS Framework: Experiments in Portability. ACM Transactions on Mathematical Software. 2016 l Devin A. Matthews. High-Performance Tensor Contraction without BLAS. Submitted to SISC. l J. Huang, T. Smith, G. Henry, R. van de Geijn. Strassen’s Algorithm Restarted. SC’16. l … 108 Insights l BLIS: Careful refactoring of GotoBLAS l Less code to optimize l More loops to parallelize l Provides building blocks for innovative optimization l Enables more flexible DSL for DLA than BLAS

109 Overview l Reflect l Abstract l Derive l Encode l Layer l Build l Preach

110 New DLA Software Stack l Funded by NSF’s Software Infrastructure for Sustained Innovation program and gifts from industry. ¡ Functionality: BLAS and LAPACK ¡ DSL being designed

111 112 Overview l Reflect l Abstract l Derive l Encode l Layer l Build l Preach

113 Building a user/developer community l Usual means: ¡ Software/Publications/Talks/Workshops l Pedagogical outreach ¡ BLISlab: Classroom activity based on BLIS matrix-matrix multiplication ¡ Linear Algebra: Foundations to Frontiers (LAFF) l MOOC on edX ¡ LAFF-On Programming for Correctness l MOOC on edX (April 2017) ¡ LAFF-On Programming for Performance l MOOC based on BLISlab (maybe…)

114 LAFF–On Programming for Correctness

https://www.youtube.com/watch? v=1B0PPMNQsP4

Coming in Spring 2017 to edX.org 115 Conclusion l A Domain Specific Language starts with Domain Specific Notation. l Notation supports systematic reasoning. l The DSL facilitates code that closely resembles algorithms. l Proper layering enables optimization across layers.

116

“Elegance is not a dispensable luxury but a quality that decides between success and failure.” -- Edsger W. Dijkstra

117 The Team

118 Funding l NSF ¡ Award ACI-1550493: Sustaining Innovation in the Linear Algebra Software Stack for Computational Chemistry and other Sciences. 2016 - 2018 ¡ Award ACI-1148125: A Linear Algebra Software Infrastructure for Sustained Innovation in Computational Chemistry and other Sciences. 2012 - 2016 ¡ Many other NSF grants 2002 - present More information l http://shpc.ices.utexas.edu l https://github.com/flame/ l https://github.com/flame/blis/ l https://github.com/flame/libflame/ l https://github.com/devinamatthews/tblis

120