The Libflame Library for Dense Matrix Computations

S O F T W A R E E NGINEERING The libflame Library for Dense Matrix Computations Researchers from the Formal Linear Algebra Method Environment (Flame) project have developed new methodologies for analyzing, designing, and implementing linear algebra libraries. These solutions, which have culminated in the libflame library, seem to solve many of the programmability problems that have arisen with the advent of multicore and many- core architectures. How do we convince people that in program books, numerous papers, and even more working ming simplicity and clarity—in short: what math notes over the last decade documenting the chal ematicians call “elegance”—are not a dispensable lenges and motivations that led to the libflame luxury, but a crucial matter that decides between library’s APIs and implementations (see www. success and failure? cs.utexas.edu/users/flame/publications). Seasoned —Edsger W. Dijkstra users in scientific and numerical computing circles will quickly recognize libflame’s target functional ity set. In short, in libflame, our goal is to provide ver the past decade, the University not only a framework for developing dense linear of Texas at Austin and Universidad algebra solutions, but also a readymade library Jaime I de Castellón have collabo that is, by almost any metric, easier to use and of rated on the Formal Linear Algebra fers competitive (and in many cases superior) real OMethod Environment (Flame) project, developing world performance when compared to the more a unique methodology, notation, tools, and APIs set traditional Basic Linear Algebra Subprograms for deriving and representing linear algebra librar (BLAS)1 and Linear Algebra Package (Lapack) ies. To better promote the Flame project’s charac libraries.2 teristic techniques, we’ve implemented a functional Here, we briefly introduce both the library it library—libflame—that demonstrates findings and self and its underlying philosophy. Using perfor insights from our 10 years of research. mance results from different architectures, we The primary purpose of libflame is to give the show how easily it can be retargeted to “hostile” scientific and numerical computing communities environments. Our hope is that the combination a modern, highperformance dense linear algebra of libflame’s functionality and performance will library that is extensible, easy to use, and available lead the scientific computing community to in under an open source license. We’ve published two vestigate our other publications. 1521-9615/09/$26.00 © 2009 IEEE What Makes libflame Different? COPUBLI S HED BY THE IEEE CS AND THE AIP Adopting libflame makes sense for numerous rea Field G. Van Zee, Ernie Chan, and Robert A. van de Geijn sons. In addition to expected attractions—such 3 The University of Texas at Austin as a detailed user manual that we routinely up date and crossplatform support for both GNU/ Enrique S. QuintanaOrtí and Gregorio QuintanaOrtí Linux and Microsoft Windows—libflame offers us Universidad Jaime I de Castellón ers and software developers several key advantages. 56 THIS ARTICLE HAS BEEN PEER-REVIEWED. COMPUTING IN SC IEN C E & ENGINEERING A Solution Grounded in CS Fundamentals Algorithm: A := CHOL_L_BLK_VAR2( A ) A A Flame advocates a new approach to developing Partition A → TL TR A A linear algebra libraries. It starts with a more styl BL BR where ATL is 0 × 0 ized notation for expressing loopbased linear while m(A ) < m(A) do 4 TL algebra algorithms. As Figure 1 shows, the nota Determine block size b tion closely resembles the way pictures naturally Repartition A00 A01 A02 illustrate matrix algorithms. The notation also ATL ATR → A10 A11 A12 facilitates rigorous formal algorithm derivations, ABL ABR A20 A21 A22 which guarantees that the resulting algorithms where A11 is b × b are correct.5 Moreover, it yields an algorithm T family, so that developers can choose the best op A11 := A11 − A10A11 (SYRK) T tion for their situations (based on problem size or A21 := A21 − A20A10 (GEMM) architecture, for example). A11 := CHOL (A11) (CHOL) −T A21 := A21A 11 (TRSM) Object-Based Abstractions and API The BLAS, Lapack, and Scalable Lapack6 proj Continue with A00 A01 A02 ects place backward compatibility as a high ATL ATR → A10 A11 A12 ABL ABR priority, which hinders adoption of modern A20 A21 A22 software engineering principles such as object endwhile abstraction. We built libflame around opaque structures that hide matrices’ implementation Figure 1. Blocked Cholesky factorization (variant 2) expressed as a Flame algorithm. Subproblems details (such as data layout), and libflame exports annotated as SYRK, GEMM, and TRSM correspond objectbased programming interfaces to oper to level-three Basic Linear Algebra Subprograms 3 ate upon these structures. Likewise, Flame al (BLAS) operations; CHOL is the recursive Cholesky gorithms are expressed (and coded) in terms of factorization subproblem. smaller operations on the matrix operands’ sub partitions. This abstraction facilitates program ming without array or loop indices, which lets Dense Linear Algebra Framework users avoid painful indexrelated programming Like Lapack, libflame provides readymade imple errors altogether. mentations of common linear algebra operations. Figure 2 compares the coding styles of libflame libflame’s implementations mirror many of those and Lapack, highlighting the inherent elegance in BLAS and Lapack. However, libflame differs of Flame code and its striking resemblance to the from Lapack in two important ways. First, as men corresponding Flame algorithm in Figure 1. This tioned, it provides algorithm families for each op similarity is quite intentional, preserving the clar eration, so developers can choose the one that best ity of the original algorithm as it would be illus suits their needs. Second, it provides a framework trated on a whiteboard or in a publication. for building complete custom linear algebra codes. This makes it a more useful environment because Educational Value it lets users quickly choose and/or prototype a lin In addition to introducing students to formal al ear algebra solution to fit the application’s needs. gorithm derivation, educators have successfully used Flame to teach linear algebra algorithms in a High Performance classroom setting. Also, Flame’s API affords clean In our publications and performance graphs, abstractions, which makes it ideally suited for we do our best to dispel the myth that user and teaching highperformance linear algebra courses programmerfriendly linear algebra codes can’t at the undergraduate and graduate level. yield high performance. As we show later, our Historically, instructors have used the BLAS/ Flame implementations of operations such as Lapack coding style in these pedagogical settings. Cholesky factorization and triangular matrix in However, we believe that this coding style ob version often outperformed Lapack’s correspond scures the algorithms; students often get bogged ing implementations.7 down debugging frustrating errors caused by Currently, libflame relies on only a core set of indexing directly into arrays that represent the highly optimized, unblocked routines to perform matrices. Using Flame greatly reduces the line of the small subproblems found in linear algebra students who need help with coding during office algorithm implementations. However, as we’ve hours. recently demonstrated, we can automatically NOVEMBER /DECEMBER 2009 57 libflame Lapack FLA_Error FLA_Chol_l_blk_var2( FLA_Obj A, int nb_alg) SUBROUTINE DPOTRF(UPLO, N, A, LDA, INFO ) { FLA_Obj ATL, ATR, A00,A01,A02, CHARACTER UPLO ABL, ABR A10, A11, A12, INTEGER INFO,LDA,N A20, A21, A22; DOUBLE PRECISIONA(LDA,*) intb,value; DOUBLE PRECISIONONE FLA_Part_2x2(A,&ATL, &ATR, PARAMETER(ONE=1.0D+0 ) &ABL,&ABR, 0, 0, FLA_TL); LOGICAL UPPER INTEGER J, JB,NB while(FLA_Obj_length(ATL )<FLA_Obj_length(A)) LOGICAL LSAME { INTEGER ILAENV b=min( FLA_Obj_length(ABR ), nb_alg ); EXTERNAL LSAME, ILAENV EXTERNAL DGEMM, DPOTF2,DSYRK,DTRSM, XERBLA FLA_Repart_2x2_to_3x3( INTRINSICMAX,MIN ATL, /**/ ATR, &A00, /**/ &A01,&A02, /* ************** //**********************/ INFO =0 &A10, /**/ &A11,&A12, UPPER=LSAME( UPLO,’U’ ) ABL, /**/ ABR, &A20, /**/ &A21,&A22, IF(.NOT.UPPER.AND. .NOT.LSAME(UPLO,’L’ ))THEN b, b, FLA_BR ); INFO =-1 ELSE IF(N.LT.0)THEN /* -----------------------------------------------------*/ INFO =-2 ELSE IF(LDA.LT.MAX( 1, N))THEN FLA_Syrk(FLA_LOWER_TRIANGULAR, FLA_NO_TRANSPOSE , INFO =-4 FLA_MINUS_ONE, A10, FLA_ONE, A11); ENDIF IF(INFO.NE.0 )THEN FLA_Gemm(FLA_NO_TRANSPOSE, FLA_TRANSPOSE, CALL XERBLA( ’DPOTRF’,-INFO ) FLA_MINUS_ONE, A20, A10, FLA_ONE ,A21 ); RETURN ENDIF value=FLA_Chol_unb_external(FLA_LOWER_TRIANGULAR, A11); INFO =0 if (value != FLA_SUCCESS) UPPER=LSAME( UPLO,’U’ ) return (FLA_Obj_length(A00 )+value); IF(N.EQ.0) FLA_Trsm(FLA_RIGHT , FLA_LOWER_TRIANGULAR, $RETURN FLA_TRANSPOSE, FLA_NONUNIT_DIAG , FLA_ONE ,A11,A21 ); NB =ILAENV( 1, ’DPOTRF’, UPLO, N, -1,-1, -1 ) IF(NB.LE.1.OR.NB.GE.N)THEN /* -----------------------------------------------------*/ CALL DPOTF2(UPLO, N, A, LDA, INFO) ELSE FLA_Cont_with_3x3_to_2x2( IF(UPPER )THEN &ATL, /**/ &ATR,A00,A01,/**/ A02, *Upper triangular case omittedfor purpose of fair comparison. A10, A11,/**/ A12, ELSE /* *************** //********************/ DO 20 J=1, N, NB &ABL, /**/ &ABR,A20,A21,/**/ A22, JB =MIN(NB, N-J+1) FLA_TL ); CALL DSYRK(’Lower’,’No transpose ’, JB,J-1,-ONE, } $ A( J, 1 ), LDA, ONE, A( J, J ), LDA ) CALL DPOTF2(’Lower’,JB, A( J, J),LDA,INFO) return

The Libflame Library for Dense Matrix Computations

Linear Algebra Libraries

C++ API for BLAS and LAPACK

Chapter 1. Preliminaries

A Framework for Developing Finite Element Codes for Multi- Disciplinary Applications

An Active-Library Based Investigation Into the Performance Optimisation of Linear Algebra and the Finite Element Method

Domain Engineering and Generic Programming for Parallel Scientific

A Framework for Developing Finite Element Codes for Multi- Disciplinary Applications

Linear Algebra Libraries Claire Mouton