S o f t w a r e e ngineering

The libflame for Dense Matrix Computations

Researchers from the Formal Linear Algebra Method Environment (Flame) project have developed new methodologies for analyzing, designing, and implementing linear algebra libraries. These solutions, which have culminated in the libflame library, seem to solve many of the programmability problems that have arisen with the advent of multicore and many- core architectures.

How do we convince people that in program­ books, numerous papers, and even more working ming simplicity and clarity—in short: what math­ notes over the last decade documenting the chal­ ematicians call “elegance”—are not a dispensable lenges and motivations that led to the libflame luxury, but a crucial matter that decides between library’s APIs and implementations (see www. success and failure? cs.utexas.edu/users/flame/publications). Seasoned —Edsger W. Dijkstra users in scientific and numerical computing circles will quickly recognize libflame’s target functional­ ity set. In short, in libflame, our goal is to provide ver the past decade, the University not only a framework for developing dense linear of Texas at Austin and Universidad algebra solutions, but also a ready-made library Jaime I de Castellón have collabo­ that is, by almost any metric, easier to use and of­ rated on the Formal Linear Algebra fers competitive (and in many cases superior) real- OMethod Environment (Flame) project, developing world performance when compared to the more a unique methodology, notation, tools, and APIs set traditional Basic Linear Algebra Subprograms for deriving and representing linear algebra librar­ (BLAS)1 and Linear Algebra Package (Lapack) ies. To better promote the Flame project’s charac­ libraries.2 teristic techniques, we’ve implemented a functional Here, we briefly introduce both the library it­ library—libflame—that demonstrates findings and self and its underlying philosophy. Using perfor­ insights from our 10 years of research. mance results from different architectures, we The primary purpose of libflame is to give the show how easily it can be retargeted to “hostile” scientific and numerical computing communities environments. Our hope is that the combination a modern, high-performance dense linear algebra of libflame’s functionality and performance will library that is extensible, easy to use, and available lead the scientific computing community to in­ under an open source license. We’ve published two vestigate our other publications.

1521-9615/09/$26.00 © 2009 IEEE What Makes libflame Different? Co p u b l i s h e d b y t h e IEEE CS a n d t h e AIP Adopting libflame makes sense for numerous rea­ Field G. Van Zee, Ernie Chan, and Robert A. van de Geijn sons. In addition to expected attractions—such 3 The University of Texas at Austin as a detailed user manual that we routinely up­ date and cross-platform support for both GNU/ Enrique S. Quintana-Ortí and Gregorio Quintana-Ortí and —libflame offers us­ Universidad Jaime I de Castellón ers and software developers several key advantages.

56 This article has been peer-reviewed. Co m p u t i n g in Sc i e n c e & En g i n e e r i n g A Solution Grounded in CS Fundamentals Algorithm: A := CHOL_L_BLK_VAR2( A ) A A Flame advocates a new approach to developing Partition A → TL TR A A linear algebra libraries. It starts with a more styl­ BL BR where ATL is 0 × 0 ized notation for expressing loop-based linear while m(A ) < m(A) do 4 TL algebra algorithms. As Figure 1 shows, the nota­ Determine block size b tion closely resembles the way pictures naturally Repartition A00 A01 A02 illustrate matrix algorithms. The notation also ATL ATR   → A10 A11 A12 facilitates rigorous formal algorithm derivations, ABL ABR  A20 A21 A22  which guarantees that the resulting algorithms where A11 is b × b are correct.5 Moreover, it yields an algorithm T family, so that developers can choose the best op­ A11 := A11 − A10A11 (SYRK) T tion for their situations (based on problem size or A21 := A21 − A20A10 (GEMM) architecture, for example). A11 := CHOL (A11) (CHOL) −T A21 := A21A 11 (TRSM) Object-Based Abstractions and API The BLAS, Lapack, and Scalable Lapack6 proj­ Continue with A00 A01 A02 ects place backward compatibility as a high ATL ATR   → A10 A11 A12 ABL ABR priority, which hinders adoption of modern  A20 A21 A22  software engineering principles such as object endwhile abstraction. We built libflame around opaque structures that hide matrices’ implementation Figure 1. Blocked Cholesky factorization (variant 2) expressed as a Flame algorithm. Subproblems details (such as data layout), and libflame exports annotated as SYRK, GEMM, and TRSM correspond object-based programming interfaces to oper­ to level-three Basic Linear Algebra Subprograms 3 ate upon these structures. Likewise, Flame al­ (BLAS) operations; CHOL is the recursive Cholesky gorithms are expressed (and coded) in terms of factorization subproblem. smaller operations on the matrix operands’ sub­ partitions. This abstraction facilitates program­ ming without array or loop indices, which lets Dense Linear Algebra Framework users avoid painful index-related programming Like Lapack, libflame provides ready-made imple­ errors altogether. mentations of common linear algebra operations. Figure 2 compares the coding styles of libflame libflame’s implementations mirror many of those and Lapack, highlighting the inherent elegance in BLAS and Lapack. However, libflame differs of Flame code and its striking resemblance to the from Lapack in two important ways. First, as men­ corresponding Flame algorithm in Figure 1. This tioned, it provides algorithm families for each op­ similarity is quite intentional, preserving the clar­ eration, so developers can choose the one that best ity of the original algorithm as it would be illus­ suits their needs. Second, it provides a framework trated on a whiteboard or in a publication. for building complete custom linear algebra codes. This makes it a more useful environment because Educational Value it lets users quickly choose and/or prototype a lin­ In addition to introducing students to formal al­ ear algebra solution to fit the application’s needs. gorithm derivation, educators have successfully used Flame to teach linear algebra algorithms in a High Performance classroom setting. Also, Flame’s API affords clean In our publications and performance graphs, abstractions, which makes it ideally suited for we do our best to dispel the myth that user- and teaching high-performance linear algebra courses programmer-friendly linear algebra codes can’t at the undergraduate and graduate level. yield high performance. As we show later, our Historically, instructors have used the BLAS/ Flame implementations of operations such as Lapack coding style in these pedagogical settings. Cholesky factorization and triangular matrix in­ However, we believe that this coding style ob­ version often outperformed Lapack’s correspond­ scures the algorithms; students often get bogged ing implementations.7 down debugging frustrating errors caused by Currently, libflame relies on only a core set of indexing directly into arrays that represent the highly optimized, unblocked routines to perform matrices. Using Flame greatly reduces the line of the small subproblems found in linear algebra students who need help with coding during office algorithm implementations. However, as we’ve hours. recently demonstrated, we can automatically

Nov em b er /Decem b er 2009 57 libflame Lapack

FLA_Error FLA_Chol_l_blk_var2( FLA_Obj A, int nb_alg) SUBROUTINE DPOTRF(UPLO, N, A, LDA, INFO ) { FLA_Obj ATL, ATR, A00,A01,A02, CHARACTER UPLO ABL, ABR A10, A11, A12, INTEGER INFO,LDA,N A20, A21, A22; DOUBLE PRECISIONA(LDA,*) intb,value; DOUBLE PRECISIONONE FLA_Part_2x2(A,&ATL, &ATR, PARAMETER(ONE=1.0D+0 ) &ABL,&ABR, 0, 0, FLA_TL); LOGICAL UPPER INTEGER J, JB,NB while(FLA_Obj_length(ATL )

(a) (b)

Figure 2. Implementations of Figure 1’s algorithm. (a) The Flame/C API, which represents libflame’s coding style, and (b) Fortran-77, obtained from Lapack.

t ra nslate a lgor it h ms represented w it h t he Fla me/C from Lapack routines by simply linking to multi­ API (Figure 2a) into a more traditional code im­ threaded BLAS. Although this low-level solution plementation (Figure 2b) and thus attain the same requires no changes to Lapack code, it suffers performance. We’ve prototyped a source-to- from sharp efficiency and scalability limitations source translator, FlameS2S, and are working to for small- and medium-sized matrix problems (as perfect it. There’s thus no longer a valid reason to we show later). code linear algebra algorithms in the traditional The fundamental bottleneck to introducing BLAS/Lapack style as opposed to a high-level of parallelism directly within many algorithms is the abstraction. web of data dependencies that frequently exists between subproblems. The libflame project has Dependency-Aware Thread-Level Parallelism developed a runtime system to detect and analyze Until recently,8 BLAS and Lapack authors ad­ dependencies within algorithms-by-blocks (algo­ vocated exploiting shared memory parallelism rithms whose subproblems operate only on block

58 Co m p u t i n g in Sc i e n c e & En g i n e e r i n g operands). Once it identifies data dependencies, the with virtually no changes to your application. system schedules suboperations to independent ex­ (Currently, any operation called through the ecution threads. Our runtime system is completely liblapack2flame c o mp at ibi l it y l a y e r w i l l e xe c ut e abstracted from the parallelized algorithm and re­ sequentially; to invoke our parallelized implemen­ quires virtually no changes to the algorithm code, tations, you must use native Flame interfaces.) yet it exposes abundant high-level parallelism. Language Independence Support for Hierarchical Storage-by-Blocks We chose C as libflame’s implementation language Storing matrices by blocks often yields perfor­ because it simplifies the interface that lets Lapack mance gains through improved spatial locality.9 users transparently use libflame instead. However, Instead of representing matrices as a single linear our approach is language independent: you can data array with a prescribed leading dimension— easily define APIs for various languages, includ­ as legacy libraries require (for column- or row- ing Matlab’s M-script, LabView’s G graphical major order)—a block storage scheme is encoded programming language, and Fortran-77. into the matrix object. So, in this scheme, inter­ Indeed, Flame’s very methodology advocates nal elements refer recursively to child objects that creating APIs that mirror the notation used to represent submatrices. present the algorithms. This differs from, say, the Currently, libflame provides a subset of the con­ C++ Matrix Template Library,10 which provides ventional API that supports hierarchical matrices, an object-oriented interface to matrix operations letting users create and manage such matrix ob­ that hides the matrices’ storage details. jects as well as convert between storage-by-blocks and conventional “flat” storage schemes. Performance The scientific community expects performance. Advanced Build System For decades, it’s been assumed that the price for From its early revisions, libflame distributions solving the programmability problem was dimin­ have been bundled with a robust build system, ished performance, and thus the kinds of abstrac­ featuring automatic makefile creation and a con­ tions that underlie libflame are “not allowed.” As figuration script conforming to GNU standards we now show, however, these abstractions in fact (which lets users run the ./configure; make; yield performance in the same conditions under make install sequence common to many open which traditional libraries thrive—as well as flex­ source projects). ibly supporting high performance in what are Without any user input, the configure script generally considered more hostile circumstances. searches for and chooses compilers based on a predefined preference order for each architecture. Sequential Performance via Optimized BLAS Users can request specific compilers via the con­ Traditional libraries like Lapack are layered upon figure interface or enable other nondefault lib­ the BLAS for portable performance. In its simplest flame features, such as custom memory alignment, mode, libflame is merely an alternative way to pro­ multi­threading (via Posix threads or OpenMP), gram algorithm families for operations—such as compiler options (debugging symbols, warnings, the LU, QR, and Cholesky factorizations—still in and optimizations), and memory leak detection. terms of traditional BLAS operations and linked The reference BLAS and Lapack libraries pro­ to traditional BLAS libraries. vide no configuration support and require users Figure 3a shows the performance of three se­ to manually modify a makefile with appropriate quential implementations of LU factorization references to compilers and compiler options, de­ with partial pivoting: pending on the host architecture. • dgetrf from the Intel Math Kernel Library Backward Compatibility with Lapack (MKL is highly tuned, with performance repre­ We understand that many developers have invest­ sentative of other optimized BLAS and Lapack ed considerable time into their current dense lin­ libraries); ear algebra applications. We thus provide a set of • FLA_LU_piv from libflame; and compatibility routines that map conventional • dgetrf from netlib’s Lapack reference Lapack invocations to their corresponding lib­ implementation. flame implementations. By simply linking to the liblapack2flame compatibility layer, you can For this experiment, the target platform is a take advantage of Flame’s performance benefits single Intel Itanium2 processor. As noted, the

Nov em b er /Decem b er 2009 59 6 Serial MKL 90 SuperMatrix + serial MKL 5 Flame + serial MKL 80 Multithreaded MKL Lapack + serial MKL Flame + multithreaded MKL 70 Lapack + multithreaded MKL 4 60

3 50 Giga ops Giga ops 40

2 30

20 1 10

0 0 0 500 1,000 1,500 2,000 2,500 3,000 0 2,000 4,000 6,000 8,000 10,000 Matrix size Matrix size (a) (b)

Figure 3. Representative performance of various library implementations of LU factorization with partial pivoting. The top of each graph represents the architecture’s theoretical peak. (a) Sequential LU factorization on a single Itanium2 CPU with Math Kernel Library (MKL) 10.1, and (b) Multithreaded LU factorization on 16 Itanium2 CPUs with MKL 8.1 (nb = 192).

MKL implementation is highly optimized and • implementing compute operations out of order attains excellent performance. The other imple­ to facilitate more parallelism greatly compli­ mentations, which are available under open source cates the code. licenses, are highly competitive. To overcome these problems, we added several Traditional Parallelism via abstractions to libflame: Multithreaded Kernels With the advent of parallel computers such as • We extended the API to support matrix storage- symmetric multiprocessing (SMP), nonuniform by-blocks. Although traditional linear algebra memory access (NUMA), and multicore archi­ libraries enforce column-major storage both tectures, the simplest approach to parallelizing in how matrices are mapped to memory and in libraries such as Lapack and libflame is to link to how the algorithm code is explicitly written, the multithreaded BLAS kernels so that the parallel object-based Flame/C API lets matrix elements implementations and the original sequential code themselves describe matrices, thus providing a remain identical. convenient way of expressing matrices that are Figure 3b shows MKL, libflame, and Lapack hierarchically stored by blocks.11 performance for three implementations of LU • We introduced the algorithms-by-blocks con­ factorization with partial pivoting. For this exper­ cepts. These algorithms view each matrix iment, the target platform is a 16 Itanium2 CPU element as a block and express the target com­ system. Extracting parallelism only within the putation as operations between those blocks. BLAS limits both the Lapack and libflame imple­ • In some cases, we had to invent new algorithms mentations’ performance. to fit the algorithms-by-blocks concept. For example, in their traditional incarnations, the Scheduling Algorithms-by-Blocks LU factorization with partial pivoting and QR to Multiple Threads factorization via Householder transformations There are several problems with extracting paral­ require columns of blocks to collaborate to lelism via multithreaded BLAS only: identify a pivot or perform an inner product. We created two concepts—incremental pivot­ • each call to a BLAS operation ends with a ing and incremental QR factorization—to over­ thread synchronization (so-called fork-and-join come this bottleneck.11 parallelism); • When we view blocks as data units and op­ • factorization of the “current panel” introduces erations with blocks as computation units, we an operation in the critical path that reduces the can expose parallelism by expressing the al­ opportunity for parallelism; and gorithm as a directed acyclic graph (DAG) of

60 Co m p u t i n g in Sc i e n c e & En g i n e e r i n g 1,500 350 SuperMatrix + Serial CuBLAS (GPU = 4) Out-of-core SuperMatrix + CuBLAS (GPU = 1) Serial CuBLAS (GPU = 1) In-core SuperMatrix + CuBLAS (GPU = 1) Multithreaded MKL on Intel Xeon Quadcore (CPU = 4) 300 In-core MKL on two Intel Xeon Quadcore (CPU = 8)

250 1,000 200 Giga ops 150 Giga ops 500 100

50

0 0 0 0.5 1 1.5 2 0 2 4 6 8 10 Matrix size (×104) Matrix size (×104) (a) (b)

Figure 4. Representative performance on a system with four Tesla accelerators. (a) Matrix-multiplication on four Tesla graphical processing units versus Math Kernel Library 10.0.1, and (b) Out-of-core Cholesky factorization on a single Tesla GPU.

suboperations. In sequential mode, libflame As part of a libflame prototype, we used the implementations—coded as Figure 2a shows— same mechanism to target shared memory par­ execute suboperations immediately. However, allel architectures. This let us achieve stunning abstractions within the library allow a nearly performance almost effortlessly on systems with identical implementation to instead build a multiple hardware accelerators.12 As Figure 4a DAG as it executes, deferring computation shows, four Nvidia Tesla accelerators achieved until the DAG is complete. This provides the a rate of more than a trillion single-precision opportunity for suboperations to be dispatched floating-point operations per second (teraflops). to threads for parallel execution even when subproblems have many interdependencies.11 Solving Medium-Sized Problems on a Desktop Thus, libflame algorithms can stay the same as In 1991, the Linpack benchmark executing on the developers and users move from a sequential to world’s fastest machine achieved 13.9 gigaflops a shared memory parallel environment. (billions of floating-point operations per second) • We developed a runtime system, SuperMatrix, while factoring what was then a large problem: a to analyze and dynamically schedule the sub­ 25,000 × 25,000 matrix that fit in 512 processors’ operations captured in algorithms-by-blocks. combined memories. A modern desktop computer SuperMatrix implements in software many can now store such a matrix in its memory and hardware-level superscalar processor tech­ solve it in seconds. But that problem is now a small niques, such as out-of-order execution. one. We can use the same mechanism that lets us schedule algorithms-by-blocks to shared memory Together, these abstractions let us exploit parallel architectures to schedule matrix blocks thread-level parallelism by elegantly and efficient­ stored on hard drives. Figure 4b shows how this ly scheduling algorithms with many dependen­ lets us factor medium-sized symmetric, positive, cies. Figure 3b shows the benefits of this approach. definite problems of size 100,000 × 100,000 in The SuperMatrix curve reports an algorithm-by- about 23 minutes via an out-of-core implementa­ blocks’ performance for LU factorization with in­ tion using a single hardware accelerator. cremental pivoting, which we implemented with SuperMatrix. lthough new computer architectures Exploiting Hardware Accelerators have brought new challenges, for the High-performance computing recently expe­ dense linear algebra domain, such rienced the arrival of hardware accelerators, challenges have provided a catalyst for such as the IBM Cell Broadband Engine and change.A Abstraction is no longer a luxury that we general-purpose graphics processing units must avoid to retain high performance; rather, it’s (GPGPUs). a necessity that enables high performance.

Nov em b er /Decem b er 2009 61 Many opportunities and much work remain. Conf. Supercomputing, ACM Press, 2007, We are continuously expanding libflame’s func­ pp. 116–125. tionality and prototyping new ideas. We hope the 11. G. Quintana-Ortí et al., “Programming Matrix scientific computing community not only benefits Algorithms-by-Blocks for Thread-Level Parallelism,” from libflame, but also contributes feedback so ACM Trans. Math. Software, vol. 36, no. 3, 2009. that we can continue to improve the library. To 12. G. Quintana-Ortí et al., “Solving Dense Linear find out more about Flame, see www.cs.utexas. Algebra Problems on Platforms with Multiple Hard- edu/users/flame. ware Accelerators,” Proc. ACM Sigplan Symp. Principles and Practice Parallel Programming, ACM Press, 2009, Acknowledgments pp. 121–130. US National Science Foundation (NSF) grants CCF- 0540926 and CCF-0702714 partially sponsored Field G. Van Zee is a researcher and software de- this work. Universidad Jaime I researchers were sup- veloper for the Formal Linear Algebra Methods En- ported by projects CICYT TIN2008-06570-C04-01 vironment (Flame) group in the University of Texas and Funds for Development of European Regions, as at Austin’s Department of Computer Sciences, where well as P1B-2007-19, P1B-2007-32 of the Fundación he leads libflame maintenance. His research inter- Caixa-Castellón/Bancaixa and Universidad Jaime I. ests are in scientific, high-performance, and parallel Microsoft also provided significant support. We thank computing. Van Zee has an MS in computer sciences Nvidia for donating some of the hardware used in from the University of Texas at Austin. Contact him our experimental evaluation. All opinions, findings, at [email protected]. conclusions, and recommendations expressed here are solely those of the authors, and don’t necessarily Ernie Chan is a PhD student in the Department of reflect the views of the NSF. Computer Sciences at the University of Texas at Austin, where he researches the runtime scheduling of matrix References computations for shared memory parallelism. Chan 1. J.J. Dongarra et al., “A Set of Level 3 Basic Linear has an MS in computer science from the University of Algebra Subprograms,” ACM Trans. Math. Software, Texas at Austin. Contact him at [email protected]. vol. 16, no. 1, 1990, pp. 1–17. 2. E. Anderson et al., Lapack Users’ Guide, SIAM, 1999. Enrique S. Quintana-Ortí is an associate professor 3. F.G. Van Zee, libflame: The Complete Reference, Lulu, of computer architecture at the University Jaime I of 2009; www.lulu.com/content/5915632. Castellón, Spain. His research interests are in high- 4. J.A. Gunnels et al., “Flame: Formal Linear Algebra performance computing (multicore processors and Methods Environment,” ACM Trans. Math. Software, GPU programming), scientific computing (such as con- vol. 27, no. 4, 2001, pp. 422–455. trol theory), and linear algebra. Quintana-Ortí has a 5. P. Bientinesi et al., “The Science of Deriving Dense PhD in computer science from the Polytechnic Univer- Linear Algebra Algorithms,” ACM Trans. Math Soft- sity of Valencia. Contact him at [email protected]. ware, vol. 31, no. 1, 2005, pp. 1–26. 6. L.S. Blackford et al., ScaLapack Users’ Guide, SIAM, Gregorio Quintana-Ortí is an associate profes- 1997. sor of computer science at the University Jaime I of 7. P. Bientinesi, B. Gunter, and R.A. van de Geijn, Castellón, Spain. His research interests are in paral- “Families of Algorithms Related to the Inversion of lel and high-performance computing and numerical a Symmetric Positive Definite Matrix,” ACM Trans. linear algebra. Quintana-Ortí has a PhD in computer Math. Software, vol. 35, no. 1, 2008. science from the Polytechnic University of Valencia. 8. E. Gullo et al., “Numerical Linear Algebra on Contact him at [email protected]. Emerging Architectures: The Plasma and Magma Projects,” J. Physics: Conf. Series, vol. 180, 2009; Robert A. van de Geijn is a professor of computer sci- www.iop.org/EJ/article/1742-6596/180/1/012037/ ence and member of the Institute for Computational jpconf9_180_012037.pdf. Engineering and Sciences at the University of Texas 9. E. Elmroth et al., “Recursive Blocked Algorithms at Austin. His research interests include parallel com- and Hybrid Data Structures for Dense Matrix puting, high-performance computing, linear algebra Library Software,” SIAM Rev., vol. 46, no. 1, 2004, libraries, numerical analysis, and formal derivation pp. 3–45. of algorithms. van de Geijn has a PhD in applied 10. P. Gottschling, D.S. Wise, and M.D. Adams, mathematics from the University of Maryland. He’s “Representation-Transparent Matrix Algorithms a member of the IEEE and the ACM. Contact him at with Scalable Performance,” Proc. 21st ACM Int’l [email protected].

62 Co m p u t i n g in Sc i e n c e & En g i n e e r i n g Nov em b er /Decem b er 2009 63