Supermatrix: a Multithreaded Runtime Scheduling System for Algorithms-By-Blocks

SuperMatrix: A Multithreaded Runtime Scheduling System for Algorithms-by-Blocks Ernie Chan Field G. Van Zee Paolo Bientinesi Enrique S. Quintana-Ort´ı Robert van de Geijn Department of Computer Science Gregorio Quintana-Ort´ı Department of Computer Sciences Duke University Departamento de Ingenier´ıa y Ciencia de The University of Texas at Austin Durham, NC 27708 Computadores Austin, Texas 78712 [email protected] Universidad Jaume I {echan,field,rvdg}@cs.utexas.edu 12.071–Castellon,´ Spain {quintana,gquintan}@icc.uji.es Abstract which led to the adoption of out-of-order execution in many com- This paper describes SuperMatrix, a runtime system that paral- puter microarchitectures [32]. For the dense linear algebra opera- lelizes matrix operations for SMP and/or multi-core architectures. tions on which we will concentrate in this paper, many researchers We use this system to demonstrate how code described at a high in the early days of distributed-memory computing recognized that level of abstraction can achieve high performance on such archi- “compute-ahead” techniques could be used to improve parallelism. tectures while completely hiding the parallelism from the library However, the coding complexity required of such an effort proved programmer. The key insight entails viewing matrices hierarchi- too great for these techniques to gain wide acceptance. In fact, cally, consisting of blocks that serve as units of data where oper- compute-ahead optimizations are still absent from linear algebra ations over those blocks are treated as units of computation. The packages such as ScaLAPACK [12] and PLAPACK [34]. implementation transparently enqueues the required operations, in- Recently, there has been a flurry of interest in reviving the idea ternally tracking dependencies, and then executes the operations of compute-ahead [1, 25, 31]. Several efforts view the problem as a utilizing out-of-order execution techniques inspired by superscalar collection of operations and dependencies that together form a di- microarchitectures. This separation of concerns allows library de- rected acyclic graph (DAG). One of the first to exploit this idea was velopers to implement algorithms without concerning themselves the Cilk runtime system, which parallelizes divide-and-conquer al- with the parallelization aspect of the problem. Different heuristics gorithms rather effectively [26]. Workqueuing [35], when allowed for scheduling operations can be implemented in the runtime sys- to be nested, achieves a similar effect as Cilk and has been pro- tem independent of the code that enqueues the operations. Results posed as an extension to OpenMP in form of the taskq pragma gathered on a 16 CPU ccNUMA Itanium2 server demonstrate ex- already supported by Intel compilers. The importance of handling cellent performance. more complex dependencies is recognized in the findings reported by the CellSs project [4], and we briefly discuss this related work Categories and Subject Descriptors D.m [Software]: Miscella- in Section 7. Other efforts to apply these techniques to dense ma- neous trix computations on the Cell processor [23] are also being pur- sued [24]. General Terms Algorithms, Performance A number of insights gained from the FLAME project allow us to propose, in our opinion, elegant solutions to parallelizing dense Keywords algorithms-by-blocks, dependency analysis, dynamic linear algebra operations within multithreaded environments. We scheduling, out-of-order execution begin by introducing a notation, illustrated in Figure 1, for expressing linear algebra algorithms. This notation closely resembles the 1. Introduction diagrams that one would draw to illustrate how the algorithm pro- Architectures capable of simultaneously executing many threads of gresses through the matrices operands [19]. Furthermore, the nota- computation will soon become commonplace. This fact has brought tion enabled a systematic and subsequently mechanical methodol- about a renewed interest in studying how to intelligently schedule ogy for generating families of loop-based algorithms given an oper- sub-operations to expose maximum parallelism. Specifically, it is ation’s recursive mathematical definition [5, 6]. Algorithms are im- possible to schedule some computations earlier, particularly those plemented using the FLAME/C API [8], which mirrors the notation operations residing along the critical path of execution, to allow used to express the algorithms, thereby abstracting away most im- more of their dependent operations to execute in parallel. This in- plementation details such as array indexing. We realized that since sight was recognized in the 1960s at the individual instruction level, the API encapsulates matrix data into typed objects, hypermatrices matrices (matrices of matrices) could easily be represented by al- lowing elements in a matrix to refer to submatrices rather than only scalar values [11, 13, 14, 21, 29]. This extension to the FLAME/C API known as FLASH [27] greatly reduced the effort required to Permission to make digital or hard copies of all or part of this work for personal or code algorithms when matrices were stored by blocks, even with classroom use is granted without fee provided that copies are not made or distributed multiple levels of hierarchy [22, 33], instead of traditional row- for profit or commercial advantage and that copies bear this notice and the full citation 1 on the first page. To copy otherwise, to republish, to post on servers or to redistribute major and column-major orderings. Storing matrices by blocks led to lists, requires prior specific permission and/or a fee. PPoPP’08, February 20–23, 2008, Salt Lake City, Utah, USA. 1 For a survey of more traditional approaches to expressing and coding Copyright °c 2008 ACM 978-1-59593-960-9/08/0002. $5.00 recursive algorithms when matrices are stored by blocks, see [16]. Algorithm: A := CHOL BLK(A) A := TRINV BLK(A) A := TTMM BLK(A) „ « A A Partition A → TL TR ? ABR whereATL is 0 × 0 while m(ATL) < m(A) do Determine block size b Repartition 0 1 „ « A00 A01 A02 ATL ATR → @ ? A11 A12 A ? ABR ? ? A22 whereA11 is b × b CHOL TRINV TTMM Variant 1: Variant 1: Variant 1: −T T A01 := A00 A01 A01 := A00A01 A00 := A00 + A01A01 T −1 T A11 := A11 − A01A01 A01 := −A01A11 A01 := A01A11 −1 T A11 := CHOL(A11) A11 := A11 A11 := A11A11 Variant 2: Variant 2: Variant 2: T −1 T A11 := A11 − A01A01 A12 := A12A22 A01 := A01A11 −1 T A11 := CHOL(A11) A12 := −A11 A12 A01 := A01 + A02A12 T −1 T A12 := A12 − A01A02 A11 := A11 A11 := A11A11 −T T A12 := A11 A12 A11 := A11 + A12A12 Variant 3: Variant 3: Variant 3: −1 T A11 := CHOL(A11) A12 := −A11 A12 A11 := A11A11 −T T A12 := A11 A12 A02 := A02 + A01A12 A11 := A11 + A12A12 T −1 T A22 := A22 − A12A12 A01 := A01A11 A12 := A12A22 −1 A11 := A11 Continue with 0 1 „ « A00 A01 A02 ATL ATR ← @ ? A11 A12 A ? ABR ? ? A22 endwhile Figure 1. Blocked algorithms for Cholesky factorization, inversion of a triangular matrix, and triangular matrix multiplication by its transpose. to the observation that blocks and computation associated with each • We illustrate how different algorithmic variants for computing block could be viewed as basic units of data and computation, re- the same operation produce roughly the same schedule with the spectively [2, 20]. This abstraction allows us to apply superscalar application of the SuperMatrix runtime system. scheduling techniques on operations over blocks via the SuperMa- • We use a much more complex operation, the inversion of a trix runtime system. symmetric positive definite (SPD) matrix, to illustrate these The basic idea behind SuperMatrix was introduced in [9], in issues. which we use the Cholesky factorization as a motivating exam- Together, these contributions further the understanding of the ben- ple. Programmer productivity issues and a demonstration that the efits of out-of-order execution through algorithms-by-blocks. system covers multiple architectures was the topic of a second pa- The rest of the paper is organized as follows. In Section 2 we de- per [10]. For that paper, two of the coauthors implemented in one scribe inversion of a symmetric positive definite matrix, which we weekend all cases and all algorithmic variants of an important set of use as a motivating example for the SuperMatrix runtime system in matrix-matrix operations, the level-3 Basic Linear Algebra Subpro- Section 3. We discuss different types of dependencies in Section 4. grams (BLAS) [15], and showed performance improvements on a In Section 5 we show the effect of different algorithmic variants number of different architectures. This current (third) paper makes when using SuperMatrix. Section 6 provides performance results. the following additional contributions: We conclude the paper in Section 7. • We show in more detail how the FLASH API facilitates careful layering of libraries and how the interface allows SuperMatrix to be invoked transparently by the library developer. • We discuss the existence of anti-dependencies within linear algebra operations and how SuperMatrix resolves them. 2. Inversion of a Symmetric Positive Definite Matrix Let A ∈ Rn×n be an SPD matrix.2 The traditional approach to implement inversion of an SPD matrix (SPD-INV), e.g., as employed by LAPACK [3], is to compute: CHOL0 TRSM1 TRSM2 TRSM3 (1) Cholesky factorization A → U T U (CHOL) where U is an upper −1 −1 −1 triangular matrix, CHOL(A1,1) A1,1A1,2 A1,1A1,3 A1,1A1,4 (2) inversion of a triangular matrix R := U −1 (TRINV), and (3) triangular matrix multiplication by its transpose A−1 := RRT SYRK4 GEMM5 GEMM6 (TTMM). A2,2− A2,3− A2,4− T T T If only the upper triangular part of A is originally stored, each A1,2A1,2 A1,2A1,3 A1,2A1,4 stage can overwrite that upper triangular part of the matrix. In the remainder of this section we briefly discuss these three operations SYRK7 GEMM8 (sweeps) separately. For a more thorough discussion, we refer to [7] where the A3,3− A3,4− AT A AT A inversion of an SPD matrix is used to illustrate that the ideal 1,3 1,3 1,3 1,4 choice of algorithm for a given operation is greatly affected by the SYRK9 characteristics of the target platform.

Supermatrix: a Multithreaded Runtime Scheduling System for Algorithms-By-Blocks

Linpack Evaluation on a Supercomputer with Heterogeneous Accelerators

Benchmark of C++ Libraries for Sparse Matrix Computation

Using Machine Learning to Improve Dense and Sparse Matrix Multiplication Kernels

Anatomy of High-Performance Matrix Multiplication

DD2358 – Introduction to HPC Linear Algebra Libraries & BLAS

Level-3 BLAS on Myriad Multi-Core Media-Processor

A Framework for Efficient Execution of Matrix Computations

On Computing with Diagonally Structured Matrices

A Systematic Approach for Obtaining Performance on Matrix-Like Operations Submitted in Partial Fulfillment of the Requirements

Parallel Tiled QR Factorization for Multicore Architectures LAPACK Working Note # 190

Implementing Strassen-Like Fast Matrix Multiplication Algorithms with BLIS

Implementing Strassen's Algorithm with BLIS