Managing the Complexity of Lookahead for LU Factorization with Pivoting

Managing the Complexity of Lookahead for LU Factorization with Pivoting Ernie Chan and Robert van de Geijn Andrew Chapman Department of Computer Sciences Microsoft Corporation The University of Texas at Austin One Microsoft Way Austin, Texas 78712 Redmond, Washington 98052 {echan,rvdg}@cs.utexas.edu [email protected] ABSTRACT algorithm-by-blocks to a directed acyclic graph (DAG) and sched- We describe parallel implementations of LU factorization with piv- ules the tasks from the DAG in parallel. This approach solves the oting for multicore architectures. Implementations that differ in programmability issue that faces us with the introduction of mul- two different dimensions are discussed: (1) using classical partial ticore architectures by separating the generation of a DAG to be pivoting versus recently proposed incremental pivoting and (2) ex- executed from the scheduling of tasks. tracting parallelism only within the Basic Linear Algebra Subpro- The contributions of the present paper include: grams versus building and scheduling a directed acyclic graph of • An implementation of classical LU factorization with partial tasks. Performance comparisons are given on two different sys- pivoting within a framework that separates programmability tems. issues from the runtime scheduling of a DAG of tasks. • A comparison of different pivoting strategies for LU factor- Categories and Subject Descriptors ization. D.1.3 [Software]: Concurrent Programming Together these contributions provide further evidence that the Su- perMatrix runtime system solves the problem of programmability General Terms while providing impressive performance. Algorithms, Performance In our previous SPAA paper [7], we first introduced this concept of using out-of-order scheduling to parallelize matrix computation using the Cholesky factorization as a motivating example, an Keywords operation which directly maps to an algorithm-by-blocks. On the LU factorization with partial pivoting, algorithm-by-blocks, directed other hand, LU factorization with partial pivoting does not easily acyclic graph, lookahead map well to an algorithm-by-blocks. Our solution addresses programmability since we can use the same methodology to parallelize this more complex operation without adding any extra complexity 1. INTRODUCTION to the code that implements LU factorization with partial pivoting. LU factorization with partial pivoting is simultaneously perhaps The rest of the paper is organized as follows. In Section 2, we the most important operation for solving linear systems and often present LU factorization with partial pivoting and several tradi- the most difficult one to parallelize due to the pivoting step. In tional methods for parallelizing the operation. We describe the Su- this paper, we compare different strategies for exploiting shared- perMatrix runtime system in Section 3. In Section 4, we describe memory parallelism when implementing this operation. A simple LU factorization with incremental pivoting and its counterpart for approach is to link to multithreaded Basic Linear Algebra Subpro- QR factorization. Section 5 provides performance results, and we grams (BLAS) [11] libraries. A strategy that requires nontrivial conclude the paper in Section 6. changes to libraries like Linear Algebra PACKage (LAPACK) [2] is to add lookahead to classical LU factorization with partial piv- 2. LU FACTORIZATION WITH PARTIAL oting. A recently proposed algorithm-by-blocks with incremental pivoting [5, 26] changes the pivoting strategy to increase opportu- PIVOTING nities for parallelism, at some expense to the numerical stability We present the right-looking unblocked and blocked algorithms of the algorithm. To manage the resulting complexity, we intro- for computing the LU factorization with partial pivoting using stan- duced the SuperMatrix runtime system [8] as a general solution dard Formal Linear Algebra Method Environment (FLAME) nota- for parallelizing LU factorization with pivoting, which maps an tion [16] in Figure 1. The thick and thin lines have semantic mean- ing and capture how the algorithms move through the matrix where the symbolic partitions reference different submatrices on which computation occurs within each iteration of the loop. Permission to make digital or hard copies of all or part of this work for We first describe the updates performed within the loop of the „ « personal or classroom use is granted without fee provided that copies are α11 unblocked algorithm. The SWAP routine takes the vector , not made or distributed for profit or commercial advantage and that copies a bear this notice and the full citation on the first page. To copy otherwise, to 21 republish, to post on servers or to redistribute to lists, requires prior specific finds the index of the element with the largest magnitude in that permission and/or a fee. vector, which is stored in π1, and exchanges that element with α11. SPAA’10, June 13–15, 2010, Thira, Santorini, Greece. Next, the pivot is applied (PIV) where the rest of the π1-th row is Copyright 2010 ACM 978-1-4503-0079-7/10/06 ...$10.00. [A; p] := PIV UNB Algorithm: LU _ (A) Algorithm: [A; p] := LUPIV_BLK(A) Partition „ « „ « Partition ATL ATR pT „ « „ « A ! , p ! ATL ATR pT ABL ABR pB A ! , p ! ABL ABR pB whereATL is 0 × 0, pT has 0 elements whereATL is 0 × 0, pT has 0 elements while n(ATL) < n(A) do while n(ATL) < n(A) do Determine block size b Repartition Repartition 0 A a A 1 „ « 00 01 02 „ « A00 A01 A02 ! ATL ATR T T ATL ATR ! @ a10 α11 a12 A, ! A10 A11 A12 , ABL ABR ABL ABR A20 a21 A22 A20 A21 A22 „ « p0 ! „ « p0 ! pT pT ! π1 ! p1 pB pB p2 p2 whereα11 and π1 are scalars whereA11 is b × b , p1 is b × 1 »„ « – „ « »„ « – „ « α11 α11 A11 A11 ; π1 := SWAP ; p1 := LUPIV_UNB a21 a21 A21 A21 „ T T « „ „ T T «« „ « „ „ «« a10 a12 a10 a12 A10 A12 A10 A12 := PIV π1; := PIV p1; A20 A22 A20 A22 A20 A22 A20 A22 a := a /α −1 21 21 11 A12 := L11 A12 T A22 := A22 − a21a12 A22 := A22 − A21A12 Continue with 0 1 Continue with A00 a01 A02 „ « A00 A01 A02 ! „ A A « ATL ATR TL TR T T A10 A11 A12 , @ a10 α11 a12 A, A A ABL ABR BL BR A A A A20 a21 A22 20 21 22 „ « p0 ! „ « p0 ! pT pT p1 π1 pB pB p2 p2 endwhile endwhile Figure 1: The right-looking unblocked and blocked algorithms (left and right, respectively) for computing the LU factorization with partial pivoting. Here the matrix is pivoted like LAPACK does so that PIV(p; A) = LU upon completion. In this figure, Lii denotes the unit lower triangular matrix stored over Aii, and n(A) stands for the number of columns of A. ` T T ´ 1 interchanged with a a . Finally, a21 is scaled by , ing the FLASH [21] extension to the FLAME application program- 10 12 α11 and a rank-one update is performed over A22. ming interface (API) for the C programming language [3] for creat- In the blocked algorithm, the LU factorization (LUPIV) subprob- ing and accessing hierarchical matrices. Notice that the FLAME/C lem calls the unblocked algorithm, which updates the column panel and FLASH application programming interfaces were typeset to „ « A11 closely resemble the FLAME notation and thus easily facilitate the and stores all the pivot indices in p1. We then apply all A21 translation from algorithm to implementation. We assume that the of those pivots to the left and right of the current column panel. matrix A and the vector p are both stored hierarchically with one Next, a triangular solve with multiple right-hand sides (TRSM) is level of blocking. This storage scheme has an additional benefit in performed over A12 with L11, which is the unit lower triangular that spatial locality is maintained when accessing the contiguously matrix of A11. Finally, A22 is updated with general matrix-matrix stored submatrices. multiplication (GEMM). Both TRSM and GEMM are examples of The matrix object A Figure 2 (left) is itself encoded as a ma- level-3 BLAS operations [11]. trix of matrices in FLASH where the top-level object consists of The problem instances of TRSM and GEMM incurred within this references to the submatrix blocks. We stride through the matrix right-looking blocked algorithm are quite easily parallelized. The using a unit block size while decomposing the subproblems into bulk of the computation in each iteration lies in the GEMM call operations on individual blocks. The algorithmic block size b in so that straight forward implementations (e.g., LAPACK’s blocked Figure 1 (right) now manifests itself as the storage block size of implementation dgetrf) can exploit parallelism by only linking each contiguously stored submatrix block. to multithreaded BLAS libraries and thus attain high performance. In Figure 2 (right), we illustrate the tasks that overwrite each As such, many opportunities for parallelism are lost since implicit block in a 3 × 3 matrix of blocks within each iteration of the loop synchronization points exist between each call to a parallelized in Figure 2 (left). We will use the notation Ai;j to denote the i; j-th BLAS routine. block within the matrix of blocks. In the first iteration, we perform the task LUPIV0 on the en- 2.1 Algorithm-by-blocks tire left column panel of the matrix where the symbolic partition A references A , and A references A and A . For con- By storing matrices hierarchically [12] and viewing submatrix 11 0;0 21 1;0 2;0 venience, we choose to make each LUPIV task updating a column blocks as the unit of data and operations with blocks (tasks) as the panel of blocks an atomic operation because it cannot be easily par- unit of computation, we reintroduced the concept of algorithms-

Load more