A Sup erno dal Approach to Sparse Partial Pivoting
y z
James W. Demmel Stanley C. Eisenstat John R. Gilb ert
x
XiaoyeS.Li Joseph W. H. Liu
July 10, 1995
Abstract
Weinvestigate several ways to improve the p erformance of sparse LU factorization with partial
pivoting, as used to solve unsymmetric linear systems. To p erform most of the numerical
computation in dense matrix kernels, weintro duce the notion of unsymmetric sup erno des. To
b etter exploit the memory hierarchy,weintro duce unsymmetric sup erno de-panel up dates and
two-dimensional data partitioning. To sp eed up symb olic factorization, we use Gilb ert and
Peierls's depth- rst search with Eisenstat and Liu's symmetric structural reductions. Wehave
implemented a sparse LU co de using all these ideas. We present exp eriments demonstrating
that it is signi cantly faster than earlier partial pivoting co des. We also compare p erformance
with Umfpack , which uses a multifrontal approach; our co de is usually faster.
Keywords: sparse matrix algorithms; unsymmetric linear systems; sup erno des; column elimi-
nation tree; partial pivoting.
AMSMOS sub ject classi cations: 65F05, 65F50.
Computing Reviews descriptors: G.1.3 [Numerical Analysis]: Numerical Linear Algebra
| Linear systems direct and iterative methods, Sparse and very large systems.
Computer Science Division, University of California, Berkeley, CA 94720 fdemmel,[email protected] erkel ey.edu.
The research of these authors was supp orted in part by NSF grant ASC{9313958, DOE grant DE{FG03{94ER25219,
UT Sub contract No. ORA4466 from ARPA Contract No. DAAL03{91{C0047, DOE grant DE{FG03{94ER25206,
and NSF Infrastructure grants CDA{8722788 and CDA{9401156.
y
Department of Computer Science, Yale University, P.O. Box 208285, New Haven, CT 06520-8285
[email protected]. The research of this author was supp orted in part by NSF grant CCR-9400921.
z
XeroxPalo Alto Research Center, 3333 Coyote Hill Road, Palo Alto, California 94304 gilb [email protected]. The
research of this author was supp orted in part by the Institute for Mathematics and Its Application s at the University
c
of Minnesota. Copyright 1994, 1995 by Xerox Corp oration. All rights reserved.
x
Department of Computer Science, York University, North York, Ontario, Canada M3J 1P3 [email protected].
The research of this author was supp orted in part by the Natural Sciences and Engineering Research Council of
Canada under grant A5509. 1
for column j =1 to n do
f = A: ;j;
Symb olic factor: determine which columns of L will up date f ;
for each up dating column r Col-col up date: f = f f r L: ;r; end for; Pivot: interchange f j and f k , where jf k j = max jf j : nj; Separate L and U : U 1: j; j = f1: j ; Lj : n; j = fj:n; Divide: L: ;j= L: ;j=Lj; j ; Prune symb olic structure based on column j ; end for; Figure 1: LU factorization with column-column up dates. 1 Intro duction The problem of solving sparse symmetric p ositive de nite systems of linear equations on sequential and vector pro cessors is fairly well understo o d. Normally, the solution pro cess is broken into two phases: First, symb olic factorization to determine the nonzero structure of the Cholesky factor; Second, numeric factorization and solution. Elimination trees [24] and compressed subscripts [30] reduce the time and space for symb olic factorization to a low-order term. Sup erno dal [5] and multifrontal [11] elimination allow the use of dense vector op erations for nearly all of the oating-p oint computation, thus reducing the symb olic overhead in numeric factorization to a low-order term. Overall, the mega op rates of mo dern sparse Cholesky co des are nearly comparable to those of dense solvers [26]. For unsymmetric systems, where pivoting is required to maintain numerical stability, progress has b een less satisfactory. Recent research has concentrated on two basic approaches: submatrix- based metho ds and column-based or row-based metho ds. Submatrix metho ds typically use some form of Markowitz pivoting, in which each stage's pivot elementischosen from the uneliminated submatrix by criteria that attempt to balance numerical quality and preservation of sparsity. Recent submatrix co des include Ma48 from the Harwell subroutine library [10], and Davis and Du 's unsymmetric multifrontalcode Umfpack [6]. 1 Column metho ds, by contrast, typically use ordinary partial pivoting. The pivot is chosen from the current column according to numerical considerations alone; the columns may b e preordered b efore factorization to preserve sparsity. Figure 1 sketches a generic left-lo oking column LU fac- torization. Notice that the bulk of the numeric computation o ccurs in column-column up dates, or, to use Blas terminology [8], in sparse Axpy s. Column metho ds have the advantage that the preordering for sparsity is completely separate from the factorization, just as in the symmetric case. However, symb olic factorization cannot b e 1 Row metho ds are exactly analogous to column metho ds, and co des of b oth sorts exist. We will use column terminology; those who prefer rows mayinterchange the terms throughout the pap er. 2 separated from numeric factorization, b ecause the nonzero structures of the factors dep end on the numerical pivoting choices. Thus column co des must do some symb olic factorization at each stage; typically this amounts to predicting the structure of each column of the factors immediately b efore computing it. George and Ng [14,15] describ ed ways to obtain upp er b ounds on the structure of the factors based only on the nonzero structure of the original matrix. An early example of such a co de is Sherman's Nspiv [31] which is actually a row co de. Gilb ert and Peierls [20] showed how to use depth- rst search and top ological ordering to get the structure of each factor column. This gives a column co de that runs in total time prop ortional to the numb er of nonzero oating-p oint op erations, unlike the other partial pivoting co des. Eisenstat and Liu [13] designed a pruning technique to reduce the amount of structural information required for the symb olic factorization, as we describ e further in Section 4. The result was that the time and space for symb olic factorization were in practice reduced to a low order term. In view of the success of sup erno dal techniques for symmetric matrices, it is natural to con- sider the use of sup erno des to enhance the p erformance of unsymmetric solvers. One dicultyis that, unlike the symmetric case, sup erno dal structure cannot b e determined in advance but rather emerges dep ending on pivoting choices during the factorization. In this pap er, we generalize sup erno des to unsymmetric matrices, and we give ecient algo- rithms for lo cating and using unsymmetric sup erno des during a column-based LU factorization. We describ e a new co de called SuperLU that uses depth- rst search and symmetric pruning to sp eed up symb olic factorization, and uses unsymmetric sup erno des to sp eed up numeric factorization. The rest of the pap er is organized as follows. Section 2 intro duces the to ols we use: unsymmetric sup erno des, panels, and the column elimination tree. Section 3 describ es the sup erno dal numeric factorization. Section 4 describ es the sup erno dal symb olic factorization. In Section 5, we present exp erimental results: we b enchmark our co de on several test matrices, we compare its p erformance to other column and submatrix co des, and weinvestigate its cache b ehavior in some detail. Finally, Section 6 presents conclusions and op en questions. 2 Unsymmetric sup erno des The idea of a sup erno de is to group together columns with the same nonzero structure, so they can b e treated as a dense matrix for storage and computation. Sup erno des were originally used for symmetric sparse Cholesky factorization; the rst published results are by Ashcraft, Grimes, T Lewis, Peyton, and Simon [5]. In the factorization A = LL , a sup erno de is a range r : sof columns of L with the same nonzero structure b elow the diagonal; that is, Lr : s; r : s is full lower 2 triangular and every rowofLs:n; r : s is either full or zero. Columns of Cholesky sup erno des need not b e contiguous, but we will consider only contiguous sup erno des. Ng and Peyton [26] analyzed the e ect of sup erno des in Cholesky factorization on mo dern unipro cessor machines with memory hierarchies and vector or sup erscalar hardware. All the up dates from columns of a sup erno de are summed into a dense vector b efore the sparse up date is p erformed. This reduces indirect addressing, and allows the inner lo ops to b e unrolled. In e ect, a sequence of column-column up dates is replaced by a sup erno de-column up date. The sup-col up date can b e implemented using a call to a standard dense Blas-2 matrix-vector multiplication kernel. This idea can b e further extended to sup erno de-sup erno de up dates, which can b e implemented using a Blas-3 dense matrix-matrix kernel. This can reduce memory trac by an order of magnitude, 2 We use Matlab notation for integer ranges and submatrices: r : s or r : s is the vector of integers r;r +1;:::; s. If I and J are vectors of integers, then AI; J is the submatrix of A with rows whose indices are from I and with columns whose indices are from J . A:;J abbreviates A1 : n; J . nnz A denotes the numb er of nonzeros in A. 3 T1 T2 T3 T4 Columns have same structure Rows have same structure Dense Figure 2: Four p ossible typ es of unsymmetric sup erno des. b ecause a sup erno de in the cache can participate in multiple column up dates. Ng and Peyton rep orted that a sparse Cholesky algorithm based on sup-sup up dates typically runs 2.5 to 4.5 times as fast as a col-col algorithm. Indeed, sup erno des have b ecome a standard to ol in sparse Cholesky factorization [5, 26, 27, 32]. To sum up, sup erno des as the source of up dates help b ecause: 1. The inner lo op over rows has no indirect addressing. Sparse Blas-1 is replaced by dense Blas-1. 2. The outer lo op over columns in the sup erno de can b e unrolled to save memory references. Blas-1 is replaced by Blas-2. Sup erno des as the destination of up dates help b ecause: 3. Elements of the source sup erno de can b e reused in multiple columns of the destination sup er- no de to reduce cache misses. Blas-2 is replaced by Blas-3. Sup erno des in sparse Cholesky can b e determined during symb olic factorization, b efore the numeric factorization b egins. However, in sparse LU, the nonzero structure cannot b e predicted b efore numeric factorization, so wemust identify sup erno des on the y.Furthermore, since the factors L and U are no longer transp oses of each other, wemust generalize the de nition of a sup erno de. 2.1 De nition of unsymmetric sup erno des We considered several p ossible ways to generalize the symmetric de nition of sup erno des to un- symmetric factorization. We de ne F = L + U I to b e the l led matrix containing b oth L and U . T1. Same row and column structures: A sup erno de is a range r : s of columns of L and rows of U , such that the diagonal blo ck F r : s; r : s is full, and outside that blo ck all the columns of L in the range have the same structure and all the rows of U in the range have the same structure. T1 sup erno des make it p ossible to do sup-sup up dates, realizing all three b ene ts. 4 T1 T2 T3 T4 median 0.236 0.345 0.326 0.006 0.284 0.365 0.342 0.052 mean Table 1: Fraction of nonzeros not in rst column of sup erno de. T2. Same column structure in L: A sup erno de is a range r : s of columns of L with full triangular diagonal blo ck and the same structure b elow the diagonal blo ck. T2 sup erno des allow sup-col up dates, realizing the rst two b ene ts. T3. Same column structure in L, full diagonal blo ckin U: A sup erno de is a range r : s of columns of L and U , such that the diagonal blo ck F r : s; r : s is full, and b elow the diagonal blo ck the columns of L have the same structure. T3 sup erno des allow sup-col up dates, like T2. In addition, if the storage for a sup erno de is laid out as for a two-dimensional array for Blas-2 or Blas-3 calls, T3 sup erno des do not waste any space in the diagonal blo ckofU. T4. Same column structure in L and U : A sup erno de is a range r : s of columns of L and U with identical structure. Since the diagonal is nonzero, the diagonal blo ckmust b e full. T4 sup erno des allow sup-col up dates, and also simplify storage of L and U . T T5. Sup erno des of A A: A sup erno de is a range r : s of columns of L corresp onding to a Cholesky T sup erno de of the symmetric matrix A A. T5 sup erno des are motivated by George and Ng's observation [14] that with suitable representations the structures of L and U in the unsymmetric factorization PA = LU are contained in the structure of the Cholesky factor T of A A. In unsymmetric LU, these sup erno des themselves are sparse, so wewould waste time and space op erating on them. Thus we do not consider them further. Figure 2 is a schematic of de nitions T1 through T4. Sup erno des are only useful if they actually o ccur in practice. The o ccurrence of symmetric sup erno des is related to the clique structure of the chordal graph of the Cholesky factor, which arises b ecause of ll during the factorization. Unsymmetric sup erno des seem harder to characterize, but they also are related to dense submatrices arising from ll. We measured the sup erno des according to each de nition for 126 unsymmetric matrices from the Harwell-Bo eing sparse matrix library [9] under various column orderings. Table 1 shows, for each de nition, the fraction of nonzeros of L that are not in the rst column of a sup erno de; this measures howmuchrow index storage is saved by using sup erno des. Corresp onding values for symmetric sup erno des for the symmetric Harwell-Bo eing structural analysis problems usually range from ab out 0.5 to 0.9. Larger numb ers are b etter, indicating larger sup erno des. We reject T4 sup erno des as b eing to o rare to make up for the simplicity of their storage scheme. T1 sup erno des allow Blas-3 up dates, but as we will see in Section 3.2 we can get most of their cache advantage with the more common T2 or T3 sup erno des by using sup-panel up dates. Thus we conclude that either T2 or T3 is the de nition of choice. Our co de uses T2, which gives slightly larger sup erno des than T3 at a small extra cost in storage. Figure 3 shows a sample matrix, and the nonzero structure of its factors with no pivoting. Using de nition T2, this matrix has four sup erno des: f1; 2g, f3g, f4; 5; 6g, and f7; 8; 9; 10g.For example, in columns 4, 5, and 6 the diagonal blo cks of L and U are full, and the columns of L all have nonzeros in rows 8 and 9. By de nition T3, the matrix has ve sup erno des: f1; 2g, f3g, f4; 5; 6g, f7g, and f8; 9; 10g. Column 7 fails to join f8; 9; 10g as a T3 sup erno de b ecause u is zero. 78 5 0 1 0 1 1 1 B C B C 2 2 B C B C B C B C 3 3 B C B C B C B C B C B C 4 4 B C B C B C B C 5 B C B C B C B C 6 6 B C B C B C B C 7 7 B C B C B C B C B C B C 8 B C B C @ A @ A 9 9 10 10 Original matrix A Factors F = L + U I Figure 3: A sample matrix and its LU factors. Diagonal elements a and a are zero. 55 88 2.2 Storage of sup erno des A standard waytolay out storage for a sparse matrix is a one-dimensional array of nonzero values in column ma jor order, plus integer arrays giving rownumb ers and column starting p ositions. We use this layout for b oth L and U , but with a slight mo di cation: we store the entire square diagonal blo ck of each sup erno de as part of L, including b oth the strict lower triangle of values from L and the upp er triangle of values from U .We store this square blo ckasifitwere completely full it is full in T3 sup erno des, but its upp er triangle may contain zeros in T2 sup erno des. This allows us to address each sup erno de as a two-dimensional array in calls to Blas routines. In other words, if columns r : s form a sup erno de, then all the nonzeros in F r : n; r : s are stored as a single dense two-dimensional array. This also lets us save some storage for row indices: only the indices of nonzero rows outside the diagonal blo ck need b e stored, and the structures of all columns within a sup erno de can b e describ ed by one set of row indices. This is similar to the e ect of compressed subscripts in the symmetric case [30]. We represent the part of U outside the sup erno dal blo cks with a standard sparse structure: the values are stored by columns, with a companion integer array the same size to store row indices; another arrayof n integers indicates the start of each column. Figure 4 shows the structure of the factors in the example from Figure 3, with s denoting k a nonzero in the k -th sup erno de and u denoting a nonzero in the k -th column of U outside the k sup erno dal blo ck. Figure 5 shows the storage layout. We omit the indexing vectors that p ointto the b eginning of each sup erno de and the b eginning of each column of U . 2.3 The column elimination tree Since our de nition requires the columns of a sup erno de to b e contiguous, we should get larger sup erno des if we bring together columns of L with the same nonzero structure. But the column ordering is xed, for sparsity, b efore numeric factorization; what can we do? In symmetric Cholesky factorization, the so-called fundamental sup erno des can b e made con- tiguous by p ermuting the matrix symmetrically according to a p ostorder on its elimination tree [4]. This p ostorder is an example of what Liu calls an equivalent reordering [24], which do es not change the sparsity of the factor. The p ostordered elimination tree can also b e used to lo cate the sup er- 6 1 0 1 s s u u 1 1 3 6 C B 2 s s u u u u 1 1 3 4 6 8 C B C B s u u 3 C B 2 7 8 C B C B s s s u 4 3 3 3 9 C B C B s s s u u 5 3 3 3 7 9 C B C B s s s s s s u u u 6 1 1 2 3 3 3 7 8 9 C B C B s s s 7 C B 4 4 4 C B C B s s s s s s s s s s 8 1 1 2 3 3 3 4 4 4 4 C B A @ s s s s s s s 9 3 3 3 4 4 4 4 s s s s s 10 2 4 4 4 4 Figure 4: Sup erno dal structure by de nition T2 of the factors of the sample matrix. 45 6 0 1 12 3 78910 0 1 0 1 0 1 s s s 3 3 3 s s s s 0 s s 1 1 2 4 4 4 B C s s s 3 3 3 B B C B C C B C s s 6 s s s s s B B C B C C B C 1 1 2 4 4 4 4 s s s B B C B C C B C 3 3 3 B C @ A @ A @ A 6 s s 8 s s s s s 1 1 2 4 4 4 4 @ A 8 s s s 3 3 3 8 s s 10 s s s s s 1 1 2 4 4 4 4 9 s s s 3 3 3 Sup erno dal blo cks stored in column-ma jor order Row 1 2 2 1 2 3 5 6 2 3 6 4 5 6 Subscripts u u u u u u u u u u u u u u 3 3 4 6 6 7 7 7 8 8 8 9 9 9 O -sup erno dal nonzeros in columns of U Figure 5: Storage layout for factors of the sample matrix, using T2 sup erno des. 7 no des b efore the numeric factorization. We pro ceed similarly for the unsymmetric case. Here the appropriate analog of the symmetric elimination tree is the column elimination tree, or column etree for short. The vertices of this tree are the integers 1 through n, representing the columns of A. The column etree of A is the symmetric elimination tree of the column intersection graph of A, or equivalently the elimination T T tree of A A provided there is no cancellation in computing A A. See Gilb ert and Ng [19] for complete de nitions. The column etree can b e computed from A in time almost linear in the numb er of nonzeros of A byavariation of an algorithm of Liu [24]. The following theorem says that the column etree represents p otential dep endencies among columns in LU factorization, and that for strong Hall matrices no stronger information is obtain- able from the nonzero structure of A. Note that column i up dates column j in LU factorization if and only if u 6=0. ij Theorem 1 Column Elimination Tree [19] Let A be a square, nonsingular, possibly unsym- metric matrix, and let PA = LU be any factorization of A with pivoting by row interchanges. Let T be the column elimination treeof A. 1. If vertex i is an ancestor of vertex j in T , then i j . 2. If l 6=0, then vertex i is an ancestor of vertex j in T . ij 3. If u 6=0, then vertex j is an ancestor of vertex i in T . ij 4. Suppose in addition that A is strong Hal l that is, A cannot bepermuted to a nontrivial block triangular form. If vertex j is the parent of vertex i in T , then there is some choice of values for the nonzeros of A that makes u 6=0 when the factorization PA = LU is computed with ij partial pivoting. Just as a p ostorder on the symmetric elimination tree brings together symmetric sup erno des, we exp ect a p ostorder on the column etree to bring together unsymmetric sup erno des. Thus, b efore we factor the matrix, we compute its column etree and p ermute the matrix columns according to a p ostorder on the tree. Wenow show that this do es not change the factorization in any essential way. Theorem 2 Let A be a matrix with column etree T .Let beapermutation such that whenever i is an ancestor of j in T , we have i j . Let P be the permutation matrix such that T T = P 1: n .Let A = PAP . 1. A = A; . 2. The column etree T of A is isomorphic to T ;inparticular, relabeling each node i of T as i yields T . T LP 3. Suppose in addition that A has an LU factorization without pivoting, A = LU . Then P T T T UP LP P UP arerespectively unit lower triangular and upper triangular, so A =P and P is also an LU factorization. Remark: Liu [24] attributes to F. Peters a result similar to part 3 for the symmetric p ositive de nite case, concerning the Cholesky factor and the usual, symmetric elimination tree. 8 Pro of: Part 1 is immediate from the de nition of P .Part 2 follows from Corollary 6.2 in Liu [24], with the symmetric structure of the column intersection graph of our matrix A taking the place of Liu's symmetric matrix A. Liu exhibits the isomorphism explicitly in the pro of of his Theorem 6.1. T T Nowwe prove part 3. Wehavea =a for all i and j .Write L = P LP and U = P UP , ij ij so that l = l and u =u . Then A = LU ;we need only show that L and U are ij ij i j i j triangular. Consider a nonzero u of U . In the triangular factorization A = LU , elementu is equal ij i j to u and is therefore nonzero. By part 3 of Theorem 1, then, j is an ancestor of i in T . i j By the isomorphism b etween T and T , this implies that j is an ancestor of iin T. Then it follows from part 1 of Theorem 1 that j i. Thus every nonzero of U is on or ab ove the diagonal, so U is upp er triangular. A similar argument shows that every nonzero of L is on or b elow the diagonal, so L is lower triangular. The diagonal elements of L are a p ermutation of those of L, so they are all equal to 1. 2 T Since the triangular factors of A are just p ermutations of the triangular factors of PAP , they have the same sparsity. Indeed, they require the same arithmetic to compute; the only p ossible di erence is the order of up dates. If addition for up dates is commutative and asso ciative, this implies that with partial pivoting i; j is a legal pivot in A i i;j is a legal pivot in A. In oating-p oint arithmetic, the di erent order of up dates could conceivably change the pivot sequence. Thus wehave the following corollary. Corollary 1 Let bea postorder on the column elimination treeofA, let P be any permutation 1 T T matrix, and let P be the permutation matrix with = P 1: n .IfPAP = LU is an LU 2 2 1 2 T T T factorization, then so is P P A =P LP P UP . In exact arithmetic, the former is an LU 1 2 2 2 2 2 T factorization with partial pivoting of AP if and only if the latter is an LU factorization with partial 2 pivoting of A. This corollary says that an LU co de can p ermute the columns of its input matrix by p ostorder on the column etree, and then fold the column p ermutation into the row p ermutation on output. Thus our SuperLU co de has the option of returning either four matrices P , P , L, and U with 1 2 T T T T P AP = LU , or just the three matrices P P , P LP , and P UP , whicharearow p ermutation 1 1 2 2 2 2 2 2 and two triangular matrices. The advantage of returning all four matrices is that the columns of each sup erno de are contiguous in L, which p ermits the use of a Blas-2 sup erno dal triangular solve for the forward-substitution phase of a linear system solver. The sup erno des are not contiguous in T P LP . 2 2 2.4 Relaxed sup erno des Wehave explored various ways of relaxing the denseness of a sup erno de. We exp erimented with b oth T2 and T3 sup erno des, and found that T2 sup erno des those with only nested column structures in L give slightly b etter p erformance. We observe that, for most matrices, the average size of a sup erno de is only ab out 2 to 3 columns though a few sup erno des are much larger. A large p ercentage of sup erno des consist of only a single column, many of which are leaves of the column etree. Therefore wehave devised a scheme to merge groups of columns at the fringe of the etree into arti cial supernodes regardless of their row structures. A parameter r controls the granularity of the merge. Our merge rule is: no de i is merged with its parentnode j when the subtree ro oted at j has at most r no des. In practice, 9 1. for column j =1 to n do 2. f = A: ;j; 3. Symb olic factor: determine which sup erno des of L will up date f ; 4. Determine whether j b elongs to the same sup erno de as j 1; 5. for each up dating sup erno de r : s 6. Apply sup erno de-column up date to column j :