<<

EFFICIENT SPARSE -MATRIX PRODUCTS USING COLORINGS

MICHAEL MCCOURT∗, BARRY SMITH∗, AND HONG ZHANG ∗

Abstract. Sparse matrix-matrix products appear in multigrid solvers and computational methods for . Some formulations of these products require the inner product of two sparse vectors, which have inefficient use of cache memory. In this paper, we propose a new for computing sparse matrix-matrix products by exploiting matrix nonzero structure through the process of graph coloring. We prove the validity of this technique in general and demonstrate its viability for multigrid methods used to solve three-dimensional boundary value problems.

1. Introduction. Matrix- is a fundamental operation [4]. The operation of concern in this work is

C = ABT , which arises in algebraic multigrid and certain graph . This operation is well defined when m r n r m n A R × and B R × and will produce C R × . Our results are equally applicable to complex-valued matrices,∈ but we omit∈ that discussion for simplicity.∈ The multiplication operation is defined such that the jth column of the ith row of C (denoted as C(i, j)) is calculated as

r C(i, j)= A(i, k)BT (k, j). k!=1 This can also be written as the inner product of the ith row of A, denoted as A(i, :), and the jth column of BT , denoted as BT (:,j):

C(i, j)=A(i, :)BT (:,j). (1.1)

Several mechanisms exist for computing the matrix-matrix product, each of which is preferable in certain settings. For instance, C can be constructed all at once with a sum of r outer products (-1 matrices),

r C = A(:,k)BT (k, :). (1.2) k!=1 For an approach between using (1.1) to compute one value at a time and using (1.2) to compute C all at once, we can compute one row or column at a time or compute blocks of C using blocks of A and BT . While all these techniques yield the same result, they may not be equally preferable for actual com- putation because of the different forms in which a matrix may be stored. Many sparse matrix algorithms have memory access outpace floating-point operations as the computational bottleneck [11]; this demands diligence when choosing a storage format so as to minimize memory traffic. In this paper, we begin with conventional matrix storage, namely, compressed sparse row format (CSR) for sparse matrices and Fortran- style column-major array storage for dense matrices [2] and then transform the storage format for higher efficiency. One approach to computing ABT would be to compute BT from B and store it as a CSR matrix, but this reorganization requires a significant amount of memory traffic. The availability of the columns of BT would seem to indicate that the inner product computation (1.1) is the preferred method. Unfortunately, computing the inner product of sparse vectors is an inefficient operation because of the low ratio of flops to cache misses (see Section 1.1). Our goal is to produce a and algorithm which will efficiently compute the sparse matrix-matrix product C = ABT using inner products. In the rest of this section we introduce sparse inner products and matrix coloring. In Section 2 we analyze matrix coloring applied to the sparse matrix product C = ABT , which allows us to instead compute C by evaluating the inner product of sparse and dense vectors. In Section 3 we propose algorithms for computing

∗Mathematics and Science Division, Argonne National Laboratory, Argonne, IL 60439. Emails: (mccomic, bsmith, hzhang)@mcs.anl.gov 1 matrix products with matrix colorings and consider some practical implementation issues. Numerical results are presented in Section 4, which show the benefits of this coloring approach for matrix products appearing in multigrid solvers for three dimensional PDEs from PFLOTRAN [8] and from the PETSc library distribution [2]. We conclude the paper in Section 5 with a brief discussion of future work.

1.1. Sparse inner products. When matrices A and B are stored in compressed sparse row format, computing C = ABT requires the inner product between sparse vectors; this section discusses the inefficiency (in number of flops per cache miss) of sparse inner products. Algorithm 1 shows how inner products between sparse vectors in Rn may be computed.

n Algorithm 1 Sparse-Sparse Inner Product of x, y R ; x, y have nx,ny nonzeros, respectively. ∈ 1: i = 1; j = 1; xT y =0.0 2: while (i n and j n ) do ≤ x ≤ y 3: if ( index(x,i) < index(y,j)) i = i + 1; { } 4: else if ( index(x,i) > index(y,j)) j = j + 1; { } 5: else xT y = xT y + x(index(x,i))y(index(y,j)); i = i + 1; j = j + 1; { } 6: end while

Because x and y are stored in compressed form, only the nx and ny nonzero values in each vector respectively are stored, along with the rows to which they belong. The function “index” accepts a vector x and an integer 1 i nx and returns the index of the ith nonzero; the ith nonzero is then accessed as x(index(x,i)). ≤ ≤ Algorithm 1 iterates through the nonzero entries in each vector until finding the collisions between their indices; each collision is then accumulated to evaluate the inner product. While the number of floating-point operations (flops) for this algorithm is only O(number of collisions between nonzeros), the required memory access is inefficient. The indices of both vectors must be traversed completely, requiring O(nx + ny) memory accesses in a while-loop. This poor ratio of computation to memory access is one motivation behind our work. Contrast Algo- rithm 1 with the inner product of a sparse and dense vector in Algorithm 2.

n Algorithm 2 Sparse-Dense Inner Product of x, y R ; x has nx nonzeros and y is dense. ∈ 1: xT y =0.0 2: for (i =1,...,nx) do 3: xT y = xT y + x(index(x,i))y(index(x,i)) 4: end for

Algorithm 2 performs O(nx) flops, because even if some of the values in y are zero, they are all treated as nonzeros. This is greater than the sparse-sparse inner product but with roughly the same level of memory accesses: O(2nx) in a for-loop. Essentially, a sparse-dense inner product would do more flops per memory access. Our interest is not in this small trade-off but in compressing multiple sparse vectors into a single, dense vector; this will significantly increase the ratio of flops to memory accesses, without (we hope) introducing a harrowing number of zero-valued nonzeros into the computation. If this can be done effectively, then the inefficient sparse-sparse inner products used to compute sparse matrix products can be replaced with a reduced number of more efficient sparse-dense inner products. Our mechanism for doing this is described in Section 1.2.

1.2. Matrix coloring. Matrix coloring is related to graph coloring, an important topic in graph theory [7]. We are not interested in a graph-theoretic understanding, only in the specific use of graph theory to color matrices, so we introduce only terms specifically relevant to our usage. Definition 1.1. Let u, v Rn. The vectors u and v are structurally orthogonal if u T v =0, ∈ | | | | where u is the vector of the absolute value of all the elements of u. A of vectors u1,...,un is called structurally| | orthogonal if u and u are structurally orthogonal for 1 i, j n. { } i j ≤ ≤ 2 Under this definition, not all orthogonal vectors are structurally orthogonal; for example,

1 1 u = , v = , 1 1 " # "− # so uT v =0but u T v = 2. Also, all vectors are structurally orthogonal to a vector of all zeros. | | | | Lemma 1.2. Let u, v Rn be structurally orthogonal, and denote u(k) as the kth element in u. Then for 1 k n, either u(k)=0∈ or v(k)=0, or both. Proof≤ .≤ We know that u T v = 0, which means | | | | n u(k) v(k) =0. | || | k!=1 Obviously, u(k) v(k) 0 for 1 k n; for the sum of these nonnegative terms to equal zero, all the terms must| themselves|| | be≥ zero: ≤ ≤

u(k) v(k) =0, 1 k n. | || | ≤ ≤ Therefore, either u(k) = 0 or v(k) = 0, or potentially both. m n 1 2 q Definition 1.3. Let C R × .Anorthogonal set is a set of q indices ! = ! , ! ,...,! for which C(:, !i) and C(:, !j ) are structurally∈ orthogonal when 1 i, j n and i = j. We{ will also define} a set containing only one index ! = !1 to be an orthogonal set≤ . ≤ % {m }n Definition 1.4. Let C R × .Amatrix coloring c = !1, !2,...,!p is a collection of index sets such that, for 1 k n, the index∈ k appears in exactly one index{ set. We say} that c is a valid matrix coloring of C if each≤ index≤ set in c is an orthogonal set of C. We refer to an orthogonal set that is part of a coloring as a color. Because we require 1 k n to appear in exactly one orthogonal set, every column of C appears in exactly one color. Each orthogonal≤ ≤ set in the coloring contains a set of indices corresponding to columns of C which are structurally orthogonal. The term coloring mimics the graph coloring concept of grouping nonadjacent vertices using the same color; here we are grouping structurally orthogonal columns of a matrix using the same orthogonal set. Recall that we allow for the possibility of there being only one column in a color; for instance, a column with no zero values in it must exist in its own color. This in turn guarantees that every matrix has a coloring, m n since every matrix C R × must have at least the trivial coloring 1 , 2 ,..., n . If the matrix C were totally dense, with∈ no structural zeros at all, this would be the only{{ } coloring.{ } { Our}} focus, however, is on the coloring of very sparse matrices. A matrix may have more than one valid coloring; the 2 2 identity × 10 I = 2 01 " # has two colorings: c1 = 1, 2 and c2 = 1 , 2 . We are not concerned with how the coloring of a matrix is determined but, rather,{{ how}} we can use{{ it} to{ e}}fficiently compute matrix-matrix products. Therefore, we refer readers to [3] for an introduction to how colorings are found. m n Definition 1.5. Let C R × and c = !1,...,!p be a valid matrix coloring for C. Applying the ∈ { } coloring c to C produces a matrix CDense that is at least as dense as the matrix C.Ifc has p colors in it, m p the matrix CDense R × . The kth column of the matrix CDense is created by combining the columns of C from the orthogonal∈ set ! = !1 ,...,!qk . Specifically, k { k k } r r C(j, !k), if C(j, !k) =0, CDense(j, k)= % 0, if qk C(j, !i ) =0, $ i=1 | k | for 1 j m. The matrix resulting from applying the coloring% will be referred to as a compressed matrix. ≤ ≤ m n Theorem 1.6. Let C R × and c = !1,...,!p be a valid matrix coloring for C. The matrix CDense created by applying c to C ∈is unique. { } 3 Proof. Each column of CDense is created from a single color, and no column of C appears in more than one color. Thus, proving that any column of CDense is unique proves that C is unique. We will prove that each value of the kth column of CDense is unique. qk i 1 If the value CDense(j, k) = 0, then i=1 C(j, !k) = 0, which could occur only if C(j, !k)=... = qk | | C(j, !k ) = 0. This would prevent having a conflicting nonzero value for that location, so all zero values in the kth column are unique. % If the value C (j, k) = 0, then at least one of the values in C(j, !1 ),...,C(j, !qk ) is nonzero. If Dense % { k k } exactly one value is nonzero, then the value CDense(j, k) is unique. We must prove that only one of these qk νk values is nonzero; we begin by assigning the nonzero value to the νk index, namely, CDense(j, !k ) = 0. 1 qk % Because c is a valid matrix coloring of C, the columns C(:, !k),...,C(:, !k ) must be structurally orthogonal. Lemma 1.2 tells us that for the jth row, { }

C(j, !s )C(j, !νk )=0, 1 s q ,s= ν . k k ≤ ≤ k % k Since we have assumed that C (j, !νk ) = 0, this lemma demands that C(j, !s ) = 0 for 1 s q ,s= ν . Dense k % k ≤ ≤ k % k Therefore, every value in the kth column of CDense is uniquely determined. Traditionally, matrix colorings have been used for the accelerated computation of finite-difference Jaco- bians for the purpose of preconditioning nonlinear solvers. In that setting, structurally orthogonal columns were grouped together to allow for simultaneous function evaluation. This minimizes the number of function calls required to approximate the Jacobian without ignoring any nonzero values. We use this same concept in Section 2 to accelerate the operation C = ABT .

2. Analysis of Coloring for Sparse Matrix-Matrix Products. Our goal in this section is to exploit the graph coloring concept, described in Section 1.2, to more efficiently compute inner products in a sparse-sparse matrix product. We are interested in the two expressions

C = ABT or (2.1a) C = RART , (2.1b) where matrices A, B, R are stored in CSR format and are not necessarily square. We analyze only (2.1a) in this section because (2.1b) can be described by using this product. The sparse-sparse matrix product (2.1a) can be computed with an analagous sparse-dense matrix prod- uct; doing so will reduce memory access overhead and improve computational efficiency. To achieve this T T improved performance, we restructure the sparse matrix B into a dense matrix BDense using the matrix coloring ideas introduced in Section 1.2. This allows us to compute

T CDense = ABDense, (2.2) such that the nonzero entries in CDense are also the nonzero entries needed to form C. After computing T (2.2), CDense must be reorganized into C, which is done by using a process similar to the compression of B T into BDense. Since this is not fundamental to the analysis, this process will be discussed in Section 3.1. T We must ask the question: How can we form a dense BDense such that CDense and C have the same nonzero values? Our approach uses a coloring of the matrix C to determine which columns of C are struc- turally orthogonal and can be computed simultaneously. This allows for the compression of those multiple T T columns associated with each color of B into a single column of BDense. After CDense has been computed, this single dense column is decompressed to fill all the columns of C corresponding to the columns of BT , T which were earlier compressed to created BDense. T Although the dense matrix CDense will contain the same nonzero values as C, the dense matrix, BDense will not necessarily contain all the nonzero values of BT . This special case may occur when A has columns with only zero values, but we prove in Corollary 2.7 that whenever A has no zero columns, any coloring of C will be a valid coloring for BT .

2.1. Motivating Example. We consider a small example to demonstrate the coloring concept, before we study its validity in general sparse-sparse matrix products. Consider the product of two “sparse” matrices, 4 1 6 3 12 12 1 2 3 4 12 2 2 6  1 2 3 36  1 6  4 2  = .  12 28 15 24   34   3 4 6        20 66   56  4 6        36   6  6             C A BT The structure, of C admits the-. coloring / , -. / , -. / c = ! , ! , ! , ! = 1, 4 , 2, 5 , 3 , 6 { 1 2 3 4} {{ } { } { } { }} because columns 1 and 4 are structurally orthogonal, as are columns 2 and 5. Notice that despite the of the third and fifth columns of BT , those columns cannot be combined in the same color in T T C. Using the coloring c, we can compress B into BDense and compute CDense, 1 6 3 12 12 1 2 3 4 12 2 2 6  1 2 3 36  1 6 4 5  = .  28 15 12 24   34  4 3 6        20 66   56 4 6        36   6  6             C A T Dense BDense

To decompress CDense, into C, we-. must store/ both, the columns-. that/ formed, -. the coloring/ and the rows of C associated with each of those columns. In the example above, the first column of CDense is composed of columns 1 and 4 from C, and the fourth and fifth rows belong to column 4 of C. This process is discussed in Section 3.1. 2.2. Proof of the Validity of Matrix Coloring. For a coloring of C to be appropriate for the product ABT , one of two situations must apply: T T 1. The coloring c of C is also a valid coloring of B ; invoking Theorem 1.6 means that a unique BDense exists when applying c, or 2. The conflicts that arise when applying the coloring do not contribute to the nonzeros of C. We must prove this second condition; before we can complete such a proof, we need to create a new device: an auxiliary matrix BˆT for which c is a valid coloring. T r n Definition 2.1. Let B R × and c be a coloring for a matrix with n columns. A vacated matrix ˆT r n ∈ T B R × is a matrix whose nonzero values all coincide with nonzeros from B but for which c is a valid coloring.∈ Under this definition, no nonzeros can be introduced in vacating BT to BˆT ; values can only be zeroed out to produce a matrix with the appropriate structure to apply the coloring. If c is a valid coloring of BT , then one trivial vacation of BT would be to remove no nonzeros. Another trivial vacation of BT that would validate any coloring would be to zero out every value in the matrix. r Definition 2.2. Let u1, u2,...,uq R . The conflicted index set of u1,...,uq is the set of indices ∈ { } Γ( u ,...,u )= γ 1, 2,...,m u (γ)u (γ) =0, for some 1 i, j q, i = j . { 1 q} { ∈ { }| i j % ≤ ≤ % } We define the conflicted index set of a set of one vector as empty: Γ( u )= . The term conflicted index set refers to the fact that these rows{ } are∅ preventing the set of vectors u1,...,uq from being structurally orthogonal. Were they structurally orthogonal, then the conflicted index{ set would} be empty. r Lemma 2.3. Let u1, u2,...,uq R , and Γ( u1,...,uq ) be the associated conflicted index set. The set of vectors uˆ , uˆ ,...,uˆ defined∈ as { } { 1 2 q}

0 γ Γ( u1,...,uq ) uˆ i(γ)= ∈ { } $ui(γ) else

5 is structurally orthogonal. Proof. If Γ = , then the vectors in u1,...,uq are already structurally orthogonal. Otherwise, we must prove that ∅ { } uˆ T uˆ =0, 1 i, j q, i = j. | i| | j | ≤ ≤ % Let us simplify the notation for this proof by defining Γ Γ( u1,...,uq ). For any i = j, the inner product can be separated into two components, ≡ { } %

uˆ T uˆ = uˆ (γ) uˆ (γ) + uˆ (γ) uˆ (γ) | i| | j | | i || j | | i || j | γ Γ γ Γ !∈ !#∈ = 0 0 + u (γ) u (γ) = u (γ) u (γ) . | || | | i || j | | i || j | γ Γ γ Γ γ Γ !∈ !#∈ !#∈ Because the conflicted index set includes indices such that ui(γ)uj (γ) = 0, any γ Γ must have ui(γ)uj(γ)= 0, leaving the summation above equal to zero. % %∈ This lemma allows us to take any set of vectors and replace certain nonzero values with zeros to make them structurally orthogonal. We will apply this concept to the columns of BT involved in each color of c in order to create a matrix BˆT for which c is a valid coloring. T r n Definition 2.4. Let B R × , c = !1,...,!p be a coloring for some matrix with n columns, and T ∈ { } 1 qk let B (:, !k) denote the set of columns associated with the color !k = !k,...,!k . The minimally vacated ˆT r n T { } matrix B R × is the vacated matrix generated from B in the following way: c ∈ T ˆT j 0, γ Γ(B (:, !k)) Bc (γ, !k)= T j ∈ , 1 j qk, 1 k p. $B (γ, !k), else ≤ ≤ ≤ ≤

T r n Lemma 2.5. Let B R × , c = !1,...,!p be a valid coloring for some matrix with n columns. The ∈ { } ˆT coloring c is valid for the minimally vacated matrix Bc . ˆT Proof. Using Lemma 2.3, we know that the columns in Bc (:, !k) are structurally orthogonal for 1 k p, ˆT ≤ ≤ and therefore c is a valid coloring of Bc . The minimal vacation of a matrix allows for the preservation of nonzeros with which there is no coloring T T conflict. Recall that for the product C = AB , our goal is to produce a dense matrix BDense from the matrix BT by applying the coloring c to BT . As stated earlier, c may not be a valid coloring for BT , in which case we ˆT T substitute the minimally vacated matrix Bc in place of B . Lemma 2.5 proves that this c is an acceptable coloring for this vacated matrix. ˆT More important than the validity of the coloring is the validity of the equation C = ABc . If this equality ˆT T does not hold, then some values in C would be altered by substituting the minimally vacated Bc for B , which is unacceptable. Theorem 2.6 addresses this concern. m r n r T ˆT Theorem 2.6. Let A R × , B R × and c = !1,...,!p be a coloring of C = AB .IfBc is the minimally vacated matrix generated∈ by∈ applying c to the{ matrix BT}, then the equality T ˆT AB = ABc (2.3) holds. ˆ ˆT ˆ ˆ Proof. We denote C = ABc and prove that C = C by proving that all the nonzeros of C match the nonzeros of C. By definition, each column of C belongs to exactly one color, so let us denote as 1 qj ! = ! ,...,j,...,!p the color containing column C(:,j). As was done earlier, let C(:, ! ) denote pj { pj j } pj the set of columns of C associated with the color !pj . Call Γ Γ(BT (:, ! )) the conflicted index arising from applying the color ! to the matrix BT . The j ≡ pj pj values C(i, j) and Cˆ(i, j) can be computed in pieces related to this Γj :

C(i, j)= A(i, γ)BT (γ,j)+ A(i, γ)BT (γ,j), γ Γ γ Γ !∈ j !#∈ j ˆ ˆT ˆT C(i, j)= A(i, γ)Bc (γ,j)+ A(i, γ)Bc (γ,j). γ Γ γ Γ !∈ j !#∈ j 6 Using the definition of a minimally vacated matrix, we can simplify this second line to

Cˆ(i, j)= A(i, γ)BT (γ,j), γ Γ !#∈ j which, combined with the first line, gives

C(i, j)= A(i, γ)BT (γ,j)+Cˆ(i, j). γ Γ !∈ j Since we want C(i, j)=Cˆ(i, j), we must prove that

A(i, γ)BT (γ,j)=0, 1 i m, 1 j n, ≤ ≤ ≤ ≤ γ Γ !∈ j which we do by proving that

for all 1 i m, 1 j n : A(i, γ) = 0 for each γ Γ . (2.4) ≤ ≤ ≤ ≤ ∈ j T By the definition of the conflicted index set Γj, for every γ Γj at least two columns of B must have nonzero rows γ. If we choose any two of those and call their indexes∈ s and t, we can write

γ Γ there exists s, t ! ,s= t such that BT (γ,s) =0, and BT (γ,t) =0. (2.5) ∈ j ⇒ ∈ pj % % %

Because !pj is a valid color for C, the set C(:, !pj ) is structurally orthogonal:

C(i, s)=0, or C(i, t)=0, 1 i m. ≤ ≤ The structural zero C(i, s) occurs because the vectors A(i, :)T and BT (:,s) are structurally orthogonal. Ap- plying Lemma 1.2 gives

A(i, γ)=0, or BT (γ,s)=0, 1 γ r, 1 i m. ≤ ≤ ≤ ≤ A similar statement is also true for C(i, t), so we join these results to state that for every 1 γ r, ≤ ≤ either BT (γ,s) = 0 or BT (γ,t)=0, or A(i, γ) = 0 for all 1 i m. (2.6) ≤ ≤ The combination of (2.5) and (2.6), along with the knowledge that each column of C appears in only one color, is sufficient to prove (2.4). Theorem 2.6 guarantees that the coloring of C is a sufficient tool to perform the sparse-dense matrix prod- T ˆT uct ABDense. While creating the minimally vacated matrix Bc is not difficult in practice, most applications do not require it, as shown in Corollary 2.7. Corollary 2.7. For the product C = ABT , any coloring c of C is a valid coloring of BT if A has no zero columns.

Proof. Start with (2.6), which is valid for any two distinct columns s, t !pj . If A has no zero columns, then A(i, γ) = 0 for some 1 i m, which requires that either BT (γ,s) =∈ 0 or BT (γ,t) = 0 for 1 γ r and makes c% a valid coloring≤ for≤BT . ≤ ≤ 3. Algorithms for Sparse-Sparse Matrix Product Using Coloring. Now that we have laid the foundation, we present Algorithm 3 for computing the product of sparse matrices by an associated sparse- dense product generated through matrix coloring. The remainder of this section will analyze this algorithm and adapt it to address implementation concerns. One way to compare Algorithm 3 to a sparse inner product-based algorithm would be to study the memory traffic. Let nA and nB denote the average number of nonzeros per row in A and B, respectively, and T let ncolor be the number of colors in c. From our discussion in Section 1.1, we can surmise that computing AB T using sparse inner products requires roughly O((nA +nB)mn) memory accesses. On the other hand, ABDense should incur only O(2n mn ) memory accesses, which could yield substantial savings if n n. A color color * 7 Algorithm 3 Computing C = ABT Using Coloring, Basic Version 1: Compute symbolic C = ABT ; Compute a matrix coloring of C, c T T 2: Assemble BDense by applying c to B T 3: Perform the sparse-dense matrix product CDense = ABDense 4: Recover C from CDense

Despite the gains from performing a sparse-dense matrix product, costs are incurred by computing the T T coloring of C, compressing B to BDense and then recovering C from CDense. These eat into the competitive advantage described earlier for the sparse-dense matrix product and must be taken into account when comparing the two algorithms. Rather than try to conduct a “pencil and paper” analysis of this additional complexity, we have performed numerical tests to demonstrate that their cost is not overwhelming. Those results appear in Section 4. In line 1 of Algorithm 3, we compute the symbolic product of ABT , which determines the location of the nonzeros in C before their value is computed [10]. Generally, this is used to preallocate space for C,but here we also use this structural description of C to determine a matrix coloring c. We will not discuss the different algorithms for computing matrix and graph colorings here; the effect of two popular choices, which are available in the PETSc library, are compared in Section 4.2. Line 2 was discussed in Section 2, although here we omit the possible need for a vacated matrix BˆT to T form BDense. Recall that Corollary 2.7 shows that this vacation is not necessary so long as A has at least one nonzero in each column. The sparse-dense matrix product in line 3 is computed as a sequence of sparse-dense inner products, each of which implements Algorithm 2. The efficient recovery of the sparse matrix C from the computed matrix CDense is not trivial; it is discussed in Section 3.1. Lines 2 through 4 contain the so-called numeric portion of the matrix product. For many applications, including the PFLOTRAN example in Section 4.1, the nonzero patterns of matrices remain fixed despite varying numeric values; this allows the symbolic component (line 1) to be performed only once while numeric products are computed whenever matrix values are updated. Our efficiency discussions focus on the numeric component. T Although not the focus of this paper, performing the sparse-dense product ABDense in lieu of the sparse- sparse product ABT allows for new optimizations leveraging the standardized structure of a dense matrix. One optimization that we exploit is the computation of multiple columns of CDense simultaneously. Each time A is brought into memory, four columns of CDense are computed, allowing for more flops per cache miss caused by A. The number of columns that can be efficiently computed simultaneously is determined by the available memory. Improving the ratio of flops per cache miss was a key factor in motivating this work, as discussed in Section 1.1, and the topic will appear again in Section 4. 3.1. Recovering the Sparse Matrix from the Dense Product. The fourth line of Algorithm 3 decompresses the dense matrix CDense to the sparse matrix C. Although we have not explicitly stated it previously, more information is needed to perform this decompression than just CDense and c. In the process T T T of compressing B to BDense, multiple columns (previously denoted B (:, !j )) are joined to form a single T column BDense(:,j); to undo this process, we must know how to partition CDense(:,j) among the columns in the set C(:, !j). In practice, during the compression, the coloring is augmented to also store this row data. Referring to the example in Section 2.1, we would augment the coloring

c = 1, 4 , 2, 5 , 3 , 6 {{ } { } { } { }} to include the following row data:

c+ = 1: 1, 3 , 4: 4, 5 , 2: 1, 2, 3 , 5: 4 , 3: 1, 3, 4 , 6: 1, 2, 3, 4, 5, 6 . {{ { } { }} { { } { }} { { }} { { }}} Using this augmented coloring c+, we can decompress a compressed matrix to its sparse form. The strategy used to populated C from CDense can contribute significantly to the computational cost of Algorithm 3. A simple approach would be to simply traverse CDense contiguously and populate C with each nonzero according to c+. Implementation of this revealed that as the matrix sizes increase, the decompression 8 T T can consume up to 1/3 of the total execution time; in contrast, the compression of B to BDense generally requires a much smaller portion of the total time. The most direct implementation of the CDense decompression ignores the fact that values that are very close in CDense may be in very distant columns of C. This occurs because CDense is stored in dense columns (as is the dense matrix storage standard) but C is a compressed sparse row matrix. Unpacking any single column of CDense may insert nonzero entries throughout C; therefore, decompressing each column of CDense may cause a traversal through the entire C matrix, which will likely incur excessive cache misses. To mitigate this expense, we could fill some block of rows of C all at once, thereby preventing the need for ncolor passes through C. Results for both decompression techniques are presented in Section 4.1. Algorithm 4 incorporates the changes discussed in this subsection into the coloring-based sparse product algorithm and also notes the potential need for a minimally vacated BT as described in Theorem 2.6. This algorithm is listed for completeness, to indicate the practical steps that must be taken for implementation; we will in general refer instead to Algorithm 3 for simplicity.

Algorithm 4 Computing C = ABT Using Coloring, Practical Version 1: Compute symbolic C = ABT 2: Compute c, a matrix coloring of C 3: Vacate BT using c if needed for compression T T 4: Assemble BDense by applying c to B 5: Augment matrix coloring c c+ → T 6: Perform the sparse-dense matrix product CDense = ABDense + 7: Recover C from CDense using c 8: Populate mblocksize rows of C at once

3.2. Algorithms for the RART Product. Thus far we have discussed the product (2.1a), but our motivating application is multigrid and it involves the product (2.1b). When Algorithm 3 is adapted for the product C = RART , two options arise: C can be computed by using two sparse-dense products or one sparse-dense product and one sparse-sparse product. The relative efficiency of these options is affected by the number of colors present in the compression. The auxiliary matrix W allows us to compute (2.1b) in two steps:

W = ART , (3.1a) C = RW. (3.1b)

We will always compute (3.1a) using coloring, but the choice of coloring will determine the efficiency of that and subsequent computations. The coloring of W, cW, can be used to implement Algorithm 3, at the end of which, a sparse W is returned. Then a sparse-sparse matrix product between R and W is used to compute C. This algorithm is described in Algorithm 5.

Algorithm 5 Computing C = RART Using the Coloring of ART 1: Use Algorithm 3 to compute W = ART 2: Compute the sparse-sparse CSR matrix product C = RW

In line 2 of this algorithm, a sparse-sparse matrix product is used instead of the sparse-dense matrix product that we have been developing in this paper. This is counterintuitive because we have been discussing the excessive cost of performing sparse-sparse products, but it may be preferable depending on the number of colors present in C. If C has too many colors, then it will be faster to use a sparse product to compute it, as discussed in Section 3. Should C have few colors, it may be faster to use Algorithm 6 to compute C. Algorithm 6 involves two sparse-dense matrix products, each using the coloring cC instead of the coloring cW as was used in Algorithm 5. The cost of this algorithm is tied to the number of colors in C, which can be greater than the number of colors in W; for many of the multigrid problems we study, C is much more dense than W, and it will be less effective to use cC for these products. Computational results presented in 9 Algorithm 6 Computing C = RART Using the Coloring of C T 1: Compute symbolic C = RAR ; Compute matrix coloring cC T T 2: Compress (and vacate, if needed) R with cC to form RDense + 3: Augment cC to cC with the necessary row information T 4: Use a sparse-dense matrix product to compute WDense = ARDense 5: Use a sparse-dense matrix product to compute CDense = RWDense + 6: Recover C from CDense using cC

Section 4.1 compare these two algorithms and show that the number of colors present in the coloring is a major factor in the efficiency of the computation. 4. Numerical Experiments. We present two sets of test cases: The regional doublet test case from PFLOTRAN [8] and • A three-dimensional linear elasticity PDE test provided in the PETSc library distribution [2]. • These tests were chosen because algebraic multigrid is an efficient solver for the linear systems arising in both applications. Multigrid is a mathematically optimal method for solving the system of algebraic equations that arise from discretized elliptic boundary value PDEs [12, 13]. We use a geometric-algebraic (GAMG) in PETSc, which integrates geometric information into robust algebraic multigrid formulations to yield superior convergence rates of the multigrid solver [1, 2]. GAMG requires the matrix triple products C = RART to be computed on all grid levels in the solver setup phase and the matrix product C = GGT to be computed for creation of connection graphs. These matrix products, as common computational primitives, constitute a significant portion of the entire simulation cost. The experiments were conducted by using the PETSc library on a Dell Poweredge 1950 server with dual Intel Xeon E5440 quadcore CPUs at 2.83 GHz and 16 GB DDR2-667 memory in 4 channels providing 21 GB/s total memory bandwidth. The machine runs Ubuntu Linux 12.04 64 bit OS. The execution time and floating-point rates were obtained by using one core with the GNU compiler version 4.7.3 and -O optimization. Our performance results were obtained by running the entire test cases and profiling the relevant matrix products. We have done so because standalone benchmarking often produces unreasonably optimistic reports since much of the data is already in cache, which is not the case during an actual simulation. 4.1. Regional Doublet Test Case from PFLOTRAN. We demonstrate the use of matrix coloring on the regional doublet test case [5] from PFLOTRAN, a state-of-the-art code for simulating multiscale, multiphase, multicomponent flow and reactive transport in geologic media. PFLOTRAN solves a coupled system of mass and energy conservation equations for a number of phases, including air, water, supercritical CO2, and a number of chemical components. PFLOTRAN is built on the PETSc library and makes extensive use of PETSc iterative nonlinear and linear solvers. The regional doublet test case models variable saturated goundwater flow and solute transport within a hypothetical aquifer measuring 5000 m 100 m. We consider only flow problems here because flow solves dominate computation. The governing× equation is a system of time-dependant PDEs. PFLOTRAN utilizes finite-volume or mimetic finite-difference spatial discretizations and backward-Euler (fully implicit) timestepping. At each time step, Newton-Krylov methods are used to solve the resulting nonlinear algebraic equations. In all the experiments reported below we have run 35 time steps, which is the minimum needed to resolve the basic physics. Tables 4.1 and 4.2 show the benefit of using matrix coloring to speed execution time for computing matrix triple products over small to large three-dimensional meshes. Three approaches are compared for computing C = RART : Using no coloring • – PT AP – stores P = RT , and then computes C with sparse outer products, – RAP – stores P = RT , and performs products between these three CSR matrices, – RART – Algorithm 1 is used to compute each value in C, Using the coloring of RART - Algorithm 6, and • Using the coloring of ART - Algorithm 5. • 10 Table 4.1: PFLOTRAN: C = RART Fine grid size: 50 25 10; A: 12, 500 12, 500, R:1, 203 12, 500 Times presented are× in× seconds. The use× of coloring in the RAR× T computation is clearly beneficial. Also, the choice of coloring (studying the structure of RART and applying Algorithm 5 or studying ART and applying Algorithm 6) plays major role in the algorithm efficiency.

No Coloring Coloring T T T T P AP RAP RAR RAR (ncolor=59) AR (ncolor=20) 3 Symbolic .0066 .016 .011 .019 .024 246 Numeric 1.30 1.46 2.39 1.51 .760 Total Time 1.31 1.48 2.40 1.53 .784

Table 4.2: PFLOTRAN: C = RART Total computation time (in seconds) presented for increasing fine grid density. The “time100” column imple- ments dense to sparse decompression with 100 row blocks. Algorithm 5 continues to outperform Algorithm 6. Unpacking multiple rows at once during the decompression (as discussed in Section 3.1) speeds the compu- tations for the ART coloring but not for the RART coloring.

No Coloring Coloring RART Algorithm 6, using RART Algorithm 5, using ART Grid Size time ncolor time time100 ncolor time time100 50 25 10 2.4 59 1.5 1.6 20 .78 .84 × × 100 50 20 28 70 26 27 24 14 12 × × 200 100 40 246 84 374 376 25 162 132 × ×

The first column of Table 4.1 gives the total number of symbolic and numeric matrix triple products accumulated from all grid levels and all linear iterations during the entire simulation. The symbolic matrix products were computed only during the solver setup phase, while the numeric matrix products were executed for every nonlinear iteration of GAMG. Time spent creating matrix colorings is included in the symbolic row and contributed approximately half the symbolic execution time for the coloring columns. The matrix colorings for this example are created by using the PETSc default algorithm, largest-first ordering (discussed in Section 4.2). When compared with the repeated execution of numeric matrix products performed during the solve, the time spent on the symbolic products is minimal. In Table 4.1, we see that the number of colors for matrix RART is 59, whereas ART (Column 6) has only 20 colors. This disparity is the result of greater density in C. While both of these are far smaller than 1203, the column size of both RT and the final sparse matrix product C, the greater sparsity in ART demands only a third as many colors and leads to significantly shorter execution time. The last row of Table 4.1 gives the total execution time, that is, the sum of the symbolic and numeric components. It shows that computing C = RART by using the matrix coloring of ART takes roughly one-third of the execution time of using no coloring. Table 4.2 presents the same experiments over larger three dimensional meshes. The use of Algorithm 5 to compute the product continues to outperform standard sparse inner products and Algorithm 6. Here we also consider the dense to sparse decompression in block rows of 100 in addition to conducting the entire decompression in one sweep; this idea was introduced for C = ABT in Section 3.1, and the results are presented in the time100 columns. There is a marked benefit when using the block decompression for the ART coloring option but no benefit for the RART coloring. This can likely be attributed to the number of colors, since more colors indicates more zero-valued nonzeros during the dense compression and therefore extra work is performed. Table 4.3 helps clear the computational picture by studying the flop rate for the three methods of computing RART . The floating point computation required for the “No Coloring” option is significantly less than the coloring options because only nonzero matrix entries are involved in the computation; as discussed in Section 1.1, the efficiency (measured by flop rate) suffers as a result because many fewer flops take place 11 Table 4.3: PFLOTRAN: Flop Rate (megaflops/second) Using coloring greatly increases the flop rate because of the sparse to dense compression. Even though the RART coloring has a higher flop rate, its total computational time (from Table 4.2) is higher because many of the flops performed are unnecessary. The ART coloring has few enough colors in it to maintain a high flop rate without excessive unneeded computation involving zero-valued nonzeros. The coloring computation uses the 100 block row decompression.

Fine Meshes No Coloring Coloring RART Coloring ART 50 25 10 76 1232 724 × × 100 50 20 64 936 606 × × 200 100 40 59 663 496 × × per cache miss. This issue is alleviated by the coloring because necessary values are stored contiguously in the dense matrix, thereby allowing for more efficient access. That improved efficiency is visible in the two columns of coloring results: both have significantly higher flop rates than the computation without coloring. Following this logic further, the higher flop rate would suggest that the RART coloring is superior to the ART coloring. Although computations are happening more quickly there, we know that the total time required for the computation is also greater. This seeming contradiction is caused by the number of zero-valued nonzeros introduced into the dense matrix. When the sparse matrices are used in the inner product computation, only nonzeros are involved in the computation, but accessing them is a slow proposition. On the other extreme, when the RART coloring is used, too many zeros are involved in the inner products, allowing for more efficient memory access but involving so many zero-valued nonzeros that the total computational time is excessive. The coloring of ART has many fewer colors, which reduces the number of superfluous zeros in the dense matrix, and achieves the best overall performance. This is why the flop rate is slightly lower, but the total computational time (shown in Table 4.2) is better. 4.2. Three Dimensional Linear Elasticity PDEs. The set of test cases in this section can be found in the PETSc library (see petsc/src/ksp/ksp/examples/tutorials/ex56.c). They model a three-dimensional bilinear quadrilateral (Q1) displacement finite-element formulation of linear elasticity PDE defined over a unit cube domain with Dirichlet boundary condition on the side y=0, and load of 1.0 in x + 2y direction on all nodes. Three physical variables are defined at each grid point (i.e., degrees of freedom equals 3). We use the tests to further demonstrate the performance benefit of using matrix colorings and illustrate that the achieved acceleration depends on the matrix nonzero structures and selected matrix coloring. We apply two matrix coloring algorithms: Largest-first ordering (LF) [6] and smallest-last ordering (SL) [9] to the grid operator matrices C = RART in Table 4.4 and to the matrix product C = GGT (used for connection graphs) in Table 4.5. These tables present the dimensions of the matrices, the execution time spent on computing numeric matrix products, and the number of colors obtained from the matrix colorings. We present only the results using the coloring of ART (Algorithm 5) in Table 4.4.

Table 4.4: Elasticity PDE: Excution Time of Numeric C = RART (seconds) As the matrix size increases the number of colors increases more quickly with the LF coloring than the SL coloring. This causes the SL time to scale more effectively, although both colorings outperform the “No Coloring” option.

Matrix Size ( Fine Grids) No Coloring Coloring ART A R LF ncolor SL ncolor 3,000 3,000 156 3,000 .022 .0092 48 .0092 48 × × 24,000 24,000 1,122 24,000 .24 .15 66 .15 60 × × 192,000 192,000 8,586 192,000 2.23 1.68 84 1.33 66 × × 648,000 648,000 27,924 648,000 7.86 6.07 90 4.75 69 × ×

12 Table 4.5: Elasticity PDE: Execution Time of Numeric C = GGT (seconds) As the matrix G size grows, the number of colors from the LF coloring remains constant, allowing it to outperform the SL coloring. Again, both coloring choices are faster than the “No Coloring” option.

Matrix Size ( Fine Grids) No Coloring Coloring LF ncolor SL ncolor 1,000 1,000 .012 .0026 125 .0029 136 × 8,000 8,000 .13 .038 125 .043 149 × 64,000 64,000 1.22 .42 125 .51 158 × 216,000 216,000 4.31 1.49 125 1.81 161 ×

As the mesh size increases for matrices ART , the number of colors produced by the SL algorithm grows slower than that of LF: from 48 to 69 vs. 48 to 90. This results in improved performance using SL colorings as shown in Table 4.4. This stands in contrast to the results for the graph connection matrices GGT show in Table 4.5 where the LF algorithm generates a more consistent number of colors as the mesh size grows, and outperforms the SL algorithm. These results suggest that no one coloring algorithm is ideal for all circumstances. Note that for all the cases presented, the coloring algorithms outperform the sparse inner product approach of “No Coloring”.

5. Conclusions and Future Work. Earlier research has accelerated the computation of sparse Ja- cobians by using matrix coloring to evaluate several columns simultaneously. This work has been adapted to accelerate sparse matrix products computed with sparse inner products by applying a matrix coloring to instead compute a related compressed sparse-dense matrix product. We have proved that for the product C = ABT , the matrix coloring of C is always a viable choice for compressing BT , although slight modifications may be necessary if A has any zero columns. Algorithms for both (2.1a) and (2.1b) were proposed, as well as considerations for practical implemen- tation. Numerical results suggested that simulations using multigrid that computed sparse matrix products through inner products can be accelerated significantly through the use of coloring. Those results can be improved by fine tuning the data traffic during the decompression of the computed dense matrix to the desired sparse matrix. We also studied the effect of the choice of coloring algorithm on the efficiency of the compression and found that different algorithms perform better in different circumstances. Future adaption of this coloring approach to parallel matrices will inherit much of the theory, but the added cost of data traffic may require some retooling or reorganization. Additionally, it would be valuable to study the use of this method on block matrices, a common structure for many applications. Beyond multigrid, our motivating application, we believe that applications from the discrete math community (e.g., breadth-first search) will benefit from our coloring algorithm, and we need to research this possibility. The transition to a sparse-dense matrix product opens the door to numerous optimizations that are unavailable in the sparse-sparse setting. For instance, our current software computes four columns of the dense matrix each time the sparse matrix is loaded into memory, thereby reducing the total cost of accessing the sparse matrix. Another improvement we hope to consider is the interlacing of several columns of the dense matrix during the product to allow for more computation per sparse matrix cache miss. Decompressing the resulting product requires new techniques, so more work is required to take advantage of this and other previously unavailable opportunities. We plan to incorporate the block dense to sparse decompression discussed in Section 3.1 into the PETSc library for faster computation of finite-difference Jacobians using coloring. Another interesting advance might be to develop a new coloring algorithm that incorporates the cost of decompression when deciding how to organize colors. A coloring algorithm that accelerates the decompression would benefit both the matrix multiplication setting and the evaluation of finite difference Jacobians.

Acknowledgments. The authors were supported by the Office of Advanced Scientific Computing Re- search, Office of Science, U.S. Department of Energy, under Contract DE-AC02-06CH11357. The authors would like to thank Jed Brown, Lois Curfman McInnes and Charles Van Loan for their help and support, and Glenn Hammond for providing us the PFLOTRAN test case. 13 REFERENCES

[1] M. F. Adams, H. H. Bayraktar, T. M. Keaveny, and P. Papadopoulos, Ultrascalable implicit finite element analyses in solid mechanics with over a half a billion degrees of freedom, in ACM/IEEE Proceedings of SC2004: High Performance Networking and Computing, 2004. Gordon Bell Award. [2] S. Balay, J. Brown, K. Buschelman, V. Eijkhout, W. D. Gropp, D. Kaushik, M. G. Knepley, L. C. McInnes, B. F. Smith, and H. Zhang, PETSc users manual, Tech. Report ANL-95/11 - Revision 3.4, Argonne National Laboratory, 2013. [3] T. F. Coleman and J. J. More, Estimation of sparse Jacobian matrices and graph coloring problems, SIAM Journal on , 20 (1983), pp. 187–209. [4] G. H. Golub and C. F. Van Loan, Matrix Computations (4th ed.), Johns Hopkins University Press, Baltimore, MD, 2012. [5] G. E. Hammond, P. C. Lichtner, and R. T. Mills, Evaluating the performance of parallel subsurface simulators: An illustrative example with PFLOTRAN, Water Resources Research, (2013). [6] W. Klotz, Graph coloring algorithms, Mathematics Report, (2002), pp. 1–9. [7] M. Kubale, Graph Colorings, Contemporary Mathematics (American Mathematical Society) v. 352, American Mathe- matical Society, 2004. [8] P. Lichtner et al., PFLOTRAN project. http://ees.lanl.gov/pflotran/. [9] D. W. Matula and L. L. Beck, Smallest-last ordering and clustering and graph coloring algorithms, J. ACM, 30 (1983), pp. 417–427. [10] Y. Saad, Iterative Methods for Sparse Linear Systems, PWS Publishing Company, Boston, MA, 1996. [11] B. Smith and H. Zhang, Sparse triangular solves for ILU revisited: Data layout crucial to better performance, Interna- tional J. High Performance Computing Applications, 25 (2011), pp. 386–391. [12] K. Stuben¨ , A review of algebraic multigrid, J. Comput. Appl. Math., 128 (2001), pp. 281–309. [13] U. Trottenberg, C. W. Oosterlee, and A. Schuller, Multigrid, Elsevier Science, 2000.

14 Government License. The submitted manuscript has been created by UChicago Argonne, LLC, Operator of Argonne National Laboratory (“Argonne”). Argonne, a U.S. Department of Energy Office of Science laboratory, is operated under Contract No. DE-AC02-06CH11357. The U.S. Government retains for itself, and others acting on its behalf, a paid-up nonexclusive, irrevocable worldwide license in said article to reproduce, prepare derivative works, distribute copies to the public, and perform publicly and display publicly, by or on behalf of the Government.

15