Efficient Sparse Matrix-Matrix Products Using Colorings

EFFICIENT SPARSE MATRIX-MATRIX PRODUCTS USING COLORINGS MICHAEL MCCOURT∗, BARRY SMITH∗, AND HONG ZHANG ∗ Abstract. Sparse matrix-matrix products appear in multigrid solvers and computational methods for graph theory. Some formulations of these products require the inner product of two sparse vectors, which have inefficient use of cache memory. In this paper, we propose a new algorithm for computing sparse matrix-matrix products by exploiting matrix nonzero structure through the process of graph coloring. We prove the validity of this technique in general and demonstrate its viability for multigrid methods used to solve three-dimensional boundary value problems. 1. Introduction. Matrix-matrix multiplication is a fundamental linear algebra operation [4]. The operation of concern in this work is C = ABT , which arises in algebraic multigrid and certain graph algorithms. This operation is well defined when m r n r m n A R × and B R × and will produce C R × . Our results are equally applicable to complex-valued matrices,∈ but we omit∈ that discussion for simplicity.∈ The multiplication operation is defined such that the jth column of the ith row of C (denoted as C(i, j)) is calculated as r C(i, j)= A(i, k)BT (k, j). k!=1 This can also be written as the inner product of the ith row of A, denoted as A(i, :), and the jth column of BT , denoted as BT (:,j): C(i, j)=A(i, :)BT (:,j). (1.1) Several mechanisms exist for computing the matrix-matrix product, each of which is preferable in certain settings. For instance, C can be constructed all at once with a sum of r outer products (rank-1 matrices), r C = A(:,k)BT (k, :). (1.2) k!=1 For an approach between using (1.1) to compute one value at a time and using (1.2) to compute C all at once, we can compute one row or column at a time or compute blocks of C using blocks of A and BT . While all these techniques yield the same result, they may not be equally preferable for actual computation because of the different forms in which a matrix may be stored. Many sparse matrix algorithms have memory access outpace floating-point operations as the computational bottleneck [11]; this demands diligence when choosing a storage format so as to minimize memory traffic. In this paper, we begin with conventional matrix storage, namely, compressed sparse row format (CSR) for sparse matrices and Fortran- style column-major array storage for dense matrices [2] and then transform the storage format for higher efficiency. One approach to computing ABT would be to compute BT from B and store it as a CSR matrix, but this reorganization requires a significant amount of memory traffic. The availability of the columns of BT would seem to indicate that the inner product computation (1.1) is the preferred method. Unfortunately, computing the inner product of sparse vectors is an inefficient operation because of the low ratio of flops to cache misses (see Section 1.1). Our goal is to produce a data structure and algorithm which will efficiently compute the sparse matrix-matrix product C = ABT using inner products. In the rest of this section we introduce sparse inner products and matrix coloring. In Section 2 we analyze matrix coloring applied to the sparse matrix product C = ABT , which allows us to instead compute C by evaluating the inner product of sparse and dense vectors. In Section 3 we propose algorithms for computing ∗Mathematics and Computer Science Division, Argonne National Laboratory, Argonne, IL 60439. Emails: (mccomic, bsmith, hzhang)@mcs.anl.gov 1 matrix products with matrix colorings and consider some practical implementation issues. Numerical results are presented in Section 4, which show the benefits of this coloring approach for matrix products appearing in multigrid solvers for three dimensional PDEs from PFLOTRAN [8] and from the PETSc library distribution [2]. We conclude the paper in Section 5 with a brief discussion of future work. 1.1. Sparse inner products. When matrices A and B are stored in compressed sparse row format, computing C = ABT requires the inner product between sparse vectors; this section discusses the inefficiency (in number of flops per cache miss) of sparse inner products. Algorithm 1 shows how inner products between sparse vectors in Rn may be computed. n Algorithm 1 Sparse-Sparse Inner Product of x, y R ; x, y have nx,ny nonzeros, respectively. ∈ 1: i = 1; j = 1; xT y =0.0 2: while (i n and j n ) do ≤ x ≤ y 3: if ( index(x,i) < index(y,j)) i = i + 1; { } 4: else if ( index(x,i) > index(y,j)) j = j + 1; { } 5: else xT y = xT y + x(index(x,i))y(index(y,j)); i = i + 1; j = j + 1; { } 6: end while Because x and y are stored in compressed form, only the nx and ny nonzero values in each vector respectively are stored, along with the rows to which they belong. The function “index” accepts a vector x and an integer 1 i nx and returns the index of the ith nonzero; the ith nonzero is then accessed as x(index(x,i)). ≤ ≤ Algorithm 1 iterates through the nonzero entries in each vector until finding the collisions between their indices; each collision is then accumulated to evaluate the inner product. While the number of floating-point operations (flops) for this algorithm is only O(number of collisions between nonzeros), the required memory access is inefficient. The indices of both vectors must be traversed completely, requiring O(nx + ny) memory accesses in a while-loop. This poor ratio of computation to memory access is one motivation behind our work. Contrast Algo- rithm 1 with the inner product of a sparse and dense vector in Algorithm 2. n Algorithm 2 Sparse-Dense Inner Product of x, y R ; x has nx nonzeros and y is dense. ∈ 1: xT y =0.0 2: for (i =1,...,nx) do 3: xT y = xT y + x(index(x,i))y(index(x,i)) 4: end for Algorithm 2 performs O(nx) flops, because even if some of the values in y are zero, they are all treated as nonzeros. This is greater than the sparse-sparse inner product but with roughly the same level of memory accesses: O(2nx) in a for-loop. Essentially, a sparse-dense inner product would do more flops per memory access. Our interest is not in this small trade-off but in compressing multiple sparse vectors into a single, dense vector; this will significantly increase the ratio of flops to memory accesses, without (we hope) introducing a harrowing number of zero-valued nonzeros into the computation. If this can be done effectively, then the inefficient sparse-sparse inner products used to compute sparse matrix products can be replaced with a reduced number of more efficient sparse-dense inner products. Our mechanism for doing this is described in Section 1.2. 1.2. Matrix coloring. Matrix coloring is related to graph coloring, an important topic in graph theory [7]. We are not interested in a graph-theoretic understanding, only in the specific use of graph theory to color matrices, so we introduce only terms specifically relevant to our usage. Definition 1.1. Let u, v Rn. The vectors u and v are structurally orthogonal if u T v =0, ∈ | | | | where u is the vector of the absolute value of all the elements of u. A set of vectors u1,...,un is called structurally| | orthogonal if u and u are structurally orthogonal for 1 i, j n. { } i j ≤ ≤ 2 Under this definition, not all orthogonal vectors are structurally orthogonal; for example, 1 1 u = , v = , 1 1 " # "− # so uT v =0but u T v = 2. Also, all vectors are structurally orthogonal to a vector of all zeros. | | | | Lemma 1.2. Let u, v Rn be structurally orthogonal, and denote u(k) as the kth element in u. Then for 1 k n, either u(k)=0∈ or v(k)=0, or both. Proof≤ .≤ We know that u T v = 0, which means | | | | n u(k) v(k) =0. | || | k!=1 Obviously, u(k) v(k) 0 for 1 k n; for the sum of these nonnegative terms to equal zero, all the terms must| themselves|| | be≥ zero: ≤ ≤ u(k) v(k) =0, 1 k n. | || | ≤ ≤ Therefore, either u(k) = 0 or v(k) = 0, or potentially both. m n 1 2 q Definition 1.3. Let C R × .Anorthogonal set is a set of q indices ! = ! , ! ,...,! for which C(:, !i) and C(:, !j ) are structurally∈ orthogonal when 1 i, j n and i = j. We{ will also define} a set containing only one index ! = !1 to be an orthogonal set≤ . ≤ % {m }n Definition 1.4. Let C R × .Amatrix coloring c = !1, !2,...,!p is a collection of index sets such that, for 1 k n, the index∈ k appears in exactly one index{ set. We say} that c is a valid matrix coloring of C if each≤ index≤ set in c is an orthogonal set of C. We refer to an orthogonal set that is part of a coloring as a color. Because we require 1 k n to appear in exactly one orthogonal set, every column of C appears in exactly one color. Each orthogonal≤ ≤ set in the coloring contains a set of indices corresponding to columns of C which are structurally orthogonal. The term coloring mimics the graph coloring concept of grouping nonadjacent vertices using the same color; here we are grouping structurally orthogonal columns of a matrix using the same orthogonal set.

Load more