37th International Conference on Parallel Processing
Challenges and Advances in Parallel Sparse Matrix-Matrix Multiplication ∗
Aydın Buluc¸ John R. Gilbert Department of Computer Science Department of Computer Science University of California, Santa Barbara University of California, Santa Barbara [email protected] [email protected]
Abstract To the best of our knowledge, parallel algorithms using a 2D block decomposition have not been developed for sparse We identify the challenges that are special to parallel matrix-matrix multiplication. Existing 1D algorithms are sparse matrix-matrix multiplication (PSpGEMM). We show not scalable to thousands of processors. On the contrary 2D that sparse algorithms are not as scalable as their dense dense matrix multiplication algorithms are proved to be op- counterparts, because in general, there are not enough timal with respect to the communication volume [15], mak- non-trivial arithmetic operations to hide the communication ing 2D sparse algorithms likely to be more scalable than costs as well as the sparsity overheads. We analyze the scal- their 1D counterparts. We show that this intuition is indeed ability of 1D and 2D algorithms for PSpGEMM. While the correct. 1D algorithm is a variant of existing implementations, 2D Widely used sequential sparse matrix-matrix multiplica- algorithms presented are completely novel. Most of these tion kernels, such as the one used in MATLAB [13] and algorithms are based on the previous research on parallel CSparse [9], are not suitable for 2D decomposition as dense matrix multiplication. We also provide results from shown before [4]. Therefore, for our 2D algorithms, we preliminary experiments with 2D algorithms. use an outer product based kernel, which is scalable with respect to the total amount of work performed. In Section 6, we compare the scalability of 1D and 2D algorithms for different matrix sizes. In those speed-up pro- 1 Introduction jections, we used our real-world estimates of constants in- stead of relying on the O-notation. We also investigated the We consider the problem of multiplying two sparse ma- effectiveness of overlapping communication with computa- trices in parallel (PSpGEMM). Sparse matrix-matrix multi- tion. Finally, we implemented and tested a 2D algorithm plication is a building block for many algorithms includ- for PSpGEMM using matrices from models of real world ing graph contraction, breadth-first search from multiple graphs. source vertices, peer pressure clustering [21], recursive for- mulations of all-pairs-shortest-path algorithms [8], multi- 2 Terminology grid interpolation/restriction [2], and parsing context-free languages [19]. A scalable PSpGEMM algorithm, operat- The sparse matrix-matrix multiplication problem ing on a semiring, is a key component for building scalable (SpGEMM) is to compute C = AB, where the input parallel versions of those algorithms. matrices A and B, having dimensions M × K and K × N In this paper, we identify the challenges in the design and respectively, are both sparse. Although the algorithms op- implementation of a scalable PSpGEMM algorithm. Sec- erate on rectangular matrices, our analysis assumes square tion 5 explains those challenges in more detail. We also matrices (M = K = N) in order not to over-complicate propose novel algorithms based on 2D block decomposi- the statement of the results. tion of data, analyze their computation and communication In PSpGEMM, we perform multiplication utilizing p requirements, compare their scalability, and finally report processors. The cost of one floating-point operation, along timings from our initial implementations. with the cost of cache misses and memory indirections as- γ ∗ sociated with the operation, is denoted by , measured in This research was supported in part by the Department of Energy un- nanoseconds . α is the latency of sending a message over the der award number DE-FG02-04ER25632, in part by MIT/LL under con- β tract number 7000012980, and in part by the National Science Foundation communication interconnect. is the inverse bandwidth, through TeraGrid resources provided by TACC. measured in nanoseconds per word transfered.
0190-3918/08 $25.00 © 2008 IEEE 503 DOI 10.1109/ICPP.2008.45 The number of nonzero elements in A is denoted by AUX nnz(A). We adopt the colon notation for sparse matrix = 0133 indexing, where A(:,i) denotes the ith column, A(i, :) de- notes the ith row, and A(i, j) denotes the element at the (i, j)th position of matrix A. Flops (flops(AB) or flops in JC = 067 short) is defined as the number of nonzero arithmetic oper- ations required to compute the product AB. CP = 0234 The sequential work of SpGEMM, unlike dense GEMM, depends on many parameters. This makes parallel scal- ability analysis a tough process. Therefore, we restrict IR = 5731 our analysis to sparse matrices with c>0 nonzeros per row/column, uniformly-randomly distributed across the NUM = 0.1 0.2 0.3 0.4 row/column. The sparsity parameter c, albeit oversimpli- fying, is useful for analysis purposes, since it makes dif- ferent parameters comparable to each other. For example, Figure 1. Matrix A in DCSC format if A and B both have sparsity c,thennnz(A)=cN and flops(AB)=c2N. It also allows us to decouple the effects of load imbalances from the algorithm analysis because the nonzeros are assumed to be evenly distributed across pro- DCSC, therefore, is our data structure of choice for storing cessors. Our analysis is probabilistic, exploiting the uni- submatrices owned locally by each processor in 2D algo- form random distribution of nonzeros. Therefore, when we rithms. Figure 1 shows DCSC representations of a 9-by-9 talk about quantities such as nonzeros per subcolumn, we matrix with 4 nonzeros that has the following triples repre- mean the expected number of nonzeros. sentation: The lower bound on sequential SpGEMM is Ω(flops) = A = {(5, 0, 0.1), (7, 0, 0.2)(3, 6, 0.3), (1, 7, 0.4)} Ω(c2N). This bound is achieved by some row-wise and column-wise implementations [13, 14], provided that c ≥ 1. More details on DCSC and the scalable sequential The row-wise implementation of Gustavson is the natural SpGEMM can be found in our previous work [4]. kernel to be used in the 1D algorithm where data is dis- tributed by rows. It has a asymptotic complexity of 3 Parallel 1D Algorithm
O(N +nnz(A)+flops) = O(N +cN +c2N)=Θ(c2N). In 1D algorithms, data can be distributed by rows or columns. We consider distribution by rows; distribution by 2 Therefore, we take the sequential work (W )tobeγc N columns is symmetric. For the analysis, we assume that throughout the paper. pairs of adjacent processors can communicate simultane- Our sequential SpGEMM kernel used in 2D algorithms ously without affecting each other. The algorithm used in requires matrix B to be transposed before the computation, practice in STAR-P [20] is a variant of this row-wise algo- at a cost of trans(B), which can be O(N + nnz(B)) or rithm where simultaneous point to point communications O(nnz(B)lgnnz(B)), depending on the implementation. are replaced with all-to-all broadcasts. The kernel has total complexity of The data is distributed to processors in block rows, where each processor receives M/p consecutive rows. For a ma- O nzc A nzr B ni trans B , th ( ( )+ ( ) + flops lg + ( )) trix A, Ai denotes the block row owned by the i proces- sor. To simplify the algorithm description, we use Aij to where nzc(A) is the number of columns of A that contain th denote Ai(:,jp:(j +1)p − 1),thej block column of Ai, nzr B B at least one nonzero, ( ) is the number of rows of although block rows are physically not partitioned. that contain at least one nonzero, and ni is the number of indices i for which A(:,i) = ∅ and B(i, :) = ∅. ⎛ ⎞ ⎛ ⎞ ⎛ ⎞ A A ... A B The data structure used by sequential SpGEMM is 1 11 1p 1 A ⎝ ... ⎠ ⎝ ...... ⎠ ,B ⎝ ... ⎠ DCSC (doubly compressed sparse column). DCSC has the = = = A A ... A B following properties: p p1 pp p For each processor Pi, the computation is set up in the 1. Its storage is strictly O(nnz) following way: 2. It supports a scalable sequential SpGEMM algorithm. p Ci = Ci + Ai B = Ci = Ci + Aij Bj 3. It allows fast access to columns of the matrix. j=1
504 Each processor sends and receives p − 1 point-to-point ⎛ ⎞ ⎛ ⎞ messages of size nnz(B)/p. Therefore, x00000 xxxxxx ⎜ ⎟ ⎜ ⎟ ⎜ x00000⎟ ⎜ 000000⎟ ⎜ ⎟ ⎜ ⎟ ⎜ x00000⎟ ⎜ 000000⎟ nnz B A = ,B = ( ) ⎜ x00000⎟ ⎜ 000000⎟ Tcomm =(p−1) α+β =Θ(pα+βcN). (1) ⎜ ⎟ ⎜ ⎟ p ⎝ x00000⎠ ⎝ 000000⎠ x00000 000000 The row-wise SpGEMM forms one row of C at a time, and they might need to access all of B for that. However, only a portion of B is accessible at any time in parallel Figure 2. Worst case for 1D Algorithm algorithms. This requires multiple iterations to fully form one row of C. The nonzero structure of the current active row of C is formed by using an O(N) data structure called the sparse accumulator (SPA) [13]. It is composed of three Analyzing the general case is hard without making some components: a dense vector that holds the real values for assumptions, even for random matrices. Here, we assume the active row of C, a dense boolean vector that holds the that each nonzero in the output is contributed by a small “occupied” flags, and a sparse list that holds the indices of (and constant) number of flops. In practice, this is true for nonzero elements in the active row of C so far. Initialization many sparse matrices, and flops/nnz(C) ≈ 1 for reason- of the SPA takes O(N) time but loading and unloading can ably large N. be done in time proportional to the number of nonzeros in The cost of SPA load/unload per row is then: the SPA. AB p c2N flops( ) · i · p · p c2p Np = Np ( +1)=Θ( ) 1 for all processors Pi in parallel do i=1 2 SPA ← Initialize ; Apart from SPA loads/unloads, each processor iterates p 3 for j ← 1 to p do times over its rows. As each processor has N/p rows, the 4 Broadcast (Bj); cost of computation per processor becomes 5 for k ← 1 to N/p do 6 SPA ← Load (Ci(k, :)); 7 SPA ← SPA + Aij (k, :) ∗ Bj ; N 2 2 Tcomp = γ (p + c p) = γ N(1 + c ) . (2) 8 Ci(k, :) ← UnLoad (SPA); p 9 end The efficiency of the algorithm overall is: 10 end 11 end W E = Algorithm 1: C ← A ∗ B using row Sparse1D p (Tcomp + Tcomm) γc2N = 2 (3) The pseudocode is given in Algorithm 1. The compu- p γ(N(1 + c )) + pα+ βcN tational time is dominated by loads and unloads of SPA, We see that the communication to computation ratio of which is not amortized by the number of nonzero arith- 1D PSpGEMM is higher than its dense counterpart. In the metic operations in general. The simple running example dense case, O(pN2) communication was dominated by the in Figure 2 shows an extreme case. In this figure, N =6, O(N 3) computation in a scaled speedup setting. To keep p =3, and each x denotes a nonzero. It is easy to see that 2 the efficiency constant in PSpGEMM, work should increase flops(AB)=n due to the first outer product. After the proportionally to the cost of parallel overheads. very first execution of the inner loop in Algorithm 1, SPA We separately consider the factors that limit scalability. becomes completely dense for every row Ci(k, :). Those Scalability due to latency is achieved when γc2N ∝ αp2, dense SPA’s are going to be loaded/unloaded p − 1 more which requires N to grow on the order of p2. Scalability times during the subsequent iterations of the outer-loop, due to bandwidth costs is achieved when even though no additional flops are needed. Since there are N/p rows per processor, and each row γc2N ∝ βcpN leads to p loads/unloads of the corresponding SPA, we αc N ∝ p, have SPA loads/unloads per processors. Each of those β =Θ(1) loads takes O(nnz(SPA)) = N time, making up a cost of O(N 2) per processor, when flops per processors where only which is clearly unscalable with increasing number of pro- O(N 2/p). cessors.
505 Having shown that 1D algorithm is not scalable in terms 1 for all processors Pij in parallel do of the amount of communication it does, we now analyze 2 LeftCircularShift (Aij ,j); its scalability due to computation costs. 3 UpCircularShift (Bij ,i); 2 2 4 end γc N ∝ γpN(c +1) 5 for all processors Pij in parallel do c2 √ +1 ∝ p. 6 for k ← 1 to p do 2 =Θ(1) c 7 Cij ← Cij + Aij ∗ Bij ; 8 LeftCircularShift (Aij , 1); We see that there is no way to keep parallel efficiency 9 UpCircularShift (Bij , 1); constant with the 1D algorithm, even when we ignore com- 10 end munication costs. In fact, the algorithm gets slower as the 11 end number of processors increase. STAR-P implementation 12 for all processors Pij in parallel do by-passes this problem by all-to-all broadcasting nonzeros 13 LeftCircularShift (Aij ,j); of the B matrix, so that the whole B matrix is essentially 14 UpCircularShift (Bij ,i); assembled at each processor. This type of implementation avoids the cost of load- 15 end ing/unloading SPA at every stage, but it introduces a mem- Algorithm 2: C ← A ∗ B using SpCannon ory problem. Aggregate memory usage among all proces- sors becomes nnz(A)+p nnz(B)+nnz(C), which is un- scalable. Furthermore, the whole B matrix is unlikely to fit The expected number of√ nonzeros in a given column of into the memory of a single node in real applications. There a local submatrix Aij is c/ p. Therefore, for a submatrix is clearly a need to design better 1D algorithms, that avoid multiplication AikBkj , SPA loading/unloading and at the same time do not impose memory problems. c2 N N c2 N ni A ,B , √ {√ , √ } ( ik kj)=min 1 p p =min p p p 4 Parallel 2D Algorithms flops(AB) c2 N flops(AikBkj )= √ = √ 4.1 Sparse Cannon (SpCannon) p p p p