<<

37th International Conference on Parallel Processing

Challenges and Advances in Parallel Sparse -

Aydın Buluc¸ John R. Gilbert Department of Science Department of Computer Science University of California, Santa Barbara University of California, Santa Barbara [email protected] [email protected]

Abstract To the best of our knowledge, parallel using a 2D block decomposition have not been developed for sparse We identify the challenges that are special to parallel matrix-matrix multiplication. Existing 1D algorithms are sparse matrix-matrix multiplication (PSpGEMM). We show not scalable to thousands of processors. On the contrary 2D that sparse algorithms are not as scalable as their dense dense matrix multiplication algorithms are proved to be op- counterparts, because in general, there are not enough timal with respect to the communication volume [15], mak- non-trivial arithmetic operations to hide the communication ing 2D sparse algorithms likely to be more scalable than costs as well as the sparsity overheads. We analyze the scal- their 1D counterparts. We show that this intuition is indeed ability of 1D and 2D algorithms for PSpGEMM. While the correct. 1D is a variant of existing implementations, 2D Widely used sequential sparse matrix-matrix multiplica- algorithms presented are completely novel. Most of these tion kernels, such as the one used in MATLAB [13] and algorithms are based on the previous research on parallel CSparse [9], are not suitable for 2D decomposition as dense matrix multiplication. We also provide results from shown before [4]. Therefore, for our 2D algorithms, we preliminary experiments with 2D algorithms. use an based , which is scalable with respect to the total amount of work performed. In Section 6, we compare the scalability of 1D and 2D algorithms for different matrix sizes. In those speed-up pro- 1 Introduction jections, we used our real-world estimates of constants in- stead of relying on the O-notation. We also investigated the We consider the problem of multiplying two sparse ma- effectiveness of overlapping communication with computa- trices in parallel (PSpGEMM). Sparse matrix-matrix multi- tion. Finally, we implemented and tested a 2D algorithm plication is a building block for many algorithms includ- for PSpGEMM using matrices from models of real world ing graph contraction, breadth-first search from multiple graphs. source vertices, peer pressure clustering [21], recursive for- mulations of all-pairs-shortest-path algorithms [8], multi- 2 Terminology grid interpolation/restriction [2], and parsing context-free languages [19]. A scalable PSpGEMM algorithm, operat- The sparse matrix-matrix multiplication problem ing on a semiring, is a key component for building scalable (SpGEMM) is to compute C = AB, where the input parallel versions of those algorithms. matrices A and B, having dimensions M × K and K × N In this paper, we identify the challenges in the design and respectively, are both sparse. Although the algorithms op- implementation of a scalable PSpGEMM algorithm. Sec- erate on rectangular matrices, our analysis assumes square tion 5 explains those challenges in more detail. We also matrices (M = K = N) in order not to over-complicate propose novel algorithms based on 2D block decomposi- the statement of the results. tion of data, analyze their computation and communication In PSpGEMM, we perform multiplication utilizing p requirements, compare their scalability, and finally report processors. The cost of one floating-point operation, along timings from our initial implementations. with the cost of cache misses and memory indirections as- γ ∗ sociated with the operation, is denoted by , measured in This research was supported in part by the Department of Energy un- nanoseconds . α is the latency of sending a message over the der award number DE-FG02-04ER25632, in part by MIT/LL under con- β tract number 7000012980, and in part by the National Science Foundation communication interconnect. is the inverse bandwidth, through TeraGrid resources provided by TACC. measured in nanoseconds per word transfered.

0190-3918/08 $25.00 © 2008 IEEE 503 DOI 10.1109/ICPP.2008.45 The number of nonzero elements in A is denoted by AUX nnz(A). We adopt the colon notation for sparse matrix = 0133 indexing, where A(:,i) denotes the ith column, A(i, :) de- notes the ith row, and A(i, j) denotes the element at the (i, j)th position of matrix A. Flops (flops(AB) or flops in JC = 067 short) is defined as the number of nonzero arithmetic oper- ations required to compute the product AB. CP = 0234 The sequential work of SpGEMM, unlike dense GEMM, depends on many parameters. This makes parallel scal- ability analysis a tough process. Therefore, we restrict IR = 5731 our analysis to sparse matrices with c>0 nonzeros per row/column, uniformly-randomly distributed across the NUM = 0.1 0.2 0.3 0.4 row/column. The sparsity parameter c, albeit oversimpli- fying, is useful for analysis purposes, since it makes dif- ferent parameters comparable to each other. For example, Figure 1. Matrix A in DCSC format if A and B both have sparsity c,thennnz(A)=cN and flops(AB)=c2N. It also allows us to decouple the effects of load imbalances from the algorithm analysis because the nonzeros are assumed to be evenly distributed across pro- DCSC, therefore, is our of choice for storing cessors. Our analysis is probabilistic, exploiting the uni- submatrices owned locally by each processor in 2D algo- form random distribution of nonzeros. Therefore, when we rithms. Figure 1 shows DCSC representations of a 9-by-9 talk about quantities such as nonzeros per subcolumn, we matrix with 4 nonzeros that has the following triples repre- mean the expected number of nonzeros. sentation: The lower bound on sequential SpGEMM is Ω(flops) = A = {(5, 0, 0.1), (7, 0, 0.2)(3, 6, 0.3), (1, 7, 0.4)} Ω(c2N). This bound is achieved by some row-wise and column-wise implementations [13, 14], provided that c ≥ 1. More details on DCSC and the scalable sequential The row-wise implementation of Gustavson is the natural SpGEMM can be found in our previous work [4]. kernel to be used in the 1D algorithm where data is dis- tributed by rows. It has a asymptotic complexity of 3 Parallel 1D Algorithm

O(N +nnz(A)+flops) = O(N +cN +c2N)=Θ(c2N). In 1D algorithms, data can be distributed by rows or columns. We consider distribution by rows; distribution by 2 Therefore, we take the sequential work (W )tobeγc N columns is symmetric. For the analysis, we assume that throughout the paper. pairs of adjacent processors can communicate simultane- Our sequential SpGEMM kernel used in 2D algorithms ously without affecting each other. The algorithm used in requires matrix B to be transposed before the computation, practice in STAR-P [20] is a variant of this row-wise algo- at a cost of trans(B), which can be O(N + nnz(B)) or rithm where simultaneous point to point communications O(nnz(B)lgnnz(B)), depending on the implementation. are replaced with all-to-all broadcasts. The kernel has total complexity of The data is distributed to processors in block rows, where each processor receives M/p consecutive rows. For a ma- O nzc A nzr B ni trans B , th ( ( )+ ( ) + flops lg + ( )) trix A, Ai denotes the block row owned by the i proces- sor. To simplify the algorithm description, we use Aij to where nzc(A) is the number of columns of A that contain th denote Ai(:,jp:(j +1)p − 1),thej block column of Ai, nzr B B at least one nonzero, ( ) is the number of rows of although block rows are physically not partitioned. that contain at least one nonzero, and ni is the number of indices i for which A(:,i) = ∅ and B(i, :) = ∅. ⎛ ⎞ ⎛ ⎞ ⎛ ⎞ A A ... A B The data structure used by sequential SpGEMM is 1 11 1p 1 A ⎝ ... ⎠ ⎝ ...... ⎠ ,B ⎝ ... ⎠ DCSC (doubly compressed sparse column). DCSC has the = = = A A ... A B following properties: p p1 pp p For each processor Pi, the computation is up in the 1. Its storage is strictly O(nnz) following way: 2. It supports a scalable sequential SpGEMM algorithm. p Ci = Ci + Ai B = Ci = Ci + Aij Bj 3. It allows fast access to columns of the matrix. j=1

504 Each processor sends and receives p − 1 point-to-point ⎛ ⎞ ⎛ ⎞ messages of size nnz(B)/p. Therefore, x00000 xxxxxx ⎜ ⎟ ⎜ ⎟ ⎜ x00000⎟ ⎜ 000000⎟ ⎜ ⎟ ⎜ ⎟ ⎜ x00000⎟ ⎜ 000000⎟  nnz B  A = ,B = ( ) ⎜ x00000⎟ ⎜ 000000⎟ Tcomm =(p−1) α+β =Θ(pα+βcN). (1) ⎜ ⎟ ⎜ ⎟ p ⎝ x00000⎠ ⎝ 000000⎠ x00000 000000 The row-wise SpGEMM forms one row of C at a time, and they might need to access all of B for that. However, only a portion of B is accessible at any time in parallel Figure 2. Worst case for 1D Algorithm algorithms. This requires multiple iterations to fully form one row of C. The nonzero structure of the current active row of C is formed by using an O(N) data structure called the sparse accumulator (SPA) [13]. It is composed of three Analyzing the general case is hard without making some components: a dense vector that holds the real values for assumptions, even for random matrices. Here, we assume the active row of C, a dense boolean vector that holds the that each nonzero in the output is contributed by a small “occupied” flags, and a sparse list that holds the indices of (and constant) number of flops. In practice, this is true for nonzero elements in the active row of C so far. Initialization many sparse matrices, and flops/nnz(C) ≈ 1 for reason- of the SPA takes O(N) time but loading and unloading can ably large N. be done in time proportional to the number of nonzeros in The cost of SPA load/unload per row is then: the SPA. AB p c2N flops( ) · i · p · p c2p Np = Np ( +1)=Θ( ) 1 for all processors Pi in parallel do i=1 2 SPA ← Initialize ; Apart from SPA loads/unloads, each processor iterates p 3 for j ← 1 to p do times over its rows. As each processor has N/p rows, the 4 Broadcast (Bj); cost of computation per processor becomes 5 for k ← 1 to N/p do 6 SPA ← Load (Ci(k, :));     7 SPA ← SPA + Aij (k, :) ∗ Bj ; N 2 2 Tcomp = γ (p + c p) = γ N(1 + c ) . (2) 8 Ci(k, :) ← UnLoad (SPA); p 9 end The efficiency of the algorithm overall is: 10 end 11 end W E = Algorithm 1: C ← A ∗ B using row Sparse1D p (Tcomp + Tcomm) γc2N   = 2 (3) The pseudocode is given in Algorithm 1. The compu- p γ(N(1 + c )) + pα+ βcN tational time is dominated by loads and unloads of SPA, We see that the communication to computation ratio of which is not amortized by the number of nonzero arith- 1D PSpGEMM is higher than its dense counterpart. In the metic operations in general. The simple running example dense case, O(pN2) communication was dominated by the in Figure 2 shows an extreme case. In this figure, N =6, O(N 3) computation in a scaled speedup setting. To keep p =3, and each x denotes a nonzero. It is easy to see that 2 the efficiency constant in PSpGEMM, work should increase flops(AB)=n due to the first outer product. After the proportionally to the cost of parallel overheads. very first execution of the inner loop in Algorithm 1, SPA We separately consider the factors that limit scalability. becomes completely dense for every row Ci(k, :). Those Scalability due to latency is achieved when γc2N ∝ αp2, dense SPA’s are going to be loaded/unloaded p − 1 more which requires N to grow on the order of p2. Scalability times during the subsequent iterations of the outer-loop, due to bandwidth costs is achieved when even though no additional flops are needed. Since there are N/p rows per processor, and each row γc2N ∝ βcpN leads to p loads/unloads of the corresponding SPA, we αc N ∝ p, have SPA loads/unloads per processors. Each of those β =Θ(1) loads takes O(nnz(SPA)) = N time, making up a cost of O(N 2) per processor, when flops per processors where only which is clearly unscalable with increasing number of pro- O(N 2/p). cessors.

505 Having shown that 1D algorithm is not scalable in terms 1 for all processors Pij in parallel do of the amount of communication it does, we now analyze 2 LeftCircularShift (Aij ,j); its scalability due to computation costs. 3 UpCircularShift (Bij ,i); 2 2 4 end γc N ∝ γpN(c +1) 5 for all processors Pij in parallel do c2 √ +1 ∝ p. 6 for k ← 1 to p do 2 =Θ(1) c 7 Cij ← Cij + Aij ∗ Bij ; 8 LeftCircularShift (Aij , 1); We see that there is no way to keep parallel efficiency 9 UpCircularShift (Bij , 1); constant with the 1D algorithm, even when we ignore com- 10 end munication costs. In fact, the algorithm gets slower as the 11 end number of processors increase. STAR-P implementation 12 for all processors Pij in parallel do by-passes this problem by all-to-all broadcasting nonzeros 13 LeftCircularShift (Aij ,j); of the B matrix, so that the whole B matrix is essentially 14 UpCircularShift (Bij ,i); assembled at each processor. This type of implementation avoids the cost of load- 15 end ing/unloading SPA at every stage, but it introduces a mem- Algorithm 2: C ← A ∗ B using SpCannon ory problem. Aggregate memory usage among all proces- sors becomes nnz(A)+p nnz(B)+nnz(C), which is un- scalable. Furthermore, the whole B matrix is unlikely to fit The expected number of√ nonzeros in a given column of into the memory of a single node in real applications. There a local submatrix Aij is c/ p. Therefore, for a submatrix is clearly a need to design better 1D algorithms, that avoid multiplication AikBkj , SPA loading/unloading and at the same time do not impose memory problems. c2 N N c2 N ni A ,B , √ {√ , √ } ( ik kj)=min 1 p p =min p p p 4 Parallel 2D Algorithms flops(AB) c2 N flops(AikBkj )= √ = √ 4.1 Sparse Cannon (SpCannon) p p p p

Our first 2D algorithm is based on Cannon’s algorithm √ c c2N N c2N p Tmult = p 2min 1, √ + √ lg min √ , √ for dense matrices√ √ [5]. The processors are logically orga- p p p p p p nized on a ( p × p) mesh, indexed by their row and col- th umn indices so that the (i, j) processor is denoted by Pij . Overall cost of additions using p processors and Brown Matrices are assigned to processors according to a 2D block and Tarjan’s O(m lg n/m) merging algorithm [3] is decomposition.√ √ Each node gets a submatrix of dimensions √ √ N/ p × N/ p A p p ( ) ( ) in its local memory. For example, is  flops flops  A Tadd √ i √ i partitioned as shown below and ij is assigned to processor = p p lg = p p lg Pij . i=1 i=1 ⎛ ⎞ √ √ p A11 ... A1 p flops flops √ √ i √ p A = ⎝ ...... ⎠ = p p lg = p p lg ( !) √ √ √ i=1 A p1 ... A p p Note that we might be slightly overestimating, since we The pseudocode of the algorithm is given in Algorithm 2. √ assume flops/nnz(C) ≈ 1 as we did in the analysis of the Each processor sends and receives p − 1 point-to-point √ 1D algorithm. From Stirling’s approximation and asymp- messages of size nnz(A)/p,and p − 1 messages of size totic analysis, we know that lg (n !) = Θ(n lg n) [7]. Thus, nnz(B)/p. Therefore, the communication cost per proces- we get: sor is √ c2N p √ nnz A nnz B  flops √ √ lg T p α β ( ) ( ) Tadd =Θ √ p lg p =Θ . comm = 2 + p + p p p p √ 2 βcN √ βcN 2 2 α p √ α p √ . There are two cases to analyze: p>c and p ≤ c . =2 + p =Θ( + p ) (4) Since scalability analysis is concerned with the asymptotic

506 behavior as p increases, we just provide results for the p> K J B(k, j) 2 c case. The total computation cost Tcomp = Tmult + Tadd K is * = √ I c2N  c2N  c2N lg p C(i,j) T γ c √ A(i, k) comp = + p lg p p + p c2N c2N  γ c = + p lg p (5) √ Figure 3. SpSUMMA Execution (b = N/ p) In this case, efficiency becomes W E = sparse Cannon and we omit the details. Using the DCSC p (Tcomp + Tcomm) data structure, the expected cost of fetching b consecutive γc2N columns of a matrix A is b plus the size (number of nonze- = 2  2  √ ros) of the output [4]. Therefore, the algorithm asymptoti- p γ c c N c N α p βcN√ + p lg p + + p cally has the same computation cost for all values of b. γc2N   = 2 c2N √ √ (6) γcp + γc N lg p + αp p + βc pN 5 Challenges

Scalability due to computation is much better than the We identify three challenges that are specific to the N p 1D case. As long as grows linearly with , efficiency sparse case. The first one is the load imbalance among / c2N/p due to computation is constant at 1 lg ( ).√ Scalability processors. Each processor performs a total of W/p work γc2N ∝ αp p 3 due to latency is achieved when ,which in the dense case where W = N , but such a balance N p1.5 requires to grow on the order of . Bandwidth cost is hard to achieve in the sparse case. Furthermore, dense is now the sole bottleneck in scalability. For keeping the GEMM algorithms [5, 11] usually execute in a synchronous efficiency constant, we need manner in s stages, where each processor performs W/p s √ γc2N ∝ βc pN work per stage. For sparse matrices, achieving good load γc √ balance per stage is even harder than achieving load bal- ∝ p. β =Θ(1) ance for the whole computation. Although our analysis as- sumes constant row/col degree, we have experimented with Although still unscalable, the degree of unscalability is random permutations for load balancing on more realistic lower than the 1D case. Consequently, 2D algorithms both graphs. Random permutations turned out to be reasonably lower the degree of unscalability due to bandwidth costs and successful except for some class of matrices with full diag- mitigate the bottleneck of computation. This makes over- onal [4], such as 3D geometric graphs [12]. Asynchronous lapping communication with computation becomes more algorithms might be helpful for mitigating the load balance promising. problem for matrices with a full diagonal. The second challenge is the addition of submatrices. In 4.2 Sparse SUMMA (SpSUMMA) the dense case, no extra operations are performed for sub- matrix additions; the same amount of work would need to SUMMA [11] is a memory efficient, easy to generalize be done in the sequential case as well. Contrary to the dense algorithm for parallel dense matrix multiplication. It is the case, A ← A+B can not be performed in time proportional algorithm used in parallel BLAS [6]. As opposed to Can- to the number of nonzeros in B, when both A and B are non’s algorithm, it allows a tradeoff to be made between sparse. latency cost and memory by varying the degree of block- The third challenge is to hide the communication. While K/b ing. The algorithm, illustrated√ in Figure 3, proceeds in this is relatively easier in the dense case due to high com- p stages. At each stage, active row processors broadcast√ putation to communication ratio, it becomes more challeng- b columns of A simultaneously along their rows and p ing in the sparse case. Merely increasing the problem size active column processors broadcast b rows of B simultane- is generally not enough to decrease the communication to ously along their columns. computation ratio of sparse algorithms. One sided com- Sparse SUMMA , like dense SUMMA , incurs an extra munication with zero copy RDMA operations previously cost over Cannon for using row-wise and col-wise broad- showed its effectiveness, particularly in the context of dense casts instead of nearest-neighbor communication, which matrix multiplication [17] and 3D Fourier Transform [1]. In might be modeled as an additional O(lg p) factor in com- the next section, we estimate the applicability of one-sided munication cost. Other than that, the analysis is similar to communication to sparse Cannon algorithm.

507 Synchronous Speedup plot for 1D Synchronous Speedup plot for 2D

70 1200

60 1000

50 800 40 600 30 400 20

10 200

0 0 12 12 10 10 5000 5000 8 8 4000 4000 5 6 5 6 x 10 3000 x 10 3000 4 2000 4 2000 2 1000 2 1000 0 0 Matrix Dimension N 0 Matrix Dimension N 0 Num Procs Num Procs

Figure 4. Speedup of Sparse1D Figure 5. Speedup of SpCannon

6 Experiments and Implementation efficient algorithm given in Section 3 actually slows down as p increases. In the second set of projections, we evaluate the effects In this section, we project the estimated speedup of 1D of overlapping communication with computation. The non- and 2D algorithms and give timing results from a prelimi- overlapped percentage of the communication [17] is defined nary version of Sparse SUMMA algorithm that we imple- as: mented using MPI. Tcomp Tcomm − Tcomp w =1− = For realistic evaluation, we first evaluated the constants Tcomm Tcomm in the sequential algorithms on a 2.2 Ghz Opteron, in or- der to be used in projections. We performed multiple runs Since the cost of additions in 2D algorithms is asymp- with different N and c values; estimating the constants us- totically different at each stage, we explicitly calculated the ing non-linear regression. One surprising result is the order amount of non-overlapped communication in stage granu- 1 of magnitude difference in the constants between sequen- larity . Then, the speedup of the asynchronous implemen- tial kernels. 1D SpGEMM kernel has γ = 293.6 nsec, tation is given by: γ . whereas the 2D kernel has =192 nsec. We attribute the W difference to cache friendliness of the 2D SpGEMM ker- S = T w T nel. We assume an interconnect with 1 GB/sec point-to- comp + ( comm) point bandwidth, and 1 microsecond latency. TACC’s Dell Figure 6 shows the estimated speedup of asyn- Linux Cluster (Lonestar) is an example system that has the chronous SpCannon assuming one-sided communication assumed network settings. with RDMA support. For relatively smaller matrices, Figures 4 and 5 show the estimated speedup of Sparse1D speedup is about 30% more than the speedup of a syn- N 16 20 and SpCannon for matrix dimensions from =2 to 2 chronous implementation. Interestingly, speedup is higher p and number of processors from =1to 4096. The number for smaller matrices (more than 1200x for 4096 processors). of nonzeros per column is 7 throughout the experiments, This is because almost all communication is overlapped which is in line with many real world graphs such as the with computation. In this case, as noted in Section 4.1, Internet graph [16]. efficiency is 1/ lg (c2N/p). In other words, when commu- We see that Sparse1D’s speedup does not go beyond 60x, nication is fully overlapped, parallel efficiency drops as N even with bigger matrix dimensions. It also starts showing increases. diminishing returns for relatively small matrix sizes. On The projected speedup plots should be interpreted as an the contrary, SpCannon shows positive and almost linear upper bound on the speedup that can be achieved on a real speedup for up to 4096 processors, even though the slope system using the aforementioned algorithms. That requires of the curve is less than ideal. It is crucial to note that the all components to be implemented and working optimally. projections for Sparse1D are based on the complexity of The conclusion we can derive from those plots is that no STAR-P ’s implementation even though it is shown to be memory inefficient. This is because the original memory 1This can be considered as a simulation in some sense

508 Asynchronous Speedup plot for 2D 15 SpSUMMA

1200 12

1000

800 9

600 6 400 Time (secs)

200 3

0 12 10 5000 8 4000 1 4 16 64 256 5 6 x 10 3000 4 2000 Processors 2 1000 0 Matrix Dimension N 0 Num Procs Figure 7. Strong scaling of sparse SUMMA Figure 6. Speedup of SpCannon (Async)

Although the results significantly deviate from the ideal matter how hard we try, it is impossible to get good speedup speedup curve, they still indicate that ours is a good ap- with the current 1D algorithms. proach to the problem in practice (one that does not ignore Finally, we implemented a real prototype of the Sp- constants so large as to make the algorithm impractical). Summa algorithm using C++, with MPI for parallelism. The implementation is synchronous in the sense that over- lapping communication with computation is not attempted. 7 Conclusions and Future Work We ran our code on the TACC’s Dell Linux Cluster (Lones- tar), which has 2.66GHz dual-core processors in each node. It has an Infiniband interconnect and we compiled our code In this paper, we provided full scalability analysis of 1D using the Intel C++ Compiler version 9.1. and 2D algorithms for parallel sparse matrix-matrix multi- We performed a strong scaling experiment with a ma- plication (PSpGEMM). To the best of our knowledge, such trix dimension of 220. In our experiments, instead of us- an analysis does not exist in the literature. We also proposed ing random matrices (matrices from Erd˝os-R´enyi random the first 2D algorithms for PSpGEMM. We supported our graphs [10]), we used matrices from synthetically gener- analysis with projections and preliminary experiments. ated real-world graphs [18] to achieve results closer to re- Both our analysis and experiments show that 2D algo- ality. The average number of nonzeros per column is 8 for rithms are more scalable than their 1D counterparts, both those synthetically generated graphs. The results are shown in terms of communication and computation costs. The 2D in Figure 7. algorithms proposed in this paper are also amenable to a Our code achieves 10× speedup for 16 processors, and it remote direct memory access (RDMA) implementation to shows steady speedup until p =64, which are both promis- hide the communication costs. ing. However, it starts to slow down as the number of pro- The main future direction of our work is to provide a cessors approach to 256. A couple of factors lead to this high-quality implementation of SpSUMMA that scales up behavior: to thousands of processors. We are currently developing a non-MPI version that uses the one-sided communication ca- 1. The code currently performs sparse matrix additions pabilities of GASNET[1] to hide the communication costs. by scanning, instead of using an optimal unbalanced Using an optimal unbalanced merging algorithm is also re- merging algorithm. As p increases, the cost of addi- quired for scalability, but our experiments with existing tions starts to become a bottleneck pointer-based merging algorithms show that their constants are quite big to make them practical in an real implementa- 2. Our input matrices model real worlds graphs with tion. Developing a cache-efficient algorithm for unbalanced power-law degree distributions. Coupled with our syn- merging is also one of our priorities. chronous implementation, this poses a load-balance problem. Lastly, a better 1D algorithm is clearly necessary in order to use 1D algorithms for settings where p<100.Such 3. SpSUMMA has the extra lg p factor in the communi- an algorithm should avoid excessive SPA loading/unloading cation that also effects the performance as p increases. and should be scalable in terms of memory consumption.

509 References [19] G. Penn. Efficient transitive closure of sparse matrices over closed semirings. Theor. Comput. Sci., 354(1):72–81, 2006. [1] C. Bell, D. Bonachea, R. Nishtala, and K. A. Yelick. Op- [20] V. Shah and J. R. Gilbert. Sparse matrices in Matlab*P: timizing bandwidth limited problems using one-sided com- Design and implementation. In HiPC, pages 144–155, 2004. munication and overlap. In IPDPS, Rhodes Island, Greece, [21] V. B. Shah. An Interactive System for Combinatorial Scien- 2006. IEEE Computer Society. tific Computing w ith an Emphasis on Programmer Produc- [2]W.L.Briggs,V.E.Henson,andS.F.McCormick. A multi- tivity. PhD thesis, University of California, Santa Barbara, grid tutorial: second edition. Society for Industrial and Ap- June 2007. plied Mathematics, Philadelphia, PA, USA, 2000. [3] M. R. Brown and R. E. Tarjan. A fast merging algorithm. Journal of the ACM, 26(2):211–226, 1979. [4] A. Buluc¸ and J. R. Gilbert. On the Representation and Mul- tiplication of Hypersparse Matrices. In IPDPS,Miami,FL, USA, 2008. IEEE Computer Society. [5] L. E. Cannon. A cellular computer to implement the kalman filter algorithm. PhD thesis, Montana State University, 1969. [6] A. Chtchelkanova, J. Gunnels, G. Morrow, J. Overfelt, and R. A. van de Geijn. Parallel implementation of BLAS: Gen- eral techniques for Level 3 BLAS. Concurrency: Practice and Experience, 9(9):837–857, 1997. [7] T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein. Introduction to Algorithms, chapter 3, page 55. The MIT Press, second edition, 2001. [8] P. D’Alberto and A. Nicolau. R-Kleene: A high- performance divide-and-conquer algorithm for the all-pair shortest path for densely connected networks. Algorithmica, 47(2):203–213, 2007. [9] T. A. Davis. Direct Methods for Sparse Linear Systems.So- ciety for Industrial and Applied Mathematics, Philadelphia, PA, USA, 2006. [10] P. Erd˝os and A. R´enyi. On random graphs. Publicationes Mathematicae, 6(1):290–297, 1959. [11] R. A. V. D. Geijn and J. Watts. SUMMA: Scalable universal matrix multiplication algorithm. Concurrency: Practice and Experience, 9(4):255–274, 1997. [12] J. R. Gilbert, G. L. Miller, and S.-H. Teng. Geometric mesh partitioning: Implementation and experiments. SIAM J. Sci. Comput., 19(6):2091–2110, 1998. [13] J. R. Gilbert, C. Moler, and R. Schreiber. Sparse matrices in Matlab: Design and implementation. SIAM Journal of Matrix Analysis and Applications, 13(1):333–356, 1992. [14] F. G. Gustavson. Two fast algorithms for sparse matri- ces: Multiplication and permuted transposition. ACM Trans. Math. Softw., 4(3):250–269, 1978. [15] D. Irony, S. Toledo, and A. Tiskin. Communication lower bounds for distributed-memory matrix multiplication. J. Parallel Distrib. Comput., 64(9):1017–1026, 2004. [16] J. M. Kleinberg, R. Kumar, P. Raghavan, S. Rajagopalan, and A. Tomkins. The web as a graph: Measurements, mod- els, and methods. In COCOON, pages 1–17, 1999. [17] M. Krishnan and J. Nieplocha. Srumma: A matrix multi- plication algorithm suitable for clusters and scalable shared memory systems. In IPDPS, Los Alamitos, CA, USA, 2004. IEEE Computer Society. [18] J. Leskovec, D. Chakrabarti, J. M. Kleinberg, and C. Falout- sos. Realistic, mathematically tractable graph generation and evolution, using kronecker multiplication. In PKDD, pages 133–145, 2005.

510