<<

A Systematic Survey of General Sparse -

Jianhua Gao Weixing Ji School of Science and Technology School of Computer Science and Technology Beijing Institute of Technology Beijing Institute of Technology Beijing, China Beijing, China [email protected] [email protected]

Zhaonian Tan Yueyan Zhao School of Computer Science and Technology School of Information and Electronics Beijing Institute of Technology Beijing Institute of Technology Beijing, China Beijing, China tan [email protected] [email protected]

Abstract—SpGEMM (General Sparse Matrix-Matrix Multipli- in the ith row of A, also a∗j to denote the vector comprising cation) has attracted much attention from researchers in fields of the entries in the jth column of A. In the general matrix- of multigrid methods and graph analysis. Many optimization matrix multiplication (GEMM), the ith row of the result matrix techniques have been developed for certain application fields and computing architecture over the decades. The objective of this C can be generated by paper is to provide a structured and comprehensive overview c = (a · b , a · b , ..., a · b ), (1) of the research on SpGEMM. Existing optimization techniques i∗ i∗ ∗1 i∗ ∗2 i∗ ∗r have been grouped into different categories based on their target where the operation · represents the of two vectors problems and architectures. Covered topics include SpGEMM [17]. applications, size prediction of result matrix, matrix partitioning and load balancing, result accumulating, and target architecture- Compared with GEMM, SpGEMM requires to exploit the oriented optimization. The rationales of different in sparsity of the two input matrices and the result matrix because each category are analyzed, and a wide range of SpGEMM algo- of huge memory and computational cost for the zero entries rithms are summarized. This survey sufficiently reveals the latest with dense formats. Most of the research on SpGEMM is progress and research status of SpGEMM optimization from proposed based on the row-wise SpGEMM given 1977 to 2019. More specifically, an experimentally comparative by Gustavson [18]. The i-th row of the result matrix C is study of existing implementations on CPU and GPU is presented. P Based on our findings, we highlight future research directions and given by ci∗ = j aij ∗ bj∗, where the operator ∗ represents how future studies can leverage our findings to encourage better the multiplication, and the computation only occurs in design and implementation. non-zero entries of sparse matrices A and B. The pseudocode Index Terms—SpGEMM, Parallel Optimization, Sparse Matrix for Gustavson’s SpGEMM algorithm is presented in Algorithm Partitioning, Parallel Architecture. 1. The goal of this survey paper is to provide a structured I.INTRODUCTION and broad overview of extensive research on SpGEMM design General sparse matrix-matrix multiplication, abbreviated as and implementations spanning multiple application domains arXiv:2002.11273v1 [cs.DC] 26 Feb 2020 SpGEMM, is a fundamental and expensive computational ker- and research areas. We not only strive to provide an overview nel in numerous scientific computing applications and graph of existing algorithms, data structures, and libraries available algorithms, such as algebraic multigrid solvers [1] [2] [3] but also try to uncover the design decisions behind the [4], triangle counting [5] [6] [7] [8], multi-source breadth- various implementation by the designers. It gives an in-depth first searching [9] [10] [11], the shortest path finding [12], presentation of the many algorithms and libraries available for colored intersecting [13] [14], and subgraphs [15] [16]. Hence computing SpGEMM. the optimization of SpGEMM has the potential to impact a Sparse matrix has been a hot topic of many surveys and wide variety of applications. reviews. Duff et al. [19] provide an extensive survey of Given two input sparse matrices A and B, SpGEMM sparse matrix research developed before the year of 1976. A computes a sparse matrix C such that C = A × B, where broad review of the sparse matrix and vector multiplication is A ∈ Rp×q,B ∈ Rq×r,C ∈ Rp×r, and the operator × presented by Grossman et al. [20]. Over the past decades, represents the general matrix multiplication. Let aij denote SpGEMM attracts intensive attention in a wide range of the element in the ith row and the jth column of matrix A, application fields, and many new designs and optimizations are and we write ai∗ to denote the vector comprising of the entries proposed targeting different architectures. The development of Algorithm 1: Pseudocode for the Gustavson’s SpGEMM II.LITERATURE OVERVIEW algorithm. A. Research Questions Input: A ∈ p×q,B ∈ q×r R R This study follows the guidelines of the systematic literature Output: C ∈ p×r R review proposed by Kitchenham [22], which is initially used in 1 for a ∈ A do i∗ medical science but later gained interest in other fields as well. 2 for a ∈ a and a is non-zero do ij i∗ ij According to the three main phases: planning, conducting, 3 for b ∈ b and b is non-zero do jk j∗ jk and reporting, we have formulated the following research 4 value ← a ∗ b ; ij jk questions: 5 if c ∈/ c then ik i∗ RQ1: What are the applications of SpGEMM, and how are 6 c ← 0; ik they formulated? 7 end RQ2: What is the current state of SpGEMM research? 8 c ← c + value; ik ik RQ3: How do the SpGEMM implementations on the off- 9 end the-shelf processors perform? 10 end RQ4: What challenges could be inferred from the current 11 end research effort that will inform future research? Regarding RQ1, the study will address how SpGEMM is used in these applications, including the sub-questions like multi/many-core architectures and parallel programming have what are the typical applications of SpGEMM? Is it used in brought new opportunities for the acceleration of SpGEMM, an iterative algorithm? Do the matrices keep constant in the and also challenging problems emerged need to be considered computation? This will give an insight into the requirements of intensively. To the best of our knowledge, this is the first SpGEMM in solving real problems. In RQ2, the study looks survey paper that overviews the developments of SpGEMM at existing techniques that were proposed in recent decades. over the decades. The goal of this survey is to present a Regarding RQ3, we will perform some evaluations for both working knowledge of the underlying theory and practice of CPU and GPU to have a general idea about the performance SpGEMM for solving large-scale scientific problems, and to of these implementations. Finally, in RQ4, we will summarize provide an overview of the algorithms, data structures, and the challenges and future research directions according to our libraries available to solve these problems, so that the readers investigative results. can both understand the methods and know how to use them B. Search Procedure best. This survey covers the application domains, challeng- ing problems, architecture-oriented optimization techniques, There have been some other reviews presented in the related and performance evaluation of available implementations for work section [21]. However, they cover a more limited number SpGEMM. We adopt the categories of challenging problems of studies. We evaluate and interpret all high-quality studies given in [21] and summarize existing techniques about size related to SpGEMM in this study. predicting of the result matrix, load balancing, and intermedi- To have a broad coverage, we perform a systematic literature ate results merging. survey by indexing papers from several popular digital libraries The main factors that differentiate our survey from the (IEEE Explore Digital Library, ACM Digital Library, Elsevier existing related work are: ScienceDirect, Springer Digital Library, Google Scholar, Web of Science, DBLP) using the keywords “SpGEMM”, “Sparse • the systematic process for collecting the data; matrix”, “sparse matrix multiplication”, “sparse matrix-matrix • a classified list of studies on both applications and opti- multiplication”. It is an iterative process, and the keywords mization techniques; are fine-tuned according to the returned results step by step. • analysis of the collected studies in the view of both Just like other literature survey work, we try to broaden the algorithm design and architecture optimization; search as much possible while maintaining a manageable result • an detailed evaluation of 6 library implementations. . In order to do that, only refereed journal and conference The rest of the paper is organized as follows. Section publications were included. Then, we read the titles and 2 gives a brief overview of this survey. Section 3 intro- abstracts of these papers and finally include 87 papers of more duces several classical applications of SpGEMM in detail. than 142 papers we found in this survey. In section 4, the related work on three pivotal problems of SpGEMM computation is presented. In section 5, the related C. Search Result work of SpGEMM optimization based on different platforms It is difficult to give a sufficient and systematic taxonomy is introduced in detail. Besides, we perform a performance for classifying SpGEMM research because the same topic may evaluation employing different SpGEMM implementations in have different meanings in different contexts. For example, section 6. A discussion about the challenges and future work load balance of SpGEMM in distributed computing, shared of SpGEMM is presented in section 7, and followed by our memory multi-core computing, single GPU computing, multi- conclusion in section 8. GPU computing, CPU+GPU computing. We select several TABLE I: Categories of selected papers. Algorithm 2: Construction of several important operators Category Papers in AMG [2]. Input: A, B Multigrid solvers [1] [2] [3] [4] [23] [24] [25] Triangle counting [5] [6] [7] [8] [26] [27] [28] Output: A1, ..., AL,P0, ..., PL−1 Application MBFS [9] [10] [11] [29] 1 A0 ← A, B0 ← B; Other [12] [13] [14] [15] [16] [30] [31] 2 for l = 0, ..., L − 1 do [32] [33] [34] [35] [36] [37] [38] [39] [40] [41] [42] // Strength-of-connection of edges Algorithm [18] [21] [43] [44] [45] [46] [47] 3 Sl =strength(Al); [14] [7] [48] [49] [50] [2] [51] [52] // Construct interpolation and [53] [42] [54] [3] [55] [56] [57] Optimization [58] [59] [60] [61] [62] [63] [64] restriction [16] [65] [66] [67] [68] [69] [25] 4 Pl =interpolation(Al,Sl); [70] [71] [72] [73] [74] [75] [26] // Galerkin product (triple sparse [41] [76] [?] [77] Many-core CPU [69] [51] [25] [59] [58] [68] matrices multiplication) GPU [21] [57] [47] [58] [68] T 5 Al+1 = Pl AlPl; Distributed [78] [42] [79] [67] [46] 6 end Heterogeneous [17] [53] [80] [81] FPGA [82] [83] Programming models [84] [85] [78] [54] Surveys [19] [20] [22] Benchmark [86] the coarse-grid operator is constructed by a Galerkin triple- matrix product (line 5 of Algorithm 2), which is implemented with two sparse matrix-matrix multiplications [1] [2] [3] [4] topics that are frequently covered by existing papers, and [25]. 80% present the correlation between papers and topics in Table I. These matrix-matrix multiplications, taking more than of the total construction time, are the most expensive com- III.TYPICAL APPLICATION ponents. Moreover, the construction of operators (and thus SpGEMM) is a considerable part of the overall execution since SpGEMM is mainly used in and graph it may occur at every time step (for transient problems) or even analytics, and among which algebraic multigrid solver, triangle multiple times per time step (for non-linear problems), making counting, and multi-source breadth-first search are three most it important to optimize SpGEMM [25]. classical applications of SpGEMM. In the following subsec- tions, we will introduce the procedures of these three applica- B. Triangle Counting tions and how the SpGEMM is applied in these applications Triangle counting is one of the major kernels for graph in detail. analysis and constitutes the core of several critical graph algorithms, such as triangle enumeration [8] [7] [26], subgraph A. Algebraic Multigrid Solvers isomorphism [30] [31], and k-truss computation [31] [32]. Algebraic Multigrid (AMG) is a multi-grid method de- The problem of triangle counting in an undirected graph is veloped by utilizing some important principles and concepts to find the number of triangles in the graph. Given a graph of Geometric multigrid (GMG), and it requires no explicit G = (V,E), it can be represented by a square sparse matrix geometric knowledge of the problem. AMG method itera- A of dimension n × n, where n is the number of vertexes tively solves large and sparse linear systems Ax = b by in graph G: n = |V |, and either the matrix entry aij or automatically constructing a hierarchy of grids and inter-grid aji represents the edge (i, j) [5]. Azad et al. [8] describe transfer operators. It achieves optimality by employing two an -based triangle counting method based complementary processes: smoothing and coarse-grid correc- on Cohen’s MapReduce algorithm [6]. First, they construct tion. The smoother is generally fixed to be a simple point- an adjacency matrix A from graph G, with its rows ordered wise method such as Gauss-Seidel or weighted Jacobi for with the increasing number of non-zeros they contain (equal eliminating the high-frequency error. Coarse-grid correction to the degree of the corresponding vertex). Assuming that involves three phases, including projecting error to a coarser sparse matrix L is the lower triangular part of A holding edges grid through restriction operator, solving a coarse-grid system (i, j) where i > j, and U is the upper triangular part of A of equations, and then projecting the solution back to the fine holding edges (i, j) where i < j, then A = L + U. Next, grid through interpolation (also called prolongation) operator all the wedges of (i, j, k) where j is the smallest numbered [23] [24]. vertex are counted via the SpGEMM operation: B = L × U, Algorithm 2 describes the construction of several important and B(i, k) represents the count for (i, j, k) wedges. Finally, operators in AMG. First, the strength operation identifies the finding the close wedges by doing element-wise multiplication strongly connected edges in the matrix Al at level l to construct with the original matrix: C = A. ∗ B. Algorithm 3 presents strength-of-connection matrix Sl (line 3 of Algorithm 2). Next, the pseudocode for the serial adjacency matrix-based triangle the interpolation operator Pl is constructed to transfer informa- counting algorithm, and an example of triangle counting is tion from level l +1 to level l (line 4 of Algorithm 2). Finally, presented in Fig. 1. Algorithm 3: Pseudocode for the serial adjacency matrix- connectivity, finding the shortest path, and finding k-hop based triangle counting algorithm [6] [8]. neighbors [36] [37]. Input: Ordered adjacency matrix A of graph G The goal of the BFS is to traverse a graph from a given source vertex, which can be performed by a sparse matrix- Output: Number Nt of triangles in graph G 1 (L, U) ←Split(A); vector multiplication (SpMV) between the adjacency matrix 2 B ← SpGEMM(L,U); of a graph and a sparse vector [29]. 3 C ←ElementWiseSpGEMM(A, B); Let us take an example in a sparse unweighted graph G = // Sum all entries of matrix C (V,E) represented by sparse matrix A. Assume n = |V |, then the size of A is n × n. Let x be a sparse vector with x(i) = 1 4 Nt ← Sumi,j(cij)/2; and all other elements being zero, then the 1-hop vertexes from source vertex i, denoted as Vi1, can be derived by the SpMV operation: y = Ax. Repeat the operation from Vi1, we Besides, Wolf et al. [27] develop a GraphBLAS like can receive 1-hop vertexes from Vi1 or 2-hop vertexes from i. approach for triangle enumeration using an overloaded Finally, a complete BFS for graph G from vertex i is yielded SpGEMM of the graph adjacency matrix (A) and incidence [9]. matrix (H): C = A × H, which can also be used for triangle In contrast, Multi-Source BFS (MS-BFS) executes multiple counting. Although this method is useful for more complex independent BFSs concurrently on the same graph from mul- graph calculations, it is less efficient than the adjacent matrix- tiple source vertexes. It can be performed by SpGEMM. A based method. Furthermore, a variant of the adjacency matrix- simple example for MS-BFS is presented in Fig. 2 [10]. Let based triangle counting algorithm is given by Wolf et al. in {1, 2} be two source vertexes, A is the adjacency matrix of [7], which uses the lower triangle portion of the adjacency the graph, and X = (x1, x2) is a rectangular matrix repre- matrix to count triangles. senting the source vertexes, where x1 and x2 are two sparse column vectors with x1(1) = 1 and x2(2) = 1 respectively i i a a a and all other elements being zero. Let Bi = (b1, b2), then B1 = A × X. Obviously, B1 is a sparse rectangular matrix representing the 1-hop vertexes (denoted as V1) from source b d b d vertexes {1, 2}. Repeat the SpGEMM operation of adjacency matrix and V1, we can derive the 1-hop vertexes from V1, also 2-hop vertexes from {1, 2} by B2 = A × B1. Finally, we get c c c the breadth-first traversal results from vertices 1 and 2, which {1, 3, 4, 2, 5, 6} {2, 3, 4, 1, 5, 6} G Two triangles are and , respectively. Fig. 1: An example of triangle counting. Graph G includes two triangles: (a, b, c) and (a, d, c).

Collected from social media, sensor feeds, the World Wide Web, biological and genetic fields, co-author networks, and citations, increasing large amounts of data are being analyzed with graph analytics to reveal the complicated underlying relationship between different data entities [28] [31]. One of the fundamental techniques for graph analytics is subgraph finding, and the triangle is one of the most important sub- graphs. Moreover, the number of triangles is an important metric in many network analysis applications, including link recommendation [33], spam detection [34], social network analysis [34], and dense neighborhood graph discovery [35]. In the adjacency matrix-based triangle counting algorithm, the Fig. 2: An example of BFS. Vertexes filling in horizontal lines 2 SpGEMM operation B = L × U costs d n operations, where and vertical lines are being visited by the BFS starting from d is the average degree and n is the number of vertices [8]. vertex {1} and {2} respectively, and vertexes filling in left Considering that the computational complexity increases with oblique lines have been visited by the BFS starting from vertex the scale of the problem, the optimization of triangle counting {1} or {2} [10]. (thus SpGEMM) becomes extremely considerable. Many applications require hundreds or even billion of BFSs C. Multi-Source Breadth-first Search over the same graph. Examples of such applications include Breadth-first search (BFS) is a key and fundamental sub- calculating graph , enumerating the neighborhoods routine in many graph analysis algorithms, such as finding for all vertexes, and solving the all-pairs shortest distance problem. Comparing with running parallel and distributed 2) Probabilistic Method: The second method estimates the BFSs sequentially, executing multiple BFSs concurrently in sparse structure of the result matrix based on random sampling a single core allows us to share the computation between and probability analysis on the input matrices, which is known different BFSs without paying the synchronization cost [9]. as the probabilistic method. As the most expensive part of MS-BFS, SpGEMM is worthy Cohen [48] thinks that the problem of estimating the number of more attention to be paid. of non-zeros in result matrix can be converted to estimate In addition, SpGEMM is also one of the most important the size of reachability sets in a directed graph. Then he components for graph contraction [38], graph matching [39], presents a Monte Carlo-based algorithm that estimates the size graph coloring [13] [14], all pairs shortest path [12], sub-graph of reachability sets with high confidence and small relative [15] [16], cycle detection or counting [40], and molecular error. The non-zero structure not only can guide efficient dynamics [41] [42]. Because of limited space, we will not memory allocation for the output matrix but also can be introduce these applications and corresponding SpGEMM used to determine the optimal order of multiplications when formulation in detail. computing a chain product of three or more matrices, where the optimal order means the minimization of a total number IV. METHODOLOGY of operations. Therefore, Cohen [48] proposes a method that The development of multi/many-core architecture and par- estimates the non-zero structure of the result matrix in time allel programming has brought new opportunities for the linear based on the previous reachability-set size estimation acceleration of SpGEMM, and three challenging problems algorithm. For any constant error probability δ > 0 and any emerged need to be considered intensively [21], including size the tolerated relative error  > 0, it can compute a 1 ±  predicting of result matrix, matrix partitioning and load bal- approximation of z of result matrix in time O(n/2), where ancing, and intermediate result accumulating. In the following z denotes the number of non-zeros in the result matrix. Then, subsections, we will introduce related studies in these three Amossen et al. [49]√ improve this method to expected time challenges in detail. O(n) for  > 4/ 4 z. Pagh et al. also present a method to estimate column/row sizes of the result matrix based on A. Size Prediction the non-zeros of input matrices, and a detailed description is The first problem is the unknown number of non-zeros in the referred to the Lemma 1 in [50]. result matrix before real execution. The number of non-zeros This type of method does not require SpGEMM-like com- of a sparse matrix usually dominates its memory overhead, putation and then yields a low time cost for estimation. while in SpGEMM, the sparsity of the result matrix is always However, considering that this type of method usually gives unknown in advance. Therefore, precise memory allocation an approximate estimation, hence an extra memory allocation for the result matrix is impossible before real execution. The must be performed when the estimation fails. research focusing on the size prediction of the result matrix 3) Upper Bound Method: The third method computes an can be mainly classified as the following four approaches. upper bound of the number of non-zeros in the result matrix 1) Precise Method: The SpGEMM algorithms, which use and allocates corresponding memory space, which is known as precise method to solve size prediction problem, usually the upper bound method. The most commonly used method of consist of two phases: symbolic and numeric phase. In the calculating upper bound for C is to count the number of non- symbolic phase, the precise number of non-zeros in result zeros of the corresponding row in matrix B for each non- C is computed, while the real values are calculated entry in matrix A. Assume that input matrices A and B are in the numeric phase. compressed with CSR format, and I and J are the row pointer The implementations of SpGEMM in KokkosKernels [87], array and column index array, respectively. Then Algorithm 4 cuSPARSE [88], MKL [89] and RMerge [43] are typical rep- gives the pseudocode for computing the upper bound of the resentatives of this method. Besides, Demouth [44], Nagasaka number of non-zeros in each row of the result matrix, which et al. [45] and Demirci et al. [46] also adopt this approach. is stored in array U = {u0, u1, . . . , up−1}. In [47], Deveci et al. design a graph compression technique, Bell et al. [2] propose a classical ESC (Expansion, Sorting, similar to the color compression idea [14], to compress matrix Compression) method to accelerate AMG method, and it B by packing its columns as bits in order to speedup the allocates the memory space of the upper-bounded size for symbolic phase. result matrix, which is applied in CUSP [90]. Nagasaka et al. Accurate pre-computation of the number of non-zeros not [51] also count a maximum of scalar non-zero multiplications only saves the memory usage but also enables the sparse per row of the result matrix. Then each thread allocates the structure of C to be reused for different multiplies with the based on the maximum and reuses the hash table same structure of input matrices [47]. Moreover, it presents throughout the computation by re-initializing for each row. significant benefits in graph analytics, because most of them This type of method is efficient and straightforward to work only on the symbolic structure, no numeric phase [7]. implement, while it does not consider the merging of the However, the calculation of two phases also means that it values with the same index items in the calculation process, needs to iterate through the input matrices twice, indicating and usually leads to over-allocation. Especially for some the higher computation overhead than the following methods. particular matrices, the estimated size with the upper bound Algorithm 4: Pseudocode for computing the upper bound Result matrix C is obtained by summing the of the number of non-zeros in each row of the result matrix, of each column a∗i of A and corresponding row bi∗ of B,

and all sparse matrices are compressed with CSR format. i.e. q I ,J ,I X Input: A A B. C = a∗i ⊗ bi∗, (2) Output: U = {u0, u1, . . . , up−1}. i=1 1 for i = 0, p − 1 do where the operator ⊗ represents the outer product of 2 ui = 0; vectors. 3 for k = IA(i),IA(i + 1) − 1 do 2) Inner-product formulation [53] [55] [45] [56]. This 4 j = JA(k); formulation is based on the row-wise and column-wise 5 ui+ = IB(j + 1) − IB(j); partitioning of input matrices A and B, respectively. Result 6 end matrix C consists of the inner product of each row of 7 end matrix A and each column of matrix B, i.e. X cij = aik ∗ bkj, (3) method and precise size have large difference. k∈I(i,j) 4) Progressive Method: The fourth method, known as the where i = 1, 2, ..., p, j = 1, 2, ..., r, and I(i, j) denotes the progressive method, dynamically allocates memory as needed. set of indexes k such that both the entries aik and bkj are It first allocates memory of proper size and then starts matrix- non-zero. matrix multiplication. The reallocation of larger memory is 3) Row-by-row-product formulation [53] [3] [56] [46]. This required if the current memory is insufficient. formulation is based on the row-wise partitioning of input The implementation of SpGEMM in Matlab [52] is the matrices. Each row ci∗ of result matrix C is obtained by representative of this type of method. It first guesses the size summing the intermediate multiplication results of each of the result matrix and allocates a new memory space that non-zero entry aik of ai∗ and corresponding row bk∗of B. i.e. is larger by a constant factor (typically 1.5) than the current X space if more space is required at some point. Meanwhile, the ci∗ = aik ∗ bk∗, (4) columns which are already computed are required to be copied k∈Ii(A) into the new memory. where Ii(A) denotes the set of column indexes k of the If this type of method estimates successfully in first, then non-zero entries in the i-th row of A. it has no additional time overhead in SpGEMM. However, 4) Column-by-column-product formulation [3] [56]. This if the first estimation fails, this type of method is extremely formulation is based on the column-wise partitioning of inefficient, especially in GPU. Moreover, if the SpGEMM input matrices, which is similar to the row-by-row-product algorithm is implemented in thread-level parallel, and threads formulation. Each column c∗j of result matrix C is ob- are using their local storage, then it must take time to aggregate tained by summing the intermediate multiplication results all the independent storage into the result matrix. of each non-zero entry bkj of b∗j and corresponding Besides, the hybrid methods are also proposed to cope with column a∗k, i.e. the problem. Weifeng Liu and Brian Vinter propose a hybrid X method for result matrix pre-allocation [21]. They calculate c∗j = a∗k ∗ bkj, (5) the upper bound of the number of non-zeros in each row k∈Ij (B) of the result matrix and divide all rows into multiple groups where Ij(B) denotes the set of row indexes k of the non- according to the number of non-zeros. Then they allocate space zero entries in the j-th column of B. of the upper-bound size for the short rows and progressively 2) Block Partitioning: Four formulations of SpGEMM rep- allocate space for the long rows. resent four different 1D matrix partitioning schemes: column- B. Matrix Partition and Load Balancing by-row, row-by-column, row-by-row, and column-by-column. However, considering the irregular distribution of non-zero The second problem is matrix partitioning and load balanc- entries in sparse matrices, plain slice-based matrix parti- ing. With the development of multi/many-core architecture, tioning usually results in load unbalancing. Therefore, some reasonable partitioning of matrices and balanced assigning of researchers pay attention to load balancing for specific archi- tasks among multiple cores or nodes are critical for making tectures. full use of hardware resources. Weifeng Liu and Brian Vinter propose a method that 1) Formulations of SpGEMM: It is closely related between guarantees a balanced load through assigning rows with a matrix partitioning and SpGEMM formulations. There are four different number of non-zeros to the different number of bins formulations of SpGEMM can be devised. [21] for GPU platform. In order to balance the load and 1) Outer-product formulation [53] [42] [54] [3] [55] [56]. efficiently exploit GPU resources, Nagasaka et al. [45] divide This formulation is based on the column-wise and row- the rows into groups in the symbolic phase by the number of wise partitioning of input matrices A and B, respectively. intermediate products, which is the upper bound of the number of non-zeros in output matrix. In the numeric phase, they 3) Hypergraph Partitioning: Traditional block-based ma- divide the rows into groups by the number of non-zeros in each trix partitioning schemes seldom consider the communication row. Winter et al. [57] also design a balanced work assigning problem from the sparse patterns of input or output matri- scheme for GPU. Each thread block is assigned an equal ces. Akbudak et al. [42] propose an elementary hypergraph number of non-zeros of A while ignoring the row boundaries partitioning (HP) model to partition input matrices for outer- of a matrix. Deveci et al. [58] propose two chunking methods product parallel SpGEMM. It aims at minimizing communi- for KNL and GPU. The difference lies in partitioning A in the cation cost while maintaining a balanced computational load column-wise method for KNL, but for GPU partitioning A in by a predetermined maximum allowable imbalance ratio. The the 2D method because of its multi-level memory property. HP model is composed of three components. The first is the Deveci et al. [68] also utilize row-by-row matrix partitioning vertex for representing the outer product of each column of A but with different tasks assigning scheme (thread-sequential, with the corresponding row of B. The second is the vertex team-sequential, thread-parallel, and thread-flat-parallel) for for each non-zero entry of C to enable 2D nonzero-based high performance computing architectures. Chen et al. [59] partitioning of the output matrix. The third is the hyperedge design a three-level partitioning scheme to improve the load (or net) for each non-zero entry of C to encode the total balance of SpGEMM on Sunway TaihuLight supercomputer. volume of communication that occurred in intermediate result The first level represents that A is partitioned into NP sub- accumulating. Assuming that the partitioning of the input matrices, where NP is the number of core groups applied matrices A and B in the HP model is presented as in the computation. In the second level, B is partitioned into  r  NC sub-matrices, where NC is the number of computing B1 r processing element (CPE) cores applied in each core group.  B2  Aˆ = AQ = [Ac Ac ··· Ac ] and Bˆ = QB =   , (6) 1 2 K  .  Sub-A and sub-B are further partitioned into several sets in  .  the third level. r BK Kurt et al. [60] propose an adaptive task partition scheme across virtual warps of a thread block in GPU. For example, where Q is the induced by the partitioning, for a thread block of size 128 and a virtual warp of size 4, each the HP model for outer-product parallel SpGEMM includes thread block will preside over four rows of A simultaneously. two phases. In the first phase (also called as multiplication c Then 32(=128/4) threads are assigned cyclically to deal with phase), each processor Pk owns column stripe Ak and row r the corresponding entries of B for each row of A. They stripe Bk of the permuted matrices, and then finishes the c r also give a systematic exploration and analysis of upper communication-free local SpGEMM computations Ak × Bk. data movement requirements for SpGEMM for the first time. The second phase (also called as summation phase) reduces The work distribution scheme is similar to the row merging partial results yielded in the first phase to calculate the final method by GPU sub-warps in [43] [61]. Ballard et al. [3] value of each non-zero entry of C. In order to maintain a present a detailed theoretical prediction for communication balanced load over two phases, the two-constraint formulation cost for SpMM (sparse-dense matrix multiplication) operation is proposed. Each vertex is assigned two weights, which rep- within the Galerkin of smoothed aggregation resent the computational overhead associated with the vertex AMG methods. They also conclude that the row-by-row- for two phases, respectively. Moreover, an imbalance ratio is product algorithm is the best 1D method for the first two introduced to direct partitioning for load balancing. multiplications and that the outer-product algorithm is the best Ballard et al. [66], based on the HP model, present a fine- 1D method for the third multiplication. grained framework to prove sparsity-dependent communica- Van de Geijn et al. [62] propose a 2D block data de- tion lower bounds for both parallel and sequential SpGEMM. composition and mapping methods for dense matrix-matrix Further, they reduce identifying a communication-optimal al- multiplication based on the platform with distributed memory. gorithm for given input matrices to solve the HP problem. It is applied in SpGEMM by Buluc et al. [63] [64] [16]. Akbudak et al. [55] also apply the HP model to improve the Noted increasing hypersparse matrices after 2D block data outer-product parallel and inner-product parallel sparse matrix- decomposition, they propose a new format DCSC (doubly matrix multiplication (SpMM) algorithms for the Xeon Phi compressed sparse column) to exploit the hypersparsity of the architecture. The HP model allocates A-matrix columns and individual sub-matrices. Then, they developed two sequential B-matrix rows that contribute to the same C-matrix entries SpGEMM algorithms based on the outer-product and column- into the same parts as much as possible in the inner-product- by-column-product multiplications, respectively, which mainly parallel SpMM. In the outer-product-parallel SpMM, the HP improve the hypersparse sub-matrices multiplication after the model allocates A-matrix rows that require the same rows of 2D partitioning. Finally, they present a parallel efficient sparse B into the same parts as much as possible. The reason is that SUMMA [62]. Furthermore, Azad et al. [65] first implement a it enables the exploiting of temporal locality. 3D parallel SpGEMM algorithm, based on the work of Buluc Besides, Akbudak et al. [56] also apply the graph model et al. that exploits both the inter-node parallelism within a third and HP model on the outer-product, inner-product, and row- processor grid dimension and the multi-threading parallelism by-row-product parallel SpGEMM algorithms. Demirci et al. within the node. [46] adopt a similar bipartite graph model for matrices parti- tioning on the distributed-memory platform but complement 2) Sort-based: A sort-based merging is performed by sort- a 3-constraint edge weight assignment scheme to maintain ing the intermediate results according to the column indexes a balanced load among tablet servers. Selvitopi et al. [67] and then summing the values with the same column indexes. apply the graph model and hypergraph model for simultane- The difference between sort-based works often lies in the ous assigning of the map and reduce tasks to improve data different sort algorithms utilized, including radix sort, merge locality and maintain a balanced load among processors in sort, and sort. both map and reduce phases. Then, Considering the similarity 1) Radix sort based. Bell et al. [2] adopt the radix-sort between the map and reduce tasks in the MapReduce and the method (implemented in [84]) to sort the intermediate accumulation of intermediate results in SpGEMM, they apply results, and Dalton et al. [71] employ the B40C radix sort the obtained MapReduce partitioning scheme based on graph algorithm [72] which allows specifications in the number and hypergraph models to improve the data locality and load and location of the sorting bits to accelerate the sort balancing in SpGEMM. operation. 2) Merge sort based. In the GPU-accelerated SpGEMM C. Result Accumulating given by Gremse et al. [43], the sparse rows of B selected The third problem is the expensive intermediate result and weighted by corresponding non-zeros of A, are merged accumulating. In outer-product SpGEMM, result matrix C is in a hierarchical way similar to merge sort. At each level, obtained by summing the outer-product result of each row of the number of rows is reduced by half, but the length of A and the corresponding column of B, where the intermediate rows becomes larger due to merging. Winter et al. [57] outer-product result is usually a sparse matrix. In row-by-row- utilize the ESC method to sort intermediate results yielded product SpGEMM, the result matrix is obtained by summing in a chunk while ignoring row boundaries of a matrix. the multiplication result of each non-zero entry of A and the Simple sorting of the chunks is performed based on their corresponding row of B, where the intermediate multiplication global chunk order in the merge stage. Final, a prefix results are also sparse. The other two formulations are similar. scanning is utilized to merge rows shared between chunks So no matter which SpGEMM formulation is chosed, the to generate the final result. merging of sparse structures will be involved, and it is a 3) Heap sort based. Azad et al. [65] represent the interme- challenging problem. diate result as a list of triples, and each list of triples is sorted by the (j, i) pair, where the triple (i, j, val) 1) Accumulator: Here, we call the that holds stores the row index, column index, and value of a intermediate results as the accumulator, and let us take row- non-zero entry. A k-way merge on k lists of triples by-row-product SpGEMM as an example to introduce the T ,T , ..., T is performed by maintaining a heap of size accumulator (because most of SpGEMM research is based on 1 2 k k, which stores the current lexicographically minimum the row-by-row-product formulation). There are two types of entry in each list of triples. Then a multi-way merge accumulators at present: dense and sparse accumulators. routine finds the minimum triple from the heap and In dense accumulator, intermediate results for a row are merges it into the current result. Nagasaka et al. [51] stored in an array of size r in dense format, where r is implement a heap sort-based sparse accumulator adopting the column dimension of result matrix. It does not require the one-phase SpGEMM in KNL and multi-core archi- extra time cost for hashing, sorting or merging operations, tecture. In merge-based sparse accumulator (similar to while it may not be suitable to parallelization with large heap sort-based accumulator) proposed by Junhong et al. threads and large r values. Deveci et al. [68] [7] utilize dense [70], each thread stores multiple column indexes, and accumulator for CPU or KNL instead of GPU because of its then performs an intra-thread-min operation to ob- high memory requirements. Patwary et al. [69] utilize dense tain the local minimum column index inta-min. Mean- accumulator with a column-wise partitioning of B to reduce while, the intermediate result value for each intra-min memory overhead. Elliott et al. [25] note that prolongation is fetched by each thread. Next, each thread performs matrices (P ) in the AMG method are usually tall and skinny, inter-thread-min(intra-min) operation to obtain the so they use a local dense lookup array for each thread and minimum column index min among the threads at the allocate memory space fast for result matrix by page-aligned same warp. Final, the inter-sum(value(min)) is utilized and size-tracked technologies. to obtain the reduction sum of multiple threads. In row-by-row-product SpGEMM, each row of result matrix C can be expressed as c = P a ∗ b , which is Besides, Junhong et al. [70] propose a register-aware sort- i∗ k∈Ii(A) ik k∗ also known as sparse accumulator [70]. In sparse accumulator, based sparse accumulator and a vectorized method for parallel intermediate results for a row are stored in the sparse formats segmented sum [73], where the sparse accumulator exploits the like COO and CSR. Therefore, the insertion operations at N-to-M pattern [74] to sort the data in the register. It is for random positions hold expensive time overhead in SpGEMM the first time that the two methods are implemented for GPU with a sparse accumulator. There are two types of methods registers. for intermediate result merging: sort-based and hash-based 3) Hash-based: The hash-based sparse accumulator first methods. allocates a memory space based on the upper bound estimation as the hash table and uses the column indexes of the inter- A. Multi-core and Many-core Platforms mediate results as the key. Then the hash table is required to be shrunk to a dense state after hash functions. At last, The research on multi-core and many-core platforms mainly sorting the values of each row of the result matrix according focuses on two problems: memory management and load to their column indexes to obtain the final result matrix balance. compressed with sparse format [70]. Result accumulation in 1) CPU-based Optimization: Memory management. To cuSPARSE [44] [88] is based on the hash-based sparse accu- exploit data locality, Patwary et al. optimize the traditional mulator. Deveci et al. [75] [47] [58] design a sparse, two-level, sequential SpGEMM algorithm by using dense arrays in [69]. hashmap-based accumulator which supports parallel insertions In order to reduce deallocation cost in SpGEMM, Nagasaka or accumulations from multiple vector lanes. It is stored in a et al. compute the amount of memory required by each thread structure and consists of 4 parallel arrays, column and allocate this amount of thread-private memory in each indexes, and their numerical values in the numeric phase, as thread independently on the Intel KNL processor [51]. They well as beginning indexes of the linked list corresponding to also develop a hash table-based SpGEMM algorithm targeting the hash values and the indexes of the next elements within shared-memory multi-core and many-core processors. For ar- the liked list. In [68], they implement linked list-based and chitectures with reasonably large memory and modest thread linear probing hashmap accumulators utilizing different data counts, Elliott et al. present a single-pass OpenMP variant structures. Junhong et al. [70] optimize the data allocations of Gustavsons SpGEMM algorithm in [25]. They partition of each thread in a hash-based accumulator utilizing GPU the rows of the right-hand side matrix and then perform registers. Besides, Yaar et al. [26] utilize hashmap-based ac- Gustavson algorithm locally using thread-local storage for cumulators in triangle counting. Weber et al. [41] use the hash data and column arrays while constructing the row array in table to store each row of matrix C based on the distributed place. To reduce the overhead of memory allocation, they blocked compress sparse row (BCSR) format. Nagasaka et al. use page-aligned allocations and track used size and capacity [45] utilize the hash table for counting the number of non- separately. To implement efficient SpGEMM for many large- zeros of each row in the symbolic phase and for calculation of scale applications, Chen et al. propose optimized SpGEMM values of result matrix in the numeric phase. In the hash-based kernels based on COO, CSR, ELL, and CSC formats on the sparse accumulator given by [45], Nagasaka et al. utilize the Sunway TaihuLight supercomputer in [59]. In this paper, they vector registers in KNL and multicore architectures for hash use the DMA transmission to access data for CPE core in probing to accelerate the hash-based SpGEMM. the main memory. Beyond doubt, continuous data transfer Besides, Weifeng Liu et al. [21] propose a hybrid parallel using DMA performs better than discrete memory accesses. result merging method. Heapsort-like, bitonic ESC and merge To study the relationship between different multi-level memory methods are employed respectively according to the upper architectures and the performance of SpGEMM , Deveci bound of the number of non-zeros in each row. et al. design different chunking-based SpGEMM algorithms for Intel KNL where the memory subsystems have similar 4) Discussion: Sort-based join and hash-based join have latency [58]. They find that the performance difference of been two popular join algorithms, and the debate over which subsystems decreases with the better temporal and spatial is the best joint algorithm has been going on for decades. locality of SpGEMM. Considering that the accesses to B can Kim et al. [76] claim that sort-based join will be faster than be irregular depending on the structure of A, a data placement hash-based join when SIMD wide enough through a series of method of storing matrix B as much as possible for KNL is experiments. Albutiu et al. [?] devise a new massively parallel proposed. In [68], Deveci et al. provide heuristics to choose the sort-merge (MPSM) join algorithm, and the experimental best kernel parameters according to problem characteristics, result shows that the MPSM algorithm outperforms competing and present their results for the selection of partitioning hash join. However, the experiments performed by Balkesen schemes and data structures with trade-offs between memory et al. [77] show that the hash-based join is still superior, and access and computational cost. Before the kernel call and sort-based joint algorithms outperform hash-based join only service requests from thousands of threads, a memory pool when vast amounts of data are involved. Therefore, there is is allocated and initialized beforehand. A chunk of memory is no consensus on which join method is the best. returned to the request thread and it will not be released until the thread puts the chunk back to the pool. The parameters of V. ARCHITECTURE ORIENTED OPTIMIZATION the memory pool are architecture-specific and the number of chunks is chosen based on the available concurrency of the Different computer architectures usually have different com- target device. puting and storage properties, so the optimization of SpGEMM Load balance. In [69], matrices are divided into small par- is closely related to the architecture used. In following sub- titions and dynamic load balancing is used over the partitions. sections, we will introduce the related research of optimizing Experiments show that reducing the size of each partition leads SpGEMM based on the multi/many-core platforms, FPGA to better load balance on a system with two Intel Xeon E5- (Field-Programmable Gate Array), heterogeneous computing 2697 V3 processors. They find that better results are achieved platform, and distributed platform respectively. when the total number of partitions is 6-10 times the number of threads. To reduce the overhead of dynamic scheduling, allocate 38 bins and put them into five bin groups. Then, the static scheduling is used in the SpGEMM algorithm on the all rows are assigned to corresponding bins according to the Intel KNL processor [51]. A three-level partitioning scheme is number of the non-zeros. Finally, they allocate a temporary presented to achieve load balancing based on the architecture matrix for the non-zero entries in the result matrix C based on of the Sunway [59]. For the form of C = A × B, on the the size of each bin. Because the rows in the first two bins only first level, the left sparse matrix A is partitioned into multiple require trivial operations, these two bins are excluded from sub-A. Each core group in the SW26010 processor performs subsequent more complex computation on the GPUs. Thus a the multiplication of a sub-A and B. On the second level, B better load balancing can be expected. In [57], Winter et al. is partitioned into multiple sub-B. Each CPE core in a core split the non-zeros of matrix A uniformly, assigning the same group multiplies the sub-A by a sub-B. On the third level, to number of non-zeros to each thread block. Though memory fit the 64-KB local data memory in each CPE core, sub-A and requirements from A are static, the actual workload for each sub-B are further partitioned into several sets. block varies based on the generated intermediate products. A 2) GPU-based Optimization: Memory management. Liu static splitting of A0s non-zeros does not require processing, et al. summarize the extra irregularity that an efficient parallel but the second stage still needs to associate each entry from SpGEMM algorithm has to handle in [21]. One of them is A with a row. Therefore, the global load balancing in parallel that the number of non-zeros in the result sparse matrix is checks A0s row pointers and writes the required information unknown in advance. In their paper, memory pre-allocation into an auxiliary array. The cost of this step is negligible for the result matrix is organized using a hybrid approach compared to enumerating temporary products. that saves a large amount of global memory space. Besides, the very limited on-chip scratchpad memory is efficiently B. Field-Programmable Gate Array (FPGA) utilized. To make full use of the high-speed registers and on Related research of SpGEMM on FPGA is quite rare. Lin GPU, Liu et al. design different register-aware SpGEMM algo- et al. are the first to address SpGEMM on FPGAs. In [82], rithms for different sparse accumulators respectively in [70]. the authors design and implement the SpGEMM algorithm These sparse accumulators include sort-based, merge-based, on FPGAs. They study the performance of their design in and hash-based, where merge-based is similar to the merge terms of computational latency, as well as the associated sort-based accumulator mentioned in Section IV. They fully power-delay and energy-delay trade off. Taking advantage of utilize the GPU registers to fetch data, finish computations, the sparsity of the input matrices, their design allows user- and store results out. In [57], Winter et al. propose an adaptive tunable power-delay and energy-delay trade off by employing chunk-based GPU SpGEMM approach (AC-SpGEMM) to cut the different numbers of processing elements in architecture down the computational overhead throughout all stages. They design and different block size in blocking decomposition. perform chunk-based ESC (explicitly expand all temporary Such ability allows designers to employ different on-chip products, sort them and combine them to yield the final resources for different system power-delay and energy-delay matrix) to produce deterministic bit-stable results. With the requirements. Jamro et al. think that indices comparisons help of the local task distribution, this stage performs multiple dominate floating-point operations. Therefore they propose a iterations of ESC, fully utilizing local memory resources highly parallel architecture, which can carry out at least 8×8 while keeping temporary results in scratchpad memory. To indices comparison in a single clock cycle in [83]. achieve the performance portability on massively threaded architectures, Deveci et al. design a hierarchical and memory- C. Heterogeneous Computing Platform efficient SpGEMM algorithm in [47] by introducing two thread Researchers mainly focus on load balancing problems on scalable data structures, a multilevel hashmap and a memory heterogeneous platforms. For heterogeneous systems com- pool. Since the memory subsystems of NVIDIA GPU differ posing CPUs and GPUs, Siegel et al. develop a task-based significantly in both bandwidth and latency, two different data programming model and a runtime-supported execution model placements methods are proposed in [58]. One is to keep the to achieve dynamic load balancing in [80]. They partition partitions of A and C in fast memory and stream the partitions the SpGEMM problem in blocks in the form of tasks to of B to fast memory. Another is to keep the partitions of address the unbalancing introduced by the sparse matrix and B in fast memory and stream the partitions of A and C to the heterogeneous platform. For a CPU-GPU heterogeneous fast memory. Which method to be chosen often depends on computing system, Matam et al. propose several heuristic the input matrix properties. To select proper parameters of methods to identify the proper work division of a subclass the memory pool on GPU, Deveci et al. over-estimate the of matrices between CPU and GPU in [53]. Liu et al. propose concurrency to access memory efficiently. They check the a load-balancing method based on output space decomposition available memory and reduce the number of chunks if the in [17]. They assign rows to a fixed number of bins through memory allocation becomes too expensive on GPUs [68]. a much faster linear-time traverse on the CPU. Moreover, Load balancing. To balance the workload of SpGEMM on they decompose the output space in a more detailed way that GPU, a heuristic method is presented to assign the rows of guarantees much more efficient load balancing. Their method the result matrix to multiple bins with different subsequent is always load balanced in all stages for maximizing resource computational methods [21]. In this paper, Liu et al. first utilization of GPUs. Besides, in [81], Rubensson et al. present a method for parallel block SpGEMM on distributed memory the balancing of computations leads to faster task execution. clusters based on their previously proposed Chunks and Tasks They show the validity of their scheduling on the MapReduce model [54]. They use a quad- to parallelization of SpGEMM. Their model leads to tremendous exploit data locality without prior information about the matrix savings in data transfer and consequently achieves up to 4.2x sparsity pattern. The quad-tree representation, combined with speedup for SpGEMM compared with random scheduling. the Chunks and Tasks model, leads to favorable weak and In [46], Demirci et al. design a distributed SpGEMM strong scaling of the communication cost with the number algorithm on Accumolo, which is a highly scalable, distributed of processes. Matrices are represented as sparse quad-trees key-value store built on the design of Googles Bigtable. The of chunk objects, and the leaves in the hierarchy are block- basic idea of this work is that reducing the values for a key sparse sub-matrices. Sparsity is dynamically detected and may locally before they are sent to a remote node. They use a tablet occur at any level in the hierarchy and/or within the sub-matrix server to iterate through multiple rows in sub-A and create leaves. In case GPU is available, this design uses both CPUs batches of rows to be processed together before performing and GPUs for leaf-level multiplication, thus making full use any computation or a batch-scan operation. Batch processing of the computing capacity of each node. of rows significantly reduces the latency and communication volume among servers, because both the number of lookup D. Distributed Platform operations and the redundant retrieval of rows of B are The research of SpGEMM on the cluster system mainly reduced. focuses on the minimizing of inter-node communication. In VI.PERFORMANCE EVALUATION [78], Jin et al. conduct a series of experiments to study the data communication issue of SpGEMM on the PC cluster. They A. Overview find that due to communication overheads, it is difficult for The performance of the SpGEMM kernel is dominated by clusters to exploit low-level parallelism. Thus, to successfully several factors, including sparse format, matrix sparsity, and achieve computation parallel, it is critical to exploit high-level computational device. It is extremely challenging to figure out parallelism sufficiently. Increasing the degree of high-level the exact relationship between the SpGEMM kernel and these parallelism makes easier the task of balancing the workload. factors. Dynamic scheduling algorithms always have less unbalanced In this section, we conduct a series of experiments to overhead than static scheduling algorithms, no matter uni- compare the SpGEMM performance of several state-of-art casting or multicasting is used for data communication. In approaches, and the non-zeros of sparse matrices are cached [42], for outer-product-parallel SpGEMM, Akbudak et al. and involved in calculation in double . Our tested propose three hypergraph models that achieve simultaneous libraries include Intel (MKL) [89], partitioning of input and output matrices without any repli- NVIDIA CUDA Sparse Library (cuSPARSE) [88], CSPARSE cation of input data. All three hypergraph models perform [91], CUSP [90], bhSPARSE [17] [21] and KokkosKernels conformable 1D column-wise and row-wise partitioning of the [87]. input matrices A and B, respectively. The first hypergraph Intel MKL is a highly-parallel closed-source library for model performs two-dimensional nonzero-based partitioning Intel processors. One of its sub-library, named spblas, includes of the output matrix, whereas the second and third models standard C and Fortran API for sparse linear algebra. It perform 1D row-wise and column-wise partitioning of the uses a highly encapsulated internal data structure to perform output matrix, respectively. This partitioning scheme leads to a operations upon sparse matrices. We call mkl sparse sp2m two-phase parallel SpGEMM algorithm: communication-free to perform SpGEMM, and both input matrices are encoded local SpGEMM computations and the multiple single-node- with either CSR or BSR. This interface supports two-stage accumulation operations on the local SpGEMM results. execution. The row pointer array of the output matrix is In [79], Bethune et al. study the performance of SpGEMM calculated in the first stage. In the second stage, the remaining using the open-source sparse linear algebra library DBCSR column indexes and value arrays of the output matrix are (Distributed Block Compressed Sparse Row) on different calculated. At the same time, We record the execution time distributed systems. DBCSR matrices are stored in a blocked of these two stages. compressed sparse row (CSR) format distributed over a two- cuSPARSE is a closed-source library provided by NVIDIA. dimensional grid of P MPI processes. Inter-process com- It contains a set of basic linear algebra subroutines used for munication is based on the communication-reducing 2.5D handling sparse matrices. In our test program, input matri- algorithm. ces encoded with COO format are first converted into CSR The local multiplication will start as soon as all the format by calling cusparseXcoo2csr, then we call cusparseXc- data has arrived at the destination process. The amount√ of srgemmnnz and cusparseDcsrgemm in turn to complete a two- communicated data by each process scales as O(1/ P ). stage SpGEMM, which is similar to MKL. Also, the execution In [67], Selvitopi et al. achieve static scheduling of “map” time of each step is recorded. and “reduce” tasks in a MapReduce job based on graph CSPARSE is an open-source, single-thread sparse matrix and hypergraph partitioning. The locality in task assignment package written in C program. It uses the CSC format store the reduces the amount of data transfer in the shuffle phase, and sparse matrices. We call cs multiply to calculate the product TABLE II: Detailed information of evaluated libraries.

Library Version Compiler Open source Size prediction Accumulator Supported format MKL 2019 icc 19.0.3.199 No precise - CSC/CSR/BSR cuSPARSE 10.0.130 nvcc 10.0.130 No precise hash-based CSR CSPARSE 3.1.4 icc 19.0.3.199 Yes progressive dense CSC CUSP 0.5.1 nvcc 8.0.61 Yes upper bound sort-based COO/CSR bhsparse Lastest nvcc 8.0.61 Yes hybrid hybrid CSR KokkosKernels 2.8.00 nvcc 10.0.130 Yes precise CPU:dense; GPU:hash-based CSR

Fig. 3: The detailed information of dataset. of two sparse matrices and also record the execution time of linear algebra and graph operations, using the Kokkos shared- this procedure simultaneously. memory parallel programming model [85]. It also supports CUSP is an open-source library for sparse basic linear two-stage execution and records the execution time of the algebra and graph computations based on Thrust, C++. It symbolic stage and numeric stage, respectively. Moreover, provides a flexible API for users and is highly encapsulated. we test its performance on the Intel multi-core platform and We call multiply to calculate the product of two sparse matrices NVIDIA GPU platform, respectively. encoded with the COO format and record the execution time of this call procedure. B. System Setup bhSPARSE is an open-source library that provides sparse linear algebra interfaces used for sparse matrix computations The computer used for the experiment runs a 64-bit Ubuntu on heterogeneous parallel processors. A SpGEMM implemen- operation system. The host memory is up to 16GB. The tation based on CSR format is included in this library, and it CPU of this machine is Intel CoreTM i7-4790K with 4GHz is claimed to be efficient and general for the multiplication clock frequency. The GPU is NVIDIA Geforce GTX 1080Ti, of irregular sparse matrices. It is composed of four stages: which is based on the Pascal architecture and equipped with calculating upper bound, binning, computing the result matrix, 3584 CUDA cores and 11GB device memory. The versions, and arranging data. The detailed introduction can be found in compilers, open-source information, size prediction method, paper [21]. intermediate result accumulator, and support format of evalu- KokkosKernels implements local computational kernels for ated libraries are set out in Table II. netics, Circuit simulation, Directed graph, structural problem, and web connectivity. Besides, it can be seen from Fig. 3 that left matrices have a regular distribution, most of their non-zero elements are distributed near main diagonal, while right sparse matrices have an irregular non-zeros distribution. D. Evaluation Result We conduct three groups of experiments on the benchmark to evaluate the performance of the ahead mentioned libraries. 1) Sparse Storage Format: Although a variety of storage formats are currently proposed for sparse matrix, the existing SpGEMM implementations involve few storage formats. CSR is the most commonly used sparse format, which is used in the SpGEMM implementation of CUSP, MKL, cuSPARSE, Fig. 4: The proportion of execution time of each phase. bhSPARSE, and KokkosKernels. Besides, MKL supports the Involved matrices are encoded with CSR format. Format con- multiplication of two sparse matrices encoded with BSR version, stage1, stage2 and format export represent the format format, and CSPARSE only supports CSC format. conversion from COO to CSR, calculation of row pointer array In this experiment, we compare the SpGEMM performance of the output matrix, calculation of remaining column indexes in MKL, where the involved sparse matrices are encoded and value arrays of the output matrix, format export from with CSR or BSR format. The SpGEMM kernel in MKL internal representation to CSR format, respectively. is composed of four phases: format conversion (COO to CSR/BSR), stage1 (the first stage mentioned earlier), stage2 (the second stage mentioned earlier), and format export (export CSR/BSR matrix from internal representation). Fig. 4 and Fig. 5 provide the proportion of execution time of each phase where all involved sparse matrices are encoded with CSR or BSR formats. It can be seen from Fig. 4 that when the involved matrices are stored in CSR format, the execution time of format conversion for most sparse matrices takes up a small proportion. The execution time of the last three phases varies from matrix to matrix. Both stage1 and stage2 account for a larger proportion than other phases, while the execution time of format export phase varies greatly. From Fig. 5, we can see that the execution time of the stage1 phase takes up a small ratio for most sparse matrices, while the other three phases account for a larger proportion. Comparison of the execution time of each phase based on Fig. 5: The proportion of execution time of each phase. different sparse formats is set out in Fig. 6. From Fig. 6(a) we Involved matrices are encoded with BSR format. Format con- can see that it takes a longer time to convert the matrix from version, stage1, stage2 and format export represent the format COO to BSR than to CSR on each sparse matrix, and the time conversion from COO to BSR, calculation of row pointer array difference is huge on some sparse matrices. As can be seen of the output matrix, calculation of remaining column indexes from Fig. 6(b), stage1 based on CSR format, which computes and value arrays of the output matrix, format export from the row pointer array of output matrix, takes a longer time internal representation to BSR format, respectively. than BSR format on most sparse matrices. From Fig. 6(c) we can see that the SpGEMM based on BSR format gives better performance than the SpGEMM based on CSR format on most C. Benchmark irregular sparse matrices, while there is no specific winner for The sparse matrices used in our experiment are from the regular sparse matrices. Fig. 6(d) indicates that there is no SuiteSparse Matrix Collection (formerly the University of absolute advantage on exporting the internal representation to Florida Sparse Matrix Collection) [86]. We choose the square CSR or BSR. matrices to perform the SpGEMM kernel C = A × A. The The final performance comparison between CSR and BSR- corresponding dataset is shown in Fig. 3. The dimensions of based SpGEMM is presented in Fig. 7. It can be seen that matrices range from 14K to 1M, and the number of non- neither CSR-based nor BSR-based SpGEMM shows the over- zeros ranges from 0.4M to 11.5M. These matrices are from whelming advantage. The specific performance of SpGEMM many applications, including (FEM), has a complex relationship with the chosen sparse format and Quantum Chromodynamics (QCD), Protein data, electromag- the matrix sparsity. (a) Format conversion. (b) Stage1.

(c) Stage2. (d) Format export. Fig. 6: Comparison of execution time of each phase in different sparse formats, including CSR and BSR.

whose SpGEMM implementation predicts the number of non- zeros of the output matrix using the precise method. In existing libraries, MKL, cuSPARSE, and KokkosKernels utilize the precise method in the first symbolic phase. Fig. 8 presents the speedup of the execution time of cuS- PARSE and KokkosKernels to MKL (SpGEMM based on CSR) in the symbolic phase. The SpGEMM in KokkosKer- nels involved mac econ fwd500, scircuit and webbase-1M matrices encounter runtime error, so their values are set to 0 (the same below). From Fig. 8 we can see that for most regular matrices, the implementation of the symbolic phase in cuSPARSE is better than other libraries, while the CUDA- based implementation of symbolic phase in KokkosKernels is the best for most irregular matrices. Besides, the MKL Fig. 7: Performance comparison of SpGEMM implementation and OpenMP-based implementation of the symbolic phase in in MKL where involved sparse matrices are compressed with KokkosKernels show worse performance than other libraries. CSR or BSR formats. The speedup of the execution time of cuSPARSE and KokkosKernels in the second numeric phase to MKL can be compared in Fig. 9. It can be seen that the CUDA- 2) SpGEMM Implementation: In this section, we compare based implementation of the numeric phase in KokkosKernels the performance of different SpGEMM implementations. First, outperforms cuSPARSE and OpenMP-based KokkosKernels we compare the performance of each stage of the libraries on each sparse matrix except three problematic matrices Fig. 8: The speedup of the execution time of cuSPARSE, CUDA-based and OpenMP-based Kokkoskernels in the symbolic phase to MKL (based on CSR).

Fig. 9: The speedup of cuSPARSE, CUDA-based and OpenMP-based Kokkoskernels in the numeric phase to MKL (based on CSR). and matrix cant. Especially on irregular sparse matrices, its symbolic phase are larger than 30%, and the time proportions performance advantage is considerable. of 3 and 7 out of 17 matrices respectively are larger than 50%. Fig. 10 presents the proportion of execution time of sym- The final performance of SpGEMM implementation in bolic and numeric phases in different SpGEMM implemen- different libraries is summarised in Fig. 11. It can be seen that tations, where MKL 1 and MKL 2 represent the symbolic CUDA-based KokkosKernels, cuSPARSE, bhSPARSE, and and numeric phases in MKL, respectively, and others are OpenMP-based KokkosKernels outperform other SpGEMM similar. It can be seen from Fig. 10 that the execution time implementations on 12, 5, 2 and 1 sparse matrices, respec- of the symbolic phase is less than or equal to the numeric tively. phase on each sparse matrix in cuSPARSE and OpenMP- based KokkosKernels. Moreover, there are 8 and 12 matrices VII.CHALLENGESAND FUTURE WORK out of the whole 20 matrices whose time proportions of the symbolic phase are less than or equal to 30% in cuSPARSE SpGEMM has gained much attention in the past decades, and OpenMP-based SpGEMM, respectively. In MKL and and the work has been conducted in many directions. We CUDA-based KokkosKernels, however, there are 11 and 16 believe that it will be used in many other fields as the out of 17 matrices, respectively, whose time proportions of development of IoT, big data, and AI. A systematic literature (a) MKL. (b) cuSPARSE.

(c) CUDA-based KokkosKernels. (d) OpenMP-based KokkosKernels. Fig. 10: The proportion of execution time of symbolic and numeric phases in different SpGEMM implementations. MKL 1 and MKL 2 represent the symbolic and numeric phases respectively in MKL, and others are similar.

Fig. 11: Performance comparsion of SpGEMM implementation in CSPARSE, MKL, CUSP, bhSPARSE, cuSPARSE, and KokkosKernels. To be fair, the in this figure do not include the time cost of format conversion and format exportd, and the performance of MKL in this figure is based on CSR format. review sheds light on the potential research directions and and efficient implementation in the specific architecture is some challenges that require further study. within reach and can be expected to continue in the new future. One of the challenges of SpGEMM is exploring the sparsity of sparse matrices. Researchers find that the non-zero ele- REFERENCES ments distribution dominates the performance of SpMV and [1] J. Georgii and R. Westermann, “A streaming approach for sparse matrix SpGEMM on the same architecture. To further explore the products and its application in galerkin multigrid methods,” Electronic Transactions on , vol. 37, no. 263-275, pp. 3–5, 2010. potential performance improvement, automatic format select [2] N. Bell, S. Dalton, and L. N. Olson, “Exposing fine-grained parallelism algorithms have been designed for SpMV based on machine in algebraic multigrid methods,” SIAM J. Scientific Computing, vol. 34, learning models. Significant performance improvement has no. 4, 2012. [Online]. Available: https://doi.org/10.1137/110838844 [3] G. Ballard, C. Siefert, and J. Hu, “Reducing communication costs been observed using auto-tuning and format selection based for sparse matrix multiplication within algebraic multigrid,” SIAM on these models. However, the situation for SpGEMM is more Journal on Scientific Computing, vol. 38, no. 3, pp. C203–C231, 2016. complicated because two sparse matrices are involved. The [Online]. Available: https://doi.org/10.1137/15M1028807 [4] A. Bienz, “Reducing communication in sparse solvers,” phdthesis, performance depends on not only the sparsity of the two University of Illinois at Urbana-Champaign, Sep. 2018. input matrices but also the combination of the sub-matrices [5] T. A. Davis, “Graph algorithms via suitesparse: Graphblas: triangle partitioned from the input matrices. counting and k-truss,” in 2018 IEEE High Performance extreme Com- puting Conference (HPEC), Sep. 2018, pp. 1–6. Size prediction of the result matrix is also a challenge [6] J. Cohen, “Graph twiddling in a mapreduce world,” Computing in for SpGEMM. Upper-bound prediction may result in memory Science and Engg., vol. 11, no. 4, pp. 29–41, Jul. 2009. [Online]. over-allocation, and others may lead to memory re-allocation. Available: https://doi.org/10.1109/MCSE.2009.120 [7] M. M. Wolf, M. Deveci, J. W. Berry, S. D. Hammond, and S. Raja- Both over-allocation and re-allocation are challenging tasks for manickam, “Fast linear algebra-based triangle counting with kokkosker- acceleration devices, such as GPU, DSP, FPGA, and TPU. On- nels,” in 2017 IEEE High Performance Extreme Computing Conference device memory capacity is limited and may not accommodate (HPEC), Sep. 2017, pp. 1–7. [8] A. Azad, A. Bulu, and J. Gilbert, “Parallel triangle counting and such a large amount of data. It also takes time to copy data enumeration using matrix algebra,” in 2015 IEEE International Parallel to and from device memory because of separated memory and Distributed Processing Symposium Workshop, May 2015, pp. 804– space. Precise size prediction not only reduces the overhead 811. [9] J. R. Gilbert, S. Reinhardt, and V. B. Shah, “High-performance graph of memory management but also indirectly improves the algorithms from parallel sparse matrices,” in International Workshop on performance of result accumulating. This is because smaller Applied . Springer, 2006, pp. 260–269. memory footprint and seldom re-allocation are required for [10] M. Then, M. Kaufmann, F. Chirigati, T.-A. Hoang-Vu, K. Pham, A. Kemper, T. Neumann, and H. T. Vo, “The more the merrier: output matrix manipulation. Efficient multi-source graph traversal,” Proc. VLDB Endow., vol. 8, Finally, heterogeneous architecture-oriented optimization is no. 4, pp. 449–460, Dec. 2014. [Online]. Available: http://dx.doi.org/ a challenge for diversified systems. CPU is good at processing 10.14778/2735496.2735507 [11] A. Buluc¸ and J. R. Gilbert, “The combinatorial BLAS: design, complicated logic, while GPU is good at dense computations. implementation, and applications,” International Journal of High Besides, DSP and FPGA accelerators may be used in different Performance Computing Applications, vol. 25, no. 4, pp. 496–509, systems. One of the critical questions of porting SpGEMM 2011. [Online]. Available: https://doi.org/10.1177/1094342011403516 [12] T. M. Chan, “More algorithms for all-pairs shortest paths in to the heterogeneous system is how to achieve load balance weighted graphs,” in Proceedings of the Thirty-ninth Annual ACM and minimize the communication traffic. Moreover, each sub- Symposium on Theory of Computing, ser. STOC ’07. New matrix may have its feature in non-zeros distribution. Ideally, York, NY, USA: ACM, 2007, pp. 590–598. [Online]. Available: http://doi.acm.org/10.1145/1250790.1250877 relatively dense blocks should be mapped to acceleration [13] H. Kaplan, M. Sharir, and E. Verbin, “Colored intersection searching via devices, while super sparse blocks should be mapped to sparse rectangular matrix multiplication,” in Proceedings of the twenty- CPU. Only in this way can each device give full play to its second annual symposium on Computational geometry. ACM, 2006, pp. 52–60. architecture and optimize the overall computing performance [14] M. Deveci, E. G. Boman, K. D. Devine, and S. Rajamanickam, “Parallel of SpGEMM. graph coloring for manycore architectures,” in 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS), May 2016, VIII.CONCLUSION pp. 892–901. [15] V. Vassilevska, R. Williams, and R. Yuster, “Finding heaviest The design of an efficient SpGEMM algorithm, as well as h-subgraphs in real weighted graphs, with applications,” CoRR, its implementation, are critical to many large-scale scientific vol. abs/cs/0609009, 2006. [Online]. Available: http://arxiv.org/abs/cs/ 0609009 applications. Therefore it is not surprising that SpGEMM has [16] A. Buluc¸and K. Madduri, “Parallel breadth-first search on distributed attracted much attention from researchers over the years. In memory systems,” in Proceedings of 2011 International Conference for this survey, we highlight some of the developments in recent High Performance Computing, Networking, Storage and Analysis, ser. SC ’11. New York, NY, USA: ACM, 2011, pp. 65:1–65:12. [Online]. years and emphasize the application, challenging problems, Available: http://doi.acm.org/10.1145/2063384.2063471 architecture-oriented optimizations, and performance evalua- [17] W. Liu and B. Vinter, “A framework for general sparse matrix-matrix tion. In concluding, we stress the fact that despite recent multiplication on gpus and heterogeneous processors,” J. Parallel Distrib. Comput., vol. 85, no. C, pp. 47–61, Nov. 2015. [Online]. progress, there are still important areas where much work Available: http://dx.doi.org/10.1016/j.jpdc.2015.06.010 remains to be done, such as different formats support, hetero- [18] F. G. Gustavson, “Two fast algorithms for sparse matrices: Multiplication geneous architecture-oriented optimization, efficient size pre- and permuted transposition,” ACM Trans. Math. Softw., vol. 4, pp. 250– 269, 1978. diction of target matrix. On the other hand, the heterogeneous [19] I. S. Duff, “A survey of sparse matrix research,” Proceedings of the architecture may give rise to new optimization opportunities, IEEE, vol. 65, no. 4, pp. 500–535, April 1977. [20] M. Grossman, C. Thiele, M. Araya-Polo, F. Frank, F. O. Alpak, [39] M. O. Rabin and V. V. Vazirani, “Maximum matchings in general graphs and V. Sarkar, “A survey of sparse matrix-vector multiplication through randomization,” J. Algorithms, vol. 10, no. 4, pp. 557–567, performance on large matrices,” CoRR, vol. abs/1608.00636, 2016. Dec. 1989. [Online]. Available: https://doi.org/10.1016/0196-6774(89) [Online]. Available: http://arxiv.org/abs/1608.00636 90005-9 [21] W. Liu and B. Vinter, “An efficient GPU general sparse matrix-matrix [40] K. Censor-Hillel, P. Kaski, J. H. Korhonen, C. Lenzen, A. Paz, and multiplication for irregular data,” in 2014 IEEE 28th International J. Suomela, “Algebraic methods in the congested clique,” CoRR, vol. Parallel and Distributed Processing Symposium, 2014, pp. 370–381. abs/1503.04963, 2015. [Online]. Available: http://arxiv.org/abs/1503. [Online]. Available: https://doi.org/10.1109/IPDPS.2014.47 04963 [22] B. Kitchenham, “Procedures for performing systematic reviews,” Keele, [41] V. Weber, T. Laino, A. Pozdneev, I. Fedulova, and A. Curioni, UK, Keele Univ., vol. 33, 08 2004. “Semiempirical molecular dynamics (semd) i: Midpoint-based parallel [23] R. D. Falgout, “An introduction to algebraic multigrid,” Computing in sparse matrixmatrix multiplication algorithm for matrices with Science , vol. 8, no. 6, pp. 24–33, Nov 2006. decay,” Journal of Chemical Theory and Computation, vol. 11, [24] W. L. Briggs, V. E. Henson, and S. F. McCormick, A Multigrid Tutorial: no. 7, pp. 3145–3152, 2015, pMID: 26575751. [Online]. Available: Second Edition. Philadelphia, PA, USA: Society for Industrial and https://doi.org/10.1021/acs.jctc.5b00382 Applied Mathematics, 2000. [42] K. Akbudak and C. Aykanat, “Simultaneous input and output [25] J. J. Elliott and C. M. Siefert, “Low thread-count gustavson: A multi- matrix partitioning for outer-product–parallel sparse matrix-matrix threaded algorithm for sparse matrix-matrix multiplication using perfect multiplication,” SIAM Journal on Scientific Computing, vol. 36, no. 5, hashing,” in 2018 IEEE/ACM 9th Workshop on Latest Advances in pp. C568–C590, 2014. [Online]. Available: https://doi.org/10.1137/ Scalable Algorithms for Large-Scale Systems (scalA), Nov 2018, pp. 13092589X 57–64. [43] F. Gremse, A. Hfter, L. Schwen, F. Kiessling, and U. Naumann, [26] A. Yas¸ar, S. Rajamanickam, M. Wolf, J. Berry, and U. V. atalyurek,¨ “Gpu-accelerated sparse matrix-matrix multiplication by iterative row “Fast triangle counting using cilk,” in 2018 IEEE High Performance merging,” SIAM Journal on Scientific Computing, vol. 37, no. 1, pp. extreme Computing Conference (HPEC), Sep. 2018, pp. 1–7. C54–C71, 2015. [Online]. Available: https://doi.org/10.1137/130948811 [27] M. M. Wolf, J. W. Berry, and D. T. Stark, “A task-based linear algebra [44] J. Demouth, “Sparse matrix-matrix multiplication on the gpu,” in GPU building blocks approach for scalable graph analytics,” in 2015 IEEE Technology Conference 2012, 2012. High Performance Extreme Computing Conference (HPEC), Sep. 2015, [45] Y. Nagasaka, A. Nukada, and S. Matsuoka, “High-performance and pp. 1–6. memory-saving sparse general matrix-matrix multiplication for nvidia [28] L. Wang, Y. Wang, C. Yang, and J. D. Owens, “A comparative study pascal gpu,” in 2017 46th International Conference on Parallel Pro- on exact triangle counting algorithms on the gpu,” in Proceedings cessing (ICPP), Aug 2017, pp. 101–110. of the ACM Workshop on High Performance Graph Processing, ser. [46] G. V. Demirci and C. Aykanat, “Scaling sparse matrix-matrix HPGP ’16. New York, NY, USA: ACM, 2016, pp. 1–8. [Online]. multiplication in the accumulo database,” Distributed and Parallel Available: http://doi.acm.org/10.1145/2915516.2915521 Databases, Jan 2019. [Online]. Available: https://doi.org/10.1007/ [29] Y. Zhang, L. Wu, G. Wei, and S. Wang, “A novel algorithm s10619-019-07257-y for all pairs shortest path problem based on matrix multiplication [47] M. Deveci, C. Trott, and S. Rajamanickam, “Performance-portable Digital Signal Processing and pulse coupled neural network,” , sparse matrix-matrix multiplication for many-core architectures,” vol. 21, no. 4, pp. 517 – 521, 2011. [Online]. Available: in 2017 IEEE International Parallel and Distributed Processing http://www.sciencedirect.com/science/article/pii/S1051200411000650 Symposium Workshops, IPDPS Workshops 2017, Orlando / Buena [30] D. Eppstein and E. S. Spiro, “The h-index of a graph and its application Vista, FL, USA, May 29 - June 2, 2017, 2017, pp. 693–702. [Online]. to dynamic subgraph statistics,” in Proceedings of the 11th International Available: https://doi.org/10.1109/IPDPSW.2017.8 Symposium on Algorithms and Data Structures, ser. WADS ’09. Berlin, [48] E. Cohen, “Size-estimation framework with applications to transitive Heidelberg: Springer-Verlag, 2009, pp. 278–289. closure and reachability,” J. Comput. Syst. Sci., vol. 55, no. 3, pp. [31] S. Samsi, V. Gadepally, M. B. Hurley, M. Jones, E. K. Kao, 441–453, Dec. 1997. [Online]. Available: http://dx.doi.org/10.1006/jcss. S. Mohindra, P. Monticciolo, A. Reuther, S. Smith, W. S. Song, 1997.1534 D. Staheli, and J. Kepner, “Static graph challenge: Subgraph [49] R. R. Amossen, A. Campagna, and R. Pagh, “Better size estimation for isomorphism,” CoRR, vol. abs/1708.06866, 2017. [Online]. Available: sparse matrix products,” 6 2010. http://arxiv.org/abs/1708.06866 [32] V. Gadepally, J. Bolewski, D. Hook, D. Hutchison, B. A. Miller, [50] R. Pagh and M. Stockel,¨ “The input/output complexity of sparse matrix and J. Kepner, “Graphulo: Linear algebra graph kernels for nosql multiplication,” CoRR, vol. abs/1403.3551, 2014. [Online]. Available: databases,” CoRR, vol. abs/1508.07372, 2015. [Online]. Available: http://arxiv.org/abs/1403.3551 http://arxiv.org/abs/1508.07372 [51] Y. Nagasaka, S. Matsuoka, A. Azad, and A. Bulu, “High-performance [33] C. E. Tsourakakis, P. Drineas, E. Michelakis, I. Koutis, and C. Faloutsos, sparse matrix-matrix products on intel knl and multicore architectures.” “Spectral counting of triangles via element-wise sparsification and [52] J. Gilbert, C. Moler, and R. Schreiber, “Sparse matrices in : triangle-based link recommendation,” Social Network Analysis and Design and implementation,” SIAM Journal on Matrix Analysis and Mining, vol. 1, no. 2, pp. 75–81, Apr 2011. [Online]. Available: Applications, vol. 13, no. 1, pp. 333–356, 1992. [Online]. Available: https://doi.org/10.1007/s13278-010-0001-9 https://doi.org/10.1137/0613024 [34] A. Pavan, K. Tangwongsan, S. Tirthapura, and K.-L. Wu, “Counting [53] K. Matam, S. R. Krishna Bharadwaj Indarapu, and K. Kothapalli, and sampling triangles from a graph stream,” Proc. VLDB Endow., “Sparse matrix-matrix multiplication on modern architectures,” in 2012 vol. 6, no. 14, pp. 1870–1881, Sep. 2013. [Online]. Available: 19th International Conference on High Performance Computing, Dec http://dx.doi.org/10.14778/2556549.2556569 2012, pp. 1–10. [35] N. Wang, J. Zhang, K.-L. Tan, and A. K. H. Tung, “On [54] E. H. Rubensson and E. Rudberg, “Chunks and tasks: A programming triangulation-based dense neighborhood graph discovery,” Proc. VLDB model for parallelization of dynamic algorithms,” Parallel Comput., Endow., vol. 4, no. 2, pp. 58–68, Nov. 2010. [Online]. Available: vol. 40, no. 7, pp. 328–343, Jul. 2014. [Online]. Available: http://dx.doi.org/10.14778/1921071.1921073 http://dx.doi.org/10.1016/j.parco.2013.09.006 [36] T. Mattson, D. Bader, J. Berry, A. Buluc, J. Dongarra, C. Faloutsos, [55] K. Akbudak and C. Aykanat, “Exploiting locality in sparse matrix- J. Feo, J. Gilbert, J. Gonzalez, B. Hendrickson, J. Kepner, C. Leiserson, matrix multiplication on many-core architectures,” IEEE Transactions A. Lumsdaine, D. Padua, S. Poole, S. Reinhardt, M. Stonebraker, on Parallel and Distributed Systems, vol. 28, no. 8, pp. 2258–2271, S. Wallach, and A. Yoo, “Standards for graph algorithm primitives,” in Aug 2017. 2013 IEEE High Performance Extreme Computing Conference (HPEC), [56] K. Akbudak, O. Selvitopi, and C. Aykanat, “Partitioning models for Sep. 2013, pp. 1–2. scaling parallel sparse matrix-matrix multiplication,” vol. 4, pp. 1–34, [37] A. Buluc¸ and J. R. Gilbert, “Highly parallel sparse matrix-matrix 2018. multiplication,” arXiv preprint arXiv:1006.2183, 2010. [57] M. Winter, D. Mlakar, R. Zayer, H.-P. Seidel, and M. Steinberger, [38] J. R. Gilbert, S. Reinhardt, and V. B. Shah, “A unified framework “Adaptive sparse matrix-matrix multiplication on the gpu,” in for numerical and combinatorial computing,” Computing in Science Proceedings of the 24th Symposium on Principles and Practice Engineering, vol. 10, no. 2, pp. 20–25, March 2008. of Parallel Programming, ser. PPoPP ’19. New York, NY, USA: ACM, 2019, pp. 68–81. [Online]. Available: http://doi.acm.org/10.1145/ Endow., vol. 7, no. 1, pp. 85–96, Sep. 2013. [Online]. Available: 3293883.3295701 http://dx.doi.org/10.14778/2732219.2732227 [58] M. Deveci, S. D. Hammond, M. M. Wolf, and S. Rajamanickam, [78] D. Jin and S. G. Ziavras, “A super-programming technique for large “Sparse matrix-matrix multiplication on multilevel memory architectures sparse matrix multiplication on pc clusters,” IEICE TRANSACTIONS on : Algorithms and experiments,” CoRR, vol. abs/1804.00695, 2018. Information and Systems, vol. 87, no. 7, pp. 1774–1781, 2004. [Online]. Available: http://arxiv.org/abs/1804.00695 [79] I. Bethune, A. Gloess, J. Hutter, A. Lazzaro, H. Pabst, and F. Reid, [59] Y. Chen, K. Li, W. Yang, G. Xiao, X. Xie, and T. Li, “Performance-aware “Porting of the dbcsr library for sparse matrix-matrix multiplications to model for sparse matrix-matrix multiplication on the sunway taihulight intel xeon phi systems.” supercomputer,” vol. 30, pp. 923–938, 2019. [80] J. Siegel, O. Villa, S. Krishnamoorthy, A. Tumeo, and X. Li, “Efficient [60] S. E. Kurt, V. Thumma, C. Hong, A. Sukumaran-Rajam, and P. Sadayap- sparse matrix-matrix multiplication on heterogeneous high performance pan, “Characterization of data movement requirements for sparse matrix systems,” in 2010 IEEE International Conference On Cluster Computing computations on gpus,” in 2017 IEEE 24th International Conference on Workshops and Posters (CLUSTER WORKSHOPS). IEEE, 2010, pp. High Performance Computing (HiPC), Dec 2017, pp. 283–293. 1–8. [61] F. Gremse, K. Kpper, and U. Naumann, “Memory-efficient sparse [81] E. H. Rubensson and E. Rudberg, “Locality-aware parallel block-sparse matrix-matrix multiplication by row merging on many-core architec- matrix-matrix multiplication using the chunks and tasks programming tures,” vol. 40, pp. C429–C449, 2018. model,” Parallel Comput., vol. 57, no. C, pp. 87–106, Sep. 2016. [62] R. A. van de Geijn and J. Watts, “Summa: Scalable universal matrix [Online]. Available: https://doi.org/10.1016/j.parco.2016.06.005 multiplication algorithm,” Austin, TX, USA, Tech. Rep., 1995. [82] C. Y. Lin, Z. Zhang, N. Wong, and H. K. So, “Design space exploration [63] A. Buluc and J. R. Gilbert, “On the representation and multiplication for sparse matrix-matrix multiplication on fpgas,” in 2010 International of hypersparse matrices,” in 2008 IEEE International Symposium on Conference on Field-Programmable Technology, Dec 2010, pp. 369– Parallel and Distributed Processing. IEEE, 2008, pp. 1–11. 372. [83] E. Jamro, T. Pabis, P. Russek, and K. Wiatr, “The algorithms for [64] ——, “Challenges and advances in parallel sparse matrix-matrix multi- fpga implementation of sparse matrices multiplication,” Computing and plication,” in 2008 37th International Conference on Parallel Processing. Informatics, vol. 33, pp. 667–684, 01 2014. IEEE, 2008, pp. 503–510. [84] J. Hoberock and N. Bell, Thrust: A parallel template library, 2013, [65] A. Azad, G. Ballard, A. Buluc¸, J. Demmel, L. Grigori, O. Schwartz, version 1.7.0. [Online]. Available: http://thrust.github.io/ S. Toledo, and S. Williams, “Exploiting multiple levels of parallelism [85] H. C. Edwards, C. R. Trott, and D. Sunderland, “Kokkos: Enabling in sparse matrix-matrix multiplication,” CoRR, vol. abs/1510.00844, manycore performance portability through polymorphic memory access 2015. [Online]. Available: http://arxiv.org/abs/1510.00844 patterns,” Journal of Parallel and Distributed Computing, vol. 74, no. 12, [66] G. Ballard, A. Druinsky, N. Knight, and O. Schwartz, “Hypergraph pp. 3202 – 3216, 2014, domain-Specific Languages and High-Level partitioning for sparse matrix-matrix multiplication,” ACM Trans. Frameworks for High-Performance Computing. [Online]. Available: Parallel Comput., vol. 3, no. 3, pp. 18:1–18:34, Dec. 2016. [Online]. http://www.sciencedirect.com/science/article/pii/S0743731514001257 Available: http://doi.acm.org/10.1145/3015144 [86] T. A. Davis and Y. Hu, “The university of florida sparse matrix [67] O. Selvitopi, G. V. Demirci, A. Turk, and C. Aykanat, “Locality- collection,” ACM Trans. Math. Softw., vol. 38, no. 1, pp. 1:1–1:25, 2011. aware and load-balanced static task scheduling for mapreduce,” [87] Kokkoskernels. [Online]. Available: https://github.com/kokkos/ Future Generation Computer Systems, vol. 90, pp. 49 – 61, kokkos-kernels 2019. [Online]. Available: http://www.sciencedirect.com/science/article/ [88] NVIDIA. Nvidia cusparse library. [Online]. Available: https://developer. pii/S0167739X17329266 nvidia.com/cusparse [68] M. Deveci, C. Trott, and S. Rajamanickam, “Multi-threaded sparse [89] Intel. Intel math kernel library. [Online]. Available: https://software. matrix-matrix multiplication for many-core and gpu architectures,” intel.com/en-us/mkl vol. 78, pp. 33–46, 2018. [90] S. Dalton and N. Bell. Cusp : A c++ templated sparse matrix library. [69] M. M. A. Patwary, N. R. Satish, N. Sundaram, J. Park, M. J. Anderson, [Online]. Available: http://cusplibrary.github.com S. G. Vadlamudi, D. Das, S. G. Pudov, V. O. Pirogov, and P. Dubey, [91] T. Davis. Csparse: a concise sparse matrix package. [Online]. Available: Parallel Efficient Sparse Matrix-Matrix Multiplication on Multicore http://www.cise.ufl.edu/research/sparse/CSparse Platforms, 2015. [70] J. Liu, X. He, W. Liu, and G. Tan, “Register-aware optimizations for parallel sparse matrix–matrix multiplication,” International Journal of Parallel Programming, Jan 2019. [Online]. Available: https: //doi.org/10.1007/s10766-018-0604-8 [71] S. Dalton, L. N. Olson, and N. Bell, “Optimizing sparse matrix - matrix multiplication for the GPU,” ACM Trans. Math. Softw., vol. 41, no. 4, pp. 25:1–25:20, 2015. [Online]. Available: https://doi.org/10.1145/2699470 [72] D. Merrill and A. Grimshaw, “High performance and scalable radix sorting: a case study of implementing dynamic parallelism for gpu computing.” Parallel Processing Letters, vol. 21, pp. 245–272, 06 2011. [73] G. E. Blelloch, M. A. Heroux, and M. Zagha, “Segmented operations for sparse matrix computation on vector multiprocessors,” School of Computer Science, Carnegie Mellon University, Tech. Rep. CMU-CS- 93-173, Aug. 1993. [74] K. Hou, W. Liu, H. Wang, and W.-c. Feng, “Fast segmented sort on gpus,” in Proceedings of the International Conference on Supercomputing, ser. ICS ’17. New York, NY, USA: ACM, 2017, pp. 12:1–12:10. [Online]. Available: http://doi.acm.org/10.1145/3079079. 3079105 [75] M. Deveci, K. Kaya, and U. V. atalyurek,¨ “Hypergraph sparsification and its application to partitioning,” in 2013 42nd International Conference on Parallel Processing, Oct 2013, pp. 200–209. [76] C. Kim, T. Kaldewey, V. W. Lee, E. Sedlar, A. D. Nguyen, N. Satish, J. Chhugani, A. Di Blas, and P. Dubey, “Sort vs. hash revisited: Fast join implementation on modern multi-core cpus,” Proc. VLDB Endow., vol. 2, no. 2, pp. 1378–1389, Aug. 2009. [Online]. Available: https://doi.org/10.14778/1687553.1687564 [77] C. Balkesen, G. Alonso, J. Teubner, and M. T. Ozsu,¨ “Multi- core, main-memory joins: Sort vs. hash revisited,” Proc. VLDB