2020 IEEE 36th International Conference on Data Engineering (ICDE)

Optimization of GPU-based Sparse Multiplication for Large Sparse Networks

Jeongmyung Lee, Seokwon Kang, Yongseung Yu, Yong-Yeon Jo, Sang-Wook Kim, Yongjun Park Department of Computer Science Hanyang University, Seoul, Korea {jeongmyung, kswon0202, dydtmd1991, jyy0430, wook, yongjunpark}@hanyang.ac.kr

Abstract— multiplication (spGEMM) is widely computational throughput using single-instruction, multiple- used to analyze the sparse network data, and extract important thread (SIMT) programming models, such as CUDA [6] information based on matrix representation. As it contains a and OpenCL [7]. A GPU generally consists of a set of high degree of data parallelism, many efficient implementations using data-parallel programming platforms such as CUDA and Streaming Multiprocessors (SMs). OpenCL/CUDA programs OpenCL have been introduced on graphic processing units are executed on GPUs by allocating Thread Blocks (TBs) or (GPUs). Several well-known spGEMM techniques, such as cuS- Cooperative Thread Arrays (CTAs) 1, which are groups of PARSE and CUSP, often do not utilize the GPU resources fully, threads, to each SM in parallel. owing to the load imbalance between threads in the expansion The main challenge is developing an efficient matrix multi- process and high memory contention in the merge process. Furthermore, even though several outer-product-based spGEMM plication technique considering the data-specific characteristics techniques are proposed to solve the load balancing problem of sparsity and power-law degree distribution [8]. Typical on expansion, they still do not utilize the GPU resources fully, sparse networks contain a much smaller number of edges with because severe computation load variations exist among the non-zero values, compared to the number of all possible edges multiple thread blocks. between nodes, and therefore, most of the elements in a sparse To solve these challenges, this paper proposes a new opti- mization pass called Block Reorganizer, which balances the total matrix have a value of zero. To reduce memory waste caused computations of each computing unit on target GPUs, based by sparsity, matrices are typically represented in the sparse on the outer-product-based expansion process, and reduces the format [9]. Sparse networks also commonly have power-law memory pressure during the merge process. For expansion, it distributions [8], where a very small number of hub nodes first identifies the actual computation amount for each block, have extremely large numbers of connections and most other and then performs two thread block transformation processes based on their characteristics: 1) B-Splitting to transform a nodes have very small numbers of connections. Based on heavy-computation blocks into multiple small blocks and 2) B- the power-law, the distribution of non-zero elements is often Gathering to aggregate multiple small-computation blocks to a highly skewed, and the resulting matrices for sparse networks larger block. While merging, it improves the overall performance generally contain a few rows with large numbers of non-zero by performing B-Limiting to limit the number of blocks on each elements while a large number of rows have a few non-zero computing unit. Experimental results show that it improves the total performance of kernel execution by 1.43x, on an average, elements. when compared to the row-product-based spGEMM, for NVIDIA There have been several previous studies on implement- TitanXpGPUsonreal-worlddatasets. ing efficient sparse matrix multiplication (spGEMM) for Index Terms—Sparse matrix multiplication; sparse network; two sparse matrices on GPUs, including cuSPARSE [10] GPU; ; and CUSP [11]. These techniques generally consist of row- I. INTRODUCTION product-based intermediate data expansion and parallel data Matrix multiplication is one of the core kernels in various merge processes. Despite their promising performance, GPU data-mining applications, such as social network services resources are still not fully utilized. First, the row-product- (SNSs) and graph analytics, and is used to extract key informa- based expansion process often leads to poor load balancing tion. Based on the rapid growth of the size of sparse networks, among threads due to the irregular distributions of target sparse the extraction of valuable information required for various networks. Second, excessive memory accesses during the par- operations, such as ranking [1], similarity computation [2], allel merge process frequently leads to degraded performance [3], and recommendation [4], [5], has become a critical than expected because of significant memory contention challenge. Weighted graphs are typically used to model such caused by excessive accesses. Although several improved network data and are represented in matrix forms, where each row-product-based techniques, such as bhSPARSE [12], have element contains an edge weight between two nodes. Matrix recently been introduced, experimental results have shown that multiplication based on the adjacent matrix format is widely they still suffer from poor thread-level load balancing problem used to extract useful information from original data. of the row-product-based scheme and the high performance Because matrix multiplication is a data-parallel operation, overhead during the merge process while performing multipli- graphic processing units (GPUs) are considered to be the most appropriate accelerators for their speed-up by providing high 1In this work, we use the term thread block and CTA interchangeably.

2375-026X/20/$31.00 ©2020 IEEE 925 DOI 10.1109/ICDE48307.2020.00085 cation on highly irregular matrices. SM Thread Shared Memory Requirement of each TB Warp Scheduler SM To overcome these limitations, several new spGEMM ap- Register File Thread Block proaches have been introduced by adopting the outer-product Core Core Core Core Core Shared Memory (column-row product) scheme [13], [14]. Outer-product-based L1 cache Shared Memory expansion is expected to produce higher performance than Thread Block SM row-product-based expansion, because the computational loads GPU Shared Memory of all threads in a TB are identical. However, the outer- SM 0 SM 1 SM 2 SM 3 SM product is not yet an ideal solution. First, the outer-product Thread Block L2 cache algorithm creates another load imbalance problem among SMs Global Memory because of the high block-level workload variance. In the Shared Memory (a) (b) outer-product scheme, each TB is formulated by a column and Fig. 1: (a) A GPU architecture overview and (b) an effect of a row of input matrices. Therefore, the resulting TBs consist shared memory requirement per thread block on thread block of several computation-heavy TBs (overloaded blocks) from allocation. several columns and rows with huge numbers of non-zero 2) Block Gathering: it merges several underloaded elements, and a massive number of computation-light TBs blocks into a combined block for better SM resource (underloaded blocks) with large numbers of zero elements. As utilization and latency hiding effectiveness. a result, the SMs that execute overloaded blocks can become 3) Block Limiting: it prevents the blocks from exe- a performance bottleneck, while all other SMs are idle. cuting with other blocks on an SM for minimizing Second, the outer-product scheme is mainly effective for resource contention. expansion, and the merge performance remains the same or might even become worse, because it produces intermediate • An extensive evaluation of the effectiveness of the Block results in a matrix form during expansion, whereas the row- Reorganizer framework using synthetic and real-world product produces the intermediate results in a single row datasets on multiple target GPUs. form [15]. Therefore, full matrix-wise accumulation may be slower than row-wise accumulation owing to the additional II. BACKGROUND column address indexing. A. GPU Architectures and SIMT Programming Model To address the limitations, we propose a novel outer- product-based spGEMM optimization pass referred to as the GPUs are accelerators that provide high throughput by Block Reorganizer. It first identifies the computation amount maximizing data parallelism using an SIMT programming of each block and categorizes the blocks as overloaded blocks, model such as CUDA [6] and OpenCL [7], which enables normal blocks, and underloaded blocks, based on their compu- multiple independent threads to execute the same instructions tational loads. It then performs two different optimizations in concurrently. In such programming languages, a thread is the the expansion process: Block Splitting for overloaded blocks basic unit of execution, and several threads are grouped into and Block Gathering for underloaded blocks. Block Splitting is TBs or CTAs. A TB is the main scheduling unit for execution the process of dividing an overloaded block into multiple small on GPUs, and the threads within a TB are affected by blocks for better load balancing. For underloaded blocks, the barrier operations for synchronization. For NVIDIA GPUs in Block Reorganizer performs the Block Gathering process by particular, a number of threads (typically 32) are also grouped creating a combined block from multiple underloaded blocks into another scheduling unit, called a warp. In NVIDIA GPUs, to increase intra-SM computation unit utilization and improve the threads in a warp are executed in lock-step similar to SIMD latency hiding efficiency via fast context-switching support. accelerators [16]. After executing all operations to produce intermediate results To support such operations efficiently, recent GPUs have during the expansion process, Block Limiting is applied to been equipped with multiple SMs to execute the kernel in- improve performance further during the merge process. Block structions of allocated TBs in an SIMD manner. Each SM Limiting is the process where each merging block is forced contains multiple computing cores, a large register file, an L1 to execute solely on the allocated SM in order to minimize cache, and a shared memory, as shown in Figure 1 (a). To resource contention. hide memory access latency, GPUs also allow fast context This paper provides the following three contributions: switching between warps. Thus, GPUs attempt to allocate the • An in-depth analysis of the inefficient resource utilization maximum allowable number of threads to an SM within the of outer-product operations on GPUs including expansion resource limit. and merge processes on real-world datasets. The number of threads allocated to an SM is limited by • The design of a novel optimization framework for ef- resource usage(e.g. shared memory and register files). For ficient sparse matrix multiplication based on the outer- example, the shared memory requirement for each TB can product scheme. To achieve this objective, we offer three change the total number of allowable TBs on an SM, as shown key techniques: in Figure 1 (b). Although the number of threads in a TB 1) Block Splitting: it divides original blocks into sev- is determined statically, all threads are not always executed eral small blocks for better load balancing. identically based on branch divergence. In this paper, we refer

926 Input Row-product spGEMM Threads Algorithm 1 Outer-product based spGEMM pseudocode ptr 0 3 5 6 8 Thread a00 b00 b01 b02 b03 execution for i:=0,toi

idx © 0 2 3 0 1 time a01 b11 Stall := a.ptr[ ] a.ptr[ +1]

val a a a a a © for a idx i to i do

00 20 30 01 11 a03 b33 Intra-row merge

¡¢¤

£ ← a.idx[ ]

¨ §

A A: CSC representation Expansion ¦

¥ ¥

( ) Merge Result C row a idx (c) Outer-product spGEMM offset ← c.ptr e[row] c.ptr e[ ] ← c.ptr e[ ] b.ptr[ 1] b.ptr[ ] ptr row row + i+ - i 0 4 5 8 9 a00 b00 b01 b02 b03 idx := b.ptr[i] b.ptr[i +1] 0 1 2 3 1 © a20 b00 b01 b02 b03 for b idx to in parallel do val

b00 b01 b02 b03 b11 a30 b00 b01 b02 b03 c idx ← c.ptr[row]+ offset

  §

¦ Merge Result C

¥ ¥ B B: CSR representation Expansion ( ) c.val[c idx] ← a.val[a idx] ∗ b.val[b idx] (a) (b) (d) c.idx[ ] ← b.idx[ ] Fig. 2: (a) Example input matrices, (b) sparse matrix formats c idx b idx (CSR/CSC), (c) row-product, and (d) outer-product. end for end for to a thread with real computations as an effective thread,and end for a thread without real computations as an non-effective thread. between the threads within a block when there is a high B. Sparse Matrix Multiplication variance in the number of non-zero elements between the rows 1) Sparse matrix format: The dense-format-based repre- in B, as illustrated in Figure 2 (c). In such case, only some sentation of sparse matrices with few non-zero elements threads perform numerous computations, while most threads incurs high memory space inefficiency owing to massive are idle or finish early after a small number of computations. storage requirements for zero elements. Thus, compressed Ci = a∗i × bi∗ (2) sparse row/column(CSR/CSC) formats without zero values are generally used for sparse matrix representation [9] 2.Asshown As shown in Equation (2), unlike the row-product, the in Figure 2 (b), the CSR format consists of three arrays. The outer-product-based scheme produces a partial matrix Ci by val array stores the value of non-zero elements, the idx array calculating a column-by-row product between the ith column stores column indices and the ptr array stores row pointers, a∗i of A and the ith row bi∗ of B. As shown in Figure 2 which indicate the first element locations within the rows. The (d), all elements in the input column are multiplied by the CSC format has the same structure, but stores elements in same row, and therefore, every thread has the same number column-major order, while CSR is based on row-major order. of computations. Therefore, the outer-product based method The CSC format stores row indices in the idx array and column does not create the load balancing problem within a TB when pointers in the ptr array.Thesizeoftheptr array is N,which executing on GPUs. Based on the data access pattern, the CSC is the number of rows/columns(CSR/CSC) of a target matrix, format is used for matrix A and the CSR format is used for and the sizes of the val and idx arrays are nnz. matrix B. 2) Matrix multiplication algorithms: Matrix multiplication Both row-product and outer-product methods can generate (C = AB) is an operation to produce an output matrix (C(N× multiple elements with the same index over partial results. Msize)) from two input matrices of A (N × Ksize)and Therefore, such methods require a specific merge-phase to B (K × Msize) 3. For sparse matrix multiplication, a basic accumulate the elements with the same index into a single ele- method based on the dot product is not well matched because ment after the generation of intermediate results in expansion- it requires index matching, which is not appropriate for sparse phase. In this work, we denote the intermediate result matrix, matrix format. N−1 which allows multiple elements with the same indices between Cˆ ci∗ = aij × bj∗ (1) expansion-phase and merge-phase,as . Algorithm 1 presents j=0 the pseudocode of the outer-product-based expansion-phase algorithm for generating Cis for every index. In the algorithm, For sparse matrix multiplications, several libraries, such as c.ptr e indicates an array to store the nnz filled for each cuSPARSE [10] or CUSP [11], are implemented based on row- row and is updated using atomic function to manage parallel product schemes, where input matrices are CSR-formatted. execution. As shown in Equation (1), the output row ci∗ is obtained by performing row-row product calculation between the ith row III. MOTIVATION ai∗ of A and all corresponding rows in B.InGPUs,aTB A. Limitations of Current Approaches generally performs all the row-products for a row in A, and all the corresponding rows in B. Therefore, the row-product based Since the distribution patterns of sparse matrices are diverse, method has a high probability of creating poor load balancing a main challenge for the spGEMM performance improvement on GPUs is to achieve high resource utilization. As shown 2General CSR/CSC formats do not require the sorted order of column/row in Figure 2, thread-level load balancing in a TB can be indices within a row/column, and this work produces the final result in achieved by adopting the outer-product scheme, whereas the unordered CSR format [9]. row-product method suffers from intra-SM load imbalance. 3Note that we used C = A2 multiplication for base evaluation, where numbers of rows and columns are same for the input matrix A. However, there are three major remaining problems leading to poor resource utilization.

927 60 60  bers of effective threads with small computations, and they 

60 lead to substantial performance degradation on GPUs.  XWLOL]DWLRQ  While the five left-hand matrices in Figure 3 (a) exhibit a   fair load balancing of SMs, another inefficiency is generated

SORW by underloaded blocks. In Figure 3 (b), most of the thread KDUERU SURWHLQ 4&' ILOWHU' VKLS \RXWXEH ORFJRZDOOD DVFDLGD V[PDWKRY VODVK'RW D block have less than 32 effective threads for many matrices.        QXPEHURIHIIHFWLYHWKUHDGV   For this situation, two main reasons exist for the significant   performance degradation in each SM. First, multiple comput-   ing cores within an SM are idle when executing underloaded

5DWLRRIWKUHDGEORFNV blocks with less than 32 threads, because 32 threads are E H[SDQVLRQ PHUJH executed in a lock-step manner, as described in Section II-A.    Second, a memory latency hiding technique with fast context   switching cannot be utilized, because no eligible warps for  context switching exist when a warp stalls for several cycles

5DWLRRIH[HFXWLRQWLPH owing to the occurrence of a memory access. Therefore, gener- F Fig. 3: (a) Execution time variance of outer-product-based ating larger blocks by aggregating several underloaded blocks spGEMM between SMs (Titan XP), (b) thread block distribu- is highly recommended for further performance enhancement. tion at different number of effective threads, and (c) execution 3) Overhead on merge: In this work, the merge process time distribution at expansion and merge processes. was implemented in a manner similar to the widely used Gustavson’s dense accumulator algorithm [19], which uses 1) Overloaded block: As discussed in the previous section, a temporary array with a length equal to the dimension of sparse matrices often have a power-law degree distribution, the target matrix. Using the dense accumulator algorithm where some rows and columns related to the hub-nodes gives an advantage to aggregate elements without sorting contain massive numbers of non-zero elements, whereas oth- overhead. For implementing the algorithm on GPUs, we used ers have only a few non-zero elements. Therefore, several atomic functions to manage parallel execution. In Figure 3 overloaded blocks used to perform multiplications of the (c), high merge latency exists when the merge process is columns and rows related to the hub nodes incur a substantial performed for rows with large nnz, because the block requires amount of computations, while other blocks (underloaded massive number of memory transactions, which can lead to blocks) perform very few computations. When overloaded performance degradation due to significant memory resource blocks are scheduled to a few SMs and underloaded blocks are contention. Several recent studies [17], [18] have also reported scheduled to the rest of the SMs, the SMs with the underloaded that allocating the maximum amount of blocks on GPUs does blocks should remain idle after completing their tasks until not always guarantee the best performance because resource all computations of the overloaded blocks on other SMs are contention may decrease overall performance when excessive completed. threads are allocated. Therefore, the over-allocation of merging Figure 3 (a) presents the variation in the SM-level exe- blocks on an SM should be avoided. cution time of expansion-phase when running outer-product spGEMM operations in multiple sparse network datasets on B. Beyond Conventional Approaches an NVIDIA Titan Xp architecture, containing 30 SMs. In Several insights have been derived from comparisons be- Figure 3 (a), the execution times for all SMs in the GPU are tween several spGEMM algorithms and the analysis of con- presented in descending order for each dataset, and five sparse flicts between GPU characteristics and sparse network char- matrices on the left have relatively regular distributions, but acteristics. First, an outer-product scheme is a better expan- the five sparse matrices on the right have skewed distributions. sion technique than a row-product scheme owing to superior In this figure, one can see that irregularity leads to high thread-level load balancing within a block, but the block-level execution time variation between SMs. When the overloaded load imbalance problem must be solved by considering both block is scheduled to a SM, the block occupies the SM for overloaded and underloaded blocks. Second, the performance a long period and other small blocks are scheduled to the of the merge process must be improved as well, by reducing remaining available SMs. Workload redistribution from long- resource contention by adjusting the block allocation to each running SMs to idle SMs is therefore the key challenge for SM. performance improvement on skewed matrices. For example, Based on these insights, we propose several intuitive high- SM utilization for the “loc-Gowalla” and “as-Caida” sets is level solutions for improved spGEMM performance. We first less than 20% owing to small numbers of long-running SMs. perform preprocessing to classify column-row product blocks 2) Underloaded block: Another issue is that most into three different categories, based on their computational rows/columns in sparse matrices have zero or a small number loads: overloaded, normal, and underloaded blocks. Over- of non-zero elements than the warp size, except for those loaded blocks are then split into multiple small blocks to rows/columns related to hub nodes. Underloaded blocks for be distributed into different SMs. For underloaded blocks, multiplication of those columns and rows contain small num- we improve performance by gathering multiple underloaded

928 Block Reorganizer Pre-process / Workload classification Merge phase Index # of elem.s Index Index # of row-wise elem.s 0 27 0 35 Index Unmerged. 3 2 4 1 4 6 2 18863 Dominator bin. 2, 3, ... 2 2833 3 22751 3 3714 Limiting bin. 2, 3, ... 7 19 Normal bin. 11, ... 7 19 Merged. 3 7 4 6 11 371 11 658 Non-limiting bin. 0, 7, 11, 13 ... 13 9 Low performer bin. 0, 7, 13, ... 13 31 A ...... Block-limiting Block-splitting Expansion phase Block-gathering Limiting bin. 2, 3 0, 7, Non-limiting bin. Dominator bin. #2 Low performer bin. Gathering factor M 11, 13 Split.Split. #2 #2 ... N0 Low. Idx. #0 B Dom. Idx. #2 M TB #0 Low. Idx. #7 0 TB #2 factor N Dom. Idx. #3 Splitting #3 TB #7 Input Split.Split. #3 #3 ... N Low. Idx. #13 Gathered. #0, #7, #13 SM SM matrices 1 Fig. 4: An overview of the Block Reorganizer. blocks into a single combined block, to maximize the number nnz is used to relocate the outer-product’s elements with same of effective threads. We also improve merge performance by row closer together for faster merge process. We also calculate limiting the number of allocated merging blocks on SMs. the block-wise nnz for workload classification. Because of the irregular distributions of sparse networks, IV. BLOCK REORGANIZER the outer-product of the dominator pair produces a massive A. Overview number of non zero elements compared to the other remaining The Block Reorganizer is an optimization method for accel- pairs. As a single column/row pair operation is assigned to a erating sparse matrix multiplication by applying an improved single block, the execution time for overloaded blocks can be block-level load balancing mechanism that is adaptive to much greater than the total execution time for all remaining sparse network characteristics. The Block Reorganizer is based blocks. This often leads to poor load balancing between SMs, on the outer-product scheme, and applies several novel load and is one of the main causes of performance degradation balancing techniques, based on an in-depth understanding of in skewed matrices. For low performer pairs, the underuti- GPU architectures. Figure 4 presents a conceptual view of lization of in-SM computing units is another reason for poor the Block Reorganizer that is proposed to improve resource performance. Therefore, different optimization techniques are utilization during both expansion and merge processes. required for each column/row pair category. nnz As shown in Figure 4, the Block Reorganizer first precalcu- Based on block-wise estimation, all dominator pairs A B lates the workload sizes of all blocks to perform column-by- are identified from the input matrices ( , ). Because of row product. The blocks are then classified into three groups the sparse data characteristics, the number of dominator pairs of overloaded, normal, and underloaded blocks based on the is typically small, and the threshold ratio for identifying sizes of their workloads. We will refer to a set of overloaded dominator pairs should be selected carefully. In this study, column/row pairs having numerous non-zero elements, as a blocks that produce more than the threshold number of ele- threshold = nnz(Cˆ)/(#blocks × α) Dominator.ALow performer is a set of underloaded col- ments ( ) are classified umn/row pairs that requires only a few computations due to as dominators. The criteria for classification can be changed α their insufficient number of effective threads. by adjusting the value of based on the target sparse network α Following categorization, dominator pairs are split into characteristics. Highly skewed networks can have lower multiple smaller column/row pairs (block splitting). Multiple values, but social networks with several medium-size hub- α underloaded blocks are gathered to generate larger blocks nodes should have high values to avoid selecting too many (block gathering). The newly created combined blocks can dominator pairs. The dominators are copied into new tempo- A B be efficiently executed on GPUs by maximizing thread level rary matrices ( , ), while blocks with less than 32(size of parallelism through both high utilization of in-SM computing warps) effective threads are classified as underloaded blocks. cores and better latency hiding using fast context switching between warps. After all elements are generated and stored C. Expansion Optimization in the intermediate matrix Cˆ, elements with the same indices 1) Block Splitting: We propose the Block-splitting tech- are merged to produce the final matrix C. To achieve better nique for better block-level workload balance. Block-splitting throughput by avoiding excessive memory contention, we is applied to overloaded blocks that are generated by domi- adjust the number of thread blocks allocated to an SM. nator vectors, in order to distribute heavy workloads evenly across multiple SMs. As expressed in Equation (2), the outer- B. Precalculation & Workload Categorization product operations for each pair are independent of each Block reorganizer first calculates nnz(Cˆ) to allocate the other, without the possibility of data reuse. Therefore, it can upper bound memory space for C. There are two different be separated and modified without affecting the results of ways to compute memory space as shown in Figure 4, and we other blocks. The dominator column vector, which is copied employ both methods for later optimizations. The row-wise into temporary matrices A, is divided into multiple smaller

929 ƍ ƍ      $ SWU  $ SWU  ƍ ƍ $ LG[  $ LG[  PDSSHU ƍ ƍ $ YDO $ YDO &RO 5RZ ƍ ƍ $ $ $    ƍ % SWU  %ƍSWU  

ƍ % LG[  %ƍLG[  

ƍ % YDO %ƍYDO % %ƍ %ƍ   2ULJLQDOLQSXW     

  

   %     

        !""  # $                 :LWKRXWVSOLWWLQJ  $IWHUVSOLWWLQJ         Fig. 5: B-Splitting: an overloaded block is split into multiple          small blocks.  ""         !""  # $ columns by modifying the column pointer values. This then        creates a mapper array, for storing the mapping between  ""      divided vector pairs. The multiple divided blocks execute their       !""  # $           own products by referencing the mapper array, and therefore,       ""   the overloaded workload can be reallocated to multiple SMs Fig. 6: B-Gathering: several underloaded blocks are combined to achieve fair load balancing. Figure 5 illustrates a detailed into a large block through block-compaction. example of the block-splitting process and highlights its effec- a sufficient number of effective threads in each block. tiveness. First, the dominator vector a∗0 and b0∗ (originally 2) Block Gathering: Because of the irregularity of sparse from input matrices A and B) are copied into matrices A matrices, executing kernels with a fixed thread block size is and B. During the splitting process, several elements from inefficient, and therefore, executing blocks with an appropriate each column vector are shifted to the next vector sequentially. thread block size is required to avoid thread waste. However, This operation can be accomplished by simply expanding as shown in Figure 3 (b), underloaded blocks, which are the pointer index of the sparse format matrix, as shown in generated by low performer groups, contain fewer effective Figure 5. A mapper array is constructed to track all of the threads than the minimum block size (32). In the proposed divided vector pairs to produce the same results as the original method, nnz(bi∗) indicates the number of effective threads vector pairs. As a result, the overloaded block requiring 25 within a block. As shown in Figure 3 (b), for some networks, computations is split into three smaller blocks. most row vectors have less than 32 non-zero elements. This Block splitting not only improves SM-level load balanc- means that several computing units in an SM are idle when ing, but also provides improved cache performance. Because executing such blocks because the threads in a warp are global memory access requires hundreds of cycles, spatial executed in a lock-step manner, as discussed in Section II-A. and temporal data localities should be fully utilized. Block- Thus, thread-level parallelism cannot be fully utilized through splitting forces multiple SMs to share identical vectors, thereby concurrent executions. increasing the probability of re-referencing data from SMs Having an insufficient number of effective threads in a block and preventing the data from being evicted due to memory also significantly decreases performance, as latency hiding space shortage. As a result, additional performance gains are using fast context switching cannot be applied. When the achieved. current active warp cannot issue the next instructions for any Determining the splitting factors for dominators is impor- reason, the warp scheduler chooses and schedules another tant, because performance improvement depends heavily on warp among the eligible warps to hide latency. However, these factors. Due to irregularity of sparse matrices, it is latency hiding based on fast warp-level context switching difficult to identify the optimal factor that can be applied cannot be applied, as underloaded blocks contain only a small to all datasets. Even within dominator groups, the nnz of number of warps with effective threads (typically only one). vectors varies, and the splitting factor for each vector should be To solve the problem, we propose Block Gathering, which selected carefully. From a GPU architectural view, overloaded is intuitive and can be applied easily. In Block Gathering, blocks should be divided into a number of smaller blocks original underloaded blocks are first transformed into micro- that is greater than the total number of SMs. The number blocks, which generate exactly the same results as the original of effective threads within each block should be larger than underloaded blocks, although they only have fewer threads the warp size to guarantee full utilization of in-SM cores. than the original blocks (block-compaction). Multiple micro- Based on these two insights, we decided to choose the splitting blocks are then combined into a large combined block with factor (2n) heuristically. Column vectors, where the number of multiple partitions, which has the same number of threads as elements is equal to the number of computations per thread, the original underloaded blocks. are split into several smaller vectors in a greedy manner. On For block-gathering, it is relatively easy to determine the the other hand, row vectors, where the number of elements optimal value of the gathering factor. In general, the number corresponds to the number of threads, are not split to guarantee of threads in a block is set to a power of two. When the

930 a ,b TB #0 TB #1 TB #2 TB #3 TB #0 TB #1 TB #2 TB #3 on the information. If the block-wise load of an ( ∗i i∗)pair SMEM SMEM SMEM SMEM ... SMEM SMEM SMEM SMEM ... usage usage usage usage usage usage usage usage exceeds the threshold, the pair is classified as the Dominator.

Thread block configuration Thread block configuration If the row-wise nnz exceeds a certain threshold, the corre- sponding rows are determined to cause resource contention TB #0 TB #1 TB #2 TB #3 TB #0 TB #1 ...... during merging. For YouTube, 713 pairs are classified as Shared memory Shared memory Shared memory Shared memory SM 0 SM 1 SM 0 SM 1 the dominator, and 362736 pairs are classified as the low GPU GPU L2-cache / Global memory L2-cache / Global memory performer. 12657 rows are also selected to use B-Limiting Large memory contention Small memory contention during merging. The overloaded blocks from the dominator Fig. 7: B-Limiting: extra shared memory is allocated to are then split into smaller blocks using a splitting alleviate resource contention while merging long rows. factor. As a result, the B-Splitting technique shows 10.4% number of threads of an underloaded block is in the range of performance gain with improved SM utilization from 16% to 2n−1 to 2n, the gathering factor is set to 32/2n. For example, 99%. if a thread block contains 2 effective threads, and the gathering In contrast, low performer vector pairs are binned in four factor is 16 to fill the 32 sized block completely. groups. Depending on their thread ranges, underloaded blocks To illustrate this concept, we present a simple merging are gathered and compressed into single, same-sized block. scenario in Figure 6. Here, the size of the thread block is This B-Gathering technique shows 6.7% performance gain. set to 16 for simplicity, and “before gathering” represents After generating all non-zero elements, B-Limiting is applied the original underloaded blocks. The original block indices to reduce memory contention in the merging process. Extra- are binned based on the corresponding numbers of effective shared memory is allocated to perform merge process for threads. The blocks contained in bin 1 are compressed into long rows in order to limit the number of allocated blocks single block with gathering factor 4, and blocks in bin 2 in SM. As a result, the B-Limiting technique shows 16.8% are gathered with factor 2. However, blocks in bin 3 are not performance gain with 32% l2 cache throughput improvement. gathered to avoid serialization. Finally, combination of the three techniques improves the total performance by 41.5% for Youtube data. D. Merge Optimization: Block Limiting After generating all non-zero elements in the intermediate V. E XPERIMENTAL ENVIRONMENT Cˆ result matrix , elements with the same indices are merged Implementation The Block Reorganizer is implemented into unique elements. This merging process is highly memory as an executable binary, which was originally written in intensive and has a small computational overhead, meaning the CUDA [6] programming language and compiled using it is sensitive to memory throughput. Similar to the input NVCC 8.0. Block Reorganizer first reads the input matrices matrices, the result matrix often has a power-law distribution. and precalculates block-wise workloads. It then applies three Therefore, during the merging process, some thread blocks optimization techniques called B-Splitting, B-Gathering, and can generate too many memory requests and incur substantial B-Limiting. All preprocesses are performed on the target performance degradation by reducing the L2 cache throughput, GPUs except for B-Splitting, which is performed on host which is shared by multiple SMs [17], [18]. CPUs. When all preprocesses are completed, the sparse matrix Based on the insight, we propose a B-Limiting technique, multiplication kernel is executed. which reduces resource contention by limiting the number of System Configuration In our experiments, we evaluated blocks allocated to an SM. Figure 7 illustrates the B-Limiting the Block Reorganizer mainly on a real machine with an process. The allowable number of blocks is determined by the Intel Xeon E5-2060 (2.10 GHz) CPU with 64 GB of main resource requirements of each block. Therefore, we allocate memory and an NVIDIA TITAN Xp GPU [21] with 12 GB extra shared memory to the merge kernel functions in order of global memory as shown in Table I. We also tested the Block to reduce the number of blocks in an SM [20]. Reorganizer on additional systems to determine its scalability: Because allocating the maximum number of blocks in an a Xeon E5 and NVIDIA Tesla V100 system (DGX Station), SM generally yields the best GPU performance, the block and a Xeon Gold and NVIDIA RTX 2080 Ti system (Table I). limiting technique should be applied carefully only when Performance Measurement Our spGEMM algorithm gen- it is expected to be better than the traditional allocation erates output data in an unordered CSR format similar to the scheme. Block-limiting is therefore currently applied only Gustavson merge algorithm [19]. Therefore, we present our cˆ nnz to the large rows of ∗i where the s exceed the given performance results in two different ways for fairness. We threshold = nnz(Cˆ)/(#blocks × β) β threshold( ), where is first compare Block Reorganizer performance to a baseline currently 10 to show fair performance gain. spGEMM, which uses a row-product based expansion and E. Putting It All Together a Gustavson merge process, and four widely used spGEMM In this section, an example workflow is presented for libraries (cuSPARSE, CUSP, and bhSPARSE for GPUs, and YouTube data, for better understanding of the mechanism for MKL for CPUs) [13], in order to measure the performance combining these three techniques into the Block Reorganizer. difference to other open libraries. We then perform detailed Block Reorganizer first estimates the block-wise nnz and row- analysis of each Block Reorganizer technique and compared wise nnz. Workload categorization is then performed based the results to the performance of the baseline spGEMM. All

931 TABLE I: Target system configurations generally exhibits regular distributions. We also used synthetic System 1 System 2 [22] System 3 datasets generated using R-MAT [29], [30] to evaluate both CPU Xeon E5-2640v4 [23] Xeon E5-2698v4 [23] Xeon Gold 5115 [24] 2 Number of C = A C = AB 10 / 20 20 / 40 10 / 20 and . Core/Threads MAX CPU Clock 3.40GHz 3.60GHz 3.40GHz Memory 64 GB 256 GB 128 GB VI. EVALUATIONS GPU Titan Xp [21] Tesla V100 [25] 2080Ti [26] Number of SMs 30 80 68 In this section, we show the effectiveness of Block Reor- MAX GPU Clock 1582MHz 1380MHz 1545MHz CUDA Capability 6.1(Pascal) 7.0(Volta) 7.5(Turing) ganizer, along with the following techniques used within it: OS Ubuntu 16.04 Ubuntu 18.04 Ubuntu 16.04 Baseline NVIDIA cuSPARSE v2, CUSP 0.4.0, bhSPARSE, MKL block-splitting, block-gathering, and block-limiting. Section VI-A shows the performance improvement and analyses on TABLE II: Real-world datasets from Florida Suite Sparse [27] real-world datasets. Section VI-B presents an examination and Stanford large network dataset collection [28] of the effectiveness of the techniques across multiple GPU dimension dimension Name nnz(A) plot Name nnz(A) plot architectures, and Section VI-C and VI-D present analysis of nnz(C) nnz(C) 106k 140k the performance impact on various dataset characteristics using filter3D 2.7 M ship 3.7M synthetic datasets. 20.1 M 23.0M 46k 36k harbor 2.3M protein 2.1M A. Evaluation on Real-World Datasets 7.5M 18.7M 81k 99k Figure 8 and 9 show the normalized and absolute perfor- sphere 2.9M 2cube sphere 854k mance of Block Reorganizer compared to four widely used 25.3M 8.6M 118k 127k spGEMM libraries, and our two baselines based on row- and accelerator 1.3M cage12 1.9M 17.8M 14.5M outer-products. The X-axises represent the datasets, and the 215k 196k Y-axises represent the relative performance based on the row- hood 5.2M m133-b3 782k 32.7M 3.0M product baseline (Figure 8) and the absolute performance 156k 381k in GFLOPs (Figure 9). Based on the figures, the Block majorbasis 1.7M mario002 1.1M 7.9M 6.2M Reorganizer achieves a performance gain of 1.43x over the 165k 254k row-product baseline, while the outer-product baseline and the mono 500Hz 4.8M offshore 2.1M 39.5M 22.2M libraries shows only 0.95x, 0.29x, 0.22x, 0.55x, and 0.48x 235k 13k patents main 548k poisson3Da 344k speedups, respectively. Block Reorganizer also shows high 2.2M 2.8M coverage, as it exhibits the best performance on most datasets. 48k 167k QCD 1.8M scircuit 0.9M Block-splitting and block-limiting are generally effective for 10.4M 5.0M irregular data that require numerous calculations and memory 193k 1.1M power197k 3.3M youtube 2.8M accesses per block. However, block-gathering can be applied to 38.0M 148M 26k 87k most matrices due to its high sparsity of matrices, regardless of as-caida 104k sx-mathoverflow 495k the regularity. Figure 10 shows the performance improvement 25.6M 17.7M 192k 36k for the three techniques over the outer-product baseline. Block- loc-gowalla 1.8M emailEnron 359k gathering, which is applied to all sparse matrices, shows the 456M 29.1M 76k 74k highest coverage for matrices. However, for some matrices slashDot 884k epinions 497k with high skewness (mostly in Stanford datasets), block- 75.2M 19.6M 318k 275k gathering on the underloaded blocks cannot improve perfor- web-Notredame 1.4M stanford 2.2M 16.0M 19.8M mance significantly because the execution time is dominated by the overloaded blocks or the merging process. For these experimental results include the overhead, except the data datasets, block-splitting and block-limiting are very effective. transfer time between host and the device. This is because Consequently, block-limiting, block-splitting, block-gathering, spGEMM is an application kernel with results that will be used and Block Reorganizer show average performance gains of in a GPU. The overhead includes the precalculation, workload 1.05x, 1.05x, 1.28x, and 1.51x, respectively. classification and preprocessing for block-splitting. 1) Better load balancing with block-splitting: To evaluate For block reorganizer and baseline spGEMM, basic the effect of block-splitting on load balancing, we define a memory-related optimizations, considering shard memory uti- new metric, load balancing index (LBI), as shown in Equation lization, cache blocking, and memory coalescing, are applied (3). LBI indicates the average execution time of all SMs for maximizing performance. normalized to the SM with the longest execution time. Dataset A total of 28 real-world datasets from the stanford N large network dataset collection [28] and the Florida matrix LBI = (cycles(SMi)/M AX cycles(SM))/N C = A2 i=1 (3) suite [27] were used for computing . Table II lists N : number of SMs in GPU detailed information for the tested real-world datasets. We chose specific datasets by considering the distribution and Figure 11 shows the LBI values and execution times of size of each matrix, and datasets from the stanford large dominators for 10 Stanford datasets with increasing splitting network dataset collection generally exhibit irregular distri- factors. As long-running overloaded blocks are the main butions, whereas the datasets from the florida matrix suite performance bottleneck for the datasets, the execution time

932 Florida matrix suite Stanford large network data 2.5 row-product outer-product cuSPARSE 2 CUSP bhSPARSE MKL Block-Reorganizer 1.5

1

0.5 Normalized Perf.Normalized

0

Fig. 8: Speedup of spGEMM operations for row/outer-product baselines, multiple libraries (cuSPARSE, CUSP, bhSPARSE, MKL), and Block Reorganizer on real-world datasets. All data are normalized to the row-product-based spGEMM performance. Florida matrix suite Stanford large network data 18 row-product outer-product cuSPARSE 16 CUSP bhSPARSE MKL 14 Block-Reorganizer 12 10 8

GFLOPS 6 4 2 0

Fig. 9: Absolute performance of spGEMM operations for row/outer-product baselines, multiple libraries (cuSPARSE, CUSP, bhSPARSE, MKL), and Block Reorganizer on real-world datasets. Florida matrix suite Stanford large network data 2.5 B-Limiting B-Splitting 2 B-Gathering Block-Reorganizer

1.5

1

Normalized perf.Normalized 0.5

0

Fig. 10: Relative performance of B-Splitting, B-Gathering, B-Limiting, and Block Reorganizer. 25 LBI 1 2 4 8 16 32 64 20 1 comes larger than the number of existing SMs and there is no significant LBI improvement. This performance gain is mainly 15 0.75

LBI due to better cache utilization, and block-splitting improves 0.5 10 the L2-cache throughput, mainly by splitting the overloaded 5 0.25 Normalized perf. blocks. Memory transactions are originally concentrated in 0 0 few overloaded blocks, and the transactions are distributed to multiple divided blocks using block-splitting. Thus, L2 cache Fig. 11: Load balancing effectiveness when applying B- utilization can be significantly improved by distributing the Splitting. divided blocks to share the same memory spaces. 500 600 1 2 4 8 16 32 64 of dominator blocks is only measured to show the effect of 400 500 400 300 block-splitting. The X-axis indicates splitting factors from 1 to 300 200 200 L2 read 64, and the Y-axis represents the LBI values and relative per- L2 write 100 100 throughput(GB/s) formance gains normalized to the performance with a splitting 0 throughput(GB/s) 0 factor of 1. When the splitting factor increases, corresponding LBI and performance increments are observed. The LBI values Fig. 12: L2 cache throughput improvements using B-Splitting. converge to more than 90% when splitting factors almost equal Figure 12 shows the improvement in L2 cache throughput the number of SMs in the target GPU. This implies a scale- when splitting overloaded blocks using the NVIDIA nvprof up of hardware on increasing the number of SMs, and block- profiler [31]. The X-axis represents datasets and the Y-axis splitting is still an effective technique to improve performance. shows L2 cache throughput. For all datasets, block-splitting By applying block-splitting, LBI increases from 0.17 to 0.96, shows a substantial L2 cache improvement of 8.9x on average. and dominator performance is improved by 8.68x on average. This explains the further performance gain when splitting 2) Better cache performance with block-splitting: Some factor is larger than the number of SMs. matrices such as “loc-gowalla,” “sx-mathoverflow,” and “slash- 3) Better latency hiding efficiency with block-gathering: Dot” are observed to improve even when splitting factor be- To prove the effectiveness of block-gathering, we profiled the

933 80 sync stall before gathering sync stall after gathering row-product outer-product cuSPARSE CUSP bhSPARSE MKL Block-Reorganizer 60 2 1.66 40 1.43 1.4 1.5

stalls(%) 20 1

sync stalls of total total of stallssync 0 0.5

Normalized Perf.Normalized 0 Titan XP Tesla V100 RTX 2080ti Fig. 13: Changes in sync stall when applying B-Gathering. Fig. 15: Performance scalability on various GPUs. kernel to observe the changes of the ratio of effective threads B. Performance Scalability on Different Architectures using nvprof.Thesync stall percentage is used as a metric To verify the scalability of Block Reorganizer on various to demonstrate the ratio of effective threads, as numerous GPU architectures, we tested the performance on three differ- synchronization stalls exist when many non-effective threads ent devices of different generations: TITAN Xp, Tesla V100, await the complete computation of several effective threads. and RTX 2080 Ti, as shown in Table I. Figure 15 represents Figure 13 shows the percentage of stall due to thread syn- the normalized performance after applying Block Reorganizer chronization. The X-axis represents the datasets and the Y- technique on the target GPUs. The X-axis represents the axis represents the percentage of sync stalls. As shown in devices, and the Y-axis represents normalized performance Figure 13, the percentage of sync stalls highly decreases when gain of each technique based on the row-product baseline. the block-gathering technique is applied. As shown in the figure, Block Reorganizer shows the best As discussed, underloaded blocks cannot efficiently hide performance across all the target GPU architectures while the latency due to the insufficient number of effective threads. outer-product baseline shows a similar performance level to Therefore, most non-effective threads wait for effective threads the row-product baseline. This is because the main problems to execute their instructions. By applying block-gathering of sparsity and skewness are on all the GPU architectures, to underloaded blocks to increase the number of effective and three main techniques proposed by Block Reorganizer can threads in a block, most stalls on synchronization disappear solve the problems successfully. Therefore, 1.43x, 1.66x, and leaving only memory stalls. Consequently, block-gathering 1.40x speedups over the row-product baseline were achieved highly increases the performance for underloaded blocks. on TITAN Xp, Tesla V100, and RTX 2080 Ti, respectively. TABLE III: Synthetic datasets 0 6144 12288 18432 24576 30720 36864 43008 0 6144 12288 18432 24576 30720 36864 43008 200 300 Data Dimension(N) # elements Parameters 2 160 250 C = A 200 120 150 s1 250000 62500 80 100 s2 500000 250000 L2 read L2 write L2 write S (0.45,0.15,0.15,0.25) 40 50 s3 750000 562500 throughput(GB/s) 0 throughput(GB/s) 0 s4 1000000 1000000 p1 (0.25,0.25,0.25,0.25) p2 (0.45,0.15,0.15,0.25) P 1M 1M Fig. 14: L2 cache throughput improvements using B-Limiting. p3 (0.55,0.15,0.15,0.15) p4 (0.57,0.19,0.19,0.05) sp1 4M 4) Less resource contention with block-limiting: Limiting sp2 3M SP 1M (0.25,0.25,0.25,0.25) the number of blocks for an SM is effective for memory- sp3 2M sp4 1M intensive kernels as it alleviates the resource contention. Thus, C = AB it is expected to increase the performance of merging kernels A 32768 440747 scale=15 15 having many elements. Figure 14 shows the effect of block B 32768 440024 edge-factor=16 A 65536 908672 scale=16 16 limiting on L2 cache throughput. The X-axis represents the B 65536 909957 edge-factor=16 A 131072 1864289 scale=17 10 Stanford datasets on which block-limiting is applied, and 17 B 131072 1868244 edge-factor=16 the Y-axis represents the percentages of L2 cache throughput A 262144 3806124 scale=18 18 with different limiting factors. The limiting factor indicates B 262144 3801872 edge-factor=16 the additionally allocated shared memory size to adjust the C = A2 number of blocks in a single SM. For the experiment, the size C. Evaluation on Synthetic Datasets ( ) of allocated memory increases by 6144 bytes. As shown in In previous sections, We discussed the effectiveness of the figure, the L2 cache throughput improves as the limiting Block Reorganizer with real-world datasets compared to the factor increase initially at a certain point, and it decreases after libraries and our customized baseline. To show the general the point. The reason for the performance degradation is that applicability of Block Reorganizer, we tested the effectiveness the performance loss due to less warp occupancy increases using synthetic datasets of contrasting characteristics as shown compared to the gain from reducing cache contention. As the in Table III. In these synthetic datasets, we changed the distribution of matrices varies highly, it is difficult to find an following important factors: number of nodes (S: scalability), optimal point for each matrix. In this study, limiting factor is skewness (P: power-law), and sparsity (SP). set to a constant value of 4 × 6144 to show fair performance 1) Scalability (dataset S): The first four matrices (s1-s4) in gain. Consequently, L2 cache read and write throughputs Figure 16 (a) show the performance changes when changing increase by 1.49x and 1.52x on average, respectively. the matrix size. When the matrix is very small, cuSPARSE

934 row-product outer-product cuSPARSE CUSP bhSPARSE MKL Block-Reorganizer row-product outer-product 2.5 cuSPARSE CUSP 1.5 bhSPARSE MKL 2 Block-Reorganizer

1.5 1 (a) (b) 1 0.5 0.5 Normalized Perf. Normalized Normalized Perf. Normalized 0 0 s1 s2 s3 s4 p1 p2 p3 p4 sp1 sp2 sp3 sp4 15 16 17 18 Scalability(dataset S) Skewness(dataset P) Sparsity(dataset SP) Fig. 16: (a) Speedup of spGEMM libraries and Block Reorganizer normalized to the row-product baseline on synthetic datasets on C = A2 operations, and (b) speedup on C = AB operations. shows the best performance. However, as the matrices be- generate denser output matrix as from C = A2 operations [32]. come larger, its performance drops significantly and eventually Therefore, block-gathering is an effective optimization be- shows the lowest performance among others. In contrast, Block cause most thread blocks are categorized into underloaded Reorganizer shows low performance in small matrices as the blocks with a few overloaded blocks. Consequently, Block- execution time for matrix multiplication is insufficient, and the Reorganizer achieves an average performance gain of 1.09x performance is mainly affected by preprocessing overheads. over the baseline, that is the best of the given techniques. The However, as the matrices become larger, it shows the best gain also appears scalable as the input size increases. performance over all other methods. VII. RELATED WORKS 2) Skewness (dataset P): The next four matrices (p1- p4) in Figure 16 (a) show the performance changes when There have been many previous studies for spGEMM. increasing the matrix skewness. The X-axis represents the NVIDIA and Intel provide libraries to support fast spGEMM matrices used for the evaluation, and the Y-axis represents [10], [11], [33]. Furthermore, several optimized techniques the normalized performance to the baseline. With an increase have been also proposed [13], [34]–[42]. in the skewness level, cuSPARSE and bhSPARSE exhibit For more details, regularization [35], input categoriza- performance degradation similar to real datasets. In contrast, tion [36], and resource optimization [37] techniques are pro- Block Reorganizer shows substantial performance gains for posed for spGEMM on GPUs. From the perspective of load all cases owing to the wide coverage. Notably, block-splitting balancing, lbGEMM [13] highly improved the performance by and block-limiting improve performance mainly for highly introducing outer-product scheme to solve thread level load skewed data by solving the load imbalance and high resource balancing problem. AC-spGEMM [39] also improved overall contention problems. performance highly by using thread-level load balancing on 3) Sparsity (datasets SP): The last four matrices (sp1- row-product-based spGEMM. Akbudak [40] also improved sp4) in Figure 16 (a) show the performance changes when merging performance via increasing the matrix locality by decreasing the matrix density. bhSPARSE shows high per- orchestrating partitioned and permutated workloads in or- formance over other spGEMMs for relatively dense matrices. der to reduce communication overheads between processors. However, as the matrices become sparser, Block Reorganizer Kernert [41] and Patwary [42] improved cache locality using outperforms all other methods by mainly applying block- adaptive tiling of target matrices. gathering. However, as discussed in Section III, these techniques are not optimally suitable for matrix multiplication for SNS C = AB D. Evaluation on Synthetic Datasets ( ) analysis due to no consideration of power-law degree distri- To prove the generality of our approach, we also evaluated bution [10], [11], [33], SM-level load balancing, or in-SM re- the performance of Block Reorganizer for C = AB cases, source utilization problems [13], [35]–[37]. Our outer-product- in addition to C = A2. As shown in Table III, the last four based approach also shows stable performance gain across sets of input matrix pairs of (A, B) are synthetically generated various target matrices by resolving thread-level load imbal- with two parameters of scale and edge-factor. The size of the ance problem natively without introducing complex per-row- target matrix is set to (2scale), and the number of non-zero level load balancing techniques, which often require additional entries is set to (edge-factor×2scale). The performance data control overhead to secure per-row linked list structures [39]. are evaluated by increasing the scale parameter from 15 to We propose three novel techniques for better load balancing 18 when the edge-factor parameterisfixedto16asshownin and resource utilization. Several related studies also have been Graphulo [32]. proposed [17], [18], [43], [44]. Thread tailor [43] adjusted Figure 16 (b) shows the normalized performance of Block the number of threads by combining multiple CPU threads Reorganizer for C = AB cases. The X-axis represents the into a merged thread based on profile results. Lee [18] and 4 spGEMM of matrix pairs, and the Y-axis represents the Kayiran [17] showed that allocating maximum number of TB relative performance normalized to the row-product baseline. on GPUs does not always guarantee the best performance, As shown in the figure, Block Reorganizer shows fair speedups and suggested hardware-level approaches for finding and al- across all input matrix pairs. C = AB operations do not locating optimal number of TBs. Ho [44] introduced threads

935 pairing, which merged two threads into a thread to vectorize [16] S. K. Raman et al., “Implementing Streaming SIMD Extensions on the Pentium III Processor,” IEEE Micro, vol. 20, no. 4, pp. 47–57, 2000. operations in GPUs. These approaches are partially related to [17] O. Kayiran et al., “Neither more nor less: Optimizing thread-level parallelism for gpgpus,” in Proceedings of the 22Nd International our approach. Conference on Parallel Architectures and Compilation Techniques,ser. VIII. CONCLUSION PACT ’13. Piscataway, NJ, USA: IEEE Press, 2013, pp. 157–166. [Online]. Available: http://dl.acm.org/citation.cfm?id=2523721.2523745 This work proposed a novel optimization pass called Block [18] M. Lee et al., “Improving gpgpu resource utilization through alternative thread block scheduling,” in 2014 IEEE 20th International Symposium Reorganizer for outer-product-based spGEMM with three on High Performance Computer Architecture (HPCA). IEEE, 2014, pp. 260–271. block-level optimizing techniques of B-Splitting, B-Gathering, [19] F. G. Gustavson, “Two fast algorithms for sparse matrices: Multiplica- tion and permuted transposition,” ACM Transactions on Mathematical and B-Limiting. Block Reorganizer first identifies overloaded Software (TOMS), vol. 4, no. 3, pp. 250–269, 1978. and underloaded thread blocks and then applies different [20] Y. Yu et al., “A compiler-based approach for GPGPU performance calibration using TLP modulation (WIP paper).” techniques to them. It solves SM level load imbalance problem [21] NVIDIA, “NVIDIA Titan Xp Graphics Cards,” 2017, https://www.nvidia.com/en-us/titan/titan-xp/. by splitting overloaded blocks into multiple small blocks [22] NVIDIA, “Nvidia dgx station,” 2017, using B-Splitting. For underloaded blocks, it increases in-SM https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/dgx- station/nvidia-dgx-station-datasheet.pdf. computing unit utilization by gathering multiple underloaded [23] INTEL, “Intel xeon e5-2600 model specification,” 2016, https://www.intel.com/content/www/us/en/processors/xeon/xeon-e5- blocks into a single block using B-Gathering. It also limits the brief.html. number of allocated thread blocks on an SM using B-Limiting, [24] INTEL, “Intel gold 5115 model specification,” 2017, https://ark.intel.com/content/www/kr/ko/ark/products/120484/intel- when overloaded rows exist in the merging process. Based on xeon-gold-5115-processor-13-75m-cache-2-40-ghz.html. [25] NVIDIA, “NVIDIA Tesla V100,” 2017, the three optimization techniques, it shows an average speedup https://images.nvidia.com/content/volta-architecture/pdf/volta- architecture-whitepaper.pdf. of 1.43x on execution time compared to the baseline for total [26] NVIDIA, “NVIDIA RTX 2080 Ti Graphics Cards,” 2018, 28 real-world datasets on a target server-class GPU. https://www.nvidia.com/en-us/geforce/graphics-cards/rtx-2080-ti/. [27] T. A. Davis and Y. Hu, “The university of florida sparse matrix IX. ACKNOWLEDGMENTS collection,” ACM Trans. Math. Softw., vol. 38, no. 1, pp. 1:1–1:25, Dec. 2011. [Online]. Available: http://doi.acm.org/10.1145/2049662.2049663 [28] “Stanford large network dataset collection,” Thanks to Myung-Hwan Jang and Hyuck-Moo Gwon for http://snap.stanford.edu/data. all their help and feedback. We also thank the anonymous [29] D. Chakrabarti et al., “R-mat: A recursive model for graph mining,” in Proceedings of the 2004 SIAM International Conference on Data reviewers who provided good suggestions for improving the Mining. SIAM, 2004, pp. 442–446. [30] D. Zheng et al., “Flashgraph: Processing billion-node graphs on an array quality of this work. This work was supported by Samsung of commodity ssds,” in 13th USENIX Conference on File and Storage Research Funding & Incubation Center of Samsung Electron- Technologies (FAST 15), 2015, pp. 45–58. [31] Profiler User’s guide, NVIDIA, 2018, ics under Project Number SRFC-IT1901-03. Yongjun Park is http://docs.nvidia.com/cuda/pdf/CUDA profiler Users Guide.pdf. [32] D. Hutchison et al., “Graphulo implementation of server-side sparse ma- the corresponding author. trix multiply in the accumulo database,” in 2015 IEEE High Performance REFERENCES Extreme Computing Conference (HPEC), Sep. 2015, pp. 1–7. [33] Intel, “Intel Math Kernel Library,” 2003, [1] D.-H. Bae et al., “Constructing seminal paper genealogy,” in Proceed- https://software.intel.com/en-us/mkl. ings of the 20th ACM international conference on Information and [34] B. Xie et al., “Cvr: Efficient vectorization of spmv on x86 processors,” in knowledge management. ACM, 2011, pp. 2101–2104. Proceedings of the 2018 International Symposium on Code Generation [2] G. He et al., “Parallel simrank computation on large graphs with iterative and Optimization. ACM, 2018, pp. 149–162. aggregation,” in Proceedings of the 16th ACM SIGKDD international [35] J. Zhang and L. Gruenwald, “Regularizing irregularity: bitmap-based conference on Knowledge discovery and data mining. ACM, 2010, pp. and portable sparse matrix multiplication for graph data on gpus,” in 543–552. [3] Y. Cai et al., “Efficient algorithm for computing link-based similarity in Proceedings of the 1st ACM SIGMOD Joint International Workshop real world networks,” in 2009 Ninth IEEE International Conference on on Graph Data Management Experiences & Systems (GRADES) and Data Mining. IEEE, 2009, pp. 734–739. Network Data Analytics (NDA). ACM, 2018, p. 4. [4] Y. Dong et al., “Link prediction and recommendation across heteroge- [36] C. Hong et al., “Efficient sparse-matrix multi-vector product on gpus,” in neous social networks,” in 2012 IEEE 12th International conference on Proceedings of the 27th International Symposium on High-Performance data mining. IEEE, 2012, pp. 181–190. Parallel and Distributed Computing. ACM, 2018, pp. 66–79. [5] Y. Koren et al., “Matrix factorization techniques for recommender [37] J. Liu et al., “Register-based implementation of the sparse systems,” Computer, no. 8, pp. 30–37, 2009. general matrix-matrix multiplication on gpus,” in Proceedings [6] J. Nickolls et al., “NVIDIA CUDA software and GPU parallel comput- of the 23rd ACM SIGPLAN Symposium on Principles and ing architecture,” in Microprocessor Forum, May 2007. Practice of Parallel Programming, ser. PPoPP ’18. New [7] KHRONOS Group, “OpenCL - the open standard for parallel program- York, NY, USA: ACM, 2018, pp. 407–408. [Online]. Available: ming of heterogeneous systems,” 2010, http://www.khronos.org. http://doi.acm.org/10.1145/3178487.3178529 [8] J. Leskovec et al., “Graph evolution: Densification and shrinking diam- [38] F. Gremse et al., “Gpu-accelerated sparse matrix-matrix multiplication eters,” ACM Transactions on Knowledge Discovery from Data (TKDD), by iterative row merging,” SIAM Journal on Scientific Computing, vol. 1, no. 1, p. 2, 2007. vol. 37, pp. C54–C71, 01 2015. [9] C. W. Keler and C. Smith, “The SPARAMAT Approach to Automatic [39] M. Winter et al., “Adaptive sparse matrix-matrix multiplication on the Comprehension of Sparse Matrix Computations,” in Proceedings of the gpu,” in Proceedings of the 24th Symposium on Principles and Practice Seventh International Workshop on Program Comprehension. IEEE of Parallel Programming. ACM, 2019, pp. 68–81. Computer Society, 1999, pp. 200–207. [40] K. Akbudak and C. Aykanat, “Simultaneous input and output matrix par- [10] “NVIDIA cuSPARSE Library,” http://developer.nvidia.com/cusparse. titioning for outer-product–parallel sparse matrix-matrix multiplication,” [11] S. Dalton et al., “CUSP: Generic parallel algorithms for sparse matrix SIAM Journal on Scientific Computing, vol. 36, no. 5, pp. C568–C590, 2014. and graph computations,” 2014, version 0.5.0. [Online]. Available: [41] D. Kernert et al., “Topology-aware optimization of big sparse matrices http://cusplibrary.github.io/ and matrix multiplications on main-memory systems,” in 2016 IEEE [12] W. Liu and B. Vinter, “An efficient gpu general sparse matrix-matrix 32nd International Conference on Data Engineering (ICDE). IEEE, multiplication for irregular data,” in 2014 IEEE 28th International 2016, pp. 823–834. Parallel and Distributed Processing Symposium, May 2014, pp. 370– [42] M. M. A. Patwary et al., “Parallel efficient sparse matrix-matrix multi- 381. [13] Y.-Y. Jo et al., “Efficient sparse matrix multiplication on gpu for large plication on multicore platforms,” in International Conference on High social network analysis,” in Proceedings of the 24th ACM International Performance Computing. Springer, 2015, pp. 48–57. on Conference on Information and Knowledge Management.ACM, [43] J. Lee et al., “Thread tailor: dynamically weaving threads together for 2015, pp. 1261–1270. efficient, adaptive parallel applications,” in Proc. of the 37th Annual [14] S. Pal et al., “Outerspace: An outer product based sparse matrix International Symposium on Computer Architecture, 2010, pp. 270–279. multiplication accelerator,” 02 2018, pp. 724–736. [44] N.-M. Ho and W.-F. Wong, “Exploiting half precision arithmetic in [15] J. J. Elliott and C. M. Siefert, “Low thread-count gustavson: A multi- nvidia gpus,” in 2017 IEEE High Performance Extreme Computing threaded algorithm for sparse matrix-matrix multiplication using perfect Conference (HPEC). IEEE, 2017, pp. 1–7. hashing,” in 2018 IEEE/ACM 9th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems (scalA), Nov 2018, pp. 57–64.

936