Optimization of GPU-Based Sparse Matrix Multiplication for Large Sparse Networks
Total Page:16
File Type:pdf, Size:1020Kb
2020 IEEE 36th International Conference on Data Engineering (ICDE) Optimization of GPU-based Sparse Matrix Multiplication for Large Sparse Networks Jeongmyung Lee, Seokwon Kang, Yongseung Yu, Yong-Yeon Jo, Sang-Wook Kim, Yongjun Park Department of Computer Science Hanyang University, Seoul, Korea {jeongmyung, kswon0202, dydtmd1991, jyy0430, wook, yongjunpark}@hanyang.ac.kr Abstract—Sparse matrix multiplication (spGEMM) is widely computational throughput using single-instruction, multiple- used to analyze the sparse network data, and extract important thread (SIMT) programming models, such as CUDA [6] information based on matrix representation. As it contains a and OpenCL [7]. A GPU generally consists of a set of high degree of data parallelism, many efficient implementations using data-parallel programming platforms such as CUDA and Streaming Multiprocessors (SMs). OpenCL/CUDA programs OpenCL have been introduced on graphic processing units are executed on GPUs by allocating Thread Blocks (TBs) or (GPUs). Several well-known spGEMM techniques, such as cuS- Cooperative Thread Arrays (CTAs) 1, which are groups of PARSE and CUSP, often do not utilize the GPU resources fully, threads, to each SM in parallel. owing to the load imbalance between threads in the expansion The main challenge is developing an efficient matrix multi- process and high memory contention in the merge process. Furthermore, even though several outer-product-based spGEMM plication technique considering the data-specific characteristics techniques are proposed to solve the load balancing problem of sparsity and power-law degree distribution [8]. Typical on expansion, they still do not utilize the GPU resources fully, sparse networks contain a much smaller number of edges with because severe computation load variations exist among the non-zero values, compared to the number of all possible edges multiple thread blocks. between nodes, and therefore, most of the elements in a sparse To solve these challenges, this paper proposes a new opti- mization pass called Block Reorganizer, which balances the total matrix have a value of zero. To reduce memory waste caused computations of each computing unit on target GPUs, based by sparsity, matrices are typically represented in the sparse on the outer-product-based expansion process, and reduces the format [9]. Sparse networks also commonly have power-law memory pressure during the merge process. For expansion, it distributions [8], where a very small number of hub nodes first identifies the actual computation amount for each block, have extremely large numbers of connections and most other and then performs two thread block transformation processes based on their characteristics: 1) B-Splitting to transform a nodes have very small numbers of connections. Based on heavy-computation blocks into multiple small blocks and 2) B- the power-law, the distribution of non-zero elements is often Gathering to aggregate multiple small-computation blocks to a highly skewed, and the resulting matrices for sparse networks larger block. While merging, it improves the overall performance generally contain a few rows with large numbers of non-zero by performing B-Limiting to limit the number of blocks on each elements while a large number of rows have a few non-zero computing unit. Experimental results show that it improves the total performance of kernel execution by 1.43x, on an average, elements. when compared to the row-product-based spGEMM, for NVIDIA There have been several previous studies on implement- TitanXpGPUsonreal-worlddatasets. ing efficient sparse matrix multiplication (spGEMM) for Index Terms—Sparse matrix multiplication; sparse network; two sparse matrices on GPUs, including cuSPARSE [10] GPU; linear algebra; and CUSP [11]. These techniques generally consist of row- I. INTRODUCTION product-based intermediate data expansion and parallel data Matrix multiplication is one of the core kernels in various merge processes. Despite their promising performance, GPU data-mining applications, such as social network services resources are still not fully utilized. First, the row-product- (SNSs) and graph analytics, and is used to extract key informa- based expansion process often leads to poor load balancing tion. Based on the rapid growth of the size of sparse networks, among threads due to the irregular distributions of target sparse the extraction of valuable information required for various networks. Second, excessive memory accesses during the par- operations, such as ranking [1], similarity computation [2], allel merge process frequently leads to degraded performance [3], and recommendation [4], [5], has become a critical than expected because of significant memory contention challenge. Weighted graphs are typically used to model such caused by excessive accesses. Although several improved network data and are represented in matrix forms, where each row-product-based techniques, such as bhSPARSE [12], have element contains an edge weight between two nodes. Matrix recently been introduced, experimental results have shown that multiplication based on the adjacent matrix format is widely they still suffer from poor thread-level load balancing problem used to extract useful information from original data. of the row-product-based scheme and the high performance Because matrix multiplication is a data-parallel operation, overhead during the merge process while performing multipli- graphic processing units (GPUs) are considered to be the most appropriate accelerators for their speed-up by providing high 1In this work, we use the term thread block and CTA interchangeably. 2375-026X/20/$31.00 ©2020 IEEE 925 DOI 10.1109/ICDE48307.2020.00085 cation on highly irregular matrices. SM Thread Shared Memory Requirement of each TB Warp Scheduler SM To overcome these limitations, several new spGEMM ap- Register File Thread Block proaches have been introduced by adopting the outer-product Core Core Core Core Core Shared Memory (column-row product) scheme [13], [14]. Outer-product-based L1 cache Shared Memory expansion is expected to produce higher performance than Thread Block SM row-product-based expansion, because the computational loads GPU Shared Memory of all threads in a TB are identical. However, the outer- SM 0 SM 1 SM 2 SM 3 SM product is not yet an ideal solution. First, the outer-product Thread Block L2 cache algorithm creates another load imbalance problem among SMs Global Memory because of the high block-level workload variance. In the Shared Memory (a) (b) outer-product scheme, each TB is formulated by a column and Fig. 1: (a) A GPU architecture overview and (b) an effect of a row of input matrices. Therefore, the resulting TBs consist shared memory requirement per thread block on thread block of several computation-heavy TBs (overloaded blocks) from allocation. several columns and rows with huge numbers of non-zero 2) Block Gathering: it merges several underloaded elements, and a massive number of computation-light TBs blocks into a combined block for better SM resource (underloaded blocks) with large numbers of zero elements. As utilization and latency hiding effectiveness. a result, the SMs that execute overloaded blocks can become 3) Block Limiting: it prevents the blocks from exe- a performance bottleneck, while all other SMs are idle. cuting with other blocks on an SM for minimizing Second, the outer-product scheme is mainly effective for resource contention. expansion, and the merge performance remains the same or might even become worse, because it produces intermediate • An extensive evaluation of the effectiveness of the Block results in a matrix form during expansion, whereas the row- Reorganizer framework using synthetic and real-world product produces the intermediate results in a single row datasets on multiple target GPUs. form [15]. Therefore, full matrix-wise accumulation may be slower than row-wise accumulation owing to the additional II. BACKGROUND column address indexing. A. GPU Architectures and SIMT Programming Model To address the limitations, we propose a novel outer- product-based spGEMM optimization pass referred to as the GPUs are accelerators that provide high throughput by Block Reorganizer. It first identifies the computation amount maximizing data parallelism using an SIMT programming of each block and categorizes the blocks as overloaded blocks, model such as CUDA [6] and OpenCL [7], which enables normal blocks, and underloaded blocks, based on their compu- multiple independent threads to execute the same instructions tational loads. It then performs two different optimizations in concurrently. In such programming languages, a thread is the the expansion process: Block Splitting for overloaded blocks basic unit of execution, and several threads are grouped into and Block Gathering for underloaded blocks. Block Splitting is TBs or CTAs. A TB is the main scheduling unit for execution the process of dividing an overloaded block into multiple small on GPUs, and the threads within a TB are affected by blocks for better load balancing. For underloaded blocks, the barrier operations for synchronization. For NVIDIA GPUs in Block Reorganizer performs the Block Gathering process by particular, a number of threads (typically 32) are also grouped creating a combined block from multiple underloaded blocks into another scheduling unit, called a warp. In NVIDIA GPUs, to increase intra-SM computation unit utilization and improve the threads in a warp are executed in lock-step similar to SIMD latency hiding efficiency via