Accelerating Sparse Approximate Matrix Multiplication on Gpus
Total Page:16
File Type:pdf, Size:1020Kb
Accelerating Sparse Approximate Matrix Multiplication on GPUs Xiaoyan Liu1 ,Yi Liu1,Ming Dun2, Bohong Yin1, Hailong Yang1, Zhongzhi Luan1, and Depei Qian1 School of Computer Science and Engineering1 School of Cyber Science and Technology2 Beihang University1,2, Beijing, China, 100191 {liuxiaoyan,yi.liu,dunming0301,yinbohong,hailong.yang,zhongzhi.luan,depeiq}@buaa.edu.cn ABSTRACT elements (values) decrease rapidly from diagonal to sides. The ele- Although the matrix multiplication plays a vital role in computa- ments can be ignored if they are small enough and the correspond- tional linear algebra, there are few efficient solutions for matrix ing matrices become near sparse. Due to the unique properties, multiplication of the near-sparse matrices. The Sparse Approximate there are many researches focusing on decay matrix itself, such Matrix Multiply (SpAMM) is one of the algorithms to fill the perfor- as the decay rate [20], the left inverse [21, 48], high-dimensional mance gap neglected by traditional optimizations for dense/sparse statistics [4], and numerical analysis [52]. In addition, the decay matrix multiplication. However, existing SpAMM algorithms fail to matrices often appear in widely used matrix operations such as exploit the performance potential of GPUs for acceleration. In this matrix inverse [7, 20], matrix exponential [32], Jacobi matrices [47], paper, we present cuSpAMM, the first parallel SpAMM algorithm and etc [6]. Moreover, decay matrices are commonly adopted in ap- optimized for multiple GPUs. Several performance optimizations plication domains such as quantum chemistry [5, 11] and quantum have been proposed, including algorithm re-design to adapt to the information theory [18, 19, 22, 46]. thread parallelism, blocking strategies for memory access optimiza- However, existing research works [12, 37, 51] for dense and tion, and the acceleration with the tensor core. In addition, we sparse GEMM are hardly efficient when applied to near-sparse scale cuSpAMM to run on multiple GPUs with an effective load matrices. On the one hand, the researches for dense GEMM focus balance scheme. We evaluate cuSpAMM on both synthesized and on reducing the computation complexity. For example, Strassen’s 2.8 real-world datasets on multiple GPUs. The experiment results show algorithm [37] and Williams’ algorithm [51] achieve $ ¹# º and 2.3727 that cuSpAMM achieves significant performance speedup compared $ ¹# º, respectively. Whereas, the complexity reduction is to vendor optimized cuBLAS and cuSPARSE libraries. hardly useful for eliminating redundant computation of near-sparse matrices on the zero elements. On the other hand, the researches KEYWORDS for sparse GEMM propose various storage formats such as CSR [12] to store the sparse matrices compactly. However, the sparse formats sparse approximate matrix multiplication, performance optimiza- can hardly benefit the near-sparse GEMM due to its non-sparse tion, multiple GPUs nature. Therefore, both the dense GEMM and the sparse GEMM have limited performance potential for near-sparse GEMM. 1 INTRODUCTION Fortunately, the approximation provides a good opportunity to Generally, the existing GEMM algorithms can be classified into boost the performance of near-sparse GEMM. For example, skipping dense and sparse algorithms according to the ratio of non-zero the calculation of small enough elements of near-sparse matrices is elements of the input matrices. Given a matrix A 2 R# ×# , the a profitable way for performance acceleration. Based on such idea, number of non-zero elements is $ ¹# 2º and $ ¹# º for dense and Sparse Approximate Matrix Multiply (SpAMM) [14] has been pro- sparse algorithms, respectively. However, in real applications, there posed for accelerating the decay matrix multiplication. For matrices are a large number of matrices in the middle ground between dense with exponential decay, existing research [3] has demonstrated the and sparse matrices, a.k.a. near-sparse matrices, whose non-zero absolute error of SpAMM can be controlled reliably. elements are between $ ¹# 2º and $ ¹# º. Near-sparse matrices are In the meanwhile, with wide adoption in a large number of fields, widely used in the field of scientific computing, such as computa- GPUs have been proven with excellent speedup for matrix opera- arXiv:2103.13042v1 [cs.PF] 24 Mar 2021 tional chemistry [41], quantum physics [26], electronic structure tions [45]. Especially with the advent of tensor core units provided calculation [44]. by NVIDIA GPUs, mixed-precision techniques have been exploited The near-sparse matrices also exist in emerging domains such to further accelerate matrix operations [42]. Although there are few as deep neural networks. Especially in convolutional neural net- research works optimizing SpAMM computation on CPUs [3, 9, 10], works (CNNs), the feature and weight matrices participated in the to the best of our knowledge, there is no GPU implementation avail- calculation are near-sparse [13, 23] due to weight pruning [31] and able for accelerating SpAMM computation, especially exploiting activation functions [33]. For example, the activation function of the architectural features such as tensor core and scaling to multi- Rectified Linear Unit (ReLU) can lead to more than 50% sparsity ple GPUs. This motivates our work in this paper to re-design the of the feature matrices on average [13]. In CNNs, the convolution SpAMM algorithm for better adaption to GPU architecture and pro- operations between feature and weight matrices are transformed pose corresponding optimizations to achieve superior performance to GEMM using the im2col algorithm [2]. In such case, the matrices compared to the state-of-the-art GEMM libraries. Specifically, this involved in the GEMM calculation are also near-sparse. paper makes the following contributions: There is also a special class of matrices that are inherently near- • We propose cuSpAMM, a re-designed SpAMM algorithm sparse, the matrices with decay [5] (a.k.a. decay matrices), whose tailored for GPU. Specifically, we adapt the calculation steps and the data access patterns of the SpAMM algorithm to Algorithm 1 SpAMM algorithm the memory hierarchy and thread organization of GPU with 1: Input: Matrices 퐴, 퐵, parameter g 2: Output: Multiplication result 퐶 increased parallelism and reduced memory accesses. 3: if lowest level then • We propose several optimization schemes such as blocking 4: return 퐶 = 퐴퐵 5: for i=0 to 1 do strategies for the calculation kernels and utilization of tensor 6: for j=0 to 1 do core for accelerating the calculation. In addition, we present 7: if j j퐴8,0 j j퐹 j j퐵0,9 j j퐹 ≥ g then 8: )0 = (?퐴"" ¹퐴8,0, 퐵0,9 , gº a scaling method to extend cuSpAMM to multiple GPUs. 9: else • We compare cuSpAMM with highly optimized vendor li- 10: )0 = 0 braries such as cuBLAS [15] and cuSPARSE [17] on GPUs. In 11: if j j퐴8,1 j j퐹 j j퐵1,9 j j퐹 ≥ g then 12: )1 = (?퐴"" ¹퐴8,1, 퐵1,9 , gº addition, we evaluate on two real datasets from electronic 13: else structure calculation (ergo) and convolutional neural net- 14: )1 = 0 15: 퐶8,9 = )0 ¸ )1 work (VGG13), to demonstrate the performance of cuSpAMM 16: return 퐶 on real-world applications. The paper is organized as follows. In Section 2, we introduce the background of SpAMM algorithm and GPU optimizations. In 2.2 GPU architecture and optimization Section 3, we present our re-designed SpAMM algorithm cuSpAMM, 2.2.1 GPU architecture. The CUDA [16] programming paradigm and corresponding optimizations for performance acceleration on provides a classic definition of GPU architecture, with the thread multiple GPUs. Section 4 compares our cuSpAMM with the-state- and memory organized hierarchically. of-art GEMM libraries and evaluates on two real-world datasets Thread hierarchy - The thread are organized at four levels using cuSpAMM. Section 5 discusses the related works and Section 6 with coarsen granularity including thread, warp, block and grid. concludes this paper. One warp usually contains 32 threads, and the threads within a warp are implicitly synchronized in execution. The warp is the 2 BACKGROUND most basic unit for calculation execution and hardware scheduling. 2.1 Decay matrix and SpAMM algorithm The block consists of multiple threads, and the threads within the same block can be synchronized. The grid consists of multiple blocks, A matrix is defined as the decay matrix when its elements decrease and the blocks within the grid are executed in the SIMD fashion. following the decay rate from the diagonal to the sides. The decay Memory hierarchy - The memory hierarchy can be divided j » ¼» ¼j rate can be exponential or algebraical, formulated as 퐴 8 9 < into three levels, including register level, shared memory level and j8−9 j _ 2_ and j퐴»8¼»9¼j < 2/¹j8 − 9 j ¸ 1º respectively, where A[i][j] is global memory level. Each thread has private registers, which is the the index of the element in matrix A. The j8 − 9 j is the separation and fastest but also the most limited storage on GPU. The second fastest can be replaced by other index-based distance function of the matrix memory is the shared memory for each block. The threads within or physical distance such as jA ®8 −A ®9 j in non-synthetic cases [3]. By the same block can be synchronized through the shared memory. mathematical definition, the decay matrix is quite dense due tofew The global memory level consists of global memory and texture zero elements. However, under certain conditions (e.g., elements memory that hosts the data transferred from the CPU. less than the threshold), a number of elements in the decay matrix can be treated as zeros, which renders the matrix as near-sparse. 2.2.2 GPU optimization. We briefly summarize the commonly used SpAMM is an approximate matrix multiplication method that optimization strategies for high-performance matrix multiplication can be used on decay matrices. The problem solved by SpAMM on GPU.