High-Performance Sparse Matrix-Multi Vector Multiplication on Multi-Core Architecture

High-Performance Sparse Matrix-Multi Vector Multiplication on Multi-Core Architecture A Thesis Presented in Partial Fulfillment of the Requirements for the Degree Master of Science in the Graduate School of The Ohio State University By Kunal Singh, Graduate Program in Computer Science and Engineering The Ohio State University 2018 Master’s Examination Committee: Dr. P. Sadayappan, Advisor Dr. Atanas Rountev © Copyright by Kunal Singh 2018 Abstract SpMM is a widely used primitive in many domains like Fluid Dynamics, Data Analytics, Economic Modelling and Machine Learning. In Machine Learning and Ar- tificial Neural Network domain SpMM is used iteratively and is the main bottleneck in many kernels. Due to its prime importance, many Machine Learning frameworks like Tensorflow, PyTorch, etc offer SpMM as a primitive. When compared to SpMV, SpMM has a higher theoretical operational intensity. However, the fraction of roofline performance achieved by SpMM is lower than SpMV suggesting possible improve- ments. In this paper, we systematically explore different design choices for SpMM primitive and develop a high-performance SpMM algorithm targeted at Multi-core and Many-core architectures. In addition, we also developed an analytical model to guide the tile size selection. As shown in our experimental section we achieve up to 3.4x speedup when compared to Intel MKL library. ii This thesis is dedicated to my parents iii Acknowledgments Foremost, I would like to express my sincerest gratitude to my advisor Professor P. Sadayappan for giving me an opportunity to work with him. This thesis would not have been possible without his immense support and guidance. I am grateful for his invaluable advice, motivation and his ever-positive spirit. His hard-working attitude motivates everyone around him to work harder and push oneself beyond the limit. I am grateful to Professor Atanas Rountev for his valuable insight and guidance. Valuable discussions with him helped me immensely in this thesis. He has been an amazing teacher and got me interested in compilers. I cannot thank him enough. I will always be in debt to Dr. Aravind Sukumaran Rajam for his support and guidance. He has gone above and beyond to help me in every part of my thesis from its inception to the writing. I would also like to thanks my lab mates Changwan, Vineeth, Israt, Prashant, Emre, Gordon and Rohit for the intellectual discussions and support. I would also thank my friends Harsh, Piyush, Pragya, Pravar, Anshu, Anand, Shashank and Atul for being such awesome people. They have always supported me through my difficult times and I am indebted to them in every way. I have enjoyed the time I spent with them playing, traveling and studying. They are one of the smartest and the most hardworking people I have ever met and I will always look up to them. iv My acknowledgment would be incomplete without thanking my colleagues Ian Cottingham, Konstantin Tereshko, Nitin Sinha, Madhavi Anand and Deepak Jha for their guidance and support. They have helped me grow professionally and provided me invaluable life experience. Finally, I would like to thank my parent for their immense support and love. Words could never be enough to express my gratitude towards them. They have always supported me unconditionally and have been my inspiration in life. I would not be the person I am today without their encouragement and confidence in me. v Vita February 23, 1992 . Born - Ranchi, India 2013 . .B.E. Computer Science and Engineer- ing 2013-2016 . Senior Systems Engineer, Siemens Healthcare, India 2017-present . Graduate Research Associate, Ohio State University University. Fields of Study Major Field: Computer Science and Engineering vi Table of Contents Page Abstract . ii Dedication . iii Acknowledgments . iv Vita......................................... vi List of Tables . ix List of Figures . .x 1. Introduction . .1 1.1 Motivation . .1 1.2 SpMM formulation . .2 1.3 Challanges in parallel SpMM . .3 1.4 Contributions . .4 1.5 Organization of this thesis . .4 2. Background . .5 2.1 Related Work . .7 2.1.1 Taco . .7 2.1.2 Intel Math Kernel Library . .7 2.1.3 Compressed Sparse Blocks based SpMM . .8 3. Analysis of SpMM . .9 vii 4. SpMM: Sparse Matrix Multi-Vector Multiplication . 16 4.0.1 Overview . 16 4.0.2 Data structure . 17 4.0.3 Algorithm . 19 4.0.4 Optimizations . 22 4.0.5 Model . 25 4.0.6 Reordering Matrices . 28 5. Experiments . 29 5.0.1 Dataset . 29 5.0.2 SpMM CPU . 31 5.0.3 SpMM comparision with CuSparse on GPU . 31 5.0.4 Reordering . 37 5.0.5 Pre-processing . 38 6. Conclusion and Future Work . 39 6.1 Conclusion . 39 6.2 Future Work . 39 Bibliography . 40 viii List of Tables Table Page 5.1 Data set . 30 ix List of Figures Figure Page 1.1 SpMM . .2 3.1 Streaming SpMM . 10 3.2 Data Movement Analysis . 15 4.1 blocked double compressed sparse column representation . 18 4.2 Our SpMM Algorithm . 20 4.3 Heat Map for mip1 in GFLOPS . 26 4.4 Heat Map for Si41Ge41H72 in GFLOPS . 27 5.1 SpMM KNL double precision lower K . 32 5.2 SpMM KNL double precision higher K . 33 5.3 SpMM Xeon E5 double precision lower K . 34 5.4 SpMM Xeon E5 double precision higher K . 35 5.5 Our algorithm on Intel KNL vs CuSparse on Nvidia K80 and Nvidia P100 . 36 5.6 Performance after reordering matrices . 37 5.7 Normalized pre-processing cost . 38 x Chapter 1: Introduction 1.1 Motivation Sparse Matrix Multi-vector multiplication (SpMM) or Sparse Matrix Dense Ma- trix multiplication (SpMDM) is a widely used operation in many domains like Fluid Dynamics, Data Analytics, Economic Modelling, Graph BLAS [15, 16] and Machine Learning. In areas like Machine Learning and Artificial Neural Networks SPMM is used iteratively over and over again, therefore making it an important kernel in Ma- chine Learning frameworks, like Tensorflow [2] and PyTorch[29]. Several recent efforts have sought to exploit sparsity in deep learning, using a SpMM formulation[23, 13, 17]. Examples of the use of SpMM from numerical simulation include the Locally Optimal Block Preconditioned Conjugate Gradient (LOBPCG) method for finding eigenvalues of a matrix[22, 4], and iterative solvers with multiple right-hand sides, like the Krylov sub-space iterative solvers which have SpMV is at its core. Although SpMM can be implemented as a sequential iterative loop over SpMV, better data reuse and higher performance can be achieved by a direct implementation of SpMM. 1 K k j Input N Matrix (I) N j x x x M i x x M i x Output Matrix (O) x x k Sparse K Matrix (S) Figure 1.1: SpMM 1.2 SpMM formulation In Sparse Matrix Multi-vector multiplication (SpMM), a sparse matrix is multi- plied with a dense matrix to form a dense output matrix. SpMM can be defined as O(M; K) = S(M; N) × I(N; K), where matrix S is the input sparse matrix of size M × N, I is the input dense matrix of size N × K and O is the output dense matrix of size M × K. The majority of elements of the sparse matrix S are zeros, therefore it is generally stored in a compressed formats like Compressed Sparse Row (CSR), Compressed Sparse Column (CSC) and Compressed Co-ordinate (COO)[31]. Only the non-zero elements of the sparse matrix contribute to the result, as can be seen in figure 1.1. 2 SpMM can also be performed by executing Sparse Matrix Vector multiplication (SpMV) repeatedly by using different columns of input dense matrix I as vectors. This method can never achieve optimal performance due to the fact that SpMV itself has a lower roofline performance than SpMM. The actual performance achieved by the SpMV algorithms are even lower [10]. 1.3 Challanges in parallel SpMM There are two major challenges in designing a parallel SpMM algorithm. First is achieving work-load balance for the threads. Unlike dense matrix multiplication, SpMM has to multiply a sparse matrix with a dense matrix. The major issue here is that the sparse matrices can have very diverse sparsity patterns, some can have power law distribution and some may have non-zeros clustered only in a few rows and columns. This can create an imbalance when assigning rows or columns to threads and makes it difficult for an algorithm to work efficiently on all matrices. The other major challenge is data reuse. When using CSR or CSC representation for SpMM it is very difficult to get data reuse on both input I and output O dense matrices. In case of CSR we get reuse on output matrix while going row-wise, but the elements of input dense matrix are evicted form the cache before being reused. Similarly, in CSC we get reuse on input dense matrix while traversing along columns, but we lose the reuse on output dense matrix. Depending upon the algorithm the elements of the sparse or dense matrix has to be read multiple times. If these elements are not cached then a slower main memory access is required which considerably reduces the performance. 3 1.4 Contributions This thesis makes the following contributions: • Systematic exploration of streaming choices for SpMM • New SpMM algorithm • Model-driven tile size selection for SpMM • Extensive evaluation over a wide range of sparse matrices 1.5 Organization of this thesis The rest of this thesis is organized in the following manner: Chapter 2 presents the background of SpMM along with the architecture of Intel Xeon Phi. Chap- ter 3 presents the analysis of streaming choices for SpMM and formulates the data movement.

Load more