Phd Thesis Parallel and Scalable Sparse Basic Linear Algebra

UNIVERSITY OF COPENHAGEN FACULTY OF SCIENCE PhD Thesis Weifeng Liu Parallel and Scalable Sparse Basic Linear Algebra Subprograms [Undertitel på afhandling] Academic advisor: Brian Vinter Submitted: 26/10/2015 ii To Huamin and Renheng for enduring my absence while working on the PhD and the thesis To my parents and parents-in-law for always helping us to get through hard times iii iv Acknowledgments I am immensely grateful to my supervisor, Professor Brian Vinter, for his support, feedback and advice throughout my PhD years. This thesis would not have been possible without his encouragement and guidance. I would also like to thank Profes- sor Anders Logg for giving me the opportunity to work with him at the Department of Mathematical Sciences of Chalmers University of Technology and University of Gothenburg. I would like to thank my co-workers in the eScience Center who supported me in every respect during the completion of this thesis: James Avery, Jonas Bardino, Troels Blum, Klaus Birkelund Jensen, Mads Ruben Burgdorff Kristensen, Simon Andreas Frimann Lund, Martin Rehr, Kenneth Skovhede, and Yan Wang. Special thanks goes to my PhD buddy James Avery, Huamin Ren (Aalborg Uni- versity), and Wenliang Wang (Bank of China) for their insightful feedback on my manuscripts before being peer-reviewed. I would like to thank Jianbin Fang (Delft University of Technology and National University of Defense Technology), Joseph L. Greathouse (AMD), Shuai Che (AMD), Ruipeng Li (University of Minnesota) and Anders Logg for their insights on my papers before being published. I thank Jianbin Fang, Klaus Birkelund Jensen, Hans Henrik Happe (University of Copenhagen) and Rune Kildetoft (University of Copenhagen) for access to the Intel Xeon and Xeon Phi machines. I thank Bo Shen (Inspur Group Co., Ltd.) for helpful discussion about Xeon Phi programming. I also thank Arash Ashari (The Ohio State University), Intel MKL team, Joseph L. Greathouse, Mayank Daga (AMD) and Felix Gremse (RWTH Aachen University) for sharing source code, libraries or implementation details of their SpMV and SpGEMM algorithms with me. I would also like to thank our anonymous reviewers of conferences SPAA ’13, PPoPP ’13, IPDPS ’14, PPoPP ’15 and ICS ’15, of workshops GPGPU-7 and PMAA ’14, and of journals Journal of Parallel and Distributed Computing and Parallel Computing for their invaluable suggestions and comments on my manuscripts. This research project was supported by the Danish National Advanced Technology Foundation (grant number J.nr. 004-2011-3). v vi Abstract Sparse basic linear algebra subprograms (BLAS) are fundamental building blocks for numerous scientific computations and graph applications. Compared with Dense BLAS, parallelization of Sparse BLAS routines entails extra challenges due to the irregularity of sparse data structures. This thesis proposes new fundamental algorithms and data structures that accelerate Sparse BLAS routines on modern massively parallel processors: (1) a new heap data structure named ad-heap, for faster heap operations on heterogeneous processors, (2) a new sparse matrix representation named CSR5, for faster sparse matrix-vector multiplication (SpMV) on homogeneous processors such as CPUs, GPUs and Xeon Phi, (3) a new CSR-based SpMV algorithm for a variety of tightly coupled CPU-GPU heterogeneous processors, and (4) a new framework and associated algorithms for sparse matrix-matrix multiplication (SpGEMM) on GPUs and heterogeneous processors. The thesis compares the proposed methods with state-of-the-art approaches on six homogeneous and five heterogeneous processors from Intel, AMD and nVidia. Using in total 38 sparse matrices as a benchmark suite, the experimental results show that the proposed methods obtain significant performance improvement over the best existing algorithms. vii viii Resumé Sparse lineær algebra biblioteket kaldet BLAS, er en grundlæggende byggesten i mange videnskabelige- og graf-applikationer. Sammenlignet med almindelige BLAS operationer, er paralleliseringen af sparse BLAS specielt udfordrende, grun- det den manglende struktur i matricernes data. Denne afhandling præsenterer grundlæggende nye datastrukturer og algoritmer til hurtigere sparse BLAS afvikling pa˚ moderne, massivt parallelle, processorer: (1) en ny hob baseret datastruktur, kaldet ad-heap, for hurtigere hob operationer pa˚ heterogene processorer. (2) en ny repræsen- tation for sparse data, kaldet CSR5, der tillader hurtigere matrix-vektor multiplikation (SpMV) pa˚ homogene processorer som CPU’er, GPGPU’er og Xeon Phi, (3) en ny CSR baseret SpMV algoritme for en række, tæt forbundne, CPU-GPGPU heterogene- processorer, og (4) et nyt værktøj for sparse matrix-matrix multiplikationer (SpGEMM) on GPGPU’s og heterogene processorer. Afhandlingen sammenligner de præsenterede løsninger med state-of-the-art løsninger pa˚ seks homogene og fem heterogene processorer fra Intel, AMD, og nVidia. Pa˚ baggrund af 38 sparse matricer som benchmarks, ses at de nye metoder der præsenteres i denne afhandling kan give signifikante forbedringer over de kendte løsninger. ix x Contents iii 1 Introduction 1 1.1 Organization of The Thesis . .1 1.2 Author’s Publications . .2 I Foundations 5 2 Sparsity and Sparse BLAS 7 2.1 What Is Sparsity? . .7 2.1.1 A Simple Example . .7 2.1.2 Sparse Matrices . .8 2.2 Where Are Sparse Matrices From? . 10 2.2.1 Finite Element Methods . 10 2.2.2 Social Networks . 11 2.2.3 Sparse Representation of Signals . 12 2.3 What Are Sparse BLAS? . 14 2.3.1 Level 1: Sparse Vector Operations . 14 2.3.2 Level 2: Sparse Matrix-Vector Operations . 16 2.3.3 Level 3: Sparse Matrix-Matrix Operations . 17 2.4 Where Does Parallelism Come From? . 18 2.4.1 Fork: From Matrix to Submatrix . 19 2.4.2 Join: From Subresult to Result . 20 2.5 Challenges of Parallel and Scalable Sparse BLAS . 21 2.5.1 Indirect Addressing . 21 2.5.2 Selection of Basic Primitives . 21 2.5.3 Data Decomposition, Load Balancing and Scalability . 22 2.5.4 Sparse Output of Unknown Size . 22 3 Parallelism in Architectures 25 3.1 Overview . 25 3.2 Multicore CPU . 25 3.3 Manycore GPU . 25 3.4 Manycore Coprocessor and CPU . 27 xi xii CONTENTS 3.5 Tightly Coupled CPU-GPU Heterogeneous Processor . 28 4 Parallelism in Data Structures and Algorithms 33 4.1 Overview . 33 4.2 Contributions . 33 4.3 Simple Data-Level Parallelism . 33 4.3.1 Vector Addition . 33 4.3.2 Reduction . 34 4.4 Scan (Prefix Sum) . 36 4.4.1 An Example Case: Eliminating Unused Entries . 36 4.4.2 All-Scan . 37 4.4.3 Segmented Scan . 39 4.4.4 Segmented Sum Using Inclusive Scan . 40 4.5 Sorting and Merging . 42 4.5.1 An Example Case: Permutation . 42 4.5.2 Bitonic Sort . 43 4.5.3 Odd-Even Sort . 45 4.5.4 Ranking Merge . 45 4.5.5 Merge Path . 47 4.5.6 A Comparison . 48 4.6 Ad-Heap and k-Selection . 50 4.6.1 Implicit d-heaps . 52 4.6.2 ad-heap design . 53 4.6.3 Performance Evaluation . 59 4.6.4 Comparison to Previous Work . 62 II Sparse Matrix Representation 65 5 Existing Storage Formats 67 5.1 Overview . 67 5.2 Basic Storage Formats . 67 5.2.1 Diagonal (DIA) . 67 5.2.2 ELLPACK (ELL) . 68 5.2.3 Coordinate (COO) . 69 5.2.4 Compressed Sparse Row (CSR) . 70 5.3 Hybrid Storage Formats . 71 5.3.1 Hybrid (HYB) . 71 5.3.2 Cocktail . 71 5.3.3 SMAT . 72 5.4 Sliced and Blocked Storage Formats . 72 5.4.1 Sliced ELL (SELL) . 72 5.4.2 Sliced COO (SCOO) . 73 5.4.3 Blocked COO (BCOO) . 73 5.4.4 Blocked CSR (BCSR) . 74 5.4.5 Blocked Compressed COO (BCCOO) . 74 5.4.6 Blocked Row-Column (BRC) . 74 CONTENTS xiii 5.4.7 Adaptive CSR (ACSR) . 75 5.4.8 CSR-Adaptive . 76 6 The CSR5 Storage Format 77 6.1 Overview . 77 6.2 Contributions . 77 6.3 The CSR5 format . 78 6.3.1 Basic Data Layout . 78 6.3.2 Auto-Tuned Parameters ! and σ .................. 79 6.3.3 Tile Pointer Information . 80 6.3.4 Tile Descriptor Information . 81 6.3.5 Storage Details . 83 6.3.6 The CSR5 for Other Matrix Operations . 83 6.4 Comparison to Previous Formats . 84 III Sparse BLAS Algorithms 85 7 Level 1: Sparse Vector Operations 87 7.1 Overview . 87 7.2 Contributions . 87 7.3 Basic Method . 87 7.3.1 Sparse Accumulator . 87 7.3.2 Performance Considerations . 87 7.4 CSR-Based Sparse Vector Addition . 88 7.4.1 Heap Method . 88 7.4.2 Bitonic ESC Method . 89 7.4.3 Merge Method . 90 8 Level 2: Sparse Matrix-Vector Operations 91 8.1 Overview . 91 8.2 Contributions . 92 8.3 Basic Methods . 93 8.3.1 Row Block Algorithm . 93 8.3.2 Segmented Sum Algorithm . 94 8.3.3 Performance Considerations . 96 8.4 CSR-Based SpMV . 97 8.4.1 Algorithm.

Phd Thesis Parallel and Scalable Sparse Basic Linear Algebra

A GMRES Solver with ILU(K) Preconditioner for Large-Scale Sparse Linear Systems on Multiple Gpus

Product Irregularity Strength of Graphs with Small Clique Cover Number

Multi-Color Low-Rank Preconditioner for General Sparse Linear Systems ∗

Fast Large-Integer Matrix Multiplication

Sparse Linear System Solvers on Gpus: Parallel Preconditioning, Workload Balancing, and Communication Reduction

A Study of Linear Error Correcting Codes

Optimizing the Sparse Matrix-Vector Multiplication Kernel for Modern Multicore Computer Architectures

Why Jacket Matrices?

Optimizing and Auto-Tuning Scale-Free Sparse Matrix-Vector Multiplication on Intel Xeon Phi

Dynamic Scheduling for Efficient Hierarchical Sparse Matrix

Investigation on Digital Fountain Codes Over Erasure Channels and Additive White

261 Triangularizing Matrices by Congruence This Paper Is