Optimizing the Sparse Matrix-Vector Multiplication Kernel for Modern Multicore Computer Architectures

¨ thesis March 11, 2013 15:54 Page 1 © National Technical University of Athens School of Electrical and Computer Engineering Division of Computer Science Optimizing the Sparse Matrix-Vector Multiplication Kernel for Modern Multicore Computer Architectures ¨ ¨ © © Ph.D. esis Vasileios K. Karakasis Electrical and Computer Engineer, Dipl.-Ing. Athens, Greece December, 2012 ¨ © ¨ thesis March 11, 2013 15:54 Page 2 © ¨ ¨ © © ¨ © ¨ thesis March 11, 2013 15:54 Page 3 © National Technical University of Athens School of Electrical and Computer Engineering Division of Computer Science Optimizing the Sparse Matrix-Vector Multiplication Kernel for Modern Multicore Computer Architectures Ph.D. esis Vasileios K. Karakasis Electrical and Computer Engineer, Dipl.-Ing. Advisory Committee: Nectarios Koziris Panayiotis Tsanakas ¨ Andreas Stafylopatis ¨ © © Approved by the examining committee on December 19, 2012. Nectarios Koziris Panayiotis Tsanakas Andreas Stafylopatis Associate Prof., NTUA Prof., NTUA Prof., NTUA Andreas Boudouvis Giorgos Stamou Dimitrios Soudris Prof., NTUA Lecturer, NTUA Assistant Prof., NTUA Ioannis Cotronis Assistant Prof., UOA Athens, Greece December, 2012 ¨ © ¨ thesis March 11, 2013 15:54 Page 4 © Vasileios K. Karakasis Ph.D., National Technical University of Athens, Greece. ¨ ¨ © © Copyright © Vasileios K. Karakasis, 2012. Με επιφύλαξη παντός δικαιώματος. All rights reserved Απαγορεύεται η αντιγραφή, αποθήκευση και διανομή της παρούσας εργασίας, εξ ολοκλήρου ή τμήματος αυτής, για εμπορικό σκοπό. Επιτρέπεται η ανατύπωση, αποθήκευση και διανομή για σκοπό μη κερδοσκοπικό, εκπαιδευτικής ή ερευνητικής φύσης, υπό την προϋπόθεση να αναφέρε- ται η πηγή προέλευσης και να διατηρείται το παρόν μήνυμα. Ερωτήματα που αφορούν τη χρήση της εργασίας για κερδοσκοπικό σκοπό πρέπει να απευθύνονται προς τον συγγραφέα. Η έγκριση της διδακτορικής διατριβής από την Ανώτατη Σχολή των Ηλεκτρολόγων Μηχανικών και Μηχανικών Υπολογιστών του Ε. Μ. Πολυτεχνείου δεν υποδηλώνει αποδοχή των γνωμών του συγγραφέα (Ν. 5343/1932, άρθρο 202). ¨ © ¨ thesis March 11, 2013 15:54 Page 1 © ¨ Στους γονείς μου, ¨ © © Κωνσταντίνο και Ειρήνη, και στα αδέλφια μου, Μάριο, Όλγα, Αλέξανδρο ¨ © ¨ thesis March 11, 2013 15:54 Page 2 © ¨ ¨ © © ¨ © ¨ thesis March 11, 2013 15:54 Page i © Abstract is thesis focuses on the optimization of the Sparse Matrix-Vector Multipli- cation kernel (SpMV) for modern multicore architectures. We perform an in-depth performance analysis of the kernel and identied its major performance bottlenecks. is allows us to propose an advanced storage format for sparse matrices, the Compressed Sparse eXtended (CSX) format, which tar- gets specically the minimization of the memory footprint of the sparse matrix. is format provides signicant improvements in the performance of the SpMV kernel in a variety of matrices and multicore architectures, main- taining a considerable performance stability. Finally, we investigate the performance of the SpMV kernel from an energy-efficiency perspective, in order to identify the execution congurations that lead to optimal performance- energy tradeoffs. Keywords: high performance computing; scientic applications; sparse matrix-vector multiplication; multicore; data compression; energy-efficiency; SpMV; CSX; HPC ¨ ¨ © © ¨ © ¨ thesis March 11, 2013 15:54 Page ii © ¨ ¨ © © ¨ © ¨ thesis March 11, 2013 15:54 Page i © Contents Contents i List of Figures v List of Tables ix List of Algorithms xi ¨ ¨ © Αντί Προλόγου xiii © 1 Introduction 1 1.1 Sparse linear systems ....................... 1 1.2 e computational aspect of iterative methods . 3 1.3 Challenges of multicore architectures .............. 5 1.3.1 Energy consumption considerations .......... 7 1.3.2 e algorithmic nature of the SpMV kernel . 8 1.3.3 e energy efficiency of the SpMV kernel . 9 1.4 Contribution of this thesis .................... 10 1.4.1 In-depth performance analysis and prediction models 10 1.4.2 e CSX storage format for sparse matrices . 11 1.4.3 Toward an energy-efficient SpMV implementation . 13 1.5 Outline .............................. 14 2 Storing Sparse Matrices 17 2.1 Conventional storage formats . 17 2.2 Exploiting the density structure of the matrix . 20 2.2.1 Fixed size blocking ................... 21 2.2.2 Variable size blocking . 23 2.2.3 Other approches ..................... 27 2.3 Explicit compression of the matrix indices . 29 i ¨ © ¨ thesis March 11, 2013 15:54 Page ii © Contents 2.4 Exploiting symmetry in sparse matrices . 31 3 e Performance of the Sparse Matrix-Vector Kernel 35 3.1 An algorithmic view ....................... 35 3.2 Experimental preliminaries ................... 38 3.2.1 Matrix suite ....................... 39 3.2.2 Hardware platforms ................... 39 3.2.3 Measurements procedures and policies . 43 3.3 A quantitative evaluation .................... 45 3.3.1 Single-threaded performance . 45 3.3.2 Multithreaded performance . 47 3.4 Related work ........................... 60 3.5 Summary ............................. 60 4 Optimization opportunities of blocking 63 4.1 e effect of compression .................... 63 4.2 e effect of the computations . 67 4.2.1 Vectorization and the block shape . 67 4.3 Predicting the optimal block size . 71 ¨ 4.3.1 e M model ..................... 72 ¨ © © 4.3.2 e S model ................... 73 4.3.3 e MC model . 74 4.3.4 e O model . 74 4.3.5 Assessing the accuracy of the models . 75 4.3.6 Extensions ........................ 79 4.4 Summary ............................. 81 5 e Compressed Sparse eXtended Format 83 5.1 e need for an integrated storage format . 84 5.2 CSX data structures ....................... 86 5.3 Detection and encoding of substructures . 88 5.3.1 Mining the matrix for substructures . 88 5.3.2 Selecting substructures for nal encoding . 93 5.3.3 Building the CSX data structures . 94 5.4 Generating the SpMV code ................... 95 5.5 Tackling the preprocessing cost . 99 5.6 Porting to NUMA architectures . 100 5.6.1 Optimizing the computations . 101 5.7 Evaluating the performance of CSX . 102 5.7.1 CSX compression potential . 102 5.7.2 CSX performance . 104 ii ¨ © ¨ thesis March 11, 2013 15:54 Page iii © Contents 5.7.3 CSX preprocessing cost . 111 5.8 Integrating CSX into multiphysics simulation soware . 113 5.9 Summary .............................116 6 Exploiting symmetry in sparse matrices 119 6.1 e symmetric SpMV kernel . 119 6.2 Minimizing the reduction cost . 122 6.2.1 Effective ranges of local vectors . 122 6.2.2 Local vectors indexing . 123 6.2.3 Alternative methods . 126 6.3 CSX for symmetric matrices . 127 6.4 Performance evaluation . 129 6.4.1 Local vectors methods . 129 6.4.2 Symmetric CSX . 131 6.4.3 Reduced bandwidth matrices . 132 6.4.4 Impact on the CG iterative method . 134 6.5 Summary .............................136 ¨ 7 Energy-efficiency considerations 139 ¨ © 7.1 Fundamentals of processor power dissipation . 139 © 7.1.1 Sources of power dissipation . 139 7.1.2 Energy-Delay products . 142 7.1.3 Energy-efficiency from a soware perspective . 142 7.2 Performance-Energy tradeoffs in the SpMV kernel . 144 7.2.1 Experimental setup . 145 7.2.2 Characterizing the tradeoffs . 146 7.3 Predicting the optimal execution congurations . 150 7.3.1 Clustering the matrices . 151 7.3.2 Constructing the cluster Pareto front . 152 7.3.3 Classication and testing . 153 7.3.4 Limitations . 157 7.4 Open issues ............................158 8 Conclusions 159 8.1 SpMV: Victim of the memory wall . 159 8.2 CSX: A viable approach to a high performance SpMV . 160 8.3 Toward an energy-efficient SpMV . 162 8.4 Future research directions . 162 List of publications 165 iii ¨ © ¨ thesis March 11, 2013 15:54 Page iv © Contents Bibliography 167 Index 179 Short biography 183 ¨ ¨ © © iv ¨ © ¨ thesis March 11, 2013 15:54 Page v © List of Figures 1.1 Execution time breakdown for the non-preconditioned CG iterative method for different problem categories. ............ 5 1.2 e two current trends in modern multicore architectures: symmetric shared memory and non-uniform memory access (NUMA) architectures. ............................. 6 1.3 Demonstration of the SpMV kernel speedup in relation to the memory bandwidth consumption in a two-way quad-core symmetric ¨ shared memory system. ....................... 9 ¨ © © 2.1 e Coordinate sparse matrix storage format. 18 2.2 e Compressed Sparse Row storage format. 19 2.3 e Blocked Compressed Sparse Row storage format. 21 2.4 e Row Segmented Diagonal storage format. 23 2.5 e Variable Block Length storage format. 24 2.6 e Variable Block Row storage format. 26 2.7 e Ellpack-Itpack storage format, suitable for vector processors and modern GPU architectures. ................... 29 2.8 Run-length encoding of the matrix column indices. 30 2.9 e Symmetric Sparse Skyline format. 31 2.10 e RAW dependency on the output vector in a parallel execution of the symmetric SpMV kernel. ................... 32 3.1 e block diagrams of the multiprocessor systems used for the experimental evaluations in this thesis (continues at page 42). 41 3.1 e block diagrams of the multiprocessor systems used for the experimental evaluations in this thesis. 42 3.2 Performance overheads of the SpMV kernel. 48 3.2 Performance overheads of the SpMV kernel. 49 3.3 Speedup of the SpMV kernel in SMP systems. 50 v ¨ © ¨ thesis March 11, 2013 15:54 Page vi © List of Figures 3.4 SpMV speedup in Dunnington using the ‘share-all’ core-lling pol- icy. ................................... 51 3.5 e contention in the memory bus of SMP architectures. 52 3.6 Performance of the SpMV kernel in relation to the arithmetic in- tensity (op:byte ratio). ....................... 52 3.7 Load imbalance in the SpMV kernel. 53 3.8 Variation of op:byte ratio across the matrix. 54 3.9 e benet of matrix reordering. 55 3.10 e speedup of the SpMV kernel in NUMA architectures. 56 3.11 Saturation of the interconnection link in NUMA architectures. 57 3.12 Transparent data placement in NUMA architectures. 59 3.13 Effect of sharing the input vector in NUMA architectures. 59 4.1 Correlation of the compression ratio and the performance improve- ment.

Optimizing the Sparse Matrix-Vector Multiplication Kernel for Modern Multicore Computer Architectures

A GMRES Solver with ILU(K) Preconditioner for Large-Scale Sparse Linear Systems on Multiple Gpus

Product Irregularity Strength of Graphs with Small Clique Cover Number

Phd Thesis Parallel and Scalable Sparse Basic Linear Algebra

Multi-Color Low-Rank Preconditioner for General Sparse Linear Systems ∗

Fast Large-Integer Matrix Multiplication

Sparse Linear System Solvers on Gpus: Parallel Preconditioning, Workload Balancing, and Communication Reduction

A Study of Linear Error Correcting Codes

Why Jacket Matrices?

Optimizing and Auto-Tuning Scale-Free Sparse Matrix-Vector Multiplication on Intel Xeon Phi

Dynamic Scheduling for Efficient Hierarchical Sparse Matrix

Investigation on Digital Fountain Codes Over Erasure Channels and Additive White

261 Triangularizing Matrices by Congruence This Paper Is