Automatic Performance Tuning of Sparse Matrix Kernels
Total Page:16
File Type:pdf, Size:1020Kb
Automatic Performance Tuning of Sparse Matrix Kernels by Richard Wilson Vuduc B.S. (Cornell University) 1997 A dissertation submitted in partial satisfaction of the requirements for the degree of Doctor of Philosophy in Computer Science in the GRADUATE DIVISION of the UNIVERSITY OF CALIFORNIA, BERKELEY Committee in charge: Professor James W. Demmel, Chair Professor Katherine A. Yelick Professor Sanjay Govindjee Fall 2003 The dissertation of Richard Wilson Vuduc is approved: Chair Date Date Date University of California, Berkeley Fall 2003 Automatic Performance Tuning of Sparse Matrix Kernels Copyright 2003 by Richard Wilson Vuduc 1 Abstract Automatic Performance Tuning of Sparse Matrix Kernels by Richard Wilson Vuduc Doctor of Philosophy in Computer Science University of California, Berkeley Professor James W. Demmel, Chair This dissertation presents an automated system to generate highly efficient, platform- adapted implementations of sparse matrix kernels. These computational kernels lie at the heart of diverse applications in scientific computing, engineering, economic modeling, and information retrieval, to name a few. Informally, sparse kernels are computational oper- ations on matrices whose entries are mostly zero, so that operations with and storage of these zero elements may be eliminated. The challenge in developing high-performance im- plementations of such kernels is choosing the data structure and code that best exploits the structural properties of the matrix|generally unknown until application run-time|for high-performance on the underlying machine architecture (e.g., memory hierarchy con- figuration and CPU pipeline structure). We show that conventional implementations of important sparse kernels like sparse matrix-vector multiply (SpMV) have historically run at 10% or less of peak machine speed on cache-based superscalar architectures. Our imple- mentations of SpMV, automatically tuned using a methodology based on empirical-search, can by contrast achieve up to 31% of peak machine speed, and can be up to 4 faster. × Given a matrix, kernel, and machine, our approach to selecting a fast implemen- tation consists of two steps: (1) we identify and generate a space of reasonable implemen- tations, and then (2) search this space for the fastest one using a combination of heuristic models and actual experiments (i.e., running and timing the code). We build on the Spar- sity system for generating highly-tuned implementations of the SpMV kernel y y + Ax, where A is a sparse matrix and x; y are dense vectors. We extend Sparsity to support tuning for a variety of common non-zero patterns arising in practice, and for additional 2 kernels like sparse triangular solve (SpTS) and computation of ATA x (or AAT x) and Aρ x. · · · We develop new models to compute, for particular data structures and kernels, the best absolute performance (e.g., Mflop/s) we might expect on a given matrix and machine. These performance upper bounds account for the cost of memory operations at all levels of the memory hierarchy, but assume ideal instruction scheduling and low-level tuning. We evaluate our performance with respect to such bounds, finding that the generated and tuned implementations of SpMV and SpTS achieve up to 75% of the performance bound. This finding places limits on the effectiveness of additional low-level tuning (e.g., better instruction selection and scheduling). Instances in which we are further from the bounds indicate new opportunities to close the gap by applying existing automatic low-level tuning technology. We also use these bounds to assess partially what architectures are good for kernels like SpMV. Among other conclusions, we find that performance improvements may be possible for SpMV and other streaming applications by ensuring strictly increasing cache line sizes in multi-level memory hierarchies. The costs and steps of tuning imply changes to the design of sparse matrix libraries. We propose extensions to the recent standardized interface, the Sparse Basic Linear Algebra Subroutines (SpBLAS). We argue that such an extended interface complements existing approaches to sparse code generation, and furthermore is a suitable building block for widely-used higher-level scientific libraries and systems (e.g., PETSc and MATLAB) to provide users with high-performance sparse kernels. Looking toward future tuning systems, we consider an aspect of the tuning problem that is common to all current systems: the problem of search. Specifically, we pose two search-related problems. First, we consider the problem of stopping an exhaustive search while providing approximate bounds on the probability that an optimal implementation has been found. Second, we consider the problem of choosing at run-time one from among several possible implementations based on the run-time input. We formalize both problems in a manner amenable to attack by statistical modeling techniques. Our methods may potentially apply broadly to tuning systems for as yet unexplored domains. Professor James W. Demmel Dissertation Committee Chair i To my bringer of wing zings. ii Contents List of Figures vi List of Tables xi List of Symbols xiv 1 Introduction 1 1.1 Contributions ................................... 5 1.2 Problem Context and History .......................... 6 1.2.1 Hardware and software trends in sparse kernel performance ..... 6 1.2.2 Emergence of standard library interfaces ................ 8 1.3 The Case for Search ............................... 11 1.4 Summary, Scope, and Outline .......................... 15 2 Basic Sparse Matrix Data Structures 18 2.1 Survey of Basic Data Structures ........................ 20 2.1.1 Example: Contrasting dense and sparse storage ............ 20 2.1.2 Dense storage formats .......................... 21 2.1.3 Sparse vector formats .......................... 25 2.1.4 Sparse matrix formats .......................... 26 2.2 Experimental Comparison of the Basic Formats ................ 39 2.2.1 Experimental setup ............................ 40 2.2.2 Results on the Sparsity matrix benchmark suite ........... 41 2.3 Note on Recursive Storage Formats ....................... 45 2.4 Summary ..................................... 45 3 Improved Register-Level Tuning of Sparse Matrix-Vector Multiply 51 3.1 Register Blocked Sparse Matrix-Vector Multiply ............... 53 3.1.1 Register blocking overview ........................ 54 3.1.2 Surprising performance behavior in practice .............. 58 3.2 An Improved Heuristic for Block Size Selection ................ 67 3.2.1 A fill ratio estimation algorithm .................... 68 3.2.2 Tuning the fill estimator: cost and accuracy trade-offs ........ 71 3.3 Evaluation of the Heuristic: Accuracy and Costs ............... 77 iii 3.3.1 Accuracy of the Sparsity Version 2 heuristic ............. 78 3.3.2 The costs of register block size tuning ................. 85 3.4 Summary ..................................... 90 4 Performance Bounds for Sparse Matrix-Vector Multiply 93 4.1 A Performance Bounds Model for Register Blocking ............. 95 4.1.1 Modeling framework and summary of key assumptions ........ 96 4.1.2 A latency-based model of execution time ................ 97 4.1.3 Lower (and upper) bounds on cache misses .............. 98 4.2 Experimental Evaluation of the Bounds Model ................ 100 4.2.1 Determining the model latencies .................... 102 4.2.2 Cache miss model validation ...................... 109 4.2.3 Key results: observed performance vs. the bounds model ...... 121 4.3 Related Work ................................... 137 4.4 Summary ..................................... 141 5 Advanced Data Structures for Sparse Matrix-Vector Multiply 143 5.1 Splitting variable-block matrices ........................ 144 5.1.1 Test matrices and a motivating example ................ 147 5.1.2 Altering the non-zero distribution of blocks using fill ......... 149 5.1.3 Choosing a splitting ........................... 154 5.1.4 Experimental results ........................... 156 5.2 Exploiting diagonal structure .......................... 165 5.2.1 Test matrices and motivating examples ................ 165 5.2.2 Row segmented diagonal format .................... 166 5.2.3 Converting to row segmented diagonal format ............. 169 5.2.4 Experimental results ........................... 171 5.3 Summary and overview of additional techniques ................ 179 6 Performance Tuning and Bounds for Sparse Triangular Solve 184 6.1 Optimization Techniques ............................. 186 6.1.1 Improving register reuse: register blocking ............... 188 6.1.2 Using the dense BLAS: switch-to-dense ................ 189 6.1.3 Tuning parameter selection ....................... 189 6.2 Performance Bounds ............................... 191 6.2.1 Review of the latency-based execution time model .......... 192 6.2.2 Cache miss lower and upper bounds .................. 193 6.3 Performance Evaluation ............................. 194 6.3.1 Validating the cache miss bounds .................... 194 6.3.2 Evaluating optimized SpTS performance ................ 195 6.4 Related Work ................................... 197 6.5 Summary ..................................... 198 iv 7 Higher-Level Sparse Kernels 202 7.1 Automatically Tuning ATA x for the Memory Hierarchy ........... 203 7.2 Upper Bounds on ATA x Performance· ..................... 205 7.2.1 A latency-based· execution time model ................. 207 7.2.2 A lower bound on cache misses ..................... 207 7.3 Experimental Results