Efficient Sparse Matrix Vector Multiplication for Structured Grid Representation

Efficient Sparse Matrix Vector Multiplication for Structured Grid Representation A Thesis Presented in Partial Fulfillment of the Requirements for the Degree Master of Science in the Graduate School of The Ohio State University By Deepan Karthik Balasubramanian, B.Tech Graduate Program in Computer Science and Engineering The Ohio State University 2012 Thesis Committee: Dr.P.Sadayappan, Advisor Dr.Atanas Rountev c Copyright by Deepan Karthik Balasubramanian 2012 Abstract Due to technology advancements, there is a need for higher accuracy in the scientific computations. Sparse matrix-vector (SpMV) multiplication is a widely used kernel in scientific applications. There can be signif- icant performance variablility due to irregular memory access patterns. There is a great opportunity to optimize these applications. Conventional algorithms needs to be modified to take advantage of these improved architecture. In the first part of the thesis, we focus on introducing new data structure, Block Structured Grid, that allows vectorization. We also focus on modifying the the existing Block CSR representation of structured grids to improve the performance. Due to the inherent nature of structured grid problems, which uses block elements due to the degrees of freedom involved in the problem, blocked structures are considered for improving the performance. We compare our performance with existing standard algorithms in PETSc. With the new matrix representations we were able to achieve an average of 1.5x performance improvement for generic algorithms. In the second part of the thesis, we compare the performance of PFlo- tran, an application for modeling Multiscale-Multiphase-Multicomponent Subsurface Reactive Flows, using Block Structured Grid and Vectorized Block CSR against standard matrix representations against standard data structures.With the new matrix representations we were able to achieve an average of 1.2x performance improvement. ii I dedicate this work to my parents and my sister iii Acknowledgments I owe my deepest gratitude to my advisor Prof. Sadayappan for his vision and constant support throughout my Masters program. His enthusiasm for research ideas has been great source of inspiration for me. His excellent technical foresight and guid- ance helped me towards right goals. I would also like to thank Dr. Atanas Rountev for agreeing to serve on my Masters examination committee. I would especially like to thank Jeswin Godwin, Kevin Stock and Justin Holewinski at DL 574 for providing me with valuable technical inputs during the course of the program. Special thanks also to my friends Ragavendar, Venmugil, Naveen, Madhu, Sriram, Shriram, Arun and Viswa, who kept me motivated and made this journey a really enjoyable one. I also like to thank all my friends here at Ohio State University for making this journey pleasant one. Finally, this endeavor would not have been possible without the support of my family members, who have encouraged and motivated me to no end. It is to my parents D. Balasubramanian and B.SaralaDevi, my sister S.Uma that I would like to dedicate my work to. iv Vita 2010 . .B.Tech. Information Technology, College of Engineering Guindy, Anna University, India. June 2011 - Sept 2011 . Software Development Engineer Intern, Microsoft Corp. Jan 2011 - present . Graduate Research Associate, Department of Computer Science and Engineering, The Ohio State University. Fields of Study Major Field: Computer Science and Engineering Studies in: High Perfomance Computing Prof. P.Sadayappan Parallel Computing Prof. P.Sadayappan Compiler Design and Implementation Prof. Atanas Rountev Programming Languages Prof. Neelam Soundarajan v Table of Contents Page Abstract . ii Dedication . iii Acknowledgments . iv Vita......................................... v List of Tables . viii List of Figures . ix 1. Introduction . 1 1.1 Problem Description . 4 1.2 PFlotran . 6 1.3 Related Work . 7 1.4 Summary . 8 2. Modified Matrix Representation . 9 2.1 Structure Grid . 10 2.1.1 Grid or Meshes . 10 2.1.2 Matrix Properties . 12 2.2 Matrix Representation . 13 2.2.1 Compressed Sparse Row . 13 2.2.2 Block Structure Grid . 17 2.2.3 Modified Block Compressed Sparse Row . 32 2.3 Experimental Evaluation . 33 2.3.1 Experimental Setup . 33 vi 2.3.2 Performance Comparison of Matrix structures using SSE and AVX intrinsics . 34 2.4 Summary . 46 3. Performance Evaluation on PFlotran . 47 3.1 PFlotran - Basics and Architecture Overview . 48 3.1.1 Overview of Pflotran [5] . 48 3.1.2 Architectural Overview . 49 3.2 Experimental Evaluation . 50 3.2.1 Experimental Setup . 50 3.2.2 Performance Evaluation and Analysis . 51 3.3 Conclusion . 52 4. Conclusions and Future Work . 54 Bibliography . 55 vii List of Tables Table Page 2.1 Regions for a 3D - physical grid of dimension m*n*p . 19 2.2 Regions for a 2D - physical grid of dimension m*n*1 . 19 2.3 Theoretical Comparison of column-major block order and re-arranged block order . 32 3.1 Pflotran sample without preconditioner . 52 3.2 Pflotran sample with ILU preconditioning . 52 viii List of Figures Figure Page 2.1 Types of Grid . 11 2.2 Stencil Computation . 11 2.3 Nonzero structure of Structure Grid Matrices . 13 2.4 Compressed Sparse Row Representation . 14 2.5 CSR MV Algorithm . 15 2.6 Block CSR Representation . 15 2.7 Block CSR MV Algorithm . 16 2.8 Block Structure Grid Representation . 18 2.9 Block Arrangement for AVX machines . 20 2.10 Block Arrangement for SSE machines . 21 2.11 Generic Block Structure Grid MV Algorithm . 22 2.10 Algorithm for Handling Blocks - AVX . 25 2.9 Algorithm for Handling Blocks - SSE . 28 2.6 Horizontal Addition By Rearranging data . 30 2.7 Handling blocks in Block Structure Grid MV Algorithm - Other Ap- proaches . 31 ix 2.8 Vectorized Block CSR MV algorithm . 32 2.9 Performance Comparison of Matrix Format - Cache Resident Data - Customized block handling - AVX Machines . 36 2.10 Performance Comparison of Matrix Format - Cache Resident Data - Generic block handling - AVX Machines . 36 2.11 Performance Comparison of Matrix Format - Cache Resident Data - Customized block handling - SSE Machines . 37 2.12 Performance Comparison of Matrix Format - Cache Resident Data - Generic block handling - SSE Machines . 37 2.13 Performance Comparison of Matrix Format - Non-Cache Resident Data - Customized block handling - AVX Machines . 38 2.14 Performance Comparison of Matrix Format - Non-Cache Resident Data - Generic block handling - AVX Machines . 38 2.15 Performance Comparison of Matrix Format - Non-Cache Resident Data - Customized block handling - SSE Machines . 39 2.16 Performance Comparison of Matrix Format - Non-Cache Resident Data - Generic block handling - SSE Machines . 39 2.17 L1 Cache Miss Ratio of Matrix Format - Cache Resident Data - AVX Machines . 40 2.18 L2 Cache Miss Ratio of Matrix Format - Cache Resident Data - AVX Machines . 40 2.19 L3 Cache Miss Ratio of Matrix Format - Cache Resident Data - AVX Machines . 41 2.20 L1 Cache Miss Ratio of Matrix Format - Non-Cache Resident Data - AVX Machines . 41 2.21 L2 Cache Miss Ratio of Matrix Format - Non-Cache Resident Data - AVX Machines . 42 x 2.22 L3 Cache Miss Ratio of Matrix Format - Non-Cache Resident Data - AVX Machines . 42 2.23 Performance Comparison of Unrolled Matrix Format - Non-Cache Res- ident Data - Customized block handling . 43 2.24 Performance Comparison of Unrolled Matrix Format - Non-Cache Res- ident Data - Generic block handling . 44 2.25 Performance Comparison of Unrolled Matrix Format with OpenMP - Non-Cache Resident Data . 44 2.26 Performance Comparison of OpenMP BSG and MPI BCSR (2 threads) - Non-Cache Resident Data . 45 2.27 Performance Comparison of OpenMP BSG and MPI BCSR (4 threads) - Non-Cache Resident Data . 45 xi Chapter 1: Introduction In the era of modern tecnological advances , there is a need for higher accuracy in scientific applications. These high accuracy scientific applications requires large amount of computations which can be optimized by using high performance computing principles. With advent of multi-core architectures, research for application of high performance computing principles for solving engineering problems such as computation fluid dynamics, subsurface reactive flows, etc. is increasingly seeking interests. These applications use solvers available in packages such as PETSc which uses the principles of HPC for optimizing solver kernels. Scientific computations are inherently parallel. This can be exploited by paral- lelizing the application to run on multiple cores. In cases where data cannot fit in a single node, computations can be distributes across multiple nodes in a cluster. Cluster nodes communicate among themselves to solve the problem at hand. These nodes use the Message Passing Interface(MPI) standard for communication. Com- putations can also be distributed across multi-cores using Posix-threads and OpenMP. Software programs spend most of the time executing only a small fraction of the code, a feature called \90 - 10" rule. 90% of the time is spent in executing 10% of the 1 code. This feature allows optimizing the kernel with minimal effort by concentrating on 10% of the code. With advancement in architecture and faster computing units, the gap between the memory bandwidth and computational capacity has widened. This requires algorithms to be modified to increase the spatial and temporal locality of the data used in computations. Accessing the data in proper order also allows the hardware to prefetch data into caches thus reducing the latency due to memory accesses. For improving the memory accesses, loop transformation techniques such as loop unrolling, loop tiling, loop permutations can be used. These techniques improve the temporal and spatial locality at register and cache level. When the application is latency limited, these techniques aid in prefetching thereby improving the computational speed. Most of the current architectures support SIMD parallelization using streaming SIMD extensions (SSE) or advanced vector extensions (AVX). These architecture provides vector registers of sizes 128 bits or 256 bits respectively and efficiently per- form same operation on multiple independent words.

Efficient Sparse Matrix Vector Multiplication for Structured Grid Representation

On Multigrid Methods for Solving Electromagnetic Scattering Problems

Sparse Matrices and Iterative Methods

Scalable Stochastic Kriging with Markovian Covariances

Chapter 7 Iterative Methods for Large Sparse Linear Systems

A Parallel Solver for Graph Laplacians

Solving Linear Systems: Iterative Methods and Sparse Systems

High Performance Selected Inversion Methods for Sparse Matrices

A Parallel Gauss-Seidel Algorithm for Sparse Power Systems Matrices

Decomposition Methods for Sparse Matrix Nearness Problems∗

Matrices and Graphs

Localization in Matrix Computations: Theory and Applications

Learning Fast Algorithms for Linear Transforms Using Butterfly Factorizations