Efficient Sparse Matrix Vector Multiplication for Structured Grid Representation

Efficient Sparse Matrix Vector Multiplication for Structured Grid Representation

Efficient Sparse Matrix Vector Multiplication for Structured Grid Representation A Thesis Presented in Partial Fulfillment of the Requirements for the Degree Master of Science in the Graduate School of The Ohio State University By Deepan Karthik Balasubramanian, B.Tech Graduate Program in Computer Science and Engineering The Ohio State University 2012 Thesis Committee: Dr.P.Sadayappan, Advisor Dr.Atanas Rountev c Copyright by Deepan Karthik Balasubramanian 2012 Abstract Due to technology advancements, there is a need for higher accuracy in the scientific computations. Sparse matrix-vector (SpMV) multiplication is a widely used kernel in scientific applications. There can be signif- icant performance variablility due to irregular memory access patterns. There is a great opportunity to optimize these applications. Conventional algorithms needs to be modified to take advantage of these improved ar- chitecture. In the first part of the thesis, we focus on introducing new data struc- ture, Block Structured Grid, that allows vectorization. We also focus on modifying the the existing Block CSR representation of structured grids to improve the performance. Due to the inherent nature of structured grid problems, which uses block elements due to the degrees of freedom involved in the problem, blocked structures are considered for improving the performance. We compare our performance with existing standard algorithms in PETSc. With the new matrix representations we were able to achieve an average of 1.5x performance improvement for generic algo- rithms. In the second part of the thesis, we compare the performance of PFlo- tran, an application for modeling Multiscale-Multiphase-Multicomponent Subsurface Reactive Flows, using Block Structured Grid and Vectorized Block CSR against standard matrix representations against standard data structures.With the new matrix representations we were able to achieve an average of 1.2x performance improvement. ii I dedicate this work to my parents and my sister iii Acknowledgments I owe my deepest gratitude to my advisor Prof. Sadayappan for his vision and constant support throughout my Masters program. His enthusiasm for research ideas has been great source of inspiration for me. His excellent technical foresight and guid- ance helped me towards right goals. I would also like to thank Dr. Atanas Rountev for agreeing to serve on my Masters examination committee. I would especially like to thank Jeswin Godwin, Kevin Stock and Justin Holewinski at DL 574 for providing me with valuable technical inputs during the course of the program. Special thanks also to my friends Ragavendar, Venmugil, Naveen, Madhu, Sriram, Shriram, Arun and Viswa, who kept me motivated and made this journey a really enjoyable one. I also like to thank all my friends here at Ohio State University for making this jour- ney pleasant one. Finally, this endeavor would not have been possible without the support of my family members, who have encouraged and motivated me to no end. It is to my parents D. Balasubramanian and B.SaralaDevi, my sister S.Uma that I would like to dedicate my work to. iv Vita 2010 . .B.Tech. Information Technology, College of Engineering Guindy, Anna University, India. June 2011 - Sept 2011 . Software Development Engineer Intern, Microsoft Corp. Jan 2011 - present . Graduate Research Associate, Department of Computer Science and Engineering, The Ohio State University. Fields of Study Major Field: Computer Science and Engineering Studies in: High Perfomance Computing Prof. P.Sadayappan Parallel Computing Prof. P.Sadayappan Compiler Design and Implementation Prof. Atanas Rountev Programming Languages Prof. Neelam Soundarajan v Table of Contents Page Abstract . ii Dedication . iii Acknowledgments . iv Vita......................................... v List of Tables . viii List of Figures . ix 1. Introduction . 1 1.1 Problem Description . 4 1.2 PFlotran . 6 1.3 Related Work . 7 1.4 Summary . 8 2. Modified Matrix Representation . 9 2.1 Structure Grid . 10 2.1.1 Grid or Meshes . 10 2.1.2 Matrix Properties . 12 2.2 Matrix Representation . 13 2.2.1 Compressed Sparse Row . 13 2.2.2 Block Structure Grid . 17 2.2.3 Modified Block Compressed Sparse Row . 32 2.3 Experimental Evaluation . 33 2.3.1 Experimental Setup . 33 vi 2.3.2 Performance Comparison of Matrix structures using SSE and AVX intrinsics . 34 2.4 Summary . 46 3. Performance Evaluation on PFlotran . 47 3.1 PFlotran - Basics and Architecture Overview . 48 3.1.1 Overview of Pflotran [5] . 48 3.1.2 Architectural Overview . 49 3.2 Experimental Evaluation . 50 3.2.1 Experimental Setup . 50 3.2.2 Performance Evaluation and Analysis . 51 3.3 Conclusion . 52 4. Conclusions and Future Work . 54 Bibliography . 55 vii List of Tables Table Page 2.1 Regions for a 3D - physical grid of dimension m*n*p . 19 2.2 Regions for a 2D - physical grid of dimension m*n*1 . 19 2.3 Theoretical Comparison of column-major block order and re-arranged block order . 32 3.1 Pflotran sample without preconditioner . 52 3.2 Pflotran sample with ILU preconditioning . 52 viii List of Figures Figure Page 2.1 Types of Grid . 11 2.2 Stencil Computation . 11 2.3 Nonzero structure of Structure Grid Matrices . 13 2.4 Compressed Sparse Row Representation . 14 2.5 CSR MV Algorithm . 15 2.6 Block CSR Representation . 15 2.7 Block CSR MV Algorithm . 16 2.8 Block Structure Grid Representation . 18 2.9 Block Arrangement for AVX machines . 20 2.10 Block Arrangement for SSE machines . 21 2.11 Generic Block Structure Grid MV Algorithm . 22 2.10 Algorithm for Handling Blocks - AVX . 25 2.9 Algorithm for Handling Blocks - SSE . 28 2.6 Horizontal Addition By Rearranging data . 30 2.7 Handling blocks in Block Structure Grid MV Algorithm - Other Ap- proaches . 31 ix 2.8 Vectorized Block CSR MV algorithm . 32 2.9 Performance Comparison of Matrix Format - Cache Resident Data - Customized block handling - AVX Machines . 36 2.10 Performance Comparison of Matrix Format - Cache Resident Data - Generic block handling - AVX Machines . 36 2.11 Performance Comparison of Matrix Format - Cache Resident Data - Customized block handling - SSE Machines . 37 2.12 Performance Comparison of Matrix Format - Cache Resident Data - Generic block handling - SSE Machines . 37 2.13 Performance Comparison of Matrix Format - Non-Cache Resident Data - Customized block handling - AVX Machines . 38 2.14 Performance Comparison of Matrix Format - Non-Cache Resident Data - Generic block handling - AVX Machines . 38 2.15 Performance Comparison of Matrix Format - Non-Cache Resident Data - Customized block handling - SSE Machines . 39 2.16 Performance Comparison of Matrix Format - Non-Cache Resident Data - Generic block handling - SSE Machines . 39 2.17 L1 Cache Miss Ratio of Matrix Format - Cache Resident Data - AVX Machines . 40 2.18 L2 Cache Miss Ratio of Matrix Format - Cache Resident Data - AVX Machines . 40 2.19 L3 Cache Miss Ratio of Matrix Format - Cache Resident Data - AVX Machines . 41 2.20 L1 Cache Miss Ratio of Matrix Format - Non-Cache Resident Data - AVX Machines . 41 2.21 L2 Cache Miss Ratio of Matrix Format - Non-Cache Resident Data - AVX Machines . 42 x 2.22 L3 Cache Miss Ratio of Matrix Format - Non-Cache Resident Data - AVX Machines . 42 2.23 Performance Comparison of Unrolled Matrix Format - Non-Cache Res- ident Data - Customized block handling . 43 2.24 Performance Comparison of Unrolled Matrix Format - Non-Cache Res- ident Data - Generic block handling . 44 2.25 Performance Comparison of Unrolled Matrix Format with OpenMP - Non-Cache Resident Data . 44 2.26 Performance Comparison of OpenMP BSG and MPI BCSR (2 threads) - Non-Cache Resident Data . 45 2.27 Performance Comparison of OpenMP BSG and MPI BCSR (4 threads) - Non-Cache Resident Data . 45 xi Chapter 1: Introduction In the era of modern tecnological advances , there is a need for higher accuracy in scientific applications. These high accuracy scientific applications requires large amount of computations which can be optimized by using high performance com- puting principles. With advent of multi-core architectures, research for application of high performance computing principles for solving engineering problems such as computation fluid dynamics, subsurface reactive flows, etc. is increasingly seeking interests. These applications use solvers available in packages such as PETSc which uses the principles of HPC for optimizing solver kernels. Scientific computations are inherently parallel. This can be exploited by paral- lelizing the application to run on multiple cores. In cases where data cannot fit in a single node, computations can be distributes across multiple nodes in a cluster. Cluster nodes communicate among themselves to solve the problem at hand. These nodes use the Message Passing Interface(MPI) standard for communication. Com- putations can also be distributed across multi-cores using Posix-threads and OpenMP. Software programs spend most of the time executing only a small fraction of the code, a feature called \90 - 10" rule. 90% of the time is spent in executing 10% of the 1 code. This feature allows optimizing the kernel with minimal effort by concentrating on 10% of the code. With advancement in architecture and faster computing units, the gap between the memory bandwidth and computational capacity has widened. This requires algo- rithms to be modified to increase the spatial and temporal locality of the data used in computations. Accessing the data in proper order also allows the hardware to prefetch data into caches thus reducing the latency due to memory accesses. For improving the memory accesses, loop transformation techniques such as loop unrolling, loop tiling, loop permutations can be used. These techniques improve the temporal and spatial locality at register and cache level. When the application is latency limited, these techniques aid in prefetching thereby improving the computational speed. Most of the current architectures support SIMD parallelization using streaming SIMD extensions (SSE) or advanced vector extensions (AVX). These architecture provides vector registers of sizes 128 bits or 256 bits respectively and efficiently per- form same operation on multiple independent words.

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    67 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us