Efficient Sparse Vector Multiplication for Structured Grid Representation

A Thesis

Presented in Partial Fulfillment of the Requirements for the Degree Master of Science in the Graduate School of The Ohio State University

By

Deepan Karthik Balasubramanian, B.Tech

Graduate Program in Science and

The Ohio State University

2012

Thesis Committee:

Dr.P.Sadayappan, Advisor Dr.Atanas Rountev c Copyright by

Deepan Karthik Balasubramanian

2012 Abstract

Due to technology advancements, there is a need for higher accuracy in the scientific computations. Sparse matrix-vector (SpMV) multiplication is a widely used in scientific applications. There can be signif- icant performance variablility due to irregular memory access patterns. There is a great opportunity to optimize these applications. Conventional needs to be modified to take advantage of these improved ar- chitecture.

In the first part of the thesis, we focus on introducing new data struc- ture, Block Structured Grid, that allows vectorization. We also focus on modifying the the existing Block CSR representation of structured grids to improve the performance. Due to the inherent nature of structured grid problems, which uses block elements due to the degrees of freedom involved in the problem, blocked structures are considered for improving the performance. We compare our performance with existing standard algorithms in PETSc. With the new matrix representations we were able to achieve an average of 1.5x performance improvement for generic algo- rithms.

In the second part of the thesis, we compare the performance of PFlo- tran, an application for modeling Multiscale-Multiphase-Multicomponent Subsurface Reactive Flows, using Block Structured Grid and Vectorized Block CSR against standard matrix representations against standard data structures.With the new matrix representations we were able to achieve an average of 1.2x performance improvement.

ii I dedicate this work to my parents and my sister

iii Acknowledgments

I owe my deepest gratitude to my advisor Prof. Sadayappan for his vision and constant support throughout my Masters program. His enthusiasm for research ideas has been great source of inspiration for me. His excellent technical foresight and guid- ance helped me towards right goals. I would also like to thank Dr. Atanas Rountev for agreeing to serve on my Masters examination committee. I would especially like to thank Jeswin Godwin, Kevin Stock and Justin Holewinski at DL 574 for providing me with valuable technical inputs during the course of the program. Special thanks also to my friends Ragavendar, Venmugil, Naveen, Madhu, Sriram, Shriram, Arun and Viswa, who kept me motivated and made this journey a really enjoyable one. I also like to thank all my friends here at Ohio State University for making this jour- ney pleasant one. Finally, this endeavor would not have been possible without the support of my family members, who have encouraged and motivated me to no end.

It is to my parents D. Balasubramanian and B.SaralaDevi, my sister S.Uma that I would like to dedicate my work to.

iv Vita

2010 ...... B.Tech. Information Technology, College of Engineering Guindy, Anna University, India. June 2011 - Sept 2011 ...... Software Development Engineer Intern, Microsoft Corp. Jan 2011 - present ...... Graduate Research Associate, Department of Computer Science and Engineering, The Ohio State University.

Fields of Study

Major Field: Computer Science and Engineering

Studies in: High Perfomance Computing Prof. P.Sadayappan Prof. P.Sadayappan Compiler Design and Implementation Prof. Atanas Rountev Programming Languages Prof. Neelam Soundarajan

v Table of Contents

Page

Abstract ...... ii

Dedication ...... iii

Acknowledgments ...... iv

Vita...... v

List of Tables ...... viii

List of Figures ...... ix

1. Introduction ...... 1

1.1 Problem Description ...... 4 1.2 PFlotran ...... 6 1.3 Related Work ...... 7 1.4 Summary ...... 8

2. Modified ...... 9

2.1 Structure Grid ...... 10 2.1.1 Grid or Meshes ...... 10 2.1.2 Matrix Properties ...... 12 2.2 Matrix Representation ...... 13 2.2.1 Compressed Sparse Row ...... 13 2.2.2 Block Structure Grid ...... 17 2.2.3 Modified Block Compressed Sparse Row ...... 32 2.3 Experimental Evaluation ...... 33 2.3.1 Experimental Setup ...... 33

vi 2.3.2 Performance Comparison of Matrix structures using SSE and AVX intrinsics ...... 34 2.4 Summary ...... 46

3. Performance Evaluation on PFlotran ...... 47

3.1 PFlotran - Basics and Architecture Overview ...... 48 3.1.1 Overview of Pflotran [5] ...... 48 3.1.2 Architectural Overview ...... 49 3.2 Experimental Evaluation ...... 50 3.2.1 Experimental Setup ...... 50 3.2.2 Performance Evaluation and Analysis ...... 51 3.3 Conclusion ...... 52

4. Conclusions and Future Work ...... 54

Bibliography ...... 55

vii List of Tables

Table Page

2.1 Regions for a 3D - physical grid of dimension m*n*p ...... 19

2.2 Regions for a 2D - physical grid of dimension m*n*1 ...... 19

2.3 Theoretical Comparison of column-major block order and re-arranged block order ...... 32

3.1 Pflotran sample without ...... 52

3.2 Pflotran sample with ILU preconditioning ...... 52

viii List of Figures

Figure Page

2.1 Types of Grid ...... 11

2.2 Stencil Computation ...... 11

2.3 Nonzero structure of Structure Grid Matrices ...... 13

2.4 Compressed Sparse Row Representation ...... 14

2.5 CSR MV ...... 15

2.6 Block CSR Representation ...... 15

2.7 Block CSR MV Algorithm ...... 16

2.8 Block Structure Grid Representation ...... 18

2.9 Block Arrangement for AVX machines ...... 20

2.10 Block Arrangement for SSE machines ...... 21

2.11 Generic Block Structure Grid MV Algorithm ...... 22

2.10 Algorithm for Handling Blocks - AVX ...... 25

2.9 Algorithm for Handling Blocks - SSE ...... 28

2.6 Horizontal Addition By Rearranging data ...... 30

2.7 Handling blocks in Block Structure Grid MV Algorithm - Other Ap- proaches ...... 31

ix 2.8 Vectorized Block CSR MV algorithm ...... 32

2.9 Performance Comparison of Matrix Format - Cache Resident Data - Customized block handling - AVX Machines ...... 36

2.10 Performance Comparison of Matrix Format - Cache Resident Data - Generic block handling - AVX Machines ...... 36

2.11 Performance Comparison of Matrix Format - Cache Resident Data - Customized block handling - SSE Machines ...... 37

2.12 Performance Comparison of Matrix Format - Cache Resident Data - Generic block handling - SSE Machines ...... 37

2.13 Performance Comparison of Matrix Format - Non-Cache Resident Data - Customized block handling - AVX Machines ...... 38

2.14 Performance Comparison of Matrix Format - Non-Cache Resident Data - Generic block handling - AVX Machines ...... 38

2.15 Performance Comparison of Matrix Format - Non-Cache Resident Data - Customized block handling - SSE Machines ...... 39

2.16 Performance Comparison of Matrix Format - Non-Cache Resident Data - Generic block handling - SSE Machines ...... 39

2.17 L1 Cache Miss Ratio of Matrix Format - Cache Resident Data - AVX Machines ...... 40

2.18 L2 Cache Miss Ratio of Matrix Format - Cache Resident Data - AVX Machines ...... 40

2.19 L3 Cache Miss Ratio of Matrix Format - Cache Resident Data - AVX Machines ...... 41

2.20 L1 Cache Miss Ratio of Matrix Format - Non-Cache Resident Data - AVX Machines ...... 41

2.21 L2 Cache Miss Ratio of Matrix Format - Non-Cache Resident Data - AVX Machines ...... 42

x 2.22 L3 Cache Miss Ratio of Matrix Format - Non-Cache Resident Data - AVX Machines ...... 42

2.23 Performance Comparison of Unrolled Matrix Format - Non-Cache Res- ident Data - Customized block handling ...... 43

2.24 Performance Comparison of Unrolled Matrix Format - Non-Cache Res- ident Data - Generic block handling ...... 44

2.25 Performance Comparison of Unrolled Matrix Format with OpenMP - Non-Cache Resident Data ...... 44

2.26 Performance Comparison of OpenMP BSG and MPI BCSR (2 threads) - Non-Cache Resident Data ...... 45

2.27 Performance Comparison of OpenMP BSG and MPI BCSR (4 threads) - Non-Cache Resident Data ...... 45

xi Chapter 1: Introduction

In the era of modern tecnological advances , there is a need for higher accuracy in scientific applications. These high accuracy scientific applications requires large amount of computations which can be optimized by using high performance com- puting principles. With advent of multi-core architectures, research for application of high performance computing principles for solving engineering problems such as computation fluid dynamics, subsurface reactive flows, etc. is increasingly seeking interests. These applications use solvers available in packages such as PETSc which uses the principles of HPC for optimizing solver kernels.

Scientific computations are inherently parallel. This can be exploited by paral- lelizing the application to run on multiple cores. In cases where data cannot fit in a single node, computations can be distributes across multiple nodes in a cluster.

Cluster nodes communicate among themselves to solve the problem at hand. These nodes use the Message Passing Interface(MPI) standard for communication. Com- putations can also be distributed across multi-cores using Posix-threads and OpenMP.

Software programs spend most of the time executing only a small fraction of the code, a feature called “90 - 10” rule. 90% of the time is spent in executing 10% of the

1 code. This feature allows optimizing the kernel with minimal effort by concentrating on 10% of the code.

With advancement in architecture and faster computing units, the gap between the memory bandwidth and computational capacity has widened. This requires algo- rithms to be modified to increase the spatial and temporal locality of the data used in computations. Accessing the data in proper order also allows the hardware to prefetch data into caches thus reducing the latency due to memory accesses. For improving the memory accesses, loop transformation techniques such as loop unrolling, loop tiling, loop permutations can be used. These techniques improve the temporal and spatial locality at register and cache level. When the application is latency limited, these techniques aid in prefetching thereby improving the computational speed.

Most of the current architectures support SIMD parallelization using streaming

SIMD extensions (SSE) or advanced vector extensions (AVX). These architecture provides vector registers of sizes 128 bits or 256 bits respectively and efficiently per- form same operation on multiple independent words. An effective speedup of 4x or

2x can be achieved by using the vector registers. Code optimization techniques can be used to enhance the vectorization. These optimizations include loop permutation, array padding, statement reordering, data reordering, loop distribution, node split- ting, and array expansion, loop peeling and conditional handling.

In addition to the aforementioned code optimization techniques, by using appro- priate data structures also aids improving the performance of the kernel.The challenge

2 in optimizing the code involves selecting the appropriate and creating the algorithm with the possibility of above men- tioned optimization techniques. The data structure should have good space complexity and the algorithm must exhibit good time complexity. It is very desirable that the data structure enables a high degree of spatial locality while running of the algorithm then it is desir- able. Then, the next challenge lies in identifying the independent logical execution path in the code and parallelizing it. After parallelization the complexity lies in applying suitable transformation at the loop level and applying techniques to vectorize it.

In our thesis, we discuss about a new data structure and algorithms to support vectorized operation using the above mentioned techniques. We also focus on modi- fying the existing data structures primarily used to allow vectorized operations. The focus of our thesis is to improve the computation time of the Matrix-Vector multi- plication, which the time dominant operation in many scientific application, without increasing the space complexity.

For evaluating the performance, we introduce new data structures in the parallel library, PETSc, which is commonly used in many scientific applications. We also modify the existing data structure to support vectorized Matrix Vector Multiplica- tion and compare the performance against existing data structures for structure grid available in PETSc such as Compressed Sparse Row (CSR) format and Blocked Com- pressed Sparse Row (BCSR). The matrices used for the performance comparison are generated to have a structure same as that of a 5 point stencil which is used in 2D physical grids and 7 point stencil which is used in 3D physical grids. In the second

3 part of thesis, we use PFlotran, a 3D modelling application for multiphase multicom- ponent subsrface reactive flow, which uses PETSc to test the performance of real-time applications.

1.1 Problem Description

Solving many physical problems such as heat transfer, computational fluid dynam- ics, etc. involves writing the governing equation using laws of conservation. These governing equations are expressed in numerical ways using partial differential equa- tions(PDE). For inatance, PFlotran models the multiphase flows and multicomponent reactive transport models in three dimensional problem domains using the partial dif- ferential equations in [5].

Numerical methods for solving partial differential equations require some form of spatial discretization, or mesh of nodes, at which the solution is specified [1].The

PDE is solved over the discretized mesh by attributing unknown variables at the grid points. Computations typically proceed as a sequence of grid update steps. For example, for explicit methods, at each step, values associated with each entity are updated in parallel, based on values retrieved from neighboring entities. For implicit methods, a sparse linear algebraic system is solved at each step.

Computations that are done on structure grids can be broadly classified into two classes: explicit and implicit. Explicit methods calculate the state of a system at a later time from the state of the system at the current time, while implicit methods

find a solution by solving an equation involving both the current state of the system

4 and the later one. An example of an explicit method is a stencil computation on the grid which involves just interacting with neighbor grid points. This operation is simple and does not need any complex data structure to represent the structure gird.

Usually two dimensional(2D) or three dimensional(3D) matrices are used to solve these stencil computations. Another important kind of computation is the implicit method, involving a sparse solver, which usually requires matrix-vector multiplication at each step.

The structure grid is represented as a sparse matrix(CSR or Diagonal format) in the implicit method. These sparse matrices generally have a diagonal or block- diagonal patterns corresponding to the degrees of freedom involved in the problem.

For instance, in the subsurface reactive flow models, the number of chemical con- straints the degrees of freedom. In general, sparse-matrix-vector(SpMV) multiplica- tion is the time-dominated computation portion of the solver, optimizing this would improve the overall performance of the solver.

In this thesis, we focus on introducing a new matrix representation, Block Struc- tured Grid, that uses the properties of the physical grid to allow vectorized SpMV multiplication operations. The dense block arrangement of elements allows vector- ization within a block. We also dicuss about modifying the block CSR matrix repre- sentation to allow vectorization during the SpMV multiplication. In both cases, the

SpMV multiplication can be viewed as of dense matrix-vector multiplications with the size of the dense matrix determined by the degrees of freedom in the problem. Our work mainly focus of vectorization of the dense matrix-vector multiplication. With

5 enough degrees of freedom, the performance improvement obtained by vectorization will be equal to that performance improvement obtained by vectorization across en- tire sparse matrix.

1.2 PFlotran

PFlotran is a tool for Modeling Multiscale-Multiphase-Multicomponent Subsur- face Reactive Flows Using Advanced Computing. It utilizes a first-order finite volume spatial discretization combined with backward-Euler time stepping. The system of nonlinear equations arising from the discretization is solved using inexact Newton-

Krylov methods. The Conjugate Gradient Method (CG) is a member of a family of iterative solvers known as methods used primarily on large sparse lin- ear systems arising from the discretization of partial differential equations (PDEs).CG uses successive approximations to obtain a more accurate solution at each step. It is considered a nonstationary method generating a sequence of conjugate (or orthogo- nal) ,vectors.

PFlotran is built on top of the PETSc framework and uses numerous features from

PETSc including nonlinear solvers, linear solvers, sparse matrix data structures (both blocked and non-blocked matrices), vectors, constructs for the parallelism of PDEs on structured grids, options database (runtime control of solver options), and binary

I/O. PFLOTRAN employs domain-decomposition parallelism, with each subdomain assigned to an MPI process and a parallel solve implemented over all processes. A number of different solver and preconditioner combinations from PETSc or other

6 packages can be used.

In this thesis, we focus on modfiying the representation of the structured grids in PETSc framework and using algorithms that can supports vectorization the code and show its impact on PFlotran.We run various samples and show the performance improvement by using the vectorized code.

1.3 Related Work

We now discuss the research efforts relevant to our work from the areas of al- gorithms for sparse matrix representations. Over the past decade, many research groups have worked on developing different methodologies to improve performance of sparse matrix representations including [8]. In [7] attempts were made to pack data to reduce the indirections and elements were rearranged using heurestics to improve the effectiveness of the sparse structure. In [6] new low level kernel modules were developed for runtime performance tuning of sparse matrix kernels. In [3] adaptive runtime tuning for improving the parallel performance of the matrix kernels were proposed. Tuning was done based on the load in each nodes and communication method (broadcast or point-to-point) was selected at runtime. In [4] a new matrix representation, structured grid, that supports vectorized SpMV is introduced. But it does not take into account the degrees of freedom involved in the problem under consideration. The problem considered in this thesis has uinique characteristics that can be exploited for static tuning of the matrix kernels. For instance, all the problems do have a block diagonal structure which enables vectorization within a block.

7 1.4 Summary

The thesis has considered the problem of optimizing the Sparse Matrix Vector

Multiplication for multicore - processors allowing vectorization. In the thesis, Ma- trix - Vector Multiplication for sparse matrix structures that occur in structured grid problems in considered. A novel representation for the above mentioned sparse matrices is introduced. In the second part, a modified Block CSR representation is discussed. In both cases, Matrix-Vector Multiplication that using vector instructions is implemented.

An experimental study was conducted on using the new Matrix-Vector Multi- plication Algorithm. Parallelization was also done using OPENMP threads. The summary of the results is as follows. With the new matrix representations we were able to achieve an average of 1.5x performance improvement for generic algorithms.

By using customized algorithm for 3D structures we were able to increase the per- formance to 2x. And by using OpenMP threads we were able to further improve the performance by 2 times.

In the second part of the thesis, we compare the performance of modified Block

CSR against standard matrix representations in PETSc. With the new matrix rep- resentations a performance improvement of 20% is achieved.

The remainder of the work in organized as follows. In the remainder of this section we will provide background on the problem that we have considered and tools that are used. In Chapter 2, we discuss our data structures and algorithms and evaluate its effectiveness. In Chapter 3, we discuss the performance improvement of the data structure in using PFlotran. We finally conclude the work in Chapter 4.

8 Chapter 2: Modified Matrix Representation

Matrix-Vector multiplication is the most expensive operation and determines the overall cost of an Sparse Linear solver.It is desired to have minimal cost, hence, rep- resentation of SG and MV algorithm has to be optimal. Many of the linear solvers involves Matrix-Vector multiplication operation, dot-product and SAXPY operation of which Matrix-Vector multiplication operation is the time dominant operation.

Our focus is to develop a new representation for the matrices arising from the structure grid problems and run Matrix-Vector multiplication faster than algorithms for existing representation. We will see how to represent a new data structure with- out increasing the space complexity and provide the Matrix vector algorithm. We will also discuss on ways to enable vectorization for current representations. We will look into the problems with current representations and also discuss the reasons for improved efficiency with new representation. We will provide with a driver program written using the parallel library PETSc to compare performances with existing rep- resentations such as CSR and Block CSR.

The remainder of the chapter is divided as follows. In section 2.1 we describe the structure grid problem and the properties of the matrices arising from it. In section 2.2

9 we describe the new representations for the matrices and discuss the improved Matrix-

Vector algorithm. In section 2.3 we show performance of new matrix representation using PETSc driver programs and finally we conclude this chapter in section 2.4

2.1 Structure Grid

In this section we describe about the physical grids and their physical charac- teristics when represented as matrices. We will also describe about the standard representations used and the problems with those representations.

2.1.1 Grid or Meshes

Physical problems such as computation fluid dynamics and heat transfer are gen- erally defined using partial differential equations (PDE). These PDEs are solved using

Non-linear solvers or Linear solvers depending on the order of the equations. In gen- eral, the Non-Linear solvers convert the high order equations into low order equations and use the Linear solvers in each step. The linear solvers represent these equations using geometric shapes as Grids or meshes.

Associated with each Grid element are one or more dependent variables (degrees of freedom) such as pressure, volume or temperature. Numerical algorithms representing approximations to the conservation laws of mass, momentum, and energy are then used to compute these variables in each grid point. Each grid points are updated based on the values of the neighboring grid points iteratively until the result converges.

Depending on the equation used , these grids can either be structured or unstruc- tured as in fig 2.1. In structured grids, the number of neighbors contributing to the computation of the dependent variables (stencil) is fixed and the grids hace a definite shape. For instance, the simplest structure grid is a rectangular grid with a 5 point

10 Figure 2.1: Types of Grid

Figure 2.2: Stencil Computation

11 stencil with the contribution done by the neighboring grid points in left, right, top

, bottom and the point itself (fig 2.2). In an unstructured grid, all neighboring grid points do not contribute to the computation of the dependent variables resulting in a non-definite shape. While the unstructured grids allows to define complex phyisical problems , it is generally difficult to solve compared to the structured grid problems.

The most commonly used method to solve an unstructured grid is to use Delaunay triangulation. In the remainder of the chapter, we consider only the structure grids for convenience.

2.1.2 Matrix Properties

The matrices that are used to represent these structure grids display the following characteristics (fig 2.3) which will be the key for our representations.

• The matrices have equal number of rows and columns equal to the number of

Grid points.

• Since each matrix element represents the coefficient of the interaction of the

corresponding grid point, the matrix is generally sparse.

• Since the stencil is fixed, the matrix has either a diagonal or a block diagonal

structure depending the degree of freedom

• The number of non-zeros in a row is determined the position of the correspond-

ing grid point.

12 Figure 2.3: Nonzero structure of Structure Grid Matrices

2.2 Matrix Representation

In a Linear solver, physical structure grids are represented using sparse matrices with number of rows and columns equal to the number of grid points with each matrix element being a block of size equal to the degrees of freedom. For detailed discussions, we consider a 2D physical grid of size 5x5 with 5-pt stencil and degrees of freedom as

2. This can be generalized to higher dimensions too.

2.2.1 Compressed Sparse Row

The most commonly used sparse matrix representation for physical problems is the Compressed Sparse Row storage (fig 2.4). In a generalized CSR representation for matrices 3 vectors are used. One for linearized matrix elements, one to store the

13 Figure 2.4: Compressed Sparse Row Representation

corresponding column number and one vector to store the row offsets as shown in

fig. The space requirements for such a matrix representation for the above mentioned matrix is 4 (block size) * 5 (number of neighbor elements) per grid point for one matrix element, and the same number for storing the columns and 2 for storing the row offsets. Effectively, N*4*5 scalars and (2*N+1)+N*4*5 integers for storing the row and column indices. The matrix-vector multiplication (y = Ax) algorithm for

CSR representation is given in fig 2.5 .

Due to the indirection invloved in accessing the x-vector elements and indetermi- nate sparsity of the matrix elements in CSR representation, it cannot allow SIMD parallelization.

One of the optimizations done on CSR representation is to store block elements together and using single row and column indices for such blocks (fig 2.6). The

14 Figure 2.5: CSR MV Algorithm

Figure 2.6: Block CSR Representation

15 space requirement for such a representation is given as N*4*5 scalars for storing

Grid elements and N*5 integers for storing column indices and N+1 integers for

storing the row offsets. The matrix-vector multiplication algorithm for a Block-CSR

representation is given in fig 2.7.

Figure 2.7: Block CSR MV Algorithm

Even though the temporal locality of the x-vector is improved, SIMD paralleliza- tion requires custom modules and rearrangements of matrix elements. The modified algorithm is discussed in section 2.2.3

16 2.2.2 Block Structure Grid Key Goals

As we stated earlier, the main idea behind the new algorithm is to allow SIMD

parallelization in Matrix-Vector Multiplication to improve the performance the Linear

PDE Solver. Also the new algorithm should maintain higher accuracy in computa-

tion. The result obtained from the new Matrix-Vector multiplication algorithm must

be same as result as that of standard algorithm and there should be absolutely no

difference.

Detailed description

The following characteristics of the matrix is used in defining the matrix repre-

sentation.

• Every Grid element has the same block size.

• All the non-zero elements are located at a definite and constant offset from the

centre diagonal.

• The number of stencil neighbor required for the computation of a particular

Grid point depends on its location in the structure grid. For instance, the top-

right edge point has neighboring grid points only in left, and bottom (and right

if wrap-around scheme is used)

The Block structure grid represents the matrix elements using two vectors, one for the grid elements and one to define the offset of the stencil from the diagonal as show in fig 2.8. The space requirement for the previously mentioned matrix is 4 (blocksize)

* 5 (number of stencil neighbors) * N (number of Grid points) and 5 integers to define

17 Figure 2.8: Block Structure Grid Representation

the stencil offsets. These stencil offsets can be identified using matrix intialization namely, 0 for same grid point, -1 for left neighbor, 1 for right neighbor, -5 (or -m for grid with dimensions m*n) for top neighbor and 5 (or m for grid with dimensions m*n) for bottom neighbor.

The stencil diagonals are arranged consecutively within a region. A region is defined as a range of grid points in which the stencil neighbors used for computation are the same. Further the range of the region is also definite and depends on the dimensions of the physical grid. Table 2.1 and 2.2 defines the ranges of different regions and the corresponding stencil neighbors used in the regions.

18 Starting Point End Point Stencil Negihbors used other than self 0 1 right, down, back 1 m left, right, down, back m m*n top, left, right, down, back m*n N-m*n front, top, left, right, down, back N-m*n N-m front, top, left, right, down N-m N-1 front, top, left, right N-1 N front, top, left

Table 2.1: Regions for a 3D - physical grid of dimension m*n*p

Starting Point End Point Stencil Negihbors used other than self 0 1 right, down 1 m left, right, down m N-m top, left, right, down N-m N-1 top, left, right N-1 N top, left

Table 2.2: Regions for a 2D - physical grid of dimension m*n*1

Further the arrangement of elements within a block and the algorithm to be used is defined by the type of vectorization required (SSE or AVX). This difference in arrangement is required due to the difference in the availability of the functionality of the vector registers. Fig 2.9 and 2.10 shows the block arrangement of elements for structure grid with 7 degrees of freedom.

Matrix-Vector Algorithm

The Matrix-Vector multiplication on Block Structure Grid uses slightly different algorithms based on the type of vectorization and the degrees of freedom. It differs

19 Figure 2.9: Block Arrangement for AVX machines

20 Figure 2.10: Block Arrangement for SSE machines

in the way blocks are handled due to different arrangement of block elements. The generic algorithm is given in fig 2.11.

For computational effectiveness, regions are sub-divided. This allows elements accessed consecutively to remain in the same page, enabling the hardware prefetcher.

The algorithm for handling the blocks using SSE vector registers and AVX vector registers is given in fig 2.10 2.9.

Matrix-Vector multiplication on Block Structure Grid uses vectorization only within block elements. This is done for reasons discussed below.

• Accessing elements block by block allows x-vector to have a better temporal

locality (x-vector elements are reused within blocks).

21 Figure 2.11: Generic Block Structure Grid MV Algorithm

• There is no need to load the output vector elements. This reduces the total

number of loads required otherwise, in a bandwidth constraint problem.

• Since output vectors are written into only once, it allows SPMD parallelization.

• Block Structure and Regions avoid the need for padding data.

Requirements for rearranging blocks

One of the necessity posed by the above algorithms to improve performance is not loading the output vector. For this reason, the resultant vector register should have the elements in sequential order. We use horizontal-add operation on vector registers to achieve this. Since SSE vector registers support horizontal-add operation fully, the rearrangement is done only for improving the computation efficiency. But

22 23 24 Figure 2.10: Algorithm for Handling Blocks - AVX

25 26 27 Figure 2.9: Algorithm for Handling Blocks - SSE

28 Stage 1

Stage 2

in case of AVX registers, horizontal-add can be achieved only within a half of the vector-register. Complete horizontal add in this case is achieved as shown in 2.6

We used few other approaches to avoid re-arranging the data within a block. One of the approach was to arrange elements in column major order and splat x-vector elements across registers. The algorithm for handling the block is given in fig 2.7.

The problem with the approach was it increased the number of loads. A compara- tive study of the number of AVX - vector operations and the number of clock cycles required [2] per block is given in table 2.3. It can be seen that the rearranged block structure requires lesser number of load operations and even the number of multi- plication and addition operation required is also less compared to the column-major

29 Stage 3

Stage 4

Stage 5

Figure 2.6: Horizontal Addition By Rearranging data

30 block order. This reduced number of operations is facilitated by using the horizontal add operations. By using block-size customized code , the number of horizontal add operations required can be reduced by storing temporary register values across blocks.

Figure 2.7: Handling blocks in Block Structure Grid MV Algorithm - Other Ap- proaches

31 Instruction Latency Column-Major Block Rearranged Block Loadu 6 bs + (bs*(bs-3)/4) + bs bs/4 + bs*(bs-3)/4 + bs/2 + bs/4 Permute 6 2*bs 4 Mul 7 (bs*(bs-3)/4) + bs bs*(bs-3)/4 + bs/2 + bs/4 Hadd 5 0 3*bs*bs/16 Add 3 (bs*(bs-3)/4) + bs bs/4

Table 2.3: Theoretical Comparison of column-major block order and re-arranged block order

2.2.3 Modified Block Compressed Sparse Row

The block arrangement discussed in the previous section can also be extended to

Blocked Compressed Sparse Row format. This allows vectorization of matrix-vector multiplication algorithm as shown in fig 2.8

Figure 2.8: Vectorized Block CSR MV algorithm

32 2.3 Experimental Evaluation

This section describes the experiments we conducted to evaluate the Matrix -

Vector multiplication using the PETSc driver program. The goal of this experiment

as follows

• The new algorithm produces the same exact result as that of existing standard

algorithms

• The new algorithm runs at a faster speed and capable of taking advantage of

vector architectures

2.3.1 Experimental Setup

The details of the machine used for running the experiment is as follows. For this machine sirius with Intel Core i7 which has four logical CPUs with 32 KB L1 data cache cache and 256 KB L2 cache and 8MB shared L3 cache. It has a memory of 16GB. Each core has a clock frequency of 3.4GHz. The machine runs Linux ker- nel 2.6 . The machine has vector registers of size 256 bits. They can perform upto

4 double floating point operation at register level. This machine also sup- ports SSE-vector operations. It is done by masking the higher 128 bits of the register.

We have evaluated our experiments using a driver program written using PETSc

API that does repetitive Matrix-Vector multiplication operation. This simulates the

Matrix-Vector multiplication calls in a linear solver. The Matrix-Vector multiplica- tion is called using the same data so as to capture any cache- resdient benefits used by the linear solver. A customized algorithm is used for Matrix-Vector multiplication for

33 smaller block sizes ( less than 7) . The PETSc library and the new matrix structures are compiled using intel compiler icc with ’-fast’ compiler option.

The experiment was done with two input sizes , one in which the matrix is com- pletely cache resident (N = 1000) and the other in which the data size is larger than cache (N = 1000000). The matrix was created so that it has the same structure as that of a 3D physical grid with 7 - point stencil. Memory is dynamically allocated for the matrices. The memory allocation is not modified to allocate only aligned memory, since the latency for loading an aligned memory and unaligned memory is the same ,

6 clock cycles [2]. In these experiments, we have used double precision floating point values.

2.3.2 Performance Comparison of Matrix structures using SSE and AVX intrinsics

In these set of experiments, 7-point stencil is used which means each grid-point interacts with neighboring grid points in x,y,z directions with offset 1. We have ex- ecuted the MV algorithm with double precision floating point operations. The first set of experiments uses matrix size so that the entire data is cache-resident.

The results of the experiment shows that the vectorized operation runs at a higher speed than the existing algorithms. Results in fig 2.9 and 2.11 corresponds to Mat-

Vec multiplication customized for smaller block sizes on cache-resident matrix. When the degree of freedom is 1 , block structure grid performs 2.5x times faster than both

CSR and block CSR. For other cases, it performs 1.5x times faster than both CSR

34 and block CSR. Modified block CSR performs slightly better than block structure grid in all cases except when degree of freedom is 1.

Results in fig 2.10 and 2.12 uses a generic Mat-Vec multiplication mentioned be- fore. Both the new representations, performs slightly poorer initially, but with higher block-size, performance improvement due to vectorization is achieved. Block struc- ture grid performs nearly 1.8x times faster than Block CSR and CSR formats and modified block CSR performs 1.9x times faster.

For the second set of experiments, larger matrix size is used such that the data does not fit in cache. Using customized algorithms, the new structures performs 1.5x times faster than CSR. Using generic algorithm, the new structures performs 1.5x times faster when the degrees of freedom is even and it performs as good as the existing algorithms when the degrees of freedom is odd.

When using AVX intrinsics, the results were similar as before. This is because, vectorization available within blocks is not enough to overlap the latency due to bandwidth constraints. Figures 2.17, 2.18, 2.19, 2.20, 2.21 and 2.22 provide the number of cache miss ratio at different levels.

Performance improvement in Block Structure Grid using explicit unrolling of regions

This is one of the optimizations used to improve the performance of the block- structure algorithm. For structure grid problems, the region boundaries and the stencil neighbors are fixed for a particular as given in table 2.2 and 2.1. In this experiment, we use this boundary limits to explicitly unroll the loop the stencil loop,

35 Figure 2.9: Performance Comparison of Matrix Format - Cache Resident Data - Customized block handling - AVX Machines

Figure 2.10: Performance Comparison of Matrix Format - Cache Resident Data - Generic block handling - AVX Machines

36 Figure 2.11: Performance Comparison of Matrix Format - Cache Resident Data - Customized block handling - SSE Machines

Figure 2.12: Performance Comparison of Matrix Format - Cache Resident Data - Generic block handling - SSE Machines

37 Figure 2.13: Performance Comparison of Matrix Format - Non-Cache Resident Data - Customized block handling - AVX Machines

Figure 2.14: Performance Comparison of Matrix Format - Non-Cache Resident Data - Generic block handling - AVX Machines

38 Figure 2.15: Performance Comparison of Matrix Format - Non-Cache Resident Data - Customized block handling - SSE Machines

Figure 2.16: Performance Comparison of Matrix Format - Non-Cache Resident Data - Generic block handling - SSE Machines

39 Figure 2.17: L1 Cache Miss Ratio of Matrix Format - Cache Resident Data - AVX Machines

Figure 2.18: L2 Cache Miss Ratio of Matrix Format - Cache Resident Data - AVX Machines

40 Figure 2.19: L3 Cache Miss Ratio of Matrix Format - Cache Resident Data - AVX Machines

Figure 2.20: L1 Cache Miss Ratio of Matrix Format - Non-Cache Resident Data - AVX Machines

41 Figure 2.21: L2 Cache Miss Ratio of Matrix Format - Non-Cache Resident Data - AVX Machines

Figure 2.22: L3 Cache Miss Ratio of Matrix Format - Non-Cache Resident Data - AVX Machines

42 to improve the performance of the block structure grid algorithm. We use a 3D physical grid with non-cache resident data size.

Figure 2.23: Performance Comparison of Unrolled Matrix Format - Non-Cache Resi- dent Data - Customized block handling

The result of this experiment shows a performance improvement of 1.2x times compared to the aforementioned algorithm. We further incorporate SPMD parallelism on this algorithm using OpenMP. In this set of experiments, we run the region unrolled algorithm across multiple core using OpenMP constructs. We can obtain upto a maximum of N (grid point) parallelism without using any synchronization. The idea is that, each block - row of the matrix is handled by a different thread. Since, the output value is written only once, there is no need for a thread synchronization. Figure

2.25, 2.26 and 2.27 shows the performance improvement due to SPMD parallelism.

43 Figure 2.24: Performance Comparison of Unrolled Matrix Format - Non-Cache Resi- dent Data - Generic block handling

Figure 2.25: Performance Comparison of Unrolled Matrix Format with OpenMP - Non-Cache Resident Data

44 Figure 2.26: Performance Comparison of OpenMP BSG and MPI BCSR (2 threads) - Non-Cache Resident Data

Figure 2.27: Performance Comparison of OpenMP BSG and MPI BCSR (4 threads) - Non-Cache Resident Data

45 2.4 Summary

This chapter focused on developing a new representation for the matrices arising in structure grid problems. We also described our Matrix-Vector algorithm that takes advantages of SIMD architectures. We also discussed on extending the ideas used in the new representation to Block Compressed Sparse Row Format.

Our experimental study was conducted using a driver program in PETSc. The summary of results are as follows. With the new matrix structure, block structure grid we were able to achieve a performance improvement 1.2x to 1.8x. But in all cases, a modified Block CSR achieved a slightly higher performance. By optimizing the block structure grid for a dimension and by unrolling the stencil neighbor loop we were able to achieve a 25% performance improvement over the generic block structure grid algorithm. We were further able to improve the performance of this algorithm by 1.5x to 3.5x times by using 2 and 4 OpenMP threads respectively.

In the next chapter, we will use a real-time application PFlotran, and compare the performances of our new matrix representation. We will compare the effectiveness of our algorithm against standard matrix representations used in PETSc.

46 Chapter 3: Performance Evaluation on PFlotran

PFlotran is a tool for modelling Multiscale-Multiphase-Multicomponent Subsur- face Reactive flows using advanced computing. It utilizes a First-order Finite volume spatial discretization combined with backward-Euler time stepping. PFlotran is built on top of PETSc, a parallel library for solving non-linear and linear differential equa- tions using the principles of advanced computing.

In this chapter, we will focus on the basics of PFlotran and overview its architec- tural. We will also discuss the dependency of PFlotran on PETSc and the operations used by PFlotran modules. We will also focus on the role of the pre-conditioners in linear solvers and the modifications necessary in the matrix formats discussed to allow the use of Pre-conditioners. We will evaluate the performance of PFlotran using the new matrix format and discuss the results.

The remainder of the chapter is divided as follows. In section 3.1 we discuss the basics of Pflotran and overview its architecture. In section 3.2 we will provide with a performance evaluation done on PFlotran and finally summarize the results in section

3.3

47 3.1 PFlotran - Basics and Architecture Overview

3.1.1 Overview of Pflotran [5]

PFLOTRAN solves a coupled system of continuum scale mass and energy conser- vation equations in porous media for a number of phases, including air, water, and supercritical CO2, and for multiple chemical components. The general form of the multiphase partial differential equations solved in the flow module of PFLOTRAN for mass and energy conservation can be summarized as

! ∂ X α X h α αi φ sαραXi + ∇ · qαραXi − φsαDαρα∇Xi = Qi (3.1) ∂t α α

and

! ∂ X h i φ sαραUα + (1 − φ) ρrcrT + ∇ · qαραHα − k∇T = Qe (3.2) ∂t α

In these equations, α designates a phase (e.g. H2O, supercritical CO2), species are designated by the subscript i (e.g. w = H2O, c = CO2), φ denotes porosity of the

α geologic formation, sα denotes the saturation state of the phase, Xi denotes the mole fraction of species i; ρα, Hα, Uα refer to the molar density, enthalpy, and internal energy of each fluid phase, respectively; qα denotes the Darcy flow rate.

t wells, respectively. The multicomponent reactive transport equations solved by

PFLOTRAN have the form:

! ∂ X α X α X φ sαΨi + ∇ · Ωj = − νjmIm (3.3) ∂t α α m

for the jth primary species, and

∂φ m = V¯ I (3.4) ∂t m m

48 α α for the mth mineral. The quantities Ψj ,Ωj denote the total concentration and flux of the jth primary species in phase α. The mineral precipitation/dissolution re- action rate Im is determined using a transition state rate law, and the quantities νjm designate the stoichiometric reaction coefficients. These equations are coupled to the

flow and energy conservation equations through the variable p, T, sα, and qα.

PFlotran currently utilizes a first-order finite volume spatial discretization com- bined with backward-Euler (fully implicit) time stepping. Upwinding is used for the advective term in the transport equations. The system of nonlinear equations aris- ing from the discretization is solved using inexact Newton-Krylov methods. Within the flow and transport modules the equations are solved fully implicitly, but because transport generally requires much smaller time steps than flow, these modules are coupled sequentially.

3.1.2 Architectural Overview

PFLOTRAN is written in Fortran 95 using as modular and object-oriented an approach as possible within the constraints of the language standard, and, being a relatively new code, it is unencumbered by legacy code has been designed from day one with parallel scalability in mind. It is built on top of the PETSc framework and makes extensive use of features from PETSc, including iterative nonliner and linear solvers, distributed linear algebra data structures, parallel constructs for representing

PDEs on structured grids, performance logging, runtime control of solver and other options, and binary I/O. It employs parallel HDF5 for I/O and SAMRAI for adaptive

49 mesh refinement.

PFLOTRAN employs domain-decomposition parallelism, with each subdomain assigned to an MPI process and a parallel solve implemented over all processes. A number of different solver and preconditioner combinations from PETSc or other packages can be used. Message passing is required to exchange ghost points across subdomain boundaries, and, within the Non-Linear solver, gather/scatter operations are needed to handle off-processor vector elements in matrix- vector product computa- tions, and global reduction operations are required to compute vector inner products and norms.

3.2 Experimental Evaluation

This section describes the PFlotran experiments we conducted to evaluate the new Matrix - Vector multiplication.

3.2.1 Experimental Setup

The details of the machine used for running the experiment is as follows. For this machine sirius with Intel Core i7 which has four logical CPUs with 32 KB L1 data cache cache and 256 KB L2 cache and 8MB shared L3 cache. It has a memory of

16GB. Each core has a clock frequency of 3.4GHz. The machine runs Linux kernel

2.6 . This machine has a theoretical peak bandwidth of ¡¿. The machine has vector registers of size 256 bits. They can perform upto 4 double precision floating point operation at register level. This machine also supports SSE-vector operations. It is done by masking the higher 128 bits of the register.

50 The pflotran experiments used for evaluation are. They have degrees of freedom as

1 for flow modules and 3 to 15 for transport modules. The experiments use Incomplete

LU factorization for preconditioning. In this type of preconditioning, the sparse matrix is approximated to be a product of an upper and a lower triangular matrix. Depending on the level of approximation the matrix has different non-zero pattern. For the examples discussed, we use zero level approximation in which the matrix non-zero structure is retained.

3.2.2 Performance Evaluation and Analysis

The following set of experiments were used for performance evaluation of PFlo- tran. The first sample used for evaluation has a matrix with 10000 Grid points and

3 degrees of freedom for the transport module and 1 degree of freedom for the flow module. 75% of the floating point operations occur in the transport module and 25% of the floating point operations occur in th flow module. It takes in 45 Flow and

45 Trans steps for converging with and without the preconditioning Results obtained indicate the new vectorized block CSR has a performance improvement of 17% com- pared to normal block CSR whereas the block structured Grid has 16% performance improvement.

The other example that is used for evaluation has 13500 Grid points and 1 degree of freedom for the Flow module and 15 degrees of freedom for the Transport module.

This example too has a similar distribution of the floating point operations. It takes

8457 Flow and 8457 Trans steps to converge using the Incomplete LU factorization preconditioning. Results obtained indicate that the vectorized block CSR has 39%

51 Matrix Type Flow MatMult Trans MatMult Overall FLOPS (in GFlops) (in GFlops) (in GFlops) Block CSR 1.274 2.286 0.5421 Vectorized Block CSR 1.287 2.694 0.6034 Block Structure Grid 3.881 2.662 0.5958

Table 3.1: Pflotran sample without preconditioner

Matrix Type Flow MatMult Trans MatMult Execution time (in GFlops) (in GFlops) (in hours) Block CSR 1.274 1.521 8.3 Vectorized Block CSR 1.287 2.122 6.9

Table 3.2: Pflotran sample with ILU preconditioning

performance improvement compared to normal block CSR.

3.3 Conclusion

This chapter focussed on discussing the basics of PFlotran which was used for eval- uating the performance of the new matrix representation. We also discussed about the architecture of PFlotran and its interaction with PETSc.

The results obtained indicate that the new vectorized matrix-vector multiplica- tion improved the performance of the application by 1.2x times approximately. The application which ran for 8.3 hours approximately using the default block CSR repre- sentation was able to complete in 6.9 hours approximately using the vectorized block

52 CSR representation providing a 20% improvement in the performance of the applica- tion.

In the next chapter, we will conclude the thesis and provide with recommendations for future work.

53 Chapter 4: Conclusions and Future Work

We now summarize the work done for the thesis and identify potential future work.The thesis provided an alternative sparse matrix representation and Matrix- vector multiplication algorithm to improve the performance and utilize SIMD paral- lelization. It also provides a way to modify existing representation to enable SIMD parallelization. Experimental study done using a PETSc driver showed that the new matrix structure had a performance improvement of 50% and the modified block

CSR representation had a slightly improved performance than block structure grid.

In the second part of the thesis, PFlotran was used to evaluate the new matrix struc- tures. Results obtained indicate that the new matrix structures had a performance improvement of 20%.

Future work can develop scripts for generating customized codes for different block sizes and dimension. Auto-tuning can also be done to improve the performance further.

54 Bibliography

[1] Jean Braun and Malcolm Sambridge. A numerical method for solving partial differential equations on highly irregular evolving grids. Nature, 376, August 1995.

[2] Intel Corp. Intel 64 and IA-32 Architectures Optimization Reference Manual, November 2009.

[3] Seyong Lee and Rudolf Eigenmann. Adaptive runtime tuning of parallel sparse matrix-vector multiplication on distributed memory systems. In ICS 08, Proceed- ings of the 22nd annual international conference on Supercomputing, pages 195 – 204, 2008.

[4] Iyyappa T Murugandi. A new representation of structure grid for matrix-vector operation and optimization of doitgen kernel. diploma thesis, The Ohio State University, 2010.

[5] Peter C Lichtner Vamsi Sripathi G (Kumar) Mahinthakumar Richard Tran Mills, Glenn E Hammond and Barry F Smith. Modeling subsurface reactive flows using leadership-class computing. Journal of Physics, 2009.

[6] James W Demmel Richard Vuduc and Katherine A Yelick. Oski: A library of automatically tuned sparse matrix kernels. Journal of Physics, 2005.

[7] Richard Vuduc John Shalf Katherine Yelick Samuel Williams, Leonid Oliker and James Demmel. Optimization of sparse matrix-vector multiplication on emerging multicore platforms. In SC07, November 2007.

[8] A. N. Yzelman and Rob H. Bisseling. Cache-oblivious sparse matrix vector multi- plication by using sparse matrix partitioning methods. SIAM, 31(4):3128 – 3154, 2009.

55