Sparse Matrices and Optimized Parallel Implementations ______

Stan Tomov Innovative Computing Laboratory Computer Science Department The University of Tennessee

April 5, 2017

CS 594 Lecture Notes Slide 1 / 32 04/05/2017 Topics Projection in Scientific Computing

Sparse matrices, PDEs, Numerical parallel implementations solution, Tools, etc.

Iterative Methods

Slide 2 / 32 Outline • Part I – Discussion

• Part II – Sparse computations

• Part III – Reordering algorithms and parallelization

Slide 3 / 32 Part I

Discussion

Slide 4 / 32 Orthogonalization

• We can orthonormalize non-orthogonal basis. How? Other approaches: QR using Householder transformation (as in LAPACK), Cholesky, or/and SVD on normal equations (as in the homework)

Hybrid CPU-GPU CPU computation NVIDIA Quadro FX 5600 AMD Opteron (tm), Processor 265 (1.8 Ghz, 1 GB cache)‏ 130 2

120 1.8 110 128 vectors 1.6 100 90 1.4

80 1.2 70 1 60 Gflop/s QR _it_svd Gflop/s MGS 50 QR _it_C hol/svd 0.8 QR _it_svd QR _it_C hol/svd 40 0.6 30 0.4 20 0.2 10 0 0 1 2 4 8 16 32 64 128 256 512 1 2 4 8 16 32 64 128 256 512 Vector size x 1,000 Vector size x 1,000

Slide 5 / 32 What if the basis is not orthonormal?

• If we do not want to orthonormalize: / 'Multiply' by x , ..., x to get u ≈ P u = c1 x1 + c2 x2 + . . . + cm xm 1 m

(u , x1) = c1 (x1 , x1)+ c2 (x2 , x1)+ . . . + cm (xm , x1) . . .

(u, xm) = c1 (x1 , xm)+ c2 (x2 , xm)+ . . . + cm (xm , xm)

• These are the so called Petrov-Galerkin conditions • We saw examples of their use in * optimization, and * PDE discretization, e.g. FEM

Slide 6 / 32

What if the basis is not orthonormal?

• If we do not want to orthonormalize, e.g. in FEM / 'Multiply' by ϕ , ..., ϕ to get u ≈ P u = c1 ϕ1 + c2 ϕ2 + . . . + c7 ϕ7 1 7 a 7x7 system

a( c1 ϕ1 + c2 ϕ2 + . . . + c7 ϕ7 , ϕi) = F(ϕi) for i = 1, ... , 7

l Two examples of basis functions ϕi

l The more ϕi overlap, the denser the resulting matrix

l Spectral element methods (high-order FEM)

(Image taken from http://www.urel.feec.vutbr.cz/~raida)‏

Slide 7 / 32 Stencil Computations

• K. Datta, S. Kamil, S. Williams, L. Oliker, J. Shaft, K. Yelick, “Optimization and Performance Modeling of Stencil Computations on Modern Microprocessors”, SIAM Review, 2008.

http://bebop.cs.berkeley.edu/pubs/datta2008-stencil-sirev.pdf

Slide 8 / 32 Part II computations

Slide 9 / 32 Sparse matrices

• Sparse matrix: substantial part of the coefficients is zero • Naturally arise from PDE discretizations – finite differences, FEM, etc.; we saw examples in the

5-point finite difference operator 1-D piece-wise linear FEM

10 Vertices are indexed, e.g. 5 6 7

2

Row 6 will have 5 non-zero elements: Row 3, for example, will have 3 non-zeros

A6,2, A6,5, A6,6, A6,7, and A6,10 A3,2, A3,3, A3,4

Slide 10 / 32 Sparse matrices

• In general : 100 115 * Degrees of freedom (DOF), associated for ex. with vertices (or edges, faces, etc.), are indexed 332 * A basis function is associated with every DOF 10 201 (unknown)

* A Petrov-Galerkin condition (equation) is derived for every basis function, representing 35 a row in the resulting system

* Only 'a few' elements per row will be nonzero as the basis functions have local support eg. what happens at '10' is described by - eg. row 10, using continuous piecewise linear FEM, will have 6 nonzeroes: the physical state at '10' and the A , A , A , A , A ,A 10,10 10,35 10,100 10,332 10,115 10,201 neighbouring 35, 201, 115, 100, and 332. - physical intuition behind: PDEs describe changes in physical processes; describing/discretizing these changes numerically, based only on local/neighbouring information, results in sparse matrices

Slide 11 / 32 Sparse matrices

• Can we take advantage of this sparse structure? – To solve for example very large problems – To solve them efficiently

• Yes! There are algorithms

– Linear solvers and (to cover some in the last 2 lectures)

– Efficient data storage and implementation (next ...)

Slide 12 / 32 Sparse matrix formats

• It pays to avoid storing the zeros! • Common sparse storage formats: – AIJ – Compressed row/column storage (CRS/CCS) – Compressed diagonal storage (CDS) * for more see the 'Templates' book http://www.netlib.org/linalg/html_templates/node90.html#SECTION00931000000000000000

– Blocked versions (why?)

Slide 13 / 32 AIJ

• Stored in 3 arrays – The same length – No order implied

I J AIJ 1 1 1 1 2 2 2 1 3 2 3 4 3 2 5 3 4 6 4 3 7 4 5 8

Slide 14 / 32 CRS

• Stored in 3 arrays – J and AIJ the same length – I (representing rows) is compressed

I J AIJ 1 1 1 3 2 2 5 1 3 7 3 4 array I : think of it as pointers to 9 2 5 where next row starts 4 6 3 7 CCS : similar but J is compressed 5 8

Slide 15 / 32 CDS

• For matrices with non-zeros along sub-diagonals

Sub-diagonal index

sub-diagonals -1 0 1

Slide 16 / 32 Performance (Mat-vec product)

• Notoriously bad for running at just a fraction of the performance peak! • Why ? Consider mat-vec product for matrix in CRS:

for i = 1, n for j = I[i], I[i+1]-1 y[i] += AIJ [j] * x [ J[j] ]

Slide 17 / 32 Performance (Mat-vec product)

• Notoriously bad for running at just a fraction of the performance peak! • Why ? Consider mat-vec product for matrix in CRS:

for i = 1, n * Irregular indirect memory access for x for j = I[i], I[i+1]-1 - result in cache trashing * performance often y[i] += AIJ [j] * x [ J[j] ] <10% peak 100 115 332 10 201 35

Slide 18 / 32 Performance (Mat-vec product)

* Performance of mat-vec products of various sizes on a 2.4 GHz Pentium 4 * An example from Gahvari et.al.: http://bebop.cs.berkeley.edu/pubs/gahvari2007-spmvbench-spec.pdf

Slide 19 / 32 Performance (Mat-vec product) • How to improve the performance? – A common technique (as done for dense ) is blocking (register, cache: next ... ) – Index reordering (in Part II) – Exploit special matrix structure (e.g., symmetry, bands, other structures)

Slide 20 / 32 Block Compressed Row Storage (BCRS) • Example of using 2x2 blocks

BI BJ AIJ 1 1 1 3 2 0 4 3 0 4 2 3 * Reduced storage for indexes 5 0 * Drawback: add 0s 6 7 * What block size to choose? 8 9 * BCRS for register blocking * Discussion?

Slide 21 / 32

BCRS

Slide 22 / 32 Cache blocking

• Improve cache reuse for x in Ax by splitting A into a set of sparse matrices, e.g.

Sparse matrix and its splitting

For more info check: SPARSITY: An Optimization Framework for Sparse Matrix Kernels (International Journal of High Performance Computing Applications, 18 (1), pp. 135-158, February 2004. Eun-Jin Im, K. Yelick, R. Vuduc

Slide 23 / 32 Part III Reordering algorithms and Parallelization

Slide 24 / 32 Reorder to preserve locality

100 115

332 10 201

35

e.g., Cuthill-McKee Ordering: start from arbitrary node, say '10' and reorder * '10' becomes 0 * neighbours are ordered next to become 1, 2, 3, 4, 5, denote this as level 1 * neighbours to level 1 nodes are next consecutively reordered, and so on until end

Slide 25 / 32 Cuthill-McKee Ordering

• Reversing the ordering (RCM) results in ordering that is better for sparse LU • Reduces matrix bandwidth (see example) • Improves cache performance

• Can be used as partitioner (parallelization) p1 but in general does not reduce edge cut p2 p3 p4

Slide 26 / 32 Self-Avoiding Walks (SAW)

• Enumeration of mesh elements through 'consecutive elements' (sharing face, edge, vertex, etc.) * similar to space-filling curves but for unstructured meshes * improves cache reuse * can be used as partitioner with good load balance but in general does not reduce edge cut

Slide 27 / 32 Graph partitioning

• Refer back to Lecture #9, Part II Mesh Generation and Load Balancing • Can be used for reordering • Metis/ParMetis: – multilevel partitioning – Good load balance and minimize edge cut

Slide 28 / 32 Parallel Mat-Vec Product

• Easiest way: p1 – 1D partitioning p2 p3 – May lead to load unbalance (why?) p4 – May need a lot of communication for x • Can use any of the just mentioned techniques • Most promising seems to be spectral multilevel methods (as in Metis/ParMetis)

Slide 29 / 32 Possible optimizations

• Block communication – To send the min. required part of x – e.g., pre-compute blocks of interfaces • Load balance, minimize edge cut – e.g., a good partitioner would do it • Reordering • Advantage of additional structure (symmetry, bands, etc)

Slide 30 / 32 Comparison Distributed memory implementation (by X. Li, L. Oliker, G. Heber, R. Biswas)

– ORIG ordering has large edge cut (interprocessor comm) and poor locality (high number of cache misses) – MeTiS minimizes edge cut, while SAW minimizes cache misses

Slide 31 / 32 Learning Goals

• Efficient sparse computations are challenging! • Computational challenges and issues related to sparse matrices – Data formats – Optimization • Blocking • Reordering • Other

• Parallel sparse Mat-Vec product – Code optimization opportunities

Slide 32 / 32