Sparse Matrices and Optimized Parallel Implementations ______

Sparse Matrices and Optimized Parallel Implementations ________________________________________________ Stan Tomov Innovative Computing Laboratory Computer Science Department The University of Tennessee April 5, 2017 CS 594 Lecture Notes Slide 1 / 32 04/05/2017 Topics Projection in Scientific Computing Sparse matrices, PDEs, Numerical parallel implementations solution, Tools, etc. Iterative Methods Slide 2 / 32 Outline • Part I – Discussion • Part II – Sparse matrix computations • Part III – Reordering algorithms and parallelization Slide 3 / 32 Part I Discussion Slide 4 / 32 Orthogonalization • We can orthonormalize non-orthogonal basis. How? Other approaches: QR using Householder transformation (as in LAPACK), Cholesky, or/and SVD on normal equations (as in the homework) Hybrid CPU-GPU CPU computation NVIDIA Quadro FX 5600 AMD Opteron (tm), Processor 265 (1.8 Ghz, 1 GB cache)‏ 130 2 120 1.8 110 128 vectors 1.6 100 90 1.4 80 1.2 70 1 60 Gflop/s QR _it_svd Gflop/s MGS 50 QR _it_C hol/svd 0.8 QR _it_svd QR _it_C hol/svd 40 0.6 30 0.4 20 0.2 10 0 0 1 2 4 8 16 32 64 128 256 512 1 2 4 8 16 32 64 128 256 512 Vector size x 1,000 Vector size x 1,000 Slide 5 / 32 What if the basis is not orthonormal? • If we do not want to orthonormalize: / 'Multiply' by x , ..., x to get u ≈ P u = c1 x1 + c2 x2 + . + cm xm 1 m (u , x1) = c1 (x1 , x1)+ c2 (x2 , x1)+ . + cm (xm , x1) . (u, xm) = c1 (x1 , xm)+ c2 (x2 , xm)+ . + cm (xm , xm) • These are the so called Petrov-Galerkin conditions • We saw examples of their use in * optimization, and * PDE discretization, e.g. FEM Slide 6 / 32 What if the basis is not orthonormal? • If we do not want to orthonormalize, e.g. in FEM / 'Multiply' by ϕ , ..., ϕ to get u ≈ P u = c1 ϕ1 + c2 ϕ2 + . + c7 ϕ7 1 7 a 7x7 system a( c1 ϕ1 + c2 ϕ2 + . + c7 ϕ7 , ϕi) = F(ϕi) for i = 1, ... , 7 l Two examples of basis functions ϕi l The more ϕi overlap, the denser the resulting matrix l Spectral element methods (high-order FEM) (Image taken from http://www.urel.feec.vutbr.cz/~raida)‏ Slide 7 / 32 Stencil Computations • K. Datta, S. Kamil, S. Williams, L. Oliker, J. Shaft, K. Yelick, “Optimization and Performance Modeling of Stencil Computations on Modern Microprocessors”, SIAM Review, 2008. http://bebop.cs.berkeley.edu/pubs/datta2008-stencil-sirev.pdf Slide 8 / 32 Part II Sparse matrix computations Slide 9 / 32 Sparse matrices • Sparse matrix: substantial part of the coefficients is zero • Naturally arise from PDE discretizations – finite differences, FEM, etc.; we saw examples in the 5-point finite difference operator 1-D piece-wise linear FEM 10 Vertices are indexed, e.g. 5 6 7 2 Row 6 will have 5 non-zero elements: Row 3, for example, will have 3 non-zeros A6,2, A6,5, A6,6, A6,7, and A6,10 A3,2, A3,3, A3,4 Slide 10 / 32 Sparse matrices • In general : 100 115 * Degrees of freedom (DOF), associated for ex. with vertices (or edges, faces, etc.), are indexed 332 * A basis function is associated with every DOF 10 201 (unknown) * A Petrov-Galerkin condition (equation) is derived for every basis function, representing 35 a row in the resulting system * Only 'a few' elements per row will be nonzero as the basis functions have local support eg. what happens at '10' is described by - eg. row 10, using continuous piecewise linear FEM, will have 6 nonzeroes: the physical state at '10' and the A , A , A , A , A ,A 10,10 10,35 10,100 10,332 10,115 10,201 neighbouring 35, 201, 115, 100, and 332. - physical intuition behind: PDEs describe changes in physical processes; describing/discretizing these changes numerically, based only on local/neighbouring information, results in sparse matrices Slide 11 / 32 Sparse matrices • Can we take advantage of this sparse structure? – To solve for example very large problems – To solve them efficiently • Yes! There are algorithms – Linear solvers and preconditioners (to cover some in the last 2 lectures) – Efficient data storage and implementation (next ...) Slide 12 / 32 Sparse matrix formats • It pays to avoid storing the zeros! • Common sparse storage formats: – AIJ – Compressed row/column storage (CRS/CCS) – Compressed diagonal storage (CDS) * for more see the 'Templates' book http://www.netlib.org/linalg/html_templates/node90.html#SECTION00931000000000000000 – Blocked versions (why?) Slide 13 / 32 AIJ • Stored in 3 arrays – The same length – No order implied I J AIJ 1 1 1 1 2 2 2 1 3 2 3 4 3 2 5 3 4 6 4 3 7 4 5 8 Slide 14 / 32 CRS • Stored in 3 arrays – J and AIJ the same length – I (representing rows) is compressed I J AIJ 1 1 1 3 2 2 5 1 3 7 3 4 array I : think of it as pointers to 9 2 5 where next row starts 4 6 3 7 CCS : similar but J is compressed 5 8 Slide 15 / 32 CDS • For matrices with non-zeros along sub-diagonals Sub-diagonal index sub-diagonals -1 0 1 Slide 16 / 32 Performance (Mat-vec product) • Notoriously bad for running at just a fraction of the performance peak! • Why ? Consider mat-vec product for matrix in CRS: for i = 1, n for j = I[i], I[i+1]-1 y[i] += AIJ [j] * x [ J[j] ] Slide 17 / 32 Performance (Mat-vec product) • Notoriously bad for running at just a fraction of the performance peak! • Why ? Consider mat-vec product for matrix in CRS: for i = 1, n * Irregular indirect memory access for x for j = I[i], I[i+1]-1 - result in cache trashing * performance often y[i] += AIJ [j] * x [ J[j] ] <10% peak 100 115 332 10 201 35 Slide 18 / 32 Performance (Mat-vec product) * Performance of mat-vec products of various sizes on a 2.4 GHz Pentium 4 * An example from Gahvari et.al.: http://bebop.cs.berkeley.edu/pubs/gahvari2007-spmvbench-spec.pdf Slide 19 / 32 Performance (Mat-vec product) • How to improve the performance? – A common technique (as done for dense linear algebra) is blocking (register, cache: next ... ) – Index reordering (in Part II) – Exploit special matrix structure (e.g., symmetry, bands, other structures) Slide 20 / 32 Block Compressed Row Storage (BCRS) • Example of using 2x2 blocks BI BJ AIJ 1 1 1 3 2 0 4 3 0 4 2 3 * Reduced storage for indexes 5 0 * Drawback: add 0s 6 7 * What block size to choose? 8 9 * BCRS for register blocking * Discussion? Slide 21 / 32 BCRS Slide 22 / 32 Cache blocking • Improve cache reuse for x in Ax by splitting A into a set of sparse matrices, e.g. Sparse matrix and its splitting For more info check: SPARSITY: An Optimization Framework for Sparse Matrix Kernels (International Journal of High Performance Computing Applications, 18 (1), pp. 135-158, February 2004. Eun-Jin Im, K. Yelick, R. Vuduc Slide 23 / 32 Part III Reordering algorithms and Parallelization Slide 24 / 32 Reorder to preserve locality 100 115 332 10 201 35 e.g., Cuthill-McKee Ordering: start from arbitrary node, say '10' and reorder * '10' becomes 0 * neighbours are ordered next to become 1, 2, 3, 4, 5, denote this as level 1 * neighbours to level 1 nodes are next consecutively reordered, and so on until end Slide 25 / 32 Cuthill-McKee Ordering • Reversing the ordering (RCM) results in ordering that is better for sparse LU • Reduces matrix bandwidth (see example) • Improves cache performance • Can be used as partitioner (parallelization) p1 but in general does not reduce edge cut p2 p3 p4 Slide 26 / 32 Self-Avoiding Walks (SAW) • Enumeration of mesh elements through 'consecutive elements' (sharing face, edge, vertex, etc.) * similar to space-filling curves but for unstructured meshes * improves cache reuse * can be used as partitioner with good load balance but in general does not reduce edge cut Slide 27 / 32 Graph partitioning • Refer back to Lecture #9, Part II Mesh Generation and Load Balancing • Can be used for reordering • Metis/ParMetis: – multilevel partitioning – Good load balance and minimize edge cut Slide 28 / 32 Parallel Mat-Vec Product • Easiest way: p1 – 1D partitioning p2 p3 – May lead to load unbalance (why?) p4 – May need a lot of communication for x • Can use any of the just mentioned techniques • Most promising seems to be spectral multilevel methods (as in Metis/ParMetis) Slide 29 / 32 Possible optimizations • Block communication – To send the min. required part of x – e.g., pre-compute blocks of interfaces • Load balance, minimize edge cut – e.g., a good partitioner would do it • Reordering • Advantage of additional structure (symmetry, bands, etc) Slide 30 / 32 Comparison Distributed memory implementation (by X. Li, L. Oliker, G. Heber, R. Biswas) – ORIG ordering has large edge cut (interprocessor comm) and poor locality (high number of cache misses) – MeTiS minimizes edge cut, while SAW minimizes cache misses Slide 31 / 32 Learning Goals • Efficient sparse computations are challenging! • Computational challenges and issues related to sparse matrices – Data formats – Optimization • Blocking • Reordering • Other • Parallel sparse Mat-Vec product – Code optimization opportunities Slide 32 / 32 .

Sparse Matrices and Optimized Parallel Implementations ______

Recursive Approach in Sparse Matrix LU Factorization

Abstract 1 Introduction

Overview of Iterative Linear System Solver Packages

Accelerating the LOBPCG Method on Gpus Using a Blocked Sparse Matrix Vector Product

Over-Scalapack.Pdf

Prospectus for the Next LAPACK and Scalapack Libraries

Jack Dongarra: Supercomputing Expert and Mathematical Software Specialist

Distibuted Dense Numerical Linear Algebra Algorithms on Massively Parallel Architectures: DPLASMA

LAPACK WORKING NOTE 195: SCALAPACK's MRRR ALGORITHM 1. Introduction. Since 2005, the National Science Foundation Has Been Fund

Reciprocity Relations

On the Performance and Energy Efficiency of Sparse Linear Algebra on Gpus

Best Practice Guide