Sparse Matrices and Optimized Parallel Implementations ______

Sparse Matrices and Optimized Parallel Implementations ________________________________________________ Stan Tomov Innovative Computing Laboratory Computer Science Department The University of Tennessee April 5, 2017 CS 594 Lecture Notes Slide 1 / 32 04/05/2017 Topics Projection in Scientific Computing Sparse matrices, PDEs, Numerical parallel implementations solution, Tools, etc. Iterative Methods Slide 2 / 32 Outline • Part I – Discussion • Part II – Sparse matrix computations • Part III – Reordering algorithms and parallelization Slide 3 / 32 Part I Discussion Slide 4 / 32 Orthogonalization • We can orthonormalize non-orthogonal basis. How? Other approaches: QR using Householder transformation (as in LAPACK), Cholesky, or/and SVD on normal equations (as in the homework) Hybrid CPU-GPU CPU computation NVIDIA Quadro FX 5600 AMD Opteron (tm), Processor 265 (1.8 Ghz, 1 GB cache)‏ 130 2 120 1.8 110 128 vectors 1.6 100 90 1.4 80 1.2 70 1 60 Gflop/s QR _it_svd Gflop/s MGS 50 QR _it_C hol/svd 0.8 QR _it_svd QR _it_C hol/svd 40 0.6 30 0.4 20 0.2 10 0 0 1 2 4 8 16 32 64 128 256 512 1 2 4 8 16 32 64 128 256 512 Vector size x 1,000 Vector size x 1,000 Slide 5 / 32 What if the basis is not orthonormal? • If we do not want to orthonormalize: / 'Multiply' by x , ..., x to get u ≈ P u = c1 x1 + c2 x2 + . + cm xm 1 m (u , x1) = c1 (x1 , x1)+ c2 (x2 , x1)+ . + cm (xm , x1) . (u, xm) = c1 (x1 , xm)+ c2 (x2 , xm)+ . + cm (xm , xm) • These are the so called Petrov-Galerkin conditions • We saw examples of their use in * optimization, and * PDE discretization, e.g. FEM Slide 6 / 32 What if the basis is not orthonormal? • If we do not want to orthonormalize, e.g. in FEM / 'Multiply' by ϕ , ..., ϕ to get u ≈ P u = c1 ϕ1 + c2 ϕ2 + . + c7 ϕ7 1 7 a 7x7 system a( c1 ϕ1 + c2 ϕ2 + . + c7 ϕ7 , ϕi) = F(ϕi) for i = 1, ... , 7 l Two examples of basis functions ϕi l The more ϕi overlap, the denser the resulting matrix l Spectral element methods (high-order FEM) (Image taken from http://www.urel.feec.vutbr.cz/~raida)‏ Slide 7 / 32 Stencil Computations • K. Datta, S. Kamil, S. Williams, L. Oliker, J. Shaft, K. Yelick, “Optimization and Performance Modeling of Stencil Computations on Modern Microprocessors”, SIAM Review, 2008. http://bebop.cs.berkeley.edu/pubs/datta2008-stencil-sirev.pdf Slide 8 / 32 Part II Sparse matrix computations Slide 9 / 32 Sparse matrices • Sparse matrix: substantial part of the coefficients is zero • Naturally arise from PDE discretizations – finite differences, FEM, etc.; we saw examples in the 5-point finite difference operator 1-D piece-wise linear FEM 10 Vertices are indexed, e.g. 5 6 7 2 Row 6 will have 5 non-zero elements: Row 3, for example, will have 3 non-zeros A6,2, A6,5, A6,6, A6,7, and A6,10 A3,2, A3,3, A3,4 Slide 10 / 32 Sparse matrices • In general : 100 115 * Degrees of freedom (DOF), associated for ex. with vertices (or edges, faces, etc.), are indexed 332 * A basis function is associated with every DOF 10 201 (unknown) * A Petrov-Galerkin condition (equation) is derived for every basis function, representing 35 a row in the resulting system * Only 'a few' elements per row will be nonzero as the basis functions have local support eg. what happens at '10' is described by - eg. row 10, using continuous piecewise linear FEM, will have 6 nonzeroes: the physical state at '10' and the A , A , A , A , A ,A 10,10 10,35 10,100 10,332 10,115 10,201 neighbouring 35, 201, 115, 100, and 332. - physical intuition behind: PDEs describe changes in physical processes; describing/discretizing these changes numerically, based only on local/neighbouring information, results in sparse matrices Slide 11 / 32 Sparse matrices • Can we take advantage of this sparse structure? – To solve for example very large problems – To solve them efficiently • Yes! There are algorithms – Linear solvers and preconditioners (to cover some in the last 2 lectures) – Efficient data storage and implementation (next ...) Slide 12 / 32 Sparse matrix formats • It pays to avoid storing the zeros! • Common sparse storage formats: – AIJ – Compressed row/column storage (CRS/CCS) – Compressed diagonal storage (CDS) * for more see the 'Templates' book http://www.netlib.org/linalg/html_templates/node90.html#SECTION00931000000000000000 – Blocked versions (why?) Slide 13 / 32 AIJ • Stored in 3 arrays – The same length – No order implied I J AIJ 1 1 1 1 2 2 2 1 3 2 3 4 3 2 5 3 4 6 4 3 7 4 5 8 Slide 14 / 32 CRS • Stored in 3 arrays – J and AIJ the same length – I (representing rows) is compressed I J AIJ 1 1 1 3 2 2 5 1 3 7 3 4 array I : think of it as pointers to 9 2 5 where next row starts 4 6 3 7 CCS : similar but J is compressed 5 8 Slide 15 / 32 CDS • For matrices with non-zeros along sub-diagonals Sub-diagonal index sub-diagonals -1 0 1 Slide 16 / 32 Performance (Mat-vec product) • Notoriously bad for running at just a fraction of the performance peak! • Why ? Consider mat-vec product for matrix in CRS: for i = 1, n for j = I[i], I[i+1]-1 y[i] += AIJ [j] * x [ J[j] ] Slide 17 / 32 Performance (Mat-vec product) • Notoriously bad for running at just a fraction of the performance peak! • Why ? Consider mat-vec product for matrix in CRS: for i = 1, n * Irregular indirect memory access for x for j = I[i], I[i+1]-1 - result in cache trashing * performance often y[i] += AIJ [j] * x [ J[j] ] <10% peak 100 115 332 10 201 35 Slide 18 / 32 Performance (Mat-vec product) * Performance of mat-vec products of various sizes on a 2.4 GHz Pentium 4 * An example from Gahvari et.al.: http://bebop.cs.berkeley.edu/pubs/gahvari2007-spmvbench-spec.pdf Slide 19 / 32 Performance (Mat-vec product) • How to improve the performance? – A common technique (as done for dense linear algebra) is blocking (register, cache: next ... ) – Index reordering (in Part II) – Exploit special matrix structure (e.g., symmetry, bands, other structures) Slide 20 / 32 Block Compressed Row Storage (BCRS) • Example of using 2x2 blocks BI BJ AIJ 1 1 1 3 2 0 4 3 0 4 2 3 * Reduced storage for indexes 5 0 * Drawback: add 0s 6 7 * What block size to choose? 8 9 * BCRS for register blocking * Discussion? Slide 21 / 32 BCRS Slide 22 / 32 Cache blocking • Improve cache reuse for x in Ax by splitting A into a set of sparse matrices, e.g. Sparse matrix and its splitting For more info check: SPARSITY: An Optimization Framework for Sparse Matrix Kernels (International Journal of High Performance Computing Applications, 18 (1), pp. 135-158, February 2004. Eun-Jin Im, K. Yelick, R. Vuduc Slide 23 / 32 Part III Reordering algorithms and Parallelization Slide 24 / 32 Reorder to preserve locality 100 115 332 10 201 35 e.g., Cuthill-McKee Ordering: start from arbitrary node, say '10' and reorder * '10' becomes 0 * neighbours are ordered next to become 1, 2, 3, 4, 5, denote this as level 1 * neighbours to level 1 nodes are next consecutively reordered, and so on until end Slide 25 / 32 Cuthill-McKee Ordering • Reversing the ordering (RCM) results in ordering that is better for sparse LU • Reduces matrix bandwidth (see example) • Improves cache performance • Can be used as partitioner (parallelization) p1 but in general does not reduce edge cut p2 p3 p4 Slide 26 / 32 Self-Avoiding Walks (SAW) • Enumeration of mesh elements through 'consecutive elements' (sharing face, edge, vertex, etc.) * similar to space-filling curves but for unstructured meshes * improves cache reuse * can be used as partitioner with good load balance but in general does not reduce edge cut Slide 27 / 32 Graph partitioning • Refer back to Lecture #9, Part II Mesh Generation and Load Balancing • Can be used for reordering • Metis/ParMetis: – multilevel partitioning – Good load balance and minimize edge cut Slide 28 / 32 Parallel Mat-Vec Product • Easiest way: p1 – 1D partitioning p2 p3 – May lead to load unbalance (why?) p4 – May need a lot of communication for x • Can use any of the just mentioned techniques • Most promising seems to be spectral multilevel methods (as in Metis/ParMetis) Slide 29 / 32 Possible optimizations • Block communication – To send the min. required part of x – e.g., pre-compute blocks of interfaces • Load balance, minimize edge cut – e.g., a good partitioner would do it • Reordering • Advantage of additional structure (symmetry, bands, etc) Slide 30 / 32 Comparison Distributed memory implementation (by X. Li, L. Oliker, G. Heber, R. Biswas) – ORIG ordering has large edge cut (interprocessor comm) and poor locality (high number of cache misses) – MeTiS minimizes edge cut, while SAW minimizes cache misses Slide 31 / 32 Learning Goals • Efficient sparse computations are challenging! • Computational challenges and issues related to sparse matrices – Data formats – Optimization • Blocking • Reordering • Other • Parallel sparse Mat-Vec product – Code optimization opportunities Slide 32 / 32 .

Sparse Matrices and Optimized Parallel Implementations ______

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support