Iterative Methods in Linear Algebra
Azzam Haidar
Innovative Computing Laboratory Computer Science Department The University of Tennessee
Wednesday April 18, 2018
COSC 594, 04-18-2018 COSC 594, 04-18-2018 Outline
Part I Discussion Motivation for iterative methods Part II Stationary Iterative Methods Part III Iterative Refinement Methods Part IV Nonstationary Iterative Methods
COSC 594, 04-18-2018 Part I
Discussion
COSC 594, 04-18-2018 Discussion
Original matrix: Reordered matrix: >> load matrix.output >> load A.data >> S = spconvert(matrix); >> S = spconvert(A); >> spy(S) >> spy(S)
COSC 594, 04-18-2018 Discussion
Original sub-matrix: Reordered sub-matrix: >> load matrix.output >> load A.data >> S = spconvert(matrix); >> S = spconvert(A); >> spy(S(1:8660),1:8660)) >> spy(S(8602:8700,8602:8700))
COSC 594, 04-18-2018 Discussion
Results on torc0: Results on woodstock: Pentium III 933 MHz, 16 KB L1 Intel Xeon 5160 Woodcrest and 256 KB L2 cache 3.0GHz, L1 32 KB, L2 4 MB
Original matrix: Original matrix: Mflop/s = 42.4568 Mflop/s = 386.8 Reordered matrix: Reordered matrix: Mflop/s = 45.3954 Mflop/s = 387.6 BCRS BCRS Mflop/s = 72.17374 Mflop/s = 894.2
COSC 594, 04-18-2018 Iterative Methods
COSC 594, 04-18-2018 Motivation
So far we have discussed and have seen: Many engineering and physics simulations lead to sparse matrices e.g. PDE based simulations and their discretizations based on FEM/FVM finite di↵erences Petrov-Galerkin type of conditions, etc. How to optimize performance on sparse matrix computations Some software how to solve sparse matrix problems (e.g. PETSc) The question is: Can we solve sparse matrix problems faster than using direct sparse methods? In certain cases Yes: using iterative methods
COSC 594, 04-18-2018 Sparse direct methods
Sparse direct methods These are based on Gaussian elimination (LU) Performed in sparse format
Are there limitations of the approach? Yes, they have fill-ins which lead to more memory requirements more flops being performed Fill-ins can become prohibitively high
COSC 594, 04-18-2018 Sparse direct methods
Consider LU for the matrix below 1st step of LU factorization will introduce fill-ins a nonzero is represented by a* marked by an F
COSC 594, 04-18-2018 Sparse direct methods
Fill-ins can be improved by reordering Remember: we talked about it in slightly di↵erent context (for speed and parallelization) Consider the following reordering
These were extreme cases but still, the problem exists
COSC 594, 04-18-2018 Sparse methods Sparse direct vs dense methods Dense takes O(n2) storage, O(n3) flops, runs within peak performance Sparse direct can take O(#nonz) storage, and O(#nonz) flops, but these can also grow due to fill-ins and performance is bad with #nonz << n2 and ’proper’ ordering it pays o↵to do sparse direct Software (table from Julien Langou) http://www.netlib.org/utk/people/JackDongarra/la-sw.html
Package Support Type Language Mode Real Complex f77 c c++ Seq Dist SPD Gen DSCPACK yes X X X M X HSL yes X X X X X X MFACT yes X X X X MUMPS yes X X X X X M X X PSPASES yes X X X M X SPARSE ? X X X X X X SPOOLES ? X X X X M X SuperLU yes X X X X X M X TAUCS yes X X X X X X UMFPACK yes X X X X X Y12M ? X X X X X
COSC 594, 04-18-2018 Sparse methods
What about Iterative Methods?Thinkforexample
x = x + P(b Ax ) i+1 i i
Pluses: Storage is only O(#nonz)(forthematrixandafewworking vectors) Can take only a few iterations to converge (e.g. << n) 1 for P = A it takes 1 iteration (check)! In general easy to parallelize
COSC 594, 04-18-2018 Sparse methods
What about Iterative Methods?Thinkforexample
x = x + P(b Ax ) i+1 i i
Minuses: Performance can be bad (as we saw previously) Convergence can be slow or even stagnate But can be improved with preconditioning 1 Think of P as a preconditioner, an operator/matrix P A Optimal preconditioners (e.g. multigrid can be) lead to⇡ convergence in O(1) iterations
COSC 594, 04-18-2018 Part II
Stationary Iterative Methods
COSC 594, 04-18-2018 Stationary Iterative Methods
Can be expressed in the form
xi+1 = Bxi + c
where B and c do not depend on i
older, simpler, easy to implement, but usually not as e↵ective (as nonstationary) examples: Richardson, Jacobi, Gauss-Seidel, SOR, etc. (next)
COSC 594, 04-18-2018 Richardson Method
Richardson iteration
x = x +(b Ax )=(I A)x + b (1) i+1 i i i i.e. B from the definition above is B = I A
Denote e = x x and rewrite (1) i i x x = x x (Ax Ax ) i+1 i i e = e Ae i+1 i i =(I A)e i 2 ei+1 (I A)ei I A ei I A ei 1 || || || || || || || || || || || || I A i e ···|| || || 0|| i.e. for convergence (e 0) we need i ! I A < 1 || || for some norm , e.g. when ⇢(B) < 1. || · || COSC 594, 04-18-2018 Jacobi Method Jacobi Method 1 1 1 x = x + D (b Ax )=(I D A)x + D b (2) i+1 i i i where D is the diagonal of A (assumed nonzero; B = I D 1A from the definition)
Denote e = x x and rewrite (2) i i 1 x x = x x D (Ax Ax ) i+1 i i 1 e = e D Ae i+1 i i 1 =(I D A)e i 1 1 1 2 ei+1 (I D A)ei I D A ei I D A ei 1 || || || || || || || || || || || || 1 i I D A e ···|| || || 0|| i.e. for convergence (e 0) we need i ! 1 I D A < 1 || || for some norm , e.g. when ⇢(I D 1A) < 1 || · || COSC 594, 04-18-2018 Gauss-Seidel Method
Gauss-Seidel Method
1 x =(D L) (Ux + b) (3) i+1 i where L is the lower and U the upper triangular part of A, and D is the diagonal.
Equivalently:
(i+1) (i) (i) (i) (i) x1 =(b1 a12x2 a13x3 a14x4 ... a1nxn )/a11 (i+1) (i+1) (i) (i) (i) x2 =(b2 a21x1 a23x3 a24x4 ... a2nxn )/a22 8 (i+1) (i+1) (i+1) (i) (i) > x3 =(b2 a31x1 a32x2 a34x4 ... a3nxn )/a22 > > . < . (i+1) (i+1) (i+1) (i+1) (i+1) > xn =(bn an1x1 an2x2 an3x3 ... an,n 1xn 1 )/ann > > :>
COSC 594, 04-18-2018 Acomparison(fromJulienLangouslides)
Gauss-Seidel method
10 1 2 0 1 23 01213 1 44 A = 0 129101 b = 0 36 1 B 031100C B 49 C B C B C B 120015C B 80 C @ A @ A
b Ax(i) / b ,functionofi,thenumberofiterations k k2 k k2
x(0) x(1) x(2) x(3) x(4) ... 0.0000 2.3000 0.8783 0.9790 0.9978 1.0000 0.0000 3.6667 2.1548 2.0107 2.0005 2.0000 0.0000 2.9296 3.0339 3.0055 3.0006 3.0000 0.0000 3.5070 3.9502 3.9962 3.9998 4.0000 0.0000 4.6911 4.9875 5.0000 5.0001 5.0000
COSC 594, 04-18-2018 Acomparison(fromJulienLangouslides)
Jacobi Method:
(i+1) (i) (i) (i) (i) x1 =(b1 a12x2 a13x3 a14x4 ... a1nxn )/a11 (i+1) (i) (i) (i) (i) x =(b a x a x a x ... a x )/a 8 2 2 21 1 23 3 24 4 2n n 22 > (i+1) (i) (i) (i) (i) > x =(b2 a31x a32x a34x ... a3nxn )/a22 > 3 1 2 4 > . <> . . > (i+1) (i) (i) (i) (i) > xn =(bn an1x an2x an3x ... an,n 1x )/ann > 1 2 3 n 1 > :> Gauss-Seidel method
(i+1) (i) (i) (i) (i) x1 =(b1 a12x2 a13x3 a14x4 ... a1nxn )/a11 (i+1) (i+1) (i) (i) (i) x =(b a x a x a x ... a x )/a 8 2 2 21 1 23 3 24 4 2n n 22 > (i+1) (i+1) (i+1) (i) (i) > x =(b2 a31x a32x a34x ... a3nxn )/a22 > 3 1 2 4 > . <> . . > (i+1) (i+1) (i+1) (i+1) (i+1) > xn =(bn an1x an2x an3x ... an,n 1x )/ann > 1 2 3 n 1 > :> Gauss-Seidel is the method with better numerical properties (less iterations to convergence). Which is the method with better e ciency in term of implementation in sequential or parallel computer? Why? (i+1) (i+1) In Gauss-Seidel, the computation of xk+1 implies the knowledge of xk . Parallelization is impossible.
COSC 594, 04-18-2018 Convergence Convergence can be very slow Consider a modified Richardson method:
x = x + ⌧(b Ax ) (4) i+1 i i Convergence is linear, similarly to Richardson we get
e I ⌧A i e || i+1|| || || || 0|| but can be very slow (if I ⌧A is close to 1), e.g. let || || A be symmetric and positive definite (SPD) the matrix norm in I ⌧A is induced by || || || · ||2 Then the best choice for ⌧ (⌧ = 2 ) would give 1+ N k(A) 1 I ⌧A = || || k(A)+1
where k(A) = N is the condition number of A. 1
COSC 594, 04-18-2018 Convergence
Note: The rate of convergence depend on the condition number k(A) Even for the the best ⌧ the rate of convergence
k(A) 1 k(A)+1
is slow (i.e. close to 1) for large k(A) We will see convergence of nonstationary methods also depend on k(A) but is better, e.g. compare with CG
k(A) 1 pk(A)+1 p
COSC 594, 04-18-2018 Part III
Iterative Refinement Methods
COSC 594, 04-18-2018 Mixed-Precision Solvers Dense Linear Algebra (DLA) is needed in a wide variety of science and engineering applications: • Linear systems: Solve Ax = b • Computational electromagnetics, material science, applications using boundary integral equations, airflow past wings, fluid flow around ship and other offshore constructions, and many more • Least squares: Find x to minimize || Ax – b || • Computational statistics (e.g., linear least squares or ordinary least squares), econometrics, control theory, signal processing, curve fitting, and many more • Eigenproblems: Solve Ax = λ x • Computational chemistry, quantum mechanics, material science, face recognition, Provided in PCA, data-mining, marketing, Google Page Rank, spectral clustering, vibrational analysis, compression, and many more MAGMA 2.3 • SVD: A = U Σ V* (Au = σv and A*v = σu) • Information retrieval, web search, signal processing, big data analytics, low rank matrix approximation, total least squares minimization, pseudo-inverse, and many more • Many variations depending on structure of A • A can be symmetric, positive definite, tridiagonal, Hessenberg, banded, sparse with dense blocks, etc. • DLA is crucial to the development of sparse solvers http://icl.cs.utk.edu/magma https://bitbucket.org/icl/magma2 / 19 Leveraging Half Precision in HPC on V100 solving linear system Ax = b solving linear system Ax = b • LU factorization is used to solve a linear system Ax=b LU factorization A x = b A x b
LUx = b U x L b
y Ly = b L b
then U x y Ux = y Leveraging Half Precision in HPC on V100 solving linear system Ax = b
For s = 0, nb, .. N 1. panel factorize LU factorization requires O(n3) 2. update trailing matrix most of the operations are spent in GEMM
TRSM nb U
Panel GEMM L
step 1 step 2 step 3 step 4 panel
update Leveraging Half Precision in HPC on V100 Motivation Study of the Matrix Matrix multiplication kernel on Nvidia V100 • dgemm achieve about 6.4 Tflop/s FP16 TC square FP16 square FP32 square FP64 square FP16 TC k=256 FP16 k=256 FP32 k=256 FP64 k=256 • sgemm achieve about 14 Tflop/s 90 85 • hgemm achieve about 27 Tflop/s 80 • Tensor cores gemm reach about 85 Tflop/s 75 70 65 Rank-k GEMM needed by LU 60 does not perform as well as 55 square but still OK 50
Tflop/s 45 40 35 30 Matrix matrix multiplication GEMM 25 20 15 10 5 0 2k 4k 6k 8k 10k 12k 14k 16k 18k 20k 22k 24k 26k 28k 30k m=n Leveraging Half Precision in HPC on V100 Motivation
Study of the LU factorization algorithm on Nvidia V100 • LU factorization is used to solve a 26 FP16-TC (Tensor Cores) hgetrf LU FP16 hgetrf LU linear system Ax=b 24 FP32 sgetrf LU A x = b FP64 dgetrf LU 22 A x b 20 18 16 3~4X LUx = b U x b 14 L
Tflop/s 12 10 y b 8 Ly = b L 6 then 4 U 2 Ux = y x y 0 2k 4k 6k 8k 10k 14k 18k 22k 26k 30k 34k matrix size Leveraging Half Precision in HPC on V100
Use Mixed Precision algorithms Idea: use lower precision to compute the expensive flops (LU O(n3)) and then iteratively refine the solution in order to achieve the FP64 arithmetic
Ø Achieve higher performance à faster time to solution Ø Reduce power consumption by decreasing the execution time à Energy Savings !!! Leveraging Half Precision in HPC on V100 Iterative Refinement Idea: use lower precision to compute the expensive flops (LU O(n3)) and then iteratively refine the solution in order to achieve the FP64 arithmetic
Iterative refinement for dense systems, Ax = b, can work this way. L U = lu(A) lower precision O(n3) x = U\(L\b) lower precision O(n2) r = b – Ax FP64 precision O(n2)
WHILE || r || not small enough 1. find a correction “z” to adjust x that satisfy Az=r solving Az=r could be done by either: Ø z = U\(L\r) Classical Iterative Refinement lower precision O(n2) Ø GMRes preconditioned by the LU to solve Az=r Iterative Refinement using GMRes lower precision O(n2) 2. x = x + z FP64 precision O(n1) 3. r = b – Ax FP64 precision O(n2) END
Ø Wilkinson, Moler, Stewart, & Higham provide error bound for SP fl pt results when using DP fl pt. Ø It can be shown that using this approach we can compute the solution to 64-bit floating point precision. Investigating Half Precision Arithmetic to Accelerate Dense Linear System Solvers ScalA17, November 12–17, 2017, Denver, CO, USA
Performance_rand_dominant_cond_100 Performance_poev_logrand_cond_100 13 6 12 10 FP64 dgesv FP64 dgesv 12 FP32->64 dsgesv 11 FP32->64 dsgesv 9 9 FP16->64 dhgesv FP16->64 dhgesv 11 5 5 10 8 8 8 8 8 10 9 9 7 7 7 7 7 7 4 4 4 4 4 8 8 7 6 6 6 7 3 3 3 3 3 3 3 3 3 3 3 3 3 3 6 5 6 Tflop/s Tflop/s 5 5 # iterations 4 # iterations 2 2 2 2 2 2 2 4 4 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 2 1 2 1 1 1 0 0 0 0 2k 4k 6k 8k 10k12k14k16k18k 22k 26k 30k 34k 2k 4k 6k 8k 10k12k14k16k18k 22k 26k 30k 34k Matrix size Matrix size
1 (a) matrix with diagonal dominant. (b) matrix with positive l and where si is random number between cond ,1 such that their logarithms are uniformly distributed.
Performance_poev_cluster2_cond_100 Performance_svd_cluster2_cond_100 12 9 8 FP64 dgesv FP64 dgesv 11 FP32->64 dsgesv FP32->64 dsgesv 850 846 FP16->64 dhgesv 8 8 8 8 8 8 8 7 FP16->64 dhgesv 10 752 7 7 7 7 9 6 658 8 6 6 6 6 5 7 564 5 6 4 470
Tflop/s 4 Tflop/s 5
# iterations 376 # iterations 3 4 3 3 3 3 3 3 3 3 3 3 3 3 3 282 3 2 2 188 2 165 1 1 90 94 1 60 32 36 42 16 18 21 24 0 0 0 313 3 4 4 4 4 4 4 4 4 4 4 0 2k 4k 6k 8k 10k12k14k16k18k 22k 26k 30k 34k 2k 4k 6k 8k 10k12k14k16k18k 22k 26k 30k 34k Matrix size Matrix size
(c) matrix with positive l and with clustered singular values, s =(1, , 1, 1 ) (d) matrix with clustered singular values, s =(1, , 1, 1 ) i ··· cond i ··· cond Leveraging FP16 in HPC on V100: Numerical Behavior