Iterative Methods in Linear Algebra

Azzam Haidar

Innovative Computing Laboratory Computer Science Department The University of Tennessee

Wednesday April 18, 2018

COSC 594, 04-18-2018 COSC 594, 04-18-2018 Outline

Part I Discussion Motivation for iterative methods Part II Stationary Iterative Methods Part III Iterative Refinement Methods Part IV Nonstationary Iterative Methods

COSC 594, 04-18-2018 Part I

Discussion

COSC 594, 04-18-2018 Discussion

Original matrix: Reordered matrix: >> load matrix.output >> load A.data >> S = spconvert(matrix); >> S = spconvert(A); >> spy(S) >> spy(S)

COSC 594, 04-18-2018 Discussion

Original sub-matrix: Reordered sub-matrix: >> load matrix.output >> load A.data >> S = spconvert(matrix); >> S = spconvert(A); >> spy(S(1:8660),1:8660)) >> spy(S(8602:8700,8602:8700))

COSC 594, 04-18-2018 Discussion

Results on torc0: Results on woodstock: Pentium III 933 MHz, 16 KB L1 Intel Xeon 5160 Woodcrest and 256 KB L2 cache 3.0GHz, L1 32 KB, L2 4 MB

Original matrix: Original matrix: Mflop/s = 42.4568 Mflop/s = 386.8 Reordered matrix: Reordered matrix: Mflop/s = 45.3954 Mflop/s = 387.6 BCRS BCRS Mflop/s = 72.17374 Mflop/s = 894.2

COSC 594, 04-18-2018 Iterative Methods

COSC 594, 04-18-2018 Motivation

So far we have discussed and have seen: Many engineering and physics simulations lead to sparse matrices e.g. PDE based simulations and their discretizations based on FEM/FVM finite di↵erences Petrov-Galerkin type of conditions, etc. How to optimize performance on computations Some software how to solve sparse matrix problems (e.g. PETSc) The question is: Can we solve sparse matrix problems faster than using direct sparse methods? In certain cases Yes: using iterative methods

COSC 594, 04-18-2018 Sparse direct methods

Sparse direct methods These are based on (LU) Performed in sparse format

Are there limitations of the approach? Yes, they have fill-ins which lead to more memory requirements more flops being performed Fill-ins can become prohibitively high

COSC 594, 04-18-2018 Sparse direct methods

Consider LU for the matrix below 1st step of LU factorization will introduce fill-ins a nonzero is represented by a* marked by an F

COSC 594, 04-18-2018 Sparse direct methods

Fill-ins can be improved by reordering Remember: we talked about it in slightly di↵erent context (for speed and parallelization) Consider the following reordering

These were extreme cases but still, the problem exists

COSC 594, 04-18-2018 Sparse methods Sparse direct vs dense methods Dense takes O(n2) storage, O(n3) flops, runs within peak performance Sparse direct can take O(#nonz) storage, and O(#nonz) flops, but these can also grow due to fill-ins and performance is bad with #nonz << n2 and ’proper’ ordering it pays o↵to do sparse direct Software (table from Julien Langou) http://www.netlib.org/utk/people/JackDongarra/la-sw.html

Package Support Type Language Mode Real Complex f77 c c++ Seq Dist SPD Gen DSCPACK yes X X X M X HSL yes X X X X X X MFACT yes X X X X MUMPS yes X X X X X M X X PSPASES yes X X X M X SPARSE ? X X X X X X SPOOLES ? X X X X M X SuperLU yes X X X X X M X TAUCS yes X X X X X X UMFPACK yes X X X X X Y12M ? X X X X X

COSC 594, 04-18-2018 Sparse methods

What about Iterative Methods?Thinkforexample

x = x + P(b Ax ) i+1 i i

Pluses: Storage is only O(#nonz)(forthematrixandafewworking vectors) Can take only a few iterations to converge (e.g. << n) 1 for P = A it takes 1 iteration (check)! In general easy to parallelize

COSC 594, 04-18-2018 Sparse methods

What about Iterative Methods?Thinkforexample

x = x + P(b Ax ) i+1 i i

Minuses: Performance can be bad (as we saw previously) Convergence can be slow or even stagnate But can be improved with preconditioning 1 Think of P as a preconditioner, an operator/matrix P A Optimal preconditioners (e.g. multigrid can be) lead to⇡ convergence in O(1) iterations

COSC 594, 04-18-2018 Part II

Stationary Iterative Methods

COSC 594, 04-18-2018 Stationary Iterative Methods

Can be expressed in the form

xi+1 = Bxi + c

where B and c do not depend on i

older, simpler, easy to implement, but usually not as e↵ective (as nonstationary) examples: Richardson, Jacobi, Gauss-Seidel, SOR, etc. (next)

COSC 594, 04-18-2018 Richardson Method

Richardson iteration

x = x +(b Ax )=(I A)x + b (1) i+1 i i i i.e. B from the definition above is B = I A

Denote e = x x and rewrite (1) i i x x = x x (Ax Ax ) i+1 i i e = e Ae i+1 i i =(I A)e i 2 ei+1 (I A)ei I A ei I A ei 1 || || || || || || || || || || || || I A i e ···|| || || 0|| i.e. for convergence (e 0) we need i ! I A < 1 || || for some norm , e.g. when ⇢(B) < 1. || · || COSC 594, 04-18-2018 Jacobi Method Jacobi Method 1 1 1 x = x + D (b Ax )=(I D A)x + D b (2) i+1 i i i where D is the diagonal of A (assumed nonzero; B = I D 1A from the definition)

Denote e = x x and rewrite (2) i i 1 x x = x x D (Ax Ax ) i+1 i i 1 e = e D Ae i+1 i i 1 =(I D A)e i 1 1 1 2 ei+1 (I D A)ei I D A ei I D A ei 1 || || || || || || || || || || || || 1 i I D A e ···|| || || 0|| i.e. for convergence (e 0) we need i ! 1 I D A < 1 || || for some norm , e.g. when ⇢(I D 1A) < 1 || · || COSC 594, 04-18-2018 Gauss-Seidel Method

Gauss-Seidel Method

1 x =(D L) (Ux + b) (3) i+1 i where L is the lower and U the upper triangular part of A, and D is the diagonal.

Equivalently:

(i+1) (i) (i) (i) (i) x1 =(b1 a12x2 a13x3 a14x4 ... a1nxn )/a11 (i+1) (i+1) (i) (i) (i) x2 =(b2 a21x1 a23x3 a24x4 ... a2nxn )/a22 8 (i+1) (i+1) (i+1) (i) (i) > x3 =(b2 a31x1 a32x2 a34x4 ... a3nxn )/a22 > > . < . (i+1) (i+1) (i+1) (i+1) (i+1) > xn =(bn an1x1 an2x2 an3x3 ... an,n 1xn 1 )/ann > > :>

COSC 594, 04-18-2018 Acomparison(fromJulienLangouslides)

Gauss-Seidel method

10 1 2 0 1 23 01213 1 44 A = 0 129101 b = 0 36 1 B 031100C B 49 C B C B C B 120015C B 80 C @ A @ A

b Ax(i) / b ,functionofi,thenumberofiterations k k2 k k2

x(0) x(1) x(2) x(3) x(4) ... 0.0000 2.3000 0.8783 0.9790 0.9978 1.0000 0.0000 3.6667 2.1548 2.0107 2.0005 2.0000 0.0000 2.9296 3.0339 3.0055 3.0006 3.0000 0.0000 3.5070 3.9502 3.9962 3.9998 4.0000 0.0000 4.6911 4.9875 5.0000 5.0001 5.0000

COSC 594, 04-18-2018 Acomparison(fromJulienLangouslides)

Jacobi Method:

(i+1) (i) (i) (i) (i) x1 =(b1 a12x2 a13x3 a14x4 ... a1nxn )/a11 (i+1) (i) (i) (i) (i) x =(b a x a x a x ... a x )/a 8 2 2 21 1 23 3 24 4 2n n 22 > (i+1) (i) (i) (i) (i) > x =(b2 a31x a32x a34x ... a3nxn )/a22 > 3 1 2 4 > . <> . . > (i+1) (i) (i) (i) (i) > xn =(bn an1x an2x an3x ... an,n 1x )/ann > 1 2 3 n 1 > :> Gauss-Seidel method

(i+1) (i) (i) (i) (i) x1 =(b1 a12x2 a13x3 a14x4 ... a1nxn )/a11 (i+1) (i+1) (i) (i) (i) x =(b a x a x a x ... a x )/a 8 2 2 21 1 23 3 24 4 2n n 22 > (i+1) (i+1) (i+1) (i) (i) > x =(b2 a31x a32x a34x ... a3nxn )/a22 > 3 1 2 4 > . <> . . > (i+1) (i+1) (i+1) (i+1) (i+1) > xn =(bn an1x an2x an3x ... an,n 1x )/ann > 1 2 3 n 1 > :> Gauss-Seidel is the method with better numerical properties (less iterations to convergence). Which is the method with better eciency in term of implementation in sequential or parallel computer? Why? (i+1) (i+1) In Gauss-Seidel, the computation of xk+1 implies the knowledge of xk . Parallelization is impossible.

COSC 594, 04-18-2018 Convergence Convergence can be very slow Consider a modified Richardson method:

x = x + ⌧(b Ax ) (4) i+1 i i Convergence is linear, similarly to Richardson we get

e I ⌧A i e || i+1|| || || || 0|| but can be very slow (if I ⌧A is close to 1), e.g. let || || A be symmetric and positive definite (SPD) the matrix norm in I ⌧A is induced by || || || · ||2 Then the best choice for ⌧ (⌧ = 2 ) would give 1+N k(A) 1 I ⌧A = || || k(A)+1

where k(A) = N is the condition number of A. 1

COSC 594, 04-18-2018 Convergence

Note: The rate of convergence depend on the condition number k(A) Even for the the best ⌧ the rate of convergence

k(A) 1 k(A)+1

is slow (i.e. close to 1) for large k(A) We will see convergence of nonstationary methods also depend on k(A) but is better, e.g. compare with CG

k(A) 1 pk(A)+1 p

COSC 594, 04-18-2018 Part III

Iterative Refinement Methods

COSC 594, 04-18-2018 Mixed-Precision Solvers Dense Linear Algebra (DLA) is needed in a wide variety of science and engineering applications: • Linear systems: Solve Ax = b • Computational electromagnetics, material science, applications using boundary integral equations, airflow past wings, fluid flow around ship and other offshore constructions, and many more • Least squares: Find x to minimize || Ax – b || • Computational statistics (e.g., linear least squares or ordinary least squares), econometrics, control theory, signal processing, curve fitting, and many more • Eigenproblems: Solve Ax = λ x • Computational chemistry, quantum mechanics, material science, face recognition, Provided in PCA, data-mining, marketing, Google Page Rank, spectral clustering, vibrational analysis, compression, and many more MAGMA 2.3 • SVD: A = U Σ V* (Au = σv and A*v = σu) • Information retrieval, web search, signal processing, big data analytics, low rank matrix approximation, total least squares minimization, pseudo-inverse, and many more • Many variations depending on structure of A • A can be symmetric, positive definite, tridiagonal, Hessenberg, banded, sparse with dense blocks, etc. • DLA is crucial to the development of sparse solvers http://icl.cs.utk.edu/magma https://bitbucket.org/icl/magma2 / 19 Leveraging Half Precision in HPC on V100 solving linear system Ax = b solving linear system Ax = b • LU factorization is used to solve a linear system Ax=b LU factorization A x = b A x b

LUx = b U x L b

y Ly = b L b

then U x y Ux = y Leveraging Half Precision in HPC on V100 solving linear system Ax = b

For s = 0, nb, .. N 1. panel factorize LU factorization requires O(n3) 2. update trailing matrix most of the operations are spent in GEMM

TRSM nb U

Panel GEMM L

step 1 step 2 step 3 step 4 panel

update Leveraging Half Precision in HPC on V100 Motivation Study of the Matrix kernel on Nvidia V100 • dgemm achieve about 6.4 Tflop/s FP16 TC square FP16 square FP32 square FP64 square FP16 TC k=256 FP16 k=256 FP32 k=256 FP64 k=256 • sgemm achieve about 14 Tflop/s 90 85 • hgemm achieve about 27 Tflop/s 80 • Tensor cores gemm reach about 85 Tflop/s 75 70 65 Rank-k GEMM needed by LU 60 does not perform as well as 55 square but still OK 50

Tflop/s 45 40 35 30 Matrix matrix multiplication GEMM 25 20 15 10 5 0 2k 4k 6k 8k 10k 12k 14k 16k 18k 20k 22k 24k 26k 28k 30k m=n Leveraging Half Precision in HPC on V100 Motivation

Study of the LU factorization algorithm on Nvidia V100 • LU factorization is used to solve a 26 FP16-TC (Tensor Cores) hgetrf LU FP16 hgetrf LU linear system Ax=b 24 FP32 sgetrf LU A x = b FP64 dgetrf LU 22 A x b 20 18 16 3~4X LUx = b U x b 14 L

Tflop/s 12 10 y b 8 Ly = b L 6 then 4 U 2 Ux = y x y 0 2k 4k 6k 8k 10k 14k 18k 22k 26k 30k 34k matrix size Leveraging Half Precision in HPC on V100

Use Mixed Precision algorithms Idea: use lower precision to compute the expensive flops (LU O(n3)) and then iteratively refine the solution in order to achieve the FP64 arithmetic

Ø Achieve higher performance à faster time to solution Ø Reduce power consumption by decreasing the execution time à Energy Savings !!! Leveraging Half Precision in HPC on V100 Iterative Refinement Idea: use lower precision to compute the expensive flops (LU O(n3)) and then iteratively refine the solution in order to achieve the FP64 arithmetic

Iterative refinement for dense systems, Ax = b, can work this way. L U = lu(A) lower precision O(n3) x = U\(L\b) lower precision O(n2) r = b – Ax FP64 precision O(n2)

WHILE || r || not small enough 1. find a correction “z” to adjust x that satisfy Az=r solving Az=r could be done by either: Ø z = U\(L\r) Classical Iterative Refinement lower precision O(n2) Ø GMRes preconditioned by the LU to solve Az=r Iterative Refinement using GMRes lower precision O(n2) 2. x = x + z FP64 precision O(n1) 3. r = b – Ax FP64 precision O(n2) END

Ø Wilkinson, Moler, Stewart, & Higham provide error bound for SP fl pt results when using DP fl pt. Ø It can be shown that using this approach we can compute the solution to 64-bit floating point precision. Investigating Half Precision Arithmetic to Accelerate Dense Linear System Solvers ScalA17, November 12–17, 2017, Denver, CO, USA

Performance_rand_dominant_cond_100 Performance_poev_logrand_cond_100 13 6 12 10 FP64 dgesv FP64 dgesv 12 FP32->64 dsgesv 11 FP32->64 dsgesv 9 9 FP16->64 dhgesv FP16->64 dhgesv 11 5 5 10 8 8 8 8 8 10 9 9 7 7 7 7 7 7 4 4 4 4 4 8 8 7 6 6 6 7 3 3 3 3 3 3 3 3 3 3 3 3 3 3 6 5 6 Tflop/s Tflop/s 5 5 # iterations 4 # iterations 2 2 2 2 2 2 2 4 4 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 2 1 2 1 1 1 0 0 0 0 2k 4k 6k 8k 10k12k14k16k18k 22k 26k 30k 34k 2k 4k 6k 8k 10k12k14k16k18k 22k 26k 30k 34k Matrix size Matrix size

1 (a) matrix with diagonal dominant. (b) matrix with positive l and where si is random number between cond ,1 such that their logarithms are uniformly distributed.

Performance_poev_cluster2_cond_100 Performance_svd_cluster2_cond_100 12 9 8 FP64 dgesv FP64 dgesv 11 FP32->64 dsgesv FP32->64 dsgesv 850 846 FP16->64 dhgesv 8 8 8 8 8 8 8 7 FP16->64 dhgesv 10 752 7 7 7 7 9 6 658 8 6 6 6 6 5 7 564 5 6 4 470

Tflop/s 4 Tflop/s 5

# iterations 376 # iterations 3 4 3 3 3 3 3 3 3 3 3 3 3 3 3 282 3 2 2 188 2 165 1 1 90 94 1 60 32 36 42 16 18 21 24 0 0 0 313 3 4 4 4 4 4 4 4 4 4 4 0 2k 4k 6k 8k 10k12k14k16k18k 22k 26k 30k 34k 2k 4k 6k 8k 10k12k14k16k18k 22k 26k 30k 34k Matrix size Matrix size

(c) matrix with positive l and with clustered singular values, s =(1, , 1, 1 ) (d) matrix with clustered singular values, s =(1, , 1, 1 ) i ··· cond i ··· cond Leveraging FP16 in HPC on V100: Numerical Behavior

Performance_poev_arith_cond_100 Performance_svd_arith_cond_100 12 9 8 220 FP64 dgesv FP64 dgesv 11 FP32->64 dsgesv FP32->64 dsgesv200 8 8 8 8 8 8 8 8 198 100 FP16->64 dhgesv 100 7 FP16->64 dhgesv 10 176 7 7 7 7 7 -5 9 -5 6 10 10 154 8 6 6 5 132 10-10 7 10-10 5 118 residual 6 residual 4 110 -15 -15 Tflop/s 4 Tflop/s 10 5 10

# iterations 88 # iterations 3 FP16-TC-->FP64 IR (Tensor Cores) FP16-TC-->FP64 IRGM (Tensor Cores) 4 3 3 3 3 3 3 3 3 3 3 3 3 3 72 FP16-->FP64 IR FP16-->FP64 IRGM 10-20 10-20 66 FP32-->FP643 IR FP32-->FP64 IRGM 2 2 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 44 # iterations2 # iterations 39 1 1 1 21 22 Ø Convergence history of the iterative refinement Ø The FP32->64 algorithm converge as expected and is able to achieve the 3 4 4 4 4 4 4 4 4 4 4 4 solver to achieve FP64 solution 0accuracy. FP64 solution accuracy in about 3-5 iterations.0 0 0 0 0 0 0 0 0 0 2k 4k 6k 8k 10k12k14k16k18k 22k 26k 30k 34k 2k 4k 6k 8k 10k12k14k16k18k 22k 26k 30k 34k Ø Left graph shows the classical iterative refinement Matrix size Matrix size method. Ø The FP16->64 algorithm requires more iterations than FP32à64 because of Ø Right graph shows the iterative refinement using the lower precision factorization. GMRes. i 1 1 Ø Problem generated with(e) an matrix arithmetic with positivedistribution eigenvalues and arithmetic distribution of its singular values (f) matrix with arithmetic distribution of its singular values si = 1 ( n1 )(1 cond ). i 1 1 Ø The FP16->64 (Tensor Cores) outperforms the FP16->64 because the of the singular values s i = 1 ( n 1 )( 1 cond ). and positive eigenvalues. accumulation during the FP16-TC GEMM (Schur update) use FP32-bit. Figure 4: Performance in Tflop/s for the three linear solver (dgesv,dsgesv,dhgesv) and the required number of iterations to achieve FP64 arithmetic solution using either the dsgesv (orange numbers) or the dhgesv (green numbers) solver, for different matrices size and different type of matrices. Note that cond = 102 for these experiments. Investigating Half Precision Arithmetic to Accelerate Dense Linear System Solvers ScalA17, November 12–17, 2017, Denver, CO, USA

Performance_rand_dominant_cond_100 Performance_poev_logrand_cond_100 13 6 12 10 FP64 dgesv FP64 dgesv 12 FP32->64 dsgesv 11 FP32->64 dsgesv 9 9 FP16->64 dhgesv FP16->64 dhgesv 11 5 5 10 8 8 8 8 8 10 9 9 7 7 7 7 7 7 4 4 4 4 4 8 8 7 6 6 6 7 3 3 3 3 3 3 3 3 3 3 3 3 3 3 6 5 6 Tflop/s Tflop/s 5 5 # iterations 4 # iterations 2 2 2 2 2 2 2 4 4 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 2 1 2 1 1 1 0 0 0 0 2k 4k 6k 8k 10k12k14k16k18k 22k 26k 30k 34k 2k 4k 6k 8k 10k12k14k16k18k 22k 26k 30k 34k Matrix size Matrix size

1 (a) matrix with diagonal dominant. (b) matrix with positive l and where si is random number between cond ,1 such that their logarithms are uniformly distributed.

Performance_poev_cluster2_cond_100 Performance_svd_cluster2_cond_100 12 9 8 FP64 dgesv FP64 dgesv 11 FP32->64 dsgesv FP32->64 dsgesv 850 846 FP16->64 dhgesv 8 8 8 8 8 8 8 7 FP16->64 dhgesv 10 752 7 7 7 7 9 6 658 8 6 6 6 6 5 7 564 5 6 4 470

Tflop/s 4 Tflop/s 5

# iterations 376 # iterations 3 4 3 3 3 3 3 3 3 3 3 3 3 3 3 282 3 2 2 188 2 165 1 1 90 94 1 60 32 36 42 16 18 21 24 0 0 0 313 3 4 4 4 4 4 4 4 4 4 4 0 2k 4k 6k 8k 10k12k14k16k18k 22k 26k 30k 34k 2k 4k 6k 8k 10k12k14k16k18k 22k 26k 30k 34k Matrix size Matrix size

1 1 (c) matrix with positive l and with clustered singular values, si=(1, , 1, ) (d) matrix with clustered singular values, si=(1, , 1, ) Leveraging Half Precision in HPC on V100 ··· cond ··· cond

Performance_poev_arith_cond_100 Performance_svd_arith_cond_100 Performance Behavior 12 9 8 220 FP64 dgesv FP64 dgesv Performance of solving Ax=b 11 FP32->64 dsgesv FP32->64 dsgesv200 8 8 8 8 8 8 8 8 198 using FP64 or IR with GMRes to achieve FP64 accuracy FP16->64 dhgesv 3 7 FP16->64 dhgesv 24 10 Flops = 2n /(3 time) 176 FP16-TC->64 dhgesv 7 7 7 meaning7 twice higher is twice faster 7 22 FP16->64 dhgesv 9 6 FP32->64 dsgesv 105 154 20 FP64 dgesv 8 6 6 • solving Ax = b using FP64 LU 5 18 7 132 4 5 118 16 6 10 • solving Ax = b using FP32 LU and 4 110

Tflop/s 4 Tflop/s 14 4X 5 iterative refinement to achieve FP64 )

# iterations 88 # iterations 3 A accuracy 3 10 ( 12 4 3 3 3 3 3 3 3 3 3 3 3 3 3 72 ∞ 66 κ Tflop/s 10 3 • solving Ax = b using FP16 LU and 2 2 2 44 8 2 10 iterative refinement to achieve FP64 39 accuracy 1 1 6 1 21 22 1 4 10 3 4 4 4 4 4 4 4 4 4 4 4 0 • solving Ax = b using FP16 Tensor Cores 0 0 0 0 0 0 0 0 0 0 2k 4k 6k 8k 10k12k14k16k18k 22k 26k 30k 34k 2k 4k 6k 8k 10k12k14k16k18k 22k 26k 30k 34k 2 LU and iterativeMatrix size refinement to achieve Matrix size 0 100 FP64 accuracy 2k 4k 6k 8k 10k 14k 18k 22k 26k 30k 34k Matrix size i 1 1 (e) matrix with positive eigenvalues and arithmetic distribution of its singular values (f) matrix with arithmetic distribution of its singular values si = 1 ( n1 )(1 cond ). i 1 1 Problem generated with an arithmetic distribution of the singular values s i = 1 ( n 1 )( 1 cond ). and positive eigenvalues.

Figure 4: Performance in Tflop/s for the three linear solver (dgesv,dsgesv,dhgesv) and the required number of iterations to achieve FP64 arithmetic solution using either the dsgesv (orange numbers) or the dhgesv (green numbers) solver, for different matrices size and different type of matrices. Note that cond = 102 for these experiments. Investigating Half Precision Arithmetic to Accelerate Dense Linear System Solvers ScalA17, November 12–17, 2017, Denver, CO, USA

Performance_rand_dominant_cond_100 Performance_poev_logrand_cond_100 13 6 12 10 FP64 dgesv FP64 dgesv 12 FP32->64 dsgesv 11 FP32->64 dsgesv 9 9 FP16->64 dhgesv FP16->64 dhgesv 11 5 5 10 8 8 8 8 8 10 9 9 7 7 7 7 7 7 4 4 4 4 4 8 8 7 6 6 6 7 3 3 3 3 3 3 3 3 3 3 3 3 3 3 6 5 6 Tflop/s Tflop/s 5 5 # iterations 4 # iterations 2 2 2 2 2 2 2 4 4 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 2 1 2 1 1 1 0 0 0 0 2k 4k 6k 8k 10k12k14k16k18k 22k 26k 30k 34k 2k 4k 6k 8k 10k12k14k16k18k 22k 26k 30k 34k Matrix size Matrix size

1 (a) matrix with diagonal dominant. (b) matrix with positive l and where si is random number between cond ,1 such that their logarithms are uniformly distributed.

Performance_poev_cluster2_cond_100 Performance_svd_cluster2_cond_100 12 9 8 FP64 dgesv FP64 dgesv 11 FP32->64 dsgesv FP32->64 dsgesv 850 846 FP16->64 dhgesv 8 8 8 8 8 8 8 7 FP16->64 dhgesv 10 752 7 7 7 7 9 6 658 8 6 6 6 6 5 7 564 5 6 4 470

Tflop/s 4 Tflop/s 5

# iterations 376 # iterations 3 4 3 3 3 3 3 3 3 3 3 3 3 3 3 282 3 2 2 188 2 165 1 1 90 94 1 60 32 36 42 16 18 21 24 0 0 0 313 3 4 4 4 4 4 4 4 4 4 4 0 2k 4k 6k 8k 10k12k14k16k18k 22k 26k 30k 34k 2k 4k 6k 8k 10k12k14k16k18k 22k 26k 30k 34k Matrix size Matrix size

1 1 (c) matrix with positive l and with clustered singular values, si=(1, , 1, ) (d) matrix with clustered singular values, si=(1, , 1, ) Leveraging Half Precision in HPC on V100 ··· cond ··· cond

Performance_poev_arith_cond_100 Performance_svd_arith_cond_100 Performance Behavior 12 9 8 220 FP64 dgesv FP64 dgesv Performance of solving Ax=b 11 FP32->64 dsgesv FP32->64 dsgesv200 8 8 8 8 8 8 8 8 198 using FP64 or IR with GMRes to achieve FP64 accuracy3 FP16->64 dhgesv 3 7 FP16->64 dhgesv 24 10 Flops = 2n /(3 time) 176 FP16-TC->64 dhgesv 3 7 7 7 meaning7 twice higher is twice faster 7 22 FP16->64 dhgesv 9 6 FP32->64 dsgesv 105 154 20 FP64 dgesv 3 8 6 6 6 5 • solving Ax = b using FP64 LU 132 18 3 7 7 4 5 118 16 7 6 10 • solving Ax = b using FP32 LU and 4 110

3 Tflop/s 4 Tflop/s 14 6 5 iterative refinement to achieve FP64 )

# iterations 88 # iterations 3 3 2 A accuracy 3 10 ( 12 6 6 2 4 3 3 3 3 3 3 3 3 3 3 3 3 3 72 3 2 ∞ 66 κ Tflop/s 6 10 2 2 3 • solving Ax = b using FP16 LU and 2 3 2 2 2 2 44 8 10 iterative refinement to achieve FP64 39 3 6 2 accuracy 1 1 6 3 6 2 1 21 22 6 2 1 4 3 10 3 4 4 4 4 4 4 4 4 4 4 4 63 2 0 • solving Ax = b using FP16 Tensor Cores 0 0 0 0 0 0 0 0 0 0 5 2 2k 4k 6k 8k 10k12k14k16k18k 22k 26k 30k 34k 2k 4k 6k 8k 10k12k14k16k18k 22k 26k 30k 34k 2 2 LU and iterativeMatrix size refinement to achieve Matrix size 0 100 FP64 accuracy 2k 4k 6k 8k 10k 14k 18k 22k 26k 30k 34k Matrix size i 1 1 (e) matrix with positive eigenvalues and arithmetic distribution of its singular values (f) matrix with arithmetic distribution of its singular values si = 1 ( n1 )(1 cond ). i 1 1 Problem generated with an arithmetic distribution of the singular values s i = 1 ( n 1 )( 1 cond ). and positive eigenvalues.

Figure 4: Performance in Tflop/s for the three linear solver (dgesv,dsgesv,dhgesv) and the required number of iterations to achieve FP64 arithmetic solution using either the dsgesv (orange numbers) or the dhgesv (green numbers) solver, for different matrices size and different type of matrices. Note that cond = 102 for these experiments. FP16 TC square FP16 square FP32 square FP64 square 26 FP16-TC (Tensor Cores) hgetrf LU 90 FP16 TC k=256 FP16 k=256 FP32 k=256 FP64 k=256 FP16 hgetrf LU 24 FP32 sgetrf LU 85 FP64 dgetrf LU 80 22 75 Leveraging FP16 in HPC on V100: 20 Numerical Behavior 70 18 65 60 16 55 50 14 100 100 Tflop/s Tflop/s 45 12 40 10 35 10-5 10-5 30 8 25 -10 6 -10 20 10 10

15 residual 4 residual 10 -15 2 -15 5 10 10 0 0FP16-TC-->FP64 IR (Tensor Cores) FP16-TC-->FP64 IRGM (Tensor Cores) 2k 4k 6k 8k 10k 12k 14k 16k 18k 20k 22k 24k 26k-20 28k 30k FP16-->FP642k 4k IR 6k 8k 10k 14k 18k 22k-20 26k 30kFP16-->FP64 34k IRGM m=n 10 FP32-->FP64 IR matrix size 10 FP32-->FP64 IRGM (a) Performance of the rank-k update (Xgemm function) used in Xgetrf0 3 . 6 9 12 15 18 21 (b)24 Performance27 30 33 of36 the Xgetrf routine.0 2 4 6 8 10 12 14 16 18 20 22 24 26 # iterations # iterations Fig. 2. Performance of the three arithmetic precisions obtained on a Nvidia V100 GPU. Ø Convergence history of the iterative refinement Ø The FP32->64 algorithm converge as expected and is able to achieve the Type solver to achieveDescription FP64 solution accuracy. FP64 solution accuracy in about 3-5 iterations. Ø Left graph shows the classical iterative refinement 1 - Random numbersmethod. with diagonal modified to be dominant Ø The FP16->64 algorithm requires too many iterations and might be a Ø 1 2 Positive eigenvalue l Random s in Right[ graph,1] such shows that the their iterative logarithms refinement are uniformlyusing performance bottleneck. GMRescond . 3 Positive eigenvalue l Clustered s s =[1, ,1, 1 ]; Ø Problem generated with a clustered··· cond distribution of Ø The FP16->64 (Tensor Cores) outperforms the FP16->64 because the 4 - Clustered s the singular values s =[ 1 , , 1 , 1 ] ; accumulation during the FP16-TC GEMM (Schur update) use FP32-bit. ··· cond i 1 1 si+1 5 Positive eigenvalue l Arithmetically distributed s si = 1 ( n1 )(1 cond ),i = 1..n, s is constant i i 1 1 si+1 6 - Arithmetically distributed s si = 1 ( n1 )(1 cond ),i = 1..n, s is constant i TABLE II DESCRIPTION OF THE TEST MATRICES WITH cond = 100.

63 iterations. This observation suggests the surprising effectiveness of Figure 4a shows results on matrices where all methods the FP16 arithmetic, which might be robust enough to be behave well. Convergence was achieved in 3 iterations for used in HPC dense linear system solvers. FP32, 4 iterations for FP16-TC, and 7 iterations for FP16. VII. EXPERIMENTAL RESULTS DISCUSSION Figure 4b has the same singular value distribution as Figure 4a but not necessarily positive eigenvalues. This type is the most This section presents the performance results of our three difficult, and the FP16 variants using either IR or IRGM do iterative refinement methods – dhgesv-TC, dhgesv, and not converge. Note that the FP16 IRGM can converge when dsgesv– using either the IR or IRGM, and compared to the allowing more than 2,000 iterations, but for our experiment we reference dgesv solver. We also depict the number of itera- limited the max iterations to 400, since we have seen a large tions required by each method to reach FP64 accuracy. The 2n3 performance drop when iterations are around 200—where the Tflop/s are computed based on the same formula (P = 3 time ), iterative refinement becomes a drawback. The FP32 variants which means performance reflects the time to solution, e.g., if are not influenced by the matrix type and always converge in a method has 2X higher performance, it is twice faster. The about 3–4 iterations. Surprisingly, the FP16-TC behaves very performance results are presented in Figure 5 and Figure 6 for well and converges in about 18 iterations, while the FP16 did the six representative types of matrices studied in Section VI. not converge. In each figure there are four performance curves that refer Lesson: For all the matrices considered, the FP16-TC to the reference dgesv and the three iterative refinement variant is the most robust and fastest in convergence algorithms dhgesv-TC, dhgesv, and dsgesv. among the FP16 arithmetic-based methods. The FP32 In Figure 5a, the matrix is diagonally dominant and as refinement variants emphasize a stable behavior regardless shown in Section VI, all variants require three to five iterations of the matrix types. to converge. Thus, one can expect that the low precision iterative refinement algorithms will bring a large speedup NJH: I suggest changing “stable” to “consistent” compared to dgesv. Since the number of iterations is small, since “stable” implies , which we imagine that the speedup ratio will be similar to the one is not the issue here. observed in Figure 2b for the LU factorization. We confirm FP16 TC square FP16 square FP32 square FP64 square 26 FP16-TC (Tensor Cores) hgetrf LU 90 FP16 TC k=256 FP16 k=256 FP32 k=256 FP64 k=256 FP16 hgetrf LU 24 FP32 sgetrf LU 85 FP64 dgetrf LU 80 22 75 20 70 Leveraging Half Precision in HPC on V100 65 18 60 Performance Behavior16 55 50 Performance of solving Ax=b 14 using FP64 or IR with GMRes to achieve FP64 accuracy 3 Tflop/s Tflop/s 45 12 14 Flops = 2n /(3 time) 40 20 FP16-TC->64 dhgesv 1014 meaning twice higher is twice faster 35 18 FP16->64 dhgesv FP32->64 dsgesv 30 8 5 25 FP64 dgesv 13 10 16 6 20 • solving Ax = b using FP64 LU 15 14 4 12 104 • solving Ax = b using FP32 LU and 10 4X2 5 12 12 3 iterative refinement to achieve FP64

3 )

0 11 3 0 3 A accuracy 2k 4k 6k 8k 10k10 12k 14k 16k 18k 20k 22k 24k 26k 28k 30k 2k 4k10 6k( 8k 10k 14k 18k 22k 26k 30k 34k 10 3 3 ∞ κ Tflop/s m=n 3 matrix size 8 59 • solving Ax = b using FP16 LU and (a) Performance of the rank-k update (Xgemm function)3 used in42 Xgetrf. 88 (b) Performance of the Xgetrf routine. 36 102 iterative refinement to achieve FP64 6 10 32 154 Fig. 2. Performance9 324 of the three arithmetic precisions obtained on a Nvidiaaccuracy V100 GPU. 20 4 9 3 330 7 17 1 Type 158 3 Description 10 • solving Ax = b using FP16 Tensor Cores 2 2 1 - 112 Random numbers with diagonal modified to be dominantLU and iterative refinement to achieve 0 1 100 FP64 accuracy 2 Positive2k eigenvalue 4k 6k 8kl 10kRandom 14ks 18kin [ cond 22k,1] such 26k that 30k their 34k logarithms are uniformly Matrix size 3 Positive eigenvalue l Clustered s s =[1, ,1, 1 ]; ··· cond 4 Problem- generated with an arithmeticClustered distributions of the singular values s =[1, ,1, 1 ]; ··· cond i 1 1 si+1 5 Positive eigenvalue l Arithmetically distributed s si = 1 ( n1 )(1 cond ),i = 1..n, s is constant i i 1 1 si+1 6 - Arithmetically distributed s si = 1 ( n1 )(1 cond ),i = 1..n, s is constant i TABLE II DESCRIPTION OF THE TEST MATRICES WITH cond = 100.

63 iterations. This observation suggests the surprising effectiveness of Figure 4a shows results on matrices where all methods the FP16 arithmetic, which might be robust enough to be behave well. Convergence was achieved in 3 iterations for used in HPC dense linear system solvers. FP32, 4 iterations for FP16-TC, and 7 iterations for FP16. VII. EXPERIMENTAL RESULTS DISCUSSION Figure 4b has the same singular value distribution as Figure 4a but not necessarily positive eigenvalues. This type is the most This section presents the performance results of our three difficult, and the FP16 variants using either IR or IRGM do iterative refinement methods – dhgesv-TC, dhgesv, and not converge. Note that the FP16 IRGM can converge when dsgesv– using either the IR or IRGM, and compared to the allowing more than 2,000 iterations, but for our experiment we reference dgesv solver. We also depict the number of itera- limited the max iterations to 400, since we have seen a large tions required by each method to reach FP64 accuracy. The 2n3 performance drop when iterations are around 200—where the Tflop/s are computed based on the same formula (P = 3 time ), iterative refinement becomes a drawback. The FP32 variants which means performance reflects the time to solution, e.g., if are not influenced by the matrix type and always converge in a method has 2X higher performance, it is twice faster. The about 3–4 iterations. Surprisingly, the FP16-TC behaves very performance results are presented in Figure 5 and Figure 6 for well and converges in about 18 iterations, while the FP16 did the six representative types of matrices studied in Section VI. not converge. In each figure there are four performance curves that refer Lesson: For all the matrices considered, the FP16-TC to the reference dgesv and the three iterative refinement variant is the most robust and fastest in convergence algorithms dhgesv-TC, dhgesv, and dsgesv. among the FP16 arithmetic-based methods. The FP32 In Figure 5a, the matrix is diagonally dominant and as refinement variants emphasize a stable behavior regardless shown in Section VI, all variants require three to five iterations of the matrix types. to converge. Thus, one can expect that the low precision iterative refinement algorithms will bring a large speedup NJH: I suggest changing “stable” to “consistent” compared to dgesv. Since the number of iterations is small, since “stable” implies numerical stability, which we imagine that the speedup ratio will be similar to the one is not the issue here. observed in Figure 2b for the LU factorization. We confirm Part IV

Nonstationary Iterative Methods

COSC 594, 04-18-2018 Nonstationary Iterative Methods

The methods involve information that changes at every iteration e.g. inner-products with residuals or other vectors, etc.

The methods are newer, harder to understand, but more e↵ective in general based on the idea of orthogonal vectors and subspace projections examples: Krylov iterative methods CG/PCG, GMRES, CGNE, QMR, BiCG, etc.

COSC 594, 04-18-2018 Krylov Iterative Methods

COSC 594, 04-18-2018 Krylov Iterative Methods

Krylov subspaces:thesearethespaces

2 i 1 K (A, r)=span r, Ar, A r,..., A r i { }

Krylov iterative methods find approximation xi to x where

Ax = b,

as a minimization on Ki (A, r). We have seen how this can be done for example by projection,i.e. by the the Petrov-Galerkin conditions:

(Ax ,)=(b,)for K (A, r) i 8 2 i

COSC 594, 04-18-2018 Krylov Iterative Methods

In general we expend the Krylov subspace by a matrix-vector product, and do a minimization/projection in it. Various methods result by specific choices of expansion and minimization/projection.

COSC 594, 04-18-2018 Krylov Iterative Methods

An example: The Conjugate Gradient Method (CG)

The method is for SPD matrices

There is a way at iteration i to construct new ’search direction’ pi s.t.

span p , p ,...,p K (A, r )and(Ap , p )=0 for i = j. { 0 1 i }⌘ i+1 0 i j 6

Note: A is SPD (Ap , p ) (p , p ) can be used as an inner product, i.e. p ,...,p is an ( , ) ) i j ⌘ i j A 0 i · · A orthogonal basis for Ki+1(A, r0) we can easily find x x as ) i+1 ⇡

x = x + ↵ p + + ↵ p s.t. i+1 0 0 0 ··· i i (Axi+1, pj )=(b, pj )forj =0,...,i

Namely, because of the ( , ) orthogonality of p ,...,p at iteration i +1wehavetofindonly↵ · · A 0 i i

(ri , pi ) (Axi+1, pj )=(A(xi + ↵i pi ), pi )=(b, pi ), ↵i = ) (Api , pi )

Note: x above actually can be replaced by any x + v, v K (A, r )(Why?) i 0 2 i 0

COSC 594, 04-18-2018 Learning Goals

A brief introduction to iterative methods Motivation for iterative methods and links to previous lectures, namely PDE discretization and sparse matrices Optimized implementations Stationary iterative methods Nonstationary iterative methods and links to building blocks that we have already covered Projection/Minimization Orthogonalization Krylov methods; an example with CG; to see more examples ... (next lecture)

COSC 594, 04-18-2018