Lecture 8: Fast Linear Solvers (Part 6)

1 Nonsymmetric System of Linear Equations

• The CG method requires to 퐴 to be an 푛 × 푛 symmetric positive definite to solve 퐴풙 = 풃. • If 퐴 is nonsymmetric: – Convert the system to a symmetric positive definite one – Modify CG to handle general matrices

2 Normal Equation Approach

The normal equations corresponding to 퐴풙 = 풃 are 퐴푇퐴풙 = 퐴푇풃 • If 퐴 is nonsingular then 퐴푇퐴 is symmetric positive definite and the CG method can be applied to solve 퐴푇퐴풙 = 퐴푇풃 (CG normal residual -- CGNR). • Alternatively, we can first solve A퐴푇풚 = 풃 for 풚, then 풙 = 퐴푇풚. • Disadvantages: – Each iteration requires 퐴푇퐴 or A퐴푇 – Condition number of 퐴푇퐴 or A퐴푇 is square of that of 퐴. However, CG works well if condition number is small 3

Arnoldi Iteration • The Arnoldi method is an orthogonal projection onto a Κ푚(퐴, 풓0) for 푛 × 푛 nonsymmetric matrix 퐴. Here 푚 ≪ 푛. • Arnoldi reduces 퐴 to a Hessenberg form. Upper : zero entries below the first subdiagonal. 2 3 4 1 2 5 1 9

0 2 1 2 0 0 3 2

Lower Hessenberg matrix: zero entries above the first superdiagonal. 3 2 0 0 2 5 1 0

1 2 1 2 3 4 3 2 4 Mechanics of

푛×푛 푛 • For 퐴 ∈ 푅 , a given vector 풓0 ∈ 푅 defines a sequence of Krylov subspaces Κ푚(퐴, 풓0). Matrix 2 푚−1 푛×푚 Κ푚 = [풓0|퐴풓0|퐴 풓0| … |퐴 풓0] ∈ 푅 is the corresponding Krylov matrix. • The Gram-Schmidt procedure for forming an orthonormal for Κ푚 is called the Arnoldi process. – Theorem. The Arnoldi procedure generates a reduced QR factorization of Krylov matrix Κ푚 in the form Κ푚 = 푉푚푅푚 with 푛×푚 푉푚 ∈ 푅 and having orthonormal columns and with a 푚×푚 triangular matrix 푅푚 ∈ 푅 . Furthermore, with the 푇 푚 × 푚 − 푢푝푝푒푟 Hessenberg matrix 퐻푚, we have 푉푚 퐴푉푚 = 퐻푚.

5 Let 퐻푚 be a 푚 × 푚 Hessenberg matrix: ℎ11 ℎ12 … ℎ1푚 ℎ ℎ … ℎ 퐻 = 21 22 2푚 푚 0 ⋱ ⋱ ⋮ 0 … ℎ푚,푚−1 ℎ푚푚 Let (푚 + 1) × 푚 퐻 푚 be the extended matrix of 퐻푚: ℎ11 ℎ12 … ℎ1푚 ℎ21 ℎ22 … ℎ2푚 퐻 푚 = 0 ⋱ ⋱ ⋮ 0 … ℎ푚,푚−1 ℎ푚푚 0 … 0 ℎ푚+1,푚 The Arnoldi iteration produces matrices 푉푚, 푉푚+1 and 퐻 푚 for matrix 퐴 satisfying: 푇 퐴푉푚 = 푉푚+1퐻 푚 = 푉푚퐻푚 + 풘푚풆푚 Here 푉푚, 푉푚+1 have orthonormal columns 푉푚 = [풗1 풗2 … |풗푚], 푉푚+1 = [풗1 풗2 … |풗푚|풗푚+1]

6

The 푚th column of the equation: 퐴풗푚 = ℎ1푚풗1 + ℎ2푚풗2 + ⋯ + ℎ푚푚풗푚 + ℎ푚+1,푚풗푚+1 Therefore, ℎ1푚 = 퐴풗푚 ∙ 풗1 ⋮

ℎ푚+1,푚 = ||퐴풗푚 − ℎ1푚풗1 … − ℎ푚푚풗푚|| 풗푚+1 = (퐴풗푚 − ℎ1푚풗1 … − ℎ푚푚풗푚)/ℎ푚+1,푚

7 Arnoldi Algorithm

Choose 풓0 and let 풗1 = 풓0/| 풓0 | for 푗 = 1, … , 푚 − 1 푗 푇 풘 = 퐴풗푗 − 푖=1((퐴풗푗) 풗푖)풗푖 풗푗+1 = 풘/||풘||2 endfor Remark: Choose 풗1. Then for 푗 = 1, … , 푚 − 1, first multiply the current Arnoldi vector 풗푗 by A, and orthonormalize 퐴풗푗 against all previous Arnoldi vectors.

8

푇 • 푉푚 푉푚 = 퐼푚×푚. • If Arnoldi process breaks down at 푚푡ℎ step, 풘푚 = ퟎ is still well- defined but not 풗푚+1, and the algorithm stop.

• In this case, the last row of 퐻푚 is set to zero, ℎ푚+1,푚 = 0 9 Stable Arnoldi Algorithm

Choose 풙0 and let 풗1 = 풙0/| 풙0 | . for 푗 = 1, … , 푚

풘 = 퐴풗푗 for 푖 = 1, … , 푗

ℎ푖푗 =< 풘, 풗푖 > 풘 = 풘 − ℎ푖푗풗푖 endfor

ℎ푗+1,푗 = ||풘||2 if ℎ푗+1,푗 = 0, then stop 풗푗+1 = 풘/ℎ푗+1,푗 endfor 10 Generalized Minimum Residual (GMRES) Method

Let the Krylov space associated with 퐴풙 = 풃 be Κ푘 퐴, 풓0 = 2 푘−1 푠푝푎푛 풓0, 퐴풓0, 퐴 풓0, … , 퐴 풓0 , where 풓0 = 풃 − 퐴 풙0 for some initial guess 풙0.

The 푘푡ℎ (푘 ≥ 1) iteration of GMRES is the solution to the least squares problem:

푚푖푛푖푚푖푧푒풙∈풙0+Κ푘||풃 − 퐴풙||2, i.e. Find 풙푘 ∈ 풙0 + Κ푘 such that ||풃 − 퐴풙푘||2 = 푚푖푛풙∈풙0+Κ푘||풃 − 퐴풙||2

• Remark: the GMRES was proposed in “Y. Saad and M. Schultz, GMRES a generalized minimal residual algorithm for solving nonsymmetric linear systems, SIAM J. Sci. Statist. Comput., 7 (1986), pp. 856–869.”

11 푘−1 푗 If 풙 ∈ 풙0 + Κ푘, then 풙 = 풙0 + 푗=0 훾푗퐴 풓0. 푘−1 푗+1 So 풃 − 퐴풙 = 풃 − 퐴풙0 − 푗=0 훾푗퐴 풓0 = 풓0 − 푘 푗 푗=1 훾푗−1퐴 풓0 .

12 • Theorem (Kelly). Let A be a nonsingular diagonalizable matrix. Assume that A has only k distinct eigenvalues. Then GMRES will terminate in at most k iterations. • Least Square via QR factorization Let 퐴 ∈ 푅푚×푛 (푚 ≥ 푛), and 풃 ∈ 푅푚 be given. Find 풙 ∈ 푅푛 so that the norm of 풓 = 풃 − 퐴풙 is minimized. Algorithm 1. Compute the QR factorization 퐴 = 푄 푅 2. Compute vector 푄 ∗풃 3. Solve the upper triangular system 푅 풙 = 푄 ∗풃 for 풙 Reference: Numerical , L.N. Trefethen, D. Bau, III

13 GMRES Implementation

• The 푘푡ℎ (푘 ≥ 1) iteration of GMRES is the solution to the least squares problem:

푚푖푛푖푚푖푧푒풙∈풙0+Κ푘||풃 − 퐴풙||2 • Suppose we have used Arnoldi process constructed an orthogonal basis 푉푘 for Κ푘 퐴, 풓0 . 푇 – 풓0 = 훽푉푘풆1, where 풆1 = (1,0,0, … ) , 훽 = ||풓0||2 – Any vector 풛 ∈ Κ푘 퐴, 풓0 can be written as 풛 = 푘 푘 푘 푙=1 푦푙풗푙 , where 풗푙 is the 푙푡ℎ column of 푉푘. Denote 푇 푘 풚 = (푦1, 푦2, … , 푦푘) ∈ 푅 . 풛 = 푉푘풚

14 Since 풙 − 풙0 = 푉푘풚 for some coefficient vector 풚 ∈ 푘 푅 , we must have 풙푘 = 풙0 + 푉푘풚 where 풚 minimizes ||풃 − 퐴(풙0 + 푉푘풚)||2 = ||풓0 − 퐴푉푘풚||2.

• The 푘푡ℎ (푘 ≥ 1) iteration of GMRES now is equivalent to a least squares problem in 푅푘, i.e.

푚푖푛푖푚푖푧푒풙∈풙0+Κ푘||풃 − 퐴풙||2 = 푚푖푛푖푚푖푧푒푦∈푅푘||풓0 − 퐴푉푘풚||2 – Remark: This is a linear least square problem, which can be solved by QR factorization. However, 퐴푉푘 must be computed at each iteration. 푇 푇 – The associate normal equation is (퐴푉푘) 퐴푉푘풚 = (퐴푉푘) 풓0. – But we will solve it differently. 15 • Let 풙푘 be 푘푡ℎ iterative solution of GMRES.

Define: 풓푘 = 풃 − 퐴풙푘 = 풓0 − 퐴 풙푘 − 풙0 = 훽푉푘+1풆1 − 퐴 풙0 + 푉푘풚 − 풙0 = 훽푉푘+1풆1 − 푘 푘 푉푘+1퐻 푘풚 = 푉푘+1(훽풆1 − 퐻 푘풚 )

Using orthonomality of 푉푘+1:

푚푖푛푖푚푖푧푒풙∈풙0+Κ푘||풃 − 퐴풙||2 푘 = 푚푖푛푖푚푖푧푒푦∈푅푘||훽풆1 − 퐻푘풚 ||2

16 “C.T. Kelley, Iterative Methods for Linear and Nonlinear Equations” . 17 푘 푚푖푛푖푚푖푧푒푦∈푅푘||훽풆1 − 퐻푘풚 ||2

Theorem. Let 푛 × 푘 (푘 ≤ 푛) matrix 퐵 be with linearly independent columns (full column rank). Let 퐵 = 푄푅 be a 푄푅 factorization of 퐵. Then for each 풃 ∈ 푅푛, the equation 퐵풖 = 풃 has a unique least-square solution, given by 풖 = 푅−1푄푇풃.

Using Householder reflection to do QR factorization (푘+1)×(푘+1) gives 퐻 푘 = 푄푘+1푅 푘 where 푄푘+1 ∈ 푅 is (푘+1)×푘 orthogonal and 푅 푘 ∈ 푅 has the form 푅 푅 = 푘 , where 푅 ∈ 푅푘×푘 is upper triangular. 푘 0 푘 18 • 풗푗 may become nonorthogonal as a result of round off errors. 푘 – ||훽풆1 − 퐻 푘풚 ||2 which depends on , will not hold and the residual could be inaccurate. – Replace the loop in Step 2c of Algorithm gmresa with

19 “C.T. Kelley, Iterative Methods for Linear and Nonlinear Equations” . 20 “C.T. Kelley, Iterative Methods for Linear and Nonlinear Equations” . 21 Modified Gram-Schmidt Process with Reorthogonalization

Test Reorthogonalization

If 퐴푣푘||2 + 훿 푣푘+1||2 = ||퐴푣푘||2 to working precision. 훿 = 10−3

22 Givens Rotations 푘 푚푖푛푖푚푖푧푒푦∈푅푘||훽풆1 − 퐻푘풚 ||2 involves QR factorization. Do QR factorizations of 퐻 푘by Givens Rotations.

• A 2 × 2 Givens rotation is a matrix of the form 푐 −푠 퐺 = where 푐 = cos (휃), 푠 = sin (휃) for 푠 푐 휃 ∈ [−휋, 휋]. The orthogonal matrix 퐺 rotates the vector (푐, −푠)푇, which makes an angle of −휃 with the 푥-axis, through an angle 휃 so that it overlaps the 푥-axis. 푐 1 퐺 = −푠 0 23 An 푁 × 푁 Givens rotation 퐺푗(푐, 푠) replaces a 2 × 2 block on the diagonal of the 푁 × 푁 identity matrix with a 2 × 2 Givens rotations. 퐺푗(푐, 푠) is with a 2 × 2 Givens rotations in rows and columns 푗 and 푗 + 1.

24 • Givens rotations can be used in reducing Hessenberg matrices to triangular form. This can be done in 푂 푁 floating-point operations.

• Let 퐻 be an 푁 × 푀(푁 ≥ 푀) upper Hessenberg matrix with rank 푀. We reduce 퐻 to triangular form by first multiplying the matrix by a Givens rotations that zeros ℎ21 (values of ℎ11 and subsequent columns are changed)

25 2 2 • Step 1: Define 퐺1(푐1, 푠1) by 푐1 = ℎ11/ ℎ11 + ℎ21 and 2 2 푠1 = −ℎ21/ ℎ11 + ℎ21. Replace 퐻 by 퐺1퐻. 2 2 • Step 2: Define 퐺2(푐2, 푠2) by 푐2 = ℎ22/ ℎ22 + ℎ32 and 2 2 푠2 = −ℎ32/ ℎ22 + ℎ32. Replace 퐻 by 퐺2퐻.

Remark: 퐺2 does not affect the first column of 퐻. • …

2 2 • Step j: Define 퐺푗(푐푗, 푠푗) by 푐푗 = ℎ푗푗/ ℎ푗푗 + ℎ푗+1,푗 and 2 2 푠푗 = −ℎ푗+1,푗/ ℎ푗푗 + ℎ푗+1,푗. Replace 퐻 by 퐺푗퐻.

Setting 푄 = 퐺푁 … 퐺1. 푅 = 푄퐻 is upper triangular.

26 Let 퐻 푘 = 푄푅 by Givens rotations matrices. 푘 푚푖푛푖푚푖푧푒푦∈푅푘||훽풆1 − 퐻푘풚 ||2 푘 = 푚푖푛푖푚푖푧푒푦∈푅푘||푄(훽풆1 − 퐻푘풚 )||2 푘 = 푚푖푛푖푚푖푧푒푦∈푅푘||훽푄풆1 − 푅풚 ||2

27 28 Preconditioning Basic idea: using GMRES on a modified system such as 푀−1퐴풙 = 푀−1풃. The matrix 푀−1퐴 need not to be formed explicitly. However, 푀풘 = 풗 need to be solved whenever needed.

Left preconditioning 푀−1퐴풙 = 푀−1풃 Right preconditioning 퐴푀−1풖 = 풃 푤푖푡ℎ 풙 = 푀−1풖 Split preconditioning: 푀 is factored as 푀 = 푀퐿푀푅 −1 −1 −1 −1 푀퐿 퐴푀푅 풖 = 푀퐿 풃 푤푖푡ℎ 풙 = 푀푅 풖 29 GMRES with Left Preconditioning

The Arnoldi process constructs an orthogonal basis for −1 −1 2 −1 푘−1 Span{풓0, 푀 퐴풓0, (푀 퐴) 풓0, … (푀 퐴) 풓0}.

Sadd. Iterative Methods for Sparse Linear Systems 30 GMRES with Right Preconditioning

Right preconditioned GMRES is based on solving 퐴푀−1풖 = 풃 푤푖푡ℎ 풙 = 푀−1풖. −1 • The initial residual is: 풃 − 퐴푀 풖0 = 풃 − 퐴풙0. – This means all subsequent vectors of the Krylov subspace can be obtained without any references to the 풖. • At the end of right preconditioned GMRES: 푚

풖푚 = 풖0 + 풗푖휂푖 푤푖푡ℎ 풖0 = 푀풙0 푖=1 푚 −1 풙푚 = 풙0 + 푀 풗푖휂푖 푖=1

31 GMRES with Right Preconditioning

The Arnoldi process constructs an orthogonal basis for −1 −1 2 −1 푘−1 Span{풓0, 퐴푀 풓0, (퐴푀 ) 풓0, … (퐴푀 ) 풓0}.

Sadd. Iterative Methods for Sparse Linear Systems. 32 Split Preconditioning • 푀 can be a factorization of the form 푀 = 퐿푈.

• Then 퐿−1퐴푈−1풖 = 퐿−1풃, 푤푖푡ℎ 풙 = 푈−1풖. – Need to operate on the initial residual by 퐿−1(풃 − 퐴풙ퟎ) – Need to operate on the linear combination −1 푈 (푉푚풚푚) in forming the approximate solution

33 Comparison of Left and Right Preconditioning

• Spectra of 푀−1퐴, 퐴푀−1 and 퐿−1퐴푈−1 are identical. • In principle, one should expect convergence to be similar. • When 푀 is ill-conditioned, the difference could be substantial.

34 Jacobi Preconditioner for solving 퐴푥 = 푏 takes the form: −1 −1 풙푘+1 = 푀 푁풙푘 + 푀 풃 where 푀 푎푛푑 푁 split 퐴 into 퐴 = 푀 − 푁. • Define 퐺 = 푀−1푁 = 푀−1 푀 − 퐴 = 퐼 − 푀−1퐴 and 풇 = 푀−1풃. • Iterative method is to solve 퐼 − 퐺 풙 = 풇, which can be written as 푀−1퐴풙 = 푀−1풃.

Jacobi iterative method: 풙푘+1 = 퐺퐽퐴풙푘 + 풇 where −1 −1 퐺퐽퐴 = (퐼 − 퐷 퐴) and 풇 = 퐷 풃 • 푀 = 퐷 for Jacobi method.

35 SOR/SSOR Preconditioner • Define: 퐴 = 퐷 − 퐸 − 퐹 −1 • Gauss-Seidel: 퐺퐺푆 = 퐼 − (퐷 − 퐸) 퐴 1 • 푀 = (퐷 − 푤퐸) 푆푂푅 푤

A symmetric SOR (SSOR) consists of: 퐷 − 푤퐸 풙 1 = 푤퐹 + 1 − 푤 퐷 풙 + 푤풃 푘+ 푘 2 퐷 − 푤퐹 풙푘+1 = 푤퐸 + 1 − 푤 퐷 풙 1 + 푤풃 푘+2 This gives 풙푘+1 = 퐺푆푆푂푅풙푘 + 풇 Where −1 −1 퐺푆푆푂푅 = 퐷 − 푤퐹 (푤퐸 + 1 − 푤 퐷) 퐷 − 푤퐸 (푤퐹 + 1 − 푤 퐷)

−1 −1 • 푀푆푆푂푅 = 퐷 − 푤퐸 퐷 (퐷 − 푤퐹); 푀푆퐺푆 = 퐷 − 퐸 퐷 (퐷 − 퐹); • Note: SSOR usually is used when 퐴 is symmetric 36 Take symmetric GS for example: −1 푀푆퐺푆 = 퐷 − 퐸 퐷 퐷 − 퐹 • Define: 퐿 = 퐷 − 퐸 퐷−1 = 퐼 − 퐸퐷−1 and 푈 = 퐷 − 퐹. • 퐿 is a lower triangular matrix and 푈 is a upper triangular matrix.

• To solve 푀푆퐺푆풘 = 풙 for 풘, a forward solve and a backward solve are used: – Solve 퐼 − 퐸퐷−1 풛 = 풙 for 풛 – Solve 퐷 − 퐹 풘 = 풛 for 풘

37 Incomplete LU(0) Factorization

Define: 푁푍 푋 = {(푖, 푗)|푋푖,푗 ≠ 0} Incomplete LU (ILU(0)): • 퐴 = 퐿푈 + 푅 with 푁푍 퐿 ∪ 푁푍 푈 = 푁푍(퐴) 푟푖푗 = 0 푓표푟 (푖, 푗) ∈ 푁푍(퐴)

I.e. 퐿 and 푈 have no fill-ins at the entries 푎푖푗 = 0.

for 푖 = 1 to 푛 for 푘 = 1 to 푖 − 1 and if (푖, 푘) ∈ 푁푍(퐴) 푎푖푘 = 푎푖푘/푎푘푗 for j = 푘 + 1 to 푛 and if (푖, 푘) ∈ 푁푍(퐴) 푎푖푗 = 푎푖푗 − 푎푖푘푎푘푗 end; end; end; 38 ILU(0)

Sadd. Iterative Methods for Sparse Linear Systems. 39 Parallel GMRES

• J. Erhel. A parallel GMRES version for general sparse matrices. Electronic Transactions on Numerical Analyis. 3:160-176, 1995. • Implementation in PETSc (Portable, Extensible Toolkit for Scientific Computation) – http://www.mcs.anl.gov/petsc/

40 Parallel Libraries ScaLAPACK • http://www.netlib.org/scalapack/ • Based on LAPACK (Linear Algebra PACKage) and BLAS (Basic Linear Algebra Subroutines) • Parallelized by “divide and conquer” or block distribution • Written in Fortran 90 • Successor of LINPACK, which was originally written for vector supercomputers in the 1970s • Implemented on top of MPI using MIMD, SPMD, and used explicit message passing

41 PETSc (Portable, Extensible Toolkit for Scientific Computation) • http://www.mcs.anl.gov/petsc/ • Suite of data structures (core: distributed vectors and matrices) and routines for linea and non-linear solvers • User (almost) never has to call MPI himself when using PETSc • Uses two MPI communicators: PETSC_COMM_SELF for the library-internal communication and PETSC_COMM_WORLD for user processes • Written in C, callable from Fortran • Has been used to solve systems with over 500 millions unknowns • Has been shown to scale up to over 6000 processors

42 PETSc Structure

43 PETSc Numerical Solvers

44 Parallel Random Number Generator SPRNG (The Scalable parallel random number generators library) • http://sprng.cs.fsu.edu/ • Random number sequence does not depend on the number of processors used, but only on the seed a reproducible Monte Carlo simulations in parallel • SPRNG implements parallel-safe, high-quality random number generators • C++/Fortran (used to be C/Fortran in previous versions)

45 Parallel PDE Solver

POOMA (Parallel Object-Oriented Methods and Applications) • http://acts.nersc.gov/formertools/pooma/index.html • Collection of templated C++ classes for writing parallel PDE solvers • Provides high-level data types (abstractions) for fields and particles using data-parallel arrays • Supports finite-difference simulations on structured, unstructured, and adaptive grids. Also supports particle simulations, hybrid particle-mesh simulations, and Monte Carlo • Uses mixed message-passing/thread parallelism

46 Many more… • Aztec (iterative solvers for sparse linear systems) • SuperLU (LU decomposition) • Umfpack (unsymmetric multifrontal LU) • EISPACK (eigen-solvers) • Fishpack (cyclic reduction for 2nd & 4th order FD) • PARTI (Parallel run-time system) • Bisect (recursive orthogonal bisection) • ROMIO (parallel distributed file I/O) • KINSol (solves the nonlinear algebraic systems) https://computation.llnl.gov/casc/sundials/main.html • SciPy (Scientific Tools for Phython) http://www.scipy.org/ • …

47 References: – C.T. Kelley. Iterative Methods for Linear and Nonlinear Equations. – Yousef Sadd. Iterative methods for Sparse Linear Systems – G. Karypis and V. Kumar. Parallel Threshold-based ILU Factorization. Technical Report #96-061. U. of Minnesota, Dept. of Computer Science, 1998. – P. -O. Persson and J. Peraire. Newton-GMRES Preconditioning for Discontinuous Galerkin Discretizations of the Navier-Stokes Equations. SIAM J. on Sci. Comput. 30(6), 2008.

48