Program Lecture 7 Numerical Computing a basis for the Krylov subspace Krylov methods for Hermitian systems • The Arnoldi basis • Lanczos method • Solution methods for linear systems • Conjugate gradients • Minres, Conjugate Residuals • SYMMLQ •

Gerard Sleijpen and Martin van Gijzen November 8, 2017

1

November 8, 2017 2 National Master Course National Master Course Delft University of Technology

Projection methods The Krylov subspace Last time you saw a general framework to solve Ax = b with a 2 k 1 The subspace span r , Ar , A r ,..., A − r is called the projection method. The main steps are: { 0 0 0 0} Krylov subspace of order k, generated by the A and Use recursion to compute a basis Vk for the search • initial vector r0 and is denoted by subspace; A r span r Ar A2 r Ak 1 r Project the problem and compute it’s representation w.r.t. k( , 0)= 0, 0, 0,..., − 0 • K { } the basis (gives low dimensional problem); Solve projected problem (gives low dimensional solution Projection methods that search Krylov subspaces are called • Krylov subspace methods. vector ~yk); Compute (high dimensional) solution x V • k = k ~yk. As we have seen in the previous lessons, a stable orthogonal Today we will see how symmetry (Hermitian problems) can be basis for k(A, r0) can be computed using Arnoldi’s method. K exploited to enhance efficiency.

November 8, 2017 3 November 8, 2017 4

National Master Course National Master Course Arnoldi’s method The Arnoldi relation The Arnoldi method can be summarised in a compact way. With Choose a starting vector v with v = 1. 1 k 1k2 For k = 1,..., do % iteration h1,1 ...... h1,k w Av % expansion   = k .. . h2,1 . . For i = 1,...,k, do % orthogonalisation    .. .. .  h = v w H k  . . .  i,k i∗ ≡   w w v   = hi,k i  hk,k 1 hk,k  −  −  end for    O hk+1,k  h = w   k+1,k k k2 if h = 0, stop % invariant subspace spanned and Vk [v1, v2,..., vk], we have the Arnoldi relation k+1,k ≡ vk+1 = w/hk+1,k % new basis vector AV = V H end for k k+1 k

November 8, 2017 5 November 8, 2017 6

National Master Course National Master Course

A is Hermitian A is Hermitian (2) Let A be Hermitian. So According to the Arnoldi relation we have h1,1 h1,2 O   V AV = V V H = H , .. .. k∗ k k∗ k+1 k k h2,1 . . Hk =   .  .. ..  where H is the k k upper block of H .  . . hk 1,k  k × k  −  A   Moreover, if is Hermitian we have  O hk,k 1 hk,k   −  V A V V AV Hk∗ = k∗ ∗ k = k∗ k = Hk. With α h = h and β h = h the Arnoldi k ≡ k,k k,k k+1 ≡ k+1,k k,k+1 method simplifies to the (Hermitian) Lanczos method. So Hk is Hermitian and upper Hessenberg.

This implies that Hk must be tridiagonal.

November 8, 2017 7 November 8, 2017 8

National Master Course National Master Course A is Hermitian (2) Lanczos’ method

So Choose a starting vector v with v = 1 1 k 1k2 v α1 β¯2 O β1 = 0, 0 = 0 % Initialization   .. .. For k = 1,... do % Iteration β2 . . Hk =   . α = v Av  .. ..  k k∗ k  . . βk  w Av v v   = k αk k βk k 1   − − −  O βk αk,k  % New direction orthogonal to the previous vs   βk+1 = w 2 With αk hk,k = hk,k and βk+1 hk+1,k = hk,k+1 the Arnoldi k k ≡ ≡ v w % Normalization method simplifies to the (Hermitian) Lanczos method. With the k+1 = /βk+1 end for Lanczos method it is possible to compute a new orthonormal

basis vector using only the two previous basis vectors. Note that βk > 0.

November 8, 2017 8 November 8, 2017 9

National Master Course National Master Course

Lanczos’ method Eigenvalue methods Let Arnoldi’s and Lanczos’ method were originally proposed as α β 0 1 2 iterative methods to compute the eigenvalues of a matrix A:  ..  β2 α2 .   V AV  ......  k∗ k = Hk  . . .  T   . k ≡  . .   .. .. β  is ‘almost’ a similarity transformation.  k    A  0 β α  The eigenvalues of Hk are called Ritz values (of of order k).  k k   0 β   k+1  H ~z = θ ~z Au θ u V where u V ~z.   k ⇔ − ⊥ k ≡ k and Vk = [v1, v2,..., vk]. Then, the Lanczos relation reads u is a Ritz vector, (θ, u) is a Ritz pair.

AVk = Vk+1 T k.

November 8, 2017 10 November 8, 2017 11

National Master Course National Master Course Optimal approximations Optimal approximations (2) The Lanczos method provides a cheap way to compute an We first look at the minimisation of the error in the A-norm: orthogonal basis for the Krylov subspace k(A, r0). K x x V ~y Our approximations can be written as k − 0 − k kkA

is minimal iff ek x x0 Vk ~yk A Vk , or, xk = x0 + Vk ~yk, ≡ − − ⊥ equivalently, V Ae = r AV ~y . This yields k ⊥ k 0 − k k where ~yk is determined so that either the error V AV V r k∗ k ~yk = k∗ 0. x x x x A x x k A = ( k)∗ ( k) k − k − − With T V∗AV , the k k upper block of T , p k ≡ k k × k is minimised in A-norm (only meaningful if A is pos. def.) or that and r0 = r0 2 v1, we get k k rk 2 = A(x xk) 2 = r∗rk, T ~y = r e k k k − k k k k 0 2 1 p k k is minimised, i.e. the norm of the residual is minimised. with e1 the first canonical basis vector.

November 8, 2017 12 November 8, 2017 13

National Master Course National Master Course

Optimal approximations (3) Towards a practical algorithm

In particular, we have that the residuals are orthogonal to the The main problem in the is that all vi (or ri 1) − basis vectors: have to be stored to compute xk. This problem can be overcome

by making an implicit LU-factorisation, Tk = Lk Uk of Tk, and rk = Aek = r0 AVk ~yk Vk − ⊥ updating xk,

1 1 xk = x0 + Vk U − L− ( r0 2 e1) , Since the r ’s are orthogonal, each residual is just a multiple of k k k k k   the corresponding basis vector vk+1: rk and vk+1 are collinear. in every iteration. (Recall that r0 = r0 2 v1.) k k Details can be found in the book of Van der Vorst. This also means that the residuals form an orthogonal Krylov basis for the Krylov subspace. With this technique we get the famous and very elegant Conjugate Gradient method.

November 8, 2017 14 November 8, 2017 15

National Master Course National Master Course The Conjugate Gradient method The Conjugate Gradient method

r0 = b Ax0, u 1 = 0, ρ 1 = 1 % Initialization x = x0, r = b Ax, u = 0, ρ = 1 % Initialization − − − − For k = 0, 1,..., do For k = 0, 1,...,kmax do r r r r ρk = k∗ k, βk = ρk/ρk 1 ρ′ = ρ, ρ = ∗ , β = ρ/ρ′ − uk = rk + βk uk 1 % Update direction vector Stop if ρ tol − ≤ c = Au u r + β u % Update direction vector k k ← u c c Au σk = k∗ k, αk = ρk/σk = xk+1 = xk + αk uk % Update iterate σ = u∗c, α = ρ/σ r = r α c % Update residual x x + α u % Update iterate k+1 k − k k ← end for r r α c % Update residual ← − end for Note that, in our notation, the CG αj and βj are not the same as

the Lanczos αj and βj.

November 8, 2017 16 November 8, 2017 17

National Master Course National Master Course

Properties of CG Lanczos matrix and CG Since CG and Lanczos are mathematically equivalent it should CG has several favourable properties: be possible to recover the Lanczos matrix Tk from the The method uses limited memory: only three vectors need • CG-iteration parameters. This is indeed the case to be stored; 1 √β1 0 The method is optimal: the error is minimised in A-norm; α0 α0 •  .  √β1 1 + β1 .. The method is finite: the (n + 1)st residual must be zero α0 α1 α0 •    ......  since all the residuals are orthogonal; Tk =  . . .  .   A u Au  .. .. √βk−1  The method is robust (if is HPD): σk k∗ k = 0 and . . • ≡  αk−2  r r   ρk ∗ k = 0 both imply that the true solution has been  √βk−1 β −1  ≡ k  0 1 + k  αk−2 αk−1 αk−2 found (that rk = 0).   Note that the CG β is not in the matrix (β is meaningless u Au and r r for 0 0 • i∗ j = 0 i∗ j = 0 i = j 6 anyway since u 1 = 0). − November 8, 2017 18 November 8, 2017 19

National Master Course National Master Course Conjugate Gradient Error Analysis A convergence bound for CG

Since e x x = e V ~y with V ~y (A, r ), The iterates x obtained from the CG algorithm satisfy the k ≡ − k 0 − k k k k ∈Kk 0 k and r0 = Ae0, we see that following inequality:

2 k k ek = e0 γ1Ae0 γ2A e0 γkA e0 = pk(A)e0 x xk A √ 2 1 2k − − −···− kx − x k 2 C − 2exp . 0 A ≤ √ 2 + 1 ≤ −√ 2  So, ek is a degree k polynomial pk in A times e0, with pk(0) = 1. k − k C C CG minimises e = e V ~y with V ~y (A, r ). 2 is the 2-condition number of A, which is for HPD-matrices k kkA k 0 − k kkA k k ∈Kk 0 C Consequently, λ (A)= max C2 ≡ C2 λ e = p (A)e p˜ (A)e min k kkA k k 0kA ≤ k k 0kA

for all degree k polynomials p˜k with p˜k(0) = 1. Here we used that p˜ (A)e is also an error, i.e., of the form p˜ (A)e = e V y˜ . k 0 k 0 0 − k k

November 8, 2017 20 November 8, 2017 21

National Master Course National Master Course

Proof (sketch) Superlinear convergence The upperbound on the CG-error is in practice very pessimistic. The error ek = x xk can be written as pk(A)e0 with pk a − Typically the rate of convergence increases during the process. polynomial such that pk(0) = 1. Hence, since CG is optimal, This is called superlinear convergence.

2 10 ek A = pk(A)e0 A p˜k(A)e0 A p˜k with p˜k(0) = 1.

0 k k k k ≤ k k ∀ 10

Since, in this Hermitian case, there is an orthonormal basis of -2 10 eigenvectors of A, we have -4 10

-6

A e A e residual norm p˜k( ) 0 A max p˜k(λi( )) 0 A 10 k k ≤ i | | · k k

-8 10 The convergence bound can now be proved by taking for p˜k a -10 scaled Chebyshev polynomial that is transformed (shifted and 10

-12 10 scaled) to the interval . 0 20 40 60 80 100 120 [λmin,λmax] Iteration number

November 8, 2017 22 November 8, 2017 23

National Master Course National Master Course Superlinear convergence (2) Superlinear convergence (3)

Zeros of the CG polynomial pk are Ritz values Superlinear convergence occurs when Ritz values converge to

(of A of order k, i.e., the eigenvalue of Tk). extreme eigenvalues. Then the component in the direction of that eigenvector is found. Proof. If pk(θ) = 0, then Explanation. If, in addition, θ λj , then ( ) shows that, ≈ ∗ p (λ)=(θ λ) q(λ) for some k 1 degree polynimial q. the eigenvector wj of A associated with λj , i.e., Awj = λj wj , k − − is ( ) ‘deflated’ from ek and from rk = Aek. Hence, ≈ That is, both the error and the residual have component ( ) zero in the direction of that ek =(θ I A) q(A)e0 rk =(θ I A) q(A)r0. ( ) ≈ − ⇒ − ∗ eigenvector (wj ). Since u q(A)r0 = V ~z for some k-vector ~z , we have that ≡ k k k Note. Assume, 0 < λ1 < λ2 <...<λn. When is θ sufficiently close to λ1? It can be (θ I A) u = Aek = rk Vk and u span(Vk)= k(A, r0), − ⊥ ∈ K shown that superlinear convergence is noticeable already if θ < λ2. whence (θ, u) is a Ritz pair.

A counting argument now shows that Ritz values are zeros of the CG polynomial.

November 8, 2017 24 November 8, 2017 25

National Master Course National Master Course

Superlinear convergence (3) Superlinear convergence (4) Superlinear convergence occurs when Ritz values converge to Convergence towards extreme eigenvalues goes faster if they extreme eigenvalues. Then the component in the direction of are isolated (and the other eigenvalues are clustered).

that eigenvector is found. Explanation (Heuristics). Again, consider the case where 0 < λ1 < λ2 <...<λn. Explanation. If, in addition, θ λj , then ( ) shows that, A I ≈ ∗ The shifted power method, with σ converges faster towards the eigenvector with w A Aw w − the eigenvector j of associated with λj , i.e., j = λj j , λ2−λ1 eigenvalue λ1 if the spectral gap is larger. Because, then, with λn−λ1 is ( ) ‘deflated’ from ek and from rk = Aek. ≈ λ2−σ σ (λ1 + λn)/2, the reduction factor − of the shifted power method is smaller. ≡ λn σ Since k(A, r0)= k(A σ I, r0), the Lanczos method will do better than the shifted Convergence of CG from then on is determined by the reduced K K − spectrum from which converged eigenvalues have been removed. power method: θ1 will converge faster towards λ1 if the gap between λ1 and the other e eigenvalues is larger. Explanation. p + (A)e0 A p˜ (A)e A = p˜ (A)e ˜. k k ℓ k ≤ k ℓ kk k ℓ kkA w w∗ e j j e With the deflated matrix A (1 w∗w )A, we used that Aek = Aek. ≡ − j j

November 8, 2017 25 November 8, 2017 26

National Master Course National Master Course CG convergence: Sparse Spectrum CG convergence: small eigenvalues 1 Note that the speed of convergence towards an extreme eigenvalue also depends on the size of the component of the associated eigenvector in the initial r . 0.5 0

In the example, b was obtained as b = Ax with x = 1 and A diagonal. In practice, x often has moderate components also in 0 the direction of eigenvectors with small eigenvalues. The small

eigenvalues lead to small components of r0 = b (we took x 0 -0.5 0 = ) in the direction of the associated eigenvectors. This explains the relative (as compared to the largest eigenvalue) slow convergence (in this example) towards the small eigenvalue -1 (and is in line with the steepness of the polynomial near zero). 0 1 2 3 4

November 8, 2017 p10(λ) 27 November 8, 2017 28

National Master Course National Master Course

CG convergence: Dense Spectrum Superlinear convergence (4) 1 Similar observations hold (with similar arguments) for optimal Krylov methods, as GMRES, FOM and GCR, for general 0.5 matrices (though, for general matrices, the situation can be obscured by very skew eigenvectors, i.e., an ill-conditioned basis of eigenvectors). 0

-0.5

-1 0 1 2 3 4

November 8, 2017 p10(λ) 29 November 8, 2017 30

National Master Course National Master Course Minimising the residuals MINimal RESiduals The problem is: CG minimises the A-norm of the error. As we have seen before, x another way to construct optimal approximations k is to find x = x + V ~y such that r is minimal. k 0 k k k kk2 minimise the residual, i.e. minimise rk = b A xk = r0 AVk ~yk = r0 2 v1 AVk ~yk. A(x x ) = r r − − k k − k − k k2 k∗ k p Hence, minimise (as in GMRES) over all xk x0 + k(A, r0). ∈ K rk 2 = r0 2 v1 AVk ~yk Before we solve this minimisation problem recall the Lanczos k k kk k − k = r0 2 Vk+1 e1 Vk+1 T k ~yk relation. kk k − k = V ( r e T ~y ) k k+1 k 0k2 1 − k k k = r e T ~y kk 0k2 1 − k kk

November 8, 2017 31 November 8, 2017 32

National Master Course National Master Course

MINRES (2) MINRES and GMRES Solving the small overdetermined system CG can be viewed as the Hermitian variant of FOM, while MINRES is the Hermitian variant of GMRES. For symmetric T ~y = r e k k k 0k2 1 (HPD) matrices, CG is mathematically equivalent to FOM (i.e., in provides iterates exact arithmetic, they have the same residuals in the same steps), MINRES is mathematically equivalent to GMRES. xk = x0 + Vk ~yk that minimise the residual. The algorithm that results from Advantage. Fast convergence (smallest residuals). solving the small system in least square sense using the Exploiting symmetry allows implementations with short QR-factorization of is called MINRES: 1 1 T k = Q k Rk T k recurrences: Lanczos combined with (VkUk− )(Lk− e1) rather 1 1 1 then with Vk(Uk− (Lk− e1)). x = x + V R− Q ∗( r e ) . k 0 k k k k 0k2 1 Advantage. Highly efficient steps, low (fixed) on memory.   The placing of the brackets, allows short recurrences. Disadvantage. More sensitive to perturbations.

November 8, 2017 33 November 8, 2017 34

National Master Course National Master Course C onjugate R esiduals Conjugate Residuals (2)

CG has been designed for Hermitian positive definite matrices. r = b Ax 0 − 0 u 1 = 0, c 1 = 0, ρ 1 = 1 % Initialization In a previous lecture we saw GCR for general matrices. − − − for k = 0, 1,..., do The Hermitian variant is CR and leads to minimal residuals also u = r , c = Au , if A is Hermitian indefinite. k k k k u c ρk = k∗ k, βk = ρk/ρk 1 As GCR is mathematically equivalent to GMRES, − uk uk + βk uk 1 % Update direction vector CR is mathematically equivalent to MINRES. ← − ck ck + βk ck 1 % to avoid extra MVs ← − (G)CR can breakdown in the indefinite case, while MINRES is c c σk = k∗ k, αk = ρk/σk robust. MINRES is more (slightly?) expensive per step (six xk+1 = xk + αk uk % Update iterate vector updates versus four for CR) and needs more memory. r = r α c % Update residual k+1 k − k k end for

November 8, 2017 35 November 8, 2017 36

National Master Course National Master Course

Conjugate Residuals (3) SYMM etric LQ It is natural to try to minimise the true error e x x . Like CG, CR has many favourable properties: k ≡ − k In the method SYMMLQ this is achieved by computing The method uses limited memory: only four vectors need to • approximate solutions of the form be stored; x = x + AV ~y = x + V T ~y The method is optimal: the residual is minimised; k 0 k k 0 k+1 k k • The method is finite: the th residual must be zero since it is (AVk ~yk rather than Vk ~yk as before) and minimising • n optimal over the whole space; x xk 2 = x x0 AVk ~yk 2 The method is robust if A is HPD, else r Ar may be k − k k − − k • ρk = k∗ k zero for some nonzero rk. with repect to ~yk, or, equivalently, solving CR is less popular than CG since minimising the A-norm of the AV e AV AV AV V Ae r ( k)∗( 0 k ~yk) = 0 ( k)∗ k ~yk = k∗ 0 = 0 2 e1. error is often more natural. CG is also slightly cheaper. − ⇔ k k

November 8, 2017 37 November 8, 2017 38

National Master Course National Master Course SYMMLQ (2) SYMMLQ (3): the naming Using the Lanczos relation we get If ~y solves T (T ~y )= r e , then ~z T ~y solves k k∗ k k k 0k2 1 k+1 ≡ k k (AV )∗AV ~y = (V T )∗(V T )~y = r e T ∗ ~zk+1 = r0 2 e1 in the least norm sense, k k k k+1 k k+1 k k k 0k2 1 k k k or T ∗ (T k ~yk)= r0 2 e1. i.e., among all solutions, T ~y is the one with smallest 2-norm. k k k k k This problem is solved with the LQ-factorization of T ∗ This problem is solved with the QR-factorisation of T k, yielding k (which is obtained by ∗ the QR-factorisation of T k): x x V x V 1 r · k = 0 + k+1 T k ~yk = 0 + k+1 Q k Rk∗− ( 0 2 e1) . k k x x V 1 r    k = 0 + k+1 ~zk+1 with ~zk+1 = Q k(Rk∗− ( 0 2 e1)). k k SYMMLQ is a stable method for solving symmetric indefinite This explains the naming for this method. linear systems. In MINRES ~y is the least square solution of T ~y = r e : k k k k 0k2 1 1 x = x + V ~y with ~y = R − (Q ∗( r e )). k 0 k k k k k k 0k2 1 November 8, 2017 39 November 8, 2017 40

National Master Course National Master Course

Concluding remarks

Today, we discussed Krylov methods for symmetric systems. These methods combine an optimal error reduction with short recurrences, and hence limited memory requirements. Last week we discussed GMRES, which solves nonsymmetric problems while minimising the norm of the residual over the Krylov subspace. For this you need to store all the basis vectors. In the next lessons we will investigate methods for nonsymmetric systems that only require a limited number of vectors, similar to the methods we discussed today. However, as we will see, these method do not minimise an error norm.

November 8, 2017 41

National Master Course