AM 205: lecture 24

I Last time: multigrid methods I Today: methods I Q guide is now open; please complete it I This is the final lecture. If you’d like to continue with more advanced topics, take a look at Applied Math 225 in Spring 2019 Krylov Subspace Methods

We now give an overview of the role of Krylov1 subspace methods in Scientific Computing

Given a A and vector b, a Krylov sequence is the set of vectors {b, Ab, A2b, A3b,...}

The corresponding Krylov subspaces are the spaces spanned by successive groups of these vectors

2 m−1 Km(A, b) ≡ span{b, Ab, A b,..., A b}

1Aleksey Krylov, 1863–1945, wrote a paper on this idea in 1931 Krylov Subspace Methods

Krylov subspaces are the for iterative methods for eigenvalue problems (and also for solving linear systems)

An important advantage: Krylov methods do not deal directly with A, but rather with matrix–vector products involving A

This is particularly helpful when A is large and sparse, since matrix–vector multiplications are relatively cheap

Also, Krylov sequence is closely related to , hence not surprising it is useful for solving eigenproblems Arnoldi Iteration Arnoldi Iteration

We define a matrix as being in Hessenberg form in the following way:

I A is called upper-Hessenberg if aij = 0 for all i > j + 1 I A is called lower-Hessenberg if aij = 0 for all j > i + 1

The Arnoldi iteration is a Krylov subspace that reduces A to upper-Hessenberg form

As we’ll see, we can then use this simpler form to approximate some eigenvalues of A Arnoldi Iteration

n×n ∗ For A ∈ C , we want to compute A = QHQ , where H is upper Hessenberg and Q is unitary (i.e. QQ∗ = I)

However, we suppose that n is huge! Hence we do not try to compute the full factorization

Instead, let us consider just the first m  n columns of the factorization AQ = QH

Therefore, on the left-hand side, we only need the matrix n×m Qm ∈ C :       Qm =  q1 q2 ... qm      Arnoldi Iteration

On the right-hand side, we only need the first m columns of H

More specifically, due to upper-Hessenberg structure, we only need Hem, which is the (m + 1) × m upper-left section of H:   h11 ··· h1m  h21 h22     .. .. .  Hem =  . . .     hm,m−1 hmm  hm+1,m

Hem only interacts with the first m + 1 columns of Q, hence we have

AQm = Qm+1Hem Arnoldi Iteration

       h11 ··· h1m        h ··· h        21 2m   A   q1 ... qm   q1 ... qm+1   . .      =    .. .         .  hm+1,m

The mth column can be written as

Aqm = h1mq1 + ··· + hmmqm + hm+1,mqm+1

Or, equivalently

qm+1 = (Aqm − h1mq1 − · · · − hmmqm)/hm+1,m

Arnoldi iteration is just the Gram–Schmidt method that constructs the hij and the (orthonormal) vectors qj , j = 1, 2,... Arnoldi Iteration

1: choose b arbitrarily, then q1 = b/kbk2 2: for m = 1, 2, 3,... do 3: v = Aqm 4: for j = 1, 2,..., m do ∗ 5: hjm = qj v 6: v = v − hjmqj 7: end for 8: hm+1,m = kvk2 9: qm+1 = v/hm+1,m 10: end for

This is akin to the modified Gram–Schmidt method because the updated vector v is used in line 5 (vs. the “raw vector” Aqm)

Also, we only need to evaluate Aqm and perform some vector operations in each iteration Arnoldi Iteration

The Arnoldi iteration is useful because the qj form orthonormal bases of the successive Krylov spaces

m−1 Km(A, b) = span{b, Ab,..., A b} = span{q1, q2,..., qm}

We expect Km(A, b) to provide good information about the dominant eigenvalues/eigenvectors of A

Note that this looks similar to the QR algorithm, but the QR algorithm was based on QR factorization of        Ae1 Ae2 ... Aen      Arnoldi Iteration

Question: How do we find eigenvalues from the Arnoldi iteration?

∗ Let Hm = QmAQm be the m × m matrix obtained by removing the last row from Hem

Answer: At each step m, we compute the eigenvalues of the 2 Hm (via, say, the QR algorithm)

This provides estimates for m eigenvalues/eigenvectors (m  n) called Ritz values, Ritz vectors, respectively

Just as with the power method, the Ritz values will typically converge to extreme eigenvalues of the spectrum

2This is how eigs in Python/Matlab works Arnoldi Iteration

We now examine why eigenvalues of Hm approximate extreme eigenvalues of A

3 m Let Pmonic denote the monic polynomials of degree m

Theorem: The characteristic polynomial of Hm is the unique m solution of the approximation problem: find p ∈ Pmonic such that

kp(A)bk2 = minimum

Proof: See Trefethen & Bau

3Recall that a monic polynomial has coefficient of highest order term of 1 Arnoldi Iteration

This theorem implies that Ritz values (i.e. eigenvalues of Hm) are the roots of the optimal polynomial

p∗ = arg min kp(A)bk m 2 p∈Pmonic Now, let’s consider what p∗ should look like in order to minimize kp(A)bk2

We can illustrate the important ideas with a simple case, suppose: I A has only m ( n) distinct eigenvalues Pm I b = j=1 αj vj , where vj is an eigenvector corresponding to λj Arnoldi Iteration m Then, for p ∈ Pmonic, we have

2 m p(x) = c0 + c1x + c2x + ··· + x

for some coefficients c0, c1,..., cm−1

Applying this polynomial to a matrix A gives

2 m p(A)b = c0I + c1A + c2A + ··· + A b m X 2 m = αj c0I + c1A + c2A + ··· + A vj j=1 m X 2 m = αj c0 + c1λj + c2λj + ··· + λj vj j=1 m X = αj p(λj )vj j=1 Arnoldi Iteration

∗ m Then the polynomial p ∈ Pmonic with roots at λ1, λ2, . . . , λm ∗ minimizes kp(A)bk2, since kp (A)bk2 = 0

Hence, in this simple case the Arnoldi method finds p∗ after m iterations

The Ritz values after m iterations are then exactly the m distinct eigenvalues of A Arnoldi Iteration

Suppose now that there are more than m distinct eigenvalues (as is generally the case in practice)

∗ It is intuitive that in order to minimize kp(A)bk2, p should have roots close to the dominant eigenvalues of A

Also, we expect Ritz values to converge more rapidly for extreme eigenvalues that are well-separated from the rest of the spectrum

(We’ll see a concrete example of this for a A shortly) Lanczos Iteration Lanczos Iteration

Lanczos iteration is the Arnoldi iteration in the special case that A is hermitian

However, we obtain some significant computational savings in this special case

Let us suppose for simplicity that A is symmetric with real entries, and hence has real eigenvalues

T Then Hm = Qm AQm is also symmetric =⇒ Ritz values (i.e. eigenvalue estimates) are also real Lanczos Iteration

Also, we can show that Hm is tridiagonal: Consider the ij entry of T Hm, hij = qi Aqj

Recall first that {q1, q2,..., qj } is an orthonormal basis for Kj (A, b)

Then we have Aqj ∈ Kj+1(A, b) = span{q1, q2,..., qj+1}, and T hence hij = qi (Aqj ) = 0 for i > j + 1 since

qi ⊥ span{q1, q2,..., qj+1}, for i > j + 1

T Also, since Hm is symmetric, we have hij = hji = qj (Aqi ), which implies hij = 0 for j > i + 1, by the same reasoning as above Lanczos Iteration

Since Hm is now tridiagonal, we shall write it as   α1 β1  β α β   1 2 2   ..  Tm =  β2 α3 .     .. ..   . . βm−1  βm−1 αm

The consequence of tridiagonality: Lanczos iteration is much cheaper than Arnoldi iteration! Lanczos Iteration

The inner loop in Lanczos iteration only runs from m − 1 to m, instead of 1 to m as in Arnoldi

This is due to the three-term recurrence at step m:

Aqm = βm−1qm−1 + αmqm + βmqm+1

(This follows from our discussion of the Arnoldi case, with Tem replacing Hem)

As before, we rearrange this to give

qm+1 = (Aqm − βm−1qm−1 − αmqm)/βm Lanczos Iteration

Which leads to the Lanczos iteration

1: β0 = 0, q0 = 0 2: choose b arbitrarily, then q1 = b/kbk2 3: for m = 1, 2, 3,... do 4: v = Aqm T 5: αm = qm v 6: v = v − βm−1qm−1 − αmqm 7: βm = kvk2 8: qm+1 = v/βm 9: end for Lanczos Iteration

Python demo: Lanczos iteration for a diagonal matrix Lanczos Iteration

m=10 m=20 4 4

3 3

2 2

1 1

0 0

−1 −1

−2 −2

−3 −3

−4 −4 −0.5 0 0.5 1 1.5 2 2.5 3 3.5 −0.5 0 0.5 1 1.5 2 2.5 3 3.5

We can see that Lanczos minimizes kp(A)bk2: I p is uniformly small in the region of clustered eigenvalues I roots of p match isolated eigenvalues very closely

Note that in general p will be very steep near isolated eigenvalues, hence convergence for isolated eigenvalues is rapid! Lanczos Iteration

Python demo: Lanczos iteration for smallest eigenpairs of Laplacian on the unit square

(We apply Lanczos iteration to A−1 since then the smallest eigenvalues become the largest and (more importantly) isolated) Conjugate Gradient Method Conjugate Gradient Method

We now turn to another Krylov subspace method: the conjugate gradient method,4 or CG

(This is a detour in this Unit since CG is not an )

CG is an iterative method for solving Ax = b in the case that A is symmetric positive definite (SPD)

CG is the original (and perhaps most famous) Krylov subspace method, and is a mainstay of Scientific Computing

4Due to Hestenes and Stiefel in 1952 Conjugate Gradient Method

Recall that we discussed CG in Unit IV — the approach in Unit IV is also often called nonlinear conjugate gradients

Nonlinear conjugate gradients applies the approach of Hestenes and Stiefel to nonlinear unconstrained optimization Conjugate Gradient Method

Iterative solvers (e.g. CG) and direct solvers (e.g. Gaussian elimination) for solving Ax = b are fundamentally different: I Direct solvers: In exact arithmetic, gives exact answer after finitely many steps I Iterative solvers: In principle require infinitely many iterations, but should give accurate approximation after few iterations Conjugate Gradient Method

Direct methods are very successful, but iterative methods are typically more efficient for very large, sparse systems

Krylov subspace methods only require matvecs and vector-vector (e.g. dot product) operations

Hence Krylov methods require O(n) operations per iteration for sparse A, no issues with “fill-in” etc

Also, iterative methods are generally better suited to parallelization, hence an important topic in supercomputing Conjugate Gradient Method

The CG algorithm is given by

1: x0 = 0, r0 = b, p0 = r0 2: for k = 1, 2, 3,... do T T 3: αk = (rk−1rk−1)/(pk−1Apk−1) 4: xk = xk−1 + αk pk−1 5: rk = rk−1 − αk Apk−1 T T 6: βk = (rk rk )/(rk−1rk−1) 7: pk = rk + βk pk−1 8: end for Conjugate Gradient Method

We shall now discuss CG in more detail — it’s certainly not obvious upfront why this is a useful algorithm!

−1 Let x∗ = A b denote the exact solution, and let ek ≡ x∗ − xk denote the error at step k

Also, let k · kA denote the norm √ T kxkA ≡ x Ax Conjugate Gradient Method

Theorem: The CG iterate xk is the unique member of Kk (A, b) which minimizes kek kA. Also, xk = x∗ for some k ≤ n.

Proof: This result relies on a set of identities which can be derived (by induction) from the CG algorithm:

(i) Kk (A, b) = span{x1, x2,..., xk } = span{p0, p1 ..., pk−1} = span{r0, r1,..., rk−1} T (ii) rk rj = 0 for j < k T (iii) pk Apj = 0 for j < k Conjugate Gradient Method

From the first identity above, it follows that xk ∈ Kk (A, b)

We will now show that xk is the unique minimizer in Kk (A, b)

Letx ˜ ∈ Kk (A, b) be another “candidate minimizer” and let ∆x ≡ xk − x˜, then

2 2 kx∗ − x˜kA = k(x∗ − xk ) + (xk − x˜)kA 2 = kek + ∆xkA T = (ek + ∆x) A(ek + ∆x) T T T = ek Aek +2 ek A∆x + ∆x A∆x Conjugate Gradient Method

Next, let r(xk ) = b − Axk denote the residual at step k, so that

r(xk )= b − Axk = b − A(xk−1 + αk pk−1) = r(xk−1) − αk Apk−1

Since r(x0) = b = r0, by induction we see that for rk computed in line 5 of CG, rk = rk−1 − αk Apk−1

we have rk = r(xk ), k = 1, 2,... Conjugate Gradient Method

2 Now, recall our expression for kx∗ − x˜kA:

2 T T T kx∗ − x˜kA = ek Aek +2 ek A∆x + ∆x A∆x

and note that

T T T T 2ek A∆x = 2∆x A(x∗ − xk ) = 2∆x (b − Axk ) = 2∆x rk

Now,

I ∆x = xk − x˜ ∈ Kk (A, b) I from (i), we have that Kk (A, b) = span{r0, r1,..., rk−1} I from (ii), we have that rk ⊥ span{r0, r1,..., rk−1}

T T Therefore, we have2 ek A∆x = 2∆x rk =0 Conjugate Gradient Method

This implies that,

2 T T 2 kx∗ − x˜kA = ek Aek + ∆x A∆x ≥ kek kA,

with equality only when ∆x = 0, hence xk ∈ Kk (A, b) is the unique minimizer!

This also tells us that if x∗ ∈ Kk (A, b), then xk = x∗

5 Therefore CG will converge to x∗ in at most n iterations since n Kk (A, b) is a subspace of R of dimension k 

5Assuming exact arithmetic! Conjugate Gradient Method

Note that the theoretical guarantee that CG will converge in n steps is of no practical use

In floating point arithmetic we will not get exact convergence to x∗

More importantly, we assume n is huge, so we want to terminate CG well before n iterations anyway

Nevertheless, the guarantee of convergence in at most n steps is of historical interest

Hestenes and Stiefel originally viewed CG as a direct method that will converge after a finite number of steps Conjugate Gradient Method

Steps of CG are chosen to give the properties (ii), (iii), which lead to the remarkable CG optimality property:

CG minimizes the error over the Krylov subspace Kk (A, b) at step k

Question: Where did the steps in the CG algorithm come from?

Answer: It turns out that CG can be derived by developing an n optimization algorithm for φ : R → R given by 1 φ(x) ≡ xT Ax − xT b 2

e.g. lines 3 and 4 in CG perform line search, line 7 gives a search direction pk Conjugate Gradient Method

[ Aside: Note that −∇φ(x) = b − Ax = r(x)

The name “Conjugate Gradient” then comes from the property

T T (ii) ∇φ(xk ) ∇φ(xj ) = rk rj = 0 for j < k

That is, the gradient directions are orthogonal, or “conjugate”]

Question: Why is the quadratic objective function φ relevant to solving Ax = b? Conjugate Gradient Method

2 Answer: Minimizing φ is equivalent to minimizing kek kA, since

2 T kek kA = (x∗ − xk ) A(x∗ − xk ) T T T = xk Axk − 2xk Ax∗ + x∗ Ax∗ T T T = xk Axk − 2xk b + x∗ b = 2φ(xk ) + const.

Hence, our argument from above shows that, at iteration k, CG solves the optimization problem

min φ(x) x∈Kk (A,b) Conjugate Gradient Method

An important topic (that we will not cover in detail) is convergence analysis of CG: How fast does kek kA converge?

A famous result for CG is that if A has 2-norm condition number κ, then √ ke k  κ − 1k k A ≤ 2 √ ke0kA κ + 1

Hence smaller condition number implies faster convergence! Conjugate Gradient Method

Suppose we want to terminate CG when

ke k k A ≤  ke0kA for some  > 0, how many CG iterations will this require?

We have the identities √ √  κ − 1k  κ + (1 − 1) − 1k 2 √ = 2 √ κ + 1 κ + 1  2 k = 2 1 − √ κ + 1 √  2/ κ k = 2 1 − √ 1 + 1/ κ Conjugate Gradient Method

And for large κ it follows that √ ke k  κ − 1k  2 k k A ≤ 2 √ ≈ 2 1 − √ ke0kA κ + 1 κ

Hence we terminate CG when

 2 k  1 − √ ≈ κ 2 Conjugate Gradient Method

Taking logs gives

 2  1 √ k ≈ log(/2)/ log 1 − √ ≈ | log(/2)| κ κ 2

where the last expression follows from the Taylor expansion:

 2  2 2 log 1 − √ ≈ log(1) − √ = −√ κ κ κ

This analysis shows that the number of CG iterations for a given √ tolerance  grows approximately as κ Conjugate Gradient Method

Python demo: Convergence of CG for the discrete Laplacian Conjugate Gradient Method

For the discrete Laplacian, we have κ = O(h−2),6 hence number of √ CG iterations should grow as O( κ) = O(h−1)

For  = 10−4, we obtain the following convergence results h κ CG iterations 4 × 10−2 3.67 × 102 32 2 × 10−2 1.47 × 103 65 1 × 10−2 5.89 × 103 133 5 × 10−3 2.36 × 104 272

6This is a standard result in finite element analysis Conjugate Gradient Method

These results indicate that CG gets more expensive for Poisson equation as h is reduced for two reasons: I The matrix and vectors get larger, hence each CG iteration is more expensive I We require more iterations since the condition number gets worse Conjugate Gradient Method: Preconditioning

The final crucial idea that we will mention is preconditioning

The idea is that we premultiply Ax = b by the preconditioning matrix M to obtain the system MAx = Mb

It is possible to “similarity transform” e.g. one intuitive idea is to choose M ≈ A−1 so that MA ≈ I has a smaller condition number than A Conjugate Gradient Method: Preconditioning

Good preconditioners must be cheap to compute, and should significantly accelerate convergence of an iterative method

Preconditioners can have a dramatic effect on convergence!

For example, preconditioning can ensure that number of CG iterations required for Poisson equation is independent of h

Preconditioning for Krylov subspace methods is a major topic in Scientific Computing: It is essential for large-scale problems!