Chapter 8

KRYLOV SUBSPACE METHODS

A more readable reference is the book by Lloyd N. Trefethen and David Bau.

8.1 Krylov Subspaces 8.2 Arnoldi 8.3 Generalized Minimal Residual Method 8.4 Conjugate Gradient Method 8.5 Biconjugate Gradient Method 8.6 Biconjugate Gradient Stabilized Method

8.1 Krylov Subspaces

Saad, Sections 6.1, 6.2 omitting Proposition 6.3 In a Krylov subspace method

i−1 xi − x0 ∈Ki(A, r0)= span{r0,Ar0,...,A r0}.

We call Ki(A, r0)aKrylov subspace.Equivalently,

i−1 xi ∈ x0 +span{r0,Ar0,...,A r0}, which we call a linear manifold. −1 −1 The exact solution A b = x0 + A r0.Theminimal polynomial of A is the polynomial m m−1 p(x)=x + cm−1x + ···+ c1x + c0 of lowest degree m such that p(A)=0.IfA is diagonalizable, m is the number of distinct eigenvalues. To see this, let A have distinct eigenvalues λ1, λ2, ..., λm, and define

m p(λ)=(λ − λ1)(λ − λ2) ···(λ − λm)=cmλ + ···+ c1λ + c0.

167 Writing A = XΛX−1,wehave

−1 p(A)=X(Λ − λ1I)(Λ − λ2I) ···(Λ − λmI)X =0.

If A has is no eigenvalue equal to zero, A−1 is a linear combination of I,A,...,Am−1,

 −1     1 1 1 3 1 e.g.,  1  =  1  −  1  . 2 2 2 1 2

To see this, write m−1 m−2 −1 A + cm−1A + ···+ c1I + c0A =0. −1 m−1 −1 m−1 Hence A r0 ∈ span {r0,Ar0,...,A r0}, A b ∈ x0 +span{r0,Ar0,...,A r0},and −1 xm = A b. This is the finite termination property. (In practice we do not go this far.)

Review questions 1. What is a Krylov subspace?

i−1 2. Is x0 +span{r0,Ar0,...,A r0} a Krylov subspace?

3. Which of the following methods choose their search directions from a Krylov subspace: cyclic coordinate descent? steepest descent? For each that does, what is its Krylov subspace?

4. What is the minimal polynomial of a square A?

5. What is the finite termination property?

Exercise 1. Use the characteristic polynomial of a nonsingular square matrix A to show that A−1 can be expressed as a polynomial of degree at most n − 1inA.

8.2 Arnoldi Orthogonalization

Saad, Section 6.3 For we incrementally construct orthonormal bases {q1,q2,...,qk} for the Krylov subspaces. However, rather than applying the Gram-Schmidt process to the k−1 sequence r0 , Ar0, ..., A r0, we use what is known as the Arnoldi process.Itisbased on the fact that each Krylov subspace can be obtained from the orthonormal basis of the Krylov subspace of one dimension less using the spanning set q1, q2, ..., qk, Aqk.Inother words a new direction in the expanded Krylov subspace can be created by multiplying the

168 k−1 most recent basis vector qk by A rather than by multiplying A r0 by A.Weremovefrom this new direction Aqk its orthogonal projection onto q1, q2, ..., qk obtaining the direction

vk+1 = Aqk − q1h1k − q2h2k −···−qkhkk where the coefficients hi,k are determined by the orthogonality conditions. This computation should be performed using the modified Gram-Schmidt iteration:

t = Aqk; for j = 1,2,..., k do { T / ∗ t =(I − qjqj )t ∗ / T hjk = qj t; t = t − qjhjk; }

Normalization produces qk+1 = vk+1/hk+1,k.

The coefficients hij have been labeled so that we can write

AQk = Qk+1H¯k (8.1) where Qk := [q1,q2,...,qk]andH¯k is a k +1byk upper Hessenberg matrix whose (i, j)th element for j ≥ i − 1ishij. Given the basis q1 , q2, ...,qk, we can express any element of the kth dimensional Krylov subspace as Qk y for some k-vector y.

Review questions

1. In the Arnoldi process for orthogonalizing a Krylov subspace Ki(A, r0),howiseach new basis vector qk+1 produced?

2. The relationship among the first k + 1 vectors produced by the Arnoldi process for Ki(A, r0) can be summarized as

AQk = Qk+1H¯k

where Qk is composed of the first k vectors. What are dimensions of H¯k and what T other property does it have? What is Qk AQk in terms of H¯k?

3. Let Qk be composed of the first k vectors of the Arnoldi process for Ki(A, r0). What is r0 in terms of Qk and kr0k?

169 8.3 Generalized Minimal Residual Method

Saad, Sections 6.5.1, 6.5.3–6.5.5. GMRES, “generalized minimal residuals,” is a popular iterative method for nonsymmetric matrices A. It is based on the principle of minimizing the norm of the residual kb − Axk2, since the energy norm is available only for an s.p.d. matrix. It is, however, not a gradient method; rather it chooses for the correction xk − x0 that element of the Krylov subspace k−1 span{r0,Ar0,...,A r0} which minimizes the 2-norm of the residual. For numerical stability we construct orthonormal bases {q1,q2,...,qk} for the Krylov subspaces using the Arnoldi process. We can express any element of the kth dimensional Krylov subspace as Qk y for some k-vector y. Thus, the minimization problem becomes

min kb − A(x0 + Qk y)k2. y

This is a linear least squares problem involving k+1 unknowns and n equations. The number of equations can be reduced to k + 1 by using eq. (8.1) to get

kb − A(x0 + Qk y)k2 = kr0 − AQk yk2

= kρq1 − AQk yk2 where ρ = kr0k

= kQk+1(ρe1 − H¯ky)k2

= kρe1 − H¯kyk2.

Review questions 1. For GMRES applied to Ax = b,whereA is n by n, from what subset of Rn does one choose the approximation xk given an initial guess x0?

2. How many matrix–vector multiplications are required for each iteration of GMRES?

3. What is the optimality property of GMRES?

4. For the GMRES solution from a k-dimensional subspace, one solves a least squares problem of reduced dimension. What are the dimensions of the coefficient matrix in this problem, and what is its special property?

8.4 Conjugate Gradient Method

Saad, Sections 6.7.1, 6.11.3. Considered here is the case where A is symmetric. Let Hk be all but the last row of H¯k. T Then Qk AQk = Hk, which is square and upper Hessenberg. Since A is symmetric, Hk is tridiagonal, and we write H¯k = T¯k,Hk = Tk.

170 Clearly, it is unnecessary to compute elements of Tk known to be zero, thus reducing the cost from O(k2)toO(k). Assuming A is also positive definite, we choose xk to be that element of x0 + Kk(A, r0) −1 which is closest to A b in energy norm. Hence, each iterate xk has the following optimality property: −1 −1 |||xk − A b||| =min{|||x − A b||| : x ∈ x0 + Kk(A, r0).

In Section 7.3 this is shown to be equivalent to making the residual b − Axk orthogonal to Kk(A, r0)=R(Qk). Writing xk = x0 + Qk y, this becomes

T 0=Qk (b − Axk) T = Qk (r0 − AQk y) T = Qk (ρq1 − AQk y) = ρe1 − Tky, a tridiagonal system to solve for y. Although the conjugate gradient method can be derived from the Lanczos orthogonal- ization described above, it is more usual to start from method of steepest descent. The conjugate gradient method constructs direction pi from the gradients r0,r1,r2,....These directions are conjugate T pi Apj =0ifi =6 j, or orthogonal in the energy inner product. We skip the details and simply state the result:

x0 = initial guess; r0 = b − Ax0; p0 = r0; for i =0, 1, 2,...do { rTr α i i i = T ; pi Api xi+1 = xi + αipi; ri+1 = ri − αiApi; rT r p r i+1 i+1 p i+1 = i+1 + T i; ri ri }

The cost per iteration is

1 matrix · vector Api T T 2 vector · vector ri+1ri+1, pi (Api) 3 vector + scalar · vector

171   −1 for a total cost of 1 matrix–vector product and 5n multiplications. For  −14−1  −1 the cost of matrix·vector is 2.5n “multiplications.” We note that ri ∈Ki(A, r0)andpi ∈Ki(A, r0). Moreover, it can be shown that the gradients {r0,r1,...,ri−1} constitute an orthogonal basis for the Krylov subspace. convergence rate

p !k −1 κ2(A) − 1 −1 |||xk − A b||| ≤ p |||x0 − A b||| κ2(A)+1 p 1/2 κ2(A) 1 n 1 To reduce enery norm of error by ε requires ≈ log iterations, e.g., log . 2 ε π ε Review questions 1. What does it mean for the directions of the conjugate gradient method to be conjugate?

2. What is the optimality property of the conjugate gradient method?

3. How many matrix–vector multiplications are required for each iteration of CG?

4. On what property of the matrix does the rate of convergence of conjugate gradient depend?

Exercises

1. Given xi,ri,pi, one conjugate gradient iteration is given by

T T αi = ri ri/(pi Api) ri+1 = ri − αiApi rT r x x α p p r i+1 i+1 p i+1 = i + i i i+1 = i+1 + T i ri ri where A is assumed to be symmetric positive definite. Suppose you are given

static void ax(float[] x, float[] y) { int n = x.length(); // Given x this method returns y where // y[i] = a[i][0]*x[0] +...+ a[i][n-1]*x[n-1] ... }

172 You should complete the following method which performs one conjugate gradient T iteration, overwriting the old values of xi,ri,ri ri,andpi with the new values. Your method must use no temporary arrays except for one float array of dimension n and it must do a minimum amount of floating-point arithmetic.

static void cg(float[] x, float[] r, float rTr, float[] p)

2. Solve the following system using the conjugate gradient method with x1 =0.     4 −10 2  −14−1  x =  6  0 −14 2

8.5 Biconjugate Gradient Method

Saad, Section 7.3.1 Recall that for a symmetric matrix A one can find by means of a direct calculation an orthogonal similarity transformation such that QTAQ = T is tridiagonal. And that for a general matrix one can get an upper Hessenberg matrix QTAQ = H. In fact, for a general matrix it is possible to find by means of a direct calculation a similarity transformation such that V −1AV = T is tridiagonal if one requires only that V be nonsingular. Letting W T = V −1, we can write this as two conditions: W TAV = T, WTV = I.

The columns of V =[v1v2 ···vn]andW =[w1w2 ···wn] are said to form a biorthogonal system in that T vi wj = δij. T Lanczos orthogonalization incrementally builds up the relations Qk AQk = Tk,which if carried to completion would yield QTAQ = T . Similarly, Lanczos biorthogonalization incrementally constructs Vk, Wk, Tk such that T T Wk AVk = Tk,Wk Vk = I. These relations can be exploited to obtain an approximate solution: T x˜ = x0 + Vky such that Wk (b − Ax)=0, which simplifies to the tridiagonal system T Tky = Wk r0.

T The columns of Vk are a basis for Kk(A, v1), and the columns of Wk are a basis for Kk(A ,w1). The method just described can be shown to be equivalent to the biconjugate gradient method, which is given below, without a derivation:

173 x0 = initial guess; 0 T 0 r0 = b − Ax0;chooser0 so that r0 r0 =0;6 0 0 p0 = r0; p0 = r0; for i =0, 1, 2,...do { rTr0 α i i i = T 0 ; (Api) pi xi+1 = xi + αipi; ri+1 = ri − αiApi; 0 0 T 0 ri+1 = ri − αiA pi; rT r0 β i+1 i+1 i = T 0 ; ri ri pi+1 = ri+1 + βipi; 0 0 0 pi+1 = ri+1 + βipi; } Convergence of BiCG is slower and more erratic than GMRES. Also there is the possibility of breakdown when a denominator becomes zero.

Review questions 1. How close can one come to diagonalizing a general matrix by a similarity transformation with a finite number of arithmetic operations?

2. What does is mean for the columns of V =[v1v2 ···vn]andW =[w1w2 ···wn] to form a biorthogonal system? 3. Lanczos biorthogonalization applied to an n by n matrix A incrementally constructs matrices Vk, Wk of dimension n by k and Tk of dimension k by k. What relations are satisfied by these matrices? and what is special about Tk? 4. How many matrix–vector multiplications are required for each iteration of BiCG ap- plied to a general system Ax = b? and with which matrices? 5. How does the convergence of BiCG compare to that of GMRES?

8.6 Biconjugate Gradient Stabilized Method

Saad, Section 7.4.2 BiCGSTAB has significantly smoother convergence rate than BiCG.

Review question 1. How does BiCGSTAB compare to BiCG? Be precise.

174