Lecture # 20 The Preconditioned Conjugate Method

We wish to solve Ax = b (1) where A Rn×n is symmetric and positive definite (SPD). We then of n are being VERY∈ LARGE, say, n = 106 or n = 107. Usually, the matrix is also sparse (mostly zeros) and Cholesky factorization is not feasible. When A is SPD, solving (1) is equivalent to finding x∗ such that 1 x∗ = arg min xT Ax xT b. x∈Rn 2 − The conjugate gradient is one way to solve this problem.

Algorithm 1 (The Conjugate Gradient Algorithm)

x0 initial guess (usually 0). p1 = r0 = b Ax0; − w = Ap1; T T α1 = r0 r0/(p1 w); x1 = x0 + α1p1 ; r1 = r0 α1w; k = 1 ; − while rk 2 > ǫ (some tolerence) k k T T βk = rk rk/(rk−1rk−1); pk+1 = rk + βkpk; w = Apk+1; T T αk+1 = rk rk/(pk+1w); xk+1 = xk + αk+1pk+1 ; rk+1 = rk αk+1w; k = k + 1;− end; x = xk;

1 The conjugate gradient algorithm is essentially optimal in the following sense. We note that

k

xk = x0 + αjpj = x0 + Pka Xj=1 where

Pk = (p1,..., pk) T a = (α1,...,αk)

One can show that

1 T T xk =arg min x Ax x b. x=x0+Pkc 2 − If we let the A-norm be given by

w = √wT Aw k kA then one can show that

k ∗ ∗ √κ 1 x x 2 x0 x − k k − kA ≤ k − kA √κ + 1 where −1 κ = κ2(A)= A 2 A 2. k k k k Thus if A is well conditioned, the conjugate gradient method converges quickly. This bound is actually very pessimistic! However, if A is badly conditioned, conjugate gradient can converge slowly, so we look to transform (1) into a well conditioned problem. We writing A in the form of a splitting

A = M N − where M is SPD. Since M is SPD, it has a Cholesky factorization, thus

M = RT R.

If we let Aˆ = R−T AR−1, bˆ = R−T b

2 then solving (1) is equivalent to solving Aˆy = bˆ where y = Rx.

We hope that κ2(Aˆ) κ2(A), and we solve (1) by performing conjugate gradient on Aˆ. In that≪ formulation, we would have

yˆk = Rxk and ˆr = bˆ Aˆyˆ = R−T (b Ax )= R−T r . k − k − k k A conjugate gradient step would be

pˆk+1 = ˆrk + βkpˆk

yk+1 = yk + αk+1pˆk+1 −1 We prefer to update xk = R yk. We have that −T pˆk+1 = R rk + βkpˆk, so −1 −1 −1 xk+1 = R yk+1 = R yk + αk+1R pˆk+1 Thus xk+1 = xk + αk+1pk+1 where −1 −T pk+1 = R R rk + βkpk. Since M −1 = R−1R−T if we let −1 zk = M rk then pk+1 = zk + βkpk is the update to the descent direction. The formulas for βk and αk+1 update to T T βk = (rk zk)/(rk−1zk−1) T T αk+1 = (rk zk)/(pk+1Apk+1). These modifications lead to the preconditioned conjugate gradient method.

3 Algorithm 2 (The Preconditioned Conjugate Gradient Algorithm)

x0 initial guess (usually 0). r0 = b Ax0; − Solve Mz0 = r0; p1 = z0; w = Ap1; T T α1 = r0 z0/(p1 w); x1 = x0 + α1p1 ; r1 = r0 α1w; k = 1 ; − while r 2 > ǫ (some tolerence) k kk Solve Mzk = rk; T T βk = rk zk/(rk−1zk−1); pk+1 = zk + βkpk; w = Apk+1; T T αk+1 = rk zk/(pk+1w); xk+1 = xk + αk+1pk+1 ; rk+1 = rk αk+1w; k = k + 1;− end; x = xk;

So far, I have ignored the question of how to choose M. It is a very important question. Although this version of the conjugate gradient method dates from Concus, Golub, and O’Leary (1976), people still write papers on how to choose preconditioners (including your instructor!). Here are some straightforward choices. The Jacobi preconditioner. Choose

M = D = diag(a11,a22,...,ann). The resulting Aˆ is Aˆ = D−1/2AD−1/2 which is the same as diagonally scaling to produce Aˆ with ones on the diag- onal.

4 Block versions of the Jacobi preconditioner are also possible.That is, if we write A in the form

A11 A12 A1 ······ s A21 A22 A2  ······ s  . . . . . A =  . . . . .   . . . . .   . . . . .     A 1 A 2 A   s s ······ ss  where each A is q q matrix. A block Jacobi preconditioner is the matrix ij ×

A11 0 0 ······  0 A22  M = D = . . ·········. . . .  . . . . .     0 0 0 Ass   ···  This matrix M is symmetric and positive definite. The solution of

Mz = r solves Akkzk = rk where z1 r1  z2   r2  z = . , r = . .  .   .       zs   rs      The SSOR preconditioner For ω (0, 2) choose ∈ M = [ω(2 ω)]−1(D + ωL)D−1(D + ωLT ) − The value of ω matters little so we usually choose ω = 1, which yields

M =(D + L)D−1(D + LT )

In this case, N = LD−1LT . The work here is just a back and forward solve.

5 Incomplete Cholesky preconditioner Do Cholesky, but ignore fill elements. If A is large and sparse in the Cholesky factorization

A = RT R (2) the matrix R will often have many more nonzeros than A. This is one of the reasons that conjugate gradient is cheaper than Cholesky in some instances. First, let us write a componentwise version of the Cholesky algorithm to compute (2). for k = 1: n 1 − rkk = √akk; for j = k + 1 : n rkj = akj/rkk; end ; for i = k + 1 : n for j = i: n % Upper triangle only! aij = aij rkirkj; end; − end; end; rnn = √ann;

Incomplete Cholesky ignores the fill elements. We get

M = RˆT Rˆ where RM = (ˆrkj) from the following algorithm. Here rkj = 0 if akj = 0.

Algorithm 3 (Incomplete Cholesky Factorization) for k = 1: n 1 − rˆkk = √akk; for j = k + 1 : n rˆkj = akj/rˆkk; end ; for i = k + 1 : n

6 for j = i: n % Upper triangle only! if a = 0 ij 6 aij = aij rˆkirˆkj; end; − end; end; end; rnn = √ann;

One downside to this algorithm, is that even if A is SPD, it is possible that akk could be negative or zero when it is time for rkk to be evaluated at the beginning of the main loop. Thus, unlike the Jacobi and SSOR precon- ditioners, the incomplete Cholesky preconditioner is not defined for all SPD matrices! However, if, in addition, A is irreducibly diagonally dominant, that is,

a a , j = 1,...,n | jj| ≥ | ij| Xi6=j with at least one inequality strict and no decoupling, that is, no permutation such that A1 0 P AP T = ,  0 A2  then incomplete Cholesky is guaranteed to exist. Incomplete Cholesky is a popular preconditioner in many environments. Much has been written about it and continues to be written about it. The many variants of it consist of for specifying which fill elements to accept and which to reject. They can become quite sophisticated.

7