L. Vandenberghe ECE236C (Spring 2020) 13. Conjugate gradient method

conjugate gradient method for linear equations •

complexity •

conjugate gradient method as

applications in nonlinear optimization •

13.1 Unconstrained quadratic minimization

1 minimize f x = xT Ax bT x ( ) 2 − with A symmetric positive definite and n n ×

equivalent to solving the linear equation Ax = b • the residual r = b Ax is the negative gradient: r = f x • − −∇ ( )

Conjugate gradient method (CG)

invented by Hestenes and Stiefel around 1951 • the most widely used iterative method for solving Ax = b, with A 0 • can be extended to non-quadratic unconstrained minimization •

Conjugate gradient method 13.2 Krylov subspaces

Definition: a sequence of subspaces

k 1 = 0 , = span b, Ab,..., A − b for k 1 K0 { } Kk { } ≥

Properties

subspaces are nested: • K0 ⊆ K1 ⊆ K2 ⊆ · · · dimensions increase by at most one: dim dim is zero or one • Kk+1 − Kk if = , then = for all i k: • Kk+1 Kk Ki Kk ≥ k k 1 i k 1 A b span b, Ab,..., A − b = A b span b, Ab,..., A − b for i > k ∈ { } ⇒ ∈ { }

Conjugate gradient method 13.3 Solution of Ax = b

Key property: 1 A− b ∈ Kn this holds even when , Rn Kn

from Cayley–Hamilton theorem, • n n 1 p A = A + a A − + + a I = 0 ( ) 1 ··· n

n n 1 where p λ = det λI A = λ + a1λ − + + an 1λ + an ( ) ( − ) ··· − multiplying on the right with A 1b shows • −

1 1  n 1 n 2  A− b = A − b + a1A − b + + an 1b −an ··· −

Conjugate gradient method 13.4 Krylov sequence

xk = argmin f x , k = 0,1,... x ( ) ∈Kk from previous page, x = A 1b • n − CG method is a recursive method for computing the Krylov sequence x , x ,... • 0 1 we will see there is a simple two-term recurrence •

xk+1 = xk tk f xk + sk xk xk 1 − ∇ ( ) ( − − )

Example 2     x0 1 0 10 0 = , = x2 A b x1 0 10 10 2 − 4 − 20 10 0 − −

Conjugate gradient method 13.5 Residuals of Krylov sequence

xk = argmin f x , k = 0,1,... x ( ) ∈Kk

optimality conditions in definition of Krylov sequence: •

x , f x = Ax b ⊥ k ∈ Kk ∇ ( k) k − ∈ Kk

hence, the residual r = b Ax satisfies • k − k

r , r ⊥ k ∈ Kk+1 k ∈ Kk the first property follows from b and x ∈ K1 k ∈ Kk the (nonzero) residuals form an orthogonal basis for the Krylov subspaces:

T k = span r0,r1,...,rk 1 , r rj = 0 for i , j K { − } i

Conjugate gradient method 13.6 Conjugate directions

the “steps” vi = xi xi 1 in the Krylov sequence (defined for i 1) satisfy − − ≥ T T T v Avj = 0 for i , j, v Avi = v ri 1 i i i −

(proof on next page)

the vectors v are conjugate: orthogonal for inner product v, w = vT Aw • i h i in particular, if vi , 0, it is linearly independent of v1,..., vi 1 • −

the (nonzero) vectors vi form a conjugate basis for the Krylov subspaces:

= span v , v , . . . , v , vT Av = 0 for i , j Kk { 1 2 k } i j

Conjugate gradient method 13.7 Proof of properties on page 13.7

assume j < i; we show that Av and v are orthogonal (vT Av = 0): • i j i j

vj = xj xj 1 j i 1 − − ∈ K ⊆ K −

and

Avi = A xi xi 1 = ri + ri 1 i⊥1 ( − − ) − − ∈ K −

T T the expression v Avi = v ri 1 follows from the fact that t = 1 minimizes • i i −

1 2 T T f xi 1 + tvi = f xi 1 + t v Avi t v ri 1 , ( − ) ( − ) 2 ( i ) − ( i − )

since xi = xi 1 + vi minimizes f over the entire subspace i − K

Conjugate gradient method 13.8 Conjugate vectors it will be convenient to work with a sequence of scaled vectors p = v α with k k/ k T v rk 1 α = k − k 2 rk 1 k − k2

the scaling factor α was chosen to satisfy • k T 2 p rk 1 = rk 1 k − k − k2

T T using v Avk = v rk 1 (page 13.7), we can express αk as • k k − T 2 p rk 1 rk 1 α = k − = k − k2 k T T pk Apk pk Apk

in this notation, the Krylov sequence and residuals satisfy •

xk = xk 1 + αk pk, rk = rk 1 αk Apk − − −

Conjugate gradient method 13.9 Recursion for pk the vectors p1, p2, . . . , can be computed recursively as p1 = r0,

T p Ark p = r k p , k = 1,2,... (1) k+1 k − T k pk Apk

(proof on next page)

this can be further simplified using • 2 T rk 1 2 2 rk Apk 2 rk = rk 1 k − k Apk = rk = rk 1 − − T ⇒ k k2 − T k − k2 pk Apk pk Apk

substituting in the recursion for p gives • k+1 2 rk p = r + k k2 p , k = 1,2,... k+1 k 2 k rk 1 k − k2

Conjugate gradient method 13.10 Proof of (1): p = span p , p ,..., p ,r , so we can express it as k+1 ∈ Kk+1 { 1 2 k k }

pk+1 = γ1p1 + + γk 1pk 1 + βpk + δrk ··· − −

δ = 1: take inner product with r and use • k

rT p = r 2, rT p = = rT p = 0 (r ) k k+1 k k k2 k 1 ··· k k k ∈ Kk⊥

γ1 = = γk 1 = 0: take inner products with Apj for j k 1, and use • ··· − ≤ −

T T pj Api = 0 for j , i, pj Ark = 0

(second equality because Ap and r ) j ∈ Kj+1 ⊆ Kk k ∈ Kk⊥ hence, p = r + βp ; inner product with Ap shows that • k+1 k k k T p Ark β = k − T pk Apk

Conjugate gradient method 13.11 Conjugate gradient

define x0 = 0, r0 = b, and repeat for k = 0,1,... until rk is sufficiently small:

1. if k = 0, take p1 = r0; otherwise, take

2 rk p = r + k k2 p k+1 k 2 k rk 1 k − k2

2. compute

2 rk α = k k2 , x = x + αp , r = r αAp T k+1 k k+1 k+1 k − k+1 pk+1Apk+1

main computation per iteration is matrix–vector product Apk+1

Conjugate gradient method 13.12 Outline

conjugate gradient method for linear equations •

complexity •

conjugate gradient method as iterative method •

applications in nonlinear optimization • Notation

1 minimize f x = xT Ax bT x ( ) 2 −

Optimal value ? 1 T 1 1 ? 2 f x = b A− b = x ( ) −2 −2k kA

Suboptimality at x 1 f x f ? = x x? 2 ( ) − 2k − kA

Relative error measure

? f x f ? x x 2 τ = ( ) − = k − kA f 0 f ? x? 2 ( ) − k kA here, u = uT Au 1 2 is A-weighted norm k kA ( ) /

Conjugate gradient method 13.13 Error after k steps

x = span b, Ab,..., Ak 1b , so x can be expressed as • k ∈ Kk { − } k

k X i 1 xk = ci A − b = p A b i=1 ( )

where p λ = Pk c λi 1 is a polynomial of degree k 1 or less ( ) i=1 i − − x minimizes f x over ; hence • k ( ) Kk ? ? 2 1 2 2 f xk f = inf x x A = inf p A A− b A ( ( ) − ) x k − k deg p

Conjugate gradient method 13.14 Error and spectrum of A

eigenvalue decomposition of A • n T X T T A = QΛQ = λiqiqi Q Q = I, Λ = diag λ1, . . . , λn i=1 ( ( ))

define d = QT b • the expression on the previous page simplifies to

? 1 2 2 f xk f = inf p A A− b ( ( ) − ) deg p

Conjugate gradient method 13.15 Error bounds

Absolute error

n 2 ! ? X di 2 f xk f inf max q λi ( ) − ≤ 2λi deg q k, q 0 =1 i=1,...,n ( ) i=1 ≤ ( ) 1 ? 2 2 = x A inf max q λi 2k k deg q k, q 0 =1 i=1,...,n ( ) ≤ ( ) the equality follows from P d2 λ = bT A 1b = x? 2 i i / i − k kA

Relative error

? 2 xk x A 2 τk = k − k inf max q λi x? 2 ≤ deg q k, q 0 =1 i=1,...,n ( ) k kA ≤ ( )

Conjugate gradient method 13.16 Convergence rate and spectrum of A

if A has m distinct eigenvalues γ ,..., γ , CG terminates in m steps: • 1 m 1 m q λ = (− ) λ γ λ γ ( ) γ γ ( − 1)···( − m) 1 ··· m satisfies deg q = m, q 0 = 1, q λ = = q λ = 0; therefore τ = 0 ( ) ( 1) ··· ( n) m if eigenvalues are clustered in m groups, then τ is small • m can find q λ of degree m, with q 0 = 1, that is small on spectrum ( ) ( ) if x? is a linear combination of m eigenvectors, CG terminates in m steps • take q of degree m with q λ = 0 where d , 0; then ( i) i

n 2 2 X q λi d ( ) i = 0 i=1 λi

Conjugate gradient method 13.17 Other bounds we omit the proofs of the following results

in terms of condition number κ = λ λ • max/ min

√κ 1 k τk 2 − ≤ √κ + 1

derived by taking for q a Chebyshev polynomial on λ , λ [ min max] in terms of sorted eigenvalues λ λ λ • 1 ≥ 2 ≥ · · · ≥ n  2 λk λn τk − ≤ λk + λn

derived by taking q with roots at λ1,..., λk 1 and λ1 + λn 2 − ( )/

Conjugate gradient method 13.18 Outline

conjugate gradient method for linear equations •

complexity •

conjugate gradient method as iterative method •

applications in nonlinear optimization • Conjugate gradient method as iterative method

In exact arithmetic

CG was originally proposed as a direct (non-iterative) method • in theory, terminates in at most n steps •

In practice

due to rounding errors, CG method can take many more than n steps (or fail) • CG is now used as an iterative method • with luck (good spectrum of A), good approximation in small number of steps • attractive if matrix–vector products are inexpensive •

Conjugate gradient method 13.19 Preconditioning

make change of variables y = Bx with B nonsingular, and apply CG to • T 1 T B− AB− y = B− b

if spectrum of B T AB 1 is clustered, PCG converges fast • − − trade-off between enhanced convergence, cost of extra computation • the matrix C = BT B is called the preconditioner •

Examples

diagonal C = diag A , A ,..., A • ( 11 22 nn) incomplete or approximate Cholesky factorization of A • good preconditioners are often application-dependent •

Conjugate gradient method 13.20 Naive implementation

T 1 T apply algorithm of page 13.12 to A˜y = b˜ where A˜ = B− AB− and b˜ = B− b

Algorithm: define y0 = 0, r˜0 = b˜, and repeat for k = 0,1,... until r˜k is sufficiently small:

1. if k = 0, take p˜1 = r˜0; otherwise, take

2 r˜k p˜ = r˜ + k k2 p˜ k+1 k 2 k r˜k 1 k − k2

2. compute

2 r˜k α = k k2 , y = y + αp˜ , r˜ = r˜ αA˜p˜ T ˜ k+1 k k+1 k+1 k − k+1 p˜k+1Ap˜k+1

Conjugate gradient method 13.21 Improvements

instead of y , p˜ compute iterates and steps in original coordinates • k k 1 1 xk = B− yk, pk = B− p˜k

compute residuals in original coordinates: • r = BTr˜ = b Ax k k − k

compute squared residual norms as • 2 T 1 r˜ = r C− r k k k2 k k

extra work per iteration is solving one equation to compute C 1r • − k

Conjugate gradient method 13.22 Preconditioned conjugate gradient algorithm

define x0 = 0, r0 = b, and repeat for k = 0,1,... until rk is sufficiently small:

1. solve the equation Csk = rk

2. if k = 0, take p1 = s0; otherwise, take

T r sk p = s + k p k+1 k T k rk 1sk 1 − −

3. compute

T r sk α = k , x = x + αp , r = r αAp T k+1 k k+1 k+1 k − k+1 pk+1Apk+1

Conjugate gradient method 13.23 Outline

conjugate gradient method for linear equations •

complexity •

conjugate gradient method as iterative method •

applications in nonlinear optimization • Applications in optimization

Inexact and truncated Newton methods

use conjugate gradient method to compute (approximate) Newton step • less reliable than exact Newton methods, but handle very large problems •

Nonlinear conjugate gradient methods

extend linear CG method to nonquadratic functions • local convergence similar to linear CG • limited global convergence theory •

Conjugate gradient method 13.24 Nonlinear conjugate gradient

minimize f x ( ) f convex and differentiable

Modifications needed to extend linear CG algorithm of page 13.12

replace r = b Ax with f x • k − k −∇ ( k) determine step size α by

Conjugate gradient method 13.25 Fletcher–Reeves CG algorithm

CG algorithm of page 13.12 modified to minimize non-quadratic convex f

Algorithm: choose x and repeat for k = 0,1,... until f x is sufficiently small: 0 ∇ ( k) 1. if k = 0, take p = f x ; otherwise, take 1 −∇ ( 0) 2 f xk p = f x + β p where β = k∇ ( )k2 k+1 k k k k 2 −∇ ( ) f xk 1 k∇ ( − )k2

2. update xk+1 = xk + αk pk+1 where

αk = argmin f xk + αpk+1 α ( )

Conjugate gradient method 13.26 Some observations

Interpretation first iteration is a gradient step • general update is gradient step with momentum term • αk βk xk+1 = xk αk f xk + xk xk 1 − ∇ ( ) αk 1 ( − − ) − it is common to restart the algorithm periodically by taking a gradient step • Line search with exact line search, reduces to linear CG for quadratic f • T exact line search in computation of αk 1 implies that f xk pk = 0 • − ∇ ( ) therefore p is a descent direction at x : • k+1 k f x T p = f x 2 + β f x T p ∇ ( k) k+1 −k∇ ( k)k2 k∇ ( k) k = f x 2 −k∇ ( k)k2 < 0

Conjugate gradient method 13.27 Variations

Polak–Ribière: compute βk from

T f xk f xk f xk 1 β = ∇ ( ) (∇ ( ) − ∇ ( − )) k 2 f xk 1 k∇ ( − )k2

Hestenes–Stiefel: compute βk from

T f xk f xk f xk 1 β = ∇ ( ) (∇ ( ) − ∇ ( − )) k T p f xk f xk 1 k (∇ ( ) − ∇ ( − )) formulas are equivalent for quadratic f and exact line search

Conjugate gradient method 13.28 References

S. Boyd, Lecture slides and notes for EE364b, II, lectures on the • conjugate gradient method. G. H. Golub and C. F. Van Loan, Matrix Computations (1996), chapter 10. • C. T. Kelley, Iterative Methods for Linear and Nonlinear Equations (1995), chapter 2. • J. Nocedal and S. J. Wright, Numerical Optimization (2006), chapter 5. • H. A. van der Vorst, Iterative Krylov Methods for Large Linear Systems (2003). •

Conjugate gradient method 13.29