13. Conjugate Gradient Method

L. Vandenberghe ECE236C (Spring 2020) 13. Conjugate gradient method conjugate gradient method for linear equations • complexity • conjugate gradient method as iterative method • applications in nonlinear optimization • 13.1 Unconstrained quadratic minimization 1 minimize f x = xT Ax bT x ¹ º 2 − with A symmetric positive definite and n n × equivalent to solving the linear equation Ax = b • the residual r = b Ax is the negative gradient: r = f x • − −r ¹ º Conjugate gradient method (CG) invented by Hestenes and Stiefel around 1951 • the most widely used iterative method for solving Ax = b, with A 0 • can be extended to non-quadratic unconstrained minimization • Conjugate gradient method 13.2 Krylov subspaces Definition: a sequence of subspaces k 1 = 0 ; = span b; Ab;:::; A − b for k 1 K0 f g Kk f g ≥ Properties subspaces are nested: • K0 ⊆ K1 ⊆ K2 ⊆ · · · dimensions increase by at most one: dim dim is zero or one • Kk+1 − Kk if = , then = for all i k: • Kk+1 Kk Ki Kk ≥ k k 1 i k 1 A b span b; Ab;:::; A − b = A b span b; Ab;:::; A − b for i > k 2 f g ) 2 f g Conjugate gradient method 13.3 Solution of Ax = b Key property: 1 A− b 2 Kn this holds even when , Rn Kn from Cayley–Hamilton theorem, • n n 1 p A = A + a A − + + a I = 0 ¹ º 1 ··· n n n 1 where p λ = det λI A = λ + a1λ − + + an 1λ + an ¹ º ¹ − º ··· − multiplying on the right with A 1b shows • − 1 1 n 1 n 2 A− b = A − b + a1A − b + + an 1b −an ··· − Conjugate gradient method 13.4 Krylov sequence xk = argmin f x ; k = 0;1;::: x ¹ º 2Kk from previous page, x = A 1b • n − CG method is a recursive method for computing the Krylov sequence x , x ,... • 0 1 we will see there is a simple two-term recurrence • xk+1 = xk tk f xk + sk xk xk 1 − r ¹ º ¹ − − º Example 2 x0 1 0 10 0 = ; = x2 A b x1 0 10 10 2 − 4 − 20 10 0 − − Conjugate gradient method 13.5 Residuals of Krylov sequence xk = argmin f x ; k = 0;1;::: x ¹ º 2Kk optimality conditions in definition of Krylov sequence: • x ; f x = Ax b ? k 2 Kk r ¹ kº k − 2 Kk hence, the residual r = b Ax satisfies • k − k r ; r ? k 2 Kk+1 k 2 Kk the first property follows from b and x 2 K1 k 2 Kk the (nonzero) residuals form an orthogonal basis for the Krylov subspaces: T k = span r0;r1;:::;rk 1 ; r rj = 0 for i , j K f − g i Conjugate gradient method 13.6 Conjugate directions the “steps” vi = xi xi 1 in the Krylov sequence (defined for i 1) satisfy − − ≥ T T T v Avj = 0 for i , j; v Avi = v ri 1 i i i − (proof on next page) the vectors v are conjugate: orthogonal for inner product v; w = vT Aw • i h i in particular, if vi , 0, it is linearly independent of v1,..., vi 1 • − the (nonzero) vectors vi form a conjugate basis for the Krylov subspaces: = span v ; v ; : : : ; v ; vT Av = 0 for i , j Kk f 1 2 k g i j Conjugate gradient method 13.7 Proof of properties on page 13.7 assume j < i; we show that Av and v are orthogonal (vT Av = 0): • i j i j vj = xj xj 1 j i 1 − − 2 K ⊆ K − and Avi = A xi xi 1 = ri + ri 1 i?1 ¹ − − º − − 2 K − T T the expression v Avi = v ri 1 follows from the fact that t = 1 minimizes • i i − 1 2 T T f xi 1 + tvi = f xi 1 + t v Avi t v ri 1 ; ¹ − º ¹ − º 2 ¹ i º − ¹ i − º since xi = xi 1 + vi minimizes f over the entire subspace i − K Conjugate gradient method 13.8 Conjugate vectors it will be convenient to work with a sequence of scaled vectors p = v α with k k/ k T v rk 1 α = k − k 2 rk 1 k − k2 the scaling factor α was chosen to satisfy • k T 2 p rk 1 = rk 1 k − k − k2 T T using v Avk = v rk 1 (page 13.7), we can express αk as • k k − T 2 p rk 1 rk 1 α = k − = k − k2 k T T pk Apk pk Apk in this notation, the Krylov sequence and residuals satisfy • xk = xk 1 + αk pk; rk = rk 1 αk Apk − − − Conjugate gradient method 13.9 Recursion for pk the vectors p1, p2, . , can be computed recursively as p1 = r0, T p Ark p = r k p ; k = 1;2;::: (1) k+1 k − T k pk Apk (proof on next page) this can be further simplified using • 2 T rk 1 2 2 rk Apk 2 rk = rk 1 k − k Apk = rk = rk 1 − − T ) k k2 − T k − k2 pk Apk pk Apk substituting in the recursion for p gives • k+1 2 rk p = r + k k2 p ; k = 1;2;::: k+1 k 2 k rk 1 k − k2 Conjugate gradient method 13.10 Proof of (1): p = span p ; p ;:::; p ;r , so we can express it as k+1 2 Kk+1 f 1 2 k k g pk+1 = γ1p1 + + γk 1pk 1 + βpk + δrk ··· − − δ = 1: take inner product with r and use • k rT p = r 2; rT p = = rT p = 0 (r ) k k+1 k k k2 k 1 ··· k k k 2 Kk? γ1 = = γk 1 = 0: take inner products with Apj for j k 1, and use • ··· − ≤ − T T pj Api = 0 for j , i; pj Ark = 0 (second equality because Ap and r ) j 2 Kj+1 ⊆ Kk k 2 Kk? hence, p = r + βp ; inner product with Ap shows that • k+1 k k k T p Ark β = k − T pk Apk Conjugate gradient method 13.11 Conjugate gradient algorithm define x0 = 0, r0 = b, and repeat for k = 0;1;::: until rk is sufficiently small: 1. if k = 0, take p1 = r0; otherwise, take 2 rk p = r + k k2 p k+1 k 2 k rk 1 k − k2 2. compute 2 rk α = k k2 ; x = x + αp ; r = r αAp T k+1 k k+1 k+1 k − k+1 pk+1Apk+1 main computation per iteration is matrix–vector product Apk+1 Conjugate gradient method 13.12 Outline conjugate gradient method for linear equations • complexity • conjugate gradient method as iterative method • applications in nonlinear optimization • Notation 1 minimize f x = xT Ax bT x ¹ º 2 − Optimal value ? 1 T 1 1 ? 2 f x = b A− b = x ¹ º −2 −2k kA Suboptimality at x 1 f x f ? = x x? 2 ¹ º − 2k − kA Relative error measure ? f x f ? x x 2 τ = ¹ º − = k − kA f 0 f ? x? 2 ¹ º − k kA here, u = uT Au 1 2 is A-weighted norm k kA ¹ º / Conjugate gradient method 13.13 Error after k steps x = span b; Ab;:::; Ak 1b , so x can be expressed as • k 2 Kk f − g k k X i 1 xk = ci A − b = p A b i=1 ¹ º where p λ = Pk c λi 1 is a polynomial of degree k 1 or less ¹ º i=1 i − − x minimizes f x over ; hence • k ¹ º Kk ? ? 2 1 2 2 f xk f = inf x x A = inf p A A− b A ¹ ¹ º − º x k − k deg p<k k¹ ¹ º − º k 2Kk we now use the eigenvalue decomposition of A to bound this quantity Conjugate gradient method 13.14 Error and spectrum of A eigenvalue decomposition of A • n T X T T A = QΛQ = λiqiqi Q Q = I; Λ = diag λ1; : : : ; λn i=1 ¹ ¹ ºº define d = QT b • the expression on the previous page simplifies to ? 1 2 2 f xk f = inf p A A− b ¹ ¹ º − º deg p<k k¹ ¹ º − º kA 1 2 = inf p Λ Λ− d Λ deg p<k k¹ ¹ º − º k n 2 2 X λip λi 1 d = inf ¹ ¹ º − º i < deg p k i=1 λi n 2 2 X q λi d = inf ¹ º i deg q k; q 0 =1 λi ≤ ¹ º i=1 Conjugate gradient method 13.15 Error bounds Absolute error n 2 ! ? X di 2 f xk f inf max q λi ¹ º − ≤ 2λi deg q k; q 0 =1 i=1;:::;n ¹ º i=1 ≤ ¹ º 1 ? 2 2 = x A inf max q λi 2k k deg q k; q 0 =1 i=1;:::;n ¹ º ≤ ¹ º the equality follows from P d2 λ = bT A 1b = x? 2 i i / i − k kA Relative error ? 2 xk x A 2 τk = k − k inf max q λi x? 2 ≤ deg q k; q 0 =1 i=1;:::;n ¹ º k kA ≤ ¹ º Conjugate gradient method 13.16 Convergence rate and spectrum of A if A has m distinct eigenvalues γ ,..., γ , CG terminates in m steps: • 1 m 1 m q λ = (− º λ γ λ γ ¹ º γ γ ¹ − 1º · · · ¹ − mº 1 ··· m satisfies deg q = m, q 0 = 1, q λ = = q λ = 0; therefore τ = 0 ¹ º ¹ 1º ··· ¹ nº m if eigenvalues are clustered in m groups, then τ is small • m can find q λ of degree m, with q 0 = 1, that is small on spectrum ¹ º ¹ º if x? is a linear combination of m eigenvectors, CG terminates in m steps • take q of degree m with q λ = 0 where d , 0; then ¹ iº i n 2 2 X q λi d ¹ º i = 0 i=1 λi Conjugate gradient method 13.17 Other bounds we omit the proofs of the following results in terms of condition number κ = λ λ • max/ min pκ 1 k τk 2 − ≤ pκ + 1 derived by taking for q a Chebyshev polynomial on λ ; λ » min max¼ in terms of sorted eigenvalues λ λ λ • 1 ≥ 2 ≥ · · · ≥ n 2 λk λn τk − ≤ λk + λn derived by taking q with roots at λ1,..., λk 1 and λ1 + λn 2 − ¹ )/ Conjugate gradient method 13.18 Outline conjugate gradient method for linear equations • complexity • conjugate gradient method as iterative method • applications in nonlinear optimization • Conjugate gradient method as iterative method In exact arithmetic CG was originally proposed as a direct (non-iterative) method • in theory, terminates in at most n steps • In practice due to rounding errors, CG method can take many more than n steps (or fail) • CG is now used as an iterative method • with luck (good spectrum of A), good approximation in small number of steps • attractive if matrix–vector products are inexpensive • Conjugate gradient method 13.19 Preconditioning make change of variables y = Bx with B nonsingular, and apply CG to • T 1 T B− AB− y = B− b if spectrum of B T AB 1 is clustered, PCG converges fast • − − trade-off between enhanced convergence, cost of extra computation • the matrix C = BT B is called the preconditioner • Examples diagonal C = diag A ; A ;:::; A • ¹ 11 22 nnº incomplete or approximate Cholesky factorization of A • good preconditioners are often application-dependent • Conjugate gradient method 13.20 Naive implementation T 1 T apply algorithm of page 13.12 to A˜y = b˜ where A˜ = B− AB− and b˜ = B− b Algorithm: define y0 = 0, r˜0 = b˜, and repeat for k = 0;1;::: until r˜k is sufficiently small: 1.

13. Conjugate Gradient Method

12. Coordinate Descent Methods

Process Optimization

The Gradient Sampling Methodology

A New Algorithm of Nonlinear Conjugate Gradient Method with Strong Convergence*

Using a New Nonlinear Gradient Method for Solving Large Scale Convex Optimization Problems with an Application on Arabic Medical Text

Conditional Gradient Method for Multiobjective Optimization

Successive Linear Programming to The

Metaheuristic Techniques in Enhancing the Efficiency and Performance of Thermo-Electric Cooling Devices

Branch and Bound Algorithm for Dependency Parsing with Non-Local Features

Linear-Memory and Decomposition-Invariant Linearly Convergent Conditional Gradient Algorithm for Structured Polytopes

Conjugate Gradient Descent Conjugate Direction Methods

Block Stochastic Gradient Method