13. Conjugate Gradient Method
Total Page:16
File Type:pdf, Size:1020Kb
L. Vandenberghe ECE236C (Spring 2020) 13. Conjugate gradient method conjugate gradient method for linear equations • complexity • conjugate gradient method as iterative method • applications in nonlinear optimization • 13.1 Unconstrained quadratic minimization 1 minimize f x = xT Ax bT x ¹ º 2 − with A symmetric positive definite and n n × equivalent to solving the linear equation Ax = b • the residual r = b Ax is the negative gradient: r = f x • − −r ¹ º Conjugate gradient method (CG) invented by Hestenes and Stiefel around 1951 • the most widely used iterative method for solving Ax = b, with A 0 • can be extended to non-quadratic unconstrained minimization • Conjugate gradient method 13.2 Krylov subspaces Definition: a sequence of subspaces k 1 = 0 ; = span b; Ab;:::; A − b for k 1 K0 f g Kk f g ≥ Properties subspaces are nested: • K0 ⊆ K1 ⊆ K2 ⊆ · · · dimensions increase by at most one: dim dim is zero or one • Kk+1 − Kk if = , then = for all i k: • Kk+1 Kk Ki Kk ≥ k k 1 i k 1 A b span b; Ab;:::; A − b = A b span b; Ab;:::; A − b for i > k 2 f g ) 2 f g Conjugate gradient method 13.3 Solution of Ax = b Key property: 1 A− b 2 Kn this holds even when , Rn Kn from Cayley–Hamilton theorem, • n n 1 p A = A + a A − + + a I = 0 ¹ º 1 ··· n n n 1 where p λ = det λI A = λ + a1λ − + + an 1λ + an ¹ º ¹ − º ··· − multiplying on the right with A 1b shows • − 1 1 n 1 n 2 A− b = A − b + a1A − b + + an 1b −an ··· − Conjugate gradient method 13.4 Krylov sequence xk = argmin f x ; k = 0;1;::: x ¹ º 2Kk from previous page, x = A 1b • n − CG method is a recursive method for computing the Krylov sequence x , x ,... • 0 1 we will see there is a simple two-term recurrence • xk+1 = xk tk f xk + sk xk xk 1 − r ¹ º ¹ − − º Example 2 x0 1 0 10 0 = ; = x2 A b x1 0 10 10 2 − 4 − 20 10 0 − − Conjugate gradient method 13.5 Residuals of Krylov sequence xk = argmin f x ; k = 0;1;::: x ¹ º 2Kk optimality conditions in definition of Krylov sequence: • x ; f x = Ax b ? k 2 Kk r ¹ kº k − 2 Kk hence, the residual r = b Ax satisfies • k − k r ; r ? k 2 Kk+1 k 2 Kk the first property follows from b and x 2 K1 k 2 Kk the (nonzero) residuals form an orthogonal basis for the Krylov subspaces: T k = span r0;r1;:::;rk 1 ; r rj = 0 for i , j K f − g i Conjugate gradient method 13.6 Conjugate directions the “steps” vi = xi xi 1 in the Krylov sequence (defined for i 1) satisfy − − ≥ T T T v Avj = 0 for i , j; v Avi = v ri 1 i i i − (proof on next page) the vectors v are conjugate: orthogonal for inner product v; w = vT Aw • i h i in particular, if vi , 0, it is linearly independent of v1,..., vi 1 • − the (nonzero) vectors vi form a conjugate basis for the Krylov subspaces: = span v ; v ; : : : ; v ; vT Av = 0 for i , j Kk f 1 2 k g i j Conjugate gradient method 13.7 Proof of properties on page 13.7 assume j < i; we show that Av and v are orthogonal (vT Av = 0): • i j i j vj = xj xj 1 j i 1 − − 2 K ⊆ K − and Avi = A xi xi 1 = ri + ri 1 i?1 ¹ − − º − − 2 K − T T the expression v Avi = v ri 1 follows from the fact that t = 1 minimizes • i i − 1 2 T T f xi 1 + tvi = f xi 1 + t v Avi t v ri 1 ; ¹ − º ¹ − º 2 ¹ i º − ¹ i − º since xi = xi 1 + vi minimizes f over the entire subspace i − K Conjugate gradient method 13.8 Conjugate vectors it will be convenient to work with a sequence of scaled vectors p = v α with k k/ k T v rk 1 α = k − k 2 rk 1 k − k2 the scaling factor α was chosen to satisfy • k T 2 p rk 1 = rk 1 k − k − k2 T T using v Avk = v rk 1 (page 13.7), we can express αk as • k k − T 2 p rk 1 rk 1 α = k − = k − k2 k T T pk Apk pk Apk in this notation, the Krylov sequence and residuals satisfy • xk = xk 1 + αk pk; rk = rk 1 αk Apk − − − Conjugate gradient method 13.9 Recursion for pk the vectors p1, p2, . , can be computed recursively as p1 = r0, T p Ark p = r k p ; k = 1;2;::: (1) k+1 k − T k pk Apk (proof on next page) this can be further simplified using • 2 T rk 1 2 2 rk Apk 2 rk = rk 1 k − k Apk = rk = rk 1 − − T ) k k2 − T k − k2 pk Apk pk Apk substituting in the recursion for p gives • k+1 2 rk p = r + k k2 p ; k = 1;2;::: k+1 k 2 k rk 1 k − k2 Conjugate gradient method 13.10 Proof of (1): p = span p ; p ;:::; p ;r , so we can express it as k+1 2 Kk+1 f 1 2 k k g pk+1 = γ1p1 + + γk 1pk 1 + βpk + δrk ··· − − δ = 1: take inner product with r and use • k rT p = r 2; rT p = = rT p = 0 (r ) k k+1 k k k2 k 1 ··· k k k 2 Kk? γ1 = = γk 1 = 0: take inner products with Apj for j k 1, and use • ··· − ≤ − T T pj Api = 0 for j , i; pj Ark = 0 (second equality because Ap and r ) j 2 Kj+1 ⊆ Kk k 2 Kk? hence, p = r + βp ; inner product with Ap shows that • k+1 k k k T p Ark β = k − T pk Apk Conjugate gradient method 13.11 Conjugate gradient algorithm define x0 = 0, r0 = b, and repeat for k = 0;1;::: until rk is sufficiently small: 1. if k = 0, take p1 = r0; otherwise, take 2 rk p = r + k k2 p k+1 k 2 k rk 1 k − k2 2. compute 2 rk α = k k2 ; x = x + αp ; r = r αAp T k+1 k k+1 k+1 k − k+1 pk+1Apk+1 main computation per iteration is matrix–vector product Apk+1 Conjugate gradient method 13.12 Outline conjugate gradient method for linear equations • complexity • conjugate gradient method as iterative method • applications in nonlinear optimization • Notation 1 minimize f x = xT Ax bT x ¹ º 2 − Optimal value ? 1 T 1 1 ? 2 f x = b A− b = x ¹ º −2 −2k kA Suboptimality at x 1 f x f ? = x x? 2 ¹ º − 2k − kA Relative error measure ? f x f ? x x 2 τ = ¹ º − = k − kA f 0 f ? x? 2 ¹ º − k kA here, u = uT Au 1 2 is A-weighted norm k kA ¹ º / Conjugate gradient method 13.13 Error after k steps x = span b; Ab;:::; Ak 1b , so x can be expressed as • k 2 Kk f − g k k X i 1 xk = ci A − b = p A b i=1 ¹ º where p λ = Pk c λi 1 is a polynomial of degree k 1 or less ¹ º i=1 i − − x minimizes f x over ; hence • k ¹ º Kk ? ? 2 1 2 2 f xk f = inf x x A = inf p A A− b A ¹ ¹ º − º x k − k deg p<k k¹ ¹ º − º k 2Kk we now use the eigenvalue decomposition of A to bound this quantity Conjugate gradient method 13.14 Error and spectrum of A eigenvalue decomposition of A • n T X T T A = QΛQ = λiqiqi Q Q = I; Λ = diag λ1; : : : ; λn i=1 ¹ ¹ ºº define d = QT b • the expression on the previous page simplifies to ? 1 2 2 f xk f = inf p A A− b ¹ ¹ º − º deg p<k k¹ ¹ º − º kA 1 2 = inf p Λ Λ− d Λ deg p<k k¹ ¹ º − º k n 2 2 X λip λi 1 d = inf ¹ ¹ º − º i < deg p k i=1 λi n 2 2 X q λi d = inf ¹ º i deg q k; q 0 =1 λi ≤ ¹ º i=1 Conjugate gradient method 13.15 Error bounds Absolute error n 2 ! ? X di 2 f xk f inf max q λi ¹ º − ≤ 2λi deg q k; q 0 =1 i=1;:::;n ¹ º i=1 ≤ ¹ º 1 ? 2 2 = x A inf max q λi 2k k deg q k; q 0 =1 i=1;:::;n ¹ º ≤ ¹ º the equality follows from P d2 λ = bT A 1b = x? 2 i i / i − k kA Relative error ? 2 xk x A 2 τk = k − k inf max q λi x? 2 ≤ deg q k; q 0 =1 i=1;:::;n ¹ º k kA ≤ ¹ º Conjugate gradient method 13.16 Convergence rate and spectrum of A if A has m distinct eigenvalues γ ,..., γ , CG terminates in m steps: • 1 m 1 m q λ = (− º λ γ λ γ ¹ º γ γ ¹ − 1º · · · ¹ − mº 1 ··· m satisfies deg q = m, q 0 = 1, q λ = = q λ = 0; therefore τ = 0 ¹ º ¹ 1º ··· ¹ nº m if eigenvalues are clustered in m groups, then τ is small • m can find q λ of degree m, with q 0 = 1, that is small on spectrum ¹ º ¹ º if x? is a linear combination of m eigenvectors, CG terminates in m steps • take q of degree m with q λ = 0 where d , 0; then ¹ iº i n 2 2 X q λi d ¹ º i = 0 i=1 λi Conjugate gradient method 13.17 Other bounds we omit the proofs of the following results in terms of condition number κ = λ λ • max/ min pκ 1 k τk 2 − ≤ pκ + 1 derived by taking for q a Chebyshev polynomial on λ ; λ » min max¼ in terms of sorted eigenvalues λ λ λ • 1 ≥ 2 ≥ · · · ≥ n 2 λk λn τk − ≤ λk + λn derived by taking q with roots at λ1,..., λk 1 and λ1 + λn 2 − ¹ )/ Conjugate gradient method 13.18 Outline conjugate gradient method for linear equations • complexity • conjugate gradient method as iterative method • applications in nonlinear optimization • Conjugate gradient method as iterative method In exact arithmetic CG was originally proposed as a direct (non-iterative) method • in theory, terminates in at most n steps • In practice due to rounding errors, CG method can take many more than n steps (or fail) • CG is now used as an iterative method • with luck (good spectrum of A), good approximation in small number of steps • attractive if matrix–vector products are inexpensive • Conjugate gradient method 13.19 Preconditioning make change of variables y = Bx with B nonsingular, and apply CG to • T 1 T B− AB− y = B− b if spectrum of B T AB 1 is clustered, PCG converges fast • − − trade-off between enhanced convergence, cost of extra computation • the matrix C = BT B is called the preconditioner • Examples diagonal C = diag A ; A ;:::; A • ¹ 11 22 nnº incomplete or approximate Cholesky factorization of A • good preconditioners are often application-dependent • Conjugate gradient method 13.20 Naive implementation T 1 T apply algorithm of page 13.12 to A˜y = b˜ where A˜ = B− AB− and b˜ = B− b Algorithm: define y0 = 0, r˜0 = b˜, and repeat for k = 0;1;::: until r˜k is sufficiently small: 1.