Method of Conjugate Gradients

Dragica Vasileska CG Method – Conjugate Gradient (CG) Method

M. R. Hestenes & E. Stiefel, 1952

BiCG Method – Biconjugate Gradient (BiCG) Method

C. Lanczos, 1952 D. A. H. Jacobs, 1981 C. F. Smith et al., 1990 R. Barret et al., 1994

CGS Method – Conjugate Gradient Squared (CGS) Method (MATLAB Function)

P. Sonneveld, 1989

GMRES Method – Generalized Minimal – Residual (GMRES) Method

Y. Saad & M. H. Schultz, 1986 R. Barret et al., 1994 Y. Saad, 1996

QMR Method – Quasi–Minimal–Residual (QMR) Method

R. Freund & N. Nachtigal, 1990 N. Nachtigal, 1991 R. Barret et al., 1994 Y. Saad, 1996 Conjugate

1. Introduction, Notation and Basic Terms 2. Eigenvalues and Eigenvectors 3. The Method of Steepest Descent 4. Convergence Analysis of the Method of Steepest Descent 5. The Method of Conjugate Directions 6. The Method of Conjugate Gradients 7. Convergence Analysis of Conjugate Gradient Method 8. Complexity of the Conjugate Gradient Method 9. Preconditioning Techniques 10. Conjugate Gradient Type for Non- Symmetric Matrices (CGS, Bi-CGSTAB Method) 1. Introduction, Notation and Basic Terms • The CG is one of the most popular iterative methods for solving large systems of linear equations Ax=b which arise in many important settings, such as finite difference and finite element methods for solving partial differential equa-tions, circuit analysis etc. • It is suited for use with sparse matrices. If A is dense, the best choice is to factor A and solve the equation by backsubstitution. • There is a fundamental underlying structure for almost all the descent algorithms: (1) one starts with an initial point; (2) deter- mines according to a fixed rule a direction of movement; (3) mo-ves in that direction to a relative minimum of the objective function; (4) at the new point, a new direction is determined and the process is repeated. The difference between different algorithms depends upon the rule by which successive directions of movement are selected. ’ 1. Introduction, Notation and Basic Terms (Cont d) • A matrix is a rectangular array of numbers, called elements. •   The transpose of an m n matrix A is the n m matrix AT with elements T aij = aji •  Two square n n matrices A and B are similar if there is a nonsin- gular matrix S such that B=S-1AS • Matrices having a single row are called row vectors; matrices having a single column are called column vectors. … Row vector: a = [a , a , , a ] 1 2 … n Column vector: a = (a1, a2, , an) • The inner product of two vectors is written as  n xTy  x y  i i i 1 ’ 1. Introduction, Notation and Basic Terms (Cont d) • A matrix is called symmetric if a = a . • ij ji A matrix is positive definite if, for every nonzero vector x xTAx > 0 • Basic matrix identities: (AB)T=BTAT and (AB)-1=B-1A-1 • A quadratic form is a scalar, quadratic function of a vector with the form  1   f(x) xTAx bTx c 2 • The gradient of a quadratic form is      f(x)  x1     1 T  1  f' (x)    A x Ax b   2 2  f(x)  xn  ’ 1. Introduction, Notation and Basic Terms (Cont d) •  Various quadratic forms for an arbitrary 2 2 matrix:

Positive-definite Negative- matrix

Singular Indefinite positive-indefinite matrix matrix ’ 1. Introduction, Notation and Basic Terms (Cont d) •  Specific example of a 2 2 symmetric positive definite matrix:  3 2   2     2 A  , b  , c 0 x   2 6  8  2

Contours of the quadratic form

Gradient of the quadratic form

Graph of a quadratic form f(x) 2. Eigenvalues and Eigenvectors •   For any n n matrix B, a scalar and a nonzero vector v that satisfy the equation  Bv= v are said to be the eigenvalue and the eigenvector of B. • If the matrix is symmetric, then the following properties hold: (a) the eigenvalues of B are real (b) eigenvectors associated with distinct eigenvalues are orthogonal • The matrix B is positive definite (or positive semidefinite) if and only if all eigenvalues of B are positive (or nonnegative). • Why should we care about the eigenvalues? Iterative methods often depend on applying B to a vector over and over again:   (a) If | |<1, then Biv= iv vanishes as i approaches infinity   (b) If | |>1, then Biv= iv will grow to infinity. ’ 2. Eigenvalues and Eigenvectors (Cont d) •   Examples for | |<1 and | | >1

 =-0.5

 2

•   The spectral radius of a matrix is: (B)= max| i|

 =0.7  =-2 ’ 2. Eigenvalues and Eigenvectors (Cont d) • Example: The for the solution of Ax=b - split the matrix A such that A=D+E - Instead of solving the system Ax=b, one solves the modifi- ed system x=Bx+z, where B=-D-1E and z=D-1b

- The is: x(i+1)= Bx(i)+z, or

it terms of the error vector: e(i+1)= Be(i) - For our particular example, we have that the eigenvalues and the eigenvectors of the matrix A are:  T 1=7, v1=[1, 2]  T 2=2, v2=[-2, 1] - The eigenvalues and eigenvectors of the matrix B are:    T 1=- 2/3=-0.47, v1=[ 2, 1]    T 2= 2/3=0.47, v2=[- 2, 1] ’ 2. Eigenvalues and Eigenvectors (Cont d) • Graphical representation of the convergence of the Jacobi method

Eigenvectors of B Convergence of the together with their Jacobi method which eigenvalues starts at [-2,-2]T and converges to [2, -2]T

The error vector e(0) The error vector e(1)

The error vector e(2) 3. The Method of Steepest Descent • In the method of steepest descent, one starts with an arbitrary point … x(0) and takes a series of steps x(1), x(2), until we are satis-fied that we are close enough to the solution. • When taking the step, one chooses the direction in which f decreases most quickly, i.e.    f' (x(i)) b Ax(i) • Definitions:

error vector: e(i)=x(i)-x

residual: r(i)=b-Ax(i) • From Ax=b, it follows that ’ r(i)=-Ae(i)=-f (x(i))

The residual is actually the direction of steepest descent ’ 3. The Method of Steepest Descent (Cont d) • Start with some vector x(0). The next vector falls along the solid line    • x(1) x(0) r(0) The magnitude of the step is determined with a proce- dure that minimizes f along the line:

d  T dx(1)   f(x(1)) f' (x(1))  0 d d  T  r(1)r(0) 0   T  b Ax(1) r(0) 0      T  b A x(0) r(0) r(0) 0 rT r T   T     (0) (0) r r(0) Ar(0) r(0) 0 (0) rT Ar (0) (0) ’ 3. The Method of Steepest Descent (Cont d) • Geometrical representation (e)

(f) ’ 3. The Method of Steepest Descent (Cont d) • The   r(i) b Ax(i) Two matrix-vector rT r   (i) (i) multiplications are (i) rT Ar required. (i) (i)          x(i 1) x(i) (i)r(i) e(i 1) e(i) (i)r(i) • To avoid one matrix-vector multiplication, one uses

    r(i 1) r(i) (i)Ar(i)

The disadvantage of using this recurrence is that the residual sequence is determined without any feedback from the value of x(i), so that round-off errors may cause x(i) to converge to some point near x. 4. Convergence Analysis of the Method of Steepest Descent • The error vector e(i) is equal to the eigenvector v of A   j    r(i) Ae(i) Av j jv j T r r(i)   (i)  1     , e  e r 0 (i) T  (i 1) (i) (i) (i) r Ar(i) j (i) • The error vector e(i) is a linear combi- nation of the eigenvectors of A, and all eigenvalues are the same  n  e  v (i)  j j j 1     n     n  r Ae  v  v (i) (i)  j j j  j j j 1 j 1 T r r(i)   (i)  1     , e  e r 0 (i) T  (i 1) (i) (i) (i) r Ar(i) (i) 3. The Method of Steepest Descent • In the method of steepest descent, one starts with an arbitrary point … x(0) and takes a series of steps x(1), x(2), until we are satis-fied that we are close enough to the solution. • When taking the step, one chooses the direction in which f decreases most quickly, i.e.    f' (x(i)) b Ax(i) • Definitions:

error vector: e(i)=x(i)-x

residual: r(i)=b-Ax(i) • From Ax=b, it follows that ’ r(i)=-Ae(i)=-f (x(i))

The residual is actually the direction of steepest descent ’ 3. The Method of Steepest Descent (Cont d) • Start with some vector x(0). The next vector falls along the solid line    • x(1) x(0) r(0) The magnitude of the step is determined with a line search proce- dure that minimizes f along the line:

d  T dx(1)   f(x(1)) f' (x(1))  0 d d  T  r(1)r(0) 0   T  b Ax(1) r(0) 0      T  b A x(0) r(0) r(0) 0 rT r T   T     (0) (0) r r(0) Ar(0) r(0) 0 (0) rT Ar (0) (0) ’ 3. The Method of Steepest Descent (Cont d) • Geometrical representation (e)

(f) ’ 3. The Method of Steepest Descent (Cont d) • • Summary of the algorithm Efficient implementation of the   method of steepest descent r b Ax (i) (i)  i 0 T   r r r b Ax   (i) (i)   T (i) T r r r Ar      (i) (i) While i i and 2  max 0     q Ar x(i 1) x(i) (i)r(i)   

• rTq To avoid one matrix-vector    multiplication, one uses instead x x r If i is divisible by 50     r b - Ax r  r Ar else (i 1) (i) (i) (i)   r r - q endif   rTr   i i 1 4. Convergence Analysis of the Method of Steepest Descent • The error vector e(i) is equal to the eigenvector v of A   j    r(i) Ae(i) Av j jv j T r r(i)   (i)  1     , e  e r 0 (i) T  (i 1) (i) (i) (i) r Ar(i) j (i) • The error vector e(i) is a linear combi- nation of the eigenvectors of A, and all eigenvalues are the same  n  e  v (i)  j j j 1     n     n  r Ae  v  v (i) (i)  j j j  j j j 1 j 1 T r r(i)   (i)  1     , e  e r 0 (i) T  (i 1) (i) (i) (i) r Ar(i) (i) 4. Convergence Analysis of the Method of Steepest Descent ’ (Cont d)

• General convergence of the method can be proven by calculating the energy norm  22 2 2   2     j j j e  eT Ae  e 2 , 2 1   (i 1) A (i 1) (i 1) (i) A  23   2 j j j j j j • For n=2, one has that 2  2 2   2   ( )  1 1         ( 2 )( 3 2 ) 1         max / min , 2 / 1

Conditioning number 4. Convergence Analysis of the Method of Steepest Descent ’ (Cont d)

 ± Worst-case scenario: =

The influence of the spectral condition number on the convergence 5. The Method of Conjugate Directions • • Basic idea: Mathematical formulation: 1. Pick a set of orthogonal … 1. For each step we choose a search directions d(0), d(1), , point d(n-1)  x(i+1)=x(i)+ (i) d(i) 2. Take exactly one step in  each search direction to line up 2. To find (i), we use the fact with x that e(i+1) is orthogonal to d(i)

T   d(i)e(i 1) 0 T      d(i) e(i) (i)d(i) 0 T   T  d(i)e(i) (i)d(i)d(i) 0 T    d(i)e(i) (i) T d(i)d(i) ’ 5. The Method of Conjugate Directions (Cont d) • To solve the problem of not knowing e(i), one makes the search directions to be A-orthogonal rather then orthogonal to each other, i.e.: T  d(i)Ad( j) 0 ’ 5. The Method of Conjugate Directions (Cont d) • The new requirement is now that e(i+1) is A-orthogonal to d(i) d dx     T (i 1)   f(x(i 1)) f '(x(i 1))  0 d d T  r(i 1)d(i) 0 T   d(i)Ae(i 1) 0 T      d(i)A e(i) (i)d(i) 0 dT r   (i) (i) (i) dT Ad (i) (i)

If the search vectors were the residuals, this formula would be identical to the method of steepest descent. ’ 5. The Method of Conjugate Directions (Cont d) • Proof that this process computes x in n-steps - express the error terms as a linear combination of the search directions   n 1 e  d (0)  ( j) ( j) j 0 - use the fact that the search directions are A-orthogonal ’ 5. The Method of Conjugate Directions (Cont d) • Calculation of the A-ortogonal search directions by a conjugate Gram-Schmidth process … 1. Take a set of linearly independent vectors u0, u1, , un-1 2. Assume that d(0)=u0 3. For i>0, take an ui and subtracts all the components from it that are not A-orthogonal to the previous search directions  T   i 1     u(i)Ad( j) d u  d , (i) (i)  ij ( j) ij T j 0 d(j)Ad( j) ’ 5. The Method of Conjugate Directions (Cont d) • The method of Conjugate Directions using the axial unit vectors as basis vectors ’ 5. The Method of Conjugate Directions (Cont d) • Important observations:

1. The error vector is A-orthogonal to all previous search directions 2. The residual vector is orthogonal to all previous search directions 3. The residual vector is also orthogonal to all previous basis vectors.

T   T   T   di Ae j di rj ui rj 0 for i j 6. The Method of Conjugate Gradients

• The method of Conjugate Gradients is simply the method of conjugate directions where the search directions are constructed by conjugation of the residuals, i.e. u =r • i (i) This allows us to simplify the calculation of the new search direction because  T T  1 r(i)r(i)  r(i)r(i)      i j 1  T  T  ij  (i 1) d(i 1)Ad(i 1) r(i 1)r(i 1)   0 i j 1 • The new search direction is determined as a linear combination of the previous search direction and the new residual      d(i 1) r(i 1) id(i) ’ 6. The Method of Conjugate Gradients (Cont d) • Efficient implementation  i 0  r b Ax d r   rTr new  0 new     While i i and 2  max new 0 q Ad   new T  d q x x d If i is divisible by 50 r b - Ax else   r r - q endif  old new   rTr new   new   old d r d i i 1 The Landscape of Ax=b Solvers Direct Iterative ’ A = LU y = Ay More General Non- symmetric Pivoting GMRES, BiCGSTAB, LU …

Symmetric positive Cholesky Conjugate definite gradient

More Robust

More Robust Less Storage (if sparse) Conjugate gradient iteration x0 = 0, r0 = b, d0 = r0 for k = 1, 2, 3, . . . á T T k = (r k-1rk-1) / (d k-1Adk-1) step length á xk = xk-1 + k dk-1 approx solution – á rk = rk-1 k Adk-1 residual â T T k = (r k rk) / (r k-1rk-1) improvement â dk = rk + k dk-1 search direction

• One matrix-vector multiplication per iteration • Two vector dot products per iteration • Four n-vectors of working storage 7. Convergence Analysis of the CG Method

• If the algorithm is performed in exact arithmetic, the exact solution is obtained in at most n-steps. • When finite precision arithmetic is used, rounding errors lead to gradual loss of orthogonality among the residuals, and the finite termination property of the method is lost. • If the matrix A has only m distinct eigenvalues, then the CG will converge in at most m iterations • An error bound on the CG method can be obtained in terms of the A-norm, and after k-iterations:    k    (A) 1 x xk 2 x x0     A A  (A) 1 8. Complexity of the CG Method

• Dominant operations of Steepest Descent or Conjugate Gradient method are the matrix vector products. Thus, both methods have spatial complexity O(m). •  Suppose that after I-iterations, one requires that ||e(i)||A ||e(0)||A      (a) Steepest Descent: i 1 ln 1 2      (b) Conjugate Gradient: i 1 ln 2 • 2 The time complexity of these two methods is:  (a) Steepest Descent: O(m )  (b) Conjugate Gradient: O(m ) •  Stopping criterion: ||r(i)|| ||r(0)|| 9. Preconditioning Techniques References: • “ J.A. Meijerink and H.A. van der Vorst, An iterative solution method for ” linear systems of which the coefficient matrix is a symmetric M-matrix, Mathematics of Computation, Vol. 31, No. 137, pp. 148-162 (1977). • “ J.A. Meijerink and H.A. van der Vorst, Guidelines for the usage of incomplete decompositions in solving sets of linear equations as they ” occur in practical problems, Journal of Computational Physics, Vol. 44, pp. 134-155 (1981). • “ D.S. Kershaw, The incomplete Cholesky-Conjugate gradient method for ” the iterative solution of systems of linear equations, Journal of Computational Physics, Vol. 26, pp. 43-65 (1978). • “ ” H.A. van der Vorst, High performance preconditioning, SIAM Journal of Scientific Statistical Computations, Vol. 10, No. 6, pp. 1174-1185 (1989). ’ 9. Preconditioning Techniques (Cont d) • Preconditioning is a technique for improving the condition number of a matrix. Suppose that M is a symmetric, positive-definite matrix that approximates A, but is easier to invert. Then, rather than solving the original system Ax=b one solves the modified system M-1Ax= M-1b • The effectiveness of the depends upon the condition number of M-1A. • Intuitively, the preconditioning is an attempt to stretch the quadratic form to appear more spherical, so that the eigenvalues become closer to each other. • Derivation:       M EET , E 1AE T ~x E 1b, ~x ET x ’ 9. Preconditioning Techniques (Cont d) • • Transformed Preconditioned CG Untransformed PCG Algorithm   ~  (PCG) Algorithm ~r E 1r, d ET d     ~  1  1 T ~ ~  ~  i 0, r E b E AE x, d r     1   ~T ~    i 0, r b Ax, d M r new r r, 0 new   T -1        2 new r M r, 0 new While i imax and new 0     2    ~ While i imax and new 0 q~ E 1AE T d   q Ad    new   new ~T ~ d q dT q ~  ~ ~   x x d x x d          ~r E 1b - E 1AE T ~x or ~r ~r - q~ r b - Ax or r r - q       old new old new     T -1 ~rT ~r r M r new  new    new   new   old old ~  ~  ~     d r d d M 1r d     i i 1 i i 1 ’ 9. Preconditioning Techniques (Cont d)

There are sevaral types of that can be used:

(a) Preconditioners based on splitting of the matrix A, i.e. A = M - N (b) Complete or incomplete factorization of A, e.g. A = LLT + E (c) Approximation of M=A-1 (d) Reordering of the equations and/or unknowns, e.g. Domain Decomposition ’ 9. Preconditioning Techniques (Cont d) (a) Preconditioning based on splittings of A • Diagonal Preconditioning: … M=D=diag{a11, a22, ann}  3 2   2     2 A  , b  , c 0 x   2 6  8  2

• Tridiagonal Preconditioning • Block Diagonal Preconditioning ’ 9. Preconditioning Techniques (Cont d) (b) Preconditioning based on factorization of A • Cholesky Factorization (for symmetric and positive definite matrix) as a direct method: A = LLT x = (L-T)(L-1b) where:    i 1 L A  L ii ii  ik k 1   i1 Aji L jk Lik      L k 1 , j i 1,i 2,,n ji L • ii Modified Cholesky factorization: A = LDLT ’ 9. Preconditioning Techniques (Cont d) (b) Preconditioning based on factorization of A • Incomplete Cholesky Factorization: T T  A = LL + E or A = LDL + E , where Lij=0 if (i,j) P … Simplest choice is: P={(i,j)| aij=0; i,j,=1,2, ,n} => ICCG(0)          ~   ~  a b c ã b c~ d   i i i  T   i i i    i  A   L   D                ~  ~ 1   ~2 ~   ~2 ~  ~  ~    ai di ai bi 1di 1 ci mdi m , bi bi , ci ci , i 1,2, ,n •     Modified ICCG(0): M (L D)D 1(D LT ) 2 2    b   c  diag(A) diag(M), d a i 1 i m i i   di 1 di m ’ 9. Preconditioning Techniques (Cont d) (c) Preconditioning based on approximation of A •  If (J)<1, then the inverse of I-J is         (I J) 1  Jk I J J2 J3   • k 0 Write matrix A as: A=D+L+U=D(I+D-1(L+U)) => J= -D-1(L+U)) • The inverse of A can be expressed as:         A 1 (D(I J)) 1 (I J) 1D 1       k   ( 1)k D 1(L U) D 1  k 0 •        For k=1, one has that: M 1 D 1 D 1(L U)D 1 •   m      k  For k=m, one usually uses: M 1  ( 1)k D 1(L U) D 1 m  k k 0 ’ 9. Preconditioning Techniques (Cont d) (d) Domain Decomposition Preconditioning   A 0 B x  d  Domain I  1 1  1   1      Ax b, 0 A2 B2 x2  d2   T T     Domain II B B Q  z   f  1 2   M-1 0 0    1  M1 0 0   -1  T    M L 0 M 0 L , where L 0 M2 0 2   T T   1 B B S  0 0 S  1 2   M1 0 B1     *   T -1  T -1 M 0 M2 B2 , S S B M B1 B M B2   1 1 2 2 BT BT S*  1 2 10. Conjugate Gradient Type Algorithms for Nonsymmetric Matrices

•    T  The Bi-Conjugate Gradient i 0, r b Ax, r r* 0  *  *   T *    (BCG) Algorithm was propo- p r, p r , new r r , 0 new      While i i and 2 sed by Lanczos in 1954.  max new 0 • q Ap It solves not only the original *  T * q A p system Ax=b but also the dual   new linear system ATx*=b*. *T  pq • x  x  p Each step of this algorithm r r - q   requires a matrix-by-vector r* r* - q* T    product with both A and A . old new   *T • * r r The search direction pj does new    new not contribute to the solution   old directly. p r p *  *   * p r p i i 1 10. Conjugate Gradient Type Algorithms for ’ Nonsymmetric Matrices (Cont d)

•    i 0, r b Ax , r* is arbitrary The Conjugate Gradient  0  0 0 Squared (CGS) Algorithm p r0, u r0   *T    was developed by Sonneveld in new r0 r, 0 new      1984. While i i and 2  max new 0 •   Main idea of the algorithm: new   r*T Ap r (A)r  0 j j 0 q u Ap      p (A)r x x (u q) j j 0      r r - A(u q) r* (AT )r*    j j 0 old new   *T *   T * new r0 r p j (A )r  j 0   new   old u r q      p u (q p)   i i 1 10. Conjugate Gradient Type Algorithms for ’ Nonsymmetric Matrices (Cont d) •    i 0, r b Ax , r* is arbitrary The Bi-Conjugate Gradient  0 0 0 Stabilized (Bi-CGSTAB) p r0   *T    Algorithm was developed by new r0 r, 0 new      van der Vorst in 1992. While i i and 2  max new 0 •   Rather than producing iterates new whose residual vectors are of r*T Ap  0  the form s r Ap   2   sT As r j (A)r0   j As T As •     it produces iterates with x x p s   residual vectors of the form r s - As      old new r j j (A) j (A)r0     r*T r p (A) (A)r new  0  j j j 0   new     old  p r (p Ap)   i i 1 Complexity of linear solvers

Time to solve model problem ’ 1/2 (Poisson s n n1/3 equation) on regular mesh 2D 3D Sparse Cholesky: O(n1.5 ) O(n2 ) CG, exact O(n2 ) O(n2 ) arithmetic: CG, no precond: O(n1.5 ) O(n1.33 ) CG, modified IC: O(n1.25 ) O(n1.17 ) CG, support trees: O(n1.20 ) -> O(n1+ ) O(n1.75 ) -> O(n1.31 ) Multigrid: O(n) O(n)