CHAPTER 4

Systems of Linear Equations

In this chapter we’ll examine both iterative and direct methods for solving equations of the form Ax = b (4.1) n where x, b R , A MnR is an n n , x is an unknown vector (to be found) and b and A are ∈ ∈ × known. Solving systems of linear equations is still the most important problem in computational mathematics, because it is used as a sub-problem in solving other problems. Algorithms that solve non-linear systems commonly use linear approximations, which give rise to systems of linear equations. Algorithms that op- timise over feasible sets given by linear and non-linear equalities and inequalities commonly solve systems related to first-order optimality conditions iteratively, which give rise to systems of linear equations. In undergraduate mathematics you have learned the mathematical ideas behind direct and iterative approaches to solving (4.1). You should understand: Theorem 4.1. The following are equivalent for any n n matrix A: × Ax = b has a unique solution of all b R. • ∈ Ax = 0 implies x = 0. • A−1 exists. • det(A) = 0. • 6 (A) = n. • The full rank of A is also our assumption throughout the chapter. Here, we build on these and analyse the related algorithms, focussing first on conditioning, second on the stability of direct methods, and third on the convergence and stability of iterative methods. This still leaves much unexplained, including conjugate gradients (CG), generalised minimal residuals (GMRES), and preconditioning, i.e. methods for changing the condition. See Liesen and Strakos [2012] for much more.

1. Condition of a System of Linear Equations

Condition suggests how changes in A and b, the “instance” of the problem, affect the solution x, using any algorithm. We will examine errors in A and b separately. It turns out that in both cases the condition number of the matrix A plays a role. Example 4.2 (An Ill-Conditioned Matrix). Consider the system of linear equations

x1 + 0.99x2 = 1.99

0.99x1 + 0.98x2 = 1.97. The true solution is x = 1 and x = 1 but x = 3.0000 and x = 1.0203 gives 1 2 1 2 − x1 + 0.99x2 = 1.989903

0.99x1 + 0.98x2 = 1.970106. Thus, a small change in the problem data, a change in the vector b from 1.99 1.989903 to , 1.97 1.970106

1 3 1.0001

2 2 x 2 1 x 1.0000

0

1 0.9999 − 1 0 1 2 3 − 0.9999 1.0000 1.0001 x1 x1

Figure 4.1. The system x1 + 0.99x2 = 1.99 (solid), 0.99x1 + 0.98x2 = 1.97 (dashed).

leads to a large change in the solution: this is our criterion for ill-conditioning. Intuitively, it is easy to see what is going wrong. See the illustration in Figure 4.1.

♦ 1.1. Perturbation of b. Let the right-hand-side b be perturbed by δb. Then we want to find the solution of A(x + δx) = b + δb. (4.2) Let denote a vector or matrix norm, according to context. By (4.2), Aδx = δb, so k k δx = A−1δb δx A−1 δb (a sharp bound). (4.3) ⇒ k k ≤ k k k k Since b = Ax, the properties of matrix norms again give b A x . (4.4) k k ≤ k k k k Hence, combining (4.3) and (4.4): each LHS RHS, so QLHS QRHS: ≤ ≤ δx b A A−1 x δb k k k k ≤ k k k k k k k k and assuming b = 0 we get 6 δx δb k k A A−1 k k x ≤ k k k k b k k k k i.e., (rel. error in x) A A−1 (rel. error in b). ≤ k k k k

2. The Condition Number of a Matrix

Thus, the quantity A A−1 measures the relative change in solution for a given relative change in k k k k problem: it measures the relative condition of the system of linear equations problem. This leads to: Definition 4.3. Given a matrix norm , the condition number of matrix A is k k cond (A) = A A−1 . rel k kk k This depends on the norm used; but, since the underlying vector norms only differ by a fixed multiplicative constant for a given n (all norms on Rn are equivalent), all measures of condition number are equally good.

We can interpret condrel(A) as: the amount a relative error in b is magnified in the solution vector x; or • the distortion A produces when applied to the unit sphere; or • 2 how “close” A (and indeed A−1) is to being a singular matrix. • Definition 4.4. We also define the spectral condition number of A as

maxλ∈σ A λ cond* (A) := ( ) | |. rel min λ λ∈σ(A) | |

Here σ(A), the spectrum of A, is the set of all eigenvalues of A. Recall that max λ = ρ(A) = rσ(A), λ∈σ(A) | | the spectral radius of A. If λ is an eigenvalue of A, its modulus (absolute value) or length λ is the factor by which a λ-eigenvector | | is expanded (if λ > 1) or contracted (if λ < 1). Thus | | | | ρ(A) = max λ is the largest factor by which A multiplies an eigenvector, while • λ∈σ(A) | | min λ is the smallest factor by which A multiplies an eigenvector. • λ∈σ(A) | | |λ| The ratio cond* (A) := maxλ∈σ(A) is thus a measure of the distortion produced by A: it measures rel minλ∈σ(A) |λ| how great is the difference in expansion/contraction of eigenvectors that A can cause. See Example 4.2 below. In the example above, the related matrix  1.00 .99  A = . .99 .98 has eigenvalues λ = 1.98, λ = 0.00005 of the characteristic equation, 1 2 −  1.00 λ .99  det(A λI) = det − − .99 .98 λ − = (1 λ)(.98 λ) .992 − − − = λ2 1.98λ + .98 .9801 = λ2 1.98λ .0001. − − − − Thus the spectral condition number is cond* (A) = 1.98 / 0.00005 =39,600. Hence, this matrix is rel | | | − | very ill-conditioned. The condition number cond (A) is bounded below by 1: this is seen by noting that I = 1 for any rel k k norm and 1 = I = AA−1 A A−1 = cond (A). k k k k ≤ k kk k rel Fact 4.5. Each norm-based condition number is also bounded below by the spectral condition number of A: 1 cond* (A) cond (A) ≤ rel ≤ rel for any norm.

Thus the spectral condition number is the smallest measure of relative condition of the system of linear equations problem.

2.1. Perturbation of A. Having looked at changes in b, and seen how the condition number arises naturally there, we now look at changes in A alone. If A is perturbed by δA then we have b = (A + δA)(x + δx) = Ax + Aδx + δAx + δAδx Aδx = δA(x + δx) ⇒ − δx = A−1δA(x + δx) ⇒ − Taking norms and using the triangle inequality we have − δx = A 1δA(x + δx) A−1 δA ( x + δx ) ≤ k kk k k k k k δx 1 A−1 δA  A−1 δA x . ⇒ k k − k k k k ≤ k k k k k k δx A−1 δA A A−1 δA Thus k k k k k k = k k k k k k x ≤ 1 A−1 δA (1 A−1 δA ) A k k − k k k k − k k k k k k 3 Since 1 = I = AA−1 A A−1 we have 1 A−1 . Thus, if A−1 δA 1 (and so k k k k ≤ k kk k kAk ≤ k k k k k k  δA / A 1), then 1 A−1 δA 1 and we have k k k k  − k k k k ≈ δx δA k k cond (A)k k. x ≤ rel A k k k k Thus, for a small perturbation of A, we again have that condition number measures the relative condition of the system of linear equations problem. A similar result can be derived for the case where both A and b are perturbed. Theorem 4.6. If A is non-singular, and δA 1 k k < A condrel(A) k k then A + δA is also non-singular.

This theorem tells us that the condition number measures the distance from A to the nearest singular matrix: it is a better measure than the of “how close to singularity” a matrix is. It also 1 says that the set GLnR of invertible n n matrices is an open set: about any matrix in GLnR there × is an ε-neighbourhood (or ε-ball) contained wholly in GLnR.

2.2. Errors and Residuals. There are two common ways to measure the discrepancy between the true solution x and the computed solutionx ˆ: Error: δx = x xˆ − Residual: r = b Axˆ − If A is invertible, and either δx or r is zero, then both must be zero. In many applications of linear equations, we want to solve Ax = b so that r, the difference between the left- and right-hand sides, is small, i.e., so that r = b Axˆ is small. k k k − k Intuitively, we can think of the residual as follows: if you have a computed solutionx ˆ to a system of linear equations and you know the exact solution x, then you know the error δx = x xˆ; but if you − don’t know the solution x beforehand then the residual is a measure along a different axis of how close you are, r = b Axˆ. − We now examine the relationship between r and errors in x. Letx ˆ be the computed solution to Ax = b. Then δx = x xˆ and r = b Axˆ, giving − − Aδx = Ax Axˆ = b Axˆ = r − − δx = A−1r. ⇒ Thus δx A−1 r (property of matrix norms) [1]. k k ≤ k k k k Similarly b A x k k ≤ k k k k 1 A so k k [2]. x ≤ b k k k k δx r It follows that k k A A−1 k k (combining [1] and [2]). x ≤ k k k k b k k k k Thus δx r k k cond (A)k k x ≤ rel b k k k k The conclusion is: if A is ill-conditioned then small r does not imply small δx / x . (We’ll see there k k k k k k is a similar conclusion for solutions of non-linear equations: if the problem is ill-conditioned, then a small “residual” f(xk) does not mean that xk xk− is small.) | | | − 1| 1 GLnR is a group under , since: (a) the product of two invertible matrices is invertible (so GLnR is closed under multiplication); (b) matrix multiplication is associative; (c) there is an I such that AI = A = IA for all A ∈ GLnR; and (d) each matrix in GLnR has an inverse under multiplication (by definition of n GLnR). Since GLnR is also a smooth manifold (locally there is a smooth mapping from Euclidean space R , with the overlaps consistent and smooth), GLnR is a Lie group.

4 Solutions with small residuals2, these can, because of the above, lead to large errors inx ˆ if A is ill- conditioned.

3. Methods for Solving Ax = b: An Overview

3.1. Special Systems. Let us consider the following types of matrices:

Symmetric: A = AT • Positive definite: xT Ax > 0 for all x = 0 and therefore xT Ax = λxT x > 0 and eigenvalues • 6 λ > 0. Diagonally dominant (DD): The element on the diagonal is larger or equal to the sum of the • P other elements in the row, i.e., aii j=6 i aij . | | ≥ | | P Strictly DD: The same, except for the strict inequality, i.e., aii > aij . • | | j=6 i | | Upper triangular: for aii = 0 • 6 a a a n  11 12 ······ 1  .   0 a22 .     . . .   . 0 .. .     . . . .   . . .. .  0 0 ann

Recall that if a matrix is in echelon form (e.g., upper triangular) the first non-zero entry in a row is th called the pivot for that row: here akk is the pivot for the k row.

3.2. The Algorithms. Direct methods for solving Ax = b apply elementary matrix operations to A and b, giving a transformed problem A0x0 = b0 which is easily solved for x0. Within direct methods:

In Gauss-Jordan, multiples of a pivot row are subtracted from other rows, such that one obtains • an upper first, and an identity matrix next. Gauss-Jordan works (with appropriate pivot) on any matrix, but is stable only for diagonally dominant or positive-definite matrices. Gauss-Jordan is also closely related to the LU and LUP decomposition, where U stands for • upper triangular matrix and L stands for upper triangular matrix. On symmetric positive definite matices, one can also use other decomposition methods (e.g., • Cholesky, QR), which are stable and faster.

Iterative methods successively improve an initial guess until it becomes satisfactory. Iterative methods for systems of linear equations are best understood as means of solving an associated optimisation problem. 1 T T Let us have a quadric f := 2 x Ax+b x+c with A positive definite. Whenever the first-order optimality n conditions of minx∈R f(x) are satisfied, we have Ax = b. Within iterative methods:

Jacobi method is guaranteed to converge if A is strictly diagonally dominant. • Gauss-Seidel is guaranteed to converge if A is either diagonally dominant or symmetric positive • semidefinite. Many other algorithms (CG, GMRES) work on symmetric positive-definite matrices. • In a number of applications, iterative methods are preferred to direct methods, especially when the coefficient matrix A is sparse or structured.

2A very important fact in numerical computation.

5 4. Direct Methods

4.1. Gauss-Jordan. Recall that this method uses a sequence of elementary matrix operations to transform the square system Ax = b into an upper triangular system Ux = b0, which is then solved using back substitution. (k) th We use a superscript in parentheses to denote the stage: xi denotes the value for xi at the k stage and A(k) denotes the matrix A at this stage. At stage k we have:  a(1) a(1) a(1) a(1) b(1)  11 12 ··· 1k ··· 1n 1  0 a(2) a(2) a(2) b(2)   22 2k 2n 2   . ···. . ···   . .. .   . .  (k) (k)   (k) (k) (k)  = A b  0 akk akn bk   ······ ···   . .   . .  (k) (k) (k) 0 a ann bn ······ nk ··· (k) (k) (k) The elements ak+1,k, ak+2,k,. . . , ank are eliminated by subtracting the following multiples of row k from rows k + 1, k + 2, . . . , n: (k) (k) (k) ak+1,k ak+2,k an,k mk+1,k := (k) , mk+2,k := (k) , . . . , mn,k := (k) . akk akk akk We have in general, assuming that a(k) = 0, the (i, k) multiplier kk 6 (k) aik mik := (k) i = k + 1, . . . , n akk and, for all i, j = k + 1, . . . , n, (k+1) (k) (k) a = a mika , ij ij − kj (k+1) (k) (k) b = b mikb . i i − k

Pictorially, at stage k the matrix will look as in Figure 4.2. Note that rows 1, . . . , k will not change from stage k + 1 onwards.

1 def GaussJordan(A, b, pivoting = noPivot): (rows, cols) = A.shape for row in range(0, rows-1): pivot = pivoting(A, row) if abs(A[pivot, row]) < 1e-8: raise ValueError()

6 if pivot != row: A[[row, pivot],:] = A[[pivot, row],:] b[[row, pivot]] = b[[pivot, row]] for i in range(row+1, rows): if abs(A[row, row]) < 1e-8: raise ValueError()

11 factor = A[i, row] / A[row, row] A[i, row+1:rows] = A[i, row+1:rows] - factor*A[row, row+1:rows] b[i] = b[i] - factor*b[row] .

for k in range(rows-1,-1,-1):

2 b[k] = (b[k] - dot(A[k, k+1:rows], b[k+1:rows])) / A[k, k] return b .

6 akk

Reduce to zero

part of matrix that changes

Figure 4.2. Gauss-Jordan: changes at stage k.

4.1.1. Analysis of Gauss-Jordan. See that Line 12 performs O(n) “multiply–accumulate” operations for n rows and n. If we see “multiplyaccumulate” as 1 operation, the number S(n) of operations performed is: n−1 n n X X X S(n) = 1 k=1 i=k+1 j=k+1 n−1 n X X = (n k) − k=1 i=k+1 n−1 X = (n k)2 − k=1 = (n 1)2 + (n 2)2 + + 22 + 12 − − ··· = n(n 1)(2n 1)/6 − − n3/3 for large n. ≈ Hence Gauss-Jordan is a Θ(n3) process. (Pn−1 k2 = 1 n(n 1)(2n 1) by induction.) k=1 6 − − To put Θ(n3) into perspective, consider a single computer, which can sustain the performance of 1011 operations per second (“100 gigaFLOPS”). For a 10000 10000 matrix, you need 1012 operations, or 10 × seconds. For a 100000 100000 matrix, you need 1015 operations, or under 3 hours, if you can store the × 80 GB in RAM. For a 1000000 1000000 matrix, you need 1018 operations, or over 115 days, if you can × store the 8 TB in RAM. As you can test using your own laptopn, this is a very optimistic estimate. Gauss-Jordan transforms the original system Ax = b to upper triangular form:   a(1) a(1) a(1)    (1)  11 12 ··· 1n x1 b1  (2) .  x  (2)   0 a .   2  b2  Ux =  22   .  =    . . .   .   .   . .. .   .   .    (n) (n) xn 0 0 ann bn ··· This system of equations can now be solved using back substitution.

7 Assumes a(k) = 0: but in fact since A is invertible, we could always swap row k with a later • kk 6 row to get a(k) = 0 (see later). kk 6 A and b are overwritten. • The 0’s beneath the pivot element are not calculated. They are ignored, as they are known to • be zero. Thus the storage space used for these zeros could be used for something else. . . An extra matrix is not needed to store the mik’s. They can be stored in place of the zeros. • The operations on b can be done separately, once we have stored the mik’s. • Because of the last observation we may now solve for any b without going through the elimi- • nation calculations again.

We solved Ax = b using Gauss-Jordan which required elementary row operations to be performed on both A and b. If we are required to solve the equation Ax = b0 then we would need to perform exactly the same operations because these are determined by the elements of A only, and A is the same in both equations.

Hence if we have stored the multipliers mik we need to perform only the last-but-one line of Gauss-Jordan, i.e.,

bi := bi mikbk, k = 1, . . . , n 1, i = k + 1, . . . , n. − −

4.1.2. The LU Decomposition of A. If at each stage k of Gauss-Jordan we store mik in those cells of A that become zero then the A matrix after elimination would be as follows  (1) (1) (1)  a11 a12 a1n  ··· .   (2) .  m21 a22 .   . .   . .. .   . . .  (n) mn mn ann 1 2 ··· We define the upper and unit lower triangular parts as    (1) (1) (1)    u11 u12 u1n a11 a12 a1n 1 0 0 ··· .  ··· .  ··· .  .   (2) .   .  0 u22 .   0 a22 .  m21 1 . U = (uij) =  . .  =  . .  ,L = (`ij) =  . . .  . .. .   . .. .   . .. .  . . .   . . .   . . . (n) 0 0 unn 0 0 ann mn1 mn2 1 ··· ··· ··· That is, for all i, j 1, . . . , n , ∈ { } ( (i) aij if i j uij = ≤ 0 otherwise  .  mij if i > j `ij = 1 if i = j  0 otherwise

Theorem 4.7 (LU Decomposition). If L = (`ij) and U = (uij) are the upper and lower triangular matrices generated by Gauss-Jordan, assuming a(k) = 0 at each stage, then kk 6 n X A = (aij) = LU, that is, aij = `ikukj k=1 where (k) (k) ukj = a , k j, in particular, ukk = a kj ≤ kk and

`ik = mik, k i, `kk = 1, ≤ and this decomposition is unique.

8 For proof, c.f., [Watkins, 2004, 51–53] We can now interpret Gauss-Jordan as a process which decomposes A into L and U and hence we have Ax = LUx = L(Ux) = Ly = b. This represents two triangular systems of equations Ly = b and Ux = y whose solutions are: y = L−1b, Ux = L−1b, x = U −1L−1b.

Overall, we solve Ly = B for y first (”forward”), and solve y = Ux for x second (”backward”). The revised code is: def LU(A, b):

2 L, U = lu(A, permute_l=True) y = zeros_like(b) for m, bi in enumerate(b.flatten()): y[m] = bi if m:

7 for n in xrange(m): y[m] -= y[n] * L[m, n] y[m] /= L[m, m] .

1 x = zeros_like(b) for midx in xrange(B.size): m = b.size - 1 - midx x[m] = y[m] if midx:

6 for nidx in xrange(midx): n = b.size - 1 - nidx x[m] -= x[n] * U[m, n] x[m] /= U[m, m] return x .

Example 4.8 (Trefethen and Schreiber [1990]). Consider the LU decomposition of the following matrix:  1 1   1 1 1   −   1 1 1 1   − −   1 1 1 1 1  − − − 1 1 1 1 1 − − − − What is the largest element in the U matrix? What if we constructed larger matrices with the same structure? ♦ Example 4.9 (Hilbert’s Matrix). Consider the following matrix:  1 1/2 1/3  ···  1/2 1/3 1/4    A = 1/3 1/4 1/5   . .  . .. Pn How does Gauss-Jordan do, when you try to solve the system, where bi = Aij? Why? j=1 ♦

9 4.1.3. The LDU Decomposition of A. Gauss-Jordan also provides the decomposition A = LDU 0,

0 where L and U are unit lower and unit upper triangular and D = diag(uii), the with u11, . . . , unn as the diagonal entries. 0 −1 To see this decompose A = LU and let U = D U. Since U is non-singular, uii = 0, i = 1, 2, . . . , n and 6 hence D−1 exists. It is easy to show that U 0 := D−1U is a unit upper triangular matrix. Thus, A = LU = LDD−1U = LDU 0. If A is symmetric then A = LDU 0 = LDLt, where L is unit lower triangular. t If A is symmetric and positive definite (that is, x Ax > 0 for all x) then each uii is positive and A = LDLt = L√D√DLt = CCt, where C = L√D and √D = diag(√uii). This is called the Cholesky Factorization of A.

4.2. Pivoting in Gauss-Jordan. In the preceding discussion of Gauss-Jordan we assumed that a(k) = 0 at each stage of the process. If a(k) = 0 then we can interchange rows of the matrix A(k) so that kk 6 kk a(k) = 0. In fact, we need only find a row i > k for which a(k) = 0 and then interchange rows i and k. kk 6 kk 6 It can be easily shown that if A is non-singular then such a row exists. Hence, theoretically, zero pivots cause no difficulty. (k) (k) However, there is a much more important reason for interchanging rows: if akk is small (even if akk = 0) (k) 6 then division by akk would cause problems because of roundoff. We can see this in the next example. The problem with roundoff in Gauss-Jordan is that it propagates and is amplified from stage to stage because, there is no contraction of error. For this reason, roundoff error control is absolutely essential in Gauss-Jordan. We will indicate briefly a couple of approaches to this, known as Partial Pivoting and Complete Pivoting. Note: It can be shown that the step A(k) A(k+1) in Gauss-Jordan may be viewed as multiplication −→ by a matrix M (k), where M (k) is a product of elementary matrices (by an elementary matrix we mean the matrix associated to elementary row operations such as (k+1) (k) (k) R := R mikR , i i − k multiplication of a row by a constant, or swapping two rows). We will not cover this in detail. It can be shown that if all the multipliers have magnitude (absolute value) < 1, then the final result will be accurate, as in our second approach to the above example. This is the basic idea of partial pivoting. 4.2.1. Partial Pivoting.

4.3. Scaled Partial Pivoting. Scaled partial pivoting is a variation of standard partial pivoting. In scaled partial pivoting, at stage k, we choose as the pivot the entry in column k which is of greatest absolute value relative to the entries in its row (as before, we only consider rows k, . . . , n). The scaled pivoting approach is useful when entries have large differences in absolute value, since this causes prop- agation of roundoff error. We use it for systems of linear equations where the row entries vary greatly in magnitude, e.g.,  10 105 106  1 1 3 − Here, it is worth transposing the two rows, since the current pivot, 10, is larger than 1 but is very small relative to the other entries 105 and 106 in the first row. Without a row swap, roundoff errors will lead to loss of accuracy as in Example ??.

10 4.4. Complete Pivoting. Complete pivoting (also called maximal pivoting) is a natural extension ∗ ∗ of partial pivoting whereby we find i and j such that ai∗j∗ = max aij . | | k≤i,j≤n | | This means that we interchange rows i∗ and k and columns j∗ and k. The row interchange does not have any effect (theoretically) on the solution but the column interchange interchanges the variable names (labels) i.e., xj∗ xk. These interchanges of columns must be recorded so that the correct variable is ↔ associated with the corresponding solution value at the end of the algorithm. Complete pivoting is an O(n2) process at each stage k. Thus it adds O(n3) steps to Gauss-Jordan which is a substantial increase, although G.E. is still Θ(n3). It is rarely used because it has been found in practice that partial pivoting is entirely adequate to ensure numerical stability, except in isolated cases: in these cases, complete pivoting may be needed to attain acceptable accuracy. Exercise 4.10 (A Russian Matrix Faddeev and Faddeeva [1960]). This innocent-looking comes from a book by Faddeev and Faddeeva [1960]: 5 7 6 5  7 10 8 7  A =   6 8 10 9  5 7 9 10 Solve the system Ax = b = (23, 32, 33, 31)T .

4.5. Direct Methods: Conclusions. In theory, the complexity is can be decreased to that of matrix-matrix multiplication, c.f. Ibarra • et al. [1982]. Complete pivoting is safe (proven), but so computationally expensive, that it is not used. • Partial pivoting is safe with high probability, particularly if the scaled version is used (experi- • mental result, Trefethen and Schreiber [1990]). In practice, the various decompositions (LU, LDU, LUP, Cholesky, etc), are of particular im- • portance, as they often allow for elegant solutions of non-trivial problems.

5. Iterative methods for solving linear equations

Iterative methods successively improve an initial guess until it becomes satisfactory. The iterative solution of Ax = b requires the equation to be re-arranged into fixed point form as follows: x = T (x) := Cx + d.

Since subscripts are traditionally used to indicate components of a vector, we will use a superscript on k th k the vector x to denote the iteration: x is the k “guess” or iteration of the solution vector x. Then xi th th denotes the value for the i component xi at the k iteration. The convergence of of iterative methods for solving Ax = b is usually restricted to diagonally dominant matrices, because: T is a contraction mapping the spectral radius r(C) < 1, which is the absolute value of • ⇐⇒ C’s largest eigenvalue. A sufficient condition for this is that: for some matrix norm , we have C < 1. This is the • k k k k case for strictly diagonally dominant matrices. Then Banach’s Fixed Point Theorem tells us that the sequence (xk) defined by xk+1 := T (xk) • will converge to a unique limit x, the solution of Ax = b. Assume the sequence (xk) converges to the fixed point x and define ek+1 = x xk+1, the error at the − kth iteration. Then we have x xk+1 = x (Cxk + d) − − = Cx + d (Cxk + d) since x is a fixed point − = C(x xk) by linearity of matrix multiplication. − 11 Hence ek+1 C ek , i.e., linear order of convergence. It is obvious that the smaller C is, the k k ≤ k k k k k k faster the iterations converge to a solution.

5.1. Transforming Ax = b to x = Cx + d. A can be split to rewrite Ax = b in fixed point form x = Cx + d in a number of ways, incl. Jacobi and Gauss-Seidel. 3 In both cases, because of the way C is derived from A, it turns out that if A is diagonally dominant, so is C: thus, if A < 1 or A ∞ < 1, k k1 k k then C also has norm < 1 and our sufficient condition for convergence of the sequence (xk) holds true.

5.2. Jacobi Method. This splits A as follows: Ax = (A D + D)x = b, − where D is the diagonal matrix formed from the diagonal elements of A. This leads to C = D−1(A D) and d = +D−1b. − − Each component of the new vector xk+1 can be calculated using the original A and b as follows: for i := 1 to n do  i−1 n  k+1 1 X k X k xi := bi aijxj aijxj  aii − − j=1 j=i+1 This iteration formula can be written in correction form as: for i := 1 to n do

 n  k+1 k 1 X k xi := xi + bi aijxj  . aii − j=1

In terms of code: def Jacobi(A, b, tol = 1e-10, limit = 100): x = zeros_like(b) for iteration in range(limit): next = zeros_like(x)

5 for i in range(A.shape[0]): s1 = dot(A[i, :i], x[:i]) s2 = dot(A[i, i + 1:], x[i + 1:]) next[i] = (b[i] - s1 - s2) / A[i, i] if allclose(x, next, atol=tol): break

10 x = next return x .

5.3. Gauss-Seidel Method. This splits A as follows: Ax = (L + D + U)x = b, where L, U and D are the matrices formed from the sub-, super-, and diagonal elements of A, respectively. This leads to C = (D + L)−1U and d = (D + L)−1b. − Each component of the new vector xk+1 can be calculated using the original A and b as follows: for i := 1 to n do  i−1 n  k+1 1 X k+1 X k xi := bi aijxj aijxj  . aii − − j=1 j=i+1

In terms of code:

3The Jacobi iteration method is attributed to Carl Jacobi (1804–1851). The Gauss-Seidel method is attributed to Johann Carl Friedrich Gauss (1777–1855) and Philipp Ludwig von Seidel (1821–1896).

12 def GaussSeidel(A, b, tol = 1e-10, limit = 100): x = zeros_like(b) for iteration in range(limit):

4 next = zeros_like(x) for i in range(A.shape[0]): s1 = dot(A[i, :i], next[:i]) s2 = dot(A[i, i + 1:], x[i + 1:]) next[i] = (b[i] - s1 - s2) / A[i, i]

9 if allclose(x, next, rtol=tol): break x = next return x .

Thus Gauss-Seidel uses a new component of x as soon as it becomes available, in contrast to the Jacobi method, which waits for all n new components before using any of them. The correction form of the Gauss-Seidel iteration formula is for i := 1 to n do  i−1 n  k+1 k 1 X k+1 X k xi := xi + bi aijxj aijxj  . aii − − j=1 j=i In vector-matrix form this is xk+1 = xk + D−1(b Lxk+1 (D + U)xk) = xk + D−1rk,k+1, − − where rk,k+1 is the ‘residual’ after the kth iteration.

Exercise 4.11. Consider the system 2x1 + 3x2 = 11, 5x1 + 7x2 = 13. Does Jacobi converge? Does Gauss-Seidel converge? Why?

Exercise 4.12 (Trefethen and Bau Trefethen and Bau III [1997]). Consider the system x2 = 1, x1 +x2 = 1. Does Gauss-Jordan converge? Why? Consider the system 10−20x + x = 1, x = 1. What does 1 2 − 1 Gauss-Jordan converge to? Why?

5.4. Iterative Methods: In Place of a Conclusion. All of the above methods have first order convergence, i.e., ek+1 C ek , where C depends on the method used. The similarities between k k ≤ k k k k the methods can be seen most easily if we write them in matrix correction form: xk+1 = xk + D−1(b Axk) = xk + D−1rk • − (Jacobi: here rk is the residual after the kth iteration); xk+1 = xk + D−1(b Lxk+1 (D + U)xk) = xk + D−1rk,k+1 • − − (Gauss-Seidel: here rk,k+1 is the ‘residual’ after the kth iteration). Thus the Jacobi and Gauss-Seidel use different approximations to the A−1 matrix.In both cases, the rate of convergence slows down, as the the condition number increases. There are much more sophisticated iterative methods, including conjugate gradients (CG), generalised minimal residuals (GMRES), and numerous randomised methods. See Liesen and Strakos [2012], Gower and Richt´arik[2015]. More importantly, there are sophisticated means of preconditioning, i.e., lowering the condition number. These fall outside of our scope, but we will provide the briefest of overviews of each. If one draws an i.i.d. S Rm×q at each iteration, one can apply an step, where xk+1 is ∈ the best approximation of x∗ in a random space passing through xk: xk+1 = arg min x x∗ 2 subject to x = xk + B−1AT Sy, y is free (4.5) n B x∈R || − || where B is an n n positive definite matrix B used to define the B-inner product and the induced × B-norm by p x, y B := Bx, y , x B := x, x B, (4.6) h i h i k k h i 13 where , is the standard Euclidean inner product. As it turns out, one can prove very strong conver- h· ·i gence results for such methods. CG and GMRES can be explained as Krylov subspace methods [Liesen and Strakos, 2012, Chapter 1], with iteration k+1 ∗ 2 0 x := arg min x x subject to x x + k , (4.7) n B +1 x∈R || − || ∈ K n 0 where k+1 R is a (k + 1)–dimensional subspace and the constraint x x + k+1 is an affine space K ⊂ ∈ K that contains x0. GMRES uses B = AT A in the objective x x∗ 2 and CG uses B = A. Alternatively, k − kB one can think in terms of the CayleyHamilton theorem: for any invertible A there exists a polynomial q of degree n1, such that q(A) = A1. In each iteration, we increase the allowable degree by 1 Many people solve P −1(Ax b) = 0 instead of Ax b = 0, with the hope that P −1A has a lower the − − condition number than A: k+1 k −1 k x = x γkP (Ax b). (4.8) − − A non-singular preconditioner P is often problem-specific and applied in a matrix-free fashion, i.e., without ever instantiating P . For example, Jacobi preconditioner uses P = diag(A). Many other preconditioners approximate A−1.

14 Bibliography

DK Faddeev and VN Faddeeva. Computational methods of [Vychislitel’nye metody lineinoi algebry]. Fizmatgiz, Moscow, 1960. Robert M. Gower and Peter Richt´arik.Randomized Iterative Methods for Linear Systems. SIAM Journal on Matrix Analysis and Applications, 2015. Also, arXiv:1506.03296. Oscar H Ibarra, Shlomo Moran, and Roger Hui. A generalization of the fast lup matrix decomposition algorithm and applications. Journal of Algorithms, 3(1):45 – 56, 1982. ISSN 0196-6774. doi: http: //dx.doi.org/10.1016/0196-6774(82)90007-4. URL http://www.sciencedirect.com/science/ article/pii/0196677482900074. J¨orgLiesen and Zdenek Strakos. Krylov subspace methods: principles and analysis. Oxford University Press, 2012. Lloyd N Trefethen and David Bau III. Numerical linear algebra. SIAM, 1997. Lloyd N Trefethen and Robert S Schreiber. Average-case stability of . SIAM Journal on Matrix Analysis and Applications, 11(3):335–360, 1990. David S Watkins. Fundamentals of matrix computations. John Wiley & Sons, 2004.

15