Quick viewing(Text Mode)

8 Norms and Inner Products

8 Norms and Inner Products

8 Norms and Inner products

This section is devoted to vector spaces with an additional structure - inner product. The underlining field here is either F = IR or F = C|| .

8.1 Definition (Norm, distance) A norm on a V is a real valued k · k satisfying 1. kvk ≥ 0 for all v ∈ V and kvk = 0 if and only if v = 0. 2. kcvk = |c| kvk for all c ∈ F and v ∈ V . 3. ku + vk ≤ kuk + kvk for all u, v ∈ V (triangle inequality). A vector space V = (V, k · k) together with a norm is called a normed space. In normed spaces, we define the distance between two vectors u, v by d(u, v) = ku−vk.

Note that property 3 has a useful implication: kuk − kvk ≤ ku − vk for all u, v ∈ V .

8.2 Examples (i) Several standard norms in IRn and C|| n:

n X kxk1 := |xi| (1 − norm) i=1

n !1/2 X 2 kxk2 := |xi| (2 − norm) i=1 note that this is Euclidean norm,

n !1/p X p kxkp := |xi| (p − norm) i=1 for any real p ∈ [1, ∞),

kxk∞ := max |xi| (∞ − norm) 1≤i≤n

(ii) Several standard norms in C[a, b] for any a < b:

Z b kfk1 := |f(x)| dx a

Z b !1/2 2 kfk2 := |f(x)| dx a

kfk∞ := max |f(x)| a≤x≤b

1 8.3 Definition (Unit sphere, Unit ball) Let k · k be a norm on V . The

S1 = {v ∈ V : kvk = 1} is called a unit sphere in V , and the set

B1 = {v ∈ V : kvk ≤ 1} a unit ball (with respect to the norm k · k). The vectors v ∈ S1 are called unit vectors.

For any vector v 6= 0 the vector u = v/kvk belongs in the unit sphere S1, i.e. any nonzero vector is a multiple of a unit vector.

The following four theorems, 8.4–8.7, are given for the sake of completeness, they will not be used in the rest of the course. Their proofs involve some advanced material of real analysis. The students who are not familiar with it, may disregard the proofs.

8.4 Theorem Any norm on IRn or C|| n is a continuous function. Precisely: for any ε > 0 there is a δ > 0 such that for any two vectors u = (x1, . . . , xn) and v = (y1, . . . , yn) satisfying

maxi |xi − yi| < δ we have kuk − kvk < ε.

Proof. Let g = max{ke1k,..., kenk}. By the triangle inequality, for any vector

v = (x1, . . . , xn) = x1e1 + ··· + xnen we have kvk ≤ |x1| ke1k + ··· + |xn| kenk ≤ (|x1| + ··· + |xn|) · g

Hence if max{|x1|,..., |xn|} < δ, then kvk < ngδ. Now let two vectors v = (x1, . . . , xn) and u = (y1, . . . , yn) be close, so that maxi |xi − yi| < δ. Then

kuk − kvk ≤ ku − vk < ngδ It is then enough to set δ = ε/ng. This proves the continuity of the norm k · k as a function of v. 2

8.5 Corollary Let k · k be a norm on IRn or C|| n. Denote by

e 2 2 S1 = {(x1, . . . , xn): |x1| + ··· + |xn| = 1}

the Euclidean unit sphere in IRn (or C|| n). Then the function k · k is bounded above and e below on S1: 0 < min kvk ≤ max kvk < ∞ e e v∈S1 v∈S1

2 e Proof. Indeed, it is known in real analysis that S1 is a compact set. Note that k · k is a e continuous function on S1. It is known in real analysis that a continuous function on a compact set always takes its maximum and minimum values. In our case, the minimum e e value of k · k on S1 is strictly positive, because 0 ∈/ S1. This proves the corollary. 2

8.6 Definition (Equivalent norms) Two norms, k · ka and k · kb, on V are said to be equivalent if there are constants 0 < C1 < C2 such that C1 ≤ kuka/kukb ≤ C2 < ∞ for all u 6= 0. This is an equivalence relation.

8.7 Theorem In any finite dimensional space V , any two norms are equivalent.

Proof. Assume first that V = IRn or V = C|| n. It is enough to prove that any norm is equivalent to the 2-norm. Any vector v = (x1, . . . , xn) ∈ V is a multiple of a Euclidean e e unit vector u ∈ S1, so it is enough to check the equivalence for vectors u ∈ S1, which immediately follows from 8.5. So, the theorem is proved for V = IRn and V = C|| n. An arbitrary n-dimensional vector space over F = IR or F = C|| is isomorphic to IRn or C|| n, respectively. 2

8.8 Theorem + Definition ( norm) Let k · k be a norm on IRn or C|| n. Then kAxk A := sup kAxk = sup kxk=1 x6=0 kxk defines a norm in the space of n×n matrices. It is called the matrix norm induced by k·k.

Proof is a direct inspection.

Note that supremum can be replaced by maximum in Theorem 8.8. Indeed, one can ob- viously write A = supkxk2=1 kAxk/kxk, then argue that the function kAxk is continuous e on the Euclidean unit sphere S1 (as a composition of two continuous functions, Ax and k · k), then argue that kAxk/kxk is a continuous function, as a ratio of two continuous e functions, of which kxk= 6 0, so that kAxk/kxk takes its maximum value on S1.

Note that there are norms on IRn×n that are not induced by any norm on IRn, for example kAk := maxi,j |aij| (Exercise, use 8.10(ii) below).

8.9 Theorem Pn (i) kAk1 = max1≤j≤n i=1 |aij| (maximum column sum) Pn (ii) kAk∞ = max1≤i≤n j=1 |aij| (maximum row sum)

3 Note: There is no explicit characterization of kAk2 in terms of the aij’s.

8.10 Theorem Let k · k be a norm on IRn or C|| n. Then (i) kAxk ≤ kAk kxk for all vectors x and matrices A. (ii) kABk ≤ kAk kBk for all matrices A, B.

8.11 Definition (Real Inner Product) Let V be a real vector space. A real inner product on V is a real valued function on V × V , denoted by h·, ·i, satisfying 1. hu, vi = hv, ui for all u, v ∈ V 2. hcu, vi = chu, vi for all c ∈ IR and u, v ∈ V 3. hu + v, wi = hu, wi + hv, wi for all u, v, w ∈ V 4. hu, ui ≥ 0 for all u ∈ V , and hu, ui = 0 iff u = 0

Comments: 1 says that the inner product is symmetric, 2 and 3 say that it is linear in the first argument (the linearity in the second argument follows then from 1), and 4 says that the inner product is non-negative and non-degenerate (just like a norm). Note that h0, vi = hu, 0i = 0 for all u, v ∈ V .

A real vector space together with a real inner product is called a real , or sometimes a Euclidean space.

8.12 Examples n Pn t t (i) V = IR : hu, vi = i=1 uivi (standard inner product). Note: hu, vi = u v = v u. R b (ii) V = C([a, b]) (real continuous functions): hf, gi = a f(x)g(x) dx

8.13 Definition (Complex Inner Product) Let V be a complex vector space. A complex inner product on V is a complex valued function on V × V , denoted by h·, ·i, satisfying 1. hu, vi = hv, ui for all u, v ∈ V 2, 3, 4 as in 8.11.

A complex vector space together with a complex inner product is called a complex inner product space, or sometimes a unitary space.

8.14 Simple properties (i) hu, v + wi = hu, vi + hu, wi (ii) hu, cvi =c ¯hu, vi

The properties (i) and (ii) are called conjugate linearity in the second argument.

4 8.15 Examples || n Pn t t (i) V = C : hu, vi = i=1 uiv¯i (standard inner product). Note: hu, vi = u v¯ =v ¯ u. R b (ii) V = C([a, b]) (complex continuous functions): hf, gi = a f(x)g(x) dx

Note: the term “inner product space” from now on refers to either real or complex inner product space.

8.16 Theorem (Cauchy-Schwarz-Buniakowsky inequality) Let V be an inner product space. Then

|hu, vi| ≤ hu, ui1/2hv, vi1/2 for all u, v ∈ V . The equality holds if and only if {u, v} is linearly dependent.

Proof. We do it in the complex case. Assume that v 6= 0. Consider the function

f(z) = hu − zv, u − zvi = hu, ui − zhv, ui − z¯hu, vi + |z|2hv, vi of a complex variable z. Let z = reiθ and hu, vi = seiϕ be the polar forms of the numbers z and hu, vi. Set θ = ϕ and assume that r varies from −∞ to ∞, then

0 ≤ f(z) = hu, ui − 2sr + r2hv, vi

Since this holds for all r ∈ IR (also for r < 0, because the coefficients are all nonnegative), the discriminant has to be ≤ 0, i.e. s2 − hu, uihv, vi ≤ 0. This completes the proof in the complex case. It the real case it goes even easier, just assume z ∈ IR. The equality case in the theorem corresponds to the zero discriminant, hence the above polynomial assumes a zero value, and hence u = zv for some z ∈ C|| . (We left out the case v = 0, do it yourself as an exercise.) 2

8.17 Theorem + Definition (Induced norm) If V is an inner product vector space, then kvk := hv, vi1/2 defines a norm on V . It is called the induced norm.

To prove the triangle inequality, you will need 8.16.

8.18 Example The inner products in Examples 8.12(i) and 8.15(i) induce the 2-norm on IRn and C|| n, respectively. The inner products in Examples 8.12(ii) and 8.15(ii) induce the 2-norm on the spaces C[a, b] of real and complex functions, respectively.

5 8.19 Theorem Let V be a real vector space with norm k · k. The norm k · k is induced by an inner product if and only if the function

1   hu, vi := ku + vk2 − ku − vk2 (polarization identity) 4 satisfies the definition of an inner product. In this case k·k is induced by the above inner product.

Note: A similar but more complicated polarization identity holds in complex inner prod- uct spaces.

6 9 Orthogonal vectors

In this section, V is always an inner product space (real or complex).

9.1 Definition (Orthogonal vectors) Two vectors u, v ∈ V are said to be orthogonal if hu, vi = 0.

9.2 Example n n (i) The canonical vectors e1, . . . , en in IR or C|| with the standard inner product are mutually (i.e., pairwise) orthogonal. (ii) Any vectors u = (u1, . . . , uk, 0,..., 0) and v = (0,..., 0, vk+1, . . . , vn) are orthog- onal in IRn or C|| n with the standard inner product. (iii) The zero vector 0 is orthogonal to any vector.

9.3 Theorem (Pythagoras) If hu, vi = 0, then ku + vk2 = kuk2 + kvk2.

Inductively, it follows that if u1, . . . , uk are mutually orthogonal, then

2 2 2 ku1 + ··· + ukk = ku1k + ··· + kukk 9.4 Theorem If nonzero vectors u1, . . . , uk are mutually orthogonal, then they are linearly indepen- dent.

9.5 Definition (Orthogonal/Orthonormal basis) A basis {ui} in V is said to be orthogonal, if all the basis vectors ui are mutually orthogonal. If, in addition, all the basis vectors are unit (i.e., kuik = 1 for all i), then the basis is said to be orthonormal, or an ONB.

9.6 Theorem (Fourier expansion) If B = {u1, . . . , un} is an ONB in a finite dimensional space V , then

n X v = hv, uiiui i=1

for every v ∈ V , i.e. ci = hv, uii are the coordinates of the vector v in the basis B. One t can also write this as [v]B = (hv, u1i,..., hv, uni).

Note: the numbers hv, uii are called the Fourier coefficients of v in the ONB {u1, . . . , un}.

For example, in IRn or C|| n with the standard inner product, the coordinates of any vector v = (v1, . . . , vn) satisfy the equations vi = hv, eii.

7 9.7 Definition (Orthogonal projection) Let u, v ∈ V , and v 6= 0. The orthogonal projection of u onto v is

hu, vi Pr u = v v kvk2

Note that the vector w := u − Prvu is orthogonal to v. Therefore, u is the sum of two vectors, Prvu parallel to v, and w orthogonal to v (see the diagram below).

Figure 1: Orthogonal projection of u to v

¨¨* ¨ ¨¨ u ¨ ¨¨ ¨¨ ¨¨ ¨ ¨¨ θ ¨ ? - kuk cos θ · (v/kvk) v

9.8 Definition (Angle) In the real case, for any nonzero vectors u, v ∈ V let

hu, vi cos θ = kuk kvk

By 8.16, we have cos θ ∈ [−1, 1]. Hence, there is a unique angle θ ∈ [0, π] with this value of cosine. It is called the angle between u and v.

Note that cos θ = 0 if and only if u and v are orthogonal. Also, cos θ = ±1 if and only if u, v are proportional, v = cu, then the sign of c coincides with the sign of cos θ.

9.9 Theorem Let B = {u1, . . . , un} be an ONB in a finite dimensional space V . Then

n n X X v = Prui v = kvk ui cos θi i=1 i=1 where θi is the angle between ui and v.

8 9.10 Theorem (Gram-Schmidt) Let nonzero vectors w1, . . . , wm be mutually orthogonal. For v ∈ V , set

m X wm+1 = v − Pr wi v i=1

Then the vectors w1, . . . , wm+1 are mutually orthogonal, and

span{w1, . . . , wm, v} = span{w1, . . . , wm, wm+1}

In particular, wm+1 = 0 if and only if y ∈ span{w1, . . . , wm}.

9.11 Algorithm (Gram-Schmidt orthogonalization) Let {v1, . . . , vn} be a basis in V . Define

w1 = v1

and then inductively, for m ≥ 1,

m X wm+1 = vm+1 − Prwi vm+1 i=1 m X hvm+1, wii = vm+1 − 2 wi i=1 kwik

This gives an orthogonal basis {w1, . . . , wn}, which ‘agrees’ with the basis {v1, . . . , vn} in the following sense: span{v1, . . . , vm} = span{w1, . . . , wm} for all 1 ≤ m ≤ n. The basis {w1, . . . , wn} can be normalized by ui = wi/kwik to give an ONB {u1, . . . , un}.

Alternatively, an ONB {u1, . . . , un} can be obtained directly by

w1 = v1 u1 = w1/kw1k and inductively for m ≥ 1

m X wm+1 = vm+1 − hvm+1, uiiui um+1 = wm+1/kwm+1k i=1 9.12 Example R 1 Let V = Pn(IR) with the inner product given by hf, gi = 0 f(x)g(x) dx. Applying Gram-Schmidt orthogonalization to the basis {1, x, . . . , xn} gives the first n + 1 of the so called Legendre polynomials.

9 9.13 Corollary Let W ⊂ V be a finite dimensional subspace of an inner product space V , and dim W = k. Then there is an ONB {u1, . . . , uk} in W . If, in addition, V is finite di- mensional, then the basis {u1, . . . , uk} of W can be extended to an ONB {u1, . . . , un} of V .

9.14 Definition (Orthogonal complement) Let S ⊂ V be a subset (not necessarily a subspace). Then

S⊥ := {v ∈ V : hv, wi = 0 for all w ∈ S}

is called the orthogonal complement to S.

9.15 Theorem S⊥ is a subspace of V . If W = span S, then W ⊥ = S⊥.

9.16 Example If S = {(1, 0, 0)} in IR3, then S⊥ = span{(0, 1, 0), (0, 0, 1)}. Let V = C[a, b] (real continuous functions) with the inner product from 8.12(ii). Let ⊥ R b S = {f ≡ const} (the subspace of constant functions). Then S = {g : a g(x) dx = 0}. Note that V = S ⊕ S⊥, see 1.34(b).

9.17 Theorem If W is a finite dimensional subspace of V , then

V = W ⊕ W ⊥

Proof. By 9.13, there is an ONB {u1, . . . , uk} of W . For any v ∈ V the vector

k X v − hv, uiiui i=1 belongs in W ⊥. Hence, V = W + W ⊥. The of W and W ⊥ follows from 9.4. 2

Note: the finite dimension of W is essential. Let V = C[a, b] with the inner product from 8.12(ii) and W ⊂ V be the set of real polynomials restricted to the interval [a, b]. Then W ⊥ = ∅, and at the same time V 6= W .

9.18 Theorem (Parceval’s identity) Let B = {u1, . . . , un} be an ONB in V . Then

n X t t hv, wi = hv, uiihw, uii = [v]B[w]B = [w]B[v]B i=1

10 for all v, w ∈ V . In particular, n 2 X 2 t kvk = |hv, uii| = [v]B[v]B i=1 Proof. Follows from 9.6. 2

9.19 Theorem (Bessel’s inequality) Let {u1, . . . , un} be an orthonormal subset of V . Then n 2 X 2 kvk ≥ |hv, uii| i=1 for all v ∈ V .

Proof. For any v ∈ V the vector n X w := v − hv, uiiui i=1

⊥ belongs in {u1, . . . , un} . Hence, n 2 2 X 2 kvk = kwk + |hv, uii| i=1 9.20 Definition (Isometry) Let V,W be two inner product spaces (both real or both complex). An isomorphism T : V → W is called an isometry if it preserves the inner product, i.e. hT v, T wi = hv, wi for all v, w ∈ V . In this case V and W are said to be isometric.

Note: it can be shown by polarization identity that T : V → W preserves inner product if and only if T preserves the induced norms, i.e. kT vk = kvk for all v ∈ V .

9.21 Theorem Let dim V < ∞. A linear transformation T : V → W is an isometry if and only if whenever {u1, . . . , un} is an ONB in V , then {T u1, . . . , T un} is an ONB in W .

9.22 Corollary Finite dimensional inner product spaces V and W (over the same field) are isometric if and only if dim V = dim W .

9.23 Example Let V = IR2 with the standard inner product. Then the maps defined by matrices 0 1 ! 1 0 ! A = and A = are isometries of IR2. 1 1 0 2 0 −1

11 10 Orthogonal/Unitary Matrices

10.1 Definition If V is a real inner product space, then isometries V → V are called orthogonal op- erators. If V is a complex inner product space, then isometries V → V are called an unitary operators.

−1 −1 Note: If U1,U2 : V → V are unitary operators, then so are U1U2 and U1 ,U2 . Thus, unitary operators form a group, denoted by U(V ) ⊂ GL(V ). Similarly, orthogonal oper- ators on a real inner product space V make a group O(V ) ⊂ GL(V ).

10.2 Definition. A real n × n matrix Q is said to be orthogonal if QQt = I, i.e. Q is invertible and Q−1 = Qt. A complex n×n matrix U is said to be unitary if UU H = I, where U H = (U¯)t is the Hermitian of U.

Note: If Q is orthogonal, then also QtQ = I, and so Qt is an as well. If U is unitary, then also U H U = I, and so U H is a , too. In the latter case, we also have UU¯ t = I and U tU¯ = I.

10.3 Theorem (i) The linear transformation in IRn defined by a matrix Q ∈ IRn×n preserves the standard (Euclidean) inner product if and only if Q is orthogonal. (ii) The linear transformation in C|| n defined by a matrix U ∈ C|| n×n preserves the standard (Hermitian) inner product if and only if U is unitary.

Proof. In the complex case: hUx, Uyi = (Ux)tUy = xtU tU¯y¯ = xty¯ = hx, yi. The real case is similar. 2

10.4 Theorem (i) An operator T : V → V of a finite dimensional complex space V is unitary iff the matrix [T ]B is unitary in any ONB B. (ii) An operator T : V → V of a finite dimensional real space V is orthogonal iff the matrix [T ]B is orthogonal in any ONB B.

10.5 Corollary Unitary n × n matrices make a group, denoted by U(n). Orthogonal n × n matrices make a group, denoted by O(n).

10.6 Theorem (i) A matrix U ∈ C|| n×n is unitary iff its columns (resp., rows) make an ONB of C|| n.

12 (ii) A matrix Q ∈ IRn×n is orthogonal iff its columns (resp., rows) make an ONB of IRn.

10.7 Theorem If Q is orthogonal, then det Q = ±1. If U is unitary, then |det U| = 1, i.e. det U = eiθ for some θ ∈ [0, 2π].

Proof. In the complex case: 1 = det I = det U H U = det U¯ t·det U = | det U|2. The real case is similar. 2

Note: Orthogonal matrices have 1 or −1. Orthogonal n × n matrices with determinant 1 make a subgroup of O(n), denoted by SO(n).

10.8 Example cos θ − sin θ ! The orthogonal matrix Q = represents the counterclockwise rota- sin θ cos θ tion of IR2 through the angle θ.

10.9 Theorem If λ is an eigenvalue of an orthogonal or unitary matrix, then |λ| = 1. In the real case this means λ = ±1.

Proof. If Ux = λx for some x 6= 0, then hx, xi = hUx, Uxi = hλx, λxi = |λ|2hx, xi, so that |λ|2 = 1. 2

10.10 Theorem Let T : V → V be an isometry of a finite dimensional space V . If a subspace W ⊂ V is invariant under T , then so is its orthogonal complement W ⊥.

10.11 Theorem Any unitary operator T of a finite dimensional complex space is diagonalizable. Fur- thermore, there is an ONB consisting of eigenvectors of T . Any unitary matrix is diago- nalizable.

Proof goes by induction on the dimension of the space, use 10.10. 2

10.12 Theorem Let T : V → V be an orthogonal operator on a finite dimensional real space V . Then V = V1 ⊕ · · · ⊕ Vm, where Vi are mutually orthogonal subspaces, each Vi is a T -invariant one- or two-dimensional subspace of V .

Proof. If T has an eigenvalue, we can use 10.10 and reduce the dimension of V by one. Assume now that T has no eigenvalues. The characteristic polynomial CT (x) has

13 real coefficients and no (real) roots, so its roots are pairs of conjugate complex numbers (because CT (z) = 0 ⇔ CT (¯z) = 0). So, CT (x) is a product of quadratic polynomials with no real roots: CT (x) = P1(x) ··· Pk(x), where deg Pi(x) = 2. By Cayley-Hamilton,

CT (T ) = P1(T ) ··· Pk(T ) = 0

Hence, at least one operator Pi(T ), 1 ≤ i ≤ k, is not invertible (otherwise CT (T ) would 2 2 be invertible). Let Pi(x) = x + ax + b. Then the operator T + aT + bI has a nontrivial , i.e. T 2v + aT v + bv = 0 for some vectror v 6= 0. Let w = T v (note that w 6= 0, since the operator T is an isometry). We get T w + aw + bv = 0, so that

T v = w and T w = −aw − bv

Hence, the subspace span{v, w} is invariant under T . It is two-dimensional, since v and w are linearly independent (otherwise T v = w = λv, and λ would be an eigenvalue of T ). Now we can use 10.10 and reduce the dimension of V by two. 2

14 11 Adjoint and Self-adjoint Matrices

In this chapter, V denotes a finite dimensional inner product space (unless stated other- wise).

11.1 Theorem (Riesz representation) Let f ∈ V ∗, i.e. f is a linear functional on V . Then there is a unique vector w ∈ V such that f(v) = hv, wi ∀v ∈ V

P Proof. Let B = {u1, . . . , un} be an ONB in V . Then for any v = ciui we have P P P ¯ f(v) = cif(ui) by linearity. Also, for any w = diui we have hv, wi = cidi. Hence, P the vector w = f(ui)ui will suffice. The uniqueness of w is obvious. 2

11.2 Corollary The identity f ↔ w established in the previous theorem is “quasi-linear” in the fol- 1 lowing sense : f1 + f2 ↔ w1 + w2 and cf ↔ cw¯ . In the real case, it is perfectly linear, though, and hence it is an isomorphism between V ∗ and V . This is a canonical isomor- phism associated with the given (real) inner product h·, ·i

Remark. If dim V = ∞, then Theorem 11.1 fails. Consider V = C[0, 1] (real functions) R 1 ∗ with the inner product hF,Gi = 0 F (x)G(x) dx. Pick a point t ∈ [0, 1]. Let f ∈ V be a linear functional defined by f(F ) = F (t). It does not correspond to any G ∈ V so that f(F ) = hF,Gi. In fact, the lack of such functions G has led to the concept of generalized functions: a generalized function Gt(x) is “defined” by three requirements: Gt(x) ≡ 0 R 1 for all x 6= t, Gt(t) = ∞ and 0 F (x)Gt(x) dx = F (t) for every F ∈ C[0, 1]. Clearly, no regular function satisfies these requirements, so a “generalized function” is a purely abstract concept.

11.3 Lemma Let T : V → V be a linear operator, B = {u1, . . . , un} an ONB in V , and A = [T ]B. Then Aij = hT uj, uii

11.4 Theorem and Definition (Adjoint Operator) Let T : V → V be a linear operator. Then there is a unique linear operator T ∗ : V → V such that hT v, wi = hv, T ∗wi ∀v, w ∈ V

1Recall the conjugate linearity of the complex inner product in the second argument.

15 T ∗ is called the adjoint of T .

Proof. Let w ∈ V . Then f(v) := hT v, wi defines a linear functional f ∈ V ∗. By the Riesz representation theorem, there is a unique w0 ∈ V such that f(v) = hv, w0i. Then we define T ∗ by setting T ∗w = w0. The linearity of T ∗ is a routine check. Note that in the complex case the conjugating bar appears twice and thus cancels. The uniqueness of T ∗ is obvious. 2

11.5 Corollary Let T,S be linear operators on V . Then (i) (T + S)∗ = T ∗ + S∗ (ii) (cT )∗ =c ¯ T ∗ (conjugate linearity) (iii) (TS)∗ = S∗T ∗ (iv) (T ∗)∗ = T .

11.6 Corollary Let T : V → V be a linear operator, and B an ONB in V . Then

∗ t [T ]B = [T ]B

Notation. For a matric A ∈ C|| n×n we call

A∗ := At = A¯t

the adjoint matrix. One often uses AH instead of A∗. In the real case, we simply have A∗ = At, the transposed matrix.

11.7 Theorem. Let T : V → V be a linear operator. Then

Ker T ∗ = (Im T )⊥

In particular, V = Im T ⊕ Ker T ∗.

Proof is a routine check.

11.8 Defnition (Selfadjoint Operator) A linear operator T : V → V is said to be selfadjoint if T ∗ = T . A matrix A is said to be selfadjoint if A∗ = A. In the real case, this is equivalent to At = A, i.e. A is a symmet- ric matrix. In the complex case, selfadjoint matrices are often called Hermitean matrices.

16 Note: By 11.6, an operator T is selfadjoint whenever the matrix [T ]B is selfadjoint for any (and then every) ONB B.

11.9 Theorem Let T : V → V be selfadjoint. Then (i) All eigenvalues of T are real (ii) If λ1 6= λ2 are two distinct eigenvalues of T with respective eigenvectors v1, v2, then v1 and v2 are orthogonal.

Proof. If T v = λv, then λhv, vi = hT v, vi = hv, T vi = λ¯hv, vi, hence λ = λ¯ (since v 6= 0). To prove (ii), we have λ1hv1, v2i = hT v1, v2i = hv1, T v2i = λ2hv1, v2i, and in addition λ2 = λ2 6= λ1, hence hv1, v2i = 0.

Remark. Note that in the real case Theorem 11.9 implies that the characteristic polyno- Q mial CT (x) has all real roots, i.e. CT (x) = i(x − λi) where all λi’s are real numbers. In particular, every real has at least one (real) eigenvalue.

11.10 Corollary Any selfadjoint operator T has at least one eigenvalue.

11.11 Lemma Let T be a selfadjoint operator and a subspace W be T -invariant, i.e. TW ⊂ W . Then W ⊥ is also T -invariant, i.e. TW ⊥ ⊂ W ⊥.

Proof. If v ∈ W ⊥, then for any w ∈ W we have hT v, wi = hv, T wi = 0, so T v ∈ W ⊥. 2

11.12 Definition (Projection) Let V be a vector space. A T : V → V is called a projection if P 2 = P .

Note: By the homework problem 3 in assignment 2 of MA631, a projection P satisfies Ker P = Im (I − P ) and V = Ker P ⊕ Im P . For any vector v ∈ V there is a unique decomposition v = v1 + v2 with v1 ∈ Ker P and v2 ∈ Im P . Then P v = P (v1 + v2) = v2. We say that P is the projection on Im P along Ker P .

11.13 Corollary Let V be a vector space and V = W1 ⊕ W2. Then there is a unique projection P on W2 along W1.

11.14 Definition (Orthogonal Projection) Let V is an inner product vector space and W ⊂ V a finite dimensional subspace. Then the projection on W along W ⊥ is called the orthogonal projection on W , denoted

17 by PW .

Remark. The assumption on W being finite dimensional is made to ensure that V = W ⊕ W ⊥, recall 9.17.

11.15 Theorem Let V be a finite dimensional inner product space and W ⊂ V a subspace. Then (W ⊥)⊥ = W and

PW ⊥ = I − PW

Remark. More generally, if V = W1 ⊕ W2, then P1 + P2 = I, where P1 is a projection on W1 along W2 and P2 is a projection on W2 along W1.

11.16 Theorem Let P be a projection. Then P is an orthogonal projection if and only if P is selfad- joint.

Proof. Let P be a projection on W2 along W1, and V = W1 ⊕ W2. For any vec- tors v, w ∈ V we have v = v1 + v2 and w = w1 + w2 with some vi, wi ∈ Wi, i = 1, 2. Now, if P is orthogonal, then hP v, wi = hv2, wi = hv2, w2i = hv, w2i = hv, P wi. If P is not orthogonal, then there are v1 ∈ W1, w2 ∈ W2 so that hv1, w2i= 6 0. Then hv1, P w2i= 6 0 = hP v1, w2i.

11.17 Definition (Unitary/Orthogonal Equivalence) Two complex matrices A, B ∈ C|| n×n are said to be unitary equivalent if B = P −1AP for some unitary matrix P , i.e. we also have B = P ∗AP . Two real matrices A, B ∈ IRn×n are said to be orthogonally equivalent if B = P −1AP for some orthogonal matrix P , i.e. we also have B = P tAP .

11.18 Remark A coordinate change between two ONB’s is represented by a unitary (resp. orthog- onal) matrix, cf. 9.22. Therefore, for any linear operator T : V → V and ONB’s B,B0 the matrices [T ]B and [T ]B0 are unitary (resp., orthogonally) equivalent. Conversely, two matrices A, B are unitary (resp., orthogonally) equivalent iff they represent one linear operator in some two ONB’s.

11.19 Remark Any unitary matrix U is unitary equivalent to a D (which is also unitary), recall 10.11.

11.20 Theorem ()

18 Let T : V → V be a selfadjoint operator. Then there is an ONB consisting entirely of eigenvectors of T .

Proof goes by induction on n = dim V , just as the proof of 10.11. You need to use 11.10 and 11.11.

11.21 Corollary Any complex Hermitean matrix is unitary equivalent to a diagonal matrix (with real diagonal entries). Any real symmetric matrix is orthogonally equivalent to a diagonal matrix.

11.22 Lemma If a complex (real) matrix A is unitary (resp., orthogonally) equivalent to a diagonal matrix with real diagonal entries, then A is Hermitean (resp., symmetric).

Proof. If A = P −1DP with P −1 = P ∗, then A∗ = P ∗D∗(P −1)∗ = A, because D∗ = D (as a real diagonal matrix). 2

11.23 Corollary If a linear operator T has an ONB consisting of eigenvectors and all its eigenvalues are real, then T is selfadjoint.

11.24 Corollary If an operator T is selfadjoint and invertible, then so is T −1. If a matrix A is selfad- joint and nonsingular, then so is A−1.

Proof. By the Spectral Theorem 11.20, there is an ONB B consisting of eigenvectors of T . Now T −1 has the same eigenvectors, and its eigenvalues are the reciprocals of those of T , hence they are real, too. Therefore, T −1 is selfadjoint. 2

19 12 Bilinear Forms, Positive Definite Matrices

12.1 Definition (Bilinear Form) A bilinear form on a complex vector space V is a mapping f : V × V → C|| such that

f(u1 + u2, v) = f(u1, v) + f(u2, v)

f(cu, v) = cf(u, v)

f(u, v1 + v2) = f(u, v1) + f(u, v2) f(u, cv) =cf ¯ (u, v)

for all vectors u, v, ui, vi ∈ V and scalars c ∈ C|| . In other words, f is linear in the first argument and conjugate linear in the second. A bilinear form on a real vector space is a mapping f : V × V → IR that satisfies the same properties, except c is a real and soc ¯ = c.

12.2 Example If V is an inner product space and T : V → V a linear operator, then f(u, v) := hT u, vi is a bilinear form.

12.3 Theorem Let V be a finite dimensional inner product space. Then for every bilinear form f on V , then there is a unique linear operator T : V → V such that

f(u, v) = hT u, vi ∀u, v ∈ V

Proof. For every v ∈ V the function g(u) = f(u, v) is linear in u, so by the Riesz representation theorem 11.1 there is a vector w ∈ V such that f(u, v) = hu, wi. Define a map S : V → V by Sv = w. It is then a routine check that S is linear. Setting T = S∗ proves the existence. The uniqueness is obvious. 2

12.4 Example All bilinear forms on C|| n are of the type f(x, y) = hAx, yi with A ∈ C|| n×n, i.e. X f(x, y) = Ajixiy¯j ij

12.5 Definition (Hermitean/Symmetric Form) A bilinear form f on a complex (real) vector space V is called Hermitean (resp., symmetric) if f(u, v) = f(v, u) ∀u, v ∈ V In the real case, the bar can be dropped.

20 For a Hermitean or symmetric form f, the function q : V → IR defined by q(u) := f(u, u) is called the quadratic form associated with f. Note that q(u) ∈ IR even in the complex case, because f(u, u) = f(u, u).

12.6 Theorem A linear operator T : V → V is selfadjoint if and only if the bilinear form f(u, v) = hT u, vi is Hermitean (symmetric, in the real case).

Proof. If T is selfadjoint, then f(u, v) = hT u, vi = hu, T vi = hT v, ui = f(v, u). If f is Hermitean, then hu, T vi = hT v, ui = f(v, u) = f(u, v) = hT u, vi = hu, T ∗vi, hence T = T ∗. 2

12.7 Lemma Let V be a complex inner product space and T,S : V → V linear operators. If hT u, ui = hSu, ui for all u ∈ V , then T = S.

Note: This lemma holds only in complex spaces, it fails in real spaces.

t t Proof. Let B be an ONB. Then hT u, ui = [u]B[T ]B[u]B, and the same holds for S. It is then enough to prove that if for some A, B ∈ C|| n×n we have xtAx¯ = xtBx¯ for all x ∈ C|| n, then A = B. Equivalently, if xtAx¯ = 0 for all x ∈ C|| n, then A = 0. This is done in the homework assignment.

12.8 Corollary If f is a bilinear form in a complex vector space V such that f(u, u) ∈ IR for all vectors u ∈ V , then f is Hermitean.

Proof. By 12.3, there is an operator T : V → V such that f(u, v) = hT u, vi. Then we have hT u, ui = f(u, u) = f(u, u) = hT u, ui = hu, T ui = hT ∗u, ui, hence T = T ∗ by 12.7. Then apply 12.6. 2

12.9 Definition (Positive Definite Form/Matrix) A Hermitean (symmetric) bilinear form f on a vector space V is said to be positive definite if f(u, u) > 0 for all u 6= 0. A selfadjoint operator T : V → V is said to be positive definite if hT u, ui > 0 for all u 6= 0. A selfadjoint matrix A is said to be positive definite if xtAx¯ > 0 for all x 6= 0.

By replacing “> 0” with “≥ 0”, one gets positive semi-definite forms/operators/matrices.

Note: in the complex cases the Hermitean requirement can be dropped, because f(u, u) > 0 implies f(u, u) ∈ IR, see 12.8.

21 12.10 Theorem Let A be a complex or real n × n matrix. The following are equivalent: (a) A is positive definite (b) f(x, y) := xtAy¯ defines an inner product in C|| n (resp., IRn) (c) A is Hermitean (resp., symmetric) and all its eigenvalues are positive.

Proof. (a)⇔(b) is a routine check. To prove (a)⇔(c), find an ONB consisting of P eigenvectors u1, . . . , un of A (one exists by 11.20). Then for any v = ciui we have P 2 hAv, vi = λi|ci| . Hence, hAv, vi > 0 ∀v if and only if λi > 0 ∀i. 2

Remark: Let A be a Hermitean (symmetric) matrix. Then it is positive semidefinite if and only if its eigenvalues are nonnegative.

12.11 Corollary If a matrix A is positive definite, then so is A−1. If an operator T is positive definite, then so is T −1.

There is a very simple and efficient way to take square roots of positive definite ma- trices.

12.12 Definition (Square Root) An n × n matrix B is called a square root of an n × n matrix A if A = B2. Notation: B = A1/2.

12.13 Lemma If A is a positive definite matrix, then there is a square root A1/2 which is also positive definite.

Proof. By 11.21 and 12.10, A = P −1DP , where D is a diagonal matrix with positive diagonal entries and P√ a unitary√ (orthogonal) matrix. If D = diag (d1, . . . , dn), then 1/2 2 −1 1/2 denote D = diag ( d1,..., dn). Now A = B where B = P D P , and B is selfadjoint by 11.22. 2

Remark. If A is positive semidefinite, then there is a square root A1/2 which is also positive semidefinite.

12.14 Corollary An n × n matrix A is positive definite if and only if there is a nonsingular matrix B such that A = B∗B.

Proof.“⇒” follows from 12.13. If A = B∗B, then A∗ = B∗(B∗)∗ = A and

22 hAx, xi = hBx, Bxi > 0 for any x 6= 0, because B is nonsingular. 2

Remark. An n × n matrix A is positive semi-definite if and only if there is a matrix B such that A = B∗B.

12.15 Lemma (Rudin) For any matrix A ∈ IRn×n there is a symmetric positive semi-definite matrix B such that (i) AtA = B2 (ii) kAxk = kBxk for all x ∈ IRn (One can think of B as a ‘rectification’ or ‘symmetrization’ of A.)

Proof. The matrix AtA is symmetric positive semidefinite by the previous remark, so it has a symmetric square root B by 12.13. Lastly, hBx, Bxi = hB2x, xi = hAtAx, xi = hAx, Axi.

12.16 Remark n×n Let A ∈ IR be symmetric. Then, by 11.20, there is an ONB {u1, . . . , un} consisting entirely of eigenvectors of A. Denote by λ1, . . . , λn the corresponding eigenvalues of A. Then n X t A = λiuiui i=1

Proof. Just verify that Aui = λiui for all 1 ≤ i ≤ n.

23 13 Cholesky Factorization

This is the continuation of Section 4 (LU Decomposition). We explore other decomposi- tions of square matrices.

13.1 Theorem (LDMt Decomposition) Let A be an n × n matrix with nonsingular principal minors, i.e. det Ak 6= 0 for k = 1, . . . , n. Then there are unique matrices L, D, M such that L, M are unit lower triangular and D is diagonal, and A = LDM t

Proof. By the LU decomposition (Theorem 4.10) there are unit lower L and upper triangular matrix U such that A = LU. Let u11, . . . , unn be the t −1 diagonal entries of U, and set D = diag(u11, . . . , unn). Then the matrix M := D U is unit upper triangular, and A = LDM t. t t To establish uniqueness, let A = LDM = L1D1M1. By the uniqueness of the LU −1 t t t decomposition 4.10, we have L = L1. Hence, (D1 D)M = M1. Since both M and t −1 M1 are unit upper triangular, the diagonal matrix D1 D must be the . Hence, D = D1, and then M = M1. 2

13.2 Corollary If, in addition, A is symmetric, then there exist unique unit lower triangular matrix L and diagonal D such that A = LDLt

Proof. By the previous theorem A = LDM t. Then A = At = MDLt, and by the uniqueness of the LDMt decomposition we have L = M. 2

13.3 Theorem (Sylvester’s Theorem) n×n Let A ∈ IR be symmetric. Then A is positive definite if and only if det Ak > 0 for all k = 1, . . . , n.

Proof. Let A be positive definite. By 12.14, det A = (det B)2 > 0. Any principal minor Ak is also a symmetric positive definite matrix, therefore by the same argument t det Ak > 0. Conversely, let det Ak > 0. By Corollary 13.2 we have A = LDL . Denote t by Lk and Dk the k-th principal minors of L and D, respectively. Then Ak = LkDkLk. Note that det Dk = det Ak > 0 for all k = 1, . . . , n, therefore all the diagonal entries of D are positive. Lastly, hAx, xi = hDLtx, Ltxi = hDy, yi > 0 because y = Ltx 6= 0 whenever x 6= 0. 2

24 13.4 Corollary Let A be a real symmetric positive definite matrix. Then Aii > 0 for all i = 1, . . . , n. 0 Furthermore, let 1 ≤ i1 < i2 < ··· < ik ≤ n, and let A be the k × k matrix formed by 0 the intersections of the rows and columns of A with numbers i1, . . . , ik. Then det A > 0.

Proof. Just reorder the coordinates in IRn so that A0 becomes a principal minor.

13.5 Theorem (Cholesky Factorization) Let A ∈ IRn×n be symmetric and positive definite. Then there exists a unique lower triangular matrix G with positive diagonal entries such that

A = GGt

t Proof. By Corollary 13.2 we have A = LDL . Let D = diag(√ d1, . . .√ , dn). As it 1/2 was shown in the proof of 13.3, all di > 0. Let D = diag( d1,..., dn). Then 1/2 1/2 1/2 t √D = D √D and setting G = LD gives A = GG . The diagonal entries of G are t t d1,..., dn, so they are positive. To establish uniqueness, let A = GG = G1G1. Then −1 t t −1 G1 G = G1(G ) . Since this is the equality of a lower triangular matrix and an up- −1 t t −1 per triangular one, then both matrices are diagonal: G1 G = G1(G ) = D. Hence, −1 t t −1 G1 = GD and G1 = DG . Therefore, D = D , so the diagonal entries of D are all ±1. Note that the value −1 is not possible since the diagonal entries of both G and G1 are positive. Hence, D = I, and so G1 = G. 2

13.6 Algorithm (Cholesky Factorization). Here we outline the algorithm of computing the matrix G = (gij) from the matrix A = (aij). Note that G is lower triangular, so gij = 0 for i < j. Hence,

min{i,j} X aij = gikgjk k=1

2 Setting i = j = 1 gives a11 = g11, so √ g11 = a11

(remember, gii must be positive). Next, for 2 ≤ i ≤ n we have ai1 = gi1g11, hence

gi1 = ai1/g11 i = 2, . . . , n

This gives the first column of G. Now, inductively, assume that we already have the first Pj 2 j − 1 columns of G. Then ajj = k=1 gjk, hence v u j−1 u X 2 gjj = tajj − gjk k=1

25 Pj Next, for j + 1 ≤ i ≤ n we have aij = k=1 gikgjk, hence

 j−1  1 X gij = aij − gikgjk gjj k=1

13.7 Cost of computation The cost of computation is measured in flops, where a flop is a multiplication or division together with one addition or subtraction, recall 4.12. The above algorithm of computation of gij takes j flops for each i = j, . . . , n, so the total is

n n2 n3 n3 X j(n − j) ≈ n − = j=1 2 3 6

It also takes n square root extractions. Recall that the LU decomposition takes about n3/3 flops, so the Cholesky factorization is nearly twice as efficient. It is also more stable than the LU decomposition.

13.8 Remark The above algorithm can be used to verify that a given symmetric matrix, A, is positive definite. Whenever the square root extractions in 13.6 are all possible and non zero, i.e. whenever

j−1 X 2 a11 > 0 and ajj − gjk > 0 ∀j ≥ 2 k=1 the matrix A is positive definite.

26 14 Machine Arithmetic

This section is unusual. It discusses imprecise calculations. It may be difficult or frus- trating for the students who were not much involved in computer work. But it might be easy (or trivial) for people experienced with computers. This section is necessary, since it motivates the discussion in the following section.

14.1 Binary numbers A bit is a binary digit, it can only take two values: 0 and 1. Any natural number N can be written, in the binary system, as a sequence of binary digits:

n N = (dn ··· d1d0)2 = 2 dn + ··· + 2d1 + d0

For example, 5 = 1012, 11 = 10112, 64 = 10000002, etc.

14.2 More binary numbers To represent more numbers in the binary system (not just positive ), it is convenient to use the numbers ±N × 2±M where M and N are positive integers. Clearly, with these numbers we can approximate any arbitrarily accurately. In other words, the set of numbers {±N × 2±M } is dense on the real line. The number E = ±M is called the exponent and N the mantissa. Note that N × 2M = (2kN) × 2M−k, so the same real number can be represented in many ways as ±N × 2±M .

14.3 Floating point representation We return to our decimal system. Any decimal number (with finitely many digits) can be written as e f = ±.d1d2 . . . dt × 10 2 where di are decimal digits and e is an . For example, 18.2 = .182 × 10 = .0182 × 103, etc. This is called a floating point representation of decimal numbers. The part .d1 . . . dt is called the mantissa and e is the exponent. By changing the exponent e with a fixed mantissa , d1 . . . dt we can move (“float”) the decimal point, for example .182 × 102 = 18.2 and .182 × 101 = 1.82.

14.4 Normalized floating point representation To avoid unnecessary multiple representations of the same number (as 18.2 by .182 × 2 3 10 and .0182 × 10 above), we require that d1 6= 0. We say the floating point represen- 2 tation is normalized if d1 6= 0. Then .182 × 10 is the only normalized representation of the number 18.2. For every positive real f > 0 there is a unique integer e ∈ ZZsuch that g := 10−ef ∈ [0.1, 1). Then f = g × 10e is the normalized representation of f. So, the normalized representation is unique.

27 14.5 Other number systems Now suppose we are working in a number system with base β ≥ 2. By analogy with 14.3, the floating point representation is

e f = ±.d1d2 . . . dt × β where 0 ≤ di ≤ β − 1 are digits,

−1 −2 −t .d1d2 . . . dt = d1β + d2β + ··· dtβ is the mantissa and e ∈ ZZis the exponent. Again, we say that the above representation is normalized if d1 6= 0, this ensures uniqueness.

14.6 Machine floating point numbers Any computer can only handle finitely many numbers. Hence, the number of digits di’s is necessarily bounded, and the possible values of the exponent e are limited to a finite interval. Assume that the number of digits t is fixed (it characterizes the accuracy of machine numbers) and the exponent is bounded by L ≤ e ≤ U. Then the parameters β, t, L, U completely characterize the set of numbers that a particular machine system can handle. The most important parameter is t, the number of significant digits, or the length of the mantissa. (Note that the same computer can use many possible machine systems, with different values of β, t, L, U, see 14.8.)

14.7 Remark The maximal (in absolute value) number that a machine system can handle is M = βU (1 − β−t). The minimal positive number is m = βL−1.

14.8 Examples Most computers use the binary system, β = 2. Many modern computers (e.g., all IBM compatible PC’s) conform to the IEEE floating-point standard (ANSI/IEEE Stan- dard 754-1985). This standard provides two systems. One is called single , it is characterized by t = 24, L = −125 and U = 128. The other is called double precision, it is characterized by t = 53, L = −1021 and U = 1024. The PC’s equipped with the so called numerical coprocessor also use an internal system called temporary format, it is characterized by t = 65, L = −16381 and U = 16384.

14.9 Relative errors Let x > 0 be a positive real number with the normalized floating point representation with base β e x = .d1d2 ... × β where the number of digits may be finite or infinite. We need to represent x in a machine system with parameters β, t, L, U. If e > U, then x cannot be represented (an attempt to store x in the computer memory or perform calculation that results in x should terminate

28 the computer program with error message OVERFLOW – a number too large). If e < L, the system may either represent x by 0 (quite reasonably) or terminate the program with error message UNDERFLOW – a number too small. If e ∈ [L, U] is within the right range, then the matissa has to be reduced to t digits (if it is longer or infinite). There are two standard ways to do that reduction: (i) just take the first t digits of the mantissa of x, i.e. .d1 . . . dt, and chop off the rest; (ii) round off to the nearest available, i.e. take the mantissa ( .d1 . . . dt if dt+1 < β/2 .d1 . . . dt + .0 ... 01 if dt+1 ≥ β/2

Denote the obtained number by xc (the computer representation of x). The relative error in this representation can be estimated as x − x c = ε or x = x(1 + ε) x c where the maximal possible value of ε is

( β1−t for chopped arithmetic u = 1 1−t 2 β for rounded arithmetic The number u is called the unit round off or machine precision.

14.10 Examples a) For the IEEE floating-point standard with chopped arithmetic in single precision we have u = 2−23 ≈ 1.2 × 10−7. In other words, approximately 7 decimal digits are accurate. b) For the IEEE floating point standard with chopped arithmetic in double precision we have u = 2−52 ≈ 2.2 × 10−16. In other words, approximately 16 decimal digits are accurate.

14.11 Computational errors Let x, y be two real numbers represented in a machine system by xc, yc. An arithmetic operation x ∗ y, where ∗ is one of +, −, ×, ÷, is performed by a computer in the following way. The computer finds xc ∗ yc (first, exactly) and then represents that number by the machine system. The result is z := (xc ∗ yc)c. Note that, generally, z is different from (x ∗ y)c, which is the representation of the exact result x ∗ y. Hence, z is not necessarily the best representation for x ∗ y. In other words, the computer makes additional round off errors during computations. Assuming that xc = x(1 + ε1) and yc = y(1 + ε2) we have

(xc ∗ yc)c = (xc ∗ yc)(1 + ε3) = [x(1 + ε1)] ∗ [y(1 + ε2)](1 + ε3)

where |ε1|, |ε2|, |ε3| ≤ u.

29 14.12 Multiplication and Division For multiplication, we have

z = xy(1 + ε1)(1 + ε2)(1 + ε3) ≈ xy(1 + ε1 + ε2 + ε3)

so the relative error is (approximately) bounded by 3u. A similar estimate can be made in the case of division.

14.13 Addition and Subtraction For addition, we have

xε + yε ! z = (x + y + xε + yε )(1 + ε ) = (x + y) 1 + 1 2 (1 + ε ) 1 2 3 x + y 3

The relative error is now small if |x| and |y| are not much bigger than |x + y|. The error, however, can be arbitrarily large if |x + y|  max{|x|, |y|}. This effect is known as catastrophic cancellation. A similar estimate can be made in the case of subtraction x − y: if |x − y| is not much smaller than |x| or |y|, then the relative error is small, otherwise we may have a catastrophic cancellation.

14.14 Quantitative estimation Assume that z = x + y is the exact sum of x and y. Let xc = x + ∆x and yc = y + ∆y be the machine representations of x and y. Let z + ∆z = xc + yc. We want to estimate the relative error |∆z|/|z| in terms of the relative errors |∆x|/|x| and |∆y|/|y|:

|∆z| |x| |∆x| |y| |∆y| ≤ + |z| |x + y| |x| |x + y| |y|

In particular, assume that |∆x|/|x| ≤ u and |∆y|/|y| ≤ u, so that xc and yc are the best possible machine representations of x and y. Then

|∆z| |x| + |y| ≤ u |z| |x + y|

so that the value q := (|x|+|y|)/|x+y| characterizes the accuracy of the machine addition of x and y. More precisely, if q  u−1, then the result is fairly accurate. On the contrary, if q ≈ u−1 or higher, then a catastrophic cancellation occurs.

14.15 Example Consider the system of equations

0.01 2 ! x ! 2 ! = 1 3 y 4

30 The exact solution is x ! 200/197 ! 1.015 ... ! = = y 196/197 0.995 ...

One can solve the system with the chopped arithmetic with base β = 10 and t = 2 (i.e. working with a two digit mantissa) by the LU decomposition without pivoting. One gets (xc, yc) = (0, 1). Increasing the accuracy to t = 3 gives (xc, yc) = (2, 0.994), not much of improvement, since xc is still far off. By applying partial pivoting one gets (xc, yc) = (1, 1) for t = 2 and (xc, yc) = (1.02, 0, 994) for t = 3, almost perfect accuracy! One can see that solving linear systems in machine arithmetic is different from solv- ing them exactly. We ‘pretend’ that we are computers subject to the strict rules of machine arithmetic, in particular we are limited to t significant digits. As we notice, these limitations may lead to unexpectedly large errors in the end.

31 15 Sensitivity and Stability

One needs to avoid catastrophic cancellations. At the very least, one wants to have means to determine whether catastrophic cancellations can occur. Here we develop such means for the problem of solving systems of linear equations Ax = b.

15.1 Convention Let k · k be a norm in IRn. If x ∈ IRn is a vector and x + ∆x ∈ IRn is a nearby vector (say, a machine representation of x), we consider k∆xk/kxk as the relative error (in x). If A ∈ IRn×n is a matrix and A + ∆A ∈ IRn×n a nearby matrix, we consider k∆Ak/kAk as the relative error (in A). Here k · k is the matrix norm induced by the norm k · k in IRn.

15.2 Definition (Condition Number) For a nonsingular matrix A, the condition number with respect to the given matrix norm k · k is κ(A) = kA−1k kAk

We denote by κ1(A), κ2(A), κ∞(A) the condition numbers with respect to the 1-norm, 2-norm and ∞-norm, respectively.

15.3 Theorem Suppose we have

Ax = b (A + ∆A)(x + ∆x) = b + ∆b

with a nonsingular matrix A. Assume that k∆Ak is small so that k∆Ak kA−1k < 1. Then k∆xk κ(A) k∆Ak k∆bk! ≤ + kxk k∆Ak kAk kbk 1 − κ(A) kAk

Proof. Expanding out the second equation, subtracting the first one and multiplying by A−1 gives ∆x = −A−1∆A(x + ∆x) + A−1∆b Taking norms and using the triangular inequality and 8.10 gives   k∆xk ≤ kA−1k k∆Ak kxk + k∆xk + kA−1k k∆bk

Using kbk ≤ kAk kxk, this rearranges to !   k∆bk 1 − kA−1k k∆Ak k∆xk ≤ kA−1k k∆Ak + kA−1k kAk kxk kbk

32 Recall that k∆Ak kA−1k < 1, so the first factor above is positive. The theorem follows immediately. 2

Interpretation. Let Ax = b be a given system of linear equations to be solved numeri- cally. A computer represents A by Ac = A + ∆A and b by bc = b + ∆b and then solves the system Acx = bc. Assume now (ideally) that the computer finds an exact solution xc = x + ∆x, i.e., Acxc = bc. Then the relative error k∆xk/kxk can be estimated by 15.3. The smaller the condition number κ(A), the tighter (better) estimate on k∆xk/kxk we get. The value of κ(A) thus characterizes the sensitivity of the solution of the linear system Ax = b to small errors in A and b.

15.4 Corollary Assume that in Theorem 15.3 we have k∆Ak ≤ ukAk and k∆bk ≤ ukbk, i.e. the matrix A and the vector b are represented with the best possible machine accuracy. Then k∆xk 2uκ(A) ≤ kxk 1 − uκ(A)

Interpretation. If uκ(A)  1, then the numerical solution of Ax = b is quite accurate in the ideal case (when the computer solves the system Acx = bc exactly). If, however, uκ(A) ≈ 1 or > 1, then the numerical solution is completely unreliable, no matter how accurately the computer works. Comparing this to 14.14, we can interpret κ(A) as a quantitative indicator of the possibility of catastrophic cancellations. Linear systems Ax = b with κ(A)  u−1 are often called well-conditioned. Those with κ(A) of order u−1 or higher are called ill-conditioned. In practical applications, when one runs (or may run) into an ill-conditioned linear system, one needs to reformu- late the underlying problem rather than trying to use numerical tricks to deal with the ill-conditioning. We will face this problem in Section 16.

15.5 Example. Assume that u ≈ 10−l, i.e. the machine system provides l accurate digits. Then if κ(A) ≈ 10k with k < l, then k∆xk/kxk ≤ 10l−k, i.e. even an ideal numerical solution only provides l − k accurate digits.

15.6 Proposition 1. κ(λA) = κ(A) for λ 6= 0 2. max kAxk κ(A) = kxk=1 minkxk=1 kAxk

3. If aj denotes the j-th column of A, then κ(A) ≥ kajk/kaik t 4. κ2(A) = κ2(A ) 5. κ(I) = 1

33 6. κ(A) ≥ 1 7. For any orthogonal matrix Q,

κ2(QA) = κ2(AQ) = κ2(A)

8. If D = diag(d1, . . . , dn) then

max1≤i≤n |di| κ2(D) = κ1(D) = κ∞(D) = min1≤i≤n |di|

t 9. If λM is the largest eigenvalue of A A and λm is its smallest eigenvalue, then

t t 2 κ2(A A) = κ2(AA ) = (κ2(A)) = λM /λm

10. κ2(A) = 1 if and only if A is a multiple of an orthogonal matrix.

15.7 Example 1 0 ! Let A = . Then κ (A) = 1/ε by 15.6 (8), and so the system Ax = b is 0 ε 2 ill-conditioned. It is, however, easy to ‘correct’ the system by multiplying the second equation by 1/ε. Then the new system A0x0 = b0 will be perfectly conditioned, since 0 κ2(A ) = 1 by 15.6 (5). The trick of multiplying rows (and columns) by scalars is called the scaling of the matrix A. Sometimes, scaling allows to significantly reduce the con- dition number of A. Such scalable matrices are, however, not very often in practice. Besides, no satisfactory method exists for detecting scalable matrices.

15.8 Remark Another way to look at the condition number κ(A) is the following. Since in practice the exact solution x of the system Ax = b is rarely known, one can try to verify whether the so called residual vector r = b − Axc is small relative to b. Since Axc = b + r, Theorem 15.3 with ∆A = 0 implies that kx − xk krk c ≤ κ(A) kxk kbk If A is well conditioned, the smallness of krk/kbk ensures the smallness of the relative error kxc − xk/kxk. If A is ill-conditioned, this does not work: one can find xc far from x for which r is still small.

15.9 Remark In 15.3–15.8, we assumed that xc was an exact solution of the system Acx = bc. We now get back to reality and consider the numerical solutionx ˆc of the system Acx = bc obtained by a computer. Because of computational errors (see 14.11–14.13) the vector xˆc will not be an exact solution, i.e. we have Acxˆc 6= bc. Hence, we have ˆ xˆc − x = (ˆxc − xc) + (xc − x) = ∆x + ∆x

34 actual A, b exact solution -actual x r r

machine Ac, bc exact solution ‘ideal’ xc XX - XXX r XX r XXcomputerX XX algorithm ˆ XXX ‘virtual’ Ac, bc -XXXzcomputedx ˆc exact solution b r

The error ∆x is solely due to inaccurate representation of the input data, A and b, in the computer memory (by Ac and bc). The error ∆x is estimated in Theorem 15.3 with the help of the condition number κ(A). If κ(A) is too large, the error ∆x may be too big, and one should not even attempt to solve the system numerically.

15.10 Round-off error analysis Assume now that κ(A) is small enough, so that we need not worry about ∆x. Now we ˆ need to estimate ∆x =x ˆc − xc. This results from computational errors, see 14.11–14.13. Note that even small errors may accumulate to a large error in the end. ˆ In order to estimatex ˆc −xc, a typical approach is to find another matrix, Ac = Ac +δA such that (Ac + δA)ˆxc = bc. We call Ac + δA a virtual matrix, since it is neither given nor computed numerically. Moreover, it is only specified by the fact that it takes the vector xˆc to bc, hence it far from being uniquely defined. One wants to find a virtual matrix as close to Ac as possible, to make δA small. Then one can use Theorem 15.3 to estimate ∆ˆ x: kxˆc − xck κ(Ac) kδAk ≤ kδAk kxck 1 − κ(A ) kAck c kAck

Clearly, the numerical solutionx ˆc, and then the virtual matrix Ac + δA, will depend on the method (algorithm) used. For example, one can use the LU decomposition method with or without pivoting. For a symmetric positive definite matrix A, one can use the Cholesky factorization. We will say that the numerical method is (algebraically) stable if one can find a virtual matrix so that kδAk/kAk is small, and unstable otherwise. The rule of thumb is then that for stable methods, computational errors do not accumulate too much and the numerical solutionx ˆc is close to the ideal solution xc. For unstable methods, computational errors may accumulate too much and forcex ˆc to be far from xc, this making the numerical solutionx ˆc unreliable.

15.11 Wilkinson’s analysis for the LU decomposition Here we provide, without proof, an explicit estimate on the matrix δA for the LU

35 c c decomposition method described in Section 4. Denote by Lc = (lij) and Uc = (uij) the computed matrices L and U by the LU decomposition algorithm. Then

n ! X c c 2 |δAij| ≤ nu 3|aij| + 5 |lik| |ukj| + O(u ) k=1

c c If the values of lik and ukj are not too large compared to those of aij, this provides a good upper bound on kδAk/kAk, it is of order n2u. In this case the LU decomposition method P c c is stable (assuming that n is not too big). But it may become unstable if |lik| |ukj| is large compared to |aij|. c One can improve the stability by using partial pivoting. In this case |lij| ≤ 1. So, the trouble can only occur if one gets large elements of Uc. Practically, however, it is observed that kUck∞ ≈ kAk∞, so that the partial pivoting algorithm is usually stable. The LU decomposition with complete pivoting is always stable. In this case it can be proved that kU k c ∞ ≤ n1/2(2131/241/3 ··· n1/(n−1))1/2 kAk∞ This bound grows slowly with n and ensures the stability. (Exercise: show that this bound grows approximately like n0.25 ln n.)

15.12 Remark The Cholesky factorization A = GGt of a symmetric positive definite matrix A, see 13.5, is a particular form of the LU decomposition, so the above analysis applies. In this case, however, we know that i X 2 aii = gij j=1 see 13.6. Hence, the elements of G cannot be large compared to the elements of A. This proves that the Cholesky factorization is always stable.

15.13 Remark (Iterative Improvement) In some practical applications, a special technique can be used to improve the numer- ical solutionx ˆc of a system Ax = b. First one solves the system Acx = bc numerically and then finds the residual vector r = bc −Acxˆc. Then one solves the system Acz = rc for z and replacesx ˆc byx ˆc + zc. This is one iteration of the so called iterative improvement or iterative refinement. Note that solving the system Acz = rc is relatively cheap, since the LU decomposition of A is obtained (and stored) already. Sometimes one uses the 2t-digit arithmetic to compute r, if the solutionx ˆc is computed in the t-digit arithmetic (one must have it available, though, which is not always the case). It is known that by repeating the iterations k times with machine precision u = 10−d q and κ∞ = 10 one expects to obtain approximately min(d, k(d − q)) correct digits. If one uses the LU decomposition with partial pivoting, then just one iteration of the above improvement makes the solution algebraically stable.

36 16 Overdetermined Linear Systems

16.1 Definition (Overdetermined Linear System) A system of linear equations Ax = b with A ∈ C|| m×n, x ∈ C|| n and b ∈ C|| m, is said to be overdetermined if m > n. Since there are more equations than unknowns, the system usu- ally has no solutions. We will be concerned here exactly with that “no solution” situation.

n m Note that the matrix A defines a linear transformation TA :C|| → C|| . Then Ax = b has a solution if and only if b ∈ Im TA. In this case the solution is unique if and only if Ker TA = {0}, i.e. the columns of A are linearly independent.

16.2 Definition (Least Squares Fit) Let (xi, yi), 1 ≤ i ≤ m, be given points on the real xy plane. For any straight line y = a + bx one defines the “distance” of that line from the given points by

m X 2 E(a, b) = (a + bxi − yi) i=1

This quantity is called the residual sum of squares (RSS). Suppose that the line y =a ˆ+ˆbx minimizes the function E(a, b), i.e.

E(ˆa, ˆb) = min E(a, b) a,b

ˆ Then y =a ˆ + bx is called the least squares approximation to the given points (xi, yi). Finding the least squares approximation is called the least squares fit of a straight line to the given points. It is widely used in statistics.

16.3 Example Let {xi} be the heights of some m people and {yi} their weights. One can expect an approximately linear dependence y = a + bx of the weight from the height. The least squares fit of a line y = a + bx to the experimental data (xi, yi), 1 ≤ i ≤ m, gives the numerical estimates of the coefficients a and b in the formula y = a + bx.

Note: in many statistical applications the least squares estimates are the best possible (most accurate). There are, however, just as many exceptions, these questions are be- yond the scope of this course.

37 16.4 Remark For the least squares fit in 16.2, let     1 x1 y1   !    1 x2  a  y2  A =  . .  x = b =  .   . .  b  .   . .   .  1 xm ym Then 2 E(a, b) = kb − Axk2 a ! Note that E(a, b) ≥ 0. Also, E(a, b) = 0 if and only if x = is an exact solution of b the system Ax = b. The system Ax = b is overdetermined whenever m > 2, so exact solutions rarely exist.

16.5 Definition (Least Squares Solution) Let Ax = b be an overdetermined linear system. A vector x ∈ C|| n that minimizes the function 2 E(x) = kb − Axk2 is called a least squares solution of Ax = b.

16.6 Remark If A, b and x are real, then one can easily differentiate the function E(x) with respect to the components of the vector x and equate the derivatives to zero. Then one arrives at the linear system AtAx = Atx.

16.7 Definition (Normal Equations) Let Ax = b be an overdetermined linear system. Then the linear system

A∗Ax = A∗b

is called the system of normal equations associated with the overdetermined system Ax = b. In the real case, the system of normal equations is

AtAx = Atb

Note: For a matrix A ∈ C|| m×n, its adjoint is defined by A∗ = At = A¯t ∈ C|| n×m.

16.8 Lemma The matrix A∗A is a square n × n Hermitean and positive semidefinite. If A has full , then A∗A is positive definite.

38 16.9 Theorem Let Ax = b be an overdetermined linear system. (a) A vector x minimizes E(x) = kb − Axk2 if and only if it is an exact solution of the ˆ ˆ system Ax = b, where b is the orthogonal projection of b onto Im TA. (b) A vector x minimizing E(x) always exists. It is unique if and only if A has full rank, i.e. the columns of A are linearly independent. (c) A vector x minimizes E(x) if and only if it is a solution of the system of normal equations A∗Ax = A∗b.

∗ m n Proof. The matrix A defines a linear transformation TA∗ :C|| → C|| . Note that ⊥ (Im TA) = Ker TA∗ , which generalizes 11.7. The proof is a routine check, just as that m of 11.7. So, we have an orthogonal decomposition C|| = Im TA ⊕ Ker TA∗ . Then we can ˆ ˆ ˆ write b = b + r uniquely, where b ∈ Im TA and r ∈ Ker TA∗ . Since hb, ri = 0, it follows from Theorem of Pythagoras that kb − Axk2 = kˆb − Axk2 + krk2 ≥ krk2 2 ˆ ˆ Hence, minx E(x) = krk is attained whenever Ax = b. Note that b ∈ Im TA, so there is || n ˆ always an x ∈ C such that Ax = b. The vector x is unique whenever Ker TA = {0}, i.e. dim Im TA = n, which occurs precisely when the columns of A are linearly independent, i.e. A has full rank (rank A = n). This proves (a) and (b). To prove (c), observe that if x minimizes E(x), then by (a) we have b − Ax = r ∈ ∗ ∗ Ker TA∗ , and thus A (b − Ax) = 0. Conversely, if A (b − Ax) = 0, then b − Ax ∈ Ker TA∗ , and by the uniqueness of the decomposition b = ˆb + r we have Ax = ˆb. 2

16.10 Normal Equations, Pros and Cons Let A ∈ IRm×n. a) By Lemma 16.8, the matrix AtA is symmetric positive definite, provided A has full rank. In this case the solution of the system of normal equations AtAx = Atb can be effectively found by Cholesky factorization. Furthermore, in many applications n  m, so the system AtAx = Atb is much smaller than Ax = b. b) It may happen, though, that the matrix A is somewhat ill-conditioned (i.e. almost causes catastrophic cancellations). In that case the condition of the matrix AtA will be much worse than that of A, compare this to 15.6 (9). Solving the normal equations can be disasterous. For example, let  1 1    A =  ε 0  0 ε then 1 + ε2 1 ! AtA = 1 1 + ε2 If ε is so small that ε2 < u (for example, ε = 10−4 in single precision), then the matrix AtA will be stored as a singular matrix, and the numerical solution of the normal equations

39 will crash. Still, it is possible to find a numerical least squares solution of the original system Ax = b, if one uses more elaborate methods. c) One can define a condition number of an m × n rectangular matrix A with m > n by max kAxk κ(A) := kxk=1 minkxk=1 kAxk Then, in the above example, √ −1 2 t −2 2 κ2(A) = ε 2 + ε and κ2(A A) = ε (2 + ε )

t 2 Hence, κ2(A A) = [κ2(A)] . d) Clearly, κ2(A) = κ2(QA) for any orthogonal m × m matrix Q. Hence, one can safely (without increasing the 2-condition number of A) multiply A by orthogonal m×m matrices. We will try to find an orthogonal matrix Q so that QA is upper triangular (i.e., perform orthogonal triangularization).

16.11 Definition (Hyperplane, Reflection) Let V be a finite dimensional inner product space (real or complex). A subspace W ⊂ V is called a hyperplane if dim W = dim V − 1. Note that in this case dim W ⊥ = 1. Let W ⊂ V be a hyperplane. For any vector v ∈ V we have a unique decomposition v = w + w0, where w ∈ W and w0 ∈ W ⊥. The linear operator P on V defined by P v = w − w0 is called a reflection (or reflector) across the hyperplane W . It is identical on W and negates vectors orthogonal to W .

16.12 Definition (Reflection Matrix) Let x 6= 0 be a vector in IRn or C|| n. The matrix

xx¯t P = I − 2 xtx¯ is called a reflection matrix (or a reflector matrix). Obviously, P is unchanged if x is replaced by cx for any c 6= 0. Note also that xtx¯ = kxk2

16.13 Theorem Let P be a reflection matrix corresponding to a vector x 6= 0. Then (a) P x = −x (b) P y = y whenever hy, xi = 0 (c) P is Hermitean (in the real case it is symmetric) (d) P is unitary (in the real case it is orthogonal) (e) P is involution, i.e. P 2 = I Note that (a)+(b) mean that P is a reflection of IRn or C|| n across the hyperplane orthog- onal to the vector x.

40 Proof. (a), (b) and (c) are proved by direct calculation. To prove (e), write

xx¯t xx¯txx¯t P 2 = I − 4 + 4 xtx¯ (xtx¯)2

Then (e) follows from the fact thatx ¯tx = xtx¯ = kxk2 is a scalar quantity. Next, (d) follows from (c) and (e).

16.14 Theorem n n Let y be a vector in IR or C|| . Choose a scalar σ so that |σ| = kyk and σ ·he1, yi ∈ IR. t 2 Suppose that x = y + σe1 6= 0. Let P = I − 2xx¯ /kxk be the reflection matrix defined in 16.12. Then P y = −σe1.

2 2 Proof. Notice that hy − σe1, y + σe1i = kyk − σhe1, yi +σ ¯hy, e1i − |σ| = 0. Hence, P (y − σe1) = y − σe1 by 16.13 (b). Besides, P (y + σe1) = −y − σe1 by 16.13 (a). Adding these two equations gives the theorem.

16.15 Remark iθ (a) To choose σ in 16.14, write a polar representation for he1, yi =y ¯1 = re and then set σ = ±kyke−iθ. (b) In the real case, we have y1 ∈ IR, and one can just set

σ = ±kyk

Note: It is geometrically obvious that for any two unit vectors x, y ∈ IRn there is a reflection P that takes x to y. In the complex space C|| n, this is not true: for generic unit vectors x, y one can only find a reflection that takes x to cy with some scalar c ∈ C|| .

16.16 Corollary For any vector y in IRn or C|| n there is a scalar σ (defined in 16.14) and a matrix P , which is either a reflection or an identity, such that P y = −σe1.

Proof. Apply Theorem 16.14 in the case y + σe1 6= 0 and set P = I otherwise. 2

16.17 Theorem (QR Decomposition) Let A be an m×n complex or real matrix with m ≥ n. Then there is a unitary (resp., orthogonal) m × m matrix Q and an upper triangular m × n matrix R (i.e., Rij = 0 for i > j) such that A = QR Furthermore, Q may be found as a product of at most n reflection matrices.

41 Proof. We use induction on n. Let n = 1, so that A is a column m-vector. By 16.16 there is a matrix P (a reflection or an identity) such that PA = −σe1 for a scalar σ. Hence, A = PR where R = −σe1 is upper triangular. Now, let n ≥ 1 and a1 the first column of A. Again, by 16.16 there is a (reflection or identity) matrix P such that P a1 = −σe1. Hence, −σ wt ! PA = 0 B where w is an (n − 1) vector and B an (m − 1) × (n − 1) matrix. By the inductive assumption, there is an (m − 1) × (m − 1) unitary matrix Q0 and an (m − 1) × (n − 1) upper triangular matrix R0 such that B = Q0R0. Consider the unitary m × m matrix

1 0 ! Q = 1 0 Q0

0 (by 10.6, Q1 is unitary whenever Q is). One can easily check that PA = Q1R where

−σ wt ! R = 0 R0

is an upper triangular matrix. Hence, A = QR with Q = PQ1. 2

16.18 Remark m m In the above theorem, denote by aj ∈ C|| , 1 ≤ j ≤ n, the columns of A, by qj ∈ C|| , 1 ≤ j ≤ m, the columns of Q, and put R = (rij). Then A = QR may be written as

a1 = r11q1

a2 = r12q1 + r22q2 ...

an = r1nq1 + r2nq2 + ··· + rnnqn

16.19 Corollary If, in addition, A has full rank (rank A = n), then (a) span{a1,..., ak} = span{q1,..., qk} for every 1 ≤ k ≤ n. (b) one can find a unitary Q so that the diagonal entries of R will be real and positive (rii > 0 for 1 ≤ i ≤ n).

iθj Proof. (a) follows from 16.18. To prove (b), let rjj = |rjj|e be the polar represen- iθj tation of rjj. Then multiplying every column qj, 1 ≤ j ≤ n, by e gives (b). The new matrix Q will be still unitary by 10.6. 2

42 The matrix A depends only on the first n columns of Q. This suggests a reduced QR decomposition.

16.20 Theorem (Reduced or “Skinny” QR decomposition) Let A be an m × n real or complex matrix with m ≥ n. Then there is an m × n matrix Qˆ with orthonormal columns and an upper triangular n × n matrix Rˆ such that

A = QˆRˆ

Proof. Apply Theorem 16.17. Let Qˆ be the left m × n rectangular block of Q (the first n columns of Q). Let Rˆ be the top n × n square block of R (note that the remainder of R is zero). Then A = QˆRˆ. 2

16.21 Corollary If, in addition, A has full rank (rank A = n), then ˆ (a) The columns of Q make an ONB in the column space of A (=Im TA) ˆ ˆ (b) one can find Q so that the diagonal entries of R will be real and positive (rii > 0). Such Qˆ and Rˆ are unique.

Proof. This follows from 16.19, except the uniqueness. For simplicity, we prove it only ˆ ˆ ˆ ˆ t ˆt ˆ ˆt ˆ t in the real case. Let A = QR = Q1R1. Then A A = R R = R1R1. Since A A is positive ˆ ˆ definite, we can use the uniqueness of Cholesky factorization and obtain R = R1. Then ˆ ˆ ˆ ˆ−1 ˆ also Q1 = QRR1 = Q. 2

In the remainder of this section, we only consider real matrices and vectors.

16.22 Algorithm Let Ax = b be a real overdetermined system with matrix A of full rank. One finds the decomposition A = QR with reflectors, as shown in the proof of Theorem 16.17. Then c ! QtQ = I and so Rx = Qtb. Denote Qtb = where c ∈ IRn is the vector of top n d components of Qtb and d ∈ IRm−n its bottom part. Next, one finds x by using backward substitution to solve the system Rxˆ = c, where Rˆ is the top n × n square block of R. Lastly, one finds the value of E(x) = kr − Axk2 by computing kdk2. Indeed,

kb − Axk2 = kQtb − QtAxk2 = k(c, d)t − Rxk2 = kdk2

The above algorithm is the most suitable for computer implementation. It does not worsen the condition of the given system Ax = b, see also numerical hints in 16.26.

43 On the other hand, for calculations ‘by hands’ (such as on tests!), the following reduced version of this algorithm is more convenient.

16.23 Algorithm Let Ax = b be a real overdetermined system with matrix A of full rank. One can obtain a “skinny” QR decomposition A = QˆRˆ by using 16.18 and employing a version of the Gram-Schmidt orthogonalization method:

r11 = ka1k and q1 = a1/r11

r12 = ha2, q1i, r22 = ka2 − r12q1k and q2 = (a2 − r12q1)/r22 ··· X X rin = han, qii (i < n), rnn = kan − rinqik and qn = (an − rinqi)/rnn Then one solves the triangular system Rxˆ = Qˆtb by backward substitution. This method does not give the value of kb − Axk2, one has to compute it directly, if necessary.

Below we outline an alternative approach to finding a QR decomposition.

16.24 Definition () Let 1 ≤ p < q ≤ m and θ ∈ [0, 2π). The matrix G = Gp,q,θ = (gij) defined by gpp = cos θ, gpq = sin θ, gqp = − sin θ, gqq = cos θ and gij = δij otherwise is called a Givens rotation matrix. It defines a rotation through the angle θ of the xpxq coordinate plane in IRm with all the other coordinates fixed. Obviously, G is orthogonal.

16.25 Algorithm Let Ax = b be a real overdetermined system with matrix A of full rank. Let aj be the leftmost column of A that contains a nonzero entry below the main diagonal, aij 6= 0 0 with some i > j. Consider the matrix A = GA where G = Gj,i,θ is a Givens rotation matrix. One easily checks that (a) the first j − 1 columns of A0 are zero below the main diagonal; 0 0 (b) in the j-th column, only the elements ajj and aij will be different from the corre- sponding elements of A, and moreover

0 aij = −ajj sin θ + aij cos θ

0 Now we find sin θ and cos θ so that aij = 0. For example, let a a cos θ = jj and sin θ = ij q 2 2 q 2 2 ajj + aij ajj + aij

Note that one never actually evaluates the angle θ, since Gj,i,θ only contains cos θ and sin θ.

44 In this way one zeroes out one nonzero element aij below the main diagonal. Working from left to right, one can convert A into an upper triangular matrix GA˜ = R where G˜ is a product of Givens’s rotation matrices. Each nonzero element of A below the main diagonal requires one multiplication by a rotation matrix. Then we get A = QR with an orthogonal matrix Q = G˜−1. This algorithm is generally more expensive than the QR decomposition with reflectors. But it works very efficiently if the matrix A is sparse, i.e. contains just a few nonzero elements below the main diagonal.

16.26 Remarks (Numerical Hints) The most accurate numerical algorithm for solving a generic overdetermined linear system Ax = b is based on the QR decomposition with reflectors, then it proceeds as in 16.22. In the process of computing the reflector matrices, two rules should be followed: (i) The sign of σ in 16.15(b) must coincide with the sign of y1 = hy, e1i every time one uses Theorem 16.14. This helps to avoid catastrophic cancellations when calculating x = y + σe1 (note that the “skinny” algorithm in 16.22 does not provide this flexibility) (ii) In the calculation of q 2 2 kyk = y1 + ··· + ym

in Theorem 16.14, there is a danger of overflow (if one of yi’s is too large) or underflow 2 (if all yi’s are too small, so that yi is a machine zero, then so will be kyk). To avoid this, find first ymax = max{|y1|,..., |ym|} and then compute q 2 2 kyk = ymax · |y1/ymax| + ··· + |ym/ymax|

The same trick must be used when computing the rotation matrices in 16.25.

16.27 Polynomial Least Squares Fit Generalizing 16.2, one can fit a set of data points (xi, yi), 1 ≤ i ≤ m, by a polynomial n y = p(x) = a0 + a1x + ··· + anx with n + 1 ≤ m. The least squares fit is based on minimizing the function

m X  n 2 E(a0, . . . , an) = a0 + a1xi + ··· + anxi − yi i=1 This leads to an overdetermined linear system generalizing 16.4:

n a0 + a1xi + ··· + anxi = yi 1 ≤ i ≤ m

45 16.28 Continuous Least Squares Fit Instead of fitting a discrete data set (xi, yi) one can fit a continuous function y = f(x) on [0, 1] by a polynomial y = p(x) ∈ Pn(IR). The least squares fit is the one minimizing

Z 1 2 E(a0, . . . , an) = |f(x) − p(x)| dx 0

The solution of this problem is the orthogonal projection of f(x) onto Pn(IR), recall the homework problem #1 in Assignment 1. n To find the solution, consider a basis {1, x, . . . , x } in Pn(IR). Then a0, . . . , an can be found from the system of normal equations

n X j i i ajhx , x i = hf, x i 1 ≤ i ≤ n j=0

The matrix A of coefficients here is

Z 1 j i i+j 1 aij = hx , x i = x dx = 0 1 + i + j This is the infamous , it is ill-conditioned even for moderately large n, since 1.5n κ2(A) ≈ 10 . One can chose an orthonormal basis in Pn(IR) (Legendre polynomials, see 9.12) to improve the condition of the problem.

46 17 Other Decomposition Theorems

17.1 Theorem (Schur Decomposition) If A ∈ C|| n×n, then there exists a unitary matrix Q such that

Q∗AQ = T where T is upper triangular, i.e. A is unitary equivalent to an upper triangular matrix. Moreover, the matrix Q can be chosen so that the eigenvalues of A appear in any order on the diagonal of T . The columns of the matrix Q are called Schur vectors.

Proof. We use induction on n. The theorem is clearly true for n = 1. Assume that it holds for matrices of order less than n. Let λ be an eigenvalue of A and let x be a unit eigenvector for λ. Let R be a unitary matrix whose first column is x (note: such a matrix exists, because there is an ONB in C|| n whose first vector is x, by 9.13, and then R can be constructed by using the vectors of that ONB as columns of R). Note that Re1 = x, ∗ −1 ∗ and hence R x = e1, since R = R . Hence, we have,

∗ ∗ ∗ R ARe1 = R Ax = λR x = λe1

∗ so e1 is an eigenvector of the matrix R AR for the eigenvalue λ. Thus, λ wt ! R∗AR = 0 B

with some w ∈ C|| n−1 and B ∈ C|| (n−1)×(n−1). By the induction assumption, there is a || (n−1)×(n−1) ∗ unitary matrix Q1 ∈ C such that Q1BQ1 = T1, where T1 is upper triangular. Let 1 0 ! Q = R 0 Q1

Note: by 10.6, the second factor is a unitary matrix because so is Q1, and then note that the product of two unitary matrices is a unitary matrix Q. Next,

! t ! ! t ! t ! ∗ 1 0 λ w 1 0 λ w Q1 λ w Q1 Q AQ = ∗ = ∗ = 0 Q1 0 B 0 Q1 0 Q1BQ1 0 T1 which is upper triangular, as required. 2

17.2 Definition () A matrix A ∈ C|| n×n is said to be normal if AA∗ = A∗A.

Note: unitary and Hermitean matrices are normal.

47 17.3 Lemma If A is normal and Q unitary, then B = Q∗AQ is also normal (i.e., the class of normal matrices is closed under unitary equivalence).

Proof. Note that B∗ = Q∗A∗Q and BB∗ = Q∗AA∗Q = Q∗A∗AQ = B∗B. 2

17.4 Lemma If A is normal and upper triangular, then A is diagonal.

Proof. We use induction on n. For n = 1 the theorem is trivially true. Assume that it holds for matrices of order less than n. Compute the top left element of the matrix AA∗ = A∗A. On the one hand, it is

n n X X 2 a1ia¯1i = |a1i| i=1 i=1

2 On the other hand, it is just |a11| . Hence, a12 = ··· = a1n = 0, and

a 0 ! A = 11 0 B

One can easily check that AA∗ = A∗A implies BB∗ = B∗B. By the induction assump- tion, B is diagonal. 2

Note: Any diagonal matrix is normal.

17.5 Theorem (i) A matrix A ∈ C|| n×n is normal if and only if it is unitary equivalent to a diagonal matrix. In that case the Schur decomposition takes form

Q∗AQ = D

where D is a diagonal matrix. (ii) If A is normal, the columns of Q (Schur vectors) are eigenvectors of A (specifically, the jth column of Q is an eigenvector of A for the eigenvalue in the jth position on the diagonal of D).

Proof. The first claim follows from 17.1–17.4. The second is verified by direct inspec- tion. 2

Note: Theorem 17.5 applies to unitary and Hermitean matrices.

48 17.6 Remark Three classes of complex matrices have the same property, that their matrices are unitary equivalent to a diagonal matrix (i.e., admit an ONB consisting of eigenvectors). The difference between those classes lies in restrictions on eigenvalues: unitary matrices have eigenvalues on the unit circle (|λ| = 1), Hermitean matrices have real eigenvalues (λ ∈ IR), and now normal matrices may have arbitrary complex eigenvalues.

Schur decomposition established in 17.1 is mainly of theoretical interest. As most matri- ces of practical interest have real entries the following variation of the Schur decomposi- tion is of some practical value.

17.7 Theorem (Real Schur Decomposition) If A ∈ IRn×n then there exists an orthogonal matrix Q such that   R11 R12 ··· R1m  0 R ··· R  t  22 2m  Q AQ =  . . .   . . .. .   . . . .  0 0 ··· Rmm where each Rii is either a 1 × 1 matrix or a 2 × 2 matrix having complex conjugate eigenvalues.

Proof. We use the induction on n. If the matrix A has a real eigenvalue, the we can reduce the dimension and use induction just like in the proof of Schur Theorem 17.1. Assume that A has no real eigenvalues. Since the characteristic polynomial of A has real coefficients, its roots come in conjugate pairs. Let λ1 = α + iβ and λ2 = α − iβ be a pair of roots of CA(x), with β 6= 0. The root λ1 is a complex eigenvalue of A, considered as a complex matrix, with a complex eigenvector x + iy, where x, y ∈ IRn. The equation A(x + iy) = (α + iβ)(x + iy) can be written as Ax = αx − βy Ay = βx + αy or in the matrix form !     α β A x y = x y −β α   where x y is an n × 2 matrix with columns x and y. Observe that x and y must be lnearly independent, because if not, one can easily show that β = 0, contrary to the assumption. Let !   R x y = P 0

49   be a QR decomposition of the matrix x y , where R ∈ IR2×2 is an upper triangular   matrix and P ∈ IRn×n orthogonal. Note that R is nonsingular, because x y has full rank. Denote T T ! P tAP = 11 12 T21 T22 where T11 is a 2 × 2 matrix, T12 is a 2 × (n − 2) matrix, T21 is a (n − 2) × 2 matrix, and T22 is a (n − 2) × (n − 2) matrix. Now, combining the above equations gives

T T ! R ! R ! α β ! 11 12 = T21 T22 0 0 −β α

This implies α β ! T R = R 11 −β α and T21R = 0

Since R is nonsingular, the latter equation implies T21 = 0, and the former equation α β ! shows that the matrix T is similar to . That last matrix has eigenvalues 11 −β α λ1,2 = α±iβ, hence so does T11, by similarity, therefore it satisfies the requirements of the theorem. Now we can apply the induction assumption to the (n−2)×(n−2) matrix T22. 2

In the next theorem we consider a matrix A ∈ IRm×n. It defines a linear transformation n m n 0 m TA : IR → IR . Let B be an ONB in IR and B be an ONB in IR . Then the 0 t transformation TA is represented in the bases B and B by the matrix U AV , where U and V are orthogonal matrices of size m × m and n × n, respectively. The following theorem shows that one can always find bases B and B0 so that the matrix U tAV will be diagonal. m×n Note: D ∈ IR is said to be diagonal if Dij = 0 for i 6= j. It has exactly p = min{m, n} diagonal elements and can be denoted by D = diag(d1, . . . , dp).

17.8 Theorem (Singular Value Decomposition (SVD)) Let A ∈ IRm×n have rank r and let p = min{m, n}. Then there are orthogonal m×m n×n matrices U ∈ IR and V ∈ IR and a diagonal matrix D = diag(σ1, . . . , σp) such that U tAV = D . In the diagonal of D, exactly r elements are nonzero. If additionally we require that σ1 ≥ · · · ≥ σr > 0 and σr+1 = ··· = σp = 0, then the matrix D is unique.

50 Proof. The matrix At defines a linear transformation IRm → IRn. Recall that the matrices AtA ∈ IRn×n and AAt ∈ IRm×m are symmetric and positive semi-definite. Note that Ker A = Ker AtA and Ker At = Ker AAt (indeed, if AtAx = 0, then 0 = hAtAx, xi = hAx, Axi, so Ax = 0; the second identity is proved similarly). Hence, rank AtA = rank AAt = rank A = r Indeed, rank AtA = n − dim Ker AtA = n − dim Ker A = rank A = r and similarly rank AAt = rank At = rank A = r. Next, by 11.21 and the remark after 12.10, there is an orthogonal matrix V ∈ IRn×n such that AtA = VSV t where n×n t S = diag(s1, . . . , sn) ∈ IR and si ≥ 0 are the eigenvalues of A A. Note that rank S = rank AtA = r, so exactly r diagonal elements of S are positive, and we can reorder them (e.g., according to 17.1 and 17.5) so that s1 ≥ · · · ≥ sr > 0 and sr+1 = ··· = sn = 0. Let t v1, . . . , vn be the columns of V , which are the eigenvectors of A A, according to 17.5, i.e. t A Avi = sivi. Define vectors u1, . . . , um as follows. Let 1 ui = √ Avi for 1 ≤ i ≤ r si t For r + 1 ≤ i ≤ m, let {ui} be an arbitrary ONB of the subspace Ker AA , whose dimension is m − rank AAt = m − r. Note that

t 1 t √ AA ui = √ A(A A)vi = siAvi = siui si t for i = 1, . . . , r, and AA ui = 0 for i ≥ r + 1. Therefore, u1, . . . , um are the eigenvectors t of AA corresponding to the eigenvalues s1, . . . , sr, 0,..., 0. Next, we claim that {ui} m make an ONB in IR , i.e. hui, uji = δij. This is true for i, j ≥ r + 1 by our choice of ui. It is true for i ≤ r < j and j ≤ r < i because then ui, uj are eigenvectors of the symmetric matrix AAt corresponding to distinct eigenvalues, so they are orthogonal by 11.9 (ii). Lastly, for i, j ≤ r √ 1 1 t si hui, uji = √ hAvi, Avji = √ hA Avi, vji = √ hvi, vji = δij sisj sisj sj m×m Now let U ∈ IR be the matrix whose columns are u1, . . . , um. Then U is orthog- t t 0 m×m 0 √ onal, and U A AU = S ∈ IR where S = diag(s1, . . . , sr, 0,..., 0). Let σi = si m×n for 1 ≤ i ≤ p and observe that Avi = σiui. Consequently, if D ∈ IR is defined by D = diag(σ1, . . . , σp), then AV = UD (the i-th column of the left matrix is Avi, and t that of the right matrix is σiui). Hence, U AV = D as required. It remains to prove the uniqueness of D. Let A = UˆDˆVˆ t be another SVD. Then AtA = Vˆ Dˆ 2Vˆ t, so the diagonal ˆ 2 t elements of D are the eigenvalues of A A, hence they are s1, . . . , sr, 0,..., 0. Since the ˆ √ √ diagonal elements of D are nonnegative, they are s1,..., sr, 0,..., 0. 2

51 17.9 Definition (Singular Values and Vectors) The positive numbers σ1, . . . , σr are called the singular values of A. The columns v1, . . . , vn of the matrix V are called the right singular vectors for A, and the columns u1, . . . , um of the matrix U are called the left singular vectors for A.

17.10 Remark For 1 ≤ i ≤ r we have

t Avi = σiui and A ui = σivi

We also have

t Ker A = span{vr+1, . . . , vn} and Ker A = span{ur+1, . . . , um} and t Im A = span{u1, . . . , ur} and Im A = span{v1, . . . , vr}

Illustration to Remark 17.10: AAt σ1 σ1 v1 −→ u1 −→ v1 σ2 σ2 v2 −→ u2 −→ v2 ...... σr σr vr −→ ur −→ vr vr+1 → 0 ur+1 → 0 ...... vn → 0 um → 0

17.11 Remark We have the following SVD expansion:

r X t A = σiuivi i=1

Proof. It is enough to observe that for every vj

r ! X t σiuivi vj = σjuj = Avj i=1

t because vi vj = δij. 2

52 17.12 Remark 2 2 t t The positive numbers σ1, . . . , σr are all non-zero eigenvalues of both A A and AA . This can be used as a practical method to compute the singular values σ1, . . . , σr. The multiplicity of σi > 0 as a singular value of A equals the (algebraic and geometric) mul- 2 t t tiplicity of σi as an eigenvalue of both A A and AA .

Much useful information about the matrix A is revealed by the SVD.

17.13 Remark t Assume that m = n, i.e. A is a . Then kAk2 = kA k2 = kDk2 = σ1. −1 −1 −1 If A is nonsingular, then similarly we have kA k2 = kD k2 = σn . Therefore, κ2(A) = κ2(D) = σ1/σn.

SVD is helpful in the study of the matrix rank.

17.14 Definition A matrix A ∈ IRm×n is said to have full rank if its rank equals p = min{m, n}. Oth- erwise, A is said to be rank deficient.

Note: If A = UDV t is an SVD, then rank A = rank D is the number of positive singular values. If A has full rank, then all its singular values are positive. Rank deficient matrices have at least one zero singular value.

17.15 Definition The 2-norm of a rectangular matrix A ∈ IRm×n is defined by

kAk2 = max kAxk2 kxk2=1

n m where kxk2 is the 2-norm in IR and kAxk2 is the 2-norm in IR .

t Note: kAk2 = kA k2 = kDk2 = σ1, generalizing Remark 17.13.

m×n The linear space IR with the distance between matrices given by kA−Bk2 is a metric space. Then topological notions, like open sets, dense sets, etc., apply.

17.16 Theorem Full rank matrices make an open and dense subset of IRm×n.

Openness means that for any full rank matrix A there is an ε > 0 such that A + E has full rank whenever kEk2 ≤ ε. Denseness means that if A is rank deficient, then for any ε > 0 there is E, kEk2 ≤ ε, such that A + E has full rank.

53 Proof. To prove denseness, let A be a rank deficient matrix and A = UDV t its SVD. t For ε > 0, put Dε = εI and E = UDεV . Then kEk2 = kDεk2 = ε and

t rank(A + E) = rank(U(D + Dε)V ) = rank(D + Dε) = p

For the openness, see homework assignment (one can use the method of the proof of 17.18 below).

One can see that slight perturbation of a rank deficient matrix (e.g., caused by round- off errors) can produce a full rank matrix. On the other hand, a full rank matrix may be perturbed by round-off errors and become rank deficient, a very undesirable event. To describe such situations, we introduce the following

17.17 Definition (Numerical Rank) The numerical rank of A ∈ IRm×n with tolerance ε is

rank(A, ε) = min rank(A + E) kEk2≤ε

Note: rank(A, ε) ≤ rank A. The numerical rank gives the minimum rank of A under perturbations of norm ≤ ε. If A has full rank p = min{m, n} but rank(A, ε) < p, then A is ‘nearly rank deficient’, which indicates dangerous situation in numerical calculations.

17.18 Theorem The numerical rank of a matrix A, i.e. rank(A, ε), equals the number of singular values of A (counted with multiplicity) that are greater than ε.

t Proof. Let A = UDV be a SVD and σ1 ≥ · · · ≥ σp the singular values of A. Let k be defined so that σk > ε ≥ σk+1. We show that rank(A, ε) = k. Let kEk2 ≤ ε and 0 t 0 E = U EV . Observe that kE k2 = kEk2 ≤ ε and

rank(A + E) = rank (U(D + E0)V t) = rank(D + E0)

Hence, rank(A, ε) = min rank(D + E0) 0 kE k2≤ε Pk Now, for any x = 1 xiei we have

0 0 k(D + E )xk2 ≥ kDxk2 − kE xk2 ≥ (σk − ε)kxk2

Hence, (D + E0)x 6= 0 for x 6= 0, so rank(D + E0) ≥ k. On the other hand, setting 0 0 0 E = diag(0,..., 0, −σk+1,..., −σp) implies that kE k2 ≤ ε and rank(D + E ) = k. 2

54 17.19 Corollary For every ε > 0, rank(At, ε) = rank(A, ε).

Proof. A and At have the same singular values. 2

17.20 Theorem (Distance to the nearest singular matrix) Let A ∈ IRn×n be a nonsingular matrix. Then ( ) kA − Ask2 1 min : As is singular = kAk2 κ2(A)

Proof. Follows from 17.13 and 17.18. 2

17.21 Example 3 ! Let A = . Then AtA = (25) is a 1 × 1 matrix, it has the only eigenvalue λ = 25 4 √ 9 12 ! with eigenvector v = e = (1). Hence, σ = λ = 5. The matrix AAt = 1 1 1 12 16 t has eigenvalues λ1 = 25 and λ2 = 0 with corresponding eigenvectors u1 = (3/5 4/5) and t u2 = (4/5 − 3/5) . The SVD takes form

3/4 4/5 ! 3 ! 5 ! (1) = 4/5 −3/5 4 0

t Observe that Av1 = 5u1 and A u1 = 5v1 in accordance with the SVD theorem.

55 18 Eigenvalues and eigenvectors: sensitivity

18.1 Definition (Left Eigenvector) Let A ∈ C|| n×n. A nonzero vector x ∈ C|| n is called a left eigenvector of A corresponding to an eigenvalue λ if x∗A = λx∗

Note that this is equivalent to A∗x = λx¯ , i.e. x being an ordinary (right) eigenvector of A∗ corresponding to the eigenvalue λ¯.

18.2 Lemma A matrix A has a left eigenvector corresponding to λ if and only if λ is an eigenvalue of A (a root of the characteristic polynomial of A).

Proof. x∗A = λx∗ for an x 6= 0 is equivalent to (A∗ − λI¯ )x = 0, which means that ∗ ¯ det(A − λI) = 0, or equivalently, det(A − λI) = 0, i.e. CA(λ) = 0. 2

This explains why we do not introduce a notion of a left eigenvalue: the eigenvalues for left eigenvectors and those for ordinary (right) eigenvectors are the same.

18.3 Lemma For any eigenvalue λ of A the dimension of the ordinary (right) eigenspace equals the dimension of the left eigenspace (i.e., the geometric multiplicity of λ is the same).

Proof. Ker(A − λI) = n − rank(A − λI) = n − rank(A∗ − λI¯ ) = Ker(A∗ − λI¯ ). 2

18.4 Definition (Rayleigh Quotient) Let A ∈ C|| n×n and x ∈ C|| n a nonzero vector. We call x∗Ax hAx, xi = x∗x hx, xi the Rayleigh quotient for A. One can consider it as a function of x.

Note: The Rayleigh quotient takes the same value for all scalar multiples of a given vector x, i.e. it is constant on the line spanned by x (with the zero vector removed). Recall that any nonzero vector is a scalar multiple of a unit vector. Hence, Rayleigh quotient as a function of x is completely defined by its values on the unit vectors (on the unit sphere). For unit vectors x, the Rayleigh quotient can be simply defined by

x∗Ax = hAx, xi

56 18.5 Theorem Let A ∈ C|| n×n and x a unit vector in C|| n. Then

∗ kAx − (x Ax)xk2 = min kAx − µxk2 µ∈C

That is, the vector (x∗Ax)x is the orthogonal projection of the vector Ax on the line spanned by x.

Proof. From the homework problem #1 in the first assignment we know that

min kAx − µxk2 µ∈C is attained when Ax − µx is orthogonal to x. This gives the value µ = x∗Ax. Alter- natively, consider an overdetermined (n × 1) system xµ = Ax where µ is (the only) unknown. Then the result follows from Theorem 16.9. 2

Note: If x is a unit eigenvector for an eigenvalue λ, then x∗Ax = λ. Suppose that x is not an eigenvector. If one wants to regard x as an eigenvector, then the Rayleigh quotient for x is the best choice that one could make for the associated eigenvalue in the sense that this value of µ comes closest (in the 2-norm) to achieving Ax − µx = 0. So, one could regard x∗Ax as a “quasi-eigenvalue” for x.

18.6 Theorem Let A ∈ C|| n×n and x a unit eigenvector of A corresponding to eigenvalue λ. Let y be another unit vector and ρ = y∗Ay. Then

|λ − ρ| ≤ 2 kAk2 kx − yk2

Moreover, if A is a Hermitean matrix, then there is a constant C = C(A) > 0 such that

2 |λ − ρ| ≤ C kx − yk2

Proof. To prove the first part, put

λ − ρ = x∗A(x − y) + (x − y)∗Ay and then using the triangle inequality and Cauchy-Schwarz inequality gives the result. Now, assume that A is Hermitean. Then there is an ONB of eigenvectors, and we can assume that x is one of them. Denote that ONB by {x, x2, . . . , xn} and the corresponding eigenvalues by λ, λ2, . . . , λn. Let y = cx + c2x2 + ··· + cnxn. Then

n n 2 2 X 2 X 2 ky − xk = |c − 1| + |ci| ≥ |ci| i=2 i=2

57 On the other hand, kyk = 1, so

n 2 X 2 λ = λ|c| + λ|ci| i=2 Pn Now, Ay = cλx + i=2 ciλixi, so n 2 X 2 ρ = hAy, yi = λ|c| + λi|ci| i=2 Therefore, n X 2 λ − ρ = (λ − λi)|ci| i=2 The result now follows with C = max |λ − λi| 2≤i≤n The theorem is proved. 2.

Note: The Rayleigh quotient function x∗Ax is obviously continuous on the unit sphere, because it is a polynomial of the coordinates of x. Since at every eigenvector x the value of this function equals the eigenvalue λ for x, then for y close to x its values are close to λ. Suppose that ky − xk = ε. The first part of the theorem gives a specific estimate on how close y∗Ay to λ is, the difference between them is O(ε). The second part says that for Hermitean matrices y∗Ay is very close to λ, the difference is now O(ε2).

Note: If A is Hermitean, then x∗Ax is a real number for any vector x ∈ C|| n, i.e. the Rayleigh quotient is a real valued function.

18.7 Theorem n×n Let A ∈ C|| be Hermitean with eigenvalues λ1 ≤ · · · ≤ λn. Then for any kxk = 1

∗ λ1 ≤ x Ax ≤ λn

Proof. Let {x1, . . . , xn} be an ONB consisting of eigenvectors of A. Let x = c1x1 + ∗ 2 2 ··· + cnxn. Then x Ax = λ1|c1| + ··· + λn|cn| . The result now follows easily. 2

18.8 Lemma Let L and G be subspaces of C|| n and dim G > dim L. Then there is a nonzero vector in G orthogonal to L.

Proof. By way of contradiction, if G ∩ L⊥ = {0}, then G ⊕ L⊥ is a subspace of C|| n with dimension dim G + n− dim L, which is > n, which is impossible. 2

58 18.9 Theorem (Courant-Fisher Minimax Theorem) n×n Let A ∈ C|| be Hermitean with eigenvalues λ1 ≤ · · · ≤ λn. Then for every i = 1, . . . , n x∗Ax λi = min max dim L=i x∈L\{0} x∗x where L stands for a subspace of C|| n.

Proof. Let {u1, . . . , un} be an ONB of eigenvectors of A corresponding to the eigen- values λ1, . . . , λn. If dim L = i, then by Lemma 18.8 there is a nonzero vector x ∈ L orthogonal to the space span{u1, . . . , ui−1}. Hence, the first i coordinates of x are zero Pn and x = j=i cjuj, so ∗ Pn 2 x Ax j=i |cj| λj ∗ = Pn 2 ≥ λi x x j=i |cj| Therefore, x∗Ax max ≥ λi x∈L\{0} x∗x

Now, take the subspace L = span{u1, . . . , ui}. Obviously, dim L = i and for every nonzero Pi vector x ∈ L we have x = j=1 cjuj, so

x∗Ax Pi |c |2λ = j=1 j j ≤ λ ∗ Pi 2 i x x j=1 |cj| The theorem is proved. 2

18.10 Theorem Let A and ∆A be Hermitean matrices. Let α1 ≤ · · · ≤ αn be the eigenvalues of A, δmin and δmax the smallest and the largest eigenvalues of ∆A. Denote the eigenvalues of the matrix B = A + ∆A by β1 ≤ · · · ≤ βn. Then for each i = 1, . . . , n

αi + δmin ≤ βi ≤ αi + δmax

Proof. Let {u1, . . . , un} be an ONB consisting of eigenvectors of A corresponding to the eigenvalues α1, . . . , αn. Let L = span{u1, . . . , ui}. Then, by 18.9, x∗Bx βi ≤ max x∈L\{0} x∗x x∗Ax x∗∆Ax ≤ max + max x∈L\{0} x∗x x∈L\{0} x∗x x∗∆Ax ≤ αi + max x∈Cn\{0} x∗x = αi + δmax

59 which is the right inequality. Now apply this theorem to the matrices B, −∆A and A = B + (−∆A). Then its right inequality, just proved, will read βi ≤ αi − δmin (note that the largest eigenvalue of −∆A is −δmin). The theorem is completely proved. 2

Theorem 18.10 shows that if we perturb a Hermitean matrix A by a small Hermitean matrice ∆A, the eigenvalues of A will not change much. This is clearly related to nu- merical calculations of eigenvalues.

18.11 Remark. A similar situation occurs when one knows an approximate eigenvalue λ and an approximate eigenvector x of a matrix A. We can assume that kxk = 1. To estimate the closeness of λ to the actual but unknown eigenvalue of A, one can compute the residual r = Ax − λx. Assume that r is small and define the matrix ∆A = −rx∗. Then k∆Ak2 = krk2 (see the homework assignment) and

(A + ∆A)x = Ax − rx∗x = λx

Therefore, (λ, x) are an exact eigenpair of a perturbed matrix A + ∆A, and the norm k∆Ak is known. One could then apply Theorem 18.10 to estimate the closeness of λ to the actual eigenvalue of A, if the matrices A and ∆A were Hermitean. Since this is not always the case, we need to study how eigenvalues of a generic matrix change under small perturbations of the matrix. This is the issue of eigenvalue sensitivity.

18.12 Theorem (Bauer-Fike) Let A ∈ C|| n×n and suppose that

−1 Q AQ = D = diag(λ1, . . . , λn)

If µ is an eigenvalue of a perturbed matrix A + ∆A, then

min |λi − µ| ≤ κp(Q)k∆Akp 1≤i≤n where k · kp stands for any p-norm.

Proof. If µ is an eigenvalue of A, the claim is trivial. If not, the matrix D − µI is invertible. Observe that

Q−1(A + ∆A − µI)Q = D + Q−1∆AQ − µI = (D − µI)(I + (D − µI)−1(Q−1∆AQ))

Since the matrix A + ∆A − µI is singular, so is the matrix I + (D − µI)−1(Q−1∆AQ). Then the Neumann lemma implies

−1 −1 −1 −1 1 ≤ k(D − µI) (Q ∆AQ)kp ≤ k(D − µI) kpkQ kpk∆AkpkQkp

60 Lastly, observe that (D − µI)−1 is diagonal, so

−1 1 1 k(D − µI) kp = max = 1≤i≤n |λi − µ| min1≤i≤n |λi − µ| The theorem now follows. 2

This theorem answers the question raised in 18.11, it gives an estimate on the error in the eigenvalue in terms of k∆Ak and κ(Q). However, this answer is not good enough – it gives one estimate for all eigenvalues. In practice, some eigenvalues can be estimated much better than others. It is important then to develop finer estimates for individual eigenvalues.

18.13 Lemma Let A ∈ C|| n×n. (i) If λ is an eigenvalue with a right eigenvector x, and µ 6= λ is another eigenvalue with a left eigenvector y, then y∗x = 0. (ii) If λ is a simple eigenvalue (of algebraic multiplicity one) with right and left eigenvec- tors x and y, respectively, then y∗x 6= 0.

Proof. To prove (i), observe that hAx, yi = λhx, yi and, by a remark after 18.1, hx, A∗yi = hx, µy¯ i = µhx, yi. Hence, λhx, yi = µhx, yi, which proves (i), since λ 6= µ. To prove (ii), assume that kxk = 1. By the Schur decomposition theorem, there is a unitary matrix R with first column x such that

λ h∗ ! R∗AR = 0 C

n−1 (n−1)×(n−1) with some h ∈ C|| and C ∈ C|| . Note also that Re1 = x. Since λ is a simple eigenvalue of A, it is not an eigenvalue of C, so the matrix λI − C is invertible, and so is λI¯ − C∗. Let z = (λI¯ − C∗)−1h. Then λz¯ − C∗z = h, hence

h∗ + z∗C = λz∗

One can immediately verify that

(1, z∗)R∗AR = λ(1, z∗)

Put w∗ = (1, z∗)R∗. The above equation can be rewritten as

w∗A = λw∗

Hence w is a left eigenvector of A. By the simplicity of λ, the vector w is a nonzero multiple of y. However, observe that

∗ ∗ ∗ w x = (1, z )R Re1 = 1

61 which proves the lemma. 2

18.14 Theorem Let A ∈ C|| n×n have a simple eigenvalue λ with right and left unit eigenvectors x and n×n y, respectively. Let E ∈ C|| such that kEk2 = 1. For small ε, denote by λ(ε) and x(ε), y(ε) the eigenvalue and unit eigenvectors of the matrix A + εE obtained from λ and x, y. Then 1 |λ0(0)| ≤ |y∗x| Proof. Write the equation

(A + εE) x(ε) = λ(ε) x(ε)

and differentiate it in ε, set ε = 0, and get

Ax0(0) + Ex = λ0(0)x + λ x0(0)

Then multiply this equation through on the left by the vector y∗, use the fact that y∗A = λy∗ and get y∗Ex = λ0(0) y∗x Now the result follows since |y∗Ex| = |hy, Exi| ≤ kyk kExk ≤ 1. Note that y∗x 6= 0 by 18.13. 2

Note: The matrix A+εE is the perturbation of A in the direction of E. If the perturbation matrix E is known, one has exactly y∗Ex λ0(0) = y∗x and so y∗Ex λ(ε) = λ + ε + O(ε2) y∗x by Taylor expansion, a fairly precise estimate on λ(ε). In practice, however, the matrix E is absolutely unknown, so one has to use the bound in 18.14 to estimate the sensitivity of λ to small perturbations of A.

18.15 Definition (Condition Number of An Eigenvalue) Let λ be a simple eigenvalue (of algebraic multiplicity one) of a matrix A ∈ C|| n×n and x, y the corresponding right and left unit eigenvectors. The condition number of λ is 1 K(λ) = |y∗x|

62 The condition number K(λ) describes the sensitivity of a (simple) eigenvalue to small perturbations of the matrix. Large K(λ) signifies an ill-conditioned eigenvalue.

18.16 Simple properties of K(λ) (a) Obviously, K(λ) ≥ 1. (b) If a matrix A is normal, then K(λ) = 1 for all its eigenvalues. (c) Conversely, if a matrix A has all simple eigenvalues with K(λ) = 1, then it is normal.

Normal matrices are characterized by the fact that the Shur decomposition Q∗AQ = T results in a diagonal matrix T . One can expect that if the matrix T is nearly diagonal (its off-diagonal elements are small), then the eigenvalues of A are well-conditioned. On the contrary, if some off-diagonal elements of T are substantial, then at least some eigen- values of A are ill-conditioned.

18.17 Remark It remains to discuss the case of multiple eigenvalues (of algebraic multiplicity ≥ 2). If λ is a multiple eigenvalue, the left and right eigenvectors may be orthogonal even if 0 1 ! the geometric multiplicity of λ equals one. Example: A = , the right and left 0 0 eigenvectors are x = e1 and y = e2, respectively. Moreover, if the geometric multiplicity is ≥ 2, then for any right eigenvector x there is a left eigenvector y such that y∗x = 0. Hence, the definition 18.17 gives an infinite value of K(λ). This does not necessarily mean that a multiple eigenvalue is always ill-conditioned. It does mean, however, that an ill-conditioned simple eigenvalue is ‘nearly multiple’. Precisely, if λ is a simple eigenvalue of A with K(λ) > 1, then there is a matrix E such that kEk2 1 ≤ q kAk2 K(λ)2 − 1 and λ is a multiple eigenvalue of A + E. We leave out the proof. We will not further discuss the sensitivity of multiple eigenvalues.

18.18 Theorem (Gershgorin) n×n Let A ∈ C|| be ‘almost diagonal’. Precisely, let A = D+E, where D = diag(d1, . . . , dn) and E = (eij) is small. Then every eigenvalue of A lies in at least one of the circular disks n X Di = {z : |z − di| ≤ |eij|} j=1

Note: Di are called Gershgorin disks.

Proof. Let λ be an eigenvalue of A with eigenvector x. Let |xr| = maxi{|x1|,..., |xn|} be the maximal (in absolute value) component of x. We can normalize x so that xr = 1

63 and |xi| ≤ 1 for all i. On equating the r-th components in Ax = λx we obtain

n n X X (Ax)r = drxr + erjxj = dr + erjxj = λxr = λ j=1 j=1

Hence n n X X |λ − dr| ≤ |erj| |xj| ≤ |erj| j=1 j=1 The theorem is proved. 2

18.19 Theorem If k of the Gershgorin disks Di form a connected region which is isolated from the other disks, then there are precisely k eigenvalues of A (counting multiplicity) in this connected region.

We call connected components of the union ∪Di clusters.

Proof. For brevity, denote n X hi = |eij| j=1 Consider a family of matrices A(s) = D + sE for 0 ≤ s ≤ 1. Clearly, the eigenvalues of A(s) depend continuously on s. The Gershgorin disks Di(s) for the matrix A(s) are centered at di and have radii shi. As s increases, the disks Di(s) grow concentrically, until they reach the size of the Gershgorin disks Di of Theorem 18.18 at s = 1. When s = 0, each Gershgorin disk Di(0) is just a point, di, which is an eigenvalue of the ma- trix A(0) = D. So, if di is an eigenvalue of multiplicity m ≥ 1, then exactly m disks Dj(0) coincide with the point di, which make a cluster of m disks containing m eigen- values (more precisely, one eigenvalue of multiplicity m). As the disks grow with s, the eigenvalues cannot jump from one cluster to another (by continuity), unless two cluster overlap and then make one cluster. When two clusters overlap (merge) at some s, they will be in one cluster for all larger values of s, including s = 1. This proves the theorem. 2

Note: If the Gershgorin disks are disjoint, then each contains exactly one eigenvalue of A.

64 19 Eigenvalues and eigenvectors: computation

19.1 Preface Eigenvalues of a matrix A ∈ C|| n×n are the roots of its characteristic polynomial, CA(x). It is a consequence of the famous Galois group theory (Abel’s theorem) that there is no finite algorithm for calculation of the roots of a generic polynomial of degree > 4. Therefore, all the methods of computing eigenvalues of matrices larger than 4 × 4 are necessarily iterative, they only provide successive approximations to the eigenvalues, but never exact formulas. For this reason, iterative methods may not look very nice as compared to other exact methods covered in this course. Some of them may not even look reasonable or logically consistent at first, but we need to understand that they have been obtained by numerical analysts after years and years of experimentation. And these methods can work surprisingly well, in fact even better than some finite and “exact” algorithms. If an eigenvalue λ of a matrix A is known, an eigenvector x can be found by solving the linear system (A − λI)x = 0, which can be done by a finite algorithm (say, LU de- composition). But since the eigenvalues can only be obtained approximately, by iterative procedures, the same goes for eigenvectors. It is then reasonable to define a procedure that gives approximations for both eigenvalues and eigenvectors, in parallel. Also, know- ing an approximate eigenvector x one can approximate the corresponding eigenvalue λ by the Rayleigh quotient x∗Ax/x∗x. To simplify the matter, we always assume that the matrix A is diagonalizable, i.e. it has a complete set of eigenvectors x1, . . . , xn with eigenvalues λ1, . . . , λn, which are ordered in absolute value: |λ1| ≥ |λ2| ≥ · · · ≥ |λn|

19.2 Definition (Dominant Eigenvalue/Eigenvector) Assume that |λ1| > |λ2|, i.e. largest eigenvalue is simple. We call λ1 the dominant eigenvalue and x1 a dominant eigenvector.

19.3 Power method: the idea Let λ1 be the dominant eigenvalue of A and

q = c1x1 + ··· + cnxn

an arbitrary vector such that c1 6= 0. Then

k k k A q = c1λ1x1 + ··· + cnλnxn k k k = λ1[c1x1 + c2(λ2/λ1) x2 + ··· + cn(λn/λ1) xn] Denote (k) k k k k q = A q/λ1 = c1x1 + c2(λ2/λ1) x2 + ··· + cn(λn/λ1) xn | {z } ∆k

65 19.4 Lemma (k) The vector q converges to c1x1. Moreover,

(k) k k∆kk = kq − c1x1k ≤ const · r where r = |λ2/λ1| < 1.

Therefore, the vectors Akq (obtained by the powers of A) will align in the direction of the dominant eigenvector x1 as k → ∞. The number r characterizes the speed of alignment, i.e. the speed of convergence k∆kk → 0. Note that if c2 6= 0, then

(k+1) (k) kq − c1x1k/kq − c1x1k → r

The number r is called the convergence ratio or the contraction number.

19.5 Definition (Linear/Quadratic Convergence) We say that the convergence ak → a is linear if

|ak+1 − a| ≤ r|ak − a| for some r < 1 and all sufficiently large k. If

2 |ak+1 − a| ≤ C|ak − a| with some C > 0, then the convergence is said to be quadratic. It is much faster than linear. If 3 |ak+1 − a| ≤ C|ak − a| with some C > 0, then the convergence is said to be cubic. It is even faster than quadratic.

19.6 Remarks on Convergence k If the convergence is linear, then by induction we have |ak − a| ≤ Cr , where C = k |a0 − a|. The sequence Cr decreases to zero exponentially fast, which is very fast by calculus standards. However, in numerical algebra standards are different. The linear convergence practically means that each iteration adds a fixed number of accurate digits to the result. For example, if r = 0.1, then each iteration adds one correct decimal digit. This is not superb, since it will take 6–7 iterations to reach the maximum accuracy in single precision arithmetic and 15–16 iterations in double precision. If r = 0.5, then each iteration adds one binary digit (a bit), hence one needs 22–23 iterations in single precision and 52–53 in double precision. Now imagine how long it might take when r = 0.9 or r = 0.99. On the other hand, the quadratic convergence means that each iteration doubles (!) the number of digits of accuracy. Starting with just one accurate binary digit, one needs

66 4–5 iterations in single precision arithmetic and only one more iteration in double preci- sion. The cubic convergence means that each iteration triples (!!!) the number of digits of accuracy. See Example 19.16 for an illustration.

19.7 Remark on Power Method (k) k k In practice, the vector q = A q/λ1 is inaccessible because we do not know λ1 in advance. On the other hand, it is impractical to work with Akq, because kAkqk → ∞ if k |λ1| > 1 and kA qk → 0 if |λ1| < 1. In order to avoid the danger of overflow or underflow, we must somehow normalize, or scale, the vector Akq.

19.8 Power Method: Algorithm Pick an initial vector q0. For k ≥ 1, define

qk = Aqk−1/σk where σk is a properly chosen scaling factor. A common choice is σk = kAqk−1k, so that kqkk = 1. Then one can approximate the eigenvalue λ1 by the Rayleigh quotient

(k) ∗ λ1 = qkAqk

Note that k k k qk = A q0/(σ1 ··· σk) = A q0/kA q0k

To estimate how close the unit vector qk is to the one-dimensional eigenspace span{x1}, denote by pk the orthogonal projection of qk on span{x1} and by dk = qk − pk the or- thogonal component. Then kdkk measures the distance from qk to span{x1}.

19.9 Theorem (Convergence of Power Method) P Assume that λ1 is the dominant eigenvalue, and q0 = cixi is chosen so that c1 6= 1. (k) Then the distance from qk to the eigenspace span{x1} converges to zero and λ1 converges to λ1. Furthermore,

k (k) k kdkk ≤ const · r |λ1 − λ1| ≤ const · r

Note: The sequence of unit vectors qk need not have a limit, see examples.

k k Proof. It is a direct calculation, based on the representation A q0 = λ1(c1x1 + ∆k) of 19.3 and Lemma 19.4. 2

19.10 Remark Another popular choice for σk in 19.8 is the largest (in absolute value) component of the vector Aqk−1. This ensures that kqkk∞ = 1. Assume that the vector x has one

67 component with the largest absolute value. In that case Theorem 19.9 applies, and moreover, σk → λ1 so that k |σk − λ1| ≤ const · r

Furthermore, the vector qk will now converge to a dominant eigenvector. The largest component of that dominant eigenvector will be equal to one, and this condition specifies it uniquely.

19.11 Examples 3 2 ! (a) Let A = . Pick q = (1, 1)t and chose σ as in 19.10. Then σ = 5 and 1 1 0 k 1 t t q1 = (1, 0.4) , σ2 = 3.8 and q2 √= (1, 0.368) , σ3 = 3.736 etc. Here σk converges to the dominant√ eigenvalue λ1 = 2 + 3 = 3.732 and qk converges to a dominant eigenvector (1, 3/2 − 1/2)t = (1, 0.366)t. −1 0 ! (b) Let A = . Pick q = (1, 1)t and chose σ as in 19.8. Then q = 0 0 0 k k k ((−1) , 0) does not have a limit, it oscillates. With σk chosen as in 19.10, we have qk = (1, 0) and σk = −1 = λ1 for all k ≥ 1.

19.12 Remark The choice of the initial vector q0 only has to fulfill the requirement c1 6= 0. Since n the vectors with c1 = 0 form a hyperplane in C|| , one hopes that a vector q0 picked “at random” will not lie in that hyperplane “with probability one”. Furthermore, even if c1 = 0, round-off errors will most likely pull the numerical vectors qk away from that hyperplane. If that does not seem to be enough, one can carry out the power method for n different initial vectors that make a basis, say e1, . . . , en. One of these vectors surely lies away from that hyperplane.

19.13 Inverse Power Method −1 −1 −1 Assume that A is invertible. Then λ1 , . . . , λn are the eigenvalues of A , with the −1 −1 same eigenvectors x1, . . . , xn. Note that |λ1 | ≤ · · · ≤ |λn |. −1 −1 −1 −1 Assume that |λn | > |λn−1|. Then λn is the dominant eigenvalue of A and xn a −1 −1 dominant eigenvector. One can apply the power method to A and find λn and xn. The rate of convergence of iterations will be characterized by the ratio r = |λn/λn−1|. This is called the inverse power method.

Now we know how to compute the largest and the smallest eigenvalues. The following trick allows us to compute any simple eigenvalue.

19.14 Inverse Power Method with Shift. Recall that if λ is an eigenvalue of A with eigenvector x, then λ − ρ is an eigenvalue of A − ρI with the same eigenvector x.

68 Assume that ρ is a good approximation to a simple eigenvalue λi of A, so that |λi−ρ| < |λj − ρ| for all j 6= i. Then the matrix A − ρI will have the smallest eigenvalue λi − ρ with the eigenvector xi. The inverse power method can now be applied to A − ρI to find λi − ρ and xi. The convergence of iterations will be linear with ratio

|λ − ρ| r = i minj6=i |λj − ρ|

Hence, the better ρ approximates λi, the faster convergence is guaranteed. By subtracting ρ from all the eigenvalues of A we shift the entire spectrum of A by ρ. The number ρ is called the shift. The above algorithm for computing λi and xi is called the inverse power method with shift. The inverse power method with shift allows us to compute all the simple eigenvalues and eigenvectors of a matrix, but the convergence is slow (just linear).

19.15 Rayleigh Quotient Iterations with Shift This is an improvement of the algorithm 19.14. Since at each iteration of the inverse power method we obtain a better approximation to the eigenvalue λi, we can use it as the shift ρ for the next iteration. So, the shift ρ will be updated at each iteration. This will ensure a faster convergence. The algorithm goes as follows: one chooses an initial vector q0 and an initial approximation ρ0, and for k ≥ 1 computes

−1 (A − ρk−1I) qk−1 qk = σk and ∗ −1 qk(A − ρk−1I) qk ρk = ∗ qkqk −1 where σk a convenient scaling factor, for example, σk = k(A − ρk−1I) qk−1k. −1 Note: in practice, there is no need to compute the inverse matrix (A − ρk−1I) −1 explicitly. To find (A − ρk−1I) b for any given vector b, one can just solve the system (A − ρk−1I)x = b by the LU decomposition. The convergence of the Rayleigh quotient iterations is, generally, quadratic (better than linear). If the matrix A is Hermitean, the convergence is even faster – it is cubic.

19.16 Example Consider the symmetric matrix

 2 1 1    A =  1 3 1  1 1 4

69 √ t and let q0 = (1, 1, 1) / 3 be the initial vector. When Rayleigh quotient iteration is applied to A, the following values ρk are computed by the first three iterations:

ρ0 = 5, ρ1 = 5.2131, ρ2 = 5.214319743184

The actual value is λ = 5.214319743377. After only three iterations, Rayleigh quotient method produced 10 accurate digits. The next iteration would bring about 30 accurate digits – more than enough in double precision. Two more iterations would increase this figure to about 270 digits, if our machine precision were high enough.

19.17 Power Method: Pros and Cons a) The power method and its variations described above are classic. They are very simple and generally good. b) An obvious concern is the of the method. The matrices used in the inverse power method with shift tend to be exceedingly ill-conditioned. As a result, the numerical solution of the system (A − ρI)x = b, call it xc, will deviate significantly from its exact solution x. However, for some peculiar reason (we do not elaborate) the difference xc − x tends to align with the vector x. Therefore, the normalized vectors xc/kxck and x/kxk are close to each other. Hence, ill conditioning of the matrices does not cause a serious problem here. c) On the other hand, the power method is slow. Even its fastest version, the Rayleigh quotient iterations with shift 19.15, is not so fast, because it requires solving a linear system of equations (A − ρk−1I)x = b with a different matrix at each iteration for each eigenvalue (a quite expensive procedure). d) Lastly, when the matrix A is real, then in order to compute its complex eigenvalues one has to deal with complex matrices (A − ρk−1I), which is inconvenient and expensive. It would be nice to stick to real matrices as much as possible and obtain pairs of complex conjugate eigenvalues only at final steps. Some algorithms provide such a luxury, see for example 19.27 below.

In recent decades, the most widely used algorithm for calculating the complete set of eigenvalues and eigenvectors of a matrix has been the QR algorithm.

19.18 The QR Algorithm n×n Let A ∈ C|| be nonsingular. The algorithm starts with A0 = A and generates a sequence of matrices Ak defined as follows:

Ak−1 = QkRk RkQk = Ak

That is, a QR factorizarion of Ak−1 is computed and then its factors are recombined in the reverse order to produce Ak. One iteration of the QR algorithm is called a QR step.

19.19 Lemma

70 All matrices Ak in the QR algorithm are unitary equivalent, in particular, they have the same eigenvalues.

∗ Proof. Indeed, Ak = QkAk−1Qk. 2

Note: if A is a Hermitean matrix, then all Ak’s are Hermitean matrices as well, since ∗ ∗ ∗ ∗ by induction Ak = QkAk−1Qk = QkAk−1Qk = Ak.

19.20 Theorem (Convergence of the QR Algorithm) Let λ1, . . . , λn be the eigenvalues of A satisfying

|λ1| > |λ2| > ··· > |λn| > 0

(k) Under one technical assumption, see below, the matrix Ak = (aij ) will converge to an upper triangular form, so that (k) (a) aij → 0 as k → ∞ for all i > j. (k) (b) aii → λi as k → ∞ for all i. The convergence is linear, with the speed maxk |λk+1/λk|.

This theorem is given without proof. The technical assumption here is that the ma- trix Y whose i-th row is a left eigenvector of A corresponding to λi for all i, must have an LU decomposition (i.e. all its principal minors are nonsingular).

19.21 Remarks (a) If A is Hermitean, then Ak will converge to a diagonal matrix, which is sweet. (b) All the matrices Ak (and Rk) involved in the QR algorithm have the same 2- condition number (by 15.6), thus the QR algorithm is numerically stable. This is also true for all the variations of the QR method discussed below. (c) Unfortunately, the theorem does not apply to real matrices with complex eigen- values. Indeed, complex eigenvalues of a real matrix come in conjugate pairs a ± ib, which have equal absolute values |a + ib| = |a − ib|, violating the main assumption of the theorem. Furthermore, if A is real, then all Qk, Rk and Ak are real matrices as well, and thus we cannot even expect the real diagonal elements of Ak to converge to the complex eigenvalues of A. In this case Ak simply does not (!) converge to an upper triangular matrix.

The QR algorithm described above is very reliable but quite expensive – each itera- tion takes too much flops and the convergence is slow (linear). These problems can be resolved with the help of Hessenberg matrices.

19.22 Definition (Upper )

71 A is called an upper Hessenberg matrix if aij = 0 for all i > j + 1, i.e. A has the form

 × · · · ×     × ×     0 × ×     . . . . .   ......    0 ··· 0 × ×

19.23 Lemma (Arnoldi Iterations) Every matrix A ∈ C|| n×n is unitary equivalent to an upper Hessenberg matrix, i.e.

A = Q∗HQ where H is an upper Hessenberg matrix and Q is a unitary matrix. There is an explicit and inexpensive algorithm to compute such H and Q.

Note: if A is Hermitean, then H is a (i.e. such that hij = 0 for all |i − j| > 1).

Proof. We present an explicit algorithm for computation of Q and H, known as Arnoldi iteration. The matrix equation in the lemma can be rewritten as AQ∗ = Q∗H. ∗ Denote by qi, 1 ≤ i ≤ n, the columns of the unitary matrix Q and by hij the components of H. Now our matrix equation takes form

Aqi = h1iq1 + ··· + hiiqi + hi+1,iqi+1 for all i = 1, . . . , n − 1 (recall that hij = 0 for j > i + 1) and

Aqn = h1nq1 + ··· + hnnqn

Now Arnoldi iteration method goes like this. We can pick an arbitrary unit vector q1. Then, for each i = 1, . . . , n − 1 we make four steps. Step 1: vi = Aqi. ∗ Step 2: for all j = 1, . . . , i we compute hji = qj vi. Pi Step 3: ui = vi − j=1 hjiqj (note: this vector is orthogonal to q1, . . . , qi). Step 4: hi+1,i = kuik and qi+1 = ui/hi+1,i, unless hi+1,i = 0, see below. Finally, for i = n we execute steps 1 and 2 only. The exceptional case hi+1,i = 0 in Step 4 is referred to as the breakdown of Arnoldi iterations. This sounds bad, but it is not a bad thing at all. This means that the Hes- senberg matrix H will have a simpler structure allowing some reductions (not discussed here). Technically, if hi+1,i = 0, then one simply picks an arbitrary unit vector qi+1 orthogonal to q1, . . . , qi and continues the Arnoldi iteration algorithm. 2

72 Lemma 19.23 shows that one can first transform A to an upper Hessenberg matrix A0 = H, which has the same eigenvalues as A by similarity, and then start the QR algorithm with A0.

19.24 Lemma If the matrix A0 is upper Hessenberg, then all the matrices Ak generated by the QR algorithm are upper Hessenberg.

Proof. By induction, let Ak−1 be upper Hessenberg. Then Ak−1 = QkRk and so −1 Qk = Ak−1Rk . Since this is a product of an upper Hessenberg matrix and an upper triangular matrix, it is verified by direct inspection that Qk is upper Hessenberg. Then, similarly, Ak = RkQk is a product of an upper triangular and upper Hessenberg matrices, so it is upper Hessenberg. 2

For upper Hessenberg matrices, the QR algorithm can be implemented with the help of rotation matrices at a relatively low cost, cf. 16.25. This trick provides an inexpensive modification of the QR algorithm.

There is a further improvement of the QR algorithm that accelerates the convergence in Theorem 19.20.

19.25 Lemma Assume that A0, and hence Ak for all k ≥ 1, are upper Hessenberg matrices. Then (k) the convergence ai+1,i → 0 as k → ∞ in Theorem 19.20 is linear with ratio r = |λi+1/λi|.

This lemma is given without proof. Note that |λi+1/λi| < 1 for all i. Clearly, the smaller this ratio the faster the convergence. Next, the faster the subdiagonal entries (k) (k) ai+1,i converge to zero the faster the diagonal entries aii converge to the eigenvalues of A.

One can modify the matrix A to decrease the ratio |λn/λn−1| and thus accelerate the (k) convergence ann → λn, with the help of shifting, as in 19.14. One applies the QR steps to the matrix A − ρI where ρ is an appropriate approximation to λn. Then the convergence (k) an,n−1 → 0 will be linear with ratio r = |λn − ρ|/|λn−1 − ρ|. The better ρ approximates λn the faster the convergence.

19.26 The QR Algorithm with Shift Further improving the convergence, one can update the shift ρ = ρk at each iteration. The QR algorithm with shift then goes as follows:

Ak−1 − ρk−1I = QkRk RkQk + ρk−1I = Ak

where the shift ρk is updated at each step k. A standard choice for the shift ρk is the

73 Rayleigh quotient of the vector en:

∗ (k) ρk = enAken = ann

(regarding en as the approximate eigenvector corresponding to λn). This is called the (k) (k) Rayleigh quotient shift. The convergence of an,n−1 → 0 and ann → λn is, generally, quadratic. (k) However, the other subdiagonal entries, ai+1,i, 1 ≤ i ≤ n − 2, move to zero slowly (k) (linearly). To speed them up, one uses the following trick. After making an,n−1 practically (k) zero, one ensures that ann is practically λn. Then one can partition the matrix Ak as ˆ ! Ak bk Ak = 0 λn

ˆ where Ak is an (n−1)×(n−1) upper Hessenberg matrix, whose eigenvalues are (obviously) λ1, . . . , λn−1. Then one can apply further steps of the QR algorithm with shift to the ˆ matrix Ak instead of Ak. This quickly produces its smallest eigenvalue, λn−1, which can be split off as above, etc. This procedure is called the deflation of the matrix A. In practice, each eigenvalue of A requires just 3-5 iterations (QR steps), on the av- erage. The algorithm is rather fast and very accurate. For Hermitean matrices, it takes just 2-3 iterations per eigenvalue.

19.27 An improvement: Wilkinson Iteration (k) The Rayleigh quotient shift ρk = ann described above can be slightly improved. At step k, consider the trailing 2-by-2 matrix at the bottom right of Ak:

(k) (k) ! an−1,n−1 an−1,n (k) (k) an,n−1 an,n

(k) and set ρk to its eigenvalue that is closer to an,n. Note: the eigenvalues of a 2-by-2 matrix can be easily computed by the quadratic formula. This modification is called the Wilkinson shift. It is slightly more accurate and stable than the Rayleigh quotient shift. Furthermore, the Wilkinson shift allows us to resolve the problem of avoiding complex matrices when the original matrix A is real. In that case the above 2-by-2 trailing block of Ak is real and therefore has either real eigenvalues or complex conjugate eigenvalues, λ and λ¯. If λ∈ / IR, the matrix A − λI is complex, and a QR with shift λ will result in a complex matrix Ak+1. However, if this step is followed immediately by a QR step with ¯ shift λ, then the resulting matrix Ak+2 is real again. Actually, if λ∈ / IR, there is no need to compute the complex matrix Ak+1. One can just combine two QR steps and construct Ak+2 directly from Ak, bypassing Ak+1. This can be carried out entirely in real arithmetic and thus is more convenient and economical. The resulting all-real procedure is called the double-step QR algorithm.

74