(VII.E) The Singular Value Decomposition (SVD)

In this section we describe a generalization of the Spectral Theo- rem to non-normal operators, and even to transformations between different vector spaces. This is a computationally useful generaliza- tion, with applications to data fitting and data compression in par- ticular. It is used in many branches of science, in statistics, and even machine learning. As we develop the underlying , we will encounter two closely associated constructions: the polar de- composition and pseudoinverse, in each case presenting the version for operators followed by that for matrices. To begin, let (V, h·, ·i) be an n-dimensional inner-product space, T : V → V an operator. By III, if T is normal then it has a unitary eigenbasis B, with complex eigenvalues σ1,..., σs. In particular, T is unitary iff these |σi| = 1 (since preserving lengths of a unitary eigenbasis is equivalent to preserving lengths of all vectors), and self-adjoint iff the σi are real. So T is both unitary and self-adjoint iff T has only +1 and −1 as eigenvalues, which is to say T is an ⊥ orthogonal reflection, acting on V = W ⊕ W by IdW on W and ⊥ −IdW⊥ on W . But the broader point here is that we should have in mind the following “table of analogies”:

operators numbers normal C unitary S1 self-adjoint R

?? R≥0 where S1 ⊂ C is the circle. So, how do we complete it?

1 2 (VII.E) THE SINGULAR VALUE DECOMPOSITION (SVD)

DEFINITION 1. T is positive (resp. strictly positive) if it is nor- mal, with all eigenvalues in R≥0 (resp. R>0).

Since they are unitarily diagonalizable with real eigenvalues, pos- itive operators are self-adjoint. Given a self-adjoint (or normal) op- erator T, any ~v ∈ V is a sum of orthogonal eigenvectors w~ j with eigenvalues σj; so 2 h~v, T~vi = ∑hw~ i, Tw~ ji = ∑ σjhw~ i, w~ ji = ∑ σj||w~ j|| i,j i.j j is in R≥0 for all ~v if and only if T is positive. If V is complex, the condition that h~v, T~vi ≥ 0 (∀~v) on its own gives self-adjointness (Ex- ercise VII.D.7), hence positivity, of T. (If V is real, this condition alone isn’t enough.)

EXAMPLE 2. Orthogonal projections are positive, since they are self-adjoint with eigenvalues 0 and 1.

Polar decomposition. A nonzero can be writ- iθ ten (in “polar form”) as a product re , for unique r ∈ R>0 and eiθ ∈ S1. If T is a , then it has a unitary eigen- iθ iθn basis B = {vˆ1,..., vˆn}, with [T]B = diag{r1e 1 ,..., rne } (ri ∈ R≥0), and defining |T|, U by [|T|]B = diag{r1,..., rn} and [U]B = diag{eiθ1 ,..., eiθn }, we have T = U|T| with U unitary and |T| pos- itive, and U|T| = |T|U. One might expect such “polar decompo- sitions” of operators to stop there. But there exists an analogue for arbitrary transformations, one that will lead on to the SVD as well! To formulate this general polar decomposition, let T : V → V be an arbitrary linear transformation. Notice that T†T is self-adjoint1 and, in fact, positive:

h~v, T†T~vi = hT~v, T~vi ≥ 0,

† since h·, ·i is an inner product. So we have a unitary B with [T T]B = diag{λ1,..., λn}, and all λi ∈ R≥0. (Henceforth we impose the con- vention that λ1 ≥ λ2 ≥ ... ≥ λn.) Taking nonnegative square

1trivially so: (T†T)† = T†T†† = T†T. (VII.E) THE SINGULAR VALUE DECOMPOSITION (SVD) 3 √ roots µi = λi ∈ R≥0, we define a new operator |T| by [|T|]B = 2 † diag{µ1,..., µn}; clearly |T| = T T, and |T| is positive.

THEOREM 3. There exists a U such that

T = U|T|.

This “polar decomposition” is unique exactly when T is invertible. More- over, U and |T| commute exactly when T is normal.

2 PROOF. We have kT~vk = hT~v, T~vi = h~v, T†T~vi = h~v, |T|2~vi = h|T|~v, |T|~vi = k|T|~vk2 since |T| is self-adjoint. Consequently T and |T| have the same kernel, hence (by Rank + Nullity) the same rank; so while their images (as subspaces of V) may differ, they have the same dimension. ⊥ By Spectral Theorem I, V = im|T| ⊕ ker|T|. So dim(ker|T|) = dim((imT)⊥), and we fix an invertible transformation U0 : ker|T| → (imT)⊥ sending some choice of unitary basis to another unitary ba- sis. Writing ~v = |T|w~ +~v0 (with |T|~v0 =~0), we now define ⊥ ⊥ U : im|T| ⊕ ker|T| → imT ⊕ (imT)⊥ | {z } | {z } V V by U~v := Tw~ + U0~v0. Since kTw~ + U0~v0k2 = kTw~ k2 + kU0~v0k2 = k|T|w~ k2 + k~v0k2 = k~vk2, U is unitary; and taking ~v = |T|w~ gives U|T|w~ = Tw~ (∀w~ ∈ V). −1 Now if T is invertible, then so is |T| (all µi > 0) =⇒ U = T|T| . If it isn’t, then ker|T| 6= {0} and non-uniqueness enters in the choice of U0. Finally, if U and |T| commute, then they simultaneously unitarily diagonalize, and therefore so does T, making T normal. If T is nor- mal, we have constructed U and |T| above, and they commute.  4 (VII.E) THE SINGULAR VALUE DECOMPOSITION (SVD)

EXAMPLE 4. Consider V = R4 with the dot product, and

 2 0 0 0     0 1 0 0  A = [T]eˆ =   .  0 1 1 0  0 0 0 0 Since A has a Jordan block, we know it is not diagonalizable; and so T is not normal. However,  4 0 0 0    † t  0 2 1 0  −1 [T T]eˆ = AA =   = SBDSB  0 1 1 0  0 0 0 0 with B := {eˆ1, eˆ2 +√ϕ−eˆ3, eˆ2 − ϕ+eˆ3, eˆ√4} and D := diag{4, η+, η−, 0} ±1+ 5 3± 5 2 (where ϕ± := 2 and η± := 2 = ϕ±). Taking the positive 1 square root D 2 := diag{2, ϕ+, ϕ−, 0}, we get   2 0 0 0  0 √3 √1 0  1 −1  5 5  [|T|]eˆ = |A| := SBD 2 S =   . B  0 √1 √2 0   5 5  0 0 0 0

On W := span{eˆ1, eˆ2, eˆ3} = imT = im|T|, we must have U|W = −1 ⊥ T|W (|T| |W ) ; while on span{eˆ4} = W , U can be taken to act by any scalar of 1. Therefore   1 0 0 0  0 √2 √−1 0   5 5  [U]eˆ = Q :=   ,  0 √1 √2 0   5 5  0 0 0 eiθ and A = Q|A|. Geometrically, |A| dilates the B-coordinates of a vector, then Q performs a . (VII.E) THE SINGULAR VALUE DECOMPOSITION (SVD) 5

SVD for endomorphisms2. Now let |T| be the (unique) positive † square root of T T as above, and B = {vˆ1,..., vˆn} the unitary basis of V under which [|T|]B = diag{µ1,..., µn}.

DEFINITION 5. These {µi} are called the singular values of T.

† Writing T = U|T| and wˆ i := Uvˆi, U U = IdV =⇒ hwˆ i, wˆ ji = † hU Uvˆi, vˆji = hvˆi, vˆji = δij =⇒ C := {wˆ 1,..., wˆ n} = U(B) is unitary. For ~v ∈ V, we have n n T~v = U|T| ∑hvˆi,~vivˆi = ∑hvˆi,~viU|T|vˆi i=1 i=1

n n (1) = ∑ µihvˆi,~viUvˆi = ∑ µihvˆi,~viwˆ i. i=1 i=1

Viewing hvˆi, ·i =: `vˆi : V → C as a linear functional, and wˆ i : C → V as a map (sending α ∈ C to αwˆ i), (1) becomes n = ◦ ` (2) T ∑ µiwˆ i vˆi . i=1 How does this look in terms? Let A be an arbitrary3 uni- tary basis of V, and set A := [T]A. Using {1} as a basis of C, (2) yields n n = [ ] [` ] = [ ] [ ]∗ (3) A ∑ µi A wˆ i {1}{1} vˆi A ∑ µi wˆ i A vˆi A. i=1 i=1 ∗ Here [wˆ i]A[vˆi]A is a (n × 1) · (1 × n) matrix product, yielding an n × n matrix of rank one. So (2) (resp. (3)) decompose T (resp. A) into a sum of rank one endomorphisms (resp. matrices) weighted by the singular values. This is a first version of the Singular Value Decom- position.

2The reader may of course replace “unitary” and C by “orthonor- mal”/”orthogonal” and R in what follows. 3What you should actually have in mind here is, in the case where V = Cn with dot product, the standard basis eˆ. 6 (VII.E) THE SINGULAR VALUE DECOMPOSITION (SVD)

REMARK 6. If T is positive, we can take U = IdV, so that wˆ i ◦

`vˆi = vˆi ◦ `vˆi =: Prvˆi is the orthogonal projection onto span{vˆi}. Thus (2) merely restates Spectral Theorem I in this case. In fact, if T is nor- † mal, then T, T T, and |T| simultaneously diagonalize. If µ˜1,..., µ˜n are the eigenvalues of T, then we evidently have µi = |µi|; U may then be defined by [U]B = diag{µ˜i/µi} (with 0/0 interpreted as 1). n Thus (2) reads T = ∑i=1 µ˜iPrvˆi , which restates Spectral Theorem III. So in this sense, the SVD generalizes the results in §VII.D.

Now since U sends B to C, we have

C [T]B = C [U]B[|T|]B = diag{µ1,..., µn} =: D | {z } In and thus

(4) A = [T]A = A[I]CC [T]BB[I]A =: S1DS2, with S1 and S2 unitary matrices (as they change between unitary bases). Roughly speaking, (4) presents A as a composition of the form “rotation,4 dilation, rotation”.

THEOREM 7. Any A ∈ Mn(C) can be presented as a product S1DS2, ∗ where D is a with entries in R≥0, and SiSi = In. The diagonal entries of D, called the singular values of A, are unique; and if these are all distinct, nonzero, and in decreasing order, then (S1, S2) are −1 unique modulo rescaling (S1∆, ∆ S2) by a unitary diagonal matrix ∆.5

∗ PROOF. Existence was deduced above. We have D = D hence

∗ ∗ ∗ ∗ ∗ 2 A A = S2 D S1 S1DS2 = S2 D S2,

4More precisely, “rotation” here can mean a complicated sequence of rotations and reflections. 5 If A ∈ Mn(R), then the above construction shows that S1 and S2 can be taken to be real (orthogonal); and such Si are unique modulo rescaling by ∆ real diagonal unitary, i.e. with diagonal entries ±1. (VII.E) THE SINGULAR VALUE DECOMPOSITION (SVD) 7 which is a unitary6 diagonalization of the A∗ A. So S2, hence D, is unique; and if the diagonal entries are distinct, then ∗ the unitary eigenbasis ( =⇒ columns of S2) is determined up to eiθ scaling factors. If no singular values are 0, D is invertible and ∗ −1 S1 = AS2 D . 

EXAMPLE 8. Let’s look at the operator d T = : P (R) → P (R) dt 2 2 h i = R 1 ( ) ( ) with f , g : −1 f t g t dt. This is a non-normal, nilpotent trans- formation. We compute its singular values – which are not all zero, even though7 0 is the only eigenvalue of T – and the bases B and C. To compute T† we will need an o.n. basis A: Gram-Schmidt on q q {1, t, t2} yields A := { √1 , 3 t, 3 5 (t2 − 1 )} and 2 2 2 2 3  √    0 3√ 0 √0 0 0 [T] =   =⇒ [T†] = [T]∗ =   A  0 0 15  A A  3√ 0 0  0 0 0 0 15 0     0 0 0 0√ 0 0 =⇒ [T†T] =   =⇒ [|T|] =   A  0 3 0  A  0 3√ 0  0 0 15 0 0 15 √ √ =⇒ singular values are 15, 3, 0, and  q q  B = 3 5 (t2 − 1 ), 3 t, √1 =: {vˆ , vˆ , vˆ }. 2 2 3 2 2 1 2 3

Now im(|T|) = span{vˆ1, vˆ2}, ker(|T|) = span{vˆ3} = ker(T), im(T) = ⊥ span{vˆ2, vˆ3}, and im(T) = span{vˆ1}. So we define ⊥ ⊥ U : span{vˆ1, vˆ2} ⊕ span{vˆ3} −→ span{vˆ2, vˆ3} ⊕ span{vˆ1}

6in the dot product 7Caution: if T is diagonalizable, it may not be unitarily diagonalizable, and it’s only in the latter case that T’s singular values are the absolute values of T’s eigenvalues. 8 (VII.E) THE SINGULAR VALUE DECOMPOSITION (SVD) by sending vˆ = |T|( √1 vˆ ) 7→ T( √1 vˆ ) = vˆ , vˆ = |T|( √1 vˆ ) 7→ 1 15 1 15 1 2 2 3 2 T( √1 vˆ ) = vˆ , and vˆ 7→ vˆ . We then have 3 2 3 3 1   0 1 0   [U]A =  0 0 1  , 1 0 0

8 and C = U(B) = {vˆ2, vˆ3, vˆ1}. The SVD of A = [T]A now reads (noting that A = {vˆ3, vˆ2, vˆ1})    √    0 1 0 15√ 0 0 0 0 1       A = S1DS2 =  1 0 0   0 3 0   0 1 0  . 0 0 1 0 0 0 1 0 0

We now turn to the more general form of the Singular Value De- composition.

SVD for transformations. Let T : V → W be an arbitrary linear transformation between finite-dimensional inner-product spaces, and fix unitary bases A resp. A0 of V resp. W. Write n = dim V, m = dim W.

† † LEMMA 9. The transformation T : W → V defined by A[T ]A0 = ∗ (A0 [T]A) is the unique operator satisfying (5) hT†w~ ,~vi = hw~ , T~vi for all ~v ∈ V, w~ ∈ W. We call it the adjoint of T.

0 PROOF. Since A, A are unitary, (5) is equivalent to

† ∗ ∗ [T w~ ]A[~v]A = [w~ ]A0 [T~v]A0 and thus to ∗ ∗  †  ∗ [w~ ]A0 A[T ]A0 [~v]A = [w~ ]A0 A0 [T]A[~v]A.



8 Of course, even over R it isn’t quite unique: one can play with the signs in S1 and S2. (VII.E) THE SINGULAR VALUE DECOMPOSITION (SVD) 9

⊥ ⊥ LEMMA 10. We have V = ker(T) ⊕ im(T†) and W = im(T) ⊕ ker(T†), with T (resp. T†) restricting to an isomorphism from im(T†) → im(T) (resp. im(T) → im(T†)).

PROOF. By Lemma 9, T and T† have the same rank r, so Rank + Nullity =⇒ ( dim(kerT) + dim(imT†) = (n − r) + r = n = dim V dim(imT) + dim(kerT†) = r + (m − r) = m = dim W.

For ~v ∈ ker(T), h~v, T†w~ i = hT~v, w~ i = h~0, w~ i = 0 =⇒ ker(T) ⊥ † † im(T ) =⇒ ker(T) ∩ im(T ) = {0} =⇒ T|im(T†) is injective. This proves the assertions about V and T; those for W and T† follow by symmetry. 

The compositions T†T : V → V and TT† : W → W are clearly ⊥ positive (argue as above), and preserve the ⊕’s in Lemma 10. So T†T ⊥ (resp. TT†) is an automorphism of im(T†) (resp. im(T))“⊕” the † zero map on ker(T√) (resp. ker(√T )). The same remarks apply to the positive operator T†T (resp. TT†), and there exists a unitary ba- h√ i † sis B = {vˆ1,..., vˆn} of V with T T = diag{µ1,..., µr, 0, . . . , 0} B and µ1 ≥ µ2 ≥ ... ≥ µr > 0.

DEFINITION 11. These {µi} are called the (nonzero) singular val- ues of T.

† In particular, {vˆ1, ..., vˆr} ⊂ im(T ) and {vˆr+1,..., vˆn} ⊂ ker(T); 1 and we define {wˆ ,..., wˆ r} ⊂ im(T) by wˆ := Tvˆ . Since hwˆ , wˆ i = 1 i µi i i j µ2 1 hTvˆ , Tvˆ i = i hvˆ , vˆ i = µi δ = δ , this gives a unitary basis of µiµj i j µiµj i j µj ij ij im(T), which we complete to a unitary basis C ⊂ W by choosing † {wˆ r+1,..., wˆ m} ⊂ ker(T ). This proves 10 (VII.E) THE SINGULAR VALUE DECOMPOSITION (SVD)

THEOREM 12 (Abstract SVD). There exist unitary bases B ⊂ V, C ⊂ W such that9

(6) C [T]B = diagm×n{µ1,..., µr, 0, . . . , 0} with µ1 ≥ ... ≥ µr > 0, and r = rank(T). √ REMARK 13. (i) C is an eigenbasis for TT†, and its nonzero eigenvalues are also the {µ }r : TT†wˆ = 1 TT†Tvˆ = 1 T(µ2vˆ ) = i i=1 i µi i µi i i 2 µi wˆ i. 0 0 (ii) If B ⊂ V, C ⊂ W are unitary bases such that C0 [T]B0 = 0 0 0 diagm×n{µ1,..., µs, 0 . . . , 0} in decreasing order, then s = r and µi = † ∗ µi (∀i). This is because B0 [T]B0 = B0 [T ]C0 C0 [T]B0 = (C0 [T]B0 ) C0 [T]B0 = {( 0 )2 ( 0 )2 } { 0} diag√ m×n µ1 ,..., µr , 0 . . . , 0 so that µi are the eigenvalues of T†T. So the (nonzero) singular values are the unique positive numbers that can appear in (6).

(iii) Moreover, if the µi are distinct and T is injective (resp. surjec- tive), the elements of B0 (resp. C0) are eiθ-multiples of the elements of B (resp. C).

DEFINITION 14. The pseudoinverse of T is the transformation T∼ : W → V given by 0 on ker(T†) and inverting T on im(T).

∼ −1 −1 COROLLARY 15. B[T ]C = diagn×m{µ1 ,..., µr , 0, . . . , 0}. Note that if T is invertible, then T∼ = T−1. ∼ COROLLARY 16. T T : V → V is the orthogonal projection Prim(T†) ∼ and TT : W → W is Prim(T). SVD for m × n matrices. The matrix version of Theorem 12 is one of the most important results in linear algebra. Its efficient com- puter implementation for large matrices has been the subject of count- less articles in numerical analysis. While we won’t say anything about these algorithms, they are more efficient (and accurate) than the orthogonal diagonalization of A∗ A, which remains our algorithm of choice here. 9The notation means an m × n matrix, whose entries are zero except for (j, j)th entries for 1 ≤ j ≤ r(≤ m, n). (VII.E) THE SINGULAR VALUE DECOMPOSITION (SVD) 11

THEOREM 17 (Matrix SVD). Let A be an arbitrary m × n matrix of rank r, with complex entries. Then there exist unitary matrices P ∈

Mm(C),Q ∈ Mn(C) and real numbers µ1 ≥ ... ≥ µr > 0, such that (7) ∗ ∗ A = P∆Q = (pˆ1 ··· pˆm) diagm×n{µ1,..., µr, 0, . . . , 0} (qˆ1 ··· qˆn) (and A∗ = Q t∆P∗). The map from Cn → Cm induced by A decomposes as an orthogonal direct sum under the dot product ⊥ ⊥ ∗ ~ ∗ ~ Vcol(A ) ⊕ {~v | A~v = 0} −→ Vcol(A) ⊕ {w~ | A w~ = 0} k k

span{qˆ1,..., qˆr} span{pˆ1,..., pˆr}

A given by qˆj 7−→ µj pˆj (j = 1, . . . , r) on the first factor and zero on the ∗ second factor. (The reverse map A sends pˆj 7→ µjqˆj.) The nonzero singular ∗ values µj are the positive square roots of the nonzero eigenvalues of A A (resp. AA∗), for which the columns of Q (resp. P) give unitary eigenbases.

PROOF. Apply Theorem 12 to the setting: V = Cn with h·, ·i = dot product and A = eˆ; W = Cm with h·, ·i = dot product and 0 A = eˆ; T : V → W such that A = A0 [T]A. Then

A0 [T]A = A0 [IdW ]CC [T]BB[IdV]A ∗ is A = SC ∆SB. The remaining details are immediate from the “ab- stract” analysis. 

DEFINITION 18. The pseudoinverse of A is the n × m matrix ∼ ∼ ∼ ∗ A := A[T ]A0 = Q∆ P ,

∼ −1 −1 where ∆ = diagn×m{µ1 ,..., µr , 0, . . . , 0}.

∼ ∼ OROLLARY A A = [Pr ∗ ] and AA = [Pr ] are C 19. Vcol(A ) eˆ Vcol(A) eˆ matrices of orthogonal projections (under the dot product).

∗ REMARK 20. In all of this, if A is real then A = t A =⇒ ∗ Vcol(A ) = Vrow(A). In this case, we can take P, Q orthogonal, and t t A = P∆ Q with the {qˆi} appearing as the rows of Q. 12 (VII.E) THE SINGULAR VALUE DECOMPOSITION (SVD)

EXAMPLE 21. Consider the rank 1 matrix   1 −1   A =  1 −1  . 2 −2

We have ! 6 −6 t AA = = Qdiag{µ2, 0}tQ −6 6 1 √   with µ = 12 and Q = √1 1 1 . Now 1 2 −1 1

1 ! 1 1 pˆ1 = Aqˆ1 = √ 1 , µ1 6 2

t while pˆ2, pˆ3 have to be an o.n. basis of ker( A). So √  √1 √1 √1    6 2 3 12 0 √1 √−1 !  −  A =  √1 √1 √1   0 0  2 2  6 2 3    √1 √1 √2 0 √−1 0 0 2 2 6 3 is a SVD, and  1 1 2  ! ! √ √ √ √1 √1 √1 0 0 6 6 6 ∼ 2 2 12  √1 √−1  A = −  0  √1 √1 0 0 0  2 2  2 2 √1 √1 √−1 3 3 3 ! 1 1 2 = 1 12 −1 −1 −2   − the pseudoinverse. So for instance A∼ A = 1 1 1 does indeed 2 −1 1   orthogonally project onto V (A) = span 1 . row −1

Applications to least squares and linear regression. Let T : V → W be an arbitrary linear transformation, with V = Cn and W = Cn equipped with the dot product, and A := eˆ[T]eˆ. Write ~v ∈ V, w~ ∈ W and [~v]eˆ = ~x, [w~ ]eˆ = ~y. (VII.E) THE SINGULAR VALUE DECOMPOSITION (SVD) 13

DEFINITION 22. For any fixed w~ ∈ W, a least-squares solution (LSS) to T~v = w~ (or equivalently A~x = ~y) is any ve ∈ V minimizing ||Tve − w~ ||. The minimal LSS vemin is the (unique) LSS minimizing ||ve||. The minimum distance from w~ to im(T) is, of course, the distance from w~ to the orthogonal projection we := Prim(T)w~ , w

im(T)

Tv~

and so v˜ is just any solution to

(8) Tve = we ( ⇐⇒ Axe = ye).

THEOREM 23. (i) The least-squares solutions of T~v = w~ are the solu- tions of the normal equations

† † ∗ ∗ (9) T Tve = T w~ ( ⇐⇒ A Axe = A ~y). (ii) The minimal LSS of T~v = w~ is ∼ (10) vemin = T w~ . ⊥ † PROOF. (i) Since im(T) = ker(T ), vesatisfies (9) ⇐⇒ w~ − Tve ∈ ⊥ im(T) ⇐⇒ ve satisfies (8). ∼ (ii) By Corollary 16, we = TT w~ . So ve satisfies (8) ⇐⇒ ~κ := ve− ∼ † 2 ∼ 2 2 T w~ ∈ ker(T). But ker(T) ⊥ im(T ) =⇒ ||ve|| = ||T w~ || + ||~κ|| is minimized by~κ =~0.  Now T†T is invertible iff T is 1-to-1 (cf. Exercise (1)), in which case ve(= vemin) is unique and the normal equations become † −1 † ∗ −1 ∗ (11) ve = (T T) T we ( ⇐⇒ xe = (A A) A ~y). 14 (VII.E) THE SINGULAR VALUE DECOMPOSITION (SVD)

If ker(T) 6= {0}, then (9) is less convenient and (10) seems the better result. But it turns out that if A is large, computationally (10) is more useful than the normal equations regardless of invertibility of A∗ A. This is because the SVD gives ∼ ∼ ∗ (12) xemin = A ~y = Q∆ P ~y, ∼ ∗ −1 and ∆ involves inverting the {µi} whereas (A A) involves in- 2 verting the {µi }. If some nonzero µi’s differ by orders of magnitude, the computation of (A∗ A)−1 will therefore introduce significantly more error than (12). That said, the normal equations are typically simpler for small examples, as we’ll now see.

EXAMPLE 24. We find the LSS for the linear system A~x = ~y, where  1 −1   2       1 0   0  A =   , ~y =   .  1 1   −2  1 2 1 Compute ! ! 4 2 3 −1 t AA = , (t AA)−1 = 1 2 6 10 −1 2 ! 1 =⇒ x = (t AA)−1 t A~y = 1 . e 2 −1 √ 2 Writing µ± = 5 ± 5, the SVD is

√ √  3−√ 5 3+√ 5 − 1 √1    2 √10 2 √10 2 2 5 µ+ 0 √ √ − 1−√ 5 1+√ 5 1 −√3  √1 5 − 2   2  0 µ− µ µ− A = 2 10√ 2 10√ 2 5   2√− √  −1√− 5 −1√+ 5 1 √3   0 0  √1+ 5 − 2  2 10√ 2 10√ 2 2 5  2µ+ µ+ −3√− 5 −3√+ 5 − 1 −√1 0 0 2 10 2 10 2 2 5

∼ −1 −1 and replacing the middle factor by ∆ = diag4×2{µ+ , µ− } gives ! 4 3 2 1 A∼ = 1 . 10 −3 −1 1 3

But using A∼ = (t AA)−1 t A (see Exercise (1) below) is much easier! (VII.E) THE SINGULAR VALUE DECOMPOSITION (SVD) 15

Note that this example can be seen as a data fitting (linear regres- sion) problem: the line minimizing the sum of squares of vertical errors in Y

X

1 1 is Y = 2 − 2 X. Here is a slightly more interesting example.

EXAMPLE 25. Having entered the sports business, Monsanto is engaged in trying to grow a better baseball player. They’ve tracked a few little-leaguers and got the data Y= # of home runs / season

(X 3,Y 3)=(2,3) ε2 ε 3 (X 2,Y 2)=(1,3) ε4 (X 4,Y 4)=(3,2)

ε1

(X ,Y )=(0,1) 1 1

X = tens of miles from Arch which suggests a parabola10

2 Y = f (X) = b0 + b1X + b2X .

10More seriously, rather than looking at the shape of the data set, one should try to base the form of f on physical principle (see for instance Exercise (2) below). 16 (VII.E) THE SINGULAR VALUE DECOMPOSITION (SVD)

So write

2 Y1 = b0 + b1X1 + b2X1 + ε1 2 Y2 = b0 + b1X2 + b2X2 + ε2 2 Y3 = b0 + b1X3 + b2X3 + ε3 2 Y4 = b0 + b1X4 + b2X4 + ε4 which translates to ~Y = A~β +~ε, where

 2    1 X1 X1 1 0 0  1 X X2   1 1 1  =  2 2  =   A  2    .  1 X3 X3   1 2 4  2 1 X4 X4 1 3 9 Minimizing ||~ε|| = ||A~β − ~Y|| just means solving Aβe = Ye, or equiv- alently t AAβe = t A~Y which is         4 6 14 b˜0 9 1.05         6 14 36 b˜1 = 15 =⇒ βe = 2.55 .       RREF   14 36 98 b˜2 33 −0.75 So using f (X) = 1.05 + 2.55X − 0.75X2 =⇒ 0 = f 0(X) = 2.55 − 1.5X is maximized at X = 1.7. Therefore the ideal distance from the Arch for raising baseball stars must be 17 miles. Because our data wasn’t questionable at all and our sample size was YUGE.

Applications to data compression. One of the many other com- putational applications of the SVD is called Principal Component Analysis. Suppose in a country of N =1 million internet users, you want to target ads (or fake news) based on online behavior, where the number M of possible clicks or purchases is also very large. Assume however that only k (cultural, demographic, etc.) attributes of a per- son generally determine this behavior: that is, there exists an N × k “attribute matrix” and a k × M “behavior matrix” whose product is roughly A, the N × M matrix of raw user data. In other words, we EXERCISES 17 expect A to be well-approximated by a rank k matrix, with k much smaller than M and N. The SVD in the first form we encountered (cf. (3)) extends to non- square matrices: in view of (7), r ∗ (13) A = ∑ µ` · pˆ` · qˆ` `=1 N×1 1×M where r = rank(A). It turns out that

k ∗ (14) Ak := ∑ µ` · pˆ` · qˆ` `=1 is the “best” rank k approximation to A (in the sense of minimiz- 11 ing the ||A − Ak|| among all rank k M × N ma- trices). In the situation just described, we’d expect Ak to be very close to A, while requiring us to record only k(1 + M + N) num- bers (the µ` and entries of pˆ`, qˆ` for ` = 1, . . . , k), a significant data compression vs. the MN entries of A. The formulas (13)-(14) are close in spirit to (discrete) Fourier analysis, which breaks a func- tion down into its constituent frequency components – viz., Aij = √ √ `0 2π −1 i` 2π −1 j ∑`,`0 α`,`0 e M e N . But the SVD allows for many fewer terms in (14), since the pˆ`, qˆ` are not fixed trigonometric functions, but rather “adapted to” the matrix A. On the other hand, the discrete Fourier transform doesn’t require recording pˆ` and qˆ`, so has that advantage. Fortunately for computational applications, MATLAB has efficient algorithms for both!

Exercises (1) (a) Show that T is 1-to-1 iff T†T is invertible. (b) Show that if T is 1-to-1, then T∼ = (T†T)−1T†. (So in this case, (10) and (11) are “the same”. However, the two formulas (A∗ A)−1 A∗ and Q∆∼P∗ for A∼ really do have different strengths.

11 ||M|| := max ||M~v||; in fact, ||A − Ak|| = µk+1. In addition, Ak minimizes the ||~v||=1 1 “stupid” ||M|| := (∑i,j Mij) 2 for A − AK. 18 (VII.E) THE SINGULAR VALUE DECOMPOSITION (SVD)

(2) As luck would have it, at time t = 0 bullies stuffed your backpack full of radioactive material from the Bridgeton landfill. Your le- gal team thinks there are two substances in there, with half-lives of one and five days respectively, but all they can do is measure

the overall masses M1, M2,..., Mn of the backpack at the end of days 1, 2 . . . , n. (The bullies also superglued it shut, you see.) Measurements have error, so you need to formulate a linear re- gression model in order to determine the original masses A, B of the two substances at time t = 0. This should involve an n × 2 matrix. (3) Find a polar decomposition for   20 4 0   A =  0 0 1  . 4 20 0 (4) Calculate the matrix SVD for   1 1 1 1   A =  1 0 −2 1  . 1 −1 1 1 (5) Calculate the minimal least-squares solution of the system

x1 + x2 = 1 x1 + x2 = 2 −x1 − x2 = 0.