<<

i “book” — 2017/4/1 — 12:47 — page 91 — #93 i i i

Embree – draft – 1 April 2017

3 Singular Value Decomposition ·

Thus far we have focused on matrix factorizations that reveal the eigenval- ues of a square matrix A n n,suchastheSchur factorization and the 2 ⇥ Jordan canonical form. Eigenvalue-based decompositions are ideal for ana- lyzing the behavior of dynamical systems likes x0(t)=Ax(t) or xk+1 = Axk. When it comes to solving linear systems of equations or tackling more general problems in data science, eigenvalue-based factorizations are often not so il- luminating. In this chapter we develop another decomposition that provides deep insight into the rank structure of a matrix, showing the way to solving all variety of linear equations and exposing optimal low-rank approximations.

3.1 Singular Value Decomposition The singular value decomposition (SVD) is remarkable factorization that writes a general rectangular matrix A m n in the form 2 ⇥ A = () () (unitary matrix)⇤. ⇥ ⇥ From the unitary matrices we can extract bases for the four fundamental subspaces R(A), N(A), R(A⇤),andN(A⇤), and the diagonal matrix will reveal much about the rank structure of A. We will build up the SVD in a four-step process. For simplicity suppose that A m n with m n.(Ifm

91

i i

i i i “book” — 2017/4/1 — 12:47 — page 92 — #94 i i i

92 Chapter 3. Singular Value Decomposition

Step 1. Using the spectral decomposition of a Hermitian matrix discussed in Section 1.5, A A has n eigenpairs ( , v ) n with orthonormal unit ⇤ { j j }j=1 eigenvectors (v v =1, v v =0when j = k). We are free to pick any j⇤ j j⇤ k 6 convenient indexing for these eigenpairs; label the eigenvalues in decreasing magnitude, 0. 1 2 ··· n Step 2. Define s := Av . j k jk Note that s2 = Av 2 = v A Av = . Since the eigenvalues ,..., j k jk j⇤ ⇤ j j 1 n are decreasing in magnitude, so are the s values: s s s 0. j 1 2 ··· n Step 3. Next, we will build a set of related orthonormal vectors in m. Suppose we have already constructed such vectors u1,...,uj 1. 1 1 If s =0, then define u = s Av ,sothat u = s Av =1. j 6 j j j k jk j k jk If sj =0, then pick uj to be any unit vector such that

uj span u1,...,uj 1 ?; 2 { } 1 i.e., ensure uj⇤uk =0for all k

1 1 k uj⇤uk = (Avj)⇤(Avk)= vj⇤A⇤Avk = vj⇤vk, sjsk sjsk sjsk

where we used the fact that v is an eigenvector of A A.Nowifj = k, j ⇤ 6 then vj⇤vk =0, and hence uj⇤uk =0. On the other hand, j = k implies that 2 vj⇤vk =1,souj⇤uk = j/sj =1. In conclusion, we have constructed a set of orthonormal vectors u n { j}j=1 with u m. j 2 Step 4. For all j =1,...,n,

Avj = sjuj,

regardless of whether sj =0or not. We can stack these n vector equations as columns of a single matrix equation,

|| | || | Av Av Av = s u s u s u . 2 1 2 ··· n 3 2 1 1 2 2 ··· n n 3 || | || | 4 5 4 5 1 If sj =0,thenj =0,andsoA⇤A has a zero eigenvalue; i.e., this matrix is singular.

Embree – draft – 1 April 2017

i i

i i i “book” — 2017/4/1 — 12:47 — page 93 — #95 i i i

3.1. Singular Value Decomposition 93

Note that both matrices in this equation can be factored into the product of simpler matrices:

s1 s || | || |2 2 3 A v1 v2 vn = u1 u2 un . . 2 ··· 3 2 ··· 3 .. 6 7 || | || |6 s 7 4 5 4 5 6 n 7 4 5 Denote these matrices as AV = U⌃, where A m n, V n n, U 2 ⇥ 2 ⇥ 2 m n,and⌃ n n. ⇥ 2 ⇥ The (j, k) entry of V⇤V is simplyb b vj⇤vk,andsoV⇤V = I. Since Vbis a square matrix,b we have just proved that it is unitary, and hence, VV⇤ = I as well. We conclude that A = U⌃V⇤.

This matrix factorization is known asb theb reduced singular value decomposi- tion or the economy-sized singular value decomposition (or, informally, the skinny SVD). It can be obtained via the MATLAB command

[Uhat, Sighat, V] = svd(A,’econ’);

The Reduced Singular Value Decomposition

Theorem 3.1. Any matrix A m n with m n can be written as 2 ⇥

A = U⌃V⇤,

m n n n where U ⇥ has orthonormalb columns,b V ⇥ is unitary, and 2 2 ⌃ = diag(s ,...,s ) n n has real nonnegative decreasing entries. 1 n 2 ⇥ The columnsb of U are left singular vectors,thecolumnsofV are right singularb vectors,andthevaluess1,...,sn are the singular values.

While the matrix U has orthonormal columns, it is not a unitary matrix when m>n. In particular, we have U U = I n n,but ⇤ 2 ⇥ b m m UU⇤ b b ⇥ 2

cannot be the identity unless mb b= n. (To see this, note that UU⇤ is an orthogonal projection onto R(U) = span u ,...,u . Since dim(R(U)) = { 1 n} n, this projection cannot equal the m-by-m identity matrix whenbm>nb .) b b Embree – draft – 1 April 2017

i i

i i i “book” — 2017/4/1 — 12:47 — page 94 — #96 i i i

94 Chapter 3. Singular Value Decomposition

Though U is not unitary, it is subunitary. We can construct m n additional column vectors to append to U to make it unitary. Here is the recipe: For jb= n +1,...,m,pick b uj span u1,...,uj 1 ? 2 { }

with uj⇤uj =1. Then define

|| | U = u u u . 2 1 2 ··· m 3 || | 4 5 Confirm that U U = UU = I m m, showing that U is unitary. ⇤ ⇤ 2 ⇥ We wish to replace the U in the reduced SVD with the unitary matrix U. To do so, we also need to replace ⌃ by some ⌃ in such a way that U⌃ = U⌃. The simplest approachb constructs ⌃ by appending zeros to the end of ⌃, thus ensuring there is no contributionb when the new entries of U multiplyb b against the new entries of ⌃: b ⌃ m n ⌃ = ⇥ . 0 2  b The factorization A = U⌃V⇤ is called the full singular value decomposition.

The Full Singular Value Decomposition

Theorem 3.2. Any matrix A m n can be written in the form 2 ⇥

A = U⌃V⇤,

where U m m and V n n are unitary matrices and ⌃ m n 2 ⇥ 2 ⇥ 2 ⇥ is zero except for the main diagonal; The columns of U are left singular vectors,thecolumnsofV are right singular vectors,andthevaluess1,...,smin m,n are the singular values. { }

A third version of the singular value decomposition is often very helpful. Start with the reduced SVD A = U⌃V⇤ in Theorem 3.1 Multiply U⌃ together to get b b b b v1⇤ n || . A = U⌃V⇤ = s u s u . = s u v⇤, 2 1 1 ··· n n 3 2 . 3 j j j v j=1 ||6 n⇤ 7 X b b 4 5 4 5 Embree – draft – 1 April 2017

i i

i i i “book” — 2017/4/1 — 12:47 — page 95 — #97 i i i

3.1. Singular Value Decomposition 95

which renders A as the sum of outer products u v m n,weightedby j j⇤ 2 ⇥ nonnegative numbers sj. Let r denote the number of nonzero singular values, so that if r

A = sjujvj⇤, (3.1) Xj=1 known as the dyadic form of the SVD.

The Dyadic Form of the Singular Value Decomposition

Theorem 3.3. For any matrix A m n,thereexistssomer 2 ⇥ 2 1,...,n such that { } r

A = sjujvj⇤, Xj=1 where s s s > 0 and u ,...,u m are orthonormal, and 1 2 ··· r 1 r 2 v ,...,v n are orthonormal. 1 r 2

Corollary 3.4. The rank of a matrix equals its number of nonzero singular values.

The singular value decomposition also gives an immediate formula for the 2-norm of a matrix.

Theorem 3.5. Let A m n have singular value decomposition (3.1). 2 ⇥ Then Ax A = max k k = s1. k k x=0 x 6 k k

Proof. The proof follows from the construction of the SVD at the start of this chapter. Note that Ax 2 x A Ax A 2 = max k k = max ⇤ ⇤ , k k x=0 x 2 x=0 x x 6 k k 6 ⇤ and so A 2 is the maximum Rayleigh quotient of the Hermitian matrix k k A⇤A. By Theorem 2.2, this maximum value is the largest eigenvalue of

Embree – draft – 1 April 2017

i i

i i i “book” — 2017/4/1 — 12:47 — page 96 — #98 i i i

96 Chapter 3. Singular Value Decomposition

A A, i.e., = s2 in “Step 2” of the construction on page 92. Thus A = ⇤ 1 1 k k p1 = s1.

3.1.1 Inductive Proof of the SVD 3.1.2 The SVD and the Four Fundamental Subspaces 3.2 The SVD: Undoer of Many Knots In the introduction to this chapter, we claimed that eigenvalue-based decom- positions were the right tool for handling systems that involved dynamics. The SVD, in turn, is the perfect tool for handling static systems, i.e., systems that do not change with time. We identify three canonical static problems: 1. Find the unique solution x to Ax = b, where A is an invertible square matrix.

2. Find the solution x of minimum norm that solves the underdetermined system Ax = b, where A is a matrix with a nontrivial null space and b R(A). 2 3. Find the minimum-norm vector x that minimizes Ax b , for any k k given b. Notice that this last problem subsumes the first two. (If Ax = b has a solution x, then Ax b =0is minimal; if there are multiple solutions, k k one then finds the one having smallest norm.) Thus, our discussion will focus on problem 3. Let A m n have rank r and let u ,...,u m be a full set of left 2 ⇥ 1 m 2 singular vectors, giving an orthonormal basis for m with

R(A) = span u ,...,u { 1 r} N(A⇤) = span u ,...,u . { r+1 m} Expand b m as a linear combination of the left singular vectors: 2 b = u + + u . 1 1 ··· m m

We can find the coefficients j very easily, since the orthonormality of the singular vectors gives

u⇤b = u⇤u + + u⇤u j 1 j 1 ··· m j m = j.

Embree – draft – 1 April 2017

i i

i i i “book” — 2017/4/1 — 12:47 — page 97 — #99 i i i

3.2. The SVD: Undoer of Many Knots 97

Now for any x n, notice that Ax R(A),andso 2 2 Ax b = Ax ( u + + u ) u + u 1 1 ··· r r r+1 r+1 ··· m m ⇣ ⌘ ⇣ ⌘ =(Ax b ) b , R N where

b := ( u + + u ) R(A) R 1 1 ··· r r 2 b := ( u + u ) N(A⇤). N r+1 r+1 ··· m m 2 Since the R(A) N(A ) (by the Fundamental Theorem of Linear Algebra), ? ⇤ (Ax b ) b , and hence, by the Pythagorean Theorem, R ? N Ax b 2 = (Ax b ) b 2 = Ax b 2 + b 2. (3.2) k k k R N k k Rk k N k Inspect this expression. The choice of x does not affect b 2: b is the k N k N piece of b that is beyond the reach of Ax.(IfAx = b has a solution, then bN = 0.) To minimize (3.2), the best we can do is find x such that Ax b =0. The dyadic form of the SVD, k Rk r

A = sjujvj⇤, (3.3) Xj=1 makes quick work of this problem. Expand any x n in the right singular 2 vectors, x = ⇠ v + + ⇠ v , 1 1 ··· n n so that (via orthonormality of the singular vectors),

r r

Ax = sjujvj⇤x = sj⇠juj. Xj=1 Xj=1 We can equate this expression with

b = u + + u R 1 1 ··· r r =(u⇤b)u + +(u⇤b)u (3.4) 1 1 ··· r r

by simply matching the coefficients in the u1,...,ur directions:

sk⇠k = uk⇤b,k=1,...,r

Embree – draft – 1 April 2017

i i

i i i “book” — 2017/4/1 — 12:47 — page 98 — #100 i i i

98 Chapter 3. Singular Value Decomposition

giving uk⇤b ⇠k = ,k=1,...,r. sk

What about ⇠r+1,...,⇠n? They can take any value! To confirm this fact, define r n uk⇤b x = vk + ⇠kvk (3.5) sk Xk=1 k=Xr+1 for any ⇠r+1,...,⇠n, and verify that Ax = bR:

r r n uk⇤b Ax = sjujvj⇤ vk + ⇠kvk sk ✓ Xj=1 ◆✓ Xk=1 k=Xr+1 ◆ r r r n uk⇤b = sj ujvj⇤vk + sj⇠kujvj⇤vk sk Xj=1 Xk=1 Xj=1 k=Xr+1 r

= (uj⇤b)uj +0 Xj=1 = bR,

according to the expansion (3.4) for bR. Thus when r

r 2 n 2 uk⇤b 2 x = | 2 | + ⇠k , k k sk | | Xk=1 k=Xr+1 making it obvious that the unique norm-minimizing solution is

r r uj⇤b 1 x = vj = vjuj⇤ b. (3.6) sj sj Xj=1 ✓ Xj=1 ◆ Take a moment to savor this beautiful formula, arguably one of the most important in matrix theory! In fact, the formula is so useful that the matrix on the right-hand side deserves some special nomenclature.

Embree – draft – 1 April 2017

i i

i i i “book” — 2017/4/1 — 12:47 — page 99 — #101 i i i

3.3. Optimal Low-Rank Approximations 99

Pseudoinverse

Definition 3.6. Let A m n be a matrix of rank r with singu- 2 ⇥ lar value decomposition (3.3). Then the pseudoinverse (or Moore– Penrose pseudoinverse)ofA is given by

r + 1 A = vjuj⇤. (3.7) sj Xj=1

We emphasize that x = A+b in (3.6) solves all three of the problems formu- lated at the beginning of this section. 1. If A is invertible, then m = n = r and n 1 1 x = A b = vjuj⇤ b. sj ✓ Xj=1 ◆ + 1 In particular, whenever A is invertible, A = A . 2. If b R(A), then b = b,and 2 R r + 1 x = A b = vjuj⇤ b sj ✓ Xj=1 ◆ solves Ax = b.Ifr

3.3 Optimal Low-Rank Approximations Many applications call for use to approximate A with some low-rank matrix. Suppose we have the SVD r

A = sjujvj⇤, (3.8) Xj=1

Embree – draft – 1 April 2017

i i

i i i “book” — 2017/4/1 — 12:47 — page 100 — #102 i i i

100 Chapter 3. Singular Value Decomposition

and we seek some rank-k approximation to A, for some 1 k 0,onemight 1 2 ··· r naturally grab the leading k terms from the decomposition (3.8):

k

Ak = sjujvj⇤. Xj=1 How good an approximation is Ak to A? Notice that r A A = s u v⇤ k j j j j=Xk+1 is a singular value decomposition for the error A A , and hence by Theo- k rem ?? its norm equals the largest singular value in the decomposition:

A A = s . (3.9) k kk k+1 Can we construct a better rank-k approximation? The following theorem, one of the most important in matrix theory, says that this is not the case. (This is known as the Schmidt–Eckart–Young–Mirsky Theorem, after its various discovers.)

Optimal Low-Rank Approximation

Theorem 3.7. Let A m n be a rank-r matrix with singular value 2 ⇥ decomposition r

A = sjujvj⇤, Xj=1 and let k

min A X = sk+1 rank(X) k k

and this minimization is attained by

k

Ak = sjujvj⇤. (3.10) Xj=1

Proof. We know from (3.9) that A A = s ,andsothisproofmust k kk k+1 show that for any rank-k matrix X,wehave A X s . k k k+1

Embree – draft – 1 April 2017

i i

i i i “book” — 2017/4/1 — 12:47 — page 101 — #103 i i i

3.3. Optimal Low-Rank Approximations 101

Let X m n be an arbitrary rank-k matrix. To show A X s , 2 ⇥ k k k+1 it suffices to identify some unit vector z n for which (A X)z s , 2 k k k+1 since A X = max (A X)v (A X)z . k k v =1 k kk k k k Since X m n has rank-k,wemusthave 2 ⇥ dim(N(X)) = n k. By the orthogonality of the right singular vectors of A,wehave

dim(span v ,...,v = k +1. { 1 k+1} The sum of the dimensions of these two spaces exceeds n, since (n k)+(k + 1) = n+1, and hence the intersection of these spaces must have dimension 1 or greater. dim N(X) span v ,...,v 1. \ { 1 k+1} Thus we can find some⇣ unit vector in this intersection,⌘

z N(X) span v ,...,v 2 \ { 1 k+1} with z =1. We will use the fact that z is in each of these spaces in turn. k k First, z N(X) we must have Xz = 0,so 2 A X (A X)z = Az . k kk k k k Now since z span v ,...,v we can expand 2 { 1 k+1} z = c v + + c v . 1 1 ··· k+1 k+1 By the Pythagorean Theorem (and the orthonormality of the v ) { j} 1= z 2 = c 2 + + c 2. (3.11) k k | 1| ··· | k+1| The orthonormality of the v also implies that { j} r k+1 r k+1 k+1

Az = sjujvj⇤ c`v` = sjc`ujvj⇤v` = sjcjuj. ✓ Xj=1 ◆✓ X`=1 ◆ Xj=1 X`=1 Xj=1 Again by the Pythagorean Theorem (now with orthonormality of the u ) { j} we have k+1 k+1 Az 2 = s2 c 2 s2 c 2 = s2 , k k j | j| k+1 | j| k+1 Xj=1 Xj=1

Embree – draft – 1 April 2017

i i

i i i “book” — 2017/4/1 — 12:47 — page 102 — #104 i i i

102 Chapter 3. Singular Value Decomposition

where the inequality follows from the decaying magnitudes of the singular values, s s s , and the last step uses the formula (3.11). We 1 ··· k k+1 conclude that

A X = max (A X)v (A X)z = Az sk+1, k k v =1 k kk k k k k k thus establishing the theorem. One must wonder if the partial sum

k

Ak = sjujvj⇤ Xj=1 delivers a unique best rank-k approximation. That is not generally the case: if k

k

Ak = sjujvj⇤, Xj=1 b b for some s1,...,sk,sothat

k r b b A A = (s s )u v⇤ + s u v⇤. k j j j j j j j Xj=1 j=Xk+1 b b This is a like a singular value decomposition for A A, except the “singular values” s s could potentially be negative. One can show that the norm j j of the misfit A A is the largest magnitude of theseb quantities and the k untouched sjbfor j = k +1,...,r: b A A = max s s ,..., s s ,s . k kk | 1 1| | k k| k+1 n o So long as we pick sb1,...,sk so that b b

b bmax sj sj sk+1, 1 j k | |   b then A A = s ,andA is another best rank-k approximation to A. k kk k+1 k (However, A is the only one of these that leaves a rank r k misfit A A .) k k b b Embree – draft – 1 April 2017

i i

i i i “book” — 2017/4/1 — 12:47 — page 103 — #105 i i i

3.4. The Polar Decomposition 103

3.4 The Polar Decomposition Throughout our studies we have fluently translated basic properties of scalars into their matrix analogues: a b effortlessly becomes A B,themultipli- ± ± cation ab becomes AB, but with the caveat that the product does not in 1 general commute; the inversion 1/b becomes B ,anddivisionbyb =0 corresponds to singular B. Even magnitude generalizes: a becomes A . | | k k The singular value decomposition allows one more construction in the same vein. Any a can be written in the polar form a = rei✓ for r 0 2 and ✓ [0, 2⇡):thatis,a is the product of a nonnegative number r and a 2 number with magnitude 1, since ei✓ =1. | | Suppose that A m n with m n.IntotheskinnySVD 2 ⇥ A = U⌃V⇤ splice the identity matrix in the form V V = I, b b⇤ A = U(V⇤V)⌃V⇤ =(UV⇤)(V⌃V⇤)=:ZR, where Z := UV m n is subunitary and R := V⌃V n n is ⇤ 2b ⇥ b b b ⇤ 2 ⇥ Hermitian positive semidefinite. (Note that R is constructed via a unitary diagonalizationb whose central matrix ⌃ has nonnegativeb diagonal entries: hence R must be Hermitian positive semidefinite.) In the polar form a = rei✓, the unit-length scalar ei✓ generalizesb to the subunitary Z, while the nonnegative scalar r>0 generalizes to the Hermitian semidefinite R.

Polar Decomposition

Theorem 3.8. Any matrix A m n, m n,canbewrittenas 2 ⇥ A = ZR

for subunitary Z m nand Hermitian positive semidefinite R n n. 2 ⇥ 2 ⇥ Notice that m n was necessary for Z to be subunitary, since Z⇤Z = VU⇤UV⇤ = VV⇤ can only equal the identity if V is a square unitary matrix. (Also note that b b Z m n, and one cannot have a matrix with n orthonormal columns of 2 ⇥ length m if n>m.) If m n, we can alternatively write  A = U⌃V⇤ = U⌃(U⇤U)V⇤ =(U⌃U⇤)(UV⇤)=:RZ⇤, now with R := U⌃U m m Hermitian positive semidefinite and Z := b b ⇤ 2 b ⇥ b b b VU n m subunitary. ⇤ 2 ⇥ b Embreeb – draft – 1 April 2017

i i

i i i “book” — 2017/4/1 — 12:47 — page 104 — #106 i i i

104 Chapter 3. Singular Value Decomposition

3.5 Variational Characterization of Singular Values Since the singular values are square roots of the eigenvalues of the Hermitian matrices A⇤A and AA⇤, the singular values inherit the variational charac- terizations that were explored in Section 2.2. For example,

v A Av 1/2 u AA u 1/2 = max ⇤ ⇤ = max ⇤ ⇤ , 1 n m v v⇤v u u⇤u 2 ⇣ ⌘ 2 ⇣ ⌘ with the leading right and left singular vectors v1 and u1 being unit vectors that attain these maxima. However, the singular values also satisfy a subtler variational property that incorporates both left and right singular vectors at the same time. Con- sider, for unit vectors u m and v n,thequantity 2 2

u⇤Av u A v = , | |k kk kk k 1 using the Cauchy–Schwarz inequality and the definition of the induced ma- trix 2-norm. On the other hand, if u1 and v1 unit vectors that give

u⇤Av = u⇤( u ) = . | 1 1| | 1 1 1 | 1 Thus u⇤Av 1 = max | | . u m,v n u v 2 2 k kk k We can characterize subsequent singular values the same way. Recall the dyadic version of the singular value decomposition,

r

A = jujvj⇤ Xj=1 for A of rank r. If we restrict unit vectors u and v such to be orthogonal to u1 and v1, then

r r r

u⇤Av = ju⇤ujvj⇤v = ju⇤ujvj⇤v = u⇤ jujvj⇤ v. Xj=1 Xj=2 ✓ Xj=2 ◆ Hence u⇤Av , | | 2 with the inequality attained when u = u1 and v = v1. Continuing this process gives the following analogue of Theorem 2.2.

Embree – draft – 1 April 2017

i i

i i i “book” — 2017/4/1 — 12:47 — page 105 — #107 i i i

3.6. Principal Component Analysis 105

Theorem 3.9. For any A m n, 2 ⇥ u⇤Av k = max | | . u span u1,...,uk 1 u v ? { } v span v1,...,vk 1 k kk k ? { }

3.6 Principal Component Analysis Matrix theory enables the analysis of the volumes of data that now so com- monly arise from applications ranging from basic science to public policy. Such measured data often depends on many factors, and we seek to identify those that are most critical. Within this realm of multivariate statistics, principal component analysis (PCA) is a fundamental tool. Linear algebraists often say, “PCA is the SVD” – in this section, we will explain what this means, and some of the subtleties involved.

3.6.1 Variance and covariance To understand principal component analysis, we need some basic notions from statistics, described in any basic textbook. For a general description of PCA along with numerous applications, see the text by Jolliffe [Jol02], whose presentation shaped parts of our discussion here. The expected value,ormean,ofarandomvariableX is denoted E[X]. The expected value is a linear function, so for any constants ↵, , 2 E[↵X + ]=↵E[X]+. The variance of X describes how much X is expected to deviate from its mean, Var(X)=E[(X E[X])2], which, using linearity of the expected value, takes the equivalent form Var(X)=E[X2] E[X]2. The covariance between two (potentially correlated) random variables X and Y is Cov(X, Y )=E[(X E[X])(Y E[Y ])] = E[XY ] E[X]E[Y ]. with Cov(X, X)=Var(X). These definitions of variance and covariance are the bedrock concepts underneath PCA, for with them we can understand the variance present in a linear combination of several random variables.

Embree – draft – 1 April 2017

i i

i i i “book” — 2017/4/1 — 12:47 — page 106 — #108 i i i

106 Chapter 3. Singular Value Decomposition

Suppose we have a set of real-valued random variables X1,...,Xn in which we suspect there may be some redundancy. Perhaps some of these variables can be expressed as linear combinations of the others – either ex- actly, or nearly so. At the other extreme, there may be some way to combine X1,...,Xn that captures much of the variance in one (or a few) aggregate random variables. In particular, we shall seek scalars 1,...,n such that n jXj Xj=1 has the largest possible variance. The definitions of variance and covariance, along with the linearity of the expected value, lead to a formula for the variance of a linear combination of random variables: n n n Var jXj = jk Cov(Xj,Xk). (3.12) ⇣ Xj=1 ⌘ Xj=1 Xk=1 You have seen double sums like this before. If we define the covariance matrix C n n having (j, k) entry 2 ⇥ cj,k = Cov(Xj,Xk), T and let v =[1,...,n] , then the variance of the combined variable is just a Rayleigh quotient: n Var jXj = v⇤Cv. ⇣ Xj=1 ⌘ Since the covariance function is symmetric: Cov(X, Y )=Cov(Y,X),thema- trix C is Hermitian; it is also positive semidefinite. Why? Variance, by its definition as the expected value of the square of a real random variable, is al- ways nonnegative. Thus the formula (3.12), which derives from the linearity of the expected value, ensures that v Cv 0. (Under what circumstances ⇤ can this quantity be zero?) We can write C in another convenient way. Collect the random variables into the vector X1 X = . . 2 . 3 Xn Then the (j, k) entry of E[XX ] E[4X]E[X5 ] is ⇤ ⇤ E[X X ] E[X ]E[X ]=Cov(X ,X )=c , j k j k j k j,k and so C = E[XX⇤] E[X]E[X⇤].

Embree – draft – 1 April 2017

i i

i i i “book” — 2017/4/1 — 12:47 — page 107 — #109 i i i

3.6. Principal Component Analysis 107

3.6.2 Derived variables that maximize variance

Return now to the problem of maximizing the variance of v⇤Cv.With- out constraint on v, this quantity can be arbitrarily large (assuming C is nonzero); thus we shall require that k 2 = v 2 =1. With this normal- j=1 j k k ization, you immediately see how to maximize the variance v Cv: v should P ⇤ be a unit eigenvector associated with the largest magnitude eigenvalue of C; call this vector v1. The associated variance, of course, is the largest eigenvalue of C;callit

v⇤Cv 1 = v1⇤Cv1 = max . v n v v 2 ⇤

The eigenvector v1 encodes the way to combine X1,...,Xn to maximize variance. The new variable – the leading principal component –is

n

v1⇤X = jXj. Xj=1 You are already suspecting that a unit eigenvector associated with the second largest eigenvalue, v2 with 2 = v2⇤Cv2, must encode the second-largest way to maximize variance. Let us explore this intuition. To find the second-best way to combine the variables, we should insist that the next new variable, for now call it w⇤X, should be independent of the first, i.e.,

Cov(v1⇤X, w⇤X)=0.

However, using linearity of expectation and the fact that, e.g., w⇤X = X⇤w for real vectors,

Cov(v⇤X, w⇤X)=E[(v⇤X)(w⇤X)] E[v⇤X]E[w⇤X] 1 1 1 = E[(v⇤XX⇤w] E[v⇤X]E[X⇤w] 1 1 = v⇤E[XX⇤]w v⇤E[X]E[X⇤]w 1 1 = v⇤(E[XX⇤] E[X]E[X⇤])w 1

= v1⇤Cw = 1v1⇤w.

Hence (assuming =0), for the combined variables v X and w X to be 1 6 1⇤ ⇤ independent, the vectors v1 and w must be orthogonal, perfectly confirming your intuition: the second-best way to combine the variables is to pick w

Embree – draft – 1 April 2017

i i

i i i “book” — 2017/4/1 — 12:47 — page 108 — #110 i i i

108 Chapter 3. Singular Value Decomposition

to be a unit eigenvector v2 of C corresponding to the second largest eigen- value – a direct result of the variational characterization of eigenvalues in Theorem 2.2. The associated variance of v2⇤X is

w⇤Cw 2 = max . w span u1 w w ? { } ⇤ Of course, in general, the kth best way to combine the variables is given by the eigenvector vk of C associated with the kth largest eigenvalue. We learn much about our variables from the relative size of the variances (eigenvalues) 0. 1 2 ··· n If some of the latter eigenvalues are very small, that indicates that the set of n random variables can be well approximated by a fewer number of ag- gregate variables. These aggregate variables are the principal components of X1,...,Xn.

3.6.3 Approximate PCA from empirical data In practical situations, one often seeks to analyze empirical data drawn from some unknown distribution: the expected values and covariances are not available. Instead, we will estimate these from the measured data. Suppose, as before, that we are considering n random variables, X1,...,Xn, with m samples of each:

xj,k,k=1, . . . , m,

i.e., xj,k is the kth sample of the random variable Xj. The expected value has a the familiar unbiased estimate m 1 µ = x . j m j,k Xj=1 Similarly, we can approximate the covariance

Cov(X ,X )=E[(X E[X ])(X E[X )]. j k j j k k One might naturally estimate this as

m 1 (x µ )(x µ ). m j,` j k,` k X`=1 Embree – draft – 1 April 2017

i i

i i i “book” — 2017/4/1 — 12:47 — page 109 — #111 i i i

3.6. Principal Component Analysis 109

However, replacing the true expected values E[Xj] and E[Xk] with the em- pirical estimates µj and µk introduces some slight bias into this estimate. This bias can be removed by scaling [Cra46, Sect. 27.6], replacing 1/m by 1/(m 1) to get the unbiased estimate m 1 s = (x µ )(x µ ),j,k=1,...,n. j,k m 1 j,` j k,` k X`=1 If we let xj,1 . xj = 2 . 3 ,j=1,...,n, x 6 j,m 7 4 5 then each covariance estimate is just an inner product 1 s = (x µ )⇤(x µ ). j,k m 1 j j k k Thus, if we center the samples of each variable about its empirical mean, we can write the empirical covariance matrix S =[sj,k] as a matrix product. Let m n X := [ (x µ )(x µ ) (x µ )] ⇥ , 1 1 2 2 ··· n n 2 so that 1 S = X⇤X. m 1 Now conduct principal component analysis just as before, but with the empirical covariance matrix S replacing the true covariance matrix C. The eigenvectors of S now lead to sample principal components. Note that there is no need to explicitly form the matrix S: instead, we can simply perform the singular value decomposition of the data matrix X. This is why some say, “PCA is just the SVD.” We summarize the details step-by-step.

1. Collect m samples of each of n random variables, xj,k for j =1,...,m and k =1,...,n. (We need m>1, and, generally expect m n.) m 2. Compute the empirical means of each column, µk =( j=1 xj,k)/m.

3. Stacking the samples of the kth variable in the vectorPx m, con- k 2 struct the mean-centered data matrix

m n X =[(x µ )(x µ ) (x µ )] ⇥ . 1 1 2 2 ··· n n 2

Embree – draft – 1 April 2017

i i

i i i “book” — 2017/4/1 — 12:47 — page 110 — #112 i i i

110 Chapter 3. Singular Value Decomposition

4. Compute the (skinny) singular value decomposition X = U⌃V⇤, with m n n n U ⇥ , ⌃ = diag(1,...,n) ⇥ ,andV =[v1 vn] n2n 2 ··· 2 ⇥ .

5. The kth sample principal component is given by vk⇤X, where X = [X1,...,Xn]⇤ is the vector of random variables.

A word of caution: when conducting principal component analysis, the scale of each column matters. For example, if the random variables sam- pled in each column of X are measurements of physical quantities, they can differ considerably in magnitude depending on the units of measurement. By changing units of measurement, you can significantly alter the principal components.

Embree – draft – 1 April 2017

i i

i i i “book” — 2017/4/1 — 12:47 — page 111 — #113 i i i

Embree – draft – 1 April 2017

References

[Ber33] Daniel Bernoulli. Theoremata de oscillationibus corporum filo flexili connexorum et catenae verticaliter suspensae. Commentarii Academiae Scientiarum Imperi- alis Petropolitanae,6:108–122,1733.Reprintedandtranslatedin[CD81]. [Ber34] Daniel Bernoulli. Demonstrationes theorematum suorum de oscillationibus cor- porum filo flexili connexorum et catenae verticaliter suspensae. Commentarii Academiae Scientiarum Imperialis Petropolitanae,7:162–173,1734. [BS72] R. H. Bartels and G. W. Stewart. Solution of the matrix equation AX +XB = C. Comm. ACM,15:820–826,1972. [BS09] Robert E. Bradley and C. Edward Sandifer. Cauchy’s Cours d’analyse: An Annotated Translation.Springer,Dordrecht,2009. [Cau21] Augustin-Louis Cauchy. Cours d’Analayse de l’École Royale Polytechnique.De- bure Frères, Paris, 1821. [Cay58] Arthur Cayley. Memoir on the theory of matrices. Phil. Trans. Royal Soc. London, 148:17–37, 1858. [CD81] John T. Cannon and Sigalia Dostrovsky. The Evolution of Dynamics: Vibration Theory from 1687 to 1782.Springer-Verlag,NewYork,1981. [Cra46] Harald Cramér. Mathematical Methods of Statistics.PrincetonUniversityPress, Princeton, 1946. [ES] Mark Embree and D. C. Sorensen. An Introduction to Model Reduction for Linear and Nonlinear Differential Equations.Inpreparation. [FS83] R. Fletcher and D. C. Sorensen. An algorithmic derivation of the Jordan canonical form. Amer. Math. Monthly,90:12–16,1983. [GW76] G. H. Golub and J. H. Wilkinson. Ill-conditioned eigensystems and the compu- tation of the Jordan canonical form. SIAM Review,18:578–619,1976. [HJ85] Roger A. Horn and Charles R. Johnson. .CambridgeUniversity Press, Cambridge, 1985. [HJ13] Roger A. Horn and Charles R. Johnson. Matrix Analysis.CambridgeUniversity Press, Cambridge, second edition, 2013. [HK71] Kenneth M. Hoffman and Ray Kunze. Linear Algebra.PrenticeHall,Englewood Cliffs, N.J., second edition, 1971. check this. [Jol02] I. T. Jolliffe. Principal Component Analysis.Springer,NewYork,secondedition, 2002.

111

i i

i i i “book” — 2017/4/1 — 12:47 — page 112 — #114 i i i

112 References

[Kat80] Tosio Kato. Perturbation Theory for Linear Operators.Springer-Verlag,Berlin, corrected second edition, 1980. [Lak76] Imre Lakatos. Proofs and Refutations: The Logical of Mathematical Discovery. Cambridge University Press, Cambridge, 1976. [Par98] Beresford N. Parlett. The Symmetric Eigenvalue Problem.SIAM,Philadelphia, SIAM Classics edition, 1998. [Ray78] Lord Rayleigh (John William Strutt). The Theory of Sound.Macmillan,London, 1877, 1878. 2 volumes. [RS80] Michael Reed and Barry Simon. Methods of Modern Mathematical Physics I: Functional Analysis.AcademicPress,SanDiego,revisedandenlargededition, 1980. [SM03] Endre Süli and David Mayers. An Introduction to Numerical Analysis.Cambridge University Press, Cambridge, 2003. [Ste04] J. Michael Steele. The Cauchy–Schwarz Master Class.CambridgeUniversity Press, Cambridge, 2004. [Str93] Gilbert Strang. The fundamental theorem of linear algebra. Amer. Math. Monthly,100:848–855,1993. [Tru60] C. Truesdell. The Rational Mechanics of Flexible or Elastic Bodies, 1638–1788. Leonhardi Euleri Opera Omnia, Introduction to Volumes X and XI, Second Se- ries. Orell Füssli, Zürich, 1960. [vNW29] J. von Neumann and E. Wigner. Über das Verhalten von Eigenwerten bei Adiabatischen Prozessen. Physikalische Zeit.,30:467–470,1929. [You88] Nicholas Young. An Introduction to .CambridgeUniversityPress, Cambridge, 1988.

Embree – draft – 1 April 2017

i i

i i