Topic 5: Principal Component Analysis 5.1 Covariance Matrices

Topic 5: Principal component analysis 5.1 Covariance matrices d Suppose we are interested in a population whose members are represented by vectors in R . We d model the population as a probability distribution P over R , and let X be a random vector with distribution P. The mean of X is the \center of mass" of P. The covariance of X is also a kind of \center of mass", but it turns out to reveal quite a lot of other information. d Note: if we have a finite collection of data points x1; x2;:::; xn 2 R , then it is common to n×d arrange these vectors as rows of a matrix A 2 R . In this case, we can think of P as the uniform distribution over the n points x1; x2;:::; xn. The mean of X ∼ P can be written as 1 (X) = A>1 ; E n and the covariance of X is > 1 1 1 1 > cov(X) = A>A − A>1 A>1 = Ae Ae n n n n where Ae = A − (1=n)11>A. We often call these the empirical mean and empirical covariance of the data x1; x2;:::; xn. Covariance matrices are always symmetric by definition. Moreover, they are always positive d semidefinite, since for any non-zero z 2 R , > > > h 2i z cov(X)z = z E (X − E(X))(X − E(X)) z = E hz; X − E(X)i ≥ 0 : This also shows that for any unit vector u, the variance of X in direction u is h 2i > var(hu; Xi) = E hu; X − E Xi = u cov(X)u : Consider the following question: in what direction does X have the highest variance? It turns out this is given by an eigenvector corresponding to the largest eigenvalue of cov(X). This follows the following variational characterization of eigenvalues of symmetric matrices. d×d Theorem 5.1. Let M 2 R be a symmetric matrix with eigenvalues λ1 ≥ λ2 ≥ · · · ≥ λd and corresponding orthonormal eigenvectors v1; v2;:::; vd. Then u>Mu max = λ1 ; u6=0 u>u u>Mu min = λd : u6=0 u>u > > These are achieved by v1 and vd, respectively. (The ratio u Mu=u u is called the Rayleigh quotient associated with M in direction u.) 32 TOPIC 5. PRINCIPAL COMPONENT ANALYSIS 33 Proof. Following Theorem 4.1, write the eigendecomposition of M as M = V ΛV > where V = [v1jv2j · · · jvd] is orthogonal and Λ = diag(λ1; λ2; : : : ; λd) is diagonal. For any u 6= 0, u>Mu u>V ΛV >u = (since VV > = I) u>u u>VV >u w>Λw = (using w := V >u) w>w 2 2 2 w1λ1 + w2λ2 + ··· + wdλd = 2 2 2 : w1 + w2 + ··· + wd This final ratio represents a convex combination of the scalars λ1; λ2; : : : ; λd. Its largest value is λ1, achieved by w = e1 (and hence u = V e1 = v1), and its smallest value is λd, achieved by w = ed (and hence u = V ed = vd). Corollary 5.1. Let v1 be a unit-length eigenvector of cov(X) corresponding to the largest eigenvalue of cov(X). Then var(hv1; Xi) = max var(hu; Xi) : u2Sd−1 d Now suppose we are interested in the k-dimensional subspace of R that captures the \most" d variance of X. Recall that a k-dimensional subspace W ⊆ R can always be specified by a collection of k orthonormal vectors u1; u2;:::; uk 2 W . By the orthogonal projection to W , we mean the linear map 2 x x x 3 ? ? ? ? ? ? > 6 7 d×k x 7! U x ; where U = 6u1 u2 ··· uk7 2 R : 6 ? ? ? 7 4 ? ? ? 5 y y y The covariance of U >X, a k×k covariance matrix, is simply given by cov(U >X) = U > cov(X)U : The \total" variance in this subspace is often measured by the trace of the covariance: tr(cov(U >X)). Recall, the trace of a square matrix is the sum of its diagonal entries, and it is a linear function. d×k > > 2 > Fact 5.1. For any U 2 R , tr(cov(U X)) = E kU (X − E(X))k2. Furthermore, if U U = I, > > 2 then tr(cov(U X)) = E kUU (X − E(X))k2. d×d Theorem 5.2. Let M 2 R be a symmetric matrix with eigenvalues λ1 ≥ λ2 ≥ · · · ≥ λd and corresponding orthonormal eigenvectors v1; v2;:::; vd. Then for any k 2 [d], > max tr(U MU) = λ1 + λ2 + ··· + λk ; > U2Rd×k : U U=I > min tr(U MU) = λd−k+1 + λd−k+2 + ··· + λd : > U2Rd×k : U U=I The max is achieved by an orthogonal projection to the span of v1; v2;:::; vk, and the min is achieved by an orthogonal projection to the span of vd−k+1; vd−k+2;:::; vd. Pd > Proof. Let u1; u2;:::; uk denote the columns of U. Then, writing M = j=1 λjvjvj (Theo- rem 4.1), k k 0 d 1 d k d > X > X > X > X X 2 X tr(U MU) = ui Mui = ui @ λjvjvj Aui = λj hvj; uii = cjλj i=1 i=1 j=1 j=1 i=1 j=1 TOPIC 5. PRINCIPAL COMPONENT ANALYSIS 34 Pk 2 Pd where cj := i=1hvj; uii for each j 2 [d]. We'll show that each cj 2 [0; 1], and j=1 cj = k. First, it is clear that cj ≥ 0 for each j 2 [d]. Next, extending u1; u2;:::; uk to an orthonormal d basis u1; u2;:::; ud for R , we have for each j 2 [d], k d X 2 X 2 cj = hvj; uii ≤ hvj; uii = 1 : i=1 i=1 d Finally, since v1; v2;:::; vd is an orthonormal basis for R , d d k k d k X X X 2 X X 2 X 2 cj = hvj; uii = hvj; uii = kuik2 = k : j=1 j=1 i=1 i=1 j=1 i=1 Pd Pd The maximum value of j=1 cjλj over all choices of c1; c2; : : : ; cd 2 [0; 1] with j=1 cj = k is λ1 + λ2 + ··· + λk. This is achieved when c1 = c2 = ··· = ck = 1 and ck+1 = ··· = cd = 0, Pd i.e., when span(v1; v2;:::; vk) = span(u1; u2;:::; uk). The minimum value of j=1 cjλj over all Pd choices of c1; c2; : : : ; cd 2 [0; 1] with j=1 cj = k is λd−k+1 +λd−k+2 +···+λd. This is achieved when c1 = ··· = cd−k = 0 and cd−k+1 = cd−k+2 = ··· = cd = 1, i.e., when span(vd−k+1; vd−k+2;:::; vd) = span(u1; u2;:::; uk). We'll refer to the k largest eigenvalues of a symmetric matrix M as the top-k eigenvalues of M, and the k smallest eigenvalues as the bottom-k eigenvalues of M. We analogously use the term top-k (resp., bottom-k) eigenvectors to refer to orthonormal eigenvectors corresponding to the top-k (resp., bottom-k) eigenvalues. Note that the choice of top-k (or bottom-k) eigenvectors is not necessarily unique. Corollary 5.2. Let v1; v2;:::; vk be top-k eigenvectors of cov(X), and let V k := [v1jv2j · · · jvk]. Then tr(cov(V >X)) = max tr(cov(U >X)) : k > U2Rd×k : U U=I An orthogonal projection given by top-k eigenvectors of cov(X) is called a (rank-k) principal component analysis (PCA) projection. Corollary 5.2 reveals an important property of a PCA projection: it maximizes the variance captured by the subspace. 5.2 Best affine and linear subspaces d PCA has another important property: it gives an affine subspace A ⊆ R that minimizes the expected squared distance between X and A. Recall that a k-dimensional affine subspace A is specified by a k-dimensional (linear) subspace d d W ⊆ R |say, with orthonormal basis u1; u2;:::; uk|and a displacement vector u0 2 R : A = fu0 + α1u1 + α2u2 + ··· + αkuk : α1; α2; : : : ; αk 2 Rg : d Let U := [u1ju2j · · · juk]. Then, for any x 2 R , the point in A closest to x is given by u0 + > > 2 UU (x − u0), and hence the squared distance from x to A is k(I − UU )(x − u0)k2. Theorem 5.3. Let v1; v2;:::; vk be top-k eigenvectors of cov(X), let V k := [v1jv2j · · · jvk], and v0 := E(X). Then > 2 > 2 k(I − V kV )(X − v0)k = min k(I − UU )(X − u0)k : E k 2 d×k d E 2 U2R ; u02R : U >U=I TOPIC 5. PRINCIPAL COMPONENT ANALYSIS 35 2 Proof. For any matrix d×d matrix M, the function u0 7! E kM(X − u0)k2 is minimized when Mu0 = M E(X) (Fact 5.2). Therefore, we can plug-in E(X) for u0 in the minimization problem, whereupon it reduces to min k(I − UU >)(X − (X))k2 : > E E 2 U2Rd×k : U U=I The objective function is equivalent to > 2 2 > 2 E k(I − UU )(X − E(X))k2 = E kX − E(X)k2 − E kUU (X − E(X))k2 2 > = E kX − E(X)k2 − tr(cov(U X)) ; where the second equality comes from Fact 5.1. Therefore, minimizing the objective is equivalent to maximizing tr(cov(U >X)), which is achieved by PCA (Corollary 5.2). The proof of Theorem 5.3 depends on the following simple but useful fact. d d Fact 5.2 (Bias-variance decomposition). Let Y be a random vector in R , and b 2 R be any fixed vector. Then 2 2 2 E kY − bk2 = E kY − E(Y )k2 + kE(Y ) − bk2 (which, as a function of b, is minimized when b = E(Y )).

Topic 5: Principal Component Analysis 5.1 Covariance Matrices

C H a P T E R § Methods Related to the Norm Al Equations

Ph 21.5: Covariance and Principal Component Analysis (PCA)

6 Probability Density Functions (Pdfs)

The Column and Row Hilbert Operator Spaces

Covariance of Cross-Correlations: Towards Efﬁcient Measures for Large-Scale Structure

Lecture 4 Multivariate Normal Distribution and Multivariate CLT

Matrices and Linear Algebra

The Variance Ellipse

Lecture 5 Ch. 5, Norms for Vectors and Matrices

Matrix Norms 30.4

5 Norms on Linear Spaces

A Degeneracy Framework for Scalable Graph Autoencoders