Quick viewing(Text Mode)

A Tutorial on Principal Component Analysis

A Tutorial on Principal Component Analysis

A Tutorial on Principal Component Analysis

Jonathon Shlens∗ Systems Neurobiology Laboratory, Salk Insitute for Biological Studies La Jolla, CA 92037 and Institute for Nonlinear Science, University of California, San Diego La Jolla, CA 92093-0402

(Dated: December 10, 2005; Version 2)

Principal component analysis (PCA) is a mainstay of modern data analysis - a black box that is widely used but poorly understood. The goal of this paper is to dispel the magic behind this black box. This tutorial focuses on building a solid intuition for how and why principal component analysis works; furthermore, it crystallizes this knowledge by deriving from simple intuitions, the mathematics behind PCA . This tutorial does not shy away from explaining the ideas informally, nor does it shy away from the mathematics. The hope is that by addressing both aspects, readers of all levels will be able to gain a better understanding of PCA as well as the when, the how and the why of applying this technique.

I. INTRODUCTION II. MOTIVATION: A TOY EXAMPLE

Here is the perspective: we are an experimenter. We Principal component analysis (PCA) has been called are trying to understand some phenomenon by measur- one of the most valuable results from applied linear al- ing various quantities (e.g. spectra, voltages, velocities, gebra. PCA is used abundantly in all forms of analysis - etc.) in our system. Unfortunately, we can not figure out from neuroscience to computer graphics - because it is a what is happening because the data appears clouded, un- simple, non-parametric method of extracting relevant in- clear and even redundant. This is not a trivial problem, formation from confusing data sets. With minimal addi- but rather a fundamental obstacle in empirical science. tional effort PCA provides a roadmap for how to reduce Examples abound from complex systems such as neu- a complex data set to a lower dimension to reveal the roscience, photometry, meteorology and oceanography - sometimes hidden, simplified structure that often under- the number of variables to measure can be unwieldy and lie it. at times even deceptive, because the underlying relation- The goal of this tutorial is to provide both an intuitive ships can often be quite simple. feel for PCA, and a thorough discussion of this topic. Take for example a simple toy problem from physics We will begin with a simple example and provide an intu- diagrammed in Figure 1. Pretend we are studying the itive explanation of the goal of PCA. We will continue by motion of the physicist’s ideal spring. This system con- adding mathematical rigor to place it within the frame- sists of a ball of mass m attached to a massless, friction- work of linear algebra to provide an explicit solution. We less spring. The ball is released a small distance away will see how and why PCA is intimately related to the from equilibrium (i.e. the spring is stretched). Because mathematical technique of singular value decomposition the spring is “ideal,” it oscillates indefinitely along the (SVD). This understanding will lead us to a prescription x-axis about its equilibrium at a set frequency. for how to apply PCA in the real world. We will discuss This is a standard problem in physics in which the mo- both the assumptions behind this technique as well as tion along the x direction is solved by an explicit function possible extensions to overcome these limitations. of time. In other words, the underlying dynamics can be expressed as a function of a single variable x. The discussion and explanations in this paper are infor- mal in the spirit of a tutorial. The goal of this paper is to However, being ignorant experimenters we do not know educate. Occasionally, rigorous mathematical proofs are any of this. We do not know which, let alone how necessary although relegated to the Appendix. Although many, axes and dimensions are important to measure. not as vital to the tutorial, the proofs are presented for Thus, we decide to measure the ball’s position in a the adventurous reader who desires a more complete un- three-dimensional space (since we live in a three dimen- derstanding of the math. The only assumption is that the sional world). Specifically, we place three movie cameras reader has a working knowledge of linear algebra. Please around our system of interest. At 200 Hz each movie feel free to contact me with any suggestions, corrections camera records an image indicating a two dimensional or comments. position of the ball (a projection). Unfortunately, be- cause of our ignorance, we do not even know what are the real “x”, “y” and “z” axes, so we choose three cam- era axes ~a, ~b,~c at some arbitrary angles with respect { } ∗Electronic address: [email protected] to the system. The angles between our measurements 2

every time sample (or experimental trial) as an individual sample in our data set. At each time sample we record a set of data consisting of multiple measurements (e.g. voltage, position, etc.). In our data set, at one point in time, camera A records a corresponding ball position (xA,yA). One sample or trial can then be expressed as a 6 dimensional column vector

xA  yA  ~ xB X =    yB     xC   y   C  FIG. 1 A diagram of the toy example. where each camera contributes a 2-dimensional projec- tion of the ball’s position to the entire vector X~ . If we record the ball’s position for 10 minutes at 120 Hz, then o might not even be 90 ! Now, we record with the cameras we have recorded 10 60 120 = 72000 of these vectors. × × for several minutes. The big question remains: how do With this concrete example, let us recast this problem we get from this data set to a simple equation of x? in abstract terms. Each sample X~ is an m-dimensional We know a-priori that if we were smart experimenters, vector, where m is the number of measurement types. we would have just measured the position along the x- Equivalently, every sample is a vector that lies in an m- axis with one camera. But this is not what happens in the dimensional vector space spanned by some orthonormal real world. We often do not know which measurements basis. From linear algebra we know that all measurement best reflect the dynamics of our system in question. Fur- vectors form a linear combination of this set of unit length thermore, we sometimes record more dimensions than we basis vectors. What is this orthonormal basis? actually need! This question is usually a tacit assumption often over- Also, we have to deal with that pesky, real-world prob- looked. Pretend we gathered our toy example data above, lem of noise. In the toy example this means that we but only looked at camera A. What is an orthonormal ba- need to deal with air, imperfect cameras or even friction sis for (xA,yA)? A naive choice would be (1, 0), (0, 1) , in a less-than-ideal spring. Noise contaminates our data √2 √2 {√2 √2 } but why select this basis over ( , ), ( − , − ) or set only serving to obfuscate the dynamics further. This { 2 2 2 2 } toy example is the challenge experimenters face everyday. any other arbitrary rotation? The reason is that the We will refer to this example as we delve further into ab- naive basis reflects the method we gathered the data. Pre- stract concepts. Hopefully, by the end of this paper we tend we record the position (2, 2). We did not record √2 √2 will have a good understanding of how to systematically 2√2 in the ( 2 , 2 ) direction and 0 in the perpindicu- extract x using principal component analysis. lar direction. Rather, we recorded the position (2, 2) on our camera meaning 2 units up and 2 units to the left in our camera window. Thus our naive basis reflects the III. FRAMEWORK: CHANGE OF BASIS method we measured our data. How do we express this naive basis in linear algebra? The goal of principal component analysis is to compute In the two dimensional case, (1, 0), (0, 1) can be recast the most meaningful basis to re-express a noisy data set. as individual row vectors. A{ matrix constructed} out of The hope is that this new basis will filter out the noise these row vectors is the 2 2 identity matrix I. We can and reveal hidden structure. In the example of the spring, generalize this to the m-dimensional× case by constructing the explicit goal of PCA is to determine: “the dynamics an m m identity matrix are along the x-axis.” In other words, the goal of PCA × is to determine that xˆ - the unit basis vector along the b1 1 0 0 · · · x-axis - is the important dimension. Determining this  b2   0 1 0  fact allows an experimenter to discern which dynamics B = . = . . ·. · · . = I  .   . . .. .  are important, which are just redundant and which are      bm   0 0 1  just noise.    · · · 

where each row is an orthornormal basis vector bi with A. A Naive Basis m components. We can consider our naive basis as the effective starting point. All of our data has been recorded With a more precise definition of our goal, we need in this basis and thus it can be trivially expressed as a a more precise definition of our data as well. We treat linear combination of bi . { } 3

B. Change of Basis We can note the form of each column of Y.

p1 xi With this rigor we may now state more precisely what ·  .  PCA asks: Is there another basis, which is a linear com- yi = . bination of the original basis, that best re-expresses our  pm xi  data set?  ·  A close reader might have noticed the conspicuous ad- We recognize that each coefficient of yi is a dot-product dition of the word linear. Indeed, PCA makes one strin- of xi with the corresponding row in P. In other words, th th gent but powerful assumption: linearity. Linearity vastly the j coefficient of yi is a projection on to the j row of simplifies the problem by (1) restricting the set of poten- P. This is in fact the very form of an equation where yi is tial bases, and (2) formalizing the implicit assumption of a projection on to the basis of p1,..., pm . Therefore, 1 continuity in a data set. the rows of P are indeed a new{ set of basis} vectors for With this assumption PCA is now limited to re- representing of columns of X. expressing the data as a linear combination of its ba- sis vectors. Let X be the original data set, where each column is a single sample (or moment in time) of our data C. Questions Remaining set (i.e. X~ ). In the toy example X is an m n matrix where m = 6 and n = 72000. Let Y be another× m n By assuming linearity the problem reduces to find- matrix related by a linear transformation P. X is× the ing the appropriate change of basis. The row vectors Y p1,..., pm in this transformation will become the original recorded data set and is a re-representation of { } that data set. principal components of X. Several questions now arise. PX = Y (1) What is the best way to “re-express” X? • Also let us define the following quantities.2 What is a good choice of basis P? • pi are the rows of P These questions must be answered by next asking our- • selves what features we would like Y to exhibit. Evi- xi are the columns of X (or individual X~ ). • dently, additional assumptions beyond linearity are re- yi are the columns of Y. quired to arrive at a reasonable result. The selection of • these assumptions is the subject of the next section. Equation 1 represents a change of basis and thus can have many interpretations. 1. P is a matrix that transforms X into Y. IV. AND THE GOAL 2. Geometrically, P is a rotation and a stretch which Now comes the most important question: what does again transforms X into Y. “best express” the data mean? This section will build up an intuitive answer to this question and along the way 3. The rows of P, p ,..., p , are a set of new basis 1 m tack on additional assumptions. The end of this section vectors for expressing{ the columns} of X. will conclude with a mathematical goal for deciphering The latter interpretation is not obvious but can be seen “garbled” data. by writing out the explicit dot products of PX. In a linear system “garbled” can refer to only three potential confounds: noise, rotation and redundancy. Let p 1 us deal with each situation individually. . PX =  .  x1 xn . · · ·  pm      A. Noise and Rotation p1 x1 p1 xn · · · · ·  . . .  Y = . .. . Measurement noise in any data set must be low or else, no matter the analysis technique, no about a  pm x1 pm xn   · · · · ·  system can be extracted. There exists no absolute scale for noise but rather all noise is measured relative to the measurement. A common measure is the signal-to-noise ratio (SNR), or a ratio of σ2, 1 A subtle point it is, but we have already assumed linearity by 2 implicitly stating that the data set even characterizes the dy- σsignal namics of the system. In other words, we are already relying on SNR = 2 . (2) the superposition principal of linearity to believe that the data σnoise provides an ability to interpolate between individual data points 2 A high SNR ( 1) indicates high precision data, while a In this section xi and yi are column vectors, but be forewarned. ≫ In all other sections xi and yi are row vectors. low SNR indicates noise contaminated data. 4 * variance along p p* A y SNR

p* x 0 45 90 (a) A (b) angle of p (degrees)

FIG. 2 (a) Simulated data of (xA, yA) for camera A. The 2 2 signal and noise variances σsignal and σnoise are graphically FIG. 3 A spectrum of possible redundancies in data from the represented by the two lines subtending the cloud of data. (b) two separate recordings r1 and r2 (e.g. xA, yB ). The best-fit Rotating these axes finds an optimal p∗ where the variance line r2 = kr1 is indicated by the dashed line. and SNR are maximized. The SNR is defined as the ratio of the variance along p∗ and the variance in the perpindicular direction. record the same dynamic information. Reconsider Fig- ure 2 and ask whether it was really necessary to record 2 variables? Figure 3 might reflect a range of possibile Pretend we plotted all data from camera A from the plots between two arbitrary measurement types r1 and spring in Figure 2a. Remembering that the spring travels r2. Panel (a) depicts two recordings with no apparent in a straight line, every individual camera should record relationship. In other words, r1 is entirely uncorrelated motion in a straight line as well. Therefore, any spread with r2. Because one can not predict r1 from r2, one deviating from straight-line motion must be noise. The says that r1 and r2 are statistcally independent. This variance due to the signal and noise are indicated by each situation could occur by plotting two variables such as line in the diagram. The ratio of the two lengths, the (xA, humidity). SNR, thus measures how “fat” the cloud is - a range of On the other extreme, Figure 3c depicts highly corre- possiblities include a thin line (SNR 1), a perfect circle ≫ lated recordings. This extremity might be achieved by (SNR = 1) or even worse. By positing reasonably good several means: measurements, quantitatively we assume that directions with largest variances in our measurement vector space A plot of (xA,xB) if cameras A and B are very contain the dynamics of interest. In Figure 2 the di- • nearby. rection with the largest variance is notx ˆA = (1, 0) nor A plot of (xA, x˜A) where xA is in meters andx ˜A is yˆA = (0, 1), but the direction along the long axis of the • cloud. Thus, by assumption the dynamics of interest ex- in inches. ist along directions with largest variance and presumably Clearly in panel (c) it would be more meaningful to just highest SNR . have recorded a single variable, not both. Why? Because Our assumption suggests that the basis for which we one can calculate r1 from r2 (or vice versa) using the best- are searching is not the naive basis (ˆx , yˆ ) because these A A fit line. Recording solely one response would express the directions do not correspond to the directions of largest data more concisely and reduce the number of sensor variance. Maximizing the variance (and by assumption recordings (2 1 variables). Indeed, this is the very the SNR ) corresponds to finding the appropriate rota- idea behind dimensional→ reduction. tion of the naive basis. This intuition corresponds to finding the direction p∗ in Figure 2b. How do we find p ? In the 2-dimensional case of Figure 2a, p falls along ∗ ∗ C. Matrix the direction of the best-fit line for the data cloud. Thus, rotating the naive basis to lie parallel to p would reveal ∗ In a 2 variable case it is simple to identify redundant the direction of motion of the spring for the 2-D case. cases by finding the slope of the best-fit line and judging How do we generalize this notion to an arbitrary number the quality of the fit. How do we quantify and generalize of dimensions? Before we approach this question we need these notions to arbitrariliy higher dimensions? Consider to examine this issue from a second perspective. 3 two sets of measurements with zero means

A = a1, a2,...,an , B = b1, b2,...,bn B. Redundancy { } { }

Figure 2 hints at an additional confounding factor in our data - redundancy. This issue is particularly evident 3 These data sets are in mean deviation form because the means in the example of the spring. In this case multiple sensors have been subtracted off or are zero. 5 where the subscript denotes the sample number. The Consider how the matrix form XXT , in that order, com- th variance of A and B are individually defined as, putes the desired value for the ij element of CX. th Specifically, the ij element of CX is the dot product 2 2 th σA = aiai i , σB = bibi i between the vector of the i measurement type with the h i h i th vector of the j measurement type (or substituting xi 4 n where the expectation is the average over variables. and xj for a and b into Equation 3). We can summarize A B The covariance between and is a straight-forward several properties of CX: generalization. CX is a square symmetric m m matrix. 2 • × covariance of A and B σAB = aibi i ≡ h i The diagonal terms of CX are the variance of par- • The covariance measures the degree of the linear rela- ticular measurement types. tionship between two variables. A large (small) value in- The off-diagonal terms of CX are the covariance dicates high (low) redundancy. In the case of Figures 2a • and 3c, the are quite large. Some additional between measurement types. facts about the covariance. CX captures the correlations between all possible pairs 2 of measurements. The correlation values reflect the noise σAB 0 (non-negative). σAB is zero if and only if • A and≥B are entirely uncorrelated. and redundancy in our measurements. 2 2 In the diagonal terms, by assumption, large (small) σAB = σA if A = B. • • values correspond to interesting dynamics (or We can equivalently convert A and B into corresponding noise). row vectors. In the off-diagonal terms large (small) values cor- • respond to high (low) redundancy. a = [a1 a2 ... an]

b = [b1 b2 ... bn] Pretend we have the option of manipulating CX. We will suggestively define our manipulated covariance ma- so that we may now express the covariance as a dot prod- trix CY. What features do we want to optimize in CY? uct matrix computation. 1 σ2 abT (3) D. Diagonalize the Covariance Matrix ab ≡ n 1 − 1 We can summarize the last two sections by stating where is a constant for normalization.5 n 1 that our goals are (1) to minimize redundancy, mea- Finally,− we can generalize from two vectors to an arbi- sured by covariance, and (2) maximize the signal, mea- trary number. We can rename the row vectors x a, 1 sured by variance. By definition covariances must be x b and consider additional indexed row vectors≡ 2 non-negative, thus the minimal covariance is zero. What x ,...,≡ x . Now we can define a new m n matrix 3 m would the optimized covariance matrix C look like? Ev- X. × Y idently, in an “optimized” matrix all off-diagonal terms

x1 in CY are zero. Thus, CY must be diagonal. There are many methods for diagonalizing C . It is X =  .  (4) Y . curious to note that PCA arguably selects the easiest  xm  method, perhaps accounting for its widespread applica-   tion. One interpretation of X is the following. Each row of X PCA assumes that all basis vectors p1,..., pm are corresponds to all measurements of a particular type (x ). { } i orthonormal (i.e. pi pj = δij). In the language of linear Each column of X corresponds to a set of measurements algebra, PCA assumes· P is an orthonormal matrix. Sec- ~ from one particular trial (this is X from section 3.1). We ondly PCA assumes the directions with the largest vari- now arrive at a definition for the covariance matrix CX. ances the signals and the most “important.” In other words, they are the most principal. Why are these as- 1 T CX XX . (5) sumptions easiest? ≡ n 1 − Envision how PCA works. In our simple example in Figure 2b, P acts as a generalized rotation to align a basis with the maximally variant axis. In multiple dimensions 4 this could be performed by a simple algorithm: h·ii denotes the average over values indexed by i. 5 1 The most straight-forward normalization is n . However, this provides a biased estimation of variance particularly for small 1. Select a normalized direction in m-dimensional n. It is beyond the scope of this paper to show that the proper space along which the variance in X is maximized. 1 normalization for an unbiased estimator is n 1 . Save this vector as p1. − 6

2. Find another direction along which variance is max- this assumption formally guarantees that the imized, however, because of the orthonormality SNR and the covariance matrix fully charac- condition, restrict the search to all directions per- terize the noise and redundancies. pindicular to all previous selected directions. Save this vector as pi

3. Repeat this procedure until m vectors are selected. III. Large variances have important dynamics. This assumption also encompasses the belief The resulting ordered set of p’s are the principal compo- that the data has a high SNR. Hence, prin- nents. cipal components with larger associated vari- In principle this simple algorithm works, however that ances represent interesting dynamics, while would bely the true reason why the orthonormality as- those with lower variances represent noise. sumption is particularly judicious. Namely, the true benefit to this assumption is that it makes the solution amenable to linear algebra. There exist decompositions IV that can provide efficient, explicit algebraic solutions. . The principal components are orthogonal. This assumption provides an intuitive sim- Notice what we also gained with the second assump- plification that makes PCA soluble with lin- tion. We have a method for judging the “importance” of ear algebra decomposition techniques. These the each prinicipal direction. Namely, the variances as- techniques are highlighted in the two follow- sociated with each direction p quantify how “principal” i ing sections. each direction is. We could thus rank-order each basis vector pi according to the corresponding variances.We will now pause to review the implications of all the as- sumptions made to arrive at this mathematical goal. We have discussed all aspects of deriving PCA - what remain are the linear algebra solutions. The first solution is somewhat straightforward while the second solution E. Summary of Assumptions and Limits involves understanding an important algebraic decompo- sition. This section provides an important context for under- standing when PCA might perform poorly as well as a road map for understanding new extensions to PCA. The Discussion returns to this issue and provides a more lengthy discussion. V. SOLVING PCA: EIGENVECTORS OF COVARIANCE I. Linearity Linearity frames the problem as a change of We derive our first algebraic solution to PCA using basis. Several areas of research have explored linear algebra. This solution is based on an important how applying a nonlinearity prior to perform- property of eigenvector decomposition. Once again, the ing PCA could extend this algorithm - this data set is X, an m n matrix, where m is the number has been termed kernel PCA. of measurement types× and n is the number of samples. The goal is summarized as follows. II. Mean and variance are sufficient statistics. The formalism of sufficient statistics captures the notion that the mean and the variance en- Find some orthonormal matrix P where tirely describe a . The 1 Y = PX such that C YYT is diago- only class of probability distributions that are Y n 1 nalized. The rows of P are≡ the− principal com- fully described by the first two moments are ponents of X. exponential distributions (e.g. Gaussian, Ex- ponential, etc). In order for this assumption to hold, the prob- ability distribution of xi must be exponen- statistical independence.

tially distributed. Deviations from this could P (y1, y2)= P (y1)P (y2) invalidate this assumption.6 On the flip side, where P (·) denotes the probability density. The class of algo- rithms that attempt to find the y1, ,..., ym that satisfy this statistical constraint are termed independent component analy- sis (ICA). In practice though, quite a lot of real world data are 6 A sidebar question: What if xi are not Gaussian distributed? Gaussian distributed - thanks to the Central Limit Theorem - Diagonalizing a covariance matrix might not produce satisfac- and PCA is usually a robust solution to slight deviations from tory results. The most rigorous form of removing redundancy is this assumption. 7

We begin by rewriting CY in terms of our variable of In practice computing PCA of a data set X entails (1) choice P. subtracting off the mean of each measurement type and T 1 (2) computing the eigenvectors of XX . This solution is C = YYT encapsulated in demonstration Matlab code included in Y n 1 Appendix B. −1 = (PX)(PX)T n 1 −1 = PXXT PT VI. A MORE GENERAL SOLUTION: SVD n 1 −1 = P(XXT )PT This section is the most mathematically involved and n 1 can be skipped without much loss of continuity. It is pre- − 1 T sented solely for completeness. We derive another alge- C = PAP Y n 1 braic solution for PCA and in the process, find that PCA − is closely related to singular value decomposition (SVD). Note that we defined a new matrix A XXT , where A ≡ In fact, the two are so intimately related that the names is symmetric (by Theorem 2 of Appendix A). are often used interchangeably. What we will see though Our roadmap is to recognize that a symmetric matrix is that SVD is a more general method of understanding (A) is diagonalized by an orthogonal matrix of its eigen- change of basis. vectors (by Theorems 3 and 4 from Appendix A). For a We begin by quickly deriving the decomposition. In symmetric matrix A Theorem 4 provides: the following section we interpret the decomposition and A = EDET (6) in the last section we relate these results to PCA. where D is a diagonal matrix and E is a matrix of eigen- vectors of A arranged as columns. A. Singular Value Decomposition The matrix A has r m orthonormal eigenvectors where r is the rank of the≤ matrix. The rank of A is less Let X be an arbitrary n m matrix7 and XT X be a than m when A is degenerate or all data occupy a sub- rank r, square, symmetric n× n matrix. In a seemingly space of dimension r m. Maintaining the constraint of unmotivated fashion, let us define× all of the quantities of orthogonality, we can≤ remedy this situation by selecting interest. (m r) additional orthonormal vectors to “fill up” the − matrix E. These additional vectors do not effect the fi- ˆv1, ˆv2,..., ˆvr is the set of orthonormal • { } nal solution because the variances associated with these m 1 eigenvectors with associated eigenval- × directions are zero. ues λ1,λ2,...,λr for the symmetric matrix T { } Now comes the trick. We select the matrix P to be a X X. T matrix where each row pi is an eigenvector of XX . By T this selection, P ET. Substituting into Equation 6, we (X X)ˆvi = λiˆvi find A = PT DP≡. With this relation and Theorem 1 of P 1 PT C Appendix A ( − = ) we can finish evaluating Y. σi √λi are positive real and termed the singular • ≡ 1 values. C = PAPT Y n 1 − ˆu1, ˆu2,..., ˆur is the set of orthonormal n 1 vec- 1 T T • { }ˆu 1 Xˆv × = P(P DP)P tors defined by i σi i. n 1 ≡ −1 We obtain the final definition by referring to Theorem 5 = (PPT )D(PPT ) n 1 of Appendix A. The final definition includes two new and − 1 1 1 unexpected properties. = (PP− )D(PP− ) n 1 ˆu ˆu = δ −1 i j ij C = D • · Y n 1 Xˆvi = σi − • k k It is evident that the choice of P diagonalizes CY. This was the goal for PCA. We can summarize the results of These properties are both proven in Theorem 5. We now have all of the pieces to construct the decomposition. The PCA in the matrices P and CY. The principal components of X are the eigenvectors • of XXT ; or the rows of P. 7 th Notice that in this section only we are reversing convention from The i diagonal value of CY is the variance of X • m × n to n × m. The reason for this derivation will become clear along pi. in section 6.3. 8

“value” version of singular value decomposition is just a B. Interpreting SVD restatement of the third definition. The final form of SVD (Equation 9) is a concise but Xˆvi = σiˆui (7) thick statement to understand. Let us instead reinterpret Equation 7. This result says a quite a . X multiplied by an eigen- vector of XT X is equal to a scalar times another vec- Xa = kb tor. The set of eigenvectors ˆv1, ˆv2,..., ˆvr and the set { } of vectors ˆu1, ˆu2,..., ˆur are both orthonormal sets or where a and b are column vectors and k is a scalar con- { } bases in r-dimensional space. stant. The set ˆv1, ˆv2,..., ˆvm is analogous to a and the { } set ˆu1, ˆu2,..., ˆun is analogous to b. What is unique We can summarize this result for all vectors in one { } though is that ˆv1, ˆv2,..., ˆvm and ˆu1, ˆu2,..., ˆun are matrix multiplication by following the prescribed con- { } { } struction in Figure 4. We start by constructing a new orthonormal sets of vectors which span an m or n dimen- diagonal matrix Σ. sional space, respectively. In particular, loosely speak- ing these sets appear to span all possible “inputs” (a) and “outputs” (b). Can we formalize the view that σ1˜ ˆv , ˆv ,..., ˆv and ˆu , ˆu ,..., ˆu span all possible  .  1 2 n 1 2 n .. 0 {“inputs” and “outputs”?} { }  σr˜  We can manipulate Equation 9 to make this fuzzy hy- Σ   ≡  0  pothesis more precise.    .   0 ..    X = UΣVT  0    UT X = ΣVT T where σ1˜ σ2˜ . . . σr˜ are the rank-ordered set of sin- U X = Z gular values.≥ Likewise≥ ≥ we construct accompanying or- thogonal matrices V and U. where we have defined Z ΣVT . Note that the previous ≡ T columns ˆu1, ˆu2,..., ˆun are now rows in U . Compar- { } ing this equation to Equation 1, ˆu1, ˆu2,..., ˆun per- V = [ˆv1˜ ˆv2˜ . . . ˆv ˜m] form the same role as ˆp , ˆp ,...,{ˆp . Hence, }UT is U = [ˆu ˆu . . . ˆu ] 1 2 m 1˜ 2˜ ˜n a change of basis from {X to Z. Just as} before, we were transforming column vectors, we can again infer that we where we have appended an additional (m r) and (n r) − − are transforming column vectors. The fact that the or- orthonormal vectors to “fill up” the matrices for V and T 8 thonormal basis U (or P) transforms column vectors U respectively . Figure 4 provides a graphical represen- means that UT is a basis that spans the columns of X. tation of how all of the pieces fit together to form the Bases that span the columns are termed the column space matrix version of SVD. of X. The column space formalizes the notion of what are the possible “outputs” of any matrix. XV = UΣ (8) There is a funny symmetry to SVD such that we can define a similar quantity - the row space. where each column of V and U perform the “value” ver- sion of the decomposition (Equation 7). Because V is XV = ΣU V 1 VT orthogonal, we can multiply both sides by − = to (XV)T = (ΣU)T arrive at the final form of the decomposition. VT XT = UT Σ T T X = UΣVT (9) V X = Z

T Although it was derived without motivation, this decom- where we have defined Z U Σ. Again the rows of VT (or the columns of V)≡ are an orthonormal basis for position is quite powerful. Equation 9 states that any T arbitrary matrix X can be converted to an orthogonal transforming X into Z. Because of the transpose on X, matrix, a diagonal matrix and another orthogonal matrix it follows that V is an orthonormal basis spanning the (or a rotation, a stretch and a second rotation). Making row space of X. The row space likewise formalizes the sense of Equation 9 is the subject of the next section. notion of what are possible “inputs” into an arbitrary matrix. We are only scratching the surface for understanding the full implications of SVD. For the purposes of this tu- 8 This is the same procedure used to fix the degeneracy in the torial though, we have enough information to understand previous section. how PCA will fall within this framework. 9

The “value” form of SVD is expressed in equation 7.

Xˆvi = σi ˆui

The mathematical intuition behind the construction of the matrix form is that we want to express all n “value” equations in just one equation. It is easiest to understand this process graphically. Drawing the matrices of equation 7 looks likes the following.

We can construct three new matrices V, U and Σ. All singular values are first rank-ordered σ1˜ ≥ σ2˜ ≥ . . . ≥ σr˜, and the corresponding vectors are indexed in the same rank order. Each pair of associated vectors ˆvi and ˆui is stacked in th the i column along their respective matrices. The corresponding singular value σi is placed along the diagonal (the iith position) of Σ. This generates the equation XV = UΣ, which looks like the following.

The matrices V and U are m × m and n × n matrices respectively and Σ is a diagonal matrix with a few non-zero values (represented by the checkerboard) along its diagonal. Solving this single matrix equation solves all n “value” form equations.

FIG. 4 How to construct the matrix form of SVD from the “value” form.

C. SVD and PCA By construction YT Y equals the covariance matrix of X. From section 5 we know that the principal compo- With similar computations it is evident that the two nents of X are the eigenvectors of CX. If we calculate the methods are intimately related. Let us return to the SVD of Y, the columns of matrix V contain the eigen- T original m n data matrix X. We can define a new vectors of Y Y = CX. Therefore, the columns of V are matrix Y as× an n m matrix9. the principal components of X. This second algorithm is × encapsulated in Matlab code included in Appendix B. 1 Y XT ≡ √n 1 − where each column of Y has zero mean. The definition of Y becomes clear by analyzing YT Y.

1 T 1 YT Y = XT XT √n 1  √n 1  What does this mean? V spans the row space of Y − − 1 T ≡ 1 TT T X . Therefore, V must also span the column space X X √n 1 = − n 1 of 1 X. We can conclude that finding the principal − √n 1 1 − = XXT components10 amounts to finding an orthonormal basis n 1 that spans the column space of X. T − Y Y = CX

10 If the final goal is to find an orthonormal basis for the coulmn 9 Y is of the appropriate n × m dimensions laid out in the deriva- space of X then we can calculate it directly without constructing tion of section 6.1. This is the reason for the “flipping” of di- Y. By symmetry the columns of U produced by the SVD of 1 X mensions in 6.1 and Figure 4. √n 1 must also be the principal components. − 10

FIG. 6 Non-Gaussian distributed data causes PCA to fail. In exponentially distributed data the axes with the largest variance do not correspond to the underlying basis. FIG. 5 Data points (black dots) tracking a person on a ferris wheel. The extracted principal components are (p1, p2) and the phase is θˆ. assumptions outlined in section 4.5 and then calculate the corresponding answer. There are no parameters to tweak VII. DISCUSSION AND CONCLUSIONS and no coefficients to adjust based on user experience - the answer is unique11 and independent of the user. A. Quick Summary This same strength can also be viewed as a weakness. If one knows a-priori some features of the structure of a Performing PCA is quite simple in practice. system, then it makes sense to incorporate these assump- 1. Organize a data set as an m n matrix, where m tions into a parametric algorithm - or an algorithm with is the number of measurement× types and n is the selected parameters. number of trials. Consider the recorded positions of a person on a fer- ris wheel over time in Figure 5. The probability distri- 2. Subtract off the mean for each measurement type butions along the axes are approximately Gaussian and or row xi. thus PCA finds (p1, p2), however this answer might not 3. Calculate the SVD or the eigenvectors of the co- be optimal. The most concise form of dimensional re- variance. duction is to recognize that the phase (or angle along the ferris wheel) contains all dynamic information. Thus, the In several fields of literature, many authors refer to the appropriate parametric algorithm is to first convert the individual measurement types x as the sources. The i data to the appropriately centered polar coordinates and data projected into the principal components Y = PX then compute PCA. are termed the signals, because the projected data pre- sumably represent the underlying items of interest. This prior non-linear transformation is sometimes termed a kernel transformation and the entire parametric algorithm is termed kernel PCA. Other common kernel B. Dimensional Reduction transformations include Fourier and Gaussian transfor- mations. This procedure is parametric because the user One benefit of PCA is that we can examine the vari- must incorporate prior knowledge of the structure in the ances CY associated with the principle components. Of- selection of the kernel but it is also more optimal in the ten one finds that large variances associated with the sense that the structure is more concisely described. first k rank) can be selected almost at whim. However, Both the strength and weakness of PCA is that it is these degrees of freedom do not effect the qualitative features of a non-parametric analysis. One only needs to make the the solution nor a dimensional reduction. 11

Gaussian. For instance, Figure 6 contains a 2-D expo- a dot product of one particular column with itself, which nentially distributed data set. The largest variances do equals one. not correspond to the meaningful axes; thus PCA fails. This less constrained set of problems is not trivial and T T 1 i = j (A A)ij = ai aj = only recently has been solved adequately via Independent  0 i = j 6 Component Analysis (ICA). The formulation is equiva- lent. AT A is the exact description of the identity matrix. 1 1 The definition of A− is A− A = I. Therefore, because T 1 T Find a matrix P where Y = PX such that A A = I, it follows that A− = A . CY is diagonalized. 2. If A is any matrix, the matrices AT A and however it abandons all assumptions except linearity, and AAT are both symmetric. attempts to find axes that satisfy the most formal form of redundancy reduction - statistical independence. Mathe- Let’s examine the transpose of each in turn. matically ICA finds a basis such that the joint probability 12 distribution can be factorized . (AAT )T = ATT AT = AAT (AT A)T = AT ATT = AT A P (yi, yj) = P (yi)P (yj)

for all i and j, i = j. The downside of ICA is that The equality of the quantity with its transpose completes it is a form of nonlinear6 optimizaiton, making the solu- this proof. tion difficult to calculate in practice and potentially not unique. However ICA has been shown a very practical 3. A matrix is symmetric if and only if it is and powerful algorithm for solving a whole new class of orthogonally diagonalizable. problems. Writing this paper has been an extremely instructional Because this statement is bi-directional, it requires a experience for me. I hope that this paper helps to demys- two-part “if-and-only-if” proof. One needs to prove the tify the motivation and results of PCA, and the under- forward and the backwards “if-then” cases. lying assumptions behind this important analysis tech- Let us start with the forward case. If A is orthogo- nique. Please send me a note if this has been useful to nally diagonalizable, then A is a symmetric matrix. By you as it inspires me to keep writing! hypothesis, orthogonally diagonalizable means that there exists some E such that A = EDET , where D is a diag- onal matrix and E is some special matrix which diago- T APPENDIX A: Linear Algebra nalizes A. Let us compute A . AT EDET T ETT DT ET EDET A This section proves a few unapparent theorems in = ( ) = = = linear algebra, which are crucial to this paper. Evidently, if A is orthogonally diagonalizable, it must also be symmetric. 1. The inverse of an orthogonal matrix is its The reverse case is more involved and less clean so it transpose. will be left to the reader. In lieu of this, hopefully the “forward” case is suggestive if not somewhat convincing. The goal of this proof is to show that if A is an or- 1 T thogonal matrix, then A− = A . 4. A symmetric matrix is diagonalized by a Let A be an m n matrix. × matrix of its orthonormal eigenvectors. A = [a a . . . a ] 1 2 n Restated in math, let A be a square n n symmet- × th ric matrix with associated eigenvectors e1, e2,..., en . where ai is the i column vector. We now show that th { } AT A = I where I is the identity matrix. Let E = [e1 e2 . . . en] where the i column of E is the th T eigenvector ei. This theorem asserts that there exists a Let us examine the ij element of the matrix A A. T th T T T diagonal matrix D where A = EDE . The ij element of A A is (A A)ij = ai aj. Remember that the columns of an orthonormal matrix This theorem is an extension of the previous theorem are orthonormal to each other. In other words, the dot 3. It provides a prescription for how to find the matrix E, product of any two columns is zero. The only exception is the “diagonalizer” for a symmetric matrix. It says that the special diagonalizer is in fact a matrix of the original matrix’s eigenvectors. This proof is in two parts. In the first part, we see 12 Equivalently, in the language of the goal is that the any matrix can be orthogonally diagonalized if to find a basis P such that the mutual information I(yi, yj) = 0 and only if it that matrix’s eigenvectors are all linearly for i 6= j. independent. In the second part of the proof, we see that 12 a symmetric matrix has the special property that all of All of these properties arise from the dot product of its eigenvectors are not just linearly independent but also any two vectors from this set. orthogonal, thus completing our proof. T In the first part of the proof, let A be just some ma- (Xˆvi) (Xˆvj) = (Xˆvi) (Xˆvj) · T T trix, not necessarily symmetric, and let it have indepen- = ˆvi X Xˆvj dent eigenvectors (i.e. no degeneracy). Furthermore, let T = ˆvi (λj ˆvj) E = [e1 e2 . . . en] be the matrix of eigenvectors placed in the columns. Let D be a diagonal matrix where the = λj ˆvi ˆvj th th · i eigenvalue is placed in the ii position. (Xˆvi) (Xˆvj) = λjδij We will now show that AE = ED. We can examine · the columns of the right-hand and left-hand sides of the equation. The last relation arises because the set of eigenvectors of X is orthogonal resulting in the Kronecker delta. In Left hand side : AE = [Ae1 Ae2 . . . Aen] more simpler terms the last relation states: Right hand side : ED = [λ1e1 λ2e2 ... λnen] λj i = j (Xˆvi) (Xˆvj) = Evidently, if AE = ED then Aei = λiei for all i. This ·  0 i = j equation is the definition of the eigenvalue equation. 6 Therefore, it must be that AE = ED. A little rearrange- This equation states that any two vectors in the set are 1 ment provides A = EDE− , completing the first part the orthogonal. proof. The second property arises from the above equation by For the second part of the proof, we show that a sym- realizing that the length squared of each vector is defined matrix always has orthogonal eigenvectors. For as: some symmetric matrix, let λ1 and λ2 be distinct eigen- 2 Xˆvi = (Xˆvi) (Xˆvi) = λi values for eigenvectors e1 and e2. k k · T λ1e1 e2 = (λ1e1) e2 · T APPENDIX B: Code = (Ae1) e2 T T = e1 A e2 This code is written for Matlab 6.5 (Release 13) T 13 = e1 Ae2 from Mathworks . The code is not computationally ef- T ficient but explanatory (terse comments begin with a %). = e1 (λ2e2)

λ1e1 e2 = λ2e1 e2 · · This first version follows Section 5 by examining the covariance of the data set. By the last relation we can equate that (λ1 λ2)e1 e2 = 0. Since we have conjectured function [signals,PC,V] = pca1(data) − · that the eigenvalues are in fact unique, it must be the % PCA1: Perform PCA using covariance. case that e1 e2 = 0. Therefore, the eigenvectors of a % data - MxN matrix of input data · symmetric matrix are orthogonal. % (M dimensions, N trials) Let us back up now to our original postulate that A is % signals - MxN matrix of projected data a symmetric matrix. By the second part of the proof, we % PC - each column is a PC know that the eigenvectors of A are all orthonormal (we % V - Mx1 matrix of variances choose the eigenvectors to be normalized). This means that E is an orthogonal matrix so by theorem 1, ET = [M,N] = size(data); 1 E− and we can rewrite the final result. % subtract off the mean for each dimension A = EDET mn = mean(data,2); data = data - repmat(mn,1,N); . Thus, a symmetric matrix is diagonalized by a matrix of its eigenvectors. % calculate the covariance matrix covariance = 1 / (N-1) * data * data’; 5. For any arbitrary m n matrix X, the symmetric matrix XT X has× a set of orthonor- % find the eigenvectors and eigenvalues mal eigenvectors of ˆv1, ˆv2,..., ˆvn and a set of [PC, V] = eig(covariance); { } associated eigenvalues λ1,λ2,...,λn . The set of { } vectors Xˆv1, Xˆv2,..., Xˆvn then form an orthog- { } onal basis, where each vector Xˆvi is of length √λi. 13 http://www.mathworks.com 13

Filters.” Vision Research 37(23), 3327-3338. % extract diagonal of matrix as vector [A paper from my field of research that surveys and explores V = diag(V); different forms of decorrelating data sets. The authors examine the features of PCA and compare it with new ideas in redun- % sort the variances in decreasing order dancy reduction, namely Independent Component Analysis.] [junk, rindices] = sort(-1*V); V = V(rindices); PC = PC(:,rindices); Bishop, Christopher. (1996) Neural Networks for % project the original data set Pattern Recognition. Clarendon, Oxford, UK. signals = PC’ * data; [A challenging but brilliant text on statistical pattern recog- nition. Although the derivation of PCA is tough in section This second version follows section 6 computing PCA 8.6 (p.310-319), it does have a great discussion on potential through SVD. extensions to the method and it puts PCA in context of other methods of dimensional reduction. Also, I want to acknowledge function [signals,PC,V] = pca2(data) this book for several ideas about the limitations of PCA.] % PCA2: Perform PCA using SVD. % data - MxN matrix of input data % (M dimensions, N trials) % signals - MxN matrix of projected data % PC - each column is a PC Lay, David. (2000). Linear Algebra and It’s Applica- % V - Mx1 matrix of variances tions. Addison-Wesley, New York. [This is a beautiful text. Chapter 7 in the second edition (p. [M,N] = size(data); 441-486) has an exquisite, intuitive derivation and discussion of SVD and PCA. Extremely easy to follow and a must read.] % subtract off the mean for each dimension mn = mean(data,2); data = data - repmat(mn,1,N); Mitra, Partha and Pesaran, Bijan. (1999) ”Analysis % construct the matrix Y of Dynamic Brain Imaging Data.” Biophysical Journal. Y = data’ / sqrt(N-1); 76, 691-708. [A comprehensive and spectacular paper from my field of % SVD does it all research interest. It is dense but in two sections ”Eigenmode [u,S,PC] = svd(Y); Analysis: SVD” and ”Space-frequency SVD” the authors discuss the benefits of performing a Fourier transform on the data % calculate the variances before an SVD.] S = diag(S); V = S .* S;

% project the original data Will, Todd (1999) ”Introduction to the Sin- signals = PC’ * data; gular Value Decomposition” Davidson College. www.davidson.edu/academic/math/will/svd/index.html [A math professor wrote up a great web tutorial on SVD with APPENDIX C: References tremendous intuitive explanations, graphics and animations. Although it avoids PCA directly, it gives a great intuitive feel Bell, Anthony and Sejnowski, Terry. (1997) “The for what SVD is doing mathematically. Also, it is the inspiration Independent Components of Natural Scenes are Edge for my ”spring” example.]