8 Principal Components Analysis

8 Principal Components Analysis Further reading. The paper Xoplaki et al. (2000) shows a nice application of the principal components and canonical correlation analysis method (see Chapter 9). 8.1 Introduction Example (Weather Report, p. 1-6). The data set consists of sunshine (three variables), air temperature (six variables), heating degree days (HDD) and precipitation data (five variables). Figure 8.1 shows a scatterplot matrix of several variables. With the help of principal component analysis we want to reduce the number of variables, without loosing a lot of information. A principal component analysis is concerned with explaining the variance-covariance structure of a set of variables through a few linear combinations of these variables. Its general objectives are (1) data reduction and (2) interpretation. Although p components are required to reproduce the total system variability, often much of this variability can be accounted for by a small number k of the principal components. If so, there is almost as much information in the k components as there is in the original p variables. The k principal components can then replace the initial p variables, and the original data set, consisting of n measurements on p variables, is reduced to a data set consisting of n measurements on k principal components. Analyses of principal components are more of a means to an end rather than an end in themselves, because they frequently serve as intermediate steps in much larger investigations. 8.2 Population Principal Components Algebraically, principal components are particular linear combinations of the p random variables X1;:::;Xp. Geometrically, these linear combinations represent the selection of a new coordinate system obtained by rotating the original system with X1;:::;Xp as the coordinate axes. The new axes represent the directions with maximum variability and provide a simpler and more parsimonious description of the covariance structure. As we shall see, principal components depend solely on the covariance matrix Σ (or the correlation matrix ρ of X1;:::;Xp). Their development does not require a multivariate normal assumption. On the other hand, principal components derived for multivariate normal populations have useful interpretations in terms of the constant density ellipsoids. Further, inferences can be done from the sample components when the population is multivariate normal. 0 Let the random vector X = (X1;:::;Xp) have the covariance matrix Σ with eigen- 8-1 values λ1 ::: λp 0. Consider the linear combinations ≥ ≥ ≥ 0 Y1 = a1X = a11X1 + ::: + a1pXp . 0 Yp = apX = ap1X1 + ::: + appXp: Then, we find 0 Var(Yi) = aiΣai; i = 1; : : : ; p (8.1) 0 Cov(Yi;Yk) = aiΣak; i; k = 1; : : : ; p (8.2) The principal components are those uncorrelated linear combinations Y1;:::;Yp whose variances in (8.1) are as large as possible. Definition 8.2.1. We define 0 First principal component = linear combination a1Xthat maximizes 0 0 Var(a1X) subject to a1a1 = 1 0 Second principal component = linear combination a2Xthat maximizes 0 0 Var(a2X) subject to a2a2 = 1 and 0 0 Cov(a1X; a2X) = 0 . 0 ith principal component = linear combination aiXthat maximizes 0 0 Var(aiX) subject to aiai = 1 and 0 0 Cov(aiX; akX) = 0 for k < i: Proposition 8.2.2. Let Σ be the covariance matrix associated with the random vector 0 X = (X1;:::;Xp). Let Σ have eigenvalue-eigenvector pairs (λ1; e1);:::; (λp; ep) where λ1 ::: λp 0. Then the ith principal component is given by ≥ ≥ ≥ p 0 Yi = eiX = eijXj; i = 1; : : : ; p: (8.3) j=1 X With these choices 0 Var(Yi) = eiΣei = λi; i = 1; : : : ; p 0 Cov(Yi;Yk) = e Σek = 0; i = k: i 6 Remark. If some λi are equal, the choices of the corresponding coefficient vectors, ei, and hence Yi, are not unique. 8-2 0 Proposition 8.2.3. Let X = (X1;:::;Xp) as in Proposition 8.2.2. Then p p p p tr(Σ) = σii = Var(Xi) = λi = Var(Yi) (8.4) i=1 i=1 i=1 i=1 X X X X and the proportion of total population variance due to the kth principal component λk p ; k = 1; : : : ; p: i=1 λi Remark. If most of the total populationP variance, for large p, can be attributed to the first one, two or three components, then these components can \replace" the original p variables without much loss of information. Remark. The magnitude of eik { also principal component loading { measures the im- portance of the kth variable to the ith principal component and thus is a useful basis for interpretation. A large coefficient (in absolute value) corresponds to a high loading, while a coefficient near zero has a low loading. Remark. One important use of principal components is interpreting the original data in terms of the principal components. The images of the original data under the principal components transformation are referred to as principal component scores. 0 0 Proposition 8.2.4. If Y1 = e1X;:::;Yp = epX are the principal components obtained from the covariance matrix Σ then eikpλi ρYi;Xk = ; i; k = 1; : : : ; p: pσkk 2 p 2 Remark. Suppose X Np(0; Σ). Then we have c = y /λi and this equation ∼ i=1 i defines an ellipsoid with axes y1; : : : ; yp lying in directions of e1;:::; ep, respectively. P 8.2.1 Principal Components obtained from Standardized Val- ues Principal components may also be obtained for the standardized variables Xi µi Zi = − ; i = 1; : : : ; p (8.5) pσii or −1 Z = V1=2 (X µ); (8.6) − where the diagonal standard deviation matrix V1=2 is defined as pσ11 0 0 ··· 0 pσ22 0 V1=2 = 0 ··· 1 : . .. B . C B C B 0 0 pσpp:C B ··· C @ A 8-3 −1 −1 We find E(Z) = 0 and Cov(Z) = V−1=2 Σ V1=2 = ρ: Remark. The eigenvalue-eigenvector pairs derived from Σare in general not the same as the ones derived from ρ. Proposition 8.2.5. The ith principal component of the standardized variables Z0 = (Z1;:::;Zp) with Cov(Z) = ρ, is given by −1 0 0 1=2 Yi = e Z = e V (X µ); i = 1; : : : ; p: i i − Moreover p p Var(Yi) = Var(Zi) = p i=1 i=1 X X and ρYi;Zk = eik λi; i; k = 1; : : : ; p: In this case (λ1; e1);:::; (λp; ep) arep the eigenvalue-eigenvector pairs of ρ with λ1 ≥ ::: λp 0. ≥ ≥ Remark. Variables should be standardized if they are measured on scales with widely differing ranges or if the units of measurements are not commensurate. 8.3 Summarizing Sample Variation by Principal Com- ponents We now study the problem of summarizing the variation in n measurements on p variables with a few judiciously chosen linear combinations. Suppose the data x1;:::; xn represent n independent drawings from some p-dimensional population with mean vector µ and covariance matrix Σ. These data yield the sample mean vector x, sample covariance matrix S and the sample correlation matrix R. The uncorrelated combinations with the largest variances will be called the sample principal components. The sample principal components (PC) are defined as those linear 8-4 combinations which have maximum sample variance. Specifically, 0 First sample PC = linear combination a1xj that maximizes the 0 0 sample variance Var(a1xj) subject to a1a1 = 1 0 Second sample PC = linear combination a2xj that maximizes the 0 0 sample variance Var(a2xj) subject to a2a2 = 1 and 0 0 Cov(a1xj; a2xj) = 0 . 0 ith sample PC = linear combination aixj that maximizes the 0 0 sample variance Var(aixj) subject to aiai = 1 and 0 0 Cov(aixj; akxj) = 0; k < i: Proposition 8.3.1. If S = sik is the p p sample covariance matrix with eigenvalue- ^ f ^ g × eigenvector pairs (λ1; e^1);:::; (λp; e^p) the ith sample principal component is given by 0 yî = eîx =e î1x1 + ::: +e îpxp; i = 1; : : : ; p; ^ ^ where λ1 ::: λp 0 and x is any observation on the variable X1;:::;Xp. Also, ≥ ≥ ≥ ^ sample variance (^yk) = λk; k = 1; : : : ; p; sample covariance (^yi; y^k) = 0; i = k: 6 In addition p ^ ^ total sample variance = sii = λ1 + ::: + λp i=1 X and ^ eîk λi r(^yi; xk) = ; i; k = 1; : : : ; p: ppσkk Number of Principal Components How many components to retain? There is no definitive answer to this question. A useful visual aid to determining an appropriate number of principal components is a scree plot. With the eigenvalues ordered from largest to smallest, a scree plot is a plot of ^ λi versus i. To determine the appropriate number of components, we look for an elbow (bend) in the scree plot. The number of components is taken to be the point at which the remaining eigenvalues are relatively small and all about the same size. 8-5 Remark. An unusually small value for the last eigenvalue from either the sample covariance or correlation matrix can indicate an unnoticed linear dependency in the data set and should therefore not be routinely ignored. Example (Weather Report, p. 1-6). Figure 8.2 shows the results of the principal component analysis calculated with the correlation matrix. 8-6 Figure 8.1: Scatterplot matrix of some variables of the Weather Report, p. 1-6. 8-7 Figure 8.2: Screeplot of the principal components analysis (top left), scatterplot matrix of the scores calculated with the correlation matrix (top right) and the loadings of the variables (bottom). Data set: Weather Report, p. 1-6. 8-8 9 Canonical Correlation Analysis Further reading. Read again the paper Xoplaki et al. (2000) to see how the canonical correlation analysis method can be applied in climate sciences.

Load more