Chapter 2 Principal Components Analysis

Math 3210

Dr. Zeng Outline

I What is Principal Components Analysis (PCA)?

I Examples

I Ideas behind PCA

I PC loadings and scores

I Performing PCA in R

I Practice What is Principal Components Analysis (PCA)

Principal component analysis is performed in order to simplify the description of a set of interrelated variables. The techinque can be summarized as a method of transforming the original variables into new, uncorrelated variables. The new variables are called the principal components. Each principal component is a linear combination of the original variables. The goal of PCA is to find the best low-dimensional representation of the variation in a multivariate data set. Thus, it is one of the classical methods in dimension reductions. It reduces the number of variables/dimensions without losing much of the information. Data Example

In the case of the wine data set, we have 13 chemical concentrations describing wine samples from three different cultivars. We can carry out a principal component analysis to investigate whether we can capture most of the variation between samples using a smaller number of new variables (principal components), where each of these new variables is a linear combination of all or some of the 13 chemical concentrations. Ideas behind PCA

Suppose that there are p variables x1, x2, ..., xp in the dataset. In PCA, we create a set of p variables, C1, C2, ..., Cp, each of which is a linear combination of x1, x2, ..., xp. These linear combinations, C1, C2, ..., Cp are called the principal components.

C1 = a11x1 + a12x2 + ... + a1pxp

C2 = a21x1 + a22x2 + ... + a2pxp . .

Cp = ap1x1 + ap2x2 + ... + appxp The coefficients a11, a12, ..., app are chosen to satisfy three requirements.

I of C1 is as large as possible. var(C1) ≥ var(C2) ≥ var(Cp) I The value of any two principal components are uncorrelated.

I For any principal component the sum of the squares of the 2 2 2 coefficients is one. That is, ai1 + ai2 + ... + aip = 1 for any i = 1, ..., p

The mathematical solution for the coefficients was originally derived by Hostelling (1993). The solution is illustrated graphically in the next Figure. Illustration of Two Principal Components

30

25

V5 20

15

10

1.5 2.0 2.5 3.0 V4

Question: How to locate the two principal components C1 and C2 in this Figure? Principal Components Loadings

I C1 = a11x1 + a12x2 + ... + a1pxp is called the first principal component of x1, x2, ..., xp

I The coefficients a11, ..., a1p are called the loadings of the first principal component.

I The loadings jointly make up the principal component loading vector denoted by

  a11  .  φ1 =  .  a1p Principal Component Scores

I Once we obtain the loadings (aij ) for the principal components, we can calculate the principal component scores for each obervation.

I The first PC score for observation 1 (row 1) is obtained by plug in the observed variable values for that observation into the first PC linear combination:

C1 = a11x1 + a12x2 + ... + a1pxp

I You will have n PC scores for the first principal component because there is a total of n observations.

I You can repeat the same process to obtain n PC scores for the second principal component, and so on. Questions to consider:

I How do we estimate the PC loadings (coefficients aij )? [use R] I How do we estimate the corresponding PC scores? [use R]

I How do we determine how many principal components we essentially need? [numerically and graphically] Performing PCA in R

To carry out a principal component analysis (PCA) on a multivariate data set, the first step is often to standardise the variables under study using the “scale()” function. This is necessary if the input variables have very different , which is true in this case as the concentrations of the 13 chemicals have very different variances. For example, to standardise the concentrations of the 13 chemicals in the wine samples

stdconcentrations<- as.data.frame(scale(wine[2:14])) # standardise the variables Once you have standardised your variables, you can carry out a principal component analysis on the standardised concentrations using the princomp() function in R. wine.pca <- princomp(stdconcentrations) # do a PCA summary(wine.pca)

Importance of components: Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 2.1631951 1.5757366 1.1991447 0.9559347 0.92110518 Proportion of Variance 0.3619885 0.1920749 0.1112363 0.0706903 0.06563294 Cumulative Proportion 0.3619885 0.5540634 0.6652997 0.7359900 0.80162293 Comp.6 Comp.7 Comp.8 Comp.9 Standard deviation 0.79878171 0.74022473 0.58867607 0.53596364 Proportion of Variance 0.04935823 0.04238679 0.02680749 0.02222153 Cumulative Proportion 0.85098116 0.89336795 0.92017544 0.94239698 Comp.10 Comp.11 Comp.12 Comp.13 Standard deviation 0.49949266 0.47383559 0.40966094 0.320619963 Proportion of Variance 0.01930019 0.01736836 0.01298233 0.007952149 Cumulative Proportion 0.96169717 0.97906553 0.99204785 1.000000000 Note:

I The summaries above gives us the standard deviation of each component, and the proportion of variance explained by each component.

I The first five principal components account for over 80% of the variability in the data. Subsequent principal component contribute lower amounts. The standard deviation of the components is stored in a named element called “sdev” of the output variable made by “princomp”: wine.pca$sdev #standard deviation of the PCs

Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 2.1631951 1.5757366 1.1991447 0.9559347 0.9211052 0.7987817 0.7402247 Comp.8 Comp.9 Comp.10 Comp.11 Comp.12 Comp.13 0.5886761 0.5359636 0.4994927 0.4738356 0.4096609 0.3206200 We can also extract the PC loadings from the object wine.pca by using wine.pca$loadings #loadings

Loadings: Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8 Comp.9 Comp.10 V2 -0.144 -0.484 -0.207 0.266 0.214 0.396 0.509 0.212 V3 0.245 -0.225 0.537 0.537 -0.421 -0.309 V4 -0.316 0.626 -0.214 0.143 0.154 0.149 -0.170 -0.308 V5 0.239 0.612 -0.101 0.287 0.428 0.200 V6 -0.142 -0.300 0.131 -0.352 -0.727 -0.323 -0.156 0.271 V7 -0.395 0.146 0.198 0.149 -0.406 0.286 -0.320 V8 -0.423 0.151 0.152 0.109 -0.187 -0.163 V9 0.299 0.170 -0.203 0.501 -0.259 -0.595 -0.233 0.196 0.216 V10 -0.313 0.149 0.399 -0.137 -0.534 -0.372 0.368 -0.209 0.134 V11 -0.530 -0.137 -0.419 0.228 -0.291 V12 -0.297 0.279 -0.428 0.174 0.106 -0.232 0.437 -0.522 V13 -0.376 0.164 0.166 0.184 0.101 0.266 0.137 0.524 V14 -0.287 -0.365 -0.127 -0.232 0.158 0.120 0.120 -0.576 0.162 Comp.11 Comp.12 Comp.13 V2 -0.226 -0.266 V3 0.122 V4 -0.499 0.141 V5 0.479 V6 V7 0.304 -0.304 0.464 V8 -0.832 V9 0.117 -0.114 V10 -0.237 0.117 V11 0.604 V12 0.259 V13 0.601 0.157 V14 0.539

Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8 SS loadings 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 Proportion Var 0.077 0.077 0.077 0.077 0.077 0.077 0.077 0.077 Cumulative Var 0.077 0.154 0.231 0.308 0.385 0.462 0.538 0.615 Comp.9 Comp.10 Comp.11 Comp.12 Comp.13 SS loadings 1.000 1.000 1.000 1.000 1.000 Proportion Var 0.077 0.077 0.077 0.077 0.077 Cumulative Var 0.692 0.769 0.846 0.923 1.000 Note:

I The function above provides the estimate loadings for the data for each component

I Note that some values are missing. These values are not zero but are close to zero and should be ignored as we only concern about the largest loadings in the absolute values.

I Take the first PC as an example, the three largest loadings are -0.423 for V8, -0.395 for V7, and -0.376 for V13.

I The first PC makes up 36.19% of the variability in the original data. We can also extract the PC scores from the object wine.pca by using winescores<-wine.pca$scores #PC scores winescores[1:5,1] #extract the PC scores from the first 5 observations for the first PC

[1] -3.307421 -2.203250 -2.509661 -3.746497 -1.006070 Determine how many principal components to retain In order to decide how many principal components should be retained, it is common to summarise the results of a principal components analysis by making a scree , which we can do in R using the “screeplot()” function:

screeplot(wine.pca,main="")

scree plot 4 3 2 Variances 1 0

Comp.1 Comp.4 Comp.7 Comp.10 screeplot(wine.pca,type="lines")

wine.pca 4 3 Variances 2 1

Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8 Comp.9 Comp.10 There are three ways to decide how many principal components to retain. Method 1: Scree plots Scree plots helps you to visualize how variability of the data is account for each component. The most obvious change in slope in the scree plot occurs at component 4, which is the “elbow” of the scree plot. Therefore, it cound be argued based on the basis of the scree plot that the first three components should be retained. Method 2: Kaiser’s criterion Kaiser’s criterion: that we should only retain principal components for which the variance is above 1 (when principal component analysis was applied to standardised data). We can check this by finding the variance of each of the principal components:

(wine.pca$sdev)^2

Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 4.6794129 2.4829458 1.4379480 0.9138111 0.8484348 0.6380522 0.5479326 Comp.8 Comp.9 Comp.10 Comp.11 Comp.12 Comp.13 0.3465395 0.2872570 0.2494929 0.2245202 0.1678221 0.1027972 Method 3: A third way to decide how many principal components to retain is to decide to keep the number of components required to explain at least some minimum amount of the total variance. For example, if it is important to explain at least 80% of the variance, we would retain the first five principal components, as that the first five principal components explain 80.2% of the variance, while the first four components only explain 73.6% of the variance. Interpret the principal components graphically A is an enhanced scatterplot that uses both points and vectors to represent structure. The biplot() function produces a biplot from the output of princomp() or prcomp().

biplot(wine.pca)

−10 −5 0 5 10

116 81 60 0.15 117 10 101986810483908876 100958694105771091261071141021208791 V126764 118112129115658910692 11085 93 5 9970751111258212112710363 1281087861130119 V13 72 80 737162 0.05 397966 124123113 22524 9769 135171 V8 519623453633283842 13184V5V9137 0 V7V1074301222244 134142140144132155141133143136139 Comp.2 2152481041291327123537 164146163161138147 14 395520 265 166172 111435747315675849 145162168V3158148 32 −5 −0.05 153165 53595040165434188V646 175149156 15V146 V4 169173152154157150174 19 17 160151170177167 4 V2 176178 V11159 −10 −0.15

−0.15 −0.05 0.05 0.15

Comp.1

#which(wine$V1==1) #which(wine$V1==2) #which(wine$V1==3) Interpreting Points:

I A biplot uses points to represent the scores of each observation (i.e. each wine sample) on the first two principal components. In this example, the points represent wine samples.

I The left and bottom axes are showing normalized principal component scores (points). For example, the PC score on the 4th observation for the first principal component is -3.746497 (winescores[4,1]) and -2.748618 (winescores[4,2]) for the second principal components.

I The relative location of the points can be interpreted. Points that are close together correspond to observations that have similar scores on the components displayed in the plot. These observations have similar performance.

I In this example, wine samples that are close together are the ones that from the same group. Note that there are three clusters on the graph because there are three groups. Interpreting Vectors:

I A biplot uses vectors (red arrows) to represent the loading of each variable (i.e. each chemical concentration) on the first two principal components. In this example, the vectors represents chemical concentrations.

I The top and right axes are showing the loadings. (vectors). For example, the loadings for V12 on the first component is -0.297 and its loading on the second principal component is 0.279 (wine.pca$loadings).

I Vectors that point in the same direction correspond to variables that have similar response profiles, and can be interpreted as having similar meaning in the context set by the data.

I In this example, the concentration in both V5 and V9 are high in group 3. Also, the concentration in both V7, V8 and V10 are high in group 1. Practice

Use the built-in dataset “USArrests” in this practice.

I Following the instructions on Chapter 1, analysis the data graphically and numerically.

I Following the instructions on Chapter 2, conduct the principal component analysis on the USArrests data and interprete the results. head(USArrests) stdconcentrations<- as.data.frame(scale(USArrests)) USArrests.pca <- princomp(stdconcentrations) # do a PCA print(USArrests.pca) summary(USArrests.pca) screeplot(USArrests.pca,main="scree plot") screeplot(USArrests.pca,type="lines") USArrests.pca$loadings biplot(USArrests.pca) biplot(USArrests.pca,scale=0, cex=.7) Interpretation of PC and loadings:

I PC1 explains 62% of variance and PC2 explains 24% so on. So PC1 and PC2 together explains about 86% of the variance. PC4 has very little contribution to the variance of the original data.

I From the loadings output, first loading vector places approximately equal weight on Assault, Murder and Rape, with much less weight on urbanpop. Hence this component roughly corresponds to a measure of overall rates of serious crimes.

I The second loading vector places most of it weight on Urbanpop and much less weight on the other 3 features. Hence, this component roughly corresponds to the level of urbanization of the state. Interpretation of biplot:

I Points that are closed to each other have similar performance. For example, Mississippi, North Carolina and South Carolina have similar crime rates.

I Overall, we see that the crime-related varaibales are located close to each other, and that the urbanpop variable is far from other three. This indicates hat the crime related variables are correlated with each other-States with high murder rates tend to had high assault and rape rates. Urabnpop variable is less correlated with the other three.

I The states that where the red arrows are pointing to have higher crime rate. The states that locate on the opposite direction of the red arrows have lower crime rate. For example, murder rate is high in Georgia, Alaska, Louisana and low in connecticut and washington.