Chapter 2 Principal Components Analysis

Chapter 2 Principal Components Analysis Math 3210 Dr. Zeng Outline I What is Principal Components Analysis (PCA)? I Data Examples I Ideas behind PCA I PC loadings and scores I Performing PCA in R I Practice What is Principal Components Analysis (PCA) Principal component analysis is performed in order to simplify the description of a set of interrelated variables. The techinque can be summarized as a method of transforming the original variables into new, uncorrelated variables. The new variables are called the principal components. Each principal component is a linear combination of the original variables. The goal of PCA is to find the best low-dimensional representation of the variation in a multivariate data set. Thus, it is one of the classical methods in dimension reductions. It reduces the number of variables/dimensions without losing much of the information. Data Example In the case of the wine data set, we have 13 chemical concentrations describing wine samples from three different cultivars. We can carry out a principal component analysis to investigate whether we can capture most of the variation between samples using a smaller number of new variables (principal components), where each of these new variables is a linear combination of all or some of the 13 chemical concentrations. Ideas behind PCA Suppose that there are p variables x1, x2, ..., xp in the dataset. In PCA, we create a set of p variables, C1, C2, ..., Cp, each of which is a linear combination of x1, x2, ..., xp. These linear combinations, C1, C2, ..., Cp are called the principal components. C1 = a11x1 + a12x2 + ... + a1pxp C2 = a21x1 + a22x2 + ... + a2pxp . Cp = ap1x1 + ap2x2 + ... + appxp The coefficients a11, a12, ..., app are chosen to satisfy three requirements. I Variance of C1 is as large as possible. var(C1) ≥ var(C2) ≥ var(Cp) I The value of any two principal components are uncorrelated. I For any principal component the sum of the squares of the 2 2 2 coefficients is one. That is, ai1 + ai2 + ... + aip = 1 for any i = 1, ..., p The mathematical solution for the coefficients was originally derived by Hostelling (1993). The solution is illustrated graphically in the next Figure. Illustration of Two Principal Components 30 25 V5 20 15 10 1.5 2.0 2.5 3.0 V4 Question: How to locate the two principal components C1 and C2 in this Figure? Principal Components Loadings I C1 = a11x1 + a12x2 + ... + a1pxp is called the first principal component of x1, x2, ..., xp I The coefficients a11, ..., a1p are called the loadings of the first principal component. I The loadings jointly make up the principal component loading vector denoted by a11 . φ1 = . a1p Principal Component Scores I Once we obtain the loadings (aij ) for the principal components, we can calculate the principal component scores for each obervation. I The first PC score for observation 1 (row 1) is obtained by plug in the observed variable values for that observation into the first PC linear combination: C1 = a11x1 + a12x2 + ... + a1pxp I You will have n PC scores for the first principal component because there is a total of n observations. I You can repeat the same process to obtain n PC scores for the second principal component, and so on. Questions to consider: I How do we estimate the PC loadings (coefficients aij )? [use R] I How do we estimate the corresponding PC scores? [use R] I How do we determine how many principal components we essentially need? [numerically and graphically] Performing PCA in R To carry out a principal component analysis (PCA) on a multivariate data set, the first step is often to standardise the variables under study using the “scale()” function. This is necessary if the input variables have very different variances, which is true in this case as the concentrations of the 13 chemicals have very different variances. For example, to standardise the concentrations of the 13 chemicals in the wine samples stdconcentrations<- as.data.frame(scale(wine[2:14])) # standardise the variables Once you have standardised your variables, you can carry out a principal component analysis on the standardised concentrations using the princomp() function in R. wine.pca <- princomp(stdconcentrations) # do a PCA summary(wine.pca) Importance of components: Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Standard deviation 2.1631951 1.5757366 1.1991447 0.9559347 0.92110518 Proportion of Variance 0.3619885 0.1920749 0.1112363 0.0706903 0.06563294 Cumulative Proportion 0.3619885 0.5540634 0.6652997 0.7359900 0.80162293 Comp.6 Comp.7 Comp.8 Comp.9 Standard deviation 0.79878171 0.74022473 0.58867607 0.53596364 Proportion of Variance 0.04935823 0.04238679 0.02680749 0.02222153 Cumulative Proportion 0.85098116 0.89336795 0.92017544 0.94239698 Comp.10 Comp.11 Comp.12 Comp.13 Standard deviation 0.49949266 0.47383559 0.40966094 0.320619963 Proportion of Variance 0.01930019 0.01736836 0.01298233 0.007952149 Cumulative Proportion 0.96169717 0.97906553 0.99204785 1.000000000 Note: I The summaries above gives us the standard deviation of each component, and the proportion of variance explained by each component. I The first five principal components account for over 80% of the variability in the data. Subsequent principal component contribute lower amounts. The standard deviation of the components is stored in a named element called “sdev” of the output variable made by “princomp”: wine.pca$sdev #standard deviation of the PCs Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 2.1631951 1.5757366 1.1991447 0.9559347 0.9211052 0.7987817 0.7402247 Comp.8 Comp.9 Comp.10 Comp.11 Comp.12 Comp.13 0.5886761 0.5359636 0.4994927 0.4738356 0.4096609 0.3206200 We can also extract the PC loadings from the object wine.pca by using wine.pca$loadings #loadings Loadings: Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8 Comp.9 Comp.10 V2 -0.144 -0.484 -0.207 0.266 0.214 0.396 0.509 0.212 V3 0.245 -0.225 0.537 0.537 -0.421 -0.309 V4 -0.316 0.626 -0.214 0.143 0.154 0.149 -0.170 -0.308 V5 0.239 0.612 -0.101 0.287 0.428 0.200 V6 -0.142 -0.300 0.131 -0.352 -0.727 -0.323 -0.156 0.271 V7 -0.395 0.146 0.198 0.149 -0.406 0.286 -0.320 V8 -0.423 0.151 0.152 0.109 -0.187 -0.163 V9 0.299 0.170 -0.203 0.501 -0.259 -0.595 -0.233 0.196 0.216 V10 -0.313 0.149 0.399 -0.137 -0.534 -0.372 0.368 -0.209 0.134 V11 -0.530 -0.137 -0.419 0.228 -0.291 V12 -0.297 0.279 -0.428 0.174 0.106 -0.232 0.437 -0.522 V13 -0.376 0.164 0.166 0.184 0.101 0.266 0.137 0.524 V14 -0.287 -0.365 -0.127 -0.232 0.158 0.120 0.120 -0.576 0.162 Comp.11 Comp.12 Comp.13 V2 -0.226 -0.266 V3 0.122 V4 -0.499 0.141 V5 0.479 V6 V7 0.304 -0.304 0.464 V8 -0.832 V9 0.117 -0.114 V10 -0.237 0.117 V11 0.604 V12 0.259 V13 0.601 0.157 V14 0.539 Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8 SS loadings 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 Proportion Var 0.077 0.077 0.077 0.077 0.077 0.077 0.077 0.077 Cumulative Var 0.077 0.154 0.231 0.308 0.385 0.462 0.538 0.615 Comp.9 Comp.10 Comp.11 Comp.12 Comp.13 SS loadings 1.000 1.000 1.000 1.000 1.000 Proportion Var 0.077 0.077 0.077 0.077 0.077 Cumulative Var 0.692 0.769 0.846 0.923 1.000 Note: I The function above provides the estimate loadings for the data for each component I Note that some values are missing. These values are not zero but are close to zero and should be ignored as we only concern about the largest loadings in the absolute values. I Take the first PC as an example, the three largest loadings are -0.423 for V8, -0.395 for V7, and -0.376 for V13. I The first PC makes up 36.19% of the variability in the original data. We can also extract the PC scores from the object wine.pca by using winescores<-wine.pca$scores #PC scores winescores[1:5,1] #extract the PC scores from the first 5 observations for the first PC [1] -3.307421 -2.203250 -2.509661 -3.746497 -1.006070 Determine how many principal components to retain In order to decide how many principal components should be retained, it is common to summarise the results of a principal components analysis by making a scree plot, which we can do in R using the “screeplot()” function: screeplot(wine.pca,main="scree plot") scree plot 4 3 2 Variances 1 0 Comp.1 Comp.4 Comp.7 Comp.10 screeplot(wine.pca,type="lines") wine.pca 4 3 Variances 2 1 Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8 Comp.9 Comp.10 There are three ways to decide how many principal components to retain.

Chapter 2 Principal Components Analysis

A Generalized Linear Model for Principal Component Analysis of Binary Data

Choosing Principal Components: a New Graphical Method Based on Bayesian Model Selection

Exploratory and Confirmatory Factor Analysis Course Outline Part 1

A Resampling Test for Principal Component Analysis of Genotype-By-Environment Interaction

GGL Biplot Analysis of Durum Wheat (Triticum Turgidum Spp

A Biplot-Based PCA Approach to Study the Relations Between Indoor and Outdoor Air Pollutants Using Case Study Buildings

Principal Component Analysis (PCA) As a Statistical Tool for Identifying Key Indicators of Nuclear Power Plant Cable Insulation

Exam-Srm-Sample-Questions.Pdf

Logistic Biplot by Conjugate Gradient Algorithms and Iterated SVD

Dynamic Factor Models Catherine Doz, Peter Fuleky

Principal Components Analysis Maths and Statistics Help Centre

Pcatools: Everything Principal Components Analysis