Principal Components Analysis Arnab Maity NCSU Department of Statistics ~ 5240 SAS Hall ~ 919-515- 1937 ~ Amaity[At]Ncsu.Edu

Home , Multivariate statistics

Principal Components Analysis Arnab Maity NCSU Department of Statistics ~ 5240 SAS Hall ~ 919-515- 1937 ~ amaity[at]ncsu.edu

Contents

Introduction 2 Principal Components Analysis 2 A quick example: weekly stock return data 3 Computational details 5 Geometry of PCA 7 Sample PCA 8 Practical considerations 10 Number of PCs to retain 10 Linear dependence among variables 11 Principal components scores 12 Outlier detection 13 PCA is not invariant to scaling 15 Heptathlon data: prediction 17 The Romano-British pottery data: classiﬁcation 19 ST 437/537 principalcomponentsanalysis2

Introduction

Often multivariate data sets contain too many variables, and this might lead to the curse of dimensionality (Bellman, 1961)1: using 1 Bellman R.E. Adaptive Control Pro- standard graphing techniques, as well as usual analysis methods be- cesses. Princeton University Press, Princeton, NJ, 1961. come problematic. Thus arises the need to reduce the dimensionality and to identify/summarize the crucial variables. Principal components analysis (PCA) is a dimension reduction technique that is widely used in multivariate statistics. The objective is to condense the information that is present in the original set of variables via linear combinations2 of the variables while losing as 2 Recall: given a vector X = T little information as possible. (X1,..., Xp) , a linear combination of X is deﬁned as a1X1 + ... + ap Xp. If T Typically, the number of linear transformations is much smaller we deﬁne the vector a = (a1,..., ap) , than the number of original variables; hence the reduction in the then the linear combination can be written as aT X. dimensionality of the data. This can be useful in different ways, such as providing better visualization and computational advantages. PCA also decorrelates the data, that is, PCA produces linear combinations of the variables that are mutually uncorrelated.

Principal Components Analysis

T Suppose we have a p × 1 random vector X = (X1,... Xp) . The main goal of PCA is to identify linear combinations of X of the form

T Yi = ai X, i = 1, 2, . . . , q,

that explain most of the variability in X. Total variation

The total variation (TV) of X is deﬁned as the sum of the individual variances,

TV = var(X1) + ... + var(Xp).

In other words, if Σ = cov(X), then TV = trace(Σ).3 3 The trace of a square matrix is the sum of the diagonal elements of the matrix. Typically q < p, and the new variables, Yi, are ordered according to their importance. Speciﬁcally, Y1 is designed to capture the most variability in the original variables (i.e., TV) by any linear combination; Y2 then captures the most of the remaining variability while being uncorrelated to Y1, and so on. In the end, we hope that the ﬁrst few Yi’s will capture most of the variability in X.

Applied Multivariate and Longitudinal Data Analysis Dr. Arnab Maity, NCSU Statistics 5240 SAS Hall, amaity[at]ncsu.edu ST 437/537 principalcomponentsanalysis3

Loadings

The individual components (the elements of the vector ai) of each PC are called loadings. The loadings tell us how the original variables are weighted to get the PCs.

A quick example: weekly stock return data

Consider the weekly stock return data,4 where 103 weekly rates of 4 Table 8-4 of Johnson and Wichern return on ﬁve stocks: JPMorgan (JPM), Citibank (CITI), WellsFargo (2007); available on the course webpage. (WF), Shell (SH), Exxon (EX) are recorded. We deﬁne

current closing price − previous week closing price weekly return = previous week closing price

adjusted for stock splits and dividend. Rates of returns across stocks are expected to be correlated. A snapshot of the data is shown below.

dat <- read.table("data/T8-4.DAT", header =F) colnames(dat) <- c("JPM", "CITI", "WF", "SH",

"EX") −0.06 −0.02 0.02 −0.04 0.00 0.04 0.04

JPM 0.00

A pairs-plot of the data and the corraltion matrix are shown in −0.04 0.02 Figures 1 and 2, respectively. CITI −0.02 −0.06 # Pairs-plot 0.02 WF

pairs(dat, pch = 19, col = "#990000") −0.02 0.04 SH 0.00

# Correlation plot −0.04 0.08 0.04 library(GGally) EX 0.00

ggcorr(dat, label =T, label _size =3, label _round =2) −0.06 −0.04 0.00 0.04 −0.02 0.02 −0.06 0.00 0.04 0.08

Before performing PCA, we should always check whether each Figure 1: Pairs-plot of the stock return variable in the dataset has similar standard deviations (or variances). data EX If not, we need to standardize the variables. SH 0.68 1.0 round(apply(dat,2, sd),3) 0.5 WF 0.18 0.15 0.0 ## JPM CITI WF SH EX −0.5 ## 0.021 0.021 0.015 0.027 0.028 CITI 0.57 0.32 0.21 −1.0 JPM 0.63 0.51 0.11 0.15 It seems that the variables have very different standard deviations, e.g., sd(EX) is almost twice of sd(WF). Thus we need to standardize each variable. Figure 2: Correlation matrix of the stock return data std.dat <- scale(dat, center =T, scale =T) apply(std.dat,2, sd)

Applied Multivariate and Longitudinal Data Analysis Dr. Arnab Maity, NCSU Statistics 5240 SAS Hall, amaity[at]ncsu.edu ST 437/537 principalcomponentsanalysis4

## JPM CITI WF SH EX ## 1 1 1 1 1

Scaling the variables

When variables are on very different scales or have very different variances, a principal components analysis should be performed on the standardized variables.

PCA can be performed using the prcomp() function in base R.

# Perform PCA data.pca <- prcomp(std.dat) # Extract the importance of each component summary(data.pca)

## Importance of components: ## PC1 PC2 PC3 PC4 PC5 ## Standard deviation 1.5612 1.1862 0.7075 0.63248 0.50514 ## Proportion of Variance 0.4874 0.2814 0.1001 0.08001 0.05103 ## Cumulative Proportion 0.4874 0.7689 0.8690 0.94897 1.00000

In the output above, the row marked Standard deviation gives the 5 5 standard deviation of each PC, that is, sd(Yi). Thus, variance of For example, we defined the first T 2 linear combination as Y1 = a1 X. So the first PC is var(Y1) = (1.5612) ≈ 2.437. Recall that we have p √ sd(Y ) = var(Y ) = aT Σa. standardized the data and thus each variable has variance 1. The first 1 1 PC alone has variance 2.437. The second row, marked Proportion of Variance, shows the pro- 6 6 portion of TV captured by each PC, that is, var(Yi)/TV for i = Since we standardized each variable, 1, 2, . . .. Thus, for the 1st PC, the proportion of variance captured is the variance of each standardized variable is 1, giving us, TV = 5. 2 var(Y1)/TV = (1.5612) /5 ≈ 0.487. The third row, marked by Cumulative Proportion, explains the proportion of total variation explained cumulatively by first few PCs. For example, the first two PCs explain almost 77% of TV. Such a criterion enables us to choose how many PCs to keep. For example, if we are satisfied with capturing at least 75% of the total variation, we only need to keep two PCs. The loadings for the first two PCs are shown below. Recall that we stored the PCA output in the variable data.pca. It contains several fields as shown below. names(data.pca)

## [1] "sdev" "rotation" "center" ## [4] "scale" "x"

The rotation ﬁeld contains the loadings.

Applied Multivariate and Longitudinal Data Analysis Dr. Arnab Maity, NCSU Statistics 5240 SAS Hall, amaity[at]ncsu.edu ST 437/537 principalcomponentsanalysis5

# Loadings of the first two PCs, rounded to 3 # decimal places round(data.pca$rotation[,1:2],3)

## PC1 PC2 ## JPM -0.469 0.368 ## CITI -0.532 0.236 ## WF -0.465 0.315 ## SH -0.387 -0.585 ## EX -0.361 -0.606

Hence the ﬁrst PC can be constructed as

Y1 = −0.469(JPM) − ... − 0.361(EX).

We might interpret the ﬁrst PC as a roughly equally weighted sum of the ﬁve variables. This might be a general market component. The second PC can be constructed as

Y2 = 0.368(JPM) + ... − 0.606(EX).

We might view the second PC as a contrast between the banking stocks and the oil stocks. This might be called an industry component. Thus, instead of looking at the original ﬁve variables, we can simply look at two PCs Y1 and Y2 that explain approximately 77% of total variation of the original data. These two PCs can be thought of two summaries or indices of the ﬁve original stocks.

Computational details

Suppose X is a random vector with cov(X) = Σ. For now, we assume Σ is known. Later we will replace Σ by S or R computed from the data. Recall that the “variability” in X is represented by the total variation, TV = trace(Σ). Consider the linear combination, the ﬁrst principal component,

T Y1 = a11X1 + ... + a1pXp = a1 X

T for some a1 = (a11,..., a1p) that we need to determine. The con- stants are called loadings. We determine the loadings by solving the 7 7 T following problem: Recall var(Y1) = a1 Σa. As var(Y1) has no upper bound, just maximizing T var(Y ) with respect to a would lead to maximize var(Y1) with the constraint a a1 = 1. 1 1 an infinite variance. To make this a well defined problem, we need to add the The solution can be obtained by using the Lagrange multiplier T constraint a1 a1 = 1. method. Specifically, one can show that the optimal choice of a1 is an eigenvector of Σ corresponding to the largest eigenvalue, and that var(Y1) = λ1, the largest eigenvalue.

Applied Multivariate and Longitudinal Data Analysis Dr. Arnab Maity, NCSU Statistics 5240 SAS Hall, amaity[at]ncsu.edu ST 437/537 principalcomponentsanalysis6

T The loadings a2 = (a21,..., a2p) of the second principal component T Y2 = a21X1 + ... + a2pXp = a2 X

8 8 T is chosen by solving the following problem: The constraint a1 a2 = 0 implies that the vector a2 is orthogonal to a1. T T maximize var(Y2) with the constraint a2 a2 = 1 and a1 a2 = 0.

Again, using Lagrange multipliers, one can show that the optimal choice of a2 is an eigenvector (orthogonal to the ﬁrst direction) of Σ corresponding to the second largest eigenvalue , and that var(Y2) = λ2, the second largest eigenvalue. Since the loading vector of the 2nd PC is orthogonal to that of the 1st PC, it readily follows that the 1st and 2nd PCs are uncorrelated. Since X has p elements, i.e., p dimensions, the covariance matrix Σ has size p × p. Thus we can have only p eigenvectors, and there can be at most p PCs. We continue the process described above un- til we get all the p PCs. The last PC is simply the p-th eigenvector, corresponding to the smallest eigenvalue. It can be shown that9 9 Result from linear algebra: trace(Σ) = sum of its eigen values. Thus, TV = trace(Σ) = λ1 + ... + λp = var(Y1) + TV = var(Y1) + ... + var(Yp), ... + var(Yp). that is, the total variation in X is fully captured by retaining all p PCs. Spectral decomposition of Σ

Since Σ is a covariance matrix, it follows that it has the spectral decomposition T T Σ = λ1a1a1 + ... + λpapap ;

where ajs as eigenvectors and λj are the corresponding eigenvalues. The vectors ajs can be chosen so that they are orthonor- T T mal, that is, ai ai = 1 and ai aj = 1 for j 6= i. The vector ai is the vector of loadings for the i-th PC Yi. Also, var(Yi) = λi, the corresponding eigenvalue.

The proportion of the variance that is explained by the jth PC is

λj . λ1 + ... + λp

The proportion of the variance explained by the ﬁrst j PCs together is

λ1 + ... + λj . λ1 + ... + λp

Applied Multivariate and Longitudinal Data Analysis Dr. Arnab Maity, NCSU Statistics 5240 SAS Hall, amaity[at]ncsu.edu ST 437/537 principalcomponentsanalysis7

Geometry of PCA

From a geometric point of view, PCA attempts to ﬁnd the directions along which most of the variability is present. Let us consider the simple T case with number of variables p = 2. Deﬁne a1 = (a11, a12) . Thus the constraint aT a = 1 becomes 1.0 1 1 a 2 2

a11 + a12 = 1. 0.5

This is the equation of a circle, centered at zero, with radius one. So y T 0.0 we only need to look at points (a11, a12) that are on the perimeter of direction the circle. This is what we mean by ; see Figure 3. −0.5 Thus, given a data scatterplot, the 1st PC points to the direction along with most of the variation lies. In Figure 4, the grey points −1.0 −1.0 −0.5 0.0 0.5 1.0 represent a data scatter. PCA ﬁrst places a circle of unit length at the x center of the data (the black circle in the plot) and ﬁnds the direction with the most variation (the red arrow). The direction orthogonal to Figure 3: First PC direction. PC1 containing the second largest amount of variation is PC2 (the blue arrow). 1 11 1 11 1 1 1 1 11 11 1 11 1 1 11 11 1 1 1 1 1 11 2 1 1 111 1 1 1 11 1 1 1 111 1 1 1 1 11 111 1 1 1 11 11 1 111 1 1 1 1 1111 1 11111 11 11 1 1 1 1 11 111 1 1 1 1 1 1 1 1 1 1111 1 1 11 1 11 1 1111 11 1 1 PC 2 1 1111 1 11 1 1 111 1 1 11 1 1 11 1 1 111 111111 1 1 1 11 111 1 11 111 1 1 1 111111111111111 1 11 1111111 1 111 1 1111 1 PC 1 111111 1111 111 1 111 111 1 1 X3 111 1 1 1111 1 1 1111111 11 11 11111 11 1111111 11 1 1 1 111111 11 1 X2 11111 111 0 1111 1111111 11 1 1111111111 1 1 1111 111111 1 11111 11 111 111111 1 1111111 11 1 11 11 1 1 1111111 1 11 1 1 11 11 11 11 11 11 11 1 1 1111 1 1 11 1 1 1 −1 11 X2 X1 −2

−1 0 1 2 X1

Figure 4: Geometry of PCA in two and Let us now consider the case with three variables, p = 3. In this three dimensions (left and right panels, respectively). T T case, a1 = (a11, a12, a13) and the constraint a1 a1 = 1 becomes 2 2 2 a11 + a12 + a13 = 1. This is the equation of a sphere, centered at zero, with radius one. T Thus we only need to look at points (a11, a12, a13) that are on the surface of the sphere.

Applied Multivariate and Longitudinal Data Analysis Dr. Arnab Maity, NCSU Statistics 5240 SAS Hall, amaity[at]ncsu.edu ST 437/537 principalcomponentsanalysis8

Now consider a data scatter in three dimensions (gray points in Figure 4, right panel). We first place a sphere of unit radius at the center of the data (the light-blue sphere). Then the first PC points to the direction (represented by the vector on the surface of the sphere) with the most variation (the red arrow). The second PC is the direction orthogonal to the first PC containing the second largest amount of variation. The third PC is the direction orthogonal to both the first and second PCs. Note that any direction represented by the vector a is also represented by −a (just like “x-axis” corresponds to both positive and negative directions). Thus if a is a PC then so is −a. Interpretation of the PCs

T Here Y1 = a1 X is interpreted as the projection of X onto the direction a1. Thus the 1st PC loading vector a1 is the direction such that the projection of the data onto this direction has the largest possible variance; the 1st PC captures most of the total variation. The 2nd PC loading vector a2 is the direction such that it is orthogonal to the 1st PC and the projection of the data onto this direction has the second largest possible variance, and so on. Finally, the last PC loading vector ap is the direction such that it is orthogonal to all the other PCs and the projection of the data onto this direction has the smallest variance.

Sample PCA

In practice, the true covariance matrix Σ is unknown, and we only have a random sample X1,..., Xn. Thus we can estimate Σ by the sample covariance matrix

1 n S = (X − X¯ )(X − X¯ )T. − ∑ i i n 1 i=1 When the variables are standardized, then we essentially use the sample correlation matrix10 R. 10 Recall, for standardized data, the In our stock price example, we suggested to standardize the vari- covariance and correlation matrices are identical. ables. Thus, using sample covariance and correlation matrices are equivalent for the standardized data.

# Recall, std.dat contains standadrdized # variables S <- cov(std.dat)

Applied Multivariate and Longitudinal Data Analysis Dr. Arnab Maity, NCSU Statistics 5240 SAS Hall, amaity[at]ncsu.edu ST 437/537 principalcomponentsanalysis9

# The function eigen() can compute the # eigenvectors/eigenvalues eig.out <- eigen(S) names(eig.out)

## [1] "values" "vectors"

The ﬁeld “values” gives the eigenvalues (ordered from largest to smallest). These numbers represent the variance of each PC. Let us also compute proportion of variance explained by each PC and the cumulative proportion of variance explained for each PC.

# eigenvalues/variances lam <- eig.out$values # standard deviations sdev <- sqrt(lam) # proportion of variance explained prop <- lam/sum(lam) # cumulative proportion of var explained pve <- cumsum(lam)/sum(lam) # combined in a table tab <- rbind(sdev, prop, pve) rownames(tab) <- c("Standard deviation", "Proportion of variance", "Cumulative proportion") colnames(tab) <- paste0("PC",1:5) round(tab,4)

## PC1 PC2 PC3 PC4 PC5 ## Standard deviation 1.5612 1.1862 0.7075 0.6325 0.5051 ## Proportion of variance 0.4875 0.2814 0.1001 0.0800 0.0510 ## Cumulative proportion 0.4875 0.7689 0.8690 0.9490 1.0000

11 11 The output above is exactly same as the one from using prcomp(). See prcomp output on page 4. We can compare the loading vectors obtained directly using eigen() to those given produced by prcomp(). Output from ‘eigen()‘:

# Eigenvectors loadings <- eig.out$vectors colnames(loadings) <- paste0("PC",1:5) rownames(loadings) <- c("JPM", "CITI", "WF", "SH", "EX") round(loadings,4)

## PC1 PC2 PC3 PC4 PC5 ## JPM -0.4691 0.3680 0.6043 0.3630 0.3841

Applied Multivariate and Longitudinal Data Analysis Dr. Arnab Maity, NCSU Statistics 5240 SAS Hall, amaity[at]ncsu.edu ST 437/537 principal components analysis 10

## CITI -0.5324 0.2365 0.1361 -0.6292 -0.4962 ## WF -0.4652 0.3152 -0.7718 0.2890 0.0712 ## SH -0.3873 -0.5850 -0.0934 -0.3813 0.5947 ## EX -0.3607 -0.6058 0.1088 0.4934 -0.4976

12 12 Output from ‘prcomp()‘: Recall prcomp() output was stored in data.pca structure. We need to access round(data.pca$rotation,4) the rotation ﬁeld on data.pca.

## PC1 PC2 PC3 PC4 PC5 ## JPM -0.4691 0.3680 -0.6043 0.3630 0.3841 ## CITI -0.5324 0.2365 -0.1361 -0.6292 -0.4962 ## WF -0.4652 0.3152 0.7718 0.2890 0.0712 ## SH -0.3873 -0.5850 0.0934 -0.3813 0.5947 ## EX -0.3607 -0.6058 -0.1088 0.4934 -0.4976

It is evident that the eigenvectors are exactly the PC loadings that prcomp() produces. Sign of loadings

The sign of the loading vector can not be interpreted. Specif- ically, if a is an eigenvector, then −a is also an eigenvector. Thus we can not say whether a variable impacts a PC posi- tively and negatively.

However, we can still interprete the sign of a variable relative to that of another variable. For example, we can say all the 5 stocks contribute similarly to PC1. For PC2, the oil stocks and banking stocks contribute in a different manner.

Practical considerations

Number of PCs to retain

A common approach to determining the number of components to retain is to keep the ﬁrst few components that explain a pre-speciﬁed large percentage of the total variation of the original variables. We typically use values between 70% and 95%. Other possible rules have been suggested by various authors. For example, we could plot the ordered eigenvalues, λi (variance captured by the PCs) of the i-th component versus i, as proposed by Cattell (1966)13 This plot is called the scree plot. Later Farmer (1971)14 13 Cattell, R. B. (1966). The scree test for the number of factors. Multivariate suggested plotting log(λi) instead of λi. Behavioural Research, 1, 245 – 276. In our previous stock price example, scree plot is shown in Figure 14 Farmer, S. A. (1971), An investigation 5. We can use the function screeplot() for this purpose. into the results of principal components analysis of data derived from random numbers, Statistician, 20, 63 – 72

Applied Multivariate and Longitudinal Data Analysis Dr. Arnab Maity, NCSU Statistics 5240 SAS Hall, amaity[at]ncsu.edu ST 437/537 principal components analysis 11

screeplot(data.pca, type = "line", main ="") 2.5 2.0

We look for a bend (elbow) in the plot. In the plot shown in the left 1.5 Variances panel above, we see the plot becomes less steep after the 3rd PC. 1.0 0.5

Thus we might wish to retain three PCs. From the PCA output 1 2 3 4 5 shown before, the first three PCs cumulatively explain almost 87% variability. Figure 5: Variance explained by each PC. Another rule of thumb is to choose those PCs whose variance is less than the average variance TV/p. When the variables are standardized, that is, each of them has variance 1, we have TV = p, and average variance is p/p = 1. Thus, when the variables are stan- 15 15 dardized, the PCs with λi < 1 will be rejected. Jolliffe, I. (1972) Jolliffe, I. (1972), Discarding variables in a principal component analysis. I: proposed a modified rule to reject PCs with λi < 0.7. Artificial data, Journal of the Royal It should be noted that the choice of PCs should not be based Statistical Society, Series C, 21, 160 – 173 on only the percent of variation explained. One should also look at their subject matter interpretation. If we obtain a PC which we can not interpret, the usability of such a component may become limited.

Linear dependence among variables

Even though we only retain the ﬁrst few PCs with highest variances, we should not completely ignore the components with small variances. A near zero variance indicates the presence of a linear relationship among the variables (i.e., collinearity) in the data. In such a case, one or more of the variables are redundant and should be deleted. Consider the following example, where we have four variables X1,..., X4, such that X1 = (0.2)X2 + (0.1)X3 + (0.2)X4. set.seed(1001) n <- 100 x2 <- rnorm(n) x3 <- rnorm(n) x4 <- rnorm(n) x1 <- 0.2 * x2 + 0.1 * x3 + 0.2 * x4 X <- cbind(x1, x2, x3, x4)

−2 0 1 2 3 −2 0 1 2 3 A quick look at pairs plot or the correlation matrix does not really 1.0 0.5 x1 reveal the perfect relation between X1 and (X2, X3, X4). Similarly, 0.0 −0.5

inspecting the correlation matrix does not reveal anything unusual. 3 2 1

0 x2 pairs(X, pch = 19) −2 2 1 x3 round(cor(X),3) 0 −1 3 2

## x1 x2 x3 x4 1 x4 0

## x1 1.000 0.721 0.225 0.657 −2

−0.5 0.0 0.5 1.0 −1 0 1 2

Applied Multivariate and Longitudinal Data Analysis Dr. Arnab Maity, NCSU Statistics 5240 SAS Hall, amaity[at]ncsu.edu ST 437/537 principal components analysis 12

## x2 0.721 1.000 -0.058 0.017 ## x3 0.225 -0.058 1.000 0.019 ## x4 0.657 0.017 0.019 1.000

A PCA on the data reveals that a possible linear dependency in the data, as the last eigenvalue is essentially zero.16 16 Here for demonstration purposes, we have not standardized the data. pcout <- prcomp(X) summary(pcout)

## Importance of components: ## PC1 PC2 PC3 PC4 ## Standard deviation 1.1735 1.0268 0.8045 1.291e-16 ## Proportion of Variance 0.4473 0.3425 0.2102 0.000e+00 ## Cumulative Proportion 0.4473 0.7898 1.0000 1.000e+00

In fact, the loading vector corresponding the last PC also gives the estimated linear relationship among the variables.

round(pcout$rotation[,4],3)

## x1 x2 x3 x4 ## 0.958 -0.192 -0.096 -0.192

Thus PCS estimates that there is collinearity among the variables, and the relationship is

(0.958)x1 − (0.192)X2 − (0.096)X3 − (0.192)X4 = 0.

This is very close to the actual relationship X1 − (0.2)X2 − (0.1)X3 − (0.2)X4 = 0. Thus, we should not entirely ignore the near-zero eigenvalues as they might point out linear dependencies that might become problematic in subsequent analysis.

Principal components scores

The principal components scores, Yi, are calculated for each PC for each subject in the dataset. Recall that in the stock price example, we calculates two PCs

Y1 = −0.469(JPM) − ... − 0.361(EX);

Y2 = 0.368(JPM) + ... − 0.606(EX). So each row of the data (weeks in our example), we can compute the values of Y1 and Y2. For example, let us look at the standardized data.17 17 Since we standardized the dataset before running PCA in our example

Applied Multivariate and Longitudinal Data Analysis Dr. Arnab Maity, NCSU Statistics 5240 SAS Hall, amaity[at]ncsu.edu ST 437/537 principal components analysis 13

std.dat[1:3,]

## JPM CITI WF SH EX ## [1,] 0.5751116 -0.4057430 -0.3217345 -1.8162094 0.04251621 ## [2,] 0.3566358 0.7654685 -0.5236027 0.2941623 0.34152797 ## [3,] -0.9117447 -0.4437558 0.5619462 -0.1506411 -0.36794871

, So week 1 (the ﬁrst row of the matrix above), we can compute the ﬁrst and second PCs as

y11 = −0.469(0.5751116) − ... − 0.361(0.04251621) ≈ 0.784,

y12 = 0.368(0.5751116) − ... − 0.606(0.04251621) ≈ 1.051. In general, for the i-th week (i-th row of the data matrix), we can similarly compute yi1 and yi2. These summaries are called PC scores corresponding to PC1 and PC2 for week i. In general, suppose we retain k PCs. For the i-th subject with ob- served data vector xi, the principal components scores for the ﬁrst k PCs are deﬁned as

T T yi1 = a1 xi,..., yik = ak xi, where a1,..., ak are the corresponding loading vectors. If the variables were not standardized, we often center the variables before computing the scores,18 so that the scores have mean zero. Speciﬁ- 18 In our example, we already standard- cally, ized the varibles, so this step was not T T needed. yi1 = a1 (xi − x¯),..., yik = ak (xi − x¯). This centering does not change the variance of the PCs. In R, we can access the PC scores as follows.

# First PC scores for first 3 weeks round(data.pca$x[1:3,1:2],3)

## PC1 PC2 ## [1,] 0.784 1.051 ## [2,] -0.568 -0.232 ## [3,] 0.594 0.048

Outlier detection

Plots of PC scores can reveal suspect observations and possible outliers. Let us consider the lumber stiffness dataset19 where four 19 Table 4-3 in Johnson and Wichern (2007). Applied Multivariate Analysis. measures of stiffness x1,..., x4 are measured of each of the n = 30 boards. We used this dataset in the lecture of assessment of multivariate normality and to demonstrate outlier detection strategies.

Applied Multivariate and Longitudinal Data Analysis Dr. Arnab Maity, NCSU Statistics 5240 SAS Hall, amaity[at]ncsu.edu ST 437/537 principal components analysis 14

# Reading the data set dat <- read.table("data/T4-3.DAT", header =F) colnames(dat) <- c("x1", "x2", "x3", "x4", "d2") # snapshot head(dat)

## x1 x2 x3 x4 d2 ## 1 1889 1651 1561 1778 0.60 ## 2 2403 2048 2087 2197 5.48 ## 3 2119 1700 1815 2222 7.62 ## 4 1645 1627 1110 1533 5.21 ## 5 1976 1916 1614 1883 1.40 ## 6 1712 1712 1439 1546 2.22

Let us now perform PCA of this dataset and compute the PC scores. std.data <- scale(dat[,1:4], center =T, scale =T) data.pca <- prcomp(std.data) summary(data.pca)

## Importance of components: ## PC1 PC2 PC3 PC4 ## Standard deviation 1.897 0.51735 0.28119 0.23047 ## Proportion of Variance 0.900 0.06691 0.01977 0.01328 ## Cumulative Proportion 0.900 0.96695 0.98672 1.00000

The ﬁrst two PCs explain almost 97% variability. Let use see a scatterplot of PC1 versus PC2; see Figure 6. Recall that we previously ﬂagged observations 9 and 16 as potential outliers. 1.0 # scatterplot plot(data.pca$x[,1:2], pch = 19) 0.5 points(data.pca$x[c(9, 16),1:2], cex =3, col = "#990000") 0.0 9 PC2 text(data.pca$x[c(9, 16),1:2], labels = c(9, −0.5

16), pos =3) −1.0

The two outliers are clearly separated in the scatterplot above. −1.5 16 Looking at the PC loadings we see that the ﬁrst PC is essentially a −2.0 −2 0 2 4 6 average of the four variables, while the second PC represents the PC1 difference between X2 and (X3, X4). Figure 6: Scatterplot of PC scores for data.pca$rotation[,1:2] the stiffness data.

## PC1 PC2 ## x1 0.5137718 -0.2060665 ## x2 0.4841620 -0.7316902 ## x3 0.4999301 0.4657684 ## x4 0.5016927 0.4530186

Applied Multivariate and Longitudinal Data Analysis Dr. Arnab Maity, NCSU Statistics 5240 SAS Hall, amaity[at]ncsu.edu ST 437/537 principal components analysis 15

We noice that observation 9 has high (positive) PC1 score indicating that all the standarized variables of observation 9 were high compared to the other data points. On the other hand, observation 16 has a high (negative) PC2 score; this indicates that the difference between X2 and (X3, X4) measures are larger compared to other data points. We can see these phenomena by observing the corresponding standardized data (z-scores).

−2.0 −1.0 0.0 1.0 −0.4 −0.2 0.0 0.2 6 tab <- cbind(c(9, 16), std.data[c(9, 16), ]) 4 PC1 2 round(tab,3) 0 −2 1.0

## x1 x2 x3 x4 0.0 PC2 ## [1,] 9 3.314 3.278 2.978 2.652 −1.0 −2.0

## [2,] 16 0.147 1.254 -1.086 -1.375 0.6 0.4

PC3 0.2 0.0

Overall, Johnson and Wichern (2007) suggest to make scatterplots −0.4

of the ﬁsrt few PC scores and also of the last few PCs. These plots help 0.2 0.0 PC4

identify suspect observations. −0.2 −0.4

−2 0 2 4 6 −0.4 0.0 0.2 0.4 0.6

PCA is not invariant to scaling Figure 7: Pairwise Scatterplot of PC scores for the stiffness data. In all our examples discussed previously, we standardized the data matrix. This is because PCA result might change if one changes unit of measurement (e.g., pound to Kg) for variables. Also, if the variables have very different variances, then the top PCs will be dominated by the variables with largest variances. As an example, consider the heptahlon dataset in the HSAUR3 package. The dataset contains results of 1988 the olympic heptathlon competition held in Seoul. The competition contained the seven events: 100m hurdles, shot, high jump, 200m, long jump, javelin and 800m. The last column of the dataset shows the total score of the athelits.

library(HSAUR3) head(heptathlon)

## hurdles highjump shot run200m longjump javelin run800m score ## Joyner-Kersee (USA) 12.69 1.86 15.80 22.56 7.27 45.66 128.51 7291 ## John (GDR) 12.85 1.80 16.23 23.65 6.71 42.56 126.12 6897 ## Behmer (GDR) 13.20 1.83 14.20 23.10 6.68 44.54 124.20 6858 ## Sablovskaite (URS) 13.61 1.80 15.23 23.92 6.25 42.78 132.24 6540 ## Choubenkova (URS) 13.51 1.74 14.76 23.93 6.32 47.46 127.90 6540 ## Schulz (GDR) 13.75 1.83 13.50 24.65 6.33 42.82 125.79 6411

Before the analysis, let us transform the variables so that they are comparable. Notice that larger values of highjump, shot, longjump and javelin indicate better performance while for the other three

Applied Multivariate and Longitudinal Data Analysis Dr. Arnab Maity, NCSU Statistics 5240 SAS Hall, amaity[at]ncsu.edu ST 437/537 principal components analysis 16

categories (hurdles, run200m and run800m) larger values indicate poorer performance. Let us ﬁrst shift hurdles, run200m and run800m so that larger values correspond to better performance for these variables as follows. hep.dat <- heptathlon for (i in c(1,4,7)) { hep.dat[, i] <- max(heptathlon[, i]) - heptathlon[, i] }

Let us consider the ﬁrst seven columns, and visualize their vari- 40 ances. 30 boxplot(hep.dat[,1:7]) 20 apply(hep.dat[,1:7],2, var) 10

## hurdles highjump shot run200m ## 0.5426500 0.0060750 2.2257190 0.9400410 0 hurdles highjump shot run200m longjump javelin run800m ## longjump javelin run800m ## 0.2248773 12.5716773 68.7421417

Clearly, the variance of javelin and run800 far exceeds that of the remaining variables. A PCA of the nonstandardized variables produces PCs that are dominated by javelin and run800. pcout <- prcomp(hep.dat[,1:7]) summary(pcout)

## Importance of components: ## PC1 PC2 PC3 PC4 PC5 PC6 PC7 ## Standard deviation 8.3646 3.5910 1.38570 0.58571 0.32382 0.14712 0.03325 ## Proportion of Variance 0.8207 0.1513 0.02252 0.00402 0.00123 0.00025 0.00001 ## Cumulative Proportion 0.8207 0.9720 0.99448 0.99850 0.99973 0.99999 1.00000 round(pcout$rotation[,1:2],3)

## PC1 PC2 ## hurdles -0.070 0.009 ## highjump -0.006 0.001 ## shot -0.078 0.136 ## run200m -0.073 0.101 ## longjump -0.040 0.015 ## javelin 0.007 0.985 ## run800m -0.991 -0.013

Applied Multivariate and Longitudinal Data Analysis Dr. Arnab Maity, NCSU Statistics 5240 SAS Hall, amaity[at]ncsu.edu ST 437/537 principal components analysis 17

We can see that PC1 is essentially effect of run800m and PC2 is javelin. This analysis completely overshadows any possible patterns due to other variables, and may give us misleading interpretation. On the other hand, the standardized dataset provides new in- sights.

Z <- scale(hep.dat[,1:7]) pcstd <- prcomp(Z) summary(pcstd)

## Importance of components: ## PC1 PC2 PC3 PC4 PC5 PC6 PC7 ## Standard deviation 2.1119 1.0928 0.72181 0.67614 0.49524 0.27010 0.2214 ## Proportion of Variance 0.6372 0.1706 0.07443 0.06531 0.03504 0.01042 0.0070 ## Cumulative Proportion 0.6372 0.8078 0.88223 0.94754 0.98258 0.99300 1.0000 round(pcstd$rotation[,1:2],3)

## PC1 PC2 ## hurdles -0.453 0.158 ## highjump -0.377 0.248 ## shot -0.363 -0.289 ## run200m -0.408 -0.260 ## longjump -0.456 0.056 ## javelin -0.075 -0.842 ## run800m -0.375 0.224

While the 2nd PC weights javelin highly, the 1st PC is essentilly an overall performance metric except javelin. To summarize, when variables have widely different variances, the data should be standardized before performing PCA. As a per- sonal preference, I always prefer to standardize the variables before performing PCA.

Heptathlon data: prediction

PC scores can be use to build predictive models as well. For example, if the end goal is to predict a speciﬁc response based on a number of predictor variables, one approach could be to perform a PCA of the predictor ﬁrst (to detect any collinearity etc), and then use the PC scores as covariates (instead of the actual variables) in the prediction model. This is one way to tackle the situation where a high number of covariates are present. As a fun example, consider again the heptahlon data (shifted and transformed so that large values of each variable indicate vetter performance, as done earlier). Recall that the column eight of the

Applied Multivariate and Longitudinal Data Analysis Dr. Arnab Maity, NCSU Statistics 5240 SAS Hall, amaity[at]ncsu.edu ST 437/537 principal components analysis 18

original heptahlon data contained the ofﬁcial scores which we did not use in our PC analysis.

# FStandardize the first seven columns Z <- scale(hep.dat[,1:7]) # PCA pcstd <- prcomp(Z) # official scores off.score <- heptathlon[,8]

20 20 Here we do not know how the total official scores were computed. The variable off.score in the code Let us now ask ourselves: if we were the one scoring the athletes (i.e., above. coming up with a measure of their performance and ordering them according to that measure), how would we do so? From the previous section, we know the first two PCs jointly capture 80% of the total variation in the data. Let us use PC1 and PC2 scores as our measure of performance, and see how they relate to the official scores. We can do so by running a linear regression with 21 21 official scores as response and PC1 and PC2 scores as predictors. Recall, the variable off.score contains the official scores, and we can # PC1 and PC2 scores obtain PC1 and PC2 scores from the PCA output pcstd. PC1 <- pcstd$x[,1] PC2 <- pcstd$x[,2] # regression out <- lm(off.score ~ PC1 + PC2) summary(out)

## ## Call: ## lm(formula = off.score ~ PC1 + PC2) ## ## Residuals: ## Min 1Q Median 3Q Max ## -218.102 -12.853 3.512 27.091 55.780 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 6090.600 10.716 568.357 < 2e-16 *** ## PC1 -266.774 5.179 -51.513 < 2e-16 *** ## PC2 -50.917 10.008 -5.088 4.26e-05 *** ## --- ## Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 ## ## Residual standard error: 53.58 on 22 degrees of freedom ## Multiple R-squared: 0.9919, Adjusted R-squared: 0.9911 ## F-statistic: 1340 on 2 and 22 DF, p-value: < 2.2e-16

Applied Multivariate and Longitudinal Data Analysis Dr. Arnab Maity, NCSU Statistics 5240 SAS Hall, amaity[at]ncsu.edu ST 437/537 principal components analysis 19

The model is highly useful as indicated by te R2 value of 0.99. Let us look at plots of ofﬁcial score vs PC1 score and PC2 scores, respectively. par(mfrow = c(1,2)) plot(PC1, off.score, pch = 19, main = "Official scores vs. PC1 scores") plot(PC2, off.score, pch = 19, main = "Official scores vs. PC2 scores")

Official scores vs. PC1 scores Official scores vs. PC2 scores 7000 7000 6500 6500 6000 6000 off.score off.score 5500 5500 5000 5000 4500 4500 −4 −2 0 2 4 6 −3 −2 −1 0 1 2

PC1 PC2

We can see PC1 has a almost perfectly linear relationship with official score.In fact, cor( official score, PC1 score ) = -0.991. Thus PC1 alone is almost equivalent to the total official scores.

The Romano-British pottery data: classiﬁcation

Another application of PCA is in classiﬁcation. Let us consider the Romano-British pottery data in HSAUR3 package. The dataset consists of 45 observations on the 9 chemicals on specimens of Romano- British pottery. library(HSAUR3) dim(pottery)

## [1] 45 10 head(pottery)

## Al2O3 Fe2O3 MgO CaO Na2O K2O TiO2 MnO BaO kiln ## 1 18.8 9.52 2.00 0.79 0.40 3.20 1.01 0.077 0.015 1

Applied Multivariate and Longitudinal Data Analysis Dr. Arnab Maity, NCSU Statistics 5240 SAS Hall, amaity[at]ncsu.edu ST 437/537 principal components analysis 20

## 2 16.9 7.33 1.65 0.84 0.40 3.05 0.99 0.067 0.018 1 ## 3 18.2 7.64 1.82 0.77 0.40 3.07 0.98 0.087 0.014 1 ## 4 16.9 7.29 1.56 0.76 0.40 3.05 1.00 0.063 0.019 1 ## 5 17.8 7.24 1.83 0.92 0.43 3.12 0.93 0.061 0.019 1 ## 6 18.8 7.45 2.06 0.87 0.25 3.26 0.98 0.072 0.017 1

The variable kiln shows the region where the specimen was made. table(pottery$kiln)

## ## 1 2 3 4 5 ## 21 12 2 5 5

22 22 There are three regions :(1=Gloucester), (2=Llanedeyrn, 3=Caldicot), see [http://people.tamu.edu/ and (4=Islands Thorns, 5=Ashley Rails). Les us create a new region ~dcarlson/quant/data/index.html] variable to reﬂect these three regions. region <- rep(NA, nrow(pottery)) region[pottery$kiln ==1] =1 region[pottery$kiln ==2| pottery$kiln ==3] =2 region[pottery$kiln ==4| pottery$kiln ==5] =3 dat <- cbind(pottery[,1:9], region) head(dat)

## Al2O3 Fe2O3 MgO CaO Na2O K2O TiO2 MnO BaO region ## 1 18.8 9.52 2.00 0.79 0.40 3.20 1.01 0.077 0.015 1 ## 2 16.9 7.33 1.65 0.84 0.40 3.05 0.99 0.067 0.018 1 ## 3 18.2 7.64 1.82 0.77 0.40 3.07 0.98 0.087 0.014 1 ## 4 16.9 7.29 1.56 0.76 0.40 3.05 1.00 0.063 0.019 1 ## 5 17.8 7.24 1.83 0.92 0.43 3.12 0.93 0.061 0.019 1 ## 6 18.8 7.45 2.06 0.87 0.25 3.26 0.98 0.072 0.017 1

We want to ask the following questions:

• Do we need to examine all the chemicals or just a few summaries are sufﬁcient to capture variation in the data?

• Can we separate the specimens into differnt regions just by using the variables or their summaries?

Let us visualize the data using a pairs plot. plot(dat[,1:9], pch = 19, col = region)

Applied Multivariate and Longitudinal Data Analysis Dr. Arnab Maity, NCSU Statistics 5240 SAS Hall, amaity[at]ncsu.edu ST 437/537 principal components analysis 21

2 6 0.0 1.0 2.0 3.5 5.0 0.00 0.10

Al2O3 18 14 10

6 Fe2O3 2 7

MgO 5 3 1

1.0 CaO 0.0 0.8

Na2O 0.4 0.0 5.0

3.5 K2O 2.0

TiO2 1.0 0.6

0.10 MnO 0.00

BaO 0.020 0.010 10 14 18 1 3 5 7 0.0 0.4 0.8 0.6 1.0 0.010 0.020 It seems some chemicals indeed seperates the three regions quite well. However, looking at all possible scatterplots is not efﬁcient, especially for large number of variales. Let us standardize the data, and then apply PCA.

# standardize the data std.dat <- scale(dat[,1:9], scale =T) # PCA pcout <- prcomp(std.dat) summary(pcout)

## Importance of components: ## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 ## Standard deviation 2.0503 1.5885 0.93699 0.67538 0.61647 0.51840 0.34325 0.30190 0.2846 ## Proportion of Variance 0.4671 0.2804 0.09755 0.05068 0.04223 0.02986 0.01309 0.01013 0.0090 ## Cumulative Proportion 0.4671 0.7475 0.84501 0.89570 0.93792 0.96778 0.98087 0.99100 1.0000

A sreeplot of the data is shown in Figure 8. It seems that that ﬁrst pcout

two or three components are desirable. 4 3 screeplot(pcout, type = "lines", lwd =2, col = "#990000") 2 Variances 1 0

Let us inspect the scores of the ﬁrst two PCs. Recall that there are 1 2 3 4 5 6 7 8 9 three regions where the specimens were made. We did not use the Figure 8: Screeplot of the pottery data.

Applied Multivariate and Longitudinal Data Analysis Dr. Arnab Maity, NCSU Statistics 5240 SAS Hall, amaity[at]ncsu.edu ST 437/537 principal components analysis 22

region information at all while performing the PCA. Let us plot PC1 score versus PC2 score and see whether we can discover any groups among the specimens.

1 1 3 3 3 1 1 1 1

2 2 11 2 11 111 111 11 11 1 1 1 1 1 1 1 1 1 1 11 1 11 11 11 1 1 2 2

PC2 PC2 1 PC2 1 0 0 0 3 3 2 3 3 2 2 2 2 2 3 2 3 2 3 3 2 3 3 2 −1 −1 3 1 2 −1 3 1 2 3 2 3 2

3 3 2 3 3 2

−2 −2 2 −2 2 3 2 2 3 2 2 2 2 2 2

−4 −2 0 2 −4 −2 0 2 −4 −2 0 2

PC1 PC1 PC1

The left panel shows the plot of PC1 vs. PC2 scores without superim- posing the actual regions. It seems that there are three groups visible in the plot. Whether these three groups actually correspond to the regions, we superimpose the region information in the plot in middle panel. Clearly, The three groups have been perfectly separated by plotting the first two PCs (we only need PC1 actually). A classification rule can be displayed in the tree form just using the first two PCs. For example, one classification rule can be as follows.

A specimen belong to region 1 if PC1 ∈ [−1.7, 1.5), region 2 if PC1 ≥ 1.5, and region 3 if PC1 < −1.7.

This rule is displayed in the plot in the right panel. Notice that the proposed rule perfectly classiﬁes the specimens into their regions. We will learn about classiﬁcation techniques in a future lecture.

Applied Multivariate and Longitudinal Data Analysis Dr. Arnab Maity, NCSU Statistics 5240 SAS Hall, amaity[at]ncsu.edu