COMP6053 lecture: Principal components analysis

[email protected] [email protected] Dealing with many variables

• Last week we looked at multivariate ANOVA, in which multiple continuous outcome variables are related to categorical predictor variables. • What about situations in which we have many variables and are not yet sure which ones we want to regard as predictors and which as outcomes? Dealing with many variables

• Suppose you've given personality tests to a large sample of people, and recorded scores on multiple traits such as extraversion, neuroticism, openness, etc. • Or you've asked 500 people to 100 different foods on a scale between 1 and 10 to say whether they like that food. Dealing with many variables

• You have a large number of continuous variables for each subject or case. • What structure might there be in the relationships between these variables? • You don't have explicit predictors and outcomes: you just want to know whether liking wine goes with liking cheese, whether or not one causes or predicts the other. Dealing with many variables

• Principal components analysis (PCA) is the right tool for the job here. • Closely related technique: factor analysis. • Essentially we're looking for structure in the covariance or correlation matrix. • Can also view it as reducing our large set of variables to a smaller set that can explain much of the variance. An example data set: heptathlon

• Heptathlon is a seven-event Olympic sport for women. It includes:

o hurdles o o o 200m run o o javelin o 800m run. An example data set: heptathlon

• Our sample data set is the results from the heptathlon at the 1988 in . • Won by Jackie Joyner-Kersee (right).

An example data set: heptathlon

• We're asking about the relationship between the results for component events. • Is there a "running factor", or a "jumping factor", or a "throwing factor", for example? • Seven (variables) is not a large number: PCA comes into its own in larger data sets, but seven will keep things manageable for presentation here. Some housekeeping in R

• To get access to the heptathlon data set, we need to download the HSAUR package ("Handbook of Statistical Analyses Using R"). • In the R GUI, set a CRAN mirror from the packages window (e.g., or Bristol). • Then type install.packages("HSAUR") library(HSAUR) Examining the data

• There were 25 competitors in the heptathlon; here are the top 3.

hurdles highjump shot run200m longjump javelin run800m score Joyner-Kersee (USA) 12.69 1.86 15.80 22.56 7.27 45.66 128.51 7291 John (GDR) 12.85 1.80 16.23 23.65 6.71 42.56 126.12 6897 Behmer (GDR) 13.20 1.83 14.20 23.10 6.68 44.54 124.20 6858

• Some results are measured in seconds (lower numbers better), others in (higher numbers better). Transforming the data

• In order to get all of the scores pointing in the same direction, we transform the times to be differences from the slowest time. heptathlon$hurdles = max(heptathlon$hurdles) - heptathlon$hurdles heptathlon$run200m = max(heptathlon$run200m) - heptathlon$run200m heptathlon$run800m = max(heptathlon$run800m) - heptathlon$run800m

• Now higher numbers are better in all cases.

Neville’s Fairness Adjustment

• Works on all sports related data • Allows you to use data for ordinary people heptathlonOrd = nevadjust(heptathlon, doping=false, protein_shakes=false, deskjob=true, smoke=true, height=0.2) The correlation matrix

hurdles highjump shot run200m longjump javelin run800m hurdles 1.00 0.81 0.65 0.77 0.91 0.01 0.78 highjump 0.81 1.00 0.44 0.49 0.78 0.00 0.59 shot 0.65 0.44 1.00 0.68 0.74 0.27 0.42 run200m 0.77 0.49 0.68 1.00 0.82 0.33 0.62 longjump 0.91 0.78 0.74 0.82 1.00 0.07 0.70 javelin 0.01 0.00 0.27 0.33 0.07 1.00 -0.02 run800m 0.78 0.59 0.42 0.62 0.70 -0.02 1.00

• Can we see any structure here? • What goes with what? The correlation matrix

hurdles highjump shot run200m longjump javelin run800m hurdles 0.81 0.65 0.77 0.91 0.78 highjump 0.81 0.44 0.49 0.78 0.59 shot 0.65 0.44 0.68 0.74 0.27 0.42 run200m 0.77 0.49 0.68 0.82 0.33 0.62 longjump 0.91 0.78 0.74 0.82 0.70 javelin 0.27 0.33 run800m 0.78 0.59 0.42 0.62 0.70

• With some colour-coding we see a sprinting / jumping factor start to emerge. • Javelin results stand out as being unrelated.

Pairs plot

• Scatterplots of all seven variables against each other reinforces the impression we get from the correlation matrix. • Clear linear relationships between hurdles, high jump, shot put, 200m, and long jump. • Javelin and, to some extent, 800m results are less correlated with the other events. Scaling the data

• Note that the seven events have very different variances. Standard deviation for the 800m is 8.29 (sec) whereas for the high jump it's only 0.078 (m). • If we work with unscaled scores, the 800m results will have a disproportionate effect. • Thus we will tell the PCA function to scale all results to have a variance of 1.0. What are the components in PCA?

• We're ready to run the PCA analysis now. • PCA is going to give us seven components: but what are they? • The first component is a line through the centroid of the data that covers as much variance as possible. • Your score on C1 is where you sit on that line.

What are the components in PCA?

• The second component also goes through the centroid but is orthogonal to (i.e., uncorrelated with) the first. • It is also chosen to explain the maximum amount of the remaining variance. • Third and subsequent components continue in the same way, explaining less variance each time. What are the components in PCA?

• Knowing only an ’s score on component 1 (i.e., where they fit along the first line) we could still recover a pretty good guess about their results, i.e., their location in 7-dimensional data-space. • If we knew components 1 and 2, the guess would be more accurate. What are the components in PCA?

• If we knew all 7 components we could completely reconstruct the person's scores. • However, there would be little point to this: we started out knowing their scores. • The idea is to choose a small subset of the most useful components and thereby compress the high-dimensional data down to a shorter description. Running the PCA

• The R commands are easy. hepPCA = prcomp(heptathlon[,c(1:7)],scale=TRUE) print(hepPCA) summary(hepPCA) • The orange part is about specifying the first seven columns only. • The green part is to force automatic scaling of the event results to a variance of 1.0. What results do we get? • print(hepPCA) gives us the seven principal components, expressed in terms of their loadings on each event result. • hepPCA$rotation[,1] gives us the loadings just for the first component. hurdles highjump shot run200m longjump javelin run800m -0.4528710 -0.3771992 -0.3630725 -0.4078950 -0.4562318 -0.0754090 -0.3749594 • So its essentially the transformation from event space to principle component space. • Bigger numbers mean a variable(event) is more important in determining the C1 score

Component loadings

• We can see that component 1 contains a little of everything, except for the javelin. • It also happens to be negative: the sign is not really important here, as C1 is an artificial variable designed to explain variance. • With predict(hepPCA)[,1] we can ask what each woman's C1 score is. Component 1 scores

Joyner-Kersee (USA) John (GDR) Behmer (GDR) Sablovskaite (URS) -4.121447626 -2.882185935 -2.649633766 -1.343351210 Choubenkova (URS) Schulz (GDR) Fleming (AUS) Greiner (USA) -1.359025696 -1.043847471 -1.100385639 -0.923173639 Lajbnerova (CZE) Bouraga (URS) Wijnsma (HOL) Dimitrova (BUL) -0.530250689 -0.759819024 -0.556268302 -1.186453832 Scheider (SWI) Braun (FRG) Ruotsalainen (FIN) Yuping (CHN) 0.015461226 0.003774223 0.090747709 -0.137225440 Hagger (GB) Brown (USA) Mulliner (GB) Hautenauve (BEL) 0.171128651 0.519252646 1.125481833 1.085697646 Kytola (FIN) Geremias (BRA) Hui-Ing (TAI) Jeong-Mi (KOR) 1.447055499 2.014029620 2.880298635 2.970118607 Launa (PNG) 6.270021972

Component 1 and total heptathlon score

• In the real heptathlon, there's a scoring system that gives each competitor a points total. • The correlation between scores on our component 1 and the official scores is -0.99. • Thus we have clearly explained a lot of the relevant variance with our first factor alone. How much variance does each component explain? • summary(hepPCA) gives us information on how much variance each component covers. Importance of components: PC1 PC2 PC3 PC4 PC5 PC6 PC7 Standard deviation 2.1119 1.0928 0.72181 0.67614 0.49524 0.27010 0.2214 Proportion of Variance 0.6372 0.1706 0.07443 0.06531 0.03504 0.01042 0.0070 Cumulative Proportion 0.6372 0.8078 0.88223 0.94754 0.98258 0.99300 1.0000

• C1 covers 63.7%, C2 covers 17.1%, C3 covers 7.4%, etc. • plot(hepPCA) shows this graphically. Variance explained by each component

• Plot shows the scaled variance explained by each component. • Also known as a "scree plot". Variance explained by each component

• There were 7 event results, each scaled to have a variance of 1.0. • This makes a total of 7.0 units of scaled variance to be explained, so the area under the scree plot is 7.0. • The first factor always "takes" the largest share of this variance. How many components do we need?

• If we take all 7 components, we've achieved nothing in terms of compressing the high- dimensional data. • So we want a small subset of the components: One? Two? Three? • Later components explain very little variance anyway. How many components do we need?

• The usual criterion is to accept components that do at least their share of explanation. • The original measurements explain 1.0 units of variance each (given the scaling). • Thus only a component that explains more than 1.0 units of variance is really helpful. How many components do we need?

• By this criterion we only want components 1 and 2 (i.e., we've compressed the data set from 7 dimensions to 2). • In doing that we still explain 81% of the variance, so we have lost little. • This criterion is sometimes expressed in terms of wanting a factor to have an "eigenvalue" of 1.0 or more. How many components do we need?

• The criterion is not absolute. Some people argue for choosing all components before the scree plot "falls away". • If a component explains slightly less than 1.0 units of variance but has an easily interpreted meaning (based on its loading scores) we would be inclined to keep it. Knee of the Scree plot Assessing components 2 and 3

> hepPCA$rotation[,2] hurdles highjump shot run200m longjump javelin run800m 0.15792058 0.24807386 -0.28940743 -0.26038545 0.05587394 -0.84169212 0.22448984

> hepPCA$rotation[,3] hurdles highjump shot run200m longjump javelin run800m -0.04514996 -0.36777902 0.67618919 0.08359211 0.13931653 -0.47156016 -0.39585671

• Component 2 has a strong (negative) loading on javelin ability. • Component 3 seems to cover shot put ability. Interpreting our PCA

• Let's go with 2 components that explain 81% of the total variance. • Component 1 seems to reflect general athletic ability. • Component 2 covers the specific skill of javelin throwing, but with a minor influence from shot put and 200m running ability.

Interpreting our PCA

• We can plot our 25 competitors in our reduced 2-D component space with biplot(hepPCA). This plot also shows the way each event loads on each component. • We now have a better understanding of the structure implicit in the correlation matrix, and could use our two components as predictor or outcome variables in further analyses. Relationship to factor analysis?

• Factor analysis is a closely related technique. • Rather than breaking the data down into principal components, FA hypothesizes hidden variables that explain variation in the observed data. • There is only one PCA result for a data set, but there can be many reasonable FA models. Additional material

• The R script for running the PCA analysis of the heptathlon data set. • The lecture borrows heavily from the PCA tutorial in "A Handbook of Statistical Analyses Using R" by Brian S. Everitt and Torsten Hothorn.