COMP6053 Lecture: Principal Components Analysis
Total Page:16
File Type:pdf, Size:1020Kb
COMP6053 lecture: Principal components analysis [email protected] [email protected] Dealing with many variables • Last week we looked at multivariate ANOVA, in which multiple continuous outcome variables are related to categorical predictor variables. • What about situations in which we have many variables and are not yet sure which ones we want to regard as predictors and which as outcomes? Dealing with many variables • Suppose you've given personality tests to a large sample of people, and recorded scores on multiple traits such as extraversion, neuroticism, openness, etc. • Or you've asked 500 people to score 100 different foods on a scale between 1 and 10 to say whether they like that food. Dealing with many variables • You have a large number of continuous variables for each subject or case. • What structure might there be in the relationships between these variables? • You don't have explicit predictors and outcomes: you just want to know whether liking wine goes with liking cheese, whether or not one causes or predicts the other. Dealing with many variables • Principal components analysis (PCA) is the right tool for the job here. • Closely related technique: factor analysis. • Essentially we're looking for structure in the covariance or correlation matrix. • Can also view it as reducing our large set of variables to a smaller set that can explain much of the variance. An example data set: heptathlon • Heptathlon is a seven-event Olympic sport for women. It includes: o hurdles o high jump o shot put o 200m run o long jump o javelin o 800m run. An example data set: heptathlon • Our sample data set is the results from the heptathlon at the 1988 Olympic games in Seoul. • Won by Jackie Joyner-Kersee (right). An example data set: heptathlon • We're asking about the relationship between the results for component events. • Is there a "running factor", or a "jumping factor", or a "throwing factor", for example? • Seven (variables) is not a large number: PCA comes into its own in larger data sets, but seven will keep things manageable for presentation here. Some housekeeping in R • To get access to the heptathlon data set, we need to download the HSAUR package ("Handbook of Statistical Analyses Using R"). • In the R GUI, set a CRAN mirror from the packages window (e.g., London or Bristol). • Then type install.packages("HSAUR") library(HSAUR) Examining the data • There were 25 competitors in the heptathlon; here are the top 3. hurdles highjump shot run200m longjump javelin run800m score Joyner-Kersee (USA) 12.69 1.86 15.80 22.56 7.27 45.66 128.51 7291 John (GDR) 12.85 1.80 16.23 23.65 6.71 42.56 126.12 6897 Behmer (GDR) 13.20 1.83 14.20 23.10 6.68 44.54 124.20 6858 • Some results are measured in seconds (lower numbers better), others in metres (higher numbers better). Transforming the data • In order to get all of the scores pointing in the same direction, we transform the times to be differences from the slowest time. heptathlon$hurdles = max(heptathlon$hurdles) - heptathlon$hurdles heptathlon$run200m = max(heptathlon$run200m) - heptathlon$run200m heptathlon$run800m = max(heptathlon$run800m) - heptathlon$run800m • Now higher numbers are better in all cases. Neville’s Fairness Adjustment • Works on all sports related data • Allows you to use data for ordinary people heptathlonOrd = nevadjust(heptathlon, doping=false, protein_shakes=false, deskjob=true, smoke=true, height=0.2) The correlation matrix hurdles highjump shot run200m longjump javelin run800m hurdles 1.00 0.81 0.65 0.77 0.91 0.01 0.78 highjump 0.81 1.00 0.44 0.49 0.78 0.00 0.59 shot 0.65 0.44 1.00 0.68 0.74 0.27 0.42 run200m 0.77 0.49 0.68 1.00 0.82 0.33 0.62 longjump 0.91 0.78 0.74 0.82 1.00 0.07 0.70 javelin 0.01 0.00 0.27 0.33 0.07 1.00 -0.02 run800m 0.78 0.59 0.42 0.62 0.70 -0.02 1.00 • Can we see any structure here? • What goes with what? The correlation matrix hurdles highjump shot run200m longjump javelin run800m hurdles 0.81 0.65 0.77 0.91 0.78 highjump 0.81 0.44 0.49 0.78 0.59 shot 0.65 0.44 0.68 0.74 0.27 0.42 run200m 0.77 0.49 0.68 0.82 0.33 0.62 longjump 0.91 0.78 0.74 0.82 0.70 javelin 0.27 0.33 run800m 0.78 0.59 0.42 0.62 0.70 • With some colour-coding we see a sprinting / jumping factor start to emerge. • Javelin results stand out as being unrelated. Pairs plot • Scatterplots of all seven variables against each other reinforces the impression we get from the correlation matrix. • Clear linear relationships between hurdles, high jump, shot put, 200m, and long jump. • Javelin and, to some extent, 800m results are less correlated with the other events. Scaling the data • Note that the seven events have very different variances. Standard deviation for the 800m is 8.29 (sec) whereas for the high jump it's only 0.078 (m). • If we work with unscaled scores, the 800m results will have a disproportionate effect. • Thus we will tell the PCA function to scale all results to have a variance of 1.0. What are the components in PCA? • We're ready to run the PCA analysis now. • PCA is going to give us seven components: but what are they? • The first component is a line through the centroid of the data that covers as much variance as possible. • Your score on C1 is where you sit on that line. What are the components in PCA? • The second component also goes through the centroid but is orthogonal to (i.e., uncorrelated with) the first. • It is also chosen to explain the maximum amount of the remaining variance. • Third and subsequent components continue in the same way, explaining less variance each time. What are the components in PCA? • Knowing only an athlete’s score on component 1 (i.e., where they fit along the first line) we could still recover a pretty good guess about their results, i.e., their location in 7-dimensional data-space. • If we knew components 1 and 2, the guess would be more accurate. What are the components in PCA? • If we knew all 7 components we could completely reconstruct the person's scores. • However, there would be little point to this: we started out knowing their scores. • The idea is to choose a small subset of the most useful components and thereby compress the high-dimensional data down to a shorter description. Running the PCA • The R commands are easy. hepPCA = prcomp(heptathlon[,c(1:7)],scale=TRUE) print(hepPCA) summary(hepPCA) • The orange part is about specifying the first seven columns only. • The green part is to force automatic scaling of the event results to a variance of 1.0. What results do we get? • print(hepPCA) gives us the seven principal components, expressed in terms of their loadings on each event result. • hepPCA$rotation[,1] gives us the loadings just for the first component. hurdles highjump shot run200m longjump javelin run800m -0.4528710 -0.3771992 -0.3630725 -0.4078950 -0.4562318 -0.0754090 -0.3749594 • So its essentially the transformation from event space to principle component space. • Bigger numbers mean a variable(event) is more important in determining the C1 score Component loadings • We can see that component 1 contains a little of everything, except for the javelin. • It also happens to be negative: the sign is not really important here, as C1 is an artificial variable designed to explain variance. • With predict(hepPCA)[,1] we can ask what each woman's C1 score is. Component 1 scores Joyner-Kersee (USA) John (GDR) Behmer (GDR) Sablovskaite (URS) -4.121447626 -2.882185935 -2.649633766 -1.343351210 Choubenkova (URS) Schulz (GDR) Fleming (AUS) Greiner (USA) -1.359025696 -1.043847471 -1.100385639 -0.923173639 Lajbnerova (CZE) Bouraga (URS) Wijnsma (HOL) Dimitrova (BUL) -0.530250689 -0.759819024 -0.556268302 -1.186453832 Scheider (SWI) Braun (FRG) Ruotsalainen (FIN) Yuping (CHN) 0.015461226 0.003774223 0.090747709 -0.137225440 Hagger (GB) Brown (USA) Mulliner (GB) Hautenauve (BEL) 0.171128651 0.519252646 1.125481833 1.085697646 Kytola (FIN) Geremias (BRA) Hui-Ing (TAI) Jeong-Mi (KOR) 1.447055499 2.014029620 2.880298635 2.970118607 Launa (PNG) 6.270021972 Component 1 and total heptathlon score • In the real heptathlon, there's a scoring system that gives each competitor a points total. • The correlation between scores on our component 1 and the official scores is -0.99. • Thus we have clearly explained a lot of the relevant variance with our first factor alone. How much variance does each component explain? • summary(hepPCA) gives us information on how much variance each component covers. Importance of components: PC1 PC2 PC3 PC4 PC5 PC6 PC7 Standard deviation 2.1119 1.0928 0.72181 0.67614 0.49524 0.27010 0.2214 Proportion of Variance 0.6372 0.1706 0.07443 0.06531 0.03504 0.01042 0.0070 Cumulative Proportion 0.6372 0.8078 0.88223 0.94754 0.98258 0.99300 1.0000 • C1 covers 63.7%, C2 covers 17.1%, C3 covers 7.4%, etc. • plot(hepPCA) shows this graphically. Variance explained by each component • Plot shows the scaled variance explained by each component. • Also known as a "scree plot".