<Principal Component Analysis>

Primer of Biostatistics The 3rd Lecture Hiroyoshi Iwata [email protected] <Principal component analysis> Experiments in agriculture and life sciences often measure multiple characteristics of the same samples. For example, in field trials of crops, various traits related to yield are examined simultaneously, even for the purpose of yield assessment. It is often the case that you can discover some kind of knowledge by drawing and viewing scatter plots of the multiple characteristics measured simultaneously. However, if the number of characteristics measured is large, it will be difficult to grasp the variation of data with a scatter plot. The number of dimensions that humans can intuitively grasp using the scatter plot is at most several dimensions, and it is not easy to grasp the variation of data when there are more than 10 measured characteristics. Principal component analysis described in this lecture is a method for summarizing variation contained in multidimensional data into low-dimensional features (i.e., principal component scores) without reducing the amount of information as much as possible. For example, the variation summary included in the marker genotype data shown as an example in this lecture shows that data in 1,311 dimensions can be summarized in four dimensions. Principal component analysis is a very effective method to efficiently extract the information contained in variables when the number of variables is large. In this lecture, we will use the rice dataset (Zhao et al. 2011, Nature Communications 2: 467) as an example. In this lecture, we will also use marker genotype data (RiceDiversityGeno.csv) as well as line (accession) data (RiceDiversityLine.csv) and phenotypic data (RiceDiversityPheno.csv). The the marker genotype data is the genotypes of 1,311 SNPs that Zhao et al. (2010, PLoS One 5: e10780) used in their analysis. All data are based on the 1 data downloaded from the Rice Diversity web page http://www.ricediversity.org/. As for marker data, missing value was imputed by using the software fastPHASE (Scheet and Stephens 2006, Am J Hum Genet 78: 629). Load three types of data and combine them. > line <- read.csv("RiceDiversityLine.csv") # accession (line) data > pheno <- read.csv("RiceDiversityPheno.csv") # phenotypic data > geno <- read.csv("RiceDiversityGeno.csv") # marker genotype data > line.pheno <- merge(line, pheno, by.x = "NSFTV.ID", by.y = "NSFTVID") # combine line and pheno > alldata <- merge(line.pheno, geno, by.x = "NSFTV.ID", by.y = "NSFTVID") # combine line.pheno and geno Analyze the variation of panicle length and flag leaf length in varieties and lines included in rice germplasm by principal component analysis. First, the data of both traits are extracted from all data (alldata). Then, samples which have at least one missing value among the data of both traits are excluded. > mydata <- data.frame( panicle.length = alldata$Panicle.length, leaf.length = alldata$Flag.leaf.length ) # Extract Panicle.length and Flag.leaf.length > mydata （omitted） > missing <- apply(is.na(mydata), 1, sum) > 0 # is.na is a function for check whether elements are NA or not # sum of this values becomes the number of NAs in a sample # missing is a vector which contains a vector of T or F # if a sample has missing, the corresponding element becomes T > mydata <- mydata[!missing, ] # remove samples which have missing data > mydata （omitted） Let's check the variation of panicle length and flag leaf length with a scatter 2 plot. > plot(mydata) # draw the scatter plot > lim <- range(mydata) # set range of values in both variables > plot(mydata, xlim = lim, ylim = lim) # draw the plot with the same range for x and y 50 45 40 35 30 leaf.length 25 20 15 15 20 25 30 35 40 45 50 panicle.length Figure 1. The relationship between panicle length and flag leaf length in 341 accessions of rice germplasm Both lengths are measured in cm Look at the scatter plot. When one trait becomes large, the other tends to also become large. Calculate the variance-covariance matrix and correlation matrix of both traits and confirm it numerically. > cov(mydata) # variance-covariance matrix panicle.length leaf.length panicle.length 12.67168 11.57718 leaf.length 11.57718 33.41344 > cor(mydata) # correlation matrix panicle.length leaf.length panicle.length 1.000000 0.562633 leaf.length 0.562633 1.000000 Both the correlation and the covariance have positive values. Thus, we can 3 confirm that both tend to vary together. In order to simplify the following calculations and explanations, each trait is normalized to be zero on average (the average is subtracted from the original variable). > mydata <- sweep(mydata, 2, apply(mydata, 2, mean)) # calculate column means and subtract them from the columns > summary(mydata) （omitted） > cov(mydata) panicle.length leaf.length panicle.length 12.67168 11.57718 leaf.length 11.57718 33.41344 > cor(mydata) panicle.length leaf.length panicle.length 1.000000 0.562633 leaf.length 0.562633 1.000000 > lim <- range(mydata) > plot(mydata, xlim = lim, ylim = lim) > abline(h = 0, v = 0) # draw the lines of x = 0 and y = 0 Note that scaling the variables does not change the relationship between the variables represented by the variance-covariance matrix and the correlation matrix. 20 15 10 5 0 leaf.length -5 -10 -15 -15 -10 -5 0 5 10 15 20 panicle.length Figure 2. Panicle length and flag leaf length normalized so that their means 4 are equal to 0 Let's perform principal component analysis and plot the obtained principal component scores. > res <- prcomp(mydata) # perform principal component analysis > lim <- range(res$x) # extract principal component scores as res$x > plot(res$x, xlim = lim, ylim = lim) # draw the plot of scores > abline(h = 0, v = 0) 15 10 5 0 PC2 -5 -10 -15 -15 -10 -5 0 5 10 15 PC1 Figure 3. The horizontal axis is the first principal component (PC1) score, and the vertical axis is the second principal component (PC2) score Let's check the relationship between principal component scores and original variables by drawing a scatter diagram side by side. 5 > op <- par(mfrow = c(1,2)) # par is a function setting graph parameters # mfrow = c(1,2) means draw graphs in the 1 row and two culums format > lim <- range(mydata) > plot(mydata, xlim = lim, ylim = lim) > abline(h = 0, v = 0) > lim <- range(res$x) > plot(res$x, xlim = lim, ylim = lim) > abline(h = 0, v = 0) > par(op) # reset the graph parameters # op is the parameters before we set them in the first line. 20 15 15 10 10 5 5 0 PC2 0 -5 leaf.length -5 -15 -15 -15 -5 0 5 10 15 20 -15 -5 0 5 10 15 panicle.length PC1 Figure 4. (left) original variable, (right) principal component score The side-by-side comparison of the principal component scores and the original variables shows that the principal component scores have the rotation (and inversion) of the original variables. Principal component analysis is a method for representing the variation of the original variables with new variables with as few dimensions as possible. For example, the horizontal axis in the right figure expresses the variation of the two variables of panicle length and flag leaf length as the variation of one new variable (i.e,, the first principal component). It can be seen that the first principal component alone can explain most of the variation of the original variables. In addition, the scores of the second principal component 6 represents the variation that could not be explained by the first principal component. Now let's check how much each principal component actually explains the variations. > summary(res) # summary of PCA Importance of components: PC1 PC2 Standard deviation 6.2117 2.7385 Proportion of Variance 0.8373 0.1627 Cumulative Proportion 0.8373 1.0000 It can be seen that the first principal component accounts for 83.7% of the total variation, and the second principal component accounts for the remaining 16.3%. That is, it can be seen that 80% or more of the variation of panicle length and flag leaf length can be represented by one variable (first main component). Let's look at the results in more detail. > res # the object which has the result of PCA Standard deviations: [1] 6.211732 2.738524 Rotation: PC1 PC2 panicle.length -0.4078995 -0.9130268 leaf.length -0.9130268 0.4078995 Standard deviations represent the standard deviation of the first and second principal components which are new variables. In addition, Rotation represents a unit vector representing the orientation of the axes of the first and second principal components (note that these unit vectors are called eigenvectors, as will be described later). In addition, the results mentioned above can also be retrieved separately as follows. 7 > res$sdev # standard deviation of PC scores [1] 6.211732 2.738524 > res$rotation # eigenvectors PC1 PC2 panicle.length -0.4078995 -0.9130268 leaf.length -0.9130268 0.4078995 Let's draw graphs for the result of principal component analysis. > op <- par(mfrow = c(1,2)) > plot(res) # variance (eigenvalues) of PCs > biplot(res) # biplot which shows relationship between PCs and original variables > par(op) res -100 -50 0 50 35 237194 1131761602622621 24590215 137 246 50 0.10 30 66 4578 186216262 8813329352269 20 31521024124723595259 2125915617198249313204288345287253119240 3271183267129532961577311175132154541812832231101822211307318252185261195121248 leaf.length41 31911693221229633630117923122190297325256772201641623322333712928935422268 1581802 30022733163155183242921691261151872083403031899 1501788161 9628010819 2927191149170109238125244162765533316620929934931285243337285203236257266306356225 976413635723129131617 355 0 10217586 172672398334182173104145229323352198563143290298228130317282 0.00 205151 5710747131724815334327513925242111921201003880214363394622415215 20 993321327733879351284251304272147258 193 7 27035023426089310264206286353841972675815965268359278103302342177281346146 PC2 36030273311255200199207 1481171441272633202181421882543356242202184201168 1415 11434712818319111213827433024230910644196 92 Variances 11124603087035881250105140265283294348 panicle.length33469 4 279 23029414332840 76 -50 -0.10 61 1413534 10 174 219163217 232 321 344 -100 -0.20 0 -0.20 -0.10 0.00 0.10 PC1 Fig.

<Principal Component Analysis>

A Generalized Linear Model for Principal Component Analysis of Binary Data

Exploratory and Confirmatory Factor Analysis Course Outline Part 1

A Resampling Test for Principal Component Analysis of Genotype-By-Environment Interaction

GGL Biplot Analysis of Durum Wheat (Triticum Turgidum Spp

A Biplot-Based PCA Approach to Study the Relations Between Indoor and Outdoor Air Pollutants Using Case Study Buildings

Logistic Biplot by Conjugate Gradient Algorithms and Iterated SVD

Pcatools: Everything Principal Components Analysis

Principal Components Analysis (Pca)

Correlplot: a Collection of Functions for Plotting Correlation Matrices 1

Factor Analysis: Illustrations/Graphics/Tables Using

Statistics 5401 19. Biplots Gary W

Biplots in Practice