Introduction to R and Statistical Data Analysis
Total Page:16
File Type:pdf, Size:1020Kb
Microarray Center IntroductionIntroduction toto RR andand StatisticalStatistical DataData AnalysisAnalysis PARTPART IIII Petr Nazarov petr.nazarov@crp -sante.lu 22-11-2010 LuciLinx: Luxembourg Bioinformatics Network www.lucilinx.lu OUTLINE PART II Descriptive statistics in R (8) sum, mean, median, sd, var, cor, etc . Principle component analysis and clustering (9) PCA, k-means clustering, hierarchical clustering Random numbers (10) random number generators, distributions Statistical tests (11) t-test, Wilcoxon test, multiple test correction. ANOVA and Linear regression (12) ANOVA, linear regression LookLook for for corresponding corresponding scripts scripts at at http://edu.sablab.net/r2010/scriptshttp://edu.sablab.net/r2010/scripts LuciLinx: Luxembourg Bioinformatics Network www.lucilinx.lu 8. DESCRIPTIVE STATISTICS IN R 8.1-8.3. Center, Variation, Dependency LuciLinx: Luxembourg Bioinformatics Network www.lucilinx.lu 9. PCA AND CLUSTERING 9.1. Iris Data from R.A.Fisher TheThe IrisIris flowerflower datadata setset oror Fisher'sFisher's IrisIris datadata setset isis aa multivariatemultivariate datadata setset introducedintroduced byby SirSir RonaldRonald AylmerAylmer FisherFisher (1936)(1936) asas anan exampleexample ofof discridiscriminantminant analysis.analysis. ItIt isis sometimessometimes calledcalled Anderson'sAnderson's IrisIris datadata setset becausebecause EdgarEdgar AndersonAnderson colcollectedlected thethe datadata toto quantifyquantify thethe geographicgeographic variationvariation ofof IrisIris flowersflowers inin thethe GaspéGaspé Pe Peninsula.ninsula. TheThe datasetdataset consistsconsists ofof 5050 samplessamples fromfrom eacheach ofof thrthreeee speciesspecies ofof Iris Iris flowersflowers (Iris(Iris setosasetosa ,, IrisIris virginicavirginica andand IrisIris versicolorversicolor ).). FourFour featuresfeatures werewere measuredmeasured fromfrom eacheach sample,sample, ththeyey areare thethe lengthlength andand thethe widthwidth ofof sepalsepal andand petal,petal, inin cencentimeters.timeters. BasedBased onon thethe combinationcombination ofof thethe fourfour features,features, FisherFisher developeddeveloped aa linearlinear discriminadiscriminantnt modelmodel toto distinguishdistinguish thethe speciesspecies fromfrom eacheach other.other. Iris setosa Iris versicolor Iris virginica LuciLinx: Luxembourg Bioinformatics Network http://wikipedia.prgwww.lucilinx.lu 9. PCA AND CLUSTERING 9.1. Data Presentation iris str (iris) 2.0 3.0 4.0 0.5 1.5 2.5 ## plot iris data x11 () Sepal.Length plot (iris[,-5]) plot (iris[,-5], 4.5 6.0 7.5 col = iris[,5]) Sepal.Width 2.0 3.0 4.0 Petal.Length 1 3 5 7 Petal.Width 0.5 1.5 2.5 4.5 5.5 6.5 7.5 1 2 3 4 5 6 7 http://urbanext.illinois.edu/gpe/case4/c4facts1a.html LuciLinx: Luxembourg Bioinformatics Network www.lucilinx.lu 9. PCA AND CLUSTERING 9.2 Principle Component Analysis (PCA) PrincipalPrincipal component component analysis analysis (PCA) (PCA) isis a a vector vector space space transform transform used used to to reduce reduce multidim multidimensionalensional 20000 genes → 2 dimensions datadata sets sets to to lower lower dimensions dimensions for for analysis. analysis. It It sele selectscts the the 2 dimensions coordinatescoordinates along along which which the the variation variation of of the the data data i sis bigger bigger.. For the simplicity let us consider 2 parametric situation both i n terms of data and resulting PCA. Scatter plot in Scatter plot in PC “natural ” coordinates Variable 2 Variable Variable 2 Variable Second component Second Second component Second Variable 1 First component Instead of using 2 “natural ” parameters for the classification, we can use the first compone nt! LuciLinx: Luxembourg Bioinformatics Network www.lucilinx.lu 9. PCA AND CLUSTERING 9.2. Data Transformation for PCA Data = as.matrix (iris[,-5]) row.names (Data) = as.character (iris[,5]) classes = as.integer (iris[,5]) ## plot data in 3d Iris library (scatterplot3d) x11 () scatterplot3d (iris[,1],iris[,2],iris[,3], pch=19,color=classes, main = "Iris", setosa xlab = names (iris)[1], versicolor ylab = names (iris)[2], virginica zlab = names (iris)[3]) legend (4,7, levels (iris$Species), col=c(1,2,3),pch=19) Petal.Length 4.5 4.0 3.5 Sepal.Width 3.0 2.5 1 2 3 4 5 6 7 2.0 4 5 6 7 8 Sepal.Length LuciLinx: Luxembourg Bioinformatics Network www.lucilinx.lu 9. PCA AND CLUSTERING 9.2. PCA LuciLinx: Luxembourg Bioinformatics Network www.lucilinx.lu 9. PCA AND CLUSTERING 9.3. k-Means Clustering kk-Means-Means Clustering Clustering kk-means-means clustering clustering is is a a method method of of cluster cluster analysis analysis wh whichich aims aims to to p partitionartition n n observations observations intointo k k clusters clusters in in which which each each observation observation belongs belongs t oto the the cluster clusterwith with the the nearest nearest mean. mean. 1) k initial "means" (in 4) Steps 2 and 3 are this case k=3) are repeated until 2) k clusters are created randomly selected from convergence has been by associating every the data set (shown in reached. color). observation with the 3) The centroid of each nearest mean. of the k clusters becomes the new means. http://wikipedia.org LuciLinx: Luxembourg Bioinformatics Network www.lucilinx.lu 9. PCA AND CLUSTERING 9.3. k-Means Clustering clusters = kmeans (x=Data,centers=3,nstart=10)$cluster x11 () plot (PC$x[,1],PC$x[,2],col = classes,pch=clusters) legend (2,1.4, levels (iris$Species),col=c(1,2,3),pch=19) legend (-2.5,1.4,c("c1","c2","c3"),col=4,pch=c(1,2,3)) c1 setosa c2 versicolor c3 virginica PC$x[, 2] PC$x[, -1.0 -0.5 0.0 0.5 1.0 -3 -2 -1 0 1 2 3 4 PC$x[, 1] LuciLinx: Luxembourg Bioinformatics Network www.lucilinx.lu 9. PCA AND CLUSTERING 9.4. Hierarchical Clustering HierarchicalHierarchical Clustering Clustering HierarchicalHierarchical clustering clustering creates creates a a hierarchy hierarchy of of clus clustersters which which ma mayy be be represented represented in in a a tree tree structurestructure called called a a dendrogram. dendrogram. TheThe root root of of the the tree tree consists consists of of a a single single cluster cluster c containingontaining allall observations, observations, and and the the leaves leaves correspond correspond to to indi individualvidual observ observations.ations. AlgorithmsAlgorithms for for hierarchical hierarchical clustering clustering are are generall generallyy either either agglomerative agglomerative , ,in in which which one one startsstarts at at the the leaves leaves and and successively successively merges merges cluste clustersrs together; together; or or divisive divisive , ,in in which which one one startsstarts at at the the root root and and recursively recursively splits splits the the clust clusters.ers. Elements Divisive Agglomerative Dendrogram Distance : Euclidean http://wikipedia.org LuciLinx: Luxembourg Bioinformatics Network www.lucilinx.lu 9. PCA AND CLUSTERING 9.4. Hierarchical Clustering ## use heatmap heatmap (data) ## use heatmap with colors color = character (length (classes)) color[classes == 1] = "black" color[classes == 2] = "red" color[classes == 3] = "green" heatmap (data,RowSideColors=color) Iris setosa Iris versicolor Iris virginica LuciLinx: Luxembourg Bioinformatics Network www.lucilinx.lu 9. PCA AND CLUSTERING 9.5. Example: Task 8a http://edu.sablab.net/data/txt/all_data.txthttp://edu.sablab.net/data/txt/all_data.txt Acute lymphoblastic leukemia (ALL), is a form of leukemia, or ca ncer of the white blood cells characterized by excess lymphoblasts. all_data.xls contains the results of full -trancript profiling for ALL patients and healthy donors using Affymetrix microarrays. The data were downloaded from ArrayExpress repository and normalized. The expression values in the table are in log 2 scale. LuciLinx: Luxembourg Bioinformatics Network www.lucilinx.lu 10. RANDOM NUMBERS AND DISTRIBUTIONS See Source Code LuciLinx: Luxembourg Bioinformatics Network www.lucilinx.lu 11. STATISTICAL TESTS See Source Code LuciLinx: Luxembourg Bioinformatics Network www.lucilinx.lu 12. ANOVA and LINEAR REGRESSION 12.1. Why ANOVA? If we would use pairwise comparisons, MeansMeans for for more more than than 2 2 populations populations what will be the probability of getting error? WeWe have have measurements measurements for for 5 5 !5 Number of comparisons: C 5 = = 10 conditions.conditions. Are Are the the means means for for these these 2 !3!2 conditionsconditions equal? equal? Probability of an error : 1 –(0.95) 10 = 0.4 ValidationValidation of of the the effects effects WeWe assume assume that that we we have have several several factorsfactors affecting affecting our our data. data. Which Which factorsfactors are are most most significant? significant? Which Which cancan be be neglected? neglected? ANOVA example from Partek™ http://easylink.playstream.com/affymetrix/ambsymposium/partek_08.wvx LuciLinx: Luxembourg Bioinformatics Network www.lucilinx.lu 12. ANOVA and LINEAR REGRESSION 12.1. Meaning of ANOVA H : µµµ = µµµ = µµµ H00: µµµ11= µµµ22= µµµ33 H : not all 3 means are equal Haa: not all 3 means are equal 14 12 10 m2 8 m3 6 m1 Depression level Depression 4 2 0 FL FL FL FL FL FL FL NY NY NY NY NY NY NY NC NC NC NC NC NC Measures LuciLinx: Luxembourg Bioinformatics Network www.lucilinx.lu 12. ANOVA and LINEAR REGRESSION 12.1. ANOVA in R: Fast and Simple salaries.txtsalaries.txt LuciLinx: Luxembourg Bioinformatics Network www.lucilinx.lu 12. ANOVA and LINEAR REGRESSION 12.2. Regression Model and Regression Line RegressionRegression model model