Microarray Center
IntroductionIntroduction toto RR andand StatisticalStatistical DataData AnalysisAnalysis
PARTPART IIII
Petr Nazarov
petr.nazarov@crp -sante.lu
22-11-2010
LuciLinx: Luxembourg Bioinformatics Network www.lucilinx.lu OUTLINE PART II
Descriptive statistics in R (8) sum, mean, median, sd, var, cor, etc . Principle component analysis and clustering (9) PCA, k-means clustering, hierarchical clustering Random numbers (10) random number generators, distributions Statistical tests (11) t-test, Wilcoxon test, multiple test correction. ANOVA and Linear regression (12) ANOVA, linear regression
LookLook for for corresponding corresponding scripts scripts at at http://edu.sablab.net/r2010/scriptshttp://edu.sablab.net/r2010/scripts
LuciLinx: Luxembourg Bioinformatics Network www.lucilinx.lu 8. DESCRIPTIVE STATISTICS IN R 8.1-8.3. Center, Variation, Dependency
LuciLinx: Luxembourg Bioinformatics Network www.lucilinx.lu 9. PCA AND CLUSTERING 9.1. Iris Data from R.A.Fisher
TheThe IrisIris flowerflower datadata setset oror Fisher'sFisher's IrisIris datadata setset isis aa multivariatemultivariate datadata setset introducedintroduced byby SirSir RonaldRonald AylmerAylmer FisherFisher (1936)(1936) asas anan exampleexample ofof discridiscriminantminant analysis.analysis. ItIt isis sometimessometimes calledcalled Anderson'sAnderson's Iris Iris data data set set because because Edgar Edgar Anderson Anderson col collectedlected the the data data to to quantify quantify the the geographicgeographic variationvariation ofof IrisIris flowersflowers inin thethe GaspéGaspé Pe Peninsula.ninsula. TheThe datasetdataset consistsconsists ofof 5050 samplessamples fromfrom eacheach ofof thrthreeee speciesspecies ofof Iris Iris flowersflowers (Iris(Iris setosasetosa ,, IrisIris virginicavirginica andand IrisIris versicolorversicolor ).). FourFour featuresfeatures werewere measuredmeasured fromfrom eacheach sample,sample, ththeyey areare thethe lengthlength andand thethe widthwidth ofof sepalsepal andand petal,petal, inin cencentimeters.timeters. BasedBased onon thethe combinationcombination ofof thethe fourfour features,features, FisherFisher developeddeveloped aa linearlinear discriminadiscriminantnt modelmodel toto distinguishdistinguish thethe speciesspecies fromfrom eacheach other.other.
Iris setosa Iris versicolor Iris virginica
LuciLinx: Luxembourg Bioinformatics Network http://wikipedia.prgwww.lucilinx.lu 9. PCA AND CLUSTERING 9.1. Data Presentation
iris str (iris) 2.0 3.0 4.0 0.5 1.5 2.5 ## plot iris data x11 () Sepal.Length plot (iris[,-5]) plot (iris[,-5], 4.5 6.0 7.5 col = iris[,5]) Sepal.Width 2.0 3.0 4.0
Petal.Length 1 3 5 7
Petal.Width 0.5 1.5 2.5
4.5 5.5 6.5 7.5 1 2 3 4 5 6 7
http://urbanext.illinois.edu/gpe/case4/c4facts1a.html
LuciLinx: Luxembourg Bioinformatics Network www.lucilinx.lu 9. PCA AND CLUSTERING 9.2 Principle Component Analysis (PCA)
PrincipalPrincipal component component analysis analysis (PCA) (PCA) isis a a vector vector space space transform transform used used to to reduce reduce multidim multidimensionalensional 20000 genes → 2 dimensions datadata sets sets to to lower lower dimensions dimensions for for analysis. analysis. It It sele selectscts the the 2 dimensions coordinatescoordinates along along which which the the variation variation of of the the data data i sis bigger bigger..
For the simplicity let us consider 2 parametric situation both i n terms of data and resulting PCA.
Scatter plot in Scatter plot in PC “natural ” coordinates Variable 2 Variable Variable 2 Variable Second component Second Second component Second
Variable 1 First component
Instead of using 2 “natural ” parameters for the classification, we can use the first compone nt!
LuciLinx: Luxembourg Bioinformatics Network www.lucilinx.lu 9. PCA AND CLUSTERING 9.2. Data Transformation for PCA
Data = as.matrix (iris[,-5]) row.names (Data) = as.character (iris[,5]) classes = as.integer (iris[,5])
## plot data in 3d Iris library (scatterplot3d) x11 () scatterplot3d (iris[,1],iris[,2],iris[,3], pch=19,color=classes, main = "Iris", setosa xlab = names (iris)[1], versicolor ylab = names (iris)[2], virginica zlab = names (iris)[3]) legend (4,7, levels (iris$Species), col=c(1,2,3),pch=19)
Petal.Length 4.5 4.0
3.5 Sepal.Width 3.0 2.5
1 2 3 4 5 6 7 2.0 4 5 6 7 8
Sepal.Length
LuciLinx: Luxembourg Bioinformatics Network www.lucilinx.lu 9. PCA AND CLUSTERING 9.2. PCA
LuciLinx: Luxembourg Bioinformatics Network www.lucilinx.lu 9. PCA AND CLUSTERING 9.3. k-Means Clustering
kk-Means-Means Clustering Clustering kk-means-means clustering clustering is is a a method method of of cluster cluster analysis analysis wh whichich aims aims to to p partitionartition n n observations observations intointo k k clusters clusters in in which which each each observation observation belongs belongs t oto the the cluster clusterwith with the the nearest nearest mean. mean.
1) k initial "means" (in 4) Steps 2 and 3 are this case k=3) are repeated until 2) k clusters are created randomly selected from convergence has been by associating every the data set (shown in reached. color). observation with the 3) The centroid of each nearest mean. of the k clusters becomes the new means. http://wikipedia.org
LuciLinx: Luxembourg Bioinformatics Network www.lucilinx.lu 9. PCA AND CLUSTERING 9.3. k-Means Clustering
clusters = kmeans (x=Data,centers=3,nstart=10)$cluster x11 () plot (PC$x[,1],PC$x[,2],col = classes,pch=clusters) legend (2,1.4, levels (iris$Species),col=c(1,2,3),pch=19) legend (-2.5,1.4,c("c1","c2","c3"),col=4,pch=c(1,2,3))
c1 setosa c2 versicolor c3 virginica PC$x[, 2] PC$x[, -1.0 -0.5 0.0 0.5 1.0
-3 -2 -1 0 1 2 3 4
PC$x[, 1] LuciLinx: Luxembourg Bioinformatics Network www.lucilinx.lu 9. PCA AND CLUSTERING 9.4. Hierarchical Clustering
HierarchicalHierarchical Clustering Clustering HierarchicalHierarchical clustering clustering creates creates a a hierarchy hierarchy of of clus clustersters which which ma mayy be be represented represented in in a a tree tree structurestructure called called a a dendrogram. dendrogram. TheThe root root of of the the tree tree consists consists of of a a single single cluster cluster c containingontaining allall observations, observations, and and the the leaves leaves correspond correspond to to indi individualvidual observ observations.ations. AlgorithmsAlgorithms for for hierarchical hierarchical clustering clustering are are generall generallyy either either agglomerative agglomerative , ,in in which which one one startsstarts at at the the leaves leaves and and successively successively merges merges cluste clustersrs together; together; or or divisive divisive , ,in in which which one one startsstarts at at the the root root and and recursively recursively splits splits the the clust clusters.ers. Elements Divisive Agglomerative Dendrogram Distance : Euclidean http://wikipedia.org
LuciLinx: Luxembourg Bioinformatics Network www.lucilinx.lu 9. PCA AND CLUSTERING 9.4. Hierarchical Clustering
## use heatmap heatmap (data)
## use heatmap with colors color = character (length (classes)) color[classes == 1] = "black" color[classes == 2] = "red" color[classes == 3] = "green" heatmap (data,RowSideColors=color)
Iris setosa Iris versicolor Iris virginica
LuciLinx: Luxembourg Bioinformatics Network www.lucilinx.lu 9. PCA AND CLUSTERING 9.5. Example: Task 8a
http://edu.sablab.net/data/txt/all_data.txthttp://edu.sablab.net/data/txt/all_data.txt
Acute lymphoblastic leukemia (ALL), is a form of leukemia, or ca ncer of the white blood cells characterized by excess lymphoblasts. all_data.xls contains the results of full -trancript profiling for ALL patients and healthy donors using Affymetrix microarrays. The data were downloaded from ArrayExpress repository and normalized. The expression values in the table are in log 2 scale.
LuciLinx: Luxembourg Bioinformatics Network www.lucilinx.lu 10. RANDOM NUMBERS AND DISTRIBUTIONS See Source Code
LuciLinx: Luxembourg Bioinformatics Network www.lucilinx.lu 11. STATISTICAL TESTS See Source Code
LuciLinx: Luxembourg Bioinformatics Network www.lucilinx.lu 12. ANOVA and LINEAR REGRESSION 12.1. Why ANOVA?
If we would use pairwise comparisons, MeansMeans for for more more than than 2 2 populations populations what will be the probability of getting error? WeWe have have measurements measurements for for 5 5 !5 Number of comparisons: C 5 = = 10 conditions.conditions. Are Are the the means means for for these these 2 !3!2 conditionsconditions equal? equal? Probability of an error : 1 –(0.95) 10 = 0.4
ValidationValidation of of the the effects effects WeWe assume assume that that we we have have several several factorsfactors affecting affecting our our data. data. Which Which factorsfactors are are most most significant? significant? Which Which cancan be be neglected? neglected?
ANOVA example from Partek™ http://easylink.playstream.com/affymetrix/ambsymposium/partek_08.wvx LuciLinx: Luxembourg Bioinformatics Network www.lucilinx.lu 12. ANOVA and LINEAR REGRESSION 12.1. Meaning of ANOVA
H : µµµ = µµµ = µµµ H00: µµµ11= µµµ22= µµµ33 H : not all 3 means are equal Haa: not all 3 means are equal
14
12
10 m2 8 m3
6 m1 Depression level Depression 4
2
0 FL FL FL FL FL FL FL NY NY NY NY NY NY NY NC NC NC NC NC NC Measures
LuciLinx: Luxembourg Bioinformatics Network www.lucilinx.lu 12. ANOVA and LINEAR REGRESSION 12.1. ANOVA in R: Fast and Simple
salaries.txtsalaries.txt
LuciLinx: Luxembourg Bioinformatics Network www.lucilinx.lu 12. ANOVA and LINEAR REGRESSION 12.2. Regression Model and Regression Line
RegressionRegression model model TheThe equation equation describing describing how how y y is is related related to to x x and and a ann error error term; term; in in simple simple linear linear regression, the regression model is y = βββ0 + βββ1x + εεε regression, the regression model is y = βββ0 + βββ1x + εεε
RegressionRegression equation equation TheThe equation equation that that describes describes how how the the mean mean or or expecte expectedd value value of of the the dependent dependent variablevariable is is related related to to the the independent independent variable; variable; in in simple simple linear linear regression, regression, E( y) = βββ0 + βββ1x E( y) = βββ0 + βββ1x
450 Model for a simple linear regression: 400
350 300 y(x) = β1x + β0 + ε 250
200
Number of cells of Number 150
100
50
0 15 25 35 45 Temperature
LuciLinx: Luxembourg Bioinformatics Network www.lucilinx.lu 12. ANOVA and LINEAR REGRESSION 12.2. Comparison of ANOVA and Linear Regression
450 14 400 12 350
10 300 m2 8 250 m3 200 6 m1 Number of cells of Number
Depression level Depression 150 4 100 2 50
0 0 FL FL FL FL FL FL FL NY NY NY NY NY NY NY NC NC NC NC NC NC 15 20 25 30 35 40 45 Measures Temperature
SST = SSTR + SSE SST = SSR + SSE
LuciLinx: Luxembourg Bioinformatics Network www.lucilinx.lu 12. ANOVA and LINEAR REGRESSION 12.2. Linear Regression in R
cells.txtcells.txt y 100 200 300 400
20 25 30 35 40
x
LuciLinx: Luxembourg Bioinformatics Network www.lucilinx.lu REGRESSION ANALYSIS 12.3. Solution of Task 12
leukemia.txtleukemia.txt Survival data1$Survival 0 50 100 150
3.0 3.5 4.0 4.5 5.0
data1$WBC
LuciLinx: Luxembourg Bioinformatics Network www.lucilinx.lu Thank you for your attention
Questions ?
LuciLinx: Luxembourg Bioinformatics Network www.lucilinx.lu