Microarray Center

IntroductionIntroduction toto RR andand StatisticalStatistical DataData AnalysisAnalysis

PARTPART IIII

Petr Nazarov

petr.nazarov@crp -sante.lu

22-11-2010

LuciLinx: Luxembourg Bioinformatics Network www.lucilinx.lu OUTLINE PART II

Descriptive in R (8) sum, mean, median, sd, var, cor, etc . Principle component analysis and clustering (9) PCA, k-means clustering, Random numbers (10) random number generators, distributions Statistical tests (11) t-test, Wilcoxon test, multiple test correction. ANOVA and Linear regression (12) ANOVA, linear regression

LookLook for for corresponding corresponding scripts scripts at at http://edu.sablab.net/r2010/scriptshttp://edu.sablab.net/r2010/scripts

LuciLinx: Luxembourg Bioinformatics Network www.lucilinx.lu 8. DESCRIPTIVE STATISTICS IN R 8.1-8.3. Center, Variation, Dependency

LuciLinx: Luxembourg Bioinformatics Network www.lucilinx.lu 9. PCA AND CLUSTERING 9.1. Data from R.A.Fisher

TheThe IrisIris flowerflower datadata setset oror Fisher'sFisher's IrisIris datadata setset isis aa multivariatemultivariate datadata setset introducedintroduced byby SirSir RonaldRonald AylmerAylmer FisherFisher (1936)(1936) asas anan exampleexample ofof discridiscriminantminant analysis.analysis. ItIt isis sometimessometimes calledcalled Anderson'sAnderson's Iris Iris data data set set because because Edgar Anderson col collectedlected the the data data to to quantify quantify the the geographicgeographic variationvariation ofof IrisIris flowersflowers inin thethe GaspéGaspé Pe Peninsula.ninsula. TheThe datasetdataset consistsconsists ofof 5050 samplessamples fromfrom eacheach ofof thrthreeee speciesspecies ofof Iris Iris flowersflowers (Iris(Iris setosasetosa ,, IrisIris virginicavirginica andand IrisIris versicolorversicolor ).). FourFour featuresfeatures werewere measuredmeasured fromfrom eacheach sample,sample, ththeyey areare thethe lengthlength andand thethe widthwidth ofof sepalsepal andand petal,petal, inin cencentimeters.timeters. BasedBased onon thethe combinationcombination ofof thethe fourfour features,features, FisherFisher developeddeveloped aa linearlinear discriminadiscriminantnt modelmodel toto distinguishdistinguish thethe speciesspecies fromfrom eacheach other.other.

Iris setosa

LuciLinx: Luxembourg Bioinformatics Network http://wikipedia.prgwww.lucilinx.lu 9. PCA AND CLUSTERING 9.1. Data Presentation

iris str (iris) 2.0 3.0 4.0 0.5 1.5 2.5 ## plot iris data x11 () Sepal.Length plot (iris[,-5]) plot (iris[,-5], 4.5 6.0 7.5 col = iris[,5]) Sepal.Width 2.0 3.0 4.0

Petal.Length 1 3 5 7

Petal.Width 0.5 1.5 2.5

4.5 5.5 6.5 7.5 1 2 3 4 5 6 7

http://urbanext.illinois.edu/gpe/case4/c4facts1a.html

LuciLinx: Luxembourg Bioinformatics Network www.lucilinx.lu 9. PCA AND CLUSTERING 9.2 Principle Component Analysis (PCA)

PrincipalPrincipal component component analysis analysis (PCA) (PCA) isis a a vector vector space space transform transform used used to to reduce reduce multidim multidimensionalensional 20000 genes → 2 dimensions datadata sets sets to to lower lower dimensions dimensions for for analysis. analysis. It It sele selectscts the the 2 dimensions coordinatescoordinates along along which which the the variation variation of of the the data data i sis bigger bigger..

For the simplicity let us consider 2 parametric situation both i n terms of data and resulting PCA.

Scatter plot in in PC “natural ” coordinates Variable 2 Variable Variable 2 Variable Second component Second Second component Second

Variable 1 First component

Instead of using 2 “natural ” parameters for the classification, we can use the first compone nt!

LuciLinx: Luxembourg Bioinformatics Network www.lucilinx.lu 9. PCA AND CLUSTERING 9.2. Data Transformation for PCA

Data = as.matrix (iris[,-5]) row.names (Data) = as.character (iris[,5]) classes = as.integer (iris[,5])

## plot data in 3d Iris library (scatterplot3d) x11 () scatterplot3d (iris[,1],iris[,2],iris[,3], pch=19,color=classes, main = "Iris", setosa xlab = names (iris)[1], versicolor ylab = names (iris)[2], virginica zlab = names (iris)[3]) legend (4,7, levels (iris$Species), col=c(1,2,3),pch=19)

Petal.Length 4.5 4.0

3.5 Sepal.Width 3.0 2.5

1 2 3 4 5 6 7 2.0 4 5 6 7 8

Sepal.Length

LuciLinx: Luxembourg Bioinformatics Network www.lucilinx.lu 9. PCA AND CLUSTERING 9.2. PCA

LuciLinx: Luxembourg Bioinformatics Network www.lucilinx.lu 9. PCA AND CLUSTERING 9.3. k-Means Clustering

kk-Means-Means Clustering Clustering kk-means-means clustering clustering is is a a method method of of cluster analysis wh whichich aims aims to to p partitionartition n n observations observations intointo k k clusters clusters in in which which each each observation observation belongs belongs t oto the the cluster clusterwith with the the nearest nearest mean. mean.

1) k initial "means" (in 4) Steps 2 and 3 are this case k=3) are repeated until 2) k clusters are created randomly selected from convergence has been by associating every the data set (shown in reached. color). observation with the 3) The centroid of each nearest mean. of the k clusters becomes the new means. http://wikipedia.org

LuciLinx: Luxembourg Bioinformatics Network www.lucilinx.lu 9. PCA AND CLUSTERING 9.3. k-Means Clustering

clusters = kmeans (x=Data,centers=3,nstart=10)$cluster x11 () plot (PC$x[,1],PC$x[,2],col = classes,pch=clusters) legend (2,1.4, levels (iris$Species),col=c(1,2,3),pch=19) legend (-2.5,1.4,c("c1","c2","c3"),col=4,pch=c(1,2,3))

c1 setosa c2 versicolor c3 virginica PC$x[, 2] PC$x[, -1.0 -0.5 0.0 0.5 1.0

-3 -2 -1 0 1 2 3 4

PC$x[, 1] LuciLinx: Luxembourg Bioinformatics Network www.lucilinx.lu 9. PCA AND CLUSTERING 9.4. Hierarchical Clustering

HierarchicalHierarchical Clustering Clustering HierarchicalHierarchical clustering clustering creates creates a a hierarchy hierarchy of of clus clustersters which which ma mayy be be represented represented in in a a tree tree structurestructure called called a a dendrogram. dendrogram. TheThe root root of of the the tree tree consists consists of of a a single single cluster cluster c containingontaining allall observations, observations, and and the the leaves leaves correspond correspond to to indi individualvidual observ observations.ations. AlgorithmsAlgorithms for for hierarchical hierarchical clustering clustering are are generall generallyy either either agglomerative agglomerative , ,in in which which one one startsstarts at at the the leaves leaves and and successively successively merges merges cluste clustersrs together; together; or or divisive divisive , ,in in which which one one startsstarts at at the the root root and and recursively recursively splits splits the the clust clusters.ers. Elements Divisive Agglomerative Dendrogram Distance : Euclidean http://wikipedia.org

LuciLinx: Luxembourg Bioinformatics Network www.lucilinx.lu 9. PCA AND CLUSTERING 9.4. Hierarchical Clustering

## use heatmap heatmap (data)

## use heatmap with colors color = character (length (classes)) color[classes == 1] = "black" color[classes == 2] = "red" color[classes == 3] = "green" heatmap (data,RowSideColors=color)

Iris setosa Iris versicolor Iris virginica

LuciLinx: Luxembourg Bioinformatics Network www.lucilinx.lu 9. PCA AND CLUSTERING 9.5. Example: Task 8a

http://edu.sablab.net/data/txt/all_data.txthttp://edu.sablab.net/data/txt/all_data.txt

Acute lymphoblastic leukemia (ALL), is a form of leukemia, or ca ncer of the white blood cells characterized by excess lymphoblasts. all_data.xls contains the results of full -trancript profiling for ALL patients and healthy donors using Affymetrix microarrays. The data were downloaded from ArrayExpress repository and normalized. The expression values in the table are in log 2 scale.

LuciLinx: Luxembourg Bioinformatics Network www.lucilinx.lu 10. RANDOM NUMBERS AND DISTRIBUTIONS See Source Code

LuciLinx: Luxembourg Bioinformatics Network www.lucilinx.lu 11. STATISTICAL TESTS See Source Code

LuciLinx: Luxembourg Bioinformatics Network www.lucilinx.lu 12. ANOVA and LINEAR REGRESSION 12.1. Why ANOVA?

If we would use pairwise comparisons, MeansMeans for for more more than than 2 2 populations populations what will be the probability of getting error? WeWe have have measurements measurements for for 5 5 !5 Number of comparisons: C 5 = = 10 conditions.conditions. Are Are the the means means for for these these 2 !3!2 conditionsconditions equal? equal? Probability of an error : 1 –(0.95) 10 = 0.4

ValidationValidation of of the the effects effects WeWe assume assume that that we we have have several several factorsfactors affecting affecting our our data. data. Which Which factorsfactors are are most most significant? significant? Which Which cancan be be neglected? neglected?

ANOVA example from Partek™ http://easylink.playstream.com/affymetrix/ambsymposium/partek_08.wvx LuciLinx: Luxembourg Bioinformatics Network www.lucilinx.lu 12. ANOVA and LINEAR REGRESSION 12.1. Meaning of ANOVA

H : µµµ = µµµ = µµµ H00: µµµ11= µµµ22= µµµ33 H : not all 3 means are equal Haa: not all 3 means are equal

14

12

10 m2 8 m3

6 m1 Depression level Depression 4

2

0 FL FL FL FL FL FL FL NY NY NY NY NY NY NY NC NC NC NC NC NC Measures

LuciLinx: Luxembourg Bioinformatics Network www.lucilinx.lu 12. ANOVA and LINEAR REGRESSION 12.1. ANOVA in R: Fast and Simple

salaries.txtsalaries.txt

LuciLinx: Luxembourg Bioinformatics Network www.lucilinx.lu 12. ANOVA and LINEAR REGRESSION 12.2. Regression Model and Regression Line

RegressionRegression model model TheThe equation equation describing describing how how y y is is related related to to x x and and a ann error error term; term; in in simple simple linear linear regression, the regression model is y = βββ0 + βββ1x + εεε regression, the regression model is y = βββ0 + βββ1x + εεε

RegressionRegression equation equation TheThe equation equation that that describes describes how how the the mean mean or or expecte expectedd value value of of the the dependent dependent variablevariable is is related related to to the the independent independent variable; variable; in in simple simple linear linear regression, regression, E( y) = βββ0 + βββ1x E( y) = βββ0 + βββ1x

450 Model for a simple linear regression: 400

350 300 y(x) = β1x + β0 + ε 250

200

Number of cells of Number 150

100

50

0 15 25 35 45 Temperature

LuciLinx: Luxembourg Bioinformatics Network www.lucilinx.lu 12. ANOVA and LINEAR REGRESSION 12.2. Comparison of ANOVA and Linear Regression

450 14 400 12 350

10 300 m2 8 250 m3 200 6 m1 Number of cells of Number

Depression level Depression 150 4 100 2 50

0 0 FL FL FL FL FL FL FL NY NY NY NY NY NY NY NC NC NC NC NC NC 15 20 25 30 35 40 45 Measures Temperature

SST = SSTR + SSE SST = SSR + SSE

LuciLinx: Luxembourg Bioinformatics Network www.lucilinx.lu 12. ANOVA and LINEAR REGRESSION 12.2. Linear Regression in R

cells.txtcells.txt y 100 200 300 400

20 25 30 35 40

x

LuciLinx: Luxembourg Bioinformatics Network www.lucilinx.lu REGRESSION ANALYSIS 12.3. Solution of Task 12

leukemia.txtleukemia.txt Survival data1$Survival 0 50 100 150

3.0 3.5 4.0 4.5 5.0

data1$WBC

LuciLinx: Luxembourg Bioinformatics Network www.lucilinx.lu Thank you for your attention

Questions ?

LuciLinx: Luxembourg Bioinformatics Network www.lucilinx.lu