Descriptive Statistics and ANOVA
Total Page:16
File Type:pdf, Size:1020Kb
Basic statistics Descriptive statistics and ANOVA Thomas Alexander Gerds Department of Biostatistics, University of Copenhagen Contents I Data are variable I Statistical uncertainty I Summary and display of data I Confidence intervals I ANOVA Data are variable A statistician is used to receive a value, such as 3.17 %, together with an explanation, such as "this is the expression of 1-B6.DBA-GTM in mouse 12". The value from the next mouse in the list is 4.88% . The measurement is difficult Data processing is done by humans Two mice have different genes They are exposed . and treated differently Decomposing variance Variability of data is usually a composite of I Measurement error, sampling scheme I Random variation I Genotype I Exposure, life style, environment I Treatment Statistical conclusions can often be obtained by explaining the sources of variation in the data. Example 1 In the yeast experiment of Smith and Kruglyak (2008) 1 transcript levels were profiled in 6 replicates of the same strain called ’RM’ in glucose under controlled conditions. 1the article is available at http://biology.plosjournals.org Example 1 Figure: Sources of the variation of these 6 values I Measurement error I Random variation Example 1 In the same yeast experiment Smith and Kruglyak (2008) profiled also 6 replicates of a different strain called ’By’ in glucose.The order in which the 12 samples were processed was at random to minimize a systematic experimental effect. Example 1 Figure: Sources of the variation of these 12 values I Measurement error I Study design/experimental environment I Genotype Example 1 Furthermore, Smith and Kruglyak (2008) cultured 6 ’RM’ and 6 ’By’ replicates in ethanol.The order in which the 24 samples were processed was random to minimize a systematic experimental effect. Sources of variation Figure: Sources of variation I Measurement error I Experimental environment I Genes I Exposure, environmental factors Example 2 Festing and Weigler in the Handbook of Laboratory Animal Science ... consider the results of an experiment using a completely randomized design... in which adult C57BL/6 mice were randomly allocated to one of four dose levels of a hormone compound. The uterus weight was measured after an appropriate time interval. Example 2 Figure: Example 2 Figure: Example 2 Figure: Example 2 Conclusions from the figures I The uterus weight depends on the dose I The variation of the data increases with increasing dose Question: Why could these first conclusions be wrong? Descriptive statistics Descriptive statistics (summarizing data) Categorical variables: count (%). Continuous variables: I raw values (if n is small) I range (min, max) I location: median (IQR=inter quartile range) I location: means (SD) Sample: Table 1 2 2Quality of life (QOL), supportive care, and spirituality in hematopoietic stem cell transplant (HSCT) patients. Sirilla & Overcash. Supportive Care in Cancer, October 2012. Sample: Table 1 R excursion: calculating descriptive statistics in groups library(Publish) library(data.table) data(Diabetes) setDT(Diabetes)## make data.table Diabetes[,.(mean.age=mean(age), sd.age=sd(age),median. chol=median(chol,na.rm=TRUE)),by=location] location mean.age sd.age median.chol 1: Buckingham 47.07500 16.74849 202 2: Louisa 46.63054 15.90929 206 R excursion: making table one library(Publish) data(Diabetes) tab1 <- summary(utable(location∼gender + age + Q(chol) + BMI, data=Diabetes)) tab1 Variable Level Buckingham (n=200) Louisa (n=203) Total (n=403) p-value gender female 114 (57.0) 120 (59.1) 234 (58.1) male 86 (43.0) 83 (40.9) 169 (41.9) 0.7422 age mean (sd) 47.1 (16.7) 46.6 (15.9) 46.9 (16.3) 0.7847 chol median [iqr] 202.0 [174.0, 231.0] 206.0 [183.5, 229.0] 204.0 [179.0, 230.0] 0.2017 missing 1 0 1 BMI mean (sd) 28.6 (7.0) 29.0 (6.2) 28.8 (6.6) 0.5424 missing 3 3 6 Method 2: Use kable3 and include in dynamic report4 ‘‘‘{r,results=’asis’} knitr::kable(tab1) ‘‘‘ R excursion: exporting a table Method 1: Write table to file write.csv(tab1,file="tables/tab1.csv") Then open file tab1.csv with Excel 3https://cran.r-project.org/web/packages/kableExtra/vignettes/ awesome_table_in_html.html 4https://www.rdocumentation.org/packages/knitr/versions/1.17/ topics/kable R excursion: exporting a table Method 1: Write table to file write.csv(tab1,file="tables/tab1.csv") Then open file tab1.csv with Excel Method 2: Use kable3 and include in dynamic report4 ‘‘‘{r,results=’asis’} knitr::kable(tab1) ‘‘‘ 3https://cran.r-project.org/web/packages/kableExtra/vignettes/ awesome_table_in_html.html 4https://www.rdocumentation.org/packages/knitr/versions/1.17/ topics/kable Dynamite plots are depreciated (DO NOT USE) Exercise I Read and discuss the documentation of why dynamite plots are not good: http://biostat.mc.vanderbilt.edu/wiki/Main/DynamitePlots Dot plots are appreciated when n is small 3 2 ● ● 1 ● ● ● 0 ● ● ● ● Measurement scale ● −1 −2 −3 A B C Figure: Group A (n=3), group B (n=3, one replicate), group C (n=4) Box plots are appreciated when n is large 4 ● ● ● ● ● ● ● ● 2 0 Measurement scale −2 ● ● ● ● −4 A B C Figure: Group A (n=300), group B (n=400), group C (n=400) Making boxplots with ggplot2 library(ggplot2) bp <- ggplot(Diabetes, aes(location,chol)) bp <- bp + geom_boxplot(aes(fill=location)) print(bp) Find the ggplot2 cheat sheet via help menu in Rstudio Making boxplots with ggplot2 ● ● 400 ● ● ● ● ● ● 300 ● location Buckingham chol Louisa 200 ● 100 ● Buckingham Louisa location Making boxplots with ggplot2 bp+facet_grid(.∼gender) female male ● ● 400 ● ● ● ● ● ● 300 ● location Buckingham chol Louisa 200 ● 100 ● Buckingham Louisa Buckingham Louisa location Making dotplots with ggplot2 dp <- ggplot(mice,aes(x=Dose,fill=Dose,y=BodyWeight)) dp <- dp + geom_dotplot(binaxis="y") print(dp) ● ●●● 13 ● ● ●● ● Dose ● 0 12 ● ● ● ● 1 ● 2.5 ● 7.5 BodyWeight ● ● 50 ● ● 11 ● ● ● ● 10 ● 0 1 2.5 7.5 50 Dose R excursion: exporting a figure Write figure to pdf (vector graphics, also eps 5, infinite zoom) ggsave(dp,file="dotplot-mice-bodyweight.pdf") # or pdf("figures/dotplot-mice-bodyweight.pdf") dp dev.off() Write figure to jpg (image file, also tiff, giff etc) jpeg("figures/dotplot-mice-bodyweight.jpg") dp dev.off() 5postscript("figures/dotplot-mice-bodyweight.eps") Quantifying variability A sample of data X1;:::; XN has a standard deviation (sd); it is defined by v u N N u 1 X 1 X SD = t (X − X )2; X = X N − 1 i N i i=1 i=1 SD measures the variability of the measurements in the sample. The variance of the sample is defined as SD2. The term ’standard deviation’ relates to the normal distribution. Normal distribution What is so special about the normal distribution? I It is symmetric around the mean, thus the mean is equal to the median. I The mean is the most likely value. Mean and standard deviation describe the full destribution. I The distribution of measurements, like height, distance, volume is often normal. I The distribution of statistics, like mean, proportion, mean difference, etc. are very often approximately normal. Quantifying statistical uncertainty For statistical inference and conclusion making, via p-values and confidence intervals, it is crucial to quantify the variability of the statistic (mean, proportion, mean difference, risk ratio, etc.): The standard error is the standard deviation of the statistic. The standard error is a measure of the statistical uncertainty. Illustration Population: Mean = 3.81 Illustration Population: Mean = 2.13 Mean = 3.81 Illustration Population: Mean = 2.13 Mean = 3.81 Mean = 4.01 Based on N = 4 values, 0.012, 0.0088, 0.0069, 0.009: I mean: β^ = 0:0091 I standard deviation: SD = 0:002108 I empirical variance: var = 0:0000044 I standard error: SE = 0:002108=2 = 0:001054 Quantifying statistical uncertainty Example: We want to estimate the unknown mean uterus weight for untreated mice. The standard error of the mean is defined as p SE = SD= N where N is the sample size: Quantifying statistical uncertainty Example: We want to estimate the unknown mean uterus weight for untreated mice. The standard error of the mean is defined as p SE = SD= N where N is the sample size: Based on N = 4 values, 0.012, 0.0088, 0.0069, 0.009: I mean: β^ = 0:0091 I standard deviation: SD = 0:002108 I empirical variance: var = 0:0000044 I standard error: SE = 0:002108=2 = 0:001054 The standard error is the standard deviation of the mean 0.015 The unknown true average uterus 0.010 ● weight ● ● ● Uterus (g) weight 0.005 0.000 Our Hypothetical Hypothetical Hypothetical study study 1 study 47 study 100 The (hypothetical) mean values are approximately normally distributed, even if the data are not normally distributed! Variance vs statistical uncertainty "’The terms standard error and standard deviation are often confused. The contrast between these two terms reflects the important distinction between data description and inference, one that all researchers should appreciate."’ 6 Rules: I The higher the unexplained variability of the data, the higher the statistical uncertainty. I The higher the sample size, the lower the statistical uncertainty. 6Altman & Bland, Statistics Notes, BMJ, 2005, Nagele P, Br J Anaesthesiol 2003;90: 514-6 Confidence intervals Constructing confidence limits A 95% confidence interval for the parameter β is [β^ − 1:96 ∗ SE; β^ + 1:96 ∗ SE] Example: a confidence interval for the mean uterus weight of untreated mice is given by 95%CI = [0:0091 − 1:96 ∗ 0:001054; 0:0091 + 1:96 ∗ 0:001054] = [0:007; 0:011]: The standard error SE measures the variability of the mean β^ around the (unknown) population value β, under the assumption that the model is correctly specified.