Basic statistics Descriptive statistics and ANOVA
Thomas Alexander Gerds
Department of Biostatistics, University of Copenhagen Contents
I Data are variable I Statistical uncertainty I Summary and display of data I Confidence intervals I ANOVA Data are variable
A statistician is used to receive a value, such as
3.17 %,
together with an explanation, such as
"this is the expression of 1-B6.DBA-GTM in mouse 12".
The value from the next mouse in the list is 4.88% . . . The measurement is difficult Data processing is done by humans Two mice have different genes They are exposed . . . and treated differently Decomposing variance
Variability of data is usually a composite of I Measurement error, sampling scheme I Random variation I Genotype I Exposure, life style, environment I Treatment Statistical conclusions can often be obtained by explaining the sources of variation in the data. Example 1
In the yeast experiment of Smith and Kruglyak (2008) 1 transcript levels were profiled in 6 replicates of the same strain called ’RM’ in glucose under controlled conditions.
1the article is available at http://biology.plosjournals.org Example 1
Figure: Sources of the variation of these 6 values
I Measurement error I Random variation Example 1
In the same yeast experiment Smith and Kruglyak (2008) profiled also 6 replicates of a different strain called ’By’ in glucose.The order in which the 12 samples were processed was at random to minimize a systematic experimental effect. Example 1
Figure: Sources of the variation of these 12 values
I Measurement error I Study design/experimental environment I Genotype Example 1
Furthermore, Smith and Kruglyak (2008) cultured 6 ’RM’ and 6 ’By’ replicates in ethanol.The order in which the 24 samples were processed was random to minimize a systematic experimental effect. Sources of variation
Figure: Sources of variation
I Measurement error I Experimental environment I Genes I Exposure, environmental factors Example 2
Festing and Weigler in the Handbook of Laboratory Animal Science ...
. . . consider the results of an experiment using a completely randomized design...
. . . in which adult C57BL/6 mice were randomly allocated to one of four dose levels of a hormone compound.
The uterus weight was measured after an appropriate time interval. Example 2
Figure: Example 2
Figure: Example 2
Figure: Example 2
Conclusions from the figures
I The uterus weight depends on the dose I The variation of the data increases with increasing dose
Question: Why could these first conclusions be wrong? Descriptive statistics Descriptive statistics (summarizing data)
Categorical variables: count (%).
Continuous variables: I raw values (if n is small) I range (min, max) I location: median (IQR=inter quartile range) I location: means (SD) Sample: Table 1
2 2Quality of life (QOL), supportive care, and spirituality in hematopoietic stem cell transplant (HSCT) patients. Sirilla & Overcash. Supportive Care in Cancer, October 2012. Sample: Table 1 R excursion: calculating descriptive statistics in groups
library(Publish) library(data.table) data(Diabetes) setDT(Diabetes)## make data.table Diabetes[,.(mean.age=mean(age), sd.age=sd(age),median. chol=median(chol,na.rm=TRUE)),by=location]
location mean.age sd.age median.chol 1: Buckingham 47.07500 16.74849 202 2: Louisa 46.63054 15.90929 206 R excursion: making table one library(Publish) data(Diabetes) tab1 <- summary(utable(location∼gender + age + Q(chol) + BMI, data=Diabetes)) tab1
Variable Level Buckingham (n=200) Louisa (n=203) Total (n=403) p-value gender female 114 (57.0) 120 (59.1) 234 (58.1) male 86 (43.0) 83 (40.9) 169 (41.9) 0.7422 age mean (sd) 47.1 (16.7) 46.6 (15.9) 46.9 (16.3) 0.7847 chol median [iqr] 202.0 [174.0, 231.0] 206.0 [183.5, 229.0] 204.0 [179.0, 230.0] 0.2017 missing 1 0 1 BMI mean (sd) 28.6 (7.0) 29.0 (6.2) 28.8 (6.6) 0.5424 missing 3 3 6 Method 2: Use kable3 and include in dynamic report4 ‘‘‘{r,results=’asis’} knitr::kable(tab1) ‘‘‘
R excursion: exporting a table
Method 1: Write table to file write.csv(tab1,file="tables/tab1.csv")
Then open file tab1.csv with Excel
3https://cran.r-project.org/web/packages/kableExtra/vignettes/ awesome_table_in_html.html 4https://www.rdocumentation.org/packages/knitr/versions/1.17/ topics/kable R excursion: exporting a table
Method 1: Write table to file write.csv(tab1,file="tables/tab1.csv")
Then open file tab1.csv with Excel
Method 2: Use kable3 and include in dynamic report4 ‘‘‘{r,results=’asis’} knitr::kable(tab1) ‘‘‘
3https://cran.r-project.org/web/packages/kableExtra/vignettes/ awesome_table_in_html.html 4https://www.rdocumentation.org/packages/knitr/versions/1.17/ topics/kable Dynamite plots are depreciated (DO NOT USE) Exercise
I Read and discuss the documentation of why dynamite plots are not good: http://biostat.mc.vanderbilt.edu/wiki/Main/DynamitePlots Dot plots are appreciated when n is small 3 2 ●
●
1 ● ● ●
0 ● ● ●
● Measurement scale ● −1 −2 −3
A B C
Figure: Group A (n=3), group B (n=3, one replicate), group C (n=4) Box plots are appreciated when n is large
4 ●
● ● ● ● ● ● ● 2 0 Measurement scale −2
●
● ● ● −4
A B C
Figure: Group A (n=300), group B (n=400), group C (n=400) Making boxplots with ggplot2
library(ggplot2) bp <- ggplot(Diabetes, aes(location,chol)) bp <- bp + geom_boxplot(aes(fill=location)) print(bp)
Find the ggplot2 cheat sheet via help menu in Rstudio Making boxplots with ggplot2
●
● 400
● ● ●
● ●
● 300 ●
location Buckingham chol Louisa
200
●
100
●
Buckingham Louisa location Making boxplots with ggplot2
bp+facet_grid(.∼gender)
female male
●
● 400
● ● ●
● ●
●
300 ●
location
Buckingham chol Louisa
200
●
100
●
Buckingham Louisa Buckingham Louisa location Making dotplots with ggplot2 dp <- ggplot(mice,aes(x=Dose,fill=Dose,y=BodyWeight)) dp <- dp + geom_dotplot(binaxis="y") print(dp)
● ●●●
13 ● ● ●●
● Dose ● 0 12 ● ● ● ● 1 ● 2.5 ● 7.5 BodyWeight ● ● 50 ● ● 11 ●
● ● ●
10 ●
0 1 2.5 7.5 50 Dose R excursion: exporting a figure
Write figure to pdf (vector graphics, also eps 5, infinite zoom) ggsave(dp,file="dotplot-mice-bodyweight.pdf") # or pdf("figures/dotplot-mice-bodyweight.pdf") dp dev.off()
Write figure to jpg (image file, also tiff, giff etc) jpeg("figures/dotplot-mice-bodyweight.jpg") dp dev.off()
5postscript("figures/dotplot-mice-bodyweight.eps") Quantifying variability
A sample of data X1,..., XN has a standard deviation (sd); it is defined by v u N N u 1 X 1 X SD = t (X − X )2; X = X N − 1 i N i i=1 i=1 SD measures the variability of the measurements in the sample.
The variance of the sample is defined as SD2. The term ’standard deviation’ relates to the normal distribution. Normal distribution What is so special about the normal distribution?
I It is symmetric around the mean, thus the mean is equal to the median. I The mean is the most likely value. Mean and standard deviation describe the full destribution. I The distribution of measurements, like height, distance, volume is often normal.
I The distribution of statistics, like mean, proportion, mean difference, etc. are very often approximately normal. Quantifying statistical uncertainty
For statistical inference and conclusion making, via p-values and confidence intervals, it is crucial to quantify the variability of the statistic (mean, proportion, mean difference, risk ratio, etc.):
The standard error is the standard deviation of the statistic.
The standard error is a measure of the statistical uncertainty. Illustration
Population:
Mean = 3.81 Illustration
Population:
Mean = 2.13 Mean = 3.81 Illustration
Population:
Mean = 2.13 Mean = 3.81
Mean = 4.01 Based on N = 4 values, 0.012, 0.0088, 0.0069, 0.009: I mean: βˆ = 0.0091 I standard deviation: SD = 0.002108 I empirical variance: var = 0.0000044 I standard error: SE = 0.002108/2 = 0.001054
Quantifying statistical uncertainty
Example: We want to estimate the unknown mean uterus weight for untreated mice. The standard error of the mean is defined as √ SE = SD/ N where N is the sample size. Quantifying statistical uncertainty
Example: We want to estimate the unknown mean uterus weight for untreated mice. The standard error of the mean is defined as √ SE = SD/ N where N is the sample size.
Based on N = 4 values, 0.012, 0.0088, 0.0069, 0.009: I mean: βˆ = 0.0091 I standard deviation: SD = 0.002108 I empirical variance: var = 0.0000044 I standard error: SE = 0.002108/2 = 0.001054 The standard error is the standard deviation of the mean 0.015
The unknown true average uterus 0.010 ● weight ●
● ● Uterus (g) weight 0.005 0.000 Our Hypothetical Hypothetical Hypothetical study study 1 study 47 study 100
The (hypothetical) mean values are approximately normally distributed, even if the data are not normally distributed! Variance vs statistical uncertainty
"’The terms standard error and standard deviation are often confused. The contrast between these two terms reflects the important distinction between data description and inference, one that all researchers should appreciate."’ 6
Rules: I The higher the unexplained variability of the data, the higher the statistical uncertainty. I The higher the sample size, the lower the statistical uncertainty.
6Altman & Bland, Statistics Notes, BMJ, 2005, Nagele P, Br J Anaesthesiol 2003;90: 514-6 Confidence intervals Constructing confidence limits
A 95% confidence interval for the parameter β is
[βˆ − 1.96 ∗ SE; βˆ + 1.96 ∗ SE] Example: a confidence interval for the mean uterus weight of untreated mice is given by
95%CI = [0.0091 − 1.96 ∗ 0.001054; 0.0091 + 1.96 ∗ 0.001054] = [0.007; 0.011].
The standard error SE measures the variability of the mean βˆ around the (unknown) population value β, under the assumption that the model is correctly specified. The idea of a 95% confidence interval 0.015
The unknown true average uterus 0.010 ● weight ●
● ● Uterus (g) weight 0.005 0.000 Our Hypothetical Hypothetical Hypothetical study study 1 study 47 study 100
By construction, we expect at most 5 of the 100 confidence intervals not to cover (include) the true value. Confidence limits for the mean uterus weights (long code)
library(Publish) cidat <- mice[,{mean=mean(UterusWeight) se=sqrt(var(UterusWeight)/.N) list(mean=mean, lower=mean-se*qnorm(1 - 0.05/2), upper=mean+se*qnorm(1 - 0.05/2))},by=Dose] publish(cidat,digits=1)
Dose mean lower upper 0.0 0.009 0.007 0.01 1.0 0.025 0.020 0.03 2.5 0.051 0.046 0.06 7.5 0.089 0.079 0.10 50.0 0.087 0.066 0.11 Confidence limits for the mean uterus weights (short code)
library(Publish) cidat <- mice[,ci.mean(UterusWeight),by=Dose] publish(cidat,digits=1)
Dose mean se lower upper level statistic 0.0 0.009 0.001 0.006 0.01 0.05 arithmetic 1.0 0.025 0.002 0.018 0.03 0.05 arithmetic 2.5 0.051 0.002 0.044 0.06 0.05 arithmetic 7.5 0.089 0.005 0.072 0.11 0.05 arithmetic 50.0 0.087 0.011 0.053 0.12 0.05 arithmetic Confidence limits for the geometric mean uterus weights (short code)
library(Publish) gcidat <- mice[,ci.mean(UterusWeight,statistic=" geometric"),by=Dose] publish(gcidat,digits=1)
Dose geomean se lower upper level statistic 0.0 0.009 1.1 0.006 0.01 0.05 geometric 1.0 0.024 1.1 0.018 0.03 0.05 geometric 2.5 0.051 1.0 0.044 0.06 0.05 geometric 7.5 0.089 1.1 0.073 0.11 0.05 geometric 50.0 0.085 1.1 0.057 0.13 0.05 geometric ggplot2: Plot of means with confidence intervals
library(ggplot2) pom <- ggplot(cidat)+geom_pointrange(aes(x=Dose, y=meanU, ymin=upperU, ymax=lowerU), color=4) pom + coord_flip() + ylab("Uterus weight(g)")+xlab(" Dose") Plot of means with confidence intervals
50 ●
7.5 ●
2.5 ● Dose
1 ●
0 ●
0.03 0.06 0.09 Uterus weight (g) Publish: Plot of means with confidence intervals (code)
library(Publish) u <- plotConfidence(x=cidat$mean, lower=cidat$lower, upper=cidat$upper, labels=cidat$Dose, title.labels="Hormon dose", title.values=expression( bold(paste("Mean(",CI[95],")"))), cex=1.8, stripes=TRUE, stripes.col=c("gray95","white"), xratio=c(.2,.3), xlim=c(0,.15), xlab="Uterus weight(g)") Publish: Plot of means with confidence intervals (result)
Hormon dose Mean (CI95)
0 ● 0.01 (0.01−0.01)
1 ● 0.02 (0.02−0.03)
2.5 ● 0.05 (0.04−0.06)
7.5 ● 0.09 (0.07−0.11)
50 ● 0.09 (0.05−0.12)
0.00 0.05 0.10 0.15 Uterus weight (g) Parameters
It is generally difficult to interpret a p-value without further quantification of the parameter of interest.
Parameters are interpretable characteristics that have to be estimated based on data.
Examples that we will study during the course: I Means I Mean differences I Probabilities I Risk ratios, odds ratios, hazard ratios I Association parameters, regression coefficients Juonala et al. (part I)
Aims: The objective was to produce reference values and to analyse the associations of age and sex with carotid intima-media thickness (IMT), carotid compliance (CAC), and brachial flow-mediated dilatation (FMD) in young healthy adults.
Methods and results: We measured IMT, CAC, and FMD with ultrasound in 2265 subjects aged 24–39 years. The mean values (mean ± SD) in men and women were 0.592 ± 0.10 vs. 0.572 ± 0.08mm (P < 0.0001) for IMT, 2.00 ± 0.66 vs. 2.31 ± 0.77%/10 mmHg (P < 0.0001) for CAC, and 6.95 ± 4.00 vs. 8.83 ± 4.56%(P < 0.0001) for FMD.
The sex differences in IMT (95% confidence interval= [-0.013; 0.004] mm, P = 0.37) and CAC (95% CI=[-0.01;0.18]%/10 mmHg, P = 0.09) became non-significant after adjustments with risk factors and carotid diameter. Confidence intervals
A confidence interval is a range of values which covers the unknown true population parameter with high probability. Roughly the probability is 100 − α% where α is the level of significance.
For example: −0.013 to 0.004 is a 95% confidence interval for the unknown average difference in IMT between men and women.
Confidence intervals have the advantage over p-values, that their absolute value has a direct interpretation.7
7Confidence intervals rather than P values: estimation rather than hypothesis testing. Statistics with Confidence, Altman et al. Relation between confidence intervals and p-values
If we estimate the parameter β, e.g.
β = mean(IMT men )-mean(IMT women)
and have computed a 95% confidence interval for this parameter,
[lower95, upper95]
then the null hypothesis
β = 0 "There is no difference"
can be rejected at the 5% significance level if the value 0 is not included in the interval: 0 ∈/ [lower95, upper95]. ANOVA Example (DGA p.208)
22 cardiac bypass operation patients were randomized to 3 types of ventilation.
Outcome: Red cell folate level (µ g/l) Group Ventilation N Mean Sd I 50% N2O, 50% O2 in 24 hours 8 316.6 58.7 II 50% N2O, 50% O2 during operation 9 256.4 37.1 III 30–50% O2 (no N2O) in 24 hours 5 278.0 33.8 ANOVA
#R-code anova(lm(cell∼group,data=RedCellData)) ANOVA table for red cell folate levels
Source of Degrees Sum of Mean F P variation of free- squares squares dom Between 2 15515.88 7757.9 3.71 0.04 groups Within 19 39716.09 2090.3 groups Total 21 55231.97 1 2 2 Var = {(X1 − X ) + ··· + (XN − X ) } N − 1 | {z } | {z } Sum of squares degrees of freedom In ANOVA terminology the variance is referred to as a mean square which is short for: mean squared deviation from the mean.
What are sum of squares and degrees of freedom?
Recall the definition of the variance for a sample of N values X1,..., XN with mean=X : 1 Var = {(X − X )2 + ··· + (X − X )2} N − 1 1 N In ANOVA terminology the variance is referred to as a mean square which is short for: mean squared deviation from the mean.
What are sum of squares and degrees of freedom?
Recall the definition of the variance for a sample of N values X1,..., XN with mean=X : 1 Var = {(X − X )2 + ··· + (X − X )2} N − 1 1 N
1 2 2 Var = {(X1 − X ) + ··· + (XN − X ) } N − 1 | {z } | {z } Sum of squares degrees of freedom What are sum of squares and degrees of freedom?
Recall the definition of the variance for a sample of N values X1,..., XN with mean=X : 1 Var = {(X − X )2 + ··· + (X − X )2} N − 1 1 N
1 2 2 Var = {(X1 − X ) + ··· + (XN − X ) } N − 1 | {z } | {z } Sum of squares degrees of freedom In ANOVA terminology the variance is referred to as a mean square which is short for: mean squared deviation from the mean. ANOVA methods
I Independent observations I t test for two groups I One-way ANOVA for more groups I More-way ANOVA for more grouping variables I Dependent observations: I Repeated measures anova I Mixed effect models I Rank statistics (non-parametric ANOVA tests) I Nonparametric anova (Kruskal-Wallis test) I Mixture of discrete and continuous factors: I Ancova I Model comparison and model selection . . . Nice method
Nice methods, but what is the question? Typical F-test hypotheses
H0 Null hypothesis The red cell folate does not depend on the treatment H1 Alternative The red cell folate does depend on the hypothesis treatment
This means
H0 : Mean group I = Mean group II = Mean group III
H1 : Mean group I 6= Mean group II or Mean group III 6= Mean group II or Mean group I 6= Mean group III
Usually we want to know which treatment yields the best response. F-test statistic
Central idea: The deviation of a subjects response from the grand mean of all responses is attributable to a deviation of that value from its group mean plus the deviation of that group mean from the grand mean.
between-group variability F = within-group variability
Variance of the mean response values between groups = Variance of the values within the groups
If the between-group variability is large relative to the within-group variability, then the grouping factor contributes to the systematic part of the variability of the response values. Conclusions from the ANOVA table
Source of Degrees Sum of Mean F P variation of free- squares squares dom Between 2 15515.88 7757.9 3.71 groups 0.04 Within 19 39716.09 2090.3 groups Total 21 55231.97
Conclusion: The red cell folate depends significantly on the treatment. Take home messages
I The variation of data can be decomposed into a systematic and a random part. I The standard deviation quantifies the variability of the data. I The standard error quantifies the uncertainty of statistical conclusions. I ANOVA is an old and general statistical technique with many different applications.