<<

Basic and ANOVA

Thomas Alexander Gerds

Department of , University of Copenhagen Contents

I are variable I Statistical uncertainty I Summary and display of data I Confidence intervals I ANOVA Data are variable

A is used to receive a value, such as

3.17 %,

together with an explanation, such as

"this is the expression of 1-B6.DBA-GTM in mouse 12".

The value from the next mouse in the list is 4.88% . . . The is difficult Data processing is done by humans Two mice have different genes They are exposed . . . and treated differently Decomposing

Variability of data is usually a composite of I Measurement error, scheme I Random variation I Genotype I Exposure, life style, environment I Treatment Statistical conclusions can often be obtained by explaining the sources of variation in the data. Example 1

In the yeast experiment of Smith and Kruglyak (2008) 1 transcript levels were profiled in 6 replicates of the same strain called ’RM’ in glucose under controlled conditions.

1the article is available at http://biology.plosjournals.org Example 1

Figure: Sources of the variation of these 6 values

I Measurement error I Random variation Example 1

In the same yeast experiment Smith and Kruglyak (2008) profiled also 6 replicates of a different strain called ’By’ in glucose.The order in which the 12 samples were processed was at random to minimize a systematic experimental effect. Example 1

Figure: Sources of the variation of these 12 values

I Measurement error I Study design/experimental environment I Genotype Example 1

Furthermore, Smith and Kruglyak (2008) cultured 6 ’RM’ and 6 ’By’ replicates in ethanol.The order in which the 24 samples were processed was random to minimize a systematic experimental effect. Sources of variation

Figure: Sources of variation

I Measurement error I Experimental environment I Genes I Exposure, environmental factors Example 2

Festing and Weigler in the Handbook of Animal ...

. . . consider the of an experiment using a completely randomized design...

. . . in which adult C57BL/6 mice were randomly allocated to one of four dose levels of a hormone compound.

The uterus weight was measured after an appropriate time interval. Example 2

Figure: Example 2

Figure: Example 2

Figure: Example 2

Conclusions from the figures

I The uterus weight depends on the dose I The variation of the data increases with increasing dose

Question: Why could these first conclusions be wrong? Descriptive statistics Descriptive statistics (summarizing data)

Categorical variables: count (%).

Continuous variables: I raw values (if n is small) I (min, max) I location: (IQR=inter quartile range) I location: (SD) Sample: Table 1

2 2Quality of life (QOL), supportive care, and spirituality in hematopoietic stem cell transplant (HSCT) patients. Sirilla & Overcash. Supportive Care in Cancer, October 2012. Sample: Table 1 R excursion: calculating descriptive statistics in groups

library(Publish) library(data.table) data(Diabetes) setDT(Diabetes)## make data.table Diabetes[,.(.age=mean(age), sd.age=sd(age),median. chol=median(chol,na.rm=TRUE)),by=location]

location mean.age sd.age median.chol 1: Buckingham 47.07500 16.74849 202 2: Louisa 46.63054 15.90929 206 R excursion: making table one library(Publish) data(Diabetes) tab1 <- summary(utable(location∼gender + age + Q(chol) + BMI, data=Diabetes)) tab1

Variable Level Buckingham (n=200) Louisa (n=203) Total (n=403) p-value gender female 114 (57.0) 120 (59.1) 234 (58.1) male 86 (43.0) 83 (40.9) 169 (41.9) 0.7422 age mean (sd) 47.1 (16.7) 46.6 (15.9) 46.9 (16.3) 0.7847 chol median [iqr] 202.0 [174.0, 231.0] 206.0 [183.5, 229.0] 204.0 [179.0, 230.0] 0.2017 missing 1 0 1 BMI mean (sd) 28.6 (7.0) 29.0 (6.2) 28.8 (6.6) 0.5424 missing 3 3 6 Method 2: Use kable3 and include in dynamic report4 ‘‘‘{r,results=’asis’} knitr::kable(tab1) ‘‘‘

R excursion: exporting a table

Method 1: Write table to file write.csv(tab1,file="tables/tab1.csv")

Then open file tab1.csv with Excel

3https://cran.r-project.org/web/packages/kableExtra/vignettes/ awesome_table_in_html.html 4https://www.rdocumentation.org/packages/knitr/versions/1.17/ topics/kable R excursion: exporting a table

Method 1: Write table to file write.csv(tab1,file="tables/tab1.csv")

Then open file tab1.csv with Excel

Method 2: Use kable3 and include in dynamic report4 ‘‘‘{r,results=’asis’} knitr::kable(tab1) ‘‘‘

3https://cran.r-project.org/web/packages/kableExtra/vignettes/ awesome_table_in_html.html 4https://www.rdocumentation.org/packages/knitr/versions/1.17/ topics/kable Dynamite plots are depreciated (DO NOT USE) Exercise

I Read and discuss the documentation of why dynamite plots are not good: http://biostat.mc.vanderbilt.edu/wiki/Main/DynamitePlots Dot plots are appreciated when n is small 3 2 ●

1 ● ● ●

0 ● ● ●

● Measurement scale ● −1 −2 −3

A B C

Figure: Group A (n=3), group B (n=3, one replicate), group C (n=4) Box plots are appreciated when n is large

4 ●

● ● ● ● ● ● ● 2 0 Measurement scale −2

● ● ● −4

A B C

Figure: Group A (n=300), group B (n=400), group C (n=400) Making boxplots with ggplot2

library(ggplot2) bp <- ggplot(Diabetes, aes(location,chol)) bp <- bp + geom_boxplot(aes(fill=location)) print(bp)

Find the ggplot2 cheat sheet via help menu in Rstudio Making boxplots with ggplot2

● 400

● ● ●

● ●

● 300 ●

location Buckingham chol Louisa

200

100

Buckingham Louisa location Making boxplots with ggplot2

bp+facet_grid(.∼gender)

female male

● 400

● ● ●

● ●

300 ●

location

Buckingham chol Louisa

200

100

Buckingham Louisa Buckingham Louisa location Making dotplots with ggplot2 dp <- ggplot(mice,aes(x=Dose,fill=Dose,y=BodyWeight)) dp <- dp + geom_dotplot(binaxis="y") print(dp)

● ●●●

13 ● ● ●●

● Dose ● 0 12 ● ● ● ● 1 ● 2.5 ● 7.5 BodyWeight ● ● 50 ● ● 11 ●

● ● ●

10 ●

0 1 2.5 7.5 50 Dose R excursion: exporting a figure

Write figure to pdf (vector graphics, also eps 5, infinite zoom) ggsave(dp,file="dotplot-mice-bodyweight.pdf") # or pdf("figures/dotplot-mice-bodyweight.pdf") dp dev.off()

Write figure to jpg (image file, also tiff, giff etc) jpeg("figures/dotplot-mice-bodyweight.jpg") dp dev.off()

5postscript("figures/dotplot-mice-bodyweight.eps") Quantifying variability

A sample of data X1,..., XN has a (sd); it is defined by v u N N u 1 X 1 X SD = t (X − X )2; X = X N − 1 i N i i=1 i=1 SD measures the variability of the in the sample.

The variance of the sample is defined as SD2. The term ’standard deviation’ relates to the normal distribution. Normal distribution What is so special about the normal distribution?

I It is symmetric around the mean, thus the mean is equal to the median. I The mean is the most likely value. Mean and standard deviation describe the full destribution. I The distribution of measurements, like height, distance, volume is often normal.

I The distribution of statistics, like mean, proportion, mean difference, etc. are very often approximately normal. Quantifying statistical uncertainty

For and conclusion making, via p-values and confidence intervals, it is crucial to quantify the variability of the (mean, proportion, mean difference, risk ratio, etc.):

The is the standard deviation of the statistic.

The standard error is a measure of the statistical uncertainty. Illustration

Population:

Mean = 3.81 Illustration

Population:

Mean = 2.13 Mean = 3.81 Illustration

Population:

Mean = 2.13 Mean = 3.81

Mean = 4.01 Based on N = 4 values, 0.012, 0.0088, 0.0069, 0.009: I mean: βˆ = 0.0091 I standard deviation: SD = 0.002108 I empirical variance: var = 0.0000044 I standard error: SE = 0.002108/2 = 0.001054

Quantifying statistical uncertainty

Example: We want to estimate the unknown mean uterus weight for untreated mice. The standard error of the mean is defined as √ SE = SD/ N where N is the sample size. Quantifying statistical uncertainty

Example: We want to estimate the unknown mean uterus weight for untreated mice. The standard error of the mean is defined as √ SE = SD/ N where N is the sample size.

Based on N = 4 values, 0.012, 0.0088, 0.0069, 0.009: I mean: βˆ = 0.0091 I standard deviation: SD = 0.002108 I empirical variance: var = 0.0000044 I standard error: SE = 0.002108/2 = 0.001054 The standard error is the standard deviation of the mean 0.015

The unknown true average uterus 0.010 ● weight ●

● ● Uterus (g) weight 0.005 0.000 Our Hypothetical Hypothetical Hypothetical study study 1 study 47 study 100

The (hypothetical) mean values are approximately normally distributed, even if the data are not normally distributed! Variance vs statistical uncertainty

"’The terms standard error and standard deviation are often confused. The between these two terms reflects the important distinction between data description and inference, one that all researchers should appreciate."’ 6

Rules: I The higher the unexplained variability of the data, the higher the statistical uncertainty. I The higher the sample size, the lower the statistical uncertainty.

6Altman & Bland, Statistics Notes, BMJ, 2005, Nagele P, Br J Anaesthesiol 2003;90: 514-6 Confidence intervals Constructing confidence limits

A 95% confidence interval for the parameter β is

[βˆ − 1.96 ∗ SE; βˆ + 1.96 ∗ SE] Example: a confidence interval for the mean uterus weight of untreated mice is given by

95%CI = [0.0091 − 1.96 ∗ 0.001054; 0.0091 + 1.96 ∗ 0.001054] = [0.007; 0.011].

The standard error SE measures the variability of the mean βˆ around the (unknown) population value β, under the assumption that the model is correctly specified. The idea of a 95% confidence interval 0.015

The unknown true average uterus 0.010 ● weight ●

● ● Uterus (g) weight 0.005 0.000 Our Hypothetical Hypothetical Hypothetical study study 1 study 47 study 100

By construction, we expect at most 5 of the 100 confidence intervals not to cover (include) the true value. Confidence limits for the mean uterus weights (long code)

library(Publish) cidat <- mice[,{mean=mean(UterusWeight) se=sqrt(var(UterusWeight)/.N) list(mean=mean, lower=mean-se*qnorm(1 - 0.05/2), upper=mean+se*qnorm(1 - 0.05/2))},by=Dose] publish(cidat,digits=1)

Dose mean lower upper 0.0 0.009 0.007 0.01 1.0 0.025 0.020 0.03 2.5 0.051 0.046 0.06 7.5 0.089 0.079 0.10 50.0 0.087 0.066 0.11 Confidence limits for the mean uterus weights (short code)

library(Publish) cidat <- mice[,ci.mean(UterusWeight),by=Dose] publish(cidat,digits=1)

Dose mean se lower upper level statistic 0.0 0.009 0.001 0.006 0.01 0.05 arithmetic 1.0 0.025 0.002 0.018 0.03 0.05 arithmetic 2.5 0.051 0.002 0.044 0.06 0.05 arithmetic 7.5 0.089 0.005 0.072 0.11 0.05 arithmetic 50.0 0.087 0.011 0.053 0.12 0.05 arithmetic Confidence limits for the uterus weights (short code)

library(Publish) gcidat <- mice[,ci.mean(UterusWeight,statistic=" geometric"),by=Dose] publish(gcidat,digits=1)

Dose geomean se lower upper level statistic 0.0 0.009 1.1 0.006 0.01 0.05 geometric 1.0 0.024 1.1 0.018 0.03 0.05 geometric 2.5 0.051 1.0 0.044 0.06 0.05 geometric 7.5 0.089 1.1 0.073 0.11 0.05 geometric 50.0 0.085 1.1 0.057 0.13 0.05 geometric ggplot2: Plot of means with confidence intervals

library(ggplot2) pom <- ggplot(cidat)+geom_pointrange(aes(x=Dose, y=meanU, ymin=upperU, ymax=lowerU), color=4) pom + coord_flip() + ylab("Uterus weight(g)")+xlab(" Dose") Plot of means with confidence intervals

50 ●

7.5 ●

2.5 ● Dose

1 ●

0 ●

0.03 0.06 0.09 Uterus weight (g) Publish: Plot of means with confidence intervals (code)

library(Publish) u <- plotConfidence(x=cidat$mean, lower=cidat$lower, upper=cidat$upper, labels=cidat$Dose, title.labels="Hormon dose", title.values=expression( bold(paste("Mean(",CI[95],")"))), cex=1.8, stripes=TRUE, stripes.col=c("gray95","white"), xratio=c(.2,.3), xlim=c(0,.15), xlab="Uterus weight(g)") Publish: Plot of means with confidence intervals ()

Hormon dose Mean (CI95)

0 ● 0.01 (0.01−0.01)

1 ● 0.02 (0.02−0.03)

2.5 ● 0.05 (0.04−0.06)

7.5 ● 0.09 (0.07−0.11)

50 ● 0.09 (0.05−0.12)

0.00 0.05 0.10 0.15 Uterus weight (g) Parameters

It is generally difficult to interpret a p-value without further quantification of the parameter of interest.

Parameters are interpretable characteristics that have to be estimated based on data.

Examples that we study during the course: I Means I Mean differences I I Risk ratios, odds ratios, hazard ratios I Association parameters, regression coefficients Juonala et al. (part I)

Aims: The objective was to produce reference values and to analyse the associations of age and sex with carotid intima-media thickness (IMT), carotid compliance (CAC), and brachial flow-mediated dilatation (FMD) in young healthy adults.

Methods and results: We measured IMT, CAC, and FMD with ultrasound in 2265 subjects aged 24–39 years. The mean values (mean ± SD) in men and women were 0.592 ± 0.10 vs. 0.572 ± 0.08mm (P < 0.0001) for IMT, 2.00 ± 0.66 vs. 2.31 ± 0.77%/10 mmHg (P < 0.0001) for CAC, and 6.95 ± 4.00 vs. 8.83 ± 4.56%(P < 0.0001) for FMD.

The sex differences in IMT (95% confidence interval= [-0.013; 0.004] mm, P = 0.37) and CAC (95% CI=[-0.01;0.18]%/10 mmHg, P = 0.09) became non-significant after adjustments with risk factors and carotid diameter. Confidence intervals

A confidence interval is a range of values which covers the unknown true population parameter with high . Roughly the probability is 100 − α% where α is the level of significance.

For example: −0.013 to 0.004 is a 95% confidence interval for the unknown average difference in IMT between men and women.

Confidence intervals have the advantage over p-values, that their absolute value has a direct interpretation.7

7Confidence intervals rather than P values: estimation rather than testing. Statistics with Confidence, Altman et al. Relation between confidence intervals and p-values

If we estimate the parameter β, e.g.

β = mean(IMT men )-mean(IMT women)

and have computed a 95% confidence interval for this parameter,

[lower95, upper95]

then the null hypothesis

β = 0 "There is no difference"

can be rejected at the 5% significance level if the value 0 is not included in the interval: 0 ∈/ [lower95, upper95]. ANOVA Example (DGA p.208)

22 cardiac bypass operation patients were randomized to 3 types of ventilation.

Outcome: Red cell folate level (µ g/l) Group Ventilation N Mean Sd I 50% N2O, 50% O2 in 24 hours 8 316.6 58.7 II 50% N2O, 50% O2 during operation 9 256.4 37.1 III 30–50% O2 (no N2O) in 24 hours 5 278.0 33.8 ANOVA

#R-code anova(lm(cell∼group,data=RedCellData)) ANOVA table for red cell folate levels

Source of Degrees Sum of Mean F P variation of free- squares squares dom Between 2 15515.88 7757.9 3.71 0.04 groups Within 19 39716.09 2090.3 groups Total 21 55231.97 1 2 2 Var = {(X1 − X ) + ··· + (XN − X ) } N − 1 | {z } | {z } Sum of squares degrees of freedom In ANOVA terminology the variance is referred to as a mean square which is short for: mean squared deviation from the mean.

What are sum of squares and degrees of freedom?

Recall the definition of the variance for a sample of N values X1,..., XN with mean=X : 1 Var = {(X − X )2 + ··· + (X − X )2} N − 1 1 N In ANOVA terminology the variance is referred to as a mean square which is short for: mean squared deviation from the mean.

What are sum of squares and degrees of freedom?

Recall the definition of the variance for a sample of N values X1,..., XN with mean=X : 1 Var = {(X − X )2 + ··· + (X − X )2} N − 1 1 N

1 2 2 Var = {(X1 − X ) + ··· + (XN − X ) } N − 1 | {z } | {z } Sum of squares degrees of freedom What are sum of squares and degrees of freedom?

Recall the definition of the variance for a sample of N values X1,..., XN with mean=X : 1 Var = {(X − X )2 + ··· + (X − X )2} N − 1 1 N

1 2 2 Var = {(X1 − X ) + ··· + (XN − X ) } N − 1 | {z } | {z } Sum of squares degrees of freedom In ANOVA terminology the variance is referred to as a mean square which is short for: mean squared deviation from the mean. ANOVA methods

I Independent I t test for two groups I One-way ANOVA for more groups I More-way ANOVA for more grouping variables I Dependent observations: I Repeated measures anova I Mixed effect models I Rank statistics (non-parametric ANOVA tests) I Nonparametric anova (Kruskal-Wallis test) I Mixture of discrete and continuous factors: I Ancova I Model comparison and . . . Nice method

Nice methods, but what is the question? Typical F-test hypotheses

H0 Null hypothesis The red cell folate does not depend on the treatment H1 Alternative The red cell folate does depend on the hypothesis treatment

This means

H0 : Mean group I = Mean group II = Mean group III

H1 : Mean group I 6= Mean group II or Mean group III 6= Mean group II or Mean group I 6= Mean group III

Usually we want to know which treatment yields the best response. F-

Central idea: The deviation of a subjects response from the grand mean of all responses is attributable to a deviation of that value from its group mean plus the deviation of that group mean from the grand mean.

between-group variability F = within-group variability

Variance of the mean response values between groups = Variance of the values within the groups

If the between-group variability is large relative to the within-group variability, then the grouping factor contributes to the systematic part of the variability of the response values. Conclusions from the ANOVA table

Source of Degrees Sum of Mean F P variation of free- squares squares dom Between 2 15515.88 7757.9 3.71 groups 0.04 Within 19 39716.09 2090.3 groups Total 21 55231.97

Conclusion: The red cell folate depends significantly on the treatment. Take home messages

I The variation of data can be decomposed into a systematic and a random part. I The standard deviation quantifies the variability of the data. I The standard error quantifies the uncertainty of statistical conclusions. I ANOVA is an old and general statistical technique with many different applications.