Design and Analysis of Experiments Course notes for STAT 568
Adam B Kashlak Mathematical & Statistical Sciences University of Alberta Edmonton, Canada, T6G 2G1
March 27, 2019 cbna This work is licensed under the Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License. To view a copy of this license, visit http://creativecommons.org/licenses/ by-nc-sa/4.0/. Contents
Preface1
1 One-Way ANOVA2 1.0.1 Terminology...... 2 1.1 Analysis of Variance...... 3 1.1.1 Sample size computation...... 6 1.1.2 Contrasts...... 7 1.2 Multiple Comparisons...... 8 1.3 Random Effects...... 9 1.3.1 Derivation of the F statistic...... 11 1.4 Cochran’s Theorem...... 12
2 Multiple Factors 15 2.1 Randomized Block Design...... 15 2.1.1 Paired vs Unpaired Data...... 17 2.1.2 Tukey’s One DoF Test...... 18 2.2 Two-Way Layout...... 20 2.2.1 Fixed Effects...... 20 2.3 Latin Squares...... 21 2.3.1 Graeco-Latin Squares...... 23 2.4 Balanced Incomplete Block Designs...... 24 2.5 Split-Plot Designs...... 26 2.6 Analysis of Covariance...... 30
3 Multiple Testing 33 3.1 Family-wise Error Rate...... 34 3.1.1 Bonferroni’s Method...... 34 3.1.2 Sidak’s Method...... 34 3.1.3 Holms’ Method...... 35 3.1.4 Stepwise Methods...... 35 3.2 False Discovery Rate...... 36 3.2.1 Benjamini-Hochberg Method...... 37 4 Factorial Design 39 4.1 Full Factorial Design...... 40 4.1.1 Estimating effects with regression...... 41 4.1.2 Lenth’s Method...... 43 4.1.3 Key Concepts...... 44 4.1.4 Dispersion and Variance Homogeneity...... 45 4.1.5 Blocking with Factorial Design...... 46 4.2 Fractional Factorial Design...... 48 4.2.1 How to choose a design...... 49 4.3 3k Factorial Designs...... 52 4.3.1 Linear and Quadratic Contrasts...... 54 4.3.2 3k−q Fractional Designs...... 55 4.3.3 Agricultural Example...... 58
5 Response Surface Methodology 63 5.1 First and Second Order...... 64 5.2 Some Response Surface Designs...... 65 5.2.1 Central Composite Design...... 65 5.2.2 Box-Behnken Design...... 67 5.2.3 Uniform Shell Design...... 68 5.3 Search and Optimization...... 69 5.3.1 Ascent via First Order Designs...... 69 5.4 Chemical Reaction Data Example...... 70
6 Nonregular, Nonnormal, and other Designs 74 6.1 Prime Level Factorial Designs...... 74 6.1.1 5 level designs...... 75 6.1.2 7 level designs...... 78 6.1.3 Example of a 25-run design...... 78 6.2 Mixed Level Designs...... 79 6.2.1 2n4m Designs...... 81 6.2.2 36-Run Designs...... 84 6.3 Nonregular Designs...... 86 6.3.1 Plackett-Burman Designs...... 86 6.3.2 Aliasing and Correlation...... 89 6.3.3 Simulation Example...... 91 6.A Paley’s Contruction of HN ...... 94 Preface
It’s the random factors. You can’t be sure, ever. All of time and space for them to happen in. How can we know where to look or turn? Do we have to fight our way through every possible happening to get the thing we want? Time and Again Clifford D Simak (1951)
The following are lecture notes originally produced for a graduate course on experimental design at the University of Alberta in the winter of 2018. The goal of these notes is to cover the classical theory of design as born from some of the founding fathers of statistics. The proper approach is to begin with an hypothesis to test, design an experiment to test that hypothesis, collected data as needed by the design, and run the test. These days, data is collected en masse and often subjected to many tests in an exploratory search. Though, understanding how to design experiments is still critical for being able to determine which factors effect the observations. These notes were produced by consolidating two sources. One is the text of Wu and Hamada, Experiments: Planning, Analysis, and Optimization. The second is lecture notes and lecture slides from Dr. Doug Wiens and Dr. Linglong Kong, respectively.
Adam B Kashlak Edmonton, Canada January 2018
Additional notes on multiple testing were included based on the text Large-Scale Inference by Bradley Efron, which is quite relevant to the many hypothesis tests considered in factorial designs.
ABK, Jan 2019
1 Chapter 1
One-Way ANOVA
Introduction
We begin by considering an experiment in which k groups are compared. The primary goal is to determine whether or not there is a significant difference among all of the groups. The secondary goal is then to determine which specific pairs of groups differ the most. One example would be sampling n residents from each of the k = 10 Canadian provinces and comparing their heights perhaps to test whether or not stature is effected by province. Here, the province is the single factor in the experiment with 10 different factor levels. Another example would be contrasting the heights of k = 3 groups of flowers where group one is given just water, group two is given water and nutrients, and group three is given water and vinegar. In this case, the factor is the liquid given to the flowers. It is also often referred to as a treatment. When more than one factor is considered, the treatment refers to specific levels of all factors.
1.0.1 Terminology In the design of experiments literature, there is much terminology to consider. The following is a list of some of the common terms:
• Size or level of a test: the probability of a false positive. That is, the probability of falsely rejecting the null hypothesis.
• Power of a test: the probability of a true positive. That is, the probability of correctly rejecting the null hypothesis.
• Response: the dependent variable or output of the model. It is what we are interested in modelling.
2 • Factor: an explanatory variable or an input into the model. Often controlled by the experimenter.
• Factor Level: the different values that a factor can take. Often this is cate- gorical.
• Treatment: the overall combination of many factors and levels.
• Blocking: grouping subjects by type in order to understand the variation between the blocks versus the variation within the blocks.
• Fixed effects: When a factor is chosen by the experimenter, it is considered fixed.
• Random effects: When a factor is not controlled by the experimenter, it is considered random.
One example comes from the Rabbit dataset from the MASS library in R. Here, five rabbits are given drugs (MDL) and placebos in different dosages and the effect of their blood pressure is recorded. The blood pressure would be the response. The factors are drug, dosage, and rabbit where the factor levels for drug are {MDL, placebo}, the levels for dosage are {6.25, 12.5, 25, 50, 100, 200}, and the levels for rabbit are {R1,. . . ,R5}. A specific treatment could be rabbit R3 with a dosage of 25 of placebo.
1.1 Analysis of Variance1
In statistics in general, analysis of variance or ANOVA is concerned with decompos- ing the total variation of the data by the factors. That is, it determines how much of the variation can be explained by each factor and how much is left to random noise. We begin with the setting of one-way fixed effects. Consider a sample of size N = nk and k different treatments. Thus, we have k different groups of size n. Each group is given a different treatment, and measurements yij for i = 1, . . . , k and j = 1, . . . , n are collected. The one-way ANOVA is concerned with comparing the between group variation to the within group variation, which is the variation explained by the treatments vs the unexplained variation.
Remark 1.1.1 (Randomization). In the fixed effects setting in practise, the N subjects are randomly assigned to one of the k treatment groups. However, this is not always possible for a given experiment.
1 See Wu & Hamada Section 2.1
3 The model for the observations yij is
yij = µ + τi + εij where µ is the global mean and τi is the effect of the ith category or treatment. 2 2 The εij are random noise variables generally assumed to be iid N 0, σ with σ unknown. Based on this model, we can rewrite the observation yij as
yij =µ ˆ +τ ˆi + rij (1.1.1)
where
k n n k n 1 X X 1 X 1 X X µˆ =y ¯ = y , τˆ =y ¯ − y¯ = y − y , r = y − y¯ ·· N ij i i· ·· n ij N lj ij ij i· i=1 j=1 j=1 l=1 j=1
Equation 1.1.1 can be rearranged into
yij − y¯·· = (¯yi· − y¯··) + (yij − y¯i·).
This, in turn, can be squared and summed to get
k n k k n X X 2 X 2 X X 2 (yij − y¯··) = n(¯yi· − y¯··) + (yij − y¯i·) , i=1 j=1 i=1 i=1 j=1 which is just the total sum of squares, SStot, decomposed into the sum of the treat- ment sum of squares, SStr, and the error sum of squares, SSerr. Under the assumption that the errors εij are normally distributed, the usual F statistic can be derived to test the hypothesis that
H0 : τ1 = ... = τk vs H1 : ∃i1, i2 s.t. τi1 6= τi2 .
2 Indeed, under this model, it can be shown that SSerr ∼ χ (N − k) and that under 2 the null hypothesis SStr ∼ χ (k − 1). Hence, the test statistic is SS /(k − 1) F = tr ∼ F (k − 1,N − k) . SSres/(N − k) Often, for example in R, all of these terms from the one-way ANOVA experiment are represented in a table as follows:
DoF Sum Squares Mean Squares F value p-value Treatment k − 1 SStr SStr/(k − 1) F P(> F ) Residuals N − k SSerr SSerr/(N − k)
4 Remark 1.1.2 (Degrees of Freedom). In an intuitive sense, the degrees of freedom (DoF) can be thought of as the difference in the sample size and the number of estimated parameters. The DoF for the total sum of squares is just N − 1 where N is the total sample size and −1 is for estimating y¯··. Similarly, for SStr, the DoF is k − 1 corresponding to the k group means minus the one global mean. The DoF for the remainder is N − k, which is the sample size minus the number of group means.
This model falls under the general linear model setting
Y = Xβ + ε
T where β = (µ, τ1, . . . , τk) is the vector of parameters, Y and ε are the N-long vectors of yij and εij, respectively, and X is the design matrix. In the context of the above model, the N × (k + 1) matrix X is of the form
1 1 0 ··· 1 0 1 0 ··· . .. . . 1 0 ··· 0 1 X = 1 1 0 ··· . . . . . 1 0 ··· 0 1
Using the usual least squares estimator from linear regression, we could try to esti- mate the parameter vector β as
βˆ = (XTX)−1XTY.
The problem is that XTX is not an invertible matrix. Hence, we have to add a constraint to the parameters to make this viable.
Remark 1.1.3 (Identifiability). The easiest way to see the problem with the model as written is that you could have a global mean µ = 0 and have group means τi or you could have a global mean µ = 1 and have group means τi − 1, which would give identical models. That is, the parameters are not identifiable. Pk The common constraint to apply is to require that i=1 τi = 0 as this is satisfied by the estimatorsτ ˆi =y ¯i·−y¯··. Here, the interpretation of the parameters is as before where µ is the global mean and τi is the offset of the ith group. However, now the Pk−1 parameter τk = − i=1 τi. As a result, the new design matrix is now of dimension
5 N × k and is of the form
1 1 0 ··· 1 0 1 0 ··· . .. . . 1 0 ··· 0 1 1 −1 · · · · · · −1 X = 1 1 0 ··· . . . . . 1 0 ··· 0 1 1 −1 · · · · · · −1
This new X allows for XTX to be invertible. Furthermore, the hypothesis test is now stated slightly differently as
H0 : τ1 = ... = τk−1 = 0 vs H1 : ∃i s.t. τi 6= 0.
Note that the above equations all hold even if the k groups for unequal sample Pk sizes. In that case, N = i=1 ni.
Example Imagine we have k = 4 categories with n = 10 samples each. The categories will be labelled A, B, C, and D. The category means are, respectively, -1, -0.1, 0.1, and 1, and the added noise is ε ∼ N (0, 1). A model can be fit to the data via the aov() function in R. The result is a table such as Df Sum Sq Mean Sq F value Pr(>F) label 3 19.98 6.662 8.026 0.000319 *** Residuals 36 29.88 0.830
The significant p-value for the F test indicates that we can (correctly!) reject the null hypothesis that the category means are all equal.
1.1.1 Sample size computation In practise, researchers are often concerned with determining the minimal sample size to achieve a certain reasonable amount of statistical power. To compute the sample size exactly is usually impossible as it is based on unknowns. However, if guesses are available from similar past or pilot studies, then a sample size estimate can be computed. Such a computation can be performed by the R function power.anova.test(). As an example, if the number of groups is k = 6, the between group variation is
6 2, the within group variation is 4, the size of the test is α = 0.05, and the desired power is 0.9, then the sample size for each of the 6 groups is n = 7.57, which will round up to 8.
power.anova.test(groups = 6, between.var = 2, within.var = 4, power=0.9)
1.1.2 Contrasts2 When an ANOVA is run in R, it defaults to taking the first alphabetical factor level to use as the intercept or reference term. Instead, we can use the contrast() command to tell R to apply the above sum-to-zero constraint instead. Many more complicated contrasts can be used for a variety of testing purposes. Continuing from the above example with n = 10 and k = 4, we can use the summary.lm() function to look at the parameter estimates for the four categories. Estimate Std. Error t value Pr(> |t|) (Intercept) -0.8678 0.2881 -3.012 0.00472 ** labB 1.0166 0.4074 2.495 0.01731 * labC 0.8341 0.4074 2.047 0.04799 * labD 1.9885 0.4074 4.881 2.16e-05 *** In this table, the Intercept corresponds to the mean of category A. Meanwhile, the labB, labC, and labD estimates correspond to the difference between the mean of category A and the means of categories B, C, and D, respectively. We can use contr.sum to construct a sum-to-zero contrast. The result of refitting the model is Estimate Std. Error t value Pr(> |t|) (Intercept) 0.09203 0.14405 0.639 0.52697 lab1 -0.95982 0.24950 -3.847 0.00047 *** lab2 0.05682 0.24950 0.228 0.82115 lab3 -0.12570 0.24950 -0.504 0.61747
Now, the Intercept estimate corresponds to the global meany ¯··, which is generally preferable. Furthermore, the estimates for lab1,2,3 correspond to the difference between the category means for A,B,C and the global mean–e.g. check that 0.092 − 0.959 = −0.867 for category A. The t-tests in the table are now testing for whether or not the category mean is equal to the global mean.
2 See Wu & Hamada Section 2.3
7 1.2 Multiple Comparisons3
Imagine that we have run the above hypothesis test and have rejected the null hypothesis. A natural follow up question is which specific τi is non-zero? This is equivalent to asking which pairs of τi, τj have significant differences. To compare the means of two different groups of size ni and nj, we can use the t-statistic y¯i· − y¯j· tij = ∼ t (N − K) q −1 −1 (ni + nj )SSerr/(N − k) and the usual two sample t-test. Thus, we can reject the hypothesis that the group means are equal at the α level if |tij| > tN−k,α/2. However, if we were to run such a test for each of the k˜ = k(k − 1)/2 pairings, then the probability of a false positive would no longer be α but much larger. One approach to correcting this problem is the Bonferroni method. The Bonferroni method simply states that one should run all of the tests at a new level α0 = α/k˜. As a result, the probability of at least one false positive is