<<

Design and Analysis of Course notes for STAT 568

Adam B Kashlak Mathematical & Statistical University of Alberta Edmonton, Canada, T6G 2G1

March 27, 2019 cbna This work is licensed under the Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License. To view a copy of this license, visit http://creativecommons.org/licenses/ by-nc-sa/4.0/. Contents

Preface1

1 One-Way ANOVA2 1.0.1 Terminology...... 2 1.1 Analysis of ...... 3 1.1.1 Sample size computation...... 6 1.1.2 Contrasts...... 7 1.2 Multiple Comparisons...... 8 1.3 Random Effects...... 9 1.3.1 Derivation of the F ...... 11 1.4 Cochran’s Theorem...... 12

2 Multiple Factors 15 2.1 Randomized Block ...... 15 2.1.1 Paired vs Unpaired ...... 17 2.1.2 Tukey’s One DoF Test...... 18 2.2 Two-Way Layout...... 20 2.2.1 Fixed Effects...... 20 2.3 Latin Squares...... 21 2.3.1 Graeco-Latin Squares...... 23 2.4 Balanced Incomplete Block ...... 24 2.5 Split-Plot Designs...... 26 2.6 Analysis of ...... 30

3 Multiple Testing 33 3.1 Family-wise Error Rate...... 34 3.1.1 Bonferroni’s Method...... 34 3.1.2 Sidak’s Method...... 34 3.1.3 Holms’ Method...... 35 3.1.4 Stepwise Methods...... 35 3.2 ...... 36 3.2.1 Benjamini-Hochberg Method...... 37 4 Factorial Design 39 4.1 Full Factorial Design...... 40 4.1.1 Estimating effects with regression...... 41 4.1.2 Lenth’s Method...... 43 4.1.3 Key Concepts...... 44 4.1.4 Dispersion and Variance Homogeneity...... 45 4.1.5 with Factorial Design...... 46 4.2 Fractional Factorial Design...... 48 4.2.1 How to choose a design...... 49 4.3 3k Factorial Designs...... 52 4.3.1 Linear and Quadratic Contrasts...... 54 4.3.2 3k−q Fractional Designs...... 55 4.3.3 Agricultural Example...... 58

5 Response Surface Methodology 63 5.1 First and Second Order...... 64 5.2 Some Response Surface Designs...... 65 5.2.1 ...... 65 5.2.2 Box-Behnken Design...... 67 5.2.3 Uniform Shell Design...... 68 5.3 Search and Optimization...... 69 5.3.1 Ascent via First Order Designs...... 69 5.4 Chemical Reaction Data Example...... 70

6 Nonregular, Nonnormal, and other Designs 74 6.1 Prime Level Factorial Designs...... 74 6.1.1 5 level designs...... 75 6.1.2 7 level designs...... 78 6.1.3 Example of a 25-run design...... 78 6.2 Mixed Level Designs...... 79 6.2.1 2n4m Designs...... 81 6.2.2 36-Run Designs...... 84 6.3 Nonregular Designs...... 86 6.3.1 Plackett-Burman Designs...... 86 6.3.2 Aliasing and Correlation...... 89 6.3.3 Simulation Example...... 91 6.A Paley’s Contruction of HN ...... 94 Preface

It’s the random factors. You can’t be sure, ever. All of time and space for them to happen in. How can we know where to look or turn? Do we have to fight our way through every possible happening to get the thing we want? Time and Again Clifford D Simak (1951)

The following are lecture notes originally produced for a graduate course on experimental design at the University of Alberta in the winter of 2018. The goal of these notes is to cover the classical theory of design as born from some of the founding fathers of . The proper approach is to begin with an to test, design an to test that hypothesis, collected data as needed by the design, and run the test. These days, data is collected en masse and often subjected to many tests in an exploratory search. Though, understanding how to design experiments is still critical for being able to determine which factors effect the observations. These notes were produced by consolidating two sources. One is the text of Wu and Hamada, Experiments: Planning, Analysis, and Optimization. The second is lecture notes and lecture slides from Dr. Doug Wiens and Dr. Linglong Kong, respectively.

Adam B Kashlak Edmonton, Canada January 2018

Additional notes on multiple testing were included based on the text Large-Scale Inference by , which is quite relevant to the many hypothesis tests considered in factorial designs.

ABK, Jan 2019

1 Chapter 1

One-Way ANOVA

Introduction

We begin by considering an experiment in which k groups are compared. The primary goal is to determine whether or not there is a significant difference among all of the groups. The secondary goal is then to determine which specific pairs of groups differ the most. One example would be n residents from each of the k = 10 Canadian provinces and comparing their heights perhaps to test whether or not stature is effected by province. Here, the province is the single factor in the experiment with 10 different factor levels. Another example would be contrasting the heights of k = 3 groups of flowers where group one is given just water, group two is given water and nutrients, and group three is given water and vinegar. In this case, the factor is the liquid given to the flowers. It is also often referred to as a treatment. When more than one factor is considered, the treatment refers to specific levels of all factors.

1.0.1 Terminology In the literature, there is much terminology to consider. The following is a list of some of the common terms:

• Size or level of a test: the probability of a false positive. That is, the probability of falsely rejecting the null hypothesis.

: the probability of a true positive. That is, the probability of correctly rejecting the null hypothesis.

• Response: the dependent variable or output of the model. It is what we are interested in modelling.

2 • Factor: an explanatory variable or an input into the model. Often controlled by the experimenter.

• Factor Level: the different values that a factor can take. Often this is cate- gorical.

• Treatment: the overall combination of many factors and levels.

• Blocking: grouping subjects by type in order to understand the variation between the blocks versus the variation within the blocks.

• Fixed effects: When a factor is chosen by the experimenter, it is considered fixed.

• Random effects: When a factor is not controlled by the experimenter, it is considered random.

One example comes from the Rabbit dataset from the MASS library in R. Here, five rabbits are given drugs (MDL) and placebos in different dosages and the effect of their blood pressure is recorded. The blood pressure would be the response. The factors are drug, dosage, and rabbit where the factor levels for drug are {MDL, placebo}, the levels for dosage are {6.25, 12.5, 25, 50, 100, 200}, and the levels for rabbit are {R1,. . . ,R5}. A specific treatment could be rabbit R3 with a dosage of 25 of placebo.

1.1 Analysis of Variance1

In statistics in general, or ANOVA is concerned with decompos- ing the total variation of the data by the factors. That is, it determines how much of the variation can be explained by each factor and how much is left to random noise. We begin with the setting of one-way fixed effects. Consider a sample of size N = nk and k different treatments. Thus, we have k different groups of size n. Each group is given a different treatment, and measurements yij for i = 1, . . . , k and j = 1, . . . , n are collected. The one-way ANOVA is concerned with comparing the between group variation to the within group variation, which is the variation explained by the treatments vs the unexplained variation.

Remark 1.1.1 (). In the fixed effects setting in practise, the N subjects are randomly assigned to one of the k treatment groups. However, this is not always possible for a given experiment.

1 See Wu & Hamada Section 2.1

3 The model for the observations yij is

yij = µ + τi + εij where µ is the global and τi is the effect of the ith category or treatment. 2 2 The εij are random noise variables generally assumed to be iid N 0, σ with σ unknown. Based on this model, we can rewrite the observation yij as

yij =µ ˆ +τ ˆi + rij (1.1.1)

where

k n n k n 1 X X 1 X 1 X X µˆ =y ¯ = y , τˆ =y ¯ − y¯ = y − y , r = y − y¯ ·· N ij i i· ·· n ij N lj ij ij i· i=1 j=1 j=1 l=1 j=1

Equation 1.1.1 can be rearranged into

yij − y¯·· = (¯yi· − y¯··) + (yij − y¯i·).

This, in turn, can be squared and summed to get

k n k k n X X 2 X 2 X X 2 (yij − y¯··) = n(¯yi· − y¯··) + (yij − y¯i·) , i=1 j=1 i=1 i=1 j=1 which is just the total sum of squares, SStot, decomposed into the sum of the treat- ment sum of squares, SStr, and the error sum of squares, SSerr. Under the assumption that the errors εij are normally distributed, the usual F statistic can be derived to test the hypothesis that

H0 : τ1 = ... = τk vs H1 : ∃i1, i2 s.t. τi1 6= τi2 .

2 Indeed, under this model, it can be shown that SSerr ∼ χ (N − k) and that under 2 the null hypothesis SStr ∼ χ (k − 1). Hence, the test statistic is SS /(k − 1) F = tr ∼ F (k − 1,N − k) . SSres/(N − k) Often, for example in R, all of these terms from the one-way ANOVA experiment are represented in a table as follows:

DoF Sum Squares Mean Squares F value p-value Treatment k − 1 SStr SStr/(k − 1) F P(> F ) Residuals N − k SSerr SSerr/(N − k)

4 Remark 1.1.2 (Degrees of Freedom). In an intuitive sense, the degrees of freedom (DoF) can be thought of as the difference in the sample size and the number of estimated parameters. The DoF for the total sum of squares is just N − 1 where N is the total sample size and −1 is for estimating y¯··. Similarly, for SStr, the DoF is k − 1 corresponding to the k group minus the one global mean. The DoF for the remainder is N − k, which is the sample size minus the number of group means.

This model falls under the general setting

Y = Xβ + ε

T where β = (µ, τ1, . . . , τk) is the vector of parameters, Y and ε are the N-long vectors of yij and εij, respectively, and X is the . In the context of the above model, the N × (k + 1) matrix X is of the form

1 1 0 ···  1 0 1 0 ···   . ..  . .    1 0 ··· 0 1    X = 1 1 0 ···  .   .  .    .  .  1 0 ··· 0 1

Using the usual estimator from , we could try to esti- mate the parameter vector β as

βˆ = (XTX)−1XTY.

The problem is that XTX is not an invertible matrix. Hence, we have to add a constraint to the parameters to make this viable.

Remark 1.1.3 (Identifiability). The easiest way to see the problem with the model as written is that you could have a global mean µ = 0 and have group means τi or you could have a global mean µ = 1 and have group means τi − 1, which would give identical models. That is, the parameters are not identifiable. Pk The common constraint to apply is to require that i=1 τi = 0 as this is satisfied by the estimatorsτ ˆi =y ¯i·−y¯··. Here, the interpretation of the parameters is as before where µ is the global mean and τi is the offset of the ith group. However, now the Pk−1 parameter τk = − i=1 τi. As a result, the new design matrix is now of dimension

5 N × k and is of the form

1 1 0 ···  1 0 1 0 ···   . ..  . .    1 0 ··· 0 1    1 −1 · · · · · · −1   X = 1 1 0 ···  .   .  .    .  .    1 0 ··· 0 1  1 −1 · · · · · · −1

This new X allows for XTX to be invertible. Furthermore, the hypothesis test is now stated slightly differently as

H0 : τ1 = ... = τk−1 = 0 vs H1 : ∃i s.t. τi 6= 0.

Note that the above equations all hold even if the k groups for unequal sample Pk sizes. In that case, N = i=1 ni.

Example Imagine we have k = 4 categories with n = 10 samples each. The categories will be labelled A, B, C, and D. The category means are, respectively, -1, -0.1, 0.1, and 1, and the added noise is ε ∼ N (0, 1). A model can be fit to the data via the aov() function in R. The result is a table such as Df Sum Sq Mean Sq F value Pr(>F) label 3 19.98 6.662 8.026 0.000319 *** Residuals 36 29.88 0.830

The significant p-value for the F test indicates that we can (correctly!) reject the null hypothesis that the category means are all equal.

1.1.1 Sample size computation In practise, researchers are often concerned with determining the minimal sample size to achieve a certain reasonable amount of statistical power. To compute the sample size exactly is usually impossible as it is based on unknowns. However, if guesses are available from similar past or pilot studies, then a sample size estimate can be computed. Such a computation can be performed by the R function power.anova.test(). As an example, if the number of groups is k = 6, the between group variation is

6 2, the within group variation is 4, the size of the test is α = 0.05, and the desired power is 0.9, then the sample size for each of the 6 groups is n = 7.57, which will round up to 8.

power.anova.test(groups = 6, between.var = 2, within.var = 4, power=0.9)

1.1.2 Contrasts2 When an ANOVA is run in R, it defaults to taking the first alphabetical factor level to use as the intercept or reference term. Instead, we can use the () command to tell R to apply the above sum-to-zero constraint instead. Many more complicated contrasts can be used for a variety of testing purposes. Continuing from the above example with n = 10 and k = 4, we can use the summary.lm() function to look at the parameter estimates for the four categories. Estimate Std. Error t value Pr(> |t|) (Intercept) -0.8678 0.2881 -3.012 0.00472 ** labB 1.0166 0.4074 2.495 0.01731 * labC 0.8341 0.4074 2.047 0.04799 * labD 1.9885 0.4074 4.881 2.16e-05 *** In this table, the Intercept corresponds to the mean of category A. Meanwhile, the labB, labC, and labD estimates correspond to the difference between the mean of category A and the means of categories B, C, and D, respectively. We can use contr.sum to construct a sum-to-zero contrast. The result of refitting the model is Estimate Std. Error t value Pr(> |t|) (Intercept) 0.09203 0.14405 0.639 0.52697 lab1 -0.95982 0.24950 -3.847 0.00047 *** lab2 0.05682 0.24950 0.228 0.82115 lab3 -0.12570 0.24950 -0.504 0.61747

Now, the Intercept estimate corresponds to the global meany ¯··, which is generally preferable. Furthermore, the estimates for lab1,2,3 correspond to the difference between the category means for A,B,C and the global mean–e.g. check that 0.092 − 0.959 = −0.867 for category A. The t-tests in the table are now testing for whether or not the category mean is equal to the global mean.

2 See Wu & Hamada Section 2.3

7 1.2 Multiple Comparisons3

Imagine that we have run the above hypothesis test and have rejected the null hypothesis. A natural follow up question is which specific τi is non-zero? This is equivalent to asking which pairs of τi, τj have significant differences. To compare the means of two different groups of size ni and nj, we can use the t-statistic y¯i· − y¯j· tij = ∼ t (N − K) q −1 −1 (ni + nj )SSerr/(N − k) and the usual two sample t-test. Thus, we can reject the hypothesis that the group means are equal at the α level if |tij| > tN−k,α/2. However, if we were to run such a test for each of the k˜ = k(k − 1)/2 pairings, then the probability of a false positive would no longer be α but much larger. One approach to correcting this problem is the Bonferroni method. The Bonferroni method simply states that one should run all of the tests at a new level α0 = α/k˜. As a result, the probability of at least one false positive is  

 [ P ∃i, j s.t. |tij| > tN−k,α0 H0 = P  {|tij| > tN−k,α0 } H0

i,j X  ≤ P |tij| > tN−k,α0 H0 i,j α = k˜ = α. k˜ This method is quite versatile. However, it is also quite conservative in practise. Often, it is recommended to instead use the Tukey method. In Tukey’s method, the same test statistic, tij, is used. However, instead of comparing it to a t distribution, it is instead compared to a distribution known as the Studentized distribution:4 reject the hypothesis that groups i and j do not differ if |t | ≥ √1 q . ij 2 k,N−k,α The value qk,N−k,α comes from first assuming that n1 = ... = nk = n, and noting that, for some constant c,  

P(∃i, j s.t. |tij| > c | H0) = P max{|tij|} > c H0 . i,j

The distribution of the maxi,j{|tij|} can be shown√ to be related to the Studentized Range distribution and c is set to be qk,N−k,α/ 2 where √ ! n(max y¯ − min y¯ ) i=1,...,k i· i=1,...,k i· P p > qk,N−k,α H0 = α. SSerr/(N − k)

3 See Wu & Hamada Section 2.2 4https://en.wikipedia.org/wiki/Studentized_range

8 When the categories are balanced–i.e. when n1 = ... = nk = n–then the error rate for Tukey’s method is exactly α. However, it can still be applied when the categories are not balanced. In general, Tukey’s method will result in tighter confidence intervals than Bonferroni’s method. However, it may not be applicable in the more complicated settings to come. Tukey confidence intervals can be computed in R via the function TukeyHSD().

Example Continuing again with the above example, if TukeyHSD is used to construct simul- taneous confidence intervals for the differences of the means of the four categories, the result is: diff lwr upr p adj B-A 1.0166422 -0.08068497 2.1139694 0.0777351 C-A 0.8341240 -0.26320318 1.9314511 0.1900675 D-A 1.9885274 0.89120026 3.0858546 0.0001230 C-B -0.1825182 -1.27984537 0.9148089 0.9695974 D-B 0.9718852 -0.12544193 2.0692124 0.0981594 D-C 1.1544034 0.05707629 2.2517306 0.0360519 Note that the most significant p-value comes from the difference between category A and D, which is reasonable as those groups have means of -1 and 1, respectively.

1.3 Random Effects5

The difference between fixed and random effects can be confusing. The models are the same, but the interpretation is different. For example, returning to the Rabbit data. The dosage of the blood pressure drug is a fixed effect; it is chosen by the experimenter, and we are interested in the difference in effect between two different dosages. Contrasting, the rabbit itself is a random effect; it is selected at random from the population, and we do not care about the difference between two specific animals, but instead about the overall variation in the population. For a random effects model, we begin as before with

yij = µ + τi + εij 2 with µ fixed and εij iid N 0, σ random variables. However, now the τi are treated 2 as iid random variables with distribution N 0, ν . The τi and εij are assumed to be independent of each other. In this case, we are not interested in estimating τi, but instead with calculating ν2, which is the between-group variance. Hence, we want to test the hypothesis

2 2 H0 : ν = 0 vs H1 : ν > 0, 5 See Wu & Hamada Section 2.5

9 which, in words, is whether or not there is any variation among the categories. The test statistic is identical to the fixed effects setting:

SS /(k − 1) F = tr ∼ F (k − 1,N − k) . SSres/(N − k) As we do not care about the individual category means in this setting, we also do not care about comparison tests like Tukey’s method. Instead, we are interested in estimating the value of ν2. As in the fixed effects setting,

2 2 SSerr/σ ∼ χ (N − k) .

2 Hence, SSerr/(N − k) is an unbiased estimator of σ . Next, we will compute 6 E(SStr/(k − 1)) assuming n1 = ... = nk = n for simplicity.

k ! X 2 E(SStr) = E n(¯yi· − y¯··) i=1 k X  2 2  = n E (τi − τ¯) + (¯εi· − ε¯··) + 2(τi − τ¯)(¯εi· − ε¯··) i=1 = nk (1 − k−1)ν2 + n−1(1 − k−1)σ2 + 0 = (k − 1)nν2 + (k − 1)σ2.

2 2 Hence, ESStr/(k − 1) = nν + σ . Furthermore, we have the unbiased estimator

 SS SS  νˆ2 = n−1 tr − err . k − 1 N − k

Lastly, we can use this estimator to construct a confidence interval for the global mean µ about the estimatorµ ˆ =y ¯··. That is,

Var (ˆµ) = Var (µ +τ ¯ +ε ¯) = 0 + k−1ν2 + (nk)−1σ2 SS = (nk)−1(nν2 + σ2) = (nk)−1ESS /(k − 1) ≈ tr . tr nk(k − 1)

Hence, we get the following 1 − α confidence interval: p |µ − µˆ| ≤ tk−1,α/2 SStr/(nk(k − 1)). 6 See Equation 2.45 in Wu & Hamada for the necessary correction when the ni do not all coincide.

10 Example Once more continuing the with example, we could choose to consider the labels A,B,C,D as random effects instead of fixed effects. Using the lmer() function from the lme4 results in Random effects: Groups Name Variance Std.Dev. label (Intercept) 0.5832 0.7637 Residual 0.8300 0.9111

Fixed effects: Estimate Std. Error t value (Intercept) 0.09203 0.40810 0.226

One can check by hand that the variance of the label factor isν ˆ2 and that the for the intercept term is SStr/(nk(k − 1)).

1.3.1 Derivation of the F statistic In both the fixed and random effects settings, we arrive at the same F statistic to test the hypothesis. Namely,

SS /(k − 1) F = tr ∼ F (k − 1,N − k) . SSres/(N − k) Why is this necessarily the case?

Fixed Effects In the fixed effects setting, we are interested in testing the null hypothesis that the τi are all equal vs the alternative that at least one pair of τi and τj are not equal. We 2 2 have, in general, that SSerr/σ ∼ χ (N − k). Under the null hypothesis, we have 2 2 that SStr/σ ∼ χ (k − 1) and that these two random variables are independent. This can be solved for directly or one can use Cochran’s theorem. Hence, the above F does indeed follow an F distribution under the null hypothesis. 2 2 Under the , we can show that SStr/σ ∼ χ (k − 1, θ) where χ2 (k − 1, θ) is a non-central chi squared distribution7 with non-centrality parameter θ > 0. Whereas the mean of a χ2 (k − 1) is k − 1, the mean of a χ2 (k − 1, θ) random variable is k − 1 + θ. Hence, it is shifted by θ. Consequently, under the alternative hypothesis, the statistic F has a non-central F distribution8 with the same non-centrality parameter θ. While, the standard F distribution with DoFs (d1, d2) has mean d2/(d2 − 2), which is approximately 1

7 https://en.wikipedia.org/wiki/Noncentral_chi-squared_distribution 8 https://en.wikipedia.org/wiki/Noncentral_F-distribution

11 for large sample sizes, the mean of a non-central F distribution with parameters (d1, d2, θ) is  d  d + θ  2 1 . d2 − 2 d1 Hence, as the the non-centrality parameter θ grows–i.e. as we move farther from the null hypothesis setting–the mean of the non-central F increases providing statistical power to reject the null. Remark 1.3.1. It seems reasonable that one could try to use the Neyman-Pearson lemma to show that this is a most powerful test. However, I have not tried that at this time.

Random Effects The random effects case follows the fixed effects case almost identically. The main difference is in the beginning. We desire to test test the null hypothesis that ν2 = 0 against that alternative that it is greater than zero where ν2 is the variance of the τi. 2 Under the null hypothesis, the fact that ν = 0 implies that all of the τi are equal to zero as they have the degenerate distribution N (0, 0). Thus, we again arrive at an F distribution. Under the alternative hypothesis, ν2 > 0, which allows for the τi to differ, we can follow the same argument as above to arrive at a non-central F distribution. Thus, we can use the same F statistic in both the fixed and random effects settings.

1.4 Cochran’s Theorem9

Cochran’s theorem allows us to decompose sums of squared normal random vari- ables, as often occurs in statistics, to get independent chi-squared random variables. This ultimately allows for the ubiquitously used F-tests. The theorem is as follows. 10 Theorem 1.4.1 (Cochran’s Theorem, 1934 ). Let Z1,...,Zm be independent and 2 m identically distributed N 0, σ random variables and Z ∈ R be the vector of Zi. k For k = 1, . . . , s, let Ak be an m × m symmetric matrix with ijth entry aij. Let Qk be the following quadratic form m T X k Qk = Z AkZ = aijZiZj. i,j=1 Lastly, let m s X 2 X Zi = Qk. i=1 k=1 9 See Wu & Hamada Section 2.4 10 https://doi.org/10.1017/S0305004100016595

12 Ps Denoting the rank of Ak by rk, Consequently, k=1 rk = m if and only if the Qk 2 2 are independent random variables with Qk/σ ∼ χ (rk). To prove this theorem, we begin with some lemmas. The first lemma concerns the distribution of a single quadratic form.

m×m Lemma 1.4.2. Let Z1,...,Zm be m iid N (0, 1) random variables. Let A ∈ R T T Pm be a symmetric matrix–i.e. A = A . Define Q = Z AZ = i,j=1 ai,jZiZj. Then, m X 2 Q = λiWi i=1 where the λi are the eigenvalues of A and the Wi are iid N (0, 1) random variables. Proof. By the spectral theorem for symmetric matrices, we can write A = U TDU where D is the diagonal matrix of real eigenvalues λ1, . . . , λm and U is the orthonor- mal matrix of eigenvectors–i.e. UU T = U TU = 1. As the columns of U form an m orthonormal basis for R , we have that W = UZ ∼ N (0,Im). Hence, m T T T T T X 2 Q = Z AZ = Z U DUZ = (UZ) D(UZ) = W DW = λiWi . i=1

2 2 Remark 1.4.3. Note that if Wi ∼ N (0, 1), then Wi ∼ χ (1). Hence, Q from Lemma 1.4.2 is a weighted sum of chi-squared random variables.

Lemma 1.4.4. Given the set up of Lemma 1.4.2, if λ1 = ... = λr = 1 and the λr+1 = λm = 0 from some 0 < r < m, then Q ∼ χ2 (r) and Q0 = ZT(I − A)Z ∼ χ2 (m − r) , and Q and Q0 are independent.

Proof. First, if λ1 = ... = λr = 1 with the other eigenvalues being zero, then by Lemma 1.4.2 m r X 2 X 2 2 Q = λiWi = Wi ∼ χ (r) i=1 i=1 as it is the sum of r independent χ2 (1) random variables. Secondly, I − A is also diagonalized by U as

I − A = U TU − U TDU = U T(I − D)U.

The diagonal entries of I − D are r zeros and m − r ones. Hence, as before, m 0 X 2 2 Q = Wi ∼ χ (m − r) . i=r+1

13 0 Lastly, Q is a function of W1,...,Wr and Q is a function of Wr+1,...,Wm. 0 We can write Q = f(W1,...,Wr) and Q = g(Wr+1,...,Wm) for some functions r m−r f : R → R and g : R → R. As the Wi are independent random variables, Q and Q0 are functions on independent random variables and, hence, also independent.

Proof (Cochran’s Theorem). Without loss of generality, we set σ2 = 1. Given the set up of Theorem 1.4.1, assume first that the Qk are independent random variables 2 with Qk ∼ χ (rk). Then, we have that

m s X 2 2 X 2 Ps Zi ∼ χ (m) and Qk ∼ χ ( k=1 rk) i=1 k=1 as the degrees of freedom add with independent chi-squared random variables are Ps summed. Hence, k=1 rk = m. Ps T Next, assume that k=1 rk = m. We begin with considering Q1 = Z A1Z. Ps Denote A−1 = k=2 Ak. From Lemma 1.4.4, we have that if A1 is an m × m symmetric matrix then A1 and A−1 = Im − A1 are simultaneously diagonalizable. That is, T T T Im = A + A−1 = U D1U + U D−1U = U (D1 + D−1)U.

Furthermore, D1 + D−1 = Im where rank (D1) = r1 and rank (D−1) = m − r1. This implies that only r1 and m − r1 of the diagonal entries of D1 and D−1 are non-zero, respectively. Hence, without loss of generality, we can reorder the basis vectors in U such that     Ir1 0 0r1 0 D1 = and D−1 = . 0 0m−r1 0 Im−r1 Ps From Lemma 1.4.4, we have that Q1 and Q−1 = k=2 Qk are independent random 2 2 variables with distributions χ (r1) and χ (m − r1), respectively. Now, looking at Q2,...,Qs, we have that   T 0r1 0 U(A2 + ... + As)U = . 0 Im−r1 Ps Thus, we can simultaneously diagonalize A2 and A−2 = k=3 Ak and proceed as above to get that     0r1 0 0 0r1 0 0

D2 =  0 Ir2 0  and D−2 =  0 0r2 0 

0 0 0m−r1−r2 0 0 Im−r1−r2

2 2 and that Q2 ∼ χ (r2) is independent of Q−2 ∼ χ (m − r1 − r2). This process can be iterated to get the desired result.

14 Chapter 2

Multiple Factors

Introduction

This chapter continues from the previous but whereas before we only considered a single factor, now we will consider multiple factors. Having more than one factor leads to many new issues to consider, which mainly revolve around proper experi- mental design to tease out the effects of each factor as well as potential cross effects.

2.1 Randomized Block Design1

As mentioned in the terminology section of Chapter 1, Blocking is a key concept in this course. The goal is to group similar subjects into blocks with the idea that the within block variance should be much smaller than the between block variance. The purpose of such blocking is to remove or control unwanted variability to allow for better estimation and testing of the actual factor under consideration. Lastly, a blocking is called complete if every treatment is considered in every block. Some examples of blocking are as follows. Blocking by gender in a drug trial could be used to remove the variation between the difference ways the drug effects men vs women. In an agricultural experiment, one could block by fields to take into account soil differences, amount of sunlight, and other factors that we are not explicitly considering. First, we consider a model with two factors: a treatment factor and a blocking factor. Let b be the number of blocks and k be the number of treatments. Then, we write yij = µ + βi + τj + εij where yij ∈ R is the measurement for block i and treatment j, µ is the global mean, βi is the effect of block i, τj is the effect of treatment j, and εij is the random noise, which is again assumed to be iid N 0, σ2. Note that the sample size here is

1 See Wu & Hamada Section 3.2

15 N = bk, which is the complete randomized . The experiment could be replicated n times to achieve a new sample of size N = nbk if desired.

Remark 2.1.1 (Constraints / Contrasts). As before with the one-way ANOVA, we will require some constraint on the βi and the τj to allow for the model to be well-posed. Unless otherwise stated, we will use the sum-to-zero constraint for both terms. That is, b k X X βi = 0 = τj. i=1 j=1

Decomposing each yij into a sum of block, treatment, and residual effects gives us yij =y ¯·· + (¯yi· − y¯··) + (¯y·j − y¯··) + (yij − y¯i· − y¯·j +y ¯··), which, as before, can be squared and summed to get

b X X SStot = (yij − y¯··) = i=1 j=1 b k b k X 2 X 2 X X 2 = k(¯yi· − y¯··) + b(¯y·j − y¯··) + (¯yij − y¯i· − y¯·j +y ¯··) = i=1 j=1 i=1 j=1

= SSbl + SStr + SSerr.

As we are interested in determining whether or not the treatment has had any effect on the observed yij, the hypotheses to test are identical to those from the previous chapter. Namely,

H0 : τ1 = ... = τk vs H1 : ∃j1, j2 s.t. τj1 6= τj2 .

By Cochran’s Theorem, under the null hypothesis, the terms SSbl, SStr, and SSerr are all independent chi squared random variables with degrees of freedom b − 1, k − 1, and (b − 1)(k − 1), respectively. The test statistic for the above hypothesis test is another F statistic of the form

SStr/(k − 1) F = ∼ F (k − 1, (b − 1)(k − 1)) under H0. SSerr/[(b − 1)(k − 1)] This can all be summarized as before in a table.

DoF Sum Squares Mean Squares F value Block b − 1 SSbl SSbl/(b − 1) Fbl Treatment k − 1 SStr SStr/(k − 1) Ftr Residuals (b − 1)(k − 1) SSerr SSerr/[(b − 1)(k − 1)]

16 Remark 2.1.2 (Multiple Comparisons). Similarly to the case of one-way ANOVA, a post-hoc Tukey test can be used to construct simultaneous confidence intervals for every paired difference τj − τj for 1 ≤ j1 < j2 ≤ k. In Chapter 1, we divided by q 1 2 n−1 + n−1. Now, as there are b observations for each treatment, we replace that i √ j with b−1 + b−1 and get a t-statistic of the form

y¯·j1 − y¯·j2 1 tj j = √ . 1 2 p −1 SSerr/[(b − 1)(k − 1)] 2b

Example Consider the setting where b = 6 and k = 4 with

β = (−3, −2, −1, 1, 2, 3)/3, and τ = (0, 0, 1, 2)

iid with εij ∼ N (0, 0.5625). Such data was generated in R and run through the aov() function to get the following table:

Df Sum Sq Mean Sq F value Pr(> F ) block 5 14.46 2.893 4.577 0.00981 ** treat 3 15.80 5.266 8.332 0.00169 ** Residuals 15 9.48 0.632

Two box plots of the data partitioned by blocking factor and by treatment factor is displayed in Figure 2.1. If we were to ignore the blocking factor, then the treatment will yield a less significant test statistic.

Df Sum Sq Mean Sq F value Pr(> F ) treat 3 15.80 5.266 4.398 0.0157 * Residuals 20 23.94 1.197

2.1.1 Paired vs Unpaired Data2 An example of a common randomized block design is a known as a paired comparison design. This occurs when the blocks are of size k = 2. This implies that each block contains two subjects, which can be randomly assigned to a treatment or control group. For example, if you wanted to test for the effect difference between a drug and a placebo, you could consider twins where one sibling is randomly given the drug and the other is given the placebo. Due to a similar genetic makeup of each pair of twins,

2 See Wu & Hamada Section 3.1

17 Block Effect Treatment Effect

4 4

3 3

2 2 y y

1 1

0 0

1 2 3 4 5 6 A B C D

Block Treatment

Figure 2.1: Box plots of blocking effects and treatment effects on the observed data. such a study will have more power in detecting the effect of the drug as opposed to a similar study where the entire set of subjects is partitioned into two groups, control and test, at random. Another example could be human eyes. Instead of randomly assigning b subjects into either a control or a test group, each of the subjects could have a single eye randomly assigned to one group and the other eye assigned to the other group. From the randomized block design notation, a paired t-test would have the statistics y¯·1 − y¯·2 tpaired = ps2/b 2 −1 Pb 2 where s = (b − 1) i=1((yi1 − yi2) − (¯y1· − y¯2·)) is the sample standard devia- tion of the differences, which has a t (b − 1) distribution under the null hypothesis. In comparison, an unpaired test would take on the form of one-way ANOVA from Chapter 1. The test statistic would be y¯ − y¯ t = ·1 ·2 unpaired p 2 2 (s1 + s2)/b 2 2 where s1 and s2 are the sample variance of the yi1 and the yi2, respectively. This statistic will have a t (2b − 2) distribution under the null hypothesis. 2 2 2 If the pairing is effective, then we should have s < s1 + s2. This would make the test statistic larger and, hence, more significant. In the unpaired case, the DoFs increases, so the decrease in the variance estimate must be large enough to counteract the increase in the DoFs.

2.1.2 Tukey’s One DoF Test3 In Tukey’s article, he proposes to consider “when the analysis has been conducted in terms where the effects of rows and columns are not additive” where row and column would refer to the two factors in the model. 3 One Degree of Freedom for Non-Additivity, JW Tukey, Biometrics, Vol. 5, No. 3 (Sep., 1949) or https://en.wikipedia.org/wiki/Tukey%27s_test_of_additivity

18 Consider the two factor model

yij = µ + αi + βj + εij where we only have N = kAkB observations. Without replicating this experiment, we do not have enough observations to estimate the effects (αβ)ij. Often, researchers will claim that such terms are negligible due to “expertise” in the field of study. However, Tukey has provided us with a way to test for non-additivity interactions between factors A and B. As noted, we cannot test a general interaction term in a model such as

yij = µ + αi + βj + (αβ)ij + εij without replicating the experiment. However, we can consider a model of the form

yij = µ + αi + βj + λαiβj + εij, and running an hypothesis test H0 : λ = 0 and H1 : λ 6= 0. As we have already estimated the αi and βj, we only require a single DoF to estimate and test λ. Doing the usual sum of squares decomposition results in

Term Formula DoF PkA 2 SSA kB i=1(¯yi· − y¯··) kA − 1 PkB 2 SSB kA j=1(¯y·j − y¯··) kB − 1 h i2 h i−1 PkA PkB ˆ P 2 P ˆ2 SSλ i=1 j=1 yijαˆiβj i αˆi j βj 1 SSerr SStot − SSA − SSB − SSλ (kA − 1)(kB − 1) − 1

Hence, the F-statistic4 is

SSλ ∼ F (1, (kA − 1)(kB − 1) − 1) . SSerr/[(kA − 1)(kB − 1) − 1] This test is implemented in the dae library with the tukey.1df() function. Tukey’s recommendation is “The occurrence of a large non-additivity mean square should lead to consideration of a transformation followed by a new analysis of the transformed variable. This consideration should include two steps: (a) inquiry whether the non-additivity was due to analysis in the wrong form or to one or more unusually discrepant values; (b) in case no unusually discrepant values are found or indicated, inquiry into how much of a transformation is needed to restore additivity.”

This procedure can be run on the Rabbit data with the commands

4 Note that I have not verified that the SSλ formula is correct.

19 (1) m = aov( BPchange ∼ Error(Animal) + Dose:Treatment, data=Rabbit )

(2) tukey.1df(md,data=Rabbit) which gives an F statistic of 2.39 and a p-value of 0.129. Hence, we do not reject the λ = 0 assumption.

2.2 Two-Way Layout

In the first chapter, we were concerned with one-way layouts of the form

yij = µ + τi + εij.

Now, we will generalize this to two-way layout–with higher order layouts also possible– as yijl = µ + αi + βj + γij + εijl.

Here, we have two factors, say A and B, with kA and kB factor levels, respectively. Then, µ is still the global mean, αi for i = 1, . . . , kA is the effect of the ith level of factor A, similarly βj for j = 1, . . . , kB is the effect of the jth level of factor B, and lastly, γij is the interaction effect between the ith level of A and the jth level of B. 2 Meanwhile, εijl is once again the iid N 0, σ random noise for l = 1, . . . , n. Once again for simplicity, we will assume all of the treatments–combinations of factor levels–are observed the same number of times, n.

Remark 2.2.1 (Blocks vs Experimental Factors). Note that two-way layout is dif- ferent than randomized block design mainly as a blocking factor is not treated as an experiment factor. A block is a way of splitting your data into more homogeneous groups in order to (hopefully) reduce the variance and achieve a more powerful test. An additional experimental factor results in kA ×kB different treatments to consider and apply randomly to n subjects each. Furthermore, we are generally concerned with interaction effects between experimental factors, but not concerned with inter- action effects concerning a blocking factor.

2.2.1 Fixed Effects5 Besides the added complexity of an additional factor and the interaction terms, we can proceed as in the one-way setting to decompose

kA kB n X X X 2 (yijl − y¯···) = SSA + SSB + SSA×B + SSerr i=1 j=1 l=1

5 See Wu & Hamada Section 3.3

20 where

kA kB X 2 X 2 SSA = nkB (¯yi·· − y¯···) SSB = nkA (¯y·j· − y¯···) i=1 j=1

kA kB X X 2 X 2 SSA×B = n (¯yij· − y¯i·· − y¯·j· +y ¯···) SSerr = (¯yijl − y¯ij·) . i=1 j=1 i,j,l

Under the null hypothesis that the αi, βj, γij are all zero, we have that these sums of squares are all chi squared distributed by Cochran’s theorem. The degrees of freedom are

2 2 SSA ∼ χ (kA − 1) ,SSB ∼ χ (kB − 1) , 2 2 SSA×B ∼ χ ((kA − 1)(kB − 1)) ,SSerr ∼ χ (nkAkb − kAkB)

Thus, if we want to test for the the significance of factor A, factor B, or the inter- action effects, we can construct an F test based on taking the corresponding sum of squares and dividing by the error sum of squares after normalizing by the DoFs, of course. Note that if three F tests are run to test for the effects of A, B, and A×B, then we should adjust for multiple testing using Bonferroni’s method. That is, declare a test to be significant at the α level if the p-value is smaller than α/3.

2.3 Latin Squares6

A k dimensional is a k × k array of symbols a1, . . . , ak such that each row and column contains exactly one instance of each of the ai. An example would be a Sudoku puzzle where k = 9.7 Such a combinatorial object can be used in an experiment with precisely 2 blocking factors and 1 experimental factor when all three of these factors take on precisely k levels. Consequently, there are N = k2 total observations in such an design. The term Latin is used as the experimental factor levels are often denoted by the letters A, B, C, etc resulting in a table that may look like Block 2 1 2 3 4 1 A B C D Block 2 B A D C 1 3 C D A B 4 D C B A

6 See Wu & Hamada Section 3.6 7https://en.wikipedia.org/wiki/Latin_square

21 Note that the resulting matrix does not necessarily have to be symmetric. In this setting, we have a model of the form

yijl = µ + αi + βj + τl + εijl where αi is the effect of the ith row, βj is the effect of the jth column, τl is the 2 effect of the lth letter, and εijl is random N 0, σ noise. Note that we do not have enough data in this setting to consider interaction terms. If this experiment were replicated n times, then we could consider such terms. As with the previous models, we get a sum of squares decomposition. Note that the sum over i, j, l only contains k2 terms as given a row i and a column j, the choice for l has already been made.

X 2 SStot = (¯yijl − y¯···) = SSrow + SScol + SStr + SSerr = (i,j,l) k k k X 2 X 2 X 2 = k(¯yi·· − y¯···) + k(¯y·j· − y¯···) + k(¯y··l − y¯···) + i=1 j=1 l=1 X 2 + (yijl − y¯i·· − y¯·j· − y¯··l + 2¯y···) (i,j,k)

The degrees of freedom for the four terms are k − 1 for SSrow, SScol, and SStr and 2 is (k − 1) − 3(k − 1) = (k − 2)(k − 1) for SSerr. Using Cochran’s theorem, we can perform an F test for the hypotheses

H0 : τ1 = ... = τk H1 : ∃l1 6= l2 s.t. τl1 6= τl2 .

Namely, under the null hypothesis, we have that SS /(k − 1) F = tr ∼ F (k − 1, (k − 1)(k − 2)) . SSerr/(k − 1)(k − 2)

Given that we reject H0, we can follow this test with a post-hoc Tukey test to determine which differences τi − τj are significant. Remark 2.3.1 (Why would we want to do this?). In the below data example, there is little difference between blocking and not blocking. The key point to remember is that a “good” blocking will reduce variation and result in a more significant test statistic. Thus, it can sometimes be the difference between rejecting and not rejecting H0.

Example: Fungicide From the text of C. H. Goulden, Methods of Statistical Analysis, and also from the R package agridat, we consider dusting wheat crops with sulphur for the purposes

22 of controlling the growth of a certain fungus. In this experiment, k = 5 and the blocking factors are literally the rows and columns of a plot of land partitioned into 25 subplots. The responses yijl are the yields of the crop in bushels per acre. Each treatment–A,B,C,D,E–corresponds to a different dusting method of the sulphur in- cluding treatment E which is no dusting. The design and the data are, respectively,

BDEAC 4.9 6.4 3.3 9.5 11.8 CABED 9.3 4.0 6.2 5.1 5.4     DCABE 7.6 15.4 6.5 6.0 4.6 .     EBCDA 5.3 7.6 13.2 8.6 4.9 AEDCB 9.3 6.3 11.8 15.9 7.6

Running an ANOVA in R gives the following output: Df Sum Sq Mean Sq F value Pr(> F ) row 4 46.67 11.67 4.992 0.0133 * col 4 14.02 3.50 1.500 0.2634 trt 4 196.61 49.15 21.032 2.37e-05 *** Residuals 12 28.04 2.34

A post-hoc Tukey test indicates significant differences between (A, C), (B,C), (D,C), (E,C). Hence, treatment C, which was “dust weekly” led to increases in yield over all other cases. If the same experiment is run without considering the blocking factors, the same conclusions are reached, but the test statistics are less significant. Note that the p-values for the blocking factors are generally not of interest. However, The mean squares does quantify the variation explained by that blocking factor, which can be used to determine whether or not the blocking was successful. From Goulden (1957),

“The column mean square is not significant. This is probably due to the shape of the plots. They were long and narrow; hence the columns are narrow strips running the length of the rectangular area. Under these conditions the Latin square may have little advantage on the average over a randomized block plan.”

2.3.1 Graeco-Latin Squares8 The Graeco-Latin Square is an extension of the Latin square where two blocking factors and two experimental factors with k level each are considered. The sample size is again k2. The idea is to superimpose two orthogonal Latin squares such that every entry is unique. Generally, the second square uses Greek letters to differentiate

8 See Wu & Hamada Section 3.7

23 it from the first. ABCD α β γ δ Aα Bβ Cγ Dδ BADC γ δ α β Bγ Aδ Dα Cβ   ,   ⇒   . CDAB δ γ β α Cδ Dγ Aβ Bα DCBA β α δ γ Dβ Cα Bδ Aγ

The analysis follows exactly as above given the model with one additional term corresponding to the Greek letters:

yijlm = µ + αi + βj + τl + κm + εijlm

The degrees of freedom for all of the factors is still k − 1 and the degrees of freedom for the residuals is now (k2 − 1) − 4(k − 1) = (k − 3)(k − 1).

Remark 2.3.2 (Fun Fact). There are no Graeco-Latin squares of order 6.

2.4 Balanced Incomplete Block Designs9

Returning to the concept of Randomized Block Design from Section 2.1, imagine that we have a single experimental factor and a single blocking factor. Originally, we had b blocks, k treatments, and a total of bk observations. The block size–i.e. the number of observations per block–was precisely k. In the Balanced Incomplete Block Design (BIBD), we consider the case where the block size k is less than t, the total number of treatments. Furthermore, we require the notation r being the number of blocks for each treatment and λ being the number of blocks for each pairs of treatments. For such a design to be balanced, every pair of treatments must be considered in the same number of blocks. As an example, consider

Treatment block ABCD 1 y1,1 y1,2 .. 2 . y2,2 y2,3 . 3 .. y3,3 y3,4 4 y4,1 . y4,3 . 5 . y5,2 . y5,4 6 y6,1 .. y6,4

Here, we have t = 4 treatments, b = 6 blocks, k = 2 treatments per block, r = 3 blocks per treatment, and λ = 1 block for each pair of treatments. This type of design occurs when only so many treatments can be applied to a single block before that block is saturated. For example, if one were to apply

9 See Wu & Hamada Section 3.8

24 rust protective coatings to steel girders, then there is only so much surface area per girder to apply differing treatments. In taste testing scenarios, only so many samples can be tasted by a judge before they lose the ability to be discerning. For testing antibiotics, only so many different drugs can be applied to a bacteria culture. There are some relationships that are required for such a blocking to be valid. They are

bk = rt Both are total sample size r(k − 1) = λ(t − 1) More equivalent counting t > k More total treatments than those in a given block r > λ single occurrences greater than paired occurrences b > r More total blocks than those given a certain treatment rk > λt follows from above

The model in this setting is identical to that of Section 2.1, but with points. yij = µ + βi + τj + εij. The sum of squares decomposition results in

Term DoF Equation Pb 2 SSbl b − 1 k i=1(¯yi· − y¯··) Pt 2 SStr t − 1 k j=1 Qj /λt SSerr bk − b − t + 1 SStot − SStr − SSbl P 2 SStot bk − 1 i,j(yij − y¯··) where, for the treatment sum of squares,

b ! 1 X Q = r y¯ − y¯ 1[y exists] . j ·j r i· i,j i=1

This formula arises from the usualy ¯·j − y¯··, but replacing the global average with the average of the block averages for only the r blocks where treatment j occurred. Once again, we can divide the treatment sum of squares by the error sum of squares to get an F statistic via Cochran’s theorem. This can be followed up with a Tukey test for individual differences τi − τj.

Example Continuing with the example from Section 2.1, we remove 12 of the entries from the data table to match the pattern from the example design at the beginning of this section. In this case, R produces the following table from aov().

25 Df Sum Sq Mean Sq F value Pr(> F ) block 5 19.122 3.824 39.059 0.00623 ** treat 3 2.233 0.744 7.602 0.06489 . Residuals 3 0.294 0.098

The treatment effect is very weak with the reduced sample, but perhaps is strong enough to warrant a follow up study. On the other hand, if we were to ignore the blocking factor. Then the results are not significant at all.

Df Sum Sq Mean Sq F value Pr(> F ) treat 3 8.002 2.667 1.564 0.272 Residuals 8 13.647 1.706

2.5 Split-Plot Designs10

The concept of a split plot design–as well as the name split plot–comes directly from agricultural research. Given two experimental factors with k1 and k2 levels that we wish to apply to a field of some crops, the ideal was to proceed is to cut the field, or plot of land, into k1k2 subplots and apply one of the k1k2 treatments to each subplot at random. However, such an approach is often impractical if one of the treatments is only able to be applied to large areas of land. As a result, we could instead split the plot of land into k1 regions, apply different levels of the first experimental factor to each, then split each of those regions into k2 subregions and apply different levels of the second experimental factor. In this case, the first factor is sometimes referred to as the whole plot factor and the second as the subplot factor. To begin, say we have experimental factors A and B with kA and kB levels, respectively. Furthermore, assume the experiment is replicated n times. Thus, the total sample size is nkAkB. If we were able to fully randomize the kAkB treatments, then we would have a two-way fixed effects model:

yijl = µ + αi + βj + γij + εijl.

However, as we have instead randomized factor A and then, given a level for A, randomized factor B, the model has to change. Consequently, the specific can have an effect on the observation yijl, which leads to a three-way mixed effects model: yijl = µ + αi + βj + τl + (αβ)ij + (ατ)il + (βτ)jl + (αβτ)ijl

where αi is the effect of the ith level of factor A, βj is the effect of the jth level of fac- tor B, τl is the effect of the lth replicate, the three paired terms, (αβ)ij, (ατ)il, (βτ)jl,

10 See Wu & Hamada Section 3.9

26 quantify the three pairwise interaction effects, and lastly, (αβτ)ijl quantifies the three-way interaction effect. Usually, the two experimental factors A and B are treated as fixed effects and the replication factor as a random effect resulting in a so-called mixed effects model. We can categorize the eight terms as

Fixed Random Whole Plot αi τl,(ατ)il Sub Plot βj,(αβ)ij (βτ)jl,(αβτ)ijl

Note that there is no εijl (even though there is one in Wu & Hamada, Equation 3.59) as if we consider the degrees of freedom, there are none left to estimate an εijl term: Term Total µ α β τ (αβ) Dofs nkAkB 1 kA − 1 kB − 1 n − 1 (kA − 1)(kB − 1) Term (ατ)(βτ)(αβτ) Dofs (kA − 1)(n − 1) (n − 1)(kB − 1) (kA − 1)(kB − 1)(n − 1) Hence, if we were to include such a term, it would not be identifiable. Instead we consider two error terms based on the random effects:

whole sub εil = (ατ)il and εijl = (βτ)jl + (αβτ)ijl.

For split-plot designs, as with the previous, we are interested in testing for whether or not there are differences in the effects of the factor levels. Applying the sum-to-zero constraints,

k k XA XB X αi = 0, βj = 0, and (αβ)ij = 0, i=1 j=1 i,j results in three null hypotheses of interest,

H01 : αi = 0 ∀i, H02 : βj = 0 ∀j, and H03 :(αβ)ij = 0 ∀i, j.

To test these, we decompose the total sum of squares as usual:

SStot = SSA + SSB + SSR + SSA×B + SSA×R + SSB×R + SSA×B×R, which can be rewritten as

SStot = SSA + SSB + SSR + SSA×B + SSwhole + SSsub.

Finally, we can write out the F statistics in the following table:

27 Sum Sq DoF F stat SSA/dfA SSA kA − 1 SSwhole/dfwhole SSR n − 1 SSwhole (kA − 1)(n − 1) SSB /dfB SSB kB − 1 SSsub/dfsub SSA×B /dfA×B SSA×B (kA − 1)(kB − 1) SSsub/dfsub SSsub kA(kB − 1)(n − 1)

Remark 2.5.1 (Example of two-way vs three-way). The two-way model would be used if, for example, we had kAkB human subjects who were each randomly assigned a specific level factors of A and B, and then this experiment was replicated with n groups of subjects. In the three-way model, we are assuming that each of the n replicates corresponds to a single subject, like a plot of land, hence, there could be replicate effects that unlike in the previous setting are not removed by randomization.

Example As an example, we can look at the stroup.splitplot dataset from the R package agridat. This is simulated data in the form of a split plot design. The dataset has N = 24 observations, yijl, with whole-plot factor A with 3 levels, sub-plot factor B with 2 levels, and n = 4 replicates. The design looks like

A1 r1 r2 r3 r4 B1,B2 B1,B2 B1,B2 B1,B2 for treatment A1 with two other identical blocks for A2 and A3. First, we will analyse the data as a two-way fixed effects model,

yijl = µ + αi + βj + (αβ)ij + εijl, using the R command aov( y∼a*b, data=stroup.splitplot ). This yields three F statistics none of which are significant.

Df Sum Sq Mean Sq F value Pr(> F ) a 2 326.6 163.29 1.874 0.182 b 1 181.5 181.50 2.083 0.166 a:b 2 75.3 37.63 0.432 0.656 Residuals 18 1568.5 87.14

Next, we can include the replications as a random effect,

yijl = µ + αi + βj + (αβ)ij + τl + εijl,

28 using the command aov( y∼a*b + Error(rep), data=stroup.splitplot ). The result is now a significant result for factors A and B, but not for the interaction term.

Error: rep Df Sum Sq Mean Sq F value Pr(> F ) Residuals 3 1244 414.5 Error: Within Df Sum Sq Mean Sq F value Pr(> F ) a 2 326.6 163.29 7.537 0.00542 ** b 1 181.5 181.50 8.377 0.01112 * a:b 2 75.3 37.63 1.737 0.20972 Residuals 15 325.0 21.67

Notice that the degrees of freedom for the within residuals has reduced by 3 and a large amount of the error sum of squares is now contained in the rep factor. Lastly, we can fit the split plot design,

yijl = µ + αi + τl + εwhole + βj + (αβ)ij + εsub, using the command aov(y∼a*b + Error(rep/a),data=stroup.splitplot). As a result, the interaction term a:b looks more significant whereas factor A looks less significant.

Replication Effect Error: rep Df Sum Sq Mean Sq F value Pr(> F ) Residuals 3 1244 414.5 Whole Plot Terms Error: rep:a Df Sum Sq Mean Sq F value Pr(> F ) a 2 326.6 163.29 4.07 0.0764 . Residuals 6 240.8 40.13 Sub Plot Terms Error: Within Df Sum Sq Mean Sq F value Pr(> F ) b 1 181.50 181.50 19.389 0.00171 ** a:b 2 75.25 37.63 4.019 0.05658 . Residuals 9 84.25 9.36

Often, the significance of the whole plot factor A will be overstated if a split plot design is not analysed as such.

29 2.6 Analysis of Covariance11

Sometimes when make observations yij, we are faced with a variable xij that cannot be controlled and varies with every observation. We can include xi,j, referred to as a covariate, in the model in order to account for it. One common type of covariate is size such as the body weight of a test subject or the population of a city. In the below example, we consider the number of plants in a plot as a covariate of the yield of the plot. The model we consider combines with the one-way ANOVA: Model 1: yij = µ + τi + βxij + εij

where µ is the global mean, τi is the ith treatment effect, xij ∈ R is the covariate, β is the unknown regression coefficient, and, as usual, εij is the iid normal noise. To test whether or not the treatment terms, τi’s, have any detectable effect on the yij, we compare the above model to the simpler model

Model 0: yij = µ + βxij + εij.

Assume, as before, we have k treatment levels, n observations of each treatment, and a total sample of size N. Following the usual procedure from linear regression, we can compute the residual sum of squares for each model to get RSS0 and RSS1 with degrees of freedom N − 2 and (N − 2) − (k − 1) = N − k − 1, respectively.12 The resulting F statistic is

(RSS − RSS )/(k − 1) F = 0 1 , RSS1/(N − k − 1) which under the null hypothesis–i.e. that all of the τi coincide–has an F distribution with degrees of freedom k − 1 and N − k − 1, respectively. The above Model 1 assumes that the slope β is independent of treatment category i. We could allow for variable slopes as well with

Model 2: yij = µ + τi + (β + αi)xij + εij where the αi term also has k − 1 degrees of freedom. Note that as µ and τi are the global and category mean effects, we have β and αi as the global and category slope effects. Testing would proceed as before taking into account the new RSS2 with degrees of freedom N − 2 − 2(k − 1) = N − 2k.

Remark 2.6.1 (Warning!). All of the previous models when put into R’s aov() function are independent of the variable order. This is because when the factors

11 See Wu & Hamada Section 3.10 12 Note that the covariate xij is considered as a continuous variable and has only 1 degree of freedom.

30 are translated into a linear model, the columns of the design matrix are orthogonal. However, when dealing with a continuous regressor xij, the results of summary.aov() are dependent on the variables’ order. Hence, we compare nested models as above to test for the significance of the experimental factor.

Example As an example, we consider the cochran.beets dataset from the agridat library. This dataset considers sugarbeet yield based on k = 7 different fertilizer treatments. It also includes a blocking factor with b = 6 levels and a single continuous covariate, number of plants per plot. Ignoring the blocking factor and covariate and using a one-way ANOVA results in a very significant p-value for the fertilizer factor.

Df Sum Sq Mean Sq F value Pr(> F ) fert 6 112.86 18.809 22.28 1.3e-10 *** Residuals 35 29.55 0.844

Next we take the ANCOVA approach by comparing the simple linear regression,

(yield) = µ + β(plants) + ε to the model with variable means based on the fertilizer and then to the model with variable means and variable slopes. This results in the following table from R.

Model 1: yield ∼ plants Model 2: yield ∼ fert + plants Model 3: yield ∼ fert * plants Res.Df RSS Df Sum of Sq F Pr(> F ) 1 40 28.466 2 34 20.692 6 7.7734 2.1822 0.07495 . 3 28 16.623 6 4.0694 1.1424 0.36431

The p-value for the fertilizer effect is now no longer significant at the 0.05 level. We can also compare the same three models as previously done by also incorpo- rating the blocking factor. Now, we get

Model 1: yield ∼ block + plants Model 2: yield ∼ fert + block + plants Model 3: yield ∼ fert * plants + block Res.Df RSS Df Sum of Sq F Pr(> F ) 1 35 9.4655 2 29 6.9971 6 2.4684 1.891 0.1256 3 23 5.0038 6 1.9933 1.527 0.2138

31 Here, the blocking resulted in a reduction in the sum of squares for comparing model 2 to model 1. The conclusion is that the fertilizer effect disappears once the covariate is taken into account.

32 Chapter 3

Multiple Testing

Introduction

Multiple testing is a ubiquitous problem throughout statistics. Consider flipping 10 coins; if we observe 10 heads, then we may conclude that the coins are not fairly weighted as the probability of getting all 10 heads assuming fair coins is 1/1024. However, if we were to repeat this experiment with 100 sets of ten coins, then the probability that at least one set yields 10 heads is

1023100 1 − ≈ 0.093. 1024

Hence, there is a 9.3% chance that we may incorrectly conclude that a set of coins is not fairly weighted. Similarly, running a single hypothesis test of size α means that the probability that we falsely reject the null hypothesis is α. If m independent tests are conducted with size α, then the probability of at least one false positive is

1 − (1 − α)m  α.

In the following chapter, we consider factorial designs where m = 2k indepen- dent hypothesis tests can be conducted. This leads to the potential for many false positives unless we correct our methodology. This chapter contains some methods for correcting for multiple testing. For the remainder of this chapter, consider that we have conducted m indepen- dent hypothesis tests that have yielded a set of p-values p1, . . . , pm ∈ [0, 1]. The ith null hypothesis is denoted H0i while the ith alternative hypothesis is denoted H1i. We assume that under the null hypothesis that pi ∼ Uniform [0, 1]. Given that we want a false positive rate of α, without correcting, we would reject the ith null hypothesis if pi < α.

33 3.1 Family-wise Error Rate

There are many proposed methods to control the Family-wise Error Rate (FWER), which is the probability of at least one false positive:

FWER = P (reject any H0i | H0i) .

Thus, instead of having a test size α such that P (reject H0i | H0i) ≤ α, we want to devise a methodology such that F W ER < α. What follows are some methods for doing just that.

3.1.1 Bonferroni’s Method As already mentioned in Chapter 1, Bonferroni’s method is an effective but ex- tremely conservative method for multiple testing correction. Its strength is that it requires few assumptions; its weakness is that it is too aggressive often resulting in too many false negatives. Bonferroni’s method simply says that if we want the FWER to be no greater than α, then we should reject H0i if pi ≤ α/m. Indeed, let I0 ⊂ {1, . . . , m} be the set of indices such that H0i is true. Then, applying a union bound or Boole’s inequality, we have   [ X α FWER = P {p ≤ α/m} ≤ P(p ≤ α/m) = |I | ≤ α.  i  i 0 m i∈I0 i∈I0

iid While we will mostly assume our hypothesis tests to be independent–i.e. pi ∼ Uniform [0, 1] under H0i–this derivation works regardless of any dependency or lack thereof among the hypothesis tests.

3.1.2 Sidak’s Method

Sidak’s method is similar to Bonferroni. In this method, we reject H0i if pi ≤ 1 − (1 − α)1/m. Unlike Bonferroni’s method, Sidak’s requires the hypothesis tests to be independent. To see this, note that the above rejection region can be equivalently 1/m writen as Don’t reject H0i if (1 − α) ≤ 1 − pi. Therefore,   \ 1/m 1 − (FWER) = P  {pi ≥ (1 − α) } i∈I0 |I |  1/m 0 = P pi ≥ (1 − α) = (1 − α)|I0|/m FWER = 1 − (1 − α)|I0|/m ≤ 1 − (1 − α) = α.

34 3.1.3 Holms’ Method Holms’ Method is the first stepwise method we will consider. It specifically imple- ments Bonferroni’s method in a stepwise fashion. In this approach, we first order the p-values to get p(1) ≤ p(2) ≤ ... ≤ p(m).

We then reject hypothesis H0i if, for all j = 1, . . . , i,

p(j) ≤ α/(m − j + 1).

Hence, we can only reject H0i if we have already rejected all H0j for j < i. Fur- thermore, when considering H01, we check whether or not p(1) ≤ α/m, which is just Bonferroni’s method. Then, the threshold for rejection is relaxed as i increases. Thus, this approach is necessarily more powerful than regular Bonferroni as the standard Bonferroni only rejects hypotheses when p(i) ≤ α/m regardless of order. It can also be shown that this method still achieves a FWER ≤ α. Indeed, denote ˆi to be the index such that H0i is rejected for all i ≤ ˆi, and let i0 be the smallest index such that H0i0 should not be rejected. Then, (step 1) applying Bonferroni in reverse, (step 2) noting that |I0| < m − i0 + 1, and (step 3) noting that the event ˆ H0i0 not being rejected implies that i < i0 gives that

 α   α  1 − α ≤ P p(i) > for all i ∈ I0 = P p(i0) > ≤ |I0| |I0|   α ˆ  ≤ P p(i0) > ≤ P i < i0 = P (No null is rejected) . m − i0 + 1 Thus, this proceedure makes sense.

3.1.4 Stepwise Methods Step-down Methods Holms’ method is a specific type of step-down method using Bonferroni’s method iteratively. Such approaches to multiple testing iteratively consider the p-values from smallest to largest and only reject hypothesis H0i if all hypotheses H0j for j < i have already been rejected. In general, let I ⊂ {1, . . . , m}. And let H0I be the joint null that all of H0i are T true for i ∈ I. That is, the intersection i∈I H0i, and thus rejection of H0I implies that at least one H0i is rejected for i ∈ I. If J ⊃ I, then not rejecting H0J implies not rejecting H0I . Assume for every subset I, we have a level-α nonrandomized test function—i.e. some φ(x; I) = 0, 1 such that P (φ(x; I) = 1 | H0I ) ≤ α where 1 indicates rejection of H0I . The simultaneous test function is

Φ(x; I) = min{φ(x; J)}, J⊇I

35 which implies that we reject H0I at level α only when all H0J are rejected at level α. Hence, if I0 contains the indices of all true null hypotheses, then for any I ⊆ I0,

if P (φ(x; I0) = 1) ≤ α, then P (Φ(x; I) = 1) ≤ α.

Thus, the test Φ simultaneously controls the probability of falsely rejecting any H0I comprised of true nulls. To see this idea in practice, we return to Holms’ method. Here, Bonferroni’s method is φ and Holms’ method is Φ. Indeed, assuming that the p-values are ordered p1 ≤ p2 ≤ ... ≤ pm, consider the indices Ii = {i, i + 1, . . . , m}. The cardinality of this set is |Ii| = m − i + 1, so applying Bonferroni to this set says reject any H0k for k ∈ Ii if pk < α/(m − i + 1). Since the p-values are ordered, this

means that we reject H0Ii if pi < α/(m − i + 1). As Ij ⊇ Ii for any j ≤ i, we have the simultaneous rule which says

reject H0Ii only if we reject all H0Ij , ∀j ≤ i, which equivalently is

reject H0Ii only if pj < α/(m − j + 1), ∀j ≤ i, and this is Holms’ method.

Step-up Methods Similar to step-down procedures, there are step-up procedures where we begin with p(n) and accept null hypothesis H0i only if we have first accepted all H0j for j > i.

3.2 False Discovery Rate

As an alternative to FWER, we have the False Discovery Rate (FDR). To understand these ideas, consider that we have run m hypothesis tests where m0 are from the null and m1 are from the alternative. Furthermore, assume that we rejected r null hypotheses. This results in the following table: Decided Decided Null Alternative True Null m0 − a a m0 True Alternative m1 − b b m1 m − r r m Here, we have r rejected null hypotheses with a of them false rejections and b of them true rejections. Using this notation, we have

FWER = E (a/m0) = P (a > 0) FDR = E (a/r)

36 3.2.1 Benjamini-Hochberg Method The Benjamini-Hochberg Method, published in 1995 in JRSSB with currently more than 50,000 citations on Google Scholar, is one of the most important advancements beyond standard FWER control, which was the standard approach for most of the 20th century. The method is in its essence quite simple. Given ordered p-values p(1) ≤ p(2) ≤ ... ≤ p(m), we choose a q ∈ (0, 1) which will be the desired FDR. Then, we find the maximal index i such that i p ≤ q (i) m and then we reject all H0j for j ≤ i. This proceedure is validated by the following theorem.

Theorem 3.2.1. Given m ordered p-values p(1) ≤ p(2) ≤ ... ≤ p(m), if the p-values corresponding to null hypotheses are independent then

FDR = E (a/r) = π0q ≤ q where π0 = |I0|/m.

Here, π0 is the proportion of true null hypotheses to all tests. Proof. For t ∈ (0, 1), let r(t) be the total number of p-values less than t and let a(t) be the number of p-values from null hypotheses less than t. Hence, the false discovery proportion at t is FDP (t) = a(t)/ max{r(t), 1}.

Also, let Q(t) = mt/ max{r(t), 1} and tq = sup{t : Q(t) ≤ q}. Considering the ordered p-values, r(p(i)) = i and Q(p(i)) = mp(i)/i. Therefore, the Benjamini- Hochberg method can be rewritten as

Reject H0(i) for p(i) ≤ tq. Under the null hypothesis, p-values are uniformly distributed on [0, 1]. Therefore Ea(t) = t|I0|, which is t|I0| p-values on average will be in the interval [0, t]. Now, defining A(t) = a(t)/t, we have that for s < t E(a(s) | a(t)) = (s/t)a(t) and E(A(s) | A(t)) = A(t) and E A(s) | A(t0) ∀t0 ≥ t = A(t)

This makes A(t) a martingale as t goes from 1 to 0. Considering tq as a stopping time—that is, as t decreases from 1 to 0, then tq is the first time that Q(t) drops below q—we can apply the Optional Stopping Theorem1 to get that

EA(tq) = EA(1) = Ea(1) = |I0|.

1 A powerful result from the theory of martingales: https://en.wikipedia.org/wiki/ Optional_stopping_theorem

37 Therefore, a(tq) Q(tq) a(tq) q FDP (tq) = = a(tq) = , max{r(tq), 1} mtq tq m and finally |I | E[FDP (t )] = q 0 ≤ q. q m

Remark 3.2.2. Note that Benjamini-Hochberg requires the p-values to be indepen- dent. A more conservative approach that works under arbitrary dependence is to find the maximal index i such that 1 i p ≤ q (i) c(m) m

Pm −1 where c(m) = i=1 i and then we reject all H0j for j ≤ i.

38 Chapter 4

Factorial Design

Introduction

In this chapter, we will consider designs where we wish to consider k factors all with 2 different levels. Hence, there are 2k total treatments to consider, which is full factorial design. As one might expect, this can quickly become impractical as k grows. Hence, we will often consider fractional factorial design where some subset of treatments is “optimally” selected. Data from a factorial design can be displayed as

Factor ABC Data --- y1 - - + y2 - + - y3 - + + y4 + - - y5 + - + y6 + + - y7 + + + y8 where - and + refer to the two levels of each of the k = 3 factors. For each of the 23 = 8 treatments, we have a some observed data. If each factor level occurs in the same number of treatments, then the design is balanced. If two factors have all of their level combinations occurring an equal number of times, then they are said to be orthogonal. If all factor pairs are orthogonal, then the design is said to be orthogonal. When running such an experiment, the ideal situation is to randomize the order of the rows above and thus test the treatments in a random order. Often some factors are harder to change than others, hence we can also consider the case of where, for example, factor A may be set to - and the levels

39 of B and C are randomized. Then A is set to + and B and C are randomized again. An experiment can also be replicated, say n times, to collect multiple yij for j = 1, . . . , n. Each time an experiment is replicated, it should be randomized again.

4.1 Full Factorial Design1

A first consideration is to just treat such a design as a multi-way fixed effects model. The problem is that without replication, we have too few observations to take an ANOVA approach. And even with replication, we are faced with 2k − 1 hypothesis tests, which require a fix for multiple testing. The Bonferroni method is too ineffi- cient in this setting, so other techniques are developed. First, however, we need to define the main and interaction effects. Consider the experiment with k factors, A1,...,Ak, each with 2 levels, {−, +}, such that every combination of levels–i.e. treatments–is tested once. The data size k is then N = 2 . We estimate the main effect of factor Ai by

ME(i) =y ¯(i+) − y¯(i−)

k−1 wherey ¯(i+) is the average of all of the 2 observations such that factor Ai is at level + and similarly fory ¯(i+). As we are averaging over all other factor levels, a significant main effect should be reproducible in the sense that it has a direct effect on the response regardless of other factor levels. Next, we can consider the conditional main effect of one factor given another. That is, for example, we could compute the main effect of Ai given Aj is at level −.

ME(i|j+) =y ¯(i + |j+) − y¯(i − |j+)

k−2 wherey ¯(i − |j+) is the average of all of the 2 observations with Ai at level i and Aj at level +. We are also interested in the interaction effects between two or more factors, which we can write as INT(i, j, . . .) with as many arguments as desired. In the case of two-way interactions, we have that 1 INT(i, j) = [ME(i|j+) − ME(i|j−)] . 2 Note that this function is symmetric in the arguments–i.e. INT(i, j) = INT(j, i). Continuing down the rabbit hole, we can define conditional interaction effects for Ai and Aj conditionally on Al taking a certain level: 1 INT(i, j|l+) = [ME(i|j+, l+) − ME(i|j−, l+)] . 2 1 See Wu & Hamada Sections 4.3.1 & 4.3.2

40 This allows us to write down three-way interaction terms as 1 INT(i, j, l) = [INT(i, j|l+) − INT(i, j|l−)] . 2 This function is similarly invariant under permutations of the arguments. This conditioning process can be iterated to consider m-way interactions with 1 INT(1, 2, . . . , m) = [INT(1, 2, . . . , m − 1|m+) − INT(1, 2, . . . , m − 1|m−)] . 2

4.1.1 Estimating effects with regression2 We can write such a factorial design in the framework of linear regression. To do this, we require the xor operator ⊕ such that

a b a ⊕ b 0 0 1 1 0 0 0 1 0 1 1 1

In this case, we have k regressors x1, . . . , xk ∈ {0, 1}. The model is

2k   X M y = β0 + βi  xj i=1 j∈Ji

−j where Ji = {j = 1, . . . , k | bi2 c mod 2 = 1} 3 The estimators βˆi can be computed as usual. If we wished to take the standard sum of squares approach, assume the experiment has been replicated n times. Each main effect and each interaction effect would have a single degree of freedom totalling 2k −1 in all. Hence the DoFs for error sum of squares would be (n2k −1)−(2k −1) = 2k(n − 1). Hence, if we do not replicate the experiment, then we cannot estimate the variance via the SSerr. Furthermore, if we do replicate the experiment n times, we are faced with a multiple testing problem as a result of the 2k − 1 hypothesis tests performed. We could apply Bonferroni’s method in this case. However, that approach is often inefficient–i.e. too conservative–when a large number of tests is considered. 2 See Wu & Hamada Sections 4.4 & 4.5 3 There is also a method due to that computes the least squares estimates for the 2k parameters. https://en.wikipedia.org/wiki/Yates_analysis

41 Normal Q−Q Plot

D ● 2.0 1.5 C AB ● ● 1.0 0.5 ● ● ● ● ● ● 0.0 ● ● ● ● ● ●

Sample Quantiles −0.5 −1.0 ● AD

−2 −1 0 1 2

Theoretical Quantiles

Figure 4.1: Applying qqnorm() to the coefficient estimates from the example.

Example As an example, consider the 24 unreplicated experiment with factors A, B, C, and D. Data was randomly generated with the C effect of 1, the D effect of 2, the AB effect of 1, and the AD effect of -1. The data has normal errors with σ = 0.1. This results in the following least squares estimates.

Effect A B C D AB AC BC AD Truth ·· 1 2 1 ·· -1 Estimate 0.0810 -0.0209 1.10 2.12 1.16 -0.196 0.0367 -1.17 Effect BD CD ABC ABD ACD BCD ABCD µ Truth ········ Estimate 0.115 -0.215 -0.022 -0.186 0.125 0.0443 0.0884 -0.0626

These coefficient estimates can be plugged into R’s qqnorm function to produce a normal qq-plot for the 16 values as displayed in Figure 4.1. In this example, we can see the four significant effects standing out from the rest. However, in practice, the significant effects may not be as obvious.

42 4.1.2 Lenth’s Method4 As we cannot rely on sums of squares and ANOVA for testing for the significance of a main or interaction effect, other approaches are considered. Wu & Hamada Section 4.8, as well as other sources, consider looking at plots to determine the effect qualitatively. But as this is a bit ad-hoc compared to proper testing, we move onward and consider Lenth’s method. Consider the unreplicated full –i.e. n = 1. If we wish to test all 2k − 1 effects, then there are no degrees of freedom remaining to estimate the variance σ2. Consequently, Lenth (1989) proposed the psuedo standard error, which is a robust estimator in terms of the . In this case, robust means that if a few of the effects, θi, are non-zero then they will not skew the estimator of the standard k error. Let θˆi be the estimated effect of the ith treatment for i = 1,..., 2 − 1, then n o PSE = 1.5median |θˆi| : |θ|i < 2.5median{|θˆi|} , which says, compute the median of |θˆi|, then consider on the |θˆi| that are less than 2.5 times the median, and take their median and multiply by 1.5. The claim is that this is a consistent estimator under H0 given the usual normality assumption. k 2 Claim 4.1.1. Under 2 factorial design with iid N 0, σ errors, if θ1 = ... = P θ2k−1 = 0, then PSE −→ σ as n → ∞. For testing for the significance of a given treatment, it is recommended to con- struct a “t-like” statistic ˆ tPSE,i = θi/P SE, which can be compared to tabulated critical values. Note that we still have to correct for multiple testing in this case. This leads to two type of thresholds for determining whether or not tPSE,i is significant. The un- corrected threshold–or individual error rate (IER) from Wu & Hamada or marginal error rate (ME) from the R package BsMD–is a real number such that

k P(|tPSE,i| > IERα | H0) = α ∀i = 1,..., 2 − 1. The corrected threshold–or experimentwise error rate (EER) from Wu & Hamada or simultaneous margin of error (SME) from R–is a real number such that  

P max |tPSE,i| > IERα H0 = α. i=1,...,2k−1 Remark 4.1.2 (Advice from Wu & Hamada). In their text, Wu & Hamada claim that it is preferable to use the uncorrected thresholds as (1) the corrected threshold is too conservative and (2) it is better to include an uninteresting effect (false positive) than miss a significant one (false negative). 4 See Wu & Hamada Sections 4.9

43 4 3

2 SME

1 ME effects 0

ME SME −2 A B C D AB AC BC AD BD CD ABC ABD ACD BCDABCD

factors

Figure 4.2: Results from applying Lenth’s method with α = 0.05 to our 24 factorial design. Here, ME is the uncorrected threshold and SME is the threshold corrected for multiple testing. The significant effects are C, D, AB, and AC.

Remark 4.1.3 (Advice from Lenth). According to RV Lenth, “Lenths method should not be used in replicated experiments, where the standard error can be estimated more efficiently by traditional analysis-of-variance methods.”

Example Applying Lenth’s method via the BsMD library in R with the function LenthPlot() to the data from the previous example, we have the plot in Figure 4.2. Here, α = 0.05, the psuedo standard error is 0.265, the ME or IER is 0.682, and the SME or EER is 1.38. The result is that we correctly detect the four significant effects.

4.1.3 Key Concepts5 There are a few principles of Factorial Design that are worth mentioning:

1. Hierarchy: Lower order effects are often more important than higher order effects.

2. Sparsity: The number of significant terms is generally small.

3. Heredity: Interaction terms are often only significant if one of there interacting factors is solely significant.

5 See Wu & Hamada Sections 4.6

44 The first point says that we should probably focus on estimating the lower order effects before investing in for the higher order effects. Though, every problem has its own peculiarities. The second point is common assumption in many statistical modelling settings. The third point gives intuition about which factors to test.

4.1.4 Dispersion and Variance Homogeneity6 Thus far, we have been concerned with mean effects between different treatments. We can also consider the if the experiment has been replicated. Indeed, for the 2k treatments, if the experiment is replicated n > 1 times, then we can compute sample variances for each treatment

n 1 X s2 = (y − y¯ )2. i n − 1 ij i· j=1 Furthermore, under the normality assumption, we have that (n − 1)s2 i ∼ χ2 (n − 1) , σ2 2 2 2 and thus Xi = si /σ is a random variable with mean 1 and variance 2/(n − 1). We then consider log X2. The mean of this random variable lies in the interval [−(n − 1)−1, 0]. The upper bound follows from Jensen’s Inequality,

E log X2 ≤ log EX2 = 0

and the concavity of the logarithm. The lower bound comes from Taylor’s Theorem,  1  1 1 E log X2 ≥ E (X2 − 1) − (X2 − 1)2 = E(X2 − 1) − Var X2 = − . 2 2 n − 1

Hence, the claim in Wu & Hamada that “E log X2 is approximately zero” is asymp- totically true even if it is necessarily negative. The variance of log X2 can be approximated via the delta method–which is basically just Taylor’s theorem again. That is, for a continuously differentiable function h : R → R, 2 Var h(X2) ≈ h0(EX2) Var X2 .

Hence, for the log transform, Var log X2 ≈ 2/(n − 1). This is often referred to as a variance stabilizing transform. 2 2 2 2 2 Therefore, as log si = log X + log σ , we have that the mean E log si ≈ log σ 2 7 2 and Var log si ≈ 2/(n − 1) for n sufficiently large. The critical property of log si 6 See Wu & Hamada Sections 4.11 & 4.13 7 Wu & Hamada say n ≥ 10.

45 is that the variance does not depend on the value of σ2. Hence, we can use this to test for variance homogeneity. The desired hypothesis test is as follows. Consider m categories with n1, . . . , nm 2 observations each, which are not necessarily the same. Let σi be the unknown variance of the ith category. Then we wish to test

2 2 2 2 H0 : σ1 = ... = σm,H1 : ∃i, j s.t. σi 6= σj .

2 −1 Pni 2 Sample variances can be computed as si = (ni − 1) j=1(yij − y¯i·) . From here, many tests are possible. 2 Wu & Hamada make the unsubstantiated claim that log si has an approximate 2 −1 normal distribution with mean log σi and variance 2(ni −1) . Then, they construct 8 2 a chi squared test based on this assumption. Let zi = log si andz ¯ be the mean of 1 Pm the zi weighted by the observations per category. That is,z ¯ = N−m i=1 (ni − 1) zi. Then, m   X ni − 1 d (z − z¯)2 −→ χ2 (m − 1) . 2 i i=1 Hence, we can compare the test statistic to a threshold based on the chi squared distribution. 2 Bartlett’s test for homogeneity of variance is also based around log si . However, 9 Pm it is derived from the likelihood ratio test. Let N = i=1 ni be the total sample size. The test statistic for Bartlett’s method is (N − m) logs ¯2 − Pm (n − 1) log s2 B = i=1 i i . 1 Pm −1 −1 1 + 3(m−1) [ i=1(ni − 1) − (N − m) ]

The statistic B −→d χ2 (m − 1), which is a consequence of Wilks’ Theorem that says that the log likelihood ratio converges in distribution to a chi squared distribution. This test can be performed in R by the function bartlett.test(). Note that as Bartlett’s test is based on the likelihood ratio, it is dependent on the assumption of normality. If the data is non-normal, then other methods should be considered. A family of tests for variance homogeneity which contains many standard ap- proaches can be found in On the Admissibility and Consistency of Tests for Homo- geneity of Variances, Cohen & Marden (1989).

4.1.5 Blocking with Factorial Design When blocking in a full factorial design without replication, we are forced to sacrifice the ability to estimate one or more of the interaction effects. That is, if we have

8 2 The approximate normality of log si is demonstrated by Bartlett & Kendall (1946), The Statistical Analysis of Variance-Heterogeneity and the Logarithmic Transformation, where they 2 demonstrate that the higher order cumulants of log si decay quickly to zero as n → ∞. Note that the normal distribution is the only distribution with zero cumulants of all orders greater than two. 9See Snedecor & Cochran, Statistical Methods.

46 a 2k factorial design and a desired blocking that splits the data into two blocks of equal size 2k−1, Then, one of the interaction terms, say 1268, will always take on a value of + in one block and - in the others. Meanwhile, all other effects will be balanced. This is referred to as . As an example, consider the 23 design,

Treatments Interactions 1 2 3 12 13 23 123 Block 1 . . . + + + . 2 . . + + . . + 3 . + . . + . + 4 . + + . . + . 5 + . . . . + + 6 + . + . + . . 7 + + . + . . . 8 + + + + + + +

Any blocking scheme filled in on the right that is balanced, so that each on the levels of 1,2,3 are considered equal number of times, has to coincide with one of the interaction terms. For a single blocking factor, the usual approach is to confound it with the highest order interaction term as, given the hierarchy assumption, it is usually one of the the least important effects to consider. More complications arise when you block more than once. Two separate blocks that split the data in half can be combined into one block that splits the data into four pieces. Then we can consider three different separations: B1 = −/+, B2 = −/+, and B1B2 = −/+. In this case, we actually have a mathematical group structure. Specifically, ({0, 1}k, ⊕) is a finite albelian group and we can choose a few elements (blocks) which will generates a subgroup. For example, if block one 10 confounds 123–we write B1 = 123–and if B2 = 12, then B1B2 = (12)(123) = 3. This implies that we also confound the main effect of factor 3, which is generally not desirable. To determine whether one blocking scheme is better than another with respect to confounded interactions, we say that one blocking has less aberration than another if the smallest order in which the number of confounded effects differs is smaller in the 1 1 1 first blocking scheme than the second. That is, for a blocking B = (B1 ,...,Bm) 2 2 2 and B = (B1 ,...,Bl ), we can define the function ψ : B × {1, . . . , k} → Z, where B is the space of blocking schemes, by ψ(B1, i) = The number of confounded factors of order i by block B1.

10Here, we are effectively xoring the columns 12 and 123 of the above table to get column 3. This group is actually known as the Klein-4 group https://en.wikipedia.org/wiki/Klein_ four-group. In general, these groups will all be Boolean groups as they are products of copies of the cycle group Z/2Z.

47 Then, find the smallest i such that ψ(B1, i) 6= ψ(B2, i). If ψ(B1, i) < ψ(B2, i), then we say that B1 has less aberration than B2. If B1 has no more aberration than any other blocking scheme, then it is said to be a minimal aberration blocking scheme.

4.2 Fractional Factorial Design

Often, it is not practical to consider a full 2k factorial experiment as k becomes larger as the number of observations required, even before replication, grows exponentially. Hence, we need to consider the consequences of testing 2k treatments with only 2k−p observations. As an example, consider the 23 design from the beginning of this chapter, but with only 23−1 treatments considered:

Main Interaction ABC AB AC BC ABC Data y1 y2 - + - - + - + y3 - + + - - + - y4 + - - - - + + y5 + - + - + - - y6 y7 y8 Notice that the interaction AB is only considered at the - level and thus cannot be estimated. Notice further that the main effects of A and B are such that ME(A) =y ¯(A+) − y¯(A−) =y ¯(B−) − y¯(B+) = −ME(B). Hence, we cannot separate the estimation of the A and B main effects. This is known as aliasing. Similarly, C and ABC are aliased, and AC and BC are aliased. Hence, we only have three independent terms to estimate, which is reasonable as the total sum of squares will have 22 − 1 = 3 degrees of freedom. In general, we can define an aliasing relation by writing I = AB, which is when column A and column B are xored together, we get the identity element. This is sometimes referred to as the defining relation. In this case, we can only estimate the effect of A if the effect of B is negligible and vice versa. This is an extreme case, however. Often, if the defining relation is I = ABCDE, then each main effect is aliased with a four-way interaction effect. If all four-way interactions are negligible, as they often our assuming the hierarchy principle, then we can still estimate the main effects without interference. Generally, prior knowledge is necessary to guess which effects will be likely to be important and which are negligible and can be aliased. There are some properties to consider when choosing a design as will be discussed in the following subsection.

48 Example Continuing with the same example from before, we assume instead of conducting a 24 full factorial experiment, that we conduct a 24−1 fractional factorial experiment with the relation I = ABCD. Hence, each of the two factor effects is aliased with another two factor effect. Running the eight data points through R gives

Effect A/BCD B/ACD C/ABD D/ABC AB/CD AC/BD BC/AD Truth ·· 1 2 1 · -1 Estimate -0.465 0.716 1.50 1.49 0.971 -0.05 -1.11 1000 reps -0.500 0.499 1.498 1.5 0.995 -0.001 -0.993

The third row gives the results of replicating this experiment 1000 times and aver- aging the results. Certainly, there are issues with the aliasing.

4.2.1 How to choose a design First, we will require some terminology. A main or two-factor effect is referred to as clear if none of its aliased effects are main or two-factor effects. It is referred to as strongly clear if none of its aliased effects are one, two, or three way effects. If we have a 2k−q design, then we will have q defining relations that will generate a group. This is called the defining contrast subgroup, which consists of 2q elements being the 2q − 1 strings of factors and identity element.

Example 4.2.1. Considering the 25−2 design, if we have the relations I = 123 and I = 345, then we would also have

I = (123)(345) = 1245.

Thus, we have the group with four elements {I, 123, 345, 1245}.

Given the defining contrast subgroup, the resolution is the smallest length of any of the elements. In the above example, the resolution would be 3. One criterion for design selection is to choose the design 2k−q design with the largest resolution. This is known as the maximal resolution criterion. This is based around the idea of hierarchy, which is that the lower order effects are probably more significant, so we do not want to lose the ability to estimate those effects. A design of resolution r is such that effects involving i factors cannot be aliased with any effect with less than R − i factors. For example, resolution 4 implies that the main and two-way effects are not aliased, which is generally desirable. Similar to blocking in the full factorial setting, we can also consider aberration between two 2k−q designs and the minimal aberration design. The development here is completely analogous to that for blocking.

49 Follow up testing and de-aliasing From the above example, if we have a 24−1 design with I = ABCD, then we cannot differentiate between the significance of two aliased terms. If the significance is between C and ABD, then we often yield to the hierarchy principle and choose C and the significant effect. But how do we decide between AB and CD? In general, if we have a 2k−q design, then the fold-over technique requires us to run a second but slightly different 2k−q design in order to augment the first. In the above example, this would merely be considering the other 24−1 design with I = −ABCD, which would be the full factorial design when combined with the original data. In a more interesting setting, consider the 25−2 design and assume the defining relations I = 123 = 145 = 2345. In this case, 1 is aliased with 23 and with 45 while 2,3,4,5 are each aliased with a two factor interaction. The corresponding table is

1/23/45 2/13/345 3/12/245 4/15/235 5/14/234 run 1 . . + . + . 2 . . + + . . 3 . + . . + . 4 . + . + . . 5 + . . . . . 6 + . . + + . 7 + + + . . . 8 + + + + + .

If we wished to de-alias 1 with 23 and 45, then we could rerun the same design but flip the sign of all of the entries in column 1. Namely, we change the relation 1 = 23 = 45 into −1 = 23 = 45. The result is

-1 2/-13/345 3/-12/245 4/-15/235 5/-14/234 run 1 + . + . + + 2 + . + + . + 3 + + . . + + 4 + + . + . + 5 ..... + 6 . . . + + + 7 . + + . . + 8 . + + + + +

Combining these two tables into one 25−1 design results in de-aliasing factor 1 from the two factor interactions. Similarly, we also de-alias factors 2,3,4,5 from the two

50 factor interactions. As this negation removes the relations I = 123 and I = 145, we are left with only I = 2345. Thus, main effects 2,3,4,5 are only aliased with three-way terms and 1 is only aliased with the five way term 12345. Note that as we have performed a second 25−2 design, we could add an additional factor in the place of “run” in the tables to test a 26−2 design. We could alternatively use this term as a blocking factor to compare the first and second runs.

Measures of Optimality Perhaps we do not wish to test a second set of 2k−q treatments either because it is too costly and impractical or because we only have a few aliased terms that require de-aliasing. We could consider adding on a few more rows to our design. Note that the aliases derived from the defining relations, I = 123 = 145 = 2345, are 1 = 23 = 45 = 12345, 2 = 13 = 1245 = 345, 3 = 12 = 1345 = 245 4 = 1234 = 15 = 235, 5 = 1235 = 14 = 234, 24 = 134 = 125 = 35, 25 = 135 = 124 = 34, I = 123 = 145 = 2345 Considering the 25−2 design from before, we can consider adding two more treat- ments such as 1 2 3 4 5 24 25 run 1 . . + . + + . . 2 . . + + . . + . 3 . + . . + . + . 4 . + . + . + . . 5 + . . . . + + . 6 + . . + + .. . 7 + + + . . .. . 8 + + + + + + + . 9 ..... + + + 10 . + . . + . + +

As each of the factors 1,. . . ,5 can be considered at the levels {−, +}, we have 32 pos- 24 sible treatments minus the 8 already considered for each. This results in 2 = 276 possibilities. Note that the run column allows us to block the first set of treatments and the second for comparison if desired. The obvious question is, which treatments are optimal to include? A collection of different methods is detailed on Wikipedia.11 In Wu & Hamada, they consider T D-optimality which seeks to maximize the determinant of Xd Xd where Xd is the (2k−q + t) × k matrix with t being the additional number of treatments considered. The two treatments presented in the above table were selected by this criteria. Note that the choice of treatments is not necessarily unique.

11 https://en.wikipedia.org/wiki/Optimal_design

51 A list of possible optimality criteria is presented in the following table.

Name Formula Intuition T D-optimality max{Xd Xd} Minimize volume of confidence ellipsoid T −1 2 T −1 A-optimality min{trace([Xd Xd] )} Minimize Var (β) = σ (Xd Xd) T T-optimality max{trace(Xd Xd)} Similar to A-optimality T −1 T G-optimality min maxi[X(X X) X ]i,i Minimize maximal variance of predictions

Instead of optimizing globally, we may also be interested in de-aliasing two spe- cific effects from one another. To achieve this, consider writing the design matrix Xd in block form as Xd = (X1 X2) k−q where X2 corresponds to the (2 + t) × 2 matrix with two columns corresponding to two effects to be de-aliased. Consequently,

 T T  T X1 X1 X1 X2 Xd Xd = T T X2 X1 X2 X2

and using the formula for inverting a block matrix12 gives   T −1 ?? (Xd Xd) = T T T −1 T −1 ? (X2 X2 − X2 X1(X1 X1) X1 X2) Hence, we can optimize over two specific effects, or alternatively a general subset of more than two effects, by choosing the new treatments to optimize

T T T −1 T max{X2 X2 − X2 X1(X1 X1) X1 X2}.

This is referred to as Ds-optimality. It is similar to how D-optimality looks for the design that optimizes

T T −1 max{Xd Xd} = min{(Xd Xd) }.

4.3 3k Factorial Designs13

In some cases, we may be interested in testing three instead of two factor levels. Hence, in this section, we will replace {−, +} with {0, 1, 2} to denote the levels that each of the k factors can take. In some sense, very little has changed from the previous 2k setting. The main difference it added complexity with respect to testing effects and aliasing when considering fractional designs.

12 https://en.wikipedia.org/wiki/Block_matrix#Block_matrix_inversion 13 See Wu & Hamada Section 6.3

52 When we considered a 22 design with factors A and B, we were able to estimate the main effects of A and B as well as the interaction term A × B. In that setting, all of these terms have one degree of freedom. The 32 analogue is slightly different. We can still consider the two main and one interaction effect. However, now the main effects have 2 degrees of freedom while the interaction effect has 4 = (3 − 1)2. The main effect of factor A considers the differences among the three levels of A = 0, 1, 2. Writing the interaction term as A × B can be a bit misleading as it is actually comparing the responses at

A + B = 0, 1, 2 mod 3 and A + 2B = 0, 1, 2 mod 3, which are displayed in the following tables.

A+B A A+2B A 0 1 2 0 1 2 0 0 1 2 0 0 1 2 B 1 1 2 0 B 1 2 0 1 2 2 0 1 2 1 2 0

The above two tables are 3 × 3 Latin squares. Furthermore, replacing (0, 1, 2) on the left with (α, β, γ), we can see that they are orthogonal Latin squares which can be combined into α0 β1 γ2 β2 γ0 α1 γ1 α2 β0 Thus, the interaction term can be thought of as a Graeco-Latin square testing two treatments referring to the above two equations. As a consequence of this, the sum of squares for A×B can be further decomposed into sums of squares for two terms denoted AB and AB2 corresponding, respectively, to A+B and A+2B mod 3. Recalling the Latin squares design, each of these effects has 3 − 1 = 2 degrees of freedom. These ideas can be extended in the 33 design to the three way interaction term A × B × C. This term will have 8 degrees of freedom and can be decomposed into four effects with 2 DoFs.

ABC : A + B + C = 0, 1, 2 mod 3 ABC2 : A + B + 2C = 0, 1, 2 mod 3 AB2C : A + 2B + C = 0, 1, 2 mod 3 AB2C2 : A + 2B + 2C = 0, 1, 2 mod 3

53 Note that we set the coefficient of the first factor equal to 1. Otherwise, we will have repeats as, for example,

2A + B + 2C = 2(A + 2B + C) = 0, 1, 2 mod 3 showing that A2BC2 is the same effect as AB2C. This method of decomposing the interaction terms is referred to as the orthog- onal component system. This is because all of the terms in the decomposition are orthogonal, which is always a desirable property. In a 3k design, we have 3k − 1 degrees of freedom and each of these terms requires 2. Hence, there are (3k − 1)/2 effects to estimate.

Remark 4.3.1. Even though we can estimate all of these effects, they may quite hard to interpret in practise. If an F-test or Lenth’s method implies significance for AB2C2, what does that mean?

4.3.1 Linear and Quadratic Contrasts14 One benefit to the 3k design over the 2k design is the ability to estimate quadratic effects. That is, when we only have two levels {−, +} to consider, the best we can do to claim that one results in larger responses than the other. However, in the 3k setting, we can look for increases and decreases and even use such models to find an optimal value for a given factor with respect to minimizing or maximizing the response. The main effect of factor A in a 3k design has two degrees of freedom and compares the difference among the responses for A = 0, 1, and 2. However, we can decompose this main effect into a linear and quadratic contrast with √ √ Al = (−1, 0, 1)/ 2, and Aq = (1, −2, 1)/ 6.

Lety ¯A0, y¯A1, y¯A2 denote the average–i.e. averaged over all other factor levels– response for A = 0, 1, 2, respectively. Then, the linear contrast considersy ¯A2 − y¯A0. If this value is significantly removed from zero, then there is evidence of an increase or decrease between the two extreme values of A = 0 and A = 2. This is, in fact, just the combination of two linear contrasts

y¯A2 − y¯A0 = (¯yA2 − y¯A1) − (¯yA1 − y¯A0).

The quadratic contrast is orthogonal to the linear contrast. It tests for the case wherey ¯A1 is either significantly larger or smaller thany ¯A0 +y ¯A2. Each of these contrasts has 1 degree of freedom.

14 See Wu & Hamada Section 6.6

54 Now when considering an interaction term like A × B, we can forego testing AB 2 and AB and instead test the four terms (AB)ll, (AB)lq, (AB)ql, (AB)qq each with one degree of freedom. These contrasts can be written as 1 1 (AB) : [(y − y ) − (y − y )] = [(y − y ) − (y − y )] ll 2 22 20 02 00 2 22 02 20 00 1 (AB)lq : √ [(y22 − 2y12 + y02) − (y20 − 2y10 + y00)] 2 3 1 (AB)ql : √ [(y22 − 2y21 + y20) − (y02 − 2y01 + y00)] 2 3 1 (AB) : [(y − 2y + y ) − 2(y − 2y + y ) + (y − 2y + y )] qq 6 22 21 20 12 11 10 02 01 00 Compare the above formulae to the following table to understand what they are testing.

y02 y12 y22 y01 y11 y21 y00 y10 y20

The linear-linear contrast is looking for a difference in the linear contrasts of A conditional on B = 0 or 2. This is equivalent to looking for a difference in the linear contrasts of B conditional on A = 0 or 2. A significant value could indicate, for example, that the response is increasing in A when B = 0 but decreasing in A when B = 2. The linear-quadratic contrasts are similar in interpretation. (AB)ql tests for differences between the quadratic contrast of A when B = 0 and when B = 2. Similarly, (AB)lq tests the same thing with the roles of A and B reversed. The quadratic-quadratic term is a bit harder to interpret. It is effectively looking for quadratic changes in the quadratic contrasts of one factor conditional on another.

4.3.2 3k−q Fractional Designs15 Even more so than in the 2k factorial design, the number of required observations for a 3k design grows quite rapidly. As a result, 3k−q fractional factorial designs are often preferred when k ≥ 4. Similarly to the 2k setting, we need to define a defining relation such as, for a 34 design, AB2C2D = I. However, our modular arithmetic is now performed in mod 3 instead of mod 2. Hence, the above relationship is equivalent to

A + 2B + 2C + D = 0 mod 3. 15 See Wu & Hamada Section 6.4

55 This equation can be used to derive all of the alias relations, but doing so is quite tedious. For example, Adding A to both sides of the above equation gives

A = 2A + 2B + 2C + D mod 3 = 2(A + B + C + 2D) mod 3 and adding 2A gives 2A = 0 + 2B + 2C + d mod 3. Hence, we have A = ABCD2 = B2C2D. This process can be continued for the other factors in order to construct the 13 aliasing relations. In general, a single defining relation with take a 3k to a 3k−1 design. The number of orthogonal terms is reduced from

3k − 1 (3k − 1)/2 − 1 3k−1 − 1 ⇒ = . 2 3 2 Hence, we lose the defining the relation, and the remaining effects are aliased into groups of three. If we have two defining relations like

I = AB2C2D = ABCD,

then we will also have an additional two relations. This can be seen by sequentially adding the defining relations. Including I = AB2C2D implies that all remaining terms are aliased in groups of 3. Hence, ABCD is aliased with

(ABCD)(AB2C2D) = A2B3C3D2 = AD

and (A2B2C2D2)(AB2C2D) = A3B4C4D3 = BC. Hence, including a second defining relation I = ABCD immediately includes AD and BC as well. Thus, our defining contrast subgroup is

I = AD = BC = ABCD = AB2C2D.

We can count effects to make sure all are accounted. In a 34 design, we have (34 − 1)/2 = 40 effects. In a 34−1 design, we have (40 − 1)/3 = 13 aliased groups with 3 effects each resulting in 13 × 3 = 39 plus 1 for the defining relation gives 40. In a 34−2 design, we have (13 − 1)/3 = 4 aliased groups with 9 effects each resulting in 4 × 9 = 36 plus the 4 defining relations gives 40 again.

56 Partial Aliasing The above aliasing relations apply to the orthogonal components system, but what is the effect of such aliasing on the linear and quadratic contrasts? If a design has resolution 5, then all of the main and two-way effects are not aliased with any other main or two-way effects. Thus, their linear and quadratic contrasts are also free of aliasing with each other. Note, however, that they can still be aliased with higher order terms. For more complex aliasing relationships, consider the 33−1 design with defining relation I = ABC. Then we have the following four aliased groups, 1 A = BC = AB2C2 2 B = AC = AB2C 3 C = AB = ABC2 4 AB2 = AC2 = BC2 and a design matrix that has 9 rows such that A + B + C = 0 mod 3

A B C AB2 1 0 0 0 0 2 0 1 2 2 3 0 2 1 1 4 1 2 0 2 5 1 0 2 1 6 1 1 1 0 7 2 1 0 1 8 2 0 1 2 9 2 2 2 0

Each of these four terms comes with 2 degrees of freedom. We can furthermore decompose the above into a design matrix of linear and quadratic contrasts by mapping (0, 1, 2) → (−1, 0, 1) in the linear case and mapping (0, 1, 2) → (1, −2, 1) in the quadratic case. The interaction term columns are filled in with the product mod 3 of the corresponding main effects columns.

Al Aq Bl Bq Cl Cq (AB)ll (AB)lq (AB)ql (AB)qq 1 -1 1 -1 1 -1 1 1 -1 -1 1 2 -1 1 0 -2 1 1 0 2 0 -2 3 -1 1 1 1 0 -2 -1 -1 1 1 4 0 -2 1 1 -1 1 0 0 -2 -2 5 0 -2 -1 1 1 1 0 0 2 -2 6 0 -2 0 -2 0 -2 0 0 0 1 7 1 1 0 -2 -1 1 0 -2 0 -2 8 1 1 -1 1 0 -2 -1 1 -1 1 9 1 1 1 1 1 1 1 1 1 1

57 However, as we only have 2 degrees of freedom from AB2, we must select two of the four to estimate in practice. Note that if you add (mod 3) the columns of, for example, Al and Aq that you will get the above column for A. We know from before that C is aliased with AB but what about the relationships between Cl, Cq, and (AB)ll,(AB)lq,(AB)ql,(AB)qq? We can compute the correlation between each pair of vectors16 to get

Cl Cq (AB)ll (AB)lq (AB)ql (AB)qq p p Cl 1 . . 1/2 1/2 . p p Cq . 1 1/2 . . − 3/5 p (AB)ll . 1/2 1 . . . p (AB)lq 1/2 . . 1 . . p (AB)ql 1/2 . . . 1 . p (AB)qq . − 3/5 . . . 1

Hence, none of the columns coincide implying that we can estimate the terms simul- taneously. However, the columns are not orthogonal meaning that the estimation is not as efficient as if they were orthogonal. This is the partial aliasing situation. In Wu & Hamada, they recommend using a approach on the entire set of 6 main linear and quadratic effects and 18 two-way interactions.

4.3.3 Agricultural Example In the agridat library, there is the dataset chinloy.fractionalfactorial, which considers a 35−1 fractional factorial design with an additional blocking variable. The five factors considered are denoted N, P, K, B, and F.17 We first run the commands

dat <- chinloy.fractionalfactorial dat <- transform(dat,N=factor(N), P=factor(P), K=factor(K), B=factor(B), F=factor(F)) to tell R that the levels of N,P,K,B,F are to be treated as factors. If we were to ignore the aliasing and just include as many terms as possible with

md1 = aov( yield∼(N+P+K+B+F)^5,data=dat ) then we would get sums of squares for the 5 main effects, the 10 two-way effects, and only 6 of the three-way effects. However, an additional peculiarity is that K × B, K × F , and B × F all have only two degrees of freedom instead of four. Also,

16 hx,yi Correlation of x and y is kxkkyk 17 The original paper associated with this dataset can be found at https://doi.org/10.1017/ S0021859600044567

58 the corresponding three-way terms with N attached only have four instead of eight degrees of freedom. The defining relation for the aliasing in this experiment is I = PK2B2F or, in modular arithmetic form, P + 2K + 2B + F = 0 mod 3. This results in the following aliased relations for the main effects,

N = NPK2B2F = NP 2KBF 2 P = PKBF 2 = KBF 2 K = PB2F = PKB2F B = PK2F = PK2BF F = PK2B2F 2 = PK2B2

For the second order effects, N does not occur in the defining relation and thus all Nx terms are aliased with 4th and 5th order interactions. For the terms involving P , we have

PK = PBF 2 = KB2F PK2 = PKBF 2 = BF 2 PB = PKF 2 = KB2F 2 PB2 = PB2KF 2 = KF 2 PF = PKFB = KB PF 2 = PKB = KBF

We can see that BF 2, KB, and KF 2 are each aliased with a two-way term involving P . Hence, the general interactions B × F , K × B, and K × F each lose 2 degrees of freedom in the ANOVA table. In the original study, the authors were only concerned with main and two-way interaction effects and chose to save the remaining degrees of freedom to estimate the residual sum of squares. Ignoring the blocking factor for the , we can test for significant effects by md2 = aov( yield∼(N+P+K+B+F)^2,data=dat ) This results in N, P , B, F , and P × F all significant at the 0.05 level–without any multiple testing correction, that is. If we set the contrasts to contr.poly, then we can look at the linear and quadratic effects. This can be achieved in R by adding the argument split=list(P=list(L=1,Q=2),F=list(L=1,Q=2)) to the summary() command. For the P × F term, we get from the ANOVA table that DoF Sum Sq Mean Sq F value Pr(>F) P:F 4 6.109 1.527 2.674 0.047424 * L.L 1 5.816 5.816 10.182 0.002938 ** Q.L 1 0.197 0.197 0.344 0.560909 L.Q 1 0.000 0.000 0.000 0.994956 Q.Q 1 0.096 0.096 0.169 0.683578

59 ntelna-iercnrs.W a okdrcl tte3 the at directly of look levels can different We across responses contrast. between interaction linear-linear the the of in significance the that indicates This h oe n r omdlaltowyitrcin with interactions two-way all model to try and happen. orthogo- model to all the have are necessarily model not reduced does this this in that contrasts Note polynomial nal. the that check can One that see can we backwards 4.4 , a just Running Figure to interactions correlation. In non-zero the the have via terms. selection contrasts use variable the polynomial further the between can of We correlations some the matrix. compute design to the extract to command 4.3. Figure in plotted are table this of columns and rows The factors between interaction the in of changes Plots see 4.3: Figure

Yield hsdtstas otisabokn atrwihhs9lvl.I eadi into it add we If levels. 9 has which factor blocking a contains also dataset This osdrn l ftemi n w-a ffcs ecnuethe use can we effects, two-way and main the of all Considering

3.5 4.0 4.5 5.0 0 ● Average PgivenF Response, d o(yield aov( = md3 P odtoa on conditional yield Level ofP ● 1 ● Yield atPgiven F=2 Yield atPgiven F=1 Yield atPgiven F=0 step() ∼ : P:F + N:P + F + B + K + P + N P 2 1 0 \ F ∼ F ucinrdcstemdlcnann l two-way all containing model the reduces function ntergtpo,w e h ees conditioning. reverse the see we plot, right the In . lc+NPKBF^,aadt) block+(N+P+K+B+F)^2,data=dat P 2 ● .653 5.19 5.36 5.23 4.96 5.25 5.11 4.45 4.44 3.27 and 2 1 0 60 F Yield oseteitrcinbehaviour. interaction the see to 3.5 4.0 4.5 5.0 0 ● Average FgivenP Response, P and Level ofF F × ● P ntelf lt we plot, left the In . 1 ● arxo average of matrix 3 Yield atFgiven P=2 Yield atFgiven P=1 Yield atFgiven P=0 and model.matrix() cor() F , scontained is command 2 ● Correlation of Contrasts Correlation with blocks

Figure 4.4: Correlations of the polynomial contrasts (Left) and correlations with blocks (Right). Most are zero. However, some of the two-way contrasts have non- zero correlation. then we see that the residual degrees of freedom has reduced from 36 to 30 and that the degrees of freedom for P × K has dropped from 4 to 2, which accounts for the 8 DoFs required for the blocking factor. Including the three-way interactions in our model allows us to use R see that we lose 2 DoFs from each of P × K, N × P × B, N × B × F , N × P × F by running the command

summary(aov( yield∼Error(block)+(N+P+K+B+F)^5,data=dat )).

Two confounded factors by the blocking relations as detailed in Chinloy et al (1953) are PK and NPB2. From here, we can construct the remaining confounded factors using. Combining these two factors gives two more

(PK)(NPB2) = NP 2KB2 (P 2K2)(NPB2) = NK2B2.

Finally, we have to consider the two other terms aliased with the above four by the relation I = PK2B2F .

PK = PBF 2 = KB2F NPB2 = NP 2K2B2F = NKB2F 2 NP 2KB2 = NB2F = NPK2B2F 2 NK2B2 = NPKBF = NPF

Thus, the two DoFs that we see being taken away come from the interactions PK, NPB2, NB2F , and NPF . In the correlation plot of Figure 4.4, we have correlation between the P × K polynomial terms and the block factor.

61 If we similarly apply backwards variable selection to the model with the blocking variable included, then the result is that more interaction terms remain. However, the polynomial contrasts are still orthogonal to one another.

yield ∼ block + N + P + K + B + F + N:P + N:K + N:B + N:F + P:B + P:F + B:F.

Problem of reordering your terms As a result of the partial aliasing, we miss significant results if we were to fit the effects in a different order. To see this, running

md2 = aov( yield∼(N+P+K+B+F)^2,data=dat ) as above yields an ANOVA table with a slightly significant P:F term with 4 degrees of freedom and, subsequently, a (PF )ll term that is very significant. If instead, we ran

md2 = aov( yield∼(N+B+P+K+F)^2,data=dat ), then the P:F term will only have 2 degrees of freedom and a p-value of 0.147. The polynomial contrasts will no longer be significant.

DoF Sum Sq Mean Sq F value Pr(>F) P:F 2 2.308 1.154 2.020 0.147 L.L 1 2.207 2.207 3.864 0.057 . Q.L 1 0.101 0.101 0.176 0.677 L.Q 1 Q.Q 1

We can, however, solve this problem using the R function step(), which will result in the same submodel as before retaining the term P : F but not the term B : K. The original 1953 paper on this dataset came before the invention of AIC. In that case, they directly identify that the mean sum of squares for the aliased PF = KB term is larger than the rest and then look at each interaction more closely to determine which is significant.

62 Chapter 5

Response Surface Methodology

Introduction

In this chapter, we consider the regression setting where given k input variables, we are interested in modelling the response variable y as

y = f(x1, . . . , xk) + ε

where f is some function to be specified and ε is iid mean zero variance σ2 noise. Here, the variables xi will be treated as coded factors. If the factor takes on three levels, then it is generally mapped to {−1, 0, 1}. If it takes on five levels, it is often mapped to {−α, −1, 0, 1, α} for some α > 1. For example, if an experi- ment were taking place in an oven, the temperature factor of {325, 350, 375} → {−1, 0, 1}. If we wanted five levels, we could choose α = 1.6 and consider tempera- tures {310, 325, 350, 375, 390} → {−1.6, −1, 0, 1, 1.6}. The goal of response surface methods is to model the function f(·) as a poly- nomial surface. While higher orders are possible, we will consider only first and second order. These tools can be used to choose treatments sequentially with the goal of optimizing with respect to the response. In general, an experimenter can use a first order model to get a local linear approximation of f(·) which provides an estimate of the slope of the surface. Second order models can be used to test the local curvature of the surface. Before proceeding with such an optimization, it is often wise to run a factorial or other design for the purposes of variable selection as optimization for large values of k can be quite costly and time consuming. Response surface methods are often used to improve some process in, say, and engineering context. As an example, we could try to optimize cake baking with respect to the variables temperature, baking time, and sugar content where the response y could be a score given by a panel of judges. As obtaining a single y is time consuming, we would want to search the parameter space as efficiently as possible.

63 Note that this is only a brief overview of the subject of response surface methods. Entire textbooks are dedicated to the subject.

5.1 First and Second Order1

A first order model for f(·) is of the form

k X y = β0 + βixi + ε, i=1 which is effectively a linear regression but where the uncoded values of xi were chosen carefully with respect to some experimental design. This could be, for example, a 2k−q fractional factorial design, which, by considering different combinations of ±1 allows for estimation of the βi. The second order model is of the form

k k X X X 2 y = β0 + βixi + βijxixj + βiixi + ε i=1 i j. In order to estimate all of the parameters in the second order model, more treatment combinations need to be considered. These will be discussed in the following section. For now, assume a design exists that allows us to estimate the βi and the βij. Given such a second order model fitted locally to our experimental data, we can solve for the critical or stationary point xs where the derivative is equal to zero. Namely, −1 0 = b + 2Bxs, or xs = −B b/2 As B is a real symmetric matrix, the spectral theorem allows us to write B as U TDU where U is the orthonormal matrix of eigenvectors and D is the diagonal matrix of eigenvalues λ1 ≥ ... ≥ λk. To understand the behaviour of the fitted valuey ˆ about the critical point xs, we can rewrite

T T yˆ = βˆ0 + x b + x Bx T T = βˆ0 + (xs + z) b + (xs + z) B(xs + z) T T T T T = βˆ0 + xs b + xs Bxs + z b + 2z Bxs + z Bz T =y ˆs + z Bz + 0

1 See Wu & Hamada Section 10.4

64 T T T as z b + 2z Bxs = z (b + 2Bxs) = 0. Furthermore,

T yˆ =y ˆs + z Bz T T =y ˆs + z U DUz T =y ˆs + v Dv k X 2 =y ˆs + λivi . i=1 Hence, the behaviour ofy ˆ about the critical point is determined by the eigenvalues of the matrix B. If all of the eigenvalues are positive, then the surface is locally elliptic and con- vex implying thaty ˆs is a local minima. If all of the eigenvalues are negative, we conversely have thaty ˆs is a local maxima. If the signs of the eigenvalues are mixed, then we have a saddle point and should continue searching for the desired optima. If one or more of the eigenvalues is zero–or very close to zero when compared to the other eigenvalues–then there will be a linear subspace of critical values spanned by the corresponding eigenvectors. For example, if all of the eigenvalues are positive except for λk, then we have a minimum for any input value of vk.

5.2 Some Response Surface Designs

If we are interested in only the first order properties of the response surface–i.e. the slope–then we can use factorial, fractional factorial, or similar designs. In order to test for second order proprieties–i.e. curvature–then we will require new factor level combinations chosen carefully to allow us to estimate the βij terms. k Note that a second order model has 1+k + 2 +k = (k +1)(k +2)/2 parameters to estimate. Hence, we will need at least that many observations to estimate the parameters. However, it is often inefficient to use a model with more than the min- imal required parameters. Choice in design is often made to minimize the number of required observations especially when these designs are used sequentially.

5.2.1 Central Composite Design2 A central composite design consists of three types of points: cube points from a fractional factorial design; central points that lie in the centre of the cube; axial points where all but one factor is at the zero level. A theorem due to Hartley (1959) is presented in Wu & Hamada as Theorem 10.1, which is 2 See Wu & Hamada Section 10.7

65 6 ● Full Factorial # of Parameters

5 ● ●

● 4 ●

● log2( data size ) log2( data size 3 ●

2 ●

2 3 4 5 6

Number of Factors Figure 5.1: A comparison of the number of parameters in a second order model k with the data size of a 2 factorial design on the log2 scale.

Theorem 5.2.1. For a central composite design based on a 2k−q fractional factorial design such that no defining relations contain a main factor, all of the βi and βii are estimable as well as one βij for each aliased group of factors.

The theorem implies that if we want to estimate all of the βij, then we cannot have any two factor terms aliased with another two factor term. Hence, we must avoid defining relations of order 4. Oddly enough, defining relations of order 3 are allowed even though the alias a main effect with a two-way effect. The reason is that the axial points allow us to estimate the main effects and, hence, de-alias them from the two factor effects. A design with resolution 3 but with no defining relations of length 4 is referred to as resolution III*.

Cube Points The cube points are those from a fractional factorial design, which is those whose factors take on the values of ±1 and hence lie on the corners of an hypercube. As mentioned above in the theorem, ideally we would choose a 2k−q design of resolu- tion III*. Other designs beyond fractional factorial can be considered such as the Plackett-Burman designs, which we will discuss in a later chapter.

66 Central Points The central points are those such that all factors are set to level zero and hence, the points lie in the centre of the design. Without central points, we merely have a composite design. The central points allow for estimation of the second order parameters. Depending on the choice of axial points, multiple replications of the central points may be required. Wu & Hamada recommend anywhere from 1 to 5 replications of this point.

Axial Points The axial points lie on the main axes of the design. That is, all but a single factor are√ set to zero and the single factor is tested at the levels ±α where α is chosen in [1, k]. The addition of axial points adds 2k treatments into the design. If the value of α is chosen to be 1, then these points lie on the faces of the hypercube from the factorial design. A benefit to this is that only 3 factor levels are required–i.e. {−1, 0, 1}–instead of 5. As the shape of the design in now a cube, such a choice is also useful when optimizing over a cuboidal region, which is often the case. √ If the value of α is chose to be k, then all of these and√ the cube points lie the same distance from the origin, which is a distance of k. As a√ result, this is sometimes referred to as spheroidal design. Values of α closer to k allow for more√ varied observations, which can help with estimation efficiency. However, when α = k, the variance estimate ofy ˆ is infinite without the inclusion of central points to stabilize the calculation. One useful design criterion is to choose a design that is rotateable. This occurs when the variance of the predicted valuey ˆ can be written as a function of kxk2. That is, the variance of the prediction is only dependent on how far away the inputs are from the centre of the design. Allegedly, a 2k−q design with resolution 5 and α = 2(k−q)/4 is rotateable. Note that while rotateability is a nice property, it is dependent on how the factors are coded and hence is not invariant to such choices.

5.2.2 Box-Behnken Design3 In 1960, Box and Behnken proposed a method of combining factorial designs of Section 4.1 with balanced incomplete block designs of Section 2.4. The idea is that if each block tests k treatments, then replace each block with a 2k factorial design where the untested factors are set to zero. As an example, we could begin with a BIBD such as

3 See Wu & Hamada Section 10.7

67 Factor block ABCD 1 y1,1 y1,2 .. 2 . y2,2 y2,3 . 3 .. y3,3 y3,4 4 y4,1 . y4,3 . 5 . y5,2 . y5,4 6 y6,1 .. y6,4

However, we would replace each block with a 22 factorial design. For example, the first row would become

Factor block ABCD 1 1 1 0 0 1 -1 0 0 -1 1 0 0 -1 -1 0 0

The total number of treatments to test would be #{blocks} × 22 or 24 in this example. Geometrically, we are testing points on the faces of the hypercube. However, all of the points are equal in Euclidean norm and hence lie on the same sphere. This is geometrically nice as our points line on the edges of the same cube and the surface of the same sphere. Also, only three factor levels are required for each factor. One additional requirement, however, is that we will need to replicate the central point multiple times in order to compute the variance. One more property of note is that such a design is said to be orthogonally blocked. This is the case where the block effects and parameter effects are orthogonal and hence do not interfere with one another. The total number of treatments to test becomes too large as the number of factors increases. Hence, other methods are more efficient. A table of the sizes can be found in the Chapter 10 Appendix in Wu & Hamada and also on the Wikipedia page.4

5.2.3 Uniform Shell Design5 Uniform Shell Designs are due to Doehlert (1970) in a paper by the same name. The idea is that “designs are generated which have an equally spaced distribution of points lying on concentric spherical shells” and that more points can be added to fill in the entire sphere if desired. For k factors, this design considers k2 + k points

4 https://en.wikipedia.org/wiki/Box%E2%80%93Behnken_design 5 See Wu & Hamada Section 10.7

68 uniformly spaced over the surface of a sphere and a single point at the center of the sphere. The total, k2 + k + 1, is roughly twice that of the minimal requirement of (k2 + 3k + 2)/2 for large values of k. Hence, this method is generally only used for small values of k. As these designs can be rotated, they are often rotated to minimize the number of factor levels required√ to consider. While the Box-Behnken and central composite designs with α = k are also spherical in nature, the important feature of the uniform shell design is the fact that the points are uniformly spread across the sphere’s surface. In general, spherical designs can be repeated for spheres of different radii in order to better estimate the change in the response as the inputs move away from the central point.

Remark 5.2.2. Fun Fact: For k = 3, the uniform shell design coincides with the Box-Behnken design.

5.3 Search and Optimization6

Repeated experiments such as those discussed above can be sequentially repeated at different factor levels in order to search the input space for an optimal–minimal or maximal–response.

5.3.1 Ascent via First Order Designs To find the direction of greatest change in the response surface, we only require estimating the first order components of the model, so a fractional factorial design will suffice at this stage. However, in order to test for overall significance of the local curvature, we will add some central points to the design. k−q Specifically, consider a 2 design with nc central points included. Recall that the second order model is

k k X X X 2 y = β0 + βixi + βijxixj + βiixi + ε i=1 i

When testing a central point, we have y = β0 + ε. Hence, the average of the nc central points is an unbiased estimator of β0. That is, E¯yc = β0. In contrast, consider the 2k−q points from the factorial design. The average of

6 See Wu & Hamada Section 10.3

69 those points, denotedy ¯f , is

 k k  q−k X X X X 2 y¯f = 2 β0 + βixi + βijxixj + βiixi + ε treatments i=1 i

5.4 Chemical Reaction Data Example

In this section, we consider the ChemReact dataset the R package rsm. This dataset contains two factors, time and temperature, and the response is the yield of the 4 chemical reaction. The first block consists of a 2 design with nc = 3 center points. The second block contains an additional 3 center points as well as 4 axial points. First, we have to code the factor levels by running the command

CR <- coded.data (ChemReact, x1∼(Time - 85)/5, x2∼(Temp - 175)/5) √ As a result, α ≈ 2 making this a . Considering the first block, we can perform a curvature check to test the null hypothesis that H0 : β11 + β22 = 0. The t-statistic is |y¯ − y¯ | f c ≈ 13.78. sp1/3 + 1/4 Comparing this to the t (2), results in a p-value of 0.005 indicating that there is evidence of curvature in this section of the response surface. A first order model can be fit to the 24 factorial design with the command

rsm( Yield∼FO(x1,x2), data=CR, subset=1:4 )

70 In this case, we get the model

yˆ = 81.875 + 0.875x1 + 0.625x2.

We can also include the center points in the first order model,

rsm( Yield∼FO(x1,x2), data=CR, subset=1:7 )

Then, the coefficient estimates for x1 and x2 do not change. However, the intercept term is now βˆ0 = 82.8. Furthermore, the standard error, degrees of freedom, and thus the p-values for the t-tests change. In both cases, we get a vector for the direction of steepest ascent, (0.814, 0.581). We can further try to estimate the pure quadratic terms–i.e. the coefficients for 2 2 x1 and x2–with only the factorial and central points.

rsm( Yield∼FO(x1,x2)+PQ(x1,x2), data=CR, subset=1:7 )

However, they will be aliased, and similar to the above t-test, we can only estimate the sum β11 + β22 until the second set of data is collected. Considering the entire dataset, we can fit a full second order model with

rsm( Yield∼SO(x1,x2), data=CR ).

In this case, we get a stationary point at (0.37, 0.33), which corresponds to a time and temperature of (86.9, 176.7). The eigenvalue estimates are −0.92 and −1.31 indicating we have a local maximum. It is worth noting, however, that performing an F-test on whether or not the quadratic terms improve the fit of the model yields a non-significant p-value. This may indicate that we should continue searching the parameter space. A plot of the response surface and design points is displayed in Figure 5.2.

Design Matrices for Chemical Data The design matrix can be extracted via the model.matrix() function. For this dataset, we get

71 Central Composite Design

182 81 80.5

77.5 180 ● ●

81.5

178

82 ● 176 ● Temp 174

172 80

80.5 79.5 170 ● ●

77.5 77

76.5 76 78 78.5 79 168 79 78.5

78 80 82 84 86 88 90 92

Time

Figure 5.2: A plot of the fitted second order response surface to the chemical reaction dataset. The solid red point is the centre point of the design. The four circle points are from the 2k design. The x points are the axial points. The solid blue point is the critical point, and the green arrow is direction of greatest increase based on the first order model.

72 β0 β1 β2 β12 β11 β22 1 1 -1 -1 1 1 1 2 1 -1 1 -1 1 1 3 1 1 -1 -1 1 1 4 1 1 1 1 1 1 5 1 . . . . . 6 1 . . . . . 7 1 . . . . . 8 1 . . . . . 9 1 . . . . . 10 1√ . . . . . 11 1 √2 . . 2 . 12 1 − 2√ . . 2 . 13 1 . √2 . . 2 14 1 . − 2 . . 2

Denoting this matrix as X, we can see the effects of including or excluding the center points from the model. For comparison, the two matrices XTX for the above data with and without the center points are, respectively,

14 ... 8 8  8 ... 8 8   . 8 ....  . 8 ....       .. 8 ...  .. 8 ...    and   .  ... 4 ..  ... 4 ..       8 ... 12 4  8 ... 12 4  8 ... 4 12 8 ... 4 12

The second matrix here is not invertible. The eigenvalues for the above matrices are, respectively,

(26.35, 8, 8, 8, 4, 3.64) and (24, 8, 8, 8, 4, 0).   Hence, recalling that the variance of the estimator βˆ is Var βˆ = σ2(XTX)−1, we have that the variance is infinite when there are√ not center points in the model. This is specifically because the designer chose α = 2. If a smaller value closer to 1 was chosen, then the variance would not be infinite. However, the addition of center points does stabilize the computation regardless of choice of α.

73 Chapter 6

Nonregular, Nonnormal, and other Designs

Introduction

In this chapter, we will consider a collection of other designs beyond the scope of what was discussed previously. These include factorial designs with more than 2 or 3 levels, mixed level factorial designs, and nonregular designs such as the Plackett- Burman designs. Often we will not have precisely 2k or 3k observations to work with. If we are given a maximum of say, 20 or 25 measurements, then how best can we design an experiment to test for the significance of all of the factors to be considered?

6.1 Prime Level Factorial Designs1

In Chapter4, we discussed factorial designs at 2 and 3 levels. These same concepts can be extended to factorial designs at r levels for any prime number r. The most common designs beyond 2 and 3 levels are 5 and 7 levels, which will be discussed in subsections below.

Remark 6.1.1. Note that the requirement that r is prime comes from the fact that a cyclic group of prime order is a finite field, because of the existence of a multiplicative inverse for each element. For example, for r = 5, we have that

1 × 1 = 2 × 3 = 4 × 4 = 1 mod 5.

Whereas if r = 4, the element 2 does not have such an inverse, since

2 × 1 = 2 mod 4, 2 × 2 = 0 mod 4, 2 × 3 = 2 mod 4.

1 See Wu & Hamada Section 7.8

74 If r is a power of a prime like 4, 8, or 9, then one can use Galois theory to construct a design.

In general, to construct an rk−q fractional factorial design, we begin with k − q orthogonal columns of length rk−q with entries 0, 1, . . . , r − 1. These columns correspond to the main effects to be tested. Denoting them by x1, . . . , xk−q, we can construct the interaction effects by summing linear combinations of these columns modulo r as follows: k−q X cixi mod r i=1 k−q for ci = 0, 1, . . . , r − 1. Counting the total number of unique columns, we have r choices for the ci; we subtract 1 for the case when ci = 0 for all i; Then, we impose the restriction that the first non-zero ci = 1 for uniqueness of the factors. As a result, we have a total of rk−q − 1 r − 1 factors to test. Recall that this coincides with the 2 and 3 level settings where we had 2k−q − 1 and (3k−q − 1)/2 effects, respectively, to test. As an example, consider the 33−1 design with factors A, B, C in the following table:

AB AB C= AB2 1 0 0 0 0 2 0 1 1 2 3 0 2 2 1 4 1 0 1 1 5 1 1 2 0 6 1 2 0 2 7 2 0 2 2 8 2 1 0 1 9 2 2 1 0

6.1.1 5 level designs In this section, we will consider the special case of the 25-run design, which is a 5k−(k−2) fractional factorial design based on 25 observations and k factors taking on 5 levels each. In such a design, we will be able to test (25 − 1)/(5 − 1) = 6 effects. Similar to the general case presented above, we begin with two orthogonal columns of length 25 and taking values 0,1,2,3,4. Then, the four interaction columns can be computed via x1 + cx2 mod 5

75 for c = 1, 2, 3, 4 giving terms AB, AB2, AB3, AB4. The main effects A and B have 5-1=4 degrees of freedom. The A × B interaction has the remaining 16 of which 4 are given to each of the sub-interaction terms. We could treat this as a 52 full factorial design. However, if there are more 5-level factors to include in the model, we can add defining relations such as C = AB, D = AB2,E = AB3 F = AB4 to treat this as 5k−(k−2) for values of k = 3, 4, 5, 6. For example, if k = 3, we would have three factors A, B, C and the defining relation I = ABC4. Then, for example, the term AB4 would be aliased with AB4 = A2B5C4 = AC2 (AB4)2 = A3B9C4 = A3B4C4 = AB3C3 (AB4)3 = A4B13C4 = A4B3C4 = AB2C (AB4)4 = A5B17C4 = B2C4 = BC2 Thus, the aliased group is {AB4, AC2,BC2, AB3C3, AB2C}. If we consider the entire 25 × 6 table–displayed in Table 6.1–the first two factors A and B can be treated as the rows and columns, respectively, for a Latin square design. We can thus include factors C, D, E, and/or F to construct a Latin, Graeco- Latin, or Hyper-Graeco-Latin square design. As there are 4 orthogonal 5 × 5 Latin square designs, each of the 5k−(k−2) designs can be treated in this way for k = 3, 4, 5, 6. Remark 6.1.2. While a general formula for the number of mutually orthogonal n × n Latin squares does not exist. Denoting this number as a(n), it is known2 that a(rk) = rk − 1 for all primes r and integers k > 0. Hence, we can consider any rk−(k−2) factorial design as a hyper-Graeco-Latin square design. When analyzing such a design, the main effects have 4 degrees of freedom. Hence, if they are ordinal variables, then we can extend the polynomial contrasts considered in the 3-level factorial setting to linear, quadratic, cubic, and quartic interactions. The contrast vectors are √ Linear: A1 = (−2, −1, 0, 1, 2)/ 10 √ Quadratic: A2 = (2, −1, −2, −1, 2)/ 14 √ Cubic: A3 = (−1, 2, 0, −2, 1)/ 10 √ Quartic: A4 = (1, −4, 6, −4, 1)/ 70

2See https://oeis.org/A001438

76 AB C = AB D = AB2 E = AB3 F = AB4 1 0 0 0 0 0 0 2 0 1 1 2 3 4 3 0 2 2 4 1 3 4 0 3 3 1 4 2 5 0 4 4 3 2 1 6 1 0 1 1 1 1 7 1 1 2 3 4 0 8 1 2 3 0 2 4 9 1 3 4 2 0 3 10 1 4 0 4 3 2 11 2 0 2 2 2 2 12 2 1 3 4 0 1 13 2 2 4 1 3 0 14 2 3 0 3 1 4 15 2 4 1 0 4 3 16 3 0 3 3 3 3 17 3 1 4 0 1 2 18 3 2 0 2 4 1 19 3 3 1 4 2 0 20 3 4 2 1 0 4 21 4 0 4 4 4 4 22 4 1 0 1 2 3 23 4 2 1 3 0 2 24 4 3 2 0 3 1 25 4 4 3 2 1 0

Table 6.1: A 5k−(k−2) factorial design.

77 The interaction term A × B can be similarly broken up into 16 contrasts (AB)i,j for i, j = 1, 2, 3, 4.

6.1.2 7 level designs The 7-level design of most interest is the 49-run design similar to the 25-run design considered in the previous subsection. That is, because 73 = 343 is a large number of observations required, and most likely, the desired hypotheses can be tested with a smaller design. Mathematically, this design is nearly identical to the previous one except now each factor can take on 7 levels making the arithmetic performed modulo 7. We can similarly consider all 7k−(k−2) designs are hyper-Graeco-Latin squares. The maximal number of factors to test will be (49−1)/(7−1) = 8. Each main effect has 6 degrees of freedom. Hence, we could feasibly consider polynomial contrasts from linear up to 6th degree. Generally, these higher order polynomials are not wise to consider as (1) they may lead to overfitting and (2) are often difficult to interpret.

Remark 6.1.3. Note that r2 − 1 = r + 1. r − 1 for integers r ≥ 2.

6.1.3 Example of a 25-run design Data was randomly generated based on a 25-run design with k = 4 factors to test. The “yield” was generated as a normal random variate with variance 1 and mean

µ(C,D) = D/2 + φ(C) where φ : {0, 1, 2, 3, 4} → {−1, 1, 0, −1, 1} and C,D = 0, 1, 2, 3, 4. That is, the yield is linear in D and cubic in C. Running a standard ANOVA model only considering main effects gives

Df Sum Sq Mean Sq F value Pr(>F) A 4 0.560 0.140 0.082 0.9856 B 4 1.561 0.390 0.229 0.9143 C 4 23.930 5.982 3.515 0.0614 . D 4 16.504 4.126 2.425 0.1333 Residuals 8 13.614 1.702

Here, we see some marginal significance in C and none in D. However, expanding with respect to polynomial contrasts gives

78 Df Sum Sq Mean Sq F value Pr(>F) A 4 0.560 0.140 0.082 0.9856 B 4 1.561 0.390 0.229 0.9143 C 4 23.930 5.982 3.515 0.0614 . C: P1 1 5.868 5.868 3.448 0.1004 C: P2 1 0.314 0.314 0.184 0.6790 C: P3 1 16.564 16.564 9.733 0.0142 * C: P4 1 1.184 1.184 0.696 0.4285 D 4 16.504 4.126 2.425 0.1333 D: L 1 13.125 13.125 7.713 0.0240 * D: HOT 3 3.379 1.126 0.662 0.5984 Residuals 8 13.614 1.702

If we also tried to include A × B interactions in our model, we would have no extra degrees of freedom to compute the residuals. However, the mean sum of squares for A × B will still be large. We could quickly apply the hierarchy principal to debunk this as a false positive. The source of this significance comes from the aliasing relations as C = AB, and D = AB2. The terms AB3 and AB4 are still orthogonal and thus A × B still has 8 degrees of freedom remaining.

6.2 Mixed Level Designs

In this section, we will consider mixed 2 & 4 and 2 & 3 level designs. That is, some factors in these models will have two levels and others will have 3 or 4 levels. This will, of course, allow for more flexible modelling and testing scenarios. Before discussing such designs, we require the notion of a symmetric ,3 which is one with the same levels in each columns, and the more general mixed level orthogonal array.

Definition 6.2.1 (Symmetric Orthogonal Array). Given s symbols denoted 1, 2, . . . , s, and an integer t > 0 referred to as the strength of the table, an N × m orthogonal array of strength t, denoted OA(N, sm, t) is an N × m table with entries 1, . . . , s such that any set of t columns will every combination of t symbols appearing an equal number of times.

We have already seen many examples of orthogonal arrays, which include the 2k−q and 3k−q designs. For example, a 24−1 design would be an OA(8, 27, t). In the

3 https://en.wikipedia.org/wiki/Orthogonal_array

79 following table, we have a 24−1 design with defining relation I = ABCD giving us a resolution of 4

Remark 6.2.2 (Resolution and Strength). In Wu & Hamada, they claim that the strength of an orthogonal array is the resolution minus 1. However, the three columns A, B, and AB will only ever take on 4 of the 8 possible binary combinations. Instead, the strength of the array restricted to the main effects only should be the resolution-1.

A B C D AB AC BC 1 ...... 2 . . + + . + + 3 . + . + + . + 4 . + + . + + . 5 + . . + + + . 6 + . + . + . + 7 + + . . . + + 8 + + + + . . .

Here, for example, the first three columns will contain all 8 unique binary sequences while the last three columns will contain 4 unique sequences repeated twice. Sim- ilarly, for 3k−q designs, we have an orthogonal array OA(3k−q, 33k−q−1, t). The strength of this array is also 1 minus the resolution of the design. As we will be concerned with mixed-level designs in this section, we need a more general definition of orthogonal arrays.

Definition 6.2.3 (Mixed Level Orthogonal Array). Given s1, . . . , sγ symbols de- Pγ noted 1, 2, . . . , si, m = i=1 mi, and an integer t > 0 referred to as the strength of the table, an N × m orthogonal array of strength t, denoted

m1 mγ OA(N, s1 . . . sγ , t) is an N × m table with mi columns with entries 1, . . . , si such that any set of t columns will have every combination of t symbols appearing an equal number of times.

A simple example of a mixed level orthogonal array is the OA(8, 2441, 2) below. It has m = 4 + 1 columns and strength 2 resulting from the inclusion of the A column.

80 A 1 2 3 123 1 0 . . . . 2 1 . . + + 3 2 . + . + 4 3 . + + . 5 3 + . . + 6 2 + . + . 7 1 + + . . 8 0 + + + +

6.2.1 2n4m Designs4 In Wu & Hamada, they detail a “Method of Replacement” for constructing a 2n4m design from a 2N design. To do this, we begin with a symmetric orthogonal design OA(N, 2m, t) with t ≥ 2, which is any 2k−q fractional factorial design with resolution 3 or greater. Consequently, any two columns in OA(N, 2m, t) will contain all 4 possible com- binations of {·, +} as otherwise the two columns would be aliased. Denoting these two columns as α and β, there is a third column in the array constructed by the product of columns α and β, which will be denoted as αβ. Considering only these three columns, we have four possible triples, which can be mapped to the levels 0, 1, 2, 3. Hence, we replace these three 2-level columns with a single 4-level column.

α β αβ A ... → 0 . + + → 1 + . + → 2 + + . → 3

A similar example occurs in the previous two example arrays where the columns AB, AC, and BC where combined into a single 4-level factor and A,B,C,D were relabeled as 1,2,3,123. This process can be iterated n times to transform an OA(N, 2m) into an OA(N, 2m−3n4n). If k is even, then 2k − 1 = 22l − 1 = 4l − 1, which is divisible by 3 as 4l = 1 mod 3 for any integer l ≥ 0. It has been proven that it is possible to, in fact, decompose a 2k design with k even into (2k − 1)/3 mutually exclusive sets of three columns that can in turn be replaced with a 4-level columns. Hence, any design of the form OA(2k, 2m−3n4n) exists for k even, m = 2k − 1, and n = 0, 1,..., (2k − 1)/3.

4 See Wu & Hamada Sections 7.2, 7.3, & 7.4

81 For k odd, we have that

2k − 2 = 22l+1 − 2 = 2(4l − 1).

Hence, in this case 2k − 2 is divisible by 3. However, it has been proven5 that actually only 2k − 5 mutually exclusive sets of three elements exist when k is odd. Therefore, when k is odd, designs of the form OA(2k, 2m−3n4n) exists for m = 2k −1, and n = 0, 1,..., (2k − 5)/3. This implies that for 16 and 64 run designs, we can accommodate 5 and 21 four-level effects, respectively.

Construction of an OA(N, 45) design Beginning with a full 24 factorial design, denoted OA(24, 224−1, 4), we can replace each mutually exclusive set of three columns by a single 4-level column as follows.

1 2 12 3 4 34 13 24 1234 23 124 134 123 14 234 1 ...... 2 ....++.++.++.++ 3 ...+.++.++.++.+ 4 ...++.++ . ++ . ++. 5 .++....++++.+.+ 6 .++.++.. . +. +++. 7 .+++.+++ . .++ . . . 8 .++++.+.+.. . .++ 9 +.+...+.+.++++. 10 +.+.++++. .. .+.+ 11 +.++.+.. .++. .++ 12 +.+++..+++.+... 13 ++....++.+.+.++ 14 ++..+++.+++. ... 15 ++.+.+.++.. .++. 16 ++.++... . .+++.+

5Wu, C. F. J. Construction of 2m4 n designs via a grouping scheme. The Annals of Statistics (1989): 1880-1885.

82 (1,2,12) (3,4,34) (13,24,1234) (23,124,134) (123,14,234) 1 0 0 0 0 0 2 0 1 1 1 1 3 0 2 2 2 2 4 0 3 3 3 3 5 1 0 1 3 2 6 1 1 0 2 3 7 1 2 3 1 0 8 1 3 2 0 1 9 2 0 2 1 3 10 2 1 3 0 2 11 2 2 0 3 1 12 2 3 1 2 0 13 3 0 3 2 1 14 3 1 2 3 0 15 3 2 1 0 3 16 3 3 0 1 2

Counting degrees of freedom, we started with 15 = 24 −1. We now have 5×(4−1) = 15, which coincides. This is an OA(24, 45, 2). Furthermore, as 4 is a prime power– i.e. 4 = 22–there are 3 mutually orthogonal 4 × 4 Latin squares. Hence, in the above design, the first two columns can be considered as the row and column of an hyper-Graeco-Latin square with 3 experimental factors to test.

Choosing and Analysing a Design Briefly, the concept of aberration can be extended to this setting for choosing which sets for three 2-level columns to collapse into a 4-level column. It is often preferred to have defining relations that include the 4-level factors as they have more degrees of freedom to spare than the 2-level factors. Hence, significance may still be detectable. For analysis, polynomial contrasts for the 4-level factors can be considered to capture linear, quadratic, and cubic effects. It is recommended though to consider an alternative system of contrasts related to the polynomials:

A1 = (−1, −1, 1, 1)

A2 = (−1, 1, 1, −1)

A3 = (−1, 1, −1, 1)

The reason for this system is that is coincides with the construction of the 4-level factor. Recalling the table from above,

83 α β αβ A ... → 0 . + + → 1 + . + → 2 + + . → 3 we see that contrasts A1, A2, and A3 correspond, respectively, to the factorial effects α, αβ, and β. For two 4-level factors, A and B, we cannot decompose them as we did for the prime level factors as the integers mod 4 is not a finite field. To illustrate the problem, for 3-level factors we have

A 0 0 0 1 1 1 2 2 2 B 0 1 2 0 1 2 0 1 2 AB 0 1 2 1 2 0 2 0 1 A2B2 0 2 1 2 1 0 1 0 2

The last two rows are identical up to swapping 1 and 2. Thus, the interactions AB and A2B2 are equivalent. However, if each of these has 4 levels, then

A 0000111122223333 B 0123012301230123 AB 0123123023013012 A2B2 0202202002022020

Hence, the term A2B2 only takes on 2 levels of a possible 4. Additionally, the term AB2 will only take on even values when A is even and odd values when A is odd, so every possible combination will not occur.

6.2.2 36-Run Designs6 We can construct a 36-run design by combining a 23−1 and a 33−1 design. In the first case, we impose the defining relation I = ABC. In the second case, it is I = DEF 2. These two designs can be considered as symmetric orthogonal arrays OA(4, 23, 2) and OA(9, 34, 2).

6 See Wu & Hamada Section 7.7

84 OA(9, 34, 2) DE F = DEDE2 1 0 0 0 0 OA(4, 23, 2) 2 0 1 1 2 A B C=AB 3 0 2 2 1 1 ... 4 1 0 1 1 2 . + + 5 1 1 2 0 3 + . + 6 1 2 0 2 4 + + . 7 2 0 2 2 8 2 1 0 1 9 2 2 1 0

We can combine these two orthogonal arrays into an OA(36, 2333, 2) by applying a “tensor”-like operation. The first set of four rows would be

ABCDEF 1 . . . 0 0 0 2 . + + 0 0 0 3 + . + 0 0 0 4 + + . 0 0 0

This can then be repeated eight more times for the other factor levels of D, E, and F. The defining contrast subgroup of this 23−133−1 design can be constructed by multiplying the above defining words together to get

I = ABC = DEF 2 = ABCDEF 2 = ABCD2E2F.

This design has 35 degrees of freedom to work with. The 2-level main effects have 1 DoF and the 3-level main effects have 2 DoFs totalling 9. There is one 3 × 3 interaction, DE2, which is not aliased with any main effects and has 2 DoFs. There are nine 2 × 3 interaction factors, which come from the defining words of length 6. This can be seen by choosing one of {A, B, C} and pairing it with one of {D,E,F }. Note that, for example, AD and AD2 are equivalent effects. Furthermore, each of these has (3 − 1) × (2 − 1) = 2 DoFs. In total, thus far, we have 9 + 2 + 2 × 9 = 29. The remaining 4 degrees of freedom come from the interactions of {A, B, C} with DE2 each having 2 DoFs. These final three interaction terms can be ignored and their 6 total degrees of freedom reserved for the residual sum of squares. Linear and quadratic contrasts can be considered for the 3-level factors. Further- more, interactions between 2 and 3 level factors can be interpreted as, for example with AD, a difference in the linear or quadratic behaviour of D conditional on whether A = − or A = +.

85 Remark 6.2.4 (Other 36-Run Designs). Similarly to the previous section, we can also construct 36-Run designs by combining one design of {22, 23−1} with one design of {32, 33−1, 34−2}.

6.3 Nonregular Designs

Thus far, every design considered can be classified as regular meaning that all of the factorial effects are either orthogonal or completely aliased, which is a correlation of either 0 or ±1. A nonregular design allows for non-zero non-one correlations between factorial effects. This is similar to how polynomial contrasts were shown in the 3k−q designs to have partial aliasing. However, now we are concerned with the actual factorial effects and not correlations between specific contrasts. The reason for considering such designs is one of economy and efficiency. If we, for example, were interested in a 2k−q factorial design to test 6 main effects and 15 two-way interactions, we would require a 26−1 design with 32 observations whereas we will only be using 21 of the 31 degrees of freedom for parameter estimation. Instead, we could consider a Plackett-Burman design on 24-runs. A reduction in 8 observations can save a lot of resources especially if this design were to be used iteratively in a response surface search procedure.

6.3.1 Plackett-Burman Designs For designs involving only 2-level factors, we have considered cases only where the number of runs is a power of 2. The design matrix for these designs can be thought of as an Hadamard matrix.

Definition 6.3.1 (Hadamard matrix). An n × n Hadamard matrix, denoted Hn, is an orthogonal matrix with entries ±1.

The simplest example is 1 1 H = . 2 1 −1

k From this matrix, we can construct any Hn for n = 2 by successive application of the tensor or Kronecker product for matrices. For an Hadamard matrix of size n, we can construct one of size 2n by   Hn Hn H2n = H2 ⊗ Hn = . Hn −Hn

Using this formula to construct an H2k matrix gives one where the first column is all ones. We can remove this column–it corresponds to the intercept term in our model–and consider the 2k × (2k − 1) matrix as our design matrix. For example,

86 when k = 3, we have

1 1 1 1 1 1 1 1 1 −1 1 −1 1 −1 1 −1   1 1 −1 −1 1 1 −1 −1   1 −1 −1 1 1 −1 −1 1 H8 =   1 1 1 1 −1 −1 −1 −1   1 −1 1 −1 −1 1 −1 1   1 1 −1 −1 −1 −1 1 1 1 −1 −1 1 −1 1 1 −1

We can remove the first column and consider the 23 design

A B AB C AC BC ABC 1 1 1 1 1 1 1 1 2 -1 1 -1 1 -1 1 -1 3 1 -1 -1 1 1 -1 -1 4 -1 -1 1 1 -1 -1 1 5 1 1 1 -1 -1 -1 -1 6 -1 1 -1 -1 1 -1 1 7 1 -1 -1 -1 -1 1 1 8 -1 -1 1 -1 1 1 -1

From here, we can construct fractional factorial designs on 8-runs by including aliasing relations such as D = ABC. If we have reason to believe that all of the interactions between the factor are negligible, we could feasibly pack 7 factors into this design. In the case where the number of runs is N = 2k, the Plackett-Burman design coincides with the set of 2(k+q)−q fractional factorial designs for q = 0, 1,..., 2k − k − 1. These correspond to orthogonal arrays OA(2k, 2m, t) where m is the number of factors and t is the strength of the design which is the resolution - 1. If we were interested in 2-level designs with N not a power of 2, we need to work harder to construct the matrix HN . First of all, N cannot be odd as every column must have an equal number of 1’s and -1’s. Secondly, N must be divisible by 4. Otherwise, we have, without loss of generality, a first column of all 1’s and a second columns of N/2 1’s followed by N/2 -1’s. Any subsequent column must be orthogonal to both of these. with respect to column 1 requires an equal amount of 1’s and -1’s. Denoting the ith column by Xi, we have

N X 0 = X1 · X3 = (X3)i. i=1 Orthogonality with respect to column 2 requires the sum of the first N/2 entries to

87 equal the sum of the second N/2 entries as

N/2 N X X 0 = X2 · X3 = (X3)i − (X3)i. i=1 i=N/2+1 PN/2 Hence, these two conditions can only hold simultaneously if i=1 (X3)i = 0, which is impossible if N/2 is odd–i.e. if N is not divisible by 4. This leads to the Hadamard conjecture. Conjecture 6.3.2. An N×N Hadamard matrix exists for for all integers N divisible by 4. Thus far, such a matrix has been found for all N < 668. The N = 668 case is still open at the time of writing these notes. Luckily, for our purposes, we are only interested in smaller Hadamard matrices–namely, those of orders N = 12, 20, 24, 28, 36, 44. These are the multiples of 4 that are not powers of 2. Constructing an Hadamard matrix with N not a power of 2 requires much more sophistication than in the power of 2 case. Some matrices result from the Paley Con- struction, which relies on quadratic residues for finite fields. This will be mentioned in the chapter appendix for completeness, but is, in general, beyond the scope of this course. The end result of the construction is a generating row of length N − 1. This can then be cyclically shifted to construct an (N −1)×(N −1) matrix. In turn, this matrix becomes HN by appending an (N − 1)-long row of -1’s to the bottom and then a column of N 1’s on the left. For example, the generating row for N = 12 is

(1, 1, −1, 1, 1, 1, −1, −1, −1, 1, −1).

This results in 1 1 1 −1 1 1 1 −1 −1 −1 1 −1 1 1 −1 1 1 1 −1 −1 −1 1 −1 1   1 −1 1 1 1 −1 −1 −1 1 −1 1 1   1 1 1 1 −1 −1 −1 1 −1 1 1 −1   1 1 1 −1 −1 −1 1 −1 1 1 −1 1     1 1 −1 −1 −1 1 −1 1 1 −1 1 1 H12 =   1 −1 −1 −1 1 −1 1 1 −1 1 1 1   1 −1 −1 1 −1 1 1 −1 1 1 1 −1   1 −1 1 −1 1 1 −1 1 1 1 −1 −1   1 1 −1 1 1 −1 1 1 1 −1 −1 −1   1 −1 1 1 −1 1 1 1 −1 −1 −1 1 1 −1 −1 −1 −1 −1 −1 −1 −1 −1 −1 −1

The first column can be removed to get an OA(12, 211, 2). It can be checked, in R for T example, that H12H12 = 12I12. Note that we can use this to immediately construct

88 a design with N = 24, 48 by using the fact that

H24 = H12 ⊗ H2, and

H48 = H24 ⊗ H2

Remark 6.3.3. Note that such designs can be generated by the pb() function in the R library FrF2. As an example,

pb(12,randomize=FALSE) which produces the same design as H12 above except that the function shifts to the right whereas we shifted to the left.

6.3.2 Aliasing and Correlation In general, these designs are used only in the case where the interaction effects are negligible. This is because of the complex aliasing that occurs in such a design. We will not discuss this in general, but instead look at an example. Consider the 12-run design presented above. We can fit 11 factors into this design assuming no interaction effects. If we consider the interactions between A, B, and C, we have the following correlation structure.

ABCDEFGHIJK AB 0 0 -1/3 -1/3 -1/3 1/3 -1/3 -1/3 1/3 1/3 -1/3 AC 0 -1/3 0 1/3 -1/3 -1/3 1/3 -1/3 1/3 -1/3 -1/3 BC -1/3 0 0 -1/3√ -1/3√ -1/3√ 1/3√ -1/3√ -1/3√ 1/3√ 1/3√ ABC 0 0 0 -1/ 8 1/ 8 -1/ 8 1/ 8 1/ 8 1/ 8 1/ 8 -1/ 8

The correlation matrix for all main effects and two-factor interactions is displayed in Figure 6.1. In that graphic, the red squares correspond to a correlation of −1/3 and the blue squares to +1/3. Plackett-Burman designs are peculiar in the sense that while, for example, effects A, B, and AB are orthogonal, main effect A is partially aliased with every two-way interaction that does not include the factor A. Hence, A is partially correlated with 11−1 2 = 45 of the 55 two-way interactions. This is why the main use of these designs is for screening many factors to decide which few are important. From a Minitab Blog,

“Plackett-Burman designs exist so that you can quickly evaluate lots of factors to see which ones are important. Plackett-Burman designs are often called screening designs because they help you screen out unim- portant factors.”

In general, when fitting a linear model but leaving out non-negligible terms, is introduced into the parameter estimates. This can occur in regression when

89 Figure 6.1: Correlation matrix for the 12-Run Plackett-Burman design with main effects and two-way interactions.

90 using a variable selection technique or similarly in one of these models with partial correlations if the partially correlated terms are ignored. Consider a linear model

Y = Xβ1 + Zβ2 + ε

where X is the design matrix containing the effects we are interested in–e.g. the main effects–and Z is the design matrix containing effects that are to be ignored–e.g. interaction effects. Solving for the least squares estimator for the reduced model

Y = Xβ1 + ε

T −1 T yields the usual βˆ1 = (X X) X Y . However, when taking the expected value of βˆ1, we no longer have an unbiased estimator. Indeed, we have

T −1 T Eβˆ1 = (X X) X EY T −1 T = (X X) X [Xβ1 + Zβ2] T −1 T = β1 + (X X) X Zβ2.

Resulting from this derivation, if the parameters β2 = 0 or if the X and Z terms T are orthogonal–i.e. X Z = 0–then our estimate of β1 is unbiased. Otherwise, we have a biased estimator. The matrix (XTX)−1XTZ is often referred to as the alias matrix.

6.3.3 Simulation Example Consider a 12-run Plackett-Burman design applied to 6 factors, {A,B,C,D,E,F}, taking values {0, 1} where the true model is

yield = (D) − 1.5(F ) + 1.25(F ⊕ E) + ε where ε ∼ N (0, 0.25). A first step maybe to just consider the main effects as they are all orthogonal to each other.

Df Sum Sq Mean Sq F value Pr(>F) A 1 0.806 0.806 1.612 0.2601 B 1 0.150 0.150 0.301 0.6069 C 1 0.198 0.198 0.396 0.5570 D 1 3.029 3.029 6.057 0.0571 . E 1 0.212 0.212 0.423 0.5439 F 1 5.378 5.378 10.754 0.0220 * Residuals 5 2.501 0.500

91 This leads us to assume that one or both of D and F are significant main effects. However, a strong two-way interaction effect could be causing this result due to aliasing. Applying the hereditary principal, we can assume that any two-way effects of interest involve either D or F . Hence, we can fit models of the form

output ∼ x + D + F + x:D + x:F for x being A, B, C, and E. From here, we see only two significant interaction terms being B:D and E:F. Thus, we can fit the model

output ∼ B + D + E + F + B:D + E:F which gives

Df Sum Sq Mean Sq F value Pr(>F) E 1 0.212 0.212 1.367 0.29502 B 1 0.150 0.150 0.971 0.36965 D 1 3.029 3.029 19.554 0.00688 ** F 1 5.378 5.378 34.715 0.00200 ** B:D 1 1.575 1.575 10.164 0.02432 * E:F 1 1.156 1.156 7.458 0.04123 * Residuals 5 0.775 0.155

This table is considering type 1 anova. Instead, using type 2 anova, we see the three significant terms appear.

Sum Sq Df F value Pr(>F) E 0.0002 1 0.0011 0.974604 B 0.0291 1 0.1880 0.682631 D 4.6727 1 30.1604 0.002733 ** F 2.9780 1 19.2217 0.007126 ** B:D 0.4530 1 2.9242 0.147951 E:F 1.1555 1 7.4583 0.041231 * Residuals 0.7746 5

As neither B nor B:D are significant, we can try refitting the model without those terms giving

Df Sum Sq Mean Sq F value Pr(>F) E 1 0.212 0.212 1.18 0.313412 D 1 3.029 3.029 16.87 0.004529 ** F 1 5.378 5.378 29.95 0.000933 *** E:F 1 2.398 2.398 13.36 0.008125 ** Residuals 7 1.257 0.180

92 When the interaction term is strongly significant If we were to instead consider a model like yield = (D) − 1.5(F ) + 5(F ⊕ E) + ε, then the strong coefficient for the E:F term makes it hard to find significance in the main effects: Df Sum Sq Mean Sq F value Pr(>F) A 1 9.38 9.382 1.254 0.314 B 1 6.52 6.518 0.871 0.394 C 1 6.81 6.811 0.910 0.384 D 1 0.18 0.180 0.024 0.883 E 1 0.21 0.212 0.028 0.873 F 1 5.38 5.378 0.719 0.435 Residuals 5 37.41 7.482

One way to proceed would be to use stepwise regression between the main effects and the two-way effects model using the following commands. mdFirst = aov(yield∼.,data=dat); mdSecond = aov(yield∼(.)^2,data=dat); out=step(mdFirst,scope=list(lower=mdFirst,upper=mdSecond),direction=’both’); One problem with this is that step() will continuing adding variables until to the model is saturated. Backing up from the saturated model, we have the new model

output ∼ A + B + C + D + E + F + E:F + B:C + A:D . Considering the anova type 2 or 3 table, we find that neither A or A:D is significant. Removing them from this model gives

output ∼ B + C + D + E + F + E:F + B:C . Now that A and A:D are gone, we see from another anova type 2 or 3 table that B, C, and B:C are no longer significant. Removing them results in the desired model

output ∼ D + E + F + E:F with the anova 2 or 3 table Sum Sq Df F value Pr(>F) D 4.654 1 25.9190 0.0014135 ** E 0.212 1 1.1796 0.3134116 F 5.378 1 29.9555 0.0009327 *** E:F 58.866 1 327.8620 3.875e-07 *** Residuals 1.257 7

93 Alternatively, one could try every possible interaction term on its own, which are models of the form

output ∼ A+B+C+D+E+F + x:y .

Then corresponding F-statistics for the models all have DoFs (7,4). The values are

A:B 1.15, A:C 0.49, A:D 0.98, A:E 0.82, A:F 0.53, B:C 0.43, B:D 1.32, B:E 0.44, B:F 0.45, C:D 0.46, C:E 0.47, C:F 0.47, D:E 0.50, D:F 0.77, E:F 33.77, which quickly identifies E:F are a significant term.

6.A Paley’s Contruction of HN

The above construction of the Hadamard matrices HN only works for N a power of 2. A more powerful construction due to Raymond Paley allows for the construction m m of HN for any N = r + 1 for r a prime number given that r + 1 is divisible by 4. This construction combined with the Kronecker product allows for the construction of HN for all N < 100 except for N = 92. More details on this construction can be found in The Theory of Error-Correcting Codes by FJ MacWilliams and NJA Sloane, Chapter 2 Section 3. First, we require the concept of a quadratic residue. For a prime r, the elements of the finite field k2 mod r for k = 1, 2, . . . , r − 1 are the are the quadratic residues mod r. That is, these are the square numbers in that field. Second, we require the definition of the Legendre symbol. For a prime r, the Legendre symbol is a function χ(·) defined as   0 for k divisible by r χ(k) = 1 for k a quadratic residue mod r  −1 otherwise

This allows us to construct the Jacobsthal matrix Q, which is a skew symmetric r × r matrix with ijth entry equal to χ(i − j). Then, letting 1r be the r × r matrix of all 1’s, we claim that T QQ + 1r = rIr. T Pr−1 2 Indeed, the diagonal entries of QQ are just i=0 χ(i) = r − 1. Furthermore, the off-diagonal ijth entry is r−1 X χ(k − i)χ(k − j). k=0 It can be shown that this sum is equal to −1 for all i 6= j.7

7See Theorem 6, MacWilliams & Sloane.

94 As a result, for N = r + 1 and 1r an r-long column vector of 1’s, we can write

 T  1 1r HN = . 1r Q − Ir

Then,  T T T  T r + 1 1r + 1r (Q − Ir) HN HN = T 1r + (Q − Ir)1r 1r + (Q − Ir)(Q − Ir) The off-diagonal entries are zero as it can be shown that there are, in each row and column of Q, precisely (r − 1)/2 entries of +1 and precisely (r − 1)/2 entries of −1. Using the above claim, the bottom right entries becomes (r + 1)Ir. Hence, T HN HN = NIN making it an Hadamard matrix.

95