A First Course in Experimental Design Notes from Stat 263/363

A First Course in Experimental Design Notes from Stat 263/363 Art B. Owen Stanford University Autumn 2020 ii © Art Owen 2020 Dec 6 2020 version. Do not distribute or post electronically without author's permission Preface This is a collection of scribed lecture notes about experimental design. The course covers classical experimental design methods (ANOVA and blocking and factorial settings) along with Taguchi methods for robust design, A/B testing and bandits for electronic commerce, computer experiments and supersaturated designs suitable for fitting sparse regression models. There are lots of different ways to do an experiment. Most of the time the analysis of the resulting data is familiar from a regression class. Since this is one of the few classes about making data we proceed assuming that the students are already familiar with t-tests and regression methods for analyzing data. There are a few places where the analysis of experimental data differs from that of observational data, so the lectures cover some of those. The opening lecture sets the scene for experimental design by grounding it in causal inference. Most of causal inference is about inferring causality from purely observational data where you cannot randomize (or did not randomize). While inferring causality from purely observational data requires us to rely on untestable assumptions, in some problems it is the only choice. When one can randomize then much stronger causal claims are possible. This course is about how to use such randomization. Several of the underlying ideas from causal inference are important even when you did randomize. These include the potential outcomes framework, conveniently depicted via a device sometimes called the science table, the Stable Unit Treatment Value Assumption (SUTVA) that the outcome for one experimental unit is not affected by the treatment for any other one, and the issue of external validity that comes up generalizing from the units in a study to others. Even with randomized experiments, we should be on the lookout for SUTVA violations and issues of external validity. The next lecture dives into how experimentation is being used now for A/B testing in electronic commerce. Then there is a lecture on bandits, especially Thompson sampling. That is an alternative to classic A/B testing that we can use when there is rapid feedback. It lets us avoid assigning a suboptimal iii iv treatment very often. The largest segment in the course covers classical design of experiments that is now about 100 years old. A lot of useful ideas and methods have been developed there motivated by problems from agriculture, medicine, psychology, manufacturing and more. We look at blocking and Latin squares and factorial experiments and fractional factorials. Taguchi's methods for robust design fit into this segment. Key ideas that come up are about which categorical predictors are of primary interest, which are used to get more precise comparisons among those that are of main interest and how to use confounding (ordinarily a bad thing) to our advantage. The subject matter in this portion of the course could easily fill a full semester long course. Here we consider it in about half of an academic quarter which is already shorter than a semester. As a compromise we skip over some of the most challenging issues that arise from unbalanced data sets, complicated mixed effects models, subtle nesting structures and so on, citing instead the text books one would need to read in order to handle those issues. That long core segment is followed by a midterm and then a survey of related topics in experimental design. This includes computer experiments for investigating deterministic functions, space-filling designs and designs for sparse models. These notes are a scribed account of lectures. The references I draw on are the ones I know about and reflect a personal view of what belongs in an experimental design course for students studying data science. I'm sure that there are more good references out there. I have a personal interest in bridging old topics in design with some newer topics. Revisiting some old ideas in the light of modern computational power and algorithms and data set sizes could uncover some useful and interesting connections. During the lectures, given online, I got many good questions and comments. Several people contributed that way but Lihua Lei and Kenneth Tay deserve special mention for the many good points that they raised. In these notes I cite textbooks where readers can go beyond the level of detail in a first course. In some cases I find that the most recent edition of a book is not the best one for this level of course. Also, I use as a publication date the first year in which the given edition appeared, and not the year that edition went online as a digital book which can be decades after the fact. Art B. Owen Stanford CA December 2020 © Art Owen 2020 Dec 6 2020 version. Do not distribute or post electronically without author's permission Contents 1 Introduction 5 1.1 History of design . 6 1.2 Confounding and related issues . 6 1.3 Neyman-Rubin Causal Model . 7 1.4 Random assignment and ATE . 8 1.5 Random science tables . 10 1.6 External validity . 11 1.7 More about causality . 12 2 A/B testing 15 2.1 Why is this hard? . 15 2.2 Selected points from the Hippo book . 17 2.3 Questions raised in class . 19 2.4 Winner's curse . 21 2.5 Sequential testing . 23 2.6 Near impossibility of measuring returns . 23 3 Bandit methods 27 3.1 Exploration and exploitation . 28 3.2 Regret . 29 3.3 Upper confidence limit . 30 3.4 Thompson sampling . 32 3.5 Theoretical findings on Thompson sampling . 35 3.6 More about bandits . 36 4 Paired and blocked data, randomization inference 39 4.1 The ordinary two sample t-test . 40 4.2 Randomization fixes assumptions . 41 1 2 Contents 4.3 Paired analysis . 43 4.4 Blocking . 45 4.5 Basic ANOVA . 45 4.6 ANOVA for blocks . 48 4.7 Latin squares . 49 4.8 Esoteric blocking . 51 4.9 Biased coins . 52 5 Analysis of variance 55 5.1 Potatoes and sugar . 55 5.2 One at a time experiments . 57 5.3 Interactions . 59 5.4 Multiway ANOVA . 61 5.5 Replicates . 62 5.6 High order ANOVA tables . 63 5.7 Distributions of sums of squares . 65 5.8 Fixed and random effects . 68 6 Two level factorials 71 6.1 Replicates and more . 72 6.2 Notation for 2k experiments . 73 6.3 Why n = 1 is popular . 75 6.4 Factorial with no replicates . 77 6.5 Generalized interactions . 78 6.6 Blocking . 79 7 Fractional factorials 81 7.1 Half replicates . 82 7.2 Catapult example . 83 7.3 Quarter fractions . 87 7.4 Resolution . 88 7.5 Overwriting notation . 89 7.6 Saturated designs . 89 7.7 Followup fractions . 90 7.8 More data analysis . 91 8 Analysis of covariance and crossover designs 93 8.1 Combining an ANOVA with continuous predictors . 93 8.2 Before and after . 94 8.3 More general regression . 96 8.4 Post treatment variables . 96 8.5 Crossover trials . 97 8.6 More general crossover . 99 © Art Owen 2020 Dec 6 2020 version. Do not distribute or post electronically without author's permission Contents 3 9 Split-plot and nested designs 101 9.1 Split-plot experiments . 101 9.2 More about split-plots . 104 9.3 Nested ANOVAs . 105 9.4 Expected mean squares and random effects . 106 9.5 Additional models . 107 9.6 Cluster randomized trials . 108 10 Taguchi methods 109 10.1 Context and philosophy . 110 10.2 Bias and variance . 111 10.3 Taylor expansion . 112 10.4 Inner and outer arrays . 113 10.5 Controversy . 115 11 Some data analysis 117 11.1 Contrasts . 117 11.2 Normality assumption . 119 11.3 Variance components . 120 11.4 Unbalanced settings . 121 11.5 Estimating or predicting the ai . 122 11.6 Missing data . 123 11.7 Choice of response . 124 12 Response surfaces 125 12.1 Center points . 126 12.2 Three level factorials . 127 12.3 Central composite designs . 129 12.4 Box-Behnken designs . 131 12.5 Uses of response surface methodology . 132 12.6 Optimal designs . 132 12.7 Mixture designs . 135 13 Super-saturated designs 137 13.1 Hadamard matrices . 137 13.2 Group testing and puzzles . 141 13.3 Random balance . 141 13.4 Quasi-regression . 143 13.5 Supersaturated criteria and designs . 144 13.6 Designs for compressed sensing . 146 14 Computer experiments 149 14.1 Motivating problems . 150 14.2 Latin hypercube sampling . 151 14.3 Orthogonal arrays . 152 14.4 Exploratory analysis . 156 © Art Owen 2020 Dec 6 2020 version. Do not distribute or post electronically without author's permission 4 Contents 14.5 Kriging . ..

A First Course in Experimental Design Notes from Stat 263/363

Internal Validity Is About Causal Interpretability

Latin Hypercube Sampling in Bayesian Networks

Application of a Central Composite Design for the Study of Nox Emission Performance of a Low Nox Burner

Recommendations for Design Parameters for Central Composite Designs with Restricted Randomization

Comorbidity Scores

(MLHS) Method in the Estimation of a Mixed Logit Model for Vehicle Choice

Efficient Experimental Design for Regularized Linear Models

Effect of Design Selection on Response Surface Performance

Some Large Deviations Results for Latin Hypercube Sampling

Split-Plot Designs and Response Surface Designs

Experimental Design

The Theory of the Design of Experiments