Springer Series in Statistics
Advisors: P. Bickel, P. Diggle, S. Fienberg, U. Gather, I. Olkin, S. Zeger
For other titles published in this series, go to http://www.springer.com/series/692 Paul R. Rosenbaum
Design of Observational Studies
123 Paul R. Rosenbaum Statistics Department Wharton School University of Pennsylvania Philadelphia, PA 19104-6340 USA [email protected]
ISBN 978-1-4419-1212-1 e-ISBN 978-1-4419-1213-8 DOI 10.1007/978-1-4419-1213-8 Springer New York Dordrecht Heidelberg London
Library of Congress Control Number: 2009938109
c Springer Science+Business Media, LLC 2010 All rights reserved. This work may not be translated or copied in whole or in part without the written permission of the publisher (Springer Science+Business Media, LLC, 233 Spring Street, New York, NY 10013, USA), except for brief excerpts in connection with reviews or scholarly analysis. Use in connection with any form of information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed is forbidden. The use in this publication of trade names, trademarks, service marks, and similar terms, even if they are not identified as such, is not to be taken as an expression of opinion as to whether or not they are subject to proprietary rights.
Printed on acid-free paper
Springer is a part of Springer Science+Business Media (www.springer.com). For Judy “Simplicity of form is not necessarily simplicity of experience.” Robert Morris, writing about art.
“Simplicity is not a given. It is an achievement.” William H. Gass, writing about literature.
“Simplicity . . . is a very important matter which must be constantly borne in mind.” Sir David Cox, writing about experiments. Preface
An observational study is an empiric investigation of effects caused by treatments when randomized experimentation is unethical or infeasible. The quality and strength of evidence provided by an observational study is determined largely by its design. Excellent methods of analysis will not salvage a poorly designed study. The line between design and analysis is easier to draw in practice than it is in the- ory. In practice, the design of an observational study consists of all activities that precede the examination or use of those outcome measures that will be the basis for the study’s conclusions. Unlike experiments, in some observational studies, the outcomes may exist as measurements prior to the design of the study; it is their ex- amination and use, not their existence, that separates design from analysis. Aspects of design include the framing of scientific questions to permit empirical investiga- tion, the choice of a context in which to conduct the investigation, decisions about what data to collect, where and how to collect it, matching to remove bias from measured covariates, strategies and tactics to limit uncertainty caused by covariates not measured, and sample splitting to guide design using individuals who will not be included in the final analysis. In practice, design ends and analysis begins when outcomes are examined for individuals who will be the basis of the study’s conclu- sions. An observational study that begins by examining outcomes is a formless, undisciplined investigation that lacks design. In theory, design anticipates analysis. Analysis is ever present in design, as any goal is ever present in any organized effort, as a goal is necessary to organize effort. One seeks to ask questions and collect data so that results will be decisive when analyzed. To end well, how should we begin?
Philadelphia, PA Paul Rosenbaum 5 August 2009
ix Acknowledgments
I am in debt to many people: to Jeff Silber, Dylan Small, and Ruth Heller for re- cent collaborations I describe here in detail; to my teacher, adviser, and coauthor Don Rubin, from whom I learned a great deal; to Ben Hansen, Bo Lu and Robert Greevy for making network optimization algorithms for matching generally accessi- ble inside R; to colleagues, coauthors or former students Katrina Armstrong, Susan Bakewell-Sachs, Lisa Bellini, T. Behringer, Avital Cnaan, Shoshana Daniel, Gabriel Escobar, Orit Even-Shoshan, Joe Gastwirth, Robert Greevy, Sam Gu, Amelia Hav- iland, Robert Hornik, Abba Krieger, Marshall Joffe, Yunfei Paul Li, Scott Lorch, Bo Lu, Barbara Medoff-Cooper, Lanyu Mi, Andrea Millman, Kewei Ming, Dan Nagin, Dan Polsky, Kate Propert, Tom Randall, Amy Rosen, Richard Ross, Sandy Schwartz, Tom Ten Have, Richard Tremblay, Kevin Volpp,Yanli Wang, Frank Yoon, and Elaine Zanutto, for collaborations I describe more briefly; to Judith McDonald, John Newell, Luke Keele, Dylan Small, and anonymous reviewers for comments on drafts of the book; to Joshua Angrist, David Card, Susan Dynarski, Alan Krueger, and Victor Lavy for making micro-data from their research available in one form or another. Parts of this book were written while I was on sabbatical. The hospitality of the Department of Statistics at Columbia University and the Department of Eco- nomics and the Statistics Program at the National University of Ireland at Galway are gratefully acknowledged. The work was supported in part by the Methodology, Measurement and Statistics Program of the U.S. National Science Foundation. Of course, my greatest debts are to Judy, Sarah, Hannah, and Aaron.
xi Contents
Part I Beginnings
1 Dilemmas and Craftsmanship ...... 3 1.1 Those Confounded Vitamins ...... 3 1.2 Cochran’s Basic Advice ...... 4 1.2.1 Treatments, covariates, outcomes ...... 5 1.2.2 How were treatments assigned? ...... 5 1.2.3 Were treated and control groups comparable? ...... 5 1.2.4 Eliminating plausible alternatives to treatment effects ...... 6 1.2.5 Exclusion criteria ...... 6 1.2.6 Exiting a treatment group after treatment assignment ...... 7 1.2.7 Study protocol...... 7 1.3 Maimonides’ Rule ...... 7 1.4 Seat Belts in Car Crashes ...... 9 1.5 Money for College ...... 10 1.6 Nature’s ‘Natural Experiment’ ...... 11 1.7 What This Book Is About ...... 13 1.8 Further Reading ...... 18 References ...... 18
2 Causal Inference in Randomized Experiments ...... 21 2.1 Two Versions of the National Supported Work Experiment ...... 21 2.1.1 A version with 185 pairs and a version with 5 pairs ...... 21 2.1.2 Basic notation ...... 23 2.2 Treatment Effects in Randomized Experiments ...... 25 2.2.1 Potential responses under alternative treatments ...... 25 2.2.2 Covariates and outcomes ...... 26 2.2.3 Possible treatment assignments and randomization ...... 27 2.2.4 Interference between units ...... 28 2.3 Testing the Null Hypothesis of No Treatment Effect ...... 29 2.3.1 Treated−control differences when the null hypothesis is true 29
xiii xiv Contents
2.3.2 The randomization distribution of the mean difference ..... 31 2.3.3 The randomization distribution of Wilcoxon’s statistic ..... 36 2.4 Testing Other Hypotheses; Confidence Intervals; Point Estimates. . . 40 2.4.1 Testing a constant, additive treatment effect ...... 40 2.4.2 Confidence intervals for a constant, additive effect ...... 41 2.4.3 Hodges-Lehmann point estimates of effect ...... 43 2.4.4 Testing general hypotheses about treatment effects ...... 44 2.4.5 Multiplicative effects; Tobit effects ...... 46 2.5 Attributable Effects ...... 49 2.6 Internal and External Validity ...... 56 2.7 Summary ...... 57 2.8 Further Reading ...... 57 2.9 Appendix: Randomization Distribution of m-statistics ...... 58 References ...... 61
3 Two Simple Models for Observational Studies ...... 65 3.1 The Population Before Matching ...... 65 3.2 The Ideal Matching ...... 66 3.3 A Na¨ıve Model: People Who Look Comparable Are Comparable . . 70 3.4 Sensitivity Analysis: People Who Look Comparable May Differ . . . 76 3.5 Welding Fumes and DNA Damage ...... 79 3.6 Bias Due to Incomplete Matching ...... 85 3.7 Summary ...... 86 3.8 Further Reading ...... 87 3.9 Appendix: Exact Computations for Sensitivity Analysis ...... 88 References ...... 90
4 Competing Theories Structure Design ...... 95 4.1 How Stones Fall ...... 95 4.2 The Permanent-Debt Hypothesis ...... 98 4.3 Guns and Misdemeanors ...... 100 4.4 The Dutch Famine of 1944–1945 ...... 100 4.5 Replicating Effects and Biases ...... 101 4.6 Reasons for Effects ...... 104 4.7 The Drive for System ...... 108 4.8 Further Reading ...... 109 References ...... 110
5 Opportunities, Devices, and Instruments ...... 113 5.1 Opportunities ...... 113 5.2 Devices ...... 116 5.2.1 Disambiguation...... 116 5.2.2 Multiple control groups ...... 116 5.2.3 Coherence among several outcomes ...... 118 5.2.4 Known effects ...... 121 Contents xv
5.2.5 Doses of treatment ...... 124 5.2.6 Differential effects and generic biases ...... 128 5.3 Instruments...... 131 5.4 Summary ...... 140 5.5 Further Reading ...... 140 References ...... 141
6 Transparency ...... 147 References ...... 149
Part II Matching
7 A Matched Observational Study ...... 153 7.1 Is More Chemotherapy More Effective? ...... 153 7.2 Matching for Observed Covariates ...... 154 7.3 Outcomes in Matched Pairs ...... 157 7.4 Summary ...... 159 7.5 Further Reading ...... 161 References ...... 161
8 Basic Tools of Multivariate Matching ...... 163 8.1 A Small Example ...... 163 8.2 Propensity Score ...... 165 8.3 Distance Matrices ...... 168 8.4 Optimal Pair Matching ...... 172 8.5 Optimal Matching with Multiple Controls ...... 175 8.6 Optimal Full Matching ...... 179 8.7 Efficiency ...... 183 8.8 Summary ...... 184 8.9 Further Reading ...... 184 References ...... 185
9 Various Practical Issues in Matching ...... 187 9.1 Checking Covariate Balance ...... 187 9.2 Almost Exact Matching ...... 190 9.3 Exact Matching ...... 192 9.4 Missing Covariate Values ...... 193 9.5 Further Reading ...... 194 References ...... 194
10 Fine Balance ...... 197 10.1 What Is Fine Balance? ...... 197 10.2 Constructing an Exactly Balanced Control Group ...... 198 10.3 Controlling Imbalance When Exact Balance Is Not Feasible ...... 201 10.4 Fine Balance and Exact Matching ...... 203 10.5 Further Reading ...... 204 xvi Contents
References ...... 204
11 Matching Without Groups ...... 207 11.1 Matching Without Groups: Nonbipartite Matching ...... 207 11.1.1 Matching with doses ...... 209 11.1.2 Matching with several groups ...... 210 11.2 Some Practical Aspects of Matching Without Groups ...... 211 11.3 Matching with Doses and Two Control Groups ...... 213 11.3.1 Does the minimum wage reduce employment? ...... 213 11.3.2 Optimal matching to form two independent comparisons . . . 214 11.3.3 Difference in change in employment with 2 control groups . 218 11.4 Further Reading ...... 220 References ...... 220
12 Risk-Set Matching ...... 223 12.1 Does Cardiac Transplantation Prolong Life? ...... 223 12.2 Risk-Set Matching in a Study of Surgery for Interstitial Cystitis . . . . 224 12.3 Maturity at Discharge from a Neonatal Intensive Care Unit ...... 228 12.4 Joining a Gang at Age 14 ...... 231 12.5 Some Theory ...... 232 12.6 Further Reading ...... 233 References ...... 234
13 Matching in R ...... 237 13.1 R ...... 237 13.2 Data...... 238 13.3 Propensity Score ...... 240 13.4 Covariates with Missing Values ...... 240 13.5 Distance Matrix ...... 242 13.6 Constructing the Match...... 243 13.7 Checking Covariate Balance ...... 244 13.8 College Outcomes ...... 246 13.9 Further Reading ...... 247 13.10 Appendix: A Brief Introduction to R...... 248 13.11 Appendix: R Functions for Distance Matrices ...... 250 References ...... 252
Part III Design Sensitivity
14 The Power of a Sensitivity Analysis and Its Limit ...... 257 14.1 The Power of a Test in a Randomized Experiment ...... 257 14.2 Power of a Sensitivity Analysis in an Observational Study ...... 265 14.3 Design Sensitivity ...... 269 14.4 Summary ...... 272 14.5 Further Reading ...... 272 Appendix: Techincal Remarks and Proof of Proposition 14.1 ...... 272 Contents xvii
References ...... 274
15 Heterogeneity and Causality...... 275 15.1 J.S. Mill and R.A. Fisher: Reducing Heterogeneity or Introducing Random Assignment ...... 275 15.2 A Larger, More Heterogeneous Study Versus a Smaller, Less Heterogeneous Study ...... 277 15.3 Heterogeneity and the Sensitivity of Point Estimates ...... 281 15.4 Examples of Efforts to Reduce Heterogeneity ...... 282 15.5 Summary ...... 284 15.6 Further Reading ...... 284 References ...... 284
16 Uncommon but Dramatic Responses to Treatment ...... 287 16.1 Large Effects, Now and Then...... 287 16.2 Two Examples ...... 290 16.3 Properties of a Paired Version of Salsburg’s Model ...... 292 16.4 Design Sensitivity for Uncommon but Dramatic Effects ...... 294 16.5 Summary ...... 296 16.6 Further Reading ...... 297 16.7 Appendix: Sketch of the Proof of Proposition 16.1 ...... 297 References ...... 298
17 Anticipated Patterns of Response ...... 299 17.1 Using Design Sensitivity to Evaluate Devices ...... 299 17.2 Coherence ...... 299 17.3 Doses ...... 303 17.4 Example: Maimonides’ Rule ...... 308 17.5 Further Reading ...... 309 17.6 Appendix: Proof of Proposition 17.1 ...... 309 References ...... 310
Part IV Planning Analysis
18 After Matching, Before Analysis ...... 315 18.1 Split Samples and Design Sensitivity ...... 315 18.2 Are Analytic Adjustments Feasible? ...... 317 18.3 Matching and Thick Description ...... 322 18.4 Further Reading ...... 324 References ...... 324
19 Planning the Analysis ...... 327 19.1 Plans Enable...... 327 19.2 Elaborate Theories ...... 329 19.3 Three Simple Plans with Two Control Groups ...... 330 19.4 Sensitivity Analysis for Two Outcomes and Coherence ...... 339 xviii Contents
19.5 Sensitivity Analysis for Tests of Equivalence...... 341 19.6 Sensitivity Analysis for Equivalence and Difference ...... 343 19.7 Summary ...... 345 19.8 Further Reading ...... 345 19.9 Appendix: Testing Hypotheses in Order ...... 346 References ...... 350
Summary: Key Elements of Design ...... 353
Solutions to Common Problems ...... 355 References ...... 358
Symbols ...... 359
Acronyms ...... 361
Glossary of Statistical Terms...... 363
Some Books ...... 369 References ...... 369
Suggested Readings for a Course ...... 371 References ...... 371
Index ...... 373 Part I Beginnings Chapter 1 Dilemmas and Craftsmanship
Abstract This introductory chapter mentions some of the issues that arise in obser- vational studies and describes a few well designed studies. Section 1.7 outlines the book, describes its structure, and suggests alternative ways to read it.
1.1 Those Confounded Vitamins
On 22 May 2004, the Lancet published two articles, one entitled “When are ob- servational studies as credible as randomized trials?” by Jan Vandenbroucke [53], the other entitled “Those confounded vitamins: What can we learn from the differ- ences between observational versus randomized trial evidence?” by Debbie Lawlor, George Smith, Richard Bruckdorfer, Devi Kundu, and Shah Ebrahim [32]. In a randomized experiment or trial, a coin is flipped to decide whether the next person is assigned to treatment or control, whereas in an observational study, treatment as- signment is not under experimental control. Despite the optimism of the first title and the pessimism of the second, both articles struck a balance, perhaps with a slight tilt towards pessimism. Vandenbroucke reproduced one of Jim Borgman’s political cartoons in which a TV newsman sits below both a banner reading “Today’s Ran- dom Medical News” and three spinners which have decided that “coffee can cause depression in twins.” Dead pan, the newsman says, “According to a report released today....”The cartoon reappeared in a recent report of the Academy of Medical Sciences that discusses observational studies in some detail [43, page 19]. Lawlor et al. begin by noting that a large observational study published in the Lancet [30] had found a strong, statistically significant negative association between coronary heart disease mortality and level of vitamin C in blood, having used a model to adjust for other variables such as age, blood pressure, diabetes, and smok- ing. Adjustments using a model attempt to compare people who are not directly comparable — people of somewhat different ages or smoking habits — removing these differences using a mathematical structure that has elements estimated from the data at hand. Investigators often have great faith in their models, a faith that
P.R. Rosenbaum, Design of Observational Studies, Springer Series in Statistics, 3 DOI 10.1007/978-1-4419-1213-8_1, © Springer Science+Business Media, LLC 2010 4 1 Dilemmas and Craftsmanship is expressed in the large tasks they expect their models to successfully perform. Lawlor et al. then note that a large randomized controlled trial published in the Lancet [20] compared a placebo pill with a multivitamin pill including vitamin C, finding slightly but not significantly lower death rates under placebo. The random- ized trial and the observational study seem to contradict one another. Why is that? There are, of course, many possibilities. There are some important differences be- tween the randomized trial and the observational study; in particular, the treatments are not really identical, and it is not inconceivable that each study correctly answered questions that differ in subtle ways. In particular, Khaw et al. emphasize vitamin C from fruit and vegetable intake rather than from vitamin supplements. Lawlor et al. examine a different possibility, namely that, because of the absence of randomized treatment assignment, people who were not really comparable were compared in the observational study. Their examination of this possibility is indirect, using data from another study, the British Women’s Heart and Health Study, in which several variables were measured that were not included in the adjustments performed by Khaw et al. Lawlor et al. find that women with low levels of vitamin C in their blood are more likely to smoke cigarettes, to exercise less than one hour per week, to be obese, and less likely to consume a low fat diet, a high fiber diet, and daily alcohol. Moreover, women with low levels of vitamin C in their blood are more likely to have had a childhood in a “manual social class,” with no bathroom or hot water in the house, a shared bedroom, no car access, and to have completed full time education by eighteen years of age. And the list goes on. The concern is that one or more of these differences, or some other difference that was not measured, not the difference in vitamin C, is responsible for the higher coronary mortality among individuals with lower levels of vitamin C in their blood. To a large degree, this problem was avoided in the randomized trial, because there, only the turn of a coin distinguished placebo and multivitamin.
1.2 Cochran’s Basic Advice
The planner of an observational study should always ask himself the question, ‘How would the study be conducted if it were possible to do it by controlled experimentation?’
William G. Cochran [9, page 236] attributing the point to H.F. Dorn. At the most elementary level, a well designed observational study resembles, as closely as possible, a simple randomized experiment. By definition, the resem- blance is incomplete: randomization is not used to assign treatments in an observa- tional study. Nonetheless, elementary mistakes are often introduced and opportuni- ties missed by unnecessary deviations from the experimental template. The current section briefly mentions these most basic ingredients. 1.2 Cochran’s Basic Advice 5 1.2.1 Treatments, covariates, outcomes
Randomized experiment: There is a well-defined treatment, that began at a well- defined time, so there is a clear distinction between covariates measured prior to treatment, and outcomes measured after treatment. Better observational study: There is a well-defined treatment, that began at a well-defined time, so there is a clear distinction between covariates measured prior to treatment, and outcomes measured after treatment. Poorer observational study: It is difficult to say when the treatment began, and some variables labeled as covariates may have been measured after the start of treatment, so they might have been affected by the treatment. The distinction between covariates and outcomes is not clear. See [34].
1.2.2 How were treatments assigned?
Randomized experiment: Treatment assignment is determined by a truly random device. At one time, this actually meant coins or dice, but today it typically means random numbers generated by a computer. Better observational study: Treatment assignment is not random, but circum- stances for the study were chosen so that treatment seems haphazard, or at least not obviously related to the outcomes subjects would exhibit under treatment or under control. When investigators are especially proud, having found unusual circumstances in which treatment assignment, though not random, seems unusu- ally haphazard, they may speak of a ‘natural experiment.’ Poorer observational study: Little attention is given to the process that made some people into treated subjects and others into controls.
1.2.3 Were treated and control groups comparable?
Randomized experiment: Although a direct assessment of comparability is pos- sible only for covariates that were measured, a randomized trial typically has a table demonstrating that the randomization was reasonably effective in balancing these observed covariates. Randomization provides some basis for anticipating that many covariates that were not measured will tend to be similarly balanced. Better observational study: Although a direct assessment of comparability is pos- sible only for covariates that were measured, a matched observational study typ- ically has a table demonstrating that the matching was reasonably effective in balancing these observed covariates. Unlike randomization, matching for ob- served covariates provides absolutely no basis for anticipating that unmeasured covariates are similarly balanced. Poorer observational study: No direct assessment of comparability is presented. 6 1 Dilemmas and Craftsmanship 1.2.4 Eliminating plausible alternatives to treatment effects
Randomized experiment: The most plausible alternatives to an actual treatment effect are identified, and the experimental design includes features to shed light on these alternatives. Typical examples include the use of placebos and other forms of sham or partial treatment, or the blinding of subjects and investigators to the identity of the treatment received by a subject. Better observational study: The most plausible alternatives to an actual treatment effect are identified, and the design of the observational study includes features to shed light on these alternatives. Because there are many more plausible al- ternatives to a treatment effect in an observational study than in an experiment, much more effort is devoted to collecting data that would shed light on these alternatives. Typical examples include multiple control groups thought to be af- fected by different biases, or a sequence of longitudinal baseline pretreatment measurements of the variable that will be the outcome after treatment. When investigators are especially proud of devices included to distinguish treatment effects from plausible alternatives, they may speak of a ‘quasi-experiment.’ Poorer observational study: Plausible alternatives to a treatment effect are men- tioned in the discussion section of the published report.
1.2.5 Exclusion criteria
Randomized experiment: Subjects are included or excluded from the experiment based on covariates, that is, on variables measured prior to treatment assignment and hence unaffected by treatment. Only after the subject is included is the subject randomly assigned to a treatment group and treated. This ensures that the same exclusion criteria are used in treated and control groups. Better observational study: Subjects are included or excluded from the experi- ment based on covariates, that is, on variables measured prior to treatment assign- ment and hence unaffected by treatment. The same criteria are used in treated and control groups. Poorer observational study: A person included in the control group might have been excluded if assigned to treatment instead. The criteria for membership in the treated and control groups differ. In one particularly egregious case, to be discussed in §12.1, treatment was not immediately available, and any patient who died before the treatment became available was placed in the control group; then came the exciting news that treated patients lived longer than controls. 1.3 Maimonides’ Rule 7 1.2.6 Exiting a treatment group after treatment assignment
Randomized experiment: Once assigned to a treatment group, subjects do not exit. A subject who does not comply with the assigned treatment, or switches to another treatment, or is lost to follow-up, remains in the assigned treatment group with these characteristics noted. An analysis that compares the groups as randomly assigned, ignoring deviations between intended and actual treatment, is called an ‘intention-to-treat’ analysis, and it is one of the central analyses reported in a randomized trial. Randomization inference may partially address noncompliance with assigned treatment by viewing treatment assignment as an instrumental variable for treatment received; see §5.3 and [18]. Better observational study: Once assigned to a treatment group, subjects do not exit. A subject who does not comply with the assigned treatment, or switches to another treatment, or is lost to follow-up, remains in the assigned treatment group with these characteristics noted. Inference may partially address noncompliance by viewing treatment assignment as an instrumental variable for treatment re- ceived; see §5.3 and [22]. Poorer observational study: There is no clear distinction between assignment to treatment, acceptance of treatment, receipt of treatment, or switching treatments, so problems that arise in experiments seem to be avoided, when in fact they are simply ignored.
1.2.7 Study protocol
Randomized experiment: Before beginning the actual experiment, a written pro- tocol describes the design, exclusion criteria, primary and secondary outcomes, and proposed analyses. Better observational study: Before examining outcomes that will form the basis for the study’s conclusions, a written protocol describes the design, exclusion criteria, primary and secondary outcomes, and proposed analyses; see Chapter 19. Poorer observational study: If sufficiently many analyses are performed, some- thing publishable will turn up sooner or later.
1.3 Maimonides’ Rule
In 1999, Joshua Angrist and Victor Lavy [3] published an unusual and much ad- mired study of the effects of class size on academic achievement. They wrote [3, pages 533-535]: [C]ausal effects of class size on pupil achievement have proved very difficult to measure. Even though the level of educational inputs differs substantially both between and within 8 1 Dilemmas and Craftsmanship
schools, these differences are often associated with factors such as remedial training or students’ socioeconomic background . . . The great twelfth century Rabbinic scholar, Mai- monides, interprets the Talmud’s discussion of class size as follows: ‘Twenty-five children may be put in charge of one teacher. If the number in the class exceeds twenty-five but is not more than forty, he should have an assistant to help with instruction. If there are more than forty, two teachers must be appointed.’ . . . The importance of Maimonides’ rule for our purposes is that, since 1969, it has been used to determine the division of enrollment cohorts into classes in Israeli public schools.
In most places at most times, class size has been determined by the affluence or poverty of a community, its enthusiasm or skepticism about the value of educa- tion, the special needs of students for remedial or advanced instruction, the obscure, transitory, barely intelligible obsessions of bureaucracies, and each of these deter- minants of class size clouds its actual effect on academic performance. However, if adherence to Maimonides’ rule were perfectly rigid, then what would separate a school with a single class of size 40 from the same school with two classes whose average size is 20.5 is the enrollment of a single student. Maimonides’ rule has the largest impact on a school with about 40 students in a grade cohort. With cohorts of size 40, 80, and 120 students, the steps down in average class size required by Maimonides’ rule when an additional student enrolls are, respectively, from 40 to 20.5, from 40 to 27, and from 40 to 30.25. For this reason, we will look at schools with fifth grade cohorts in 1991 with between 31 and 50 students, where average class sizes might be cut in half by Maimonides’ rule. There were 211 such schools, with 86 of these schools having between 31 and 40 students in fifth grade, and 125 schools having between 41 and 50 students in the fifth grade. Adherence to Maimonides’ rule is not perfectly rigid. In particular, Angrist and Lavy [3, page 538] note that the percentage of disadvantaged students in a school “is used by the Ministry of Education to allocate supplementary hours of instruction and other school resources.” Among the 211 schools with between 31 and 50 stu- dents in fifth grade, the percentage disadvantaged has a slightly negative Kendall’s correlation of −0.10 with average class size, which differs significantly from zero (P-value = 0.031), and it has more strongly negative correlations of −0.42 and −0.55, respectively, with performance on verbal and mathematics test scores. For this reason, 86 matched pairs of two schools were formed, matching to minimize to total absolute difference in percentage disadvantaged. Figure 1.1 shows the paired schools, 86 schools with 31 and 40 students in fifth grade, and 86 schools with be- tween 41 and 50 students in the fifth grade. After matching, the upper left panel in Figure 1.1 shows that the percentage of disadvantaged students was balanced; indeed, the average absolute difference within a pair was less than 1%. The upper right panel in Figure 1.1 shows Maimonides’ rule at work: with some exceptions, the slightly larger schools had substantially smaller class sizes. The bottom panels of Figure 1.1 show the average mathematics and verbal test performance of these fifth graders, with somewhat higher scores in the schools with between 41 and 50 fifth graders, where class sizes tended to be smaller. 1.4 Seat Belts in Car Crashes 9
Disadvantaged Class Size
●
● Percent
●
Average Students Per Class ● ● ● ● ● ● ● ● 0 10203040506070 15 20 25 30 35 40 45 31−40 41−50 31−40 41−50
Fig. 1.1 Eighty-six pairs of two Israeli schools, one with Math Test Score Verbal Test Score between 31 and 40 students in the fifth grade, the other with ● between 41 and 50 students in the fifth grade, matched for percentage of students in the school classified as disadvantaged. The figure shows that the percentage of disadvantaged students
is balanced, that imperfect Average Test Score Average Test Score adherence to Maimonides’ rule has yielded substantially ● ● 50 60 70 80 90 different average class sizes, 40 50● 60 70 80 90 and test scores were higher 31−40 41−50 31−40 41−50 in the group of schools with predominantly smaller class sizes.
1.4 Seat Belts in Car Crashes
Do safety belts prevent fatalities in automobile accidents? Car crashes vary in sever- ity, depending upon speed, road traction, the reaction time of a driver stepping on the brakes, and physical forces that are rarely, if ever, measured. Wearing safety belts is a precaution. Many people, perhaps most people, who wear safety belts think that a serious accident is possible; this possibility is salient, and small incon- veniences seem tolerable if the risk is reduced. In contrast, small inconveniences may seem intolerable if a serious accident is seen as a remote possibility. Does one take a single precaution? Perhaps some people do, but others will take several precautions. If cautious drivers wear seat belts, but also drive at slower speeds, at a greater distance from the car ahead, with greater allowance for road conditions – if 10 1 Dilemmas and Craftsmanship
Table 1.1 Crashes in FARS 1975–1983 in which the front seat had two occupants, a driver and a passenger, with one belted, the other unbelted, and one died and one survived. Driver Not Belted Belted Passenger Belted Not Belted Driver Died Passenger Survived 189 153 Driver Survived Passenger Died 111 363 risk-tolerant drivers do not wear seat belts, drive faster and closer, ignore road con- ditions – then a simple comparison of belted and unbelted drivers may credit seat belts with effects that reflect, in part, the severity of the crash. Using data from the U.S. Fatal Accident Reporting System (FARS), Leonard Evans [14] looked at crashes in which there were two individuals in the front seat, one belted, the other unbelted, with at least one fatality. In these crashes, several otherwise uncontrolled features are the same for driver and passenger: speed, road traction, distance from the car ahead, reaction time. Admittedly, risk in the pas- senger seat may differ from risk in the driver seat, but in this comparison there are belted drivers with unbelted passengers and unbelted drivers with belted passengers, so this issue may be examined. Table 1.1 is derived from Evans’ [14] more detailed tables. In this table, when the passenger is belted and the driver is not, more often than not, the driver dies; conversely, when the driver is belted and the passenger is not, more often than not, the passenger dies. Everyone in Table 1.1 is at least sixteen years of age. Nonetheless, the roles of driver and passenger are connected to law and custom, for parents and children, husbands and wives, and others. For this reason, Evans did further analyses, for instance taking account of the ages of driver and passenger, with similar results. Evans [14, page 239]wrote:
The crucial information for this study is provided by cars in which the safety belt use of the subject and other occupant differ . . . There is a strong tendency for safety belt use or non-use to be the same for different occupants of the same vehicle . . . Hence, sample sizes in the really important cells are . . . small . . .
This study is discussed further in §5.2.6.
1.5 Money for College
To what extent, if any, does financial aid increase college attendance? It would not do to simply compare those who received aid with those who did not. Decisions about the allocation of financial aid are often made person by person, with consider- ation of financial need and academic promise, together with many other factors. A grant of financial aid is often a response to an application for aid, and the decision to apply or not is likely to reflect an individual’s motivation for continued education and competing immediate career prospects. 1.6 Nature’s ‘Natural Experiment’ 11
To estimate the effect of financial aid on college attendance, Susan Dynarski [13] used “a shift in aid policy that affect[ed] some students but not others.” Between 1965 and 1982, a program of the U.S. Social Security Administration provided sub- stantial financial aid to attend college for the children of deceased Social Security beneficiaries, but the U.S. Congress voted in 1981 to end the program. Using data from the National Longitudinal Survey of Youth, Dynarski [13] compared college attendance of high school seniors with deceased fathers and high school seniors whose fathers were not deceased, in 1979–1981 when aid was available, and in 1982–1983 after the elimination of the program. Figure 1.2 depicts the comparison. In 1979–1981, while the Social Security Student Benefit Program provided aid to students with deceased fathers, these students were more likely than others to attend college, but in 1982–1983, after the program was eliminated, these students were less likely than others to attend college. In Figure 1.2, the group that faced a change in incentives exhibited a change in behavior, whereas the group that faced no change in incentives exhibited little change in behavior. In the spirit of §1.2.4, Figure 1.2 studies one treatment using four groups, where only certain patterns of response among the four groups are compatible with a treatment effect; see also [7, 37] and [47, Chapter 5]. Is being the child of a deceased father a random event? Apparently not. It is unrelated to the child’s age and gender, but the children of deceased fathers had mothers and fathers with less education and were more likely to be black; how- ever, these differences were about the same in 1979–1981 and 1982–1983, so these differences alone are not good explanations of the shift in college attendance [13, Table 1]. This study is discussed further in Chapter 13.
1.6 Nature’s ‘Natural Experiment’
In asking whether a particular gene plays a role in causing a particular disease, a key problem is that the frequencies of various forms of a gene (its alleles) vary some- what from one human community to the next. At the same time, habits, customs, diets, and environments also vary somewhat from one community to the next. In consequence, an association between a particular allele and a particular disease may not be causal: gene and disease may both be associated with some cause, such as diet, that is not genetic. Conveniently, nature has created a natural experiment. With the exception of sex-linked genes, a person receives two versions of each gene, perhaps identical, one from each parent, and transmits one copy to each child. To a close approximation, in the formation of a fertilized egg, each parent con- 1 tributes one of two possible alleles, each with probability 2 , the contributions of the two parents being independent of each other, and independent for different children of the same parents. (The transmissions of different genes that are neighbors on the same chromosome are not generally independent; see [51, §15.4]. In consequence, a particular gene may be associated with a disease not because it is a cause of the disease, but rather because it is a marker for a neighboring gene that is a cause.) 12 1 Dilemmas and Craftsmanship
Several strategies use this observation to create natural experiments that study genetic causes of a specific disease. Individuals with the disease are identified. Richard Spielman, Ralph McGinnis, and Warren Ewens [49] used genetic informa- tion on the diseased individual and both parents in their Transmission/Disequilibrium Test (TDT). The test compares the diseased individuals to the known distributions of alleles for the hypothetical children their parents could produce. David Curtis [12], Richard Spielman and Warren Ewens [50], and Michael Boehnke and Carl Langefeld [5] suggested using genetic information on the diseased individual and one or more siblings from the same parents, which Spielman and Ewens called the sib-TDT. If the disease has no genetic cause linked to the gene under study, then the alleles from the diseased individual and her siblings should be exchangeable. The idea underlying the sib-TDT is illustrated in Table 1.2, using data from Boehnke and Langefeld [5, Table 5], their table being derived from work of Mar- garet Pericak-Vance and Ann Saunders; see [44]. Table 1.2 gives the frequency of the ε4 allele of the apolipoprotein E gene in 112 individuals with Alzheimer disease and in an unaffected sibling of the same parents. Table 1.2 counts sib- ling pairs, not individuals, so the total count in the table is 112 pairs of an af-
Attending College by Age 23 Proportion Attending
Fig. 1.2 College attendance by age 23 in four groups: 0.3 0.4 0.5 0.6 before (1979–1981) and af- FD ter (1982–1983) the end of FND the Social Security Student se Benefit Program for children whose fathers were deceased 1979 1980 1981 1982 1983 (FD) or not deceased (FND). Values are proportions with Years: 1979−1981 vs 1982−1983 standard errors (se). 1.7 What This Book Is About 13
Table 1.2 Alzheimer disease and the apolipoprotein E ε4 allele in 112 sibling pairs, one with Alzheimer disease (affected), the other without (unaffected). The table counts pairs, not individ- uals. The rows and columns of the table indicate the number (0, 1, or 2) of ApoE alleles for the affected and unaffected sibling. Unaffected Sib # ApoEε4 Alleles 012 0 2340 Affected Sib 1 25 36 2 2 886 fected and an unaffected sibling. Each person can receive 0, 1, or 2 copies of the ε4 allele from parents. For any one pair, write (aff,unaff) for the number of ε4 alleles possessed by, respectively, the affected and unaffected sibling. In Ta- ble 1.2, there are 25 pairs with (aff,unaff)=(1,0). If Alzheimer disease had no genetic link with the apolipoprotein E ε4 allele, then nature’s natural experiment implies that the chance that (aff,unaff)=(2,0), say, is the same as the chance that (aff,unaff)=(0,2), and more generally, the chance that (aff,unaff)=(i, j) equals the chance that (aff,unaff)=(j,i), for i, j = 0,1,2. In fact, this does not appear to be the case in Table 1.2. For instance, there are eight sibling pairs such that (aff,unaff)=(2,0) and none such that (aff,unaff)=(0,2). Also, there are 25 pairs such that (aff,unaff)=(1,0) and only 4 pairs such that (aff,unaff)=(0,1). A distribution with the property
Pr{(aff,unaff)=(i, j)} = Pr{(aff,unaff)=(j,i)} for all i, j is said to be exchangeable. In the absence of a genetic link with disease, na- ture’s natural experiment ensures that the distribution of allele frequencies in af- fected/unaffected sib pairs is exchangeable. This creates a test, the sib transmission disequilibrium test [50] that is identical to a certain randomization test appropriate in a randomized experiment [31].
1.7 What This Book Is About
Basic structure
Design of Observational Studies has four parts, ‘Beginnings,’ ‘Matching,’ ‘Design Sensitivity,’ and ‘Planning Analysis,’ plus a brief summary. Part I, ‘Beginnings,’ is a conceptual introduction to causal inference in observational studies. Chapters 2, 3, and 5 of Part I cover concisely, in about one hundred pages, many of the ideas discussed in my book Observational Studies [38], but in a far less technical and less general fashion. Parts II–IV cover material that, for the most part, has not previously appeared in book form. Part II, ‘Matching,’ concerns the conceptual, practical, and computational aspects of creating a matched comparison that balances many ob- 14 1 Dilemmas and Craftsmanship served covariates. Because matching does not make use of outcome information, it is part of the design of the study, what Cochran [9] called “setting up the com- parisons”; that is, setting up the structure of the experimental analog. Even if the matching in Part II is entirely successful, so that after matching, matched treated and control groups are comparable with respect to all observed covariates, the question or objection or challenge will inevitably be raised that subjects who look compa- rable in observed data may not actually be comparable in terms of covariates that were not measured. Chapters 3 and 5 and Parts III and IV address this central con- cern. Part III, ‘Design Sensitivity,’ discusses a quantitative tool for appraising how well competing designs (or data generating processes) resist such challenges. In part, ‘Design Sensitivity’ will provide a formal appraisal of the design strategies introduced informally in Chapter 5. Part IV discusses those activities that follow matching but precede analysis, notably planning the analysis.
Structure of Part I: Beginnings
Observational studies are built to resemble simple experiments, and Chapter 2 re- views the role that randomization plays in experiments. Chapter 2 also introduces elements and notation shared by experiments and observational studies. Chapter 3 discusses two simple models for observational studies, one claiming that adjust- ments for observed covariates suffice, the other engaging the possibility that they do not. Chapter 3 introduces the propensity score and sensitivity analysis. Obser- vational studies are built from three basic ingredients: opportunities, devices and instruments. Chapter 5 introduces these ideas in an informal manner, with some of the formalities developed in Part III and others developed in [38, Chapters 4, 6–9]. My impression is that many observational studies dissipate either by the absence of a focused objective or by becoming mired in ornate analyses that may overwhelm an audience but are unlikely to convince anyone. Neither problem is common in randomized experiments, and both problems are avoidable in observational studies. Chapter 4 discusses the first problem, while Chapter 6 discusses the second. In a successful experiment or observational study, competing theories make conflicting predictions; this is the concern of Chapter 4. Transparency means making evidence evident, and Chapter 6 discusses how this is done.
Structure of Part II: Matching
Part II, entitled ‘Matching,’ is partly conceptual, partly algorithmic, partly data an- alytic. Chapter 7 is introductory: it presents a matched comparison as it might (and did) appear in a scientific journal. The basic tools of multivariate matching are de- scribed and illustrated in Chapter 8, and various common practicalities are discussed in Chapter 9. Later chapters in Part II discuss specific topics in matching, including fine balance, matching with multiple groups or without groups, and risk-set match- ing. Matching in the computer package R is discussed in Chapter 13. 1.7 What This Book Is About 15
Structure of Part III: Design Sensitivity
In Chapter 3, it is seen that some observational studies are sensitive to small unob- served biases, whereas other studies are insensitive to quite large unobserved biases. What features of the design of an observational study affect its sensitivity to bias from covariates that were not measured? This is the focus of Part III. Chapter 14 reviews the concept of power in a randomized experiment, then de- fines the power of a sensitivity analysis. Design sensitivity is then defined. Design sensitivity is a number that defines the sensitivity of an observational study design to unmeasured biases when the sample size is large. Many factors affect the design sensitivity, including the issues discussed informally in Chapter 5. Chapter 15 re- vives a very old debate between John Stuart Mill and Sir Ronald Fisher about the relevance to causal inference of the heterogeneity of experimental material. Mill believed it mattered quite a bit; Fisher denied this. Sometimes a treatment has lit- tle effect on most people and a dramatic effect on some people. In one sense, the effect is small — on average it is small — but for a few people it is large. Is an effect of this sort highly sensitive to unmeasured biases? Chapter 16 provides the answer. Chapter 17 takes up themes from Chapter 5, specifically coherence and dose-response, and evaluates their contribution to design sensitivity.
Structure of Part IV: Planning Analysis
The sample has been successfully matched — treated and control groups look com- parable in terms of measured covariates — and Part IV turns to planning the analy- sis. Chapter 18 concerns three emprical steps that aid planning the analysis: sample splitting to improve design sensitivity, checking that analytical adjustments are fea- sible, and thick description of a few matched pairs. After reviewing Fisher’s advice — “make your theories elaborate” — Chapter 19 discusses planning the analysis of an observational study.
A less technical introduction to observational studies
The mathematician Paul Halmos wrote two essays, “How to write mathematics” and “How to talk mathematics.” In the latter, he suggested that in a good mathematical talk, you don’t prove everything, but you do prove something to give the flavor of proofs for the topic under discussion. In the spirit of that remark, Observational Studies [38] writes about statistics, where Design of Observational Studies talks about statistics. This is done in several ways. We often develop an understanding by taking something apart and putting it back together. In statistics, this typically means looking at an unrealistically small ex- ample in which it is possible to see the details of what goes on. For this reason, I discuss several unrealistically small examples in parallel with real examples of practical size. For instance, Chapter 2 discusses two versions of a paired random- 16 1 Dilemmas and Craftsmanship ized experiment, one with five pairs, the other with 185 pairs. The five pairs are a random sample from the 185 pairs. With 5 pairs, there are 25 = 32 possible treat- ment assignments, and it is possible to see what is happening. With 185 pairs, there are 2185 = 4.9 × 1055 possible treatment assignments, and it is not possible to see what is happening, although nothing new is happening beyond what you saw hap- pening with five pairs. The larger experiment is just larger, somewhat awkward to inspect, but no different conceptually. In a similar way, Chapter 7 discusses the construction of 344 pairs matched for many covariates, while Chapter 8 discusses the construction of 21 pairs matched for three covariates. With 21 pairs, you can see what is happening, whereas with 344 pairs you cannot see as much, but nothing new is happening. Chapter 2 discusses a number of very old, very central concepts in statistics. These include: the role of randomization in experiments, the nature of randomiza- tion tests, obtaining confidence intervals by inverting hypothesis tests, building an estimator using an estimating equation, and so on. This material is so old and central to the subject that an article in a statistical journal might reduce the entire chapter to a paragraph, and that would be fine for someone who had been around this track a few times. My goal in Chapter 2 is not concise expression. My goal in Chapter 2 is to take you around this track a few times. To prove something, not everything, I develop statistical theory only for the case of matched pairs with continuous responses. The case of matched pairs is the sim- plest nontrivial case. All of the important concepts appear in the case of matched pairs, but most of the technical issues are easy. Randomization distributions for matched pairs are little more than a series of independent coin flips. Everybody can do coin flips. In a randomized experiment, the coin flips are fair, but in a sensitiv- ity analysis, the coin flips may be biased. The matching methods in Part II are not restricted to pair matching — matching with multiple controls, matching with vari- able controls, full matching, risk set matching are all there — however, if you want to work through the derivations of the associated statistical analyses for continuous, discrete, censored, and multivariate responses, you will need to turn to [38] or the references discussed in ‘Further Reading.’ Focusing the theoretical presentation on matched pairs permits discussion of key concepts with the minimum of mathematics. Unlike statistical analysis, research de- sign yields decisions rather than calculations — decisions to ask certain questions, in certain settings, collecting certain data, adding certain design elements, attending to certain patterns — and for such decisions, the concepts are more important than details of general computations. What is being left out by focusing on matched pairs? For one thing, sensitivity analyses in other cases are easy to do but require more mathematical machinery to justify. Some of this machinery is aesthetically pleasing, for instance exact results using Holley’s inequality [1, 6, 25] in I. R. Sav- age’s [45] finite distributive lattice of rank orders or samples; see [35, 36] and [38, §4]. Some of this machinery uses large sample approximations or asymptotics that work easily and well even in small samples, but discussion of these approximations means a step up in the level of technical detail; see [16] and [38, §4]. To get a feeling for the difference between the paired case and other cases, see [39], where 1.7 What This Book Is About 17 both cases are discussed in parallel, one right after the other. If you need to do a sensitivity analysis for a situation other than matched pairs, see [38, §4] or other references in Further Reading. Another item that is left out of the current discus- sion is a formal notation and model for multiple groups, as opposed to a treated and a control group; see [38, §8] for such a notation and model. Such a model adds more subscripts and symbols with few additional concepts. The absence of such a notation and model has a small effect on the discussion of multiple control groups (§5.2.2), differential effects (§5.2.6), the presentation of the matching session in R for a difference-in-differences example in Chapter 13, and the planned analysis with two control groups (§19.3); specifically, these topics are described informally with reference to the literature for formal results.
Dependence among chapters
Design of Observational Studies is highly modular, so it is not necessary to read chapters in order. Part II may be read before or after Part I or not at all; Part II is not needed for Parts III and IV. Part III may be read before or after Part IV. In Part I, Chapter 5 depends on Chapter 3, which in turn depends on Chapter 2. The beginning of Chapter 2, up through §2.4.3, is needed for Chapters 3 and 5, but §2.5 is not essential except for Chapter 16, and the remainder of Chapter 2 is not used later in the book. Chapters 4 and 6 may be read at any time or not at all. In Part II, most chapters depend strongly on Chapter 8 but only weakly on each other. Read the introductory Chapter 7 and Chapter 8; then, read what you like in Part II. The situation is similar in Part III. All of the chapters of Part III depend upon Chapter 14, which in turn depends on Chapter 5. The remaining chapters of Part III may be read in any order or not at all. The two chapters in Part IV may be read out of sequence, and both depend upon Chapter 5. At the back of the book, there is a list of symbols and a glossary of statistical terms. In the index, a bold page number locates the definition of a technical term or symbol. Some books (e.g., [38]) contain practice problems for you to solve. My sense is that the investigator planning an observational study has problems enough, so instead of further problems, at the back of the book there is a list of solutions. As a scholar at a research university, I fall victim to periodic compulsions to make remarks that are largely unintelligible and totally unnecessary. These remarks are found in appendices and footnotes. Under no circumstances read them. If you read a footnote, you will suffer a fate worse than Lot’s wife.1
1 She turned into a pillar of salt when she looked where her husband instructed her not to look. Opposed to the story of Lot and his wife is the remark of Immanuel Kant: “Sapere aude” (Dare to know) [27, page 17]. You are, by the way, off to a bad start with these footnotes. 18 1 Dilemmas and Craftsmanship 1.8 Further Reading
One might reasonably say that the distinction between randomized experiments and observational studies was introduced by Sir Ronald Fisher’s [15] invention of ran- domized experimentation. Fisher’s book [15] of 1935 is of continuing interest. As noted in §1.2, William Cochran [9] argued that observational studies should be un- derstood in relation to experiments; see also the important paper in this spirit by Donald Rubin [41]. A modern discussion of quasi-experiments is given by William Shadish, Thomas Cook , and Donald Campbell [47], and Campbell’s [8] collected papers are of continuing interest. See also [17, 33, 54]. Natural experiments in medicine are discussed by Jan Vandenbroucke [53] and the report edited by Michael Rutter [43], and in economics by Joshua Angrist and Alan Kruger [2], Timothy Besley and Anne Case [4], Daniel Hamermesh [19], Bruce Meyer [33], and Mark Rosenzweig and Kenneth Wolpin [40]. Natural experiments are prominent also in recent developments in genetic epidemiology [5, 12, 31, 49, 50]. The papers by Jerry Cornfield and colleagues [11], Austin Bradford Hill [23] , and Mervyn Susser [52] remain highly influential in epidemiology and are of continuing interest. Miguel Hernan´ and colleagues [22] illustrate the practical importance of adhering to the experimental template in designing an observational study. For a general discussion of observational studies, see [38].
References
1. Anderson, I.: Combinatories of Finite Sets. New York: Oxford University Press (1987) 2. Angrist, J.D., Krueger, A.B.: Empirical strategies in labor economics. In: Ashenfelter, O., Card, D. (eds.) Handbook of Labor Economics, Volume 3, pp. 1277–1366. New York: Else- vier (1999) 3. Angrist, J.D., Lavy, V.: Using Maimonides’ rule to estimate the effect of class size on scholas- tic achievement. Q J Econ 114, 533–575 (1999) 4. Besley, T., Case, A.: Unnatural experiments? Estimating the incidence of endogenous poli- cies. Econ J 110, 672–694 (2000) 5. Boehnke, M., Langefeld, C.D.: Genetic association mapping based on discordant sib pairs: The discordant alleles test. Am J Hum Genet 62, 950–961 (1998) 6. Bollobas,´ B.: Combinatorics. New York: Cambridge University Press (1986) 7. Campbell, D.T.: Factors relevant to the validity of experiments in social settings. Psychol Bull 54, 297–312 (1957) 8. Campbell, D.T.: Methodology and Epistemology for Social Science: Selected Papers. Chicago: University of Chicago Press (1988) 9. Cochran, W.G.: The planning of observational studies of human populations (with Discus- sion). J Roy Statist Soc A 128, 234–265 (1965) 10. Cook, T.D., Shadish, W.R.: Social experiments: Some developments over the past fifteen years. Annu Rev Psychol 45, 545–580 (1994) 11. Cornfield, J., Haenszel, W., Hammond, E., Lilienfeld, A., Shimkin, M., Wynder, E.: Smoking and lung cancer: Recent evidence and a discussion of some questions. J Natl Cancer Inst 22, 173–203 (1959) 12. Curtis, D.: Use of siblings as controls in case-control association studies. Ann Hum Genet 61, 319–333 (1997) References 19
13. Dynarski, S.M.: Does aid matter? Measuring the effect of student aid on college attendance and completion. Am Econ Rev 93, 279–288 (2003) 14. Evans, L.: The effectiveness of safety belts in preventing fatalities. Accid Anal Prev 18, 229– 241 (1986) 15. Fisher, R.A.: Design of Experiments. Edinburgh: Oliver and Boyd (1935) 16. Gastwirth, J.L., Krieger, A.M., Rosenbaum, P.R.: Asymptotic separability in sensitivity anal- ysis. J Roy Statist Soc B 62, 545–555 (2000) 17. Greenstone, M., Gayer, T.: Quasi-experimental and experimental approaches to environmental economics. J Environ Econ Manag 57, 21–44 (2009) 18. Greevy, R., Silber, J.H., Cnaan, A., Rosenbaum, P.R.: Randomization inference with imper- fect compliance in the ACE-inhibitor after anthracycline randomized trial. J Am Statist Assoc 99, 7–15 (2004) 19. Hamermesh, D.S.: The craft of labormetrics. Indust Labor Relat Rev 53, 363–380 (2000) 20. Heart Protection Study Collaborative Group.: MRC/BHF Heart Protection Study of an- tioxidant vitamin supplementation in 20,536 high-risk individuals: A randomised placebo- controlled trial. Lancet 360, 23–33 (2002) 21. Heckman, J.J.: Micro data, heterogeneity, and the evaluation of public policy: Nobel lecture. J Polit Econ 109, 673–748 (2001) 22. Hernan,´ M.A., Alonso, A., Logan, R., Grodstein, F., Michels, K.B., Willett, W.C., Manson, J.E., Robins, J.M.: Observational studies analyzed like randomized experiments: an ap- plication to postmenopausal hormone therapy and coronary heart disease (with Discussion). Epidemiology 19,766–793 (2008) 23. Hill, A.B.: The environment and disease: Association or causation? Proc Roy Soc Med 58, 295–300 (1965) 24. Holland, P.W.: Statistics and causal inference. J Am Statist Assoc 81, 945–960 (1986) 25. Holley, R.: Remarks on the FKG inequalities. Comm Math Phys 36, 227–231 (1974) 26. Imbens, G.W., Wooldridge, J.M.: Recent developments in the econometrics of program eval- uation. J Econ Lit 47, 5–86 (2009) 27. Kant, I.: What is enlightenment? In: I. Kant, Toward Perpetual Peace and Other Writings. New Haven, CT: Yale University Press (1785, 2006) 28. Katan, M.B.: Apolipoprotein E isoforms, serum cholesterol, and cancer. Lancet 1, 507–508 (1986) Reprinted: Int J Epidemiol 33, 9 (2004) 29. Katan, M.B.: Commentary: Mendelian randomization, 18 years on. Int J Epidemiol 33, 10–11 (2004) 30. Khaw, K.T., Bingham, S., Welch, A., Luben, R., Wareham, N., Oakes, S., Day, N.: Relation between plasma ascorbic acid and mortality in men and women in EPIC-Norfolk prospective study. Lancet 357, 657–663 (2001) 31. Laird, N.M., Blacker, D., Wilcox, M.: The sib transmission/disequilibrium test is a Mantel- Haenszel test. Am J Hum Genet 63, 1915 (1998) 32. Lawlor, D.A., Smith, G.D., Bruckdorfer, K.R., Kundo, D., Ebrahim, S.: Those confounded vitamins: What can we learn from the differences between observational versus randomized trial evidence? Lancet 363, 1724–1727 (2004) 33. Meyer, B.D.: Natural and quasi-experiments in economics. J Business Econ Statist 13, 151– 161 (1995) 34. Rosenbaum, P.R.: The consequences of adjustment for a concomitant variable that has been affected by the treatment. J Roy Statist Soc A 147, 656–666 (1984) 35. Rosenbaum, P.R.: On permutation tests for hidden biases in observational studies: An appli- cation of Holley’s inequality to the Savage lattice. Ann Statist 17, 643–653 (1989) 36. Rosenbaum, P.R.: Quantiles in nonrandom samples and observational studies. J Am Statist Assoc 90, 1424–1431 (1995) 37. Rosenbaum, P.R.: Stability in the absence of treatment. J Am Statist Assoc 96, 210–219 (2001) 38. Rosenbaum, P.R.: Observational Studies (2nd ed.). New York: Springer (2002) 39. Rosenbaum, P.R.: Sensitivity analysis for m-estimates, tests, and confidence intervals in matched observational studies. Biometrics 63, 456–464 (2007) 20 1 Dilemmas and Craftsmanship
40. Rosenzweig, M.R., Wolpin, K.I.: Natural ‘natural experiments’ in economics. J Econ Lit 38, 827–874 (2000) 41. Rubin, D.B.: Estimating causal effects of treatments in randomized and nonrandomized stud- ies. J Educ Psychol 66, 688–701 (1974) 42. Rubin, D.B.: The design versus the analysis of observational studies for causal effects: Paral- lels with the design of randomized trials. Statist Med 26, 20–36 (2007) 43. Rutter, M.: Identifying the Environmental Causes of Disease: How Do We Decide What to Believe and When to Take Action? London: Academy of Medical Sciences (2007) 44. Saunders, A.M., Strittmatter, W.J., Schmechel, D., et al.: Association of apolipoprotein E al- lele epsilon 4 with late-onset familial and sporadic Alzheimer’s disease. Neurology 43, 1467– 1472 (1993) 45. Savage, I.R.: Contributions to the theory of rank order statistics: Applications of lattice theory. Rev Int Statist Inst 32, 52–63 (1964) 46. Sekhon, J.S.: Opiates for the matches: Matching methods for causal inference. Annu Rev Pol Sci 12, 487-508 (2009) 47. Shadish, W. R., Cook, T. D., Campbell, D.T.: Experimental and Quasi-Experimental Designs for Generalized Causal Inference. Boston: Houghton-Mifflin (2002) 48. Shadish, W. R., Cook, T. D.: The renaissance of field experimentation in evaluating interven- tions. Annu Rev Psychol 60, 607–629 (2009) 49. Spielman, R.S., McGinnis, R.E., Ewens, W.J.: Transmission test for linkage disequilibrium. Am J Hum Genet 52, 506–516 (1993) 50. Spielman, R.S., Ewens, W.J.: A sibship test for linkage in the presence of association: The sib transmission/disequilibrium test. Am J Hum Genet 62, 450–458 (1998) 51. Strachan, T., Read, A.P.: Human Molecular Genetics. New York: Garland (2004) 52. Susser, M.: Epidemiology, Health and Society: Selected Papers. New York: Oxford Univer- sity Press (1987) 53. Vandenbroucke, J.P.: When are observational studies as credible as randomized trials? Lancet 363, 1728–1731 (2004) 54. West, S.G., Duan, N., Pequegnat, W., Gaist, P., Des Jarlais, D.C., Holtgrave, D., Szapocznik, J., Fishbein, M., Rapkin, B., Clatts, M., Mullen, P.D.: Alternatives to the randomized controlled trial. Am J Public Health 98, 1359–1366 (2008) Chapter 2 Causal Inference in Randomized Experiments
Abstract An observational study is an empiric investigation of treatment effects when random assignment to treatment or control is not feasible. Because observa- tional studies are structured to resemble simple randomized experiments, an under- standing of the role randomization plays in experiments is important as background. As a prelude to the discussion of observational studies in later chapters, the current chapter contains a brief review of the logic of causal inference in a randomized ex- periment. Only one simple case is discussed in detail, namely a randomized paired experiment in which subjects are paired before randomization and one subject in each pair is picked at random to receive treatment, the other receiving control. Al- though a foundation for later chapters, much of the material in this chapter is quite old, dating from Sir Ronald Fisher’s work in the 1920s and 1930s, and it is likely to be familiar from other contexts, such as a course in the design of experiments.
2.1 Two Versions of the National Supported Work Experiment
2.1.1 A version with 185 pairs and a version with 5 pairs
Though discussed daily in the newspaper, unemployment is a curious phenomenon. Defined abstractly, it implies that someone entered an active labor market intending to sell their labor but was unable to find a buyer, often for long periods of time. Of course, the abstract definition leaves out most of what is happening. Robert LaLonde [30] reviews several decades of
public sector sponsored employment and training programs . . . [intended to] enhance par- ticipants’ productive skills and, in turn, increase their future earnings and tax payments and reduce their dependence on social welfare benefits. [. . . The] primary recipients of public sector sponsored training [have been] economically disadvantaged or dislocated workers. [30, page 149]
P.R. Rosenbaum, Design of Observational Studies, Springer Series in Statistics, 21 DOI 10.1007/978-1-4419-1213-8_2, © Springer Science+Business Media, LLC 2010 22 2 Causal Inference in Randomized Experiments
The National Supported Work Demonstration (NSW) included a randomized ex- periment evaluating the effects of one such program [7, 24]. Kenneth Couch writes:
The NSW provided work experiences primarily in service occupations for females and con- struction for males. The jobs were designed to be consistent with a concept known as graduated stress. Stress within the working environment increased gradually during the training period until it simulated the workplace norms of the private sector. At that point, not more than 18 month after entry, individuals who received services provided by the NSW had to attempt a transition to unsubsidized employment. Screening criteria for the NSW limited participation to individuals severely handicapped in the labor market. . . . After screening by a local social service agency and referral into the program, each participant was randomly assigned to a treatment (experimental) or control group. Treatments received the services offered by the NSW. Controls continued enrollment in other available social programs [7, pages 381–382].
The current chapter will use pieces of the NSW experiment to illustrate the logic of randomized experimentation. A reader interested not in randomized experimen- tation but rather in the full NSW program and its effects might begin with Couch’s [7] study of the program’s long term effects. The NSW became important in think- ing about methodology for observational studies in economics because of a series of studies [12, 19, 29, 57], beginning with LaLonde’s 1986 study [29], in which the randomized control group was set aside and various analytical methods were applied to nonrandomized controls from survey data. Because the goal is to discuss the logic of randomized experiments in a manner that prepares for the discussion of observational studies in later chapters, several ad- justments and simplifications were made. First, the data set is what Rajeev Dehejia and Sadek Wahba [12, Table 1] call the “RE74 subset,” which in turn is a subset of the data used by LaLonde [29]. The subset consists of males who were ran- domized to treatment or control after December 1975 and who had left the program before January 1978, with annual earnings recorded in 1974, 1975 and 1978. Be- cause of these requirements, earnings in 1974 and 1975 are pretreatment covariates unaffected by the NSW program, and earnings in 1978 is an outcome that may be af- fected by treatment. Furthermore, to emphasize the parallel with matched observa- tional studies, the randomized treated and control groups were matched: specifically, all 185 treated men were matched to 185 untreated controls using eight covariates, forming 185 matched pairs. 1 For the 185 matched pairs, Table 2.1 shows the distribution of the eight covari- ates. Before treatment, the treated and control groups looked fairly similar.
1 As mentioned in Chapter 1, to keep the theoretical technicalities to a minimum, statistical theory is developed only for the case of matched pairs, so I have slightly reshaped the NSW experiment into a matched pair study. Randomization inference for unmatched randomized experiments is similar and is discussed in [44, Chapter 2]. The matching used precisely the methods described later in §8.4, using a penalty function to implement calipers on the propensity score, with the rank- based Mahalanobis distance used within calipers. Multivariate matching before randomization can improve the efficiency of a randomized experiment; see [17]. For illustration only in this expository chapter, the matching is done after random assignment, which entails discarding some of the randomized controls, something I would not do if the goal were to perform the most efficient analysis of the NSW experiment. 2.1 Two Versions of the National Supported Work Experiment 23
Table 2.1 Pretreatment covariates for 185 matched pairs from the NSW randomized experiment. For age and years of education, means, median (50%) and quartiles (25% and 75%) are given. For earnings, the mean of all earnings (including zero earnings) and the percentage of zero earnings are given. For binary variables, the percentage is given. Covariate Group Mean 25% 50% 75% Age Treated 25.82 20 25 29 Control 25.70 20 25 29 Years of Treated 10.35 91112 Education Control 10.19 91011 Covariate Group Mean Percent $0 Earnings in $ Treated 2096 71% in 1974 Control 2009 75% Earnings in $ Treated 1532 60% in 1975 Control 1485 64% Covariate Group Percent Black Treated 84% Control 85% Hispanic Treated 6% Control 5% Married Treated 19% Control 20% No High School Treated 71% Degree Control 77%
Figure 2.1 displays the earnings in 1978 after treatment for the 185 matched pairs of men. In the boxplot of 185 matched pair differences, the dotted lines are at 0 and ±5000. It appears that treated men earned somewhat more. In general, in this book, I will use small examples to illustrate the details of what goes on, and examples of practical size to illustrate analysis and interpretation. In that spirit, Table 2.2 is a random sample of five of the 185 pairs; it will be used to illustrate the details of randomization inference. The sample happens to consist of pairs 15, 37, 46, 151, and 181 of the 185 pairs.
2.1.2 Basic notation
Table 2.2 exemplifies notation that will be used throughout the book. The index of the pair is i, i = 1,2,...,5 = I, and the index of the person in the pair is j, j = 1,2. In Table 2.2 and throughout the book, Z indicates treatment, Z = 1 for treated, Z = 0 for control, x is an observed covariate — there are eight covariates, xijk, k = 1,...,8 = K, in Table 2.2 — and R indicates a response, in this case, earnings after the end of treatment in 1978. The first pair in Table 2.2 consists of two unmarried, young black men with seven or eight years of education and no earnings in 1974 and 1975, before the start of treatment; after treatment, in 1978, the treated man earned more. The treated-minus-control matched pair difference in 1978 earnings 24 2 Causal Inference in Randomized Experiments
1978 Earnings Treated−Control Difference
● ●
● ● ● ● ●
● ●
1978 Earnings ● Difference in 1978 Earnings
● 0 10000 30000 50000 −20000 0 10000 30000 50000 Treated Control
Fig. 2.1 Earnings in 1978, after treatment, for 185 pairs of men in the NSW randomized experi- ment. The dotted lines are at −$5000, $0, and $5000.
Table 2.2 Five pairs sampled at random from the 185 pairs in the NSW randomized experiment. The variables are: id = pair number among the 185 pairs; pairs 1 to 5; person 1 or 2 in a pair; treat=1 if treated, 0 if control; age in years; edu=education in years; black=1 if black, 0 otherwise; hisp=1 if hispanic, 0 otherwise; married=1 if married, 0 otherwise; no degree=1 if no high school degree, 0 otherwise; and re74, re75, and re78 are earnings in dollars in 1974, 1975, and 1978. Also, in pair i, Yi is the treated-minus-control matched pair difference in 1978 earnings, Yi =(Zi1 −Zi2)(Ri1 −Ri2). treat age edu black hisp married nodegree re74 re75 re78 id ijZij xij1 xij2 xij3 xij4 xij5 xij6 xij7 xij8 Rij Yi 15 1 1 1 17710 0 1003024 1456 15 1 2 0 18810 0 1001568 37 2 1 1 25510 0 1006182 3988 37 2 2 0 24710 0 1002194 46 3 1 1 25 11 1 0 1 1 0 0 0 −45 46 3 2 0 25 11 1 0 1 1 0 0 45 151 4 1 1 28 10 1 0 0 1 0 2837 3197 −2147 151 4 2 0 22 10 1 0 0 1 0 2175 5344 181 5 1 1 33 12 1 0 1 0 20280 10941 15953 3173 181 5 2 0 28 12 1 0 1 0 10585 5551 12780
is Yi =(Zi1 − Zi2)(Ri1 − Ri2),soY1 =(1 − 0)(3024 − 1568)=$1456 for the first pair of men. It is convenient to have a symbol that represents the K covariates together, and T vector notation does this, so xij = xij1,...,xijK contains the K covariate values for the jth person or unit in the ith pair. If you are not familiar with vector notation, it will not be a problem here. Vectors and vector notation have many uses, but for 2.2 Treatment Effects in Randomized Experiments 25 the most part, in the current book they are used simply to give concise names to T arrays of data. For instance, in Table 2.2, x11 =(17,7,1,0,0,1,0,0) contains the covariate values for the first man in the first pair. In the same way, Z indicates the treatment assignments for all 2I subjects in T the I matched pairs, Z =(Z11,Z12,Z21,...,ZI2) . For the I = 5 pairs in Table 2.2, T Z =(1,0,1,0,1,0,1,0,1,0) . The notation is slightly redundant, because Zi2 = 1 − Zi1, so that, now and then, when compact expression is needed, only the Zi1’s T T are mentioned. Also, R =(R11,R12,...,RI2) and Y =(Y1,Y2,...,YI) . It is possible to match for observed covariates, but not for a covariate that was not observed. There is an important sense in which a failure to match or control for an unobserved covariate presents no special problem in a randomized experiment but can present substantial problems in an observational study. Clarification of this dis- tinction is one goal of the current chapter. For this purpose, it is convenient to give a name, uij, to the unobserved covariate. In the current chapter, which is focused on inference in randomized experiments, uij could be any unmeasured covariate (or any vector containing several unmeasured covariates). Success in finding and keep- ing a job may depend on aspects of personality, intelligence, family and personal connections, and physical appearance; then, in this chapter in connection with the NSW experiment, uij could be a vector of measurements of these attributes, a vector T that was not measured by the investigators. Also, u =(u11,u12,...,uI2) . To aid in reading the table, subject j = 1 is always the treated subject and subject j = 2 is always the control subject; however, strictly speaking, one should not do this. Strictly speaking, ijis the ‘name’ of a unique person; that person had his name, ij, before treatments were randomly assigned. In a paired randomized experiment, Zi1 is determined by I independent flips of a fair coin, so Z11 = 1 happens with 1 = 1 = − probability 2 and Z11 0 happens with probability 2 ; then Zi2 1 Zi1. That is, in Table 2.2, Zi1 is 1 and Zi2 is 0 for every i, but this would happen by chance in a paired = 1 5 = 1 = . randomized experiment with I 5 pairs with probability 2 32 0 03125. Strictly speaking, some of the Zi1’s should be 1 and some of the Zi2’s should be 1. Quantities in statistical computations, such as the treated-minus-control difference in responses, Yi =(Zi1 − Zi2)(Ri1 − Ri2), are unaffected by the ordering of the table. So there is a small inconsistency between the logic of random assignment and the form of a readable table. Having mentioned this once, I will adhere to the logic of random assignment in theoretical discussions, present tables in readable form, and ignore the small inconsistency.
2.2 Treatment Effects in Randomized Experiments
2.2.1 Potential responses under alternative treatments
In Table 2.2, the first man, (i, j)=(1,1), was randomly assigned to treatment, Z11 = 1, and had earnings of R11 = $3024 in 1978, but things might have been 26 2 Causal Inference in Randomized Experiments different. Had the coin fallen differently, the first man, (i, j)=(1,1), might have been randomly assigned to control, Z11 = 0, and in this case his earnings in 1978 might have been different. We will never know what this first man’s earnings would have been had the coin fallen differently, had he been assigned to control. Perhaps the treatment was completely ineffective, and perhaps this first man would have held the same job, with the same earnings, namely $3024, had he been assigned to con- trol. Or perhaps the treatment raised his earnings, and his earnings under control would have been lower. We will never know about this one man, but with 370 men in 185 pairs, half randomly assigned to treatment, the others to control, we can say something about what would have happened to the 370 men under treatment and under control. Not only the first man but each man (i, j) has two potential responses, a level of earnings he would exhibit in 1978 if assigned to treatment, rTij, and a level of earnings he would exhibit in 1978 if assigned to control, rCij. We see one of these. Specifically, if (i, j) were assigned to treatment, Zij = 1, we would see rTij, but if (i, j) were assigned to control, Zij = 0, we would see rCij. The response, Rij,we actually observe from (i, j) — that is, the 1978 earnings actually recorded in Table 2.2 — equal rTij if Zij = 1orrCij if Zij = 0; that is, in a formula, Rij = ZijrTij + T T (1 − Zij) rCij. Also, rT =(rT11,rT12,...,rTI2) and rC =(rC11,rC12,...,rCI2) . To say that the treatment has no effect on this response from (i, j) is to say rCij = rTij. To say that the treatment caused (i, j)’s earnings to increase by $1000 is to say rTij = rCij + 1000 or rTij − rCij = 1000. We will never be in a position to confidently assert either of these things about a single man. It is, however, a very different thing to assert that all 370 men were unaffected by the treatment — to assert that rCij = rTij for i = 1,2,...185, j = 1,2; in a randomized experiment, we may be in a position to confidently deny that. The hypothesis that the treatment had no effect on anyone, namely H0 : rCij = rTij for i = 1,2,...185, j = 1,2, is known as Ronald Fisher’s [13] sharp null hypothesis of no effect. This hypothesis may be written compactly as H0 : rT = rC. The notation that expresses treatment effects as comparisons of potential re- sponses under alternative treatments was introduced into the design of experiments by Jerzy Neyman [35] in 1923 and was used to solve various problems in random- ized experiments; e.g., [62, 64, 9, 40] and [28, § 8.3]. Donald Rubin [50] first advocated use of this notation in observational studies.
2.2.2 Covariates and outcomes
In §1.2, the distinction between covariates and outcomes was emphasized. A co- variate, such as xij or uij, is a pretreatment quantity, so there is only one version of a covariate. A response or outcome has two potential values, rTij, rCij , one of which is observed, namely Rij, depending upon the treatment assignment Zij, that is, Rij = ZijrTij +(1 − Zij) rCij. 2.2 Treatment Effects in Randomized Experiments 27 Notice that rTij, rCij, xij, uij do not change when treatments are assigned at random, that is, when Zij is determined by a coin flip or a random number, but in general the observed response, Rij, does change. It is convenient to have a symbol that represents the quantities that are not changed when treatments are randomized. Let F denote the array of quantities rTij, rCij, xij, uij , i = 1,2,...,I, j = 1,2 that do not change when Zij is determined. (These quantities are fixed in Fisher’s theory of randomization inference; hence, the symbol is F .)
2.2.3 Possible treatment assignments and randomization
The observed treatment assignment in Table 2.2 is Z =(1,0,1,0,1,0,1,0,1,0)T , but random assignment of treatments within pairs might have picked a different assignment. Write Z for the set of the 2I possible values of Z; that is, z ∈ Z if and T only if z =(z11,z12,...,zI2) with zij = 0orzij = 1 and zi1 +zi2 = 1 for each i, j.For Table 2.2, there are 2I = 25 = 32 possible values z ∈ Z . Generally, if A is a finite set, then |A| is the number of elements of A, so that, in particular, |Z | = 2I. Table 2.3 lists, in abbreviated form, the 32 possible treatment assignments for Table 2.2; the abbreviation consists in listing zi1 but not zi2 because zi2 = 1−zi1. The observed treatment assignment, Z =(1,0,1,0,1,0,1,0,1,0)T in Table 2.2, corresponds with the first row of Table 2.3. In the second row of Table 2.3, the treatment assignment for the fifth pair in Table 2.2 has been reversed, so the second man, not the first man, in that pair is assigned to treatment. = Z For the I 185 pairs. in Table 2.1, the set of possible treatment assignments contains 2I = 2185 = 4.9 × 1055 possible treatment assignments, z ∈ Z . It would be inconvenient to list them. What does it mean to randomly assign treatments? At an intuitive level, one as- ∈ Z −I −5 = 1 = signment z is picked at random, each having probability 2 ,or2 32 0.03125 for Table 2.3. For instance, one might flip a fair coin independently five times to determine Z in Table 2.3. The intuition is that randomization is making a statement about how Z alone was determined, that is, about the marginal distribu- tion of Z. This intuition is not quite correct; in an important way, randomization means more than this. Specifically, in a randomized experiment, the information in rTij, rCij, xij, uij (or in F ) is of no use in predicting Zij. That is, the coin is fair not just in coming up heads half the time, independently in different pairs, but more importantly the coin knows nothing about the individual and is impartial in its treatment assignments. The design of the paired randomized experiment forces Z to fall in Z — that is, it forces the event Z ∈ Z to occur — which is denoted concisely by saying the event Z occurs. Randomization in the paired randomized experiment means: 1 1 Pr(Z = z |F ,Z )= = for each z ∈ Z . (2.1) |Z | 2I 28 2 Causal Inference in Randomized Experiments
Table 2.3 The set Z of 32 = 25 possible treatment assignments for the small version of the NSW experiment with I = 5 pairs. Only Zi1 is listed because Zi2 = 1 − Zi1. Label Pair 1 Pair 2 Pair 3 Pair 4 Pair 5 Z11 Z21 Z31 Z41 Z51 1 11111 2 11110 3 11101 4 11100 5 11011 6 11010 7 11001 8 11000 9 10111 10 10110 11 10101 12 10100 13 10011 14 10010 15 10001 16 10000 17 01111 18 01110 19 01101 20 01100 21 01011 22 01010 23 01001 24 01000 25 00111 26 00110 27 00101 28 00100 29 00011 30 00010 31 00001 32 00000