Better LATE Than Nothing: Some Comments on Deaton (2009) and Heckman and Urzua (2009)

Journal of Economic Literature 48 (June 2010): 399–423 http:www.aeaweb.org/articles.php?doi=10.1257/jel.48.2.399

Guido W. Imbens*

Two recent papers, Deaton (2009) and Heckman and Urzua (2009), argue against what they see as an excessive and inappropriate use of experimental and quasi-experimental methods in empirical work in economics in the last decade. They specifically question the increased use of instrumental variables and natural experiments in labor economics and of randomized experiments in development economics. In these comments, I will make the case that this move toward shoring up the internal validity of estimates, and toward clarifying the description of the population these estimates are relevant for, has been important and beneficial in increasing the credibility of empirical work in economics. I also address some other concerns raised by the Deaton and Heckman–Urzua papers. (JEL C21, C31)

1. Introduction last decade.1 Deaton and HU reserve much of their scorn for the local average treatment wo recent papers, Angus S. Deaton effect (LATE) introduced in the econometric T(2009) (Deaton from hereon) and James literature by Guido W. Imbens and Joshua D. J. Heckman and Sergio Urzua (2009) (HU Angrist (1994) (IA from hereon). HU write: from hereon), argue against what they see “Problems of identification and interpretation as an excessive and inappropriate use of are swept under the rug and replaced by ‘an experimental and quasi-experimental meth- effect’ identified by IV that is often very dif- ods in empirical work in economics in the ficult to interpret as an answer to an interesting economic question” (HU, p. 20). Deaton writes: “This goes beyond the old story of looking [for] an object where the light is strong enough to see; rather, we have control * Imbens: Harvard University. I have benefitted from over the light, but choose to let it fall where discussions with Joshua Angrist, Susan Athey, Abhijit Banerjee, David Card, Gary Chamberlain, Esther Duflo, Kei Hirano, Geert Ridder, Chuck Manski, Sendhil Mul- lainathan, and Jeffrey Wooldridge, although they bear no 1 The papers make similar arguments, perhaps not sur- responsibility for any of the views expressed here. Finan- prisingly given Deaton’s acknowledgement that “much of cial support for this research was generously provided what I have to say is a recapitulation of his [Heckman’s] through NSF grants 0631252 and 0820361. arguments” (Deaton, p. 4).

399 400 Journal of Economic Literature, Vol. XLVIII (June 2010) it may, and then proclaim that whatever it potential causes. In contrast to what Deaton illuminates is what we were looking for all and HU suggest, this issue of data quality along” (Deaton, p. 10) and “The LATE may, and study design is distinct from the choice or may not, be a parameter of interest . . . and between more or less structural or theory in general, there is no reason to suppose that driven models and estimation methods. In it will be . . . I find it hard to make any sense fact, recognizing this distinction, there has of the LATE” (Deaton, p. 10).2 Deaton also been much interesting work exploring the rails against the perceived laziness of these benefits of randomization for identification, researchers by raising the “futility of trying estimation, and assessment of structural to avoid thinking about how and why things models. For an early example, see Jerry A. work,” (Deaton, p. 14).3 HU wonder whether Hausman and David A. Wise (1979), who these researchers are of the opinion: “that dis- estimate a model for attrition with data from guising identification problems by a statistical a randomized income maintenance experi- procedure is preferable to an honest discus- ment, and, for recent examples, see, among sion of the limits of the data?” (HU, p. 19). others, Card and Dean R. Hyslop (2005), who The fact that two such distinguished econo- estimate a structural model of welfare partici- mists so forcefully4 question trends in current pation using experimental data from Canada; practice may suggest to those not familiar with Petra E. Todd and Kenneth Wolpin (2003), this literature that it is going seriously awry. In who analyze data from Mexico’s Progresa pro- these comments, I will argue that, to the con- gram; Imbens, Donald B. Rubin and Bruce I. trary, empirical work is much more credible Sacerdote (2001), who estimate dynamic labor as a result of the natural experiments revolu- supply models exploiting random variation in tion started by David Card, Angrist, Alan B. unearned earnings using data from lottery Krueger, and others in the late 1980s. Starting winners; Duflo, Rema Hanna, and Stephen in the late 1980s, their work, and more Ryan (2007), who look at the effect of ran- recently that by development economists domized monitoring and financial incentives such as Abhijit V. Banerjee, Esther Duflo, on teacher’s absences; Susan Athey, Jonathan and Michael Kremer arguing in favor of ran- Levin, and Enrique Seira (forthcoming), domized experiments, has had a profound who use randomized assignment of auction influence on empirical work. By emphasizing formats to estimate structural models of bid- internal validity and study design, this litera- ding behavior; Raj Chetty, Adam Looney, and ture has shown the importance of looking for Kory Kroft (2009), who look at the effect of clear and exogenous sources of variation in tax saliency using experimental evidence from supermarket pricing policies; and Chetty and Emmanuel Saez (2009), who exploit random 2 Somewhat incongruously, Deaton views Heckman’s local instrumental variables methods much more positively variation in information about the tax code. as the “appropriate” (Deaton, p. 14) response to dealing There is much room for such work where with heterogeneity, although the marginal treatment effect experimental variation is used to improve that is the cornerstone of this approach is nothing more than the limit version of the LATE in the presence of con- the identification and credibility of the struc- tinuous instruments, e.g., Heckman, Urzua, and Edward tural models. It would put at risk the progress Vytlacil (2006), see also Angrist, Kathryn Graddy, and made in improving the credibility of empirical Imbens 2000). 3 Curiously, Deaton exempts the leaders of this move- work in economics if this message got lost in ment from these charges, by declaring them “too talented discussions about the relative merits of struc- to be bound by their own methodological prescriptions” tural work versus work less directly connected (Deaton, p. 4). 4 Deaton dismisses arguments of those he disagrees to economic theory or in minor squabbles with as merely “rhetoric” no less than six times in his paper. about second-order technicalities such as Imbens: Better LATE Than Nothing 401

adjustments for heteroskedasticity in the cal- as evidence of a causal effect of smoking culation of standard errors and the Behrens– on lung cancer is now generally accepted, Fisher problem (e.g., Deaton, p. 33).5 without any direct experimental evidence In my view, it is helpful to separate the to support it. It would be unfortunate if the discussion regarding the merits of the recent current interest in credible causal inference, literature on experiments and natural experi- by insisting on sometimes unattainable stan- ments into two parts. The first part concerns dards of internal validity, leads researchers to the questions of interest and the second the avoid such questions. At the same time, the choice of methods conditional on the ques- long road toward general acceptance of the tion. In my opinion, the main concern with causal interpretation of the smoking and lung the current trend toward credible causal cancer correlation (including Fisher’s long- inference in general, and toward random- time skepticism about the causal interpreta- ized experiments in particular, is that it may tion of this correlation) shows the difficulties lead researchers to avoid questions where in gaining acceptance for causal claims with- randomization is difficult, or even conceptu- out randomization. ally impossible, and natural experiments are However, the importance of questions for not available. There are many such questions which randomization is difficult or infea- and many of them are of great importance. sible should not take away from the fact Questions concerning the causal effects of that, for answering the questions they are macroeconomic policies can rarely be settled designed for, randomized experiments, and by randomized experiments.6 The effect of other (what Card calls) design-based strat- mergers and acquisitions cannot be stud- egies, have many advantages. Specifically, ied using experiments. Similarly, questions conditional on the question of interest being involving general equilibrium effects cannot one for which randomized experiments are be answered by simple experiments. In other feasible, randomized experiments are supe- examples, randomized experiments raise eth- rior to all other designs in terms of credibil- ical concerns and are ultimately not feasible. ity. Deaton’s view that “experiments have These are not new concerns and I am sympa- no special ability to produce more credible thetic with the comments in this regard made knowledge than other methods” (Deaton, by, for example, Dani Rodrik (2008). There abstract) runs counter to the opinions of is clearly much room for nonexperimental many researchers who have considered these work and history abounds with examples issues previously. David A. Freedman, hailed where causality has ultimately found general by Deaton himself as “one of its [the world’s] acceptance without any experimental evi- greatest statisticians”7 (Deaton, title page, dence. The most famous example is perhaps acknowledgement) is unambiguous in his the correlation between smoking and lung opening sentence, “Experiments offer more cancer. The interpretation of this correlation reliable evidence on causation than observational studies” (Freedman 2006, abstract). Edward E. Leamer (1983), in his influential 5 Moreover, there is nothing in these issues that makes criticism of the state of empirical work in the observational studies less vulnerable to them. 1970s, writes, “There is therefore a sharp dif- 6 Although, for an interesting macroeconomic study in the spirit of the modern causal literature, see Christina ference between inference from randomized D. Romer and David H. Romer (2004), who study the experiments and inference from natural effects of monetary policy decisions on the macroeconomy, exploiting variation in Federal Reserve policies at times when markets viewed the Federal Reserve decisions as 7 I certainly have no disagreement with this qualification: unpredictable and, thus, essentially as random. see my endorsement on the back cover of Freedman (2010). 402 Journal of Economic Literature, Vol. XLVIII (June 2010) experiments” (Leamer, p. 33).8 That is not to its emphasis on obtaining credible causal say that one may not choose to do an obser- estimates and for developing a clear under- vational study for other reasons, e.g., finan- standing of the nature of the variation that cial costs, or ethical considerations, even in gives these estimates credibility, I will refer settings where randomized experiments are to this as the causal literature. Second, I will feasible. However, no other design will have discuss briefly the origins of this causal lit- the credibility that a randomized experiment erature, which takes its motivation partially would have. Suppose we are interested in a from the failure of specific structural models, question that can be addressed by random- such as the Heckman selection model (e.g., ized experiments, for example, whether a job Heckman 1978), to satisfactorily address training program has an effect on labor mar- endogeneity issues in the context of estima- ket outcomes or whether class size affects tion of causal effects of labor market pro- educational outcomes. In such settings, the grams. This was famously documented by evidence from a randomized experiment is LaLonde (1986) (see also Thomas Fraker unambiguously superior to that from obser- and Rebecca Maynard 1987). Third, I will vational studies. As a result, randomized elaborate on the point that, in cases where experiments have often been very influen- the focus is establishing the existence of tial in shaping policy debates, e.g., the 1965 causal effects and where experiments are Perry Preschool Project on early childhood feasible, experiments are unambiguously the interventions (see Constance Holden 1990 preferred approach: since Ronald A. Fisher and Charles F. Manski 1997 for some recent (1925) it has formally been established that discussions), the National Supported Work randomization gives such designs a credibil- Demonstration experiments on labor market ity unmatched by any other research design. programs (e.g., Robert J. LaLonde 1986), or Fourth, I will make the case that a key con- Project STAR on class size reductions (e.g., tribution of the recent theoretical literature Krueger 1999). More generally, and this on causality has been to clarify the merits, as is really the key point, in a situation where well as the limitations, of instrumental vari- one has control over the assignment mecha- ables, local average treatment effects, and nism, there is little to gain, and much to lose, regression discontinuity designs in settings by giving up this control through allowing with heterogenous causal effects. Far from individuals to choose their own treatment “disguising identification problems by a sta- regime. Randomization ensures exogeneity tistical procedure” (HU, p. 19), it was shown of key variables, where, in a corresponding by IA that, in instrumental variables settings observational study, one would have to worry with heterogenous effects, instrumental about their potential endogeneity. variables methods do identify the average In these comments, I will make five treatment effect for a well defined subpopu- points from the perspective of an econome- lation (the compliers in the terminology from trician who is interested in, and has been Angrist, Imbens and Rubin 1996), indexed involved in, the methodological aspects of by the instrument.9 Although, in many cases this literature. First, I will give a different these, what are now known as local average characterization of goals and focus of the literature Deaton and HU take issue with. For 9 Although Deaton credits Heckman (1997) with establishing that in the presence of heterogenous effects 8 By natural experiments Leamer here refers to studies the probability limit of instrumental variables estimators without formal randomization, that is, observational stud- depends on the instrument, this was previously shown in ies (personal communication). Imbens and Angrist (1994). Imbens: Better LATE Than Nothing 403 treatment effects or LATEs, and similarly the Imbens and Jeffrey M. Wooldridge 2009 for estimands in regression discontinuity designs, a survey) and in the experimental literature are not the average effects that researchers set (e.g., Banerjee and Duflo 2009). out to estimate, the internal validity of those estimands is often much higher than that of 2 Causal Models and Design-Based other estimands. I will also take issue with the Approaches Deaton and HU view that somehow instrumental variables methods are atheoretical. The literature that does not conform to The exclusion and monotonicity restrictions the Deaton and HU standards of structural that underlie such methods are motivated by work is variously referred to, in a somewhat subject matter, that is economic, rather than pejorative manner, as atheoretical or statis- statistical, knowledge. Moreover, the focus tical (as opposed to economic).11 These are on instrumental variables estimands, rather not terms commonly used in this literature than on reduced form correlations between itself. They are also at odds with their histori- outcomes and exogenous variables (including cal use.12 The almost complete lack of instru- instruments), is motivated by the belief that mental variables methods in the statistical the former are more likely to be structural literature makes that label an unusual one than the latter.10 for the literature that Deaton and HU focus In the fifth point, I discuss issues related to on in their criticism. What is shared by this external validity, that is, the ability of the esti- literature is not so much a lack of theoretical mands to generalize to other populations and or economic motivation but rather an explicit settings. The causal literature has empha- emphasis on credibly estimating causal sized internal validity over external valid- effects, a recognition of the heterogeneity, with the view that a credible estimate ity in these effects, clarity in the identifying of the average effect for a subpopulation is assumptions, and a concern about endogene- preferred to an estimate of the average for ity of choices and the role study design plays. the target population with little credibility. I will therefore refer to this interchangeably This is consistent with the biomedical litera- as the causal or design-based literature. Early ture. Although the primacy of internal valid- influential examples include the Card (1990) ity over external validity has been criticized study of the impact of immigration using the often in that literature, there is little support Mariel Boatlift, Angrist’s (1990) study of the for moving toward a system where stud- effect of veteran status on earnings using the ies with low internal validity receive much Vietnam era draft lottery as an instrument, weight in policy decisions. External valid- and the Angrist and Krueger (1991) study ity is generally a more substantial problem in economics than in biomedical settings, with considerable variation in both prefer- 11 This literature is also sometimes referred to as ences and constraints between individuals, “reduced form,” again a misnomer. In the classical, Cowles Commission, simultaneous equations setting, the term as well as variation over time. Understanding reduced form is used to refer to the regression of the heterogeneity in treatment effects is there- endogenous variables on the full set of exogenous variables fore of great importance in these settings (which is typically estimated by ordinary least squares). Equations estimated by instrumental variables methods and it has received considerable attention are, in this terminology, referred to as structural equations. in the theoretical evaluation literature (see 12 In an even more remarkable attempt to shape the debate by changing terminology, Deaton proposes to rede- fine the term “exogeneity” in such a way that “Even ran- 10 “Structural” is used here in the Arthur S. Goldberger dom numbers—the ultimate external variables—may be (1991) sense of invariant across populations. endogenous” (Deaton, p. 13). 404 Journal of Economic Literature, Vol. XLVIII (June 2010) of the effect of education on earnings using of the intervention of interest, often in com- variation in educational achievement related bination with the innovative collection of to compulsory schooling laws. More recently, original data sources, to remove any selec- this has led to many studies using regression tion bias. discontinuity designs (see David S. Lee and To focus the discussion, let me intro- Thomas Lemieux 2010 for a review). The duce a specific example. Suppose a state, recent work in development economics has say California, is considering reducing class taken the emphasis on internal validity even size in first through fourth grade by 10 further, stressing formal randomization as a percent. Entering in the California policy- systematic and robust approach to obtaining makers’ decision is the comparison of the credible causal effects (see Duflo, Rachel cost of such a class size reduction with its Glennerster, and Kremer 2008 for an over- benefits. Suppose that the policymakers view of this literature). This has led to a spec- have accurate information regarding the tacular increase in experimental evaluations cost of the program but are unsure about in development economics (see, for exam- the benefits. Ultimately the hope may be ple, the many experiments run by research- that such a reduction would improve labor ers associated with the Poverty Action Lab at market prospects of the students, but let us MIT), and in many other areas in econom- suppose that the state views the program ics, e.g., Marianne Bertrand and Sendhil as worthwhile if it improves some measure Mullainathan (2004), Duflo and Saez (2003), of skills, say measured as a combination of Chetty, Looney, and Kroft (2009), and many test scores, by some amount. What is the others. relevance for this decision of the various Often the focus in this literature is on estimates available in the literature? Let us causal effects of binary interventions or treat- consider some of the studies of the effect of ments. See Imbens and Wooldridge (2009) class size on educational outcomes. There for a recent review of the methodological is a wide range of such studies but let me part of this literature. For example, one may focus on a few. First, there is experimental focus on the effect of universal exposure to evidence from the Tennessee STAR experi- the treatment, that is, the average treatment ments starting in 1985 (e.g., Krueger 1999). effect, or on the effect of exposure for those Second, there are estimates based on regres- currently exposed, the average effect on the sion discontinuity designs using Israeli data treated. Even if these interventions do not (Angrist and Victor Lavy 1999). Third, there directly correspond to plausible future poli- are estimates exploiting natural variation in cies, they are often useful summary statis- class size arising from natural variation in tics for such policies and, therefore, viewed cohort size using data from Connecticut as quantities of interest. A major concern reported in Caroline M. Hoxby (2000). in this literature is that simple comparisons None of these estimates directly answers between economic agents in the various the question facing the decisionmakers in regimes are often not credible as estimates California. So, are any of these three stud- of the average effects of interest because of ies useful for informing our California poli- the potential selection bias that may result cymaker? In my view, all three are. In all from the assignment to a particular regime three cases, finding positive effects of class being partly the result of choices by optimiz- size reductions on test scores would move ing agents. As a consequence, great care is my prior beliefs on the effect in California applied to the problem of finding credible toward bigger effects. Exactly how much sources of exogenous variation in the receipt each of the three studies would change my Imbens: Better LATE Than Nothing 405 prior beliefs would depend on the external when these studies were conducted. This is, and internal validity of the three studies. again, not a new point. The proponents of Specifically, the external validity of each randomization in the new development eco- study would depend on (i) its timing rela- nomics have argued persuasively in favor of tive to the target program, with older stud- doing multiple experiments (Duflo 2004; ies receiving less weight, (ii) differences Banerjee 2007; Banerjee and Duflo 2009). between the study population and the It is obvious that, as Deaton comments, sim- California target population, including the ply repeating the same experiment would targeted grade levels in each study, and (iii) not be very informative. However, conduct- differences between the study outcomes ing experiments on a variety of settings, and the goals of the California programs. In including different populations and differ- terms of these external validity criteria, the ent economic circumstances, would be. As Hoxby study with Connecticut data would Deaton suggests, informing these settings probably do best. In terms of internal valid- by economic theory, much as the original ity, that is, of the estimate having a credible negative income tax experiments were, causal interpretation, the Krueger study would clearly improve our understanding of using experimental Tennessee data would the processes as well as our ability to inform definitely, and, next, the Angrist–Lavy study public policy. with Israeli data might, do better. The main The focus of the causal literature has been point, though, is that all three studies are in on shoring up the internal validity of the my view useful. None of the three answers estimates and on clarifying the nature of the directly the question of interest but the population these estimates are relevant for. combination is considerably better than any This is where instrumental variables, local single one. We could clearly do better if average treatment effects, and regression we designed a study especially to study the discontinuity methods come in. These often California question. Ideally we would run do not answer exactly the question of inter- an experiment in California itself, which, est but provide estimates of causal effects for five years later, might give us a much more well-defined subpopulations under weaker reliable answer but it would not help the assumptions than those required for iden- policymakers at this moment very much. If tification of the effects of primary interest. we did an observational study in California, As a result, a single estimate is unlikely to however, I would still put some weight on provide a definitive and comprehensive basis the Connecticut, Tennessee, and Israeli for informing policy. Rather, the combina- studies. One may go further in formalizing tion of several such studies, based on differ- the decision process in this case and I will ent populations and in different settings, can do so in section 6. give guidance on the nature of interventions Reiterating the main point, having a vari- that work. ety of estimates, with a range of populations Let me mention one more example. Deaton and a range of identification strategies, can cites a study by Banerjee et al. (2007) who be useful to policymakers even if none of the find differences in average effects between individual studies directly answers the policy randomized evaluations of the same program question of interest. It is of course unrealis- in two locations. Banerjee et al. surmise that tic to expect that the California policymakers these differences are related to differential would be able to pick a single study from initial reading abilities. Deaton dismisses this the literature in order to get an answer to a conclusion as not justified by the randomiza- question that had not actually been posed yet tion because such a question was not part of 406 Journal of Economic Literature, Vol. XLVIII (June 2010) the original protocol and would therefore be taught in labor and econometrics courses in subject to data mining issues. This is formally economics PhD programs, LaLonde studies correct, but it is precisely the attempt to the ability of a number of econometric meth- understand differences in the results of past ods, including Heckman’s selection models, experiments that leads to further research to replicate the results from an experimen- and motivates subsequent experiments, thus tal evaluation of a labor market program on building a better understanding of the het- the basis of nonexperimental data. He con- erogeneity in the effects that can assist in cluded that they could not do so systemati- informing policy. See Card, Jochen Kluve, cally. LaLonde’s evidence, and subsequent and Andrea Weber (2009) for another exam- studies with similar conclusions, e.g., Fraker ple of such a meta analysis and section 6 for and Maynard (1987), had a profound impact additional discussion. in the economics literature and even played a role in influencing Congress to mandate experimental evaluations for many federally 3. LaLonde (1986): The Failure of funded programs. Nonexperimental Methods to Replicate It would appear to be uncontroversial Experimental Evaluations of that the focus in LaLonde’s study, the aver- Labor Market Programs age effect of the Nationally Supported Work Surprisingly, neither Deaton nor HU (NSW) program, meets Deaton’s criterion of discuss in much detail the origins of the being “useful for policy or understanding,” resurgence of interest in randomized and (Deaton, abstract). The most direct evidence natural experiments, and the concern with that it meets this criterion is the willingness the internal validity of some of the struc- of policymakers to provide substantial funds tural modeling. HU vaguely reference the for credible evaluations of similar labor mar- “practical difficulty in identifying, and pre- ket and educational programs. Nevertheless, cisely estimating, the full array of struc- the question remains whether evaluation tural parameters” (HU, p. 2), but mention methods other than those considered by only, without a formal reference, a paper LaLonde would have led to better results. by Hausman (presumably Hausman 1981) There is some evidence that matching meth- as one of the papers that according to HU ods would have done better. See the influen- “fueled the flight of many empirical econo- tial paper by Rajeev H. Dehejia and Sadek mists from structural models” (HU, p. 2, Wahba (1999), although this is still dis- footnote 6). I think the origins behind this puted, see, e.g., Jeffrey A. Smith and Todd flight are not quite as obscure or haphazard (2005) and Dehejia (2005). See Imbens and as may appear from reading Deaton and HU. Wooldridge (2009) for a recent review of Neither of them mentions the role played by such methods. Matching methods, however, LaLonde’s landmark 1986 paper, “Evaluating hardly meet Deaton’s criteria for “analysis the Econometric Evaluations of Training of models derived from economic theory” Programs with Experimental Data.”13 In (Deaton, p. 2). Until there are more suc- his 1986 paper, widely cited and still widely cessful attempts to replicate experimental results, it would therefore seem inescapable 13 At some level, LaLonde’s paper makes the same point that there is a substantial role to be played as Leamer did in his 1983 paper in which he criticized the by experimental evaluations in this literature state of empirical work based on observational studies. See if we want data analyses to meet Leamer’s the symposium in the Spring 2010 Journal of Economic Perspectives for a recent discussion on the effects of standard of being taken seriously by other Leamer’s criticism on subsequent empirical work. researchers. Imbens: Better LATE Than Nothing 407

4. The Benefits of Randomized follow. However, I strongly disagree with Experiments her claim that this is what gives randomized experiments their credibility. It is not the One of the most curious discussions in assumption of randomization but the actual Deaton concerns the merits of randomized act of randomization that allows for pre- experiments. He writes: “I argue that evi- cise quantifications of uncertainty, and this dence from randomized controlled trials has is what gives randomization a unique status no special priority . . . Randomized controlled among study designs. Constance Reid (1982, trials cannot automatically trump other evi- p. 45) quotes Jerzey Neyman concerning dence, they do not occupy any special place the importance of the actual randomization, in some hierarchy of evidence” (Deaton, p. and the attribution of this insight to Fisher: 4) and “Actual randomization faces similar “. . . the recognition that without randomiza- problems as quasi-randomization, notwith- tion an experiment has little value irrespective standing rhetoric to the contrary” (Deaton, of the subsequent treatment. The latter point abstract). These are remarkable statements. is due to Fisher, and I consider it as one of the If true, in the unqualified way Deaton states most valuable of Fishers’s achievements.” It them, it would throw serious doubt on the is interesting to see that Cartwright does not Food and Drug Administration’s (FDA) mention Fisher’s or Neyman’s views on these insistence on randomized evaluations of issues in the discussion of her claims. new drugs and treatments. But of course Far from being merely rhetoric as Deaton Deaton’s statements are wrong. Deaton is claims, the physical act of randomization is both formally wrong and wrong in spirit. what allows the researcher to precisely quan- Randomized experiments do occupy a special tify the uncertainty associated with the evi- place in the hierarchy of evidence, namely dence for an effect of a treatment, as shown at the very top. Again, this is not merely my originally by Fisher (1925). Specifically, it view: see the earlier quotes from Freedman allows for the calculation of exact p-values and Leamer for similar sentiments. of sharp null hypotheses. These p-values For support for his position that are free of assumptions on distributions of “Randomization is not a gold standard” outcomes, free from assumptions on the (Deaton, p. 4), Deaton quotes Nancy sampling process, and even free of assump- Cartwright (2007) as claiming that “there is tions on interactions between units and of no gold standard” (Cartwright 2007, quoted assumptions on compliance behavior, solely in Deaton, p. 4). It is useful to give a slightly relying on randomization and a sharp null longer quote from Cartwright (2007) to put hypothesis. No other design allows for this. her claim in perspective: “The claims of ran- Now this is strictly speaking a very narrow domized controlled trials (RCTs) to be the result, with the extensions to more interest- gold standard rest on the fact that the ideal ing questions somewhat subtle. We can estab- RCT is a deductive method: if the assump- lish the uncertainty regarding the existence tions of the test are met, a positive result of a causal effect through the calculation of implies the appropriate causal conclusion. p-values but we cannot establish properties of This is a feature that RCT’s share with a vari- estimators for, say, the average effect without ety of other methods, which thus have equal additional assumptions and approximations. claim to being a gold standard” (Cartwright Unless we rule out interactions between indi- 2007, abstract). I agree with Cartwright that viduals, the average effect of the treatment many methods have the feature that if their depends on assignments to other individuals assumptions are met, the causal conclusions and, thus, needs to be defined carefully. In 408 Journal of Economic Literature, Vol. XLVIII (June 2010) the absence of interactions, we can estimate individuals assigned to the control group were the average effect without bias but the valid- in fact exposed to the treatment. We can still ity of confidence intervals still relies on large assess the null hypothesis of no effect of the sample approximations (e.g., Neyman 1923; treatment using the same analysis as before Freedman 2008). Nevertheless, even if experi- as long as we take care to use the randomiza- ments rely on some assumptions or large sam- tion distribution of the assignment to treat- ple approximations for inference on average ment rather than the receipt of treatment. treatment effects, they do so to a lesser extent There is no complication in this analysis than observational studies by not requiring arising from the noncompliance. The non- assumptions on the assignment mechanism. compliance does, however, compromise our Deaton himself hedges his remarkable ability to find an estimator that is unbiased claims by adding that “actual experiments are for the average treatment effect. However, frequently subject to practical problems that if, for example, the outcome is binary, we undermine any claims to statistical or epis- can still derive, in the spirit of the work by temic superiority” (Deaton, abstract), a some- Manski (1990, 1995, 2003), a range of val- what confusing statement given that according ues consistent with the average treatment to the earlier quotes in his view there is no effect that is valid without assumptions on initial superiority to undermine. It is true that the compliance behavior. These bounds may violations of assignment protocols, missing be informative, depending on the data, and, data, and other practical problems can create in particular if the rate of noncompliance is complications in the analyses of data from ran- low, will lead to a narrow range. In the pres- domized experiments. There is no evidence, ence of missing data, both the derivation of however, that giving up control of the assign- p-values and estimators will now lead to ranges ment mechanism and conducting an observa- of values without additional assumptions. An tional study improves these matters. Moreover, important role is played here by Manski’s the suggestion that any complication, such as insight that identification is not a matter of a violation of the assignment protocol, leads all or nothing. Thus, some of the benefits of to analyses that lose all credibility accorded to randomization formally remain even in the randomized experiments is wrong. Again, it is presence of practical complications such as both formally wrong and wrong in substance. noncompliance and missing data. That this suggestion is formally wrong is easi- In his paper, Deaton also questions what est illustrated in an example. we learn from experiments: “One immediate Consider a randomized experiment with consequence of this derivation is a fact that is N individuals, M randomly assigned to the often quoted by critics of RCTs but is often treatment group and the remaining N M ignored by practitioners, at least in econom- − assigned to the control group. In the absence ics: RCTs are informative about the mean of of complications such as noncompliance, the treatment effects, Y Y , but do not i1 − i0 interactions between units, and missing data, identify other features of the distribution. we can calculate the p-value associated with For example, the median of the difference is the null hypothesis of no effect of the treat- not the difference in medians, so an RCT is ment and we can also estimate the average not, by itself, informative about the median effect of the treatment without bias, both treatment effect, something that could be of based on the randomization distribution. as much interest to policymakers as the mean Now suppose that there is noncompliance. treatment effect” (Deaton, p. 26). He further Some individuals assigned to the treatment stresses this point by writing “Put differently, were not exposed to the treatment, and some the trial might reveal an average positive Imbens: Better LATE Than Nothing 409 effect although nearly all of the population marginal distributions, not about the distri- is hurt with a few receiving very large ben- bution of the difference. Suppose that the efits, a situation that cannot be revealed by planner’s choice is between two programs. In the RCT, although it might be disastrous if that case, the social planner would look at the implemented” (Deaton, p. 27). These state- welfare given the marginal distribution of out- ments are formally wrong in the claims about comes induced by the first program and com- the information content of randomized pare that to the welfare given the marginal experiments and misleading in their sugges- outcome distribution induced by the second tion about what is of interest to policymakers. program, and not at the joint distribution of Regarding the first claim, here is a simple outcomes. My argument against Deaton’s counterexample, similar to one discussed in claim that policymakers could be as much the Heckman and Smith (1995) paper cited interested in the median treatment effect as by Deaton. Suppose we have a randomized in the mean treatment effect is not novel. As experiment with binary outcomes. Assume Manski (1996) writes, “Thus, a planner maxi- that among the controls and treated the mizing a conventional social welfare func- outcome distributions are binomial with tion wants to learn P[ Y (1)] and P[ Y (0)], not mean p and p respectively. If the differ- P[ Y (1) Y (0)]” (Manski 1996, p. 714). (Here 0 1 − ence p p exceeds 1 2, one can infer P[ Y (w)] denotes the distribution of Y (w).) 1 − 0 / that the median effect of the treatment is The implication is that the planner’s decision one. In general, however, it is correct that may depend on the median of the marginal the evidence from randomized experiments distributions of Yi (0) and Yi (1) but would, regarding the joint distribution of the pair in general, not depend on the median of the (Y(0), Y(1)) is limited. Nevertheless, there is treatment effect Y (1) Y (0). To make this i − i more information regarding quantiles than specific, let us return to Deaton’s example of Deaton suggests. In the presence of covari- a program with few reaping large benefits and ates, experiments are directly informative many suffering small losses. Manski’s social concerning the two conditional distributions planner would compare the distribution given f (Y(0) X) and f (Y(1) X), and together these the treatment, P[ Y (1)], with the distribution | | may lead to more informative bounds on, say, in the absence of the treatment, P[ Y (0)]. The quantiles of the distribution of the difference comparison would not necessarily be based Y(1) Y(0) than simply the two marginal simply on means but might take into account − distributions f (Y(0)) and f (Y(1)). measures of dispersion, and so avoid the The more important issue is the second potential disasters Deaton is concerned about. claim in the Deaton quote, that the median Deaton also raises issues concerning could be of as much interest to policymakers the manner in which data from random- as the mean treatment effect or, more gen- ized experiments are analyzed in practice. erally, that it is the joint distribution we are Consider a carefully designed randomized interested in, beyond the two marginal dis- experiment with covariates present that tributions. In many cases, average effects of were not taken into account in the random- (functions of) outcomes are indeed what is ization.14 Deaton raises three issues. The of interest to policymakers, not quantiles of differences in potential outcomes. The key insight is an economic one—a social planner, 14 In fact, one should take these into account in the maximizing a welfare function that depends design because one would always, even in small samples, be at least as well off by stratifying on these covariates as on the distribution of outcomes in each state by ignoring them, e.g., Imbens et al. (2009), but that is a of the world, would only care about the two different matter. 410 Journal of Economic Literature, Vol. XLVIII (June 2010) first concerns inference or, more specifically, mating a structural model it is still helpful to estimation of standard errors. The second have experimental data. Although regression is concerned with finite sample biases. The estimators are generally not unbiased under third issue deals with specification search the randomization distribution, regression and the exploration of multiple hypotheses. estimators are made more robust and cred- I will address each in turn. Before doing so, ible by randomization because at least some let me make two general comments. First, in of the assumptions underlying regression my view, the three issues Deaton raises are analyses are now satisfied by design. decidedly second order ones. That is, sec- Now let me turn to the first issue raised ond order relative to the first order issues of by Deaton, concerning the standard errors. selection and endogeneity in observational This is an issue even in large samples. If evaluation studies that have long been high- the average effect is estimated as the dif- lighted in the econometric literature, promi- ference in means by treatment status, the nently in work by Heckman (e.g., Heckman appropriate variance, validated by the ran- 1978; Heckman and Richard Robb 1985). domization, is the robust one, allowing for Second, Deaton appears to be of the view heteroskedasticity (e.g., Friedhelm Eicker that the only way experiments should be 1967; Peter J. Huber 1967; Halbert White analyzed is based on randomization infer- 1980). Using the standard ordinary least ence.15 Randomization inference is still rela- squares variance based on homoskedastic- tively rare in economics (the few examples ity leads to confidence intervals that are not include Bertrand, Duflo, and Mullainathan necessarily justified even in large samples. 2002 and Alberto Abadie, Alexis Diamond, This point is correct and, in practice, it is and Jens Hainmueller forthcoming) and, certainly recommended to use the robust although personally I am strongly in favor variance here, at least in sufficiently large of its use (see the discussion in Imbens and samples.17 Moreover, the standard error Wooldridge 2009), it is not the only mode issue that is often the most serious concern of inference even if one has experimental in practice, clustering, is nowadays routinely data. If one uses model-based inference, taken into account. See Duflo, Glennerster, including regression methods, there are still and Kremer (2008) for more discussion. well established benefits from randomized The second issue concerns finite sample assignment of the treatment even if there issues. Researchers often analyze random- are no longer exact finite sample results ized experiments using regression methods, (e.g., Rubin 1978, 1990).16 As I wrote in the including as regressors both the treatment introduction to this paper, when one is esti- indicator and covariates not affected by the

15 Under randomization inference, properties such “Experiments should be analyzed as experiments, not as as bias and variance are calculated over the distribution observational studies” (Freedman, abstract). I have some induced by random assignment for the fixed population, sympathy for that view, although that does not take away keeping potential outcomes with and without treatment from the fact that, if one wants to estimate a structural and covariates fixed, and reassigning only the treatment model, one would still benefit from having experimental indicator. This contrasts with the model-based repeated data. sampling perspective often used in econometrics where 17 There are further complications in small samples. the covariates and the treatment indicator are fixed in The most commonly used version of robust standard repeated samples and the unobserved component in the errors performs worse than homoskedastic standard errors regression function is redrawn from its distribution. See in small samples. There are improvements available in the Paul R. Rosenbaum (1995) for a general discussion. literature, especially for the simple case where we compare 16 Freedman (2006) argues for randomization infer- two sample averages, which probably deserve more atten- ence whenever a randomized experiment is conducted: tion. See, among others, Henry Scheffe (1970). Imbens: Better LATE Than Nothing 411 treatment. If only the treatment indicator is example through the estimation of average included in the specification of the regression effects for various subgroups. This is formally function, the least squares estimator is iden- correct, and I would certainly encourage tical to the difference in average outcomes researchers to follow more closely the pro- by treatment status. As shown originally by tocols established by the FDA, which, for Neyman, this estimator is unbiased, in the example, insists on listing the analyses to be finite sample, over the distribution induced conducted prior to the collection of the data. by randomizing the treatment assignment. As Again there is of course nothing specific to Freedman (2008) points out, if one includes randomized experiments in this arguments: additional covariates in the specification of any time a researcher uses pretesting or esti- the regression function, the least squares estimates multiple versions of a statistical model mator is no longer exactly unbiased, where there should be concern that the final con- again the distribution is that induced by the fidence intervals no longer have the nomi- randomization.18 On the other hand, includ- nal coverage rate. See Leamer (1978) for a ing covariates can substantially improve the general discussion of these issues. However, I precision if these covariates are good pre- think that this is again a second order issue in dictors of the outcomes with or without the the context of the comparison between ran- treatment. In finite samples, there is there- domized experiments and observational stud- fore a tradeoff between some finite sample ies. In randomized experiments, one typically bias, and large sample precision gains. In finds, as in LaLonde (1986), that the results practice including some covariates that are a from a range of estimators and specifications priori believed to be substantially correlated are robust. Had Deaton added a real example with the outcomes, is likely to improve the of a case where results based on experiments expected squared error. An additional point is were sensitive to these issues, his argument that if the regression model is saturated, e.g., would have been more convincing. with a binary covariate including both the Ultimately, and this is really the key point of covariate and the interaction of the covariate this section, it seems difficult to argue that, in and the treatment indicator, there is no bias, a setting where it is possible to carry out a ran- even in finite samples.19 domized experiment, one would ever benefit The third issue Deaton raises concerns from giving up control over the assignment the exploration of multiple specifications, for mechanism by allowing individuals to choose

18 This result may come as a surprise to some researchers, so let me make it explicit in a simple example. Suppose a constant, the treatment indicator W , and the covariate X . there are three units, with covariate values X 0 X 1 i i 1 = , 2 = , For example, if W (0, 0, 1), the first estimator is equal to and X3 2. If assigned to the treatment, the outcomes for = = ˆ dif Y3 (1) Y2 (0) 2 Y1 (0) 2 and the second estima- τ = − / − / the three units are Y1(1), Y2 (1), and Y3 (1) and, if assigned tor is equal to ˆ Y (1) 2Y (0) Y (0). It is simple to τols = 3 − 2 + 1 to the control treatment, the outcomes are Y1 (0), Y2 (0), calculate the expectation of these two estimators over the and Y (0). The average treatment effect is (Y (1) 3 τ = 1 + randomization distribution. For the difference estimator, Y2 (1) Y3(1)) 3 (Y1 (0) Y2 (0) Y3 (0)) 3. Suppose the expectation is equal to the average treatment effect + / − + + / τ the experiment assigns one of the three units to the treat- but, for the least squares estimator, the expectation is equal ment and the other two units to the control group. Thus, to (Y (1) Y (1) Y (1)) 3 ( Y (0) 2 4Y (0) 1 + 2 + 3 / − − 1 / + 2 − there are three possible values for the assignment vector, Y (0) 2 3, which in general differs from . This bias dis- 3 / )/ τ W {(1, 0, 0), (0, 1, 0), (0, 0, 1)}. Under each value of the appears at rate 1 N as the sample size increases, as shown ∈ / assignment, we can calculate the value of the two estima- in Freedman (2006). tors. The first estimator is equal to the difference in the 19 A separate issue is that it is difficult to see how finite average outcomes for the treated units and the average out- sample concerns could be used as an argument against comes for the control units. The second estimator is based actually doing experiments. There are few observational on the least squares regression of the observed outcome on settings for which we have exact finite sample results. 412 Journal of Economic Literature, Vol. XLVIII (June 2010) their own treatment status.20 In other words, intervention in those settings.21 In doing so, conditional on the question, the methodolog- this literature has made many connections ical case for randomized experiments is unas- to the statistics and psychology literature on sailable and widely recognized as such, and observational studies. Rather than leading none of the arguments advanced by Deaton to “unnecessary rhetorical barriers between and HU weaken that. I do not want to say disciplines working on the same problems” that, in practice, randomized experiments (Deaton, p. 2), this has been a remarkably are generally perfect or that their implemen- effective two-way exchange, leading to sub- tation cannot be improved, but I do want to stantial convergence in the statistics and make the claim that giving up control over econometrics literatures, both in terms of the assignment process is unlikely to improve terminology and in the exchange of ideas. matters. It is telling that neither Deaton nor On the one hand, economists have now gen- HU give a specific example where an obser- erally adopted Rubin’s potential outcome vational study did improve, or would have framework (Rubin 1974, 1990; Rosenbaum improved, on a randomized experiment, con- and Rubin 1983), labeled the Rubin Causal ditional on the question lending itself to a Model by Paul W. Holland (1986), which randomized experiment. formulates causal questions as comparisons of unit-level potential outcomes.22 Although this potential outcome framework 5. Instrumental Variables, Local Average is a substantial departure from the Cowles Treatment Effects, and Regression Commission general set up of simultaneous Discontinuity Designs equations models, it is closely related to the In some settings, a randomized experi- interpretation of structural equations in, for ment would have been feasible, or at least example, Trygve Haavelmo (1943).23 On the conceivable, but was not actually conducted. other hand, statisticians gained an apprecia- This may have been the result of ethical tion for, and understanding of, instrumental considerations, or because there was no variables methods. See, for example, what particularly compelling reason to conduct is probably the first use of instrumental an experiment, or simply practical reasons. In some of those cases, credible evaluations 21 For a recent review of this literature, see Imbens and can be based on instrumental variables or Wooldridge (2009). regression discontinuity strategies. As a rule, 22 Compare, for example, the set up in Heckman such evaluations are second best to random- and Robb (1985), which predates the potential outcome set up, with that in Heckman (1990), which adopts that ized experiments for two reasons. First, they framework. rely on additional assumptions and, second, 23 Jan Tinbergen distinguishes between “any imaginable price” that enters into the demand and supply functions they have less external validity. Often, how- π and the “actual price” p that is determined by the mar- ever, such evaluations are all we have. The ket clearing in his notation. Subsequently this distinction theoretical econometrics literature in the between potential outcomes and realized outcomes has last two decades has clarified what we can become blurred. With the work of the Cowles Foundation, the standard textbook notation directly relates a matrix of learn, and under what conditions, about the observed endogenous outcomes (Y) to a matrix of exogenous variables (X) and a matrix of unobserved “residuals” (U), linked by a set of unknown parameters (B and ): 20 Of course it is possible that the question of interest Γ Y XB U, itself involves the choice of treatment status. For example, Γ + = if we are interested in a job training program that would representing causal, structural relationships. This notation be implemented as a voluntary program, the experimental has obscured many of the fundamental issues and contin- design should involve randomization of the option to enroll ues to be an impediment to communication with other in the program, and not randomization of enrollment itself. disciplines. Imbens: Better LATE Than Nothing 413

variables published in the mainstream medi- randomly assigned, this is true by design in cal literature, although still written by econo- this case. mists, Mark McClellan, Barbara J. McNeil, The second is that there is no direct effect and Joseph P. Newhouse (1994). Special of the instrument, the lottery number, on cases of these methods had been used pre- the outcome. This is what Angrist, Imbens, viously in the biostatistics literature, in and Rubin (1996) call the exclusion restric- particular in settings of randomized experi- tion.24 This is a substantive assumption that ments with one-sided compliance (e.g., M. may well be violated. See Angrist (1990) and Zelen 1979), but no links to the economet- Angrist, Imbens, and Rubin (1996) for dis- rics literature had been made. Furthermore, cussions of potential violations.25 economists have significantly generalized The third assumption is what IA call applicability and understanding of regres- monotonicity, which requires that any sion discontinuity designs (Jinyong Hahn, man who would serve if not draft eligible, Todd, and Wilbert van der Klaauw 2001; would also serve if draft eligible.26 In this Justin McCrary 2008; Lee 2008; Imbens setting, monotonicity, or as it is sometimes and Karthik Kalyanaraman 2008) and now called “no-defiers,” seems a very reason- in turn influence the psychology literature able assumption. Although Deaton quotes where these designs originated. See William Freedman as wondering “just why are R. Shadish, Thomas D. Cook, and Donald there no defiers” (Freedman 2006, quoted T. Campbell (2000) and Cook (2008) for a in Deaton, p. 37) and Heckman and Urzua historical perspective. Within economics, (2010) write about “arbitrary conditions like however, the results in IA and Hahn, Todd, ‘monotonicity’ that are central to LATE” and van der Klaauw (2001) are unusual (Heckman and Urzua 2010, p. 8), the mono- (“the opposite of standard statistical prac- tonicity assumption is often well motivated tice” Deaton, p. 9). As a consequence, these from the perspective of optimizing agents. papers have generated a substantial degree Increasing the value of the instrument, in of controversy as echoed in the quotes from the draft lottery example corresponding Deaton and HU. Let me offer some com- to giving the person a lottery number that ments on this. implies the person will be more likely to be The standard approach in econometrics subject to the draft, raises the cost of staying is to state precisely what the object of inter- out of the military. It would seem reasonable est is at the outset of an analysis. Let me use to assume that the response to this increase Angrist’s (1990) famous draft lottery study as in costs, for each optimizing individual, is an example. In that case, one may be inter- an increased likelihood of serving in the ested in the average causal effect of serving military. This interpretation of changes in in the military on earnings. Now suppose the value of the instrument corresponding one is concerned that simple comparisons to increases in the net benefits of receiving between veterans and nonveterans are not credible as estimates of average causal 24 Deaton actually calls this second assumption “exo- effects because of selection biases arising geneity” in an unnecessary and confusing change from from unobserved differences between vet- conventional terminology that leads him to argue that even random numbers can fail to be exogenous. erans and nonveterans. Let us consider the 25 For example, the extension of formal schooling to arguments advanced by Angrist in support of avoid the draft could lead to violations of the exclusion using the draft lottery number as an instru- restriction. 26 In another unnecessary attempt to change estab- ment. The first key assumption is that draft lished terminology, HU argue that this should be called eligibility is exogenous. Since it was actually “uniformity.” 414 Journal of Economic Literature, Vol. XLVIII (June 2010) the treatment does not hold in all cases and, to make. Manski’s is a coherent perspective when IA introduced the assumption, they and a useful one. While I have no disagree- discuss settings where it need not be plau- ment with the case for reporting the bounds sible. Nevertheless, it is far from an arbitrary on the overall average treatment effect, there assumption and often plausible in settings is, in my view, a strong case for also reporting with optimizing agents. In addition, Angrist, estimates for the subpopulation for which Imbens, and Rubin discuss the implications one can identify the average effect of inter- of violations of this assumption. est, that is the local average treatment effect. These three assumptions are not sufficient The motivation for this is that there may be to identify the average effect of serving in cases with wide bounds on the population the military for the full population. However, average effect, some of which are, and some as shown by IA, these assumptions are suf- of which are not, informative about the pres- ficient to identify the average effect on the ence of any effects. Consider an example of a subpopulation of what Angrist, Imbens, and randomized evaluation of a drug on survival, Rubin (1996) call compliers, the local aver- with one-sided noncompliance and with the age treatment effect or LATE. Compliers randomized assignment as an instrument for in this context are individuals who were receipt of treatment. Suppose the bounds induced by the draft lottery to serve in the for the average effect of the treatment are military, as opposed to never-takers who equal to [ 3 16, 5 16]. This can be con- − / / would not serve irrespective of their lottery sistent with a substantial negative average number, and always-takers, who would vol- effect for compliers, lowering survival rates unteer irrespective of their lottery number. by 1 4, or with a substantial positive average / But, Deaton might protest, this is not what effect for compliers, raising survival rates by we said we were interested in! That may 1 4.27 One would think that, in the first case, / be correct, depending on what is the policy a decisionmaker would be considerably less question. One could imagine that the policy likely to implement universal adoption of the interest is in compensating those who were treatment than in the second, and so report- involuntarily taxed by the draft, in which ing only the bounds might leave out relevant case the compliers are exactly the popula- information. tion of interest. If, on the other hand, the A second alternative approach to the question concerns future drafts that may be focus on the local average treatment effect more universal than the Vietnam era one, is to complement the three assumptions the overall population may be closer to the that allowed for identification of the average population of interest. In that case, there are effect for compliers, with additional assump- two alternatives that do focus on the average tions that allow one to infer the overall effect for the full population. Let me briefly average effect, at least in large samples. The discuss both in order to motivate the case for concern is that the assumptions that allow reporting the local average treatment effect.

See also Manski (1996) for a discussion of 27 To be specific, let the probability of complier and these issues. never-takers be equal to 1 2. With the endogenous / One principled approach is Manski’s regressor (receipt of treatment) denoted by Xi and the instrument (assignment of treatment) denoted by Zi, (1990, 1996, 2003) bounds, or partial identi- let p pr(Y 1 X x, Z z). In the first example, zx = = | = = fication, approach. Manski might argue that p00 1 4, p10 1 8, and p11 1 8. In the second exam- = / = / = / ple, p ˜ 1 2, p˜ 5 8, and p˜ 5 8. In both cases one should maintain the focus on the overall 00 = / 10 = / 11 = / average effect and derive the bounds on this the sharp bounds on the average treatment effect are [ 3 16, 5 16], in the first example late 1 4, and in the − / / τ = − / estimand given the assumptions one is willing second example ˜ 1 4. τ late = / Imbens: Better LATE Than Nothing 415 one to carry out this extrapolation are of a not serve even if drafted, individuals with very different nature from, and may be less (compliers) would −π0 − π ≤ η i < −π0 credible than, those that identify the local serve if drafted but not as volunteers, and average treatment effect. For that reason, I individuals with (always-takers) −π0 ≤ ηi would prefer to keep those assumptions sep- would always serve. Note that this model arate and report both the local average treat- implies the monotonicity or no-defiers con- ment effect, with its high degree of internal dition, although, unlike in the IA set up, the but possibly limited external validity, and assumption is implicit, rather than explicit. possibly add a set of estimates for the overall Although not widely used anymore, this average effect with the corresponding addi- type of model was very popular in the 1980s, tional assumptions, with lower internal, but as one of the first generation of models higher external, validity. Let us be more spe- that explicitly took into account selection cific in the context of the Angrist study. One bias (Heckman 1978, 1990). Note that this might write down a model for the outcome model embodies all the substantive assump- (e.g., earnings) denoted by Yi, depending on tions underlying the local average treatment veteran status Vi : effect. Thus, the instrumental variables estimator can be justified by reference to this, Y V . i = α + β · i + εi admittedly simple, structural model. Although originally this type of model was In addition, one might write down a often used with a distributional assumption Heckman-style latent index model (Heckman (typically joint normality of ( , )), this is not 1978, 1990) for the decision to serve in the η i εi essential in this version of the model. Without military as a function of the instrument Z i any distributional assumptions, only assum- (draft eligibility): ing independence of and Z is sufficient for εi i identifying the average effect of military ser- V * Z . i = π0 + π1 · i + η i vice, . More important is the assumption of β * a constant effect of veteran status. Such an The latent index V i represents the difference in utility from serving, versus not serving, in assumption is rarely implied by theory and the military with the observed veteran status is often implausible on substantive grounds V equal to (e.g., with binary outcomes). Suppose we i relax the model and explicitly allow for het-

* erogenous effects: 1 if V i 0, Vi * > = 0 if V i 0. ≤ Yi ( i) Vi i , e = α + β + ν · + ε The inclusion of the instrument Z in the where captures the heterogeneity in the i νi utility function can be thought of as reflect- effect of veteran status for individual i. If ing the cost a low lottery number imposes we maintain joint normality (now of the on the action of not serving in the military. triple ( , , )), we can still identify the εi ηi η i Suppose that the only way to stay out of the parameters of the model, including , that β military if drafted is through medical exemp- is, the average effect of veteran status. See, tions. In that case, it may well be plausible for example, Anders Björklund and Robert that the instrument is valid. Health status Moffitt (1987). Unlike in the constant effect is captured by the unobserved component model, however, in this case the normality : individuals in poor health assumption is not innocuous. As Heckman ηi ηi < −π0 − π1 (never-takers in the AIR terminology) would (1990) shows, a nonparametric version of 416 Journal of Economic Literature, Vol. XLVIII (June 2010) this model is not identified unless the prob- point estimates for the overall average based ability of veteran status, as a function of the on additional assumptions is, thus, emphati- instrument Zi , is arbitrarily close to zero and cally not motivated by a claim that the local one for some choices of the instrument. As average treatment effect is the sole or even this is implied by the range of the instrument primary effect of interest. Rather, it is moti- being unbounded, this is often referred vated by a sober assessment that estimates to as “identification at infinity” (Gary for other subpopulations do not have the Chamberlain 1986; HU). In the case with a same internal validity and by an attempt to binary instrument, this assumption is easy to clarify what can be learned from the data in verify. In the Angrist study, the probability of the absence of identification of the popula- serving in the military for the draft eligible tion average effect. It is based on a realization and noneligible is far from zero and one, and that, because of heterogeneity in responses, so nonparametric identification arguments instrumental variables estimates are a dis- based on identification-at-infinity fail. The tinct second best to randomized experiments. key contribution of IA was the insight that, Let me end this discussion with a final com- although one could not identify the average ment on the substantive importance of what effect for the overall population, one could we learn in such settings. Although we do not still identify the average effect for compli- learn what the average effect is of veteran ers, or the LATE.28 In the structural model status, we can, in sufficiently large samples, above, compliers are the individuals with learn for a particular, well-defined subpopu- . Think again of the case lation, what the effect is. We may then wish π0 − π1 ≤ ηi < π0 where the never-takers with to extrapolate to other subpopulations, even if ηi< −π0 − π1 correspond to individuals in poor health. only qualitatively, but given that the nature of These individuals cannot be induced to serve those extrapolations is often substantially less in the military through the draft. It seems credible than the inferences for the particular intuitively clear that we cannot identify the subpopulation, it may be useful to keep these average effect of military service for this extrapolations separate from the identifica- group from such data because we never see tion of the effect for compliers. them serving in the military. So, the problem These arguments are even more relevant in this case is not so much that researchers for the regression discontinuity case. In the are “trying to avoid thinking about how and sharp regression discontinuity case, we learn why things work” (Deaton, p. 14) but that about the average effect of a treatment at a there is little basis for credible extrapolation fixed value of the covariate. Let us consider from the local average treatment effect to Jordan D. Matsudaira’s (2008) example of the overall average effect. the effect of summer school attendance on Reporting the local average treatment subsequent school performance. Matsudaira effect solely or in combination with bounds or uses comparisons of students just above and just below the threshold on the test score that leads to mandatory summer school 28 Althought fifteen years after its introduction Deaton attendance. Students close to this margin still finds it hard to make sense of the LATE, Heckman, in are likely to be different from those far away at least some of his work, appears to value this contribution, writing “It is a great virtue of the LATE parameter from the margin. At the same time, there is that it makes the investigator stick to the data at hand, and no reason to think that only students at the separate out the aspects of an estimation that require out margin are of interest: the effect of summer of sample extrapolation or theorizing from aspects of an estimation that are based on observable data” (Heckman school on students with test scores far below 1999, p. 832). the margin is likely to be of considerable Imbens: Better LATE Than Nothing 417 interest as well but, in the absence of cred- mean the credibility of the estimator as an ible models for extrapolation, there may be estimator of the causal effect of interest, and no credible estimates for that group. by external validity I mean the generalizabil- Fuzzy regression discontinuity designs ity of the causal effect to other populations.29 rank even lower in terms of external validity. The concern is typically that randomized As pointed out by Hahn, Todd, and van der experiments may do well in terms of internal Klaauw (2001) in arguably the most impor- validity but poorly in terms of external valid- tant contribution of economists to the regres- ity, relative to structural models.30 There is sion discontinuity design literature, fuzzy no disagreement that both internal and regression discontinuity designs combine external validity are important. See Banerjee the limitations of sharp regression disconti- and Duflo (2009) for a recent discussion in nuity designs, in that they only refer to units the context of experimental evaluations in with a particular value of the covariates, with development economics. Returning to the those of instrumental variables estimates, in class size example from section 2, Angrist that they only reflect on compliers. However, and Lavy (1999), Hoxby (2000), and Krueger for this subpopulation, these designs often (1999) do not study the effect of class size as a have great internal validity. Many convincing historical phenomenon: they want to inform examples have now been published. See the the policy debate on class size. Similarly, Card survey paper by Lee and Lemieux (2009) and (1990) is presumably not interested in solely the special issue of the journal of econometrics in the effect of the Mariel Boatlift, rather (Imbens and Lemieux 2008). Again, research- he is interested in informing the debate on ers do not necessarily set out to estimate the the effects of immigration of low-skilled average for these particular subpopulations workers. In order to be useful in informing but, in the face of the lack of internal valid- policy, a study needs to have internal validity of estimates for other subpopulations, they ity (have a credible causal interpretation for justifiably choose to report estimates for them. the population it refers to) as well as external validity (be relevant for the populations the treatment may be extended to). In many 6. Internal versus External Validity disciplines, the weights placed on different Much of the debate between structural studies are heavily loaded in favor of inter- and causal approaches ultimately centers on nal validity. The FDA insists on elaborate the weight researchers put on external valid- protocols to ensure the internal validity of ity versus internal validity of a study. To be estimates, with much less emphasis on their precise, by a study I have in mind a combina- external validity. This has led, at times, to the tion of a population, a causal effect of inter- approval of drugs with a subsequent reversal est, and an estimator. By internal validity I of that decision after the drug was found

29 This is in line, with, for example, Shadish, Cook, and Campbell (2002), who define internal validity as “the valid- on the . . . individuals in the experiment,” and “‘External’ ity of inferences about whether observed covariation . . . validity refers to the effects of the treatment on people not reflects a causal relationship,” and external validity as “the included in the experiment.” validity of inferences about whether the cause–effect rela- 30 Although Cartwright (2007) surprisingly has the tionship holds over variation in persons, settings, treatment opposite view: “Despite the claims of RCTs [randomized variables, and measurement variables.” It also agrees with clinical trials] to be the gold standard, economic models Rosenbaum (2010) who writes “A randomized experiment have all the advantages when it comes to internal validity” is said to have a high level of ‘internal validity’ in the sense and “But it seems that RCTs have the advantage over eco- that the randomization provides a strong or ‘reasoned’ nomic models with respect to external validity” (Cartwright basis for inference about the effects of the treatment . . . 2007, p. 19). 418 Journal of Economic Literature, Vol. XLVIII (June 2010) to have adverse effects on populations that effect of an intervention, e.g., putting a price were underrepresented in the original study cap into place at p1 versus at p0, on demand populations. Part of this is unavoidable. First, for a particular commodity in a particular legally, randomized experiments can only be state, say California. For ease of exposi- conducted with informed consent by partici- tion, let us assume that p p 1. Let the 1 − 0 = pants and there is no systematic method for expected difference in demand, at the two ensuring that the population of those who potential values for the price cap, be denoted consent is representative of the population by , indexed by state s. States may differ θs of interest. Second, after a successful ran- in the expected effect because they differ θs domized experiment, the target population in terms of institutions or because they dif- may well change. If a treatment is in a ran- fer in terms of population composition. Let domized trial demonstrated to be beneficial us denote the relevant characteristics of the for moderately sick patients, physicians may states by Xs and, for purposes of this discus- well be tempted to use it for sicker patients sion, let us assume we observe Xs. that were not part of the original study. Now suppose we have a structural eco- Doing a second experiment on a popula- nomic model for the household level demand tion of sicker patients would not always be function: an option and would not be ethical if the first trial on the population of moderately D p I p , i = β0 + β1 · + β2 · i · + εi sick individuals showed a substantial beneficial effect of the treatment. Third, other where Di is household level demand, Ii is things may change between the experiment household income, and are unobserved εi and the subsequent adoption that affects differences between households. The param- the efficacy of the treatment. Again, this is eters are structural parameters, common β unavoidable in practice. to all states (structural in the Goldberger In economic applications, the issue of 1991 sense of being invariant to changes in external validity is considerably more severe. the population). Given this model, the dif- In many biomedical treatments the effects ference in expected demand in state s if the are through relatively stable biological price is fixed at p1 versus p0 is mechanisms that often generalize readily to other populations. A vaccine for a particular E[D S s, P p ] strain of HIV that prevents infection in the θs = i | i = i = 1 United States has a high likelihood of work- E[D S s, Pi p ] ing for the same strain in Africa as well. In − i | i = = 0 contrast, an educational reform that is found E[I s]. to raise test scores in England is unlikely to = β1 + β2 · | S = be directly applicable to the United States given the differences in educational institu- Let X E[I S s] be average income in s = | = tions and practices. state s, so that we can write It may be helpful to put some more structure on this problem.31 Suppose we have a g(X , ) X . θs = s β = β1 + β2 · s number of units. To be specific, I will refer to them as states. We are interested in the We are interested solely in the difference in average outcomes in California, 31 This discussion is partly based on conversations with g(X , ) X . Banerjee and Mullainathan. θca = ca β = β0 + β1 · ca Imbens: Better LATE Than Nothing 419

Now suppose we have data from an experi- additional information, one may need to rely ment in Tennessee, where randomly selected on prior beliefs. If one believes there is little exp individuals were faced with a price of p , variation in , one might prefer ˆ . If one 1 θs θca and others with a price of p0. Thus, with a believed the structural model was close to struct sufficiently large sample, we would learn correctly specified, one would prefer ˆ . θca from the Tennessee experiment the value of Note the benefits in this case of experimen- g(X , ). tal data: if the structural model had actually θtn = tn β Suppose we also have data from an obser- been estimated on experimental data, there vational study from Connecticut. In this would be no bias and would be equal to βc t state, we have a random sample of demand, and, thus, g(X , ) would be equal to . β ca βct θca income, and prices, (D , I , P ), for i 1, . . . N. That is not always the case. If the structural i i i = We may be concerned that in this state prices model was richer, a simple experiment with are endogenous, and so let us assume that randomly assigned prices would not neces- we also observe an instrument for price, Zi. sarily pin down all structural parameters. If the instrument is valid, and conditional However, in general, it will help pin down on income it is both correlated with prices some combination of the structural param- and uncorrelated with , this will allow us to eters by forcing the model to fit the experi- εi estimate the structural parameters using mental evidence. This is closely related to β two-stage least squares. Let us allow for the the sufficient statistics approach in Chetty possibility that the instrument is not valid, (2009). or more generally for misspecification in The answer to the first question may the structural model. In that case, ˆ , the also differ if the experiment in Tennessee βct estimator for based on Connecticut data, focused on a question that differed from that β need not be consistent for . Let us denote in California. If the experiment in Tennessee β the probability limit of the estimator by — involved randomly assigning prices of p and βct 2 we index this probability limit by the state p3 , rather than the price levels that enter into to capture the possibility that if the same the California question, p0 and p1, it may be structural model was estimated in a different difficult to estimate from the Tennessee θca state, the bias might well be different. results. This would not pose any concep- The first question now is how we would tual problems from the structural model choose between two estimates of the inter- perspective. vention in California: the experimental one A second question is what one would do from Tennessee, if one had both the experimental evidence from Tennessee and the observational data exp ˆ , from Connecticut. In that case, one could, in θca = θtn the spirit of the LaLonde (1986) evaluation versus the structural one, based on param- of econometric evaluation methods, compare eter estimates from Connecticut, combined the experimental estimate for Tennessee, , θtn with the characteristics from California, with the structural one based on Connecticut ˆ struct estimates, tn g (Xtn , ct). The compari- struct θ struct= β ˆ g(X , ). son of and ˆ reflects on the adequacy θca = ca βct θtn θtn of the structural model. If the structural In principle, the choice between the two model passes the test, there is a stronger estimators would depend on the variation in case for using the structural model to predict effect and in the variation in the pseudo- the effect of the intervention in California. If θs structural parameter . In the absence of the prediction fails, however, the conclusion βs 420 Journal of Economic Literature, Vol. XLVIII (June 2010) is that the structural model is not adequate be more difficult than finding a state with struct and, thus, invalidates ˆ . This test does potentially different effects : it may well be θca θs not reflect in any way on the experimental that the biases in observational studies would exp estimate ˆ . be similar in all states, arising from the same θca A third question concerns the information selection mechanisms. Rosenbaum (1987) content of additional experiments. With two discusses similar issues arising in the pres- or more experiments we would be able to ence of multiple control groups in observa- update our beliefs on the amount of varia- tional studies. tion in . It obviously would not help much θs if we did the second experiment in a state 7. Conclusion very similar to Tennessee but, if we did the second experiment in a state very different Deaton offers a critical appraisal of the from Tennessee and ideally more similar to methodologies currently in fashion in devel- California, we would likely learn much about opment economics. He argues that random- the amount of variation in . If we have ized experiments have no special role in the θs detailed information on Xs, having a substan- hierarchy of evidence and, as do Heckman tial number of experiments may enable us and Urzua, argues somewhat presumptu- to approximate the function g (x; ) without ously that instrumental variables methods β directly estimating , simply fitting a flexible do not answer interesting questions. He β functional form to E[ X ] g (X ; ). If we suggests moving toward more theory-based θs | s = s γ can approximate this function accurately, we studies and away from randomized and would be able to predict the effect of the natural experiments. In these comments, I intervention in California. In this case, one take issue with some of these positions and could also incorporate different experiments, caution against his recommendations. The e.g., those involving other price caps. If there causal or design-based literature, going back is any choice, one should do the experiments to the work in labor economics by Angrist, in a wide range of settings, that is, in the Card, Krueger, and others, and the current current example, in states with different Xs. experimental literature in development eco- The analyses by Card, Kluve, and Weber nomics, including work by Duflo, Banerjee, (2009), V. Joseph Hotz, Imbens, and Julie H. and Kremer, has greatly improved the stan- Mortimer (2005), Kremer and Alaka Holla dards of empirical work by emphasizing (2008), and Raghabendra Chattopadhyay internal validity and clarifying the nature of and Duflo (2004) fit into this framework. identifying assumptions. Although it would The fourth question concerns the benefits be regrettable if this trend led researchers of multiple observational studies. This is not to avoid questions that cannot be answered quite so clear. In many cases, one would through randomized or natural experi- expect that repeated observational studies in ments, it is important not to lose track of the different locations would have similar biases great strides made by this literature toward generated through similar selection mecha- improving the credibility of empirical work. nisms. Finding that multiple observational studies lead to the same results is therefore References not necessarily informative. To get a handle Abadie, Alberto, Alexis Diamond, and Jens Hainmuel- on the bias, the difference , we would βs − β ler. Forthcoming. “Synthetic Control Methods for need observational study from states that do Comparative Case Studies: Estimating the Effect of California’s Tobacco Control Program.” Journal of not have the same biases as the first state, the American Statistical Association. Connecticut. Identifying such states may Angrist, Joshua D. 1990. “Lifetime Earnings and the Imbens: Better LATE Than Nothing 421

Vietnam Era Draft Lottery: Evidence from Social Wages and Employment: A Case Study of the Fast- Security Administrative Records.” American Eco- Food Industry in New Jersey and Pennsylvania.” nomic Review, 80(3): 313–36. American Economic Review, 84(4): 772–93. Angrist, Joshua D., Kathryn Graddy, and Guido W. Cartwright, Nancy. 2007. “Are RCTs the Gold Stan- Imbens. 2000. “The Interpretation of Instrumen- dard?” BioSocieties, 2(1): 11–20. tal Variables Estimators in Simultaneous Equations Chamberlain, Gary. 1986. “Asymptotic Efficiency in Models with an Application to the Demand for Fish.” Semi-parametric Models with Censoring.” Journal of Review of Economic Studies, 67(3): 499–527. Econometrics, 32(2): 189–218. Angrist, Joshua D., Guido W. Imbens, and Donald B. Chattopadhyay, Raghabendra, and Esther Duflo. 2004. Rubin. 1996. “Identification of Causal Effects Using “Women as Policy Makers: Evidence from a Ran- Instrumental Variables.” Journal of the American domized Policy Experiment in India.” Econometrica, Statistical Association, 91(434): 444–55. 72(5): 1409–43. Angrist, Joshua D., and Alan B. Krueger. 1991. “Does Chetty, Raj. 2009. “Sufficient Statistics for Welfare Compulsory School Attendance Affect Schooling and Analysis: A Bridge Between Structural and Reduced- Earnings?” Quarterly Journal of Economics, 106(4): Form Methods.” Annual Review of Economics, 1: 979–1014. 451–88. Angrist, Joshua D., and Victor Lavy. 1999. “Using Mai- Chetty, Raj, Adam Looney, and Kory Kroft. 2009. monides’ Rule to Estimate the Effect of Class Size “Salience and Taxation: Theory and Evidence.” on Scholastic Achievement.” Quarterly Journal of American Economic Review, 99(4): 1145–77. Economics, 114(2): 533–75. Chetty, Raj, and Emmanuel Saez. 2009. “Teaching the Athey, Susan, Jonathan Levin, and Enrique Seira. Tax Code: Earnings Responses to an Experiment Forthcoming. “Comparing Open and Sealed Bid with EITC Recipients.” National Bureau of Eco- Auctions: Theory and Evidence from Timber Auc- nomic Research Working Paper 14836. tions.” Quarterly Journal of Economics. Cook, Thomas D. 2008. “‘Waiting for Life to Arrive’: Banerjee, Abhijit V. 2007. Making Aid Work. Cam- A History of the Regression-Discontinuity Design bridge and London: MIT Press. in Psychology, Statistics and Economics.” Journal of Banerjee, Abhijit V., Shawn Cole, Esther Duflo, and Econometrics, 142(2): 636–54. Leigh Linden. 2007. “Remedying Education: Evi- Deaton, Angus S. 2009. “Instruments of Development: dence from Two Randomized Experiments in India.” Randomization in the Tropics, and the Search for the Quarterly Journal of Economics, 122(3): 1235–64. Elusive Keys to Economic Development.” National Banerjee, Abhijit V., and Esther Duflo. 2009. “The Bureau of Economic Research Working Paper 14690. Experimental Approach to Development Econom- Dehejia, Rajeev H. 2005. “Practical Propensity Score ics.” Annual Review of Economics, 1: 151–78. Matching: A Reply to Smith and Todd.” Journal of Banerjee, Abhijit V., and Ruimin He. 2008. “Making Econometrics, 125(1–2): 355–64. Aid Work.” In Reinventing Foreign Aid, ed. William Dehejia, Rajeev H., and Sadek Wahba. 1999. “Causal R. Easterly, 47–92. Cambridge and London: MIT Effects in Nonexperimental Studies: Reevaluating Press. the Evaluation of Training Programs.” Journal of the Bertrand, Marianne, Esther Duflo, and Sendhil Mul- American Statistical Association, 94(448): 1053–62. lainathan. 2002. “How Much Should We Trust Duflo, Esther. 2004. “Scaling Up and Evaluation.” In Differences-in-Differences Estimates?” National Annual World Bank Conference on Development Bureau of Economic Research Working Paper 8841. Economics, 2004: Accelerating Development, ed. Bertrand, Marianne, and Sendhil Mullainathan. 2004. François Bourguignon and Boris Pleskovic, 341–69. “Are Emily and Greg More Employable than Laki- Washington, D.C.: World Bank; Oxford and New sha and Jamal? A Field Experiment on Labor Market York: Oxford University Press. Discrimination.” American Economic Review, 94(4): Duflo, Esther, Rachel Glennerster, and Michael Kre- 991–1013. mer. 2008. “Using Randomization in Development Björklund, Anders, and Robert Moffitt. 1987. “The Economics Research: A Toolkit.” In Handbook of Estimation of Wage Gains and Welfare Gains in Self- Development Economics, Volume 4, ed. T. Paul Selection Models.” Review of Economics and Statis- Schultz and John Strauss, 3895–3962. Amsterdam tics, 69(1): 42–49. and San Diego: Elsevier, North-Holland. Card, David. 1990. “The Impact of the Mariel Boatlift Duflo, Esther, Rema Hanna, and Stephen Ryan. 2007. on the Miami Labor Market.” Industrial and Labor “Monitoring Works: Getting Teachers to Come to Relations Review, 43(2): 245–57. School.” Unpublished. Card, David, and Dean R. Hyslop. 2005. “Estimating Duflo, Esther, and Emmanuel Saez. 2003. “The Role the Effects of a Time-Limited Earnings Subsidy for of Information and Social Interactions in Retire- Welfare-Leavers.” Econometrica, 73(6): 1723–70. ment Plan Decisions: Evidence from a Random- Card, David, Jochen Kluve, and Andrea Weber. 2009. ized Experiment.” Quarterly Journal of Economics, “Active Labor Market Policy Evaluations: A Meta- 118(3): 815–42. analysis.” Institute for the Study of Labor Discussion Eicker, Friedhelm. 1967. “Limit Theorems for Regres- Paper 4002. sions with Unequal and Dependent Errors.” In Pro- Card, David, and Alan B. Krueger. 1994. “Minimum ceedings of the Berkeley Symposium on Mathematical 422 Journal of Economic Literature, Vol. XLVIII (June 2010)

Statistics and Probability, Volume 1, 59–82. Berke- Research Working Paper 14706. ley: University of California Press. Heckman, James J., and Sergio Urzua. 2010. “Com- Fisher, Ronald A. 1925. The Design of Experiments, paring IV with Structural Models: What Simple IV First edition. London: Oliver and Boyd. Can and Cannot Identify.” Journal of Econometrics, Fraker, Thomas, and Rebecca Maynard. 1987. “The 156(1): 27–37. Adequacy of Comparison Group Designs for Evalua- Heckman, James J., Sergio Urzua, and Edward Vyt- tions of Employment-Related Programs.” Journal of lacil. 2006. “Understanding Instrumental Variables Human Resources, 22(2): 194–227. in Models with Essential Heterogeneity.” Review of Freedman, David A. 2006. “Statistical Models for Economics and Statistics, 88(3): 389–432. Causation: What Inferential Leverage Do They Pro- Holden, Constance. 1990. “Head Start Enters Adult- vide?” Evaluation Review, 30(6): 691–713. hood: After 25 Years We Don’t Know Much about Freedman, David A. 2008. “On Regression Adjust- How Early Childhood Intervention Programs ments to Experimental Data.” Advances in Applied Work, but Current Research Suggests They Should Mathematics, 40(2): 180–93. Be Extended Beyond Early Childhood.” Science, Freedman, David A. 2010. Statistical Models and 247(4949): 1400–1402. Causal Inference: A Dialogue with the Social Sci- Holland, Paul W. 1986. “Statistics and Causal Infer- ences, ed. David Collier, Jasjeet Sekhon, and Philip ence.” Journal of the American Statistical Associa- B. Stark. Cambridge and New York: Cambridge Uni- tion, 81(396): 945–60. versity Press. Hotz, V. Joseph, Guido W. Imbens, and Julie H. Mor- Goldberger, Arthur S. 1991. A Course in Econometrics. timer. 2005. “Predicting the Efficacy of Future Cambridge, Mass. and London: Harvard University Training Programs Using Past Experiences at Other Press. Locations.” Journal of Econometrics, 125(1–2): Haavelmo, Trygve. 1943. “The Statistical Implications 241–70. of a System of Simultaneous Equations.” Economet- Hoxby, Caroline M. 2000. “The Effects of Class Size rica, 11(1): 1–12. on Student Achievement: New Evidence from Popu- Hahn, Jinyong, Petra E. Todd, and Wilbert van der lation Variation.” Quarterly Journal of Economics, Klaauw. 2001. “Identification and Estimation of 115(4): 1239–85. Treatment Effects with a Regression-Discontinuity Huber, Peter J. 1967. “The Behavior of Maximum Like- Design.” Econometrica, 69(1): 201–09. lihood Estimates under Nonstandard Conditions.” In Hausman, Jerry A. 1981. “Labor Supply.” In How Proceedings of the Berkeley Symposium on Mathe- Taxes Affect Economic Behavior, ed. Henry J. Aaron matical Statistics and Probability, Volume 1, 221–33. and Joseph A. Pechman, 27–72. Washington, D.C.: Berkeley: University of California Press. Brookings Institution Press. Imbens, Guido W., and Joshua D. Angrist. 1994. “Iden- Hausman, Jerry A., and David A. Wise. 1979. “Attri- tification and Estimation of Local Average Treatment tion Bias in Experimental and Panel Data: The Gary Effects.” Econometrica, 62(2): 467–75. Income Maintenance Experiment.” Econometrica, Imbens, Guido W., and Karthik Kalyanaraman. 2008. 47(2): 455–73. “Optimal Bandwidth Selection in Regression Dis- Heckman, James J. 1978. “Dummy Endogenous Vari- continuity Designs.” Unpublished. ables in a Simultaneous Equation System.” Econo- Imbens, Guido W., Gary King, David McKenzie, and metrica, 46(4): 931–59. Geert Ridder. 2009. “On the Benefits of Stratification Heckman, James J. 1990. “Varieties of Selection Bias.” in Randomized Experiments.” Unpublished. American Economic Review, 80(2): 313–18. Imbens, Guido W., and Thomas Lemieux. 2008. “Spe- Heckman, James J. 1997. “Instrumental Variables: cial Issue Editors’ Introduction: The Regression A Study of Implicit Behavioral Assumptions in Discontinuity Design—Theory and Applications.” One Widely Used Estimator.” Journal of Human Journal of Econometrics, 142(2): 611–14. Resources, 32(3): 441–62. Imbens, Guido W., Donald B. Rubin, and Bruce I. Sac- Heckman, James J. 1999. “Instrumental Variables: erdote. 2001. “Estimating the Effect of Unearned Response.” Journal of Human Resources, 34(4): Income on Labor Earnings, Savings, and Consump- 828–37. tion: Evidence from a Survey of Lottery Players.” Heckman, James J., and Richard Robb Jr. 1985. “Alter- American Economic Review, 91(4): 778–94. native Methods for Evaluating the Impact of Inter- Imbens, Guido W., and Jeffrey M. Wooldridge. 2009. ventions.” In Longitudinal Analysis of Labor Market “Recent Developments in the Econometrics of Pro- Data, ed. James J. Heckman and Burton Singer, 156– gram Evaluation.” Journal of Economic Literature, 245. Cambridge; New York and Sydney: Cambridge 47(1): 5–86. University Press. Kremer, Michael, and Alaka Holla. 2008. “Pricing and Heckman, James J., and Jeffrey A. Smith. 1995. Access: Lessons from Randomized Evaluations in “Assessing the Case for Social Experiments.” Journal Education and Health.” Unpublished. of Economic Perspectives, 9(2): 85–110. Krueger, Alan B. 1999. “Experimental Estimates of Heckman, James J., and Sergio Urzua. 2009. “Compar- Education Production Functions.” Quarterly Journal ing IV with Structural Models: What Simple IV Can of Economics, 114(2): 497–532. and Cannot Identify.” National Bureau of Economic LaLonde, Robert J. 1986. “Evaluating the Econometric Imbens: Better LATE Than Nothing 423

Evaluations of Training Programs with Experimental Springer. Data.” American Economic Review, 76(4): 604–20. Rodrik, Dani. 2008. “The New Development Econom- Leamer, Edward E. 1978. Specification Searches: Ad ics: We Shall Experiment, but How Shall We Learn?” Hoc Inference with Nonexperimental Data. New Harvard University John F. Kennedy School of Gov- York: Wiley. ernment Working Paper 08-055. Leamer, Edward E. 1983. “Let’s Take the Con Out of Romer, Christina D., and David H. Romer. 2004. “A Econometrics.” American Economic Review, 73(1): New Measure of Monetary Shocks: Derivation and 31–43. Implications.” American Economic Review, 94(4): Lee, David S. 2008. “Randomized Experiments from 1055–84. Non-random Selection in U.S. House Elections.” Rosenbaum, Paul R. 1987. “The Role of a Second Con- Journal of Econometrics, 142(2): 675–97. trol Group in an Observational Study.” Statistical Sci- Lee, David S., and Thomas Lemieux. 2010. “Regres- ence, 2(3): 292–306. sion Discontinuity Designs in Economics.” Journal of Rosenbaum, Paul R. 1995. Observational Studies. New Economic Literature, 48(2): 281–355. York; Heidelberg and London: Springer. Manski, Charles F. 1990. “Nonparametric Bounds on Rosenbaum, Paul R. 2010. Design of Observational Treatment Effects.” American Economic Review, Studies. New York: Springer. 80(2): 319–23. Rosenbaum, Paul R., and Donald B. Rubin. 1983. “The Manski, Charles F. 1995. Identification Problems in the Central Role of the Propensity Score in Observa- Social Sciences. Cambridge and London: Harvard tional Studies for Causal Effects.” Biometrika, 70(1): University Press. 41–55. Manski, Charles F. 1996. “Learning About Treatment Rubin, Donald B. 1974. “Estimating Causal Effects Effects from Experiments with Random Assignment of Treatments in Randomized and Nonrandomized of Treatments.” Journal of Human Resources, 31(4): Studies.” Journal of Educational Psychology, 66(5): 709–-33. 688–701. Manski, Charles F. 1997. “The Mixing Problem in Pro- Rubin, Donald B. 1978. “Bayesian Inference for Causal gramme Evaluation.” Review of Economic Studies, Effects: The Role of Randomization.” Annals of Sta- 64(4): 537–53. tistics, 6(1): 34–58. Manski, Charles F. 2003. Partial Identification of Prob- Rubin, Donald B. 1990. “Formal Mode of Statistical abilities Distributions. New York and Heidelberg: Inference for Causal Effects.” Journal of Statistical Springer. Planning and Inference, 25(3): 279–92. Manski, Charles F., Gary D. Sandefur, Sara McLana- Scheffe, Henry. 1970. “Practical Solutions to the Beh- han, and Daniel Powers. 1992. “Alternative Esti- rens–Fisher Problem.” Journal of the American Sta- mates of the Effect of Family Structure during tistical Association, 65(332): 1501–08. Adolescence on High School Graduation.” Journal of Shadish, William R., Thomas D. Cook, and Donald the American Statistical Association, 87(417): 25–37. T. Campbell. 2002. Experimental and Quasi-exper- Matsudaira, Jordan D. 2008. “Mandatory Summer imental Designs for Generalized Causal Inference. School and Student Achievement.” Journal of Econo- Boston: Houghton Mifflin. metrics, 142(2): 829–50. Smith, Jeffrey A., and Petra E. Todd. 2005. “Does McClellan, Mark, Barbara J. McNeil, and Joseph P. Matching Overcome LaLonde’s Critique of Nonex- Newhouse. 1994. “Does More Intensive Treat- perimental Estimators?” Journal of Econometrics, ment of Acute Myocardial Infarction in the Elderly 125(1–2): 305–53. Reduce Mortality? Analysis Using Instrumental Vari- Tinbergen, Jan. 1997. “Determination and Interpreta- ables.” Journal of the American Medical Association, tion of Supply Curves: An Example.” In The Founda- 272(11): 859–66. tions of Econometric Analysis, ed. David F. Hendry McCrary, Justin. 2008. “Manipulation of the Running and Mary S. Morgan, 233–45. Cambridge and New Variable in the Regression Discontinuity Design: York: Cambridge University Press. A Density Test.” Journal of Econometrics, 142(2): Todd, Petra E., and Kenneth Wolpin. 2003. “Using a 698–714. Social Experiment to Validate a Dynamic Behavioral Miguel, Edward, and Michael Kremer. 2004. “Worms: Model of Child Schooling and Fertility: Assessing Identifying Impacts on Education and Health in the the Impact of a School Subsidy Program in Mexico.” Presence of Treatment Externalities.” Econometrica, Penn Institute for Economic Research Working 72(1): 159–217. Paper 03-022. Neyman, Jerzy. 1990. “On the Application. of Prob- White, Halbert. 1980. “A Heteroskedasticity-Con- ability Theory to Agricultural Experiments. Essays sistent Covariance Matrix Estimator and a Direct on Principles. Section 9.” Statistical Science, 5(4): Test for Heteroskedasticity.” Econometrica, 48(4): 465–72. (Orig. pub. 1923.) 817–38. Ravallion, Martin. 2009. “Should the Randomistas Zelen, M. 1979. “A New Design for Randomized Rule?” The Economists’ Voice, 6(2). Clinical Trials.” New England Journal of Medicine, Reid, Constance. 1982. Neyman: From Life. New York: 300(22): 1242–45. This article has been cited by:

1. L. Keele, W. Minozzi. 2013. How Much Is Minnesota Like Wisconsin? Assumptions and Counterfactuals in Causal Inference with Observational Data. Political Analysis . [CrossRef] 2. John Gibson, David McKenzie, Steven Stillman. 2013. Accounting for Selectivity and Duration- Dependent Heterogeneity When Estimating the Impact of Emigration on Incomes and Poverty in Sending Areas. Economic Development and Cultural Change 61:2, 247-280. [CrossRef] 3. Mar##a Laura Alz##a, Guillermo Cruces, Laura Ripani. 2012. Welfare programs and labor supply in developing countries: experimental evidence from Latin America. Journal of Population Economics . [CrossRef] 4. Elizabeth Mokyr Horner. 2012. Subjective Well-Being and Retirement: Analysis and Policy Recommendations. Journal of Happiness Studies . [CrossRef] 5. Vivian C. Wong, Coady Wing, Peter M. Steiner, Manyee Wong, Thomas D. CookResearch Designs for Program Evaluation . [CrossRef] 6. Sylvain Chassang, , Gerard Padr## i Miquel, , Erik Snowberg. 2012. Selective Trials: A Principal- Agent Approach to Randomized Controlled Experiments. American Economic Review 102:4, 1279-1309. [Abstract] [View PDF article] [PDF with links] 7. Paolo Frumento, Fabrizia Mealli, Barbara Pacini, Donald B. Rubin. 2012. Evaluating the Effect of Training on Wages in the Presence of Noncompliance, Nonemployment, and Missing Outcome Data. Journal of the American Statistical Association 107:498, 450-466. [CrossRef] 8. Martin Binder, Alex Coad. 2012. Life satisfaction and self-employment: a matching approach. Small Business Economics . [CrossRef] 9. Andrew Gelman. 2011. Causality and Statistical Learning 1 Counterfactuals and Causal Inference: Methods and Principles for Social Research . By Stephen L. Morgan and Christopher Winship . New York: Cambridge University Press, 2007. Pp. xiii+319. Causality: Models, Reasoning, and Inference , 2d ed. By Judea Pearl . Cambridge: Cambridge University Press, 2009. Pp. xix +464. Causal Models: How People Think About the World and Its Alternatives . By Steven A. Sloman . Oxford: Oxford University Press, 2005. Pp. xi+212. American Journal of Sociology 117:3, 955-966. [CrossRef] 10. Francesco Bogliacino. 2011. Cimoli, M., Dosi, G. and Stiglitz, J. E. (eds.): Industrial policy and development. The political economy of capabilities accumulation. Journal of Economics . [CrossRef] 11. Pedro Carneiro, , James J. Heckman, , Edward J. Vytlacil. 2011. Estimating Marginal Returns to Education. American Economic Review 101:6, 2754-2781. [Abstract] [View PDF article] [PDF with links] 12. Michael A. Clemens, Gabriel Demombynes. 2011. When does rigorous impact evaluation make a difference? The case of the Millennium Villages. Journal of Development Effectiveness 3:3, 305-339. [CrossRef] 13. Fran##ois Claveau. 2011. Evidential variety as a source of credibility for causal inference: beyond sharp designs and structural models. Journal of Economic Methodology 18:3, 233-253. [CrossRef] 14. Jens Ludwig, , Jeffrey R. Kling, , Sendhil Mullainathan. 2011. Mechanism Experiments and Policy Evaluations. Journal of Economic Perspectives 25:3, 17-38. [Abstract] [View PDF article] [PDF with links] 15. D. McKenzie. 2011. How Can We Learn Whether Firm Policies Are Working in Africa? Challenges (and Solutions?) For Experiments and Structural Models. Journal of African Economies 20:4, 600-625. [CrossRef] 16. G. W. Harrison. 2011. Experimental methods and the welfare evaluation of policy lotteries. European Review of Agricultural Economics 38:3, 335-360. [CrossRef] 17. G. W. Harrison. 2011. Randomisation and Its Discontents. Journal of African Economies 20:4, 626-652. [CrossRef] 18. Andrew Dillon. 2011. Do Differences in the Scale of Irrigation Projects Generate Different Impacts on Poverty and Production?. Journal of Agricultural Economics 62:2, 474-492. [CrossRef] 19. Lota D. Tamini. 2011. A nonparametric analysis of the impact of agri-environmental advisory activities on best management practice adoption: A case study of Qu##bec. Ecological Economics 70:7, 1363-1374. [CrossRef] 20. Henk Folmer, Olof Johansson-Stenman. 2011. Does Environmental Economics Produce Aeroplanes Without Engines? On the Need for an Environmental Social Science. Environmental and Resource Economics 48:3, 337-361. [CrossRef] 21. Gregory K. Leonard, G. Steven Olley. 2011. What Can Be Learned About the Competitive Effects of Mergers from ###Natural Experiments###?. International Journal of the Economics of Business 18:1, 103-107. [CrossRef] 22. Arnab K. Acharya, Giulia Greco, Edoardo Masset. 2010. The economics approach to evaluation of health interventions in developing countries through randomised field trial. Journal of Development Effectiveness 2:4, 401-420. [CrossRef] 23. C. B. Barrett, M. R. Carter. 2010. The Power and Pitfalls of Experiments in Development Economics: Some Non-random Reflections. Applied Economic Perspectives and Policy 32:4, 515-548. [CrossRef]