<<

Journal of Economic Literature 48 (June 2010): 424–455 http:www.aeaweb.org/articles.php?doi=10.1257/jel.48.2.424

Instruments, Randomization, and Learning about Development

Angus Deaton*

There is currently much debate about the effectiveness of foreign aid and about what kind of projects can engender . There is skepticism about the ability of econometric analysis to resolve these issues or of development agencies to learn from their own experience. In response, there is increasing use in development of randomized controlled trials (RCTs) to accumulate credible knowl- edge of what works, without overreliance on questionable theory or statistical meth- ods. When RCTs are not possible, the proponents of these methods advocate quasi- randomization through instrumental variable (IV) techniques or natural experiments. I argue that many of these applications are unlikely to recover quantities that are use- ful for policy or understanding: two key issues are the misunderstanding of exogeneity and the handling of heterogeneity. I illustrate from the literature on aid and growth. Actual randomization faces similar problems as does quasi-randomization, notwith- standing rhetoric to the contrary. I argue that experiments have no special ability to produce more credible knowledge than other methods, and that actual experiments are frequently subject to practical problems that undermine any claims to statisti- cal or epistemic superiority. I illustrate using prominent experiments in development and elsewhere. As with IV methods, RCT-based evaluation of projects, without guid- ance from an understanding of underlying mechanisms, is unlikely to lead to scientific progress in the understanding of economic development. I welcome recent trends in development experimentation away from the evaluation of projects and toward the evaluation of theoretical mechanisms. (JEL C21, F35, O19)

* Deaton: . This is a lightly revised Ron Brookmayer, , , Hank Farber, and updated version of “Instruments of Development: Bill Easterly, Erwin Diewert, Jean Drèze, , Randomization in the Tropics and the Search for the Elu- , Don Green, Jim Heckman, Ori Heffetz, Karla sive Keys to Economic Development,” originally given as Hoff, Bo Honoré, Šte​ ˘ ​pán Jurajda, Dean Karlan, Aart Kraay, the Keynes Lecture, , October 9, 2008, , David Lee, Steve Levitt, Winston Lin, and published in the Proceedings of the British Academy John List, David McKenzie, Margaret McMillan, Costas 102, 2008 Lectures, © The British Academy 2009. I am Meghir, Branko Milanovic, Chris Paxson, Franco Perac- grateful to the British Academy for permission to reprint chi, , Sam Schulhofer-Wohl, Jonathan Skinner, the unchanged material. I have clarified some points in Jesse Rothstein, Rob Sampson, Burt Singer, Finn Tarp, response to Imbens (2009), but have not changed anything Alessandro Tarozzi, Chris Udry, Gerard van den Berg, Eric of substance and stand by my original arguments. For Verhoogen, Susan Watkins, Frank Wolak, and John Wor- helpful discussions and comments on earlier versions, I am rall. I acknowledge a particular intellectual debt to Nancy grateful to , Tim Besley, , Cartwright, who has discussed these issues patiently with

424 Deaton: Instruments, Randomization, and Learning about Development 425

1. Introduction that econometric methodology needs to be assessed, not only by methodologists, but by he effectiveness of development assis- those who are concerned with the substance Ttance is a topic of great public interest. of the issue. Only they (we) are in a position Much of the public debate among non- to tell when something has gone wrong with takes it for granted that, if the the application of econometric methods, funds were made available, would not because they are incorrect given their be eliminated (Thomas Pogge 2005; Peter assumptions, but because their assumptions Singer 2004) and at least some economists do not apply, or because they are incorrectly agree (Jeffrey D. Sachs 2005, 2008). Others, conceived for the problem at hand. Or at most notably William Easterly (2006, 2009), least that is my excuse for meddling in these are deeply skeptical, a position that has been matters. forcefully argued at least since P. T. Bauer Any analysis of the extent to which for- (1971, 1981). Few academic economists or eign aid has increased economic growth in political scientists agree with Sachs’s views, recipient countries immediately confronts but there is a wide range of intermediate the familiar problem of simultaneous causal- positions, well assembled by Easterly (2008). ity; the effect of aid on growth, if any, will be The debate runs the gamut from the macro— disguised by effects running in the opposite can foreign assistance raise growth rates and direction, from poor economic performance eliminate poverty?—to the micro—what to compensatory or humanitarian aid. It is not sorts of projects are likely to be effective?— obvious how to disentangle these effects, and should aid focus on electricity and roads, some have argued that the question is unan- or on the provision of schools and clinics swerable and that econometric studies of it or vaccination campaigns? Here I shall be should be abandoned. Certainly, the econo- concerned with both the macro and micro metric studies that use international evi- kinds of assistance. I shall have very little to dence to examine aid effectiveness currently say about what actually works and what does have low professional status. Yet it cannot not—but it is clear from the literature that be right to give up on the issue. There is no we do not know. Instead, my main concern general or public understanding that nothing is with how we should go about finding out can be said, and to give up the econometric whether and how assistance works and with analysis is simply to abandon precise state- methods for gathering evidence and learning ments for loose and unconstrained histories from it in a scientific way that has some hope of episodes selected to support the position of leading to the progressive accumulation of the speaker. of useful knowledge about development. The analysis of aid effectiveness typically I am not an econometrician, but I believe uses cross-country growth regressions with the simultaneity between aid and growth dealt with using instrumental variable meth- me for several years, whose own work on causality has greatly influenced me, and who pointed me toward other ods. I shall argue in the next section that there important work; to Jim Heckman, who has long thought has been a good deal of misunderstanding in deeply about these issues, and many of whose views are the literature about the use of instrumental recounted here; and to the late David Freedman, who con- sistently and effectively fought against the (mis)use of tech- variables. Econometric analysis has changed nique as a substitute for substance and thought. None of its focus over the years, away from the analy- which removes the need for the usual disclaimer, that the sis of models derived from theory toward views expressed here are entirely my own. I acknowledge financial support from NIA through grant P01 AG05842– much looser specifications that are statistical 14 to the National Bureau of Economic Research. representations of program evaluation. With 426 Journal of Economic Literature, Vol. XLVIII (June 2010) this shift, instrumental variables have moved Project evaluations, whether using random- from being solutions to a well-defined prob- ized controlled trials or nonexperimental lem of inference to being devices that induce methods, are unlikely to disclose the secrets quasi-randomization. Old and new under- of development nor, unless they are guided standings of instruments coexist, leading to by theory that is itself open to revision, are errors, misunderstandings, and confusion, they likely to be the basis for a cumulative as well as unfortunate and unnecessary rhe- research program that might lead to a better torical barriers between disciplines work- understanding of development. This argu- ing on the same problems. These abuses ment applies a fortiori to instrumental vari- of technique have contributed to a general ables strategies that are aimed at generating skepticism about the ability of econometric quasi-experiments; the value of econometric analysis to answer these big questions. methods cannot and should not be assessed A similar state of affairs exists in the micro- by how closely they approximate random- economic area, in the analysis of the effec- ized controlled trials. Following Nancy tiveness of individual programs and projects, Cartwright (2007a, 2007b), I argue that evi- such as the construction of infrastructure— dence from randomized controlled trials can dams, roads, water supply, electricity—and in have no special priority. Randomization is the delivery of services—, , not a gold standard because “there is no gold or policing. There is frustration with aid standard” Cartwright (2007a.) Randomized organizations, particularly the , controlled trials cannot automatically trump for allegedly failing to learn from its projects other evidence, they do not occupy any spe- and to build up a systematic catalog of what cial place in some hierarchy of evidence, nor works and what does not. In addition, some does it make sense to refer to them as “hard” of the skepticism about macro economet- while other methods are “soft.” These rhe- rics extends to micro , so that torical devices are just that; metaphor is not there has been a movement away from such argument, nor does endless repetition make methods and toward randomized controlled it so. trials. According to Esther Duflo, one of the More positively, I shall argue that the leaders of the new movement in develop- analysis of projects needs to be refocused ment, “Creating a culture in which rigor- toward the investigation of potentially gen- ous randomized evaluations are promoted, eralizable mechanisms that explain why and encouraged, and financed has the potential in what contexts projects can be expected to revolutionize social policy during the 21st to work. The best of the experimental work century, just as randomized trials revolution- in already does so ized medicine during the 20th,” this from a because its practitioners are too talented 2004 Lancet editorial headed “The World to be bound by their own methodological Bank is finally embracing science.” prescriptions. Yet there would be much to In section 4 of this paper, I shall argue that, be said for doing so more openly. I concur under ideal circumstances, randomized eval- with Ray Pawson and Nick Tilley (1997), uations of projects are useful for obtaining who argue that thirty years of project evalua- a convincing estimate of the average effect tion in sociology, education, and criminology of a program or project. The price for this was largely unsuccessful because it focused success is a focus that is too narrow and too on whether projects worked instead of on local to tell us “what works” in development, why they worked. In economics, warnings to design policy, or to advance scientific along the same lines have been repeatedly ­knowledge about development processes. given by James J. Heckman (see particularly Deaton: Instruments, Randomization, and Learning about Development 427

Heckman 1992 and Heckman and Jeffrey A. in some way that is outside of the model. No Smith 1995), and much of what I have to say modern macroeconomist would take this is a recapitulation of his arguments. model seriously, though the simple con- The paper is organized as follows. sumption function is an ancestor of more Section 2 lays out some econometric pre­ satisfactory and complete modern formula- liminaries ­concerning instrumental vari- tions; in particular, we can think of it (or at ables and the vexed question of exogeneity. least its descendents) as being derived from Section 3 is about aid and growth. Section 4 is a coherent model of intertemporal choice. about randomized controlled trials. Section 5 Similarly, modern versions would postulate is about using empirical evidence and where some theory for what determines invest- we should go now. ment I—here it is simply taken as given and assumed to be orthogonal to the consump- tion disturbance u. 2. Instruments, Identification, and the In this model, and income Meaning of Exogeneity are simultaneously determined so that, in It is useful to begin with a simple and particular, a stochastic realization of u— familiar econometric model that I can use consumers displaying animal spirits of their to illustrate the differences between differ- own—will affect not only C but also Y ent flavors of econometric practice—this has through equation (2), so that there is a posi- nothing to do with economic development tive correlation between u and Y. As a result, but it is simple and easy to contrast with the ordinary least squares (OLS) estimation of development practice that I wish to discuss. (1) will lead to upwardly biased and inconsis- In contrast to the models that I will discuss tent estimates of the parameter . β later, I think of this as a model in the spirit This simultaneity problem can be dealt of the Cowles Foundation. It is the simplest with in a number of ways. One is to solve (1) possible Keynesian macroeconomic model and (2) to get the reduced form equations of national income determination taken from once-standard econometrics textbooks. u There are two equations that together com- (3) C α ​ β ​ I ​ , = ​ _1 + _​ 1 + _​ 1 prise a complete macroeconomic system. − β − β − β The first equation is a consumption function I u in which aggregate consumption is a linear (4) Y α ​ ​ ​. = ​ _1 + _​ 1 + _​ 1 function of aggregate national income, while − β − β − β the second is the national income account- ing identity that says that income is the sum Both of these equations can be consistently of consumption and investment. I write the estimated by OLS, and it is easy to show system in standard notation as that the same estimates of and will be α β obtained from either one. An alternative (1) C Y u, method of estimation is to focus on the = α + β + consumption function (1) and to use our (2) Y C I. knowledge of (2) to note that investment ≡ + can be used as an instrumental variable (IV) According to (1), consumers choose the for income. In the IV regression, there is a level of aggregate consumption with refer- “first stage” regression in which income is ence to their income, while in (2) investment regressed on investment; this is identical to is set by the “animal spirits” of entrepreneurs equation (4), which is part of the reduced 428 Journal of Economic Literature, Vol. XLVIII (June 2010) form. In the ­second stage, consumption is special “infrastructure development area,” regressed on the predicted value of income or perhaps an earthquake that conveniently from (4). In this simple case, the IV estimate destroyed a selection of railway stations, or of is identical to the estimate from the even the existence of river confluence near β reduced form. This simple model may not the city, since rivers were an early source of be a very good model—but it is a model, if power, and railways served the power-based only a primitive one. industries. I am making fun, but not much. I now leap forward sixty years and consider And these instruments all have the real merit an apparently similar set up, again using an that there is some mechanism linking them absurdly simple specification. The World to whether or not the town has a railway Bank (let us imagine) is interested in whether station, something that is not automatically to advise the government of China to build guaranteed by the instrument being corre- more railway stations as part of its poverty lated with R and uncorrelated with v (see, for reduction strategy. The Bank economists example, Peter C. Reiss and Frank A. Wolak write down an econometric model in which 2007, pp. 4296–98). the poverty head count ratio in city c is taken My main argument is that the two econo- to be a linear function of an indicator R of metric structures, in spite of their resem- whether or not the city has a railway station, blance and the fact that IV techniques can be used for both, are in fact quite different. In (5) P R v , particular, the IV procedures that work for c = γ + θ c + c the effect of national income on consump- where (I hesitate to call it a parameter) tion are unlikely to give useful results for θ indicates the effect—presumably negative— the effect of railway stations on poverty. To of infrastructure (here a railway station) on explain the differences, I begin with the lan- poverty. While we cannot expect to get useful guage. In the original example, the reduced estimates of from OLS estimation of (5)— form is a fully specified system since it is θ railway stations may be built to serve more derived from a notionally complete model of prosperous cities, they are rarely built in des- the determination of income. Consumption erts where there are no people, or there may and income are treated symmetrically and be “third factors” that influence both—this is appear as such in the reduced form equa- seen as a “technical problem” for which there tions (3) and (4). In contemporary examples, is a wide range of econometric treatments such as the railways, there is no complete including, of course, instrumental variables. theoretical system and there is no symme- We no longer have the reduced form of try. Instead, we have a “main” equation (5), the previous model to guide us but, if we which used to be the “structural” equation can find an instrument Z that is correlated (1). We also have a “first-stage” equation, with whether a town has a railway station which is the regression of railway stations on but uncorrelated with v, we can do the same the instrument. The now rarely considered calculations and obtain a consistent estimate. regression of the variable of interest on the For the record, I write this equation instrument, here of poverty on earthquakes or on river confluences, is nowadays referred (6) R Z . to as the reduced form, although it was origi- c = ϕ + φ c + ηc nally one equation of a multiple equation Good candidates for Z might be indicators reduced form—equation (6) is also part of of whether the city has been designated by the reduced form—within which it had no the Government of China as belonging to a special significance. These language shifts Deaton: Instruments, Randomization, and Learning about Development 429 sometimes cause confusion but they are not ingenuity is often exercised. Such ingenuity the most important differences between the is often needed because it is difficult simul- two systems. taneously to satisfy both of the standard cri- The crucial difference is that the relation- teria required for an instrument, that it be ship between railways and poverty is not a correlated with Rc and uncorrelated with vc. model at all, unlike the consumption model However, if heterogeneity is indeed present, that embodied a(n admittedly crude) theory even satisfying the standard criteria is not of income determination. While it is clearly sufficient to prevent the probability limit of possible that the construction of a railway sta- the IV estimator depending on the choice tion will reduce poverty, there are many pos- of instrument (Heckman 1997). Without sible mechanisms, some of which will work explicit prior consideration of the effect in one context and not in another. In conse- of the instrument choice on the parameter quence, is unlikely to be constant over dif- being estimated, such a procedure is effec- θ ferent cities, nor can its variation be usefully tively the opposite of standard statistical thought of as random variation that is uncor- practice in which a parameter of interest is related with anything else of interest. Instead, defined first, followed by an estimator that it is precisely the variation in that encapsu- delivers that parameter. Instead, we have a θ lates the poverty reduction mechanisms that procedure in which the choice of the instru- ought to be the main objects of our enquiry. ment, which is guided by criteria designed Instead, the equation of interest—the so- for a situation in which there is no hetero- called “main equation” (5)—is thought of as geneity, is implicitly allowed to determine a representation of something more akin to the parameter of interest. This goes beyond an experiment or a biomedical trial in which the old story of looking for an object where some cities get “treated” with a station and the light is strong enough to see; rather, we some do not. The role of econometric analy- have at least some control over the light but sis is not, as in the Cowles example, to esti- choose to let it fall where it may and then mate and investigate a casual model, but “to proclaim that whatever it illuminates is what create an analogy, perhaps forced, between we were looking for all along. an observational study and an experiment” Recent econometric analysis has given us a (David A. Freedman 2006, p. 691). more precise characterization of what we can One immediate task is to recognize and expect from such a method. In the railway somehow deal with the variation in , which example, where the instrument is the designa- θ is typically referred to as the “heterogene- tion of a city as belonging to the “special infra- ity problem” in the literature. The obvious structure zone,” the probability limit of the IV way is to define a parameter of interest in a estimator is the average of poverty reduction way that corresponds to something we want effects over those cities who were induced to to know for policy evaluation—perhaps the construct a railway station by being so des- average effect on poverty over some group ignated. This average is known as the “local of cities—and then devise an appropriate average treatment effect” (LATE) and its estimation strategy. However, this step is recovery by IV estimation requires a number often skipped in practice, perhaps because of nontrivial conditions, including, for exam- of a mistaken belief that the “main equa- ple, that no cities who would have constructed tion” (5) is a structural equation in which a railway station are perverse enough to be is a constant, so that the analysis can go actually deterred from doing so by the posi- θ immediately to the choice of instrument Z, tive designation (see Guido W. Imbens and over which a great deal of imagination and Joshua D. Angrist 1994, who established the 430 Journal of Economic Literature, Vol. XLVIII (June 2010)

LATE theorem). The LATE may or may not There is a related issue that bedevils a be a parameter of interest to the World Bank good deal of contemporary applied work, or the Chinese government and, in general, which is the understanding of exogeneity, a there is no reason to suppose that it will be. word that I have so far avoided. Suppose, for For example, the parameter estimated will the moment, that the effect of railway sta- typically not be the average poverty reduction tions on poverty is the same in all cities and effect over the designated cities, nor will it be we are looking for an instrument, which is the average effect over all cities. required to be exogenous in order to con- I find it hard to make any sense of the sistently estimate . According to Merriam- θ LATE. We are unlikely to learn much about Webster’s dictionary, “exogenous” means the processes at work if we refuse to say any- “caused by factors or an agent from outside thing about what determines ; heterogene- the organism or system,” and this common θ ity is not a technical problem calling for an usage is often employed in applied work. econometric solution but a reflection of the However, the consistency of IV estimation fact that we have not started on our proper requires that the instrument be orthogonal business, which is trying to understand what to the error term v in the equation of inter- is going on. Of course, if we are as skeptical est, which is not implied by the Merriam- of the ability of economic theory to deliver Webster definition (see Edward E. Leamer useful models as are many applied econo- 1985, p. 260). Jeffrey M. Wooldridge (2002, mists today, the ability to avoid modeling can p. 50) warns his readers that “you should not be seen as an advantage, though it should not rely too much on the meaning of ‘endoge- be a surprise when such an approach delivers nous’ from other branches of economics” and answers that are hard to interpret. Note that goes on to note that “the usage in economet- my complaint is not with the “local” nature of rics, while related to traditional definitions, is the LATE—that property is shared by many used broadly to describe any situation where estimation strategies and I will discuss later an explanatory variable is correlated with how we might overcome it. The issue here the disturbance.” Heckman (2000) suggests is rather the “average” and the lack of an ex using the term “external” (which he traces ante characterization of the set over which back to Wright and Frisch in the 1930s) for the averaging is done. Angrist and Jörn- the Merriam-Webster definition, for vari- Steffen Pischke (2010) have recently claimed ables whose values are not set or caused that the explosion of instrumental variables by the variables in the model and keeping methods, including LATE estimation, has “exogenous” for the orthogonality condition led to greater “credibility” in applied econo- that is required for consistent estimation in metrics. I am not entirely certain what cred- this instrumental variable context. The terms ibility means, but it is surely undermined if are hardly standard, but I adopt them here the parameter being estimated is not what because I need to make the distinction. The we want to know. While in many cases what main issue, however, is not the terminol- is estimated may be close to, or may contain ogy but that the two concepts be kept dis- information about, the parameter of interest, tinct so that we can see when the argument that this is actually so requires demonstration being offered is a justification for externality and is not true in general (see Heckman and when what is required is a justification for Sergio Urzua 2009, who analyze cases where exogeneity. An instrument that is external, the LATE is an uninteresting and potentially but not exogeneous, will not yield consistent misleading assemblage of parts of the under- estimates of the parameter of interest, even lying structure). when the parameter of interest is a constant. Deaton: Instruments, Randomization, and Learning about Development 431

An alternative approach is to keep the data, empirical tests cannot settle the ques- Miriam-Webster (or “other branches of eco- tion. This does not prevent many attempts nomics”) definition for exogenous and to in the literature, often by misinterpreting require that, in addition to being exogenous, a satisfactory overidentification test as - evi an instrument satisfy the “exclusion restric- dence for valid identification. Such tests can tions” of being uncorrelated with the dis- tell us whether estimates change when we turbance. I have no objection to this usage, select different subsets from a set of possible though the need to defend these additional instruments. While the test is clearly useful restrictions is not always appreciated in prac- and informative, acceptance is consistent tice. Yet exogeneity in this sense has no con- with all of the instruments being invalid, sequences for the consistency of econometric while failure is consistent with a subset being estimators and so is effectively meaningless. correct. Passing an overidentification test Failure to separate externality and exoge- does not validate instrumentation. neity—or to build a case for the validity of In my running example, earthquakes and the exclusion restrictions—has caused, and rivers are external to the system and are nei- continues to cause, endless confusion in the ther caused by poverty nor by the construc- applied development (and other) literatures. tion of railway stations, and the designation Natural or geographic variables—distance as an infrastructure zone may also be deter- from the equator (as an instrument for per mined by factors independent of poverty or capita GDP in explaining religiosity, Rachel railways. But even earthquakes (or rivers) are M. McCleary and Robert J. Barro 2006), not exogenous if they have an effect on pov- rivers (as an instrument for the number of erty other than through their destruction (or school districts in explaining educational out- encouragement) of railway stations, as will comes, Caroline M. Hoxby 2000), land gradi- almost always be the case. The absence of ent (as an instrument for dam construction in simultaneity does not guarantee exogeneity— explaining poverty, Duflo and exogeneity requires the absence of simulta- 2007), or rainfall (as an instrument for eco- neity but is not implied by it. Even random nomic growth in explaining civil war, Edward numbers—the ultimate external variables— Miguel, Shanker Satyanath, and Ernest may be endogenous, at least in the presence Sergenti 2004 and the examples could be of heterogeneous effects if agents choose to multiplied ad infinitum)—are not affected accept or reject their assignment in a way that by the variables being explained, and are is correlated with the heterogeneity. Again, clearly external. So are historical variables— the example comes from Heckman’s (1997) the mortality of colonial settlers is not influ- discussion of Angrist’s (1990) famous use of enced by current institutional arrangements draft lottery numbers as an instrumental vari- in ex-colonial countries (, able in his analysis of the subsequent earn- Simon Johnson, and James A. Robinson ings of Vietnam veterans. 2001) nor does the country’s growth rate I can illustrate Heckman’s argument using today influence the identity of their past col- the Chinese railways example with the zone onizers (Barro 1998). Whether any of these designation as instrument. Rewrite the equa- instruments is exogenous (or satisfies the tion of interest, (5), as exclusion restrictions) depends on the speci- fication of the equation of interest, and is not (7) P _​R w ­guaranteed by its externality. And because c = γ + ​θ c + c exogeneity is an identifying assumption that must be made prior to analysis of the _​R {v ( ​_​) R } , = γ + ​θ c + c + θ − θ c 432 Journal of Economic Literature, Vol. XLVIII (June 2010) where w is defined by the term in curly orthogonal to , ruled out here by the ran- c νc brackets, and ​_​ is the mean of over the domness of the instrument. θ θ cities that get the station so that the com- The general lesson is once again the ulti- pound error term w has mean zero. Suppose mate futility of trying to avoid thinking about the designation as an infrastructure zone is how and why things work—if we do not do Dc , which takes values 1 or 0, and that the so, we are left with undifferentiated hetero- Chinese bureaucracy, persuaded by young geneity that is likely to prevent consistent development economists, decides to ran- estimation of any parameter of interest. One domize and designates cities by flipping a appropriate response is to specify exactly yuan. For consistent estimation of _​ ​, we want how cities respond to their designation, θ the covariance of the instrument with the an approach that leads to Heckman’s local error to be zero. The covariance is instrumental variable methods (Heckman and Edward Vytlacil 1999, 2007; Heckman, Urzua, and Vytlacil 2006). In a similar vein, (8) E(D w ) E[( ​_​) RD] c c = θ − θ David Card (1999) reviews estimates of the rate of return to schooling and explores how E[( ​_​) D 1, R 1] the choice of instruments leads to estimates

= θ − θ | = = that are averages over different subgroups P(D 1, R 1), of the population so that, by thinking about × = = the implicit selection, evidence from differ- ent studies can be usefully summarized and which will be zero if either (a) the average compared. Similar questions are pursued in effect of building a railway station on poverty Gerard van den Berg (2008). among the cities induced to build one by the designation is the same as the average effect 3. Instruments of Development among those who would have built one any- way, or (b) no city not designated builds a The question of whether aid has helped railway station. If (b) is not guaranteed by economies grow faster is typically asked fiat, we cannot suppose that it will otherwise within the framework of standard growth hold, and we might reasonably hope that regressions. These regressions use data for among the cities who build railway stations, many countries over a period of years, usually those induced to do so by the designation from the Penn World Table, the current ver- are those where there is the largest effect sion of which provides data on real per cap- on poverty, which violates (a). In the exam- ita GDP and its components in purchasing ple of the Vietnam veterans, the instrument power dollars for more than 150 countries as (the draft lottery number) fails to be exog- far back as 1950. The model to be estimated enous because the error term in the earnings has the rate of growth of per capita GDP as equation depends on each individual’s rate the dependent variable, while the explana- of return to schooling, and whether or not tory variables include the lagged value of each potential draftee accepted their assign- GDP per capita, the share of investment in ment—their veteran’s status—depends on GDP, and measures of the educational level that rate of return. This failure of exogene- of the population (see, for example, Barro ity is referred to by Richard Blundell and and Xavier Sala-i-Martin 1995, chapter 12, Monica Costa Dias (2009) as selection on for an overview). Other variables are often idiosyncratic gain and it adds to any bias added, and my main concern here is with caused by any failure of the instrument to be one of these, external assistance (aid) as a Deaton: Instruments, Randomization, and Learning about Development 433 fraction of GDP. A typical specification can of aid from the effectiveness of investment. be written Even so, it presumably matters what kind of ­investment is promoted by aid, and aid for (9) ln Yct 1 0 1 ln Yct roads, for dams, for vaccination programs, Δ + = β + β or for humanitarian purposes after an earth- I ​ ct ​ H Z quake are likely to have different effects on + β2 _ + β3 ct + β4 ct Yct subsequent growth. More broadly, one of the A u , main issues of contention in the debate is what + θ ct + ct aid actually does. Just to list a few of the pos- where Y is per capita GDP, I is investment, sibilities, does aid increase investment, does H is a measure of or educa- aid crowd out domestic investment, is aid sto- tion, and A is the variable of interest, aid as len, does aid create rent-seeking, or does aid a share of GDP. Z stands for whatever other undermine the institutions that are required variables are included. The index c is for for growth? Once all of these possibilities are country and t for time. Growth is rarely mea- admitted, it is clear that the analysis of (9) is sured on a year to year basis—the data in the not a Cowles model at all, but is seen as analo- Penn World Table are not suitable for annual gous to a biomedical experiment in which analysis—so that growth may be measured different countries are “dosed” with different over ten, twenty, or forty year intervals. With amounts of aid, and we are trying to measure around forty years of data, there are four, the average response. As in the Chinese rail- two, or one observation for each country. ways case, a regression such as (9) will not give An immediate question is whether the us what we want because the doses of aid are growth equation (9) is a model-based Cowles- not randomly administered to different coun- type equation, as in my national income tries, so our first task is to find an instrumental example, or whether it is more akin to the variable that will generate quasi-randomness. atheoretical analysis in my invented Chinese The most obvious problem with a regres- railway example. There are elements of both sion of aid on growth is the simultaneous here. If we ignore the Z and A variables in feedback from growth to aid that is gener- (9), the model can be thought of as a Solow ated by humanitarian responses to economic growth model, extended to add human capital collapse or to natural or man-made disas- to physical capital (see again Barro and Sala- ters that engender economic collapse. More i-Martin, who derive their empirical specifi- generally, aid flows from rich countries to cations from the theory, and also N. Gregory poor countries, and poor countries, almost Mankiw, David Romer, and David N. Weil by definition, are those with poor records of 1992, who extended the Solow model to economic growth. This feedback, from low include education). However, the addition of growth to high aid, will obscure, nullify, or the other variables, including aid, is typically reverse any positive effects of aid. Most of the less well justified. In some cases, for example literature attempts to eliminate this feedback under the assumption that all aid is invested, by using one or more ­instrumental ­variables it is possible to calculate what effect we might and, although they would not express it in expect aid to have (see Raghuram G. Rajan these terms, the aim of the instrumentation and Arvind Subramanian 2008). If we follow is to a restore a situation in which the pure this route, (9) would not be useful—because effect of aid on growth can be observed as aid is already included—and we should if in a randomized situation. How close we instead investigate whether aid is indeed get to this ideal depends, of course, on the invested, and then infer the ­effectiveness choice of instrument. 434 Journal of Economic Literature, Vol. XLVIII (June 2010)

Although there is some variation across of GDP comes with a reduction in the growth studies, there is a standard set of ­instruments, rate of one tenth of a percentage point a year. originally proposed by Peter Boone (1996), Easterly (2006) provides many other (some- which includes the log of population size times spectacular) examples of negative and various country dummies, for example, associations between aid and growth. When a dummy for Egypt or for francophone West instrumental variables are used to eliminate Africa. One or both of these instruments are the reverse causality, Rajan and Subramanian used in almost all the papers in a large sub- find a weak or zero effect of aid and contrast sequent literature, including Craig Burnside that finding with the robust positive effects and David Dollar (2000), Henrik Hansen and of investment on growth in specifications like Finn Tarp (2000, 2001), Carl-Johan Dalgaard (9). I should note that, although Rajan and and Hansen (2001), Patrick Guillaumont and Subramanian’s study is an excellent one, it is Lisa Chauvet (2001), Robert Lensink and certainly not without its problems and, as the Howard White (2001), Easterly, Ross Levine, authors note, there are many difficult econo- and David Roodman (2004), Dalgaard, metric problems over and above the choice Hansen, and Tarp (2004), Michael Clemens, of instruments, including how to estimate Steven Radelet, and Rikhil Bhavnani (2004), dynamic models with country fixed effects Rajan and Subramanian (2008), and Roodman on limited data, the choice of countries and (2007). The rationale for population size is sample period, the type of aid that needs to that larger countries get less aid per capita be considered, and so on. Indeed, it is those because the aid agencies allocate aid on a other issues that are the focus of most of the country basis, with less than full allowance literature cited above. The substance of this for population size. The rationale for what I debate is far from over. shall refer to as the “Egypt instrument” is that My main concern here is with the use of Egypt gets a great deal of American aid as part the instruments, what they tell us, and what of the Camp David accords in which it agreed they might tell us. The first point is that to a partial rapprochement with Israel. The neither the “Egypt” (or colonial heritage) same argument applies to the francophone nor the population instrument are plausibly countries, which receive additional aid from exogenous; both are external—Camp David France because of their French colonial leg- is not part of the model, nor was it caused acy. By comparing these countries with coun- by Egypt’s economic growth, and similarly tries not so favored or by comparing populous for population size—but exogeneity would with less populous countries, we can observe require that neither “Egypt” nor popula- a kind of variation in the share of aid in GDP tion size have any influence on economic that is unaffected by the negative feedback growth except through the effects on aid from poor growth to compensatory aid. In flows, which makes no sense at all. We also effect, we are using the variation across popu- need to recognize the heterogeneity in the lations of different sizes as a natural experi- aid responses and try to think about how the ment to reveal the effects of aid. different instruments are implicitly choos- If we examine the effects of aid on growth ing different averages, involving different without any allowance for reverse causality, weightings or subgroups of countries. Or we for example by estimating equation (9) by could stop right here, conclude that there OLS, the estimated effect is typically nega- are no valid instruments, and that the aid tive. For example, Rajan and Subramanian to growth question is not answerable in this (2008), in one of the most careful recent stud- way. I shall argue otherwise, but I should also ies, find that an increase in aid by one percent note that similar challenges over the ­validity Deaton: Instruments, Randomization, and Learning about Development 435 of instruments have become routine in on ­instrumental variables ring particularly applied econometrics, leading to widespread hollow when it is economists themselves who skepticism by some, while others press on are so often misled. My argument is that, for undaunted in an ever more creative search both economists and noneconomists, the for exogeneity. direct consideration of the reduced form is Yet consideration of the instruments is not likely to generate productive lines of enquiry. without value, especially if we move away The case of the “Egypt” instrument is from instrumental variable estimation, with somewhat different. Once again the reduced the use of instruments seen as technical, not form is useful (Egypt doesn’t grow particu- substantive, and think about the reduced larly fast in spite of all the aid it gets in con- form which contains substantive informa- sequence of Camp David), though mostly tion about the relationship between growth for making it immediately clear that the and the instruments. For the case of popu- comparison of Egypt versus non-Egypt, lation size, we find that, conditional on the or francophone versus nonfrancophone, is other variables, population size is unrelated not a useful way of assessing the effective- to growth, which is one of the reasons that ness of aid on growth. There is no reason to the IV estimates of the effects of aid are suppose that “being Egypt” has no effect on small or zero. This (partial) regression coef- its growth other than through aid from the ficient is a much simpler object than is the . Yet almost every paper in this instrumental variable estimate; under stan- literature unquestioningly uses the Egypt dard assumptions, it tells us how much faster dummy as an instrument. Similar instru- large countries grow than small countries, ments based on colonial heritage face exactly once the standard effects of the augmented the same problem; colonial heritage cer- Solow model have been taken into account. tainly affects aid, and colonial heritage is not Does this tell us anything about the effec- influenced by current growth performance, tiveness of aid? Not directly, though it is but different colonists behaved differently surely useful to know that, while larger coun- and left different legacies of institutions and tries receive less per capita aid in relation infrastructure, all of which have their own to per capita income, they grow just as fast persistent effect on growth today. as countries that have received more, once The use of population size, an Egypt we take into account the amount that they dummy, or colonial heritage variables as invest, their levels of education, and their instruments in the analysis of aid effective- starting level of GDP. But we would hardly ness cannot be justified. These instruments conclude from this fact alone that aid does are external, not exogenous, or if we use not increase growth. Perhaps aid works less the Webster definition of exogeneity, they well in small countries, or perhaps there is an clearly fail the exclusion restrictions. Yet they offsetting positive effect of population size continue in almost universal use in the aid- on economic growth. Both are possible and effectiveness literature, and are endorsed for both are worth further investigation. More this purpose by the leading exponents of IV generally, such arguments are susceptible to methods (Angrist and Pischke 2010). fruitful discussions, not only among econo- I conclude this section with an example mists, but also with other social scientists that helps bridge the gap between analy- and historians who study these questions, ses of the macro and analyses of the micro something that is typically difficult with effects of aid. Many microeconomists agree instrumental variable methods. Economists’ that instrumentation in cross-country regres- claims to methodological superiority based sions is unlikely to be useful, while claiming 436 Journal of Economic Literature, Vol. XLVIII (June 2010) that microeconomic analysis is capable of rule, are strongly influenced by it and exhibit doing better. We may not be able to answer the same saw tooth pattern. They then plot ill-posed questions about the macroeco- test scores against enrollment, and show nomic effects of foreign assistance, but we that they display the opposite pattern, ris- can surely do better on specific projects and ing at each of the discontinuities where class programs. Abhijit Vinayak Banerjee and size abruptly falls. This is a natural experi- Ruimin He (2008) have provided a list of the ment, with Maimonides’ rule inducing quasi- sort of studies that they like and that they experimental variation, and generating a pre- believe should be replicated more widely. dicted class size for each level of enrollment One of these, also endorsed by Duflo (2004), which serves as an instrumental variable in a is a famous paper by Angrist and Victor Lavy regression of test scores on class size. These (1999) on whether schoolchildren do bet- IV estimates, unlike the OLS estimates, show ter in smaller classes, a position frequently that children in smaller classes do better. endorsed by parents and by teacher’s unions Angrist and Lavy’s paper, the creativity of but not always supported by empirical work. its method, and the clarity of its result has set The question is an important one for devel- the standard for micro empirical work since opment assistance because smaller class sizes it was published, and it has had a far-reach- cost more and are a potential use for foreign ing effect on subsequent empirical work in aid. Angrist and Lavy’s paper uses a natural labor and development economics. Yet there experiment, not a real one, and relies on IV is a problem, which has become apparent estimation, so it provides a bridge between over time. Note first the heterogeneity; it is the relatively weak natural experiments in improbable that the effect of lower class size this section, and the actual randomized con- is the same for all children so that, under trolled trials in the next. the assumptions of the LATE theorem, the Angrist and Lavy’s study is about the allo- IV estimate recovers a weighted average cation of children enrolled in a school into of the effects for those children who are classes. Many countries set their class sizes shifted by Maimonides’ rule from a larger to to conform to some version of Maimonides’ a smaller class. Those children might not be rule, which sets a maximum class size, the same as other children, which makes it beyond which additional teachers must be hard to know how useful the numbers might found. In Israel, the maximum class size is be in other contexts, for example when all set at 40. If there are less than 40 children children are put in smaller class sizes. The enrolled, they will all be in the same class. If underlying reasons for this heterogeneity there are 41, there will be two classes, one are not addressed in this quasi-experimental of 20, and one of 21. If there are 81 or more approach. To be sure of what is happening children, the first two classes will be full, and here, we need to know more about how dif- more must be set up. Angrist and Lavy’s fig- ferent children finish up in different classes, ure 1 plots actual class size and Maimonides’ which raises the possibility that the varia- rule class size against the number of children tion across the discontinuities may not be enrolled; this graph starts off running along orthogonal to other factors that affect test the 45-degree line, and then falls discontinu- scores. ously to 20 when enrollment is 40, increas- A recent paper by Miguel Urquiola and ing with slope of 0.5 to 80, falling to 27.7 (80 Eric Verhoogen (2009) explores how it is divided by 3) at 80, rising again with a slope that children are allocated to different class of 0.25, and so on. They show that actual class sizes in a related, but different, situation in sizes, while not exactly ­conforming to the where a version of Maimonides’ rule is Deaton: Instruments, Randomization, and Learning about Development 437 in place. Urquiola and Verhoogen note that to authorities, finding a different school, or parents care a great deal about whether their even moving—and these actions will gen- children are in the 40 child class or the 20 erally differ by rich and poor children, by child class, and for the private schools they children with more or less educated parents, study, they construct a model in which there or by any factor that affects the cost to the is sorting across the boundary, so that the child of being in a larger class. The behav- children in the smaller classes have richer, ioral response to the quasi-randomization more educated parents than the children (or indeed randomization) means that the in the larger classes. Their data match such groups being compared are not identical to a model, so that at least some of the differ- start with (see also Justin McCrary 2008 and ences in test scores across class size come David S. Lee and Thomas Lemieux 2009 for from differences in the children that would further discussion and for methods of detec- be present whatever the class size. This tion when this is happening). paper is an elegant example of why it is so In preparation for the next section, I note dangerous to make inferences from natu- that the problem here is not the fact that ral experiments without understanding the we have a quasi-experiment rather than a mechanisms at work. It also strongly suggests real experiment, so that there was no actual that the question of the effects of class size randomization. If children had been ran- on student performance is not well-defined domized into class size, as in the Belgrade without a description of the environment in example, the problems would have been the which class size is being changed (see also same unless there had been some mecha- Christopher A. Sims 2010). nism for forcing the children (and their par- Another good example comes to me in pri- ents) to accept the assignment. vate correspondence from Branko Milanovic, who was a child in Belgrade in a school 4. Randomization in the Tropics only half of whose teachers could teach in English, and who was randomly assigned to a Skepticism about econometrics, doubts class taught in Russian. He remembers losing about the usefulness of structural models friends whose parents (correctly) perceived in economics, and the endless wrangling the superior value of an English education, over identification and instrumental vari- and were insufficiently dedicated socialists to ables has led to a search for alternative ways accept their assignment. The two language of learning about development. There has groups of children remaining in the school, also been frustration with the World Bank’s although “randomly” assigned, are far from apparent failure to learn from its own proj- identical, and IV estimates using the ran- ects and its inability to provide a convinc- domization could overstate the superiority of ing ­argument that its past activities have the English medium education. enhanced economic growth and poverty More generally, these are examples where reduction. Past development practice is an instrument induces actual or quasi-ran- seen as a succession of fads, with one sup- dom assignment, here of children into dif- posed magic bullet replacing another—from ferent classes, but where the assignment can planning to infrastructure to human capi- be undone, at least partially, by the actions of tal to structural adjustment to health and the subjects. If children—or their parents— social capital to the environment and back care about whether they are in small or large to infrastructure—a process that seems not classes, or in Russian or English classes, to be guided by progressive learning. For some will take evasive action—by protesting many economists, and ­particularly for the 438 Journal of Economic Literature, Vol. XLVIII (June 2010) group at the Poverty Action Lab at MIT, the rizing, evidence of what works, when, why solution has been to move toward random- and for how much,” (International Initiative ized controlled trials of projects, programs, for Impact Evaluation 2008), although not and policies. RCTs are seen as generating exclusively by randomized controlled trials. gold standard evidence that is superior to The Poverty Action Lab lists dozens of com- econometric evidence and that is immune pleted and ongoing projects in a large num- to the methodological criticisms that are ber of countries, many of which are project typically directed at econometric analyses. evaluations. Many development economists Another aim of the program is to persuade would join many physicians in subscribing the World Bank to replace its current evalu- to the jingoist view proclaimed by the edi- ation methods with RCTs; Duflo (2004) tors of the British Medical Journal (quoted argues that randomized trials of projects by Worrall 2007, p. 996) which noted that would generate knowledge that could be “Britain has given the world Shakespeare, used elsewhere, an international public Newtonian physics, the theory of evolution, good. Banerjee (2007b, chapter 1) accuses parliamentary democracy—and the random- the Bank of “lazy thinking,” of a “resistance ized trial.” to knowledge,” and notes that its recom- 4.1 The Ideal RCT mendations for poverty reduction and empowerment show a striking “lack of dis- Under ideal conditions and when cor- tinction made between strategies founded rectly executed, an RCT can estimate certain on the hard evidence provided by random- quantities of interest with minimal assump- ized trials or natural experiments and the tions, thus absolving RCTs of one complaint rest.” In all this there is a close parallel with against econometric methods—that they the evidence-based movement in medi- rest on often implausible economic models. cine that preceded it, and the successes of It is useful to lay out briefly the (standard) RCTs in medicine are frequently cited. Yet framework for these results, originally due the parallels are almost entirely rhetorical, to Jerzy Neyman in the 1920s, currently and there is little or no reference to the dis- often referred to as the Holland–Rubin senting literature, as surveyed for example framework or the Rubin causal model (see in John Worrall (2007) who documents the Freedman 2006 for a discussion of the his- rise and fall in medicine of the rhetoric used tory). According to this, each member of the by Banerjee. Nor is there any recognition of population under study, labeled i, has two the many problems of medical RCTs, some possible values associated with it, Y0 i and Y1 i , of which I shall discuss as I go. which are the outcomes that i would display The movement in favor of RCTs is currently if it did not get the treatment, T 0, and if i = very successful. The World Bank is now con- it did get the treatment, T 1. Since each i i = ducting substantial numbers of randomized is either in the treatment group or in the con- trials, and the methodology is sometimes trol group, we observe one of Y0 i and Y1 i but explicitly requested by governments, who not both. We would like to know something supply the World Bank with funds for this pur- about the distribution over i of the effects pose (see World Bank 2008 for details of the of the treatment, Y Y , in particular its i1 − i0 Spanish Trust Fund for Impact Evaluation). mean ​_Y ​ ​_Y ​. In a sense, the most surpris- 1 − 0 There is a new International Initiative for ing thing about this set-up is that we can say Impact Evaluation which “seeks to improve anything at all without further assumptions the lives of poor people in low- and middle- or without any modeling. But that is the income countries by providing, and summa- magic that is wrought by the randomization. Deaton: Instruments, Randomization, and Learning about Development 439

What we can observe in the data is the The difference in means between the treat- difference between the average outcome in ments and controls is an estimate of the the treatments and the average outcome in average treatment effect among the treated, the controls, or E(Y T 1) E(Y T 0). which, since the treatment and controls dif- i | i = − i | i = This difference can be broken up into two fer only by randomization, is an estimate terms of the average treatment effect for all. This standard but remarkable result depends (10) E(Y T 1) E(Y T 0) both on randomization and on the linearity i1 | i = − i0 | i = of expectations. [E(Y T 1) E(Y T 1)] One immediate consequence of this deri- = i1 | i = − i0 | i = vation is a fact that is often quoted by critics [E(Y T 1) E(Y T 0)]. of RCTs, but often ignored by practitioners, + i0 | i = − i0 | i = at least in economics: RCTs are informative Note that on the right hand side the sec- about the mean of the treatment effects, ond term in the first square bracket cancels Y Y , but do not identify other features i1 − i0 out with the first term in the second square of the distribution. For example, the median bracket. But the term in the second square of the difference is not the difference in bracket is zero by randomization; the non- medians, so an RCT is not, by itself, infor- treatment outcomes, like any other charac- mative about the median treatment effect, teristic, are identical in expectation in the something that could be of as much inter- control and treatment groups. We can there- est to policymakers as the mean treatment fore write (10) as effect. It might also be useful to know the fraction of the population for which the (11) E(Y T 1) E(Y T 0) treatment effect is positive, which once again i1 | i = − i0 | i = is not identified from a trial. Put differently, [E(Y T 1) E(Y T 1)], the trial might reveal an average positive = i1 | i = − i0 | i = effect although nearly all of the population so that the difference in the two observ- is hurt with a few receiving very large ben- able outcomes is the difference between efits, a situation that cannot be revealed by the average treated outcome and the aver- the RCT, although it might be disastrous if age untreated outcome in the treatment implemented. Indeed, Ravi Kanbur (2001) group. The last term on the right hand side has argued that much of the disagreement would be unobservable in the absence of about development policy is driven by differ- randomization. ences of this kind. We are not quite done. What we would Given the minimal assumptions that go like is the average of the difference, rather into an RCT, it is not surprising that it can- than the difference of averages that is cur- not tell us everything that we would like to rently on the right-hand side of (11). But the know. Heckman and Smith (1995) discuss expectation is a linear operator, so that the these issues at greater length and also note difference of the averages is identical to the that, in some circumstances, more can be average of the differences, so that we reach, learned. Essentially, the RCT gives us two finally marginal distributions from which we would like to infer a joint distribution; this is impos- (12) E(Y T 1) E(Y T 0) sible, but the marginal distributions limit the i1 | i = − i0 | i = joint distribution in ways that can be use- [E(Y Y T 1). ful. For example, Charles F. Manski (1996) = i1 − i0 | i = 440 Journal of Economic Literature, Vol. XLVIII (June 2010) notes that a planner who is maximizing the in particular on the validity of running a expected value of a social welfare function regression like (13), on which I shall have needs only the two marginal distributions more to say below. One immediate charge to check the usefulness of the treatment. against such a procedure is data mining. A Beyond that, if the probability distribution sufficiently determined examination of any of outcomes among the treated stochasti- trial will eventually reveal some subgroup cally dominates the distribution among the for which the treatment yielded a significant controls, we know that appropriately defined effect of some sort, and there is no general classes of social welfare functions will show way of adjusting standard errors to protect an improvement without having to know against the possibility. A classic example from what the social welfare function is. Not all medicine comes from the ISIS-2 trial of the relevant cases are covered by these examples; use of aspirin after heart attacks (Richard even if a drug saves lives on average, we need Peto, Rory Collins, and Richard Gray 1995). to know whether it is uniformly beneficial or A randomized trial established a beneficial kills some and saves more. To answer such effect with a significance level of better than 6 questions, we will have to make assumptions 10− , yet ex post analysis of the data showed beyond those required for an RCT; as usual, that there was no significant effect for trial some questions can be answered with fewer subjects whose astrological signs were Libra assumptions than others. or Gemini. In drug trials, the FDA rules In practice, researchers who conduct ran- require that analytical plans be submitted domized controlled trials often do present prior to trials and drugs cannot be approved results on statistics other than the mean. For based on ex post data analysis. As noted by example, the results can be used to run a Robert J. Sampson (2008), one analysis of regression of the form the recent Moving to Opportunity experi- ment has an appendix listing tests of many

thousands of outcomes (Lisa Sanbonmatsu (13) Yi 0 1 Ti ​ ​ ​ j Xij = β + β + ∑​j θ et al. 2006). I am not arguing against posttrial subgroup

analysis, only that, as is enshrined in the ​ ​ ​ j Xij Ti ui, + ∑​j ϕ × + FDA rules, any special epistemic status (as in “gold standard,” “hard,” or “rigorous” evi- where T is a binary variable that indicates dence) possessed by RCTs does not extend treatment status, and the X’s are various to ex post subgroup analysis if only because characteristics measured at baseline that there is no guarantee that a new RCT on are included in the regression both on their post-experimentally defined subgroups will own (main effects) and as interactions with yield the same result. Such analyses do not treatment status (see Suresh de Mel, David share any special evidential status that might McKenzie, and Christopher Woodruff 2008 arguably be accorded to RCTs and must be for an example of a field experiment with assessed in exactly the same way as we would micro-enterprises in Sri Lanka). The esti- assess any nonexperimental or econometric mated treatment effect now varies across the study. These issues are wonderfully exposed population, so that it is possible, for example, by the subgroup analysis of drug effective- to estimate whether the average treatment ness by Ralph I. Horwitz et al. (1996), criti- effect is positive or negative for various sub- cized by Douglas G. Altman (1998), who groups of interest. These estimates depend refers to such studies as “a false trail,” by on more assumptions than the trial itself, Stephen Senn and Frank Harrell (1997), Deaton: Instruments, Randomization, and Learning about Development 441 who call them “wisdom after the event,” as it posed problems for nonexperimental and by George Davey Smith and Matthias methods that sought to approximate ran- Egger (1998), who call them “incommuni- domization. For this reason, in his Planning cable knowledge,” drawing the response by of Experiments, David R. Cox (1958, p. 15) Horwitz et al. (1997) that their critics have begins his book with the assumption that reached “the tunnel at the end of the light.” the treatment effects are identical for all While it is clearly absurd to discard data subjects. He notes that the RCT will still because we do not know how to analyze it estimate the mean treatment effect with with sufficient purity and while many impor- heterogeneity but argues that such estimates tant findings have come from posttrial analy- are “quite misleading,” citing the example of sis of experimental data, both in medicine two internally homogeneous subgroups with and in economics, for example of the nega- distinct average treatment effects, so that the tive income experiments of the 1960s, RCT delivers an estimate that applies to no the concern about data-mining remains real one. Cox’s recommendation makes a good enough. In large-scale, expensive trials, a deal of sense when the experiment is being zero or very small result is unlikely to be wel- applied to the parameter of a well-specified comed, and there is likely to be considerable model, but it could not be further away from pressure to search for some subpopulation or most current practice in either medicine or some outcome that shows a more palatable economics. result, if only to help justify the cost of the One of the reasons why subgroup analy- trial. sis is so hard to resist is that researchers, The mean treatment effect from an RCT however much they may wish to escape the may be of limited value for a physician or a straitjacket of theory, inevitably have some policymaker contemplating specific patients mechanism in mind, and some of those or policies. A new drug might do better than mechanisms can be “tested” on the data a placebo in an RCT, yet a physician might from the trial. Such “testing,” of course, does be entirely correct in not prescribing it for not satisfy the strict evidential standards that a patient whose characteristics, according to the RCT has been set up to satisfy and, if the the physician’s theory of the disease, might investigation is constrained to satisfy those lead her to suppose that the drug would be standards, no ex post speculation is permit- harmful. Similarly, if we are convinced that ted. Without a prior theory and within its dams in India do not reduce poverty on aver- own evidentiary standards, an RCT targeted age, as in Duflo and Pande’s (2007) IV study, at “finding out what works” is not informa- there is no implication about any specific tive about mechanisms, if only because there dam, even one of the dams included in the are always multiple mechanisms at work. For study, yet it is always a specific dam that a example, when two independent but identi- policymaker has to approve. Their evidence cal RCTs in two cities in India find that chil- certainly puts a higher burden of proof on dren’s scores improved less in Mumbai than those proposing a new dam, as would be in Vadodora, the authors state “this is likely the case for a physician prescribing in the related to the fact that over 80 percent of face of an RCT, though the force of the evi- the children in Mumbai had already mas- dence depends on the size of the mean effect tered the basic language skills the program and the extent of the heterogeneity in the was covering” (Duflo, Rachel Glennerster, responses. As was the case with the material and Michael Kremer 2008). It is not clear discussed in sections 2 and 3, heterogeneity how “likely” is established here, and there poses problems for the analysis of RCTs, just is certainly no evidence that conforms to 442 Journal of Economic Literature, Vol. XLVIII (June 2010) the “gold standard” that is seen as one of The first is the seemingly obvious practi- the central justifications for the RCTs. For cal matter of how to compute the results of the same reason, repeated successful repli- a trial. In theory, this is straightforward—we cations of a “what works” experiment, i.e., simply compare the mean outcome in the one that is unrelated to some underlying experimental group with the mean outcome or guiding mechanism, is both unlikely and in the control group and the difference is unlikely to be persuasive. Learning about the causal effect of the intervention. This theory, or mechanisms, requires that the simplicity, compared with the often baroque investigation be targeted toward that theory, complexity of econometric estimators, is toward why something works, not whether seen as one of the great advantages of RCTs, it works. Projects can rarely be replicated, both in generating convincing results and in though the mechanisms underlying suc- explaining those results to policymakers and cess or failure will often be replicable and the lay public. Yet any difference is not use- transportable. This means that, if the World ful without a standard error and the calcula- Bank had indeed randomized all of its past tion of the standard error is rarely quite so projects, it is unlikely that the cumulated straightforward. As Ronald A. Fisher (1935) evidence would contain the key to economic emphasized from the very beginning, in his development. famous discussion of the tea lady, randomiza- Cartwright (2007a) summarizes the ben- tion plays two separate roles. The first is to efits of RCTs relative to other forms of evi- guarantee that the probability law governing dence. In the ideal case, “if the assumptions the selection of the control group is the same of the test are met, a positive result implies as the probability law governing the selection the appropriate causal conclusion,” that the of the experimental group. The second is to intervention “worked” and caused a positive provide a probability law that enables us to outcome. She adds “the benefit that the con- judge whether a difference between the two clusions follow deductively in the ideal case groups is significant. In his tea lady example, comes with great cost: narrowness of scope” Fisher uses combinatoric analysis to calcu- (p. 11). late the exact probabilities of each possible outcome, but in practice this is rarely done. 4.2 Tropical RCTs in Practice Duflo, Glennerster, and Kremer (2008, How well do actual RCTs approximate the p. 3921) (DGK) explicitly recommend what ideal? Are the assumptions generally met in seems to have become the standard method practice? Is the narrowness of scope a price in the development literature, which is to that brings real benefits or is the superior- run a restricted version of the regression ity of RCTs largely rhetorical? RCTs allow (13), including only the constant and the the investigator to induce variation that treatment dummy, might not arise nonexperimentally, and this variation can reveal responses that could (14) Y T u . i = β0 + β1 i + i never have been found otherwise. Are these responses the relevant ones? As always, there As is easily shown, the OLS estimate of β1 is no substitute for examining each study in is simply the difference between the mean detail, and there is certainly nothing in the of the Yi in the experimental and control RCT methodology itself that grants immu- groups, which is exactly what we want. nity from problems of implementation. Yet However, the standard error of from β1 there are some general points that are worth the OLS regression is not generally correct. discussion. One problem is that the variance among the Deaton: Instruments, Randomization, and Learning about Development 443 experimentals may be different from the to run the regression (14) with additional variance among the controls, and to assume controls taken from the baseline data or that the experiment does not affect the vari- equivalently (13) with the Xi but without the ance is very much against the minimalist interactions, spirit of RCTs. If the regression (14) is run with the standard heteroskedasticity cor- (15) Y T ​ ​ ​ X u . i = β0 + β1 i + ∑​ θj ij + i rection to the standard error, the result will j be the same as the formula for the standard The standard argument is that, if the ran- error of the difference between two means, domization is done correctly, the Xi will be but not otherwise except in the special case orthogonal to the treatment variable Ti so where there are equal numbers of experi- that their inclusion does not affect the esti- mental and controls, in which case it turns mate of , which is the parameter of inter- β1 out that the correction makes no difference est. However, by absorbing variance, as and the OLS standard error is correct. It is compared with (14), they will increase the not clear in the experimental development precision of the estimate—this is not neces- literature whether the correction is routinely sarily the case, but will often be true. DGK done in practice, and the handbook review (p. 3924) give an example: “controlling for by DGK makes no mention of it, although it baseline test scores in evaluations of edu- provides a thoroughly useful review of many cational interventions greatly improves the other aspects of standard errors. precision of the estimates, which reduces the Even with the correction for unequal cost of these evaluations when a baseline test variances, we are not quite done. The gen- can be conducted.” eral problem of testing the significance of There are two problems with this pro- the differences between the means of two cedure. The first, which is noted by DGK, normal populations with different variances is that, as with posttrial subgroup analysis, is known as the Fisher–Behrens problem. there is a risk of data mining—trying dif- The test statistic computed by dividing the ferent control variables until the experi- difference in means by its estimated stan- ment “works”—unless the control variables dard error does not have the t–distribution are specified in advance. Again, it is hard when the variances are different in treat- to tell whether or how often this dictum is ments and controls, and the significance of observed. The second problem is analyzed the estimated difference in means is likely by Freedman (2008), who notes that (15) is to be overstated if no correction is made. If not a standard regression because of the het- there are equal numbers of treatments and erogeneity of the responses. Write for the αi controls, the statistic will be approximately (hypothetical) treatment response of unit i, distributed as Student’s t but with degrees so that, in line with the discussion in the pre- of freedom that can be as little as half the vious subsection, Y Y , and we can αi = i1 − i 0 nominal degrees of freedom when one of the write the identity two variances is zero. In general, there is also no reason to suppose that the heterogene- (16) Y Y T i = i0 + αi i ity in the treatment effects is normal, which will further complicate inference in small _Y ​ T (Y ​_Y ​), = ​ 0 + αi i + i0 − 0 samples. Another standard practice, recommended which looks like the regression (15) with by DGK, and which is also common in medi- the X’s and the error term capturing the cal RCTs according to Freedman (2008), is variation in Yi0. The only difference is that 444 Journal of Economic Literature, Vol. XLVIII (June 2010)

the ­coefficient on the treatment term has where Zi is the standardized (z-score) ver- an i suffix because of the heterogeneity. If sion of Xi. Equation (19) shows that the bias we define E( T 1), the average comes from the heterogeneity, or more spe- α = αi | i = treatment effect among the treated, as the cifically, from a covariance between the het- parameter of interest, as in section 4.1, we erogeneity in the treatment effects and the can rewrite (16) as squares of the included covariates. With the sample sizes typically encountered in these experiments, which are often expensive to (17) Y ​Y__ ​ T (Y ​_Y ​) i = 0 + α i + i0 − 0 conduct, the bias can be substantial. One possible strategy here would be to compare ( i ) Ti. the estimates of with and without covari- + α − α α ates; even ignoring pretest bias, it is not Finally, and to illustrate, suppose that we clear how to make such a comparison with- model the variation in Yi0 as a linear function out a good estimate of the standard error. of an observable scalar X and a residual , Alternatively, and as noted by Imbens (2009) i0 ηi we have in his commentary on this paper, it is possible to remove bias using a “saturated” regression model, for example by estimating (15) when (18) Y T (X ​_X ​) i = β0 + α i + θ i − the covariates are discrete and there is a com- plete set of interactions. This is equivalent to [ + ( ) T ], + ηi αi − α i stratifying on each combination of values of the covariates, which diminishes any effect with Y , which is in the regression form on reducing the standard errors and, unless β0 = 0 (15) but allows us to see the links with the the stratification is done ex ante, raises the experimental quantities. usual concerns about data mining. Equation (18) is analyzed in some detail by Of these and related issues in medi- Freedman (2008). It is easily shown that Ti is cal trials, Freedman (2008, p. 13) writes orthogonal to the compound error, but that “Practitioners will doubtless be heard to this is not true of X ​_X ​. However, the two object that they know all this perfectly well. i − right hand side variables are uncorrelated Perhaps, but then why do they so often fit because of the randomization, so the OLS models without discussing assumptions?” estimate of is consistent. This is not All of the issues so far can be dealt with, β1 = α true of , though this may not be a problem either by appropriately calculating stan- θ if the aim is simply to reduce the sampling dard errors or by refraining from the use of variance. A more serious issue is that the covariates, though this might involve draw- dependency between Ti and the compound ing larger and more expensive samples. error term means that the OLS estimate of However, there are other practical problems the average treatment effect is biased, and that are harder to fix. One of these is that sub- α in small samples this bias—which comes jects may fail to accept assignment, so that from the heterogeneity—may be substantial. people who are assigned to the experimental Freedman notes that the leading term in the group may refuse, and controls may find a bias of the estimate of the OLS estimate of way of getting the treatment, and either may α is n where n is the sample size and drop out of the experiment altogether. The φ/ classical remedy of double blinding, so that n neither the subject nor the experimenter 1 2 (19) lim ​ _ ​ ​ ​ ​ ​( i )​Z​ i​ ​, know which subject is in which group, is φ = − n i∑1 α − α = Deaton: Instruments, Randomization, and Learning about Development 445 rarely feasible in social experiments—chil- variable approach developed by Heckman dren know their class size—and is often and his coauthors, Heckman and Vytlacil not feasible in medical trials—subjects may (1999, 2007). Alternatively, and as recom- decipher the randomization, for example by mended by Freedman (2005, p. 4; 2006), it asking a laboratory to check that their medi- is always informative to make a simple unad- cine is not a placebo. Heckman (1992) notes justed comparison of the average outcomes that, in contrast to people, “plots of grounds between treatments and controls according do not respond to anticipated treatments of to the original assignment. This may also fertilizer, nor can they excuse themselves be enough if what we are concerned with is from being treated.” This makes the impor- whether the treatment works or not, rather tant point, further developed by Heckman in than with the size of the effect. In terms of later work, that the deviations from assign- instrumental variables, this is a recommen- ment are almost certainly purposeful, at least dation to look at the reduced form, and again in part. The people who struggle to escape harks back to similar arguments in section 2 their assignment will do so more vigorously on aid effectiveness. the higher are the stakes, so that the devia- One common problem is that the people tions from assignment cannot be treated as who agree to participate in the experiment random measurement error, but will com- (as either experimental or controls) are not promise the results in fundamental ways. themselves randomly drawn from the gen- Once again, there is a widely used techni- eral population so that, even if the experi- cal fix, which is to run regressions like (15) or ment itself if perfectly executed, the results (18), with actual treatment status in place of are not transferable from the experimental the assigned treatment status Ti. This replace- to the parent population and will not be a ment will destroy the orthogonality between reliable guide to policy in the parent popu- treatment and the error term, so that OLS lation. In effect, the selection or omitted estimation will no longer yield a consistent variable bias that is a potential problem in estimate of the average treatment effect nonexperimental studies comes back in a among the treated. However, the assigned different form and, without an analysis of treatment status, which is known to the the two biases, it is impossible to conclude experimenter, is orthogonal to the error term which estimate is better—a biased nonex- and is correlated with the actual treatment perimental analysis might do better than a status, and so can serve as an instrumental randomized controlled trial if enrollment variable for the latter. But now we are back into the trial is nonrepresentative. In drug to the discussion of instrumental variables in trials, ethical protocols typically require the section 2, and we are doing econometrics, principle of “equipoise”—that the physician, not an ideal RCT. Under the assumption of or at least physicians in general, believe that no “defiers”—people who do the opposite of the patient has an equal chance of improve- their assignment just because of the assign- ment with the experimental and control ment (and it is not clear “just why are there drug. Yet risk-averse patients will not accept no defiers” Freedman 2006)—the instru- such a gamble, so that there is selection into mental variable converges to the LATE. As treatment, and the requirements of ethics before, it is unclear whether this is what we and of representativity come into direct con- want, and there is no way to find out with- flict. While this argument does not apply to out modeling the behavior that is respon- all trials, there are many ways in which the sible for the heterogeneity of the response experimental (experimentals plus controls) to assignment, as in the local instrumental and parent populations can differ. 446 Journal of Economic Literature, Vol. XLVIII (June 2010)

There are also operational problems that flip charts in schools by Paul Glewwe et al. afflict every actual experiment; these can (2004); this paper, like “Worms,” is much be mitigated by careful planning—in RCTs cited as evidence in favor of the virtues of compared with econometric analysis, most of randomization. the work is done before data collection, not Alphabetization may be a reasonable solu- after—but not always eliminated. tion when randomization is impossible but In this context, I turn to the flagship study we are then in the world of quasi- or natural of the new movement in development eco- experiments, not randomized experiments; nomics—Miguel and Kremer’s (2004) study in the latter, the balance of observable and of intestinal worms in Kenya. This paper is unobservable factors in treatments and repeatedly cited in DGK’s manual and it is controls is guaranteed by design, at least one of the exemplary studies cited by Duflo in expectation, in the former, it has to be (2004) and by Banerjee and He (2008). It argued for each case, and the need for such was written by two senior authors at leading argument is one of the main claims for the research universities and published in the superiority of the randomized approach. As most prestigious technical journal in eco- is true with all forms of quasi-randomization, nomics. It has also received a great deal of alphabetization does not guarantee orthogo- positive attention in the popular press (see, nality with potential confounders, however for example, David Leonhardt 2008) and has plausible it may be in the Kenyan case. been influential in policy (see Poverty Action Resources are often allocated alphabetically Lab 2007). In this study, a group of “seventy- because that is how many lists are presented five rural primary schools were phased into (see, for example, Št​ e˘ ​pán Jurajda and Daniel treatment in a randomized order,” with the Münich 2009 for documentation of stu- finding “that the program reduced school dents being admitted into selective schools absenteeism by at least one quarter, with (partly) based on their position in the alpha- particularly large participation gains among bet). If this were the case in Kenya, schools the youngest children, making deworming a higher in the alphabet would be systemati- highly effective way to boost school partici- cally different and this difference would be pation among young children” (p. 159). The inherited in an attenuated form by the three point of the RCT is less to show that deworm- groups. Indeed, this sort of contamination ing medicines are effective, but to show that is described by Cox (1958, pp. 74–75) who school-based treatment is more effective explicitly warns against such designs. Of than individual treatment because children course, it is also possible that, in this case, infect one another. As befits a paper that aims the alphabetization causes no ­confounding to change method, there is emphasis on the with factors known or unknown. If so, there virtues of randomization, and the word “ran- is still an issue with the calculation of stan- dom” or its derivatives appears some sixty dard errors. Without a probability law, we times in the paper. However, the “random- have no way of discovering whether the dif- ization” in the study is actually an assignment ference between treatments and controls of schools to three groups by their order in could have arisen by chance. We might think the alphabet as in Albert Infant School to of modeling the situation here by imagining group 1, Alfred Elementary to group 2, Bell’s that the assignment was equivalent to taking Academy to group 3, Christopher School a random starting value and assigning every to group 1 again, Dean’s Infants to group third school to treatment. If so, the fact that 2, and so on. Alphabetization, not random- there are only three possible assignments of ization, was also used in the experiment on schools would have to be taken into account Deaton: Instruments, Randomization, and Learning about Development 447 in calculating the standard errors, and noth- who were not represented in the original ing of this kind is reported. As it is, it is trials (Jerome Groopman 2009). In RCTs impossible to tell whether the experimental of some medical procedures, the hospitals differences in these studies are or are not chosen to participate are carefully selected, due to chance. and favorable trial results may not be obtain- In this subsection, I have dwelt on practice able elsewhere (see David E. Wennberg et not to critique particular studies or particu- al. 1998 for an example in which actual mor- lar results; indeed it seems entirely plausible tality is many times higher than in the trials, that deworming is a good idea and that the sufficiently so as to reverse the desirability costs are low relative to other interventions. of the adoption of the procedure). There is My main point here is different—that con- also a concern that those who sponsor trials, ducting good RCTs is exacting and often those who analyze them, and those who set expensive, so that problems often arise that evidence-based guidelines using them some- need to be dealt with by various economet- times have financial stakes in the outcome, ric or statistical fixes. There is nothing wrong which can cast doubts on the results. This with such fixes in principle—though they is currently not a problem in economics but often compromise the substance, as in the would surely become one if, as the advocates instrumental variable estimation to correct argue, successful RCTs became a precondi- for failure of assignment—but their applica- tion for the rollout of projects. Beyond that, tion takes us out of the world of ideal RCTs John Concato, Nirav Shah, and Horwitz and back into the world of everyday econo- (2000) argue that, in practice, RCTs do not metrics and statistics. So that RCTs, although provide useful information beyond what can frequently useful, carry no special exemption be learned from well-designed and carefully from the routine statistical and substantive interpreted observational studies. scrutiny that should be routinely applied to any empirical investigation. 5. Where Should We Go from Here? Although it is well beyond my scope in this paper, I should note that RCTs in medi- Cartwright (2007b) maintains a useful cine—the gold standard to which develop- distinction between “hunting” causes and ment RCTs often compare themselves—also “using” them, and this section is about the encounter practical difficulties, and their pri- use of randomized controlled trials for policy. macy is not without challenge. In particular, Here I address the issue of generalizability ethical (human subjects) questions surround- or external validity—as opposed to internal ing RCTs in medicine have become severe validity as discussed in the previous sec- enough to seriously limit what can be under- tion—grounds on which development RCTs taken, and there is still no general agree- are sometimes criticized (see, for example, ment on a satisfactory ethical basis for RCTs. Dani Rodrik 2009). We need to know when Selection into medical trials is not random we can use local results, from instrumental from the population; in particular, patients are variables, from RCTs, or from nonexperi- typically excluded if they suffer from condi- mental analyses, in contexts other than those tions other than those targeted in the trial, so in which they were obtained. that researchers can examine effects without There are certainly cases in both medi- contamination from comorbidities. Yet many cine and economics where an RCT has had actual patients do have comorbidities—this a major effect on the way that people think is particularly true among the elderly—so beyond the original local context of the drugs are ­frequently prescribed to those trial. In the recent development literature, 448 Journal of Economic Literature, Vol. XLVIII (June 2010) my favorite is Raghabendra Chattopadhyay scale that are absent in a pilot, or because and Duflo’s (2004) study of women lead- the outcomes are different when everyone is ers in India. The Government of India covered by the treatment rather than just a randomly selected some panchayats and selected group of experimental subjects who forced them to have female leaders, and the are not representative of the population to paper explores the differences in outcomes be covered by the policy. Small development between such villages and others with male projects that help a few villagers or a few leaders. There is a theory (of sorts) under- villages may not attract the attention of cor- lying these experiments—the development rupt public officials because it is not worth community had briefly adopted the view that their while to undermine or exploit them, a key issue in development was the empow- yet they would do so as soon as any attempt erment of women (or perhaps just giving were made to scale up. The scientists who them “voice”) and, if this was done, more run the experiments are likely to do so more children would be educated, more money carefully and conscientiously than would the would be spent on food and on health, and so bureaucrats in charge of a full scale opera- on. Women are altruistic and men are selfish. tion. In consequence, there is no guarantee Chattopadhyay and Duflo’s analysis of the that the policy tested by the RCT will have Indian government’s experiments shows that the same effects as in the trial, even on the this is most likely wrong; I say most likely subjects included in the trial or in the popu- because, as with all experiments, the mecha- lation from which the trialists were selected. nisms are unclear. It is possible, for example, For an RCT to produce “useful knowledge” that women do indeed want the socially beyond its local context, it must illustrate desirable outcomes but are unable to obtain some general tendency, some effect that is them in a male dominated society, even the result of mechanism that is likely to apply when women are nominally in power. Even more broadly. so, this study reversed previously held gen- It is sometimes argued that skepticism eral beliefs. There are also many examples about generalizability is simply “a version in medicine where knowledge of the mean of David Hume’s famous demonstration of treatment effect among the treated from a the lack of a rational basis for induction” trial, even with some allowance for practi- (Banerjee 2005, p. 4341). But what is going cal problems, has reversed previously held on here is often a good deal more mundane. beliefs (see Davey Smith and Shah Ebrahim Worrall (2007, p. 995) responds to the same 2002, who note that “Observational studies ­argument with the following: “One example propose, RCTs dispose”). is the drug benoxaprophen (trade name: Yet I also believe that RCTs of “what Opren), a nonsteroidal inflammatory treat- works,” even when done without error or ment for arthritis and musculo-skeletal pain. contamination, are unlikely to be helpful for This passed RCTs (explicitly restricted to policy, or to move beyond the local, unless 18 to 65 year olds) with flying colours. It is they tell us something about why the pro- however a fact that musculo-skeletal pain gram worked, something to which they are predominately afflicts the elderly. It turned often neither targeted nor well-suited. Some out, when the (on average older) ‘target pop- of the issues are familiar and are widely ulation’ were given Opren, there were a sig- discussed in the literature. Actual policy is nificant number of deaths from hepato-renal always likely to be different from the experi- failure and the drug was withdrawn.” ment, for example because there are general In the same way, an educational protocol equilibrium effects that operate on a large that was successful when randomized across Deaton: Instruments, Randomization, and Learning about Development 449 villages in India holds many things constant from experiments to policy. In economics, that would not be constant if the program the language would refer to theory, building were transported to Guatemala or Vietnam, models, and tailoring them to local condi- even as a randomized controlled trial, let tions. Policy requires a causal model; without alone when enacted into policy (Cartwright it, we cannot understand the welfare con- 2010). These examples demonstrate a failure sequences of a policy, even a policy where to control for relevant factors or nonrandom causality is established and that is proven participation in the trial, not the general to work on its own terms. Banerjee (2007a) impossibility of induction. RCTs, like non- describes an RCT by Duflo, Rema Hanna, experimental results, cannot automatically and Stephen Ryan (2008) as “a new eco- be extrapolated outside the context in which nomics being born.” This experiment used they were obtained. cameras to monitor and prevent teacher Perhaps the most famous randomization absenteeism in villages in the Indian state in development economics is Progresa (now of Rajasthan. Curiously, Pawson and Tilley Oportunidades) in Mexico, a conditional cash (1997, pp. 78–82) use the example of cam- transfer scheme in which welfare benefits to eras (to deter crime in car parks) as one of parents were paid conditional on their chil- their running examples. They note that cam- dren attending schools and clinics. Angrist eras do not, in and of themselves, prevent and Pischke (2010) approvingly quote Paul crime because they do not make it impossi- Gertler’s statement that “Progresa is why ble to break into a car. Instead, they depend now thirty countries worldwide have con- on triggering a series of behavioral changes. ditional cash transfer programs,” and there Some of those changes show positive experi- is no doubt that the spread of Progresa mental outcomes—crime is down in the car depended on the fact that its successes were parks with cameras—but are undesirable, supported by a randomized evaluation. Yet for example because crime is shifted to other it is unclear that this wholesale imitation car parks or because the cameras change is a good thing. Santiago Levy (2006), the the mix of patrons of the car park. There architect of Progresa, notes that the Mexican are also cases where the experiment fails scheme cannot simply be exported to other but has beneficial effects. It would not be countries if, for example, those countries difficult to construct similar arguments for have a preexisting antipoverty program with the cameras in the Indian schools, and wel- which conditional cash transfers might not fare conclusions cannot be supported unless fit, or if they do not have the capacity to we understand the behavior of teachers, meet the additional demand for education pupils, and their parents. Duflo, Hanna, and or healthcare, or if the political support is Ryan (2008) understand this and use their absent. Incentivizing parents to take their experimental results to construct a model children to clinics will not improve child of teacher behavior. Other papers that use health if there are no clinics to serve them, structural models to interpret experimental a detail that can easily be overlooked in the results include Petra E. Todd and Kenneth I. enthusiasm for the credibility of the Mexican Wolpin (2006) and , Costas evaluation. Meghir, and Ana Santiago (2005); these Pawson and Tilley (1997) argue that it is and the other studies reviewed in Todd and the combination of mechanism and context Wolpin (forthcoming) are surely a good ave- that generates outcomes and that, without nue for future explanation. understanding that combination, scientific Cartwright (2007a) draws a contrast progress is unlikely. Nor can we safely go between the rigor applied to establish 450 Journal of Economic Literature, Vol. XLVIII (June 2010)

­internal validity—to establish the gold stan- are not theoretically guided are unlikely to dard status of RCTs—and the much looser have more than local validity, a warning that arguments that are used to defend the trans- applies equally to nonexperimental work. plantation of the experimental results to pol- In (forthcoming), where I icy. For example, running RCTs to find out develop these arguments further, I discuss whether a project works is often defended on a number of examples of nonexperimental the grounds that the experimental project is work in economic development where theo- like the policy that it might support. But the ries are developed to the point where they are “like” is typically argued by an appeal to simi- capable of being tested on nonexperimental lar circumstances, or a similar environment, data, with the results used to refute, refine, arguments that depend entirely on observ- or further develop the theory. Randomized able variables. Yet controlling for observables experiments, which allow the researcher to is the key to the matching estimators that are induce controlled variance, should be a pow- one of the main competitors for RCTs and erful tool in such programs, and make it pos- that are typically rejected by the advocates sible to construct tests of theory that might of RCTs on the grounds that RCTs control otherwise be difficult or impossible. The dif- not only for the things that we observe but ference is not in the methods, experimental things that we cannot. As Cartwright notes, and nonexperimental, but in what is being the validity of evidence-based policy depends investigated, projects on the one hand, and on the ­weakest link in the chain of argument mechanisms on the other. and evidence, so that by the time we seek to One area in which this is already hap- use the experimental results, the advantage pening is in behavioral economics, and of RCTs over matching or other econometric the merging of economics and psychol- methods has evaporated. In the end, there ogy, whose own experimental tradition is no substitute for careful evaluation of the is clearly focused on behavioral regulari- chain of evidence and reasoning by people ties. The experiments reviewed in Steven who have the experience and expertise in D. Levitt and John A. List (2008), often the field. The demand that experiments be involving both economists and psycholo- theory-driven is, of course, no guarantee of gists, cover such issues as loss aversion, success, though the lack of it is close to a procrastination, hyperbolic discounting, or guarantee of failure. the availability heuristic—all of which are It is certainly not always obvious how to examples of behavioral mechanisms that combine theory with experiments. Indeed, promise applicability beyond the specific much of the interest in RCTs—and in instru- experiments. There also appears to be a mental variables and other econometric tech- good deal of convergence between this niques that mimic random allocation—comes line of work, inspired by earlier experi- from a deep skepticism of economic theory, mental traditions in economic theory and and impatience with its ability to deliver in psychology, and the most recent work structures that seem at all helpful in inter- in development. Instead of using experi- preting reality. Applied and theoretical econ- ments to evaluate projects, looking for omists seem to be further apart now than at which projects work, this development any period in the last quarter century. Yet fail- work designs its experiments to test predic- ure to reintegrate is hardly an option because tions of theories that are generalizable to without it there is no chance of long-term sci- other situations. Without any attempt to be entific progress or of maintaining and extend- comprehensive, some examples are Dean ing the results of experimentation. RCTs that Karlan and Jonathan Zinman (2008), who Deaton: Instruments, Randomization, and Learning about Development 451 are concerned with the price elasticity of of reduced-form social ­experimentation to the demand for credit; Marianne Bertrand the exclusion of structural evaluation based et al. (2010), who take predictions about on observational data is not warranted.” the importance of context from the psy- Their statement still holds good, and it chology laboratory to the study of advertis- would be worth our while trying to return to ing for small loans in South Africa; Duflo, something like Orcutt and Orcutt’s vision in Kremer, and Jonathan Robinson (2009), experimental work, as well as to reestablish- who construct and test a behavioral model ing the relevance and importance of theory- of procrastination for the use of fertiliz- driven nonexperimental work. The more ers by small farmers in Kenya; and Xavier recent Moving to Opportunity experiment is Giné, Karlan, and Zinman (2009), who use another example of an experiment that was an experiment in the Philippines to test more successful as a black-box test of “what the efficacy of a smoking-cessation product works,” in this case giving housing vouchers designed around behavioral theory. In all of to a small segment of the most disadvan- this work, the project, when it exists at all, is taged population, than it was in illuminat- an embodiment of the theory that is being ing general long-standing questions about tested and refined, not the object of evalu- the importance of neighborhoods (Sampson ation in its own right, and the field experi- 2008). Sampson concludes his review of the ments are a bridge between the laboratory Moving to Opportunity experiment with “a and the analysis of “natural” data (List plea for the old-fashioned but time-proven 2006). The collection of purpose-designed benefits of theoretically motivated descrip- data and the use of randomization often tive research and synthetic analytical efforts” make it easier to design an acid test that (p. 227). can be more difficult to construct without Finally, I want to return to the issue of them. If we are lucky, this work will pro- “heterogeneity,” a running theme in this vide the sort of behavioral realism that has paper. Heterogeneity of responses first been lacking in much of economics while, appeared in section 2 as a technical problem at the same time, identifying and allowing for instrumental variable estimation, dealt us to retain the substantial parts of existing with in the literature by local average treat- economic theory that are genuinely useful. ment estimators. Randomized controlled In this context, it is worth looking back trials provide a method for estimating quan- to the previous phase of experimentation in tities of interest in the presence of heteroge- economics that started with the New Jersey neity, and can therefore be seen as another income tax experiments. A rationale for these technical ­solution for the “heterogeneity experiments is laid out in Guy H. Orcutt and problem.” They allow estimation of mean Alice G. Orcutt (1968) in which the vision responses under extraordinarily weak condi- is a formal model of labor supply with the tions. But as soon as we deviate from ideal experiments used to estimate its parameters. conditions and try to correct the random- By the early 1990s, however, experimenta- ization for inevitable practical difficulties, tion had moved on to a “what works” basis, heterogeneity again rears its head, biasing and Manski and Irwin Garfinkel (1992), estimates, and making it difficult to interpret surveying the experience, write “there is, at what we get. In the end, the technical fixes present, no basis for the popular belief that fail and compromise our attempts to learn extrapolation from social experiments is less from the data. What this should tell us is that problematic than extrapolation from observa- the heterogeneity is not a technical problem tional data. As we see it, the recent embrace but a symptom of something deeper, which 452 Journal of Economic Literature, Vol. XLVIII (June 2010) is the failure to specify causal models of the References processes we are examining. Technique is Acemoglu, Daron, Simon Johnson, and James A. Rob- never a substitute for the business of doing inson. 2001. “The Colonial Origins of Comparative economics. Development: An Empirical Investigation.” Ameri- Perhaps the most successful example of can Economic Review, 91(5): 1369–1401. Altman, Douglas G. 1998. “Within Trial Variation—A learning about economic development is False Trail?” Journal of Clinical Epidemiology, 51(4): the Industrial Revolution in Britain, recently 301–03. described by Joel Mokyr (2009). Much of the Angrist, Joshua D. 1990. “Lifetime Earnings and the Vietnam Era Draft Lottery: Evidence from Social learning was indeed accomplished by trial Security Administrative Records.” American Eco- and error—RCTs were yet to be invented— nomic Review, 80(3): 313–36. though the practical and often uneducated Angrist, Joshua D., and Victor Lavy. 1999. “Using Mai- monides’ Rule to Estimate the Effect of Class Size people who did the experimentation were on Scholastic Achievement.” Quarterly Journal of constantly in contact with leading scien- Economics, 114(2): 533–75. tists, often through hundreds of local soci- Angrist, Joshua D., and Jörn-Steffen Pischke. 2010. “The Credibility Revolution in Empirical Econom- eties dedicated to the production of “useful ics: How Better Research Design is Taking the Con knowledge” and its application to the ben- Out of Econometrics.” Journal of Economic Perspec- efit of mankind (see also Roy Porter 2000). tives, 24(2): 3–30. Attanasio, Orazio, , and Ana Santiago. Mokyr sees the Enlightenment as a neces- 2005. “Education Choices in Mexico: Using a Struc- sary precondition for the revolution and of tural Model and a Randomized Experiment to Eval- the escape from poverty and disease. Yet the uate Progresa.” Unpublished. Banerjee, Abhijit Vinayak. 2005. “‘New Development application and development of scientific Economics’ and the Challenge to Theory.” Economic ideas was uneven, and progress was faster and Political Weekly, 40(40): 4340–44. where there was less heterogeneity, for exam- Banerjee, Abhijit Vinayak. 2007a. “Inside the Machine: Toward a New Development Economics.” Boston ple in chemical or mechanical processes, Review, 32(2): 12–18. which tended to work in much the same way Banerjee, Abhijit Vinayak. 2007b. Making Aid Work. everywhere, than where heterogeneity was Cambridge and London: MIT Press. Banerjee, Abhijit Vinayak, and Ruimin He. 2008. “Mak- important, as in agriculture, where soils and ing Aid Work.” In Reinventing Foreign Aid, ed. William farms differed by the mile (Mokyr 2009, p. Easterly, 47–92. Cambridge and London: MIT Press. 192). Even so, as with sanitation, progress Barro, Robert J. 1998. Determinants of Economic Growth: A Cross-Country Empirical Study. Cam- was often made without a correct under- bridge and London: MIT Press. standing of mechanisms and with limited Barro, Robert J., and Xavier Sala-i-Martin. 1995. or no experimentation; Richard J. Murnane ­Economic Growth. New York; London and Mon- treal: McGraw-Hill. and Richard R. Nelson (2007) argue that the Bauer, P. T. 1971. Dissent on Development: Studies and same is frequently true in modern medicine. Debates in Development Economics. London: Wei- In the end, many problems were simply too denfeld and Nicolson. Bauer, P. T. 1981. Equality, the Third World and Eco- hard to be solved without theoretical guid- nomic Delusion. Cambridge and London: Harvard ance, which in areas such as soil chemistry, or University Press. the germ theory of disease, lay many decades Bertrand, Marianne, Dean Karlan, Sendhil Mullaina- than, Eldar Shafir, and Jonathan Zinman. 2010. in the future. It took scientific understanding “What’s Advertising Content Worth? Evidence from to overcome the heterogeneity of experience a Consumer Credit Marketing Field Experiment.” which ultimately defeats trial and error. As Quarterly Journal of Economics, 125(1): 263–305. Blundell, Richard, and Monica Costa Dias. 2009. was the case then, so now, and I believe that “Alternative Approaches to Evaluation in Empiri- we are unlikely to banish poverty in the mod- cal .” Journal of Human Resources, ern world by trials alone, unless those trials 44(3): 565–640. Boone, Peter. 1996. “Politics and the Effectiveness of are guided by and contribute to theoretical Foreign Aid.” European Economic Review, 40(2): understanding. 289–329. Deaton: Instruments, Randomization, and Learning about Development 453

Burnside, Craig, and David Dollar. 2000. “Aid, Policies, Economics Research: A Toolkit.” In Handbook of and Growth.” , 90(4): Development Economics, Volume 4, ed. T. Paul 847–68. Schultz and John Strauss, 3895–3962. Amsterdam Card, David. 1999. “The Causal Effect of Education and Oxford: Elsevier, North-Holland. on Earnings.” In Handbook of Labor Economics, Duflo, Esther, Rema Hanna, and Stephen Ryan. 2008. Volume 3A, ed. and David Card, “Monitoring Works: Getting Teachers to Come to 1801–63. Amsterdam; New York and Oxford: Else- School.” Center for Economic and Policy Research vier Science, North-Holland. Discussion Paper 6682. Cartwright, Nancy. 2007a. “Are RCTs the Gold Stan- Duflo, Esther, Michael Kremer, and Jonathan Robin- dard?” Biosocieties, 2(1): 11–20. son. 2009. “Nudging Farmers to Use Fertilizer: Evi- Cartwright, Nancy. 2007b. Hunting Causes and Using dence from Kenya.” National Bureau of Economic Them: Approaches in Philosophy and Economics. Research Working Paper 15131. Cambridge and New York: Cambridge University Duflo, Esther, and Rohini Pande. 2007. “Dams.”Quar - Press. terly Journal of Economics, 122(2): 601–46. Cartwright, Nancy. 2010. “What Are Randomised Easterly, William. 2006. The White Man’s Burden: Why Controlled Trials Good For?” Philosophical Studies, the West’s Efforts to Aid the Rest Have Done So Much 147(1): 59–70. Ill and So Little Good. New York: Penguin Press. Chattopadhyay, Raghabendra, and Esther Duflo. 2004. Easterly, William, ed. 2008. Reinventing Foreign Aid. “Women as Policy Makers: Evidence from a Ran- Cambridge and London: MIT Press. domized Policy Experiment in India.” , Easterly, William. 2009. “Can the West Save Africa?” 72(5): 1409–43. Journal of Economic Literature, 47(2): 373–447. Clemens, Michael, Steven Radelet, and Rikhil Bhav- Easterly, William, Ross Levine, and David Roodman. nani. 2004. “Counting Chickens When They Hatch: 2004. “Aid, Policies, and Growth: Comment.” Ameri- The Short Term Effect of Aid on Growth.” Center can Economic Review, 94(3): 774–80. for Global Development Working Paper 44. Fisher, Ronald A. 1935. The Design of Experiments, Concato, John, Nirav Shah, and Ralph I. Horwitz. 2000. Eighth edition. New York: Hafner, 1960. “Randomized, Controlled Trials, Observational Stud- Freedman, David A. 2005. Statistical Models: Theory ies, and the Hierarchy of Research Designs.” New and Practice. Cambridge and New York: Cambridge England Journal of Medicine, 342(25): 1887–92. University Press. Cox, David R. 1958. Planning of Experiments. New Freedman, David A. 2006. “Statistical Models for York: Wiley. Causation: What Inferential Leverage Do They Pro- Dalgaard, Carl-Johan, and Henrik Hansen. 2001. “On vide?” Evaluation Review, 30(6): 691–713. Aid, Growth and Good Policies.” Journal of Develop- Freedman, David A. 2008. “On Regression Adjust- ment Studies, 37(6): 17–41. ments to Experimental Data.” Advances in Applied Dalgaard, Carl-Johan, Henrik Hansen, and Finn Tarp. Mathematics, 40(2): 180–93. 2004. “On the Empirics of Foreign Aid and Growth.” Giné, Xavier, Dean Karlan, and Jonathan Zinman. Economic Journal, 114(496): F191–216. 2009. “Put Your Money Where Your Butt Is: A Com- Davey Smith, George, and Shah Ebrahim. 2002. “Data mitment Contract for Smoking Cessation.” World Dredging, Bias, or Confounding: They Can All Get Bank Policy Research Working Paper 4985. You into the BMJ and the Friday Papers.” British Glewwe, Paul, Michael Kremer, Sylvie Moulin, and Medical Journal, 325: 1437–38. Eric Zitzewitz. 2004. “Retrospective vs. Prospective Davey Smith, George, and Matthias Egger. 1998. Analyses of School Inputs: The Case of Flip Charts “Incommunicable Knowledge? Interpreting and in Kenya.” Journal of Development Economics, 74(1): Applying the Results of Clinical Trials and Meta- 251–68. analyses.” Journal of Clinical Epidemiology, 51(4): Groopman, Jerome. 2009. “Diagnosis: What Doctors 289–95. Are Missing.” New York Review of Books, 56(17), Deaton, Angus. Forthcoming. “Understanding the November 5th. Mechanisms of Economic Development.” Journal of Guillaumont, Patrick, and Lisa Chauvet. 2001. “Aid and Economic Perspectives. Performance: A Reassessment.” Journal of Develop- de Mel, Suresh, David McKenzie, and Christopher ment Studies, 37(6): 66–92. Woodruff. 2008. “Returns to Capital in Microenter- Hansen, Henrik, and Finn Tarp. 2000. “Aid Effective- prises: Evidence from a Field Experiment.” Quar- ness Disputed.” Journal of International Develop- terly Journal of Economics, 123(4): 1329–72. ment, 12(3): 375–98. Duflo, Esther. 2004. “Scaling Up and Evaluation.” In Hansen, Henrik, and Finn Tarp. 2001. “Aid and Growth Annual World Bank Conference on Development Regressions.” Journal of Development Economics, Economics, 2004: Accelerating Development, ed. 64(2): 547–70. François Bourguignon and Boris Pleskovic, 341–69. Heckman, James J. 1992. “Randomization and Social Washington, D.C.: World Bank; Oxford and New Policy Evaluation.” In Evaluating Welfare and York: Oxford University Press. ­Training Programs, ed. Charles F. Manski and Irwin Duflo, Esther, Rachel Glennerster, and Michael Kre- Garfinkel, 201–30. Cambridge and London: Harvard mer. 2008. “Using Randomization in Development University Press. Available as National Bureau of 454 Journal of Economic Literature, Vol. XLVIII (June 2010)

Economic Research Technical Working Paper 107. Development, 29(6): 1083–94. Heckman, James J. 1997. “Instrumental Variables: A Karlan, Dean, and Jonathan Zinman. 2008. “Credit Study of Implicit Behavioral Assumptions Used in Elasticities in Less-Developed Economies: Implica- Making Program Evaluations.” Journal of Human tions for Microfinance.”American Economic Review, Resources, 32(3): 441–62. 98(3): 1040–68. Heckman, James J. 2000. “Causal Parameters and Pol- Leamer, Edward E. 1985. “Vector Autoregressions for icy Analysis in Economics: A Twentieth Century Ret- Causal Inference?” Carnegie–Rochester Conference rospective.” Quarterly Journal of Economics, 115(1): Series on Public Policy, 22: 255–303. 45–97. Lee, David S., and Thomas Lemieux. 2009. “Regres- Heckman, James J., and Jeffrey A. Smith. 1995. sion Discontinuity Designs in Economics.” National “Assessing the Case for Social Experiments.” Journal Bureau of Economic Research Working Paper 14723. of Economic Perspectives, 9(2): 85–110. Lensink, Robert, and Howard White. 2001. “Are There Heckman, James J., and Sergio Urzua. 2009. “Compar- Negative Returns to Aid?” Journal of Development ing IV with Structural Models: What Simple IV Can Studies, 37(6): 42–65. and Cannot Identify.” National Bureau of Economic Leonhardt, David. 2008. “Making Economics Relevant Research Working Paper 14706. Again.” New York Times, February 20. Heckman, James J., Sergio Urzua, and Edward ­Vytlacil. Levitt, Steven D., and John A. List. 2008. “Field Exper- 2006. “Understanding Instrumental Variables in iments in Economics: The Past, the Present, and the Models with Essential Heterogeneity.” Review of Future.” National Bureau of Economic Research Economics and Statistics, 88(3): 389–432. Working Paper 14356. Heckman, James J., and Edward Vytlacil. 1999. “Local Levy, Santiago. 2006. Progress against Poverty: Sus- Instrumental Variables and Latent Variable Models taining Mexico’s Progresa-Oportunidades Program. for Identifying and Bounding Treatment Effects.” Washington, D.C.: Brookings Institution Press. Proceedings of the National Academy of Sciences, List, John A. 2006. “Field Experiments: A Bridge 96(8): 4730–34. between Lab and Naturally Occurring Data.” B.E. Heckman, James J., and Edward Vytlacil. 2007. Journal of Economic Analysis and Policy: Advances “Econometric Evaluation of Social Programs, Part in Economic Analysis and Policy, 6(2). II: Using the Marginal Treatment Effect to Orga- Mankiw, N. Gregory, David Romer, and David N. Weil. nize Alternative Econometric Estimators to Evalu- 1992. “A Contribution to the Empirics of Economic ate Social Programs, and to Forecast their Effects in Growth.” Quarterly Journal of Economics, 107(2): New Environments.” In Handbook of Econometrics, 407–37. Volume 6B, ed. James J. Heckman and Edward E. Manski, Charles F. 1996. “Learning About Treatment Leamer, 4875–5143. Amsterdam and Oxford: Else- Effects from Experiments with Random Assignment vier, North-Holland. of Treatments.” Journal of Human Resources, 31(4): Horwitz, Ralph I., Burton H. Singer, Robert W. 709–33. Makuch, and Catherine M. Viscoli. 1996. “Can Manski, Charles F., and Irwin Garfinkel. 1992. “Evalu- Treatment That is Helpful on Average Be Harmful to ating Welfare and Training Programs: Introduction.” Some Patients? A Study of the Conflicting Informa- In Evaluating Welfare and Training Programs, ed. tion Needs of Clinical Inquiry and Drug Regulation.” Charles F. Manski and Irwin Garfinkel, 1–22. Cam- Journal of Clinical Epidemiology, 49(4): 395–400. bridge and London: Harvard University Press. Horwitz, Ralph I., Burton H. Singer, Robert W. McCleary, Rachel M., and Robert J. Barro. 2006. “Reli- Makuch, and Catherine M. Viscoli. 1997. “On Reach- gion and Economy.” Journal of Economic Perspec- ing the Tunnel at the End of the Light.” Journal of tives, 20(2): 49–72. Clinical Epidemiology, 50(7): 753–55. McCrary, Justin. 2008. “Manipulation of the Running Hoxby, Caroline M. 2000. “Does Competition among Variable in the Regression Discontinuity Design: Public Schools Benefit Students and Taxpayers?” A Density Test.” Journal of Econometrics, 142(2): American Economic Review, 90(5): 1209–38. 698–714. Imbens, Guido W. 2009. “Better Late than Nothing: Miguel, Edward, and Michael Kremer. 2004. “Worms: Some Comments on Deaton (2009) and Heckman Identifying Impacts on Education and Health in the and Urzua (2009).” National Bureau of Economic Presence of Treatment Externalities.” Econometrica, Research Working Paper 14896. 72(1): 159–217. Imbens, Guido W., and Joshua D. Angrist. 1994. “Iden- Miguel, Edward, Shanker Satyanath, and Ernest Ser- tification and Estimation of Local Average Treatment genti. 2004. “Economic Shocks and Civil Conflict: An Effects.” Econometrica, 62(2): 467–75. Instrumental Variables Approach.” Journal of Politi- International Initiative for Impact Evaluation (3IE). cal Economy, 112(4): 725–53. 2008. http://www.3ieimpact.org/. Mokyr, Joel. 2009. The Enlightened Economy: An Eco-

Jurajda, Šte​ ˘ ​pán, and Daniel Münich. 2009. “Admission nomic History of Britain 1700–1850. New Haven to Selective Schools, Alphabetically.” http://home. and London: Press. cerge-ei.cz/jurajda/alphabet07.pdf. Murnane, Richard J., and Richard R. Nelson. 2007. Kanbur, Ravi. 2001. “Economic Policy, Distribution “Improving the Performance of the Education Sec- and Poverty: The Nature of Disagreements.” World tor: The Valuable, Challenging, and Limited Role Deaton: Instruments, Randomization, and Learning about Development 455

of Random Assignment Evaluations.” Economics of and Jeanne Brooks-Gunn. 2006. “Neighborhoods Innovation and New Technology, 16(5): 307–22. and Academic Achievement: Results from the Mov- Orcutt, Guy H., and Alice G. Orcutt. 1968. “Incen- ing to Opportunity Experiment.” Journal of Human tive and Disincentive Experimentation for Income Resources, 41(4): 649–91. Maintenance Policy Purposes.” American Economic Senn, Stephen, and Frank Harrell. 1997. “On Wisdom Review, 58(4): 754–72. after the Event.” Journal of Clinical Epidemiology, Pawson, Ray, and Nick Tilley. 1997. Realistic Evalu- 50(7): 749–51. ation. London and Thousand Oaks, Calif.: Sage Sims, Christopher A. 2010. “But Economics Is Not an Publications. Experimental Science.” Journal of Economic Per- Peto, Richard, Rory Collins, and Richard Gray. 1995. spectives, 24(2): 59–68. “Large-Scale Randomized Evidence: Large, Simple Singer, Peter. 2004. One World: The Ethics of Global- Trials and Overviews of Trials.” Journal of Clinical ization. New Haven and London: Yale University Epidemiology, 48(1): 23–40. Press, 2002. Pogge, Thomas. 2005. “World Poverty and Human The Lancet. 2004. “The World Bank Is Finally Embrac- Rights.” Ethics and International Affairs, 19(1): 1–7. ing Science: Editorial.” The Lancet, 364: 731–32. Porter, Roy. 2000. The Creation of the Modern World: Todd, Petra E., and Kenneth I. Wolpin. 2006. “Assess- The Untold Story of the British Enlightenment. New ing the Impact of a School Subsidy Program in York and London: Norton. Mexico: Using a Social Experiment to Validate a Poverty Action Lab. 2007. “Clinton Honors Global Dynamic Behavioral Model of Child Schooling Deworming Effort.” http://www.povertyactionlab. and Fertility.” American Economic Review, 96(5): org/deworm/. 1384–1417. Rajan, Raghuram G., and Arvind Subramanian. 2008. Todd, Petra E., and Kenneth I. Wolpin. Forthcom- “Aid and Growth: What Does the Cross-Country ing. “Structural Estimation and Policy Evalua- Evidence Really Show?” Review of Economics and tion in Developing Countries.” Annual Review of Statistics, 90(4): 643–65. Economics. Reiss, Peter C., and Frank A. Wolak. 2007. “Structural Urquiola, Miguel, and Eric Verhoogen. 2009. “Class- Econometric Modeling: Rationales and Examples Size Caps, Sorting, and the Regression-Disconti- from Industrial Organization.” Handbook of Econo- nuity Design.” American Economic Review, 99(1): metrics, Volume 6A, ed. James J. Heckman and 179–215. Edward E. Leamer, 4277–4415. Amsterdam and van den Berg, Gerard. 2008. “An Economic Analysis Oxford: Elsevier, North-Holland. of Exclusion Restrictions for Instrumental Variable Rodrik, Dani. 2009. “The New Development Econom- Estimation.” Unpublished. ics: We Shall Experiment, but How Shall We Learn?” Wennberg, David E., F. L. Lucas, John D. Birkmeyer, In What Works in Development? Thinking Big and Carl E. Bredenberg, and Elliott S. Fisher. 1998. Thinking Small, ed. Jessica Cohen and William East- “Variation in Carotid Endarterectomy Mortality in erly, 24–47. Washington, D.C.: Brookings Institution the Medicare Population: Trial Hospitals, Volume, Press. and Patient Characteristics.” Journal of the American Roodman, David. 2007. “The Anarchy of Numbers: Medical Association, 279(16): 1278–81. Aid, Development, and Cross-Country Empirics.” Wooldridge, Jeffrey M. 2002. Econometric Analysis of World Bank Economic Review, 21(2): 255–77. Cross Section and Panel Data. Cambridge and Lon- Sachs, Jeffrey D. 2005. The End of Poverty: Economic don: MIT Press. Possibilities for Our Time. New York: Penguin. World Bank. 2008. “Spanish Impact Evaluation Fund.” Sachs, Jeffrey D. 2008. Common Wealth: Economics http://web.worldbank.org/WBSITE/EXTERNAL/ for a Crowded Planet. New York: Penguin Press. TOPICS/EXTPOVERTY/EXTISPMA/0,,contentM Sampson, Robert J. 2008. “Moving to Inequality: DK:21419502~menuPK:384336~pagePK:148956~p Neighborhood Effects and Experiments Meet Social iPK:216618~theSitePK:384329,00.html. Structure.” American Journal of Sociology, 114(1): Worrall, John. 2007. “Evidence in Medicine and Evi- 189–231. dence-Based Medicine.” Philosophy Compass, 2(6): Sanbonmatsu, Lisa, Jeffrey R. Kling, Greg J. Duncan, 981–1022.