American Economic Review: Papers & Proceedings 2015, 105(5): 467–470 http://dx.doi.org/10.1257/aer.p20151015

Heterogeneous Treatment Effects in Impact Evaluation†

By Eva Vivalt*

The past few years have seen an exponential which programs work best where. To date, it growth in impact evaluations. These evalua- has conducted 20 meta-analyses and systematic tions are supposed to be useful to policymakers, reviews of different development programs.1 development practitioners, and researchers Data gathered through meta-analyses are the designing future studies. However, it is not yet ideal data with which to answer the question of clear to what extent we can extrapolate from how much we can extrapolate from past results, past impact evaluation results or under which as what one would want is a large database of conditions Deaton 2010; Pritchett and Sandefur impact evaluation results. Since data on these 20 2013 . Further,( it has been shown that even a topics were collected in the same way, we can similar) program, in a similar environment, can also look across different types of programs to yield different results Duflo et al. 2012; Bold see if there are any more general trends. et al. 2013 . The different( findings that have The rest of this paper proceeds as follows. been obtained) in such similar conditions point First, I briefly discuss the framework of hetero- to substantial context-dependence of impact geneous treatment effects. I then describe the evaluation results. It is critical to understand this data in more detail and provide some illustrative context-dependence in order to know what we results. A fuller treatment of the topic is found can learn from any impact evaluation. in Vivalt 2015 . While the most important reason to exam- ( ) ine generalizability is to aid interpretation and I. Heterogeneous Treatment Effects improve predictions, it could also help direct research attention to where it is most needed. If I define generalizability as the ability to pre- we could identify which results would be more dict results outside of the sample.2 apt to generalize, we could reallocate effort to I model treatment effects as depending on the those topics which are less well understood. context of the intervention. Suppose we have Though impact evaluations are still rapidly an intervention and are investigating a particu- increasing both in number and in terms of the lar outcome, Y​ ​. In the simplest model, context amount of resources devoted to them, a few can be represented as a “contextual variable,” C​ ​ , thousand are already complete. We are thus which interacts with the treatment such that: at the point where we can begin to answer the question more generally. 1 ​Y​ ​ ​T​ ​ ​C​ ​ ​T​ ​ ​C​ ​ ​ ​ ​ , I do this using a large, unique dataset of ( ) j = α + β j + δ j + γ j j + εj impact evaluation results. These data were gath- ered by a nonprofit research organization that where ​j​ indexes individuals in the study and ​T​ I founded, AidGrade, that seeks to determine indicates treatment status. A particular impact evaluation might estimate an equation without the interaction term: * , 14A Washington Mews 303, New York, NY 10003 e-mail: [email protected] . I# thank ( ) Edward Miguel, Hunt Allcott, David Card, Ernesto Dal Bó, 2 ​Y​ j​ ​T​ j​ ​ ​ j​. Bill Easterly, Vinci Chow, Willa Friedman, Xing Huang, ( ) = α + β ′ + ε Steven Pennings, Edson Severnini, and seminar partici- pants at the University of California, Berkeley, Columbia 1 Throughout, I will refer to all 20 as meta-analyses, University, New York University, and the for but some did not have enough comparable outcomes to be helpful comments. included in a meta-analysis and became systematic reviews. † Go to http://dx.doi.org/10.1257/aer.p20151015 to visit 2 Technically, all that can be measured is “local general- the article page for additional materials and author disclo- izability”—the ability to predict results in a particular out of sure statement. sample group. Vivalt 2015 expands on this. ( ) 467 468 AEA PAPERS AND PROCEEDINGS MAY 2015

​ ​ would then capture ​ C​. In this situa- pay; rural electrification; safe water storage; tion,β′ when the contextualβ +variableγ changes, in scholarships; school meals; unconditional a new context, the observed effect attributed to cash transfers; water treatment; and women’s the treatment also changes. This appears as low empowerment programs. generalizability. The subset of the data on which I am focusing The example described is the simplest case. for this paper is based on those papers that passed One can imagine that the true state of the world all screening stages in the meta-analyses and are has “interaction effects all the way down.” publicly available online AidGrade 2015 . The search and screening criteria( were very )broad II. Data and, after passing the full text screening, the vast majority of papers that were later excluded were This paper uses a database of impact evaluation excluded merely because they had no outcome results collected by AidGrade, a US nonprofit­ that variables in common. The small overlap of out- I founded in 2012. AidGrade focuses on gathering come variables is a surprising and notable feature the results of impact evaluations and analyzing of the data. the data, including through meta-analysis. Its data When considering the variation of effect sizes on impact evaluation results were collected in the within a set of papers, the definition of the set course of its meta-analyses from 2012–2014. is clearly critical. If the outcome variable is AidGrade’s meta-analyses follow the standard defined very narrowly e.g., “height in centime- stages: i topic selection; ii a search for rel- ters” , it is clear what( is being measured, and evant papers;( ) iii screening( of) papers; iv data one potential) source of dispersion in results is extraction; and( )v data analysis. In (addition,) removed. On the other hand, if outcomes are it pays attention (to) vi dissemination and vii defined too narrowly, there may be little overlap updating of results AidGrade( ) 2013 . These stages( ) in outcomes between papers. Therefore, multi- are described below,( but the reader) is referred to ple coding rules were used, with this paper con- Vivalt 2015 for a more detailed summary and sidering narrowly-defined outcomes. The reader Higgins( and Green) 2008 for further information is referred to Vivalt 2015 for more details. about meta-analyses( and systematic) reviews. ( ) The interventions that were selected for III. Results meta-analysis were selected largely on the basis of there being a sufficient number of studies on I will restrict attention here to discussing that topic. how the treatment effects vary within interven- A comprehensive literature search was car- tion-outcome, using the data’s original units ried out using a mix of the search aggrega- e.g., percentage points . tors SciVerse, Google Scholar, and EBSCO ( The first thing we) might care about is: if PubMed. The online databases of J-PAL, IPA,/ we were considering the results of a particular CEGA, and 3ie were also searched for com- impact evaluation, how likely is it that the point pleteness. Finally, the references of any exist- estimate for an outcome will fall within the con- ing systematic reviews or meta-analyses were fidence interval of another impact evaluation’s collected. estimate for the same intervention and outcome? Any impact evaluation which appeared to be What is the probability that the confidence inter- on the intervention in question was included, vals of the two studies will overlap? The mean barring those in developed countries. Any will be contained in the confidence interval paper that tried to consider the counterfactual about 53 percent of the time; the studies’ con- was considered an impact evaluation. Both fidence intervals will overlap approximately 83 published papers and working papers were percent of the time. included. Twenty topics were covered to date: Second, how far away are the results from conditional cash transfers; contract teachers; one another? If we were to take the mean result deworming; financial literacy training; HIV within a particular intervention-outcome combi- education; improved stoves; insecticide-treated nation, what is the average difference between bed nets; irrigation; micro health insurance; that and a given study’s result? Putting the abso- microfinance; micronutrient supplementation; lute differences in terms of percents, the average­ mobile phone-based reminders; performance difference is 114 percent; the median across VOL. 105 NO. 5 HETEROGENEOUS TREATMENT EFFECTS IN IMPACT EVALUATION 469 intervention-outcomes is 48 percent. Excluding more opportunities to explicitly integrate these a given study from the calculation of the mean analyses. Efforts to coordinate across different result, to focus on prediction out of sample, the studies, asking the same questions or looking absolute differences increase to an average of at some of the same outcome variables, would 311 percent and median of 52 percent. also help. Any subgroup analyses should be It should be emphasized these numbers prespecified so as to avoid specification search- include outcomes one may not wish to consider ing Casey et al. 2012 . The framing of hetero- together, such as percentage points and rate geneous( treatment effects) could also provide ratios. Rate ratios, as the name suggests, com- positive motivation for replication projects in prise a ratio of two rates: the rate e.g., of the different contexts: different findings would incidence of a disease in the treatment( group not necessarily negate the earlier ones but add in the numerator and )the related rate in the another level of information. control group in the denominator. A change in Ultimately, knowing how much results extrap- 0.1 of a ratio, should be interpreted differently olate and when is critical if we are to know how than a change in 0.1 of an outcome measured to interpret an impact evaluation’s results or apply on another scale, such as in percentage points. its findings. More work is needed in this vein. Vivalt 2015 uses standardized data and does further (analysis.) REFERENCES

IV. Conclusions AidGrade. 2013. “Process map and methodol- ogy.” http://www.aidgrade.org/methodology/ How much impact evaluation results general- process-map-and-methodology accessed ize to other settings is an important topic, and March 9, 2013 . ( data from meta-analyses are the ideal data with AidGrade. 2015.) http://www.aidgrade.org/ which to examine the question. With data on 20 accessed January 12, 2015 . different types of interventions, all collected in Bold,( Tessa, Mwangi Kimenyi,) Germano Mwabu, the same way, we can begin to speak a bit more Alice Ng’ang’a, and Justin Sandefur. 2013. generally about how results tend to vary across “Scaling-up What Works: Experimental Evi- contexts and what that implies for impact evalu- dence on External Validity in Kenyan Educa- ation design and policy recommendations. tion.” University of Oxford, Center for Global This paper presents some evidence on gener- Development Working Paper 321. alizability. In Vivalt 2015 I further explore the Casey, Katherine, Rachel Glennerster, and variation in results. Overall,( ) I find a large amount Edward Miguel. 2012. “Reshaping Institutions: of dispersion, with most papers declining to take Evidence on Aid Impacts Using a Pre-Analy- on the tasks that would make their findings more sis Plan.” Quarterly Journal of 127 useful, such as: specifying a model or “causal 4 : 1755–1812. chain” through which the intervention is sup- Chassang,( ) Sylvain, Gerard Padró i Miquel, and posed to work; reporting results for outcome Erik Snowberg. 2012. “Selective Trials: A variables that other studies also consider; or pro- Principal-Agent Approach to Randomized viding basic information about the context of the Controlled Experiments.” American Economic intervention. Review 102 4 : 1279–1309. There are some steps that researchers can Deaton, Angus.( 2010.) “Instruments, Randomiza- take that may improve the generalizability of tion, and Learning about Development.” Jour- their own studies. First, just as with hetero- nal of Economic Literature 48 2 : 424–55. geneous selection into treatment Chassang, Duo, Esther, Pascaline Dupas,( )and Michael Padró i Miquel, and Snowberg 2012( , one Kremer. 2012. “School Governance, Teacher solution would be to ensure one’s impact) eval- Incentives and Pupil-Teacher Ratios: Exper- uation varied some of the contextual variables imental Evidence from Kenyan Primary that we might think underlie the heterogeneous Schools.” National Bureau of Economic treatment effects. Given that many studies are Research Working Paper 17939. underpowered, that may not be likely; however, Higgins, J. P. T., and Sally Green. 2008. Cochrane large organizations and governments have been Handbook for Systematic Reviews of Interven- supporting more impact evaluations, providing tions. West Sussex: Wiley, 2011. 470 AEA PAPERS AND PROCEEDINGS MAY 2015

Pritchett, Lant, and Justin Sandefur. 2013. “Con- 336. text Matters for Size: Why External ­Validity Vivalt, Eva. 2015. “How Much Can We Claims and Development Practice Don’t Mix.” ­Generalize from Impact Evaluations?” Center for Global Development Working Paper ­Unpublished.