NONPARAMETRIC ESTIMATION OFAVERAGE TREATMENT EFFECTS UNDER EXOGENEITY: AREVIEW* GuidoW .Imbens

Abstract—Recently there has been asurge in econometric work focusing edness and exogeneity interchangeably to denote the as- on estimating average treatment effects under various sets of assumptions. One strand of this literature has developed methods for estimating average sumption that the receipt of treatmentis independent of the treatment effects for abinary treatment under assumptions variously potential outcomes with and without treatmentif certain described as exogeneity, unconfoundedness, or selection on observables. observable covariates areheld constant. The implication of The implication of these assumptions is that systematic (for example, average or distributional) differences in outcomes between treated and these assumptions is that systematic (forexample, average control units with the same values for the covariates are attributable to the or distributional) differences in outcomes between treated treatment. Recent analysis has considered estimation and inference for and control units with the samevalues forthese covariates average treatment effects under weaker assumptions than typical of the earlier literature by avoiding distributional and functional-form assump- areattributable to the treatment. tions. Various methods of semiparametric estimation have been proposed, Much of the recentwork, building on the statistical including estimating the unknown regression functions, , meth- literatureby Cochran (1968), Cochran and Rubin (1973), ods using the propensity score such as weighting and blocking, and combinations of these approaches. In this paper Ireview the state of this Rubin (1973a, 1973b, 1977, 1978), Rosenbaum and Rubin literature and discuss some of its unanswered questions, focusing in (1983a, 1983b, 1984), Holland (1986), and others, considers particular on the practical implementation of these methods, the plausi- estimation and inference without distributional and func- bility of this exogeneity assumption in economic applications, the relative performance of the various semiparametric estimators when the key tional formassumptions. Hahn (1998) derived efŽciency assumptions (unconfoundedness and overlap) are satisŽed, alternative bounds assuming only unconfoundedness and some regu- estimands such as quantile treatment effects, and alternate methods such as Bayesian inference. larity conditions and proposed an efŽcient estimator. Vari- ous alternative estimators have been proposed given these I.Introduction conditions. These estimation methods can be grouped into Žve categories: (i)methods based on estimating the un- INCEthe work by Ashenfelter (1978), Card and Sulli- known regression functions of the outcome on the covari- Svan (1988), Heckman and Robb (1984), Lalonde ates (Hahn, 1998; Heckman, Ichimura, &Todd, 1997, 1998; (1986), and others, there has been much interest in econo- Imbens, Newey,&Ridder, 2003), (ii)matching on covari- metricmethods forestimating the effectsof active labor ates (Rosenbaum, 1995; Abadie and Imbens, 2002) (iii) marketprograms such asjob search assistance or classroom methods based on the propensity score, including blocking teaching programs. This interest has led to asurge in (Rosenbaum &Rubin, 1984) and weighting (Hirano, Im- theoretical work focusing on estimating average treatment bens, &Ridder, 2003), (iv)combinations of these ap- effectsunder various sets of assumptions. Seefor general proaches, forexample, weighting and regression (Robins & surveys of this literatureAngrist and Krueger (2000), Heck- Rotnizky,1995) or matching and regression (Abadie & man, LaLonde, and Smith (2000), and Blundell and Costa- Imbens, 2002), and (v)Bayesian methods, which have Dias (2002). found relatively little following since Rubin (1978). Inthis One strand of this literaturehas developed methods for paper Iwill review the state of this literature—with partic- estimating the average effectof receiving or not receiving a ular emphasis on implications forempirical work— and binary treatmentunder the assumption that the treatment discuss some of the remaining questions. satisŽes some formof exogeneity.Differentversions of this The organization of the paper is as follows. In section II assumption arereferred to as unconfoundedness (Rosen- Iwill introduce the notation and the assumptions used for baum &Rubin, 1983a), selection on observables (Barnow, identiŽcation. Iwill also discuss the differencebetween Cain, &Goldberger, 1980; Fitzgerald, Gottschalk, &Mof- population- and sample-average treatmenteffects. The re- Žtt, 1998), or conditional independence (Lechner, 1999). In the remainderof this paper Iwill use the termsunconfound- cent econometric literaturehas largely focused on estima- tion of the population-average treatmenteffect and its coun- terpartfor the subpopulation of treated units. An alternative, Received for publication October 22, 2002. Revision accepted for publication June 4, 2003. following the early experimental literature(Fisher, 1925; *University of California at Berkeley and NBER Neyman, 1923), is to consider estimation of the average This paper was presented as an invited lecture at the Australian and European meetings of the Econometric Society in July and August 2003. effectof the treatmentfor the units in the sample. Many of Iam also grateful to Joshua Angrist, Jane Herr, Caroline Hoxby,Charles the estimators proposed can be interpreted as estimating Manski, Xiangyi Meng, Robert MofŽtt, and Barbara Sianesi, and two either the average treatmenteffect for the sample athand, or referees for comments, and to anumber of collaborators, Alberto Abadie, Joshua Angrist, Susan Athey,Gary Chamberlain, Keisuke Hirano, V. the average treatmenteffect for the population. Although the Joseph Hotz, Charles Manski, Oscar Mitnik, Julie Mortimer, Jack Porter, choice of estimand may not affectthe formof the estimator, Whitney Newey,Geert Ridder, Paul Rosenbaum, and Donald Rubin for it has implications forthe efŽciency bounds and forthe many discussions on the topics of this paper. Financial support for this research was generously provided through NSFgrants SBR9818644 and formof estimators of the asymptotic variance; the variance SES0136789 and the Giannini Foundation. of estimators forthe sample average treatmenteffect are

TheReview ofEconomicsand , February2004, 86(1): 4– 29 © 2004by the President and Fellows of HarvardCollege and the Massachusetts Instituteof Technology AVERAGE TREATMENTEFFECTS 5 generally smaller.In section II,Iwill also discuss alterna- createdeither to ful Ž ll the unconfoundedness assumption or tive estimands. Almost the entire literaturehas focused on to failit aknown way —designed to comparethe applica- average effects.However, in many cases such measures bility of the various treatmenteffect estimators in these maymask important distributional changes. These can be diverse settings. captured moreeasily by focusing on quantiles of the distri- This survey will not address alternatives forestimating butions of potential outcomes, in the presence and absence average treatmenteffects that do not rely on exogeneity of the treatment(Lehman, 1974; Docksum, 1974; Firpo, assumptions. This includes approaches whereselected ob- 2003). served covariates arenot adjusted for, such as instrumental In section III,Iwill discuss in moredetail some of the variables analyses (Bjo¨rklund & MofŽ t, 1987; Heckman & recently proposed semiparametricestimators forthe average Robb, 1984; Imbens &Angrist, 1994; Angrist, Imbens, & treatmenteffect, including those based on regression, Rubin, 1996; Ichimura& Taber, 2000; Abadie, 2003a; matching, and the propensity score. Iwill focus particularly Chernozhukov &Hansen, 2001). Iwill also not discuss on implementation, and compare the differentdecisions methods exploiting the presence of additional data, such as facedregarding smoothing parametersusing the various differencein differences in repeated cross sections (Abadie, estimators. 2003b; Blundell etal., 2002; Athey and Imbens, 2002) and Insection IV,Iwill discuss estimation of the variances of regression discontinuity wherethe overlap assumption is these average treatmenteffect estimators. For most of the violated (van der Klaauw, 2002; Hahn, Todd, &van der estimators introduced in the recentliterature, corresponding Klaauw, 2000; Angrist &Lavy,1999; Black, 1999; Lee, estimators forthe variance have also been proposed, typi- 2001; Porter, 2003). Iwill also limitthe discussion to binary cally requiring additional nonparametric regression. Inprac- treatments, excluding models with static multivalued treat- tice, however, researchersoften rely on bootstrapping, al- ments as inImbens (2000) and Lechner (2001) and models though this method has not been formallyjusti Ž ed. In with dynamic treatmentregimes as in Hamand LaLonde addition, ifone is interested in the average treatmenteffect (1996), Gill and Robins (2001), and Abbring and van den forthe sample, bootstrapping isclearly inappropriate. Here Berg (2003). Reviews of many of these methods can be Idiscuss in moredetail asimple estimator forthe variance found in Shadish, Campbell, and Cook (2002), Angrist and formatching estimators, developed by Abadie and Imbens Krueger (2000), Heckman, LaLonde, and Smith (2000), and Blundell and Costa-Dias (2002). (2002), that does not require additional nonparametric esti- mation. Section Vdiscusses differentapproaches to assessing the II.Estimands,IdentiŽ cation, and EfŽ ciency Bounds plausibility of the two key assumptions: exogeneity or unconfoundedness, and overlap in the covariate distribu- A.DeŽnitions tions. The Ž rst of these assumptions is in principle untest- able. Nevertheless anumber of approaches have been pro- Inthis paper Iwill use the potential-outcome notation that dates back to the analysis of randomized by posed that areuseful foraddressing its credibility (Heckman Fisher (1935) and Neyman (1923). Afterbeing forcefully and Hotz, 1989; Rosenbaum, 1984b). One may also wish to advocated in aseries of papers by Rubin (1974, 1977, assess the responsiveness of the results to this assumption 1978), this notation is now standard in the literatureon both using asensitivity analysis (Rosenbaum &Rubin, 1983b; experimental and nonexperimental program evaluation. Imbens, 2003), or, in its extremeform, a bounds analysis Webegin with N units, indexed by i 5 1, . . . , N, (Manski, 1990, 2003). The second assumption is that there viewed as drawn randomly froma large population. Each exists appropriate overlap in the covariate distributions of unit is characterizedby apair of potential outcomes, Y (0) the treated and control units. That iseffectively an assump- i forthe outcome under the control treatmentand Yi(1) for tion on the joint distribution of observable variables. How- the outcome under the active treatment.In addition, each ever, as it only involves inequality restrictions, there areno unit has avector of characteristics, referredto as covariates, direct tests of this null. Nevertheless, in practiceit is often pretreatmentvariables, or exogenous variables, and denoted very important to assess whether there is suf Ž cient overlap 1 by Xi. Itis important that these variables arenot affectedby to draw credible inferences. Lacking overlap forthe full the treatment.Often they take their values prior to the unit sample, one may wish to limitinferences to the average being exposed to the treatment,although this is not suf Ž - effectfor the subset of the covariate space wherethere exists cient forthe conditions they need to satisfy.Importantly, overlap between the treated and control observations. this vector of covariates can include lagged outcomes. InSection VI,Idiscuss anumber of implementations of average treatmenteffect estimators. The Ž rst set of imple- 1 Calling such variables exogenous is somewhat at odds with several mentations involve comparisons of the nonexperimental formal deŽ nitions of exogeneity (e.g., Engle, Hendry, &Richard, 1974), estimators to results based on randomized experiments, as knowledge of their distribution can be informative about the average treatment effects. It does, however, agree with common usage. See for allowing direct tests of the unconfoundedness assumption. example, Manski et al. (1992, p. 28). See also Fro¨lich (2002) and Hirano The second set consists of simulation studies —using data et al. (2003) for additional discussion. 6 THEREVIEW OFECONOMICS AND STATISTICS

Finally,each unit is exposed to asingle treatment; Wi 5 0 effectof such aprogram on individuals with strong labor if unit i receives the control treatment,and Wi 5 1 if unit marketattachment. i receives the active treatment.W ethereforeobserve for Iwill also look atsample-average versions of these two each unit the triple ( Wi, Yi, Xi), where Yi is the realized population measures. These estimands focus on the average outcome: of the treatmenteffect in the speci Ž csample, ratherthan in the population at large. They include the sample-average Yi~0! if Wi 5 0, treatmenteffect (SA TE) Yi ; Yi~Wi! 5 H Yi~1! if Wi 5 1. 1 N Distributions of ( W, Y, X)referto the distribution induced tS 5 @Y ~1! 2 Y ~0!#, N O i i by the random sampling fromthe superpopulation. i51 Several additional pieces of notation will be useful in the remainderof the paper. First, the propensity score (Rosen- and the sample-average treatmenteffect for the treated baum and Rubin, 1983a) is de Ž ned as the conditional (SATT) probability of receiving the treatment,

S 1 e~ x! ; Pr~W 5 1uX 5 x! 5 E@WuX 5 x#. tT 5 O @Yi~1! 2 Yi~0!#, NT i:Wi51

Also, deŽ ne, for w [ {0,1}, the two conditional regression N and variance functions where NT 5 ¥i51 Wi is the number of treated units. The SATEand the SATThave received little attention in the 2 mw~ x! ; E@Y~w!uX 5 x#, sw~ x! ; V~Y~w!uX 5 x!. recenteconometric literature, although the SATEhas along tradition in the analysis of randomized experiments (for Finally,let r( x)be the conditional correlation coef Ž cient of example, Neyman, 1923). Without further assumptions, the Y(0) and Y(1) given X 5 x.As one never observes Yi(0) sample contains no information about the PATEbeyond the and Yi(1)forthe sameunit i,the data only contain indirect SATE.Tosee this, consider the case wherewe observe the and very limited information about this correlation coef Ž - sample (Yi(0), Yi(1), Wi, Xi), i 5 1, . . . , N; that is, we cient.2 observe both potential outcomes foreach unit. In that case S t 5 ¥i [Yi(1) 2 Yi(0)]/N can be estimated without error. B.Estimands:Average Treatment Effects Obviously,the best estimator forthe population-average effect tP is tS.However, wecannot estimate tP without In this discussion Iwill primarily focus on anumber of erroreven with asample whereall potential outcomes are average treatmenteffects (A TEs).This is less limiting than observed, because welack the potential outcomes forthose it mayseem, however, as it includes averages of arbitrary population membersnot included in the sample. This simple transformations of the original outcomes. LaterI will return argument has two implications. First, one can estimatethe brie ytoalternative estimands that cannot be written inthis SATEatleast as accurately as the PATE,and typically more form. so. In fact,the differencebetween the two variances isthe The Ž rst estimand, and the most commonly studied in the variance of the treatmenteffect, which is zero only when the econometric literature, is the population-average treatment treatmenteffect is constant. Second, agood estimator for effect(P ATE): one ATEis automatically agood estimator forthe other. One tP 5 E@Y~1! 2 Y~0!#. can thereforeinterpret many of the estimators forP ATEor PATTasestimators forSA TEor SATT,with lower implied Alternatively wemay be interested in the population- standard errors, as discussed in moredetail in section IIE. average treatmenteffect for the treated (PATT;for example, Athird pair of estimands combines featuresof the other Rubin, 1977; Heckman &Robb, 1984): two. These estimands, introduced by Abadie and Imbens (2002), focus on the ATEconditional on the sample distri- P tT 5 E@Y~1! 2 Y~0!uW 5 1#. bution of the covariates. Formally,the conditional ATE (CATE) is deŽ ned as Heckman and Robb (1984) and Heckman, Ichimura, and

Todd (1997) argue that the subpopulation of treated units is N often of moreinterest than the overall population in the 1 t~X! 5 E@Yi~1! 2 Yi~0!uXi#, N O context of narrowly targeted programs. For example, ifa i51 program is speci Ž cally directed atindividuals disadvan- taged in the labor market,there is often little interest in the and the SATEforthe treated (CATT)is de Ž ned as

2 As Heckman, Smith, and Clemens (1997) point out, however, one can 1 draw some limited inferences about the correlation coef Ž cient from the t~X!T 5 O E@Yi~1! 2 Yi~0!uXi#. NT shape of the two marginal distributions of Y(0) and Y(1). i:Wi51 AVERAGE TREATMENTEFFECTS 7

Using the sameargument as in the previous paragraph, it For many of the formalresults one will also need smooth- can be shown that one can estimateCA TEand CATTmore ness assumptions on the conditional regression functions accurately than PATEand PATT,but generally less accu- and the propensity score [ mw( x) and e( x)],and moment rately than SATEand SATT. conditions on Y(w).Iwill not discuss these regularity The differencein asymptotic variances forces the re- conditions here. Details can be found in the referencesfor searcherto take astance on what the quantity of interest is. the speciŽ cestimators given below. For example, in aspeci Ž capplication one can legitimately Therehas been some controversy about the plausibility of reachthe conclusion that there is no evidence, at the 95% Assumptions 2.1 and 2.2 in economic settings, and thus level, that the PATEis differentfrom zero, whereas there about the relevance of the econometric literaturethat fo- may be compelling evidence that the SATEand CATEare cuses on estimation and inference under these conditions for positive. Typically researchersin econometrics have fo- empiricalwork. In this debate it has been argued that cused on the PATE,but one can argue that it is of interest, agents’ optimizing behavior precludes their choices being when one cannot ascertain the sign of the population-level independent of the potential outcomes, whether or not effect,to know whether one can determine the sign of the conditional on covariates. This seemsan unduly narrow effectfor the sample. Especially in cases, which areall too view. In response Iwill offerthree arguments forconsider- common, whereit is not clearwhether the sample is repre- ing these assumptions. sentative of the population of interest, results forthe sample The Ž rst is astatistical, data-descriptive motivation. A at hand may be of considerable interest. natural starting point inthe evaluation of any program is a comparison of average outcomes fortreated and control C. IdentiŽ cation units. Alogical next step is to adjust any differencein Wemakethe following key assumption about the treat- average outcomes fordifferences in exogenous background ment assignment: characteristics (exogenous in the sense of not being affected by the treatment).Such an analysis may not lead to the Ž nal ASSUMPTION 2.1 (UNCONFOUNDEDNESS ): word on the ef Ž cacy of the treatment,but its absence would seem difŽ cult to rationalize in aserious attempt to under- ~Y~0!, Y~1!! \ WuX. stand the evidence regarding the effectof the treatment. Asecond argument is that almost any evaluation of a This assumption was Ž rst articulated in this formby treatmentinvolves comparisons of units who received the Rosenbaum and Rubin (1983a), who referto it as “ignorable treatmentwith units who did not. The question is typically treatmentassignment. ” Lechner (1999, 2002) refersto this not whether such acomparison should be made, but rather as the “conditional independence assumption. ” Following which units should be compared, that is, which units best work by Barnow, Cain, and Goldberger (1980) in aregres- represent the treated units had they not been treated. Eco- sion setting it is also referredto as “selection on observ- nomic theory can help in classifying variables into those ables.” that need to be adjusted forversus those that do not, on the To see the link with standard exogeneity assumptions, basis of their role in the decision process (forexample, suppose that the treatmenteffect is constant: t 5 Y (1) 2 i whether they enter the utility function or the constraints). Y (0) for all i.Suppose also that the control outcome is i Given that, the unconfoundedness assumption merelyas- linear in X : i serts that all variables that need to be adjusted forare observed by the researcher.This is an empiricalquestion, Yi~0! 5 a 1 X9ib 1 ei, and not one that should be controversial as ageneral with ei \ Xi.Then wecan write principle. Itis clearthat settings wheresome of these covariates arenot observed will require strong assumptions Yi 5 a 1 t z Wi 1 X9ib 1 ei. toallow foridenti Ž cation. Such assumptions include instru- Given the assumption of constant treatmenteffect, uncon- mental variables settings where some covariates areas- sumed to be independent of the potential outcomes. Absent foundedness is equivalent to independence of W and e i i those assumptions, typically only bounds can be identi Ž ed conditional on Xi,which would also capture the idea that Wi is exogenous. Without this assumption, however, uncon- (asin Manski, 1990, 2003). foundedness does not imply alinear relation with (-)- Athird, related argument isthat even when agents choose independent errors. their treatmentoptimally ,two agents with the samevalues Next, wemake a second assumption regarding the joint forobserved characteristics maydiffer in their treatment distribution of treatmentsand covariates: choices without invalidating the unconfoundedness assump- tion ifthe differencein their choices is driven by differences ASSUMPTION 2.2 (OVERLAP): inunobserved characteristics that arethemselves unrelated to the outcomes of interest. The plausibility of this will 0 , Pr~W 5 1uX! , 1. depend critically on the exact nature of the optimization 8 THEREVIEW OFECONOMICS AND STATISTICS process facedby the agents. In particular it may be impor- and thus mw( x) is identiŽ ed. Thus one can estimatethe tant that the objective of the decision makeris distinct from average treatmenteffect t by Ž rst estimating the average the outcome that is of interest to the evaluator. For example, treatmenteffect for a subpopulation with covariates X 5 x: suppose weare interested inestimating the average effectof abinary input (such as anew technology) on a Ž rm’s t~ x! ; E@Y~1! 2 Y~0!uX 5 x# 5 E@Y~1!uX 5 x# output.3 Assume production is astochastic function of this E E input because other inputs (such as weather)are not under 2 @Y~0!uX 5 x# 5 @Y~1!uX 5 x, W 5 1# the Ž rm’s control: Yi 5 g(W, ei).Suppose that pro Ž ts are 2 E@Y~0!uX 5 x, W 5 0# 5 E@YuX, W 5 1# output minus costs ( pi 5 Yi 2 ci z Wi),and also that a Ž rm chooses aproduction level to maximizeexpected pro Ž ts, 2 E@YuX, W 5 0#; equal to output minus costs, conditional on the cost of adopting new technology, followed by averaging over the appropriate distribution of x. To makethis feasible, one needs tobe able to estimatethe expectations E[Y X 5 x, W 5 w]forall values of w and Wi 5 arg max E@p~w!uci# u w[$0,1% x in the support of these variables. This is wherethe second assumption enters. Ifthe overlap assumption is violated at 5 arg max E@ g~w, ei! 2 ci z wuci#, X 5 x,it would be infeasible to estimateboth E[YuX 5 x, w[$0,1% W 5 1] and E[YuX 5 x, W 5 0],because atthose values of x there would be either only treated or only control units. implying Some researchersuse weakerversions of the uncon- foundedness assumption (forexample, Heckman, Ichimura, Wi 5 1$E@ g~1, e! 2 g~0, ei! $ ciuci#% 5 h~ci!. and Todd, 1998). Ifthe interest is in the PATE,itis suf Ž cient toassume that Ifunobserved marginal costs ci differbetween Ž rms, and these marginal costs areindependent of the errors ei in the ASSUMPTION 2.3 (MEAN INDEPENDENCE ): Ž rms’ forecastof production given inputs, then unconfound- edness will hold, as E@Y~w!uW, X# 5 E@Y~w!uX#,

for w 5 0, 1. ~ g~0, e!, g~1, ei!! \ ci. Although this assumption is unquestionably weaker, in practiceit is rarethat aconvincing case is madefor the Note that under the sameassumptions one cannot necessar- weakerassumption 2.3 without the case being equally ily identify the effectof the input on pro Ž ts, for (pi(0), strong forthe stronger version 2.1. The reason is that the p(1))are not independent of ci.For arelated discussion, in weakerassumption is intrinsically tied to functional-form the context of instrumental variables, see Athey and Stern assumptions, and as aresult one cannot identify average (1998). Heckman, LaLonde, and Smith (2000) discuss al- effectson transformations of the original outcome (such as ternative models that justify unconfoundedness. In these logarithms) without the stronger assumption. models individuals do attempt to optimize the sameout- One can weaken the unconfoundedness assumption in a comethat is the variable of interest to the evaluator. They differentdirection ifone is only interested in the average show that selection-on-observables assumptions can be jus- effectfor the treated (see, forexample, Heckman, Ichimura, tiŽ ed by imposing restrictions on the way individuals form &Todd, 1997). In that case one need only assume their expectations about the unknown potential outcomes. In general, therefore, aresearchermay wish to consider, either ASSUMPTION 2.4 (UNCONFOUNDEDNESS FOR CONTROLS): as a Ž nal analysis or as part of alargerinvestigation, estimates based on the unconfoundedness assumption. Y~0! \ WuX. Given the two key assumptions, unconfoundedness and overlap, one can identify the average treatmenteffects. The and the weakeroverlap assumption key insight is that given unconfoundedness, the following equalities hold: ASSUMPTION 2.5 (WEAK OVERLAP):

Pr~W 5 1 X! , 1. mw~ x! 5 E@Y~w!uX 5 x# 5 E@Y~w!uW 5 w, X 5 x# u These two assumptions aresuf Ž cient foridenti Ž cation of 5 E@YuW 5 w, X 5 x#, PATTand SATT,because the moments of the distribution of Y(1) forthe treated aredirectly estimable. 3 If we are interested in the average effect for Ž rms that did adopt the new technology (PATT),the following assumptions can be weakened An important result building on the unconfoundedness slightly. assumption shows that one need not condition simulta- AVERAGE TREATMENTEFFECTS 9 neously on all covariates. The following result shows that Bitler, Gelbach, and Hoynes (2002) estimatethese in a all biases due to observable covariates can be removed by randomized evaluation of asocial program. In instrumental conditioning solely on the propensity score: variables settings Abadie, Angrist, and Imbens (2002) and Chernozhukov and Hansen (2001) investigate estimation of Lemma 2.1 (Unconfoundedness Given the Propensity differences in quantiles of the two marginal potential out- Score: Rosenbaum and Rubin, 1983a): Suppose that as- comedistributions, either forthe entire population or for sumption 2.1 holds. Then subpopulations. Assumptions 2.1 and 2.2 also allow foridenti Ž cation of ~Y~0!, Y~1!! \ Wue~X!. the full marginal distributions of Y(0) and Y(1).Tosee this, Ž rst note that wecan identify not just the average treatment Proof: Wewill show that Pr( W 5 1uY(0), Y(1), e(X)) 5 effect t( x),but also the averages of the two potential Pr(W 5 1ue(X)) 5 e(X),implying independence of ( Y(0), outcomes, m0( x) and m0( x).Second, by these assumptions Y(1)) and W conditional on e(X).First, note that wecan similarly identify the averages of any function of the basic outcomes, E[ g(Y(0))] and E[ g(Y(1))].Hence wecan Pr~W 5 1uY~0!, Y~1!, e~X!! 5 E@W 5 1uY~0!, Y~1!, e~X!# identify the average values of the indicators 1{ Y(0) # y} 5 E@E@WuY~0!, Y~1!, e~X!, X#uY~0!, Y~1!, e~X!# and 1{Y(1) # y},and thus the distribution function of the potential outcomes at y.Given identi Ž cation of the two 5 E@E@WuY~0!, Y~1!, X#uY~0!, Y~1!, e~X!# distribution functions, it is clearthat one can also identify 5 E@E@WuX#uY~0!, Y~1!, e~X!# quantiles of the two potential outcome distributions. Firpo (2002) develops an estimatorfor such quantiles under un- 5 E@e~X!uY~0!, Y~1!, e~X!# 5 e~X!, confoundedness. wherethe last equality follows fromunconfoundedness. The sameargument shows that E. EfŽ ciencyBounds and Asymptotic V ariancesfor Population-AverageTreatment Effects Pr~W 5 1ue~X!! 5 E@W 5 1ue~X!# 5 E@E@W 5 1uX#ue~X!# Next Ireview some results on the ef Ž ciency bound for 5 E@e~X! e~X!# 5 e~X!. P P u estimators of the ATEs t , and tT.This requires both the assumptions of unconfoundedness and overlap (Assump- Extensions of this result to the multivalued treatmentcase tions 2.1 and 2.2) and some smoothness assumptions on the aregiven in Imbens (2000) and Lechner (2001). To provide conditional expectations of potential outcomes and the treat- intuition forRosenbaum and Rubin ’sresult, recallthe text- ment indicator (fordetails, see Hahn, 1998). Formally,Hahn book formulafor omitted variable bias in the linear regres- (1998) shows that forany regular estimator for tP, denoted sion model. Suppose wehave aregression model with two by tˆ, with regressors: d ÎN z ~tˆ 2 tP! ? 1~0, V!, Yi 5 b0 1 b1 z Wi 1 b92Xi 1 ei. The bias of omitting X fromthe regression on the coef Ž cient itmust be that on W is equal to b92d, where d isthe vector of coef Ž cients on 2 2 W in regressions of the elements of X on W.By condition- s1~X! s0~X! V $ E 1 1 ~t~X! 2 tP!2 . ing on the propensity score weremove the correlation F e~X! 1 2 e~X! G between X and W, because X \ Wue(X).Hence omitting X no longer leads to any bias (although it maystill lead to Knowledge of the propensity score does not affectthis some efŽ ciency loss). efŽ ciency bound. Hahn also shows that asymptotically linear estimators D.Distributionaland Quantile T reatmentEffects exist with such variance, and hence such ef Ž cient estimators Most of the literaturehas focused on estimating ATEs. can be approximated as Thereare, however, many cases whereone may wish to estimateother featuresof the joint distribution of outcomes. 1 N P P 21/ 2 Lehman (1974) and Doksum (1974) introduce quantile tˆ 5 t 1 c~Yi, Wi, Xi, t ! 1 op~N !, N O treatmenteffects as the differencein quantiles between the i51 two marginal treated and control outcome distributions. 4 where c[ is the efŽ cient score: 4 In contrast, Heckman, Smith, and Clemens (1997) focus on estimation of bounds on the joint distribution of ( Y(0), Y(1)). One cannot without can never observe both potential outcomes simultaneously ,but one can strong untestable assumptions identify the full joint distribution, since one nevertheless derive bounds on functions of the two distributions. 10 THEREVIEW OFECONOMICS AND STATISTICS

wy ~1 2 w! y E@c~Y, W, X, tPuY~0!, Y~1!, X!# 5 Y~1! 2 Y~0! 2 tP, c~ y, w, x, tP! 5 2 S e~ x! 1 2 e~ x! D (1) m1~ x! m0~ x! and thus 2 tP 2 1 @w 2 e~ x!#. S e~ x! 1 2 e~ x!D N P E@t˜u~Yi~0!, Yi~1!, Xi!i51# 5 E@c# # 1 t

1 N P Hahn (1998) also reports the ef Ž ciency bound for t , 5 ~Yi~1! 2 Yi~0!!. T N O both with and without knowledge of the propensity score. i51 P For tT the efŽ ciency bound given knowledge of e(X) is Hence 2 e~X! Var~Y~1!uX! e~X! Var~Y~0!uX! S N E 1 E@t˜ 2 t u~Yi~0!, Yi~1!, Xi!i # F E@e~X!#2 E@e~X!#2~1 2 e~X!! 51 N e~X!2 1 P 2 S 1 ~t~X! 2 t ! . 5 ~Yi~1! 2 Yi~0!! 2 t 5 0. T E@e~X!#2G N O i51

Ifthe propensity score is not known, unlike the bound for Next, consider the normalized variance: P P P t , the efŽ ciency bound for tT isaffected.For tT the bound without knowledge of the propensity score is VP 5 N z E@~t˜ 2 tS!2# 5 N z E@~c# 1 tP 2 tS!2#.

e~X! Var~Y~1!uX! e~X!2 Var~Y~0!uX! Note that the variance of t˜ as an estimator of tP can be E 1 F E@e~X!#2 E@e~X!#2~1 2 e~X!! expressed, using the factthat c[ is the efŽ cient score, as e~X! P 2 # 2 P 2 N z E@~t˜ 2 t ! # 5 N z E@~c! # 5 1 ~t~X! 2 tT! 2 , E P P S P S @e~X!# G N z E@~c# ~Y, W, X, t ! 1 ~t 2 t ! 2 ~t 2 t !!2#. which is higher by Because

e~X!~1 2 e~X!! P P S P S P 2 E # Y, W, X, 1 2 z 2 5 0 E ~t~X! 2 t ! z . @~c~ t ! ~t t !! ~t t !# F T E@e~X!#2 G [asfollows by using iterated expectations, Ž rst conditioning The intuition that knowledge of the propensity score affects on X, Y(0), and Y(1)],it follows that the efŽ ciency bound forthe average effectfor the treated P 2 S 2 S P 2 (PATT),but not forthe overall average effect(P ATE),goes N z E@~t˜ 2 t ! # 5 N z E@~t˜ 2 t ! # 1 N z E@~t 2 t ! # as follows. Both areweighted averages of the treatment 5 N z E@~t˜ 2 tS!2# 1 N z E@~Y~1! 2 Y~0! 2 tP!2#. effectconditional on the covariates, t( x).For the PATEthe weight is proportional to the density of the covariates, Thus, the samestatistic that as an estimator of the popula- whereas forthe PATTthe weight is proportional to the tion average treatmenteffect tP has anormalized variance product of the density of the covariates and the propensity equal to VP,asanestimator of tS has the property score (see, forexample, Hirano, Imbens, and Ridder, 2003). d Knowledge of the propensity score implies one does not S S ÎN~t˜ 2 t ! ? 1~0, V !, need to estimatethe weight function and thus improves precision. with

S P P 2 F. EfŽ ciencyBounds and Asymptotic V ariancesfor V 5 V 2 E@~Y~1! 2 Y~0! 2 t ! #. Conditionaland Sample Average Treatment Effects As an estimator of tS the variance of t˜ is lower than its Consider the leading termof the ef Ž cient estimator for variance as an estimator of tP,with the differenceequal to P the PATE, t˜ 5 t 1 c# , where c# 5 (1/N) ¥ c(Yi, Wi, Xi, the variance of the treatmenteffect. tP),and let us view this as an estimator forthe SATE, The sameline of reasoning can be used to show that instead of as an estimatorfor the PATE.Iwill show that, d Ž rst, this estimator is unbiased, conditional on the covariates ÎN~t˜ 2 t~X!! ? 1~0, V t~X!!, and the potential outcomes, and second, it has lower vari- ance asan estimator of the SATEthan as anestimator of the with PATE.To see that the estimatoris unbiased note that with the efŽ cient score c( y, w, x, t)given in equation (1), V t~X! 5 VP 2 E@t~X! 2 tP!2], AVERAGE TREATMENTEFFECTS 11 and and use this upper-bound variance estimateto construct conŽ dence intervals that areguaranteed to be conservative. VS 5 V t~X! 2 E@~Y~1! 2 Y~0! 2 t~X!!2#. Note the connection with Neyman ’s(1923) discussion of conservative con Ž dence intervals foraverage treatmentef- An example to illustrate these points maybe helpful. fectsin experimental settings. Itshould be noted that the differencebetween these variances is of the sameorder as Suppose that X [ {0, 1}, with Pr( X 5 1) 5 px and Pr(W 5 1uX) 5 1/2. Suppose that t( x) 5 2x 2 1, and the variance itself, and thereforenot asmall-sampleprob- 2 lem.Only when the treatmenteffect is known tobe constant sw( x)is very smallfor all x and w.In that case the average can itbe ignored. Depending on the correlation between the treatmenteffect is px z 1 1 (1 2 px) z (21) 5 2px 2 1. The efŽ cient estimatorin this case, assuming only uncon- outcomes and the covariates, this may change the standard foundedness, requires separately estimating t( x) for x 5 0 errorsconsiderably .Itshould also be noted that bootstrap- P 2 and 1, and averaging these two by the empiricaldistribution ping methods in general lead toestimation of E[(t˜ 2 t ) ], 2 of X.The variance of =N(tˆ 2 tS)will be smallbecause rather than E[(t˜ 2 t(X)) ],which aregenerally too big. 2 sw( x)is small, and according to the expressions above, the P variance of =N(t 2 t )will be largerby 4 px(1 2 px). If III.Estimating A verageTreatment Effects px differsfrom 1/ 2, and so PATEdiffersfrom 0, the conŽ dence interval forP ATEin smallsamples will tend to Therehave been anumber of statistics proposed for include zero. In contrast, with s2 ( x)smallenough and N estimating the PATEand PATT,all of which arealso w appropriate estimators of the sample versions (SATEand odd [and both N0 and N1 atleast equal to 2, so that one can 2 S SATT)and the conditional average versions (CATEand estimate sw( x)],the standard con Ž dence interval for t will exclude 0with probability 1. The intuition is that tP is much CATT).(Theimplications of focusing on SATEor CATE moreuncertain because it depends on the distribution of the ratherthan PATEonly arisewhen estimating the variance, covariates, whereas the uncertainty about tS depends only and so Iwill return tothis distinction in section IV.In the on the conditional outcome variances and the propensity current section all discussion applies equally to all esti- score. mands.) HereI review some of these estimators, organized The differencein asymptotic variances raises the issue of into Ž ve groups. how to estimatethe variance of the sample average treat- The Ž rst set, referredto as regression estimators, consists ment effect.Speci Ž cestimators forthe variance will be of methods that rely on consistent estimation of the two discussed in section IV,but here Iwill introduce some conditional regression functions, m0( x) and m1( x). These general issues surrounding their estimation. Because the estimators differin the way that they estimatethese ele- two potential outcomes forthe sameunit arenever observed ments, but all rely on estimators that areconsistent forthese simultaneously, one cannot directly inferthe variance of the regression functions. treatmenteffect. This is the sameissue as the nonidenti Ž - The second set, matching estimators, compare outcomes cation of the correlation coef Ž cient. One can, however, across pairs of matched treated and control units, with each estimatea lower bound on the variance of the treatment unit matched to a Ž xed number of observations with the effect,leading to an upper bound on the variance of the opposite treatment.The bias of these within-pair estimates of the average treatmenteffect disappears asthe sample size estimator of the SATE,which is equal to V t~X!. Decompos- increases, although their variance does not go to zero, ing the variance as because the number of matches remains Ž xed. The third set of estimators is characterizedby acentral E@~Y~1! 2 Y~0! 2 tP!2# 5 V~E@Y~1! 2 Y~0! 2 tP X#! u role forthe propensity score. Four leading approaches in 1 E@V~Y~1! 2 Y~0! 2 tPuX!#, this set areweighting by the reciprocal of the propensity score, blocking on the propensity score, regression on the P 2 2 propensity score, and matching on the propensity score. 5 V~t~X! 2 t ! 1 E@s1~X! 1 s0~X! The fourth category consists of estimators that rely on a 2 2r~X!s0~X!s1~X!#, combination of these methods, typically combining regres- sion with one of its alternatives. The motivation forthese wecan consistently estimatethe Ž rst term,but generally say combinations is that although in principle any one of these little about the second other than that it is nonnegative. One methods can remove all of the bias associated with the S can thereforebound the variance of t˜ 2 t fromabove by covariates, combining two maylead to morerobust infer- ence. For example, matching leads to consistent estimators P 2 P 2 E@c~Y, W, X, t ! # 2 E@~Y~1! 2 Y~0!! 2 t ! ] foraverage treatmenteffects under weak conditions, so # E@c~Y, W, X, tP!2# 2 E@~t~X! 2 tP!2# matching and regression can combine some of the desirable variance properties of regression with the consistency of 2 2 s1~X! s0~X! matching. Similarly,acombination of weighting and regres- 5 E 1 5 V t~X!, F e~X! 1 2 e~X!G sion, using parametricmodels forboth the propensity score 12 THEREVIEW OFECONOMICS AND STATISTICS and the regression functions, can lead to an estimatorthat is high-dimensional covariate spaces, as they can be consistent even ifonly one of the models is correctly masked forany single variable. speciŽ ed (“doubly robust ” in the terminology of Robins & Ritov,1997). A.Regression Finally,in the Ž fth group Iwill discuss Bayesian ap- The Ž rst class of estimators relieson consistent estima- proaches toinference foraverage treatmenteffects. tion of m ( x) for w 5 0, 1. Given mˆ ( x) for these Only some of the estimators discussed below achieve the w w regression functions, the PATE,SATE,and CATEareesti- semiparametricef Ž ciency bound, yet this does not mean matedby averaging their differences over the empirical that these should necessarily be preferredin practice —that distribution of the covariates: is, in Ž nite samples. Moregenerally ,the debate concerning the practicaladvantages of the various estimators, and the 1 N settings in which some aremore attractive than others, is tˆ 5 @mˆ ~X ! 2 mˆ ~X !#. (2) reg N O 1 i 0 i still ongoing, with asof yet no Ž rmconclusions. Although i51 all estimators, either implicitly or explicitly,estimatethe two unknown regression functions or the propensity score, In most implementations the average of the predicted they do so in very differentways. Differencesin smoothness treated outcome forthe treated is equal to the average of the regression function or the propensity score, or relative observed outcome forthe treated [so that ¥i Wi z mˆ 1(Xi) 5 discreteness of the covariates in speci Ž capplications, may ¥i Wi z Yi],and similarly forthe controls, implying that tˆreg affectthe relative attractiveness of the estimators. can also be written as In addition, even the appropriateness of the standard asymptotic distributions as aguide towards Ž nite-sample 1 N W z @Y 2 mˆ ~X !# 1 ~1 2 W ! z @mˆ ~X ! 2 Y #. performanceis still debated (see, forexample, Robins & N O i i 0 i i 1 i i Ritov,1997, and Angrist &Hahn, 2004). Akey featurethat i51 casts doubt on the relevance of the asymptotic distributions For the PATTand SATTtypically only the control regres- = is that the N consistency is obtained by averaging a sion function is estimated; weonly need predict the out- nonparametric estimator of aregression function that itself comeunder the control treatmentfor the treated units. The has aslow nonparametric convergence rateover the empir- estimator then averages the differencebetween the actual icaldistribution of its argument. The dimension of this outcomes forthe treated and their estimated outcomes under argument affectsthe rateof convergence forthe unknown the control: function [the regression functions mw( x)or the propensity score e( x)],but not the rateof convergence forthe estima- N 1 tor of the parameterof interest, the average treatmenteffect. tˆ reg,T 5 O Wi z @Yi 2 mˆ 0~Xi!#. (3) NT In practice, however, the resulting approximations of the i51 ATEcan be poor ifthe argument is of high dimension, in which case information about the propensity score is of Earlyestimators for mw( x)included parametricregres- particular relevance. Although Hahn (1998) showed, as sion functions —forexample, linear regression (asin Rubin, discussed above, that forthe standard asymptotic distribu- 1977). Such parametricalternatives include least squares tions knowledge of the propensity score is irrelevant (and estimators with the regression function speci Ž ed as conditioning only on the propensity score is in factless m ~ x! 5 b9x 1 t z w, efŽ cient than conditioning on all covariates), conditioning w on the propensity score involves only one-dimensional non- inwhich case the average treatmenteffect is equal to t. In parametricregression, suggesting that the asymptotic ap- this case one can estimate t directly by least squares proximations may be moreaccurate. Inpractice, knowledge estimation using the regression function of the propensity score may thereforebe very informative. Another issue that is important in judging the various Yi 5 a 1 b9Xi 1 t z Wi 1 ei. estimatorsis how well they performwhen there is only limited overlap in the covariate distributionsof the two Moregenerally ,one can specify separate regression func- treatmentgroups. Ifthere areregions in the covariate tions forthe two regimes: space with little overlap (propensity score close to 0or mw~ x! 5 b9wx. 1), ATEestimatorsshould have relatively high variance. However, this isnot always the case forestimator sbased In that case one can estimatethe two regression functions on tightly parametrized models forthe regression func- separately on the two subsamples and then substitute the tions, where outliers in covariate values can lead to predicted values in equation (2). These simple regression spurious precision forregression parameters. Regions of estimators may be very sensitive to differences in the smalloverlap can also be dif Ž cult to detect directly in covariate distributions fortreated and control units. The AVERAGE TREATMENTEFFECTS 13 reason is that in that case the regression estimators rely of the dimension of the covariates again forother estima- heavily on extrapolation. To see this, note that the regression tors. function forthe controls, m0( x),is used to predict missing For the average treatmenteffect for the treated (PATT),it outcomes forthe treated. Hence on average one wishes to is important to note that with the propensity score known, predict the control outcome at X# T,the average covariate the estimatorgiven in equation (3)is generally not ef Ž cient, value forthe treated. With alinear regression function, the irrespective of the estimator for m0( x).Intuitively,this is average prediction can be written as Y# C 1 bˆ 9(X# T 2 X# C). because with the propensity score known, the average ¥ With X# T very close to the average covariate value forthe WiYi/NT is not efŽ cient forthe population expectation controls, X# C,the precise speci Ž cation of the regression E[Y(1)uW 5 1]. An efŽ cient estimator (asin Hahn, 1998) function will not mattervery much forthe average predic- can be obtained by weighting all the estimated treatment tion. However, with the two averages very different, the effects, mˆ 1(Xi) 2 mˆ 0(Xi),by the probability of receiving the prediction based on alinear regression function can be very treatment: sensitive to changes in the speci Ž cation. Morerecently ,nonparametric estimators have been pro- N N posed. Hahn (1998) recommends estimating Ž rst the three t˜ reg,T 5 O e~Xi! z @mˆ 1~Xi! 2 mˆ 0~Xi!# O e~Xi!. (4) i51 Y i51 conditional expectations g1( x) 5 E[WYuX ], g0( x) 5 E[(1 2 W )YuX ], and e( x) 5 E[WuX]nonparametrically Inother words, instead of estimating E[Y(1)uW 5 1] as ¥ using series methods. Hethen estimates mw( x) as WiYi/NT using only the treated observations, it is estimated using all units, as ¥ mˆ 1(Xi) z e(Xi)/¥ e(Xi).Knowledge of gˆ 1~x! gˆ 0~x! mˆ ~x! 5 , mˆ ~x! 5 , the propensity score improves the accuracy because it al- 1 eˆ x 0 1 2 eˆ x ~ ! ~ ! lows one to exploit the control observations to adjust for and shows that the estimators forboth PATEand PATT imbalances in the sampling of the covariates. achieve the semiparametricef Ž ciency bounds discussed in For all of the estimators inthis section animportant issue section IIE(the lattereven when the propensity score is is the choice of the smoothing parameter.In Hahn ’s case, unknown). afterchoosing the formof the series and the sequence, the Using this series approach, however, it is unnecessary to smoothing parameteris the number of termsin the series. In estimateall three of these conditional expectations Heckman, Ichimura, and Todd ’scase it is the bandwidth of (E[YWuX], E[Y(1 2 W)uX], and E[WuX])to estimate the kernel chosen. The evaluation literaturehas been largely silent concerning the optimal choice of the smoothing pa- mw( x).Instead one can use series methods to directly rameters,although the largerliterature on nonparametric estimatethe two regression functions mw( x),eliminating the need to estimatethe propensity score (Imbens, Newey, estimation of regression functions does provide some guid- and Ridder, 2003). ance, offering data-driven methods such as cross-validation Heckman, Ichimura, and Todd (1997, 1998) and Heck- criteria.The optimality properties of these criteria,however, man, Ichimura, Smith, and Todd (1998) consider kernel arefor estimation of the entire function, in this case mw( x). methods forestimating m ( x),in particular focusing on Typically the focus is on mean-integrated-squared-error w 2 local linear approaches. The simple kernel estimatorhas the criteriaof the form *x [mˆ w( x) 2 mw( x)] fX( x) dx, with form possibly anadditional weight function. Inthe current prob- lem,however, one is interested speci Ž cally in the average

Xi 2 x Xi 2 x treatmenteffect, and so such criteriaare not necessarily mˆ ~x! 5 Y z K K , w O i S h DY O S h D optimal. In particular, global smoothing parametersmay be i:Wi5w i:Wi5w inappropriate, because they can be driven by the shape of the regression function and distribution of covariates in with akernel K[ and bandwidth h.In the local linear regions that arenot important forthe average treatment kernel regression the regression function mw( x)is estimated asthe intercept b in the minimization problem effectof interest. LaLonde ’s(1986) data set isawell-known 0 example of this where much of probability mass of the

Xi 2 x nonexperimental control group is in aregion with moderate min @Y 2 b 2 b9~X 2 x!#2 z K . O i 0 1 i S h D tohigh earnings wherefew of the treated group arelocated. b0,b1 i:Wi5w Thereis little evidence whether results foraverage treat- ment effectsare more or less sensitive to the choice of In order to control the bias of their estimators, Heckman, smoothing parameterthan results forestimation of the Ichimura, and Todd (1998) require that the order of the regression functions themselves. kernel be atleast aslarge asthe dimension of the covariates. That is, they require the use of akernel function K( z) such r B. Matching that *z z K( z) dz 5 0 for r # dim (X),so that the kernel must be negative on part of the range, and the implicit Asseen above, regression estimators impute the missing averaging involves negative weights. Weshall see this role potential outcomes using the estimated regression function. 14 THEREVIEW OFECONOMICS AND STATISTICS

5 Thus, if Wi 5 1, then Yi(1)is observed and Yi(0)is missing Abadie etal., 2003). Formally,given asample, {( Yi, Xi, N and imputed with aconsistent estimator mˆ 0(Xi) for the Wi)}i51, let ,m(i)be the index l that satisŽ es Wl Þ Wi and conditional expectation. Matching estimators also impute the missing potential outcomes, but do so using only the O 1$iXj 2 Xii # iXl 2 Xii% 5 m, outcomes of nearest neighbors of the opposite treatment juWjÞWi group. Inthat respect matching is similarto nonparametric where 1{z }is the indicator function, equal to 1ifthe kernel regression methods, with the number of neighbors expression in brackets is true and 0otherwise. In other playing the role of the bandwidth inthe kernel regression. A words, , (i)is the index of the unit in the opposite treat- formaldifference is that the asymptotic distribution is de- m ment group that is the mth closest to unit i in termsof the rived conditional on the implicit bandwidth, that is, the distance measurebased on the norm i z i.Inparticular, , (i) number of neighbors, which is often Ž xed atone. Using 1 isthe nearest matchfor unit i. Let )M(i)denote the set of such asymptotics, the implicit estimate mˆ w( x)is (close to) indices forthe Ž rst M matches forunit i: )M(i) 5 unbiased, but not consistent for mw( x).In contrast, the {,1(i), . . . , ,M(i)}. DeŽ ne the imputed potential outcomes regression estimators discussed inthe previous section rely as on the consistency of mw( x).

Matching estimators have the attractive featurethat given Yi if Wi 5 0, the matching metric,the researcheronly has to choose the 1 Yˆ i~0! 5 number of matches. In contrast, forthe regression estima- Yj M O if Wi 5 1, tors discussed above, the researchermust choose smoothing 5 j[)M~i! parametersthat aremore dif Ž cult to interpret: either the number of termsin aseries or the bandwidth in kernel and regression. Within the class of matching estimators, using 1 only asingle matchleads to the most credible inference with Yj if W 5 0, ˆ M O i the least bias, at most sacri Ž cing some precision. This can Yi~1! 5 j[)M~i! makethe matching estimatoreasier to use than those esti- 5 Yi if Wi 5 1. mators that require morecomplex choices of smoothing parameters,and mayexplain some of its popularity. The simple matching estimatordiscussed by Abadie and Matching estimators have been widely studied in practice Imbens is then and theory (forexample, Gu &Rosenbaum, 1993; Rosen- baum, 1989, 1995, 2002; Rubin, 1973b, 1979; Heckman, 1 N tˆ sm 5 @Yˆ ~1! 2 Yˆ ~0!#. (5) Ichimura, &Todd, 1998; Dehejia &Wahba, 1999; Abadie & M N O i i Imbens, 2002). Most often they have been applied in set- i51 tings with the following two characteristics: (i)the interest They show that the bias of this estimator is O(N21/k), where is in the average treatmenteffect for the treated, and (ii) k is the dimension of the covariates. Hence, ifone studies there is alarge reservoir of potential controls. This allows the asymptotic distribution of the estimator by normalizing the researcherto matcheach treated unit to one or more by = [ascan be justi Ž edby the factthat the variance of distinct controls (referredto as matching without replace- N the estimator is O(1/N)],the bias does not disappear ifthe ment).Given the matched pairs, the treatmenteffect within dimension of the covariates is equal to 2, and will dominate apair is then estimated as the differencein outcomes, with the large sample variance if k isatleast 3. an estimatorfor the PATTobtained by averaging these Letme makeclear three caveats to Abadie and Imbens ’s within-pair differences. Since the estimator is essentially the result. First, it is only the continuous covariates that should differencebetween two sample , the variance is cal- be counted in this dimension, k.With discrete covariates the culated using standard methods fordifferences in means or matching will be exact in large samples; thereforesuch methods forpaired randomized experiments. The remaining covariates do not contribute to the order of the bias. Second, bias is typically ignored in these studies. The literaturehas ifone matches only the treated, and the number of potential studied fastalgorithms formatching the units, as fully controls is much largerthan the number of treated units, one efŽ cient matching methods arecomputationally cumber- can justify ignoring the bias by appealing to anasymptotic some (see, forexample, Gu and Rosenbaum, 1993; Rosen- sequence wherethe number of potential controls increases baum, 1995). Note that in such matching schemes the order fasterthan the number of treated units. Speci Ž cally,ifthe inwhich the units arematched may be important. number of controls, N0,and the number of treated, N1, Abadie and Imbens (2002) study both bias and variance 4/k satisfy N1/N0 ? 0,then the bias disappears in large in amoregeneral setting whereboth treated and control samples afternormalization by =N1.Third, even though units are(potentially) matched and matching is done with replacement(as in Dehejia &Wahba, 1999). The Abadie- 5 See Becker and Ichino (2002) and Sianesi (2001) for alternative Stata Imbens estimator is implemented in Matlab and Stata (see implementations of estimators for average treatment effects. AVERAGE TREATMENTEFFECTS 15 the order of the bias maybe high, the actual bias may still Zhao (2004), in an interesting discussion of the choice of be smallif the coef Ž cients in the leading termare small. metrics,suggests some alternatives that depend on the This is possible ifthe biases fordifferent units areat least correlation between covariates, treatmentassignment, and partially offsetting. For example, the leading termin the outcomes. Hestarts by assuming that the propensity score bias relieson the regression function being nonlinear, and has alogistic form the density of the covariates having anonzero slope. Ifone of these two conditions is atleast close to being satis Ž ed, the exp~x9g! e~ x! 5 , resulting bias may be fairlylimited. To remove the bias, 1 1 exp~x9g! Abadie and Imbens suggest combining the matching pro- cess with aregression adjustment, as Iwill discuss in and that the regression functions arelinear: section IIID. Another point madeby Abadie and Imbens isthat match- mw~ x! 5 aw 1 x9b. ing estimators aregenerally not ef Ž cient. Even in the case where the bias is of low enough order to be dominated by Hethen considers two alternative metrics.The Ž rst weights the variance, the estimators arenot ef Ž cient given a Ž xed absolute differences in the covariates by the coef Ž cient in number of matches. Toreachef Ž ciency one would need to the propensity score: increase the number of matches with the sample size. If K M ? `, with M/N ? 0, then the matching estimator is essentially like aregression estimator, with the imputed dZ1~ x, z! 5 O uxk 2 zku z ugku, missing potential outcomes consistent fortheir conditional k51 expectations. However, the ef Ž ciency gain of such estima- and the second weights them by the coef Ž cients in the tors is of course somewhat arti Ž cial. Ifin agiven data set regression function: one uses M matches, one can calculate the variance as ifthis number of matches increased atthe appropriate ratewith the K sample size, in which case the estimatorwould be ef Ž cient, d ~ x, z! 5 ux 2 z u z ub u, or one could calculate the variance conditional on the Z2 O k k k k51 number of matches, in which case the sameestimator would be inefŽ cient. Littleis yet known about the optimal number th where xk and zk are the k elements of the K-dimensional of matches, or about data-dependent ways of choosing this vectors x and z respectively. number. In light of this discussion, it is interesting to consider In the above discussion the distance metricin choosing optimality of the metric.Suppose, following Zhao (2004), the optimal matches was the standard Euclidean metric: that the regression functions arelinear with coef Ž cients bw. Now consider atreated unit with covariate vector x who will dE~ x, z! 5 ~ x 2 z!9~ x 2 z!. be matched to acontrol unit with covariate vector z. The All of the distance metricsused in practicestandardize the bias resulting fromsuch amatchis ( z 2 x)9b0. If one is covariates in some manner. Abadie and Imbens use the interested in minimizing foreach matchthe squared bias, diagonal matrixof the inverse of the covariate variances: one should choose the Ž rst matchby minimizing over the

21 control observations ( z 2 x)9b0b90( z 2 x).Yettypically dAI~ x, z! 5 ~ x 2 z!9 diag ~SX ! ~x 2 z!, one does not know the value of the regression coef Ž cients, in which case one may wish to minimizethe expected where SX is the covariance matrixof the covariates. The most common choice is the Mahalanobis metric(see, for squared bias. Using anormal distribution forthe regression example, Rosenbaum and Rubin, 1985), which uses the errors, and a  at prior on b0,the posterior distribution for b0 ˆ 21 2 inverse of the covariance matrixof the pretreatmentvari- isnormal with mean b0 and variance SX s /N. Hence the ables: expected squared bias froma matchis

21 dM~ x, z! 5 ~ x 2 z!9SX ~ x 2 z!. potential matches are unit j with Xj1 5 Xj2 5 5 and unit k with Xk1 5 4 and Xk2 5 0.The difference in covariates for the Ž rst match is the vector This metrichas the attractive property that it reduces dif- (5, 5)9,and the difference in covariates for the second match is (4, 0) 9. Intuitively it may seem that the second match is better: it is strictly closer ferencesin covariates within matched pairs in all direc- to the treated unit than the Ž rst match for both covariates. Using the 6 21 tions. Seefor more formal discussions Rubin and Thomas Abadie-Imbens metric diag ( SX ),this is in fact true. Under that metric (1992). the distance between the second match and the treated unit is 16, considerably smaller than 50, the distance between the Ž rst match and the treated unit. Using the Mahalanobis metric, however, the distance for the 6 However, using the Mahalanobis metric can also have less attractive Ž rst match is 26, and the distance for the second match is much higher at implications. Consider the case where one matches on two highly corre- 84. Because of the correlation between the covariates in the sample, the lated covariates, X1 and X2 with equal variances. For speci Ž city, suppose difference between the matches is interpreted very differently under the that the correlation coef Ž cient is 0.9 and both variances are 1. Suppose two metrics. To choose between the standard and the Mahalanobis metric that we wish to match atreated unit i with Xi1 5 Xi2 5 0. The two one needs to consider what the appropriate match would be in this case. 16 THEREVIEW OFECONOMICS AND STATISTICS

ˆ ˆ 2 21 E@~ z 2 x!9b0b90~ z 2 x!# 5 ~ z 2 x!9~b0b09 1 s SX /N! two conditional expectations mw( x),they require instead the equally high-dimensional nonparametric regression of the 3 ~z 2 x!. treatmentindicator on the covariates. Inpracticethe relative meritsof these estimators will depend on whether the Inthis argument the optimal metricis acombination of the propensity score is moreor less smooth than the regression sample covariance matrixplus the outer product of the functions, and on whether additional information is avail- regression coef Ž cients, with the formerscaled down by a able about either the propensity score or the regression factor 1/N: functions. d*~ z, x! 5 ~ z 2 x!9~bˆ bˆ 9 1 s2 S21 /N!~z 2 x!. w w w X,w Weighting: The Ž rst set of propensity-score estimators Aclearproblem with this approach is that when the regres- use the propensity scores as weights to createa balanced sion function is misspeci Ž ed, matching with this particular sample of treated and control observations. Simply taking metricmay not lead to aconsistent estimator. Onthe other the differencein average outcomes fortreated and controls, hand, when the regression function is correctly speci Ž ed, it WiYi ~1 2 Wi!Yi would be moreef Ž cient to use the regression estimators tˆ 5 O 2 O , than any matching approach. In practiceone maywant to O Wi O 1 2 Wi use ametricthat combines some of the optimal weighting with some safeguards in case the regression function is is not unbiased for tP 5 E[Y(1) 2 Y(0)],because, misspeciŽ ed. conditional on the treatmentindicator, the distributions of So farthere is little experience with any alternative the covariates differ.By weighting the units by the recipro- metricsbeyond the Mahalanobis metric.Zhao (2004) re- calof the probability of receiving the treatment,one can ports the results of some simulations using his proposed undo this imbalance. Formally,weighting estimators rely on metrics, Ž nding no clearwinner given his speci Ž c design, the equalities although his Ž ndings suggest that using the outcomes in WY WY~1! WY~1! deŽ ning the metricis apromising approach. E 5 E 5 E E X F e~X!G F e~X! G F F e~X! U GG C.PropensityScore Methods e~X! z E@Y~1!uX# 5 E 5 E@Y~1!#, Since the work by Rosenbaum and Rubin (1983a) there F e~X! G has been considerable interest in methods that avoid adjust- ing directly forall covariates, and instead focus on adjusting using unconfoundedness in the second to last equality,and fordifferences in the propensity score, the conditional similarly probability of receiving the treatment.This can be imple- mented in anumber of differentways. One can weight the ~1 2 W!Y E 5 E@Y~0!#, observations using the propensity score (and indirectly also F 1 2 e~X! G intermsof the covariates) to createbalance between treated and control units in the weighted sample. Hirano, Imbens, implying and Ridder (2003) show how such estimators can achieve the semiparametricef Ž ciency bound. Alternatively one can W z Y ~1 2 W! z Y tP 5 E 2 . divide the sample into subsamples with approximately the F e~X! 1 2 e~X! G samevalue of the propensity score, atechnique known as blocking. Finally,one can directly use the propensity score With the propensity score known one can directly imple- asaregressor in aregression approach. ment this estimator as In practicethere aretwo important cases. First, suppose the researcherknows the propensity score. In that case all N 1 WiYi ~1 2 Wi!Yi three of these methods arelikely to be effectivein elimi- t˜ 5 O 2 . (6) N Se~Xi! 1 2 e~Xi! D nating bias. Even ifthe resulting estimator is not fully i51 efŽ cient, one can easily modify it by using aparametric estimateof the propensity score to capture most of the Inthis particular formthis isnot necessarily an attractive efŽ ciency loss. Furthermore, since these estimators do not estimator. The main reason is that, although the estimator rely on high-dimensional nonparametric regression, this can be written asthe differencebetween aweighted average suggests that their Ž nite-sample properties arelikely to be of the outcomes forthe treated units and aweighted average relatively attractive. of the outcomes forthe controls, the weights do not neces- Ifthe propensity score is not known, the advantages of sarily add to 1. Speci Ž cally,in equation (6), the weights for the estimators discussed below areless clear.Although they the treated units add up to [ ¥ Wi/e(Xi)]/N.In expectation avoid the high-dimensional nonparametric regression of the this is equal to 1, but because its variance ispositive, in any AVERAGE TREATMENTEFFECTS 17 given sample some of the weights arelikely to deviate contribution forunit i by the propensity score e( xi). If the from 1. propensity score is known, this leads to One approach forimproving this estimatoris simply to normalize the weights to unity.One can further normalize N e X N e X the weights to unity within subpopulations as de Ž ned by the ~ i! ~ i! tˆ weight,tr 5 O Wi z Yi z O Wi eˆ~Xi! eˆ~Xi! covariates. In the limitthis leads to an estimatorproposed i51 Y i51 by Hirano, Imbens, and Ridder (2003), who suggest using a nonparametric series estimator for e( x).Moreprecisely , N N they Ž rst specify asequence of functions of the covariates, e~Xi! e~Xi! 2 O ~1 2 Wi! z Yi z O ~1 2 Wi! , such as power series h ( x), l 5 1, . . . , `. Next, they 1 2 eˆ~Xi! 1 2 eˆ ~Xi! l i51 Y i51 choose anumber of terms, L(N),as afunction of the sample size, and then estimatethe L-dimensional vector g in L wherethe propensity score enters in some places as the true score (forthe weights to get the appropriate estimand) and exp@~h1~x!, . . . , hL~x!!gL# Pr~W 5 1uX 5 x! 5 , inother cases asthe estimated score (to achieve ef Ž ciency). 1 1 exp@~h1~x!, . . . , hL~x!!gL# Inthe unknown propensity score case one always uses the estimated propensity score, leading to by maximizing the associated likelihood function. Let gˆ L be the maximumlikelihood estimate.In the third step, the 1 estimated propensity score is calculated as tˆ weight,tr 5 O Yi N1 F i : Wi51 G exp@~h1~x!, . . . , hL~x!!gˆ L# eˆ~ x! 5 . 1 1 exp@~h1~x!, . . . , hL~x!!gˆ L# eˆ~Xi! eˆ~Xi! 2 O Yi z O . 1 2 eˆ~Xi! 1 2 eˆ~Xi! Finally they estimatethe average treatmenteffect as F i : Wi50 Y i : Wi50 G

N W z Y N W One difŽ culty with the weighting estimators that are i i i based on the estimated propensity score is again the prob- tˆ weight 5 O O eˆ~Xi! eˆ~Xi! i51 Y i51 lemof choosing the smoothing parameters.Hirano, Imbens, (7) and Ridder (2003) use series estimators, which requires N N choosing the number of termsin the series. Ichimuraand ~1 2 Wi! z Yi 1 2 Wi 2 O O . Linton (2001) consider akernel version, which involves 1 2 eˆ ~Xi! 1 2 eˆ~Xi! i51 Y i51 choosing abandwidth. Theirs is currently one of the few studies considering optimal choices forsmoothing parame- Hirano, Imbens, and Ridder show that with anonparametric tersthat focuses speci Ž cally on estimating average treat- estimatorfor e( x)this estimatoris ef Ž cient, whereas with ment effects.A departure fromstandard problems in choos- the true propensity score the estimatorwould not be fully ing smoothing parametersis that here one wants to use efŽ cient (and in factnot very attractive). nonparametric regression methods even ifthe propensity This estimator highlights one of the interesting featuresof score isknown. For example, ifthe probability of treatment the problem of ef Ž ciently estimating average treatment isconstant, standard optimality results would suggest using effects.One solution is to estimatethe two regression ahigh degree of smoothing, asthis would lead to the most functions mw( x)nonparametrically,asdiscussed in Section accurateestimator for the propensity score. However, this IIIA;that solution completely ignores the propensity score. would not necessarily lead toanef Ž cient estimator forthe Asecond approach is to estimatethe propensity score average treatmenteffect of interest. nonparametrically,ignoring entirely the two regression functions. Ifappropriately implemented, both approaches Blocking on the Propensity Score: In their original lead to fully ef Ž cient estimators, but clearly their Ž nite- propensity-score paper Rosenbaum and Rubin (1983a) sug- sample properties maybe very different, depending, for gest the following blocking-on-the-propensity-score estima- example, on the smoothness of the regression functions tor. Using the (estimated)propensity score, divide the sam- versus the smoothness of the propensity score. Ifthere is ple into M blocks of units of approximately equal only asingle binary covariate, or moregenerally ifthere are probability of treatment,letting Jim be an indicator forunit only discrete covariates, the weighting approach with afully i being in block m.One way of implementing this is by nonparametric estimator forthe propensity score is numer- dividing the unit interval into M blocks with boundary ically identical to the regression approach with afully values equal to m/M for m 5 1, . . . , M 2 1, so that nonparametric estimator forthe two regression functions. To estimatethe average treatmenteffect for the treated m 2 1 m ratherthan forthe full population, one should weight the J 5 1 , e~X ! # im H M i MJ 18 THEREVIEW OFECONOMICS AND STATISTICS for m 5 1, . . . , M.Within each block there are Nwm speciŽ cation of the propensity score. Often some informal observations with treatmentequal to w, Nwm 5 ¥i 1{Wi 5 version of the following algorithm isused: Ifwithin ablock w, Jim 5 1}. Given these subgroups, estimatewithin each the propensity score itself is unbalanced, the blocks aretoo block the average treatmenteffect as ifrandom assignment large and need to be split. If,conditional on the propensity held: score being balanced, the covariates areunbalanced, the speciŽ cation of the propensity score is not adequate. No 1 N 1 N formalalgorithm has been proposed forimplementing these tˆ m 5 O JimWiYi 2 O Jim~1 2 Wi!Yi. blocking methods. N1m N0m i51 i51 Analternative approach to Ž nding the optimal number of blocks is to relatethis approach to the weighting estimator Then estimatethe overall average treatmenteffect as discussed above. One can view the blocking estimator as identical toaweighting estimator, with amodi Ž ed estimator M N1m 1 N0m forthe propensity score. Speci Ž cally,given the original tˆ 5 tˆ z . block O m N estimator eˆ( x),in the blocking approach the estimatorfor m51 the propensity score is discretized to Ifone is interested inthe average effectfor the treated, one M will weight the within-block average treatmenteffects by 1 m e˜~ x! 5 1 # eˆ~ x! . the number of treated units: M O H M J m51

M N1m Using e˜( x)as the propensity score in the weighting estima- tˆ T,block 5 O tˆm z . tor leads to an estimator forthe average treatmenteffect NT m51 identical to that obtained by using the blocking estimator with eˆ( x)as the propensity score and M blocks. With Blocking can be interpreted asacrude formof nonpara- sufŽ ciently large M,the blocking estimatoris suf Ž ciently metricregression wherethe unknown function is approxi- close to the original weighting estimator that it shares its matedby astep function with Ž xed jump points. To estab- Ž rst-order asymptotic properties, including its ef Ž ciency. lish asymptotic properties forthis estimatorwould require This suggests that in general there islittle harmin choosing establishing conditions on the rateat which the number of alarge number of blocks, atleast with regard to asymptotic blocks increases with the sample size. With the propensity properties, although again the relevance of this for Ž nite score known, these areeasy todetermine; no formalresults samples has not been established. have been established forthe unknown propensity score case. Regression on the Propensity Score: The third method The question arises how many blocks to use in practice. of using the propensity score is to estimatethe conditional Cochran (1968) analyzes acase with asingle covariate and, expectation of Y given W and e(X). DeŽ ne assuming normality,shows that using Ž ve blocks removes at least 95% of the bias associated with that covariate. Since nw~e! 5 E@Y~w!ue~X! 5 e#. all bias, under unconfoundedness, is associated with the propensity score, this suggests that under normality the use Byunconfoundedness this is equal to E[YuW 5 w, e(X) 5 of Ž ve blocks removes most of the bias associated with all e].Given an estimator nˆw(e),one can estimatethe average the covariates. This has often been the starting point of treatmenteffect as empiricalanalyses using this estimator(for example, Rosenbaum and Rubin, 1983b; Dehejia and Wahba, 1999) 1 N and has been implemented in Stata by Becker and Ichino tˆ regprop 5 O @nˆ 1~e~Xi!! 2 nˆ 0~e~Xi!!#. 7 N (2002). Often, however, researcherssubsequently check i51 the balance of the covariates within each block. Ifthe true propensity score per block is constant, the distribution of the Heckman, Ichimura, and Todd (1998) consider alocal linear covariates among the treated and controls should be identi- version of this forestimating the average treatmenteffect cal, or, in the evaluation terminology,the covariates should forthe treated. Hahn (1998) considers aseries version and be balanced. Hence one can assess the adequacy of the shows that it is not as ef Ž cient as the regression estimator statistical model by comparing the distribution of the co- based on adjustment forall covariates. variates among treated and controls within blocks. Ifthe distributions arefound to be different, one can either split Matching on the Propensity Score: Rosenbaum and the blocks into anumber of subblocks, or generalize the Rubin’sresult implies that it issuf Ž cient to adjust solely for differences in the propensity score between treated and 7 Becker and Ichino also implement estimators that match on the control units. Since one of the ways in which one can adjust propensity score. fordifferences in covariates is matching, another natural AVERAGE TREATMENTEFFECTS 19 way to use the propensity score is through matching. Be- with the sameweights li.Such an estimator, using amore cause the propensity score is ascalarfunction of the co- general semiparametricregression model, was suggested by variates, the bias results in Abadie and Imbens (2002) imply Robins and Rotnitzky (1995), Robins, Roznitzky,and Zhao that the bias termis of lower order than the variance term (1995), and Robins and Ritov (1997), and implemented by and matching leads to a =N-consistent, asymptotically Hirano and Imbens (2001). Inthe parametriccontext Robins normally distributed estimator. The variance forthe case and Ritov argue that the estimatoris consistent as long as with matching on the true propensity score also follows either the regression model or the propensity score (and thus directly fromtheir results. Morecomplicated is the case the weights) arespeci Ž edcorrectly.That is, in Robins and with matching on the estimated propensity score. Ido not Ritov’sterminology,the estimator isdoubly robust. know of any results that give the variance forthis case. Blocking and Regression: Rosenbaum and Rubin (1983b) suggest modifying the basic blocking estimator by D.MixedMethods using least squares regression within the blocks. Without the Anumber of approaches have been proposed that com- additional regression adjustment the estimated treatment bine two of the three methods described in the previous effectwithin blocks can be written as aleast squares sections, typically regression with one of its alternatives. estimator of tm forthe regression function The reason forthese combinations is that, although one method alone is often suf Ž cient to obtain consistent or even Yi 5 am 1 tm z Wi 1 ei, efŽ cient estimates, incorporating regression mayeliminate remaining bias and improve precision. This is particularly using only the units in block m.Asabove, one can also add useful in that neither matching nor the propensity-score covariates to the regression function methods directly address the correlation between the covari- ates and the outcome. The bene Ž tassociated with combin- Yi 5 am 1 b9m Xi 1 tm z Wi 1 ei, ing methods is madeexplicit in the notion developed by Robins and Ritov (1997) of double robustness. They pro- again estimated on the units in block m. pose acombination of weighting and regression where, as long as the parametricmodel foreither the propensity score Matching and Regression: Because Abadie and Imbens or the regression functions is speci Ž edcorrectly,the result- (2002) have shown that the bias of the simple matching ing estimator forthe average treatmenteffect is consistent. estimatorcan dominate the variance ifthe dimension of the Similarly,matching leads to consistency without additional covariates is too large, additional bias corrections through assumptions; thus methods that combine matching and re- regression can be particularly relevant in this case. Anum- gressions arerobust against misspeci Ž cation of the regres- ber of such corrections have been proposed, Ž rst by Rubin sion function. (1973b) and Quade (1982) inaparametricsetting. Follow- ing the notation of section IIIB,let Yˆ i(0) and Yˆ i(1) be the Weighting and Regression: One can rewritethe weight- observed or imputed potential outcomes forunit i; the ing estimator discussed above as estimating the following estimated potential outcomes equal the observed outcomes regression function by weighted least squares: forsome unit i and forits match ,(i).The bias in their comparison, E[Yˆ i(1) 2 Yˆ i(0)] 2 [Yi(1) 2 Yi(0)],arises fromthe factthat the covariates Xi and X,(i) for units i and Yi 5 a 1 t z Wi 1 ei, ,(i)arenot equal, although they areclose because of the with weights equal to matching process. To further explore this, focusing on the single-match case, deŽ ne foreach unit Wi 1 2 Wi li 5 1 . Î e~Xi! 1 2 e~Xi! Xi if Wi 5 0, Xˆ i~0! 5 H X,1~i! if Wi 5 1 Without the weights the least squares estimatorwould not be consistent forthe average treatmenteffect; the weights and ensure that the covariates areuncorrelated with the treat- ment indicator and hence the weighted estimator isconsis- X,1~i! if Wi 5 0, Xˆ i~1! 5 tent. H Xi if Wi 5 1. This weighted-least-squares representation suggests that one mayadd covariates to the regression function to im- Ifthe matching is exact, Xˆ i(0) 5 Xˆ i(1)foreach unit. Ifnot, prove precision, forexample, these discrepancies may lead to bias. The difference Xˆ i(1) 2 Xˆ i(0) will thereforebe used to reduce the bias of Yi 5 a 1 b9Xi 1 t z Wi 1 ei, the simple matching estimator. 20 THEREVIEW OFECONOMICS AND STATISTICS

Suppose unit i is atreated unit ( Wi 5 1), so that Yˆ i(1) 5 aBayesian perspective. Dehejia (2002) goes further, study- Yi(1) and Yˆ i(0)is an imputed value for Yi(0).This imputed ing the policy decision problem of assigning heterogeneous value is unbiased for (X ) (since Yˆ (0) Y ), but individuals tovarious training programs with uncertain and m0 ,1(i) i 5 ,(i) not necessarily for m0(Xi).One maytherefore wish to adjust variable effects. Yˆ (0)by anestimateof (X ) (X ).Typically these To my knowledge, however, there areno applications i m0 i 2 m0 ,1(i) corrections aretaken to be linear in the differencein the using the Bayesian approach that focus on estimating the covariates forunit i and its match, that is, of the form average treatmenteffect under unconfoundedness, either for [Xˆ (1) Xˆ (0)] (X X ).Rubin (1973b) the whole population or just forthe treated. Neither arethere b90 i 2 i 5 b90 i 2 ,1(i) proposed three corrections, which differin how b0 is esti- simulation studies comparing operating characteristics of mated. Bayesian methods with the frequentist methods discussed in To introduce Rubin ’s Ž rst correction, note that one can the earliersections of this paper. Such aBayesian approach writethe matching estimator as the least squares estimator can be easily implemented with the regression methods forthe regression function discussed insection IIIA.Interestingly,it is less clearhow Bayesian methods would be used with pairwise matching, Yˆ i~1! 2 Yˆ i~0! 5 t 1 ei. which does not appear to have anatural likelihood interpre- tation. This representation suggests modifying the regression func- ABayesian approach to the regression estimators may be tion to useful fora number of reasons. First, one of the leading problems with regression estimators is the presence of many ˆ ˆ ˆ ˆ Yi~1! 2 Yi~0! 5 t 1 @Xi~1! 2 Xi~0!#9b 1 ei, covariates relative to the number of observations. Standard frequentist methods tend to either include those covariates and again estimating t by least squares. without any restrictions, or exclude them entirely.In con- The second correction is to estimate m ( x)directly by 0 trast, Bayesian methods would allow researchersto include taking all control units, and estimatea linear regression of covariates with moreor less informative prior distributions. the form For example, ifthe researcherhas anumber of lagged outcomes, one mayexpect recent lags to be moreimportant Yi 5 a 1 b9 Xi 1 ei 0 0 inpredicting future outcomes than longer lags; this can be by least squares. [Ifunit i is acontrol unit, the correction re ected in tighter prior distributions around zero forthe will be done using an estimator forthe regression function older information. Alternatively,with anumber of similar m1( x)based on alinear speci Ž cation Yi 5 a1 1 b91Xi covariates one maywish to use hierarchical models that estimated on the treated units.] Abadie and Imbens (2002) avoid problems with large-dimensional parameterspaces. show that ifthis correction is done nonparametrically,the Asecond argument forconsidering Bayesian methods is resulting matching estimatoris consistent and asymptoti- that in an areaclosely related to this process of estimated cally normal, with its bias dominated by the variance. unobserved outcomes —that of missing data with the miss- The third method is to estimatethe sameregression ing atrandom (MAR)assumption —Bayesian methods have function forthe controls, but using only those that areused found widespread applicability.As advocated by Rubin asmatches forthe treated units, with weights corresponding (1987), multiple imputation methods often rely on aBayes- to the number of timesa control observations is used as a ian approach forimputing the missing data, taking account match(see Abadie and Imbens, 2002). Compared to the of the parameterheterogeneity in amanner consistent with second method, this approach may be less ef Ž cient, as it the uncertainty in the missing-data model itself. The same discards some control observations and weights some more methods could be used with little modi Ž cation forcausal than others. Ithas the advantage, however, of only using the models, with the main complication that arelatively large most relevant matches. The controls that arediscarded in the proportion—namely 50% of the total number of potential matching process arelikely to be outliers relative to the outcomes—is missing. treated observations, and they may thereforeunduly affect the least squares estimates. Ifthe regression function is in IV.EstimatingV ariances factlinear, this may be anattractive feature,but ifthere is The variances of the estimators considered so fartypi- uncertainty over its functional form,one maynot wish to cally involve unknown functions. For example, as discussed allow these observations such in  uence. in section IIE,the variance of ef Ž cient estimators of the PATEisequal to E.BayesianApproaches 2 2 s1~X! s0~X! VP 5 E 1 1 ~m ~X! 2 m ~X! 2 t!2 , Littlehas been done using Bayesian methods to estimate F e~X! 1 2 e~X! 1 0 G average treatmenteffects, either in methodology or in ap- plication. Rubin (1978) introduces ageneral approach to involving the two regression functions, the two conditional estimating average and distributional treatmenteffects from variances, and the propensity score. AVERAGE TREATMENTEFFECTS 21

Thereare a number of ways wecan estimatethis asymp- we can Ž nd two treated units with X 5 x, say units i and j. 2 totic variance. The Ž rst is essentially by brute force.All Ž ve Inthat case anunbiased estimator for s1( x) is 2 2 components of the variance, s0( x), s1( x), m0( x), m1( x), 2 2 and e( x),areconsistently estimable using kernel methods sˆ 1~x! 5 ~Yi 2 Yj! /2. or series, and hence the asymptotic variance can be esti- matedconsistently .However, ifone estimatesthe average In general it is again dif Ž cult to Ž nd exact matches, but treatmenteffect using only the two regression functions, itis again, this is not necessary.Instead, one uses the closest an additional burden to estimatethe conditional variances matchwithin the set of units with the sametreatment and the propensity score inorder to estimate VP.Similarly, indicator. Let vm(i) be the mth closest unit to i with the sametreatment indicator ( W 5 W ), and if one efŽ ciently estimates the average treatmenteffect by vm(i) i weighting with the estimated propensity score, it is acon- 1$iX 2 xi # iX 2 xi% 5 m. siderable additional burden to estimatethe Ž rst two mo- O l vm~i! luWl5Wi,lÞi ments of the conditional outcome distributions just to esti- matethe asymptotic variance. Given a Ž xed number of matches, M,this gives us M units Asecond method applies to the case whereeither the with the sametreatment indicator and approximately the regression functions or the propensity score is estimated samevalues forthe covariates. The sample variance of the using series or sieves. In that case one can interpret the outcome variable forthese M units can then be used to 2 estimators, given the number of termsin the series, as estimate s1( x).Doing the samefor the control variance 2 2 parametricestimators, and calculate the variance this way. function, s0( x),wecan estimate sw( x)atall values of the Under some conditions that will lead tovalid standard errors covariates and for w 5 0, 1. and conŽ dence intervals. Note that these arenot consistent estimators of the con- Athird approach is to use bootstrapping (Efronand ditional variances. As the sample size increases, the bias of Tibshirani, 1993; Horowitz, 2002). Thereis little formal these estimators will disappear, just as wesaw that the bias evidence speci Ž cforthese estimators, but, given that the of the matching estimatorfor the average treatmenteffect estimators areasymptotically linear, it is likely that boot- disappears under similarconditions. The rateat which this strapping will lead to valid standard errorsand con Ž dence bias disappears depends on the dimension of the covariates. 2 intervals atleast forthe regression and propensity score The variance of the estimators for sw(Xi),namely atspe- methods. Bootstrapping maybe morecomplicated for ciŽ cvalues of the covariates, will not go to zero; however, matching estimators, asthe process introduces discreteness this is not important, as weare interested not in the vari- in the distribution that will lead to ties in the matching ances atspeci Ž cpoints in the covariates distribution, but in algorithm. Subsampling (Politis and Romano, 1999) will the variance of the average treatmenteffect, VS. Following still work in this setting. the process introduce above, this last step is estimated as These Ž rst three methods provide variance estimates for P estimators of t .As argued above, however, one may N 2 2 1 sˆ 1~Xi! sˆ 0~Xi! instead wish to estimate tS or t(X),in which case the Vˆ S 5 O 1 . N S eˆ~Xi! 1 2 eˆ~Xi!D appropriate (conservative) variance is i51

2 2 s1~X! s0~X! Under standard regularity conditions this is consistent for VS 5 E 1 . F e~X! 1 2 e~X!G the asymptotic variance of the average treatmenteffect estimator. For matching estimators even estimation of the Asabove, this variance can be estimated by estimating the propensity score can be avoided. Abadie and Imbens show conditional moments of the outcome distributions, with the that one can estimatethe variance of the matching estimator accompanying inherent dif Ž culties. VS cannot, however, be for SATE as: estimated by bootstrapping, since the estimand itself N 2 changes across bootstrap samples. 1 KM~i! Vˆ E 5 1 1 sˆ 2 ~X !, Thereis, however, an alternative method forestimating N O S M D Wi i this variance that does not require additional nonparametric i51 estimation. The idea behind this matching variance estima- tor, asdeveloped by Abadie and Imbens (2002), is that even where M isthe number of matches and KM(i)is the number though the asymptotic variance depends on the conditional of timesunit i is used as amatch. 2 variance sw( x),one need not actually estimatethis variance consistently atall values of the covariates. Rather, one needs V.Assessingthe Assumptions only the average of this variance over the distribution, A.IndirectT estsof the Unconfoundedness Assumption weighted by the inverse of either e( x)or its complement 1 2 e( x).The key is thereforeto obtain aclose-to-unbiased The unconfoundedness assumption relied upon through- 2 estimatorfor the variance sw( x).Moregenerally ,suppose out this discussion is not directly testable. As discussed 22 THEREVIEW OFECONOMICS AND STATISTICS above, it states that the conditional distribution of the An implication of this independence condition is being outcome under the control treatment, Y(0),given receipt of tested by the tests discussed above. Whether this test has the active treatmentand given covariates, is identical to the much bearing on the unconfoundedness assumption de- distribution of the control outcome given receipt of the pends on whether the extension of the assumption is plau- control treatmentand given covariates. The sameis as- sible given unconfoundedness itself. sumed forthe distribution of the active treatmentoutcome, The second set of tests of unconfoundedness focuses on Y(1).Because the data arecompletely uninformative about estimating the causal effectof the treatmenton avariable the distribution of Y(0) forthose who received the active known to be unaffected by it, typically because its value is treatmentand of Y(1) forthose who received the control, determined prior tothe treatmentitself. Such avariable can the data cannot directly rejectthe unconfoundedness as- be time-invariant, but the most interesting case is in con- sumption. Nevertheless, there areoften indirect ways of sidering the treatmenteffect on alagged outcome. Ifthis is assessing this assumption, anumber of which aredeveloped not zero, this implies that the treated observations are inHeckman and Hotz (1989) and Rosenbaum (1987). These distinct fromthe controls; namely,that the distribution of methods typically rely on estimating acausal effectthat is Yi,21 forthe treated units is not comparable to the distribu- known to equal zero. Ifthe test then suggests that this causal tion of Yi,21 forthe controls. Ifthe treatmentis instead zero, effectdiffers from zero, the unconfoundedness assumption it is moreplausible that the unconfoundedness assumption isconsidered less plausible. These tests can be divided into holds. Ofcourse this does not directly test this assumption; two broad groups. inthis setting, being able to rejectthe null of no effectdoes The Ž rst set of tests focuses on estimating the causal not directly re  ecton the hypothesis of interest, uncon- effectof atreatmentthat is known not to have an effect, foundedness. Nevertheless, ifthe variables used in this relying on the presence of multiple control groups (Rosen- proxy test areclosely related to the outcome of interest, the baum, 1987). Suppose one has two potential control groups, test arguably has morepower. For these tests it is clearly forexample, eligible nonparticipants and ineligibles, as in helpful to have anumber of lagged outcomes. Heckman, Ichimura, and Todd (1997). One interpretation of To formalizethis, let us suppose the covariates consist of the test is to compareaverage treatmenteffects estimated anumber of lagged outcomes Yi,21, . . . , Yi,2T as well as using each of the control groups. This can also be inter- time-invariant individual characteristics Zi, so that Xi 5 preted as estimating an “average treatmenteffect ” using (Yi,21, . . . , Yi,2T, Zi).By construction only units in the only the two control groups, with the treatmentindicator treatmentgroup afterperiod 21receivethe treatment;all now adummy forbeing amemberof the Ž rst group. In that other observed outcomes arecontrol outcomes. Also sup- case the treatmenteffect is known to be zero, and statistical pose that the two potential outcomes Yi(0) and Yi(1) cor- evidence of anonzero effectimplies that atleast one of the respond to outcomes in period zero. Now consider the control groups is invalid. Again, not rejecting the test does following two assumptions. The Ž rst is unconfoundedness not imply the unconfoundedness assumption is valid (as given only T 2 1lags of the outcome: both control groups could sufferthe samebias), but nonre- jection in the case wherethe two control groups arelikely to Yi~1!, Yi~0! \ WiuYi,21, . . . , Yi,2~T21!, Zi, have differentpotential biases makes it moreplausible that and the second assumes stationarity and exchangeability: the unconfoundedness assumption holds. The key forthe power of this test is tohave available control groups that are fYi,s~0!uYi,s21~0!, . . . ,Yi,s2~T21!~0!,Zi,Wi~ ysuys21, . . . , ys2~T21!, z, w! likely to have different biases, ifany .Comparing ineligibles and eligible nonparticipants as in Heckman, Ichimura, and does not depend on i and s.Then it follows that Todd (1997) is aparticularly attractive comparison. Alter- natively one mayuse differentgeographic controls, for Yi,21 \ WiuYi,22, . . . , Yi,2T, Zi, example fromareas bordering on differentsides of the which istestable. This hypothesis is what the test described treatmentgroup. above tests. Whether this test has much bearing on uncon- One can formalizethis test by postulating athree-valued foundedness depends on the link between the two assump- indicator Ti [ {21,0,1} forthe groups (e.g., ineligibles, tions and the original unconfoundedness assumption. With a eligible nonparticipants, and participants), with the treat- sufŽ cient number of lags, unconfoundedness given all lags ment indicator equal to Wi 5 1{Ti 5 1}.Ifone extends the but one appears plausible, conditional on unconfoundedness unconfoundedness assumption to independence of the po- given all lags, so the relevance of the test depends largely on tential outcomes and the group indicator given covariates, the plausibility of the second assumption, stationarity and exchangeability. Yi~0!, Yi~1! \ TiuXi, B.Choosingthe Covariates then atestable implication is The discussion so farhas focused on the case wherethe Yi \ 1$Ti 5 0%uXi, Ti # 0. covariates set is known apriori. In practicethere can be two AVERAGE TREATMENTEFFECTS 23 issues with the choice of covariates. First, there maybe with one or two covariates one can do this directly.In some variables that should not be adjusted for. Second, even high-dimensional cases, however, this becomes moredif Ž - with variables that should be adjusted forin large samples, cult. One can inspect pairs of marginal distributions by the expected meansquared errormay be reduced by ignor- treatmentstatus, but these arenot necessarily informative ing those covariates that have only weak correlation with about lack of overlap. Itis possible that foreach covariate the treatmentindicator and the outcomes. This second issue the distributions forthe treatmentand control groups are is essentially astatistical one. Including acovariate in the identical, even though there areareas where the propensity adjustment procedure, through regression, matching or oth- score is 0or 1. erwise, will not lower the asymptotic precision of the Amoreuseful method is thereforeto inspect the distri- average treatmenteffect if the assumptions arecorrect. In bution of the propensity score in both treatmentgroups, which Ž nite samples, however, acovariate that is not, or is only can directly reveallack of overlap in high-dimensional weakly,correlated with outcomes and treatmentindicators covariate distributions. Its implementation requires non- mayreduce precision. Thereare few procedures currently parametricestimation of the propensity score, however, and available foroptimally choosing the set of covariates tobe misspeciŽ cation maylead to failurein detecting alack of included in matching or regression adjustments, taking into overlap, just as inspecting various marginal distributions account such Ž nite-sample properties. may be insufŽ cient. In practiceone maywish to under- The Ž rst issue is asubstantive one. The unconfounded- smooth the estimation of the propensity score, either by ness assumption mayapply with one set of covariates but choosing abandwidth smallerthan optimal fornonparamet- not apply with an expanded set. Aparticular concern is the ricestimation or by including higher-order termsin aseries inclusion of covariates that arethemselves affectedby the expansion. treatment,such as intermediate outcomes. Suppose, for Athird way to detect lack of overlap is to inspect the example, that in evaluating ajob training program, the quality of the worst matches in amatching procedure. Given primaryoutcome of interest is earnings two years later.In aset of matches, one can, foreach component k of the that case, employment status prior to the program is unaf- vector of covariates, inspect max i uxi,k 2 x,1(i),ku, the fected by the treatmentand thus avalid elementof the set of maximumover all observations of the matching discrep- adjustment covariates. In contrast, employment status one ancy.Ifthis differenceis large relative to the sample year afterthe program is an intermediate outcome and standard deviation of the kth component of the covariates, should not be controlled for. Itcould itself be an outcome of there is reason forconcern. The advantage of this method is interest, and should thereforenever be acovariate in an that itdoes not require additional nonparametric estimation. analysis of the effectof the training program. One guarantee Once one determines that there is alack of overlap, one that acovariate is not affectedby the treatmentis that itwas can either conclude that the average treatmenteffect of measured before the treatmentwas chosen. In practice, interest cannot be estimated with suf Ž cient precision, and/or however, the covariates areoften recorded atthe sametime decide to focus on an average treatmenteffect that is as the outcomes, subsequent to treatment.In that case one estimable with greateraccuracy .Todo the latterit can be has to assess on acase-by-case basis whether aparticular covariate should be used in adjusting outcomes. SeeRosen- useful to discard some of the observations on the basis of baum (1984b) and Angrist and Krueger (2000) formore their covariates. For example, one may decide to discard discussion. control (treated)observations with propensity scores below (above) acutoff level. The desired cutoff maydepend on the sample size; in avery large sample one maynot be con- C.Assessingthe Overlap Assumption cerned with apropensity score of 0.01, whereas in small The second of the key assumptions inestimating average samples such avalue may makeit dif Ž cult to Ž nd reason- treatmenteffects requires that the propensity score —the able comparisons. Tojudge such tradeoffs, it is useful to probability of receiving the active treatment —be strictly understand the relationship between aunit ’spropensity between zero and one. In principle this is testable, as it score and its implicit weight in the average-treatment-effect restrictsthe joint distribution of observables; but formal estimation. Using the weighting estimator, the average out- tests arenot necessarily the main concern. In practice, this comeunder the treatmentis estimated by summing up assumption raises anumber of questions. The Ž rst is how to outcomes forthe control units with weight approximately detect alack of overlap in the covariate distributions. A equal to 1divided by their propensity score (and 1divided second is how to deal with it, given that such alack exists. by 1minus the propensity score fortreated units). Hence Athird is how the individual methods discussed in section with N units, the weight of unit i is approximately 1/{ N z IIIaddress this lack of overlap. Ideally such alack would [1 2 e(Xi)]}ifit is atreated unit and 1/[ N z e(Xi)] if it is result in large standard errorsfor the average treatment acontrol. One maywish to limitthis weight to some effects. fraction, forexample, 0.05, so that no unit will have a The Ž rst method to detect lack of overlap is to plot weight of morethan 5% in the average. Under that ap- distributions of covariates by treatmentgroups. Inthe case proach, the limiton the propensity score in asample with 24 THEREVIEW OFECONOMICS AND STATISTICS

200 units is 0.1; units with apropensity score less than 0.1 trols, leading to possibly biased estimates. The standard or greaterthan 0.9 should be discarded. In asample with errorswould largely be unaffected. 1000 units, only units with apropensity score outside the Finally,consider propensity-score estimates. Estimatesof range [0.02, 0.98] will be ignored. the probability of receiving treatmentnow include values In matching procedures one need not rely entirely on close to 0and 1. The values close to 0forthe control comparisons of the propensity score distribution indiscard- observations would cause little dif Ž culty because these units ing the observations with insuf Ž cient matchquality . would get close to zero weight in the estimation. The control Whereas Rosenbaum and Rubin (1984) suggest accepting observations with apropensity score close to 1, however, only matches wherethe differencein propensity scores is would receivehigh weights, leading to an increase in the below acutoff point, alternatively one may wish to drop variance of the average-treatment-effectestimator, correctly matches whereindividual covariates areseverely mis- implying that one cannot estimatethe average treatment matched. effectvery precisely.Blocking on the propensity score would lead to similarconclusions. Finally,let us consider the three approaches to infer- Overall, propensity score and matching methods (and ence—regression, matching, and propensity score meth- likewise kernel-based regression methods) arebetter de- ods—and assess how each handles lack of overlap. Suppose signed to cope with limited overlap in the covariate distri- one is interested in estimating the average effecton the butions than areparametric or semiparametric(series) re- treated, and one has adata set with suf Ž cient overlap. Now gression models. In all cases it is useful to inspect suppose one adds afewtreated or control observations with histograms of the estimated propensity score inboth groups covariate values rarelyseen in the alternative treatment toassess whether limited overlap is an issue. group. Adding treated observations with outlying values implies one cannot estimatethe average treatmenteffect for the treated very precisely,because one lacks suitable con- VI. Applications trols against which to comparethese additional units. Thus Thereare many studies using some formof unconfound- with methods appropriately dealing with limited overlap edness or selection on observables, ranging fromsimple one will see the variance estimatesincrease. In contrast, least squares analyses to matching on the propensity score adding control observations with outlying covariate values (forexample, Ashenfelter and Card, 1985; LaLonde, 1986; should have little effect,since such controls areirrelevant Card and Sullivan, 1988; Heckman, Ichimura, and Todd, forthe average treatmenteffect for the treated. Therefore, 1997; Angrist, 1998; Dehejia and Wahba, 1999; Lechner, methods appropriately dealing with limited overlap should 1998; Friedlander and Robins, 1995; and many others). in this case show estimatesapproximately unchanged in HereI focus primarily on two sets of analyses that can help bias and precision. researchersassess the value of the methods surveyed in this Consider Ž rst the regression approach. Conditional on a paper: Ž rst, studies attempting to assess the plausibility of particular parametricspeci Ž cation forthe regression func- the assumptions, often using randomized experiments as a tion, adding observations with outlying values of the regres- yardstick; second, simulation studies focusing on the per- sors leads toconsiderably moreprecise parameterestimates; formanceof the various techniques in settings where the such observations arein  uential precisely because of their assumptions areknown to hold. outlying values. Ifthe added observations aretreated units, the precision of the estimated control regression function at A.Applications:Randomized Experiments as Checks on these outlying values will be lower (since fewif any control Unconfoundedness units arefound in that region); thus the variance will The basic idea behind these studies is simple: to use increase, as it should. One should note, however, that the experimental results as acheck on the attempted nonexperi- estimates in this region maybe sensitive to the speci Ž cation mental estimates. Given arandomized , one can chosen. In contrast, by the nature of regression functions, obtain unbiased estimates of the average effectof apro- adding control observations with outlying values will lead gram.Then, one can put aside the experimental control toaspurious increase inprecision of the control regression group and attempt to replicate these results using anonex- function. Regression methods can thereforebe misleading perimental control. Ifone can successfully replicate the incases with limited overlap. experimental results, this suggests that the assumptions and Next, consider matching. In estimating the average treat- methods areplausible. Such investigations areof course not ment effectfor the treated, adding control observations with generally conclusive, but areinvaluable in assessing the outlying covariate values will likely have little affecton the plausibility of the approach. The Ž rst such study, and one results, since such observations areunlikely to be used as that madean enormous impactin the econometrics litera- matches. The results would, however, be sensitive to adding ture, was by LaLonde (1986). Frakerand Maynard (1987) treated observations with outlying covariate values, because conducted asimilarinvestigation, and many morehave these observations would be matched to inappropriate con- followed. AVERAGE TREATMENTEFFECTS 25

LaLonde (1986) took the National Supported Work pro- speciŽ cguidance that should be the aimof such studies. gram,a fairlysmall program aimedat particularly They give clearand generalizable conditions that makethe disadvantaged people in the labor market(individuals with assumptions of unconfoundedness and overlap —at least poor labor markethistories and skills). Using these data, he according totheir study of alarge training program —more set aside the experimental control group and in its place plausible. These conditions include the presence of detailed constructed alternative controls fromthe Panel Study of earnings histories, and control groups that aregeographi- Income Dynamics (PSID)and Current Population Survey cally close to the treatmentgroup —preferably groups of (CPS), using various selection criteriadepending on prior ineligibles, or eligible nonparticipants fromthe sameloca- labor marketexperience. Hethen used anumber of meth- tion. In contrast, control groups fromvery differentloca- ods—ranging froma simple difference, to least squares tions arefound to be poor nonexperimental controls. Al- adjustment, aHeckman selection correction, and difference- though such conclusions areonly clearly generalizable to indifferences techniques —to createnonexperimental esti- evaluations of social programs, they arepotentially very matesof the average treatmenteffect. His general conclu- useful inproviding analysts with concrete guidance asto the sion was that the results werevery unpredictable and that no applicability of these assumptions. method could consistently replicate the experimental results Dehejia (2002) uses the GreaterAvenues to INdepen- using any of the six nonexperimental control groups con- dence (GAIN)data, using different counties as well as structed. Anumber of researchershave subsequently tested differentof Ž ces within the samecounty as nonexperimental new techniques using these samedata. Heckman and Hotz control groups. Similarly,Hotz, Imbens, and Klerman (1989) focused on testing the various models and argued (2001) use the basic GAINdata set supplemented with that the testing procedures they developed would have administrative data on long-term quarterly earnings (both eliminated many of LaLonde ’sparticularly inappropriate prior and subsequent to the randomization date), to inves- estimates. Dehejia and Wahba (1999) used several of the tigate the importance of detailed earnings histories. Such semiparametricmethods based on the unconfoundedness detailed histories can also provide moreevidence on the assumption discussed inthis survey,and found that forthe plausibility of nonexperimental evaluations forlong-term subsample of the LaLonde data that they used (with two outcomes. years of prior earnings), these methods replicated the ex- Two complications makethis literaturedif Ž cult to eval- perimental results moreaccurately —both overall and within uate. One is the differences in covariates used; it is rarethat subpopulations. Smith and Todd (2003) analyze the same variables aremeasured consistently across differentstudies. data and conclude that forother subsamples, including those For instance, some have yearly earnings data, others quar- forwhich only one year of prior earnings is available, the terly,others only earnings indicators on amonthly or quar- results areless robust. SeeDehejia (2003) foradditional terly basis. This makes itdif Ž cult to consistently investigate discussion of these results. the level of detail in earnings history necessary forthe Others have used differentexperiments to carryout the unconfoundedness assumption to hold. Asecond complica- sameor similaranalyses, using varying sets of estimators tion is that differentestimators aregenerally used; thus any and alternative control groups. Friedlander and Robins differences in results can be attributed to either estimators or (1995) focus on least squares adjustment, using data from assumptions. This is likely driven by the factthat fewof the the WIN(Work INcentive) demonstration programs con- estimators have been suf Ž ciently standardized that they can ducted in anumber of states, and construct control groups be implemented easily by empiricalresearchers. fromother counties in the samestate, as well as from All of these studies just discussed took data fromactual differentstates. They conclude that nonexperimental meth- randomized experiments to test the “true” treatmenteffect ods areunable to replicatethe experimental results. Hotz, against the estimators used on the nonexperimental data. To Imbens, and Mortimer(2003) use the samedata and con- some extent, however, such experimental data arenot re- sider matching methods with various sets of covariates, quired. The question of interest is whether an alternative using single or multiple alternative states as nonexperimen- control group is anadequate proxy fora randomized control tal control groups. They Ž nd that forthe subsample of in aparticular setting; note that this question does not individuals with positive earnings atsome date prior tothe require data on the treatmentgroup. Although these ques- program, nonexperimental methods work better than for tions have typically been studied by comparing experimen- those with no known positive earnings. tal with nonexperimental results, all that is really relevant is Heckman, Ichimura, and Todd (1997, 1998) and Heck- whether the nonexperimental control group can predict the man, Ichimura, Smith, and Todd (1998) study the national average outcomes forthe experimental control. Asin Heck- Job Training Partnership Act(JPT A)program, using data man, Ichimura, Smith, and Todd ’s(1998) analysis of the fromdifferent geographical locations to investigate the JTPAdata, one can take two groups, neither subject to the nature of the biases associated with differentestimators, and treatment,and ask the question whether —using data on the the importance of overlap in the covariates, including labor covariates forthe Ž rst control group in combination with markethistories. Theirconclusions provide the type of outcome and covariate information forthe second —one can 26 THEREVIEW OFECONOMICS AND STATISTICS predict the average outcome in the Ž rst. Ifso, this implies where, for r 5 0, 1,2, one has Sr 5 ¥ K((Xi 2 r r that, had there been anexperiment on the population from x)/h)(Xi 2 x) and Tr 5 ¥ K((Xi 2 x)/h)(Xi 2 x) Yi. The which the Ž rst control group was drawn, the second group Seifert-Gassermodi Ž cation is to use instead would provide an acceptable nonexperimental control. Fromthis perspective one can use data frommany different T0 T1 surveys. In particular, one can moresystematically investi- mˆ ~x! 5 1 ~x 2 x#!, S S 1 R gate whether control groups fromdifferent counties, states, 0 2 or regions or even differenttime periods makeacceptable wherethe recommended ridge parameteris R 5 ux 2 nonexperimental controls. 3 x#u[5/(16h)],given the Epanechnikov kernel k(u) 5 4 (1 2 2 B.Simulations u )1{uuu , 1}. Note that with high-dimensional covariates, such anonnegative kernel would lead to biases that do not Asecond question that is often confounded with that of vanish fastenough to be dominated by the variance (seethe the validity of the assumptions is that of the relative per- discussion in Heckman, Ichimura, and Todd, 1998). This is formanceof the various estimators. Suppose one iswilling not aproblem in Fro¨lich’ssimulations, ashe considers only to accept the unconfoundedness and overlap assumptions. cases with asingle covariate. Fro¨lich Ž nds that the local Which estimation method is most appropriate in aparticular linear estimator, with Seifertand Gassert ’s modiŽ cation, setting? Inmany of the studies comparing nonexperimental performsbetter than either the matching or the standard with experimental outcomes, researcherscompare results local linear estimator. fora number of the techniques described here. Yetin these Zhao (2004) uses simulation methods to comparematch- settings wecannot be certain that the underlying assump- ing and parametricregression estimators. Heuses metrics tions hold. Thus, although it is useful to compare these based on the propensity score, the covariates, and estimated techniques in such realisticsettings, it is also important to regression functions. Using designs with varying numbers compare them in an arti Ž cialenvironment whereone is of covariates and linear regression functions, Zhao Ž nds certain that the underlying assumptions arevalid. there is no clearwinner among the differentestimators, Thereexist afewstudies that speci Ž cally set out to do although he notes that using the outcome data in choosing this. Fro¨lich (2000) compares anumber of matching esti- the metricappears apromising strategy. mators and local linear regression methods, carefully for- Abadie and Imbens (2002) study their matching estimator malizing fully data-driven procedures forthe estimators using adata-generating process inspired by the LaLonde considered. To makethese comparisons he considers alarge study to allow forsubstantial nonlinearity, Ž tting aseparate number of data-generating processes, based on eight differ- binary response model to the zeros in the earnings outcome, ent regression functions (including some highly nonlinear and alog linear model forthe positive observations. The and multimodal ones), two differentsample sizes, and three regression estimators include linear and quadratic models differentdensity functions forthe covariate (one important (the latterwith afull set of interactions), with seven covari- limitation is that he restrictsthe investigation to asingle ates. This study Ž nds that the matching estimators, and in covariate). For the matching estimator Fro¨lich considered a particular the bias-adjusted alternatives, outperform the lin- single matchwith replacement;for the local linear regres- earand quadratic regression estimators (the formerusing 7 sion estimators he uses data-driven optimal bandwidth covariates, the latter35, afterdropping squares and interac- choices based on minimizing the meansquared errorof the tions that lead to perfectcollinearity). Theirsimulations also average treatmenteffect. The Ž rst local linear estimator suggest that with fewmatches —between one and four — considered isthe standard one: at x the regression function matching estimators arenot sensitive to the number of m( x)is estimated as b0 in the minimization problem matches used, and that their con Ž dence intervals have actual coverage ratesclose tothe nominal values. N X 2 x The results fromthese simulation studies areoverall 2 i min @Yi 2 b 2 b z ~Xi 2 x!# z K , somewhat inconclusive; it is clearthat morework is re- O 0 1 S h D b0,b1 i51 quired. Future simulations mayusefully focus on some of the following issues. First, it is obviously important to with an Epanechnikov kernel. He Ž nds that this has com- closely model the data-generating process on actual data putational problems, as wellas poor small-sampleproper- sets, to ensure that the results have some relevance for ties. Hetherefore also considers amodi Ž cation suggested by practice. Ideally one would build the simulations around a Seifertand Gasser (1996, 2000). For given x, deŽ ne x# 5 ¥ number of speci Ž cdata sets through arange of data- XiK((Xi 2 x)/h)/¥ K((Xi 2 x)/h),so that one can write generating processes. Second, it is important to have fully the standard local linear estimator as data-driven procedures that de Ž ne anestimator as afunction N of (Yi, Wi, Xi)i51,as seen in Fro¨lich (2000). For the T0 T1 mˆ ~x! 5 1 ~x 2 x#!, matching estimators this is relatively straightforward, but S0 S2 forsome others this requires morecare. This will allow AVERAGE TREATMENTEFFECTS 27 other researchersto consider meaningful comparisons performanceof these methods in realisticsettings with large across the various estimators. numbers of covariates and varying degrees of smoothness in Finally,weneed to learn which featuresof the data- the conditional means of the potential outcomes and the generating process areimportant forthe properties of the propensity score. various estimators. For example, do some estimators Once these issues have been resolved, today ’s applied deteriorate morerapidly than others when adata set has evaluators will bene Ž tfroma new set of reliable, econo- many covariates and fewobservations? Aresome estimators metricallydefensible, and robust methods forestimating the morerobust against high correlations between covariates average treatmenteffect of current social policy programs and outcomes, or high correlations between covariates and under exogeneity assumptions. treatmentindicators? Which estimators aremore likely to give conservative answers in termsof precision? Since it is clearthat no estimator is always going to dominate all REFERENCES others, what is important is to isolate salient featuresof the Abadie, A., “Semiparametric Instrumental Variable Estimation of Treat- data-generating processes that lead to preferring one alter- ment Response Models, ” Journal of Econometrics 113:2 (2003a), native over another. Ideally weneed descriptive statistics 231–263. Abadie, A., “Semiparametric Difference-in-Differences Estimators, ” summarizing the featuresof the data that provide guidance forthcoming, Review of Economic Studies (2003b). inchoosing the estimator that will performbest in agiven Abadie, A., J.Angrist, and G.Imbens, “Instrumental Variables Estimation situation. of Quantile Treatment Effects, ” Econometrica 70:1 (2002), 91 – 117. Abadie, A., D.Drukker, H.Herr, and G.Imbens, “Implementing Matching VII. Conclusion Estimators for Average Treatment Effects in STATA, ” Department of , University of California, Berkeley,unpublished manuscript (2003). Inthis paper Ihave attempted to review the current state Abadie, A.,and G.Imbens, “Simple and Bias-Corrected Matching Esti- of the literatureon inference foraverage treatmenteffects mators for Average Treatment Effects, ” NBERtechnical working under the assumption of unconfoundedness. This has re- paper no. 283 (2002). cently been avery active areaof researchwhere many new Abbring, J., and G.van den Berg, “The Non-parametric Identi Ž cation of Treatment Effects in Duration Models, ” Free University of Am- semi-and nonparametric econometric methods have been sterdam, unpublished manuscript (2002). applied and developed. The researchhas moved along way Angrist, J., “Estimating the Labor Market Impact of Voluntary Military fromrelying on simple least squares methods forestimating Service Using Social Security Data on Military Applicants, ” Econometrica 66:2 (1998), 249 –288. average treatmenteffects. Angrist, J.D., and J.Hahn, “When to Control for Covariates? Panel- The primaryestimators in the current literatureinclude Asymptotic Results for Estimates of Treatment Effects, ” NBER propensity-score methods and pairwise matching, aswell as technical working paper no. 241 (1999). Angrist, J.D.,G.W.Imbens, and D.B.Rubin, “IdentiŽ cation of Causal nonparametric regression methods. Ef Ž ciency bounds have Effects Using Instrumental Variables, ” Journal of the American been established fora number of the average treatment Statistical Association 91 (1996), 444 –472. effectsestimable with these methods, and avariety of these Angrist, J. D.,and A.B.Krueger, “Empirical Strategies in Labor Eco- nomics,” in A.Ashenfelter and D.Card (Eds.), Handbook of Labor estimators rely on the weakest assumptions that allow point Economics vol. 3(New York: Elsevier Science, 2000). identiŽ cation. Researchers have suggested several ways for Angrist, J., and V.Lavy, “Using Maimonides ’ Rule to Estimate the Effect estimating the variance of these average-treatment-effect of Class Size on Scholastic Achievement, ” Quarterly Journal of estimators. One, morecumbersome approach requires esti- Economics CXIV(1999), 1243. Ashenfelter, O., “Estimating the Effect of Training Programs on Earn- mating each component of the variance nonparametrically. ings,” this REVIEW 60 (1978), 47 –57. Amorecommon method relieson bootstrapping. Athird Ashenfelter, O., and D.Card, “Using the Longitudinal Structure of alternative, developed by Abadie and Imbens (2002) forthe Earnings to Estimate the Effect of Training Programs, ” this REVIEW 67 (1985), 648 –660. matching estimator, requires no additional nonparametric Athey, S., and G.Imbens, “IdentiŽ cation and Inference in Nonlinear estimation. Thereis, as yet, however, no consensus on Difference-in-Differences Models, ” NBERtechnical working pa- which arethe best estimation methods to apply in practice. per no. 280 (2002). Athey, S.,and S.Stern, “AnEmpirical Framework for Testing Theories Nevertheless, the applied researcherhas now alarge number about Complementarity in Organizational Design, ” NBERworking of new estimators ather disposal. paper no. 6600 (1998). Challenges remainin making the new tools moreeasily Barnow, B.S.,G.G.Cain, and A.S.Goldberger, “Issues in the Analysis of Selectivity Bias, ” in E.Stromsdorfer and G.Farkas (Eds.), applicable. Although software is available to implement Evaluation Studies vol. 5(San Francisco: Sage, 1980). some of the estimators (seeBecker and Ichino, 2002; Becker, S., and A.Ichino, “Estimation of Average Treatment Effects Based Sianesi, 2001; Abadie etal., 2003), many remaindif Ž cult to on Propensity Scores, ” The Stata Journal 2:4 (2002), 358 –377. apply.Aparticularly urgent task is thereforeto provide fully Bitler, M.,J.Gelbach, and H.Hoynes, “What Mean Impacts Miss: Distributional Effects of Welfare Reform Experiments, ” Depart- implementable versions of the various estimators that do not ment of Economics, University of Maryland, unpublished paper require the applied researcherto choose bandwidths or other (2002). smoothing parameters.This is less of aconcern formatch- Bjo¨rklund, A., and R.Mof Ž t, “The Estimation of Wage Gains and Welfare Gains in Self-Selection Models, ” this REVIEW 69 (1987), 42 –49. ing methods and probably explains alarge part of their Black, S., “DoBetter Schools Matter? Parental Valuation of Elementary popularity.Another outstanding question is the relative Education,” Quarterly Journal of Economics CXIV(1999), 577. 28 THEREVIEW OFECONOMICS AND STATISTICS

Blundell, R.,and Monica Costa-Dias, “Alternative Approaches to Evalu- “Matching as an Econometric Evaluation Estimator, ” Review of ation in Empirical Microeconomics, ” Institute for Fiscal Studies, Economic Studies 65 (1998), 261 –294. Cemmap working paper cwp10/02 (2002). Heckman, J., H.Ichimura, J. Smith, and P.Todd, “Characterizing Selec- Blundell, R., A.Gosling, H.Ichimura, and C.Meghir, “Changes in the tion Bias Using Experimental Data, ” Econometrica 66 (1998), Distribution of Male and Female Wages Accounting for the Em- 1017–1098. ployment Composition, ” Institute for Fiscal Studies, London, un- Heckman, J., R.LaLonde, and J.Smith, “The Economics and Economet- published paper (2002). rics of Active Labor Markets Programs, ” in A.Ashenfelter and D. Card, D., and D.Sullivan, “Measuring the Effect of Subsidized Training Card (Eds.), Handbook of Labor Economics vol. 3(New York: Programs on Movements In and Out of Employment, ” Economet- Elsevier Science, 2000). rica 56:3 (1988), 497 –530. Heckman, J., and R.Robb, “Alternative Methods for Evaluating the Chernozhukov, V.,and C.Hansen, “AnIVModel of Quantile Treatment Impact of Interventions, ” in J.Heckman and B.Singer (Eds.), Effects,” Department of Economics, MIT, unpublished working Longitudinal Analysis of Labor Market Data (Cambridge, U.K.: paper (2001). Cambridge University Press, 1984). Cochran, W., “The Effectiveness of Adjustment by Subclassi Ž cation in Heckman, J., J.Smith, and N.Clements, “Making the Most out of Removing Bias in Observational Studies, ” Biometrics 24, (1968), Programme Evaluations and Social Experiments: Accounting for 295–314. Heterogeneity in Programme Impacts, ” Review of Economic Stud- Cochran, W.,and D.Rubin, “Controlling Bias in Observational Studies: A ies 64 (1997), 487 –535. Review,” Sankhya# 35 (1973), 417 –446. Hirano, K., and G.Imbens, “Estimation of Causal Effects Using Propen- Dehejia, R., “Was There aRiverside Miracle? AHierarchical Framework sity Score Weighting: AnApplication of Data on Right Ear Cath- for Evaluating Programs with Grouped Data, ” Journal of Business eterization, ” Health Services and Outcomes Research Methodology and Economic Statistics 21:1 (2002), 1 –11. 2(2001), 259 –278. “Practical Propensity Score Matching: AReply to Smith and Hirano, K.,G.Imbens, and G.Ridder, “EfŽ cient Estimation of Average Todd,” forthcoming, Journal of Econometrics (2003). Treatment Effects Using the Estimated Propensity Score, ” Econo- Dehejia, R.,and S.Wahba, “Causal Effects in Nonexperimental Studies: metrica 71:4 (2003), 1161 –1189. Reevaluating the Evaluation of Training Programs, ” Journal of the Holland, P., “Statistics and Causal Inference ” (with discussion), Journal of American Statistical Association 94 (1999), 1053 –1062. the American Statistical Association 81 (1986), 945 –970. Doksum, K., “Empirical Probability Plots and for Horowitz, J., “The Bootstrap, ” in James J.Heckman and E.Leamer (Eds.), Nonlinear Models in the Two-Sample Case, ” Annals of Statistics 2 Handbook of Econometrics, vol. 5(Elsevier North Holland, 2002). (1974), 267–277. Hotz, J., G.Imbens, and J.Klerman, “The Long-Term Gains from GAIN: Efron, B.,and R.Tibshirani, AnIntroduction to the Bootstrap (New York: ARe-analysis of the Impacts of the California GAINProgram, ” Chapman and Hall, 1993). Department of Economics, UCLA,unpublished manuscript (2001). Engle, R.,D.Hendry,and J.-F. Richard, “Exogeneity, ” Econometrica 51:2 Hotz, J., G.Imbens, and J.Mortimer, “Predicting the Ef Ž cacy of Future (1974), 277–304. Training Programs Using Past Experiences, ” forthcoming, Journal Firpo, S., “EfŽ cient Semiparametric Estimation of Quantile Treatment of Econometrics (2003). Effects,” Department of Economics, University of California, Ichimura, H., and O.Linton, “Asymptotic Expansions for Some Semipa- Berkeley,PhDthesis (2002), chapter 2. rametric Program Evaluation Estimators, ” Institute for Fiscal Stud- Fisher, R.A., The Design of Experiments (Boyd, London, 1935). ies, cemmap working paper cwp04/01 (2001). Fitzgerald, J., P.Gottschalk, and R.Mof Ž tt, “An Analysis of Sample Ichimura, H., and C.Taber, “Direct Estimation of Policy Effects, ” De- Attrition in Panel Data: The Michigan Panel Study of Income partment of Economics, Northwestern University,unpublished Dynamics,” Journal of Human Resources 33 (1998), 251 –299. manuscript (2000). Fraker, T., and R.Maynard, “The Adequacy of Comparison Group Imbens, G., “The Role of the Propensity Score in Estimating Dose- Designs for Evaluations of Employment-Related Programs, ” Jour- Response Functions, ” Biometrika 87:3 (2000), 706 –710. nal of Human Resources 22:2 (1987), 194 –227. “Sensitivity to Exogeneity Assumptions in Program Evaluation, ” Friedlander, D., and P.Robins, “Evaluating Program Evaluations: New American Economic Review Papers and Proceedings (2003). Evidence on Commonly Used Nonexperimental Methods, ” Amer- Imbens, G.,and J.Angrist, “IdentiŽ cation and Estimation of Local ican Economic Review 85 (1995), 923 –937. Average Treatment Effects, ” Econometrica 61:2 (1994), 467 –476. Fro¨lich, M., “Treatment Evaluation: Matching versus Local Polynomial Imbens, G.,W.Newey, and G.Ridder, “Mean-Squared-Error Calculations Regression, ” Department of Economics, University of St. Gallen, for Average Treatment Effects, ” Department of Economics, UC discussion paper no. 2000-17 (2000). Berkeley,unpublished manuscript (2003). “What is the Value of Knowing the Propensity Score for Estimat- LaLonde, R.J., “Evaluating the Econometric Evaluations of Training ing Average Treatment Effects, ” Department of Economics, Uni- Programs with Experimental Data, ” American Economic Review versity of St. Gallen (2002). 76 (1986), 604 –620. Gill, R., and J.Robins, “Causal Inference for Complex Longitudinal Data: Lechner, M., “Earnings and Employment Effects of Continuous Off-the- The Continuous Case, ” Annals of Statistics 29:6 (2001), 1785 – Job Training in East Germany after Uni Ž cation,” Journal of Busi- 1811. ness and Economic Statistics 17:1 (1999), 74 –90. Gu, X.,and P.Rosenbaum, “Comparison of Multivariate Matching Meth- Lechner, M., “IdentiŽ cation and Estimation of Causal Effects of Multiple ods: Structures, Distances and Algorithms, ” Journal of Computa- Treatments under the Conditional Independence Assumption, ” in tional and Graphical Statistics 2(1993), 405 –420. M.Lechner and F.Pfeiffer (Eds.), Econometric Evaluations of Hahn, J., “Onthe Role of the Propensity Score in Ef Ž cient Semiparamet- Active Labor Market Policies in Europe (Heidelberg: Physica, ric Estimation of Average Treatment Effects, ” Econometrica 66:2 2001). (1998), 315–331. “Program Heterogeneity and Propensity Score Matching: An Hahn, J., P.Todd, and W. Van der Klaauw, “IdentiŽ cation and Estimation Application to the Evaluation of Active Labor Market Policies, ” of Treatment Effects with aRegression-Discontinuity Design, ” this REVIEW 84:2 (2002), 205 –220. Econometrica 69:1 (2000), 201 –209. Lee, D., “The Electoral Advantage of Incumbency and the Voter ’s Valu- Ham, J., and R.LaLonde, “The Effect of Sample Selection and Initial ation of Political Experience: ARegression Discontinuity Analysis Conditions in Duration Models: Evidence from Experimental Data of Close Elections, ” Department of Economics, University of on Training, ” Econometrica 64:1 (1996). California, unpublished manuscript (2001). Heckman, J., and J.Hotz, “Alternative Methods for Evaluating the Impact Lehman, E., Nonparametrics: Statistical Methods Based on Ranks (San of Training Programs ” (with discussion), Journal of the American Francisco: Holden-Day, 1974). Statistical Association 84:804 (1989), 862 –874. Manski, C., “Nonparametric Bounds on Treatment Effects, ” American Heckman, J., H.Ichimura, and P.Todd, “Matching as an Econometric Economic Review Papers and Proceedings 80 (1990), 319 –323. Evaluation Estimator: Evidence from Evaluating aJob Training Manski, C.,G.Sandefur, S.McLanahan, and D.Powers, “Alternative Program,” Review of Economic Studies 64 (1997), 605 –654. Estimates of the Effect of Family Structure During Adolescence on AVERAGE TREATMENTEFFECTS 29

High School, ” Journal of the American Statistical Association “Constructing aControl Group Using Multivariate Matched 87:417 (1992), 25 –37. Sampling Methods That Incorporate the Propensity Score, ” Amer- Partial Identi Ž cation of Probability Distributions (New York: ican Statistician 39 (1985), 33 –38. Springer-Verlag, 2003). Rubin, D., “Matching to Remove Bias in Observational Studies, ” Biomet- Neyman, J., “On the Application of Probability Theory to Agricultural rics 29 (1973a), 159 –183. Experiments. Essay on Principles. Section 9 ” (1923), translated “The Use of Matched Sampling and Regression Adjustments to (with discussion) in Statistical Science 5:4 (1990), 465 –480. Remove Bias in Observational Studies, ” Biometrics 29 (1973b), Politis, D., and J. Romano, Subsampling (Springer-Verlag, 1999). 185–203. Porter, J., “Estimation in the Regression Discontinuity Model, ” Harvard “Estimating Causal Effects of Treatments in Randomized and University, unpublished manuscript (2003). Non-randomized Studies, ” Journal of Educational Psychology 66 Quade, D., “Nonparametric Analysis of Covariance by Matching, ” Bio- (1974), 688–701. metrics 38 (1982), 597 –611. “Assignment to Treatment Group on the Basis of aCovariate, ” Robins, J., and Y.Ritov, “Towards aCurse of Dimensionality Appropriate Journal of Educational Statistics 2:1 (1977), 1 –26. (CODA)Asymptotic Theory for Semi-parametric Models, ” Statis- “Bayesian Inference for Causal Effects: The Role of Randomiza- tics in Medicine 16 (1997), 285 –319. tion,” Annals of Statistics 6 (1978), 34–58. Robins, J.M.,and A.Rotnitzky, “Semiparametric Ef Ž ciency in Multivar- “Using Multivariate Matched Sampling and Regression Adjust- iate Regression Models with Missing Data, ” Journal of the Amer- ment to Control Bias in Observational Studies, ” Journal of the ican Statistical Association 90 (1995), 122 –129. American Statistical Association 74 (1979), 318 –328. Robins, J.M.,Rotnitzky,A., Zhao, L.-P., “Analysis of Semiparametric Rubin, D.,and N.Thomas, “AfŽ nely Invariant Matching Methods with Regression Models for Repeated Outcomes in the Presence of Ellipsoidal Distributions, ” Annals of Statistics 20:2 (1992), 1079 – Missing Data, ” Journal of the American Statistical Association 90 1093. (1995), 106–121. Seifert, B.,and T.Gasser, “Finite-Sample Variance of Local Polynomials: Rosenbaum, P., “Conditional Permutation Tests and the Propensity Score Analysis and Solutions, ” Journal of the American Statistical As- in Observational Studies, ” Journal of the American Statistical Association 79 (1984a), 565 –574. sociation 91 (1996), 267 –275. “The Consequences of Adjustment for aConcomitant Variable “Data Adaptive Ridging in Local Polynomial Regression, ” Jour- That Has Been Affected by the Treatment, ” Journal of the Royal nal of Computational and Graphical Statistics 9:2 (2000), 338 – Statistical Society, Series A 147 (1984b), 656 –666. 360. “The Role of aSecond Control Group in an ” Shadish, W.,T.Campbell, and D.Cook, Experimental and Quasi- (with discussion), Statistical Science 2:3 (1987), 292 –316. experimental Designs for Generalized Causal Inference (Boston: “Optimal Matching in Observational Studies, ” Journal of the Houghton Mif  in, 2002). American Statistical Association 84 (1989), 1024 –1032. Sianesi, B., “psmatch: Propensity Score Matching in STATA, ” University Observational Studies (New York: Springer-Verlag, 1995). College London and Institute for Fiscal Studies (2001). “Covariance Adjustment in Randomized Experiments and Obser- Smith, J.A., and P.E.Todd, “Reconciling Con  icting Evidence on the vational Studies, ” Statistical Science 17:3 (2002), 286 –304. Performance of Propensity-Score Matching Methods, ” American Rosenbaum, P.,and D.Rubin, “The Central Role of the Propensity Score Economic Review Papers and Proceedings 91 (2001), 112 –118. in Observational Studies for Causal Effects, ” Biometrika 70 “Does Matching Address LaLonde ’sCritique of Nonexperimental (1983a), 41–55. Estimators, ” forthcoming, Journal of Econometrics (2003). “Assessing the Sensitivity to an Unobserved Binary Covariate in Van der Klaauw, W., “ARegression-Discontinuity Evaluation of the Effect an Observational Study with Binary Outcome, ” Journal of the of Financial Aid Offers on College Enrollment, ” International Royal Statistical Society, Series B 45 (1983b), 212 –218. Economic Review 43:4 (2002), 1249 –1287. “Reducing the Bias in Observational Studies Using Subclassi Ž - Zhao, Z., “Using Matching to Estimate Treatment Effects: Data Require- cation on the Propensity Score, ” Journal of the American Statisti- ments, Matching Metrics, and Monte Carlo Evidence, ” this REVIEW cal Association 79 (1984), 516 –524. this issue (2004).