Propensity Score-Matching Methods for Nonexperimental Causal Studies 153

PROPENSITYSCORE-MATCHINGMETHODSFOR NONEXPERIMENTALCAUSALST UDIES RajeevH. Dehejiaand Sadek W ahba*

Abstract—This paper considers causal inference and sample selection bias treatmentimpact. 1 The rst generation of matching methods in nonexperimental settings in which (i) few units in the nonexperimental paired observations based on either asingle variable or comparison group are comparable to the treatment units, and (ii) selecting asubset of comparison units similar to the treatment units is difcult weighting several variables. (See,inter alia, Bassi (1984), because units must be compared across ahigh-dimensional set of pre- Cave and Bos (1995), Czajka etal. (1992), Cochran and treatment characteristics. Wediscuss the use of propensity score-matching Rubin (1973), Raynor (1983), Rosenbaum (1995), Rubin methods, and implement them using data from the National Supported Work experiment. Following LaLonde (1986), we pair the experimental (1973, 1979), Westat (1981), and studies cited by Barnow treated units with nonexperimental comparison units from the CPSand (1987).) PSID,and compare the estimates of the treatment effect obtained using The motivation forfocusing on propensity score- our methods to the benchmark results from the experiment. For both matching methods is that, in many applications of interest, comparison groups, we show that the methods succeed in focusing attention on the small subset of the comparison units comparable to the the dimensionality of the observable characteristics is high. treated units and, hence, in alleviating the bias due to systematic differ- With asmallnumber of characteristics (forexample, two ences between the treated and comparison units. binary variables), matching is straightforward (one would group units in four cells). However, when there aremany variables, it is difcult to determine along which dimen- I.Introduction sions to matchunits or which weighting scheme to adopt. nimportant problem of causal inference is how to Propensity score-matching methods, aswedemonstrate, are Aestimatetreatment effects in observational studies, especially useful under such circumstances because they situations (like an experiment) in which agroup of units is provide anatural weighting scheme that yields unbiased exposed toawell-dened treatment,but (unlike an experi- estimates of the treatmentimpact. ment)no systematic methods of experimental design are The key contribution of this paper is to discuss and apply used to maintain acontrol group. Itis well recognized that propensity score-matching methods, which arenew to the the estimateof acausal effectobtained by comparing a economics literature. (Previous papers include Dehejia and treatmentgroup with anonexperimental comparison group Wahba (1999), Heckman etal. (1996, 1998), Heckman, could be biased because of problems such asself-selection Ichimura, and Todd (1997, 1998). SeeFriedlander, Green- or some systematic judgment by the researcherin selecting berg, and Robins (1997) fora review.)This paper differs units to be assigned to the treatment.This paper discusses fromDehejia and Wahba (1999) by focusing on matching methods in detail, and it complements the Heckman etal. the use of propensity score-matching methods to correctfor papers by discussing adifferentarray of matching estima- sample selection bias due to observable differences between tors in the context of adifferentdata set. the treatmentand comparison groups. An important featureof our method is that, afterunits are Matching involves pairing treatmentand comparison matched, the unmatched comparison units arediscarded and units that aresimilar in termsof their observable character- arenot directly used in estimating the treatmentimpact. Our istics. When the relevant differences between any two units approach has two motivations. First, in some settings of arecaptured in the observable (pretreatment)covariates, interest, data on the outcome variable forthe comparison which occurs when outcomes areindependent of assign- group arecostly to obtain. For example, in economics, some ment to treatmentconditional on pretreatmentcovariates, data sets provide outcome information foronly one year; if matching methods can yield an unbiased estimateof the the outcome of interest takes place in alaterperiod, possibly thousands of comparison units have to be linked across data Received for publication February 12, 1998. Revision accepted for sets or resurveyed. Insuch settings, the ability to obtain the publication January 24, 2001. needed data fora subset of relevant comparison units, *Columbia University and Morgan Stanley,respectively. discarding the irrelevant potential comparison units, is ex- Previous versions of this paper were circulated under the title “An Oversampling Algorithm for Nonexperimental Causal Studies with In- tremelyvaluable. Second, even ifinformation on the out- complete Matching and Missing Outcome Variables”(1995) and as comeis available forall comparison units (asit is in our National Bureau of Economic Research working paper no. 6829. We thank data), the process of searching forthe best subset fromthe Robert Moftt and two referees for detailed comments and suggestions that have improved the paper.Weare grateful to Gary Chamberlain, Guido comparison group reveals the extent of overlap between the Imbens, and Donald Rubin for their support and encouragement, and treatmentand comparison groups in termsof pretreatment greatly appreciate comments from Joshua Angrist, George Cave, and Jeff characteristics. Because methods that use the full set of Smith. Special thanks are due to Robert LaLonde for providing, and helping to reconstruct, the data from his 1986 study.Valuable comments were received from seminar participants at Harvard, MIT, and the Man- 1 More precisely,to estimate the treatment impact on the treated, the power Demonstration Research Corporation. Any remaining errors are the outcome in the untreated state must be independent of the treatment authors’responsibility. assignment.

TheReview ofEconomicsand Statistics, February2002, 84(1): 151– 161 ’ 2002by the President and Fellows of Harvard College and the Massachusetts Instituteof Technology 152 THE REVIEW OFECONOMICS ANDSTATISTICS comparison units extrapolate or smooth across the treatment where Ti 1 ( 0) if the ithunit was assigned totreatment and comparison groups, it is extremely useful to know how (control).3 The problem of unobservability is summarized many of the comparison units arein factcomparable and by the factthat wecan estimate E(Yi1 Ti 1), but not hence how much smoothing one ’sestimator is expected to E(Yi0 Ti 1). e perform. The difference, E(Yi1 Ti 1) E(Yi0 Ti 0), The data weuse, obtained fromLaLonde (1986), are can be estimated, but itis potentially abiased estimator of fromthe National Supported Work (NSW)Demonstration, a .Intuitively,if Yi0 forthe treated and comparison units labor marketexperiment in which participants wereran- systematically differs, then in observing only Yi0 for the domized between treatment(on-the-job training lasting be- comparison group wedo not correctly estimate Yi0 for the tween nine months and ayear)and control groups. Follow- treated group. Such bias is of paramount concern in nonex- ing LaLonde, weuse the experimental controls to obtain a perimental studies. The role of randomization is to prevent benchmark estimatefor the treatmentimpact and then set this: them aside, wedding the treated units fromthe experiment to comparison units fromthe Population Survey of Income Yi1, Yi0 Ti f E Yi0 Ti 0 E Yi0 Ti 1 E Yi Ti 0 , Dynamics (PSID)and the Current Population Survey 2 (CPS). Wecompareestimates obtained using our nonex- where Yi TiYi1 (1 Ti)Yi0 (the observed value of the perimental methods to the experimental benchmark. We outcome) and is the symbol forindependence. The treated show that most of the nonexperimental comparison units are and control groups do not systematically differfrom each not good matches forthe treated group. Wesucceed in other, making the conditioning on Ti in the expectation selecting the comparison units that aremost comparable to unnecessary (ignorable treatmentassignment, in the termi- e 4 the treated units and in replicating the benchmark treatment nology of Rubin (1977)), and yielding T 1 . impact. The paper is organized as follows. In section II,we B.ExactMatching on Covariates discuss the theory behind our estimation strategy.Insection III,wediscuss propensity score-matching methods. Insec- Tosubstitute forthe absence of experimental control tion IV,wedescribe the NSWdata, which wethen use in units, weassume that data can be obtained fora set of section Vto implement our matching procedures. Section potential comparison units, which arenot necessarily drawn VItests the matching assumption and examines the sensi- fromthe samepopulation as the treated units but forwhom tivity of our estimates to the speci cation of the propensity weobserve the sameset of pretreatmentcovariates, Xi. The score. Section VIIconcludes the paper. following proposition extends the frameworkof the previous section to nonexperimental settings: II.MatchingMethods Proposition 1(Rubin, 1977). Ifforeach unit weobserve A.TheRole of Randomization avector of covariates Xi and Yi0 Ti Xi, @i,then the population treatmenteffect for the treated, T 1, is identi ed: it Acause is viewed as amanipulation or treatmentthat is equal to the treatmenteffect conditional on covariates and brings about achange inthe variable of interest, compared on assignment to treatment, T 1,X,averaged over the 5 to some baseline, called the control (Cox, 1992; Holland, distribution X Ti 1 . 1986). The basic problem in identifying acausal effectis that the variable of interest is observed under either the 3 In anonexperimental setting, the treatment and comparison samples treatmentor control regimes, but never both. are either drawn from distinct groups or are nonrandom samples from a Formally,let i index the population under consideration. common population. In the former case, typically the interest is the treatment impact for the group from which the treatment sample is drawn. Yi1 is the value of the variable of interest when unit i is In the latter case, the interest could be in knowing the treatment effect for subject to treatment(1), and Yi0 is the value of the same the subpopulation from which the treatment sample is drawn or the variable when the unit is exposed to the control (0). The treatment effect for the full population from which both treatment and comparison samples were drawn. In contrast, in arandomized experiment, treatmenteffect for a single unit, i, is de ned as i Yi1 the treatment and control samples are randomly drawn from the same Yi0.The primarytreatment effect of interest in nonexperi- population, and thus the treatment effect for the treated group is identical mentalsettings is the expected treatmenteffect for the to the treatment effect for the untreated group. 4 We are also implicitly making what is sometimes called the stable- treated population; hence unit-treatment-value assumption (Rubin, 1980, 1986). This amounts to the assumption that Yi1(Yi0)does not depend upon which units other than i were assigned to the treatment group; that is, there are no within-group T E i Ti 1 1 spillovers or general equilibrium effects. 5 Randomization implies Yi1, Yi0 Ti, but Yi0 Ti Xi is all that is required E Yi1 Ti 1 E Yi0 Ti 1 , to estimate the treatment effect on the treated. The stronger assumption, Yi1, Yi0 Ti Xi,would be needed to identify the treatment effect on the 2 Fraker and Maynard (1987) also conduct an evaluation of nonexperi- comparison group or the overall average. Note that weare estimating the mental methods using the NSWdata. Their ndings were similar to treatment effect for the treatment group as it exists at the time of analysis. LaLonde’s. Weare not estimating any program entry or exit effects that might arise if PROPENSITY SCORE-MATCHING METHODS FOR NONEXPERIMENTAL CAUSAL STUDIES 153

Intuitively,this assumes that, conditioning on observable Matching on the propensity score isessentially aweight- covariates, wecan take assignment to treatmentto have been ing scheme, which determines what weights areplaced on random and that, in particular, unobservables play no role in comparison units when computing the estimated treatment the treatmentassignment; comparing two individuals with the effect: sameobservable characteristics, one of whom was treatedand one of whom was not, is by proposition 1like comparing those 1 1 ˆ T 1 Yi Yj , two individuals in arandomized experiment. Under this as- N Ji i N j Ji sumption, the conditional treatmenteffect, T 1,is estimated by rst estimating T 1,X and then averaging over the distri- where N is the treatmentgroup, N the number of units in bution of X conditional on T 1. the treatmentgroup, Ji is the set of comparison units One way to estimatethis equation would be by matching matched to treatmentunit i (seeHeckman, Ichimura, and units on their vector of covariates, Xi.In principle, wecould Todd (1998), who discuss moregeneral weighting stratify the data into subgroups (orbins), each de ned by a schemes), and J is the number of comparison units in J . particular value of X;within each bin, this amounts to i i conditioning on X.The limitation of this method is that it This estimatorfollows fromproposition 3. Expectations relieson asuf ciently rich comparison group so that no bin arereplaced by sample means, and wecondition on p(Xi) containing atreated unit is without acomparison unit. For by matching each treatmentunit i to aset of comparison example, ifall n variables aredichotomous, the number of units, J ,with asimilarpropensity score. Taken literally, possible values forthe vector X will be 2n.Clearly,asthe i conditioning on p(Xi)implies exact matching on p(Xi). number of variables increases, the number of cells increases This is dif cult in practice, so the objective becomes to exponentially,increasing the dif culty of nding exact matchtreated units to comparison units whose propensity matches foreach of the treated units. scores aresuf ciently close to consider the conditioning on p(Xi)inproposition 3tobe approximately valid. C.PropensityScore and Dimensionality Reduction Threeissues arisein implementing matching: whether or Rosenbaum and Rubin (1983, 1985a, b) suggest the use of not to matchwith replacement, how many comparison units the propensity score —the probability of receiving treatment to matchto each treated unit, and nally which matching conditional on covariates —to reduce the dimensionality of the method to choose. Weconsider each in turn. matching problem discussed in the previous section. Matching with replacementminimizes the propensityscore distance between the matched comparison units and the treatmentunit: each treatmentunit can be matched tothe Proposition 2(Rosenbaum and Rubin, 1983). Let p(Xi) be the probability of aunit i having been assigned to treatment, nearest comparison unit, even ifa comparison unit is matched morethan once. This is bene cialin termsof bias de ned as p(Xi) Pr(Ti 1 Xi) E(Ti Xi). Then, reduction. In contrast, by matching without replacement,

Yi1, Yi0 Ti Xi f Yi1, Yi0 Ti p Xi . when there arefew comparison units similarto the treated units, wemay be forced to matchtreated units to comparison units that arequite differentin termsof the estimated Proposition 3. T 1 Ep(X)[( T 1,p(X)) Ti 1]. propensity score. This increases bias, but it could improve Thus, the conditional independence result extends to the the precision of the estimates. Anadditional complication of use of the propensity score, asdoes by immediateimplica- matching without replacementis that the results arepoten- tion our result on the computation of the conditional treat- tially sensitive to the order in which the treatmentunits are matched (Rosenbaum, 1995). ment effect,now T 1,p(X).The point of using the propensity score is that it substantially reduces the dimensionality The question of how many comparison units to match of the problem, allowing us tocondition on ascalarvariable with each treatmentunit isclosely related. Byusing asingle ratherthan in ageneral n-space. comparison unit foreach treatmentunit, weensure the smallest propensity-score distance between the treatment III.Propensity Score-Matching Algorithms and comparison units. By using morecomparison units, one increases the precision of the estimates, but atthe cost of In the discussion that follows, weassume that the propensity increased bias. One method of selecting aset of comparison score is known, which of course it is not. The appendix units is the nearest-neighbor method, which selects the m discusses astraightforward method forestimating it. 6 comparison units whose propensity scores areclosest to the treated unit in question. Another method is caliper match- the treatment were made more widely available. Estimation of such effects ing, which uses all of the comparison units within apre- would require additional data as described by Mof tt (1992). 6 Standard errors should adjust for the estimation error in the propensity de ned propensity score radius (or “caliper”). A bene t of score and the variation that it induces in the matching process. In the application, we use bootstrap standard errors. Heckman, Ichimura, and estimators, but in their application paper, Heckman, Ichimura, and Todd Todd (1998) provide asymptotic standard errors for propensity score (1997) also use bootstrap standard errors. 154 THE REVIEW OFECONOMICS ANDSTATISTICS caliper matching is that it uses only as many comparison TABLE 1.—SAMPLE MEANS AND STANDARD ERRORS OF COVARIATES units as areavailable within the calipers, allowing forthe FOR MALE NSW PARTICIPANTS use of extra (fewer)units when good matches are(not) NationalSupported Work Sample (Treatment andControl) available. Dehejia-WahbaSample In the application that follows, weconsider arange of Variable Treatment Control simple estimators. For matching without replacement, we Age 25.81(0.52) 25.05 (0.45) consider low-to-high, high-to-low, and random matching. In Years ofschooling 10.35(0.15) 10.09 (0.1) these methods, the treated units areranked (fromlowest to Proportionof school dropouts 0.71(0.03) 0.83 (0.02) highest or highest to lowest propensity score, or randomly). Proportionof blacks 0.84(0.03) 0.83 (0.02) Proportionof Hispanic 0.06(0.017) 0.10 (0.019) The highest-ranked unit is matched rst, and the matched Proportionmarried 0.19(0.03) 0.15 (0.02) comparison unit is removed fromfurther matching. For Numberof children 0.41(0.07) 0.37 (0.06) matching with replacement, weconsider single-nearest- No-showvariable 0 (0) n/a Monthof assignment(Jan. 1978 0)18.49 (0.36) 17.86 (0.35) neighbor matching and caliper matching fora range of Real earnings12 months before training 1,689 (235) 1,425 (182) calipers. Inaddition tousing aweighted differencein means Real earnings24 months before training 2,096 (359) 2,107 (353) to estimatethe treatmenteffect, we also consider aweighted Hoursworked 1 year beforetraining 294(36) 243 (27) Hoursworked 2 years beforetraining 306 (46) 267 (37) regression using the treatmentand matched comparison Sample size 185 260 units, with the comparison units weighted by the number of timesthat they arematched to atreated unit. Aregression can potentially improve the precision of the estimates. Table 1provides the characteristics of the sample weuse, The question that remainsis which method to select in LaLonde’smalesample (185 treated and 260 control obser- practice. In general, this depends on the data in question, vations).8 The table highlights the role of randomization: the and in particular on the degree of overlap between the distribution of the covariates forthe treatmentand control treatmentand comparison groups in termsof the propensity groups arenot signi cantly different. Weuse the two non- score. When there is substantial overlap in the distribution experimental comparison groups constructed by LaLonde of the propensity score between the comparison and treat- (1986), drawn fromthe CPSand PSID. 9 ment groups, most of the matching algorithms will yield similarresults. When the treatmentand comparison units B.Distributionof the Treatment and Comparison Samples arevery different, nding asatisfactory matchby matching without replacementcan be very problematic. In particular, Tables 2and 3(rows 1and 2) present the sample ifthere areonly ahandful of comparison units comparable characteristics of the two comparison groups and the treat- to the treated units, then once these comparison units have ment group. The differences arestriking: the PSIDand CPS been matched, the remaining treated units will have to be sample units areeight to nine years older than those inthe matched to comparison units that arevery different. In such NSWgroup, their ethnic composition is different, and they settings, matching with replacementis the natural choice. If have on average completed high school degrees, whereas there areno comparison units fora range of propensity NSWparticipants wereby and large high school dropouts, scores, then forthat range the treatmenteffect could not be and, most dramatically,pretreatmentearnings aremuch estimated. The application that follows will further clarify higher forthe comparison units than forthe treated units, by the choices that the researcherfaces in practice. morethan $10,000. Amoresynoptic way to view these differences is to use the estimated propensity score as a summary statistic. Using the method outlined inthe appen- IV. The Data dix, weestimatethe propensity score forthe two composite samples (NSW-CPS and NSW-PSID),incorporating the A.TheNational Supported W orkProgram covariates linearly and with some higher-order terms. The NSWwas aU.S. federally and privately funded program that aimedto provide work experience forindivid- 8 The data weuse are asubsample of the data used in LaLonde (1986). The analysis in LaLonde is based on one year of pretreatment earnings. uals who had facedeconomic and social problems prior to But, as Ashenfelter (1978) and Ashenfelter and Card (1985) suggest, the enrollment in the program (Hollister, Kemper,and May- use of more than one year of pretreatment earnings is key in accurately nard, 1984; Manpower Demonstration Research Corpora- estimating the treatment effect, because many people who volunteer for 7 training programs experience adrop in their earnings just prior to entering tion, 1983). Candidates forthe experiment wereselected on the training program. Using the LaLonde sample of 297 treated and 425 the basis of eligibility criteria,and then wereeither ran- control units, we exclude the observations for which earnings in 1974 domly assigned to, or excluded from,the training program. could not be obtained, thus arriving at areduced sample of 185 treated observations and 260 control observations. Because weobtain this subset by looking at pretreatment covariates, we do not disturb the balance in 7 Four groups were targeted: Women on Aid to Families with Dependent observed and unobserved characteristics between the experimental treated Children (AFDC),former addicts, former offenders, and young school and control groups. See Dehejia and Wahba (1999) for acomparison of the dropouts. Several reports extensively document the NSWprogram. For a two samples. general summary of the ndings, see Manpower Demonstration Research 9 These are the CPS-1and PSID-1 comparison groups from LaLonde ’s Corporation (1983). paper. PROPENSITY SCORE-MATCHING METHODS FOR NONEXPERIMENTAL CAUSAL STUDIES 155

TABLE 2.—SAMPLE CHARACTERISTICSAND ESTIMATED IMPACTSFROM THE NSW AND CPS SAMPLES Treatment Mean Effect Regression No. of Propensity No (Diff. in Treatment ControlSample Observations ScoreA Age SchoolBlack Hispanic Degree MarriedRE74 RE75 U74 U75 Means) Effect NSW 185 0.3725.82 10.35 0.84 0.06 0.71 0.19 2095 1532 0.29 0.40 1794 B 1672C (633) (638) Full CPS 15992 0.01 33.23 12.03 0.07 0.07 0.30 0.71 14017 13651 0.88 0.89 8498 1066 (0.02)D (0.53) (0.15) (0.03) (0.02) (0.03) (0.03) (367) (248) (0.03) (0.04) (583)E (554) Withoutreplacement: Random 185 0.32 25.26 10.30 0.84 0.06 0.65 0.22 2305 1687 0.37 0.51 1559 1651 (0.03) (0.79) (0.23) (0.04) (0.03) (0.05) (0.04) (495) (341) (0.05) (0.05) (733) (709) Low to high 185 0.32 25.23 10.28 0.84 0.06 0.66 0.22 2286 1687 0.37 0.51 1605 1681 (0.03) (0.79) (0.23) (0.04) (0.03) (0.05) (0.04) (495) (341) (0.05) (0.05) (730) (704) High to low 185 0.32 25.26 10.30 0.84 0.06 0.65 0.22 2305 1687 0.37 0.51 1559 1651 (0.03) (0.79) (0.23) (0.04) (0.03) (0.05) (0.04) (495) (341) (0.05) (0.05) (733) (709) Withreplacement: Nearest neighbor 119 0.37 25.36 10.31 0.84 0.06 0.69 0.17 2407 1516 0.35 0.49 1360 1375 (0.03) (1.04) (0.31) (0.06) (0.04) (0.07) (0.06) (727) (506) (0.07) (0.07) (913) (907) Caliper, 0.00001 325 0.37 25.26 10.31 0.84 0.07 0.69 0.17 2424 1509 0.36 0.50 1119 1142 (0.03) (1.03) (0.30) (0.06) (0.04) (0.07) (0.06) (845) (647) (0.06) (0.06) (875) (874) Caliper, 0.000051043 0.37 25.29 10.28 0.84 0.07 0.69 0.17 2305 1523 0.35 0.49 1158 1139 (0.02) (1.03) (0.32) (0.05) (0.04) (0.06) (0.06) (877) (675) (0.06) (0.60) (852) (851) Caliper, 0.0001 1731 0.37 25.19 10.36 0.84 0.07 0.69 0.17 2213 1545 0.34 0.50 1122 1119 (0.02) (1.03) (0.31) (0.05) (0.04) (0.06) (0.06) (890) (701) (0.06) (0.06) (850) (843) Variables:Age, age of participant;School, number of school years; Black, 1 ifblack, 0 otherwise;Hisp, 1 ifHispanic, 0 otherwise;No degree, 1 ifparticipant had no school degrees, 0 otherwise;Married, 1 if married,0 otherwise;RE74, real earnings (1982US$) in 1974; RE75, real earnings (1982US$) in 1975; U74, 1 ifunemployed in 1974, 0 otherwise;U75, 1 ifunemployed in 1975, 0 otherwise;and RE78, real earnings (1982US$)in 1978. (A)The propensity score is estimatedusing a logitof treatmentstatus on:Age, Age 2, Age3,School,School 2,Married,No degree, Black, Hisp, RE74, RE75, U74, U75, School RE74. (B)The treatment effect for the NSW sampleis estimatedusing the experimental control group. (C)The regression treatment effect controls for all covariateslinearly. For matching with replacement, weighted least squaresis used,where treatment units are weighted at 1 andthe weight for a controlis the numberof times it is matchedto a treatmentunit. (D)The standard error applies to the difference in means between the matched and the NSW sample,except in the last twocolumns, where the standard error applies to the treatment effect. (E)Standard errors for the treatment effect and regression treatment effect are computed using a bootstrapwith 500 replications. Figures 1and 2provide asimple diagnostic on the data ples. Note that the histograms do not include the com- examined, plotting the histograms of the estimated pro- parison units (11,168 units forthe CPSand 1,254 units pensity scores forthe NSW-CPS and NSW-PSID sam- forthe PSID)whose estimated propensity score is less

TABLE 3.—SAMPLE CHARACTERISTICSAND ESTIMATED IMPACTSFROM THE NSW AND PSID SAMPLES Treatment Mean Effect Regression No. of Propensity No RE74 RE75 (Diff. in Treatment ControlSample Observations ScoreA Age SchoolBlack Hispanic Degree Married US$ US$ U74 U75 Means) Effect NSW 185 0.3725.82 10.35 0.84 0.06 0.71 0.19 2095 1532 0.29 0.40 1794 B 1672C (633) (638) Full PSID 2490 0.02 34.85 12.12 0.25 0.03 0.31 0.87 19429 19063 0.10 0.09 15205 4 (0.02)D (0.57) (0.16) (0.03) (0.02) (0.03) (0.03) (449) (361) (0.04) (0.03) (657)E (1014) Withoutreplacement: Random 185 0.25 29.17 10.30 0.68 0.07 0.60 0.52 4659 3263 0.40 0.40 916 77 (0.03) (0.90) (0.25) (0.04) (0.03) (0.05) (0.05) (554) (361) (0.05) (0.05) (1035) (983) Low to high 185 0.25 29.17 10.30 0.68 0.07 0.60 0.52 4659 3263 0.40 0.40 916 77 (0.03) (0.90) (0.25) (0.04) (0.03) (0.05) (0.05) (554) (361) (0.05) (0.05) (1135) (983) High to low 185 0.25 29.17 10.30 0.68 0.07 0.60 0.52 4659 3263 0.40 0.40 916 77 (0.03) (0.90) (0.25) (0.04) (0.03) (0.05) (0.05) (554) (361) (0.05) (0.05) (1135) (983) Withreplacement: Nearest Neighbor 56 0.70 24.81 10.72 0.78 0.09 0.53 0.14 2206 1801 0.54 0.69 1890 2315 (0.07) (1.78) (0.54) (0.11) (0.05) (0.12) (0.11) (1248) (963) (0.11) (0.11) (1202) (1131) Caliper, 0.00001 85 0.70 24.85 10.72 0.78 0.09 0.53 0.13 2216 1819 0.54 0.69 1893 2327 (0.08) (1.80) (0.56) (0.12) (0.05) (0.12) (0.12) (1859) (1896) (0.10) (0.11) (1198) (1129) Caliper, 0.00005 193 0.70 24.83 10.72 0.78 0.09 0.53 0.14 2247 1778 0.54 0.69 1928 2349 (0.06) (2.17) (0.60) (0.11) (0.04) (0.11) (0.10) (1983) (1869) (0.09) (0.09) (1196) (1121) Caliper, 0.0001 337 0.70 24.92 10.73 0.78 0.09 0.53 0.14 2228 1763 0.54 0.70 1973 2411 (0.05) (2.30) (0.67) (0.11) (0.04) (0.11) (0.09) (1965) (1777) (0.07) (0.08) (1191) (1122) Caliper, 0.001 2021 0.70 24.98 10.74 0.79 0.09 0.53 0.13 2398 1882 0.53 0.69 1824 2333 (0.03) (2.37) (0.70) (0.09) (0.04) (0.10) (0.07) (2950) (2943) (0.06) (0.06) (1187) (1101) (A)The propensity score is estimatedusing a logitof treatmentstatus on:Age, Age 2,School,School 2,Married,No degree, Black, Hisp, RE74, RE74 2, RE75, RE752,U74,U75, U74 Hisp. (B)The treatment effect for the NSW sampleis estimatedusing the experimental control group. (C)The regression treatment effect controls for all covariateslinearly. For matching with replacement, weighted least squaresis used,where treatment units are weighted at 1 andthe weight for a controlis the numberof times it is matchedto a treatmentunit. (D)The standard error applies to the difference in means between the matched and the NSW sample,except in the last twocolumns, where the standard error applies to the treatment effect. (E)Standard errors for the treatment effect and regression treatment effect are computed using a bootstrapwith 500 replications. 156 THE REVIEW OFECONOMICS ANDSTATISTICS

FIGURE 1.—HISTOGRAM OF ESTIMATED PROPENSITY SCORE, FIGURE 3.—PROPENSITY SCORE FOR TREATED AND MATCHED NSW AND CPS COMPARISON UNITS, RANDOM WITHOUT REPLACEMENT

than the minimumestimated propensity score forthe V.MatchingResults treated units. As well, the rst bins of both diagrams contain most of the remaining comparison units (4,398 Figures 3to 6provide asnapshot of the matching forthe CPSand 1,007 forthe PSID).Hence, it is clear methods described in section IIIand applied to the that very fewof the comparisonunits arecomparable to NSW-CPS sample, where the horizontal axis displays the treated units. In fact,one of the strengths of the treated units (indexed fromlowest to highest estimated propensity score method isthat itdramatically highlights propensity score) and the vertical axis depicts the pro- this fact.In comparing the other bins, wenote that the pensity scores of the treated units and their matched number ofcomparison units ineach bin isapproximately comparison counterparts. (Thecorrespondi ng gures for equal to the number of treated units in the NSW-CPS the NSW-PSID sample look very similar.)Figures 3to5, sample, but, inthe NSW-PSID sample, many ofthe upper which consider matching without replacement, share the bins have farmore treated units than comparison units. common featurethat the rst 100 or so treated units are This last observation will beimportant ininterpreting the well matched totheir comparison group counterparts: the results of the next section. solid and the dashed lines virtually overlap. But the

FIGURE 2.—HISTOGRAM OF ESTIMATED PROPENSITY SCORE, FIGURE 4.—PROPENSITY SCORE FOR TREATED AND MATCHED NSW AND PSID COMPARISON UNITS, LOWEST TO HIGHEST PROPENSITY SCORE-MATCHING METHODS FOR NONEXPERIMENTAL CAUSAL STUDIES 157

FIGURE 5.—PROPENSITY SCORE FOR TREATED AND MATCHED FIGURE 6.—PROPENSITY SCORE FOR TREATED AND MATCHED COMPARISON UNITS, HIGHEST TO LOWEST COMPARISON UNITS, NEAREST MATCH

treated units with estimated propensity scores of 0.4 or higher arenot well matched. 2, wealready noted that the CPSsample is very different In gure 3, units that arerandomly selected to be matched fromthe NSW.The aimof matching is to choose sub- earlier nd better matches, but those matched laterare samples whose characteristics moreclosely resemblethe poorly matched because the fewcomparison units compa- NSW.Rows 3to 5of table 2depict the matched samples rable to the treated units have already been used. Likewise, that emergefrom matching without replacement. Note that in gure 4, whereunits arematched fromlowest to highest, the characteristics of these samples areessentially identical, treated units in the 140 th to 170th positions areforced to use suggesting that these three methods yield the samecompar- comparison units with ever-higher propensity scores. Fi- ison groups. (Figures 3to5obscure this factbecause they nally,forthe remaining units (fromapproximately the 170 th compare the order in which units arematched, not the position on), the comparison units with high propensity resulting comparison groups.) The matched samples are scores areexhausted and matches arefound among com- much closer to the NSWsample than the full CPScompar- parison units with much lower estimated propensity scores. ison group. The matched CPSgroup has an age of 25.3 Similarly,when wematch from highest to lowest, the (compared with 25.8 and 33.2 forthe NSWand full CPS quality of matches begins to decline afterthe rst few samples), its ethnic composition is the sameas the NSW treated units, until wereach treated units whose propensity sample (note especially the differencein the full CPSin score is (approximately) 0.4. termsof the variable Black), no degree and maritalstatus Figure 6depicts the matching achieved by the nearest- align, and, perhaps most signi cantly,the pretreatmentearn- 11 matchmethod. 10 Wenote immediately that by matching ings aresimilar for both 1974 and 1975. None of the with replacementwe are able to avoid the deterioration in differences between the matched groups and the NSW 12 the quality of matches noted in gures 3to 5; the solid and sample arestatistically signi cant. Looking atthe nearest- the dashed lines largely coincide. Looking atthe line de- matchand caliper methods, little signi cant improvement picting comparison units morecarefully ,wenote that it has can be discerned, although most of the variables aremar- atsections that correspond to ranges in which asingle ginally better matched. This suggests that the observation comparison unit is being matched to morethan one treated maderegarding gure 1(that the CPS, in fact,has a unit. Thus, even though there is asmallersample size, we arebetter able to matchthe distribution of the propensity 11 The matched earnings, like the NSWsample, exhibit the Ashenfelter (1978) “dip” in earnings in the year prior to program participation. scores of the treated units. 12 Note that both LaLonde (1986) and Fraker and Maynard (1987) In table 2, weexplore the matched samples and the attempt to use “ rst-generation ” matching methods to reduce differences estimated treatmentimpacts forthe CPS. Fromrows 1and between the treatment and comparison groups. LaLonde creates subsets of CPS-1and PSID-1 by matching single characteristics (employment status and income). Dehejia and Wahba (1999) demonstrates that signi cant 10 Note that, in implementing this method, if the set of comparison units differences remain between the reduced comparison groups and the within agiven caliper is empty for atreated unit, wematch it to the nearest treatment group. Fraker and Maynard match on predicted earnings. Their comparison unit. The alternative is to drop unmatched treated units, but matching method also fails to balance pretreatment characteristics (espe- then one would no longer be estimating the treatment effect for the entire cially earnings) between the treatment and comparison group. (See Fraker treated group. and Maynard (1987, p. 205).) 158 THE REVIEW OFECONOMICS ANDSTATISTICS suf cient number of comparison units overlapping with the matesis less surprising when weconsider the sample size NSW) is borne out in termsof the matched sample. involved: weare using only 56 of the 2,490 potential Turning tothe estimates of the treatmentimpact, inrow comparison units fromthe PSID. For the PSID, caliper 1wesee that the benchmark estimateof the treatment matching also performswell. The estimates range from impact fromthe randomized experiment is$1,794. For the $1,824 to $2,411. Slightly lower standard errorsare full CPS comparison group, the estimateis $8,498 using achieved than nearest-neighbor matching. adifferencein means and $1,066 using regression adjust- In conclusion, propensity score-matching methods are ment. The rawestimate is very misleading when compared able to yield reasonably accurateestimates of the treatment with the benchmark, although the regression-adjusted esti- impact, especially when contrasted with the range of estimateis better. The matching estimates arecloser. For the matesthat emergedin LaLonde ’spaper. By selecting an without-replacement estimators, the estimateranges from appropriate subset fromthe comparison group, asimple $1,559 to $1,605 forthe differencein means and from differencein means yields an estimateof the treatment $1,651 to $1,681 forthe regression-adjusted estimator. The effectclose to the experimental benchmark. The choice nearest-neighbor with-replacement estimates are$1,360 and among matching methods becomes important when there is $1,375. Essentially,these methods succeed by picking out minimaloverlap between the treatmentand comparison the subset of the CPSthat is the best comparison forthe groups. When there is minimaloverlap, matching with NSW.Based on these estimates, one might conclude that replacementemerges as abetter choice. In principle, caliper matching without replacementis the best strategy.The matching can also improve standard errorsrelative to near- reason why all the methods performwell is that there is est-neighbor matching, although atthe cost of greaterbias. reasonable overlap between the treatmentand CPS compar- At least in our application, the bene ts of caliper matching ison samples. As wewill see, forthe PSIDcomparison werelimited. When there is greateroverlap, the without- group the estimates arevery different. replacementestimators performas wellas the nearest- When using caliper matching, alargercomparison group neighbor method, and their standard errorsare somewhat is selected: 325 fora caliper of 0.00001, 1,043 fora caliper lower than the nearest-neighbor method, so, when many of 0.0001, and 1,731 fora caliper of 0.0001. Intermsof the comparison units overlap with the treatmentgroup, match- characteristics of the sample, fewsigni cant differences are ing without replacementis probably abetter choice. observed, although weknow that the quality of the matches VI. Testing in termsof the propensity score is poorer. This is re ected in the estimated treatmentimpact which ranges from$1,122 A.Testingthe Matching Assumption to $1,149. The special structure of the data weuse allows us to test Using the PSIDsample (table 3), somewhat different the assumption that underlies propensity score matching. conclusions arereached. Like the CPS, the PSIDsample is Because wehave both an experimental control group very different fromthe NSWsample. Unlike the CPS, the (which weuse to estimatethe experimental benchmark matched-without-replacement samples arenot fully compa- estimatein row 1of tables 2and 3) and two nonexperimen- rable to the NSW.They arereasonably comparable in terms tal comparison groups, wecan test the assumption that, of age, schooling, and ethnicity,but, in termsof pretreat- conditional on the propensity score, earnings in the non- ment earnings, weobserve alarge (and statistically signif- treated state areindependent of assignment to treatment icant) difference. As aresult, it is not surprising that the (Heckmanet al., 1998; Heckman, Ichimura, and Todd, estimatesof the treatmentimpact, both by adifferencein 1997). In practice, this amounts to comparing earnings forthe means and through regression adjustment, arefar from the experimental control group with earnings forthe two compar- experimental benchmark (ranging from$ 916 to $77). In ison groups using the propensity score. Weapply the propen- contrast, the matched-with-replacement samples use even sity score speci cations fromsection Vto the composite fewer(56) comparison units, but they areable tomatchthe sample of NSW control units and CPS (orPSID) comparison pretreatmentearnings of the NSWsample and the other units. Following Heckman etal. (1998), wecompute the bias variables as well. This corresponds to our observation re- within stratade ned on the propensity score. garding gure 2, namely that there arevery fewcomparison The bias estimates —earnings forthe experimental con- units in the PSIDthat aresimilar to units in the NSW; when trol group less earnings forthe nonexperimental comparison this is the case, weexpect moresensitivity to the method group conditional on the estimated propensity score —are used to matchobservations, and weexpect matching with presented graphically in gures 7and 8. For both the CPS replacementto performbetter. The treatmentimpact as and PSID, wesee arange of bias estimates that arepartic- estimated by the nearest-neighbor method through adiffer- ularly large forlow values of the estimated propensity score. ence inmeans ($1,890) is very similarto the experimental This group represents those who areleast likely to have benchmark, but differsby $425 when estimated through been inthe treatmentgroup, and, based on tables 2and 3, regression adjustment (although it is still closer than the this group has much higher earnings than those in the NSW. estimates in rows 1to 4). The differencein the two esti- But none of the bias estimates arestatistically signi cant. PROPENSITY SCORE-MATCHING METHODS FOR NONEXPERIMENTAL CAUSAL STUDIES 159

FIGURE 7.—BIAS ESTIMATES, CPS treatmenteffect in nonexperimental settings in which the treated group differssubstantially fromthe pool of potential comparison units. The method is able to pare the large comparison group down to the relevant comparisons without using information on outcomes, thereby,ifnecessary , allowing outcome data tobe collected only forthe relevant subset of comparison units. Ofcourse, the quality of the estimatethat emergesfrom the resulting comparison is limited by the overall quality of the comparison group that is used. Using LaLonde ’s(1986) data set, wedemonstrate the ability of this technique to work in practice. Even though in atypical application the researcherwould not have the bene tof checking his or her estimateagainst the experimental-benchmark estimate, the conclusion of our analysis is that it is extremely valuable to check the com- parability of the treatmentand comparison units intermsof pretreatmentcharacteristics, which the researchercan check in most applications. In particular, the propensity score method dramatically highlights the factthat most of the comparison units are Ofcourse, in practicea researcherwill not be able to very differentfrom the treated units. In addition to this, performsuch tests, but itis auseful exercise when possible. when there arevery fewcomparison units remaining after It con rmsthat matching succeeds because the nontreated having discarded the irrelevant comparison units, the choice earnings of the comparison and control groups arenot of matching algorithm becomes important. Wedemonstrate statistically signi cantly different, conditional on the esti- that, when there area suf cient number of relevant com- matedpropensity score. parison units (in our application, when using the CPS), the nearest-matchmethod does no worse than the matching- B.TestingSensitivity to the Speci cation of the Propensity without-replacement methods that would typically be ap- Score plied, and, in situations in which there arevery fewrelevant One potential limitation of propensity score methods is comparison units (in our application, when using the PSID), the need to estimatethe propensity score. In LaLonde ’s matching with replacementfares better than the alternatives. (1986) paper, one of the cautionary ndings was the sensi- Extensions of matching with replacement(caliper match- tivity of the nonexperimental estimators to the speci cation ing), although interesting inprincipal, wereof little value in adopted. The appendix suggests asimple method to choose our application. a speci cation forthe propensity score. In table 4, we consider sensitivity of the estimates to the choice of speci-

cation. FIGURE 8.—ESTIMATED BIAS, PSID In table 4, weconsider dropping in succession the interactions and cubes, the indicators forunemployment, and nally squares of covariates in the speci cation. The nal speci cation forboth samples contains the covariates linearly.For the CPS, the estimatebounces from$1,037 to $1,874, and forthe PSIDfrom $1,004 to $1,845. The estimates arenot particularly sensitive, especially compared to the variability of estimators in LaLonde ’soriginal paper. Furthermore, aresearcherwho did not have the bene t of the experimental benchmark estimatewould choose the full-speci cation estimatesbecause (asexplained in the appendix) these speci cations succeed in balancing all the observed covariates, conditional on the estimated propensity score.

VII. Conclusion This paper has presented apropensity score-matching method that is able to yield accurateestimates of the 160 THE REVIEW OFECONOMICS ANDSTATISTICS

TABLE 4.—SENSITIVITYOF MATCHINGWITH REPLACEMENTTO THE SPECIFICATIONOF THE ESTIMATED PROPENSITY SCORE Difference-in-Means Number of Treatment Effect RegressionTreatment Effect A Speci cation Observations (StandardError) B (StandardError) B CPS Full speci cation 119 1360 (633) 1375 (638) Droppinginteractions and cubes 124 1037 (1005) 1109 (966) Droppingindicators: 142 1874 (911) 1529 (928) Droppingsquares 134 1637 (944) 1705 (965) PSID Full speci cation 56 1890 (1202) 2315 (1131) Droppinginteractions and cubes 61 1004 (2412) 1729 (3621) Droppingindicators: 65 1845 (1720) 1592 (1624) Droppingsquares 69 1428 (1126) 1400 (1157) Forall speci cationsother than the full speci cations,some covariates are not balanced across the treatment and comparison groups. (A)The regression treatment effect controls for all covariateslinearly. Weighted least squaresis usedwhere treatment units are weighted at 1andthe weight for a controlis thenumber of timesit is matched toa treatmentunit. (B)Standard errors for the treatment effect and regression treatment effect are computed using a bootstrapwith 500 replications.

Itis something of an irony that the data that weuse were Bassi, Laurie, “Estimating the Effects of Training Programs with Nonran- dom Selection, ” this REVIEW 66:1 (February 1984), 36 –43. originally employed by LaLonde (1986) to demonstrate the Cain, Glen, “Regression and Selection Models to Improve Nonexperimen- failureof standard nonexperimental methods in accurately tal Comparisons ” (pp. 297–317), in C.A.Bennett and A.A. estimating the treatmenteffect. Using matching methods on Lumsdaine (Eds.), Evaluation and Experiments: Some Critical both of his samples, weareable to replicatethe experimen- Issues in Assessing Social Programs (New York: Academic Press, 1975). tal benchmark, but beyond this wefocus attention on the Cave, George, and Hans Bos, “The Value of aGEDin aChoice-Based value of exibly adjusting forobservable differences be- Experimental Sample, ” (New York: Manpower Demonstration Re- tween the treatmentand comparison groups. The process of search Corporation, 1995). Cochran, W.G., and D.B.Rubin, “Controlling Bias in Observational trying to nd asubset of the PSIDgroup comparable to the Studies: AReview, ” Sankhya, ser. A,35:4 (December 1973), NSWunits demonstrated that the PSIDis apoor compari- 417–446. son group, especially when compared to the CPS. Cox, D. R., “Causality: Some Statistical Aspects, ” Journal of the Royal Statistical Society, series A,155, part 2(1992), 291 –301. Given the success of propensity score methods in this Czajka, John, Sharon M.Hirabayashi, Roderick J.A.Little, and Donald B. application, how might aresearcherchoose which method Rubin, “Projecting from Advance Data Using Propensity Model- to use in other settings? Animportant issue is whether the ing: An Application to Income and Tax Statistics, ” Journal of Business and Economic Statistics 10:2 (April 1992), 117 –131. assumption of selection on observable covariates is valid, or Dehejia, Rajeev, and Sadek Wahba, “An Oversampling Algorithm for whether the selection process depends on variables that are Non-experimental Causal Studies with Incomplete Matching and unobserved (Heckmanand Robb, 1985). Only when the Missing Outcome Variables, ” Harvard University mimeograph (1995). researcheris comfortable with the formerassumption do , “Causal Effects in Non-Experimental Studies: Re-Evaluating the propensity score methods comeinto play.Even then, the Evaluation of Training Programs, ” Journal of the American Statis- researcherstill can use standard regression techniques with tical Association 94:448 (December 1999), 1053 –1062. Fraker, T., and R.Maynard, “Evaluating Comparison Group Designs with suitably exible functional forms(Cain, 1975; Barnow, Employment-Related Programs, ” Journal of Human Resources 22 Cain, and Goldberger, 1980). The methods that wediscuss (1987), 194–227. in this paper should be viewed as acomplement to the Friedlander, Daniel, David Greenberg, and Philip Robins, “Evaluating Government Training Programs for the Economically Disadvan- standard techniques in the researcher ’sarsenal. Bystarting taged,” Journal of Economic Literature 35:4 (December 1997), with apropensity score analysis, the researcherwill have a 1809–1855. better sense of the extent to which the treatmentand com- Heckman, James, Hidehiko Ichimura, Jeffrey Smith, and Petra Todd, “Sources of Selection Bias in Evaluating Social Programs: An parison groups overlap and consequently of how sensitive Interpretation of Conventional Measures and Evidence on the estimates will be to the choice of functional form. Effectiveness of Matching as aProgram Evaluation Method, ” Proceedings of the National Academy of Sciences 93:23 (Novem- REFERENCES ber 1996), 13416 –13420. , “Characterizing Selection Bias Using Experimental Data, ” Ashenfelter, Orley, “Estimating the Effects of Training Programs on Econometrica 66:5 (September 1998), 1017 –1098. Earnings,” this REVIEW 60:1 (February 1978), 47 –57. Heckman, James, Hidehiko Ichimura, and Petra Todd, “Matching as an Ashenfelter, Orley, and D.Card, “Using the Longitudinal Structure of Econometric Evaluation Estimator: Evidence from Evaluating a Earnings to Estimate the Effect of Training Programs, ” this REVIEW Job Training Programme, ” Review of Economic Studies 64:4 (Oc- 67:4 (November 1985), 648 –660. tober 1997), 605 –654. Barnow, Burt, “The Impact of CETAPrograms on Earnings: AReview of , “Matching as an Econometric Evaluation Estimator, ” Review of the Literature, ” Journal of Human Resources 22:2 (Spring 1987), Economic Studies 65:2 (April 1998), 261 –294. 157–193. Heckman, James, and Richard Robb, “Alternative Methods for Evaluating Barnow, Burt, Glen Cain, and Arthur Goldberger, “Issues in the Analysis the Impact of Interventions ” (pp. 63–113), in J.Heckman and B. of Selectivity Bias, ” (pp. 42–59), in Ernst W.Stromsdorfer and Singer (Eds.), Longitudinal Analysis of Labor Market Data, Econo- George Farkas (Eds.), Evaluation Studies Review Annual, 5 (Bev- metric Society Monograph, No. 10 (Cambridge, UK:Cambridge erly Hills: Sage Publications, 1980). University Press, 1985). PROPENSITY SCORE-MATCHING METHODS FOR NONEXPERIMENTAL CAUSAL STUDIES 161

Holland, Paul W., “Statistics and Causal Inference, ” Journal of the In estimating the propensity score through aprobability model, the American Statistical Association 81:396 (December 1986), 945 – choice of which interaction or higher-order term to include is determined 960. solely by the need to condition fully on the observable characteristics that Hollister, Robinson, Peter Kemper, and Rebecca Maynard, The National make up the assignment mechanism. The following proposition forms the Supported Work Demonstration (Madison, WI:University of Wis- basis of the algorithm weuse to estimate the propensity score (Rosenbaum consin Press, 1984). and Rubin, 1983): LaLonde, Robert, “Evaluating the Econometric Evaluations of Training Programs,” American Economic Review 76:4 (September 1986), Proposition A: 604–620. Manpower Demonstration Research Corporation, Summary and Findings X T p X . of the National Supported Work Demonstration (Cambridge, MA: Ballinger, 1983). Mof tt, Robert, “Evaluation Methods for Program Entry Effects ” (pp. Proof: From the de nition of p(X)in proposition 2:

231–252), in Charles Manski and Irwin Gar nkel (Eds.), Evaluat- E Ti Xi, p Xi E Ti Xi p Xi . ing Welfare and Training Programs (Cambridge, MA:Harvard University Press, 1992). The algorithm works as follows. Starting with aparsimonious Raynor, W.J., “Caliper Pair-Matching on aContinuous Variable in Case logistic function with linear covariates to estimate the score, rank all Control Studies, ” Communications in Statistics: Theory and Meth- observations by the estimated propensity score (from lowest to high- ods 12:13 (June 1983), 1499 –1509. est). Divide the observations into strata such that within each stratum Rosenbaum, Paul, Observational Studies (New York: Springer Verlag, the difference in propensity score for treated and comparison obser- 1995). vations is insigni cant. Proposition Atells usthat within each stratum Rosenbaum, P.,and D.Rubin, “The Central Role of the Propensity Score the distribution of the covariates should be approximately the same in Observational Studies for Causal Effects, ” Biometrika 70:1 across the treated and comparison groups, once the propensity score is (April 1983), 41 –55. controlled for. Within each stratum, we can test for statistically , “Constructing aControl Group Using Multivariate Matched signi cant differences between the distribution of covariates for Sampling Methods that Incorporate the Propensity, ” American treated and comparison units; operationally, t-tests on differences in Statistician 39:1 (February 1985a), 33 –38. the rst moments are often suf cient, but ajoint test for the difference , “The Bias Due to Incomplete Matching, ” Biometrics 41 (March in means for all the variables within each stratum could also be 1985b), 103–116. performed. 14 When the covariates are not balanced within aparticular Rubin, D., “Matching to Remove Bias in Observational Studies, ” Biomet- stratum, the stratum may be too coarsely de ned; recall that proposi- rics 29 (March 1973), 159 –183. tion Adeals with observations with anidentical propensity score. The , “Assignment to aTreatment Group on the Basis of aCovariate, ” solution adopted istodivide the stratum into ner strata and test again Journal of Educational Statistics 2:1 (Spring 1977), 1 –26. for no difference in the distribution of the covariates within the ner , “Using Multivariate Matched Sampling and Regression Adjust- strata. If, however, some covariates remain unbalanced for many strata, ment to Control Bias in Observation Studies, ” Journal of the the score may be poorly estimated, which suggests that additional American Statistical Association 74:366 (June 1979), 318 –328. terms (interaction or higher-order terms) ofthe unbalanced covariates , “Discussion of Randomization Analysis of Experimental Data: should beadded to the logistic speci cation to control better for these The Fisher Randomization Test, by D.Basu, ” Journal of the characteristics. This procedure isrepeated for each given stratum until American Statistical Association 75:371 (September 1980), 591 – the covariates are balanced. The algorithm is summarized next. 593. , “Discussion of Holland (1986), ” Journal of the American Statis- ASimpleAlgorithm for Estimating the Propensity Score tical Association 81:396 (December 1986), 961 –964. Westat, “Continuous Longitudinal Manpower Survey Net Impact Report No. 1: Impact on 1977 Earnings of NewFY 1976 CETAEnrollees 1. Start with aparsimonious logit speci cation to estimate the score. in Selected Program Activities, ” report prepared for U.S.DOL 2. Sort data according to estimated propensity score (ranking from under contract 23-24-75-07 (1981). lowest to highest). 3. Stratify all observations such that estimated propensity scores within astratum for treated and comparison units are close (no signi cant difference); for example, start by dividing observations APPENDIX: ESTIMATING THEPROPENSITY SCORE into strata of equal score range (0 –0.2, ...,0.8 –1). 4. Statistical test: for all covariates, differences in means across treated and comparison units within each stratum are not signi cantly The rst step in estimating the treatment effect is to estimate the different from zero. propensity score. Any standard probability model can be used (for exam- a. If covariates are balanced between treated and comparison ob- ple, logit or probit). It is important to remember that the role of the servations for all strata, stop. propensity score is only to reduce the dimensions of the conditioning; as b. If covariates are not balanced for some stratum, divide the such, it has no behavioral assumptions attached to it. For ease of estima- stratum into ner strata and reevaluate. tion, most applications in the statistics literature have concentrated on the c. If acovariate is not balanced for many strata, modify the logit by logit model: adding interaction terms and/or higher-order terms of the covari- e h Xi Pr T 1 X , ate and reevaluate. i i 1 e h X i Akey property of this procedure is that it uses awell-de ned criterion where Ti is the treatment status and h(Xi)is made up of linear and to determine which interaction terms to use in the estimation, namely higher-order terms of the covariates on which wecondition to obtain an those terms that balance the covariates. It also makes no use of the ignorable treatment assignment. 13 outcome variable, and embodies one of the speci cation tests proposed by LaLonde (1986) and others in the context of evaluating the impact of training on earnings, namely to test for the regression-adjusted difference 13 Because we allow for higher-order terms in X,this choice is not very in the earnings prior to treatment. restrictive. Byrearranging and taking logs, weobtain ln(Pr( Ti 1 Xi)/1 Pr(Ti 1 Xi)) h(Xi).ATaylor-series expansion allows us an arbitrarily precise approximation. See also Rosenbaum and Rubin (1983). 14 More generally,one can also consider higher moments or interactions.