<<

arXiv:1010.5586v1 [stat.ME] 27 Oct 2010 oki acig hc ea nte14s the 1940s, use. the and complexity in both in began increased have early which methods matching, Since in data. work with (nonrandomized) covariates replicate observed for observational to possible as how much as on examined this co- Work has unobserved. background methods and all matching observed on both another randomly variates, one only be from and to treated different guaranteed the are that groups is control effects causal estimating for eiwadaLo Forward Look a Stuart Inference: A. and Elizabeth Causal Review for A Methods Matching 00 o.2,N.1 1–21 1, DOI: No. 25, Vol. 2010, Science Statistical 10,UAe-mail: Maryland USA Baltimore, 21205, Health, Hopkins Public Johns of School , Bloomberg and Health Mental of

lzbt .Sur sAssatPoesr Departments Professor, Assistant is Stuart A. Elizabeth c ttsia Science article Statistical original the the of by reprint published electronic an is This ern iesfo h rgnli aiainand pagination in detail. original typographic the from differs reprint n ftekybnfiso admzdexperiments randomized of benefits key the of One nttt fMteaia Mathematical of Institute 10.1214/09-STS313 .INTRODUCTION 1. nttt fMteaia Statistics Mathematical of Institute , umr fweeteltrtr nmthn ehd snwa weighting. now classification, is phrases: methods and words matching Key headed. on be literature should and the it new) where where and of old summary (both pro research guidance a paper existing and This the methods research. coalescing matching use, current about and thinking past for structure about single a learn have to not me turn matching—do matching acr to using scattered related in methods been interested developing has are advice who Researchers related plines. and literature the science. now subje political fiel control and in medicine and matching popularity , treated gaining , on choose are work methods best 1970s, Matching to comparison. the how thereby Since examined groups, covariates. has control the ods well and to choosing treated due by original achieved the be cov of often similar samples can with closel goal groups as control This and tributions. treated randomized obtaining a by replicate ble to desirable is Abstract. 00 o.2,N.1 1–21 1, No. 25, Vol. 2010, [email protected] 2010 , hnetmtn asleet sn bevtoa aa it data, observational using effects causal estimating When . This . bevtoa td,poest crs sub- scores, propensity study, Observational in 1 fcvrae ntetetdadcnrlgop.This groups. control and distribution treated the the “balance”) in covariates (or of equate to re- aims that where headed. of be view should methods a matching matching of provides on search use paper the the on methods, guidance addi- providing In disciplines. to across tion aware— ideas not together are tying researchers and current many which methods— matching of bring- on work methods, original the matching together ing on literature diverse the ( science litical Harding, ( resources statis- as ( and such tics disciplines research across scattered the been have contrast, these ap- implementing In for in advice tech-methods. interested of and researchers summary methods a plied the nor in- of available, researchers overview niques for an information in of terested source been single has there no expanding, is field the while However, rohr tal. et Brookhart edfie“acig ral ob n method any be to broadly “matching” define We Rosenbaum 2006 ,eoois( economics ), oe al. et Ho , , 2002 2006 oee,until However, ; spossi- as y , raedis- ariate ,scooy(ognand (Morgan ), ssc as such ds thods—or Rubin providing 2007 -matched s disci- oss reducing ntheir on lc to place ie a vides t for cts meth- Imbens .Ti ae coalesces paper This ). , nd 2006 , ,epidemiology ), 2004 n po- and ) 2 E. A. STUART may involve 1 : 1 matching, weighting or subclas- the history and theory underlying matching meth- sification. The use of matching methods is in the ods. Sections 2–5 provide details on each of the steps broader context of the careful design of nonexperi- involved in implementing matching: defining a dis- mental studies (Rosenbaum, 1999, 2002; Rubin, tance measure, doing the matching, diagnosing the 2007). While extensive time and effort is put into matching, and then estimating the treatment effect the careful design of randomized , rela- after matching. The paper concludes with sugges- tively little effort is put into the corresponding “de- tions for future research and practical guidance in sign” of nonexperimental studies. In fact, precisely Section 6. because nonexperimental studies do not have the 1.1 Two Settings benefit of randomization, they require even more careful design. In this spirit of design, we can think Matching methods are commonly used in two types of any study aiming to estimate the effect of some of settings. The first is one in which the outcome intervention as having two key stages: (1) design, values are not yet available and matching is used and (2) outcome analysis. Stage (1) uses only back- to select subjects for follow-up (e.g., Reinisch et al., ground information on the individuals in the study, 1995; Stuart and Ialongo, 2009). It is particularly designing the nonexperimental study as would be a relevant for studies with cost considerations that , without access to the out- prohibit the collection of outcome data for the full come values. Matching methods are a key tool for control group. This was the setting for most of the stage (1). Only after stage (1) is finished does stage original work in matching methods, particularly the (2) begin, comparing the outcomes of the treated theoretical developments, which compared the ben- and control individuals. While matching is generally efits of selecting matched versus random samples of the control group (Althauser and Rubin, 1970; used to estimate causal effects, it is also sometimes Rubin, 1973a, 1973b). The second setting is one in used for noncausal questions, for example, to inves- which all of the outcome data is already available, tigate racial disparities (Schneider, Zaslavsky and and the goal of the matching is to reduce bias in the Epstein, 2004). estimation of the treatment effect. Alternatives to matching methods include adjust- A common feature of matching methods, which ing for background variables in a regression model, is automatic in the first setting but not the sec- instrumental variables, structural equation model- ond, is that the outcome values are not used in the ing or selection models. Matching methods have a matching process. Even if the outcome values are few key advantages over those other approaches. First, available at the time of the matching, the outcome matching methods should not be seen in conflict values should not be used in the matching process. with regression adjustment and, in fact, the two This precludes the selection of a matched sample methods are complementary and best used in com- that leads to a desired result, or even the appear- bination. Second, matching methods highlight areas ance of doing so (Rubin, 2007). The matching can of the covariate distribution where there is not suf- thus be done multiple times and the matched sam- ficient overlap between the treatment and control ples with the best balance—the most similar treated groups, such that the resulting treatment effect esti- and control groups—are chosen as the final matched mates would rely heavily on extrapolation. Selection samples; this is similar to the design of a random- models and regression models have been shown to ized experiment where a particular randomization perform poorly in situations where there is insuffi- may be rejected if it yields poor covariate balance cient overlap, but their standard diagnostics do not (Hill, Rubin and Thomas, 1999; Greevy et al., 2004). involve checking this overlap (Dehejia and Wahba, This paper focuses on settings with a treatment 1999, 2002; Glazerman, Levy and Myers, 2003). defined at some particular point in time, covariates Matching methods in part serve to make researchers measured at (or relevant to) some period of time aware of the quality of resulting inferences. Third, before the treatment, and outcomes measured af- matching methods have straightforward diagnostics ter the treatment. It does not consider more com- by which their performance can be assessed. plex longitudinal settings where individuals may go The paper proceeds as follows. The remainder of in and out of the treatment group, or where treat- Section 1 provides an introduction to matching meth- ment assignment date is undefined for the control ods and the scenarios considered, including some of group. Methods such as marginal structural models MATCHING METHODS 3

(Robins, Hernan and Brumback, 2000) or balanced from one another on all covariates, observed and un- risk set matching (Li, Propert and Rosenbaum, 2001) observed. In nonexperimental studies, we must posit are useful in those settings. an assignment mechanism, which determines which 1.2 Notation and Background: Estimating individuals receive treatment and which receive con- Causal Effects trol. A key assumption in nonexperimental studies is that of a strongly ignorable treatment assignment As first formalized in Rubin (1974), the estima- (Rosenbaum and Rubin, 1983b) which implies that tion of causal effects, whether from a randomized (1) treatment assignment (T ) is independent of the experiment or a nonexperimental study, is inher- potential outcomes (Y (0),Y (1)) given the covariates ently a comparison of potential outcomes. In par- (X): T ⊥(Y (0),Y (1))|X, and (2) there is a positive ticular, the causal effect for individual i is the com- probability of receiving each treatment for all val- parison of individual i’s outcome if individual i re- ues of X: 0 < P (T = 1|X) < 1 for all X. The first ceives the treatment (the potential outcome under component of the definition of strong ignorability treatment), Yi(1), and individual i’s outcome if in- is sometimes termed “ignorable,” “no hidden bias” dividual i receives the control (the potential out- or “unconfounded.” Weaker versions of the ignora- come under control), Yi(0). For simplicity, we use bility assumption are sufficient for some quantities the term “individual” to refer to the units that re- of interest, as discussed further in Imbens (2004). ceive the treatment of interest, but the formulation This assumption is often more reasonable than it would stay the same if the units were schools or may sound at first since matching on or controlling communities. The “fundamental problem of causal for the observed covariates also matches on or con- inference” (Holland, 1986) is that, for each individ- trols for the unobserved covariates, in so much as ual, we can observe only one of these potential out- they are correlated with those that are observed. comes, because each unit (each individual at a par- Thus, the only unobserved covariates of concern are ticular point in time) will receive either treatment those unrelated to the observed covariates. Analy- or control, not both. The estimation of causal effects ses can be done to assess sensitivity of the results to can thus be thought of as a problem the existence of an unobserved confounder related (Rubin, 1976a), where we are interested in predict- to both treatment assignment and the outcome (see ing the unobserved potential outcomes. Section 6.1.2). Heller, Rosenbaum and Small (2009) For efficient causal inference and good estimation of the unobserved potential outcomes, we would like also discuss how matching can make effect estimates to compare treated and control groups that are as less sensitive to an unobserved confounder, using a similar as possible. If the groups are very different, concept called “design sensitivity.” An additional as- the prediction of Y (1) for the control group will sumption is the Stable Unit Treatment Value be made using information from individuals who Assumption (SUTVA; Rubin, 1980), which states look very different from themselves, and likewise that the outcomes of one individual are not affected for the prediction of Y (0) for the treated group. by treatment assignment of any other individuals. A number of authors, including Cochran and Rubin While not always plausible—for example, in school (1973), Rubin (1973a, 1973b), Rubin (1979), settings where treatment and control children may Heckman, Ichimura and Todd (1998), Rubin and interact, leading to “spillover” effects—the plausibil- Thomas, (2000) and Rubin (2001), have shown that ity of SUTVA can often be improved by design, such methods such as adjustment can as by reducing interactions between the treated and actually increase bias in the estimated treatment ef- control groups. Recent work has also begun think- fect when the true relationship between the covari- ing about how to relax this assumption in analyses ate and outcome is even moderately nonlinear, espe- (Hong and Raudenbush, 2006; Sobel, 2006; cially when there are large differences in the Hudgens and Halloran, 2008). and of the covariates in the treated and To formalize, using notation similar to that in control groups. Rubin (1976b), we consider two populations, Pt and Randomized experiments use a known random- Pc, where the subscript t refers to a group exposed ized assignment mechanism to ensure “balance” of to the treatment and c refers to a group exposed the covariates between the treated and control to the control. Covariate data on p pre-treatment groups: The groups will be only randomly different covariates is available on random samples of sizes 4 E. A. STUART

Nt and Nc from Pt and Pc. The means and vari- ATT, better matching scenarios include situations ance covariance matrix of the p covariates in group with many more control than treated individuals, i are given by µi and Σi, respectively (i = t, c). For small initial bias between the groups, and smaller individual j, the p covariates are denoted by Xj , in the treatment group than the control treatment assignment by Tj (Tj = 0 or 1), and the group. observed outcome by Yj . Without loss of generality, Dealing with multiple covariates was a challenge we assume Nt < Nc. due to both computational and data problems. With To define the treatment effect, let E(Y (1)|X)= more than just a few covariates, it becomes very dif- R1(X) and E(Y (0)|X)= R0(X). In the matching ficult to find matches with close or exact values of context effects are usually defined as the difference all covariates. For example, Chapin (1947) finds that in potential outcomes, τ(x) = R1(x) − R0(x), with initial pools of 671 treated and 523 controls although other quantities, such as odds ratios, are there are only 23 pairs that match exactly on six also sometimes of interest. It is often assumed that categorical covariates. An important advance was the response surfaces, R0(x) and R1(x), are parallel, made in 1983 with the introduction of the propen- so that τ(x)= τ for all x. If the response surfaces sity score, defined as the probability of receiving the are not parallel (i.e., the effect varies), an average treatment given the observed covariates effect over some population is generally estimated. (Rosenbaum and Rubin, 1983b). The propensity Variation in effects is particularly relevant when the score facilitates the construction of matched sets estimands of interest are not difference in means, with similar distributions of the covariates, with- but rather odds ratios or relative risks, for which out requiring close or exact matches on all of the the conditional and marginal effects are not neces- individual variables. sarily equal (Austin, 2007; Lunt et al., 2009). The In a series of papers in the 1990s, Rubin and Thomas most common estimands in nonexperimental stud- (1992a, 1992b, 1996) provided a theoretical basis for ies are the “average effect of the treatment on the multivariate settings with affinely invariant match- treated” (ATT), which is the effect for those in the ing methods and ellipsoidally symmetric covariate treatment group, and the “average treatment effect” distributions (such as the normal or t-distribution), (ATE), which is the effect on all individuals (treat- again focusing on estimating the ATT. Affinely in- ment and control). See Imbens (2004), Kurth et al. variant matching methods, such as propensity score (2006) and Imai, King and Stuart (2008) for further or Mahalanobis metric matching, are those that yield discussion of these distinctions. The choice between the same matches following an affine (linear) trans- these estimands will likely involve both substantive formation of the data. Matching in this general set- reasons and data availability, as further discussed in ting is shown to be Equal Percent Bias Reducing Section 6.2. (EPBR; Rubin, 1976b). Rubin and Stuart (2006) later showed that the EPBR feature also holds under much 1.3 History and Theoretical Development of more general settings, in which the covariate dis- Matching Methods tributions are discriminant mixtures of ellipsoidally Matching methods have been in use since the first symmetric distributions. EPBR methods reduce bias half of the 20th Century (e.g., Greenwood, 1945; in all covariate directions (i.e., makes the covariate Chapin, 1947), however, a theoretical basis for these means closer) by the same amount, ensuring that if methods was not developed until the 1970s. This de- close matches are obtained in some direction (such velopment began with papers by Cochran and Rubin as the propensity score), then the matching is also (1973) and Rubin (1973a, 1973b) for situations with reducing bias in all other directions. The matching one covariate and an implicit focus on estimating thus cannot be increasing bias in an outcome that is the ATT. Althauser and Rubin (1970) provide an a linear combination of the covariates. In addition, early and excellent discussion of some practical is- matching yields the same percent bias reduction in sues associated with matching: how large the control bias for any linear function of X if and only if the “reservoir” should be to get good matches, how to matching is EPBR. define the quality of matches, how to define a “close- Rubin and Thomas (1992b) and Rubin and Thomas enough” match. Many of the issues identified in that (1996) obtain analytic approximations for the reduc- work are topics of continuing debate and discussion. tion in bias on an arbitrary linear combination of the The early papers showed that when estimating the covariates (e.g., the outcome) that can be obtained MATCHING METHODS 5 when matching on the true or estimated discrimi- rely on ignorability, which assumes that there are no nant (or propensity score) with normally distributed unobserved differences between the treatment and covariates. In fact, the approximations hold remark- control groups, conditional on the observed covari- ably well even when the distributional assumptions ates. To satisfy the assumption of ignorable treat- are not satisfied (Rubin and Thomas, 1996). The ment assignment, it is important to include in the approximations in Rubin and Thomas (1996) can matching procedure all variables known to be re- be used to determine in advance the bias reduc- lated to both treatment assignment and the outcome tion that will be possible from matching, based on (Rubin and Thomas, 1996; Heckman, Ichimura and the covariate distributions in the treated and con- Todd, 1998; Glazerman, Levy and Myers, 2003; trol groups, the size of the initial difference in the Hill, Reiter and Zanutto, 2004). Generally poor per- covariates between the groups, the original sample formance is found of methods that use a relatively sizes, the number of matches desired and the cor- small set of “predictors of convenience,” such as de- relation between the covariates and the outcome. mographics only (Shadish, Clark and Steiner, 2008). Unfortunately these approximations are rarely used When matching using propensity scores, detailed in practice, despite their ability to help researchers below, there is little cost to including variables that quickly assess whether their data will be useful for are actually unassociated with treatment assignment, estimating the causal effect of interest. as they will be of little influence in the propensity 1.4 Steps in Implementing Matching Methods score model. Including variables that are actually unassociated with the outcome can yield slight in- Matching methods have four key steps, with the creases in variance. However, excluding a potentially first three representing the “design” and the fourth important confounder can be very costly in terms of the “analysis”: increased bias. Researchers should thus be liberal in 1. Defining “closeness”: the distance measure used terms of including variables that may be associated to determine whether an individual is a good with treatment assignment and/or the outcomes. match for another. Some examples of matching have 50 or even 100 2. Implementing a matching method, given that mea- covariates included in the procedure (e.g., Rubin, sure of closeness. 2001). However, in small samples it may not be pos- 3. Assessing the quality of the resulting matched sible to include a very large set of variables. In that samples, and perhaps iterating with steps 1 and case priority should be given to variables believed to 2 until well-matched samples result. be related to the outcome, as there is a higher cost 4. Analysis of the outcome and estimation of the in terms of increased variance of including variables treatment effect, given the matching done in step unrelated to the outcome but highly related to treat- 3. ment assignment (Brookhart et al., 2006). Another effective strategy is to include a small set of covari- The next four sections go through these steps one ates known to be related to the outcomes of interest, at a time, providing an overview of approaches and do the matching, and then check the balance on all advice on the most appropriate methods. of the available covariates, including any additional variables that remain particularly unbalanced after 2. DEFINING CLOSENESS the matching. To avoid allegations of variable se- There are two main aspects to determining the lection based on estimated effects, it is best if the measure of distance (or “closeness”) to use in match- variable selection process is done without using the ing. The first involves which covariates to include, observed outcomes, and instead is based on previous and the second involves combining those covariates research and scientific understanding (Rubin, 2001). into one distance measure. One type of variable that should not be included in the matching process is any variable that may 2.1 Variables to Include have been affected by the treatment of interest The key concept in determining which covariates (Rosenbaum, 1984; Frangakis and Rubin, 2002; to include in the matching process is that of strong Greenland, 2003). This is especially important when ignorability. As discussed above, matching methods, the covariates, treatment indicator and outcomes and in fact most nonexperimental study methods, are all collected at the same point in time. If it is 6 E. A. STUART deemed to be critical to control for a variable poten- 4. Linear propensity score: tially affected by treatment assignment, it is better to exclude that variable in the matching procedure Dij = | logit(ei) − logit(ej)|. and include it in the analysis model for the outcome Rosenbaum and Rubin (1985b), Rubin and Thomas (as in Reinisch et al., 1995).1 (1996) and Rubin (2001) have found that match- Another challenge that potentially arises is when ing on the linear propensity score can be partic- variables are fully (or nearly fully) predictive of treat- ularly effective in terms of reducing bias. ment assignment. Excluding such a variable should be done only with great care, with the belief Below we use “propensity score” to refer to either that the problematic variable is completely unasso- the propensity score itself or the linear version. ciated with the outcomes of interest and that the Although exact matching is in many ways the ignorability assumption will still hold. More com- ideal (Imai, King and Stuart, 2008), the primary dif- monly, such a variable indicates a fundamental prob- ficulty with the exact and Mahalanobis distance mea- lem in estimating the effect of interest, whereby it sures is that neither works very well when X is high may not be possible to separate out the effect of the dimensional. Requiring exact matches often leads to treatment of interest from this problematic variable many individuals not being matched, which can re- using the data at hand. For example, if all adolescent sult in larger bias than if the matches are inexact but heavy drug users are also heavy drinkers, it will be more individuals remain in the analysis impossible to separate out the effect of heavy drug (Rosenbaum and Rubin, 1985b). A recent advance, use from the effect of heavy drinking. coarsened exact matching (CEM), can be used to do 2.2 Distance Measures exact matching on broader ranges of the variables; The next step is to define the “distance”: a mea- for example, using income categories rather than a sure of the similarity between two individuals. There continuous measure (Iacus, King and Porro, 2009). The Mahalanobis distance can work quite well when are four primary ways to define the distance Dij be- tween individuals i and j for matching, all of which there are relatively few covariates (fewer than 8; are affinely invariant: Rubin, 1979; Zhao, 2004), but it does not perform as well when the covariates are not normally dis- 1. Exact: tributed or there are many covariates 0, if Xi = Xj, (Gu and Rosenbaum, 1993). This is likely because Dij =  ∞, if Xi 6= Xj. Mahalanobis metric matching essentially regards all 2. Mahalanobis: interactions among the elements of X as equally im- ′ −1 portant; with more covariates, Mahalanobis match- Dij = (Xi − Xj) Σ (Xi − Xj). ing thus tries to match more and more of these If interest is in the ATT, Σ is the variance co- multi-way interactions. variance matrix of X in the full control group; A major advance was made in 1983 with the intro- if interest is in the ATE, then Σ is the variance duction of propensity scores (Rosenbaum and Rubin, covariance matrix of X in the pooled treatment 1983b). Propensity scores summarize all of the co- and full control groups. If X contains categorical variates into one scalar: the probability of being variables, they should be converted to a series treated. The propensity score for individual i is de- of binary indicators, although the distance works fined as the probability of receiving the treatment best with continuous variables. given the observed covariates: ei(Xi)= P (Ti = 1|Xi). 3. Propensity score: There are two key properties of propensity scores. The first is that propensity scores are balancing scores: Dij = |ei − ej|, At each value of the propensity score, the distri- where ek is the propensity score for individual k, bution of the covariates X defining the propensity defined in detail below. score is the same in the treated and control groups. Thus, grouping individuals with similar propensity 1The method is misstated in the footnote in Table 1 of that paper. In fact, the potential variables were scores replicates a mini-randomized experiment, at not used in the matching procedure, but were utilized in the least with respect to the observed covariates. Sec- outcome analysis (D. B. Rubin, personal communication). ond, if treatment assignment is ignorable given the MATCHING METHODS 7 covariates, then treatment assignment is also ignor- A more recently developed distance measure is the able given the propensity score. This justifies match- “prognosis score” (Hansen, 2008). Prognosis scores ing based on the propensity score rather than on are essentially the predicted outcome each individ- the full multivariate set of covariates. Thus, when ual would have under the control condition. The treatment assignment is ignorable, the difference in benefit of prognosis scores is that they take into ac- means in the outcome between treated and con- count the relationship between the covariates and trol individuals with a particular propensity score the outcome; the drawback is that it requires a model value is an unbiased estimate of the treatment effect for that relationship. Since it thus does not have the at that propensity score value. While most of the clear separation of the design and analysis stages propensity score results are in the context of finite that we advocate here, we focus instead on other ap- samples and the settings considered by proaches, but it is a potentially important advance Rubin and Thomas (1992a, 1996), Abadie and Im- in the matching literature. bens (2009a) discuss the asymptotic properties of 2.2.1 Propensity score estimation and model spec- propensity score matching. ification In practice, the true propensity scores are The distance measures described above can also rarely known outside of randomized experiments and be combined, for example, doing exact matching thus must be estimated. Any model relating a bi- on key covariates such as race or gender followed nary variable to a set of predictors can be used. by propensity score matching within those groups. The most common for propensity score estimation When exact matching on even a few variables is not is , although nonparametric meth- possible because of sample size limitations, meth- ods such as boosted CART and generalized boosted ods that yield “fine balance” (e.g., the same pro- models (gbm) often show very good performance portion of African American males in the matched (McCaffrey, Ridgeway and Morral, 2004; treated and control groups) may be a good alterna- Setoguchi et al., 2008; Lee, Lessler and Stuart, 2009). tive (Rosenbaum, Ross and Silber, 2007). If the key The model diagnostics when estimating propen- covariates of interest are continuous, Mahalanobis sity scores are not the standard model diagnostics matching within propensity score calipers for logistic regression or CART. With propensity (Rubin and Thomas, 2000) defines the distance be- score estimation, concern is not with the parameter tween individuals i and j as estimates of the model, but rather with the resulting ′ −1 balance of the covariates (Augurzky and Schmidt, (Zi − Zj) Σ (Zi − Zj), 2001). Because of this, standard concerns about Dij =  if | logit(ei) − logit(ej)|≤ c, collinearity do not apply. Similarly, since they do not  ∞, if | logit(ei) − logit(ej)| > c, use covariate balance as a criterion, model fit statis-  tics identifying classification ability (such as the c- where c is the caliper, Z is the set of “key covari- ) or stepwise selection models are not helpful ates,” and Σ is the variance covariance matrix of for variable selection (Rubin, 2004; Brookhart et al., Z. This will yield matches that are relatively well 2006; Setoguchi et al., 2008). One strategy that is matched on the propensity score and particularly helpful is to examine the balance of covariates (in- well matched on Z. Z often consists of pre-treatment cluding those not originally included in the propen- measures of the outcome, such as baseline test scores sity score model), their squares and interactions in in educational evaluations. Rosenbaum and Rubin the matched samples. If imbalance is found on par- (1985b) discuss the choice of caliper size, generaliz- ticular variables or functions of variables, those terms ing results from Table 2.3.1 of Cochran and Rubin can be included in a re-estimated propensity score (1973). When the variance of the linear propensity model, which should improve their balance in the score in the treatment group is twice as large as that subsequent matched samples (Rosenbaum and Rubin, in the control group, a caliper of 0.2 standard de- 1984; Dehejia and Wahba, 2002). viations removes 98% of the bias in a normally dis- Research indicates that misestimation of the propen- tributed covariate. If the variance in the treatment sity score (e.g., excluding a squared term that is in group is much larger than that in the control group, the true model) is not a large problem, and that smaller calipers are necessary. Rosenbaum and Rubin treatment effect estimates are more biased when the (1985b) generally suggest a caliper of 0.25 standard outcome model is misspecified than when the propen- deviations of the linear propensity score. sity score model is misspecified (Drake, 1993; 8 E. A. STUART

Dehejia and Wahba, 1999, 2002; Zhao, 2004). This most effective method for settings where the goal may in part be because the propensity score is used is to select individuals for follow-up. Nearest neigh- only as a tool to get covariate balance—the accuracy bor matching nearly always estimates the ATT, as it of the model is less important as long as balance is matches control individuals to the treated group and obtained. Thus, the exclusion of a squared term, for discards controls who are not selected as matches. example, may have less severe consequences for a In its simplest form, 1 : 1 nearest neighbor match- propensity score model than it does for the outcome ing selects for each treated individual i the control model, where interest is in interpreting a particular individual with the smallest distance from individ- regression coefficient (that on the treatment indica- ual i. A common complaint regarding 1 : 1 matching tor). However, these evaluations are fairly limited; is that it can discard a large number of observations for example, Drake (1993) considers only two covari- and thus would apparently lead to reduced power. ates. Future research should involve more systematic However, the reduction in power is often minimal, evaluations of propensity score estimation, perhaps for two main reasons. First, in a two-sample com- through more sophisticated simulations as well as parison of means, the precision is largely driven by analytic work, and consideration should include how the smaller group size (Cohen, 1988). So if the treat- the propensity scores will be used, for example, in ment group stays the same size, and only the con- weighting versus subclassification. trol group decreases in size, the overall power may 3. MATCHING METHODS not actually be reduced very much (Ho et al., 2007). Second, the power increases when the groups are Once a distance measure has been selected, the more similar because of the reduced extrapolation next step is to use that distance in doing the match- and higher precision that is obtained when com- ing. In this section we provide an overview of the paring groups that are similar versus groups that spectrum of matching methods available. The meth- are quite different (Snedecor and Cochran, 1980). ods primarily vary in terms of the number of individ- This is also what yields the increased power of using uals that remain after matching and in the relative matched pairs in randomized experiments weights that different individuals receive. One way (Wacholder and Weinberg, 1982). Smith (1997) pro- in which propensity scores are commonly used is as a predictor in the outcome model, where the set of vides an illustration where estimates from 1 : 1 match- individual covariates is replaced by the propensity ing have lower standard deviations than estimates score and the outcome models run in the full treated from a linear regression, even though thousands of and control groups (Weitzen et al., 2004). Unfortu- observations were discarded in the matching. An ad- nately the simple use of this method is not an op- ditional concern is that, without any restrictions, timal use of propensity scores, as it does not take k : 1 matching can lead to some poor matches, if, advantage of the balancing property of propensity for example, there are no control individuals with scores: If there is imbalance on the original covari- propensity scores similar to a given treated individ- ates, there will also be imbalance on the propensity ual. One strategy to avoid poor matches is to im- score, resulting in the same degree of model extrap- pose a caliper and only select a match if it is within olation as with the full set of covariates. However, the caliper. This can lead to difficulties in interpret- if the model regressing the outcome on the treat- ing effects if many treated individuals do not re- ment indicator and the propensity score is correctly ceive a match, but can help avoid poor matches. specified or if it includes nonlinear functions of the Rosenbaum and Rubin (1985a) discuss those trade- propensity score (such as quantiles or splines) and offs. their with the treatment indicator, then this can be an effective approach, with links to sub- 3.1.1 Optimal matching One complication of sim- classification (Schafer and Kang, 2008). Since this ple (“greedy”) nearest neighbor matching is that the method does not have the clear “design” aspect of order in which the treated subjects are matched may matching, we do not discuss it further. change the quality of the matches. Optimal match- ing avoids this issue by taking into account the over- 3.1 Nearest Neighbor Matching all set of matches when choosing individual matches, One of the most common, and easiest to imple- minimizing a global distance measure (Rosenbaum, ment and understand, methods is k : 1 nearest neigh- 2002). Generally, greedy matching performs poorly bor matching (Rubin, 1973a). This is generally the when there is intense competition for controls, and MATCHING METHODS 9 performs well when there is little competition to the treated individuals (e.g., Dehejia and Wahba, (Gu and Rosenbaum, 1993). Gu and Rosenbaum 1999). Additionally, when matching with replace- (1993) find that optimal matching does not in gen- ment, the order in which the treated individuals are eral perform any better than greedy matching in matched does not matter. However, inference be- terms of creating groups with good balance, but comes more complex when matching with replace- does do better at reducing the distance within pairs ment, because the matched controls are no longer (page 413): “. . . optimal matching picks about the independent—some are in the matched sample more same controls [as greedy matching] but does a bet- than once and this needs to be accounted for in the ter job of assigning them to treated units.” Thus, outcome analysis, for example, by using frequency if the goal is simply to find well-matched groups, weights. When matching with replacement, it is also greedy matching may be sufficient. However, if the possible that the treatment effect estimate will be goal is well-matched pairs, then optimal matching based on just a small number of controls; the num- may be preferable. ber of times each control is matched should be mon- 3.1.2 Selecting the number of matches: Ratio itored. matching When there are large numbers of control 3.2 Subclassification, Full Matching and individuals, it is sometimes possible to get multi- Weighting ple good matches for each treated individual, called ratio matching (Smith, 1997; Rubin and Thomas, For settings where the outcome data is already 2000). Selecting the number of matches involves a available, one apparent drawback of k : 1 nearest neigh- bias :variance trade-off. Selecting multiple controls bor matching is that it does not necessarily use all for each treated individual will generally increase the data, in that some control individuals, even some bias since the 2nd, 3rd and 4th closest matches are, of those with propensity scores in the of the by definition, further away from the treated indi- treatment groups’ scores, are discarded and not used vidual than is the 1st closest match. On the other in the analysis. Weighting, full matching and sub- hand, utilizing multiple matches can decrease vari- classification methods instead use all individuals. ance due to the larger matched sample size. Approx- These methods can be thought of as giving all indi- imations in Rubin and Thomas (1996) can help de- viduals (either implicit or explicit) weights between termine the best ratio. In settings where the out- 0 and 1, in contrast with nearest neighbor matching, come data has yet to be collected and there are cost in which individuals essentially receive a weight of constraints, researchers must also balance cost con- either 0 or 1 (depending on whether or not they are siderations. More methodological work needs to be selected as a match). The three methods discussed done to more formally quantify the trade-offs in- here represent a continuum in terms of the num- volved. In addition, k : 1 matching is not optimal ber of groupings formed, with weighting as the limit since it does not account for the fact that some of subclassification as the number of observations treated individuals may have many close matches and subclasses go to infinity (Rubin, 2001) and full while others have very few. A more advanced form of matching in between. ratio matching, variable ratio matching, allows the 3.2.1 Subclassification Subclassification forms ratio to vary, with different treated individuals re- groups of individuals who are similar, for example, ceiving differing numbers of matches as defined by quintiles of the propensity score distri- (Ming and Rosenbaum, 2001). Variable ratio match- bution. It can estimate either the ATE or the ATT, ing is related to full matching, described below. as discussed further in Section 5. One of the first uses 3.1.3 With or without replacement Another key of subclassification was Cochran (1968), which ex- issue is whether controls can be used as matches amined subclassification on a single covariate (age) for more than one treated individual: whether the in investigating the link between lung cancer and matching should be done “with replacement” or smoking. Cochran (1968) provides analytic expres- “without replacement.” Matching with replacement sions for the bias reduction possible using subclassi- can often decrease bias because controls that look fication on a univariate continuous covariate; using similar to many treated individuals can be used mul- just five subclasses removes at least 90% of the ini- tiple times. This is particularly helpful in settings tial bias due to that covariate. Rosenbaum and Rubin where there are few control individuals comparable (1985b) extended that to show that creating five 10 E. A. STUART propensity score subclasses removes at least 90% of An alternative weighting technique, weighting by the bias in the estimated treatment effect due to the odds, can be used to estimate the ATT all of the covariates that went into the propensity (Hirano, Imbens and Ridder, 2003). Formally, wi = ˆ score. Based on those results, the current conven- T + (1 − T ) ei . With this weight, treated individ- i i 1−eˆi tion is to use 5–10 subclasses. However, with larger uals receive a weight of 1. Control individuals are sample sizes more subclasses (e.g., 10–20) may be weighted up to the full sample using the 1 term, 1−eˆi feasible and appropriate (Lunceford and Davidian, and then weighted to the treated group using the 2004). More work needs to be done to help deter- eˆi term. In this way both groups are weighted to mine the optimal number of subclasses: enough to represent the treatment group. get adequate bias reduction but not too many that A third weighting technique, used primarily in the within-subclass effect estimates become unsta- economics, is kernel weighting, which averages over ble. multiple individuals in the control group for each 3.2.2 Full matching A more sophisticated form of treated individual, with weights defined by their dis- subclassification, full matching, selects the number tance (Imbens, 2000). Heckman, Hidehiko and Todd of subclasses automatically (Rosenbaum, 1991; (1997), Heckman et al. (1998) and Heckman, Ichimura Hansen, 2004; Stuart and Green, 2008). Full match- and Todd (1998) describe a local linear matching ing creates a series of matched sets, where each estimator that requires specifying a bandwidth pa- matched set contains at least one treated individ- rameter. Generally, larger bandwidths increase bias ual and at least one control individual (and each but reduce variance by putting weight on individu- matched set may have many from either group). als that are further away from the treated individ- Like subclassification, full matching can estimate ei- ual of interest. A complication with these methods ther the ATE or the ATT. Full matching is optimal is this need to define a bandwidth or smoothing pa- in terms of minimizing the average of the distances rameter, which does not generally have an intuitive between each treated individual and each control meaning; Imbens (2004) provides some guidance on individual within each matched set. Hansen (2004) that choice. demonstrates the method in the context of estimat- A potential drawback of the weighting approaches ing the effect of SAT coaching. In that example the is that, as with Horvitz–Thompson estimation, the original treated and control groups had propensity variance can be very large if the weights are extreme score differences of 1.1 standard deviations, but the (i.e., if the estimated propensity scores are close to matched sets from full matching differed by only 0 or 1). If the model is correctly specified and thus 0.01 to 0.02 standard deviations. Full matching may the weights are correct, then the large variance is thus have appeal for researchers who are reluctant to appropriate. However, a worry is that some of the discard some of the control individuals but who want extreme weights may be related more to the estima- to obtain optimal balance on the propensity score. tion procedure than to the true underlying proba- To achieve efficiency gains, Hansen (2004) also in- bilities. Weight trimming, which sets weights above troduces restricted ratios of the number of treated some maximum to that maximum, has been pro- individuals to the number of control individuals in posed as one solution to this problem (Potter, 1993; each matched set. Scharfstein, Rotnitzky and Robins, 1999). However, 3.2.3 Weighting adjustments Propensity scores there is relatively little guidance regarding the trim- can also be used directly as inverse weights in es- ming level. Because of this sensitivity to the size timates of the ATE, known as inverse probability of of the weights and potential model misspecification, treatment weighting (IPTW; Czajka et al., more attention should be paid to the accuracy of 1992; Robins, Hernan and Brumback, 2000; propensity score estimates when the propensity Lunceford and Davidian, 2004). Formally, the weight scores will be used for weighting vs. matching 1− w = Ti + Ti , wheree ˆ is the estimated propen- (Kang and Schafer, 2007). Another effective strat- i eˆi 1−eˆi k sity score for individual k. This weighting serves to egy is doubly-robust methods (Bang and Robins, weight both the treated and control groups up to 2005), which yield accurate effect estimates if either the full sample, in the same way that survey sam- the propensity score model or the outcome model pling weights weight a sample up to a population are correctly specified, as discussed further in Sec- (Horvitz and Thompson, 1952). tion 5. MATCHING METHODS 11

3.3 Assessing Common Support ing matched samples. All matching should be fol- lowed by an assessment of the covariate balance in One issue that comes up for all matching meth- the matched groups, where balance is defined as the ods is that of “common support.” To this point, we similarity of the empirical distributions of the full have assumed that there is substantial overlap of set of covariates in the matched treated and con- the propensity score distributions in the two groups, trol groups. In other words, we would like the treat- but potentially density differences. However, in some ment to be unrelated to the covariates, such that situations there may not be complete overlap in the p˜(X|T = 1) =p ˜(X|T =0), wherep ˜ denotes the em- distributions. For example, many of the control indi- pirical distribution. A matching method that results viduals may be very different from all of the treat- in highly imbalanced samples should be rejected, ment group members, making them inappropriate and alternative methods should be attempted until as points of comparison when estimating the ATT a well-balanced sample is attained. In some situa- (Austin and Mamdani, 2006). Nearest neighbor tions the diagnostics may indicate that the treated matching with calipers automatically only uses in- and control groups are too far apart to provide reli- dividuals in (or close to) the area of common sup- able estimates without heroic modeling assumptions port. In contrast, the subclassification and weight- (e.g., Rubin, 2001; Agodini and Dynarski, 2004). In ing methods generally use all individuals, regard- contrast to traditional regression models, which do less of the overlap of the distributions. When us- not examine the joint distribution of the predic- ing those methods it may be beneficial to explicitly tors (and, in particular, of treatment assignment restrict the analysis to those individuals and the covariates), matching methods will make in the region of common support (as in it clear when it is not possible to separate the ef- Heckman, Hidehiko and Todd, 1997; fect of the treatment from other differences between Dehejia and Wahba, 1999). the groups. A well-specified regression model of the Most analyses define common support using the outcome with many interactions would show this im- propensity score, discarding individuals with propen- balance and may be an effective method for estimat- sity score values outside the range of the other group. ing treatment effects (Schafer and Kang, 2008), but A second method involves examining the “convex complex models like that are only rarely used. hull” of the covariates, identifying the multidimen- When assessing balance we would ideally compare sional space that allows interpolation rather than the multidimensional of the covariates in extrapolation (King and Zeng, 2006). While these the matched treated and control groups. However, procedures can help identify who needs to be dis- multidimensional histograms are very coarse and/or carded, when many subjects are discarded it can will have many zero cells. We thus are left examin- help the interpretation of results if it is possible to ing the balance of lower-dimensional summaries of define the discard rule using one or two covariates that joint distribution, such as the marginal distri- rather than the propensity score itself. butions of each covariate. Since we are attempting to It is also important to consider the implications examine different features of the multidimensional of common support for the estimand of interest. Ex- distribution, though, it is helpful to do a number of amining the common support may indicate that it different types of balance checks, to obtain a more is not possible to reliably estimate the ATE. This complete picture. could happen, for example, if there are controls out- All balance metrics should be calculated in ways side the range of the treated individuals and thus similar to how the outcome analyses will be run, as no way to estimate Y (1) for the controls without discussed further in Section 5. For example, if sub- extensive extrapolation. When estimating the ATT classification was done, the balance measures should it may be fine (and in fact beneficial) to discard be calculated within each subclass and then aggre- controls outside the range of the treated individu- gated. If weights will be used in analyses (either as als, but discarding treated individuals may change IPTW or because of variable ratio or full matching), the group for which the results apply (Crump et al., they should also be used in calculating the balance 2009). measures (Joffe et al., 2004). 4. DIAGNOSING MATCHES 4.1 Numerical Diagnostics Perhaps the most important step in using match- One of the most common numerical balance diag- ing methods is to diagnose the quality of the result- nostics is the difference in means of each covariate, 12 E. A. STUART divided by the in the full treated leads to increased balance, simply because of the re- − group: Xt Xc . This measure, sometimes referred to duced power. In particular, hypothesis tests should σt as the “standardized bias” or “standardized differ- not be used as part of a stopping rule to select a ence in means,” is similar to an effect size and is matched sample when those samples have varying compared before and after matching sizes (or effective sample sizes). Some researchers ar- (Rosenbaum and Rubin, 1985b). The same standard gue that hypothesis tests are okay for testing balance deviation should be used in the standardization be- since the outcome analysis will also have reduced fore and after matching. The standardized difference power for estimating the treatment effect (Hansen, of means should be computed for each covariate, 2008), but that argument requires trading off Type as well as two-way interactions and squares. For I and Type II errors. The cost of those two types of binary covariates, either this same formula can be errors may differ for balance checking and treatment used (treating them as if they were continuous), or effect estimation. a simple difference in proportions can be calculated 4.2 Graphical Diagnostics (Austin, 2009). Rubin (2001) presents three balance measures With many covariates it can be difficult to care- based on the theory in Rubin and Thomas (1996) fully examine numeric diagnostics for each; graph- that provide a comprehensive view of covariate bal- ical diagnostics can be helpful for getting a quick ance: assessment of the covariate balance. A first step is to examine the distribution of the propensity scores 1. The standardized difference of means of the in the original and matched groups; this is also use- propensity score. ful for assessing common support. Figure 1 shows 2. The ratio of the variances of the propensity score an example with adequate overlap of the propensity in the treated and control groups. scores, with a good control match for each treated 3. For each covariate, the ratio of the variance of individual. For weighting or subclassification, plots the residuals orthogonal to the propensity score such as this can show the dots with their size pro- in the treated and control groups. portional to their weight. Rubin (2001) illustrates these diagnostics in an ex- For continuous covariates, we can also examine ample with 146 covariates. For regression adjust- quantile–quantile (QQ) plots, which compare the ment to be trustworthy, the absolute standardized differences of means should be less than 0.25 and the variance ratios should be between 0.5 and 2 (Rubin, 2001). These guidelines are based both on the as- sumptions underlying regression adjustment as well as on results in Rubin (1973b) and Cochran and Rubin (1973), which used simulations to estimate the bias resulting from a number of treatment effect estima- tion procedures when the true relationship between the covariates and outcome is even moderately non- linear. Although common, hypothesis tests and p-values that incorporate information on the sample size (e.g., t-tests) should not be used as measures of balance, for two main reasons (Austin, 2007; Imai, King and Stuart, 2008). First, balance is in- herently an in-sample property, without reference to any broader population or super-population. Sec- ond, hypothesis tests can be misleading as measures of balance, because they often conflate changes in Fig. 1. Matches chosen using 1 : 1 nearest neighbor match- balance with changes in statistical power. ing on propensity score. Black dots indicate matched individu- Imai, King and Stuart (2008) show an example where als; grey unmatched individuals. Data from Stuart and Green randomly discarding control individuals seemingly (2008). MATCHING METHODS 13

5. ANALYSIS OF THE OUTCOME Matching methods are not themselves methods for estimating causal effects. After the matching has created treated and control groups with adequate balance (and the thus “designed”), researchers can move to the outcome analysis stage. This stage will generally involve re- gression adjustments using the matched samples, with the details of the analysis depending on the structure of the matching. A key point is that match- ing methods are not designed to “compete” with modeling adjustments such as linear regression, and, in fact, the two methods have been shown to work best in combination (Rubin, 1973b; Carpenter, 1977; Rubin, 1979; Robins and Rotnitzky, 1995; Fig. 2. Plot of standardized difference of means of 10 covari- Heckman, Hidehiko and Todd, 1997; Rubin and ates before and after matching. Data from Stuart and Green Thomas, 2000; Glazerman, Levy and Myers, 2003; (2008). Abadie and Imbens, 2006). This is similar to the idea of “double robustness,” and the intuition is the same as that behind regression adjustment in ran- empirical distributions of each variable in the treated domized experiments, where the regression adjust- and control groups (this could also be done for the ment is used to “clean up” small residual covariate variables squared or two-way interactions, getting imbalance between the groups. Matching methods at second moments). QQ plots compare the quan- should also make the treatment effect estimates less tiles of a variable in the treatment group against the sensitive to particular outcome model specifications corresponding quantiles in the control group. If the (Ho et al., 2007). two groups have identical empirical distributions, all The following sections describe how outcome anal- points would lie on the 45 degree line. For weight- yses should proceed after each of the major types of ing methods, weighted boxplots can provide similar matching methods described above. When weight- information (Joffe et al., 2004). ing methods are used, the weights are used directly Finally, a plot of the standardized differences of in regression models, for example, using weighted means, as in Figure 2, gives us a quick overview of least squares. We focus on parametric modeling ap- whether balance has improved for individual covari- proaches since those are the most commonly used, ates (Ridgeway, McCaffrey and Morral, 2006). In however, nonparametric permutation-based tests, this example the standardized difference of means such as Fisher’s exact test, are also appropriate, as of each covariate has decreased after matching. In detailed in Rosenbaum (2002, 2010). The best re- some situations researchers may find that the stan- sults are found when estimating marginal treatment dardized difference of means of a few covariates will effects, such as differences in means or differences in proportions. Greenland, Robins and Pearl (1999) increase. This may be particularly true of covari- and Austin (2007) discuss some of the challenges in ates with small differences before matching, since estimating noncollapsible conditional treatment ef- they will not factor heavily into the propensity score fects and which matching methods perform best for model (since they are not predictive of treatment those situations. assignment). In these cases researchers should con- k : 1 sider whether the increase in bias on those covariates 5.1 After Matching is problematic, which it may be if those covariates When each treated individual has received k are strongly related to the outcome, and modify the matches, the outcome analysis proceeds using the matching accordingly (Ho et al., 2007). One solu- matched samples, as if those samples had been gen- tion for that may be to do Mahalanobis matching erated through randomization. There is debate about on those covariates within propensity score calipers. whether the analysis needs to account for the matched 14 E. A. STUART pair nature of the data (Austin, 2007). However, This is especially useful for full matching. This es- there are at least two reasons why it is not necessary timates a separate effect for each subclass, but as- to account for the matched pairs (Schafer and Kang, sumes that the relationship between the covariates 2008; Stuart, 2008). First, conditioning on the vari- X and the outcome is constant across subclasses. ables that were used in the matching process (such Specifically, models such as Yij = β0j +β1jTij +γXij + as through a regression model) is sufficient. Second, eij are fit, where i indexes individuals and j indexes propensity score matching, in fact, does not guaran- subclasses. In this model, β1j is the treatment ef- tee that the individual pairs will be well-matched on fect for subclass j, and these effects are aggregated the full set of covariates, only that groups of indi- across subclasses to obtain an overall treatment ef- Nj J viduals with similar propensity scores will have sim- fect: β = N j=1 β1j , where J is the number of sub- ilar covariate distributions. Thus, it is more com- classes, Nj isP the number of individuals in subclass mon to simply pool all the matches into matched j, and N is the total number of individuals. (This treated and control groups and run analyses using formula weights subclasses by their total size, and the groups as a whole, rather than using the indi- so estimates the ATE, but could be modified to es- vidual matched pairs. In essence, researchers can do timate the ATT.) This procedure is somewhat more the exact same analysis they would have done us- complicated for noncontinuous outcomes when the ing the original data, but using the matched data estimand of interest, for example, an odds ratio, is instead (Ho et al., 2007). noncollapsible. In that case the outcome proportions Weights need to be incorporated into the analysis in each treatment group should be aggregated and for matching with replacement or variable then combined. ratio matching (Dehejia and Wahba, 1999; 5.3 Variance Estimation Hill, Reiter and Zanutto, 2004). When matching with replacement, control group individuals receive One of the most debated topics in the literature a frequency weight that reflects the number of times on matching is variance estimation. Researchers dis- they were selected as a match. When using vari- agree on whether uncertainty in the propensity score able ratio matching, control group members receive estimation or the matching procedure needs to be a weight that is proportional to the number of con- taken into account, and, if so, how. Some researchers trols matched to “their” treated individual. For ex- (e.g., Ho et al., 2007) adopt an approach similar to ample, if 1 treated individual was matched to 3 con- randomized experiments, where the models are run trols, each of those controls receives a weight of 1/3. conditional on the covariates, which are treated as If another treated individual was matched to just 1 fixed and exogenous. Uncertainty regarding the match- control, that control receives a weight of 1. ing process is not taken into account. Other researchers argue that uncertainty in the propensity score model 5.2 After Subclassification or Full Matching needs to be accounted for in any analysis. With standard subclassification (e.g., the forma- However, in fact, under fairly general conditions tion of 5 subclasses), effects are generally estimated (Rubin and Thomas, 1996; Rubin and Stuart, 2006), within each subclass and then aggregated across sub- using estimated rather than true propensity scores classes (Rosenbaum and Rubin, 1984). Weighting the leads to an overestimate of variance, implying that subclass estimates by the number of treated individ- not accounting for the uncertainty in using estimated uals in each subclass estimates the ATT; weighting rather than true values will be conservative in the by the overall number of individuals in each subclass sense of yielding confidence intervals that are wider estimates the ATE. There may be fairly substantial than necessary. Robins, Mark and Newey (1992) also imbalance remaining in each subclass and, thus, it is show the benefit of using estimated rather than true important to do regression adjustment within each propensity scores. Analytic expressions for the bias subclass, with the treatment indicator and covari- and variance reduction possible for these situations ates as predictors (Lunceford and Davidian, 2004). are given in Rubin and Thomas (1992b). Specifically, When the number of subclasses is too large—and Rubin and Thomas (1992b) states that “. . . with large the number of individuals within each subclass too pools of controls, matching using estimated linear small—to estimate separate regression models within propensity scores results in approximately half the each subclass, a joint model can be fit, with subclass variance for the difference in the matched sample and subclass by treatment indicators (fixed effects). means as in corresponding random samples for all MATCHING METHODS 15 covariates uncorrelated with the population discrim- Qu and Lipkovich (2009) illustrate this method and inant.” This finding has been confirmed in simu- show good results for an adaptation that also in- lations (Rubin and Thomas, 1996) and an empiri- cludes indicators of missing data patterns in the cal example (Hill, Rubin and Thomas, 1999). Thus, propensity score model. when it is possible to obtain 100% or nearly 100% In addition to development and investigation of bias reduction by matching on true or estimated matching methods that account for missing data, propensity scores, using the estimated propensity one particular area needing development is balance scores will result in more precise estimates of the diagnostics for settings with missing covariate val- average treatment effect. The intuition is that the ues, including dignostics that allow for nonignorable estimated propensity score accounts for chance im- missing data mechanisms. D’Agostino, Jr. and Rubin balances between the groups, in addition to the sys- (2000) suggests a few simple diagnostics such as tematic differences—a situation where overfitting is assessing available-case means and standard devi- good. When researchers want to account for the un- ations of the continuous variables, and comparing certainty in the matching, a bootstrap procedure has available-case cell proportions for the categorical vari- been found to outperform other methods (Lechner, ables and missing-data indicators, but diagnostics 2002; Hill and Reiter, 2006). There are also some should be developed that explicitly consider the in- empirical formulas for variance estimation for par- teractions between the missing data and treatment ticular matching scenarios (e.g., Abadie and Imbens, assignment mechanisms. 2006, 2009b; Schafer and Kang, 2008), but this is an area for future research. 6.1.2 Violation of ignorable treatment assignment A critique of any nonexperimental study is that there 6. DISCUSSION may be unobserved variables related to both treat- ment assignment and the outcome, violating the as- 6.1 Additional Issues sumption of ignorable treatment assignment and bi- This section raises additional issues that arise when asing the treatment effect estimates. Since ignora- using any matching method, and also provides sug- bility can never be directly tested, researchers have gestions for future research. instead developed sensitivity analyses to assess its 6.1.1 Missing covariate values Most of the liter- plausibility, and how violations of ignorability may ature on matching and propensity scores assume affect study conclusions. One type of plausibility fully observed covariates, but of course most stud- test estimates an effect on a variable that is known ies have at least some missing data. One possibil- to be unrelated to the treatment, such as a pre- ity is to use generalized boosted models to estimate treatment measure of the outcome variable (as in propensity scores, as they do not require fully ob- Imbens, 2004), or the difference in outcomes be- served covariates. Another recommended approach tween multiple control groups (as in Rosenbaum, is to do a simple single imputation of the missing 1987b). If the test indicates that the effect is not covariates and include missing data indicators in equal to zero, then the assumption of ignorable treat- the propensity score model. This essentially matches ment assignment is deemed to be less plausible. based both on the observed values and on the miss- A second approach is to perform analyses of sen- ing data patterns. Although this is generally not an sitivity to an unobserved variable. Rosenbaum and appropriate strategy for dealing with missing data Rubin (1983a) extends the ideas of Cornfield (1959), (Greenland and Finkle, 1995), it is an effective ap- examining how strong the correlations would have proach in the propensity score context. Although it to be between a hypothetical unobserved covariate cannot balance the missing values themselves, this and both treatment assignment and the outcome to method will yield balance on the observed covari- make the observed treatment effect go away. Simi- ates and the missing data patterns (Rosenbaum and larly, bounds can be created for the treatment effect, Rubin, 1984). A more flexible method is to use mul- given a range of potential correlations of the unob- tiple imputation to impute the missing covariates, served covariate with treatment assignment and the run the matching and effect estimation separately outcome (Rosenbaum, 2002). Although sensitivity within each “complete” data set, and then use the analysis methods are becoming more and more de- multiple imputation combining rules to obtain fi- veloped, they are still used relatively nal effect estimates (Rubin, 1987; Song et al., 2001). infrequently. Newly available software 16 E. A. STUART

(McCaffrey, Ridgeway and Morral, 2004; Keele, propensity score (the propensity function), showing 2009) will hopefully help facilitate their adoption that it has properties similar to that of the propen- by more researchers. sity score in that adjusting for the low-dimensional (not always scalar, but always low-dimensional) 6.1.3 Choosing between methods There are a wide propensity function balances the covariates. They variety of matching methods available, and little advocate subclassification rather than matching, and guidance to help applied researchers select between provide two examples as well as simulations showing them (Section 6.2 makes an attempt). The primary the performance of adjustment based on the propen- advice to this point has been to select the method sity function. Diagnostics are also complicated in that yields the best balance (e.g., Harder, Stuart and Anthony, 2010; Ho et al., 2007; Rubin, 2007). But this setting, as it becomes more difficult to assess defining the best balance is complex, as it involves the balance of the resulting samples when there are trading off balance on multiple covariates. Possible multiple treatment levels. Future work is needed to ways to choose a method include the following: (1) examine these issues. the method that yields the smallest standardized 6.2 Guidance for Practice difference of means across the largest number of co- variates, (2) the method that minimizes the stan- So what are the take-away points and advice re- dardized difference of means of a few particularly garding when to use each of the many methods dis- prognostic covariates, and (3) the method that re- cussed? While more work is needed to definitively sults in the fewest number of “large” standardized answer that question, this section attempts to pull differences of means (greater than 0.25). Another together the current literature to provide advice for promising direction is work by Diamond and Sekhon researchers interested in estimating causal effects us- (2006), which automates the matching procedure, ing matching methods. The lessons can be summa- finding the best matches according to a set of bal- rized as follows: ance measures. Further research needs to compare 1. Think carefully about the set of covariates the performance of treatment effect estimates from to include in the matching procedure, and err on methods using criteria such as those in the side of including more rather than fewer. Is the Diamond and Sekhon (2006) and Harder, Stuart and ignorability assumption reasonable given that set of Anthony (2010), to determine what the proper cri- covariates? If not, consider in advance whether there teria should be and examine issues such as potential are other data sets that may be more appropriate, overfitting to particular measures. or if there are sensitivity analyses that can be done 6.1.4 Multiple treatment doses Throughout this to strengthen the inferences. discussion of matching, it has been assumed that 2. Estimate the distance measure that will be there are just two groups: treated and control. How- used in the matching. Linear propensity scores esti- ever, in many studies there are actually multiple lev- mated using logistic regression, or propensity scores els of the treatment (e.g., doses of a drug). estimated using generalized boosted models or Rosenbaum (2002) summarizes two methods for deal- boosted CART, are good choices. If there are a few ing with doses of treatment. In the first method, the covariates on which particularly close balance is de- propensity score is still a scalar function of the co- sired (e.g., pre-treatment measures of the outcome), variates (e.g., Joffe and Rosenbaum, 1999; Lu et al., consider using the Mahalanobis distance within 2001). In the second method, each of the levels of propensity score calipers. treatment has its own propensity score (e.g., 3. Examine the common support and implica- Rosenbaum, 1987a; Imbens, 2000) and each propen- tions for the estimand. If the ATE is of substan- sity score is used one at a time to estimate the distri- tive interest, is there enough overlap of the treated bution of responses that would have been observed and control groups’ propensity scores to estimate if all individuals had received that dose. the ATE? If not, could the ATT be estimated more Encompassing these two approaches, reliably? If the ATT is of interest, are there controls Imai and van Dyk (2004) generalizes the propensity across the full range of the treated group, or will it score to arbitrary treatment regimes (including or- be difficult to estimate the effect for some treated dinal, categorical and multidimensional). They pro- individuals? vide theorems for the properties of this generalized 4. Implement a matching method. MATCHING METHODS 17

• If estimating the ATE, good choices are generally matching methods. However, recent advances have IPTW or full matching. made these methods more and more accessible. • If estimating the ATT and there are many more This section lists some of the major matching pro- control than treated individuals (e.g., more than cedures available. A continuously updated version 3 times as many), k : 1 nearest neighbor match- is also available at http://www.biostat.jhsph.edu/ ing without replacement is a good choice for its ˜estuart/propensityscoresoftware.html. simplicity and good performance. • If estimating the ATT and there are not (or not • Matching software for R many) more control than treated individuals, ap- – cem, http://gking.harvard.edu/cem/ propriate choices are generally subclassification, Iacus, S. M., King, G. and Porro, G. (2009). cem: full matching and weighting by the odds. Coarsened exact matching software. Can also be 5. Examine the balance on covariates resulting implemented through MatchIt. from that matching method. – Matching, http://sekhon.berkeley.edu/matching • If adequate, move forward with treatment effect Sekhon, J. S. (in press). Matching: Multivariate estimation, using regression adjustment on the and propensity score matching with balance opti- matched samples. mization. Forthcoming, Journal of Statistical Soft- • If imbalance on just a few covariates, consider ware. Uses automated procedure to select matches, incorporating exact or Mahalanobis matching on based on univariate and multivariate balance di- those variables. agnostics. Primarily k : 1 matching, allows match- • If imbalance on quite a few covariates, try an- ing with or without replacement, caliper, exact. other matching method (e.g., move to k : 1 match- Includes built-in effect and variance estimation ing with replacement) or consider changing the procedures. estimand or the data. – MatchIt, http://gking.harvard.edu/matchit Ho, D. E., Imai, K., King, G. and Stuart, E. A. (in Even if for some reason effect estimates will not press). MatchIt: Nonparametric preprocessing for be obtained using matching methods, it is worth- parameteric causal inference. Forthcoming, Jour- while to go through the steps outlined here to assess nal of Statistical Software. Two-step process: does the adequacy of the data for answering the ques- tion of interest. Standard regression diagnostics will matching, then user does outcome analysis. Wide not warn researchers when there is insufficient over- array of estimation procedures and matching meth- lap to reliably estimate causal effects; going through ods available: nearest neighbor, Mahalanobis, caliper, the process of estimating propensity scores and as- exact, full, optimal, subclassification. Built-in nu- sessing balance before and after matching can be meric and graphical diagnostics. invaluable in terms of helping researchers move for- – optmatch, http://cran.r-project.org/web/ ward with causal inference with confidence. packages/optmatch/index.html Matching methods are important tools for applied Hansen, B. B. and Fredrickson, M. (2009). opt- researchers and also have many open research ques- match: Functions for optimal matching. Variable tions for statistical development. This paper has pro- ratio, optimal and full matching. Can also be im- vided an overview of the current literature on match- plemented through MatchIt. ing methods, guidance for practice and a road map – PSAgraphics, http://cran.r-project.org/web/ for future research. Much research has been done in packages/PSAgraphics/index.html the past 30 years on this topic, however, there are Helmreich, J. E. and Pruzek, R. M. (2009). PSA- still a number of open areas and questions to be an- graphics: Propensity score analysis graphics. Jour- swered. We hope that this paper, combining results nal of Statistical Software 29. Package to do graph- from a variety of disciplines, will promote awareness ical diagnostics of propensity score methods. of and interest in matching methods as an important – rbounds, http://cran.r-project.org/web/packages/ and interesting area for future research. rbounds/index.html Keele, L. J. (2009). rbounds: An R package for 7. SOFTWARE APPENDIX sensitivity analysis with matched data. Does anal- In previous years software limitations made it ysis of sensitivity to assumption of ignorable treat- difficult to implement many of the more advanced ment assignment. 18 E. A. STUART

– twang, http://cran.r-project.org/web/packages/ – SAS usage note: http://support.sas.com/kb/30/ twang/index.html 971.html Ridgeway, G., McCaffrey, D. and Morral, A. (2006). – Greedy 1 : 1 matching, http://www2.sas.com/ twang: Toolkit for weighting and analysis of proceedings/sugi25/25/po/25p225.pdf nonequivalent groups. Functions for propensity Parsons, L. S. (2005). Using SAS software to per- score estimating and weighting, nonresponse weight- form a case-control match on propensity score in ing, and diagnosis of the weights. Primarily uses an observational study. In SAS SUGI 30, Paper generalized boosted regression to estimate the 225-25. propensity scores. – gmatch macro, http://mayoresearch.mayo.edu/mayo/ research/biostat/upload/gmatch.sas • Matching software for Stata Kosanke, J. and Bergstralh, E. (2004). gmatch: Match 1 or more controls to cases using the – cem, http://gking.harvard.edu/cem/ GREEDY algorithm. Iacus, S. M., King, G. and Porro, G. (2009). cem: – Proc assign, http://pubs.amstat.org/doi/abs/10.1198/ Coarsened exact matching software. 106186001317114938 – match, http://www.economics.harvard.edu/faculty/ Can be used to perform optimal matching. imbens/software imbens – 1 : 1 Mahalanobis matching within propensity score Abadie, A., Drukker, D., Herr, J. L. and Imbens, calipers, www.lexjansen.com/pharmasug/2006/ G. W. (2004). Implementing matching estimators publichealthresearch/pr05.pdf for average treatment effects in Stata. The Stata Feng, W. W., Jun, Y. and Xu, R. (2005). A method/ Journal 4 290–311. Primarily k : 1 matching (with macro based on propensity score and Mahalanobis replacement). Allows estimation of ATT or ATE, distance to reduce bias in treatment comparison including robust variance estimators. in observational study. – pscore, http://www.lrz-muenchen.de/˜sobecker/ – vmatch macro, http://mayoresearch.mayo.edu/mayo/ pscore.html research/biostat/upload/vmatch.sas Becker, S. and Ichino, A. (2002). Estimation of av- Kosanke, J. and Bergstralh, E. (2004). Match cases erage treatment effects based on propensity scores. to controls using variable optimal matching. Vari- The Stata Journal 2 358–377. Does k : 1 nearest able ratio matching (optimal algorithm). neighbor matching, radius (caliper) matching and – Weighting, http://www.lexjansen.com/wuss/2006/ subclassification. Analytics/ANL-Leslie.pdf – psmatch2, http://econpapers.repec.org/software/ Leslie, S. and Thiebaud, P. (2006). Using propen- bocbocode/s432001.htm sity scores to adjust for treatment selection bias. Leuven, E. and Sianesi, B. (2003). psmatch2. Stata ACKNOWLEDGMENTS module to perform full Mahalanobis and propen- sity score matching, common support graphing, Supported in part by Award K25MH083846 from and covariate imbalance testing. Allows k : 1 match- the National Institute of Mental Health. The con- ing, kernel weighting, Mahalanobis matching. In- tent is solely the responsibility of the author and cludes built-in diagnostics and procedures for es- does not necessarily represent the official views of timating ATT or ATE. the National Institute of Mental Health or the Na- – Note: 3 procedures for analysis of sensitivity to tional Institutes of Health. The work for this paper the ignorability assumption are also available: was partially done while the author was a gradu- ate student at Harvard University, Department of rbounds (for continuous outcomes), mhbounds (for Statistics, and a Researcher at Mathematica Policy categorical outcomes), and sensatt (to be used af- Research. The author particularly thanks Jennifer ter the pscore procedures). Hill, Daniel Ho, Kosuke Imai, Gary King and Don- rbounds , http://econpapers.repec.org/software/ ald Rubin, as well as the reviewers and Editor, for bocbocode/s438301.htm; helpful comments. mhbounds, http://ideas.repec.org/p/diw/diwwpp/ dp659.html; sensatt, http://ideas.repec.org/c/boc/bocode/ REFERENCES s456747.html. Abadie, A. and Imbens, G. W. (2006). Large sample prop- erties of matching estimators for average treatment effects. • Matching software for SAS Econometrica 74 235–267. MR2194325 MATCHING METHODS 19

Abadie, A. and Imbens, G. W. (2009a). Bias cor- Dehejia, R. H. and Wahba, S. (1999). Causal effects in rected matching estimators for average treat- nonexperimental studies: Re-evaluating the evaluation of ment effects. Journal of Educational and Be- training programs. J. Amer. Statist. Assoc. 94 1053–1062. havioral Statistics. To appear. Available at Dehejia, R. H. and Wahba, S. (2002). Propensity score http://www.hks.harvard.edu/fs/aabadie/bcm.pdf. matching methods for non-experimental causal studies. Re- Abadie, A. and Imbens, G. W. (2009b). Matching on the view of Economics and Statistics 84 151–161. estimated propensity score. Working Paper 15301, National Diamond, A. and Sekhon, J. S. (2006). Genetic matching Bureau of Economic Research, Cambridge, MA. for estimating causal effects: A general multivariate match- Agodini, R. and Dynarski, M. (2004). Are experiments the ing method for achieving balance in observational stud- only option? A look at dropout prevention programs. Re- ies. Working paper. Univ. California, Berkeley. Available view of Economics and Statistics 86 180–194. at http://sekhon.berkeley.edu/papers/GenMatch.pdf. Althauser, R. and Rubin, D. (1970). The computerized Drake, C. (1993). Effects of misspecification of the propen- construction of a matched sample. American Journal of sity score on estimators of treatment effects. Biometrics 49 Sociology 76 325–346. 1231–1236. Augurzky, B. and Schmidt, C. (2001). The propensity Frangakis, C. E. and Rubin, D. B. (2002). Principal score: A means to an end. Discussion Paper 271, Institute stratification in causal inference. Biometrics 58 21–29. for the Study of Labor (IZA). MR1891039 Austin, P. C. (2007). The performance of different propen- Glazerman, S., Levy, D. M. and Myers, D. (2003). Nonex- sity score methods for estimating marginal odds ratios. perimental versus experimental estimates of earnings im- Stat. Med. 26 3078–3094. MR2380505 pacts. Annals of the American Academy of Political and Austin, P. (2009). Using the standardized difference to com- Social Science 589 63–93. pare the prevalence of a binary variable between two groups Greenland, S. (2003). Quantifying biases in causal models: in observational research. Comm. Statist. Simulation Com- Classical confounding vs collider-stratification bias. Epi- 14 put. 38 1228–1234. demiology 300–306. Greenland, S. and Finkle, W. D. (1995). A critical look at Austin, P. C. and Mamdani, M. M. (2006). A comparison methods for handling missing covariates in epidemiologic of propensity score methods: A case-study illustrating the regression analyses. American Journal of Epidemiology 142 effectiveness of post-ami statin use. Stat. Med. 25 2084– 1255–1264. 2106. MR2239634 Greenland, S., Robins, J. M. and Pearl, J. (1999). Con- Bang, H. and Robins, J. M. (2005). Doubly robust estima- founding and collapsibility in causal inference. Statist. Sci. tion in missing data and causal inference models. Biomet- 14 29–46. rics 61 962–972. MR2216189 Greenwood, E. (1945). Experimental Sociology: A Study in Brookhart, M. A., Schneeweiss, S., Rothman, K. J., Method. King’s Crown Press, New York. Glynn, R. J., Avorn, J. and Sturmer, T. (2006). Vari- Greevy, R., Lu, B., Silber, J. H. and Rosenbaum, P. able selection for propensity score models. American Jour- (2004). Optimal multivariate matching before randomiza- nal of Epidemiology 163 1149–1156. tion. Biostatistics 5 263–275. Carpenter, R. (1977). Matching when covariables are nor- Gu, X. and Rosenbaum, P. R. (1993). Comparison of mul- 64 mally distributed. Biometrika 299–307. tivariate matching methods: Structures, distances, and al- Chapin, F. (1947). Experimental Designs in Sociological Re- gorithms. J. Comput. Graph. Statist. 2 405–420. search. Harper, New York. Hansen, B. B. (2004). Full matching in an observational Cochran, W. G. (1968). The effectiveness of adjustment by study of coaching for the SAT. J. Amer. Statist. Assoc. subclassification in removing bias in observational studies. 99 609–618. MR2086387 Biometrics 24 295–313. MR0228136 Hansen, B. B. (2008). The essential role of balance tests Cochran, W. G. and Rubin, D. B. (1973). Controlling bias in propensity-matched observational studies: Comments on in observational studies: A review. Sankhy¯aSer. A 35 417– ‘A critical appraisal of propensity-score matching in the 446. medical literature between 1996 and 2003’ by Peter Austin, Cohen, J. (1988). Statistical Power Analysis for the Behav- Statistics in Medicine. Stat. Med. 27 2050–2054. ioral Sciences, 2nd ed. Earlbaum, Hillsdale, NJ. Hansen, B. B. (2008). The prognostic analogue of the Cornfield, J. (1959). Smoking and lung cancer: Recent ev- propensity score. Biometrika 95 481–488. MR2521594 idence and a discussion of some questions. Journal of the Harder, V. S., Stuart, E. A. and Anthony, J. (2010). National Cancer Institute 22 173–200. Propensity score techniques and the assessment of mea- Crump, R., Hotz, V. J., Imbens, G. W. and Mitnik, O. sured covariate balance to test causal associations in psy- (2009). Dealing with limited overlap in estimation of aver- chological research. Psychological Methods. To appear. age treatment effects. Biometrika 96 187–199. Heckman, J. J., Hidehiko, H. and Todd, P. (1997). Match- Czajka, J. C., Hirabayashi, S., Little, R. and Rubin, ing as an econometric evaluation estimator: Evidence from D. B. (1992). Projecting from advance data using propen- evaluating a job training programme. Rev. Econom. Stud. sity modeling. J. Bus. Econom. Statist. 10 117–131. 64 605–654. D’Agostino, Jr., R. B. and Rubin, D. B. (2000). Estimat- Heckman, J. J., Ichimura, H., Smith, J. and Todd, P. ing and using propensity scores with partially missing data. (1998). Characterizing selection bias using experimental J. Amer. Statist. Assoc. 95 749–759. data. Econometrica 66 1017–1098. MR1639419 20 E. A. STUART

Heckman, J. J., Ichimura, H. and Todd, P. (1998). Match- and marginal structural models. Amer. Statist. 58 272–279. ing as an econometric evaluation estimator. Rev. Econom. MR2109415 Stud. 65 261–294. MR1623713 Kang, J. D. and Schafer, J. L. (2007). Demystifying double Heller, R., Rosenbaum, P. and Small, D. (2009). Split robustness: A comparison of alternative strategies for esti- samples and design sensitivity in observational studies. mating a population from incomplete data. Statist. J. Amer. Statist. Assoc. 104 1090–1101. Sci. 22 523–539. MR2420458 Hill, J. L. and Reiter, J. P. (2006). Keele, L. (2009). rbounds: An R package for sensi- for treatment effects using propensity score matching. Stat. tivity analysis with matched data. R package. Avail- Med. 25 2230–2256. MR2240098 able at http://www.polisci.ohio-state.edu/faculty/lkeele/ Hill, J., Reiter, J. and Zanutto, E. (2004). A compar- rbounds.html. ison of experimental and observational data analyses. In King, G. and Zeng, L. (2006). The dangers of extreme coun- Applied Bayesian Modeling and Causal Inference From an terfactuals. Political Analysis 14 131–159. Incomplete-Data Perspective (A. Gelman and X.-L. Meng, Kurth, T., Walker, A. M., Glynn, R. J., Chan, K. A., eds.). Wiley, Hoboken, NJ. MR2134801 Gaziano, J. M., Berger, K. and Robins, J. M. Hill, J., Rubin, D. B. and Thomas, N. (1999). The de- (2006). Results of multivariable logistic regresion, propen- sign of the New York School Choice Scholarship Program sity matching, propensity adjustment, and propensity- evaluation. In Research Designs: Inspired by the Work of based weighting under conditions of nonuniform effect. Donald Campbell, (L. Bickman, ed.) 155–180. Sage, Thou- American Journal of Epidemiology 163 262–270. sand Oaks, CA. Lechner, M. (2002). Some practical issues in the evaluation Hirano, K., Imbens, G. W. and Ridder, G. (2003). Ef- of heterogeneous labour market programmes by match- ficient estimation of average treatment effects using the ing methods. J. Roy. Statist. Soc. Ser. A 165 59–82. estimated propensity score. Econometrica 71 1161–1189. MR1888488 MR1995826 Lee, B., Lessler, J. and Stuart, E. A. (2009). Improving Ho, D. E., Imai, K., King, G. Stuart, E. A. and (2007). propensity score weighting using machine learning. Stat. Matching as nonparametric preprocessing for reducing Med. 29 337–346. model dependence in parametric causal inference. Politi- Li, Y. P., Propert, K. J. and Rosenbaum, P. R. (2001). cal Analysis 15 199–236. Balanced risk set matching. J. Amer. Statist. Assoc. 96 Holland, P. W. (1986). Statistics and causal inference. 455, 870–882. MR1946360 J. Amer. Statist. Assoc. 81 945–960. MR0867618 Lu, B., Zanutto, E., Hornik, R. and Rosenbaum, P. R. Hong, G. and Raudenbush, S. W. (2006). Evaluating (2001). Matching with doses in an observational study of kindergarten retention policy: A case study of causal in- a media campaign against drug abuse. J. Amer. Statist. ference for multilevel observational data. J. Amer. Statist. Assoc. 96 1245–1253. MR1973668 Assoc. 101 901–910. MR2324091 Lunceford, J. K. and Davidian, M. (2004). Stratification Horvitz, D. and Thompson, D. (1952). A generalization and weighting via the propensity score in estimation of of without replacement from a finite universe. causal treatment effects: A comparative study. Stat. Med. J. Amer. Statist. Assoc. 47 663–685. MR0053460 23 Hudgens, M. G. and Halloran, M. E. (2008). Toward 2937–2960. Lunt, M., Solomon, D., Rothman, K., Glynn, R., causal inference with interference. J. Amer. Statist. Assoc. 103 832–842. MR2435472 Hyrich, K., Symmons, D. P., Sturmer, T., the Iacus, S. M., King, G. and Porro, G. (2009). British Society for Rheumatology Biologics Regis- CEM: Software for coarsened exact match- ter and the British Society for Rheumatology Bio- ing. J. Statist. Software 30 9. Available at logics Register Contrl Centre Consortium (2009). http://gking.harvard.edu/files/abs/cemR-abs.shtml. Different methods of balancing covariates leading to differ- Imai, K. and van Dyk, D. A. (2004). Causal inference ent effect estimates in the presence of effect modification. with general treatment regimes: Generalizing the propen- American Journal of Epidemiology 169 909–917. sity score. J. Amer. Statist. Assoc. 99 854–866. MR2090918 McCaffrey, D. F., Ridgeway, G. and Morral, A. R. Imai, K., King, G. and Stuart, E. A. (2008). Misunder- (2004). Propensity score estimation with boosted regres- standings among experimentalists and observationalists in sion for evaluating causal effects in observational studies. causal inference. J. Roy. Statist. Soc. Ser. A 171 481–502. Psychological Methods 9 403–425. MR2427345 Ming, K. and Rosenbaum, P. R. (2001). A note on optimal Imbens, G. W. (2000). The role of the propensity score in es- matching with variable controls using the assignment algo- timating dose–response functions. Biometrika 87 706–710. rithm. J. Comput. Graph. Statist. 10 455–463. MR1939035 MR1789821 Morgan, S. L. and Harding, D. J. (2006). Matching esti- Imbens, G. W. (2004). Nonparametric estimation of average mators of causal effects: Prospects and pitfalls in theory treatment effects under exogeneity: A review. Review of and practice. Sociological Methods & Research 35 3–60. Economics and Statistics 86 4–29. MR2247150 Joffe, M. M. and Rosenbaum, P. R. (1999). Propensity Potter, F. J. (1993). The effect of weight trimming on non- scores. American Journal of Epidemiology 150 327–333. linear survey estimates. In Proceedings of the Section on Joffe, M. M., Ten Have, T. R., Feldman, H. I. and Kim- Survey Research Methods of American Statistical Associa- mel, S. E. (2004). , confounder control, tion. Amer. Statist. Assoc., San Francisco, CA. MATCHING METHODS 21

Qu, Y. and Lipkovich, I. (2009). Propensity score estimation an observational study of treatment for ovarian cancer. with missing values using a multiple imputation missing- J. Amer. Statist. Assoc. 102 75–83. MR2345534 ness pattern (MIMP) approach. Stat. Med. 28 1402–1414. Rubin, D. B. (1973a). Matching to remove bias in observa- Reinisch, J., Sanders, S., Mortensen, E. and Rubin, tional studies. Biometrics 29 159–184. D. B. (1995). In utero exposure to phenobarbital and in- Rubin, D. B. (1973b). The use of matched sampling and re- telligence deficits in adult men. Journal of the American gression adjustment to remove bias in observational stud- Medical Association 274 1518–1525. ies. Biometrics 29 185–203. Ridgeway, G., McCaffrey, D. and Morral, A. (2006). Rubin, D. B. (1974). Estimating causal effects of treatments twang: Toolkit for weighting and analysis of nonequivalent in randomized and nonrandomized studies. Journal of Ed- groups. Software for using matching methods in R. Avail- ucational Psychology 66 688–701. able at http://cran.r-project.org/web/packages/twang/ Rubin, D. B. (1976a). Inference and missing data (with dis- index.html. cussion). Biometrika 63 581–592. MR0455196 Robins, J. and Rotnitzky, A. (1995). Semiparametric effi- Rubin, D. B. (1976b). Multivariate matching methods that ciency in multivariate regression models with missing data. are equal percent bias reducing, I: Some examples. Biomet- J. Amer. Statist. Assoc. 90 122–129. MR1325119 rics 32 109–120. MR0400555 Robins, J. M., Hernan, M. A. and Brumback, B. (2000). Rubin, D. B. (1979). Using multivariate matched sampling Marginal structural models and causal inference in epi- and regression adjustment to control bias in observational demiology. Epidemiology 11 550–560. studies. J. Amer. Statist. Assoc. 74 318–328. Robins, J. M., Mark, S. and Newey, W. (1992). Estimat- Rubin, D. B. (1980). Bias reduction using Mahalanobis met- ing exposure effects by modelling the expectation of expo- ric matching. Biometrics 36 293–298. sure conditional on confounders. Biometrics 48 479–495. Rubin, D. B. (1987). Multiple Imputation for Nonresponse in MR1173493 Surveys. Wiley, New York. MR0899519 Rubin, D. B. Rosenbaum, P. R. (1984). The consequences of adjustment (2001). Using propensity scores to help design for a concomitant variable that has been affected by the observational studies: Application to the tobacco litigation. Health Services & Outcomes Research Methodology 2 169– treatment. J. Roy. Statist. Soc. Ser. A 147 656–666. 188. Rosenbaum, P. R. (1987a). Model-based direct adjustment. Rubin, D. B. (2004). On principles for modeling propen- J. Amer. Statist. Assoc. 82 387–394. sity scores in medical research. Pharmacoepidemiology and Rosenbaum, P. R. (1987b). The role of a second control Drug Safety 13 855–857. group in an observational study (with discussion). Statist. Rubin, D. B. (2006). Matched Sampling for Causal Inference. Sci. 2 292–316. Cambridge Univ. Press, Cambridge. MR2307965 Rosenbaum, P. R. (1991). A characterization of optimal de- Rubin, D. B. (2007). The design versus the analysis of obser- signs for observational studies. J. Roy. Statist. Soc. Ser. B vational studies for causal effects: Parallels with the design 53 597–610. MR1125717 of randomized trials. Stat. Med. 26 20–36. MR2312697 Rosenbaum, P. R. (1999). Choice as an alternative to control Rubin, D. B. and Stuart, E. A. (2006). Affinely invariant in observational studies (with discussion). Statist. Sci. 14 matching methods with discriminant mixtures of propor- 259–304. tional ellipsoidally symmetric distributions. Ann. Statist. Rosenbaum, P. R. (2002). Observational Studies, 2nd ed. 34 1814–1826. MR2283718 Springer, New York. MR1899138 Rubin, D. B. and Thomas, N. (1992a). Affinely invari- Rosenbaum, P. R. (2010). Design of Observational Studies. ant matching methods with ellipsoidal distributions. Ann. Springer, New York. Statist. 20 1079–1093. MR1165607 Rosenbaum, P. R. and Rubin, D. B. (1983a). Assessing Rubin, D. B. and Thomas, N. (1992b). Characterizing the sensitivity to an unobserved binary covariate in an obser- effect of matching using linear propensity score meth- vational study with binary outcome. J. Roy. Statist. Soc. ods with normal distributions. Biometrika 79 797–809. Ser. B 45 212–218. MR1209479 Rosenbaum, P. R. and Rubin, D. B. (1983b). The cen- Rubin, D. B. and Thomas, N. (1996). Matching using esti- tral role of the propensity score in observational studies mated propensity scores, relating theory to practice. Bio- for causal effects. Biometrika 70 41–55. MR0742974 metrics 52 249–264. Rosenbaum, P. R. and Rubin, D. B. (1984). Reducing Rubin, D. B. and Thomas, N. (2000). Combining propensity bias in observational studies using subclassification on the score matching with additional adjustments for prognostic propensity score. J. Amer. Statist. Assoc. 79 516–524. covariates. J. Amer. Statist. Assoc. 95 573–585. Rosenbaum, P. R. and Rubin, D. B. (1985a). The bias Schafer, J. L. and Kang, J. D. (2008). Average causal due to incomplete matching. Biometrics 41 103–116. effects from nonrandomized studies: A practical guide and MR0793436 simulated case study. Psychological Methods 13 279–313. Rosenbaum, P. R. and Rubin, D. B. (1985b). Constructing Scharfstein, D. O., Rotnitzky, A. and Robins, J. M. a control group using multivariate matched sampling meth- (1999). Adjusting for non-ignorable drop-out using semi- ods that incorporate the propensity score. Amer. Statist. parametric non-response models. J. Amer. Statist. Assoc. 39 33–38. 94 1096–1120. MR1731478 Rosenbaum, P. R., Ross, R. N. and Silber, J. H. (2007). Schneider, E. C., Zaslavsky, A. M. and Epstein, A. M. Minimum distance matched sampling with fine balance in (2004). Use of high-cost operative procedures by Medicare 22 E. A. STUART

beneficiaries enrolled in for-profit and not-for-profit health Stuart, E. A. (2008). Developing practical recommendations plans. The New England Journal of Medicine 350 143–150. for the use of propensity scores: Discussion of “A critical Setoguchi, S., Schneeweiss, S., Brookhart, M. A., appraisal of propensity score matching in the medical lit- Glynn, R. J. and Cook, E. F. (2008). Evaluating uses of erature between 1996 and 2003” by P. Austin. Stat. Med. data mining techniques in propensity score estimation: A 27 2062–2065. MR2439885 simulation study. Pharmacoepidemiology and Drug Safety Stuart, E. A. and Green, K. M. (2008). Using full match- 17 546–555. ing to estimate causal effects in non-experimental studies: Shadish, W. R., Clark, M. and Steiner, P. M. (2008). Examining the relationship between adolescent marijuana Can nonrandomized experiments yield accurate answers? use and adult outcomes. Developmental Psychology 44 395– A randomized experiment comparing random and nonran- 406. dom assignments. J. Amer. Statist. Assoc. 103 1334–1344. Stuart, E. A. and Ialongo, N. S. (2009). Matching meth- Smith, H. (1997). Matching with multiple controls to esti- ods for selection of subjects for follow-up. Multivariate Be- mate treatment effects in observational studies. Sociological havioral Research. To appear. Methodology 27 325–353. Wacholder, S. and Weinberg, C. R. (1982). Paired versus Snedecor, G. W. and Cochran, W. G. (1980). Statisti- two-sample design for a of treatments with di- cal Methods, 7th ed. Iowa State Univ. Press, Ames, IA. chotomous outcome: Power considerations. Biometrics 38 MR0614143 801–812. Sobel, M. E. (2006). What do randomized studies of hous- Weitzen, S., Lapane, K. L., Toledano, A. Y., Hume, ing mobility demonstrate?: Causal inference in the face A. L. and Mor, V. (2004). Principles for modeling propen- of interference. J. Amer. Statist. Assoc. 101 1398–1407. sity scores in medical research: A systematic literature re- MR2307573 view. Pharmacoepidemiology and Drug Safety 13 841–853. Song, J., Belin, T. R., Lee, M. B., Gao, X. and Zhao, Z. (2004). Using matching to estimate treatment ef- Rotheram-Borus, M. J. (2001). Handling baseline dif- fects: Data requirements, matching metrics, and Monte ferences and missing items in a longitudinal study of HIV Carlo evidence. Review of Economics and Statistics 86 91– risk among runaway youths. Health Services & Outcomes 107. Research Methodology 2 317–329.