<<

11-Osborne (Best)-45409.qxd 10/9/2007 10:58 AM Page 155

11

BEST PRACTICES IN QUASI- EXPERIMENTAL DESIGNS Methods for Causal Inference

ELIZABETH A. STUART

DONALD B. RUBIN

any studies in social that aim to is often infeasible in research, due estimate the effect of an intervention to either ethical or practical concerns. Matching M suffer from treatment selection , methods constitute a growing collection of tech- where the units who receive the treatment may niques that attempt to replicate, as closely as have different characteristics from those in the possible, the ideal of randomized control condition. These preexisting differences when using observational . between the groups must be controlled to obtain There are two key ways in which the match- approximately unbiased estimates of the effects of ing methods we discuss replicate a randomized interest. For example, in a study estimating the . First, matching aims to select sub- effect of bullying on high school graduation, samples of the treated and control groups that students who were bullied are likely to be very are, at worst, only randomly different from one different from students who were not bullied on another on all observed covariates. In other a wide of characteristics, such as socioeco- words, matching seeks to identify subsamples of nomic status and academic performance, even treated and control units that are “balanced” before the bullying began. It is crucial to try to with respect to observed covariates: The separate out the causal effect of the bullying from observed covariate distributions are essentially the effect of these preexisting differences between the same in the treatment and control groups. the “treated” and “control” groups. Matching The methods described in this chapter examine methods provide a way to attempt to do so. how best to choose subsamples from the origi- of units to receive (or nal treated and control groups such that the not receive) the treatment of interest ensures distributions of covariates in the matched that there are no systematic differences between groups are substantially more similar than in the the treatment and control groups before treat- original groups, when this is possible. A second ment assignment. However, random assignment crucial similarity between a

155 11-Osborne (Best)-45409.qxd 10/9/2007 10:58 AM Page 156

156 BEST PRACTICES IN

and a matched is that each subsets of the treated and control groups with study has two clear stages. The first stage is similar observed covariate distributions, thereby design, in which the units to be compared are increasing robustness in observational studies by selected, without use of the values of the out- reducing reliance on modeling assumptions. The come variables. Like the design of a randomized main objective of the matching is to reduce bias. experiment, the matches are chosen without But what about ? Although discarding access to any of the outcome data, thereby pre- units in the matching process will result in venting intentional or unintentional bias when smaller sizes and thus might appear to selecting a particular matched sample to achieve lead to increases in variance, this is not a desired result. Only after the design is set does always the case because improved balance in the the second stage begin, which involves the analy- covariate distributions will decrease the variance ses of the outcome, estimating treatment effects of estimators (Snedecor & Cochran, 1980). using the matched sample. We only discuss H. Smith (1997) gives an empirical example where propensity score methods that are applicable at estimates from one-to-one matching have lower the design stage in the sense that they do not estimated standard deviations than estimates involve any outcome data. Some methods that from a , even though thousands use propensity scores, including some weighting of were discarded in the one-to-one techniques, can involve outcome data, and such matching, and all were used in the regression. methods are not discussed here. The second situation is one in which outcome This chapter reviews the diverse literature on data are not yet collected on the units, and cost matching methods, with particular attention paid constraints prohibit measuring the outcome vari- to providing practical guidance based on applied ables on all units. In that situation, matching and simulation results that indicate the potential of methods can help choose for follow-up the matching methods for bias reduction in observa- control units most similar to those in the treated tional studies. We first provide an introduction to group. The matching identifies those control units the goal of matching and a very brief history of who are most similar to the treated units so that these methods; the second section presents the rather than random samples of units being dis- and motivation behind propensity scores, carded, the units discarded are those most irrele- discussing how they are a crucial component when vant as points of comparison with the treated using matching methods. We then discuss other units. This second situation motivated much of methods of controlling for covariates in observa- the early work in matching methods (Althauser & tional studies, such as , and Rubin, 1970; Rubin, 1973a, 1973b), which com- explain why matching methods (particularly when pared the benefits of choosing matched versus combined with regression in the analysis stage) are random samples for follow-up. more effective. The implementation of matching Matching methods can be considered as one methods, including challenges and evaluations of method for designing an observational study, their performance, is then discussed. We conclude in the sense of selecting the most appropriate with recommendations for researchers and a discus- data for reliable estimation of causal effects, as sion of software currently available. Throughout discussed in Cochran and Rubin (1973), Rubin the chapter, we motivate the methods using data (1977, 1997, 2004), Rosenbaum (1999, 2002), from the National Supported Work Demonstration and Heckman, Hidehiko, and Todd (1997). These (Dehejia & Wahba, 1999; LaLonde, 1986). papers stress the importance of carefully design- ing an observational study by making appropri- ate choices when it is impossible to have full Designing Observational Studies control (e.g., ). The careful The methods described here are relevant design of an observational study must involve for two types of situations. The first, which is making careful choices about the data used in arguably more common in social science making comparisons of outcomes in treatment research, is a situation where all covariate and and control conditions. outcome data are already available on a large set Other approaches that attempt to control of units, but a subset of those units will be cho- for covariate differences between treated and sen for use in the analysis. This subsetting (or control units include regression analysis or “matching”) is done with the aim of selecting selection models, which estimate parameters of 11-Osborne (Best)-45409.qxd 10/9/2007 10:58 AM Page 157

Quasi-Experimental Designs 157

a model for the outcome of interest conditional separation of stages of design and analysis that on the covariates (and a treatment/control indi- we advocate here. cator). Matching methods are preferable to these model-based adjustments for two key reasons. The National Supported First, matching methods do not use the outcome Work Demonstration values in the design of the study and thus pre- clude the selection of a particular design to yield The National Supported Work (NSW) a desired result. As stated by Rubin (2001), Demonstration was a federally and privately funded randomized experiment done in the 1970s Arguably, the most important feature of to estimate the effects of a job training program experiments is that we must decide on the for disadvantaged workers. Since a series of way data will be collected before observing the analyses beginning in the 1980s (Dehejia & outcome data. If we could try hundreds of Wahba, 1999, 2002; LaLonde, 1986; J. Smith & designs and for each see the resultant answer, Todd, 2005), the data set from this study has we could capitalize on random variation in become a canonical example in the literature on answers and choose the design that generated matching methods. the answer we wanted! The lack of availability In the NSW Demonstration, eligible individ- of outcome data when designing experiments uals were randomly selected to participate in is a tremendous stimulus for “honesty” in the training program. Treatment group members experiments and can be in well-designed and control group members (those not selected observational studies as well. (p. 169) to participate) were followed up to estimate the effect of the program on later earnings. Because Second, when there are large differences in the NSW program was a randomized experi- the covariate distributions between the groups, ment, the difference in in the outcomes standard model-based adjustments rely heavily between the randomized treated and control on extrapolation and model-based assumptions. groups is an unbiased estimate of the average Matching methods highlight these differences and treatment effect for the subjects in the random- also provide a way to limit reliance on the inher- ized experiment, and indicated that, on average, ently untestable modelling assumptions and the among all male participants, the program raised consequential sensitivity to those assumptions. annual earnings by approximately $800. Matching methods and regression-based To investigate whether certain nonexperi- model adjustments should also not be seen as mental methods yielded a result similar to competing methods but rather as complemen- that from the randomized experiment, LaLonde tary, which is a decades-old message. In fact, as (1986) attempted to use certain nonexperimen- discussed earlier, much research over a period of tal methods to estimate the treatment effect, decades (Cochran & Rubin, 1973; Ho, Imai, King, with the experimental estimate of the treatment & Stuart, 2007; Rubin, 1973b, 1979; Rubin & effect as a benchmark. LaLonde used, in anal- Thomas, 2000) has shown that the best approach ogy with then current econometric practice, is to combine the two methods by, for example, two sources of comparison units, both large doing regression adjustment on matched sam- national databases: the Panel Survey of Income ples. Selecting matched samples reduces bias due Dynamics (PSID) and the Current Population to covariate differences, and regression analysis Survey (CPS). LaLonde found that the nonex- on those matched samples can adjust for small perimental methods gave a wide range of remaining differences and increase of impact estimates, ranging from approximately estimates. These approaches are similar in spirit −$16,000 to $700, and concluded that it was dif- to the recent “doubly robust” procedures of ficult to replicate the experimental results with Robins and Rotnitzky (2001), which provide con- any of the nonexperimental methods available sistent estimation of causal effects if either the at that time. model of treatment assignment (e.g., the propen- In the 1990s, Dehejia and Wahba (1999) used sity scores) or the model of the outcome is cor- propensity score matching methods to estimate rect, although these later methods are more the effect of the NSW program, using compari- sensitive to a correctly specified model used for son groups similar to those used by LaLonde. weighting and generally do not have the clear They found that most of the comparison group 11-Osborne (Best)-45409.qxd 10/9/2007 10:58 AM Page 158

158 BEST PRACTICES IN RESEARCH DESIGN

members used by LaLonde were in fact very dis- (Holland, 1986; Rubin, 1978) is that, for each similar to the treated group members and that by unit, we can observe only one of these potential restricting the analysis to the comparison group outcomes because each unit will receive either members who looked most similar to the treated treatment or control, not both. The estimation of group, they were able to replicate results found causal effects can thus be thought of as a missing in the NSW . Using the CPS, data problem, where at least half of the values of which had a larger pool of individuals compara- interest (the unobserved potential outcomes) are ble to those in the treated group, for the sample missing (Rubin, 1976a). We are interested in pre- of men with 2 years of pretreatment earnings dicting the unobserved potential outcomes, thus data available, Dehejia and Wahba (1999) enabling the comparison of the potential out- obtained a range of treatment effect estimates of comes under treatment and control. $1,559 to $1,681, quite close to the experimental For efficient causal inference and good esti- estimate of approximately $1,800 for the same mation of the unobserved potential outcomes, sample. Although there is still debate regarding we would like to compare groups of treated and the use of nonexperimental data to estimate the control units that are as similar as possible. If effects of the NSW program (see, e.g., Dehejia, the groups are very different, the prediction of

2005; J. Smith & Todd, 2005), this example has the Yi(1) for the control group will be made nonetheless remained an important illustration using information from treated units, who look of the use of matching methods in practice. very different from those in the control group,

We will use a subset of these data as an illus- and likewise, the prediction of the Yi(0) for the trative example throughout this chapter. The treated units will be made using information “full” data set that we use has 185 treated males from control units, who look very different from who had 2 years of preprogram earnings data the treated units. (1974 and 1975) as well as 429 comparison Randomized experiments use a known ran- males from the CPS who were younger than age domized assignment mechanism to ensure “bal- 55, unemployed in 1976, and had income below ance” of the covariates between the treated and the poverty line in 1975. The goal of matching control groups: The groups will be only ran- will be to select the comparison males who domly different from one another on all back- look most similar to the treated group on other ground covariates, observed and unobserved. In covariates. The covariates available in this data observational studies, we must posit an assign- set include age, education level, high school ment mechanism, which stochastically deter- degree, marital status, race, ethnicity, and earn- mines which units receive treatment and which ings in 1974 and 1975. In this chapter, we do not receive control. A key initial assumption in attempt to obtain a reliable estimate of the effect observational studies is that of strongly ignor- of the NSW program but rather use the data able treatment assignment (Rosenbaum & only to illustrate matching methods.1 Rubin, 1983b), which implies that (a) treatment assignment (W) is unconfounded (Rubin, 1990); that is, it is independent of the potential Notation and Background outcomes (Y(0),Y(1)) given the covariates (X): As first formalized by Rubin (1974), the esti- W⊥(Y(0),Y(1))|X, and (b) there is a positive mation of causal effects, whether from data in a probability of receiving each treatment for randomized experiment or from information all values of X: 0 < P(W = 1|X) < 1 for all X. Part obtained from an observational study, is inher- (b) essentially states that there is overlap in the ently a comparison of potential outcomes on propensity scores. However, since below we individual units, where a unit is a physical object discuss methods to impose this by discarding (e.g., a person or a school) at a particular point units outside the region of overlap, in the rest of in time. In particular, the causal effect for unit i the chapter, we focus on the first part of the is the comparison of unit i ’s outcome if unit i strong ignorability assumption: unconfounded receives the treatment (unit i ’s potential out- treatment assignment, sometimes called “selec-

come under treatment), Yi(1), and unit i’s out- tion on observables” or “no hidden bias.” come if unit i receives the control (unit i’s Imbens (2004) discusses the plausibility of this

potential outcome under control), Yi(0). The assumption in , and this issue is dis- “fundamental problem of causal inference” cussed further later in this chapter, including 11-Osborne (Best)-45409.qxd 10/9/2007 10:58 AM Page 159

Quasi-Experimental Designs 159

tests for sensitivity to the assumption of uncon- methods (such as most implementations of founded treatment assignment. propensity score matching) with ellipsoidally A second assumption that is made in nearly all symmetric covariate distributions (Rubin & studies estimating causal effects (including ran- Stuart, 2006; Rubin & Thomas, 1992a, 1992b, domized experiments) is the stable unit treat- 1996). Affinely invariant matching methods are ment value assumption (SUTVA; Rubin, 1980). those that yield the same matches following There are two components to this assumption. an affine (e.g., linear) transformation of the The first is that there is only one version of data (Rubin & Thomas, 1992a). This theoretical each treatment possible for each unit. The second development grew out of initial work on equal component is that of no interference: The treat- percent bias-reducing (EPBR) matching meth- ment assignment of one unit does not affect the ods in Rubin (1976a, 1976c). EPBR methods potential outcomes of any other units. This is also reduce bias in all covariate directions by the sometimes referred to as the assumption of “no same percentage, thus ensuring that if close spillover.” Some recent work has discussed relax- matches are obtained in some direction (such as ing this SUTVA assumption, in the context of the discriminant), then the matching is also school effects (Hong & Raudenbush, 2006) or reducing bias in all other directions and so can- neighborhood effects (Sobel, 2006). not be increasing bias in an outcome that is a linear combination of the covariates. Methods that are not EPBR will infinitely increase bias for History of Matching Methods some linear combinations of the covariates. Matching methods have been in use since the Since the initial work on matching methods, first half of the 20th century, with much of the which was primarily in sociology and statistics, early work in sociology (Althauser & Rubin, matching methods have been growing in popu- 1970; Chapin, 1947; Greenwood, 1945). How- larity, with developments and applications in ever, a theoretical basis for these methods was a variety of fields, including economics (Imbens, not developed until the late 1960s and early 2004), medicine (D’Agostino, 1998), 1970s. This development began with a paper by (Christakis & Iwashyna, 2003), Cochran (1968), which particularly examined (Ho et al., 2007), and sociology (Morgan & subclassification but had clear connections with Harding, 2006; Winship & Morgan, 1999). A matching, including Cochran’s occasional use of review of the older work and more recent applica- the term stratified matching to refer to subclassi- tions can also be found in Rubin (2006). fication. Cochran and Rubin (1973) and Rubin (1973a, 1973b) continued this development for situations with one covariate, and Cochran PROPENSITY SCORES and Rubin (1973) and Rubin (1976b, 1976c) extended the results to multivariate settings. In applications, it is often very difficult to find Dealing with multiple covariates was a chal- close matches on each covariate. Rather than lenge due to both computational and data prob- attempting to match on all of the covariates indi- lems. With more than just a few covariates, it vidually, propensity score matching matches on becomes very difficult to find matches with close the scalar propensity score, which is the most or exact values of all covariates. An important important scalar summary of the covariates. advance was made in 1983 with the introduction Propensity scores, first introduced in Rosenbaum of the propensity score by Rosenbaum and and Rubin (1983b), provided a key step in the Rubin (1983b), a generalization of discriminant continual development of matching methods by matching (Cochran & Rubin, 1973; Rubin, enabling the formation of matched sets that have 1976b, 1976c). Rather than requiring close or balance on a large number of covariates. exact matches on all covariates, matching on the The propensity score for unit i is defined as scalar propensity score enables the construction the probability of receiving the treatment given = = | of matched sets with similar distributions of the observed covariates: ei(X) P(Wi 1 X). covariates. There are two key theorems relating to their use Developments were also made regarding the (Rosenbaum & Rubin, 1983b). The first is that theory behind matching methods, particularly propensity scores are balancing scores: At each in the context of affinely invariant matching value of the propensity score, the distribution of 11-Osborne (Best)-45409.qxd 10/9/2007 10:58 AM Page 160

160 BEST PRACTICES IN RESEARCH DESIGN

the covariates, X, that define the propensity score propensity scores can control variance orthogo- is the same in the treated and control groups. In nal to the discriminant and thus can lead to more other words, within a small range of propensity precise estimates of the treatment effect (Rubin & score values, the treated and control groups’ Thomas, 1992b, 1996). Analytic expressions for observed covariate distributions are only ran- the bias and variance reduction possible for these domly different from each other, thus replicating situations are given in Rubin and Thomas a mini-randomized experiment, at least with (1992b). Specifically, Rubin and Thomas (1992b) respect to these covariates. Second, if treatment state that “with large pools of controls, matching assignment is unconfounded given the observed using estimated linear propensity scores results in covariates (i.e., does not depend on the potential approximately half the variance for the difference outcomes), then treatment assignment is also in the matched sample means as in correspond- unconfounded given only the propensity score. ing random samples for all covariates uncorre- This justifies matching or forming subclasses lated with the population discriminant” (p. 802). based on the propensity score rather than on the This finding is confirmed in simulation work in full set of multivariate covariates. Thus, when Rubin and Thomas (1996) and in an empirical treatment assignment is unconfounded, for a example in Hill, Rubin, and Thomas (1999). specific value of the propensity score, the differ- Hence, in situations where nearly all bias can be ence in means in the outcome between the eliminated relatively easily, matching on the esti- treated and control units with that propensity mated propensity scores is superior to matching score value is an unbiased estimate of the on the true propensity score because it will result treatment effect at that propensity score value. in more precise estimates of the average treat- Abadie and Imbens (2006) present theoretical ment effect. results that provide additional justification for matching on the propensity score, showing that Model Specification creating estimates based on matching on one continuous covariate (such as the propensity The model specification and diagnostics score) is N ½ consistent, but attempting to match when estimating propensity scores are not the on more than one covariate without discarding standard model diagnostics for logistic regres- any units is not. Thus, in this particular case, sion or CART, as discussed by Rubin (2004). using the propensity score enables consistent esti- With propensity score estimation, concern is not mation of treatment effects. with the parameter estimates of the model but rather with the quality of the matches and sometimes in the accuracy of the predictions of Propensity Score Estimation treatment assignment (the propensity scores In practice, the true propensity scores are themselves). When the propensity scores will be rarely known outside of randomized experi- used for matching or subclassification, the key ments and thus must be estimated. Propensity diagnostic is covariate balance in the resulting scores are often estimated using logistic regres- matched samples or subclasses. When propen- sion, although other methods such as classifica- sity scores are used directly in weighting adjust- tion and regression trees (CART; Breiman, ments, more attention should be paid to the Friedman, Olshen, & Stone, 1984), discriminant accuracy of the model predictions since the esti- analysis, or generalized boosted models (McCaffrey, mates of the treatment effect may be very sensi- Ridgeway, & Morral, 2004) can also be used. tive to the accuracy of the propensity score Matching or subclassification is then done using values themselves. the estimated propensity score (e.g., the fitted Rosenbaum and Rubin (1984); Perkins, Tu, values from the ). Underhill, Zhou, and Murray (2000); Dehejia and In the matching literature, there has been Wahba (2002); and Michalopoulos, Bloom, and some discussion of the effects of matching using Hill (2004) described propensity score model-fit- estimated rather than true propensity scores, ting strategies that involve examining the result- especially regarding the variance of estimates. ing covariate balance in subclasses defined by the Theoretical and analytic work has shown that, propensity score. If covariates (or their squares although more bias reduction can be obtained or cross-products) are found to be unbalanced, using true propensity scores, matching on estimated those terms are then included in the propensity 11-Osborne (Best)-45409.qxd 10/9/2007 10:58 AM Page 161

Quasi-Experimental Designs 161

score specification, which should improve bal- deviation apart in the two groups, the ratio of the ance, subject to sample size limitations. treated to control group variance is 1/2, and Drake (1993) stated that treatment effect the model of the outcome is moderately nonlin- estimates are more sensitive to misspecification ear (y = ex/2), linear regression adjustment can of the model of the outcome than to misspecifi- lead to 300% reduction in bias. In other words, an cation of the propensity score model. Dehejia increase in the original bias, but 200% in the and Wahba (1999, 2002) and Zhao (2004) also opposite direction! Results are even more striking provided that treatment effect estimates for larger initial bias between the groups, where may not be too sensitive to the propensity score the amount of bias remaining can be substantial specification. However, these evaluations are even if most (in percentage) of the initial bias has fairly limited; for example, Drake considered been removed (see Rubin, 2001, Table 1). only two covariates. Despite these striking results, regression adjustment on unmatched data is still a com- mon method for attempting to estimate causal WHEN IS REGRESSION effects. Matching methods provide a way to ANALYSIS TRUSTWORTHY? avoid extrapolation and reliance on the model- ing assumptions, by ensuring the comparison of It has been known for many years that regres- treated and control units with similar covariate sion analysis can lead to misleading results when distributions, when this is possible, and warning the covariate distributions in the groups are very of the inherent extrapolation in regression mod- different (e.g., Cochran, 1957; Cochran & Rubin, els when there is little overlap in distributions. 1973; Rubin, 1973b). Rubin (2001, p. 174) stated the three basic conditions that must generally be IMPLEMENTATION OF met for regression analyses to be trustworthy, in the case of approximately normally distributed MATCHING METHODS covariates:2 We now turn to the implementation of match- 1. The difference in the means of the propensity ing methods. There are five key steps when using scores in the two groups being compared must matching methods to estimate causal effects. be small (e.g., the means must be less than half These are (1) choosing the covariates to be used a apart), unless the in the matching process; (2) defining a distance situation is benign in the sense that: measure, used to assess whether units are “simi- a. the distributions of the covariates in both lar”; (3) choosing a specific matching algorithm groups are nearly symmetric, to form matched sets; (4) diagnosing the b. the distributions of the covariates in both matches obtained (and iterating between groups have nearly the same , and (2) and (3)); and finally, (5) estimating the effect c. the sample sizes are approximately the same. of the treatment on the outcome, using the 2. The ratio of the variances of the propensity matched sets found in (4) and possibly other score in the two groups must be close to 1 adjustments. The following sections provide (e.g., 1/2 or 2 are far too extreme). further information on each of these steps. 3. The ratio of the variances of the residuals of the covariates after adjusting for the Choosing the Covariates propensity score must be close to 1 (e.g., 1/2 or 2 are far too extreme). The first step is to choose the covariates on which close matches are desired. As discussed These guidelines arise from results on the bias earlier, an underlying assumption when estimat- resulting from regression analysis in samples with ing causal effects using nonexperimental data large initial covariate bias that show that linear is that treatment assignment is unconfounded regression adjustment can grossly overcorrect (Rosenbaum & Rubin, 1983b) given the covari- or undercorrect for bias when these conditions ates used in the matching process. To make this are not met (Cochran & Rubin, 1973; Rubin, assumption plausible, it is important to include 1973b, 1979, 2001). For example, when the propen- in the matching procedure any covariates that sity score means are one quarter of a standard may be related to treatment assignment and the 11-Osborne (Best)-45409.qxd 10/9/2007 10:58 AM Page 162

162 BEST PRACTICES IN RESEARCH DESIGN

outcome; the most important covariates to individual covariates, such as propensity scores, include are those that are related to treatment are commonly used in practice. assignment because the matching will typically Two measures of the distance between units be done for many outcomes. Theoretical and on multiple covariates are the Mahalanobis empirical research has shown the importance of distance, which is full rank, and the propensity including a large set of covariates in the match- score distance, which is not. The Mahalanobis ing procedure (Hill, Reiter, & Zanutto, 2004; distance on covariates X between units i and j is − Σ−1 − Σ Lunceford & Davidian, 2004; Rubin & Thomas, (Xi Xj) (Xi Xj), where can be the true or 1996). Greevy, Lu, Silber, and Rosenbaum estimated variance- matrix in the (2004) provide an example where the power of treated group, the control group, or a pooled the subsequent analysis in a randomized experi- sample; the control group variance-covariance ment is increased by matching on 14 covariates, matrix is usually used. The propensity score dis- even though only 2 of those covariates are tance is defined as the absolute difference in directly related to the outcome (the other 12 are (true or estimated) propensity scores between related to the outcome only through their corre- two units. See the “Propensity Score Estimation” lation with the 2 on which the outcome explic- section for more details on estimating propen- itly depends). sity scores. Gu and Rosenbaum (1993) and A second consideration is that the covariates Rubin and Thomas (2000) compare the per- included in the matching must be “proper” formance of matching methods based on covariates in the sense of not being affected Mahalanobis metric matching and propensity by treatment assignment. It is well-known that score matching and find that the two distance matching or subclassifying on a variable affected measures perform similarly when there are a by treatment assignment can lead to substantial relatively small number of covariates, but bias in the estimated treatment effect (Frangakis propensity score matching works better than & Rubin, 2002; Greenland, 2003; Imbens, 2004). Mahalanobis metric matching with large All variables should thus be carefully considered numbers of covariates (greater than 5). One rea- as to whether they are “proper” covariates. This son for this is that the Mahalanobis metric is is especially important in fields such as epidemi- attempting to obtain balance on all possible ology and political science, where the treatment interactions of the covariates (which is very dif- assignment date is often somewhat undefined. If ficult in multivariate space), effectively consider- it is deemed to be critical to control for a vari- ing all of the interactions as equally important. able potentially affected by treatment assign- In , propensity score matching allows ment, it is better to exclude that variable in the the exclusion of terms from the propensity score matching procedure and include it in the analy- model and thereby the inclusion of only the sis model for the outcome (Reinisch, Sanders, important terms (e.g., main effects, two-way Mortensen, & Rubin, 1995) and hope for bal- interactions) on which to obtain balance. ance on it, or use principal stratification meth- As discussed below, these distance measures ods (Frangakis & Rubin, 2002) to deal with it. can be combined or used in conjunction with exact matching on certain covariates. Combining these distance measures with exact Selecting a Distance Measure matching on certain covariates sets the distance The next step when using matching methods between two units equal to infinity if the units is to define the “distance” measure that will be are not exactly matched on those covariates. used to decide whether units are “similar” in terms of their covariate values. “Distance” is in Selecting Matches quotes because the measure will not necessarily be a proper “full-rank” distance in the mathe- Once the distance measure is defined, the matical sense. One extreme distance measure is next step is to choose the matched samples. This that of exact matching, which groups units only section provides a summary of some of the most if they have the same values of all the covariates. common types of matching methods, given a Because limited sample sizes (and large numbers particular distance measure. These methods of covariates) make it very difficult to obtain include nearest neighbor matching and its vari- exact matches, distance measures that are ations (such as caliper matching) and subclassi- not full rank and that combine distances on fication methods (such as full matching). We 11-Osborne (Best)-45409.qxd 10/9/2007 10:58 AM Page 163

Quasi-Experimental Designs 163

provide an overview of each, as well as refer- by Gu and Rosenbaum (1993),“Optimal match- ences for further information and examples. ing picks about the same controls [as greedy matching] but does a better job of assigning them to treated units” (p. 413). Nearest Neighbor Matching Figure 11.1 illustrates the result of a one- Nearest neighbor matching (Rubin, 1973a) to-one greedy nearest neighbor matching generally selects k matched controls for each algorithm implemented using the NSW data treated unit (often, k = 1). The simplest nearest described in “The National Supported Work neighbor matching uses a “greedy” algorithm, Demonstration” section. The propensity score which cycles through the treated units one at a was estimated using all covariates available in time, selecting for each the available control unit the data set. Of the 429 available control individ- with the smallest distance to the treated unit. A uals, the 185 with propensity scores closest to more sophisticated algorithm, “optimal” match- those of the 185 treated individuals were ing, minimizes a global measure of balance selected as matches. We see that there is fairly (Rosenbaum, 2002). Rosenbaum (2002) argues good overlap throughout most of the range of that the collection of matches found using opti- propensity scores and that most of the control mal matching can have substantially better individuals not used as matches had very low balance than matches found using greedy propensity scores and so were inapposite for use matching, without much loss in computational as points of comparison. speed. Generally, greedy matching performs When there are large numbers of control poorly with respect to average pair differences units, it is sometimes possible to get multiple when there is intense competition for controls good matches for each treated unit, which can and performs well when there is little competi- reduce sampling variance in the treatment effect tion. In practical situations, when assessing the estimates. Although one-to-one matching is the matched groups’ covariate balance, Gu and most common, a larger number of matches for Rosenbaum (1993) find that optimal matching each treated unit are often possible. Unless there does not in general perform any better than are many units with the same covariate values, greedy matching in terms of creating groups using multiple controls for each treated unit is with good balance but does do better at reduc- expected to increase bias because the second, ing the distance between pairs. As summarized third, and fourth closest matches are, by definition,

Distribution of Propensity Scores

Treatment Units

Control Units

0.0 0.2 0.4 0.6 0.8 Propensity Score

Figure 11.1 Matches chosen using 1:1 nearest neighbor matching on the propensity score. Black units were matched; gray units were unmatched. A total of 185 treated units were matched to 185 control units; 244 control units were discarded. 11-Osborne (Best)-45409.qxd 10/9/2007 10:58 AM Page 164

164 BEST PRACTICES IN RESEARCH DESIGN

further away from the treated unit than is the For example, in Rubin (2001), the analyses are first closest match, but using multiple matches done separately for males and females, with can decrease sampling variance due to the larger male smokers matched to male nonsmokers matched sample size. Of course, in settings and female smokers matched to female non- where the outcome data have yet to be collected smokers. Similarly, in Dehejia and Wahba and there are cost constraints, researchers must (1999), the analysis is done separately for males balance the benefit of obtaining multiple and females. matches for each unit with the increased costs. Examples using more than one control match Mahalanobis Metric Matching on Key for each treated unit include H. Smith (1997) Covariates Within Propensity Score Calipers and Rubin and Thomas (2000). Another key issue is whether controls can be Caliper matching (Althauser & Rubin, 1970) used as matches for more than one treated unit, selects matches within a specified range (caliper c) that is, whether the matching should be done of a one-dimensional covariate X (which may “with replacement” or “without replacement.” actually be a combination of multiple covariates, | − |≤ Matching with replacement can often yield such as the propensity score): Xtj Xcj c for all better matches because controls that look simi- treatment/control matched pairs, indexed by j. lar to many treated units can be used multiple Cochran and Rubin (1973) investigate various times. In addition, like optimal matching, when caliper sizes and show that with a normally dis- matching with replacement, the order in which tributed covariate, a caliper of 0.2 standard devia- the treated units are matched does not matter. tions can remove 98% of the bias due to that However, a drawback of matching with replace- covariate, assuming all treated units are matched. ment is that it may be that only a few unique Althauser and Rubin (1970) find that even a control units will be selected as matches; the looser matching (1.0 standard deviations of X) number of times each control is matched should can still remove approximately 75% of the initial be monitored and reflected in the estimated pre- bias due to X. Rosenbaum and Rubin (1985b) cision of estimated causal effects. show that if the caliper matching is done using the Using the NSW data, Dehejia and Wahba propensity score, the bias reduction is obtained on (2002) match with replacement from the PSID all of the covariates that went into the propensity sample because there are few control individuals score. They suggest that a caliper of 0.25 standard comparable to those in the treated group, mak- deviations of the logit transformation of the ing matching with replacement appealing. propensity score can work well in general. When one-to-one matching is done without For situations where there are some key con- replacement, nearly half of the treated group tinuous covariates on which particularly close members end up with matches that are quite matches are desired, Mahalanobis matching far away. They conclude that matching with on the key covariates can be combined with replacement can be useful when there are a lim- propensity score matching, resulting in particu- ited number of control units with values similar larly good balance (Rosenbaum & Rubin, 1985b; to those in the treated group. Rubin & Thomas, 2000). The Mahalanobis dis- tance is usually calculated on covariates that are believed to be particularly predictive of the out- Limited Exact Matching come of interest or of treatment assignment. For Rosenbaum and Rubin (1985a) illustrate the example, in the NSW Demonstration data, futility in attempting to find matching treated Mahalanobis metric matching on the 2 years of and control units with the same values of all the preprogram earnings could be done within covariates and thus not being able to find propensity score calipers. matches for most units. However, it is often desirable (and possible) to obtain exact matches Subclassification on a few key covariates, such as race or sex. Combining exact matching on key covariates Rosenbaum and Rubin (1984) discuss reduc- with propensity score matching can lead to large ing bias due to multiple covariates in obser- reductions in bias and can result in a design anal- vational studies through subclassification on ogous to in a randomized experiment. estimated propensity scores, which forms groups 11-Osborne (Best)-45409.qxd 10/9/2007 10:58 AM Page 165

Quasi-Experimental Designs 165

of units with similar propensity scores and thus equal numbers of treated units. All units are similar covariate distributions. For example, sub- placed into one of the six subclasses. The control classes may be defined by splitting the treated and units within each subclass are given equal weight, control groups at the quintiles of the propensity proportional to the number of treated units in score in the treated group, leading to five the subclass; thus the treated and control units in subclasses with approximately the same number each subclass receive the same total weight. of treated units in each. That work builds on the If the balance achieved in matched samples work by Cochran (1968) on subclassification selected using nearest neighbor matching is using a single covariate; when the conditional not adequate, subclassification of the matches expectation of the outcome variable is a mono- chosen using nearest neighbor matching can be tone function of the propensity score, creating done to yield improved balance. This is illustrated just five propensity score subclasses removes at in Figure 11.3, where, after one-to-one nearest least 90% of the bias in the estimated treatment neighbor matching, six subclasses have been effect due to each of the observed covariates. formed with approximately the same number of Thus, five subclasses are often used, although treated units in each subclass. This process is with large sample sizes, more subclasses, or even illustrated in Rubin (2001) and Rubin (2007). variable-sized subclasses, are often desirable. This method is clearly related to making an ordinal Full Matching version of a continuous underlying covariate. Lunceford and Davidian (2004) assess subclas- An extension of subclassification is “full sification on the propensity score and find that matching” (Rosenbaum, 1991a, 2002), in which subclassification without subsequent within- the matched sample is composed of matched strata model adjustment (as discussed earlier) sets (subclasses), where each matched set con- can lead to biased answers due to residual imbal- tains either (a) one treated unit and one or more ance within the strata. They suggest a need for controls or (b) one control unit and one or more further research on the optimal number of sub- treated units. Full matching is optimal in terms classes, a topic also discussed in Du (1998). of minimizing a weighted average of the dis- Subclassification is illustrated using the NSW tances between each treated subject and each data in Figure 11.2, where six propensity score control subject within each matched set. Hansen subclasses were formed to have approximately (2004) gives a practical evaluation of the

Distribution of Propensity Scores

Treatment Units

Control Units

0.0 0.2 0.4 0.6 0.8 Propensity Score

Figure 11.2 Results from subclassification on the propensity score. Subclasses are indicated by vertical lines. The weight given to each unit is represented by its symbol size; larger symbols correspond to larger weight. 11-Osborne (Best)-45409.qxd 10/9/2007 10:58 AM Page 166

166 BEST PRACTICES IN RESEARCH DESIGN

Distribution of Propensity Scores

Treatment Units

Control Units

0.0 0.2 0.4 0.6 0.8 Propensity Score

Figure 11.3 One-to-one nearest neighbor matching on the propensity score followed by subclassification. Black units were matched; gray units were unmatched. Subclasses indicated by vertical lines.

method, estimating the effect of SAT coaching, Because subclassification and full matching illustrating that, although the original treated place all available units into one of the subclasses, and control groups had propensity score differ- these methods may have particular appeal for ences of 1.1 standard deviations, the matched researchers who are reluctant to discard some of sets from full matching differed by less than 2% the control units. However, these methods are of a standard deviation. To achieve efficiency not relevant for situations where the matching gains, Hansen (2004) also describes a variation is being used to select units for follow-up or for of full matching that restricts the ratio of the situations where some units have essentially zero number of treated units to the number of con- probability of receiving the other treatment. trol units in each matched set, a method also applied in Stuart and Green (in press). Weighting Adjustments The output from full matching is illustrated using the NSW data in Figure 11.4. Because it is Another method that uses all units is weight- not feasible to show the individual matched sets ing, where observations are weighted by their (in these data, 103 matched sets were created), the inverse propensity score (Czajka, Hirabayashi, units are represented by their relative weights. All Little, & Rubin, 1992; Lunceford & Davidian, treated units receive a weight of 1 (and thus the 2004; McCaffrey et al., 2004). Weighting can symbols are all the same size). Control units in also be thought of as the limit of subclassifica- matched sets with many control units and few tion as the number of observations and the treated units receive small weight (e.g., the units number of subclasses go to infinity. Weighting with propensity scores close to 0), whereas con- methods are based on Horvitz-Thompson esti- trol units in matched sets with few control units mation (Horvitz & Thompson, 1952), used fre- and many treated units (e.g., the units with quently in sample surveys. A drawback of propensity scores close to 0.8) receive large weighting adjustments is that, as with Horvitz- weight. The weighted treated and control group Thompson estimation, the sampling variance of covariate distributions look very similar. As in resulting weighted estimators can be very large if simple subclassification, all control units within a the weights are extreme (if the propensity scores matched set receive equal weight. However, are close to 0 or 1). Thus, the subclassification because there are many more matched sets than or full matching approaches, which also use with simple subclassification, the variation in the all units, may be more appealing because the weights is much larger across matched sets. resulting weights are less variable. 11-Osborne (Best)-45409.qxd 10/9/2007 10:58 AM Page 167

Quasi-Experimental Designs 167

Distribution of Propensity Scores

Treatment Units

Control Units

0.0 0.2 0.4 0.6 0.8 Propensity Score

Figure 11.4 Results from full matching on the propensity score. The weight given to each unit is represented by its size; larger symbols correspond to higher weight.

Another type of weighting procedure is that of Dehejia (2005). Matching methods have a vari- kernel weighting adjustments, which average over ety of simple diagnostic procedures that can be multiple persons in the control group for each used, most based on the idea of assessing bal- treated unit, with weights defined by their dis- ance between the treated and control groups. tance from the treated unit. Heckman, Ichimura, Although we would ideally compare the multi- Smith, and Todd (1998) and Heckman, Ichimura, variate covariate distributions in the two groups, and Todd (1998) describe a local linear matching that is difficult when there are many covariates, estimator that requires specifying a bandwidth and so generally comparisons are done for each parameter. Generally, larger bandwidths increase univariate covariate separately, for two-way bias but reduce variance by putting weight on interactions of covariates, and for the propensity units that are further away from the treated unit score, as the most important univariate sum- of interest. A complication with these methods is mary of the covariates. this need to define a bandwidth or smoothing At a minimum, the balance diagnostics parameter, which does not generally have an intu- should involve comparing the mean covariate itive meaning; Imbens (2004) provides some values in the groups, sometimes standardized guidance on that choice. by the standard deviation in the full sample; ide- With all of these weighting approaches, it is ally, other characteristics of the distributions, still important to separate clearly the design and such as variances, correlations, and interactions analysis stages. The propensity score should between covariates, should also be compared. be carefully estimated, using approaches such as Common diagnostics include t tests of the those described earlier, and the weights set covariates, Kolmogorov-Smirnov tests, and other before any use of those weights in models relat- comparisons of distributions (e.g., Austin & ing the outcomes to covariates. Mamdani, 2006). Ho et al. (2007) provide a summary of numerical and graphical summaries of balance, including empirical quantile-quantile Diagnostics for Matching Methods plots to examine the empirical distribution Diagnosing the quality of the matches of each covariate in the matched samples. obtained from a matching method is of primary Rosenbaum and Rubin (1984) examine F ratios importance. Extensive diagnostics and propen- from a two-way performed sity score model specification checks are for each covariate, where the factors are treat- required for each data set, as discussed by ment/control and propensity score subclasses. 11-Osborne (Best)-45409.qxd 10/9/2007 10:58 AM Page 168

168 BEST PRACTICES IN RESEARCH DESIGN

Rubin (2001) presents diagnostics related to the Some slight adjustments to the analysis meth- conditions given in the previous section that ods are required with some particular matching indicate when regression analyses are trustwor- methods. With procedures such as full matching, thy. These diagnostics include assessing the stan- subclassification, or matching with replacement, dardized difference in means of the propensity where there may be different numbers of treated scores between the two treatment groups, the and control units at each value of the covariates, ratio of the variances of the propensity scores in the analysis should incorporate weights to the two groups, and, for each covariate, the ratio account for these varying weights. Examples of the variance of the residuals orthogonal to the of this can be found in Dehejia and Wahba propensity score in the two groups. The stan- (1999), Hill et al. (2004), and Michalopoulos dardized differences in means should generally et al. (2004). When subclassification has been be less than 0.25, and the variance ratios should used, estimates should be obtained separately be close to 1, certainly between 0.5 and 2, as dis- within each subclass and then aggregated across cussed earlier. subclasses to obtain an overall effect (Rosenbaum & Rubin, 1984). Estimates within each subclass are sometimes calculated using simple differ- Analysis of Outcome Data After Matching ences in means, although empirical (Lunceford The analysis of outcome(s) should proceed & Davidian, 2004) and theoretical (Abadie & only after the observational study design has been Imbens, 2006) work has shown that better results set in the sense that the matched samples have are obtained if regression adjustment is used in been chosen, and it has been determined that the conjunction with the subclassification. When matched samples have adequate balance. In keep- aggregating across subclasses, weighting the sub- ing with the idea of replicating a randomized class estimates by the number of treated units in experiment, the same methods that would be used each subclass estimates the average treatment in an experiment can be used in the matched data. effect for the units in the treated group; if In particular, matching methods are not designed there was no matching done before subclassifica- to “compete” with modeling adjustments such tion, weighting by the overall number of units in as linear regression, and in fact, the two methods each subclass estimates the overall average treat- have been shown to work best in combination. ment effect for the population of treated and Many authors have discussed for decades the ben- control units. efits of combining matching or propensity score weighting and regression adjustment (Abadie & Imbens, 2006; Heckman et al., 1997; Robins COMPLICATIONS IN USING & Rotnitzky, 1995; Rubin, 1973b, 1979; Rubin & MATCHING METHODS Thomas, 2000). The intuition for using both is the same as Overlap in Distributions that behind regression adjustment in random- ized experiments, where the regression adjust- In some analyses, some of the control units ment is used to “clean up” small residual may be very dissimilar from all treated units, or covariate imbalance between the treatment and some of the treated units may be very dissimilar control groups. The matching method reduces from all control units, potentially exhibited by large covariate bias between the treated and con- propensity scores outside the range of the other trol groups, and the regression is used to adjust treatment group. Thus, it is sometimes desirable for any small residual and to increase effi- to explicitly discard units with “extreme” values ciency. These “bias-corrected” matching meth- of the propensity score—for example, treated ods have been found by Abadie and Imbens units for whom there are no control units with (2006) and Glazerman, Levy, and Myers (2003) propensity score values as large. Doing the to work well in practice, using simulated and analysis only in the areas where there is distri- actual data. Rubin (1973b, 1979), Rubin and butional overlap—that is, with “common sup- Thomas (2000), and Ho et al. (2007) show that port” (regions of the covariate space that have models based on matched data are much less both treated and control units)—will lead to sensitive to model misspecification and more more robust inference. This, in essence, is what robust than are models fit in the full data sets. matching is usually attempting to do; defining 11-Osborne (Best)-45409.qxd 10/9/2007 10:58 AM Page 169

Quasi-Experimental Designs 169

the area of common support is a way to discard Rosenbaum & Rubin, 1984). All three methods units that are unlike all units in the other treat- perform well in terms of creating well-matched ment group. samples. They find that the third method per- However, it is often difficult to determine forms the best, evaluated by treating the original whether there is common support in multidi- complete-case data set (i.e., all individuals with mensional space. One way of doing so is to no missing values) as the “truth,”imposing addi- examine the overlap of the propensity score dis- tional nonignorable values on that tributions. This is illustrated in Dehejia and complete-case data set and examining which Wahba (1999), where control units with propen- method best reproduces the estimate observed sity scores lower than the minimum propensity in the original complete-case data, given the score for the treated units are discarded. A sec- imposed missingness. Song, Berlin, Lee, Gao, and ond method of examining the multivariate over- Rotheram-Borus (2001) compare two methods lap involves examining the “convex hull” of the of using propensity scores with missing covariate covariates, essentially identifying the multidi- data. The first uses mean imputation for the mensional space that allows interpolation rather missing values and then estimates the propensity than extrapolation (King & Zeng, 2007). Imbens scores. The second multiply imputes the covari- (2004) also discusses these issues in an eco- ates (Rubin, 1987) and estimates propensity nomic context. scores in each “complete” data set. A mixed effects model is used to analyze the longitudinal outcome data in each data set, and the multiple Missing Covariate Values imputation combining rules are used to obtain Most of the literature on matching and one estimate of the treatment effect. Results are propensity scores assumes fully observed covari- similar using the two methods. Both methods ates, so that models such as logistic regression show that the covariates are very poorly balanced can be used to estimate the propensity scores. between the treated and control groups, and that However, there are often missing values in the good matches are hard to find (a finding that covariates, which complicates matching and standard modeling approaches would not neces- propensity score estimation. Two complex sarily have discovered). Hill (2004) finds that statistical models used to estimate propensity methods using multiple imputation work better scores in this case are pattern mixture models than complete data or complete variable meth- (Rosenbaum & Rubin, 1984) and general loca- ods (which use only units with complete data or tion models (D’Agostino & Rubin, 2000). A key only variables with complete data). consideration when thinking about missing covariate values is that the pattern of missing Unobserved Variables covariates can be prognostically important, and in such cases, the methods should condition on A critique of any observational study is that the observed values of the covariates and on the there may be unobserved covariates that affect observed missing data indicators. both treatment assignment and the outcome, There has not been much theoretical work thus violating the assumption of unconfounded done on the appropriate procedures for deal- treatment assignment. The approach behind ing with missing covariate values. Multiple matching is that of dealing as well as possible researchers have done empirical comparisons of with the observed covariates; close matching on methods, but this is clearly an area for further the observed covariates will also lessen the bias research. D’Agostino, Lang, Walkup, and Morgan due to unobserved covariates that are correlated (2001) compare three simpler methods of deal- with the observed covariates. However, there ing with missing covariate values: The first uses may still be concern regarding unobserved dif- only units with complete data and discards ferences between the treated and control groups. all units with any missing data, the second does The assumption of unconfounded treat- a simple imputation for missing values and ment assignment can never be directly tested. includes indicators for missing values in the However, some researchers have proposed tests propensity score model, and the third fits sepa- in which an estimate is obtained for an effect that rate propensity score models for each pattern of is “known” to be zero, such as the difference in missing data (a pattern mixture approach, as in a pretreatment measure of the outcome variable 11-Osborne (Best)-45409.qxd 10/9/2007 10:58 AM Page 170

170 BEST PRACTICES IN RESEARCH DESIGN

(Imbens, 2004) or the difference in outcomes to estimate the distribution of responses that between multiple control groups (Rosenbaum, would have been observed if all units had 1987b). If the test indicates that the effect is not received that treatment level. These distribu- equal to zero, then the assumption of uncon- tions are then compared. founded treatment assignment is deemed to be Encompassing these two methods, Imai and less plausible. van Dyk (2004) generalize the propensity score Analyses can also be performed to assess to arbitrary treatment regimes (including ordi- sensitivity to an unobserved variable. Rosenbaum nal, categorical, and multidimensional). They and Rubin (1983a) extend the ideas of Cornfield provide theorems for the properties of this gen- et al. (1959), who examined how strong the cor- eralized propensity score (the propensity func- relations would have to be between a hypothetical tion), showing that it has properties similar to unobserved covariate and both treatment assign- that of the propensity score for binary treat- ment and the outcome to make the observed esti- ments in that adjusting for the low-dimensional mate of the treatment effect be zero. This (not always scalar, but always low-dimensional) approach is also discussed and applied to an eco- propensity function balances the covariates. nomic application in Imbens (2003). Rosenbaum They advocate subclassification rather than (1991b) describes a sensitivity analysis for case matching and provide two examples as well as control studies and discusses how sensitivity simulations showing the performance of adjust- could also be assessed in situations where there ment based on the propensity function. are multiple sources of control units available— Diagnostics are especially crucial in this set- some closer on some (potentially unobserved) ting because it becomes more difficult to assess dimensions and others closer on other (poten- the balance of the resulting samples when there tially unobserved) dimensions. See Stuart and are multiple treatment levels. It is even unclear Rubin (in press) for another example of using what balance precisely means in this setting; multiple sources of control units. does there need to be balance among all of the levels or only among pairwise comparisons of dose levels? Future work is needed to examine Multiple Treatment Doses these issues. Throughout this discussion of matching, it has been assumed that there are just two groups: treated and control. However, in many studies, EVALUATION OF MATCHING METHODS there are actually multiple levels of the treat- ment (e.g., doses of a drug). Rosenbaum (2002) Two major types of evaluations of matching summarizes two methods for dealing with mul- methods have been done, one using simulated tiple treatment levels. In the first method, the data and another trying to replicate results from propensity score is still a scalar function of the randomized experiments using observational covariates (Joffe & Rosenbaum, 1999). This data. Simulations that compare the perfor- method uses a model such as an ordinal logit mance of matching methods in terms of bias model to match on a linear combination of the reduction include Cochran and Rubin (1973), covariates. This is illustrated in Lu, Zanutto, Rubin (1973a, 1973b, 1979), Rubin and Thomas Hornik, and Rosenbaum (2001), where match- (2000), Gu and Rosenbaum (1993), Frolich ing is used to form pairs that balance covariates (2004), and Zhao (2004). These generally but differ markedly in dose of treatment include relatively small numbers of covariates received. This differs from the standard match- drawn from known distributions. Many of ing setting in that there are not clear “treatment” the results from these simulations have been and “control” groups, and thus any two subjects included in the discussions of methods provided could conceivably be paired. An optimal match- in this chapter. ing algorithm for this setting is described and A second type of evaluation has attempted to applied to the evaluation of a media campaign replicate the results of randomized experiments against drug abuse. In the second method, each using observational data. Glazerman et al. of the levels of treatment has its own propensity (2003) summarize the results from 12 case stud- score (e.g., Imbens, 2000; Rosenbaum, 1987a), ies that attempted to replicate experimental esti- and each propensity score is used one at a time mates using nonexperimental data, all in the 11-Osborne (Best)-45409.qxd 10/9/2007 10:58 AM Page 171

Quasi-Experimental Designs 171

context of job training, welfare, and employ- et al. (2004) use data on welfare-to-work pro- ment programs with earnings as the outcome of grams that had random assignment and again interest. The nonexperimental methods include find that within-state comparisons have less bias matching and covariance adjustment. From the than out-of-state comparisons. They compare 12 studies, they extract 1,150 estimates of the estimates from propensity score matching, ordi- bias (approximately 96 per study), where bias is nary , a fixed-effects model, and a defined as the difference between the result from random-growth model and find that no method the randomized experiment and the result using is consistently better than the others but that the observational data. They determine that it is in matching method was more useful for diagnos- general difficult to replicate experimental results ing situations in which the data set was insuffi- consistently and that nonexperimental estimates cient for the comparison. Hill et al. (2004) also are often dramatically different from experi- stress the importance of matching on geography mental results. However, some general guidance as well as other covariates; using data from a can be obtained. randomized experiment of a child care program Glazerman et al. (2003) find that one-to-one for low-birth-weight children and comparison propensity score matching performs better than data from the National of other propensity score matching methods or Youth, they were able to replicate well the exper- non–propensity score matching and that stan- imental results using matching with a large set of dard econometric selection correction proce- covariates, including individual-level and geo- dures, such as instrumental variables or the graphic area–level covariates. Ordinary least Heckman selection correction, tend to perform squares with the full set of covariates or match- poorly. As discussed earlier, their results also ing with a smaller set of covariates did not show that combining methods, such as match- perform as well as the propensity score match- ing and covariance adjustment, is better than ing with the full set of covariates. Agodini and using those methods individually. They also Dynarski (2004) describe an example where stress the importance of high-quality data and a matching methods highlighted the fact that the rich set of covariates, and they discuss the diffi- data were insufficient to estimate causal effects culties in trying to use large publicly available without heroic assumptions. data sets for this purpose. However, there are counterexamples to this general guidance. For example, using the NSW data, Dehejia and ADVICE TO AN INVESTIGATOR Wahba (1999) found that propensity score matching methods using a large, publicly avail- To conclude, this section provides advice to able national data set replicated experimental investigators interested in implementing match- results very well. ing methods. A number of authors, particularly Heckman and colleagues, use data from the U.S. National Control Group and Covariate Selection Job Training Partnership Act (JTPA) study to evaluate matching methods (Heckman et al., As discussed in Cochran (1965), Cochran and 1997; Heckman, Ichimura, Smith, & Todd, 1998; Rubin (1973), and Rosenbaum (1999), a key to Heckman, Ichimura, & Todd, 1998). Some of estimating causal effects with observational data their results are similar to those of Glazerman is to identify an appropriate control group, ideally et al. (2003), particularly stressing the impor- with good overlap with the treated group. Care tance of high-quality data. Matching is best able should be taken to find data sets that have units to replicate the JTPA experimental results when similar to those in the treated group, with com- (a) the same data sources are used for the partic- parable covariate meaning and availability. ipants and nonparticipants, thereby ensuring Approximations for the maximum percent similar covariate meaning and measurement; bias reduction possible can be used to determine (b) participants and nonparticipants reside in which of a set of control groups are likely to pro- the same local labor markets; and (c) the data set vide the best matches or to help guide sample contains a rich set of covariates to model the sizes and matching ratios (Rubin, 1976c; Rubin & probability of receiving the treatment. Reaching Thomas, 1992b, 1996). Large pools of potential somewhat similar conclusions, Michalopoulos controls are beneficial, as many articles show that 11-Osborne (Best)-45409.qxd 10/9/2007 10:58 AM Page 172

172 BEST PRACTICES IN RESEARCH DESIGN

much better balance is achieved when there are for producing well-matched pairs (Gu & many controls available for the matching (Rubin, Rosenbaum, 1993). Optimal matching aims to 1976c; Rubin & Thomas, 1996). As discussed reduce a global distance measure, rather than earlier, researchers should include all available just considering each match one at a time, and covariates in the propensity score specification; thus reconsiders earlier matches if better overall excluding potentially relevant covariates can cre- balance could be obtained by breaking that ate bias in the estimation of treatment effects, but earlier match. Further details are given in the including potentially irrelevant covariates will “Nearest Neighbor Matching” subsection. typically not reduce the quality of the matches However, if overall balanced samples are all that much (Rubin & Thomas, 1996). is desired (rather than specifically matched pairs), then an easier and more straightforward nearest neighbor greedy matching algorithm Distance Measure can be used to select the controls. Once the control pool is selected, propensity Researchers should also consider whether it score matching is the most effective method is feasible (or desirable) to obtain more than for reducing bias due to many covariates (Gu one matched control for each treated unit (or & Rosenbaum, 1993; Rosenbaum & Rubin, even some treated units), as discussed earlier. 1985b). As discussed earlier, propensity scores With relatively small control pools, it may be can be estimated using logistic regression, and difficult to obtain more than one match for each the propensity score specification should be treated unit and still obtain large reductions in assessed using a method such as that in the bias. However, with larger control pools, it may “Model Specification” subsection. This generally be possible to obtain more than one match for involves examining the balance of covariates in each treated unit without sacrificing bias reduc- subclasses defined by the propensity score. If tion. This decision is also likely to involve cost there are a few covariates designated as particu- considerations. larly related to the outcome, and thus it is con- sidered desirable to obtain especially close If Outcome Data Are Already Available matches on those covariates, Mahalanobis matching on those key covariates can be done When the outcome values are already avail- within propensity score calipers (Rosenbaum & able, first put the outcome values away when Rubin, 1985b; Rubin & Thomas, 2000). matching! Then, a variety of good methods exist and should be considered and tried. In particu- Recommended Matching Methods lar, one-to-one propensity score matching is often a good place to start (or k-to-one if there Our advice for the matching method itself is are many controls relative to the number of very general: Try a variety of methods and use the treated units). If there is still substantial bias diagnostics discussed earlier to determine which between the groups in the matched samples approach yields the most closely matched (e.g., imbalance in the propensity score of more samples. Since the design and analysis stages are than 0.5 standard deviations), the nearest neigh- clearly separated and the outcome is not used in bor matching can be combined with subclassifi- the matching process, trying a variety of methods cation on the matched samples, as discussed and selecting the one that leads to the best covari- earlier. Full matching and subclassification on ate balance cannot bias the results. Although the the full data sets also often work well in practice, best method will depend on the individual data where full matching can be thought of as in set, below we highlight some methods that are between the two extremes of one-to-one match- likely to produce good results for the two general ing and weighting. situations considered in this chapter. Outcome Analysis To Select Units for Follow-Up After matched samples are selected, the out- For the special case of doing matching for the come analysis can proceed: linear regression, purpose of selecting well-matched controls for logistic regression, hierarchical modeling, and follow-up (i.e., when the outcome values are not so on. As discussed earlier, results should be less yet available), optimal matching is generally best sensitive to the modeling assumptions and thus 11-Osborne (Best)-45409.qxd 10/9/2007 10:58 AM Page 173

Quasi-Experimental Designs 173

should be fairly insensitive to the model specifi- Althauser, R., & Rubin, D. (1970). The computerized cation, as compared with the same analysis on construction of a matched sample. American the original unmatched samples. With proce- Journal of Sociology, 76, 325–346. dures such as full matching, subclassification, or Austin, P. C., & Mamdani, M. M. (2006). A matching with replacement, where there may be comparison of propensity score methods: A case-study illustrating the effectiveness of different numbers of treated and control units at post-ami statin use. Statistics in Medicine, 25, each value of the covariates, the analysis should 2084–2106. incorporate weights to account for these varying Becker, S. O., & Ichino, A. (2002). Estimation of distributions. average treatment effects based on propensity scores. The Stata Journal, 2(4), 358–377. Breiman, L. J., Friedman, J., Olshen, R., & Stone, C. SOFTWARE (1984). Classification and regression trees. Belmont, CA: Wadsworth. A variety of software packages are currently Chapin, F. (1947). Experimental designs in sociological research. New York: Harper. available to implement matching methods. Christakis, N. A., & Iwashyna, T. I. (2003). The health These include multiple R packages (MatchIt, impact of on families: A matched Ho, Imai, King, & Stuart, 2006; twang, Ridgeway, of hospice use by decedents and McCaffrey, & Morral, 2006; Matching, Sekhon, mortality outcomes in surviving, widowed 2006), multiple Stata packages (Abadie, spouses. Social Science & Medicine, 57, 465–475. Drukker, Herr, & Imbens, 2004; Becker & Cochran, W. G. (1957). : Its Ichino, 2002; Leuven & Sianesi, 2003), and nature and uses. Biometrics, 13, 261–281. SAS code for propensity score matching Cochran, W. G. (1965). The planning of (D’Agostino, 1998; Parsons, 2001). A major ben- observational studies of human populations efit of the R packages (particularly MatchIt and (with discussion). Journal of the Royal Statistical Society, Series A, 128, 234–255. twang) is that they clearly separate the design Cochran, W. G. (1968). The effectiveness of and analysis stages and have extensive propen- adjustment by subclassification in removing sity score diagnostics. The Stata and SAS pack- bias in observational studies. Biometrics, 24, ages and procedures do not explicitly separate 295–313. these two stages. Cochran, W. G., & Rubin, D. B. (1973). Controlling bias in observational studies: A review. Sankhya: The Indian Journal of Statistics, Series A, 35, NOTES 417–446. Cornfield, J., Haenszel, W., Hammond, E. C., Lilienfeld, A. M., Shimkin, M. B., & Wynder, E. L. 1. The data for this example are available at (1959). Smoking and lung cancer: Recent http://www.nber.org/%7Erdehejia/nswdata.html and evidence and a discussion of some questions. in the MatchIt matching package for R, available at Journal of the National Cancer Institute, 22, http://gking.harvard.edu/matchit. 173–203. 2. With nonnormally distributed covariates, the Czajka, J. C., Hirabayashi, S., Little, R., & Rubin, D. B. conditions are even more complex. (1992). Projecting from advance data using propensity modeling. Journal of Business and Economic Statistics, 10, 117–131. REFERENCES D’Agostino, R. B., Jr. (1998). Propensity score methods for bias reduction in the comparison Abadie, A., Drukker, D., Herr, J. L., & Imbens, G. W. of a treatment to a non-randomized control (2004). Implementing matching estimators for group. Statistics in Medicine, 17, 2265–2281. average treatment effects in Stata. The Stata D’Agostino, R. B., Jr., Lang, W., Walkup, M., & Journal, 4(3), 290–311. Morgan, T. (2001). Examining the impact of Abadie, A., & Imbens, G. W. (2006). Large sample missing data on propensity score estimation in properties of matching estimators for average determining the effectiveness of self-monitoring treatment effects. Econometrica, 74(1), of blood glucose (SMBG). Health Services & 235–267. Outcomes Research Methodology, 2, 291–315. Agodini, R., & Dynarski, M. (2004). Are experiments D’Agostino, R. B., Jr., & Rubin, D. B. (2000). the only option? A look at dropout prevention Estimating and using propensity scores with programs. Review of Economics and Statistics, partially missing data. Journal of the American 86(1), 180–194. Statistical Association, 95, 749–759. 11-Osborne (Best)-45409.qxd 10/9/2007 10:58 AM Page 174

174 BEST PRACTICES IN RESEARCH DESIGN

Dehejia, R. H. (2005). Practical propensity score estimator. Review of Economic Studies, 65, matching: A reply to Smith and Todd. Journal of 261–294. , 125, 355–364. Hill, J. (2004). Reducing bias in treatment effect Dehejia, R. H., & Wahba, S. (1999). Causal effects in estimation in observational studies suffering from nonexperimental studies: Re-evaluating the missing data. Working Paper 04-01, Columbia evaluation of training programs. Journal of the University Institute for Social and Economic American Statistical Association, 94, 1053–1062. Research and Policy (ISERP). Dehejia, R. H., & Wahba, S. (2002). Propensity score Hill, J., Reiter, J., & Zanutto, E. (2004). A comparison matching methods for non-experimental causal of experimental and observational data studies. Review of Economics and Statistics, 84, analyses. In A. Gelman & X.-L. Meng (Eds.), 151–161. Applied Bayesian modeling and causal inference Drake, C. (1993). Effects of misspecification of the from an incomplete-data perspective (pp. 44–56). propensity score on estimators of treatment New York: John Wiley. effects. Biometrics, 49, 1231–1236. Hill, J., Rubin, D. B., & Thomas, N. (1999). The design Du, J. (1998). Valid inferences after propensity score of the New York School Choice Scholarship subclassification using maximum number of Program evaluation. In L. Bickman (Ed.), Research subclasses as building blocks. Unpublished designs: Inspired by the work of Donald Campbell doctoral dissertation, Harvard University, (pp. 155–180). Thousand Oaks, CA: Sage. Department of Statistics. Ho, D. E., Imai, K., King, G., & Stuart, E. A. (2006). Frangakis, C. E., & Rubin, D. B. (2002). Principal MatchIt: Nonparametric preprocessing for stratification in causal inference. Biometrics, parametric causal inference. Software for using 58, 21–29. matching methods in R. Available at Frolich, M. (2004). Finite-sample properties of http://gking.harvard.edu/matchit/ propensity-score matching and weighting Ho, D. E., Imai, K., King, G., & Stuart, E. A. (2007). estimators. Review of Economics and Statistics, Matching as nonparametric preprocessing for 86(1), 77–90. reducing model dependence in parametric causal Glazerman, S., Levy, D. M., & Myers, D. (2003). inference. Political Analysis, 15(3), 199–236. Nonexperimental versus experimental estimates Available at http://pan.oxfordjournals.org/cgi/ of earnings impacts. The Annals of the American reprint/mpl013?ijkey=K17Pjban3gH2zs0& Academy of Political and Social Science, 589, keytype=ref. 63–93. Holland, P. W. (1986). Statistics and causal inference. Greenland, S. (2003). Quantifying biases in causal Journal of the American Statistical Association, models: classical vs collider- 81, 945–960. stratification bias. , 14(3), 300–306. Hong, G., & Raudenbush, S. W. (2006). Evaluating Greenwood, E. (1945). Experimental sociology: A kindergarten retention policy: A of study in method. New York: King’s Crown Press. causal inference for multilevel observational Greevy, R., Lu, B., Silber, J. H., & Rosenbaum, P. data. Journal of the American Statistical (2004). Optimal multivariate matching before Association, 101(475), 901–910. randomization. , 5, 263–275. Horvitz, D., & Thompson, D. (1952). A Gu, X., & Rosenbaum, P. R. (1993). Comparison generalization of sampling without replacement of multivariate matching methods: Structures, from a finite universe. Journal of the American distances, and algorithms. Journal of Statistical Association, 47, 663–685. Computational and Graphical Statistics, Imai, K., & van Dyk, D. A. (2004). Causal inference 2, 405–420. with general treatment regimes: Generalizing Hansen, B. B. (2004). Full matching in an the propensity score. Journal of the American observational study of coaching for the SAT. Statistical Association, 99(467), 854–866. Journal of the American Statistical Association, Imbens, G. W. (2000). The role of the propensity 99(467), 609–618. score in estimating dose-response functions. Heckman, J. J., Hidehiko, H., & Todd, P. (1997). Biometrika, 87, 706–710. Matching as an econometric evaluation Imbens, G. W. (2003). Sensitivity to exogeneity estimator: Evidence from evaluating a job assumptions in program evaluation. American training programme. Review of Economic Economic Review, 96(2), 126–132. Studies, 64, 605–654. Imbens, G. W. (2004). Nonparametric estimation Heckman, J. J., Ichimura, H., Smith, J., & Todd, P. of average treatment effects under exogeneity: (1998). Characterizing using A review. Review of Economics and Statistics, experimental data. Econometrika, 66(5), 86(1), 4–29. 1017–1098. Joffe, M. M., & Rosenbaum, P. R. (1999). Propensity Heckman, J. J., Ichimura, H., & Todd, P. (1998). scores. American Journal of Epidemiology, 150, Matching as an econometric evaluation 327–333. 11-Osborne (Best)-45409.qxd 10/9/2007 10:58 AM Page 175

Quasi-Experimental Designs 175

King, G., & Zeng, L. (2007). When can history be our with missing data. Journal of the American guide? The pitfalls of counterfactual inference. Statistical Association, 90, 122–129. International Studies Quarterly, 51, 183–210. Robins, J., & Rotnitzky, A. (2001). Comment on LaLonde, R. (1986). Evaluating the econometric P. J. Bickel and J. Kwon, “Inference for evaluations of training programs with semiparametric models: Some questions and an experimental data. American Economic Review, answer.” Statistica Sinica, 11(4), 920–936. 76(4), 604–620. Rosenbaum, P. R. (1987a). Model-based direct Leuven, E., & Sianesi, B. (2003). psmatch2. Stata adjustment. Journal of the American Statistical module to perform full Mahalanobis and Association, 82, 387–394. propensity score matching, common support Rosenbaum, P. R. (1987b). The role of a second graphing, and covariate imbalance testing. control group in an observational study (with Available at http://www1.fee.uva.nl/scholar/ discussion). Statistical Science, 2(3), 292–316. mdw/leuven/stata Rosenbaum, P. R. (1991a). A characterization of Lu, B., Zanutto, E., Hornik, R., & Rosenbaum, P. R. optimal designs for observational studies. (2001). Matching with doses in an Journal of the Royal Statistical Society, Series B observational study of a media campaign (Methodological), 53(3), 597–610. against drug abuse. Journal of the American Rosenbaum, P. R. (1991b). Sensitivity analysis for Statistical Association, 96, 1245–1253. matched case-control studies. Biometrics, 47(1), Lunceford, J. K., & Davidian, M. (2004). 87–100. Stratification and weighting via the propensity Rosenbaum, P. R. (1999). Choice as an alternative score in estimation of causal treatment effects: to control in observational studies (with A comparative study. Statistics in Medicine, 23, discussion and rejoinder). Statistical Science, 2937–2960. 14(3), 259–304. McCaffrey, D. F., Ridgeway, G., & Morral, A. R. Rosenbaum, P. R. (2002). Observational studies (2004). Propensity score estimation with (2nd ed.). New York: Springer-Verlag. boosted regression for evaluating causal effects Rosenbaum, P. R., & Rubin, D. B. (1983a). Assessing in observational studies. Psychological Methods, sensitivity to an unobserved binary covariate in 9(4), 403–425. an observational study with binary outcome. Michalopoulos, C., Bloom, H. S., & Hill, C. J. (2004). Journal of the Royal Statistical Society, Series B, Can propensity-score methods match the 45(2), 212–218. findings from a random assignment evaluation Rosenbaum, P. R., & Rubin, D. B. (1983b). The of mandatory welfare-to-work programs? central role of the propensity score in Review of Economics and Statistics, 56(1), observational studies for causal effects. 156–179. Biometrika, 70, 41–55. Morgan, S. L., & Harding, D. J. (2006). Matching Rosenbaum, P. R., & Rubin, D. B. (1984). Reducing bias estimators of causal effects: Prospects and in observational studies using subclassification on pitfalls in theory and practice. Sociological the propensity score. Journal of the American Methods & Research, 35(1), 3–60. Statistical Association, 79, 516–524. Parsons, L. S. (2001, April). Reducing bias in a Rosenbaum, P. R., & Rubin, D. B. (1985a). The bias due propensity score matched-pair sample using to incomplete matching. Biometrics, 41, 103–116. greedy matching techniques. Paper presented at Rosenbaum, P. R., & Rubin, D. B. (1985b). SAS SUGI 26, Long Beach, CA. Constructing a control group using Perkins, S. M., Tu, W., Underhill, M. G., Zhou, multivariate matched sampling methods that X.-H., & Murray, M. D. (2000). The use of incorporate the propensity score. The American propensity scores in pharmacoepidemiological , 39, 33–38. research. Pharmacoepidemiology and Drug Rubin, D. B. (1973a). Matching to remove bias in Safety, 9, 93–101. observational studies. Biometrics, 29, 159–184. Reinisch, J., Sanders, S., Mortensen, E., & Rubin, D. B. Rubin, D. B. (1973b). The use of matched sampling (1995). In utero exposure to phenobarbital and and regression adjustment to remove bias in intelligence deficits in adult men. Journal of the observational studies. Biometrics, 29, 185–203. American Medical Association, 274, 1518–1525. Rubin, D. B. (1974). Estimating causal effects of Ridgeway, G., McCaffrey, D., & Morral, A. (2006). treatments in randomized and nonrandomized twang: Toolkit for weighting and analysis of studies. Journal of Educational , 66, nonequivalent groups. Software for using 688–701. matching methods in R. Available at Rubin, D. B. (1976a). Inference and missing data http://cran.r-project.org/src/contrib/ (with discussion). Biometrika, 63, 581–592. Descriptions/twang.html Rubin, D. B. (1976b). Multivariate matching Robins, J., & Rotnitzky, A. (1995). Semiparametric methods that are equal percent bias reducing: efficiency in multivariate regression models I. Some examples. Biometrics, 32, 109–120. 11-Osborne (Best)-45409.qxd 10/9/2007 10:58 AM Page 176

176 BEST PRACTICES IN RESEARCH DESIGN

Rubin, D. B. (1976c). Multivariate matching methods score methods with normal distributions. that are equal percent bias reducing: II. Biometrika, 79, 797–809. Maximums on bias reduction. Biometrics, 32, Rubin, D. B., & Thomas, N. (1996). Matching using 121–132. estimated propensity scores, relating theory to Rubin, D. B. (1977). Assignment to treatment group practice. Biometrics, 52, 249–264. on the basis of a covariate. Journal of Rubin, D. B., & Thomas, N. (2000). Combining Educational Statistics, 2, 1–26. propensity score matching with additional Rubin, D. B. (1978). for causal adjustments for prognostic covariates. Journal effects: The role of randomization. The Annals of the American Statistical Association, 95, of Statistics, 6, 34–58. 573–585. Rubin, D. B. (1979). Using multivariate matched Sekhon, J. S. (2006). Matching: Multivariate and sampling and regression adjustment to control propensity score matching with balance bias in observational studies. Journal of the optimization. Software for using matching American Statistical Association, 74, 318–328. methods in R. Available at http://sekhon Rubin, D. B. (1980). Discussion of “Randomization .berkeley.edu/matching analysis of experimental data in the Fisher Smith, H. (1997). Matching with multiple controls to randomization test” by D. Basu. Journal of the estimate treatment effects in observational American Statistical Association, 75, 591–593. studies. Sociological Methodology, 27, 325–353. Rubin, D. B. (1987). Multiple imputation for Smith, J., & Todd, P. (2005). Does matching nonresponse in surveys. New York: John Wiley. overcome LaLonde’s critique of Rubin, D. B. (1990). Formal modes of statistical nonexperimental estimators? Journal of inference for causal effects. Journal of Statistical Econometrics, 125, 305–353. Planning and Inference, 25, 279–292. Snedecor, G. W., & Cochran, W. G. (1980). Statistical Rubin, D. B. (1997). Estimating causal effects from methods (7th ed.). Ames: Iowa State University large data sets using propensity scores. Annals of Press. Internal Medicine, 127, 757–763. Sobel, M. E. (2006). What do randomized studies of Rubin, D. B. (2001). Using propensity scores to help housing mobility demonstrate? Causal inference design observational studies: Application to the in the face of interference. Journal of the tobacco litigation. Health Services & Outcomes American Statistical Association, 101(476), Research Methodology, 2, 169–188. 1398–1407. Rubin, D. B. (2004). On principles for modeling Song, J., Belin, T. R., Lee, M. B., Gao, X., & propensity scores in . Rotheram-Borus, M. J. (2001). Handling Pharmacoepidemiology and Drug Safety, baseline differences and missing items in a 13, 855–857. longitudinal study of HIV risk among runaway Rubin, D. B. (2006). Matched sampling for causal youths. Health Services & Outcomes Research inference. Cambridge, UK: Cambridge Methodology, 2, 317–329. University Press. Stuart, E. A., & Green, K. M. (in press). Using full Rubin, D. B. (2007). The design versus the analysis of matching to estimate causal effects in non- observational studies for causal effects: Parallels experimental studies: Examining the with the design of randomized trials. Statistics relationship between adolescent marijuana use in Medicine, 26(1), 20–30. and adult outcomes. Developmental Psychology. Rubin, D. B., & Stuart, E. A. (2006). Affinely Stuart, E. A., & Rubin, D. B. (in press). Matching invariant matching methods with discriminant with multiple control groups with adjustment mixtures of proportional ellipsoidally for group differences. Journal of Educational symmetric distributions. The Annals of and Behavioral Statistics. Statistics, 34(4), 1814–1826. Winship, C., & Morgan, S. L. (1999). The estimation Rubin, D. B., & Thomas, N. (1992a). Affinely of causal effects from observational data. invariant matching methods with ellipsoidal Annual Review of Sociology, 25, 659–706. distributions. The Annals of Statistics, 20, Zhao, Z. (2004). Using matching to estimate 1079–1093. treatment effects: Data requirements, matching Rubin, D. B., & Thomas, N. (1992b). Characterizing metrics, and Monte Carlo evidence. Review of the effect of matching using linear propensity Economics and Statistics, 86(1), 91–107.