Matching on the Estimated Propensity Score

Econometrica, Vol. 84, No. 2 (March, 2016), 781–807 NOTES AND COMMENTS MATCHING ON THE ESTIMATED PROPENSITY SCORE BY ALBERTO ABADIE AND GUIDO W. I MBENS1 Propensity score matching estimators (Rosenbaum and Rubin (1983)) are widely used in evaluation research to estimate average treatment effects. In this article, we derive the large sample distribution of propensity score matching estimators. Our derivations take into account that the propensity score is itself estimated in a first step, prior to matching. We prove that first step estimation of the propensity score affects the large sample distribution of propensity score matching estimators, and derive adjustments to the large sample variances of propensity score matching estimators of the average treatment effect (ATE) and the average treatment effect on the treated (ATET). The adjustment for the ATE estimator is negative (or zero in some special cases), implying that matching on the estimated propensity score is more efficient than matching on the true propensity score in large samples. However, for the ATET estimator, the sign of the adjustment term depends on the data generating process, and ignoring the estimation error in the propensity score may lead to confidence intervals that are either too large or too small. KEYWORDS: Matching estimators, propensity score matching, average treatment effects, causal inference, program evaluation. 1. INTRODUCTION PROPENSITY SCORE MATCHING ESTIMATORS (Rosenbaum and Rubin (1983)) 2 are widely used to estimate treatment effects. Rosenbaum and Rubin (1983) defined the propensity score as the conditional probability of assignment to a treatment given a vector of covariates. Suppose that adjusting for a set of covariates is sufficient to eliminate confounding. The key insight of Rosenbaum and Rubin (1983) is that adjusting only for the propensity score is also sufficient to eliminate confounding. Relative to matching directly on the covariates, propensity score matching has the advantage of reducing the dimensionality of matching to a single dimension. This greatly facilitates the matching process 1We are grateful to the editor and three referees for helpful comments, to Ben Hansen, Ju- dith Lok, James Robins, Paul Rosenbaum, Donald Rubin, and participants in many seminars for comments and discussions, and to Jann Spiess for expert research assistance. Financial support by the NSF through Grants SES 0820361 and SES 0961707 is gratefully acknowledged. 2Following the terminology in Abadie and Imbens (2006), the term “matching estimator” is reserved in this article to estimators that match each unit (or each unit of some sample subset, e.g., the treated) to a small number of units with similar characteristics in the opposite treatment arm. Thus, our discussion does not refer to regression imputation methods, like the kernel matching method of Heckman, Ichimura, and Todd (1998), which use a large number of matches per unit and nonparametric smoothing techniques to consistently estimate unit-level regression values under counterfactual treatment assignments. See Hahn (1998), Heckman, Ichimura, and Todd (1998), Imbens (2004), and Imbens and Wooldridge (2009) for a discussion of such estimators. © 2016 The Econometric Society DOI: 10.3982/ECTA11293 782 A. ABADIE AND G. W. IMBENS because units with dissimilar covariate values may nevertheless have similar values for their propensity scores. In observational studies, propensity scores are not known, so they have to be estimated prior to matching. In spite of the great popularity that propensity score matching methods have enjoyed since they were proposed by Rosenbaum and Rubin in 1983, their large sample distribution has not yet been derived for the case when the propensity score is estimated in a first step.3 A possible rea- son for this void in the literature is that matching estimators are non-smooth functionals of the distribution of the matching variables, which makes it dif- ficult to establish an asymptotic approximation to the distribution of matching estimators when a matching variable is estimated in a first step. This has motivated the use of bootstrap standard errors for propensity score matching estimators. However, recently it has been shown that the bootstrap is not, in general, valid for matching estimators (Abadie and Imbens (2008)).4 In this article, we derive large sample approximations to the distribution of propensity score matching estimators. Our derivations take into account that the propensity score is itself estimated in a first step. We show that propensity matching estimators have approximately Normal distributions in large samples. We demonstrate that first step estimation of the propensity score affects the large sample distribution of propensity score matching estimators, and derive adjustments to the large sample variance of propensity score matching estimators that correct for first step estimation of the propensity score. We do this for estimators of the average treatment effect (ATE) and the average treatment effect on the treated (ATET). The adjustment for the ATE estimator is negative (or zero in some special cases), implying that matching on the estimated propensity score is more efficient than matching on the true propensity score in large samples. As a result, treating the estimated propensity score as it was the true propensity score for estimating the variance of the ATE estimator leads to conservative confidence intervals. However, for the ATET estimator, the sign of the adjustment depends on the data generating process, and ignoring the estimation error in the propensity score may lead to confidence intervals that are either too large or too small. 2. MATCHING ESTIMATORS The setup in this article is a standard one in the program evaluation literature, where the focus of the analysis is often the effect of a binary treatment, 3Influential papers using matching on the estimated propensity score include Heckman, Ichimura, and Todd (1997), Dehejia and Wahba (1999), and Smith and Todd (2005). 4In contexts other than matching, Heckman, Ichimura, and Todd (1998), Hirano, Imbens, and Ridder (2003), Abadie (2005), Wooldridge (2007), and Angrist and Kuersteiner (2011) derived large sample properties of statistics based on a first step estimator of the propensity score. In all these cases, the second step statistics are smooth functionals of the propensity scores and, therefore, standard stochastic expansions for two-step estimators apply (see, e.g., Newey and McFadden (1994)). MATCHING ON THE ESTIMATED PROPENSITY SCORE 783 represented in this paper by the indicator variable W , on some outcome variable, Y . More specifically, W = 1 indicates exposure to the treatment, while W = 0 indicates lack of exposure to the treatment. Following Rubin (1974), we define treatment effects in terms of potential outcomes. We define Y(1) as the potential outcome under exposure to treatment, and Y(0) as the potential outcome under no exposure to treatment. Our goal is to estimate the average treatment effect, τ = E Y(1) − Y(0) where the expectation is taken over the population of interest. Alternatively, the goal may be estimation of the average effect for the treated, τt = E Y(1) − Y(0)|W = 1 Estimation of these average treatment effects is complicated by the fact that for each unit in the population, we observe at most one of the potential outcomes: Y(0) if W = 0, Y = Y(1) if W = 1. Let X be a vector of covariates of dimension k. The propensity score is p(X) = Pr(W = 1|X),andp∗ = Pr(W = 1) is the probability of being treated. The following assumption is often referred to as “strong ignorability” (Rosenbaum and Rubin (1983)). It means that adjusting for X is sufficient to eliminate all confounding. ASSUMPTION 1: (i) Y(1) Y (0) ⊥⊥ W |X almost surely; (ii) p ≤ p(X) ≤ p almost surely, for some p > 0 and p<1. Assumption 1(i) uses the conditional independence notation in Dawid (1979). This assumption is often referred to as “unconfoundedness.” It will hold, for example, if all confounders are included in X, so that after control- ling for X, treatment exposure is independent of the potential outcomes. Hahn (1998) derived asymptotic variance bounds and studied asymptotically efficient estimation under Assumption 1(i). Assumption 1(ii) implies that, for almost all values of X, the population includes treated and untreated units. Moreover, Assumption 1(ii) bounds the values of the propensity score away from zero and 1. Khan and Tamer (2010) have shown that this condition is necessary for root-N consistent estimation of the average treatment effect. Let μ(w x) = E[Y |W = wX = x] and σ 2(w x) = var(Y |W = wX = x) be the conditional mean and variance of Y given W = w and X = x. Simi- larly, let μ(w¯ p) = E[Y |W = wp(X) = p] and σ¯ 2(w p) = var(Y |W = w p(X) = p) be the conditional mean and variance of Y given W = w and 784 A. ABADIE AND G. W. IMBENS p(X) = p. Under Assumption 1, τ = E μ(1X)− μ(0X) and τt = E μ(1X)− μ(0X)|W = 1 (see Rubin (1974)). Therefore, adjusting for differences in the distribution of X between treated and nontreated removes all confounding and, therefore, allows identification of ATE and ATET. Rosenbaum and Rubin (1983)proved that W and X are independent conditional on the propensity score, p(X), which implies that under Assumption 1: τ = E μ¯ 1 p(X) −¯μ 0 p(X) and τt = E μ¯ 1 p(X) −¯μ 0 p(X) |W = 1 In other words, under Assumption 1, adjusting for the propensity score only is enough to remove all confounding. This result motivates the use of propensity score matching estimators. A propensity score matching estimator for the average treatment effect can be defined as 1 N 1 τ∗ = (2W − 1) Y − Y N N i i M j i=1 j∈JM (i) where M is a fixed number of matches per unit and JM (i) is the set of matches 5 ∗ ∗ for unit i.

Load more