<<

Section on in

Propensity score analysis with hierarchical

Fan Li, Alan M. Zaslavsky, Mary Beth Landrum Department of Health Care Policy, Harvard Medical School 180 Longwood Avenue, Boston, MA 02115

October 29, 2007

Abstract nors et al., 1996; D’Agostino, 1998, and references therein). This approach, which involves comparing subjects weighted Propensity score (Rosenbaum and Rubin, 1983) methods are (or stratified, matched) according to their propensity to re- being increasingly used as a less parametric alternative to tra- ceive treatment (i.e., propensity score), attempts to balance ditional regression methods in medical care and health policy subjects in treatment groups in terms of observed character- research. Data collected in these disciplines are often clus- istics as would occur in a randomized . Propensity tered or hierarchically structured, in the sense that subjects are score methods permit control of all observed fac- grouped together in one or more ways that may be relevant tors that might influence both choice of treatment and outcome to the analysis. However, propensity score was developed using a single composite measure, without requiring specifi- and has been applied in settings with unstructured data. In cation of the relationships between the control variables and this report, we present and compare several propensity-score- outcome. weighted estimators of treatment effect in the context of hier- Propensity score methods were developed and have been archically structured data. For the simplest case without co- applied in settings with unstructured data. However, data col- variates, we show the “double-robustness” of those weighted lected in medical care and health policy studies are typically estimators, that is, when both of the true underlying treatment clustered or hierarchically structured, in the sense that sub- assignment mechanism and the outcome generating mecha- jects are grouped together in one or more ways that may be nism are hierarchically structured, the estimator is consistent relevant to the analysis. For example, subjects maybe grouped as long as the hierarchical structure is taken into account in at by geographical area, treatment center (e.g., hospital or physi- least one of the two steps in the propensity score procedure. cians), or in the example we consider in this paper, health This result holds for any balancing weight. We obtain the ex- plan. Generally, subjects are assigned to clusters by an un- act form of bias when clustering is ignored in both steps. We known mechanism that may be associated with measured sub- apply those methods to study racial disparity in the service of ject characteristics that we are interested in (e.g., race, age, breast cancer screening among elders who participate Medi- clinical characteristics), measured subject characteristics that care health plans. are not of intrinsic interest and are believed to be unrelated to outcomes except through their effects on assignment to clus- KEY WORDS: double robustness, health policy research, hi- ters (e.g., location), and unmeasured subject characteristics erarchical data, propensity score, racial disparity, weighting. (e.g., unmeasured severity of disease, aggressiveness in seek- ing treatment). 1. Introduction When subjects are hierarchically structured, a number of issues appear that are not present with an unstructured collec- Population-based observational studies often are the best tion of subjects. First of all, calculations that methodology for obtaining generalizable results on access to, ignore the hierarchical structure will be inaccurate, leading to patterns of, and outcomes from medical care when large-scale incorrect inferences. A more interesting set of issues arises be- controlled are infeasible. Comparisons between cause there may be both measured and unmeasured factors at groups can be biased, however, when the groups are unbal- the cluster level that create variation among clusters in quality anced with respect to measured and unmeasured confounders. of treatment and hence in outcomes. Hierarchical regression Standard analytic methods adjust for observed differences be- models have been developed to give a more comprehensive de- tween treatment groups by stratifying or patients on scription than non-hierarchical models provide for such data a few observed covariates or with in the (e.g., Gatsonis et al., 1993). Despite the increasing popular- case of many observed confounders. But if treatment groups ity of propensity score analyses and the vast literature regard- differ greatly in observed characteristics, estimates of treat- ing regional and provider variation in medical care and health ment effects from regression models rely on model extrap- policy research (e.g., Nattinger et al., 1992; Farrow et al., olations and the resulting conclusions can be very sensitive 1996), however, to our knowledge, the implications of such to model mis-specification (Rubin, 1979). Propensity score data structures for propensity score analyses have been rarely methods (Rosenbaum and Rubin, 1983, 1984) have been pro- studied. Huang et al. (2005) applied propensity score methods posed as a less parametric alternative to regression adjustment to clustered health service data. But their goal was to rank the and are being increasingly used in health policy studies (Con- performance of multiple health service providers (clusters) in-

2474 Section on Statistics in Epidemiology stead of to estimate an overall treatment effect from data with ject k in cluster h (e.g., a clinical diagnosis); xhk the corre- clustered structure, which is the goal of this paper. Specif- sponding covariates (typically vector-valued, e.g., age, stage ically, we will present several propensity score models ana- of detection, comorbidity scores, etc.); vh the cluster-level co- logues to many of the commonly used regression models for variates (e.g., teaching status or measures of technical capac- clustered data in Section 2; investigate the behavior of those ity of a hospital); zhk the treatment assignment for the subject, estimators, especially the bias when clustering information is zhk ∈ {0, 1}; and ehk the propensity score. ignored in the analysis in Section 3; and apply the methods to study racial disparities in the service of breast cancer screen- 2.1 Step 1. Estimating the propensity score ing among elders in Section 4. Summaries and remarks will be provided in Section 5. Our discussion concerns the case To estimate the propensity score, several where a binary treatment is assigned at individual level. Also, models are available with various treatment of the hierarchical to illustrate the major point and yet without loss of generality, structure. we focus on data with two-level hierarchical structure. 2.1.1 Marginal model 2. Estimators As the name suggests, marginal regression models ignore clus- tering information. A typical marginal propensity score model The class of estimands considered in this paper is generally would be referred as a “treatment effect”,   ehk e e ∆ = E [E(Y |X,Z = 1)] − E [E(Y |X,Z = 0)], (1) log = β xhk + κ vh, (2) x x 1 − ehk i.e., the average difference in outcome between two treatment where ehk = P (zhk = 1 | xhk, vh). This model in fact as- groups that have same distribution of covariates. sumes the treatment assignment mechanism is the same across The propensity score e is defined as the conditional proba- all clusters. In other words, it assumes that two subjects are bility of being assigned to a particular treatment z given mea- exchangeable in terms of treatment propensity if they have the sured covariates x: e(x) = P (z = 1|x). In most observational same vector of covariates, whether or not they come from the studies, the propensity score is not known and thus needed to same cluster. be estimated. Therefore propensity score analysis usually in- This propensity score model can be thought of as a non- volves two steps. The first step is to estimate the propensity parametric alternative to a regression-based adjustment for in- score, typically by a logistic regression. The second step is to dividual and cluster covariates. The analogous marginal re- estimate the treatment effect by incorporating (e.g., by weight- gression model would be, ing or matching) the estimated propensity score. Hierarchi- y y cal structure leads to a of different choices of modeling yhk = γzhk + β xhk + κ vh + hk, (3) in both steps. In this section, we will introduce several most 2 widely used models. where hk ∼ N(0, δ ), and γ is the treatment effect. As Before going into more details, here we make a note regard- model (2), estimates derived from this regression model rely ing the targeted estimand “treatment effect” defined above, on the assumption that the outcome generating mechanism is which is slightly different from those “causal treatment effect” the same across all clusters. defined using the conventional potential outcomes framework. Models (2) and (3) have a manifest similarity of form. A The propensity score originated from and has been widely deeper connection is that the sufficient statistics to estimate used in causal inference, but its use is certainly not restricted the treatment effect that are balanced under propensity score to studying causal effects. For instance, in many health pol- estimator are the same that must be balanced under model (3). icy studies, the major interest is to compare the difference in the average of a feature (e.g., access to care) between two 2.1.2 Pooled within-cluster model groups (e.g., races, social economical status), rather than to make a causal statement. Moreover, the “treatment” is often a A pooled within-cluster model for propensity score conditions non-manipulable variable, e.g., race or gender, which does not on both the covariates and the cluster indicators, gives a well-defined casual effect in the sense of Rubin (1978) ehk e e (more discussion in Section 4). Nevertheless, propensity score log( ) = δh + β xhk, (4) 1 − ehk is still a valid and powerful tool to balance the covariates dis- e e tribution between groups for studies with non-causal purposes. where δh is a cluster-level main effect, δh ∼ N(0, ∞), and Therefore, we avoid the subtle issue of causality throughout ehk = P (zhk = 1 | xhk, h). This model implies the treatment the paper and note the results obtained here are applicable for assignment mechanism differs among clusters, and the differ- e studies with more general (non-causal) purposes. For ease of ence is controlled by a cluster-level main effect δh. Model (4) description, we still refer to our estimands discussed as “treat- involves a more general assumption (weaker) on the treatment ment effects” even though they are not necessarily causal. assignment mechanism than the marginal model (2), because Henceforth, let m denote the total number of clusters; nh the cluster-level covariate vh is a function of the cluster indi- the number of subjects in cluster h; yhk the outcome for sub- cator h.

2475 Section on Statistics in Epidemiology In the above model, if we assume the cluster-specific main 2.2.1 Marginal estimator e e 2 effects δh follow a distribution, δh ∼ N(0, σδ ), then we have a new propensity score model with random effects, Similar to the marginal model in step 1, the marginal estimator ignores clustering. A specific nonparametric estimator is the ehk e e e log( ) = δh + β xhk + κ vh. difference of the weighted overall of the outcome of 1 − ehk two treatment groups, More generally, βe can be allowed to vary across clusters and Pzhk=1 Pzhk=0 follow a distribution. In practice, results from the above ran- h,k whkyhk h,k whkyhk ∆ˆ .,marg = − , (7) dom effects model are usually similar to those from the pooled Pzhk=1 Pzhk=0 h,k whk h,k whk within-cluster model when the number of clusters is big.

A corresponding pooled within-cluster outcome model ad- where the weight whk is a function of the estimated propensity justing for cluster-level main effects and covariates is of the score. The choice of weight will be discussed in Section 2.3. 2 form: Assume yhk is homoscedastic and var(yhk) = σ , then the y y yhk = γzhk + δh + β xhk + hk, (5) large sample of the marginal estimator is, where δy is a cluster-level main effect, δy ∼ N(0, ∞). Under h h 2 ˆ this model, all information is obtained by comparisons within s.,marg = var(∆.,marg) y 2 Pzhk=1 2 2 Pzhk=0 2 clusters, since the δh term absorbs all between-cluster infor- σ w σ w = h,k hk + h,k hk . (8) mation. Pzhk=1 2 Pzhk=0 2 ( h,k whk) ( h,k whk)

2.1.3 Surrogate indicator model In practice σ2 can be estimated from the sample variance of When there are a large number of clusters with large sample yhk. size, the computational task of fitting the pooled within-cluster model can get demanding for standard software. Alternatively, 2.2.2 Clustered estimator P zhk define dh = , the cluster-specific proportion of be- k∈h nh ing treated, we can consider the following propensity score A second estimator is to first obtain the cluster-specific model weighted difference and then calculate the weighted average of these differences based on the sum of weights in each clus- ehk dh e e log( ) = λ log( ) + β xhk + κ vh. (6) ter. That is, for cluster h, 1 − ehk 1 − dh Pzhk=1 w y Pzhk=0 w y In the simplest situation where there is no covariates, ehk = ˆ k∈h hk hk k∈h hk hk ∆h = z =1 − z =0 . dh for any h, k. Therefore, comparing models (4) and (6), the P hk P hk k∈h whk k∈h whk logit of dh maybe expected to be a reasonable surrogate for the cluster indicator in the pooled within-cluster model with The variance of the cluster-specific estimator ∆ˆ h under the the coefficient λ being around 1. The inference is same as independent homoscedastic assumption of yhk within cluster in the marginal model with an additional covariate logit(dh). h is Usually the coefficients of the cluster-level covariates κe are 2 ˆ very small since most of their effects have been absorbed by sh = var(∆h) λ. The surrogate indicator model reduces the m parameters 2 Pzhk=1 2 2 Pzhk=0 2 σh k∈h whk σh k∈h whk (δh’s) in the pooled within-cluster model to a single parameter = + . Pzhk=1 2 Pzhk=0 2 λ, thus greatly reducing the computation required for model ( k∈h whk) ( k∈h whk) fitting. However, this reduction is based on the assumption 2 that logit of the empirical cluster-specific proportion of be- Similarly, σh can be estimated from its empirical counterpart within each cluster. ing treated, logit(dh), is linearly correlated with logit of the true propensity score. When the underlying truth is far from Let wh be a function of the weights in cluster h, e.g., the P this assumption, the surrogate indicator model could perform sum of weights wh = k∈h whk, or the precision of the esti- ˆ −2 poorly. mator ∆h, wh = sh . The overall clustered estimator is then The goodness of fit of these models can be checked by an average of the ∆ˆ h’s weighted by wh, conventional diagnostic procedures (e.g., Rosenbaum and Ru- bin, 1984). For example, one can check both the over- P w ∆ˆ ∆ˆ = h h h . (9) all and within-cluster balance of the distribution of covari- .,clu P h wh ates weighted by the estimated propensity score in different groups. And the overall variance is P (P w )2s2 2 ˆ h k∈h hk h 2.2 Step 2. Estimating the treatment effect s.,clu = var(∆.,clu) = P 2 . (10) ( h,k whk) Common approaches estimate treatment effects using propen- 2 2 sity score involve weighting, matching and stratification. We Standard errors of estimators s.,marg and s.,clu also be ob- will focus on weighting in this report. tained from methods such as the bootstrap.

2476 Section on Statistics in Epidemiology 2.2.3 Doubly-robust estimators averaged over the distribution of covariates in the population where the two treatment groups overlap The weighted can be regarded as a weighted regression without covariates. Therefore in step 2, we can replace the E[YZ{1 − e(X)} − Y (1 − Z)e(X)] nonparametric weighted mean (7) or (9) by a parametric re- = E[{(Y |Z = 1) − (Y |Z = 0)}e(X){1 − e(X)}]. gression (e.g., model (3) or (5)) weighted by the estimated propensity score. And the coefficient of the treatment assign- This population-overlap estimator can be calculated with ac- ment γ is the targeted estimand of treatment effect. This is ceptable variance when the H-T estimator cannot be practi- essentially the class of doubly-robust estimators proposed by cally estimated, because e(x) can approach 0 or 1 for some 1 1 Scharfstein et al. (1999). Doubly-robust estimators allow flex- part of x space such that e(x) or 1−e(x) would become ex- ible model choices in both steps, which can be very beneficial tremely large. In effect the H-T estimator attempts to estimate in applications. These estimators are coined “doubly-robust” a treatment effect for types of cases which are essentially un- in the sense that they are proven to be consistent if one but not represented in one or the other group, while the population- necessarily both of the step 1 and 2 models are correctly spec- overlap weighting focuses on the types of cases with a more ified under the Horvitz-Thompson weight (see below). De- balanced distribution of “treatment”. In addition to its statis- tailed discussion of this property with hierarchical data is pre- tical advantage, the latter analysis may be more scientifically sented in the next section. relevant since it focuses attention on comparison of outcomes among the kinds of cases which both “treatments” are cur- 2.3 Choice of weights rently observed, for example those in clinical equipoise be- tween treatments. We now consider the choice of weights. We call the class of weights which balances the distribution of covariates between 3. Bias of Estimators treatment groups balancing weights. The most widely used balancing weight is the Horvitz-Thompson (inverse probabil- In this section, we investigate the bias of each of the estimators ity) weight proposed in the previous section. We first look at the simplest case with two level-hierarchical structure and no covariates.  1 , for zhk = 1 ehk Let nh1(nh0) denote the number of subjects with z = whk = 1 , for zhk = 0. P 1−ehk 1(z = 0) in cluster h; and n+1 = h nh1, n+0 = P n , n = n + n . h i h h0 ++ +1 +0 XZ Assume the outcome generating mechanism for a continu- The H-T weight is a balancing weight because E e(X) = h i ous outcome follows a random effects model with cluster-level E X(1−Z) 1−e(X) . The H-T estimator compares the expected out- random intercepts and random treatment effects, come of the subjects placed in z = 0 versus that of the subjects placed in z = 1, averaging over the distribution of covariates yhk = δh + γhzhk + αdh + hk, (11) in the combined population. That is, 2 2 where δh ∼ N(0, σδ ), hk ∼ N(0, σ ), α is the effect of the  YZ Y (1 − Z) cluster-specific proportion of being treated dh on the outcome, E − = E[(Y |Z = 1) − (Y |Z = 0)]. 2 e(X) 1 − e(X) and the true treatment effect is γh with γh ∼ N(γ0, σγ ). We first look at the situation where clustering information In fact, the doubly-robust estimators in Scharfstein et al. is ignored in both steps. For the marginal model in step 1, (1999) are restricted to using the H-T weight because of this it is easy to show that the estimated propensity score is the n+1 same for each subject eˆhk = . Consequently, the marginal clear causal interpretation. However, the H-T estimator has n++ been well known to have excessively large variance when estimator is there are subjects with extremely small propensity score. Nev- ˆ ertheless, the same idea is readily extended to any balancing ∆marg,marg Pzhk=1 Pzhk=0 weight, although alternative weights might define different es- h,k yhk h,k yhk timands. For example, we can consider the population-overlap = − n+1 n+0 weight, X nh1 X nh1 nh0  = γh + ( − )δh 1 − ehk, for zhk = 1 n+1 n+1 n+0 whk = h h ehk, for zhk = 0. Pzhk=1 Pzhk=0 hk hk where each subject is weighted by the probability of being +( h,k − h,k ) n n assigned to the other treatment group. It is also a balancing +1 +0 n++ P − nhdh(1 − dh) weight because E[XZ{1 − e(X)}] = E[X(1 − Z)e(X)]. In n+1n+0 h +α theory, the population-overlap weight gives the smallest vari- n++ n+1n+0 ance under a homoscedastic model for Y given X. But it 2 defines a different estimand than the the Horvitz-Thompson P∞ nh1 Assume the common regularity conditions h 2 < ∞ and n+1 weight. Specifically, we call this the population-overlap 2 P∞ nh0 weight because it results in an average treatment effect that is h 2 < ∞ hold, then by the weak law of large num- n+0

2477 Section on Statistics in Epidemiology ∞ n2 bers for the weighted sum of independent and identically- P h h n2 < ∞) as distributed random variables (e.g., Chow and Lai, 1973), ++ P nh1 P γh converges to γ0 as the number of clusters goes to n γ n ,m→∞ h n+1 ˆ h h h h P nh1 nh0 ∆pool,marg = → γ0. infinity, and ( − )δh goes to 0, so does the third h n+1 n+0 n++ term in the above formula. In the fourth term, n++ is n+1n+0 However, the same estimator under the population-overlap in fact the variance of the total number of treated subjects, weight is var(n+1), if all clusters are exchangeable, i.e., if all sub- jects regardless of the clusters follow the same treatment as- ∆ˆ pool,marg signment mechanism, z ∼ Bernoulli( n+1 ). Furthermore, n++ P Pzhk=1 yhk P Pzhk=0 yhk nh0( ) − nh1( ) P h k∈h nh h k∈h nh h nhdh(1 − dh) is the sum of the variance of the number = P P nh1nh0 of treated subjects within each cluster, h var(nh1), if each h nh cluster separately follows a treatment assignment mechanism, P nh1nh0 Pzhk=1 hk Pzhk=0 hk h n (γh + k∈h n − k∈h n ) nh1 h h1 h0 zk∈h ∼ Bernoulli( ). Therefore, bias of the marginal = nh P nh1nh0 estimator with propensity score estimated from the marginal h nh nh,m→∞ model is → γ0.

 P  Even though this estimator is also asymptotically unbiased, its ˆ var(n+1) − h var(nh1) Bias(∆marg,marg) = α . small sample behavior can be quite different from that of the var(n+1) (12) estimator under H-T weight. The size of the bias is controlled by two factors: (1) the ratio of Under the homoscedasticity assumption of outcome, the ˆ ˆ ˆ the variance of the total number of treated subjects under a ho- three H-T estimators ∆pool,marg, ∆marg,clu, and ∆pool,clu mogeneous versus a cluster-heterogeneous treatment assign- that take into account clustering in at least one step have the ment mechanism; and (2) the effect that the cluster-specific same variance, proportion of being treated dh has in the response, i.e., |α|. 2 2 2 X σ nh 1 1 This is intuitive because the first factor measures the varia- s = 2 ( + ). n nh1 nh0 tion in the treatment assignment mechanism among clusters h ++ and the second measures the variation in the outcome generat- Similarly as the discussion on bias, this result is generally not ing mechanism, both of which are ignored in the analysis with applicable for other type of weights. Specifically, the vari- marginal models in both steps. When either but not necessar- ance of ∆ˆ pool,marg is usually larger than that of ∆ˆ marg,clu ily both of the two mechanisms is homogenous across clusters, and ∆ˆ . the marginal estimator, ∆ˆ , is also consistent. How- pool,clu marg,marg When there are no covariates, the surrogate indicator model ever, in reality, it is most likely that both of the mechanisms gives the estimated propensity score as the pooled within- are heterogenous among clusters. cluster model. Thus the results obtained above regarding the We now look at the opposite situation where clustering in- pooled within-cluster model automatically hold for the surro- formation is taken into account in both steps. For the pooled gate indicator model. But this is not the case for the general within-cluster model in step 1, it is easy to show that the es- nh1 situation with covariates. timated propensity score is eˆhk = . Then the clustered nh The proofs are analogous for data with a higher order of weighted estimator is hierarchical levels. For the simplest case without covari- ates, above we have shown the “double-robustness” of those ˆ ∆pool,clu propensity score estimators, that is, when both of the true un- P Pzhk=1 yhk P Pzhk=0 yhk ( k∈h ) ( k∈h ) derlying treatment assignment mechanism and outcome gen- = h nh1 − h nh0 m m erating mechanism are hierarchically structured, the estima- P P Pzhk=1 hk P Pzhk=0 hk tor using a balancing weight is consistent as long as the hi- γh h( k∈h n ) h( k∈h n ) = h + h1 − h0 erarchical structure is taken into account in at least one of m m m n ,m→∞ the two steps in the propensity score procedure. This can h → γ (13) 0 be viewed as both a special case and an extension of the “double-robustness” property of the estimator in Scharfstein which is asymptotically unbiased. The result is free of the et al. (1999). The extension lies in that our conclusion is in- form of weight. Simple calculation shows that the clustered stead free of the form of weight. weighted estimator combining the marginal model in step In the more general cases with covariates, usually there is no 1, ∆ˆ marg,clu, is of exactly the same form as that in (13) closed-form solution to the logistic models for estimating the and thus also unbiased. Furthermore, the marginal estima- propensity score. Consequently, there is no closed-form of the tor with propensity score estimated from the pooled within- bias of those estimators as above. Nevertheless, this situation cluster model, ∆ˆ pool,clu, follows the same form as in (13), can be explored either by large-scale simulations, or by adopt- but only under H-T weight and a balanced design (i.e., each ing a probit (instead of logistic) link for estimating the propen- cluster has same number of subjects). Under H-T weight but sity score. Intuitively, the “double-robustness” property still an unbalanced design, the estimator is also consistent (assume holds. But the bias of a marginal estimator ∆ˆ marg,marg is

2478 Section on Statistics in Epidemiology expected to also be affected by the size of the true treatment whites and blacks. Different models clearly give quite dif- effect γ (negative correlated) and the ratio of between-cluster ferent estimates of propensity score in this data, where the σ2 δ marginal model departs mostly from the other two models. and within-cluster variance g = 2 (positively correlated), σ var(n )−P var(n ) The variance of the estimated propensity score of blacks is in addition to α and +1 h h1 in (12). A com- var(n+1) much bigger than that of whites, regardless of the model. We prehensive discussion is beyond the scope this report and is checked the weighted distributions of covariates. Each model subject to further research. leads to good balance of the overall weighted covariates distri- butions between groups. However, the marginal model in gen- 4. Application eral does poorly in balancing covariates between races within each cluster, while the surrogate indicator model does better, We now apply the above methods to study racial disparity in and the pooled within-cluster model does the best. This sug- health services. Disparity refers to racial differences in care gests that there is important between-cluster variation. attributed to operations of health care system. Our applica- tion concerns the HEDIS R measures of health care provided of Propensity score of Whites Estimated from Marginal Model in Medicare health plans. Each of these measures is an es- 6000 timate of the rate at which a guideline-recommended clini- 2000 0 cal service is provided within the appropriate population. We 0.0 0.2 0.4 0.6 0.8 1.0 obtained individual-level data from the Centers for Medicare estimated propensity score Histogram of Propensity score of Whites Estimated from Pooled Within−Cluster Model and Medicaid Services (CMS) on breast cancer screening of women in Medicare managed care health plans (Schneider et 2000 Frequency al., 2002). Our main interest is the disparity between whites 1000 0 and blacks, so we exclude subjects of other races for whom 0.0 0.2 0.4 0.6 0.8 1.0 estimated propensity score racial identification is unreliable in this dataset. We focus on Histogram of Propensity score of Whites Estimated from Surrogate Indicator Model plans with at least 25 whites and 25 blacks, leaving 64 plans 3000 with a total sample size of 75012. For practical reasons, we 2000 Frequency 1000

drew a random subsample of size 3000 from each of the three 0 0.0 0.2 0.4 0.6 0.8 1.0 large plans with more than 3000 subjects, leaving a total sam- estimated propensity score ple size of 56480. All the covariates considered in the analysis are binary. The Figure 1: Histogram of propensity score estimated from dif- individual-level covariates xhk include two indicators of age ferent models for whites. category (70-80,>80) with reference group being 60-70; eli- gibility for Medicaid (1 yes); neighborhood status indicator (1

Histogram of Propensity score of Blacks Estimated from Marginal Model poor). The plan-level covariates vh include nine geographi- cal code indicators; non/for-profit status (1 for-profit); and the 300 practice model of providers (1 staff-group model; 0 network- Frequency 100 0 independent practice model). The outcome y is a binary vari- 0.0 0.2 0.4 0.6 0.8 1.0 able equal to 1 if the enrollee underwent breast cancer screen- estimated propensity score ing and equal to 0 otherwise, and the “treatment” z here is Histogram of Propensity score of Whites Estimated from Pooled Within−Cluster Model 300

race (1 black, 0 white). We want to estimate the difference in 200 Frequency the proportion of undergoing breast cancer screening between 100 0 whites and blacks. As mentioned before, race is not a valid 0.0 0.2 0.4 0.6 0.8 1.0 estimated propensity score

“treatment” in conventional sense in causal inference, because Histogram of Propensity score of Whites Estimated from Surrogate Indicator Model it is not manipulable (Holland, 1986). However, in this par- 300 ticular application, our goal is not to study the causal pathway 200 Frequency 100

between race and health service utilization, but simply to esti- 0 0.0 0.2 0.4 0.6 0.8 1.0 mate the magnitude of disparity under balanced distributions estimated propensity score of covariates between the two races. Hence, the propensity score in this application is merely an analytical tool to achieve Figure 2: Histogram of propensity score estimated from dif- this goal, and it should not be taken as having the explicit ferent models for blacks. meaning of the probability of being black. We first estimate the propensity score using the three mod- Using the estimated propensity score, we estimate racial els introduced in Section 2.1 with all the above covariates in- disparity in breast cancer screening among the elder women cluded. Details of the fitted models are omitted here since participating Medicare health plans by the estimators pro- the focus is the fitted values (estimated propensity score). All posed in Section 2.2. Although the outcome is binary in this models suggest that living in poor neighborhood, being eligi- case, the probabilities of outcome are in a range where the ble for Medicaid and enrollment in for-profit insurance plan linear probability model is an acceptable fit. Hence, for the are significantly associated with being black race. Figures 1 doubly-robust estimators, we adopt the combinations of the and 2 show of the estimated propensity score for three propensity score models (2), (4) and (6) in step 1 and

2479 Section on Statistics in Epidemiology the two outcome models (3) and (5) in step 2. Table 1 shows among the elders who participate in Medicare health plans, the estimates using the H-T weight. Each row represents one blacks on average have a significantly lower chance to receive step 1 model, and each column represents one type of step breast cancer screening than whites, after adjusting for age, 2 model/estimator. Analogous results using the population- geographical region, social economical status and health plan overlap weight are given in Table 2. characteristics.

weighted doubly-robust 5. Summary and Remarks marginal clustered marginal pooled within marginal -0.050 -0.020 -0.042 -0.021 Since first been proposed twenty-five years ago, propensity (0.008) (0.008) (0.004) (0.004) score methods have gained increasing popularity in observa- pooled -0.024 -0.021 -0.018 -0.022 tional studies in multiple disciplines. One example is health within (0.009) (0.008) (0.004) (0.004) surrogate -0.017 -0.015 -0.012 -0.015 care policy research, where data with hierarchical structure indicator (0.009) (0.008) (0.004) (0.004) are rule rather than exception nowadays. However, despite the wide appreciation of propensity score among both statisti- cians and health policy researchers, there is very limited liter- Table 1: Difference in the proportion of getting breast ature regarding the methodological issues of propensity score cancer screening between blacks and whites using Horvitz- methods in the context of hierarchical data, which motivates Thompson weight our exploration in this paper. Specifically, we present three typical models for estimating propensity score and two types All models show the proportion of receiving breast can- of nonparametric weighted (by estimated propensity score) es- cer screening is significantly lower among blacks than among timators of treatment effect for hierarchically structured data. whites with similar characteristics. The estimates are simi- Furthermore, for the simplest (conceptual) case without co- lar except for the analyses that ignore clustering in both steps, variates, we show the “double-robustness” of those weighted which overestimate the treatment effect. This pattern matches estimators: when both of the true underlying treatment assign- the double-robustness property. Results from the surrogate in- ment mechanism and outcome generating mechanism are hi- dicator model in step 1 are slightly different from the others, erarchically structured, the estimator is consistent as long as suggesting the cluster-specific proportion of being treated dh the hierarchical structure is taken into account in at least one is correlated with certain covariates. The doubly-robust esti- of the two steps in the propensity score procedure. We also mates have smaller standard errors because the extra variation quantify the bias of the estimator when clustering is ignored is explained by covariates in step 2. Not surprisingly, the esti- in both steps. mates using H-T weight have much larger than those We have focused on the case of treatment being assigned using the population-overlap weight. We also notice that the at the individual level in this paper. Treatment assigned at estimates incorporating clustering in step 2 have less variation the cluster level (e.g., hospital, health care provider) is also than those doing so in step 1. This observation suggests, in ap- common in medical care and health policy studies, where sev- plication, modeling the hierarchical structure for the outcome eral new challenging issues can arise. First, the number of generating mechanism leads to more stable estimates, even clusters is often relatively small despite a large total sample though in theory correct model specification in both steps are size. This could lead to poorly estimated propensity scores equivalent in terms of their effect on consistency. A possible with excessively large standard errors. Second, the cluster- explanation is the impact of misspecifying propensity score is level propensity score only balances the cluster-level covari- attenuated through weighting because the ultimate estimand is ates and the average individual-level coviariates. What are the a function of the outcome, rather than of the propensity score. consequences of the possible imbalance in the overall distri- Even though we do not know the underlying truth, the sim- butions of individual-level covariates? This also has a strong ilarity of various estimators suggests our analyses capture the connection to the ecological inference commonly encountered main information regarding disparity in this data. That is, in political science (e.g., King, 1997) where the estimand has an interpretation as an average effect on individual outcomes. Third, all the nonparametric weighted estimators discussed in weighted doubly-robust this paper do not make use of the individual-level covariates, marginal clustered marginal pooled within which often contain crucial information. The doubly-robust marginal -0.043 -0.030 -0.043 -0.032 estimators with flexible regression model choice in the second (0.007) (0.008) (0.004) (0.004) step appear to be preferable in this case. But what specific pooled -0.030 -0.031 -0.031 -0.031 within (0.007) (0.008) (0.004) (0.004) regression model to choose greatly depends on the specific surrogate -0.035 -0.030 -0.031 -0.030 data. Fourth, most interestingly, the foundational stable-unit- indicator (0.007) (0.008) (0.004) (0.004) treatment-value assumption (SUTVA) – “the observation on one unit should be unaffected by the particular assignment of treatments to the other units” (Cox 1958, 2.4) often no longer Table 2: Difference in the proportion of getting breast cancer holds under clustered treatment assignment, especially in the screening between blacks and whites using population-overlap studies with, for instance, behavioral outcomes and infectious weight disease. In that case, correct modeling of the interference

2480 Section on Statistics in Epidemiology among subjects is crucial for valid analysis. Those issues are among a range of open questions remained to be explored on this topic. Further systematic research efforts are desired to shed insight to the methodological issues and to provide guide- lines for practical applications.

REFERENCES

Connors, A., Speroff, T., Dawson, N., and et al. (1996). The effectiveness of right heart catheterization in the initial care of critically ill patients. Journal of the American Medical Association 276, 889-897. Cox. C.P. (1958). The Analysis of Latin Square Designs with Individual Curvatures in one Direction. Journal of the Royal Statistical Society. Series B. 20(1), 193-204. Chow, Y. S. and Lai, T.L. (1973). Limiting behavior of weighted sums of independent random variables. The Annals of Probability 1(5), 810-824. D’Agostino, R. (1998). Tutorial in : propensity score methods for bias reduction in the comparisons of a treatment to a non-randomized control. Statistics in Medicine 17, 2265-2281. Farrow, D., Samet, J. and Hunt, W. (1996). Regional variation in survival following the diagnosis of cancer. Journal of Clinical Epidemiology 49, 843-847. Gatsonis, C., Normand, S., Liu, C., and Morris, C. (1993). Geographic vari- ation of procedure utilization: a hierarchical model approach. Medical Care 31, YS54-YS59. King, G. (1997). A Solution to the Ecological Inference Problem: Recon- structing Individual Behavior from . Princeton University Press. Holland, P.W. (1986). Statistics and causal inference (with discussion). Jour- nal of the American Statistical Association 81, 945-970. Huang, I.C., Frangakis, C.E., Dominici, F., Diette, G. and Wu, A.W. (2005). Application of a propensity score approach for risk adjustment in profil- ing multiple physician groups on asthma care. Health Services Research 40, 253-278. Nattinger, A., Gottilieb, M., Veum, J., and et al. (1992). Geographic variation in the use of breast-conserving treatment for breast cancer. New England Journal of Medicine 326, 1102-1127. Rosenbaum, P.R. and Rubin, D.B. (1983). The central role of the propensity score in observational studies for causal effects. Biometrika 70(1), 41-55. Rosenbaum, P.R. and Rubin, D.B. (1984). Reducing bias in observational studies using subclassification on the propensity score. Journal of the American Statistical Association 79, 516-524. Rubin, D.B. (1978) for causal effects: the role of random- ization. Annals of Statistics 6, 34-58. Rubin, D.B. (1979). Using multivariate matched and regression ad- justment to control bias in observational studies. Journal of the American Statistical Association 74, 318-324. Scharfstein, D. O., Rotnitzky, A. and Robins, J. M. (1999). Adjusting for nonignorable drop-out using semiparametric nonresponse models (with discussion). Journal of the American Statistical Association 94, 1096- 1146. Schneider E.C., Zaslavsky A.M., Epstein, A.M. (2002). Racial disparities in the quality of care for enrollees in Medicare managed care. Journal of the American Medical Association 287(10), 1288-1294.

2481