Propensity Score Analysis with Hierarchical Data

Fan Li Alan Zaslavsky Mary Beth Landrum

Department of Health Care Policy Harvard Medical School

May 19, 2008 Introduction

I Population-based observational studies are increasingly important sources for estimating treatment effects.

I Proper adjustment for differences between treatment groups is crucial to valid comparison and causal inference.

I Regression has long been the standard method.

I Propensity score (Rosenbaum and Rubin, 1983) is a robust alternative to regression for adjusting for observed differences. Hierarchically structured data

I Propensity score has been developed and applied in cross-sectional settings with unstructured data.

I Data in medical care and health policy research are often hierarchically structured.

I Subjects are grouped in natural clusters, e.g., geographical area, hospitals, health service provider, etc.

I Signiﬁcant within- and between-cluster variations are often the case. Hierarchically structured data

I Ignoring cluster structure often leads to invalid inference.

I Standard errors could be underestimated. I Cluster level effect could be confounded with individual level effect.

I Hierarchical regression models provide a uniﬁed framework to study clustered data.

I Propensity score methods for hierarchical data have been less explored. Propensity score

I Propensity score: e(x) = P(z = 1|x).

I Balancing on propensity score also balances the covariates of different treatment groups: z ⊥ x|e(x).

I Two steps procedure.

I Step 1: estimate the propensity score, e.g., by logistic regression. I Step 2: estimate the treatment effect by incorporating (e.g., weighting, stratiﬁcation) the estimated propensity score.

I We will introduce and compare several possible estimators of treatment effect using propensity score in context of hierarchical data.

I We will investigate the large sample behavior of each estimator. Notation

I h cluster; k individual.

I m no. of clusters; nh no. of subjects in cluster h.

I zhk binary treatment assignment - individual level.

I xhk individual level covariates; vh cluster level covariates.

I ehk propensity score.

I yhk outcome.

I Estimand: “treatment effect” ∆ = Ex [E(Y |X, Z = 1)] − Ex [E(Y |X, Z = 0)].

I Note: ∆ does not necessarily have a causal interpretation, it is the difference of the average of a outcome between two populations controlling for covariates. Step 1: Marginal model

I Marginal analysis ignores clustering.

I Marginal propensity score model ehk e e log = β xhk + κ vh, 1 − ehk

where ehk = P(zhk = 1 | xhk , vh).

I If treatment assignment mechanism (TAM) follows above

(xhk , vh) ⊥ zhk | ehk . Step 1: Pooled within-cluster model

I Pooled within-cluster model for propensity score (ehk = P(zhk = 1 | xhk , h))

ehk e e log( ) = δh + β xhk , 1 − ehk

e e where δh is a cluster-level main effect, δh ∼ N(0, ∞). I General (weaker) assumption of TAM than marginal model:

(xhk , h) ⊥ zhk | ehk .

e 2 I Assuming δh ∼ N(0, σδ ) gives a similar random effects model. Step 1: Surrogate indicator model

P h zhk I Deﬁne d = = cluster-speciﬁc proportion of being h nh treated.

I Propensity score model

ehk dh e e log( ) = λ log( ) + β xhk + κ vh. 1 − ehk 1 − dh

I logit(dh) is “surrogate” for the cluster indicator with the coefﬁcient being around 1.

I Analytic model is same as the marginal analysis with proportion treated dh as additional variable.

I Greatly reduce computation, but based on strong linear assumption. Step 2: Estimate “treatment effect”

Estimate treatment effect using propensity score.

I Weighting - weight as function of propensity score.

I Stratiﬁcation.

I Matching.

I Regression using propensity score as a predictor. Marginal weighted estimator - ignore cluster structure

I wh1(wh0: sum of whk with z = 1(z = 0) in cluster h. P P I w1 = h wh1, w0 = h wh0, w = w1 + w0. I Marginal weighted estimator - difference of weighted mean

zhk =1 zhk =0 ˆ X whk X whk ∆.,marg = yhk − yhk . w1 w0 h,k h,k

I Large sample variance under homoscedasticity of yhk 2 ˆ s.,marg = var(∆.,marg) z =1 z =0 Xhk w 2 Xhk w 2 = σ2( hk + hk ). w 2 w 2 h,k 1 h,k 0

2 I In practice, σ estimated from sample variance of yhk . Clustered weighted estimator

I Cluster-speciﬁc weighted estimator

zhk =1 zhk =0 ˆ X whk X whk ∆h = yhk − yhk . wh1 wh0 k∈h k∈h

I The overall clustered estimator X w ∆ˆ = h ∆ˆ . .,clu w h h Clustered weighted estimator

ˆ I Variance of ∆h under within-cluster homoscedasticity

zhk =1 2 zhk =0 2 2 ˆ 2 X whk X whk s = var(∆h) = σ ( + ). h h w 2 w 2 k∈h h1 k∈h h0

I Overall variance

X w 2 s2 = var(∆ˆ ) = h s2. .,clu .,clu w 2 h h

I Standard error can also be obtained using bootstrap. Doubly-robust estimators (Scharfstein et al., 1999)

I Weighted mean can be viewed as a weighted regression without covariates.

I In step 2, replace the weighted mean by a weighted regression.

I Estimator is consistent if either or both of step 1 and 2 models are correctly speciﬁed.

I Numerous combination of regression models in two steps. Choice of weight

I Horvitz-Thompson (inverse probability) weight

( 1 , for zhk = 1 w = ehk hk 1 , for z = 0. 1−ehk hk

I Balance covariates distribution between two groups: XZ X(1 − Z ) E = E . e(X) 1 − e(X)

I H-T estimator compares the counterfactual scenario: all subjects placed in trt=0 vs. all subjects placed in trt=1. YZ Y (1 − Z) E − = E[(Y |Z = 1) − (Y |Z = 0)]. e(X) 1 − e(X)

I H-T has large variance if e(X) approaches 0 or 1. Choice of weight

I Population-overlap weight 1 − ehk , for zhk = 1 whk = ehk , for zhk = 0.

I Each subject is weighted by the probability of being assigned to the other trt group.

I Balance covariates distribution between two groups:

E{XZ[1 − e(X)]} = E[X(1 − Z )e(X)].

I Small variance, different estimand.

E{YZ[1 − e(X)] − Y (1 − Z )e(X)} = E{[(Y |Z = 1) − (Y |Z = 0)]e(X)[1 − e(X)]}. Bias of Estimators

I Focus on the simplest case with two-level hierarchical structure and no covariates.

I nh1(nh0): no. of subjects with z = 1(z = 0) in cluster h. P P I n1 = h nh1, n0 = h nh0, n = n1 + n0. I Assume outcome generating mechanism is:

yhk = δh + γhzhk + αdh + hk , (1)

2 2 where δh ∼ N(0, σδ ), hk ∼ N(0, σ ), and the true treatment 2 effect: γh ∼ N(γ0, σγ). Bias of Marginal Estimator

ˆ n1 I For marginal model in step 1, ehk = n , ∀h, k. I The marginal estimator is ˆ ∆marg,marg

zhk =1 zhk =0 X yhk X yhk = − n1 n0 h,k h,k

zhk =1 zhk =0 X nh1 X nh1 nh0 X hk X hk = γh + ( − )δh + ( − ) n1 n1 n0 n1 n0 h h h,k h,k n P − nhdh(1 − dh) n1n0 h +α n n1n0 Bias of Marginal Estimator

I By WLLN of weighted sum of i.i.d. random variables 2 P∞ nh1 (assuming h 2 < ∞): n1

X nh1 nh,m→∞ γh → γ0. n1 h

I Similarly the middle two parts go to 0 as nh, m → ∞ . n I = var(n ): variance of total no. of treated, if all n1n0 1 n1 clusters follow the same TAM, z ∼ Bernoulli( n ). P P I h nhdh(1 − dh) = h var(nh1): sum of variance of no. of treated within each cluster, if each cluster follows a nh1 separate TAM: z ∈ ∼ Bernoulli( ). k h nh Bias of Marginal Estimator

I Exact form of bias P var(n1) − h var(nh1) Bias(∆ˆ marg,marg) = α . (2) var(n1)

I Controlled by two factors: (1) variance ratio - treatment assignment mechanism; (2) |α| - outcome generating mechanism.

I Both are ignored by the marginal estimator ∆ˆ marg,marg. Bias of Clustered Estimator

nh1 I For pooled within-cluster model in step 1, eˆ = , k ∈ h. hk nh I The clustered estimator with p.s. estimated from pooled ˆ within-cluster model ∆pool,clu is consistent ˆ ∆pool,clu P (Pzhk =1 yhk ) P (Pzhk =0 yhk ) = h k∈h nh1 − h k∈h nh0 m m P Pzhk =1 hk P Pzhk =0 hk P γ ( ) ( ) = h h + h k∈h nh1 − h k∈h nh0 m m m nh,m→∞ → γ0 (3)

I This result is free of type of weights. Bias of Clustered Estimator

I Clustered estimator with p.s. estimated from marginal ˆ model, ∆marg,clu, exactly follows (3), thus consistent.

I Marginal estimator with p.s. estimated from pooled ˆ within-cluster model, ∆pool,marg, also consistent.

I But different small sample behavior between H-T and population-overlap weights. Extensions

I Without covariates, surrogate indicator model gives the estimated p.s. as pooled within-cluster model.

I Above results regarding pooled within-cluster model automatically hold for surrogate indicator model.

I Proofs are analogous for data with higher order of hierarchical level. Double-robustness

I For the simplest case without covariates, we show “double-robustness” of the p.s. estimators:

I When both of the true underlying treatment assignment mechanism and outcome generating mechanism are hierarchically structured: I Estimators using a balancing weight are consistent as if hierarchical structure is taken into account in at least one of the two steps in the p.s. procedure.

I A special case of Scharfstein et al. (1999), but free of form of weights. Cases with covariates

I No closed-form solution to p.s. models, thus no closed-form of the bias of those estimators.

I Can be explored by (1) large-scale simulations; or (2) adopting a probit (instead of logistic) link for estimating p.s.

I Intuitively, “double-robustness” property still holds.

I Bias of ∆ˆ marg,marg is affected by: P var(n1)− h var(nh1) I α and in (2); var(n1) I Size of true trt effect γ (negative correlated); I Ratio of between-cluster and within-cluster variance, 2 σδ g = 2 (positively correlated). σ Racial disparity data

I Disparity: racial differences in care attributed to operations of health care system.

I Breast cancer screening data are collected from health insurance plans.

I Focus on the plans with at least 25 whites and 25 blacks: 64 plans with a total sample size of 75012.

I Subsample 3000 subjects from large (>3000) clusters to restrict impact of extremely large clusters, resulting sample size 56480. Racial disparity data

I Cluster level covariates vh: geographical code, non/for-proﬁt status, practice model.

I Individual level covariates xhk : age category, eligibility for medicaid, poor neighborhood.

I “Treatment” variable zhk : black race (1=black, 0=white).

I Not strictly causal. Compare groups with balanced covariates.

I Outcome yhk : receive screening for breast cancer or not.

I Research aim: investigate racial disparity in breast cancer screening. Estimated propensity score Estimated propensity score

I Different propensity score models give quite different estimates.

I Each method leads to good overall covariates balance between groups in this data.

I Marginal analysis does not lead to balance in covariates in each cluster, surrogate indicator analysis does better, pooled- within the best. Analysis results: racial disparity estimated from Horvitz-Thompson weight

weighted doubly-robust regression pooled clustered marginal pooled-within marginal -0.050 -0.020 -0.042 -0.021 -0.044 (0.008) (0.008) (0.004) (0.004) (0.007) pooled- -0.024 -0.021 -0.018 -0.022 -0.032 within (0.009) (0.008) (0.004) (0.004) (0.007) surrogate -0.017 -0.015 -0.012 -0.015 -0.014 indicator (0.009) (0.008) (0.004) (0.004) (0.007) Analysis results: racial disparity estimated from population-overlap weight

weighted doubly-robust regression pooled clustered marginal pooled-within marginal -0.043 -0.030 -0.043 -0.032 -0.044 (0.007) (0.008) (0.004) (0.004) (0.007) pooled- -0.030 -0.031 -0.031 -0.031 -0.032 within (0.007) (0.008) (0.004) (0.004) (0.007) surrogate -0.035 -0.030 -0.031 -0.030 -0.014 indicator (0.007) (0.008) (0.004) (0.004) (0.007) Diagnostics

I Check the balance of weighted covariates between treatment groups. Each method leads to good balance in this data.

I Quantiles table. Remarks on results

I Ignoring cluster structure in both steps gives results greatly defer from others.

I Results from surrogate indicator analysis are different from others, suggesting Portion treated is correlated with covariates.

I Taking into account cluster structure in at least one of the two steps leads to similar results - “doubly-robustness”.

I Doubly-robust estimates have smaller s.e., extra variation is explained by covariates in step 2.

I Incorporating cluster structure in step 2 is preferable to step 1.

I Between-cluster variation is large in breast cancer data.

I Standard errors obtained from bootstrap are much larger than those from analytic formula. Summary

I We introduce and compare several possible propensity score analyses for hierarchical data.

I We show “double-robustness” property of propensity score weighted estimators: cluster structure must be taken into account in at least one of the two steps.

I We obtain the analytic form of bias of the marginal estimator.

I Case by case. In practice, total number of clusters, size of each cluster, within- and between- cluster variations can greatly affect the conclusion.