Propensity Score Analysis with Hierarchical Data

Section on Statistics in Epidemiology Propensity score analysis with hierarchical data Fan Li, Alan M. Zaslavsky, Mary Beth Landrum Department of Health Care Policy, Harvard Medical School 180 Longwood Avenue, Boston, MA 02115 October 29, 2007 Abstract nors et al., 1996; D’Agostino, 1998, and references therein). This approach, which involves comparing subjects weighted Propensity score (Rosenbaum and Rubin, 1983) methods are (or stratified, matched) according to their propensity to re- being increasingly used as a less parametric alternative to tra- ceive treatment (i.e., propensity score), attempts to balance ditional regression methods in medical care and health policy subjects in treatment groups in terms of observed character- research. Data collected in these disciplines are often clus- istics as would occur in a randomized experiment. Propensity tered or hierarchically structured, in the sense that subjects are score methods permit control of all observed confounding fac- grouped together in one or more ways that may be relevant tors that might influence both choice of treatment and outcome to the analysis. However, propensity score was developed using a single composite measure, without requiring specifi- and has been applied in settings with unstructured data. In cation of the relationships between the control variables and this report, we present and compare several propensity-score- outcome. weighted estimators of treatment effect in the context of hier- Propensity score methods were developed and have been archically structured data. For the simplest case without co- applied in settings with unstructured data. However, data col- variates, we show the “double-robustness” of those weighted lected in medical care and health policy studies are typically estimators, that is, when both of the true underlying treatment clustered or hierarchically structured, in the sense that sub- assignment mechanism and the outcome generating mecha- jects are grouped together in one or more ways that may be nism are hierarchically structured, the estimator is consistent relevant to the analysis. For example, subjects maybe grouped as long as the hierarchical structure is taken into account in at by geographical area, treatment center (e.g., hospital or physi- least one of the two steps in the propensity score procedure. cians), or in the example we consider in this paper, health This result holds for any balancing weight. We obtain the ex- plan. Generally, subjects are assigned to clusters by an un- act form of bias when clustering is ignored in both steps. We known mechanism that may be associated with measured sub- apply those methods to study racial disparity in the service of ject characteristics that we are interested in (e.g., race, age, breast cancer screening among elders who participate Medi- clinical characteristics), measured subject characteristics that care health plans. are not of intrinsic interest and are believed to be unrelated to outcomes except through their effects on assignment to clus- KEY WORDS: double robustness, health policy research, hi- ters (e.g., location), and unmeasured subject characteristics erarchical data, propensity score, racial disparity, weighting. (e.g., unmeasured severity of disease, aggressiveness in seek- ing treatment). 1. Introduction When subjects are hierarchically structured, a number of issues appear that are not present with an unstructured collec- Population-based observational studies often are the best tion of subjects. First of all, standard error calculations that methodology for obtaining generalizable results on access to, ignore the hierarchical structure will be inaccurate, leading to patterns of, and outcomes from medical care when large-scale incorrect inferences. A more interesting set of issues arises be- controlled experiments are infeasible. Comparisons between cause there may be both measured and unmeasured factors at groups can be biased, however, when the groups are unbal- the cluster level that create variation among clusters in quality anced with respect to measured and unmeasured confounders. of treatment and hence in outcomes. Hierarchical regression Standard analytic methods adjust for observed differences be- models have been developed to give a more comprehensive de- tween treatment groups by stratifying or matching patients on scription than non-hierarchical models provide for such data a few observed covariates or with regression analysis in the (e.g., Gatsonis et al., 1993). Despite the increasing popular- case of many observed confounders. But if treatment groups ity of propensity score analyses and the vast literature regard- differ greatly in observed characteristics, estimates of treat- ing regional and provider variation in medical care and health ment effects from regression models rely on model extrap- policy research (e.g., Nattinger et al., 1992; Farrow et al., olations and the resulting conclusions can be very sensitive 1996), however, to our knowledge, the implications of such to model mis-specification (Rubin, 1979). Propensity score data structures for propensity score analyses have been rarely methods (Rosenbaum and Rubin, 1983, 1984) have been pro- studied. Huang et al. (2005) applied propensity score methods posed as a less parametric alternative to regression adjustment to clustered health service data. But their goal was to rank the and are being increasingly used in health policy studies (Con- performance of multiple health service providers (clusters) in- 2474 Section on Statistics in Epidemiology stead of to estimate an overall treatment effect from data with ject k in cluster h (e.g., a clinical diagnosis); xhk the corre- clustered structure, which is the goal of this paper. Specif- sponding covariates (typically vector-valued, e.g., age, stage ically, we will present several propensity score models ana- of detection, comorbidity scores, etc.); vh the cluster-level co- logues to many of the commonly used regression models for variates (e.g., teaching status or measures of technical capac- clustered data in Section 2; investigate the behavior of those ity of a hospital); zhk the treatment assignment for the subject, estimators, especially the bias when clustering information is zhk 2 f0; 1g; and ehk the propensity score. ignored in the analysis in Section 3; and apply the methods to study racial disparities in the service of breast cancer screen- 2.1 Step 1. Estimating the propensity score ing among elders in Section 4. Summaries and remarks will be provided in Section 5. Our discussion concerns the case To estimate the propensity score, several logistic regression where a binary treatment is assigned at individual level. Also, models are available with various treatment of the hierarchical to illustrate the major point and yet without loss of generality, structure. we focus on data with two-level hierarchical structure. 2.1.1 Marginal model 2. Estimators As the name suggests, marginal regression models ignore clustering information. A typical marginal propensity score model The class of estimands considered in this paper is generally would be referred as a “treatment effect”, ehk e e ∆ = E [E(Y jX; Z = 1)] − E [E(Y jX; Z = 0)]; (1) log = β xhk + κ vh; (2) x x 1 − ehk i.e., the average difference in outcome between two treatment where ehk = P (zhk = 1 j xhk; vh). This model in fact as- groups that have same distribution of covariates. sumes the treatment assignment mechanism is the same across The propensity score e is defined as the conditional proba- all clusters. In other words, it assumes that two subjects are bility of being assigned to a particular treatment z given mea- exchangeable in terms of treatment propensity if they have the sured covariates x: e(x) = P (z = 1jx). In most observational same vector of covariates, whether or not they come from the studies, the propensity score is not known and thus needed to same cluster. be estimated. Therefore propensity score analysis usually in- This propensity score model can be thought of as a non- volves two steps. The first step is to estimate the propensity parametric alternative to a regression-based adjustment for in- score, typically by a logistic regression. The second step is to dividual and cluster covariates. The analogous marginal re- estimate the treatment effect by incorporating (e.g., by weight- gression model would be, ing or matching) the estimated propensity score. Hierarchi- y y cal structure leads to a range of different choices of modeling yhk = γzhk + β xhk + κ vh + hk; (3) in both steps. In this section, we will introduce several most 2 widely used models. where hk ∼ N(0; δ ), and γ is the treatment effect. As Before going into more details, here we make a note regard- model (2), estimates derived from this regression model rely ing the targeted estimand “treatment effect” defined above, on the assumption that the outcome generating mechanism is which is slightly different from those “causal treatment effect” the same across all clusters. defined using the conventional potential outcomes framework. Models (2) and (3) have a manifest similarity of form. A The propensity score originated from and has been widely deeper connection is that the sufficient statistics to estimate used in causal inference, but its use is certainly not restricted the treatment effect that are balanced under propensity score to studying causal effects. For instance, in many health pol- estimator are the same that must be balanced under model (3). icy studies, the major interest is to compare

Propensity Score Analysis with Hierarchical Data

Auditor: an R Package for Model-Agnostic Visual Validation and Diagnostics

ROC Curve Analysis and Medical Decision Making

Estimating the Variance of a Propensity Score Matching Estimator for the Average Treatment Effect

An Introduction to Logistic Regression: from Basic Concepts to Interpretation with Particular Attention to Nursing Domain

Njit-Etd2007-041

Logistic Regression, Part I: Problems with the Linear Probability Model

Homoskedasticity

HOMOSCEDASTICITY PLOT Graphics Commands

Chapter 10 Heteroskedasticity

Assumptions of Multiple Linear Regression

Simple Linear Regression 80 60 Rating 40 20

Discriminatory Accuracy of Serological Tests for Detecting Trypanosoma