<<

Causal Inference in Observational Studies with Complex Design: Multiple Arms, Complex and Intervention Effects

DISSERTATION

Presented in Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy in the Graduate School of The Ohio State University

By

Giovanni Nattino, M.S.

Graduate Program in

The Ohio State University

2019

Dissertation Committee: Dr. Bo Lu, Advisor Dr. Stanley Lemeshow, Co-Advisor Dr. Eloise Kaizar © Copyright by Giovanni Nattino 2019 Abstract

Observational studies are major sources to infer causal relationships. When using observational data to estimate causal effects, researchers must consider appro- priate statistical methodology to account for the non-random allocation of the units to the treatment groups. Such methodology is well-established when the research question involves two treatment groups and results do not need to be generalized to the population from which the study has been selected. Relatively few studies have focused on research questions that do not fit into this framework. The goal of this work is to introduce statistical methods to perform causal inference in complex designs. First, I introduce a design for estimating treatment effects in the presence of multiple treatment groups. I devise a novel matching algorithm, generating samples that are well-balanced with respect to pre-treatment variables, and discuss the post-matching statistical analyses. Second, I focus on the generaliza- tion of causal effects to the population level, specifically when the sample selection is based on complex survey designs. I discuss the extension of the propensity score methodology to survey data, describe a weighted for the common two-stage cluster sample and study its asymptotic properties. Third, I consider the estimation of population intervention effects, which evaluate the impact of realistic changes in the distribution of the treatment in a cohort. I describe for upper and lower bounds of effects of this type, highlighting the implications for policy makers. For each of these three areas of causal inference, I use Monte Carlo simulations to

ii assess the reliability of the proposed methods and compare them with competing approaches. The new methods are illustrated with real-data applications. Finally, I discuss limitations and aspects requiring further work.

iii Acknowledgments

First of all, I would like to express my sincere gratitude to my advisors. Thanks to Dr. Stan Lemeshow, who has been the catalyst of this incredible journey. Without you, I would not be where I am now. I am grateful for your unconditional help, which often went beyond the university walls, and countless advices. An equal thanks goes to Dr. Bo Lu, who introduced me to the world of causal inference. Thank you for your guidance and trust, which simultaneously directed me to the finish line and left me space to set my own pace. Thanks for all the pragmatic suggestions and for helping me navigating the statistical conferences I have been fortunate to attend. Thanks to all the staff of the Government Resource Center, in particular to Lorin Ranbom and Colin Odden, for the continuous support and for the invaluable oppor- tunity of continuously working on the Infant Mortality Research Partnership project. A special thank you to all the researchers I was fortunate to meet within this project. Thank you “Task 4” members, especially to Dr. Pat and Steve Gabbe and Dr. Court- ney Hebert. Your enthusiasm and genuine devotion to impact on the well-being of our society have truly inspired me. I would also like to thank Dr. Henry Xiang and Dr. Junxin Shi, from Nationwide Children’s Hospital, for their expert advice and the help with the trauma data, which motivated part of this work. Thanks to all the faculties and students I have met during my time at the Ohio State University. In particular, I would like to thank Dr. Elly Kaizar, for your

iv valuable feedback on my work. I am grateful to Dr. Matt Pratola and Hengrui Luo and to Dr. Mike Pennell. Even though the results of our collaborations do not appear in these pages, working with you was a truly stimulating, refreshing and enjoyable experience. A special thanks also to Dr. Amy Ferketich and Dr. Mario Peruggia, for your friendly advice and for being my “Little Italy” in Columbus. I would like to thank the researchers of the Laboratory of Clinical at the Mario Negri Institute for Pharmacological Research, in Italy, where I developed my interests in research and in biostatistics. Thank you all, especially to Dr. Guido Bertolini, for helping me embarking on this journey. Thanks to all the friends who have been my Columbus family in these years. In particular, thank you Sebastian, Guilherme, Aziz, J´ulia,Armand, Shuyuan, Jason, Natalia, Jafar, Juli´an,Alejandro and Andreas. Thanks for all the dinners together, the Friday night gatherings, the endless barbecues, the bike rides, the rock climbing sessions, the racquetball and disc golf games. You will be missed. A special thanks to my parents, Daniela and Beppe, and my brothers, Francesco and Stefano. If I am where I am, this is because of your education, encouragement and love. Finally, a profound thank you to my fianc´ee,Melissa. You understood the impor- tance of this goal for me, despite the time together that I had to sacrifice along the way. Thanks for your patience and heartening words. I could not have asked for a better travel companion.

v Vita

1987 ...... Born in Lecco (LC), Italy

Education

2009 ...... B.S. Applied Mathematics, University of Milan, Milan, Italy 2011 ...... M.S. Applied Mathematics, University of Milan, Milan, Italy 2014 ...... Post-graduate certificate in Biomedical Research, Istituto di Ricerche Farma- cologiche Mario Negri IRCCS, Ranica (BG), Italy Professional Experience

2011-2015 ...... Research Associate, Laboratory of Clinical Epidemiology, Istituto di Ricerche Farmacologiche Mario Negri IRCCS, Ranica (BG), Italy 2016-2019 ...... Graduate Research Associate, Divi- sion of Biostatistics, College of Pub- lic Health, The Ohio State University, Columbus, Ohio 2017-2019 ...... Graduate Research Associate, Ohio Colleges of Medicine Government Re- source Center, The Ohio State Univer- sity Wexner Medical Center, Colum- bus, Ohio

vi Publications

1. Giovanni Nattino, Michael L Pennell, and Stanley Lemeshow. Assessing the goodness of fit of models in large samples: a modification of the Hosmer-Lemeshow test. Submitted to Biometrics, 2019.

2. Giovanni Nattino, Bo Lu, Junxin Shi, Stanley Lemeshow, and Henry Xiang. Triplet matching for estimating causal effects with three treatment arms: a comparative study of mortality by trauma center level. Submitted to Journal of the American Statistical Association, 2019.

3. Courtney L Hebert, Giovanni Nattino, Steven G Gabbe, Patricia T Gabbe, Jason Benedict, Gary Phillips, and Stanley Lemeshow. A predictive model for very preterm birth: developing a point of care tool. Submitted to American Journal of Obstetrics and Gynecology, 2019.

4. Erinn M Hade, Giovanni Nattino, Heather A Frey, and Bo Lu. Propensity Score Matching for Treatment Delay Effects with Observational Survival Data. Submitted to Statistical Methods in Medical Research, 2019.

5. Giovanni Nattino and Bo Lu. Model assisted sensitivity analyses for hidden bias with binary outcomes. Biometrics, 74: 1141–1149, 2018.

6. Stefano Skurzak, Greta Carrara, Carlotta Rossi, Giovanni Nattino, Daniele Crespi, Michele Giardino, and Guido Bertolini. Cirrhotic patients admitted to the icu for medical reasons: Analysis of 5506 patients admitted to 286 icus in 8years. Journal of Critical Care, 45: 220–228, 2018.

7. Guido Bertolini, Giovanni Nattino, Carlo Tascini, Daniele Poole, Bruno Viaggi, Greta Carrara, Carlotta Rossi, Daniele Crespi, Matteo Mondini, Martin Langer, Gian Maria Rossolini, and Paolo Malacarne. Mortality attributable to different kleb- siella susceptibility patterns and to the coverage of empirical antibiotic therapy: a on patients admitted to the ICU with infection. Intensive Care Medicine, 44(10): 1709–1719, 2018.

8. Giovanni Nattino, Stanley Lemeshow, Gary Phillips, Stefano Finazzi, and Guido Bertolini. Assessing the calibration of dichotomous outcome models with the calibra- tion belt. Stata Journal, 17(4): 1003–1014, 2017.

9. Daniele Poole, Stefano Finazzi, Giovanni Nattino, Danilo Radrizzani, Giuseppe Gristina, Paolo Malacarne, Sergio Livigni, and Guido Bertolini. The prognostic im- portance of chronic end-stage diseases in geriatric patients admitted to 163 italian ICUs. Minerva Anestesiologica, 83: 1283–1293, 2017. vii 10. Giovanni Nattino, Stefano Finazzi, and Guido Bertolini. A new test and graphical tool to assess the goodness of fit of logistic regression models. Statistics in Medicine, 35(5): 709–720, 2016.

11. Daniele Poole, Giovanni Nattino, and Guido Bertolini. Overoptimism in the interpretation of statistics. Intensive Care Medicine, 40(12): 1927–1929, 2014.

12. Giovanni Nattino, Stefano Finazzi, and Guido Bertolini. Comments on ‘Graphical assessment of internal and external calibration of logistic regression models by using loess smoothers’ by Peter C. Austin and Ewout W. Steyerberg. Statistics in Medicine, 33(15): 2696–2698, 2014.

13. Giovanni Nattino, Stefano Finazzi, and Guido Bertolini. A new calibration test and a reappraisal of the calibration belt for the assessment of prediction models based on dichotomous outcomes. Statistics in Medicine, 33(14): 2390–2407, 2014.

14. Nicola Latronico, Giovanni Nattino, Bruno Guarneri, Nazzareno Fagoni, Aldo Amantini, and Guido Bertolini. Validation of the peroneal nerve test to diagnose critical illness polyneuropathy and myopathy in the intensive care unit: the multi- centre italian crimyne-2 diagnostic accuracy study. F1000Research, 3(127), 2014.

Fields of Study

Major Field: Biostatistics

viii Table of Contents

Page Abstract...... ii Acknowledgments...... iv Vita...... vi List of Figures ...... xii List of Tables ...... xiv List of Abbreviations ...... xv

Chapters

1 Introduction1 1.1 Causal Inference in Observational Studies...... 1 1.2 Target of Inference ...... 3 1.3 Treatment Effects...... 6 1.4 Estimation of Treatment Effects...... 7 1.4.1 Identifiability Assumptions...... 7 1.4.2 The Propensity Score Framework...... 9 1.4.3 G-methods...... 12 1.5 Modern Challenges in Causal Inference...... 13 1.5.1 Multiple Treatment Groups ...... 13 1.5.2 Complex Survey Data ...... 17 1.5.3 Generalized Intervention Effects...... 20

2 Multiple Treatment Groups 24 2.1 Conditionally Optimal Matching Algorithm ...... 24 2.1.1 Algorithm Setup ...... 24 2.1.2 Matching Algorithm for Three Treatment Groups ...... 26 2.1.3 Extensions to More than Three Treatment Groups ...... 33 2.2 Post-matching Outcome Analysis ...... 36 2.2.1 Covariate Balance...... 36 2.2.2 Statistical Setup ...... 37 2.2.3 Evidence Factors ...... 38 2.2.4 Estimation of Treatment Effects...... 40

ix 2.2.5 Sensitivity Analysis to Hidden Bias...... 42 2.3 Simulation Study...... 45 2.3.1 Setup ...... 45 2.3.2 Results...... 47 2.4 Application: Mortality Differences among Trauma Center Levels . . . 49 2.4.1 Background...... 49 2.4.2 Data...... 50 2.4.3 Methods...... 51 2.4.4 Results...... 52 2.4.5 Conclusions...... 56

3 Propensity Score Adjustment With Cluster Sampling Data 58 3.1 Weighted Estimators for Population ATE in Complex Survey Data . 59 3.1.1 Weighting in Causal Inference and ...... 59 3.1.2 Treatment and Sample Selections...... 60 3.1.3 Weighted or Unweighted Propensity Score?...... 63 3.2 Two-Stage Cluster Sample Surveys ...... 65 3.2.1 Cluster Sampling Design: Notation...... 65 3.2.2 Weighted Estimator for Population ATE...... 67 3.2.3 Propensity Score Estimation...... 69 3.2.4 Asymptotic Properties...... 71 3.2.5 Design in Simple Two-Stage Cluster Sampling . . . . 77 3.3 Simulation Study...... 78 3.3.1 Setup ...... 78 3.3.2 Results...... 82 3.4 Application: Effect of Insurance Status on Decision to Seek Care After Injury ...... 85 3.4.1 Background...... 85 3.4.2 Data...... 87 3.4.3 Methods...... 88 3.4.4 Results...... 89 3.4.5 Conclusions...... 92

4 Population Intervention Effects 96 4.1 Definition ...... 97 4.2 Interventions ...... 100 4.3 Upper and Lower Bounds ...... 101 4.4 Estimation of Upper and Lower Bounds ...... 104 4.5 Properties of the Proposed Estimators ...... 106 4.5.1 Asymptotic Distribution...... 107 4.5.2 Bootstrap...... 111 4.6 Outcome Models ...... 112 4.7 Simulation Study...... 114

x 4.7.1 Setup ...... 114 4.7.2 Results...... 116 4.8 Application: Tobacco Cessation Interventions and Nicotine Addiction during Pregnancy...... 119 4.8.1 Background...... 119 4.8.2 Data...... 121 4.8.3 Methods...... 122 4.8.4 Results...... 122 4.8.5 Conclusions...... 124

5 Discussion and Future Work 126 5.1 Multiple Treatment Group...... 126 5.1.1 Discussion...... 126 5.1.2 Limitations ...... 127 5.1.3 Future Work...... 128 5.2 Complex Survey Designs...... 130 5.2.1 Discussion...... 130 5.2.2 Limitations ...... 130 5.2.3 Future Work...... 131 5.3 Population Intervention Effects ...... 133 5.3.1 Discussion...... 133 5.3.2 Limitations ...... 134 5.3.3 Future Work...... 134

Bibliography 136

Appendices

A Additional Results of Simulation Study in Chapter 3 147

xi List of Figures

Figure Page

1.1 Sampling and Treatment Selections in Population Structure..... 4 1.2 Causal Contrasts of Average and Intervention Effects...... 21

2.1 First Step of the Conditionally Optimal Matching Algorithm . . . . . 28 2.2 Distributions of the Matching Variable in the Scenarios of the Simula- tion Study...... 46 2.3 Result of the Sensitivity Analysis in the Comparison of Mortality among Trauma Center Level...... 57

3.1 Result of Simulation Study: Continuous Outcome, (S,Ns) = (5, 80), All the Covariates Considered, Sampling Scheme Independent from the Treatment...... 83 3.2 Result of Simulation Study: Continuous Outcome, (S,Ns) = (5, 80), All the Covariates Considered, Sampling Scheme Dependent on the Treatment...... 83 3.3 Result of Simulation Study: Continuous Outcome, (S,Ns) = (20, 20), All the Covariates Considered, Sampling Scheme Dependent on the Treatment...... 84 3.4 Result of Simulation Study: Continuous Outcome, (S,Ns) = (5, 80), Covariate X3 Omitted, Sampling Scheme Dependent on the Treatment 85 3.5 Result of Simulation Study: Binary Outcome, (S,Ns) = (5, 80), All the Covariates Considered, Sampling Scheme Dependent on the Treatment 86 3.6 Estimates of Average Treatment Effect for Insurance Status on Deci- sion to Seek Care after Injury...... 91

4.1 Result of Simulation Study: Outcome Model Correctly Specified . . . 118 4.2 Result of Simulation Study: Scale of Nonlinear Covariates Misspecified in Outcome Model ...... 119 4.3 Result of Simulation Study: One Covariate Omitted from Outcome Model ...... 120

xii 4.4 Estimates of Bounds of Intervention Effect as function of Proportion of Treated subjects...... 123

xiii List of Tables

Table Page

2.1 Result of Simulation Study...... 48 2.2 Balance of Covariates between Treatment Groups in Matched Sample 53 2.3 Mortality by Trauma Center Level before and after Matching . . . . . 55

3.1 Estimates of Coefficients of Population Propensity Score Model. . . . 94 3.2 Balance of Covariates between Treatment Groups in Weighted Sample 95

4.1 Logistic Regression Model Estimating Probabilities of Preterm Delivery 125

xiv List of Abbreviations

ATE Average Treatment Effect.

ATT Average Treatment Effect for the Treated.

IE Intervention Effect.

MEPS Medical Expenditure Panel Survey.

NEDS Nationwide Emergency Department Sample.

NN Nearest Neighbor.

NTC Nontrauma Centers.

SUTVA Stable Unit Treatment Value Assumption.

TC Trauma Centers.

TC I Level I Trauma Centers.

TC II Level II Trauma Centers.

xv Chapter 1 Introduction

1.1 Causal Inference in Observational Studies

The goal of causal inference is to measure causal effects of treatments or exposures on outcomes. Treatment and outcome are denoted with Z and Y , respectively. For introductory purposes, I focus on studies involving two treatment levels, indicated with values 1 and 0. Subjects assigned to treatment 1 and 0 will be referred to as treated and controls, respectively. Causal effects are traditionally defined at the individual level. For any given subject k of the cohort under study, imagine being able to observe the outcome of interest in two counterfactual scenarios. In one scenario, the unit receives the treatment 1 (Zk = 1). In the other scenario, the same unit receives the treatment 0

1 0 (Zk = 0). Denote the outcomes observed under the two scenarios with Yk and Yk , the potential outcomes of subject k. For this specific subject, the treatment has a

1 0 causal effect on the outcome if Yk differs from Yk . In cohorts of subjects, causal effects may be quantified in several ways. For exam- ple, the difference between the average of Y 1 and Y 0 in a cohort is a popular measure of average effect and is referred to as Average Treatment Effect (ATE). Other mea- sures of causal effects are discussed in Section 1.3. For most treatments and outcomes, it is impossible to observe more than one

1 potential outcome per unit. When a subject receives a treatment, only the corre- sponding potential outcome can be observed. In this sense, the treatment assignment can be interpreted as a selection from the set of all the potential outcomes of the units under study. Historically, randomized are considered as the to mea- sure causal effects. The of the treatment guarantees that, for each subject, the potential outcome to be observed is selected at random. As a conse- quence, the distributions of the observed outcomes in treated and control groups are expected to represent the distributions of the the potential outcomes Y 1 and Y 0, respectively. In this case, causal effects can be quantified straightforwardly. For ex- ample, the ATE can be estimated using the difference between the sample averages of the observed outcomes in the treatment groups. The estimation of causal effects in observational studies requires stronger assump- tions and more complex statistical methodology. Treatments are not assigned at random and, most of the time, the underlying assignment mechanism is unknown. The literature provides different methods to deal with a variety of scenarios. All of these methods require assumptions, which are often reasonable but rarely testable. Nevertheless, causal inference in observational studies has become increasingly popular over the past decades. There are several reasons motivating such a marked increase in popularity. First of all, observational studies have much lower costs than interventional experiments. In particular, holding fixed the study budget, the possibil- ity to study larger samples is an attractive feature when dealing with small treatment effects. Second, in the fields of medicine and environmental sciences, randomized ex- periments are not always ethical. Whenever researchers are interested in treatments that have known risks for the experimental units, observational studies are the only option for estimating causal effects. Third, the generalizability of the results of in-

2 terventional studies is often hampered by the rigid criteria controlling the conditions of the experiments. Samples collected in carefully-designed observational studies are much more likely to be representative of the population of interest. Finally, modern technologies and computer science developments are constantly increasing the avail- ability of large observational datasets. Large scale surveys, electronic health records and social networks are just few of the many sources of “big data”, which are invalu- able resources for observational studies.

1.2 Target of Inference

Researchers are often interested in quantifying causal effects at the aggregate level. Depending on the research question, the target cohort may be the study sample or the population where the sample has been drawn. Figure 1.1 provides a graphical representation of a comprehensive population framework, which includes the cohorts targeted by the methods presented throughout the dissertation. The most inclusive cohort is the infinite potential outcome superpopulation, which is the top-left set of the figure. Because both the potential outcomes are known for each subject of this superpopulation, the distribution of infinite subjects can be thought as a bivariate distribution of (Y 0,Y 1). In the most general case, subjects with different characteristics may show different values of the potential outcomes. This feature is represented with a multi-modal pattern in the figure, where the colors indicate different subgroups. The bottom-right set is the smallest of the cohorts, the observed sample. Only a finite set of subjects is observed and only one potential outcome is known, because subjects have received the treatment. Two types of selections are involved in the process that identifies the observed sample from the potential outcome superpopulation. On the one hand, there are

3 (0) (0) Y (1) (1) Y (1) D YD Y A YA (0) Infinite YC Infinite (1) superpopulation T YC Subgroup D superpopulation SP Subgroup A (1) (1) YE of potential YB after treatment (0) Subgroup C YB (0) outcomes YE selection

(0) Y Subgroup B Subgroup E

PPO PT

Subject A Subject D Subject A Subject D (1) (0) (1) YA (0) Finite YA YA YD Finite Subject C (0) (1) Subject C population YD YD T population P Y (1) of potential Subject B Y (0) Y (1) C after treatment C C Subject E Subject B outcomes Subject E selection (0) (1) (0) YB YB YB (0) (1) (1) YE YE YE

SPO ST

Subject A Subject D Subject A Subject D Sample Sample (1) (0) (1) (0) (1) YA YA YA YD Y T (0) after treatment D S YD of potential Subject C outcomes Subject C selection (0) (1) (1) (observed) YC YC YC

Figure 1.1: Population structure assumed throughout the document. Arrows repre- sent selections. Treatment selections at superpopulation, population and sample level are indicated with TSP , TP and TS. PPO and PT indicate the population selec- tions from the superpopulations of potential outcomes and the one after treatment. SPO and ST indicate the sample selections from the finite populations of potential outcomes and the one after treatment.

4 subject selections, which draw finite populations from infinite superpopulations (se-

lections PPO and PT ) and samples from finite populations (SPO and ST ). These selections are represented with vertical arrows in the figure. On the other hand, there are treatment selections, which identify the potential outcome to be revealed for each unit. Depending on the study design, the treatment selection can be thought

to be applied to the infinite superpopulation (TSP ), finite population (TP ) or study

sample (TS). Typically, the finite population is considered to be representative of the super- population, because it is assumed to be drawn via simple random sampling. The size of the finite population is denoted with N. The study sample is subsequently selected from the finite population. The sample size is denoted with n, with n < N. In practice, at this stage, researchers may employ sophisticated sampling designs, to ensure the generalizability of sample-based results while optimizing the costs of the study. Therefore, the sample may be drawn by simple random sampling or with other complex sampling designs. The target cohort of the causal inference research question varies from study to study. Causal parameters may be defined on any of the sets of the figure. When the interest is on sample-level effects, randomization-based inference is the most popular methodology to verify hypotheses and quantify causal effects (Rosenbaum, 2002b). In this framework, the target cohort is the sample of potential outcomes and the only source of is attributed to the treatment selection. On the other hand, policy makers are often interested in causal parameters defined in the population or superpopulation from which the study sample is drawn (Westreich, 2017). In this case, the sampling selection introduces an additional source of randomness, which must be considered in the statistical analysis. The hierarchical framework illustrated in Figure 1.1 provides an overarching map

5 of the possible paths leading to the observed sample. When the target cohort is identified, researchers need to recognize the selections that have resulted in the ob- served sample and ensure the identifiability of the causal parameter of interest with the available data.

1.3 Treatment Effects

I focus on marginal effects, which are averages of individual-level effects over the target cohort. When the target cohort is either an infinite superpopulation or a finite population, the most popular marginal effect is the ATE, which is formally defined as

 1  0 ∆AT E = E Y − E Y . (1.1)

The operator E [·] is used to denote the average over the cohort of interest (as in Hern´anand Robins(2018)). In particular, it is possible to define more specific versions of the ATE depending on the target cohort. For example, denoting the index set of the subjects in the finite population with U, the population ATE is defined as

1 X 1 X ∆ = Y 1 − Y 0. (1.2) P AT E N k N k k∈U k∈U A different way to quantify a causal effect is to focus on a subset of the units under study. A common choice is to look at the effect of the treatment on the subjects assigned to the treatment group (i.e., such that Z = 1). The Average Treatment Effect for the Treated (ATT) is defined as:

 1   0  ∆AT T = E Y |Z = 1 − E Y |Z = 1 . (1.3)

Similarly to the ATE, it is possible to define specific versions of the ATT for the cohort of interest. Notably, ATE and ATT quantify different aspects of the impact

6 of the treatment on the outcome. In practice, the most appropriate measure to be considered depends on the research question to be addressed. The definitions of ATE and ATT are naturally extended to the sample. For instance, denoting the index set of the subjects in the study sample with S, the sample ATE is defined as

1 X 1 X ∆ = Y 1 − Y 0. (1.4) SAT E n k n k k∈S k∈S The sample ATT is defined accordingly, restricting the average to treated subjects. These sample parameters are the principal target of inference in the framework intro- duced by Neyman(1935). From the author’s perspective, hypotheses about causal effects should be formulated in terms of average effects. For example, the null hy-

pothesis of no treatment effect can be expressed as H0 : ∆SAT E = 0. Fisher(1935) proposed a different framework, where sample effects are established by verifying a null hypothesis of no treatment effect for all the subjects in the sample,

1 0 i.e., H0 : Yk = Yk for all k in S. This hypothesis, referred to as Fisher’s sharp null, is stronger than the Neyman’s null hypothesis of no average effect. Because the sharp null hypothesis has a central role in Fisher’s framework, effects are usually quantified by inverting the test that is employed to verify the test.

1.4 Estimation of Treatment Effects

1.4.1 Identifiability Assumptions

In order to quantify causal effects in observational studies, we need to account for pre- treatment covariates and pose some identifiability assumptions. Let X be the vector of covariates. I assume a traditional set of assumptions in causal inference (Hern´an and Robins, 2018). These assumptions are:

7 1. Consistency: There is no interference among subjects and, when the treatment level is fixed, the same potential outcome is consistently observed. This con- dition implies that the observed outcome Y is always defined as Y = I(Z = 1)Y 1 + I(Z = 0)Y 0.

2. Exchangeability: Treatment and potential outcomes are assumed to be indepe- nent given the covariates (i.e., Z ⊥⊥ Y z| X for z = 0, 1).

3. Positivity: All of the subjects in the cohort of interest are eligible to receive all the treatment levels (i.e., 0 < P (Z = z|X) < 1 for z = 0, 1).

Analogous conditions have been referred to with a different terminology in the literature. For example, the condition of consistency has been referred to as Stable Unit Treatment Value Assumption (SUTVA) in the randomization-based inference framework (Rubin, 1980). Similarly, the condition of exchangeability and positivity have been referred to as ignorability of the treatment assignment or weak uncofound- edness (Rosenbaum and Rubin, 1983; Imbens, 2000). If these assumptions are met, causal effects can be estimated with the available data. The statistical method to be used depends on the specific effect to be es- timated, which in turn depends on the research question to be addressed. While a comprehensive presentation of the methodology to estimate causal effects is out of the scope of this dissertation, the available approaches can be classified in nonparameteric methods and methods requiring modeling (Hern´anand Robins, 2018). Matching and stratifying subjects according to the value of the covariates are examples of nonparametric approaches (Cochran and Chambers, 1965). The basic idea is to identify subgroups where the treatment groups are balanced with respect to the covariates and, therefore, comparable in terms of the outcome. The key limitation of these approaches is the difficulty of handling large numbers of covariates. This a

8 common issue in observational studies, where many factors may influence both the treatment and the outcome. The second class of methods involve modeling. The propensity score framework is a popular methodology that belongs to this family, where models are used to estimate the probability of receiving the treatment given the covariates, namely the propensity score (Rosenbaum and Rubin, 1983). G-methods represent another class of methods of this family (Robins, 1986). In this case, models are used to either estimate the probability of receiving the treatment or the conditional of the outcome value (or both). These two comprehensive frameworks are briefly introduced in the following sections.

1.4.2 The Propensity Score Framework

The Propensity Score

The propensity score is the probability of receiving the treatment given the covari- ates, i.e., e(X) = P (Z = 1|X). Under the identifiability assumptions, Rosenbaum and Rubin(1983) showed that the exchangeability property holds if the possibly high-dimensional vector of covariates is replaced by the propensity score. This data- reduction property makes the propensity score extremely attractive in empirical re- search. In most practical applications, the probability e(X) is unknown and must be es- timated. If the treatment is binary, a common approach is to estimate the propensity score with a logistic regression model, using the treatment variable Z as dependent variable and the available covariates X as predictors. The propensity score can be used in different ways to estimate causal effects (Rosen- baum and Rubin, 1983). I focus on propensity score matching and weighting, which are considered as the most reliable approaches to estimate treatment effects. These

9 approaches are briefly described in the following subsections.

Propensity Score Matching

Rosenbaum and Rubin(1983) showed that the propensity score is a balancing score, which that treatment and covariates are independent conditional on the value of the propensity score. As a consequence of this property, if treated units could be perfectly paired to controls with the same propensity score values, the distribution of the covariates is expected to be the same in the matched treatment groups, as it might have been in a randomized . This is the rationale of propensity score matching. Matching algorithms construct matched sets formed by control and treated units that are similar with respect to the propensity score. In this way, researchers hope to generate matched samples where the covariates in the treatment groups are well balanced. The main limitation of matching is the fact that only a subset of the units in the control group is selected to enter the matched sample, even though discarding control units that are not comparable to the treated might be a desirable property. On the other hand, matching offers several advantages over the other propensity score-based methods (Austin, 2011). First of all, by attempting to recreate the balanced design resulting from a randomized study, the results of matching are easy to interpret. Second, because post-matching analyses do not need to rely on parametric outcome models, they are robust to the misspecification of functional forms of the covariates in these models. Third, because of the necessity to evaluate the quality of matching in terms of covariate balance, researchers are required to critically assess the overlap of the treatment groups in terms of the observed confounders. A lack of overlap between treatment groups implies an undesired extrapolation from the available data when estimating treatment effects, and might pass unnoticed with model-based approaches.

10 Finally, matching offers the possibility of “outcome blinding”. The matching step, which creates the balanced design to be used for the following analysis, can be per- formed blindly with respect to the outcome of interest. This practice ensures the robustness of the final analysis to conscious and unconscious choices that may condi- tion the study results.

Propensity Score Weighting

The rationale of weighting is to recreate the sample, population or superpopulation where both the potential outcomes are known (left sets in Figure 1.1), by assigning an appropriate weight to each subject of the observed sample. For example, the sample ATE can be estimated with a weighted average of the outcomes, where each subject is weighted by the inverse of the probability of receiving the treatment that he/she actually received (e(X) for treated and 1 − e(X) for controls). Similarly, a different family of weights can be used to define a weighted estimator for the ATT. In this sense, weighting is a versatile approach for the estimation of different causal effects. As opposed to matching, the estimates of the propensity score are explicitly in- volved in the estimation of the treatment effect. Traditional weighting estimators are therefore more sensitive to misspecifications of the propensity score model than matched estimators. To address this limitation, doubly-robust weighting estimators have been proposed in the literature (Robins et al., 1994). Briefly, the idea is to appropriately introduce an outcome model within the estimator. Estimation of the treatment effect is guaranteed to be unbiased if either the propensity score or the outcome model are misspecified (but not both).

11 1.4.3 G-methods

G-methods (or, generalized methods) are model-based approaches to estimate a va- riety of causal contrasts, in cross-sectional and longitudinal designs (Robins, 1986). This broad family includes marginal structural models, structural nested models and the parametric g-formula. Extensive descriptions of these approaches are available in the literature (e.g., Naimi et al.(2016)). Marginal structural models and structural nested models are families of models whose coefficients are directly related to marginal causal parameters, such as E[Y z], the ATE or the ATE in subgroups of the study sample. Marginal structural models require a model for the probability of receiving a treatment level (i.e., the propensity score). Structural nested models need both models for a function of the outcome and the propensity score, but they are doubly-robust: estimates are consistent if either of the two models is correctly specified. The parametric g-formula relies on outcome models that include the treatment Z and the covariates X. I provide some background information about this methodol- ogy, which is used in Chapter 4. In fixed-time settings, the g-formula estimator for ATEs is motivated by the following equality:

 1  0 ∆AT E = E Y − E Y =

= E E Y 1|X − E E Y 0|X =

= E E Y 1|Z = 1, X − E E Y 0|Z = 0, X

= E [E [Y |Z = 1, X]] − E [E [Y |Z = 0, X]] , (1.5)

where E [Y z|X] = E [Y z|Z = z, X] because of the exchangeability assumption (Y z and Z are independent given X). The equality suggests that E [Y z] can be esti- mated via standardization of the mean of Y across values of Z and X. Theoreti-

12 cally, E [Y |Z, X] could be estimated nonparametrically, by computing sample aver- ages across strata of Z and X. However, this is only possible if the dimension of X is small and the strata corresponding to each value of X in the observed sample are well populated. In most practical settings, these conditions are not met and E [Y |Z, X] is estimated with a parametric outcome model. A natural estimator of the sample ATE follows from Equation (1.5):

n 1 X n o ∆b SAT E = Eb [Y |Z = 1, Xk] − Eb [Y |Z = 0, Xk] . (1.6) n k=1 Importantly, the estimators based on the g-formula are consistent only if the outcome model is correctly specified. G-methods can be easily applied to complex designs, in the presence of time- varying treatments or when the target causal contrasts involves interventions on the treatment mechanism (Hern´anand Robins, 2018). In study designs with fixed-time treatments and when the goal is to estimate traditional causal effects (such as the ATE or the ATT), nonparametric methods and the propensity score framework are simpler alternative approaches.

1.5 Modern Challenges in Causal Inference

1.5.1 Multiple Treatment Groups

Causal Inference Setup and Estimation of Treatment Effects

Even though most of the traditional causal inference literature have focused on designs with two treatment levels, simultaneously evaluating the effect of multiple treatments is vital in modern public health and medical research, where several alternative treat- ments are often available. The potential outcome framework naturally generalizes to settings with multiple

13 treatment groups. In the presence of K treatment levels, the treatment variable Z assumes values in the set Z = {1, ..., K} and each treatment level z ∈ Z corresponds to one potential outcome, Y z. The population structure described in Section 1.2 also applies to designs with multi-valued treatments. The definition of causal effects proceeds analogously to the binary-treatment case. The seminal work by Imbens(2000) discussed extentions of the traditional identifi- ability assumptions. In particular, the exchangeability and positivity conditions are extended to multi-valued treatments, by assuming that potential outcomes Y z are independent given the covariates X for all z ∈ Z and that each subject in the sample is eligible to receive any of the treatments under study, i.e., 0 < P (Z = z|X) < 1 for any z ∈ Z. Imbens(2000) also generalized the propensity score to multiple-treatment set- tings. Defining ez(X) = P (Z = z|X), the generalized propensity score is the K- dimensional vector of probabilities e(X) = (e1(X), e2(X), ..., eK (X)). Notably, since the treatments are mutually exclusive, these probabilities are subject to the constraint P z∈Z ez(X) = 1 for any value of the covariates X. Since each probability ez(X) can be expressed as one minus the sum of the other probabilities, the generalized propen- sity score belongs to a (K − 1)-dimensional space. The author showed that the generalized propensity score offers data-reduction properties similar to the ones of the traditional propensity score in the two-group case. In particular, the treatment assignment is independent of each potential outcome given the propensity score, i.e.,

z Z ⊥⊥ Y |ez(X) for each z ∈ Z. Different models can be used to estimate the generalized propensity score, de- pending on the characteristics of the treatment values. If the treatment values are qualitatively different, Imbens(2000) suggested the use of multinomial logit or probit regression. On the other hand, if there is a logical ordering of the treatment levels

14 (e.g., when the treatment levels under study are different doses of a drug), ordinal logistic regression is better suited. These results provide the theoretical foundation for the estimation of causal effects using observational data. Most of the statistical approaches designed for the two- group case are potentially extendable to the multiple-treatment case. Linden et al. (2016) provided an overview of regression adjustment, stratification and weighted estimators. Even though matching is a common approach when dealing with two treatment groups, the method has received limited attention for studies with multiple treatments. This is unfortunate, because of the unique advantages of matching over alternative approaches (see Section 1.4.2). The following section provides an overview of the available matching procedures for designs with multiple treatments.

Matching Algorithms

Lopez and Gutman(2017) recently discussed the limitations in scope of existing matching algorithms for multi-valued treatments. Part of the reason is that these algorithms are much harder to implement than in the two-group case and no op- timal solution is available. Notably, given any finite sample, the optimal matched sample that minimizes the total distance within matched sets does exist. However, this optimization problem is NP-hard when the number of treatments is larger than two, which means that the solution cannot be identified in polynomial time (Karp, 1972). Therefore, optimal solutions exist, but they are not practically identifiable in a reasonable computation time. To fill this gap, Lu et al.(2001) and Lu and Rosenbaum(2004) introduced the optimal nonbipartite matching design for multiple treatment groups. The optimality is achieved by relaxing the requirement of having units from each of the treatment groups in the matched sets. Unfortunately, the resulting design is a paired structure

15 that cannot be used to compare all groups directly. To create matched sets with subjects from all treatment arms, Rassen et al.(2013) discussed applications of the popular Nearest Neighbor (NN) algorithm, using differ- ent distance metrics. The major issue withNN algorithms is that the overall matching quality could be poor, as has been shown in the two-group case (Rosenbaum, 1989). Simple extensions of optimal two-group matching to the three-group settings have been implemented in empirical research (Lu et al., 2012; Shi et al., 2016). These studies used optimal matching to generate pairs between a reference, anchor group (arbitrarily selected) and the other treatment groups. However, the distances between units in the non-anchor groups were not taken into account and it is easy to construct examples where this approach performs poorly. Lopez and Gutman(2017) described a two-step algorithm that can be used to form matched sets on the basis of the generalized propensity score. First, the dimen- sionality of the matching problem is reduced by grouping subjects on the basis of a subset of the components of the propensity score, using clustering techniques. Then, subjects are matched within clusters on the remaining components of the propen- sity score. One of the main limitations of the algorithm is the necessity to trim the study sample, to create a good overlap of the distributions of the propensity score across treatment groups. This limits the interpretability and the generalizability of the results. Recently, Bennett et al.(2018) proposed a procedure to construct matched sam- ples satisfying fine balance constraints in multiple-treatment designs. In the pres- ence of binary treatments, computationally efficient fine-balance algorithms prioritize good marginal balance in the covariates over small within-pair distances in terms of covariates or propensity score (Zubizarreta, 2012). To extend this approach to multiple-treatment designs, the authors proposed to match all the treatment groups

16 to a template sample, which is chosen to be similar to the ideal target population. Nevertheless, the distances within matched sets can be far from optimal, because the adopted two-group algorithm primarily targets marginal balance instead of small total distances. Other researchers have focused on algorithms that generate matched sets with less stringent structures, producing stratification-like designs. S¨avjeet al.(2017) recently proposed a computationally efficient algorithm that generalizes full matching to the multiple-treatment case. In order to make the solution to the problem feasible, the authors relax the classic full matching design, allowing the construction of matched sets with more than one subject for all treatment groups. Despite its computational efficiency, such designs tend to produce imbalanced matched sets. This complicates subsequent statistical analyses (Gu and Rosenbaum, 1993). Chapter 2 introduces a matching algorithm for the multiple-treatment case, which is designed to generate matched sets characterized by small total distance. The chapter also describes post-matching statistical analyses. The methodology is applied to a comparative study of mortality across trauma center levels, which motivated the methodological research discussed throughout the chapter.

1.5.2 Complex Survey Data

Population surveys are invaluable data sources for policy research. The represen- tativeness of the target population is guaranteed by appropriate sampling designs. As introduced in Section 1.2, the sample may not be selected with simple random sampling from the finite population in complex survey designs. Common sampling methods include systematic, stratified and cluster sampling (Levy and Lemeshow, 2013). The appropriate method is generally chosen to guarantee that the sample will be representative of the finite population while minimizing the costs of the data

17 collection. Methods to infer causal effects in the sample, such as the sample ATE or ATT, are well-established and discussed in Section 1.4. Little attention has been dedicated to the estimation of population and superpopulation effects when the study sample is selected with complex sampling designs, where researchers must take into account the survey design, the sampling weights and the observational nature of the data (Lenis et al., 2017). The first attempts to estimate population treatment effects in complex designs used heuristic methodology based on weighted estimators (Zanutto et al., 2005; Zanutto, 2006). The idea is to interpret the study sample as the result of two sampling stages from the finite population, as represented in Figure 1.1. On one hand, the treatment selection, based on the individual probabilities of receiving the treatment (namely, the propensity score). On the other hand, the sample selection, based on the survey sampling probabilities, which are often known by design. The authors proposed to construct the overall probability underlying the two-stage selection as the product of survey probability and propensity score. Estimates of population treatment effects were generated using weights defined as the inverse of this probability. Formal methodological justifications of this family of estimators have been de- scribed in the literature (Wang et al., 2009; Ashmead, 2014; Ridgeway et al., 2015). However, previous studies are mainly confined to single-stage sampling designs. The only exception is the very recent work of Yang(2018), who described a weighted estimator of the population ATE in two-stage cluster sample surveys. Matched estimators have also been considered to estimate population treatment effects. Ashmead(2014), Austin et al.(2018) and Lenis et al.(2017) recently described simulation analyses investigating the performance of propensity score matching to estimate population effects in complex survey designs. On the basis of the simulation

18 results, the authors provide guidelines for matching designs. Despite the recent developments in the field, there is still no consensus on all the aspects of causal inference methodology for complex survey designs. The method that should be considered for estimating the propensity score is one central element where previous studies disagree. Wang et al.(2009), Ashmead(2014) and Ridgeway et al. (2015) recommended estimating the propensity score model with weighted regression models. Yang(2018) proposed a complex algorithm to estimate a calibrated propen- sity score model, which is designed to provide good balance in terms of the covariates between treatment groups. The author presents treatment effect estimators that use the calibrated propensity score, which is described as robust with respect to mis- specification of model form and with respect to unmeasured cluster-specific variables. Zanutto(2006), DuGoff et al.(2014) and Lenis et al.(2017) affirmed that incorpo- rating survey weights in estimating the propensity score is not necessary, since the balancing property of the propensity score is only necessary at the sample level. In the simulations carried out by Lenis et al.(2017), weighted and unweighted propensity score models performed similarly in the estimation of the treatment effect. Chapter 3 is devoted to an estimator of population and superpopulation ATE for two-stage cluster sampling survey designs, which have received little attention in the literature. I describe a weighted estimator, which naturally combines the survey weights with the propensity score. I introduce the properties of this estimator and a comparison of its performance with competing methods. The role of survey weights in the estimation of the propensity score is a key factor that is evaluated in the simulation analysis. The methodology is applied to the 2015 Medical Expenditure Panel Survey (MEPS) data, to quantify the causal effect of health insurance coverage on the decision to seek medical care after an injury.

19 1.5.3 Generalized Intervention Effects

Figure 1.2 provides a graphical representation of the comparisons evaluated by the most traditional marginal causal effects, the ATE and ATT, in an example with a bi- nary treatment. The ATE compares the average outcome value in two counterfactual scenarios where all the subjects receive either of two treatment levels (panel (b)). For instance, in a cohort of patients being eligible to receive two drugs, this effect may be used to determine the drug that is associated, on average, with better outcomes. The ATT evaluates a similar comparison, but focus on the subjects that received the treatment Z = 1 (panel (c)). Policy makers, however, are often interested in a different type of effect, evaluating the impact of interventions that modify the distribution of the treatment in the target population. For instance, decision makers may be interested in quantifying the effect of a policy change that increases the proportion of treated subjects. In the study motivating my work in this area, state agencies are interested in evaluating how the preterm birth rate would change if it would be possible to increase the proportion of women enrolling in smoking cessation programs among nicotine-dependent pregnant women. In particular, given the finite amount of resources to implement programs and interventions, stakeholders need to consider scenarios with partial modifications of the treatment status, as the possibility of increasing the proportion of the treated to 100% of the cohort is unrealistic in most applications. For example, in the context of the motivating study, suppose that only 5% of the nicotine-dependent pregnant women currently receive a particular type of smoking cessation treatment. The study team believes that the preterm birth rate could be reduced if a greater percentage of women could be convinced to receive this treatment. While it is very unlikely that 100% of women could be convinced to enroll, it might be possible to dedicate some

20 (a) Observed Cohort

Z=0 Z=1

(b) ATE

Z=1 vs. Z=0

(c) ATT

Z=1 vs. Z=0

(d) IE

Z=0 Z=1 vs. Z=0 Z=1

Figure 1.2: Graphical representation of the causal comparisons evaluated by the ATE, ATT and IE.

additional resources in order to increase the enrollment in this smoking cessation program to, say, 10% or 20%. A reasonable question is what would be the impact on the preterm birth rate if the proportion of subjects receiving smoking cessation treatment could be increased by some specified amount. In such cases, the comparison between a counterfactual scenario with an increased proportion of treated subjects and the real, factual, cohort is the most informative contrast for policy makers (Figure 1.2, panel (d)). Such an effect has been referred to as a “generalized intervention effect” or “population intervention effect” (Ahern, 2016; Westreich, 2017). I will denote it simply as Intervention Effect (IE). It has been the

21 target of sporadic studies over the past three decades (Browner, 1986; Bulterys et al., 1997) and it has recently gained popularity, because it translates research efforts into valuable measures for policy makers (Ahern, 2016). Nevertheless, a formal definition of this effect with the potential outcome framework has not been provided yet in the literature. Only few studies have provided guidance on the methodology to estimate the IE and, in most of the cases, they have adopted the parametric g-formula (Ahern et al., 2009; Westreich, 2014; Ahern et al., 2016). Both fixed-time and time-varying treatments have been considered in the literature. In the latter case, the estimation of the effect is more complex, as the intervention on the treatment Z may vary over time. Hence, it is necessary to deal with correlated outcome occurrences and it is necessary to resort to longitudinal models for the outcome (Taubman et al., 2009; Westreich, 2014). Moreover, because interventions on time-varying treatments may have very complex patterns on both future treatment status and time-varying covariates, these interventions were studied under several simplifying assumptions (Westreich, 2014). Most of the research targeting the estimation of theIE have focused on sce- narios with continuous exposures. A possible explanation is that, for quantitative treatments, it is easier to specify plausible interventions on the treatment Z without having to define, one by one, the units selected for such modification. For example, Ahern et al.(2016) described a study investigating the effect of alcohol outlet density on binge drinking. The authors considered modifications of the distribution of the quantitative treatment (alcohol outlet density) by setting pre-specified upper limits to the value of the treatment (e.g., 60 outlets per square miles). All the treatment values exceeding the pre-specified threshold were replaced with the value of the up- per limit. In this way, the study estimated the effect of a reduction in the maximum number of alcohol outlet density on the overall rate of binge drinking in the cohort

22 under study. The same strategy is not immediately translated to scenarios with categorical treatments. In this case, it is not possible to truncate the maximum value of the treatment value to pre-specified thresholds. In order to evaluate the impact of an intervention, researchers must specify the subset of subjects whose treatment levels have been modified. This comes with an additional challenge whenever the effect of the treatment is heterogeneous: the impact of the intervention will depend on the selected subset. Westreich(2014) studied the estimation of theIE in a scenario where both treatment and outcome are binary. The author simulated several fictitious co- horts where, in each simulation, a constant proportion of control units was assigned to the treatment. The control units targeted by the modification of the treatment assignment were randomly selected. The effect estimated with this procedure cor- responds to an average over simulated interventions. This approach is feasible, but computationally intensive. Moreover, this estimate might poorly predict the impact of an intervention modifying the treatment distribution in the cohort, if such inter- vention does not target a naturally representative sample of the cohort. Chapter 4 is devoted to the estimation of population intervention effects. I intro- duce a formal definition of theIE with the potential outcome framework and propose a simple estimator for the upper and lower bounds of this effect. I focus on scenarios with binary and categorical treatments, which have received little attention in previ- ous research. I illustrate the proposed approach with a study investigating the effect of smoking cessation interventions on the number of preterm deliveries in a cohort of pregnant women.

23 Chapter 2 Multiple Treatment Groups

Matching has unique advantages over other approaches to estimate causal effects in observational studies. However, this methodology is rarely used in the presence of multiple treatment groups, partially because of the limitations of the available algorithms. This chapter introduces a new matching algorithm for observational studies with multiple-treatment designs, aiming to create matched sets characterized by small total distance. The chapter is organized as follows. The new matching algorithm is described in Section 2.1. Section 2.2 discusses a strategy to conduct post-matching outcome analyses. A simulation study comparing the performance of the proposed algorithm with the principal competing method, theNN algorithm, is described in section Section 2.3. Section 2.4 describes a comparative study of mortality across trauma center levels, which motivated the methodological research discussed throughout the chapter.

2.1 Conditionally Optimal Matching Algorithm

2.1.1 Algorithm Setup

The goal of the proposed matching algorithm is to identify matched samples charac- terized by small total distance. The algorithm is structured in two main steps. First, a starting matched sample is generated. I suggest one specific solution to construct

24 this starting point, even though, for this purpose, existing matching algorithms might be employed as well. The second step involves an iterative procedure, which explores improvements in the quality of matching. At each iteration, a subset of L of the K treatment groups is selected and each K-tuple is split into two matched sets: one L- tuple from the L selected groups and one (K − L)-tuple from the remaining groups. In other words, this process relaxes the links between subjects from the L groups and the remaining groups. Then, the two sets of fixed K-tuples and (K − L)-tuples are rematched, using the optimal bipartite algorithm. The process is iterated, un- til the quality of matching cannot be reduced further. In particular, a measure of within-K-tuple dissimilarity is used to quantify the quality of matching. The algorithm takes advantage of the optimal solution to the two-group matching problem, which can be found in polynomial time. Because the algorithm iteratively matches two families of fixed matched sets and the optimality is achieved conditioning on a partially matched structure, we refer to it as conditionally optimal. The setup of the algorithm is formally introduced in this section in the general scenario of K treatment groups. Section 2.1.2 presents the algorithm in the case where K = 3. In this case there is only one possible way to split the existing matched sets— L = 1, K − L = 2. Extensions to designs with K > 3 are discussed in Section 2.1.3.

Denote by nz the size of the treatment group z ∈ Z in the study sample. The

proposed algorithm constructs a matched sample with S = minz∈Z {nz} matched sets, with one subject per treatment group. Without loss of generality, suppose that

the first group is the smallest, i.e., S = n1. Let I be the index set of the units in the

first treatment group (the smallest) and let J1, ..., JK−1 be the index sets of the units in the other K − 1 treatment groups. The algorithm is based on a distance, which measures differences within K-tuples in terms of the matching variables, which may be the propensity score vector or the

25 covariates. For multiple-treatment matching, a K-dimensional distance metric must

K be defined. Let d (i, j1, ..., jK−1) be the distance within the K-tuple involving units

{i, j1, ..., jK−1}. I focus on distances with form

K−1 K X 2 X 2 d (i, j1, ..., jK−1) = d (i, jz) + d (ja, jb), (2.1) z=1 16a

2 d (ja, jb) = ke(Xja ) − e(Xjb )k2, where k · k2 is the Euclidean norm. In the case of 3 three treatment groups, the three-way distance d (i, j1, j2) induced by this choice

corresponds to the perimeter of the triangle defined by the points e(Xi), e(Xj1 ) and e(Xj2 ).

The matched samples is denoted with M = {(i, j1(i), ..., jK−1(i))}i∈I , a collection of SK-tuples, where the index jz(i) identifies the subject from group z matched to subject i, with i ∈ I and jz(i) ∈ Jz. The total distance associated with the matched sample M is defined as

X K D (M) = d (i, j1(i), ..., jK−1(i)) , (2.2) i∈I i.e., the sum of the distances within K-tuples.

2.1.2 Matching Algorithm for Three Treatment Groups

The iterative procedure introduced in Section 2.1.1 can be implemented in a single way in the case of K = 3 treatment groups. At each step, one treatment group (L = 1) is selected and the connection of the existing triplets to the selected group is relaxed. Subjects from the selected group are then optimally rematched to the fixed pairs of the remaining K − L = 2 groups.

26 To generate the starting matched sample, I propose a simple procedure that uses two two-groups matching steps. A formal description of the algorithm is provided by the following points:

Step 1: Generate the starting matched sample.

Step 1.1: Select two treatment groups and match them with the optimal two-group matching procedure. Without loss of generality, label these two groups as 1 and 2 and the remaining group as 3.

Step 1.2: Optimally match subjects from group 3 to the 1-2 pairs defined in

(0) Step 1.1. Let M1,2 be this set of initial matched triplets and let  (0) D M1,2 be the total distance associated with the constructed matched sample. The subscript “1,2” emphasizes the fact that the matching is conditional on the fixed 1-2 pairs.

Step 2: Explore potential reductions of the total distance with conditional iterations.

(n−1) For each n > 1, consider the matched set Mz1,z2 and the associated total  (n−1) distance D Mz1,z2 , resulting from the previous iteration. Repeat the following steps:

(n−1) Step 2.1: Fix the z2-z3 pairs within the triplets Mz1,z2 and optimally re-

match such pairs with the subjects in group z1.

(n−1) Step 2.2: Fix the z1-z3 pairs within the triplets Mz1,z2 and optimally re-

match such pairs with the subjects in group z2.

(n) (n) Step 2.3: Let Mz2,z3 and Mz1,z3 be the matched sets generated at steps 2.1  (n)   (n)  and 2.2, and let D Mz2,z3 and D Mz1,z3 be their respective  (n)   (n)  total distances. If both D Mz2,z3 and D Mz1,z3 are greater  (n−1) than D Mz1,z2 , stop the iterations: the new matched sets do

27 not decrease the total distance. Otherwise, select the matched sample corresponding to the smallest total distance.

Figure 2.1 provides a graphical representation of the first step of the algorithm. At each iteration, the procedure explores a potential reduction in the total distance by changing the two groups whose pairs are fixed.

Step 1.1 Group 1 Step 1.2

Group 2 Group 3

Figure 2.1: First step of the conditionally optimal matching algorithm in a three- group design. First, groups 1 and 2 are optimally matched. Second, subjects in group 3 are optimally matched to the pairs formed in Step 1.1.

There is no guarantee that the algorithm converges to the global optimum, i.e., the matched sample attaining the minimum total distance. However, by design, each iteration cannot decrease the quality of matching. This is shown in Proposition 2.1, which proves that the total distance of the solution cannot be larger than the total distance of the starting matched sample.

Proposition 2.1. Given any starting triplet match M0, the conditionally optimal

matching algorithm will produce a new match, MCO, with total distance no larger

28 than the initial one, i.e., D (MCO) 6 D (M0).

Proof. The total distance of the matched sample M0 = {(i, j1(i), j2(i))}i∈I is D (M0) = P 3 i∈I d (i, j1(i), j2(i)) . When applying the conditionally optimal matching algorithm, we first fix one edge of the triplet, say between groups 1 and 2. We then try to find a new set of subjects from group 3 to minimize the total distance with the fixed {(i, j1(i))}i∈I pairs:

X 3 0 arg min d (i, j1(i), j2(i)) . 0 j2(i)∈J2 i∈I This becomes a two-group matching problem, where the goal is to match the pairs

{(i, j1(i))}i∈I to the subjects in group 3. Define the two-way distance between the

pair (i, j1) and subject j2 as

2 3 d ((i, j1), j2) = d (i, j1, j2) .

The optimal bipartite matching algorithm can be used to identify the optimal solution

∗ to this problem. That is, we can identify subjects {j2 (i)}i∈I from group 3 such P 2 ∗ P 2 0 0 that i∈I d ((i, j1(i)), j2 (i)) 6 i∈I d ((i, j1(i)), j2(i)) for any choice {j2(i)}i∈I . In P 2 ∗ P 2 particular, i∈I d ((i, j1(i)), j2 (i)) 6 i∈I d ((i, j1(i)), j2(i)). ∗ (1) Denote the new triplet match {(i, j1(i), j2 (i))}i∈I with M1,2. Notably,

 (1) X 3 ∗ D M1,2 = d (i, j1(i), j2 (i)) i∈I X 2 ∗ = d ((i, j1(i)), j2 (i)) i∈I X 2 6 d ((i, j1(i)), j2(i)) i∈I X 3 = d (i, j1(i), j2(i)) = D (M0) . i∈I

 (1) Therefore, we have D M1,2 6 D (M0). This result applies to each iteration of the algorithm and proves that the total 29 distance cannot increase at any iteration. Therefore, denoting with MCO the final matched set, we have D (MCO) 6 D (M0).

The proposition has two implications. First, even though the algorithm does not necessarily converge to the globally optimal solution, the iterations will end in a local optimum, where relaxing the connection to each group and optimally rematching it to the remaining pairs cannot reduce the total distance further. Second, the final result potentially depends on the arbitrary choice of the two treatment groups matched in Step 1.1. To obtain the best result, the procedure can be applied three times, once for each starting combination. Then, the matched sample with the smallest total distance can be selected. Step 1 describes a simple approach to generate the starting matched sample. However, the algorithm is very flexible and the initializing set of triplets can be con- structed with any matching procedure. This allows the use of the proposed algorithm to explore potential improvements upon the result of any existing three-way matching algorithm. For example, the result of theNN procedure can be used as the starting point and the conditionally optimal algorithm can be used to search for possible re- ductions in the total distance. Proposition 2.1 guarantees that the resulting matched sample cannot be worse than theNN solution. The solution of the proposed algorithm has another appealing property. Even though the algorithm might not converge to the global optimum, the total distance of the solution is bounded by the optimal distance multiplied by a factor that is less than 2. Proposition 2 describes this property.

 2-opt   2-opt  Proposition 2.2. Let i, j1 (i) i∈I and i, j2 (i) i∈I be the optimal pairs result of the optimal two-group matching between groups 1-2 and 1-3, respectively,

30  opt opt  and let MOPT = i, j1 (i), j2 (i) i∈I be the optimal set of triplets. Then: ( ) X 2 2-opt opt X 2 2-opt opt D (MCO) ≤ D (MOPT ) + min d (j1 (i), j1 (i)), d (j2 (i), j2 (i)) i∈I i∈I

≤ 2D (MOPT )

A similar formulation of the inequality is: " # 1 X X D (M ) ≤ D (M ) + d2(j2-opt(i), jopt(i)) + d2(j2-opt(i), jopt(i)) CO OPT 2 1 1 2 2 i∈I i∈I

≤ 2D (MOPT )

(0)  2-opt ∗ Proof. Consider the set M1,2 = (i, j1 (i), j2 (i)) i∈I , generated as the first itera- tion of the three-way conditionally optimal algorithm. In particular, the elements

∗ {j2 (i)}i∈I from the third group are chosen to minimize the total distance between  2-opt  the pairs i, j1 (i) i∈I and the elements of the third group. By the distance shortening property, we have:

 (0) X  2 2-opt  2 ∗ 2 2-opt ∗  D (MCO) 6 D M1,2 = d i, j1 (i) + d (i, j2 (i)) + d j1 (i), j2 (i) . i∈I

∗  2-opt  Because the {j2 (i)}i∈I minimize the total distance given the pairs i, j1 (i) i∈I ,  2-opt opt  (0) the total distance of the triplets (i, j1 (i), j2 (i)) i∈I is larger than D M1,2 :

X  2 2-opt  2 ∗ 2 2-opt ∗  d i, j1 (i) + d (i, j2 (i)) + d j1 (i), j2 (i) (2.3) i∈I

X  2 2-opt  2 opt  2 2-opt opt  6 d i, j1 (i) + d i, j2 (i) + d j1 (i), j2 (i) . (2.4) i∈I

 2-opt  Moreover, since the pairs i, j1 (i) i∈I are optimal:

X 2 2-opt  X 2 opt  d i, j1 (i) 6 d i, j1 (i) . (2.5) i∈I i∈I

2 2-opt opt  2 2-opt opt  Using this result and the triangular inequality d j1 (i), j2 (i) 6 d j1 (i), j1 (i) +

31 2 opt opt  d j1 (i), j2 (i) on the last component of Equation (2.4):

X  2 2-opt  2 opt  2 2-opt opt  D (MCO) 6 d i, j1 (i) + d i, j2 (i) + d j1 (i), j2 (i) i∈I

X  2 opt  2 opt  2 opt opt  2 2-opt opt  6 d i, j1 (i) + d i, j2 (i) + d j1 (i), j2 (i) + d j1 (i), j1 (i) i∈I

X 2 2-opt opt 6 D (MOPT ) + d (j1 (i), j1 (i)). (2.6) i∈I

 2-opt  Using analogous inequalities starting from the set of pairs i, j2 (i) i∈I , it is possible to show the following result:

X 2 2-opt opt D (MCO) 6 D (MOPT ) + d (j2 (i), j2 (i)). (2.7) i∈I

From Equation (2.6) and (2.7), we have: ( ) X 2 2-opt opt X 2 2-opt opt D (MCO) ≤ D (MOPT ) + min d (j1 (i), j1 (i)), d (j2 (i), j2 (i)) i∈I i∈I

Summing Equation (2.6) and (2.7), we have the second inequality: " # 1 X X D (M ) ≤ D (M ) + d2(j2-opt(i), jopt(i)) + d2(j2-opt(i), jopt(i)) CO OPT 2 1 1 2 2 i∈I i∈I

P 2 2-opt opt P 2 2-opt opt If i∈I d (j1 (i), j1 (i)) 6 D (MOPT ) and i∈I d (j2 (i), j2 (i)) 6 D (MOPT ), the proof is complete, because ( ) X 2 2-opt opt X 2 2-opt opt D (MCO) ≤ D (MOPT ) + min d (j1 (i), j1 (i)), d (j2 (i), j2 (i)) i∈I i∈I

≤ D (MOPT ) + min {D (MOPT ) , D (MOPT )} = 2D (MOPT ) , and " # 1 X X D (M ) ≤ D (M ) + d2(j2-opt(i), jopt(i)) + d2(j2-opt(i), jopt(i)) CO OPT 2 1 1 2 2 i∈I i∈I 1 ≤ D (M ) + [D (M ) + D (M )] = 2D (M ) . OPT 2 OPT OPT OPT

32 P 2 2-opt opt We show that i∈I d (j1 (i), j1 (i)) 6 D (MOPT ) using triangular inequalities and P 2 2-opt opt the result in Equation (2.5). The proof that i∈I d (j2 (i), j2 (i)) 6 D (MOPT ) is analogous.

X 2 2-opt opt X 2 2-opt opt 2 opt opt d (j1 (i), j1 (i)) 6 d (j1 (i), j2 (i)) + d (j2 (i), j1 (i)) i∈I i∈I

X 2 2-opt 2 opt 2 opt opt 6 d (i, j1 (i)) + d (i, j2 (i)) + d (j2 (i), j1 (i)) i∈I

X 2 opt 2 opt 2 opt opt 6 d (i, j1 (i)) + d (i, j2 (i)) + d (j2 (i), j1 (i)) i∈I

= D (MOPT ) .

2.1.3 Extensions to More than Three Treatment Groups

The same general idea of the algorithm can be extended to designs with more than three treatment groups. However, before moving to the description of this extension, it is important to note that the advantages of distance-based matching procedures poorly scale to designs involving a very large number of treatments. Suppose that the vector of the covariates is high-dimensional, as is the case in most practical applica- tions. In such designs, distance-based matching algorithms rely on the dimensionality reduction property of the propensity score methodology (see Section 1.4.2). However, when K increases, the dimension of the propensity score increases as well, and the matching problem takes place in a space that becomes increasingly sparse. In this case, identifying subjects with similar values of the propensity score is a problem that suffers from the “curse of dimensionality” (Linden et al., 2016). Confining the plausibility of matched designs to small-to-moderate numbers of treatment groups, there are multiple possible implementations of the general idea of our algorithm, because there are multiple ways to split the K-tuples at each step.

33 For example, with K = 4 treatment groups, the matched sets could be either split in two sets of pairs or in one set of triplets and one set of singletons. The number of possible strategies increase with the increase of K. For K = 5, matched sets could be split in singletons and quadruplets, pairs and triplets or in three groups (two sets of pairs and one group of singletons). Within the wide number of possible strategies, the approach that most easily generalizes to any value of K is to split matched sets in units from one treatment group and (K − 1)-tuples from the remaining K − 1 treatment groups. At each step, the selected group rotates among the K groups, to explore possible reductions in the total distance. The iterative portion of the conditionally optimal algorithm (i.e., Step 2) is there- fore extended to the general case of K treatment groups by the following procedure:

(n−1) Step 2: For each n > 1, consider the matched set M and the associated to- tal distance D M(n−1), resulting from the previous iteration. Repeat the following steps:

Step 2.1: For s from 1 to K, relax the connection of the matched sample M(n−1) to treatment group s and rematch the resulting (K − 1)- tuples to group s, using the optimal bipartite result. Denote the

(n) generated matched sample with Ms .

(n−1)  (n) Step 2.2: If D M 6 D Ms for all s = 1, ..., K, stop the itera- tions: the new matched sets do not decrease the total distance. Otherwise, define M(n) = M(n), where se

(n) se = argmins=1,...,K D Ms ,

and proceed with the next iteration.

34 The problem now moves to identifying the best strategy to form the starting matched sample (Step 1). Again, there are multiple possibilities to generate it. I propose three different strategies:

1. Matching Pairs of Treatment Groups. Organize treatment groups in pairs (or, pairs plus one group in case of an odd number of groups). The pairs of groups can be matched with the optimal bipartite result. Keeping the pairs fixed, it is possible to match two sets of pairs, then the quadruplets to another set of pairs, and so on, until all the groups have been matched.

2. Matching groups from the smallest to the largest. Sort the treatment groups, from the smallest to the largest. Match the two smallest treatment groups with the optimal bipartite result. Then, fix the pairs generated in the previous step and use the optimal bipartite result to match these pairs to the subjects in the third smallest group. Iterate this procedure, sequentially matching the smallest of the remaining group to the matched sample, until all the groups have been matched.

3. Matching “by induction”. Suppose that a matching algorithm is available for a design with K − 1 treatment group. Select K − 1 treatment groups and match them with this algorithm. Then, fix the (K−1)-tuples generated in the previous step and use the optimal bipartite result to match them to the subjects in the remaining treatment group.

Selecting the best of these strategies requires further research. I conducted a small batch of explorative simulations, to understand the potentialities of the three approaches. The preliminary results suggest two hypotheses, which should be verified with extensive simulations. First, leaving the largest treatment groups at the end of the matching procedure appears to result in the best performances. This is because 35 the most numerous treatment groups offer the largest pool of subjects to choose from. When a partially matched structure is fixed, finding subjects similar to existing matched sets is easier if the pool of available subjects offers more possibilities. Second, it appears that the iterative procedure (Step 2 of the algorithm) can “compensate” for low-quality starting points. Therefore, spending computational time in generating a first matched set with very small total distance might not be worthwhile, and simple procedures (as 1 or 2) might be preferable to time-consuming ones (as 3).

2.2 Post-matching Outcome Analysis

2.2.1 Covariate Balance

The principal goal of matching is to create a matched sample where treatment groups are similar in terms of the distribution of the covariates. Before proceeding with the outcome analysis, researchers should verify that covariates are well balanced in the sample generated by the matching algorithm. In the presence of two groups, a common approach is to compute the standardized mean differences (Austin, 2011). A popular rule-of-thumb suggests that values smaller than 10% can be considered as negligible differences. The same strategy can be easily extended in the presence of multiple treatment groups (Lopez and Gutman, 2017). The covariate balance can be evaluated by com- puting all pairwise standardized mean differences among the treatment groups. If the maximum of the pairwise standardized mean differences is smaller than 10%, the covariates may be considered sufficiently balanced and it is possible to move to the outcome analysis.

36 2.2.2 Statistical Setup

The type of outcome analysis depends on a variety of factors, including the target cohort (e.g., the sample or a superpopulation), the primary hypothesis to test (e.g., Fisher’s null or hypotheses about average effects) and the matching design that has been adopted (e.g., one subject from each treatment group or full matching). I focus on matched designs with one subject from each treatment group, which is the structure generated by the proposed algorithm. The matched sample consists in S matched sets, where Zs = (Zs1,Zs2, ..., ZsK ) and Ys = (Ys1,Ys2, ..., YsK ) denote the treatments received and the observed outcomes of individuals in the matched set s. Let Z = (Z1, ..., ZS) and Y = (Y1, ..., YS) be the collections of all the treatment statuses and outcome values. I restrict my attention to sample-level effects and in particular to Fisher’s sharp

1 2 K null hypothesis, i.e., H0 : Ysi = Ysi = ... = Ysi for all s and i, which is the most popular hypothesis in matched designs (Rosenbaum, 2002b). In this framework, the plausibility of the null hypothesis is evaluated with randomization-based tests, where the distribution of statistics t(Z, Y) is defined by the possible permutations of the treatment values Zs within matched sets. The McNemar test (for binary outcomes) and Wilcoxon signed-rank tests (for continuous outcomes) are common choices for the paired structures generated with two treatment groups (Rosenbaum, 2002b). In the presence of multiple treatment groups, a strategy to test Fisher’s sharp null is to use the evidence factor methodol- ogy. Rosenbaum(2010) introduced this approach to combine evidence from two or more statistical tests that investigate different pieces of the overall sharp null hypoth- esis. The key advantage of the proposed framework is that it naturally incorporates the possibility to assess the robustness of the result to hidden bias. I describe this methodology as a general way to test about causal effects in matched designs with K

37 treatment groups.

2.2.3 Evidence Factors

Evidence factors are tests addressing different fragments of the overall null hypothesis of no treatment effect (Rosenbaum, 2010). The idea is to perform the statistical tests separately and to combine their results into a single p-value, which evaluates the plau- sibility of the overall null. Such tests must be independent—or nearly independent—

under H0, as they should use different information in the data. Moreover, the bias that may affect one test should not affect the others. This condition enables the combina- tion of test-specific sensitivity analyses to hidden bias in a single sensitivity analysis about the overall test. Originally, Rosenbaum(2010) introduced the methodology for independent test statistics. The framework was subsequently generalized, relaxing the necessity of using independent statistics and providing sufficient conditions for the conditions above to hold (Rosenbaum, 2011, 2017). Even though several aspects of the evidence factor methodology are well-established in the literature, this framework has not been formally discussed as a structured testing strategy to evaluate Fisher’s sharp null hypothesis in matched designs with multiple groups. In particular, to assess the overall null hypothesis of no difference among the K treatments, one solution is to use K − 1 tests. First, one treatment is compared to the remaining K − 1, pooled together as a single treatment group. Then, the second treatment is compared to the following K − 2, and so on. The last test compares the last two treatment groups. In this way, the combination of the null hypotheses of the K − 1 tests correspond to the overall hypothesis of no difference among all of the treatments. To combine the results of the different tests, one possibility is to use Fisher’s

method (Rosenbaum, 2010). In its traditional formulation, p-values P1, ..., PK−1 of

38 PK−1 K − 1 independent tests can be combined in the statistic −2 k=1 log(Pk), which is distributed as a chi-squared distribution with 2(K − 1) degrees of freedom if the null hypotheses of all the tests are true. Rosenbaum(2011) discussed how the type-I error of the overall test is controlled even if the individual tests are not independent,

provided that the p-values P1, ..., PK−1 are stochastically larger than uniform, i.e., TK−1  QK−1 such that P k=1 {Pk 6 p1} 6 k=1 pk for every constant p1, ..., pK−1 ∈ [0, 1]. For the proposed strategy of iteratively nested comparisons, this property is sat- isfied by many randomization-based statistics for matched data (Rosenbaum, 2011). For continuous outcomes, the Huber-Maritz m-statistic can be used to perform each comparison. The statistic comparing group k to groups k + 1, ..., K is

S K PK PK 0 ! X X I(Zsr = k)Ysr − I(Zsr = k )Ysr t (Z, Y) = ψ r=1 r=1 , (2.8) k µ s=1 k0=k+1

where the terms in the numerator are the outcomes of the subjects receiving treatment k and k0, µ is a scale factor and ψ(·) is an odd function. A popular choice is to set n o µ to the of PK I(Z = k)Y − PK I(Z = k0)Y and to r=1 sr sr r=1 sr sr s=1,...,S 0 16k

PK PK 0  PK r=1 k0=k I(Zsr = k ) Ysr. Given msk, each term r=1 I(Zsr = k)Ysr is dis-

tributed as a Bernoulli with probability psk = msk/(K − k + 1). Therefore, if the number of matched sets is large, it is possible to consider a standard-

39 ized version of the statistic, whose distribution is approximated well by a standard normal distribution:

PS PK PS s=1 r=1 I(Zsr = k)Ysr − s=1 psk tk(Z, Y) = q . (2.9) PS s=1 psk(1 − psk)

Notably, the last of the K − 1 tests is a comparison between two treatment groups in a paired structure. In this case, the Mantel-Haenszel statistic is equivalent to the McNemar test for paired data (McNemar, 1947). If the null hypothesis is true, the treatment can be considered to be given at random within each matched set. The null distributions of both the Huber-Maritz and Mantel-Haenszel statistics are valid under the assumption of no hidden bias.

2.2.4 Estimation of Treatment Effects

If Fisher’s null hypothesis is rejected, the data provide evidence that the treatment has a causal effect on the outcome and interest moves to quantifying such effect. In order to provide a measure of the causal effect that is coherent with the statistical test, the effect is estimated by inversion of the test. In particular, the type of effect mainly depends on the type of outcome.

Continuous Outcomes: Constant Additive Effects

For a continuous outcome, the effect is often assumed to be constant and addi- tive (Rosenbaum, 2011). Considering the first treatment group as the reference,

1 PK k the observed outcomes are assumed to be equal to Ysr = Ysr + k=2 I(Zsr = k)τ . Inference on the parameters τ = (τ 2, ..., τ K ) can be carried out by inversion of the

Huber-Maritz test. That is, for any given vector of parameters τ0, τ0 belongs to the

(1 − α)100% confidence set of τ if the p-value of the test H0 : τ = τ0 is larger than

α. To carry out this test, note that H0 : τ = 0 is equivalent to Fisher’s null, which

40 is tested with the statistic described in Section 2.2.3. For τ0 6= 0, it is sufficient to

τ0 PK k define modified responses Ysr = Ysr − k=2 I(Zsr = k)τ0 and evaluate the hypothesis of no effect on these fictitious responses (Rosenbaum, 2011).

Binary Outcomes: Attributable Effect

A similar strategy has been suggested to quantify causal effects in the presence of binary outcomes (Rosenbaum, 2002a). Because the outcomes can only assume values 0 or 1, the assumption of a constant additive effect is unrealistic in this case. The effect is quantified in terms of the attributable effect instead. With the same convention of considering the first treatment group as the reference, the attributable effect of

PS PK k 1  PS PK k treatment k is Ak = s=1 r=1 I(Zsr = k) Ysr − Ysr = s=1 r=1 I(Zsr = k)δsr, i.e., the number of events observed among the subjects who received treatment k that would have not been observed if the subject had received treatment 1. Notably, in the context of randomization-based inference, the attributable effect

is a random variable, because it is a function of the random treatment statuses Zsr. Therefore, the effect is not estimated, as it is not possible to “estimate” random variables. However, it is possible to gain information about this effect by identi- fying plausible values (Rosenbaum, 2002a). In particular, plausible values of the

attributable effect Ak are strictly related to the confidence region of the parameters

k k k δ = (δsr) s=1,...,S . Consider a constant ak and suppose that there exists a vector δ0 r=1,...,K k belonging to the (1 − α)100% confidence set Ck of δ and such that

S K X X k I(Zsr = k)δ0sr 6 ak. (2.10) s=1 r=1

Then, it is plausible that ak or fewer of the outcome events would not have been observed if the subjects who received treatment k had received treatment 1 instead,

k because we cannot exclude δ0 from the values about which we are confident. Con-

41 k versely, if there are no vectors δ0 in Ck satisfying the condition above, we can conclude that it is implausible that ak or fewer events are caused by treatment k. To sum up, to quantify the effect of a treatment of a binary outcome, the strategy

k is to construct an appropriate confidence set Ck for δ and to identify the smallest ak

k such that no vector δ0 in Ck satisfies the condition in Equation 2.10. The two-group procedure described above can be extended to the multiple-treatment comparison. First, a confidence set C for δ = (δ2, ..., δK ) is identified, by inversion of the overall test described in section 2.2.3. Then, for each k from 2 to K, it is possible to compute the smallest of the values ak satisfying Equation 2.10.

2.2.5 Sensitivity Analysis to Hidden Bias

Hidden bias arises in the presence of unmeasured confounders. Rosenbaum(2002 b) described a comprehensive framework to assess the robustness of the conclusions to unobserved confounders. The application of this framework to observational studies with multiple evidence factors has been discussed in the literature (Rosenbaum, 2010). In this section, I describe the implementation of the sensitivity analysis to matched designs with multiple treatment groups. The idea is to relax the assumption of no hidden bias independently for each of the K −1 evidence factors. Consider a set of K −1 unobserved confounders U 1, ..., U K−1, assuming values in the interval [0, 1]. The k-th confounder is assumed to introduce bias in the k-th comparison (i.e., the test between treatment k and the following K − k + 1 groups). Rosenbaum(2010) noted that these factors may or may not be distinct, so that the same unobserved covariate may bias two or more tests. These unobserved confounders, together with the observed covariates X, are assumed to make the treatment assignment Z exchangeable. Given the possible imbalance of the unobserved confounders in the matched sample, subjects in the same K-tuple might

42 have had unequal probabilities of receiving the K treatments. To formally describe the key assumption of the framework, I introduce a set of aux-

k k iliary binary variables Zsr, where Zsr = 1 if Zsr = k, i.e., if subject r in matched set s

receives treatment level k (Rosenbaum, 2010). Denote with Jsk the set of subjects in

 k 0 matched set s involved in the k-th comparison, i.e., Jsk = j : Zsj = 0 for all k < k .

k A parameter γk controls the impact of the unobserved confounder U on the proba-

k bility of receiving treatment k. Let πsr be the probability that subject r is the one in matched set s receiving treatment k, given the matched structure and the observed and unobserved covariates. The following model is assumed:  k  exp(γkUsr)  PK k if r ∈ Jsk  exp(γkU ) k j∈Jsk sj πsr = (2.11)  0 if r∈ / Jsk

In the special case γk = 0, in each matched set s, all the subjects r involved in

k the k-th comparison (i.e., subjects r ∈ Jsk) have equal probability πsr of receiving treatment k, and such probability is πk = 1/(PK 1) = 1/(K − k + 1). In other sr j∈Jsk words, the unobserved confounder U k does not bias the k-th test and treatment k can be considered as given at random within the subjects of the K-tuple involved in the

0 test. For values γk > 0, Equation (2.11) implies that, for any given subjects r and r

in Jsk,

k   πrs k k  1 k = exp γk(Usr − Usr0 ) ∈ , Γk , (2.12) πr0s Γk

where Γk = exp(γk). Under the assumption formalized by Equation (2.11), the randomization-based

distribution of the statistic tk(Z, Y) depends on γk and on the values of the unobserved confounder U k. To conduct the sensitivity analysis on the k-th comparison, a value of Γ is fixed and the upper bound P k of the p-value is computed, considering all k Γk

43 k the possible values Urs of the unobserved confounder. If small values of Γk yield large values of P k , the result based on observed data is considered sensitive to hidden bias. Γk Conversely, if P k is small for large Γ , unobserved confounders strongly associated Γk k with the treatment received cannot alter the conclusion of the test. To compute the individual P k , the procedure depends on the statistic t (Z, Y). Γk k The procedure for the Huber-Maritz statistic is provided by Rosenbaum(2007). For the Mantel-Haenszel statistic—and its special case, the McNemar statistic—the pro- cedure is provided by Rosenbaum(1987). In particular, Nattino and Lu(2018) re-

cently described a simple approach to identify the smallest value of Γk resulting in an upper bound P k larger than a fixed significance level α in the Mantel-Haenszel Γk test. The authors showed the identity between the tipping point, denoted as sensi- tivity value, and the lower bound of the (1 − α)100% one-sided confidence interval of the conditional between treatment assignment and outcome. This value can be be computed by fitting a conditional logistic regression model predicting the outcome of interest with the treatment indicator as the only independent variable. Using this approach, it is possible to identify the sensitivity value of the test in a single computation, using off-the-shelf methods. The procedure described above can be used to assess the sensitivity to hidden bias for each of the K −1 comparisons. The result that allows one to carry out a sensitivity analysis to the overall test, which combines the K − 1 p-values with Fisher’s method, is described by Rosenbaum(2010). Suppose that the Huber-Maritz or the Mantel- Haenszel statistics are used for the individual tests. Fix the vector of sensitivity

parameters Γ = (Γ1, ..., Γk), the upper bound PΓ of the overall test described in Section 2.2.3 can be computed by comparing the statistic −2 PK−1 log(P k ) with a k=1 Γk chi-squared distribution with 2(K − 1) degrees of freedom.

44 2.3 Simulation Study

2.3.1 Setup

I designed a simulation analysis to evaluate the performance of the proposed matching algorithm. I focused on the case of K = 3 treatment groups. In each of the considered scenarios, I generated 1,000 samples. The subjects in the simulated samples were assigned to one of three treatment groups. Each subject was characterized by a single covariate value X, as the unique matching variable. I considered two sets of simulations. First, I generated samples where each subject in the smallest group had perfect matches in the other two groups. To simulate this scenario, I considered three groups of sizes n1 = 100, n2 = 200 and n3 = 300. The values of the matching variable X for the n1 subjects in group 1 were generated from a standard normal distribution. The same values were assigned to 100 subjects in groups 2 and 3, in order to construct perfect triplets across the groups. The values of the matching variable for the remaining 100 and 200 subjects in groups 2 and 3 were sampled from the same distribution (standard normal). The conditionally optimal algorithm was applied to each simulated sample. I applied the algorithm considering each of the three starting setups. The purpose of this simulation was to evaluate the capability of the algorithm to identify perfect triplets—when they exist—and to evaluate the sensitivity of the result to the arbitrary choice of the starting setup. The second family of simulations investigated the performance of the matching algorithm in more realistic settings. I generated the values of the matching variable for subjects in groups 1, 2 and 3 as independent samples from beta distributions. This family of distributions is characterized by two shape parameters. If both these parameters are greater than one, the density is unimodal and the location of the peak can be set by appropriately choosing the values of the two parameters. I considered

45 four choices of the parameters. In the first case, the matching variables of the three groups were sampled from the same symmetric distribution. The other choices of the beta parameters correspond to a decreasing overlap among the distributions of the

three groups. In particular, the matching variables Xi,Xj and Xk for subjects i in group 1, j in group 2 and k in group 3 were sampled as follows:

1. Xi,Xj,Xk ∼ beta (2, 2)

2. Xi ∼ beta (2, 3), Xj ∼ beta (3, 2) and Xk ∼ beta (2, 2),

3. Xi ∼ beta (2, 4), Xj ∼ beta (4, 2) and Xk ∼ beta (2, 2),

4. Xi ∼ beta (2, 5), Xj ∼ beta (5, 2) and Xk ∼ beta (2, 2).

A graphical representation of the four scenarios is reported in Figure 2.2.

Group 1: beta(2,2), Group 2: beta(2,2), Group 1: beta(2,3), Group 2: beta(3,2), Group 1: beta(2,4), Group 2: beta(4,2), Group 1: beta(2,5), Group 2: beta(5,2), Group 3: beta(2,2) Group 3: beta(2,2) Group 3: beta(2,2) Group 3: beta(2,2) 2.5 2.5 2.5 2.5 2.0 2.0 2.0 2.0 1.5 1.5 1.5 1.5 1.0 1.0 1.0 1.0 0.5 0.5 0.5 0.5 0.0 0.0 0.0 0.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

Figure 2.2: Distributions of the matching variable in the four scenarios considered in the simulation study (from left to right, scenario 1 to 4). The distributions of X in group 1, 2 and 3 are plotted with solid, dotted and dashed lines, respectively.

For each of the four scenarios, I considered samples of different sizes:

i. three groups of equal sizes: n1 = n2 = n3 = 500;

ii. one central group smaller than the other two groups: n1 = n2 = 1, 000 and

n3 = 500; 46 iii. one central group larger than the other two groups: n1 = n2 = 500 and n3 = 1, 000;

iv. asymmetric case with larger group on the left side: n1 = 1, 000 and n2 = n3 = 500.

Notably, the smallest group had size 500 in each of the four cases. Therefore, the number of matched triplets was 500 for all the simulated samples. I applied both the conditionally optimal and theNN algorithms to each generated sample. The conditionally optimal algorithm was applied three times, once for each of the possible starting setups. Of the three results, I selected the matched sample associated with the smallest total distance and I compared it with the total distance from theNN matching. I evaluated the performance of the proposed algorithm by computing the attained percentage reduction of the total distance with respect to theNN matching.

2.3.2 Results

In the first set of simulations, the proposed algorithm correctly identified the existing perfect match 100% of the time. The starting setups did not make a difference. The perfect match was identified from all of the three starting conditions. For the second family of simulations, Table 2.1 presents the distributions of the percent reduction in the simulated samples. The minimum percentage reduction was greater than or equal to zero in each implementation (see column Min-Max). This indicates that the quality of matching with the proposed algorithm was, in each in- stance, at least as good as the one of theNN procedures. The conditionally optimal algorithm attained better performance (largest reductions in the total distance) in the settings characterized by a higher degree of overlap between the three distribu- tions. When the matching variables were sampled from the same distribution (first scenario) the algorithm produced matched samples with a 10% to 29% average re- 47 duction in the total distance with respect to theNN. The reduction was smaller in cases with smaller overlap between the distributions (0-5% average reduction in the fourth scenario). Interestingly, in the scenarios characterized by non-perfect distri- bution overlap (second to fourth scenarios), the largest reduction was attained when the central group was the smallest one (second line in each scenario). Simulations based on samples with different sizes returned very similar results.

Table 2.1: Percentage reductions of the total distance attained with the conditionally optimal algorithm with respect to the nearest neighbor procedure.

Groups Group size Min - Max Median (Q1-Q3) Mean (SD) Distributions n1 n2 n3 500 500 500 11.35-44.28 29.52 (25.10-33.02) 29.00 (5.94) 1: beta(2,2) 1000 1000 500 3.11-20.39 9.56 (7.97-11.03) 9.64 (2.41) 2: beta(2,2) 500 500 1000 6.78-45.45 25.60 (19.98-31.04) 25.42 (7.74) 3: beta(2,2) 1000 500 500 5.05-53.13 25.74 (20.13-30.95) 25.58 (7.70) 500 500 500 0.37-9.85 3.96 (2.90-5.04) 4.05 (1.57) 1: beta(2,3) 1000 1000 500 1.19-15.91 3.91 (2.91-5.26) 4.32 (1.99) 2: beta(3,2) 500 500 1000 0.14-1.09 0.35 (0.29-0.43) 0.38 (0.13) 3: beta(2,2) 1000 500 500 0.63-7.90 2.18 (1.68-2.97) 2.42 (1.03) 500 500 500 0.05-3.36 0.91 (0.57-1.34) 1.00 (0.58) 1: beta(2,4) 1000 1000 500 1.80-7.05 4.01 (3.53-4.54) 4.04 (0.74) 2: beta(4,2) 500 500 1000 0.02-0.41 0.07 (0.05-0.10) 0.08 (0.04) 3: beta(2,2) 1000 500 500 0.11-1.50 0.43 (0.32-0.57) 0.47 (0.20) 500 500 500 0.00-1.49 0.27 (0.13-0.45) 0.31 (0.24) 1: beta(2,5) 1000 1000 500 3.57-7.40 5.20 (4.79-5.58) 5.19 (0.60) 2: beta(5,2) 500 500 1000 0.00-0.16 0.02 (0.01-0.03) 0.03 (0.02) 3: beta(2,2) 1000 500 500 0.02-0.45 0.10 (0.07-0.16) 0.12 (0.07)

48 2.4 Application: Mortality Differences among Trauma Center Levels

2.4.1 Background

Traumatic injury is a leading cause of mortality and morbidity in the US (Centers for Disease Control and Prevention, 2016). Trauma centers provide specialized medical services and resources to patients suffering from traumatic injuries. The classification of trauma centers in the US is based on key resources and expertise for the care of trauma patients. The commonly used classification is: Level I Trauma Centers (TC I), providing multidisciplinary treatment and specialized resources for trauma patients and operating an organized teaching and research effort; Level II Trauma Centers (TC II), providing similar experienced medical services and resources without carrying out trauma research; Nontrauma Centers (NTC), including other medical units provid- ing limited trauma care. The trauma-level designation is assigned by state/regional authorities and verified by the American College of Surgeons Committee on Trauma (Committee on Trauma American College of Surgeons, 2006). An important aspect of trauma system evaluation is the comparison of patient outcomes across different levels of trauma care. Such comparison provides impor- tant information to optimally use finite medical resources and achieve best patient outcomes. Previous research has discussed the beneficial effect of being treated in Trauma Centers (TC) versus NTC(Smith Jr et al., 1990; MacKenzie et al., 2006). However, comparisons betweenTCI andTCII are less conclusive. Some studies have suggested that they provide a similar level of care (Clancy et al., 2001; MacKenzie et al., 2003). Other studies have found mortality differences between the two trauma center levels in head-injury patients (McConnell et al., 2005). I am not aware of direct comparisons betweenTCII and NTC.

49 A key research question is whetherTCII are a justified investment of finite trauma care resources. If trauma patients treated atTCII had, instead, been treated atTC I or NTC, would their outcomes have been different? The major outcome of interest that can be used to evaluate the trauma system is emergency department mortality. If the mortality ofTCII patients would have been similar had these patients been treated at NTC, with both these mortality rates being much higher than what would have been observed inTCI, the utility ofTCII would be brought into question. The answer to this question is expected to have broad translational and policy importance in the treatment and transfer of trauma patients in the US.

2.4.2 Data

The Nationwide Emergency Department Sample (NEDS) data were used to investi- gate the research question. The NEDS is a nationally representative dataset designed by the Agency for Healthcare Research and Quality to enable analyses of emergency department utilization patterns. It is the largest all-payer emergency department database in the US, which includes approximately 30 million records each year. Key variables that are available to researchers are the injury severity score, which is a score correlating with mortality and other clinical outcomes, age, sex, comorbidity of chronic conditions, multiple injuries, median household income by zip code, expected primary payer, and urban-rural designation for the patient’s county of residence. Further information about the NEDS design and data can be found in the section dedicated to the Healthcare Cost and Utilization Project-NEDS in the website of the Agency for Health care Research and Quality (2019).

50 2.4.3 Methods

Estimating the causal effect of a three-dimensional treatment (the trauma center level where the patient is admitted) on a binary outcome (emergency department mortality) was the main goal of the analysis. In particular, because the interest was onTCII patients, a matched design linking eachTCII patient to oneTCI and one NTC patient is a robust approach to address the research question. I applied the proposed matching algorithm to the NEDS dataset. I considered the same set of covariates considered in a previous analysis of the same data (Shi et al., 2016). The covariates included age, sex, the injury severity score (ISS), presence of chronic conditions, multiple injury, primary expected payer, patient location and income. To reduce the dimensionality of the matching problem, subjects were matched on the basis of the propensity score. A multinomial logistic regression model was used to estimate the three-dimensional propensity score. In particular, subjects admitted to different trauma-level facilities were matched on the basis of the two estimated log-odds associated with the propensity score, i.e., the two linear components of the multinomial logistic model. We considered the three-way distance induced by the Euclidean norm in the matching procedure. The conditionally optimal algorithm was run using each of the three possible starting setups. The matched sample with the smallest total distance was used for further analysis. The primary hypothesis of the analysis was verified testing Fisher’s sharp null,

1 2 3 i.e., H0 : Ysr = Ysr = Ysr for each subject r = 1, 2, 3 in all the triplets s = 1, ..., S. To test this hypothesis, the methodology of evidence factors discussed in Section 2.2 was applied to the matched sample. The overall null hypothesis was evaluated with K − 1 = 2 comparisons. First, one treatment group was compared to the other two, pooled together. The second comparison involved the two groups pooled in the first

51 step. From a purely statistical perspective, the order of the trauma levels in the two comparisons is not relevant, because any order can be used to verify the hypothesis of no difference among the three groups. However, from a scientific standpoint, some comparisons might be more interesting than others. For example, in this application, comparing NTC versusTCI andTCII together and thenTCI versusTCII is a set of comparisons that is more informative than other choices. In this way, while testing about the overall hypothesis, it is possible to explore whether there is a difference between NTC andTC and ifTCI differs fromTCII. Because the outcome is binary, the tests are carried out with the Mantel-Haenszel statistic. In particular, the second comparison, betweenTCI andTCII, reduces to the special case of the McNemar test. The p-values P1 and P2 of the two tests are combined with Fisher’s method, comparing the statistic −2 log(P1) − 2 log(P2) to a chi-squared distribution with four degrees of freedom. The effect of the treatment is quantified in terms of the attributable effect. Finally, the robustness of the test to hidden bias is evaluated with the sensitivity analysis framework.

2.4.4 Results

Matching

Out of the 21,855 patients considered, 5,314 (24.3%) were admitted to non-trauma centers, while 13,383 (61.2%) and 3,158 (14.4%) patients were admitted to level I and level II trauma centers, respectively. The three matched samples generated by the different setups of the algorithm yielded extremely similar total distances. When the procedure was started from optimally matching the groups NTC withTCI, NTC withTCII orTCI withTC II, the total distances attained by the matching algorithm were 1025.33, 1020.55 and 1020.46. The third matched sample was selected for the causal inference analysis,

52 because it attained the smallest total distance. Balance across the treatment groups in the matched sample was assessed using the absolute standardized differences, reported in Table 2.2. For all of the covariates, the average value of the standardized differences was below the threshold of 10%, with the largest single value being 13.97%. Considering the substantially increased complexity of matching in the three-dimensional space, these results are regarded as signs of good balance across covariates.

Table 2.2: Standardized differences after matching.

NTC TC I NTC Variable Average vs. TC I vs. TC II vs. TC II Age 1.05% 3.65% 2.60% 2.43% Sex (Female) 3.75% 2.93% 0.87% 2.52% ISS 0.44% 0.09% 0.36% 0.30% Multiple injury 0.50% 0.65% 0.00% 0.38% Chronic conditions 10.43% 3.36% 13.62% 9.14% Median household income by patient zip code Q1 (0%-25%) 8.33% 12.13% 3.75% 8.07% Q2 (25%-50%) 6.10% 1.12% 4.82% 4.01% Q3 (50%-75%) 3.40% 3.50% 0.08% 2.33% Q4 (75%-100%) 9.17% 12.48% 3.42% 8.36% Primary expected payer Medicare 5.11% 0.00% 5.46% 3.53% Medicaid 2.89% 1.22% 1.68% 1.93% Private insurance 3.75% 10.17% 13.97% 9.30% Self-pay 4.41% 7.32% 11.39% 7.71% No charge 1.35% 0.97% 3.06% 1.79% Other 1.79% 1.18% 2.89% 1.95% Patient location Large central metropolitan area 5.40% 0.41% 6.50% 4.10% Large fringe metropolitan area 5.15% 13.10% 8.34% 8.86% Medium metropolitan area 5.93% 2.28% 7.88% 5.36% Small metropolitan area 2.77% 6.60% 8.72% 6.03% Micropolitan area 0.44% 4.36% 3.66% 2.82% Neither metropolitan 10.69% 9.74% 1.28% 7.24% nor micropolitan area

53 Outcome Analysis

Table 2.3 provides the mortality rates for the three trauma center levels in both the unmatched and matched samples. Prior to matching, the mortality observed in NTC,TCI andTCII was 14 .3%, 3.8% and 4.2%, respectively. In particular, the NTC mortality was 10 percentage points higher than that inTCI orTCII. The matching design reduced the effect of observed covariates and restricted the comparison to subjects with similar characteristics to those treated at TC II. As a result, the difference among the mortality rates decreased. However, the post-matching mortality rates still pointed in the direction of lower quality of care at NTC, with rates of 10.1%, 4.2% and 4.2% in NTC,TCI andTCII, respectively. The Mantel-Haenszel test was used to compare outcome rates between NTC and

TC. The statistic of the first evidence factor was t1 = 11.45, corresponding to an

extremely small p-value (P1 < 0.0001). Barring any hidden bias, this result provides strong evidence of a beneficial effect ofTC. Second, the McNemar’s test was used to

compareTCI andTCII. The value of the statistic was t2 = 0 and the corresponding

one-sided p-value was P2 = 0.5. Given the large p-value, there was no evidence that being admitted toTCI orTCII had a causal effect on mortality. The two results can be combined in a single test, using Fisher’s method. The value of the combined

statistic was −2 log(P1) − 2 log(P2) = 139.25, which corresponds to an extremely small p-value (<0.0001 when compared to a chi-squared distribution with 4 degrees of freedom). Clearly, there was strong evidence that the sharp null hypothesis did not hold. Since the comparison betweenTCI andTCII was non-significant, there was no evidence that any of the deaths among theTCII patients would have been prevented if the patients had been admitted toTCI. In contrast, the very small value of P1 suggested that some of the deaths among NTC patients can be prevented by admission

54 toTC. Therefore, we pooledTCI andTCII patients, assuming the same potential outcomes forTCI andTCII admission, and estimated the effect attributable to NTC. This effect is based on the inversion of the Mantel-Haenszel test used in the NTC- TC comparison. Following the procedure described in Section 2.2.4, the inversion of the 0.05-level test provided evidence that 162 or more of the 319 NTC deaths were caused by the NTC admission. This corresponds to an attributable mortality of at least 5.1%.

Table 2.3: Emergency department (ED) mortality by trauma center level before and after matching.

Before matching After matching N ED mortality - N (%) N ED mortality - N (%) NTC 5,314 760 (14.3%) 3,158 319 (10.1%) TC I 13,383 503 (3.8%) 3,158 134 (4.2 %) TC II 3,158 134 (4.2%) 3,158 134 (4.2 %)

Sensitivity Analysis for Hidden Bias

The comparison of NTC andTC in mortality resulted in a very significant difference

(P1 < 0.0001). How sensitive is this result to potentially unmeasured confounders?

Under the sensitivity analysis assumptions, if Γ1 = 2 (i.e., subjects within each triplet can differ in the odds of being the patient admitted to NTC up to a factor 2), the p-value of the test can be as high as P 1 = 0.0004. If Γ1 = 3, the upper bound of

the p-value becomes very large (P 1 = 0.86). Using the method described by Nattino and Lu(2018), the sensitivity value at the 0.05 level is Γ 1 = 2.34, because the 95% one-sided confidence interval of the conditional odds ratio between admission to NTC and mortality is (2.34, +∞). That is, the significance of the result at the α = 0.05 55 level cannot be explained away by unobserved confounders whose association with the admission to NTC (in the odds ratio scale) is smaller than 2.34.

The second comparison of our analysis was not significant (P2 = 0.5). Assuming that no hidden bias was present, no evidence of a mortality difference betweenTC I andTCII emerged from our data. Since the upper bounds of the p-value are larger than the reported p-value by definition, a sensitivity analysis cannot change the conclusions about this evidence factor. A sensitivity analysis can be used to evaluate the robustness of the combined test. For example, assuming that subjects can differ in the odds of being the patient admitted to NTC up to a factor 2 within each triplet (Γ1 = 2) and that, in the (TC I,TC II) pairs, subjects can differ in the odds of being the patient admitted toTC

I up to a factor 1.5 (Γ2 = 1.5), the p-value of the combined test can be as large as 0.003. To obtain a complete picture, we need to evaluate how the significance changes according to a wide of Γ1 and Γ2 values. This is done by considering a dense grid of values of the parameter and depicting the region where the upper bound of the p-value falls beyond a desired significance level α. Figure 2.3 presents such a region for α = 0.05. The upper bound is above 0.05 when the parameter Γ1 is slightly larger than 2. This implies that a lurking confounder U might explain away the significance of the overall test if it is able to cause the odds of being admitted to NTC to be more than twice the odds of being admitted toTC. Given the non-significance of the

McNemar’s test, Γ1 has little impact on the overall test.

2.4.5 Conclusions

Overall, if there were no unmeasured confounders, the data provided strong evidence of a reduced mortality in theTCI andTCII groups if the same patient population (i.e., the patients receiving trauma care atTCII) had been treated at all three levels of

56 2.5 2.0 N M Γ 1.5 1.0

1.0 1.5 2.0 2.5

ΓMH Figure 2.3: Set (represented with the gray area) containing the values of the sensitivity parameters (Γ1, Γ2) that correspond to an upper bound of the p-value of the overall test larger than α = .05.

trauma centers. The mortality ofTCII hospitals was very similar to the one observed inTCI hospitals and the data did not provide evidence of a difference in quality of care between the two groups. This has important implications for regionalized trauma care planning. Upgrading the existingTCII toTCI appeared as unnecessary in terms of patients’ outcomes. On the other hand, the downgrade ofTCII to NTC would likely be harmful for the patients. These results were consistent with previous studies that analyzed the same data but focused on NTC patients (Vickers et al., 2015; Shi et al., 2016). The authors found increased mortality rates in NTC centers and similar mortalities inTCI andTCII. The sensitivity analysis revealed that an unmeasured confounder with moderate-large association with received trauma care may change the findings qualitatively.

57 Chapter 3 Propensity Score Adjustment With Cluster Sampling Data

To infer causal effects in survey data, researchers need to take into account the unequal sampling probabilities of the subjects, the design of the survey and the observational nature of the data. Few methodological studies have focused on the generalization of propensity-score methods to the estimation of population-level effects in complex survey designs, especially cluster sampling. This chapter discusses the estimation of population average treatment effects in the presence of complex survey data. In particular, I propose an estimator for two- stage cluster sample surveys, which have received little attention in previous research despite the popularity of the sampling design. The remainder of the chapter is orga- nized as follows. Section 3.1 describes the role of the propensity score when inferring causal effects in survey data, clarifying the type of model that should be used for its estimation. The estimator of the population ATE for two-stage cluster sampling de- signs and its asymptotic properties are described in Section 3.2. Section 3.3 describes an extensive simulation study, which evaluates the proposed theoretical results. Fi- nally, I apply the proposed methodology to real data in Section 3.4, to assess the effect of health insurance status on the decision to seek care after an injury.

58 3.1 Weighted Estimators for Population Average Treatment Effects in Complex Survey Data

3.1.1 Weighting in Causal Inference and Survey Sampling

Weighted estimators are popular in both the causal inference and the survey sampling literature. In the first field of study, they are commonly used to estimate sample marginal effects (see Section 1.4.2). For example, to estimate the sample ATE, each subject is weighted by the inverse of the probability of receiving the treatment that was actually received (i.e., the propensity score e(X) = P (Z = 1|X) for treated, 1 − e(X) for controls). Such a weighting scheme aims to reconstruct the potential outcome sample (Hern´anand Robins, 2018). The most intuitive weighted estimator of the sample ATE is   1 X Zk 1 − Zk ∆b SAT E = Yk − Yk . (3.1) n e(Xk) 1 − e(Xk) k∈S Similar weighted estimators have been extensively used in the survey sampling literature, to estimate population parameters. The Horvitz-Thompson estimator is the most common weighted estimator in this field (Levy and Lemeshow, 2013). Each observation is weighted by the inverse of the sampling probability, which is often known by design. Let S be the binary variable indicating whether a unit is sampled and let f(X) = P (S = 1|X) be its sampling probability. The Horvitz-Thompson P estimator for the total of Y in the population, TY = k∈U Yk, is

X Sk X 1 TbY = Yk = Yk. (3.2) f(Xk) f(Xk) k∈U k∈S The inverse of the sampling probability is the survey weight and is denoted with wk = 1/f(Xk). The average value of Y in the population can be estimated by TbY /N.

The causal parameter of interest in this chapter is the population ATE, ∆P AT E,

59 which is formally introduced in Section 1.3. To estimate this parameter in the pres- ence of complex sampling designs, a natural choice is to merge the two approaches presented above, by considering weights that are the product of survey and propensity score weights (Ashmead, 2014; Yang, 2018). The naive estimator of the population ATE is   1 X ZkSk (1 − Zk)Sk ∆b P AT E = Yk − Yk N e(Xk)f(Xk) (1 − e(Xk))f(Xk) k∈U   1 X Zk 1 − Zk = Yk − Yk . (3.3) N e(Xk)f(Xk) (1 − e(Xk))f(Xk) k∈S

This estimator was originally proposed more than a decade ago (Zanutto, 2006). However, there is still no consensus on the approach that should be used to estimate the propensity score e(X), which is unknown in most of the cases (Lenis et al., 2017). As discussed in Section 1.5.2, the literature is still divided on whether survey weights should be considered in the estimating procedure of the propensity score model.

3.1.2 Treatment and Sample Selections

The population ATE is formally defined in the population of potential outcomes (left set on the second row of Figure 1.1). Notably, there are two distinct paths that may be considered for the selection of the study sample (bottom-right set of Figure 1.1). One possibility is that subjects receive the treatment first (treatment selection TP ) and are subsequently sampled via the survey design (sample selection ST ). The second option reverses the order of the selections. First, subjects are sampled via survey design (sample selection SPO). Then, the sampled subjects receive the treatment under study (treatment selection TS).

Notably, the selection probabilities of the treatment selections TP and TS are different. It is possible to consider a population propensity score, denoted with eP (X) = P (Z = 1|X), as the selection probability of the population treatment se- 60 lection TP . Alternatively, the selection probability of the sample treatment selection

TS is the sample propensity score, eS(X,S) = P (Z = 1|X,S), which is conditional on the indicator specifying whether a subject has been drawn from the population.

A similar discussion involves the sample selections SPO and ST . I distinguish the potential outcome sampling probability, fPO(X) = P (S = 1|X), which char- acterizes the sampling selection SPO, and the sampling probability after treatment, fT (X,Z) = P (S = 1|X,Z), which characterizes the sampling selection ST and is conditional on the treatment received. It is worth noting that the two distinctions are only theoretical if the treatment and sampling selections are independent. In this special case, P (Z = 1|X,S) = P (Z = 1|X) and P (S = 1|X,Z) = P (S = 1|X). These equivalences imply that the two propensity scores and sampling probabilities can be used interchangeably. However, in the most general case, the independence does not hold. In the idealistic scenario where all of the four selection probabilities are known,

Figure 1.1 suggests two different estimators for ∆P AT E, built upon the naive estimator in Equation 3.3. They are introduced in Proposition 3.1, which shows that both are unbiased.

Proposition 3.1. The following estimators are unbiased estimators for ∆P AT E:   1 X Zk 1 − Zk ∆b P AT E,T S = Yk − Yk , (3.4) N eP (Xk)fT (Xk,Zk) (1 − eP (Xk))fT (Xk,Zk) k∈S   1 X Zk 1 − Zk ∆b P AT E,ST = Yk − Yk . (3.5) N eS(Xk,Sk)fPO(Xk) (1 − eS(Xk,Sk))fPO(Xk) k∈S Proof. The estimators can be equivalently formulated as sums over all of the subjects

61 in the finite population:   1 X ZkSk (1 − Zk)Sk ∆b P AT E,T S = Yk − Yk , N eP (Xk)fT (Xk,Zk) (1 − eP (Xk))fT (Xk,Zk) k∈U   1 X ZkSk (1 − Zk)Sk ∆b P AT E,ST = Yk − Yk . N eS(Xk,Sk)fPO(Xk) (1 − eS(Xk,Sk))fPO(Xk) k∈U h i h i ZkSk 1 (1−Zk)Sk 0 If E Yk = Y and E Yk = Y , the unbiasedness eP (Xk)fT (Xk,Zk) k (1−eP (Xk))fT (Xk,Zk) k of ∆b P AT E,T S is proven, because      h i 1 X ZkSk (1 − Zk)Sk E ∆b P AT E,T S = E Yk − E Yk N eP (Xk)fT (Xk,Zk) (1 − eP (Xk))fT (Xk,Zk) k∈U 1 X = Y 1 − Y 0 = ∆ . N k k P AT E k∈U h i h i ZkSk 1 (1−Zk)Sk 0 Similarly, E Yk = Y and E Yk = Y imply the eS (Xk,Sk)fPO(Xk) k (1−eS (Xk,Sk))fPO(Xk) k unbiasedness of ∆b P AT E,ST . I will show only the first equality of each estimator. The second follows analo- gously. For the first estimator:     ZkSk ZkSk 1 E Yk = E Yk eP (Xk)fT (Xk,Zk) eP (Xk)fT (Xk,Zk)      ZkSk 1 = E E E Yk Zk Xk eP (Xk)fT (Xk,Zk)  1  Yk = E E [ZkE [Sk| Zk]| Xk] eP (Xk)fT (Xk,Zk)  1  Yk = E E [Zk| Xk] E [Sk| Xk,Zk] eP (Xk)fT (Xk,Zk)  1  Yk = E eP (Xk)fT (Xk,Zk) eP (Xk)fT (Xk,Zk)

1 = Yk . (3.6)

62 Similarly, for the second estimator:     ZkSk ZkSk 1 E Yk = E Yk eS(Xk,Sk)fPO(Xk) eS(Xk,Sk)fPO(Xk)      ZkSk 1 = E E E Yk Sk Xk eS(Xk,Sk)fPO(Xk)  1  Yk = E E [SkE [Zk| Sk]| Xk] eS(Xk,Sk)fPO(Xk)  1  Yk = E E [Sk| Xk] E [Zk| Xk,Sk] eS(Xk,Sk)fPO(Xk)  1  Yk = E fPO(Xk)eS(Xk,Sk) eS(Xk,Sk)fPO(Xk)

1 = Yk . (3.7)

Importantly, the unbiasedness does not hold for estimators combining population

propensity score eP (X) with sampling probabilities before treatment fPO(X) or sam-

ple propensity score eS(X,S) with sampling probabilities after treatment fT (X,Z). In these cases, propensity score and sampling probabilities would not cancel out in

Equations (3.6) and (3.7), introducing a bias in the estimation of ∆P AT E.

3.1.3 Weighted or Unweighted Propensity Score?

Most of the time, the propensity score is unknown and must be estimated. In par- ticular, the sample propensity score eS(X,S) is a sample parameter and it can be estimated with a traditional, unweighted regression model. In contrast, the popula-

tion propensity score eP (X) is a population parameter. To estimate it, it is necessary to account for the sample selection ST , reconstructing the population after treatment.

Therefore, the estimation of eP (X) requires a survey-weighted model, with weights

equal to the inverse of the probabilities fT (X,Z).

Assuming that the sampling probabilities fT (Xk,Zk) are known and that it is

63 therefore possible to estimate both eS(Xk,Sk) and eP (Xk) for each subject in the sample, which propensity score model should be used to estimate ∆P AT E? Section 3.1.2 provides a clear answer to this question. If the only available sampling prob- abilities are the fT (Xk,Zk), researchers should use the estimates of the population propensity score eP (Xk), because the unbiasedeness of the estimator of ∆P AT E is guaranteed only if fT (X,Z) is combined with eP (X) (Equation (3.4)). On the other hand, if the sampling probabilities fPO(Xk) are available, it is also possible to use the estimator in Equation (3.5), which combines these sampling probabilities with the sample propensity score eS(X,S). In practice, survey data are usually accompanied by only one set of sampling weights, which are computed on the basis of the sampling design and considered as known. Consequently, the choice of the propensity score model depends on the proba- bilities on which the sampling weights are based: fT (Xk,Zk) or fPO(Xk). Identifying the type of survey weights provided with the data is not a statistical task. On the contrary, the answer should be indicated by the study design. Nonetheless, there are some practical considerations that should be noted. From a purely sampling per- spective, the selection of the sample is often based on design variables, which are assumed to be part of the covariates X. In this ideal situation, treatment and sample selections are independent and the two sets of propensity scores coincide, as well as the two sets of sampling probabilities (as noted in Section 3.1.2). In this case, what- ever set of sampling weights is provided with the survey data, both propensity score models can be used. However, realistically, the sample selection is likely to depend on the treatment. For example, the treatment of a subject might affect the probability that a subject responds to the survey. Because survey weights are often adjusted for nonresponse, it is likely that the available weights depend on the received treatment. For this reason, it is more likely that the set of sampling probabilities provided with

64 the survey data are conditional on the treatment received and correspond to the set

of fT (Xk,Zk). In this case, the population propensity score (i.e., the one estimated

with a survey-weighted model) is the correct choice to estimate ∆P AT E. In the remainder of the chapter, I will assume that the sampling probabilities fT (X,Z) are known by design. Consequently, I will consider the estimation of the population propensity score eP (X) in order to estimate the population ATE. To simplify the notation, these selection probabilities will be denoted with f(X,Z) and e(X).

3.2 Two-Stage Cluster Sample Surveys

3.2.1 Cluster Sampling Design: Notation

Proposition 3.1 shows the unbiasedness of two naive weighted estimators for the population ATE. The result holds for any complex sampling design, provided that the appropriate sampling weights are available. However, the variance of such estimators depends on the specific sampling design. Ashmead(2014) focused on single-stage sampling frameworks, discussing applications to stratified sampling. I will focus on multi-stage sampling designs, which have received little attention in the literature and, in particular, on two-stage cluster sampling, which is the most popular multi-stage sampling design (Levy and Lemeshow, 2013). In order to study the asymptotic properties of estimators targeting finite-population parameters, I follow the approach described by Fuller(2011). The finite population is assumed to be sampled from an infinite superpopulation (as discussed in Section 1.2) and asymptotic results are derived considering sequences of growing finite popula- tions. In two-stage cluster sampling, units are grouped in clusters, also called level-1 units or primary sampling units. Each cluster is a collection of level-2 units, simply

65 referred as units. The superpopulation is assumed to be an infinite set of finite clusters. The finite

population FN is a set of N clusters, drawn from the superpopulation by simple random sampling. The subscript emphasizes the number of primary sampling units

in the population, i.e., the number of clusters N. Let Mi be the number of units belonging to cluster i. Denote with U = {1, ..., N} and Ui = {1, ..., Mi} the index sets for clusters and units in cluster i, respectively. Despite the hierarchical structure of multi-stage sampling designs, I focus on unit- level treatments. In particular, I assume the identifiability conditions discussed in Section 1.4.1 to hold, to guarantee the estimability of causal effects with observational data. As discussed in Section 3.1, I also consider a sampling design where units receive the treatment before the selection of the study sample. In two-stage cluster sampling, the sample is drawn in two steps. First, n clusters are sampled from the N in the finite population. Define X = {X } and Z = i ij j∈Ui i {Z } , the collections of covariates and treatment statuses of the subjects in cluster ij j∈Ui i. The sampling probability of cluster i is denoted with f (Xi, Zi) = P (Si = 1|Xi, Zi), where Si indicates whether cluster i is sampled (Si = 1) or not (Si = 0). Let S ⊂ U be the index set of the sampled clusters. In the second sampling step, units are selected within clusters. In particular, mi units are sampled from cluster i ∈ S.

The sampling probability of unit j in cluster i is denoted with f (Xij,Zij|Si = 1) =

P (Sij = 1|Xij,Zij,Si = 1). The index set of the units sampled in cluster i is denoted with Si. The sampling probabilities are considered as known.

For each cluster i ∈ S and unit j ∈ Si, define the cluster and unit weights as wi = 1/f (Xi, Zi) and wj|i = 1/f (Xij,Zij|Si = 1), respectively. Finally, denote with

0 1 T T Aij = Yij,Yij,Zij, Xij , the vector of the potential outcomes, treatment assignment and covariates of unit j in cluster i.

66 3.2.2 Weighted Estimator for Population ATE

To estimate the population ATE in two-stage cluster sampling designs, one possibility is to consider the weighted estimator that is routinely used for the population mean of a variable. First of all, cluster-specific total treatment effects T = P Y 1 − Y 0 ∆,i j∈Ui ij ij are estimated with   X Zij 1 − Zij Tb∆,i = wj|i Yij − Yij , (3.8) e(Xij) 1 − e(Xij) j∈Si

for each i ∈ S. Then, the cluster-specific estimates are pooled with appropriate P weight, to estimate the population total treatment effect T∆ = i∈U T∆,i. The esti- mator of the population ATE is obtained by dividing the estimated total treatment effect by the population size:

1 X ∆b P AT E = P wiTb∆,i Mi i∈U i∈S   1 X X Zij 1 − Zij = P wi wj|i Yij − Yij . (3.9) i∈U Mi e(Xij) 1 − e(Xij) i∈S j∈Si

The following result proves the unbiasedness of the estimator.

Proposition 3.2. The weighted estimator in Equation (3.9) is unbiased for ∆P AT E.

Proof. Rewrite the estimator as a sum of all the clusters in the population:

1 X ∆b P AT E = P wiSiTb∆,i, Mi i∈U i∈U

67 The expected value of this estimator is:

h i 1 X h i E ∆b P AT E = P E wiSiTb∆,i Mi i∈U i∈U 1 h h ii X = P E E wiSiTb∆,i Zi, Xi Mi i∈U i∈U 1 h h ii X = P E wiE [Si|Zi, Xi] E Tb∆,i Zi, Xi Mi i∈U i∈U 1  1 h i X = P E P (Si = 1|Zi, Xi) E Tb∆,i Zi, Xi . Mi P (Si = 1|Zi, Xi) i∈U i∈U h i

The proof that E Tb∆,i Zi, Xi = T∆,i is analogous to the proof of Proposition 3.1. Therefore,

h i 1 X X 1 0 E ∆b P AT E = P Yij − Yij = ∆P AT E. i∈U Mi i∈U j∈Ui

If the population size is unknown, it can be replaced by the sum of the weights

in the sample (Levy and Lemeshow, 2013). Defining wij = wiwj|i, an alternative estimator of the population ATE is   1 X X Zij 1 − Zij ∆b P AT E,1 = P wij Yij − Yij . (3.10) i,j wij e(Xij) 1 − e(Xij) i∈S j∈Si

Alternatively, it is possible to show that " # " # X X ZijSij X X (1 − Zij)Sij X E wij = E wij = Mi, (3.11) e(Xij) 1 − e(Xij) i∈U j∈Ui i∈U j∈Ui i∈U

using derivations analogous to the proof of Proposition 3.2. This result suggests another weighted estimator, which separately computes the averages of treated and controls. Following the work of Lunceford and Davidian(2004) for traditional simple random sampling designs, the estimator in Equation (3.9) is split in the difference of the averages among treated and controls. The denominators of the two averages

68 are replaced by the estimators presented in Equation (3.11), generating two ratio estimators for the average value of Y 1 and Y 0 in the population. The difference of these two terms is the following estimator of the population ATE:

!−1 X X Zij X X Zij ∆b P AT E,2 = wij wij Yij e(Xij) e(Xij) i∈S j∈Si i∈S j∈Si !−1 X X 1 − Zij X X 1 − Zij − wij wij Yij. (3.12) 1 − e(Xij) 1 − e(Xij) i∈S j∈Si i∈S j∈Si

Notably, ratio estimators are not necessarily unbiased and, therefore, this character- istic is inherited by ∆b P AT E,2 (Levy and Lemeshow, 2013). Nevertheless, Section 3.2.4 discusses the asymptotic properties of the proposed estimators and shows that they are consistent. Because the population is assumed to be a drawn from the infinite superpopulation, the estimators defined in Equations (3.10) and (3.12) can also be thought as estimators for the superpopulation ATE, even though, in practice, we may have ∆SP AT E 6= ∆P AT E. Given the possibility to use the estimators above for both population and superpopulation parameters, I will simply denote the estimators in Equation (3.10) and (3.12) with ∆b 1 and ∆b 2.

3.2.3 Propensity Score Estimation

Proposition 3.2 provides an unbiased estimator for the population and superpopu- lation ATE. However, the expression relies on the propensity score, which must be estimated in most cases. As discussed in Section 3.1, I consider estimators of the population ATE using the population propensity score, which is estimated with a survey-weighted parametric logistic regression model. Under this modeling assumption, I emphasize the depen- dence of the propensity score on β, the coefficients of the logistic model. If the model

69 is correctly specified, the propensity score of subject j in cluster i corresponds to

1 e(Xij, β) = Pp . (3.13) 1 + exp(−β0 − k=1 βkXkij)

The estimator of the parameters βb does not have a closed form and it is implicitly defined as the solution to the p + 1

X X X X wijψβij = wij(Zij − e(Xij, β))Xij = 0. (3.14)

i∈S j∈Si i∈S j∈Si

A natural approach to estimate ∆P AT E and ∆SP AT E is to deal with the estima- tion of the propensity score and of the treatment effect as separate steps. First, the

parameters of the propensity score are estimated with βb. Then, e(Xij, βb) is substi- tuted in Equation (3.10) or (3.12). Although this procedure generates a consistent

estimator of the treatment effects, it is important to account for the variance of βb

when computing the variance of ∆b 1 or ∆b 2. Lunceford and Davidian(2004) described a framework to incorporate the uncer- tainty introduced by the estimation of the propensity score in the variability of a weighted treatment effect estimator. The idea is to consider the joint estimation of

∆P AT E and β via a single set of estimating equations. Such equations are constructed by augmenting the p+1 estimating equations of the propensity-score parameters with ad-hoc linear equations, having the causal parameters of interest as unique roots. De-

fine the function g1 as      Zij 1−Zij  ψ∆ij − Yij − ∆P AT E    e(Xij ,β) 1−e(Xij ,β)  g1(Aij, θ) =   =   , (3.15) ψβij (Zij − e(Xij, β))Xij

T T where θ = (∆P AT E, β ) is the vector of unknown parameters. The desired esti- mators, ∆b 1 and βb, are the components of the solution θb of the following estimating

70 equations: X X wijg1 (Aij, θ) = 0. (3.16)

i∈S j∈Si

Similarly, it is possible to define the function g2 as     Zij ψµ ij (Yij − µ1)  1   e(Xij ,β)     1−Zij  g2(Aij, η) = ψµ ij =  (Yij − µ0)  , (3.17)  0   1−e(Xij ,β)      ψβij (Zij − e(Xij, β))Xij

T T where the parameters η = (µ1, µ0, β ) are estimated via the set of equations

X X wijg2 (Aij, η) = 0. (3.18)

i∈S j∈Si

Notably, the estimator ∆b 2 can be expressed as ∆b 2 = µb1 − µb0 or, defining the vector T T u = (1, −1, 0) , as ∆b 2 = u ηb.

3.2.4 Asymptotic Properties

The estimators ∆b 1 and ∆b 2 can be considered as estimators for both the population and superpopulation ATE (see Section 3.2.2). However, their asymptotic properties and variance depend on the target parameter. I will use the notation VN,1 = V ar(θb | FN ) and VN,2 = V ar(ηb | FN ) to denote the design of θb and ηb, i.e., the covari- ance matrices considering only the randomness due to the sampling from the finite population (Fuller, 2011). I will denote the variances of the same estimators at the superpopulation level with V0,1 = V ar(θb) and V0,2 = V ar (ηb), which account for the additional variability introduced by the sampling of the finite population from the in-

finite superpopulation. With this notation, the variance of ∆b 1 as estimator of ∆P AT E and of ∆SP AT E are the top-left entry of the matrices VN,1 and V0,1, respectively. Sim-

T ilarly, the variance of ∆b 2 as estimator of ∆P AT E and of ∆SP AT E are u VN,2u and

T u V0,2u, respectively.

71 To describe the asymptotic distribution of the estimators ∆b 1 and ∆b 2, I follow the framework described by Fuller(2011) and Binder(1983). Asymptotic properties are established by considering the finite population FN as the N-th element of an increasing sequence of populations. Each population in the sequence is considered as a sample from the infinite super-population. Notably, large-sample properties are described as the number of primary sam- pling units goes to infinity. This implies an important difference between multi-stage sampling—such as two-stage cluster sampling—and single-stage sampling designs, where the primary sampling units are the observational units. Here, the primary sampling units are the clusters and the limiting assumption is that the number of clusters, N, goes to infinity. As will be noted later, the cluster sizes Mi do not need to be large.

Denote the totals of g1 (Aij, θ) and g2 (Aij, η) in the population FN and their natural sample-based estimators with

X X X X Tg,1(θ) = g1 (Aij, θ) , Tbg,1(θ) = wijg1 (Aij, θ) , (3.19)

i∈U j∈Ui i∈S j∈Si X X X X Tg,2(η) = g2 (Aij, η) , Tbg,2(η) = wijg2 (Aij, η) . (3.20)

i∈U j∈Ui i∈S j∈Si

Using this notation, the estimators θb and ηb are the solution to Tbg,1(θ) = 0 and 0 Tbg,2(η) = 0, respectively. Denote with θN and θ the solutions to Tg,1(θ) = 0 and

0 E [Tg,1(θ)] = 0, respectively. Similarly, denote with ηN and η the solutions to

Tg,2(η) = 0 and E [Tg,2(η)] = 0. Finally, denote the total number of level-2 units in

tot P the population FN with MN = i∈U Mi and the expected number of sampled units tot P  with mN = E i∈S mi | FN .

The following proposition provides the asymptotic distribution of ∆b 1. This result requires a set of technical assumptions about the estimating equations, the super-

72 population model and the sampling design.

Proposition 3.3. Let {F } be a sequence of finite populations, where each pop- N N>1 ulation F is composed of N clusters. Let A = {A } be the collection of level-2 N i ij j∈Ui units in cluster i. Consider a two-stage cluster sampling procedure, as described in Section 3.2.1, and the following assumptions:

1. Ai are independent random draws from a distribution with finite fourth mo- ments.

tot 1/2 tot −1   L 2. (mN ) (MN ) Tbg,1(θ) − Tg,1(θ) | FN −→ N(0, Σgg,1) for all θ in a neigh-

0 borhood of θ , where Σgg,1 is positive definite.

3. (M tot)−1 P w H (A , θ) = (M tot)−1 P H (A , θ)+O ((mtot)−1/2) N i∈S,j∈Si ij 1 ij N i∈U,j∈Ui 1 ij p N and lim (M tot)−1 P H (A , θ) = H (θ) almost surely for all θ N→∞ N i∈U,j∈Ui 1 ij ∞,1 0 in a neighborhood of θ and for all Aij, where H∞,1(θ) is nonsingular and

H1(Aij, θ) = (∂/∂θ)g1(Aij, θ).

Then, as N → ∞,

−1/2   d VbN,1 θb − θN | FN −→ N(0, I), (3.21)

−1/2  0 d Vb0,1 θb − θ −→ N(0, I), (3.22) where

−1 −1 T −1 tot −1 T VbN,1 = Hc1 Vbgg,1(Hc1 ) , Vb0,1 = Hc1 (Vbgg,1 + MN Σb gg,1)(Hc1 ) ,

X tot X T Hc1 = wijH1(Aij, θb),MN Σb gg,1 = wijg1(Aij, θb)(g1(Aij, θb)) , i∈S,j∈Si i∈S,j∈Si and Vbgg,1 is an estimator of the design variance Vgg,1 = V ar(Tbg,1|FN ).

73 Proof. The results follow from Theorem 1.3.9 and Corollary 1.3.9.1 in Fuller(2011). To prove the results of the Proposition, it suffices to show that the assumptions of the theorem are met. In particular:

ˆ E [Tg(θ)] is a one-to-one function in θ. h i For the first component of the vector T (θ), E P P ψ is linear in g i∈U j∈Ui ∆ij

∆P AT E and is therefore a one-to-one function of ∆P AT E. For the second compo- h i nent of the vector, E P P ψ is the expectation of the score function i∈U j∈Ui βij of a logistic regression model and it is a one-to-one function of β. Therefore,

T T E [Tg(θ)] is a one-to-one function of θ = (∆P AT E, β ) .

ˆ Vgg,1 is positive definite. This follows from the definition of Vgg,1 in the consid- ered sampling design.

0 ˆ For any value of Aij, g1(Aij, ·) is continuous in a closed set B containing θ as an interior point.

Note that g1(Aij, ·) is smooth everywhere. To verify this, note that e(Xij, β) is the probability assumed by a logistic regression model and, therefore, it is

smooth with respect to β and 0 < e(Xij, β) < 1. Both ψ∆ij and ψβij, the

components of g1(Aij, ·), are smooth with respect to β, as they are composition

of continuous functions. ψ∆ij is also smooth in ∆P AT E, while ψβij does not

depend on ∆P AT E. Hence, g1(Aij, ·) is smooth everywhere and, in particular, it is continuous in a closed set B containing θ0 as an interior point.

ˆ For any value of Aij, H1(Aij, ·) is continuous for all θ in B.

Note that H1(Aij, ·) is continuous everywhere, as it is the gradient of g1(Aij, ·),

which is a smooth function. Therefore, H1(Aij, ·) is continuous for all θ in B.

74 ˆ kg1(Aij, θ)k2 < ξ(Aij) for all θ in B, with ξ(Aij) having finite fourth .

Let c∆ be the maximum of |∆P AT E| in B. Under the assumption that the

0 < ce 6 e(Xij, β) 6 c¯e < 1, note that   Zij 1 − Zij − Yij − ∆P AT E e(Xij, β) 1 − e(Xij, β)  1 1  6 + |Yij| + |∆P AT E| e(Xij, β) 1 − e(Xij, β)

6 ce|Yij| + c∆

1 0 6 ce(|Yij| + |Yij|) + c∆

where ce is an arbitrary constant. In addition, k(Zij − e(Xij, β))Xijk2 6

kXijk2. Therefore

kg1(Aij, θ)k2 s   2 Zij 1 − Zij 2 = − Yij − ∆P AT E + k(Zij − e(Xij, β))Xijk2 e(Xij, β) 1 − e(Xij, β) q 1 0 2 2 6 (ce(|Yij| + |Yij|) + c∆) + kXijk2,

1 0 which has finite fourth moment as Yij, Yij and Xij have finite fourth moment by assumption.

The top-left entry of the matrices VbN,1 and Vb0,1 estimates the variance of ∆b 1 as

an estimator of ∆P AT E and ∆SP AT E, respectively. An analogous asymptotic result

for the estimator ηb is obtained replacing g1, Tg,1 and Tbg,1 with g2, Tg,2 and Tbg,2. T Estimators for the variance of ∆b 2 when estimating ∆P AT E and ∆SP AT E are u VbN,2u

T and u Vb0,2u, respectively. A key assumption of Proposition 3.3 is the existence of a for the total Tbg,1 under the two-stage cluster sampling design (assumption 2). Sen 75 (1988) described sufficient conditions for this assumption in the one-dimensional case. In order to reduce our framework to a scenario where a one-dimensional parameter is estimated, suppose that the propensity score were known and consider the estimator in Equation (3.9). The author discussed the difference between multi-stage sampling designs, such as two-stage cluster sampling, and stratified designs, where samples are taken from each stratum of the population. In the latter case, the central limit theorem for one-stage sampling with unequal probabilities applies promptly. In the case of multi-stage sampling, there are other regularity conditions that need to be satisfied. In particular, applying the results discussed by Sen(1988) to our setup, the asymptotic normality of the total treatment effect is guaranteed under two conditions.

First, for each cluster-total T∆,i and its respective estimator Tb∆,i (see definitions in Section 3.2.2), the following Lindeberg-type condition must hold:

 2  √   max E Tb∆,i − T∆,i I Tb∆,i − T∆,i > ξ N FN → 0 i∈U

for any ξ > 0 as N → ∞. This is guaranteed if the uncertainty of the within-cluster estimation of the treatment effect does not decrease when increasing the number of clusters. The second condition requires that the ratios of cluster sampling proba-

bilities maxi {f(Xi, Zi)} / mini {f(Xi, Zi)} and of cluster sizes maxi {Mi} / mini {Mi} are asymptotically finite. This assumption rules out scenarios where 1) the number of clusters increases but the sampling probability for a subset of clusters becomes smaller and smaller and 2) the population is sequentially increased by adding larger and larger clusters. Importantly, these assumptions do not require the cluster sizes

Mi or the number of sampled level-2 units mi to be large. The result in Proposition 3.3 applies to any implementation of two-stage cluster sampling. Possible generalizations to multi-stage sampling designs are discussed in

Section 5.2.3. In particular, the design variances Vgg,1 and Vgg,2 and their estimators

76 Vbgg,1 and Vbgg,2 depend on the specific sampling design. The following section provides an expression of these matrices for simple two-stage cluster sampling designs.

3.2.5 Design Variance in Simple Two-Stage Cluster Sampling

In simple two-stage cluster sampling designs, both clusters and units are drawn by simple random sampling. In this case, the sampling probabilities reduce to f (Xi, Zi) = n/N and f (Xij,Zij|Si = 1) = mi/Mi. In particular, the weighs are constant within cluster and they are equal to wij = (NMi)/(nmi).

To derive the design variance Vgg,1, the one-dimensional expressions in Fuller (2011) can be extended to the multivariate case. In particular, we have

h   i h   i Vgg,1 = V ar E Tbg,1|S, {Si} , FN |FN + E V ar Tbg,1|S, {Si} , FN |FN = 2 2   2  n  1 X N Mi mi = N 1 − Σ1 + 2 1 − Σ2,i, (3.23) N n n mi Mi i∈S

where Σ1 is the empirical matrix of the level-1 units, i.e., the N totals of

g1 by cluster. The matrices Σ2,i are the empirical covariance matrices of the level-2 units within the sampled clusters.

The first term can be estimated using the empirical Σb 1 of the n estimated totals in the sampled clusters. The second term can be estimated using the

empirical covariance matrices Σb 2,i, based on the sampled units in cluster i. Therefore, an estimator of the design variance of the total is:

2   2  n  1 X N Mi mi Vbgg,1 = N 1 − Σb 1 + 1 − Σb 2,i (3.24) N n n mi Mi i∈S

Replacing g1 with g2 and Tbg,1 with Tbg,2, it is possible to obtain an analogous estimator of Vgg,2.

77 3.3 Simulation Study

3.3.1 Setup

Surveys often use several sampling stages with different designs, to ensure the rep- resentativeness of the target population. For example, the MEPS data used in the application (Section 3.4) are the result of a complex sampling scheme, which includes a stratified sampling and two cluster sampling stages (Agency for Health care Research and Quality, 2019). The simulation study mimics such a complex design, considering the combination of stratified and two-stage cluster sampling designs.

Finite Population

I considered a finite population of S strata. Each stratum contained Ns clusters of variable size. The size of cluster i in stratum s was defined as Msi = 10, 000+ζsi, where

ζsi were iid samples from an exponential distribution with mean 10,000 (rounded to the closest integer). This choice provided a range of clusters of minimum size 10,000

and average size 20,000. I considered three choices of the pair (S,Ns): (5, 80), (10, 40), (20, 20). These choices corresponded to finite populations with similar size (about 8 million units) and same total number of clusters (400) and investigated the properties of the estimator in designs with different number of clusters per stratum.

Each level-2 unit was characterized by p = 5 covariates X = (X1, ..., X5), whose distributions were stratum-specific. In particular, the value of covariate k of subject

j in cluster i in stratum s was sampled as Xksij ∼ N(µks, 1). The stratum-specific

means of the covariates were sampled from a standard normal distribution: µks ∼ N(0, 1). I considered two types of outcomes: continuous and binary. For the continuous

78 case, potential outcomes were generated as follows:

0 T Ysij = γC Xsij + sij,

1 T Ysij = γC Xsij + sij + δC (1 + X1sij/5 − X5sij/5),

2 where γC was a vector of p equispaced values between .01 and 2, sij ∼ N(0, (0.1) ) was a noise term, δC = 3 imposed a treatment effect and its coefficient (1 + Xsij1/5 −

Xsij5/5) made the treatment effect heterogeneous across units—the effect depended on the value of covariates 1 and 5—and across strata, because the distribution of the covariates was stratum-specific.

0 1 For the binary case, I defined the probabilities psij and psij as

0  T logit psij = −1 + γBXsij,

1  T logit psij = −1 + γBXsij + δB.

The vector γB was defined by equispaced values between .01 and .3. I set δB = 1.

0 1 The binary potential outcomes Ysij and Ysij were sampled as Bernoulli trials with

0 1 probabilities psij and psij. Notably, the impact of the treatment was constant on the log-odds scale and, therefore, heterogeneous on the probability scale, which is the scale of the ATE. Therefore, treatment effects are expected to be heterogeneous, as in the continuous-outcome simulations. Once the potential outcomes were fixed, the treatment was assigned using the propensity score model

T logit (e(Xsij)) = −2 + β Xsij,

where the vector β was set to p equispaced values between 1 and 0.01, so that the covariates had different relative importance on outcome and treatment received. The

treatment value Zsij was sampled as a Bernoulli trial with probability e(Xsij). When

79 the treatment was assigned, the observed outcome, Ysij, was defined by the relation-

1 0 ship Ysij = ZsijYsij + (1 − Zsij)Ysij.

Sampling

The study sample was drawn with a combination of stratified and two-stage cluster sampling. The number of clusters selected in each stratum was not constant. In half

of the strata, ns = 5 clusters were sampled, with equal probability. ns = 10 clusters

were sampled from each one of the remaining strata. From each cluster, ms = 100 units were drawn. I considered two sampling schemes for the level-2 units. In the first scheme, units were sampled with equal probabilities. In the second scheme, the sampling probability depended on the treatment and on the values of the covariates. The sampling probability within cluster was constant for all the units, except for treated

units with positive value of X1 and X2, whose sampling probability was doubled,

and control units with positive value of X4 and X5, whose sampling probability was halved. In both the sampling schemes, the survey weights based on the true sampling probabilities were considered to be known. In particular, the known sampling prob- abilities were conditional on the treatment status (i.e., the weights were based on the probabilities fT described in Section 3.1). Notably, the two sampling schemes simulate designs where treatment and sample selection are independent (first case) and where the sample selection depends on the treatment received (second case).

Estimators

For each of the six scenarios—two types of outcomes times three combinations of number of strata and clusters—I generated one finite population. The two sampling

80 schemes were used to draw 1,000 samples from each population. On each sample, I ap- plied different estimators of ∆P AT E, which was a known parameter of each generated population.

First of all, I applied the two weighted estimators, ∆b 1 and ∆b 2. I considered dif- ferent versions of the two estimators. First of all, I considered the estimators induced by population propensity score (i.e., eP ) and sample propensity score (i.e., eS), fit with weighted and unweighted logistic regression, respectively. The second varying factor was the estimator for the variance matrices Vgg,1 and Vgg,2. I considered two estimators. First, I used the most appropriate estimator, using the full information of the sampling scheme. Second, I applied an approximate estimator, assuming that the second stage of the cluster sampling scheme was performed with replacement. This approximation is suggested for the analysis of the MEPS data (Agency for Health care Research and Quality, 2019). By considering this factor in the simulation study,

I was able to assess the effect of this approximation on the performance of ∆b 1 and

∆b 2.

The parameter ∆P AT E was also estimated with weighted estimators that included either the propensity score or the survey weights (but not both). In scenarios where the outcome was continuous, I also applied a survey-weighted , using the available covariates and the treatment variable as predictors. In such a linear model, the coefficient of the treatment is an estimator of ∆P AT E. Importantly, the model was not correctly specified, because the effect of the treatment was not constant across units. Finally, for the weighted estimators based on the propensity score and for the model-based estimator, I considered analogous estimators omitting one of the covari- ates from the analysis. The omitted covariate was X3, i.e., the factor with medium importance in both the true propensity score and outcome generation.

81 The performance of the estimators were evaluated in terms of percent bias and coverage of the 95% confidence interval. For each estimator, I also computed the mean of the estimated variance and the empirical variance across simulations, to compare the efficiency of the different methods.

3.3.2 Results

In the first set of results, I focus on scenarios dealing with the continuous outcome. Figure 3.1 provides the results of the simulations in the scenario where the number

of strata and clusters per stratum were (S,Ns) = (5, 80) and where the sampling probabilities did not depend on the treatment received. In this case, ∆b 1 and ∆b 2 showed little bias and coverage close to the nominal level, regardless of the type of propensity-score model (first four estimators of top panels). Interestingly, ∆b 1 performed slightly better than ∆b 2 in terms of coverage and efficiency. Ignoring the appropriate sampling design at the second stage did not affect the performance of the estimator (bottom panels). Weighted estimators including either the propensity score or the survey weights (but not both) and the model-based estimator resulted in important bias and poor coverage (last three estimators of top panels). When the sampling probabilities were treatment dependent, as expected, the type of propensity-score model heavily affected the performance of the estimators (Fig- ure 3.2). In this sampling scheme, estimators based on unweighted propensity-score models showed important bias and poor coverage (third and fourth estimators in upper and lower panels). Again, ∆b 1 performed slightly better than ∆b 2 and the ap- proximate design variance resulted in performance comparable to the ones of the design variance based on the complete information of the survey design. Figure 3.3 provides the results in the scenario where the number of strata and cluster by stratum were (S,Ns) = (20, 20), in the presence of treatment-dependent

82 Percent Bias Coverage of 95% CI Estimated and Empirical Variance

IPW1 − SW PS ● IPW1 − SW PS ● IPW1 − SW PS ●

IPW2 − SW PS ● IPW2 − SW PS ● IPW2 − SW PS ●

IPW1 − not SW PS ● IPW1 − not SW PS ● IPW1 − not SW PS ●

IPW2 − not SW PS ● IPW2 − not SW PS ● IPW2 − not SW PS ●

IPW − PS only ● IPW − PS only ● IPW − PS only ● Correct survey design Correct survey design Correct survey design Correct survey

IPW − SW only ● IPW − SW only ● IPW − SW only ●

SW model ● SW model ● SW model ●

IPW1 − SW PS ● IPW1 − SW PS ● IPW1 − SW PS ●

IPW2 − SW PS ● IPW2 − SW PS ● IPW2 − SW PS ●

IPW1 − not SW PS ● IPW1 − not SW PS ● IPW1 − not SW PS ●

IPW2 − not SW PS ● IPW2 − not SW PS ● IPW2 − not SW PS ● Approximate design variance Approximate design variance Approximate design variance Approximate 0 10 20 30 0 25 50 75 100 0.02 0.04 0.06 Bias (%) Coverage (%) Variance

Figure 3.1: The black dots in left, central and right panels represent the percent bias, coverage and mean variance of estimators of ∆P AT E in scenario with continuous outcome, (S,Ns) = (5, 80) and sampling scheme independent from the treatment. Dashed red lines indicate the value of 0 in the left panels and the 90 and 95% values in the central panels. The empirical variance of the estimated effects is reported as a red star in the right panel.

Percent Bias Coverage of 95% CI Estimated and Empirical Variance

IPW1 − SW PS ● IPW1 − SW PS ● IPW1 − SW PS ●

IPW2 − SW PS ● IPW2 − SW PS ● IPW2 − SW PS ●

IPW1 − not SW PS ● IPW1 − not SW PS ● IPW1 − not SW PS ●

IPW2 − not SW PS ● IPW2 − not SW PS ● IPW2 − not SW PS ●

IPW − PS only ● IPW − PS only ● IPW − PS only ● Correct survey design Correct survey design Correct survey design Correct survey

IPW − SW only ● IPW − SW only ● IPW − SW only ●

SW model ● SW model ● SW model ●

IPW1 − SW PS ● IPW1 − SW PS ● IPW1 − SW PS ●

IPW2 − SW PS ● IPW2 − SW PS ● IPW2 − SW PS ●

IPW1 − not SW PS ● IPW1 − not SW PS ● IPW1 − not SW PS ●

IPW2 − not SW PS ● IPW2 − not SW PS ● IPW2 − not SW PS ● Approximate design variance Approximate design variance Approximate design variance Approximate −30 −20 −10 0 10 20 30 0 25 50 75 100 0.03 0.06 0.09 Bias (%) Coverage (%) Variance

Figure 3.2: The black dots in left, central and right panels represent the percent bias, coverage and mean variance of estimators of ∆P AT E in scenario with continuous outcome, (S,Ns) = (5, 80) and sampling scheme dependent on the treatment. Dashed red lines indicate the value of 0 in the left panels and the 90 and 95% values in the central panels. The empirical variance of the estimated effects is reported as a red star in the right panel.

83 sampling probabilities. Notably, the coverage of weighted ∆b 2 was slightly lower

than the nominal level, while the performance of ∆b 1 was optimal (first and sec- ond estimators in upper and lower panels). Because the sampling probabilities were treatment-dependent, estimators based on unweighted propensity-score models per- formed poorly.

Percent Bias Coverage of 95% CI Estimated and Empirical Variance

IPW1 − SW PS ● IPW1 − SW PS ● IPW1 − SW PS ●

IPW2 − SW PS ● IPW2 − SW PS ● IPW2 − SW PS ●

IPW1 − not SW PS ● IPW1 − not SW PS ● IPW1 − not SW PS ●

IPW2 − not SW PS ● IPW2 − not SW PS ● IPW2 − not SW PS ●

IPW − PS only ● IPW − PS only ● IPW − PS only ● Correct survey design Correct survey design Correct survey design Correct survey

IPW − SW only ● IPW − SW only ● IPW − SW only ●

SW model ● SW model ● SW model ●

IPW1 − SW PS ● IPW1 − SW PS ● IPW1 − SW PS ●

IPW2 − SW PS ● IPW2 − SW PS ● IPW2 − SW PS ●

IPW1 − not SW PS ● IPW1 − not SW PS ● IPW1 − not SW PS ●

IPW2 − not SW PS ● IPW2 − not SW PS ● IPW2 − not SW PS ● Approximate design variance Approximate design variance Approximate design variance Approximate −25 0 25 50 0 25 50 75 0.0 0.1 0.2 0.3 Bias (%) Coverage (%) Variance

Figure 3.3: The black dots in left, central and right panels represent the percent bias, coverage and mean variance of estimators of ∆P AT E in scenario with continuous out- come, (S,Ns) = (20, 20) and sampling scheme dependent on the treatment. Dashed red lines indicate the value of 0 in the left panels and the 90 and 95% values in the central panels. The empirical variance of the estimated effects is reported as a red star in the right panel.

Figure 3.4 reports the results in the presence of the model misspecification. Even though only one covariate was omitted from the propensity score (for weighted esti- mators) or from the model (for model-based estimators), all the estimators showed important bias and very poor coverage. Finally, Figure 3.5 shows one of the results in the presence of binary outcomes. In particular, the Figure provides the performance of the estimators in the case with

(S,Ns) = (5, 80) and treatment-dependent sampling probabilities. The results de-

84 Percent Bias Coverage of 95% CI Estimated and Empirical Variance

IPW1 − SW PS ● IPW1 − SW PS ● IPW1 − SW PS ●

IPW2 − SW PS ● IPW2 − SW PS ● IPW2 − SW PS ●

IPW1 − not SW PS ● IPW1 − not SW PS ● IPW1 − not SW PS ●

IPW2 − not SW PS ● IPW2 − not SW PS ● IPW2 − not SW PS ● Correct survey design Correct survey design Correct survey design Correct survey IPW − PS only ● IPW − PS only ● IPW − PS only ●

SW model ● SW model ● SW model ●

IPW1 − SW PS ● IPW1 − SW PS ● IPW1 − SW PS ●

IPW2 − SW PS ● IPW2 − SW PS ● IPW2 − SW PS ●

IPW1 − not SW PS ● IPW1 − not SW PS ● IPW1 − not SW PS ●

IPW2 − not SW PS ● IPW2 − not SW PS ● IPW2 − not SW PS ● Approximate design variance Approximate design variance Approximate design variance Approximate

0 10 20 30 40 0 25 50 75 0.01 0.02 0.03 0.04 0.05 Bias (%) Coverage (%) Variance

Figure 3.4: The black dots in left, central and right panels represent the percent bias, coverage and mean variance of estimators of ∆P AT E in scenario with continuous outcome, (S,Ns) = (5, 80) and sampling scheme dependent on the treatment. The models in the estimators omit the covariate X3. Dashed red lines indicate the value of 0 in the left panels and the 90 and 95% values in the central panels. The empirical variance of the estimated effects is reported as a red star in the right panel.

picted in the Figure are similar to the case with continuous outcomes. Results for all the remaining scenarios are reported in the Appendix.

3.4 Application: Effect of Insurance Status on De- cision to Seek Care After Injury

3.4.1 Background

Several factors contribute to the decision of a person to seek medical care, including predisposing, enabling and need factors (Andersen and Newman, 1973). Predisposing factors include characteristics that increase the likelihood of health care utilization for subjects, without being the direct cause of a specific use of health services. Examples of these factors are demographics, social factors and health beliefs. Enabling factors characterize the means that each subjects has to seek care, such as income or the availability of health services in the community. Finally, the need component includes

85 Percent Bias Coverage of 95% CI Estimated and Empirical Variance

IPW1 − SW PS ● IPW1 − SW PS ● IPW1 − SW PS ●

IPW2 − SW PS ● IPW2 − SW PS ● IPW2 − SW PS ●

IPW1 − not SW PS ● IPW1 − not SW PS ● IPW1 − not SW PS ●

IPW2 − not SW PS ● IPW2 − not SW PS ● IPW2 − not SW PS ● Correct survey design Correct survey design Correct survey design Correct survey IPW − PS only ● IPW − PS only ● IPW − PS only ●

IPW − SW only ● IPW − SW only ● IPW − SW only ●

IPW1 − SW PS ● IPW1 − SW PS ● IPW1 − SW PS ●

IPW2 − SW PS ● IPW2 − SW PS ● IPW2 − SW PS ●

IPW1 − not SW PS ● IPW1 − not SW PS ● IPW1 − not SW PS ●

IPW2 − not SW PS ● IPW2 − not SW PS ● IPW2 − not SW PS ● Approximate design variance Approximate design variance Approximate design variance Approximate

−60 −40 −20 0 25 50 75 100 0.001 0.002 Bias (%) Coverage (%) Variance

Figure 3.5: The black dots in left, central and right panels represent the percent bias, coverage and mean variance of estimators of ∆P AT E in scenario with binary outcome, (S,Ns) = (5, 80) and sampling scheme dependent on the treatment. Dashed red lines indicate the value of 0 in the left panels and the 90 and 95% values in the central panels. The empirical variance of the estimated effects is reported as a red star in the right panel.

factors describing the health status, such as presence of illness, its severity and the subjective perception of it. Health insurance status has been documented as a relevant enabling factor in the US health care system (Andersen, 1995). Previous research has widely described how insured subjects are more likely to use health services (Newacheck et al., 1998; Zuvekas and Taliaferro, 2003; Andrulis, 1998). Unfortunately, most of the literature has been limited to a description of the association between insurance status and utilization of health care, rather than to assessing a causal relationship between the two elements. The difference between insured and uninsured in the pattern of access to care is not confined to preventive services. On the contrary, studies have documented how the gap also exists for the treatment of symptomatic conditions (Baker et al., 2000). In particular, for injured patients, previous studies established how insurance status is an important predictor of patient outcomes, even though the etiology of the 86 phenomenon is not clear (Haas and Goldman, 1994; Greene et al., 2010; Salim et al., 2010). Sacks et al.(2011) discussed how insured patients are more likely to receive specialized postacute care, which is essential for recovery from severe injuries. The following analysis focus on a different aspect of the phenomenon, investigating whether insurance status has a causal effect on the decision to seek care after injury. By evaluating the existence and, possibly, the magnitude of this causal effect, the goal of the analysis is to better understand the role played by insurance coverage on the outcome of injured patients.

3.4.2 Data

We used the 2015 MEPS data, which contains detailed information about the use and cost of health services across the US (Agency for Health care Research and Quality, Medical Expenditure Panel Survey (MEPS), 2019). The survey is based on a complex design, involving stratified and cluster sampling stages to guarantee the representativeness of the US civilian population (Chowdhury et al., 2019). Because the research question is focused on the decision to seek care after injuries, I selected only injured subjects. The treatment of interest was insurance status. Notably, the positivity assumption for such a treatment would be violated for most minors and elderly, who commonly have access to insurance coverage, either private or public, in the US health care system. Therefore, I selected adult subjects, aged 18-64. The decision to seek medical treatment was the outcome of the analysis. The outcome variable included use of ambulatory and home-health services, prescription medicines, emergency department visits and hospital admission. The data included variables about demographic and socioeconomic status, health conditions and access to care.

87 3.4.3 Methods

The research question is addressed by estimating the population ATE of insurance status. To estimate this effect, I applied the proposed estimators ∆b 1 and ∆b 2 to the MEPS data. Notably, the survey weights provided with the MEPS data are adjusted to account for nonresponse (Chowdhury et al., 2019). Because the treatment variable, insurance status, might affect the health status of the patients and therefore the response to the survey, the survey weights of the subjects are likely to depend on the treatment. Therefore, as discussed in Section 3.1, the propensity score model was treated as a population parameter and developed with a weighted logistic regression model. The covariates included in the propensity score were selected with the help of subject- matter experts. Survey weights were combined with the probabilities estimated by the propensity score model, as described in Section 3.2.2. The resulting weights are designed to reconstruct the potential outcome population, where each subject is represented by two counterfactual outcomes, corresponding to the levels of the treatment. A common choice to verify the appropriateness of the weights is to verify the balance of the covariates between treatment groups in the resulting weighted cohort. The balance is generally assessed with weighted standardized differences (Austin and Stuart, 2015). As for traditional standardized differences, values smaller than 10% are considered negligible imbalances. The estimates of the population ATE of the proposed methods were compared with the results of other approaches. I applied the proposed weighted estimators using the sample propensity score, which included the same covariates as the population propensity score but was fit with a traditional, unweighted logistic regression model. In addition, I applied two model-based estimators. Both the estimators relied

88 on weighted outcome models including, as predictors, the treatment variable and the other covariates. First, I considered a model with quasi-binomial likelihood and identity link, so that the coefficient of the treatment variable could be interpreted as a marginal effect on the risk difference scale and compared that to the other estimates. Second, I fit a logistic model and estimated the marginal effect as the difference between the average probabilities estimated by the model when all the subjects were considered insured and uninsured, respectively, holding the values of the other covariates as observed.

3.4.4 Results

Data

The analysis included 3,585 injured subjects, representing a population of more than 39 million subjects when accounting for the survey weights. In the study sample, 3,179 subjects (88.7%) had health insurance and 2,379 (66.4 %) sought medical treatment. In particular, treatment was sought by 67.6% of insured and by 56.4% of uninsured subjects. When accounting for the survey weights, 91.6% (95% CI: 90.2, 93.0) of the population was estimated to have insurance coverage and 66.9% (95% CI: 64.2, 70.0) was estimated to seek medical treatment. It was also estimated that 67.7% (95% CI: 64.9, 70.5) and 58.6% (95% CI: 51.5, 65.7) of the population sought medical care among insured and uninsured, respectively.

Propensity Score Model

Table 3.1 provides the estimates of the coefficient of the propensity score model, pre- dicting insurance status. The predicted probabilities were used to generate weights, in combination with the survey weights provided with the MEPS data. Table 3.2 provides the distribution of the covariates by treatment group, in the

89 cohorts generated by weighting subjects with survey weights only (first two columns) and with the proposed combination of survey and propensity score weights (fourth and fifth column). The standardized mean differences in the third column quantify the imbalance between the treatment groups in the population after treatment (right set of the second row in Figure 1.1), because subjects are only weighted by survey weights. When the propensity score was also included in the weights (sixth column), only one standardized difference slightly exceeded the recommended value of 10% (the value for education was 11.2%). We considered these results as evidence of good balance between the insured and uninsured.

Estimates of Treatment Effect

Because the outcome was binary, the ATE is quantified as a risk difference. The estimates resulting from the different methods are provided in Figure 3.6. In particu-

lar, the estimates based on the proposed estimators ∆b 1 and ∆b 2 were 4.99% (95% CI: -6.08, 16.06) and 2.84% (95% CI: -7.15, 12.82 ). Methods based on population and sample propensity score returned similar values. The weighted estimator that only accounted for the survey weights resulted in the most extreme estimate of the effect (9.10%, 95% CI: 1.74, 16.47). However, such estimates are not reliable, because they are based on a comparison of treatment groups that are not balanced in terms of important covariates (see Table 3.2). Slightly smaller estimates of the effect were returned by the model-based estimators. In particular, the estimate based on the quasi-binomial model with identity link appeared to be significantly different from zero at the 0.05-level.

90 IPW1 − SW PS ●

IPW2 − SW PS ●

IPW1 − not SW PS ●

IPW2 − not SW PS ●

IPW − PS only ●

IPW − SW only ●

SW model − identity link ●

SW model − logit link ●

−0.1 0.0 0.1 Risk Difference

Figure 3.6: Estimates of average treatment effect for insurance status on decision to seek care and 95% confidence interval.

91 3.4.5 Conclusions

According to the estimator ∆b 1, 4.99% more subjects would seek medical treatment after injury if all the US injured adults had health insurance with respect to the case in which no one had health insurance. The estimate based on the second proposed estimator is very similar. However, both the estimates are non-significant at the 0.05- level, suggesting that there is no strong evidence that insurance status has a causal effect on the decision to seek medical treatment after injury. Model-based methods produced more extreme estimates and, as expected, nar- rower confidence intervals. However, these models rely on the assumption that the treatment effect is constant (on the risk difference scale for the quasi-binomial model, on the odds ratio scale for the logistic model). This is a strong assumption and it is hardly justified in this context, because the effect of insurance status is likely to be weaker for very severe or very mild injuries than in moderate injuries. The simulation study has shown how the estimates of the treatment effect can be severely biased if the assumption of homogeneous treatment effect does not hold. Because our methods do not rely on this assumption, our estimates should be considered as more reliable. Overall, these results suggest a limited or null causal effect of insurance status on the decision to seek care after injury in the US population. It is important to note that the target estimand was the average effect over the injured patients who participated in the MEPS survey. Even though detailed data about the severity of the injury are not freely accessible, it is reasonable to believe that very mild injuries represent the vast majority in the target population. The weakness of the estimated effect might be explained if insurance status does not impact on the decision to seek care in this subgroup of subjects. Similarly, a weak or null effect would also be expected for subjects incurring life-threatening injuries, who are often transported to the emergency department regardless of the insurance status. The subgroup of

92 subjects where insurance status might causally affect the decision to seek care is the group of moderate injuries, where the need for care might balance economical considerations. Unfortunately, these hypothesis cannot be verified with the available data.

93 Table 3.1: Estimates of coefficients of population propensity score model.

Variable Coefficient (95% CI) Intercept 1.681 (1.142;2.220) Age/10 0.113 (0.000;0.225) Sex Male 0.000 Female 0.519 (0.242;0.796) Race NH White 0.000 Hispanic -0.479 (-0.846;-0.113) NH Black -0.058 (-0.416;0.299) Other -0.185 (-0.787;0.417) Education HS or less 0.000 More than HS 0.474 (0.134;0.813) Married No 0.000 Yes 0.312 (-0.073;0.697) Family Income Poor -0.848 (-1.251;-0.446) Near Poor -0.542 (-1.050;-0.033) Low Income -0.346 (-0.787;0.095) Middle Income 0.000 High Income 0.653 (0.189;1.117) Employment Status Non-self employed or unemployed 0.000 Self employed -1.016 (-1.580;-0.452) Perceived Health Status Good or better 0.000 Fair/Poor -0.104 (-0.498;0.290) Perceived Mental Health Status Good or better 0.000 Fair/Poor 0.506 (-0.049;1.060) Limitation in Work/house/school No 0.000 Yes -0.012 (-0.546;0.521) Born is USA Yes 0.000 No -0.541 (-0.958;-0.124) How well speaks English English spoken at home 0.000 Well or better -0.015 (-0.461;0.431) Not well or not at all -0.729 (-1.339;-0.119) 94 Table 3.2: Distribution of covariates in treatment groups (insured and uninsured) and standardized mean differences (SD) in weighted cohorts. The table shows the balance of the covariates when weighting by survey weights only (left columns) and when weighting by both survey and propensity score weights (right columns).

Weighting by Weighting by Variable SW only SW and PS Ins. Unins. SD Ins. Unins. SD Age - Mean 42.6 39.4 23.7 42.3 41.9 3.4 Sex, Female - % 47.8 35.3 25.5 46.8 51.3 -9.1 Race - % NH white 70.9 53.0 37.6 69.4 68.4 2.1 Hispanic 11.9 28.3 -42.0 13.3 15.0 -5.1 NH black 10.2 10.8 -1.8 10.3 8.6 5.6 Other - % 7.0 7.9 -3.5 7.1 7.9 -3.3 Education, More than HS - % 61.7 38.4 47.9 59.7 54.1 11.2 Married - % 48.4 33.6 30.5 47.1 45.3 3.7 Family Income - % Poor 11.5 27.7 -41.6 12.9 14.4 -4.3 Near poor 3.9 7.6 -15.9 4.2 5.0 -3.6 Low income 12.2 18.8 -18.2 12.8 14.4 -4.7 Middle income 25.6 25.8 -0.5 25.6 24.6 2.3 High income 46.8 20.2 58.8 44.5 41.7 5.7 Employment Status - % Non-self employed 64.9 56.0 18.3 63.8 59.7 8.5 Self employed 6.6 12.6 -20.7 7.1 7.4 -0.9 Unemployed 28.5 31.4 -6.3 29.1 33.0 -8.5 Perc. Health, Fair/Poor - % 16.5 21.6 -13.0 17.0 16.4 1.5 Perc. Mental Health, Fair/Poor - % 10.3 9.3 3.4 10.2 11.5 -4.0 Limitation in Work/house/school - % 15.2 17.2 -5.6 15.4 18.3 -7.8 Born is USA - % 90.5 77.0 37.3 89.5 87.8 5.1 How well speaks English - % English spoken at home 85.0 70.4 35.7 83.8 81.4 6.2 Well or better 13.2 19.5 -17.3 13.7 16.1 -6.6 Not well or at all 1.8 10.1 -35.5 2.5 2.5 0.0

95 Chapter 4 Population Intervention Effects

Policy makers in public health are often interested in comparing outcomes in the factual, “real”, cohort to a counterfactual cohort where an intervention modifies the treatment mechanism. From the point of view of decision makers, it is important to quantify the impact of realistic interventions, targeting changes in the treatment mechanism in subgroups of the cohort. Unfortunately, this type of causal contrast cannot be quantified with the marginal effects discussed in the previous chapters. The population intervention effect (or,IE) is a causal parameter that addresses this family of research questions and it is the topic of this chapter. The remainder of the chapter is organized as follows. Section 4.1 rigorously de- fines theIE with the potential outcome framework. Section 4.2 describes the family of interventions considered in this dissertation. Section 4.3 introduces upper and lower bounds for IEs, while Section 4.4 provides simple estimators for these bounds. Asymptotic properties and variances of the estimators are discussed in Section 4.5. Section 4.6 provides some considerations about the type of outcome model to be con- sidered in the estimating procedure. In Section 4.7, a simulation study is described that evaluates the properties of the proposed estimators. Finally, Section 4.8 applies the methodology discussed in the chapter to a study assessing the impact of tobacco cessation interventions on preterm delivery in a cohort of nicotine-dependent pregnant

96 women.

4.1 Definition

TheIE compares the average value of the outcome between the target cohort and a counterfactual intervention cohort, where the intervention potentially modifies the treatment status of the subjects. For simplicity, the methodology is introduced in the case of binary treatment levels. Moreover, I assume that the study sample is drawn by simple random sampling from the target cohort, which is assumed to be an infinite superpopulation. Generalizations to more than two treatment groups and to complex sampling designs are discussed in Section 5.3.3. As in the previous chapters, I denote the treatment status and observed outcome in the original cohort with Z and Y , respectively. Let Ze be the treatment variable in the intervention cohort. As Ze and Z may differ, the potential outcomes revealed in the original and intervention cohorts may differ as well. Note that the observed outcome in the original cohort may also be expressed as Y Z , where Y Z = Y 1 if Z = 1 and Y Z = Y 0 if Z = 0. Using this notation, I denote the outcome revealed in the intervention cohort with Y Ze. In particular, subjects who receive the same treatment in the two counterfactual populations are characterized by the same outcomes, as

Ze = Z implies Y Ze = Y Z = Y . This chapter focuses on the estimation of theIE, the causal effect that compares the average value of the outcome after the intervention to the average in the target cohort:

Ze ∆IE = E[Y ] − E [Y ] . (4.1)

Notably, this parameter does not belong to the family of causal parameters dis- cussed by Hubbard and Van der Laan(2008), who described estimators of contrasts

97 such as E[Y z] − E [Y ], where z is a constant value (0 or 1 in the binary treatment case). Equation 4.1 can be manipulated to provide an estimable expression. Consider the partition of the target cohort in four subgroups, defined by all the possible com-

binations of values of Z and Ze. Using this partition, the two terms in Equation 4.1 can be expanded as follows:

Ze X h Ze i E[Y ] = E Y |Z = z1, Ze = z2 P (Z = z1, Ze = z2), (4.2)

z1,z2∈{0,1} X h i E [Y ] = E Y |Z = z1, Ze = z2 P (Z = z1, Ze = z2). (4.3)

z1,z2∈{0,1} h i h i In particular, if Ze = Z, Y Ze = Y and E Y Ze|Z = z, Ze = z = E Y |Z = z, Ze = z for z = 0, 1. Using the linearity of the expectation operator and the fact that Y = Y Z , the intervention effect simplifies to:

h Ze i ∆IE =E Y − Y |Z = 0, Ze = 1 P (Z = 0, Ze = 1) h i + E Y Ze − Y |Z = 1, Ze = 0 P (Z = 1, Ze = 0) h i =E E[Y Ze|X] − E [Y |X] |Z = 0, Ze = 1 P (Z = 0, Ze = 1) h i + E E[Y Ze|X] − E [Y |X] |Z = 1, Ze = 0 P (Z = 1, Ze = 0)

h  1   0  i =E E Y |X − E Y |X |Z = 0, Ze = 1 P (Z = 0, Ze = 1)

h  0   1  i + E E Y |X − E Y |X |Z = 1, Ze = 0 P (Z = 1, Ze = 0), where the second term of the equality expresses the averages as integrals over the values of the covariates, X. Note that the conditional expectations E [Y z|X] are implicitly conditional on the treatments Z and Ze. In addition to the identifiability conditions discussed in Section 1.4.1, I assume the conditional independence of potential outcomes and Ze

98 given Z and X, i.e., Y z ⊥⊥ Ze | Z, X. This is reasonable, because it means that the treatment in the intervention cohort is only related to the potential outcomes through the treatment Z and the covariates X. Under such assumption, the conditional distribution of Y z given X, Z and Ze does not depend on Ze. Moreover, because of the traditional exchangeability of Z, the same conditional distribution does not depend on Z either. The difference between the conditional means E [Y 1|X] and E [Y 0|X] can be replaced by

δ(X) = E [Y |Z = 1, X] − E [Y |Z = 0, X] , (4.4) whose terms are estimable with the observed data. Using the definition in Equa- tion (4.4) and the identifiability assumptions discussed above, ∆IE can be expressed as

h i ∆IE =E δ(X)|Z = 0, Ze = 1 P (Z = 0, Ze = 1) h i − E δ(X)|Z = 1, Ze = 0 P (Z = 1, Ze = 0). (4.5)

The two terms of Equation (4.5) are integrations over the subgroups where the treat- ment status is modified (Z = 0, Ze = 1 and Z = 1, Ze = 0). Notably, if the treatment effect is homogeneous in the target population (i.e., Y 1 − Y 0 = δ for all of the sub- jects), δ(X) is equal to a constant δ. Under this strong assumption, ∆IE reduces to

n o ∆IE = δ P (Z = 0, Ze = 1) − P (Z = 1, Ze = 0) . (4.6)

In this over-simplifying case, theIE only depends on the constant treatment effect, δ, and on the proportion of units whose treatment status is modified. However, in the more general case where the effect of the treatment is not homogeneous, ∆IE depends

99 on the individuals belonging to the subgroups where Z 6= Ze.

4.2 Interventions

In the presence of heterogeneous treatment effects, ∆IE depends on the subjects who belong to the subgroups Z = 0, Ze = 1 and Z = 1, Ze = 0. These subgroups are determined by the intervention, which defines the modified treatment status, Ze. Interventions have been classified as static, dynamic and stochastic (D´ıazand Van der Laan, 2013). Static interventions fix the modified treatment status to constant values. For example, the intervention that assigns all the subjects to the treatment group, i.e., such that Ze = 1 for all the subjects, is static. Dynamic interventions define Ze on the basis of the covariates. For example, in a study where sex is a covariate, the intervention that assigns males and females to treatment and control groups, respectively, is dynamic. Both static and dynamic interventions assign degenerate densities to Ze, because its values are deterministically set—unconditionally for static interventions and conditionally on the covariates for dynamic interventions (D´ıazand Van der Laan, 2013). Conversely, stochastic interventions assume traditional, non- degenerate distributions for Ze. Both the frameworks of dynamic and stochastic interventions provide the method- ology to assess the impact of a wide range of manipulations of the causal system. However, both are difficult to implement in practice. On the one hand, to specify dynamic interventions, it is necessary to list covariate patterns where Zek = 1 and

Zek = 0. This necessity might have contributed to the overly simplistic family of in- terventions considered by previous studies (Westreich, 2014; Ahern et al., 2009, 2016). In particular, when the deterministic mechanism defining the modified treatment sta- tuses, Ze, is set, the g-formula can be used to estimate ∆IE. The estimator follows

100 naturally from Equation (4.5):   1 X ˆ X ˆ ∆b IE = δ(Xk) − δ(Xk) , (4.7) n k:Zk=0, k:Zk=1, Zek=1 Zek=0 ˆ where δ(Xk) = Eb [Y | Z = 1, Xk] − Eb [Y | Z = 0, Xk] and Eb [Y | Z = z, X] is the estimate of E [Y | Z = z, X], computed with an outcome model. On the other hand, stochastic interventions require the specification of the conditional distribution of Ze given X. Specifying this distribution is practically as challenging and restrictive as specifying the deterministic mechanism for dynamic interventions. For example, Westreich(2014) considered interventions modifying the proportion of treated sub- jects in the study cohort. The author assumed that all the subjects in the cohort had the same probability of receiving the treatment Ze = 1, which is an assumption that rarely holds in real settings. Mu˜nozand Van der Laan(2012) proposed a different framework for continuous treatments, where interventions truncate the distribution of the treatment in the population. I consider the framework of dynamic interventions, which allows for the quantifica- tion of the effect of a wide variety of interventions with estimating procedures that are simpler than those necessary to assess the effect of stochastic interventions. However, I tackle the problem from a different perspective with respect to previous research, proposing estimators for upper and lower bounds of the effect of interventions.

4.3 Upper and Lower Bounds

In this section, I introduce an upper and a lower bound for ∆IE across a family of plausible interventions, to match relevant research questions for policy makers. To do so, I focus on a specific family of interventions. I consider interventions that do not affect the treatment status of subjects who received the treatment (Z = 1). On the

101 other hand, a subset of known size of the untreated subjects (Z = 0) is targeted by the intervention and the treatment status of its members is modified. In this scenario, the treatment is thought of as a possibly beneficial exposure and policy makers are interested in estimating the impact of an increased proportion of subjects receiving the exposure/treatment. By simply modifying the labels of the two treatment levels, it is possible to consider diametrically opposite scenarios, where the treatment is a possibly harmful factor and researchers are interested in scenarios with less treated subjects than those in the actual cohort. Note that, with this choice, the intervention is defined conditionally on the treat- ment Z, which was factually received by the subjects in the cohort. Under this framework, the cohort target of inference is the superpopulation after treatment, i.e., the top right set in Figure 1.1, and the research question investigates what would be the impact on the average outcome level if a subgroup of factually untreated subjects would had received the treatment instead.

Under this simplifying constraint, the conditions Z = 1 and Ze = 0 are incompatible– i.e., P (Z = 1, Ze = 0) = 0. Therefore, the only subgroup where the treatment status is modified is composed of subjects where Z = 0 and Ze = 1. This subgroup is the target of the intervention and its size is assumed to be a known input parameter, provided by the investigators. By varying this input parameter, researchers would be able to evaluate the impact of interventions targeting subgroups of different sizes. In particular, using the result in Equation (4.5), the definition of ∆IE within this family of interventions and its estimator reduce to

h i ∆IE =E δ(X)|Z = 0, Ze = 1 P (Z = 0, Ze = 1), (4.8) 1 X ˆ ∆b IE = δ(Xk). (4.9) n k:Zk=0, Zek=1

102 The size of the intervention can be measured in terms of the proportion of targeted

individuals, which is denoted with π = P (Z = 0, Ze = 1). Because this quantity can also be expressed as P (Ze = 1|Z = 0)P (Z = 0) and because the proportion of untreated subjects P (Z = 0) can be reliably estimated in most practical applications, an alternative way to specify the size of the intervention is by fixing the proportion

of targeted subjects among the untreated, i.e., π0 = P (Ze = 1|Z = 0). Indicating the

proportion of treated subjects after the intervention, i.e., πe = P (Ze = 1), is a third way to specify the size of the intervention. Because of the constraint on the type of considered interventions, the set of subjects with Z = 1 is contained in the set where

Ze = 1, and P (Ze = 1) = P (Z = 1) + P (Z = 0, Ze = 1). So, π, π0 and πe can all be used to characterize the size of an intervention and they are related by the equalities

π = π0P (Z = 0) and πe = (1 − P (Z = 0)) + π. Fixing the size, π, of the intervention, and denoting specific values of the modified treatment status with Ze, let Zπ be the set of all admissible treatment allocations Ze in the target cohort. Because of the restriction on admissible interventions, every modified treatment status Ze in Zπ satisfies P (Z = 0, Ze = 1) = π and P (Z = 1, Ze =

0) = 0. The maximum and minimum values of ∆IE across all of the admissible

max min interventions Zπ are defined as ∆IE and ∆IE , where

min max ∆IE = min ∆IE and ∆IE = max ∆IE. (4.10) Ze∈Zπ Ze∈Zπ

The interpretation of these causal parameters is straightforward. They correspond to the best-case and worst-case effects within the family of considered interventions. In order to evaluate these bounds, we need to identify the interventions that attain the largest and smallest changes in the average outcome value. For explanatory

min purposes, focus on the lower bound ∆IE . From the expression in Equation (4.8), we

103 have

min h i ∆IE = min E δ(X)|Z = 0, Ze = 1 P (Z = 0, Ze = 1) Ze∈Zπ   =πE δ(X)|Z = 0, δ(X) 6 δ[π0] , (4.11)

where δ[π0] is the π0-quantile of the set of values δ(X) among the untreated subjects in the target cohort. The equality in Equation (4.11) is motivated by two obser-

vations. First, P (Z = 0, Ze = 1) = π for any Ze in Zπ. Second, the minimum of h i E δ(X)|Z = 0, Ze = 1 is attained when the intervention subgroup of fixed size π

contains the proportion π0 = π/P (Z = 0) of the untreated subjects where δ(X)

max is smallest or, more formally, if Z = 0 and δ(X) 6 δ[π0]. Similarly, ∆IE can be expressed as

max   ∆IE = πE δ(X)|Z = 0, δ(X) > δ[1−π0] , (4.12)

h i because the maximum of E δ(X)|Z = 0, Ze = 1 is attained when the intervention

subgroup contains the proportion π0 of the untreated subjects with the largest value of δ(X).

4.4 Estimation of Upper and Lower Bounds

min max The final remark of Section 4.3 provides expressions of ∆IE and ∆IE that suggest the estimators

min 1 X max 1 X ∆b = δb(Xk), ∆b = δb(Xk), (4.13) IE n IE n k: Zk=0, k: Zk=0, δ(X ) δ δ(X ) δ b k 6b[π0] b k >b[1−π0]

where δb[π0] and δb[1−π0] are the lower and upper sample π0-quantile of the set of values

δb(Xk) among the untreated and where δb(Xk) is defined in Section 4.2.

min max In summary, estimating ∆IE and ∆IE requires two steps. First, the intervention

104 min max subgroups defining ∆b IE and ∆b IE must be identified. For the lower bound, the subgroup is defined by the proportion of untreated subjects, π0, with the lowest value

of δb(Xk) = Eb [Y | Z = 1, Xk] − Eb [Y | Z = 0, Xk]. The subgroup defining the upper

bound is defined by the proportion π0 of untreated subjects with the largest δb(Xk).

min max Second, ∆IE and ∆IE are estimated with the estimator provided in Equation (4.9). The expressions in Equation (4.13) follow from the application of this estimator to the intervention subgroups defined in the first step. Different expressions of the estimators in Equation (4.13) can be formulated in terms of the treatment status, Ze, after the intervention. Define  1 if Z = 1  k  min  Zek = (4.14) 1 if Zk = 0 and δb(Xk) 6 δb[π0]    0 if Zk = 0 and δb(Xk) > δb[π0] and  1 if Z = 1  k  max  Zek = (4.15) 1 if Zk = 0 and δb(Xk) > δb[1−π0]    0 if Zk = 0 and δb(Xk) < δb[1−π0]

min for each subject, k, in the sample. Subjects in the sample with Zek = 1 and

Zk = 0 are those who are selected to “change” the treatment status (from untreated to treated) in order to estimate the smallest effect of the intervention. Analogously,

max subjects with Zek = 1 and Zk = 0 are those selected in the intervention subgroup

that are identified to estimate the largest value of ∆IE. Using these definitions, we

105 have

min 1 X 1 X ∆b = δb(Xk) = δb(Xk) IE n n k: Zk=0, k: Zk=0, δ(X ) δ Zmin=1 b k 6b[π0] ek n 1 X n h min i o = Eb Y | Ze , Xk − Eb [Y | Zk, Xk] , (4.16) n k k=1

h min i min because δb(Xk) = Eb Y | Zek , Xk − Eb [Y | Zk, Xk] if Zk = 0 and Zek = 1, while

h min i min min Eb Y | Zek , Xk − Eb [Y | Zk, Xk] = 0 when Zk = Zek = 0 or Zk = Zek = 1. A similar expression can be derived for the estimator of the upper bound:

n max 1 X n h max i o ∆b = Eb Y | Ze , Xk − Eb [Y | Zk, Xk] . (4.17) IE n k k=1 The expressions in Equations (4.16) and (4.17) are more convenient than the equiv- alent expression in Equation (4.13) to discuss the properties of the estimators.

4.5 Properties of the Proposed Estimators

As noted in Section 4.4, the proposed estimators are based on two steps. First, the modified treatment status attaining the upper and lower bounds of the effects are identified. Second, the estimates of the causal effects are computed. Notably, both steps are based on the estimates δb(Xk), which depend on the conditional means esti- mated by the outcome model. Unfortunately, the convoluted form of the estimators complicates the formulation of asymptotic results. I propose two strategies to tackle the uncertainty quantification of the proposed estimators. First, I describe asymptotic properties of slightly modified estimators, where it is assumed that the identification of the intervention subgroups attaining the upper and lower bounds (i.e., the first step of the estimating procedure) is based on superpopulation parameters instead of sample-based estimates. For these simpli-

106 fied versions of the estimators, it is possible to use the theory of M-estimation to derive asymptotic distributions. Second, I describe a non-parametric bootstrapping

min max strategy. This approach can be used to estimate the variance of ∆b IE and ∆b IE and

min max construct confidence intervals for ∆IE and ∆IE accounting for the uncertainty in both the steps of the estimating procedure. However, this procedure is computation- ally intensive. The two approaches are presented in the following subsections and compared in a simulation study in Section 4.7.

4.5.1 Asymptotic Distribution

Suppose that the modified treatment variables defined in Equations (4.14) and (4.15) are replaced by  1 if Z = 1  k  min  Zek = (4.18) 1 if Zk = 0 and δ(Xk) 6 δ[π0]    0 if Zk = 0 and δ(Xk) > δ[π0] and  1 if Z = 1  k  max  Zek = (4.19) 1 if Zk = 0 and δ(Xk) > δ[1−π0]    0 if Zk = 0 and δ(Xk) < δ[1−π0]

where the sample estimates δb(Xk), δb[π0], δb[1−π0] are replaced by the population pa-

min max rameters δ(Xk), δ[π0], δ[1−π0]. In this way, Zek and Zek are functions of Xk and Zk only and can be considered as deterministically known given the study sample. Under this simplifying condition, the asymptotic distribution of the estimators in Equations (4.16) and (4.17) can be formulated, provided some conditions on the outcome model hold. In particular, it will be assumed that the mean outcome model

107 is based on a parametric , with form

g (E [Y |Z, X]) = η(Z, X, β), where g(·) is a smooth and invertible link function, η is the systematic component and β is the vector of parameters. For example, g is equal to the identity function in traditional models, while g is the logit function (i.e., g(x) = log(x/(1−x))) for logistic regression. The simplest form of the systematic component is

T η(Z, X, β) = (1,Z, X )β = β0 + β1Z + β2X1 + ... + βp+1Xp.

I assume that the parameters β of the outcome model are estimated via maximum likelihood, i.e., solving the estimating equation

n X ∂ l(Y |Z , X , β) = 0, ∂β k k k k=1 where l(Y |Z, X, β) is the log-likelihood of the outcome model.

−1 min Define µ(Z, X, β) = g (η(Z, X, β)). With this notation, the estimators ∆b IE

max and ∆b IE can be expressed as

n min 1 X n  min   o ∆b = µ Ze , Xk, βb − µ Zk, Xk, βb IE n k k=1 n max 1 X n  max   o ∆b = µ Ze , Xk, βb − µ Zk, Xk, βb IE n k k=1

min For presentation purposes, I focus on the estimator ∆b IE . An analogous result

max can be formulated for ∆b IE . Under the assumptions discussed above, the estimators

min min ∆b IE and βb for ∆IE and β can be seen as the solution of the estimating equations

n X min ψ(Zek ,Zk, Xk, θ) = 0, k=1

108 min T where θ = (∆IE , β) and     µ Zmin, X , β − µ (Z , X , β) − ∆min min ek k k k IE ψ(Zek ,Zk, Xk, θ) =   .  ∂  ∂β l(Yk|Zk, Xk, β)

min The following Proposition describes the asymptotic behavior of ∆b IE .

Proposition 4.1. Under the assumptions discussed above, as n → ∞,

√  min min d T −1 −1 T n ∆b IE − ∆IE −→ N(0, u A B(A ) u), (4.20)

where:

T u = (1, 0p+2) ,  ∂  A = E − ψ(Zemin,Z, X, θ) , ∂θT h i B = E ψ(Zemin,Z, X, θ)ψT (Zemin,Z, X, θ) .

Proof. The result is an application of the theory of M-estimation (Huber, 1967). To prove the result, it suffices to show that the assumptions of this theory are met. In particular:

h i ˆ E ψ(Zemin,Z, X, θ) is a one-to-one function in θ. In particular, the system of equations

h i E ψ(Zemin,Z, X, θ) = 0 (4.21)

has a unique solution, the true value of the superpopulation parameters. h i To show this result, the first component of the vector E ψ(Zemin,Z, X, θ) is

h  min  mini min E µ Ze , X, β − µ (Z, X, β) − ∆IE , which is linear in ∆IE and is there-

min fore a one-to-one function of ∆IE . For the second component of the vector,

h ∂ i E ∂β l(Y |Z, X, β) is the expectation of the score function of the regression

109 model and, under regularity conditions of g and η, it is a one-to-one function of

min T T β. Therefore, the full vector is a one-to-one function of θ = (∆IE , β ) and Equation 4.21 admits a unique solution.

ˆ min For any value of Zk and Xk, ψ(Zek ,Zk, Xk, ·) is smooth in a closed set B containing the solution to Equation (4.21) as an interior point.

min Note that ψ(Zek ,Zk, Xk, ·) is smooth everywhere. Indeed, the first component

min of the vector ψ is linear in ∆IE (and therefore smooth with respect to this parameter) and smooth with respect to β under regularity conditions on g and η. The second component of ψ is one element of the sum in the score function and it is smooth in β under regularity conditions on g and η.

ˆ The matrices A and B exist and

( n ) ∂ 1 X min ψ(Ze ,Zk, Xk, θ) , (4.22) ∂θT n k k=1 when evaluated in the vector of parameters solution to Equation (4.21), is non- singular for large n. These conditions are met under regularity conditions on g (Stefanski and Boos, 2002).

min max Proposition 4.1 provides asymptotic results for ∆b IE and ∆b IE . Notably, the asymptotic variance in Equation (4.20) depends on the true population parameters. Following the recommendation of Stefanski and Boos(2002), an estimator of the

min asymptotic variance of ∆b IE is

  1 Vd ar ∆b min = uT Ab−1Bb(Ab−1)T u, (4.23) IE n

110 where

n 1 X ∂ min Ab = − ψ(Ze ,Zk, Xk, θ) , n ∂θT k θ=θb k=1 n 1 X min T min Bb = ψ(Ze ,Zk, Xk, θb)ψ (Ze ,Zk, Xk, θb). n k k k=1

In particular, taking advantage of the fact that Ab is block triangular—the bottom-

∂ n ∂ o left block is min l(Yk|Zk, Xk, β) = 0— the expression in Equation (4.23) can ∂∆IE ∂β be simplified to

  1 Vd ar ∆b min = aT Bba, (4.24) IE n where

( n ) !T X ∂ h  min  i −1 a = 1, µ Ze , Xk, β − µ (Zk, Xk, β) I(βb) (4.25) ∂β k k=1 and I(β) is the observed information matrix of the outcome model. As discussed above, these asymptotic properties are theoretically designed for the estimators where the intervention subgroups are identified with the population pa-

rameters δ(Xk). In practice, this step is based on the estimates of the outcome model.

Nevertheless, if the outcome model is correctly specified, the estimate δb(Xk) based

on the maximum likelihood estimate, βb, are consistent. Therefore the values δb(Xk)

are expected to be close to δ(Xk) in large samples and the result in Equation (4.23) is expected to provide a reasonable approximation of the variance of the estimator used in practice.

4.5.2 Bootstrap

The non-parametric bootstrap is an alternative approach to estimating the variance of

min max ∆b IE and ∆b IE . Bootstrapping is a popular strategy for complex estimators derived

111 by g-methods (Hern´anand Robins, 2018). In particular, it is the approach considered by previous studies dealing with population intervention effects (Ahern et al., 2009, 2016; Westreich, 2014). The idea is to apply the algorithm described in Section 4.4 to a large number B of bootstrap samples, drawn with replacement from the study

min max sample. The variance of ∆b IE and ∆b IE is estimated with the sample variance of

min max the B estimates in the bootstrap samples. Confidence intervals for ∆IE and ∆IE can be computed with appropriate of the empirical distribution of the B

min estimates. For example, the 2.5% and 97.5% percentiles of the B estimates of ∆IE

min in the bootstrap samples provide the limits of a 95% confidence interval for ∆IE .

min max The bootstrap approach can be used to estimate the variance of ∆b IE and ∆b IE , whatever type of outcome model is involved in the estimation—the model does not need to be a parametric regression model as assumed in Section 4.5.1. Moreover, this methodology accounts for the variability of the procedure at both steps. However, bootstrapping is computationally intensive, because the estimating procedure must be repeated on several bootstrap samples in order to provide accurate estimates of uncertainty. In the simulation study and application, I followed the recommended choice of B = 1, 000 (Efron and Tibshirani, 1986; Hern´anand Robins, 2018).

4.6 Outcome Models

As any implementation of the parametric g-formula methodology, the procedure de-

min max scribed in Section 4.4 to estimate ∆IE and ∆IE depends on a model for E [Y |Z, X], the conditional mean of the outcome. Chapter 15 of Hern´anand Robins(2018) pro- vides guidelines about the characteristics of outcome models in causal inference ap- plications. In particular, the authors contrast properties of models of this type and predictive models, whose purpose is to achieve good predictive performance exploiting associations between predictors and outcome. The most important difference between

112 the two modeling approaches is the selection of the covariates. Outcome models for causal inference must include the factors X that ensure the property of exchange- ability. These factors are identified thinking to the causal mechanism under study, with the help of subject-matter experts. In predictive models, variables are selected on the basis of statistical significance or measures of predictive power and goodness of fit. The most popular strategy to generate outcome models is by parametric regres- sion, even though machine learning methods (e.g., generalized boosting, BART, ran- dom forests) have also been used in this framework. Among the regression models, the family to be considered depends on the type of outcome. Linear models are the most popular for continuous outcomes. For binary outcomes, the conditional mean E [Y | Z, X] is a proportion and logistic regression is certainly the most common family of models. The use of outcome models with nonlinear links, such as logistic regression, re- quires a clarification about the heterogeneity of the treatment effect. As pointed

out in Section 4.1, non-constant treatment effects complicate the estimation of ∆IE. Notably, this heterogeneity must be evaluated on the scale of the effect and not on the scale of the systematic component of the outcome model. For instance, in the case of binary outcomes and logistic models, the effect ∆IE is quantified as a risk difference, but the outcome model measures the association between treatment and response on the odds ratio scale. In this case, because of the non-linearity of the link, the treatment has a heterogeneous effect for the sake of the estimation of ∆IE, even if it is assumed to have a homogeneous conditional effect in the logistic model—i.e., if the treatment variable is not involved in terms. This is evident with a numerical example. If the odds of a binary outcome are homogeneously halved in the presence of a treatment for each subject in a cohort (constant conditional effect

113 in the logistic model), the impact of the treatment on the probability scale will be different for untreated subjects with risks of 1% or 20%. The risk would be reduced by .05 percentage points in the first case and by 8.9 percentage points in the second case. Therefore, the effect is heterogeneous on the probability scale.

4.7 Simulation Study

4.7.1 Setup

min max I conducted a simulation study to verify the consistency of ∆b IE and ∆b IE and to compare the two variance estimators. I generated a large population of N = 1 million

subjects, characterized by three covariates. The values of the covariates X1 and

X3 were sampled from a standard normal distribution, while the values of X2 were sampled from a gamma distribution with shape and rate equal to 1. Because the outcome of the application study was binary, I focused on this type of outcome in the simulations. For each subject k of the population, I defined

1 2 p logit pk = β0 − 1 + X1 + .3X1 + X2 + X3,

0 2 p logit pk = β0 + X1 + .3X1 + X2 + X3,

1 0 and the potential outcomes Yk and Yk were sampled as Bernoulli trials with proba-

1 0 bilities pk and pk, respectively. The value of the intercept was set to β0 = −2. The treatment of each subject in the population was assigned as a Bernoulli trial with probability

Z  logit pk = .5X1 − X2 + .5X3.

1 0 The observed outcome was defined as Yk = ZkYk + (1 − Zk)Yk . The choices of

1 0 Z the distribution of the covariates and of the coefficients defining pk, pk, pk result in populations with about 30% of treated subjects (Z = 1) and a similar prevalence of 114 the outcome event (Y = 1). The target estimands were the upper and lower bound of the effect of an in- tervention that would increase the proportion of treated subjects to 50%, 70% and

90%—i.e., with the notation of Section 4.3, I considered πe = .5, .7 and .9. Because both the potential outcomes were known in the simulations, these effects can be mea- sured exactly in each simulated population and correspond to the target parameters

min max ∆IE and ∆IE . I considered the estimation of the target parameters with three outcome models. First, the correctly specified logistic model, which included the three covariates in the correct form (Model 1). Second, a misspecified logistic model that included all the three covariates in linear form, failing to capture the nonlinear relationship between

X2 and X3 and the logit of the probability of the response (Model 2). Third, a misspecified logistic model where the covariate X3 was omitted and X1 and X2 were included with the correct nonlinear transformation (Model 3). The estimators described in Section 4.4 were evaluated on samples of different sizes: n = 500, 1, 000 and 5, 000. For each sample size, 1, 000 samples were drawn

min from the population by simple random sampling. For each sample, I computed ∆b IE

max and ∆b IE , as described in Section 4.4, and the 95% confidence intervals based on the two approaches described in Section 4.5: bootstrapping and asymptotic normal approximation. The performances of the estimators were primarily evaluated in terms of percent bias and coverage of the 95% confidence intervals. In addition, in each scenario, the means of the estimated variances (based on bootstrap and on the asymptotic result) were compared to the empirical variance of the estimated effects across simulations.

115 4.7.2 Results

Figures 4.1, 4.2 and 4.3 summarize the results of the simulations based on Models 1, 2 and 3, respectively. When the correct outcome model was used to estimate the intervention effects, the estimated percent bias was very small (left panel of Figure 4.1). As expected, the bias was smaller in larger samples. Interestingly, intervention scenarios where the proportion of treated subjects was closer to 1 were characterized by smaller bias. There is a possible explanation for this phenomenon. Focus on the estimation of the upper bound. In an extreme case, to estimate the effect of an intervention that would make the proportion of treated subject equal to 1, there is only one subgroup where the treatment status must be changed: the subgroup of all the untreated subjects. In this case, there is no uncertainty in the identification of the intervention subgroup. The estimation of the intervention effect only depends on the second step of the proposed procedure, where the individual effects δb(Xk) of all the untreated subjects are averaged. However, when the proportion of the treated after the intervention is considered to be much lower than one, there are several possibilities of intervention subgroups. In this case, there is only one type of error that can be done when identifying the proportion of untreated subjects with treatment effect in the upper quantile of the distribution (i.e., to select subjects whose real treatment effect is not in the upper quantile). Therefore, it is more likely to underestimate the upper bound of the effect then to overestimate it, and the estimator is relatively more prone to bias for values of πe corresponding to wider families of subgroups. Nonetheless, the simulations show that the bias is very small, provided that the outcome model is correctly specified. The central panel of Figure 4.1 shows that the coverage of both types of 95% CIs was close to the nominal level (red dashed line). However, the bootstrap method

116 was superior in samples of smaller sizes. This result was expected. The confidence intervals based on asymptotic results do not account for the uncertainty in the iden- tification of the intervention subgroups, and such uncertainty is larger in smaller samples. This makes the asymptotic approximate interval slightly liberal in small samples. Because both the averages of the bootstrap-based and asymptotic-based variances matched the empirical variance of the estimated effect (right panel of Figure 4.1), the simulations confirm the overall reliability of both variance estimators. Naturally, the variance of all the effects decreases as the sample size increased. Moreover, the panel shows that the variance of the bounds on intervention effects increases when the interventions involve larger subgroups of subjects. This is a desirable property, because larger intervention subgroups correspond to larger extrapolations from the observed data. The estimators showed worse performance in the presence of a mild misspecifi- cation of the outcome model, where two of the three covariates were not included with the correct nonlinear form. A non-trivial bias (about 10%) was observed when evaluating the lower bound of the effect of interventions that would increase the pro- portion of treated to 50%. Importantly, such bias did not decrease as the sample size increased. Accordingly, in this scenario, the coverage of the confidence intervals was slightly lower than the nominal level. Nevertheless, small bias and appropriate coverage were observed in the other scenarios. Again, bootstrap-based confidence intervals appeared to be more reliable than those based on the asymptotic results. The results were drastically different when the estimators used the outcome model that did not include one of the relevant covariates. In this case, the estimators showed strong bias in most of the scenarios and the coverage of the confidence intervals was correspondingly very low. Severe bias is a well-known problem of causal inference

117 methods whenever important covariates are not taken into account. The possibility to assess the robustness of the result to hidden bias is discussed in Section 5.3.3.

Bias Coverage of 95% CI Estimated Variance

Upper Bound Lower Bound Upper Bound Lower Bound Upper Bound Lower Bound

0.5 ● ● 0.5 ● ● 0.5 ● ● Sample Size: 500 Sample Size: 500 Sample Size: 500

0.7 ● ● 0.7 ● ● 0.7 ● ●

0.9 ● ● 0.9 ● ● 0.9 ● ●

0.5 ● ● 0.5 ● ● 0.5 ● ● Sample Size: 1,000 Sample Size: 1,000 Sample Size: 1,000

0.7 ● ● 0.7 ● ● 0.7 ● ●

0.9 ● ● 0.9 ● ● 0.9 ● ● Proportion of Treated after Intervention Proportion of Treated after Intervention Proportion of Treated after Intervention Proportion of Treated

0.5 ● ● 0.5 ● ● 0.5 ● ● Sample Size: 5,000 Sample Size: 5,000 Sample Size: 5,000

0.7 ● ● 0.7 ● ● 0.7 ● ●

0.9 ● ● 0.9 ● ● 0.9 ● ●

−10 0 10 −10 0 10 70 80 90 100 70 80 90 100 0.00 0.01 0.02 0.030.00 0.01 0.02 0.03 Bias (%) Coverage (%) SD

max Figure 4.1: Percent bias, coverage and variance of ∆b IE (left side of panels) and min ∆b IE (right side of panels) based on Model 1 (outcome model correctly specified). CIs and variance are computed with the bootstrap method (blue circles) and with the asymptotic approximate result (green triangles). The empirical variance of the estimated effects is reported as a red star in the right panel.

118 Bias Coverage of 95% CI Estimated Variance

Upper Bound Lower Bound Upper Bound Lower Bound Upper Bound Lower Bound

0.5 ● ● 0.5 ● ● 0.5 ● ● Sample Size: 500 Sample Size: 500 Sample Size: 500

0.7 ● ● 0.7 ● ● 0.7 ● ●

0.9 ● ● 0.9 ● ● 0.9 ● ●

0.5 ● ● 0.5 ● ● 0.5 ● ● Sample Size: 1,000 Sample Size: 1,000 Sample Size: 1,000

0.7 ● ● 0.7 ● ● 0.7 ● ●

0.9 ● ● 0.9 ● ● 0.9 ● ● Proportion of Treated after Intervention Proportion of Treated after Intervention Proportion of Treated after Intervention Proportion of Treated

0.5 ● ● 0.5 ● ● 0.5 ● ● Sample Size: 5,000 Sample Size: 5,000 Sample Size: 5,000

0.7 ● ● 0.7 ● ● 0.7 ● ●

0.9 ● ● 0.9 ● ● 0.9 ● ●

−10 0 10 −10 0 10 70 80 90 100 70 80 90 100 0.00 0.01 0.02 0.030.00 0.01 0.02 0.03 Bias (%) Coverage (%) SD

max min Figure 4.2: Percent bias, coverage and variance of ∆b IE (left side of panels) and ∆b IE (right side of panels) based on Model 2 (misspecification of the scale of nonlinear covariates). CIs and variance are computed with the bootstrap method (blue circles) and with the asymptotic approximate result (green triangles). The empirical variance of the estimated effects is reported as a red star in the right panel.

4.8 Application: Tobacco Cessation Interventions and Nicotine Addiction during Pregnancy

4.8.1 Background

Tobacco use during pregnancy remains one of the most common risk factors associated with preterm birth and poor pregnancy outcomes (Moore et al., 2016). Massive public health campaigns have raised the awareness of the risks connected to such behavior and reduced its prevalence over the last decades. However, the number of pregnant women smoking during pregnancy is still high. Tobacco cessation interventions have

119 Bias Coverage of 95% CI Estimated Variance

Upper Bound Lower Bound Upper Bound Lower Bound Upper Bound Lower Bound

0.5 ● ● 0.5 ● ● 0.5 ● ● Sample Size: 500 Sample Size: 500 Sample Size: 500

0.7 ● ● 0.7 ● ● 0.7 ● ●

0.9 ● ● 0.9 ● ● 0.9 ● ●

0.5 ● ● 0.5 ● ● 0.5 ● ● Sample Size: 1,000 Sample Size: 1,000 Sample Size: 1,000

0.7 ● ● 0.7 ● ● 0.7 ● ●

0.9 ● ● 0.9 ● ● 0.9 ● ● Proportion of Treated after Intervention Proportion of Treated after Intervention Proportion of Treated after Intervention Proportion of Treated

0.5 ● ● 0.5 ● ● 0.5 ● ● Sample Size: 5,000 Sample Size: 5,000 Sample Size: 5,000

0.7 ● ● 0.7 ● ● 0.7 ● ●

0.9 ● ● 0.9 ● ● 0.9 ● ●

−60 −30 0 30 60 −60 −30 0 30 60 0 25 50 75 100 0 25 50 75 100 0.00 0.01 0.02 0.030.00 0.01 0.02 0.03 Bias (%) Coverage (%) SD

max Figure 4.3: Percent bias, coverage and variance of ∆b IE (left side of panels) and min ∆b IE (right side of panels) based on Model 3 (omission of one relevant covariate). CIs and variance are computed with the bootstrap method (blue circles) and with the asymptotic approximate result (green triangles). The empirical variance of the estimated effects is reported as a red star in the right panel.

been proven to help pregnant women to quit smoking and are therefore of primary importance to reduce the burden of preterm deliveries (Wagijo et al., 2017). The U.S. Preventive Services Task Force recommends behavioral and counseling treatments to help pregnant women stop smoking. However, the effect of such treatments on poor pregnancy outcomes has not been quantified yet. The Infant Mortality Research Partnership is a collaborative effort to reduce infant mortality in Ohio (Ohio Colleges of Medicine Government Resource Center, Infant Mortality Research Partnership, 2018). The project involved multidisciplinary teams of university researchers. A rich dataset of women of reproductive age is at the core

120 of the project. Insurance claims data were linked to vital statistics data and made available to researchers after deidentification. In this application, I investigate the effect of behavioral and counseling tobacco cessation interventions on preterm birth, which is defined as a delivery before 36 weeks of gestation. Because preterm and very preterm births account for the largest proportion of infant mortality, they are key outcomes to consider to indirectly lower infant mortality (MacDorman, 2011). We apply the proposed methodology to inform policy makers about the potential effect on the preterm birth rate of interventions that would provide smoking cessation treatments to nicotine-dependent pregnant women who did not receive such treatments. The key research question is: if a different proportion of pregnant women had received the tobacco cessation treatment, how different would the preterm birth rate in the study cohort have been?

4.8.2 Data

I considered four years of data, including babies born from 2014 to 2017. Only moth- ers who were nicotine dependent during pregnancy were selected. Deliveries before 23 weeks were excluded, because the babies were considered non-viable. Because mothers-to-be were the subjects receiving the treatment, each mother in the dataset was only considered once, in correspondence to her most recent recorded pregnancy. The selection lead to a total of 66,264 mothers and babies. The number of preterm de- liveries was 8,477 (12.8%). Only 2,933 women (4.4%) received a behavioral/counseling tobacco cessation treatment. Available covariates included demographics and information about obstetric his- tory, medical risk during pregnancy, prenatal care and delivery.

121 4.8.3 Methods

Logistic regression was used to develop the outcome model used in the estimator. The accuracy of the estimates of the conditional means Eb[Y |Zk, Xk] across the subjects was assessed by evaluating the calibration of the model. Relevant covariates were chosen with the help of subject-matter experts. The calibration was assessed with the Hosmer-Lemeshow test, using ten groups (Hosmer et al., 2013).

min max The model was used to compute ∆b IE and ∆b IE , as described in the Section 4.4. I considered interventions increasing the proportion of the treated to a range of 25 equispaced values between the observed proportion (4.4%) and 50%. The estimators were used to quantify the possible impacts on the preterm birth rate of such interven- tions. For each given intervention size, I computed the 95% confidence intervals with both the approximate asymptotic method and bootstrap, using B = 1, 000 bootstrap samples.

4.8.4 Results

Outcome Model

Table 4.1 provides the estimates of the coefficients of the logistic regression model. The model showed satisfactory calibration, with a p-value of the Hosmer-Lemeshow test of .097 (value of the chi-squared statistic: 13.45, degrees of freedom: 8). Notably, the conditional odds ratio of the treatment in the model was 0.83 (95% CI: 0.73-0.93). This suggests that, controlling for the other covariates in the model, the treatment appears to reduce the risk of preterm birth.

122 Estimates and 95% bootstrap CI Estimates and 95% asymptotic approximate CI 0.0 0.0 −0.5 −0.5 −1.0 −1.0 −1.5 −1.5 Change in preterm birth (in %) rate Change in preterm birth (in %) rate −2.0 −2.0

0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5

Proportion of Treated Proportion of Treated

Figure 4.4: Estimates of upper bound (red solid line) and lower bound (blue solid line) of intervention effects as function of the overall proportion of treated subjects (horizontal axis). The left and right panels provide estimates of the pointwise 95% confidence interval based on bootstrap and asymptotic result, respectively.

Estimates of Intervention Effect

The estimates of the upper and lower bound of the intervention effects are provided in Figure 4.4. The solid lines depict the point estimates of the possible change in preterm birth rate corresponding to interventions increasing the proportion of treated to the value set on the horizontal axis. Blue and red lines correspond to the point estimates of the best-case and worst-case scenarios. The corresponding 95% confidence interval are provided with the dashed lines. The left and right panels provide the boostrap and asymptotic confidence intervals, respectively. Notably, the two methods provide extremely similar results. As noted in the simulation study, the variance of the bounds of the intervention effect and the related width of the confidence interval increased when the methodology was used to assess impact of larger changes in the treatment status.

123 4.8.5 Conclusions

Conditional effects are the most common measures of effect reported in medical lit- erature (Ahern, 2016; Westreich, 2017). In our example, when controlling for the relevant covariates, the odds ratio of the counseling treatment for smoking cessation on preterm delivery was 0.83 (95% CI: 0.73, 0.93). Under the identifiability assump- tions and if the outcome model is correctly specified, this effect suggests that the treatment reduces the odds of preterm birth by almost 20% in tobacco-dependent pregnant women. However, this causal effect is not informative for policy makers who may be more interested in evaluating what impact an intervention, designed to increase the proportion of the women receiving a treatment, might have. Similar arguments can be moved against traditional marginal effects, such as the average treatment effect. Using the traditional plug-in estimator based on the out- come model, the average treatment effect is estimated to be -1.93% (95% CI: -3.04, -0.82). This estimate suggests that the proportion of preterm deliveries would be almost 2% lower in the cohort where all the women receive the smoking cessation treatment with respect to the cohort where none of the women receive the treatment. Again, this estimate fails to quantify the causal effect of the treatment on the outcome on a scale that might help program planning and allocation of resources. The proposed methodology provides the tools to quantify the potential impact of realistic interventions on the distribution of the treatment. For example, if a subgroup of the untreated women with size corresponding to 5% of the cohort would have received the treatment, increasing the total size of the treatment group from 4.42% to 9.42%, it is estimated that the proportion of preterm deliveries would decrease by a percentage that is bounded by -0.21% (95% CI: -0.34, -0.09) and -0.06% (95% CI: -0.09, -0.03). Figure 4.4 provides the estimates corresponding to other possible sizes of interventions.

124 Table 4.1: Logistic regression model estimating the proportion of preterm delivery.

Variable Coef. 95% CI OR 95% CI Behavioral/counseling interv. for tobacco use No 0.000 Yes -0.192 -0.311 -0.074 0.825 0.733 0.929 Age -0.070 -0.109 -0.031 Age2 0.002 0.001 0.002 Body Mass Index >18.5 0.000 <18.5 0.377 0.286 0.467 1.457 1.331 1.595 Mother’s Race Non-hispanic white 0.000 Non-hispanic black 0.328 0.275 0.382 1.388 1.316 1.465 Hispanic 0.167 0.044 0.290 1.181 1.045 1.336 Other 0.110 -0.192 0.411 1.116 0.825 1.509 Education Higher than high school diploma 0.000 High school diploma or less 0.145 0.092 0.198 1.156 1.096 1.219 Marital status Married 0.000 Not married 0.032 -0.031 0.095 1.033 0.970 1.100 Parity 0 0.000 >1 -0.117 -0.193 -0.041 0.890 0.824 0.960 Opiate addiction No 0.000 Yes 0.439 0.375 0.504 1.552 1.455 1.655 Depression No 0.000 Yes 0.123 0.071 0.174 1.131 1.074 1.191 Previous preterm birth No 0.000 Yes 1.259 1.198 1.319 3.521 3.315 3.739 Primary eligibility to Medicaid Pregnancy related 0.000 Disabled 0.265 0.127 0.404 1.304 1.135 1.497 Medicaid expansion 0.166 0.096 0.235 1.180 1.101 1.265 Other 0.033 -0.024 0.089 1.033 0.976 1.093 Welfare related 0.414 0.156 0.673 1.514 1.169 1.959 Intercept -1.850 -2.392 -1.308

125 Chapter 5 Discussion and Future Work

5.1 Multiple Treatment Group

5.1.1 Discussion

Matched analyses on observational data are very popular for binary treatments. In the presence of more than two treatment arms, however, matching designs have sel- dom been employed, primarily due to the lack of good matching algorithms. Chapter 2 described a conditionally optimal matching algorithm and a specific implementation to a design with three treatment groups. In this specific design, the algorithm is guar- anteed to identify a matched sample that is bounded away from the optimal solution by a known factor. Simulations showed how the proposed algorithm outperformed theNN procedure, the principal competing matching algorithm. The algorithm is relatively easy to implement, since it only uses off-the-shelf pro- cedures. It is based on an iterative procedure, which repeatedly applies two-group optimal matching steps to search for matched samples with small total distance. In particular, each optimal step solves a problem characterized by a time complexity of O(n3), with n being the sample size (Rosenbaum, 1989). Specifically, I used the implementation of the optimal matching algorithm of the R package optmatch, which solves the problem as a minimum-cost flow optimization in no more than O(n3log(n))

126 steps (Hansen and Klopfer, 2006). If the maximum number of iterations of our algo- rithm is set to a constant, the time complexity of our algorithm is also O(n3log(n)). In practice, the mean number of iterations in our simulation study was 4.7 (SD: 2.8, min:1, max:29), indicating that the algorithm converges in a limited number of steps. Even though further research should focus on the computational aspects of this pro- cedure, these results suggests a convergence in polynomial time, with a complexity between O(n3) and O(n4). I also described a comprehensive strategy to test the sharp null hypothesis of no treatment effect in the matched sample, based on Rosenbaum’s evidence factors methodology. This analytic framework naturally provides the tools to assess the impact of hidden bias on the result of the causal inference procedure.

5.1.2 Limitations

I described a viable procedure to identify matched samples with small total distance in polynomial time. Nevertheless, time complexities of O(n3) or O(n4) are practi- cally prohibitive if the sample size is very large. With the available technology, the two-group optimal matching algorithm terminates in a reasonable time span (within minutes or hours) in the presence of treatment groups with order of magnitude of thousands of records or, at most, few tens of thousands. In the presence of very large datasets, where the size of the treatment groups has order of magnitude greater than hundreds of thousands of records, solving two-group optimal matching steps is infea- sible, particularly in an iterative procedure that applies this step multiple times. This is a well-known limitation of optimal matching procedures, whose scope of application is confined to small-to-moderate sample sizes (Bennett et al., 2018). The unsuitability with a large number of treatment groups is the second limitation of the proposed methodology. First of all, the dimensionality of the propensity score

127 vector increases with the number of groups, K. As K increases, the dimensionality- reduction property offered by the propensity score vanishes and the space of the matching variables becomes increasingly sparse. Therefore, the problem of identifying similar subjects increases in complexity for large K. The sensitivity analysis for hidden bias also shows an important limitation in these settings. The evidence factor methodology introduces a sensitivity parameter for each one of the K − 1 tests. Each parameter controls the potential bias in the corresponding comparison. As the number of treatment groups increases, the number of sensitivity parameters increases accordingly, and summarizing the impact of hidden bias on the result of the overall test becomes challenging. As suggested in Section 2.1.3, these problems evoke the recommendation of using the proposed methodology for a small-to-moderate number of treatment groups. Heuristically, matching problems for K > 10 are likely to be infeasible, while there are certainly practical scenarios that would benefit from matching designs with K 6 6. For intermediate values of K, researchers should carefully evaluate the characteristics of the available data and the rationale of the study when considering a matching design.

5.1.3 Future Work

There are several aspects of the methodology discussed in Chapter 2 requiring fur- ther investigation. First of all, the principal limitation of the conditionally optimal matching algorithm is the computational cost in large datasets. In these settings, if the size of all the treatment groups is large, the proposed procedure cannot be em- ployed with the current technology and algorithm implementation. One workaround may be possible in specific scenarios where the overall size of the sample is large but the target treatment group has small-to-moderate size. This is a recurrent situation in observational studies, where data about controls or comparison groups are often

128 selected from large registries. In these scenarios, researchers might opt for efficient matching algorithms to generate a first matched sample. For example, Bennett et al. (2018) recently proposed a matching procedure that construct matched samples in linear time, prioritizing marginal balance in the covariates over the small distance within each matched sets. The result of this procedure would be a matched sam- ple with limited size, which can be considered as input of the conditionally optimal matching procedure. Methodological and applied research should explore possible combinations of different matching procedures. I only discussed the use of matched samples to estimate sample-level effect. This is a common strategy in matched designs, where randomization-based inference is the most popular framework applied in post-matching statistical analysis. Nevertheless, recent research has discussed the use of matched estimators for population and super- population level effects, but have focused on binary treatments (Ashmead, 2014; Lenis et al., 2017; Austin et al., 2018). The reliability of the proposed matching procedure should be evaluated in the context of survey data. Finally, the detailed description of the algorithm, the simulation study and the application focused on a three-group design. Section 2.1.3 provided insights about the generalization of the matching procedure to the general case of K groups. The section also discussed preliminary results, suggesting hypotheses about the strategy to construct the starting matched samples (i.e., Step 1 of the algorithm). First, leaving the largest treatment groups as the last in the matching procedure appeared to result in the best performances. Second, the iterations of the algorithm (Step 2) appeared to compensate for poor choices of the starting matched samples and, therefore, investing computational time in the construction of the starting point may not be cost-effective. Further research should be dedicated to verify these hypotheses.

129 5.2 Complex Survey Designs

5.2.1 Discussion

Chapter 3 discussed the estimation of population treatment effects in complex survey designs. In this framework, I focused on weighted estimators, which provide a natural way to combine propensity score and survey weights. By providing a clear description of the selection mechanisms underlying the iden- tification of the study sample from the population or superpopulation, the chapter sheds light on the conditions under which the propensity score should be considered as a population or sample parameter and, consequently, when this parameter should be estimated with a weighted or unweighted model. The simulation study showed how using the wrong modeling strategy results in severe bias in the treatment effect estimation. The theoretical and simulation results are expected to provide a valuable contribution in the open debate about whether propensity score models should or should not be based on survey-weighted models. The chapter also discussed the variance estimation and asymptotic properties of weighted estimators in the presence of a popular multi-stage survey design, i.e., two- stage cluster sampling. The calculations and theoretical properties were confirmed from the simulations.

5.2.2 Limitations

As any method estimating causal effects in observational studies, weighted estimators rely on assumptions. Specifically, these estimators assume that the propensity score model is correctly specified and, in particular, that the observed covariates are suffi- cient to guarantee the exchangeability of the treatment mechanism. The simulation study showed how the omission of a single covariate from the propensity score model

130 might completely invalidate the inference about the target parameter. To partially address this limitation, Robins et al.(1994) proposed doubly-robust weighted estima- tors, which take advantage of both a propensity score and an outcome model to gen- erate consistent estimates if one of the two models is misspecified (but not both). On the other hand, some studies have argued that matching estimators are more robust to models misspecifications, because well-balanced matched samples can be constructed even if the propensity score model is not correctly specified and the estimator for the treatment effect does not directly depend on model estimates (Waernbaum, 2012). Nevertheless, when important confounders are not observed, causal effects are not identifiable in observational data and, without further context-specific assumptions, any method would be prone to a certain degree of bias. Rosenbaum’s sensitivity anal- yses provides a comprehensive family of methods to assess the robustness of causal inference tests to hidden bias (see Section 2.2.5). Unfortunately, this methodology is developed within the randomization-based inference framework, which inherently focus on sample-level effects.

5.2.3 Future Work

There is still considerable work to be done to address the open research questions in this area. First of all, further research should be dedicated to develop methods to assess the sensitivity to hidden bias of weighted estimators targeting population-level effects. As discussed in the previous section, the existence of unobserved confounders potentially undermines inference in observational data. The possibility to assess the robustness to hidden bias is vital to draw causal conclusions. Other aspects requiring further clarifications concern the propensity score model. Section 3.1 contrasted the use of population and sample models. The section dis- cussed how the population propensity score is likely to be the appropriate choice in

131 several practical scenarios. Nevertheless, in scenarios where treatment and sample selection are independent, both the propensity score models can be used to estimate treatment effects. In such scenarios, the efficiency of the resulting estimator might be a criterion to choose the type of propensity score to consider. The simulation study did not find a clear difference between the two propensity scores in terms of the variances of the estimators. However, more formal investigations are needed to clar- ify whether one propensity score model is preferable over the other under particular conditions. For example, it is well known that the variance of weighted estimators is inflated by extreme weights (Hern´anand Robins, 2018). To use the most efficient estimator, one possible strategy could be to select the propensity score corresponding to a distribution of weights with less extreme values. I focused on models for the propensity score based on weighted logistic regression and including subject-level variables only. However, cluster-level variables might be available to researchers and might be relevant confounding factors (Lenis et al., 2017; Yang, 2018). It is not clear how these variables should be considered in the propensity score model. Treating cluster-level confounders as if they were subject-level covariates would result in underestimating the of the estimators of the propensity- score parameters. Such underestimation would likely introduce bias in the estimator of the variance of the treatment effect. To address this problem, robust estimators might be used to estimate the variance of propensity-score estimates, considering in- dependence at the level of the cluster. Alternatively, further research should explore other modeling approaches for the estimation of the propensity score. For example, under simple random sampling designs, Li et al.(2013) considered fixed-effect and mixed-effect propensity score models in the presence of multilevel data structures. Po- tential extensions of these modeling frameworks to complex survey sampling designs should be investigated.

132 Finally, Chapter 3 focused on weighted estimators. As noted in Section 5.1.3, matching estimators have also been used in the context of survey data (Ashmead, 2014; Lenis et al., 2017; Austin et al., 2018). A formal comparison of weighted and matching estimators in the context of multi-stage sampling has not been described in the literature yet. In particular, the two families of estimators should be compared in terms of efficiency and robustness to misspecifications of the propensity score model.

5.3 Population Intervention Effects

5.3.1 Discussion

The estimation of population intervention effects was the goal of Chapter 4. The chapter provided a formal definition of intervention effects with the potential out- come framework and described estimators for an upper and a lower bound of these effects. Intervention effects are informative for policy makers, who are interested in estimating the consequences of realistic modifications of exposures or treatments on the outcome of the target population. The bounds of the intervention effect have very straightforward interpretations. With a fixed size of the intervention subgroup, they correspond to the best-case and worst-case scenarios that could have been observed if the target population had received the intervention. The proposed estimators can be easily implemented. They are based on simple algebraic calculations involving the estimates from an outcome model. I described two approaches for the calculation of the variance of the estimators: a boostrap- based estimator and an asymptotic result. The simulation study showed that the two approaches perform similarly in large samples.

133 5.3.2 Limitations

As for the methods based on the g-formula methodology, the proposed estimators rely on the correct specification of the outcome model, in addition to the traditional identifiability assumptions. The simulation study showed how minor model misspec- ifications, as the one that might be introduced by omitting interactions or nonlinear terms, may introduce moderate bias in the estimation of the target effects. Severe bias was observed when one of the covariates was omitted from the model. This was consistent with the results of the simulations presented in Chapter 3. Consistent with the work of Ahern et al.(2016) and Westreich(2014), I assumed that interventions would directly modify the treatment status of a subgroup of the subjects. Notably, this assumption does hold in some research fields. For instance, when the exposure of interest is a behavior (e.g., smoking or binge drinking), policy makers can act on instruments of the exposure and not modify directly the exposure status. Mu˜nozand Van der Laan(2012) and D´ıaz and Van der Laan(2013) have described methodologies accounting for such an indirect impact of the intervention. However, the authors focused on estimands that do not allow modifications of the treatment status in subgroups.

5.3.3 Future Work

As discussed in Section 5.2.3, the possibility to assess the sensitivity to hidden bias is vital to establish causal relationships in observational studies. Unfortunately, there is no well-established procedure to quantify the impact of unobserved confounders in estimators based on the g-fomula methodology. This is a broad area of future work. For simplicity, I described population intervention effects assuming that the study sample was selected from the target population by simple random sampling. However, the methodology can be naturally extended to complex survey designs. A naive

134 extension would replace the estimators in Equation (4.13) with weighted averages of

the individual treatment effect δb(Xk), where each subject is weighted using his/her

survey weights. In particular, lower and upper sample π0-quantiles, δb[π0] and δb[1−π0], should be replaced by the corresponding population quantiles, which are defined as

the upper and lower π0-quantiles of the weighted distribution of the set of δb(Xk) values. Extensions of the proposed variance estimators to complex survey designs is less straightforward. The asymptotic result described in Proposition 4.1 should be modified to account for sampling designs from a finite population and for unequal weights of the subjects. Bootstrap estimators can also be generalized to complex survey data, but require design-specific strategies (Levy and Lemeshow, 2013). Finally, Chapter 4 focused on binary treatments and on interventions aiming to modify the treatment status of subjects from one level to the other. The proposed methodology can be easily extended to scenarios with multiple treatment levels, if

the intervention of interest aims to move subjects from one treatment level, z1, to another given level, z2. In this case, the setup would only differ from what presented in Section 4.1 insofar as subjects receiving treatments different from z1 and z2 would not be “eligible” for a modification of treatment status. Future studies might focus on more complex interventions, impacting on the allocation of the subject to the treatment groups in different ways. For example, one intervention might remove subjects from one treatment level, z1, and equally distribute the subjects to the other treatment levels.

135 Bibliography

Agency for Health care Research and Quality, Medical Expenditure Panel Survey (MEPS) (2019).

URL: meps. ahrq. gov [Accessed: March 22th, 2019]

Agency for Health care Research and Quality, Overview of the Nationwide Emergency Department Sample (NEDS) (2019).

URL: www. hcup-us. ahrq. gov/ nedsoverview. jsp [Accessed: March 22th, 2019]

Ahern, J. (2016), ‘Population Intervention Measures to Connect Research Findings to Policy’, American Journal of Public Health 106(12), 2152–2153.

Ahern, J., Colson, K. E., Margerson-Zilko, C., Hubbard, A. and Galea, S. (2016), ‘Predicting the Population Health Impacts of Community Interventions: The Case of Alcohol Outlets and Binge Drinking’, American Journal of Public Health 106(11), 1938–1943.

Ahern, J., Hubbard, A. and Galea, S. (2009), ‘Estimating the Effects of Potential Pub- lic Health Interventions on Population Disease Burden: A Step-by-Step Illustration of Causal Inference Methods’, American Journal of Epidemiology 169(9), 1140– 1147.

136 Andersen, R. M. (1995), ‘Revisiting the behavioral model and access to medical care: Does it matter?’, Journal of Health and Social Behavior 36(1), 1–10.

Andersen, R. and Newman, J. F. (1973), ‘Societal and individual determinants of medical care utilization in the united states’, The Milbank Memorial Fund Quar- terly. Health and Society 51(1), 95–124.

Andrulis, D. P. (1998), ‘Access to care is the centerpiece in the elimination of socioe- conomic disparities in health’, Annals of Internal Medicine 129(5), 412–416.

Ashmead, R. D. (2014), Propensity Score Methods for Estimating Causal Effects from Complex Survey Data, PhD thesis, Ohio State University.

Austin, P. C. (2011), ‘An introduction to propensity score methods for reducing the effects of confounding in observational studies’, Multivariate Behavioral Research 46(3), 399–424.

Austin, P. C., Jembere, N. and Chiu, M. (2018), ‘Propensity score matching and complex surveys’, Statistical Methods in Medical Research 27(4), 1240–1257.

Austin, P. C. and Stuart, E. A. (2015), ‘Moving towards best practice when using inverse probability of treatment weighting (iptw) using the propensity score to estimate causal treatment effects in observational studies’, Statistics in Medicine 34(28), 3661–3679.

Baker, D. W., Shapiro, M. F. and Schur, C. L. (2000), ‘Health insurance and access to care for symptomatic conditions’, Archives of Internal Medicine 160(9), 1269–1274.

Bennett, M., Vielma, J. P. and Zubizarreta, J. R. (2018), ‘Building representative matched samples with multi-valued treatments in large observational studies: Anal- ysis of the impact of an earthquake on educational attainment’, arXiv preprint arXiv:1810.06707 . 137 Binder, D. A. (1983), ‘On the variances of asymptotically normal estimators from complex surveys’, International Statistical Review 51(3), 279–292.

Browner, W. S. (1986), ‘Estimating the impact of risk factor modification programs’, American Journal of Epidemiology 123(1), 143–153.

Bulterys, M., Morgenstern, H. and Weed, D. L. (1997), ‘Quantifying the expected vs potential impact of a risk-factor intervention program’, American Journal of Public Health 87(5), 867–868.

Centers for Disease Control and Prevention, National Center for Injury Prevention and Control, Key Injury and Violence Data (2016).

URL: www. cdc. gov/ injury/ wisqars/ overview/ key_ data. html [Accessed: June 26th, 2018]

Chowdhury, S. R., Machlin, S. R. and Gwet, K. L. (2019), ‘Sample designs of the medical expenditure panel survey household component, 1996-2006 and 2007-2016’.

URL: https: // meps. ahrq. gov/ data_ files/ publications/ mr33/ mr33. shtml

Clancy, T. V., Maxwell, J. G., Covington, D. L., Brinker, C. C. and Blackman, D. (2001), ‘A statewide analysis of level i and ii trauma centers for patients with major injuries’, Journal of Trauma and Acute Care Surgery 51(2), 346–351.

Cochran, W. G. and Chambers, S. P. (1965), ‘The planning of observational stud- ies of human populations’, Journal of the Royal Statistical Society (Series A) 128(2), 234–266.

Committee on Trauma, American College of surgeons, Resources for Optimal Care of the Injured Patient: 2006 (2006), American College of Surgeons, Chicago, IL. 138 D´ıaz,I. and Van der Laan, M. J. (2013), ‘Assessing the Causal Effect of Policies: An Example Using Stochastic Interventions’, The International Journal of Biostatistics 9(2), 161–174.

DuGoff, E. H., Schuler, M. and Stuart, E. A. (2014), ‘Generalizing results: Applying propensity score methods to complex surveys’, Health Services Research 49(1), 284–303.

Efron, B. and Tibshirani, R. (1986), ‘Bootstrap methods for standard errors, con- fidence intervals, and other measures of statistical accuracy’, Statistical Science 1(1), 54–75.

Fisher, R. A. (1935), The , Oliver And Boyd, Edinburgh, UK.

Fuller, W. A. (2011), Sampling statistics, Vol. 560, John Wiley & Sons, New York, NY.

Greene, W. R., Oyetunji, T. A., Bowers, U., Haider, A. H., Mellman, T. A., Cornwell, E. E., Siram, S. M. and Chang, D. C. (2010), ‘Insurance status is a potent predictor of outcomes in both blunt and penetrating trauma’, The American Journal of Surgery 199(4), 554–557.

Gu, X. S. and Rosenbaum, P. R. (1993), ‘Comparison of multivariate matching meth- ods: Structures, distances, and algorithms’, Journal of Computational and Graph- ical Statistics 2(4), 405–420.

Haas, J. S. and Goldman, L. (1994), ‘Acutely injured patients with trauma in mas- sachusetts: differences in care and mortality, by insurance status.’, American Jour- nal of Public Health 84(10), 1605–1608.

139 Hansen, B. B. and Klopfer, S. O. (2006), ‘Optimal full matching and related designs via network flows’, Journal of Computational and Graphical Statistics 15(3), 609– 627.

Hern´an,M. A. and Robins, J. M. (2018), Causal Inference, Chapman & Hall/CRC, forthcoming, Boca Raton, FL.

Hosmer, D. W. J., Lemeshow, S. and Sturdivant, R. X. (2013), Applied Logistic Regression, Vol. 398, John Wiley & Sons, Chicester.

Hubbard, A. E. and Van der Laan, M. J. (2008), ‘Population intervention models in causal inference’, Biometrika 95(1), 35–47.

Huber, P. J. (1967), The behavior of maximum likelihood estimates under nonstandard conditions, University of California Press, Berkeley, Calif.

Imbens, G. W. (2000), ‘The role of the propensity score in estimating dose-response functions’, Biometrika 87(3), 706–710.

Karp, R. M. (1972), Reducibility among Combinatorial Problems, Springer US, Boston, MA, pp. 85–103.

Lenis, D., Nguyen, T. Q., Dong, N. and Stuart, E. A. (2017), ‘It’s all about balance: Propensity score matching in the context of complex survey data’, Biostatistics . URL: http://dx.doi.org/10.1093/biostatistics/kxx063

Levy, P. S. and Lemeshow, S. (2013), Sampling of Populations: Methods and Appli- cations, John Wiley & Sons, Hoboken, NJ.

Li, F., Zaslavsky, A. M. and Landrum, M. B. (2013), ‘Propensity score weighting with multilevel data’, Statistics in Medicine 32(19), 3373–3387.

140 Linden, A., Uysal, S. D., Ryan, A. and Adams, J. L. (2016), ‘Estimating causal effects for multivalued treatments: A comparison of approaches’, Statistics in Medicine 35(4), 534–552.

Lopez, M. J. and Gutman, R. (2017), ‘Estimation of causal effects with multiple treatments: A review and new ideas’, Statistical Science 32(3), 432–454.

Lu, B., Qian, Z., Cunningham, A. and Li, C.-L. (2012), ‘Estimating the effect of pre- marital cohabitation on timing of marital disruption using propensity score match- ing in event history analysis’, Sociological Methods & Research 41(3), 440–466.

Lu, B. and Rosenbaum, P. R. (2004), ‘Optimal pair matching with two control groups’, Journal of computational and graphical statistics 13(2), 422–434.

Lu, B., Zanutto, E., Hornik, R. and Rosenbaum, P. R. (2001), ‘Matching with doses in an observational study of a media campaign against drug abuse’, Journal of the American Statistical Association 96(456), 1245–1253.

Lunceford, J. K. and Davidian, M. (2004), ‘Stratification and weighting via the propensity score in estimation of causal treatment effects: A comparative study’, Statistics in Medicine 23(19), 2937–2960.

MacDorman, M. F. (2011), ‘Race and ethnic disparities in fetal mortality, preterm birth, and infant mortality in the united states: An overview’, Seminars in Peri- natology 35(4), 200–208.

MacKenzie, E. J., Hoyt, D. B., Sacra, J. C., Jurkovich, G. J., Carlini, A. R., Teit- elbaum, S. D. and Teter Jr, H. (2003), ‘National inventory of hospital trauma centers’, Journal of the American Medical Association 289(12), 1515–1522.

MacKenzie, E. J., Rivara, F. P., Jurkovich, G. J., Nathens, A. B., Frey, K. P., Egle- ston, B. L., Salkever, D. S. and Scharfstein, D. O. (2006), ‘A national evaluation of 141 the effect of trauma-center care on mortality’, New England Journal of Medicine 354(4), 366–378.

Mantel, N. and Haenszel, W. (1959), ‘Statistical aspects of the analysis of data from retrospective studies of disease’, Journal of the National Cancer Institute 22(4), 719–748.

McConnell, K. J., Newgard, C. D., Mullins, R. J., Arthur, M. and Hedges, J. R. (2005), ‘Mortality benefit of transfer to level i versus level ii trauma centers for head-injured patients’, Health Services Research 40(2), 435–458.

McNemar, Q. (1947), ‘Note on the of the difference between correlated proportions or percentages’, Psychometrika 12(2), 153–157.

Moore, E., Blatt, K., Chen, A., Van Hook, J. and DeFranco, E. A. (2016), ‘Factors associated with smoking cessation in pregnancy’, American Journal of Perinatology 33(6), 560.

Mu˜noz,I. D. and Van der Laan, M. (2012), ‘Population Intervention Causal Effects Based on Stochastic Interventions’, Biometrics 68(2), 541–549.

Naimi, A. I., Cole, S. R. and Kennedy, E. H. (2016), ‘An introduction to g methods’, International Journal of Epidemiology 46(2), 756–762.

Nattino, G. and Lu, B. (2018), ‘Model assisted sensitivity analyses for hidden bias with binary outcomes’, Biometrics . URL: http://doi.org/10.1111/biom.12919

Newacheck, P. W., Stoddard, J. J., Hughes, D. C. and Pearl, M. (1998), ‘Health in- surance and access to primary care for children’, New England Journal of Medicine 338(8), 513–519.

142 Neyman, J. (1935), ‘Statistical problems in agricultural experimentation’, Supplement to the Journal of the Royal Statistical Society 2(2), 107–180.

Ohio Colleges of Medicine Government Resource Center, Infant Mortality Research Partnership (2018).

URL: http: // grc. osu. edu/ projects/ IMRP [Accessed: June 26th, 2018]

Rassen, J. A., Shelat, A. A., Franklin, J. M., Glynn, R. J., Solomon, D. H. and Schneeweiss, S. (2013), ‘Matching by propensity score in cohort studies with three treatment groups’, Epidemiology 24(3), 401–409.

Ridgeway, G., Kovalchik, S. A., Griffin, B. A. and Kabeto, M. U. (2015), ‘Propensity score analysis with survey weighted data’, Journal of Causal Inference 3(2), 237– 249.

Robins, J. (1986), ‘A new approach to causal inference in mortality studies with a sustained exposure period—application to control of the healthy worker survivor effect’, Mathematical Modelling 7(9), 1393–1512.

Robins, J. M., Rotnitzky, A. and Zhao, L. P. (1994), ‘Estimation of regression coef- ficients when some regressors are not always observed’, Journal of the American Statistical Association 89(427), 846–866.

Rosenbaum, P. R. (1987), ‘Sensitivity analysis for certain permutation inferences in matched observational studies’, Biometrika 74(1), 13–26.

Rosenbaum, P. R. (1989), ‘Optimal matching for observational studies’, Journal of the American Statistical Association 84(408), 1024–1032.

Rosenbaum, P. R. (2002a), ‘Attributing effects to treatment in matched observational studies’, Journal of the American Statistical Association 97(457), 183–192.

143 Rosenbaum, P. R. (2002b), Observational studies, Springer, New York, NY.

Rosenbaum, P. R. (2007), ‘Sensitivity analysis for m-estimates, tests, and confidence intervals in matched observational studies’, Biometrics 63(2), 456–464.

Rosenbaum, P. R. (2010), ‘Evidence factors in observational studies’, Biometrika 97(2), 333–345.

Rosenbaum, P. R. (2011), ‘Some approximate evidence factors in observational stud- ies’, Journal of the American Statistical Association 106(493), 285–295.

Rosenbaum, P. R. (2017), ‘The general structure of evidence factors in observational studies’, Statist. Sci. 32(4), 514–530. URL: https://doi.org/10.1214/17-STS621

Rosenbaum, P. R. and Rubin, D. B. (1983), ‘The central role of the propensity score in observational studies for causal effects’, Biometrika 70(1), 41–55.

Rubin, D. B. (1980), ‘Discussion of “randomization analysis of experimental data in the fisher randomization test” by d. basu’, Journal of the American Statistical Association 75(371), 591–593.

Sacks, G. D., Hill, C. and Rogers Jr, S. O. (2011), ‘Insurance status and hospital discharge disposition after trauma: inequities in access to postacute care’, Journal of Trauma and Acute Care Surgery 71(4), 1011–1015.

Salim, A., Ottochian, M., DuBose, J., Inaba, K., Teixeira, P., Chan, L. S. and Mar- gulies, D. R. (2010), ‘Does insurance status matter at a public, level i trauma center?’, Journal of Trauma and Acute Care Surgery 68(1), 211–216.

S¨avje,F., Higgins, M. J. and Sekhon, J. S. (2017), ‘Generalized full matching’,

144 arXiv:1703.03882 . URL: https://arxiv.org/abs/1703.03882

Sen, P. K. (1988), Asymptotics in finite population sampling, in ‘Sampling’, Vol. 6 of Handbook of Statistics, Elsevier, pp. 291–331.

Shi, J., Lu, B., Wheeler, K. K. and Xiang, H. (2016), ‘Unmeasured confounding in observational studies with multiple treatment arms’, Epidemiology 27(5), 624–632.

Smith Jr, J. S., Martin, L. F., Young, W. W. and Macioce, D. P. (1990), ‘Do trauma centers improve outcome over non-trauma centers: the evaluation of re- gional trauma care using discharge abstract data and patient management cate- gories’, The Journal of trauma 30(12), 1533–1538.

Stefanski, L. A. and Boos, D. D. (2002), ‘The calculus of m-estimation’, The American 56(1), 29–38.

Taubman, S. L., Robins, J. M., Mittleman, M. A. and Hernan, M. A. (2009), ‘Inter- vening on risk factors for coronary heart disease: An application of the parametric g-formula’, International Journal of Epidemiology 38, 1599–611.

Vickers, B. P., Shi, J., Lu, B., Wheeler, K. K., Peng, J., Groner, J. I., Haley, K. J. and Xiang, H. (2015), ‘Comparative study of ed mortality risk of us trauma patients treated at level i and level ii vs nontrauma centers’, The American Journal of Emergency Medicine 33(9), 1158–1165.

Waernbaum, I. (2012), ‘Model misspecification and robustness in causal infer- ence: comparing matching with doubly robust estimation’, Statistics in Medicine 31(15), 1572–1581.

145 Wagijo, M.-a., Sheikh, A., Duijts, L. and Been, J. V. (2017), ‘Reducing tobacco smok- ing and smoke exposure to prevent preterm birth and its complications’, Paediatric Respiratory Reviews 22, 3–10.

Wang, W., Scharfstein, D., Tan, Z. and MacKenzie, E. J. (2009), ‘Causal inference in outcome-dependent two-phase sampling designs’, Journal of the Royal Statistical Society: Series B (Statistical Methodology) 71(5), 947–969.

Westreich, D. (2014), ‘From Exposures to Population Interventions: Pregnancy and Response to HIV Therapy’, American Journal of Epidemiology 179(7), 797–806.

Westreich, D. (2017), ‘From patients to policy: Population intervention effects in epidemiology’, Epidemiology 28(4), 525–528.

Yang, S. (2018), ‘Propensity score weighting for causal inference with clustered data’, Journal of Causal Inference 6(2).

Zanutto, E. L. (2006), ‘A comparison of propensity score and linear of complex survey data’, Journal of Data Science 4(1), 67–91.

Zanutto, E., Lu, B. and Hornik, R. (2005), ‘Using propensity score subclassification for multiple treatment doses to evaluate a national antidrug media campaign’, Journal of Educational and Behavioral Statistics 30(1), 59–73.

Zubizarreta, J. R. (2012), ‘Using mixed integer programming for matching in an obser- vational study of kidney failure after surgery’, Journal of the American Statistical Association 107(500), 1360–1371.

Zuvekas, S. H. and Taliaferro, G. S. (2003), ‘Pathways to access: Health insurance, the health care delivery system, and racial/ethnic disparities, 1996–1999’, Health Affairs 22(2), 139–153.

146 Appendix A Additional Results of Simulation Study in Chapter 3

The following figures show results of the simulation study described in Section 3.3 that were not reported in the main body of the dissertation. Each figure depicts the

percent bias, coverage and variance of estimators of ∆P AT E. Dashed red lines indicate the value of 0 in the left panels and the 90 and 95% values in the central panels. The empirical variance of the estimated effects is reported as a red star in the right panel.

Percent Bias Coverage of 95% CI Estimated and Empirical Variance

IPW1 − SW PS ● IPW1 − SW PS ● IPW1 − SW PS ●

IPW2 − SW PS ● IPW2 − SW PS ● IPW2 − SW PS ●

IPW1 − not SW PS ● IPW1 − not SW PS ● IPW1 − not SW PS ●

IPW2 − not SW PS ● IPW2 − not SW PS ● IPW2 − not SW PS ● Correct survey design Correct survey design Correct survey design Correct survey IPW − PS only ● IPW − PS only ● IPW − PS only ●

SW model ● SW model ● SW model ●

IPW1 − SW PS ● IPW1 − SW PS ● IPW1 − SW PS ●

IPW2 − SW PS ● IPW2 − SW PS ● IPW2 − SW PS ●

IPW1 − not SW PS ● IPW1 − not SW PS ● IPW1 − not SW PS ●

IPW2 − not SW PS ● IPW2 − not SW PS ● IPW2 − not SW PS ● Approximate design variance Approximate design variance Approximate design variance Approximate

0 10 20 30 40 0 25 50 75 0.01 0.02 0.03 0.04 0.05 0.06 Bias (%) Coverage (%) Variance

Figure A.1: Continuous outcome, (S,Ns) = (5, 80), covariate X3 omitted, sampling scheme independent from the treatment.

147 Percent Bias Coverage of 95% CI Estimated and Empirical Variance

IPW1 − SW PS ● IPW1 − SW PS ● IPW1 − SW PS ●

IPW2 − SW PS ● IPW2 − SW PS ● IPW2 − SW PS ●

IPW1 − not SW PS ● IPW1 − not SW PS ● IPW1 − not SW PS ●

IPW2 − not SW PS ● IPW2 − not SW PS ● IPW2 − not SW PS ●

IPW − PS only ● IPW − PS only ● IPW − PS only ● Correct survey design Correct survey design Correct survey design Correct survey

IPW − SW only ● IPW − SW only ● IPW − SW only ●

SW model ● SW model ● SW model ●

IPW1 − SW PS ● IPW1 − SW PS ● IPW1 − SW PS ●

IPW2 − SW PS ● IPW2 − SW PS ● IPW2 − SW PS ●

IPW1 − not SW PS ● IPW1 − not SW PS ● IPW1 − not SW PS ●

IPW2 − not SW PS ● IPW2 − not SW PS ● IPW2 − not SW PS ● Approximate design variance Approximate design variance Approximate design variance Approximate 0 20 40 60 0 25 50 75 100 0.00 0.02 0.04 0.06 Bias (%) Coverage (%) Variance

Figure A.2: Continuous outcome, (S,Ns) = (10, 40), all the covariates considered, sampling scheme independent from the treatment.

Percent Bias Coverage of 95% CI Estimated and Empirical Variance

IPW1 − SW PS ● IPW1 − SW PS ● IPW1 − SW PS ●

IPW2 − SW PS ● IPW2 − SW PS ● IPW2 − SW PS ●

IPW1 − not SW PS ● IPW1 − not SW PS ● IPW1 − not SW PS ●

IPW2 − not SW PS ● IPW2 − not SW PS ● IPW2 − not SW PS ● Correct survey design Correct survey design Correct survey design Correct survey IPW − PS only ● IPW − PS only ● IPW − PS only ●

SW model ● SW model ● SW model ●

IPW1 − SW PS ● IPW1 − SW PS ● IPW1 − SW PS ●

IPW2 − SW PS ● IPW2 − SW PS ● IPW2 − SW PS ●

IPW1 − not SW PS ● IPW1 − not SW PS ● IPW1 − not SW PS ●

IPW2 − not SW PS ● IPW2 − not SW PS ● IPW2 − not SW PS ● Approximate design variance Approximate design variance Approximate design variance Approximate

0 10 20 30 40 0 25 50 75 0.010 0.015 0.020 0.025 0.030 Bias (%) Coverage (%) Variance

Figure A.3: Continuous outcome, (S,Ns) = (10, 40), covariate X3 omitted, sampling scheme independent from the treatment.

Percent Bias Coverage of 95% CI Estimated and Empirical Variance

IPW1 − SW PS ● IPW1 − SW PS ● IPW1 − SW PS ●

IPW2 − SW PS ● IPW2 − SW PS ● IPW2 − SW PS ●

IPW1 − not SW PS ● IPW1 − not SW PS ● IPW1 − not SW PS ●

IPW2 − not SW PS ● IPW2 − not SW PS ● IPW2 − not SW PS ●

IPW − PS only ● IPW − PS only ● IPW − PS only ● Correct survey design Correct survey design Correct survey design Correct survey

IPW − SW only ● IPW − SW only ● IPW − SW only ●

SW model ● SW model ● SW model ●

IPW1 − SW PS ● IPW1 − SW PS ● IPW1 − SW PS ●

IPW2 − SW PS ● IPW2 − SW PS ● IPW2 − SW PS ●

IPW1 − not SW PS ● IPW1 − not SW PS ● IPW1 − not SW PS ●

IPW2 − not SW PS ● IPW2 − not SW PS ● IPW2 − not SW PS ● Approximate design variance Approximate design variance Approximate design variance Approximate −25 0 25 50 0 25 50 75 100 0.00 0.05 0.10 0.15 Bias (%) Coverage (%) Variance

Figure A.4: Continuous outcome, (S,Ns) = (10, 40), all the covariates considered, sampling scheme dependent on the treatment.

148 Percent Bias Coverage of 95% CI Estimated and Empirical Variance

IPW1 − SW PS ● IPW1 − SW PS ● IPW1 − SW PS ●

IPW2 − SW PS ● IPW2 − SW PS ● IPW2 − SW PS ●

IPW1 − not SW PS ● IPW1 − not SW PS ● IPW1 − not SW PS ●

IPW2 − not SW PS ● IPW2 − not SW PS ● IPW2 − not SW PS ● Correct survey design Correct survey design Correct survey design Correct survey IPW − PS only ● IPW − PS only ● IPW − PS only ●

SW model ● SW model ● SW model ●

IPW1 − SW PS ● IPW1 − SW PS ● IPW1 − SW PS ●

IPW2 − SW PS ● IPW2 − SW PS ● IPW2 − SW PS ●

IPW1 − not SW PS ● IPW1 − not SW PS ● IPW1 − not SW PS ●

IPW2 − not SW PS ● IPW2 − not SW PS ● IPW2 − not SW PS ● Approximate design variance Approximate design variance Approximate design variance Approximate

−10 0 10 20 30 40 0 25 50 75 0.02 0.04 0.06 Bias (%) Coverage (%) Variance

Figure A.5: Continuous outcome, (S,Ns) = (10, 40), covariate X3 omitted, sampling scheme dependent on the treatment.

Percent Bias Coverage of 95% CI Estimated and Empirical Variance

IPW1 − SW PS ● IPW1 − SW PS ● IPW1 − SW PS ●

IPW2 − SW PS ● IPW2 − SW PS ● IPW2 − SW PS ●

IPW1 − not SW PS ● IPW1 − not SW PS ● IPW1 − not SW PS ●

IPW2 − not SW PS ● IPW2 − not SW PS ● IPW2 − not SW PS ●

IPW − PS only ● IPW − PS only ● IPW − PS only ● Correct survey design Correct survey design Correct survey design Correct survey

IPW − SW only ● IPW − SW only ● IPW − SW only ●

SW model ● SW model ● SW model ●

IPW1 − SW PS ● IPW1 − SW PS ● IPW1 − SW PS ●

IPW2 − SW PS ● IPW2 − SW PS ● IPW2 − SW PS ●

IPW1 − not SW PS ● IPW1 − not SW PS ● IPW1 − not SW PS ●

IPW2 − not SW PS ● IPW2 − not SW PS ● IPW2 − not SW PS ● Approximate design variance Approximate design variance Approximate design variance Approximate 0 20 40 60 0 25 50 75 100 0.00 0.03 0.06 0.09 0.12 Bias (%) Coverage (%) Variance

Figure A.6: Continuous outcome, (S,Ns) = (20, 20), all the covariates considered, sampling scheme independent from the treatment.

Percent Bias Coverage of 95% CI Estimated and Empirical Variance

IPW1 − SW PS ● IPW1 − SW PS ● IPW1 − SW PS ●

IPW2 − SW PS ● IPW2 − SW PS ● IPW2 − SW PS ●

IPW1 − not SW PS ● IPW1 − not SW PS ● IPW1 − not SW PS ●

IPW2 − not SW PS ● IPW2 − not SW PS ● IPW2 − not SW PS ● Correct survey design Correct survey design Correct survey design Correct survey IPW − PS only ● IPW − PS only ● IPW − PS only ●

SW model ● SW model ● SW model ●

IPW1 − SW PS ● IPW1 − SW PS ● IPW1 − SW PS ●

IPW2 − SW PS ● IPW2 − SW PS ● IPW2 − SW PS ●

IPW1 − not SW PS ● IPW1 − not SW PS ● IPW1 − not SW PS ●

IPW2 − not SW PS ● IPW2 − not SW PS ● IPW2 − not SW PS ● Approximate design variance Approximate design variance Approximate design variance Approximate

0 10 20 30 0 25 50 75 0.01 0.02 0.03 Bias (%) Coverage (%) Variance

Figure A.7: Continuous outcome, (S,Ns) = (20, 20), covariate X3 omitted, sampling scheme independent from the treatment.

149 Percent Bias Coverage of 95% CI Estimated and Empirical Variance

IPW1 − SW PS ● IPW1 − SW PS ● IPW1 − SW PS ●

IPW2 − SW PS ● IPW2 − SW PS ● IPW2 − SW PS ●

IPW1 − not SW PS ● IPW1 − not SW PS ● IPW1 − not SW PS ●

IPW2 − not SW PS ● IPW2 − not SW PS ● IPW2 − not SW PS ● Correct survey design Correct survey design Correct survey design Correct survey IPW − PS only ● IPW − PS only ● IPW − PS only ●

SW model ● SW model ● SW model ●

IPW1 − SW PS ● IPW1 − SW PS ● IPW1 − SW PS ●

IPW2 − SW PS ● IPW2 − SW PS ● IPW2 − SW PS ●

IPW1 − not SW PS ● IPW1 − not SW PS ● IPW1 − not SW PS ●

IPW2 − not SW PS ● IPW2 − not SW PS ● IPW2 − not SW PS ● Approximate design variance Approximate design variance Approximate design variance Approximate

−10 0 10 20 30 0 25 50 75 0.00 0.02 0.04 0.06 Bias (%) Coverage (%) Variance

Figure A.8: Continuous outcome, (S,Ns) = (20, 20), covariate X3 omitted, sampling scheme dependent on the treatment.

Percent Bias Coverage of 95% CI Estimated and Empirical Variance

IPW1 − SW PS ● IPW1 − SW PS ● IPW1 − SW PS ●

IPW2 − SW PS ● IPW2 − SW PS ● IPW2 − SW PS ●

IPW1 − not SW PS ● IPW1 − not SW PS ● IPW1 − not SW PS ●

IPW2 − not SW PS ● IPW2 − not SW PS ● IPW2 − not SW PS ● Correct survey design Correct survey design Correct survey design Correct survey IPW − PS only ● IPW − PS only ● IPW − PS only ●

IPW − SW only ● IPW − SW only ● IPW − SW only ●

IPW1 − SW PS ● IPW1 − SW PS ● IPW1 − SW PS ●

IPW2 − SW PS ● IPW2 − SW PS ● IPW2 − SW PS ●

IPW1 − not SW PS ● IPW1 − not SW PS ● IPW1 − not SW PS ●

IPW2 − not SW PS ● IPW2 − not SW PS ● IPW2 − not SW PS ● Approximate design variance Approximate design variance Approximate design variance Approximate

0.0 2.5 5.0 7.5 90 95 0.0005 0.0010 0.0015 0.0020 0.0025 Bias (%) Coverage (%) Variance

Figure A.9: Binary outcome, (S,Ns) = (5, 80), all the covariates considered, sampling scheme independent from the treatment.

Figure A.10: Binary outcome, (S,Ns) = (5, 80), all the covariates considered, sam- pling scheme independent from the treatment.

150 Percent Bias Coverage of 95% CI Estimated and Empirical Variance

IPW1 − SW PS ● IPW1 − SW PS ● IPW1 − SW PS ●

IPW2 − SW PS ● IPW2 − SW PS ● IPW2 − SW PS ●

IPW1 − not SW PS ● IPW1 − not SW PS ● IPW1 − not SW PS ●

IPW2 − not SW PS ● IPW2 − not SW PS ● IPW2 − not SW PS ● Correct survey design Correct survey design Correct survey design Correct survey

IPW − PS only ● IPW − PS only ● IPW − PS only ●

IPW1 − SW PS ● IPW1 − SW PS ● IPW1 − SW PS ●

IPW2 − SW PS ● IPW2 − SW PS ● IPW2 − SW PS ●

IPW1 − not SW PS ● IPW1 − not SW PS ● IPW1 − not SW PS ●

● ● ●

IPW2 − not SW PS design variance Approximate IPW2 − not SW PS design variance Approximate IPW2 − not SW PS design variance Approximate

−60 −40 −20 0 25 50 75 0.0010 0.0012 0.0014 0.0016 0.0018 Bias (%) Coverage (%) Variance

Figure A.11: Binary outcome, (S,Ns) = (5, 80), covariate X3 omitted, sampling scheme dependent on the treatment.

Percent Bias Coverage of 95% CI Estimated and Empirical Variance

IPW1 − SW PS ● IPW1 − SW PS ● IPW1 − SW PS ●

IPW2 − SW PS ● IPW2 − SW PS ● IPW2 − SW PS ●

IPW1 − not SW PS ● IPW1 − not SW PS ● IPW1 − not SW PS ●

IPW2 − not SW PS ● IPW2 − not SW PS ● IPW2 − not SW PS ● Correct survey design Correct survey design Correct survey design Correct survey IPW − PS only ● IPW − PS only ● IPW − PS only ●

IPW − SW only ● IPW − SW only ● IPW − SW only ●

IPW1 − SW PS ● IPW1 − SW PS ● IPW1 − SW PS ●

IPW2 − SW PS ● IPW2 − SW PS ● IPW2 − SW PS ●

IPW1 − not SW PS ● IPW1 − not SW PS ● IPW1 − not SW PS ●

IPW2 − not SW PS ● IPW2 − not SW PS ● IPW2 − not SW PS ● Approximate design variance Approximate design variance Approximate design variance Approximate

0 10 20 25 50 75 100 0.0005 0.0010 0.0015 0.0020 0.0025 Bias (%) Coverage (%) Variance

Figure A.12: Binary outcome, (S,Ns) = (10, 40), all the covariates considered, sam- pling scheme independent from the treatment.

Percent Bias Coverage of 95% CI Estimated and Empirical Variance

IPW1 − SW PS ● IPW1 − SW PS ● IPW1 − SW PS ●

IPW2 − SW PS ● IPW2 − SW PS ● IPW2 − SW PS ●

IPW1 − not SW PS ● IPW1 − not SW PS ● IPW1 − not SW PS ●

IPW2 − not SW PS ● IPW2 − not SW PS ● IPW2 − not SW PS ● Correct survey design Correct survey design Correct survey design Correct survey

IPW − PS only ● IPW − PS only ● IPW − PS only ●

IPW1 − SW PS ● IPW1 − SW PS ● IPW1 − SW PS ●

IPW2 − SW PS ● IPW2 − SW PS ● IPW2 − SW PS ●

IPW1 − not SW PS ● IPW1 − not SW PS ● IPW1 − not SW PS ●

● ● ●

IPW2 − not SW PS design variance Approximate IPW2 − not SW PS design variance Approximate IPW2 − not SW PS design variance Approximate

0 5 10 15 70 80 90 0.0007 0.0008 0.0009 0.0010 0.0011 Bias (%) Coverage (%) Variance

Figure A.13: Binary outcome, (S,Ns) = (10, 40), covariate X3 omitted, sampling scheme independent from the treatment.

151 Percent Bias Coverage of 95% CI Estimated and Empirical Variance

IPW1 − SW PS ● IPW1 − SW PS ● IPW1 − SW PS ●

IPW2 − SW PS ● IPW2 − SW PS ● IPW2 − SW PS ●

IPW1 − not SW PS ● IPW1 − not SW PS ● IPW1 − not SW PS ●

IPW2 − not SW PS ● IPW2 − not SW PS ● IPW2 − not SW PS ● Correct survey design Correct survey design Correct survey design Correct survey IPW − PS only ● IPW − PS only ● IPW − PS only ●

IPW − SW only ● IPW − SW only ● IPW − SW only ●

IPW1 − SW PS ● IPW1 − SW PS ● IPW1 − SW PS ●

IPW2 − SW PS ● IPW2 − SW PS ● IPW2 − SW PS ●

IPW1 − not SW PS ● IPW1 − not SW PS ● IPW1 − not SW PS ●

IPW2 − not SW PS ● IPW2 − not SW PS ● IPW2 − not SW PS ● Approximate design variance Approximate design variance Approximate design variance Approximate

−60 −40 −20 0 20 0 25 50 75 100 0.001 0.002 0.003 Bias (%) Coverage (%) Variance

Figure A.14: Binary outcome, (S,Ns) = (10, 40), all the covariates considered, sam- pling scheme dependent on the treatment.

Percent Bias Coverage of 95% CI Estimated and Empirical Variance

IPW1 − SW PS ● IPW1 − SW PS ● IPW1 − SW PS ●

IPW2 − SW PS ● IPW2 − SW PS ● IPW2 − SW PS ●

IPW1 − not SW PS ● IPW1 − not SW PS ● IPW1 − not SW PS ●

IPW2 − not SW PS ● IPW2 − not SW PS ● IPW2 − not SW PS ● Correct survey design Correct survey design Correct survey design Correct survey

IPW − PS only ● IPW − PS only ● IPW − PS only ●

IPW1 − SW PS ● IPW1 − SW PS ● IPW1 − SW PS ●

IPW2 − SW PS ● IPW2 − SW PS ● IPW2 − SW PS ●

IPW1 − not SW PS ● IPW1 − not SW PS ● IPW1 − not SW PS ●

● ● ●

IPW2 − not SW PS design variance Approximate IPW2 − not SW PS design variance Approximate IPW2 − not SW PS design variance Approximate

−40 −20 0 20 40 60 80 0.0010 0.0015 0.0020 Bias (%) Coverage (%) Variance

Figure A.15: Binary outcome, (S,Ns) = (10, 40), covariate X3 omitted, sampling scheme dependent on the treatment.

Percent Bias Coverage of 95% CI Estimated and Empirical Variance

IPW1 − SW PS ● IPW1 − SW PS ● IPW1 − SW PS ●

IPW2 − SW PS ● IPW2 − SW PS ● IPW2 − SW PS ●

IPW1 − not SW PS ● IPW1 − not SW PS ● IPW1 − not SW PS ●

IPW2 − not SW PS ● IPW2 − not SW PS ● IPW2 − not SW PS ● Correct survey design Correct survey design Correct survey design Correct survey IPW − PS only ● IPW − PS only ● IPW − PS only ●

IPW − SW only ● IPW − SW only ● IPW − SW only ●

IPW1 − SW PS ● IPW1 − SW PS ● IPW1 − SW PS ●

IPW2 − SW PS ● IPW2 − SW PS ● IPW2 − SW PS ●

IPW1 − not SW PS ● IPW1 − not SW PS ● IPW1 − not SW PS ●

IPW2 − not SW PS ● IPW2 − not SW PS ● IPW2 − not SW PS ● Approximate design variance Approximate design variance Approximate design variance Approximate

0 5 10 15 20 0 25 50 75 0.0005 0.0010 0.0015 Bias (%) Coverage (%) Variance

Figure A.16: Binary outcome, (S,Ns) = (20, 20), all the covariates considered, sam- pling scheme independent from the treatment.

152 Percent Bias Coverage of 95% CI Estimated and Empirical Variance

IPW1 − SW PS ● IPW1 − SW PS ● IPW1 − SW PS ●

IPW2 − SW PS ● IPW2 − SW PS ● IPW2 − SW PS ●

IPW1 − not SW PS ● IPW1 − not SW PS ● IPW1 − not SW PS ●

IPW2 − not SW PS ● IPW2 − not SW PS ● IPW2 − not SW PS ● Correct survey design Correct survey design Correct survey design Correct survey

IPW − PS only ● IPW − PS only ● IPW − PS only ●

IPW1 − SW PS ● IPW1 − SW PS ● IPW1 − SW PS ●

IPW2 − SW PS ● IPW2 − SW PS ● IPW2 − SW PS ●

IPW1 − not SW PS ● IPW1 − not SW PS ● IPW1 − not SW PS ●

● ● ●

IPW2 − not SW PS design variance Approximate IPW2 − not SW PS design variance Approximate IPW2 − not SW PS design variance Approximate

0 5 10 80 90 6e−04 7e−04 8e−04 9e−04 1e−03 Bias (%) Coverage (%) Variance

Figure A.17: Binary outcome, (S,Ns) = (20, 20), covariate X3 omitted, sampling scheme independent from the treatment.

Percent Bias Coverage of 95% CI Estimated and Empirical Variance

IPW1 − SW PS ● IPW1 − SW PS ● IPW1 − SW PS ●

IPW2 − SW PS ● IPW2 − SW PS ● IPW2 − SW PS ●

IPW1 − not SW PS ● IPW1 − not SW PS ● IPW1 − not SW PS ●

IPW2 − not SW PS ● IPW2 − not SW PS ● IPW2 − not SW PS ● Correct survey design Correct survey design Correct survey design Correct survey IPW − PS only ● IPW − PS only ● IPW − PS only ●

IPW − SW only ● IPW − SW only ● IPW − SW only ●

IPW1 − SW PS ● IPW1 − SW PS ● IPW1 − SW PS ●

IPW2 − SW PS ● IPW2 − SW PS ● IPW2 − SW PS ●

IPW1 − not SW PS ● IPW1 − not SW PS ● IPW1 − not SW PS ●

IPW2 − not SW PS ● IPW2 − not SW PS ● IPW2 − not SW PS ● Approximate design variance Approximate design variance Approximate design variance Approximate

−50 −25 0 25 0 25 50 75 100 0.000 0.001 0.002 0.003 0.004 Bias (%) Coverage (%) Variance

Figure A.18: Binary outcome, (S,Ns) = (20, 20), all the covariates considered, sam- pling scheme dependent on the treatment.

Percent Bias Coverage of 95% CI Estimated and Empirical Variance

IPW1 − SW PS ● IPW1 − SW PS ● IPW1 − SW PS ●

IPW2 − SW PS ● IPW2 − SW PS ● IPW2 − SW PS ●

IPW1 − not SW PS ● IPW1 − not SW PS ● IPW1 − not SW PS ●

IPW2 − not SW PS ● IPW2 − not SW PS ● IPW2 − not SW PS ● Correct survey design Correct survey design Correct survey design Correct survey

IPW − PS only ● IPW − PS only ● IPW − PS only ●

IPW1 − SW PS ● IPW1 − SW PS ● IPW1 − SW PS ●

IPW2 − SW PS ● IPW2 − SW PS ● IPW2 − SW PS ●

IPW1 − not SW PS ● IPW1 − not SW PS ● IPW1 − not SW PS ●

● ● ●

IPW2 − not SW PS design variance Approximate IPW2 − not SW PS design variance Approximate IPW2 − not SW PS design variance Approximate

−60 −40 −20 0 25 50 75 100 0.0005 0.0010 0.0015 Bias (%) Coverage (%) Variance

Figure A.19: Binary outcome, (S,Ns) = (20, 20), covariate X3 omitted, sampling scheme dependent on the treatment.

153