Causal Inference in Observational Studies with Complex Design: Multiple Arms, Complex Sampling and Intervention Effects
DISSERTATION
Presented in Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy in the Graduate School of The Ohio State University
By
Giovanni Nattino, M.S.
Graduate Program in Biostatistics
The Ohio State University
2019
Dissertation Committee: Dr. Bo Lu, Advisor Dr. Stanley Lemeshow, Co-Advisor Dr. Eloise Kaizar © Copyright by Giovanni Nattino 2019 Abstract
Observational studies are major data sources to infer causal relationships. When using observational data to estimate causal effects, researchers must consider appro- priate statistical methodology to account for the non-random allocation of the units to the treatment groups. Such methodology is well-established when the research question involves two treatment groups and results do not need to be generalized to the population from which the study sample has been selected. Relatively few studies have focused on research questions that do not fit into this framework. The goal of this work is to introduce statistical methods to perform causal inference in complex designs. First, I introduce a matching design for estimating treatment effects in the presence of multiple treatment groups. I devise a novel matching algorithm, generating samples that are well-balanced with respect to pre-treatment variables, and discuss the post-matching statistical analyses. Second, I focus on the generaliza- tion of causal effects to the population level, specifically when the sample selection is based on complex survey designs. I discuss the extension of the propensity score methodology to survey data, describe a weighted estimator for the common two-stage cluster sample and study its asymptotic properties. Third, I consider the estimation of population intervention effects, which evaluate the impact of realistic changes in the distribution of the treatment in a cohort. I describe estimators for upper and lower bounds of effects of this type, highlighting the implications for policy makers. For each of these three areas of causal inference, I use Monte Carlo simulations to
ii assess the reliability of the proposed methods and compare them with competing approaches. The new methods are illustrated with real-data applications. Finally, I discuss limitations and aspects requiring further work.
iii Acknowledgments
First of all, I would like to express my sincere gratitude to my advisors. Thanks to Dr. Stan Lemeshow, who has been the catalyst of this incredible journey. Without you, I would not be where I am now. I am grateful for your unconditional help, which often went beyond the university walls, and countless advices. An equal thanks goes to Dr. Bo Lu, who introduced me to the world of causal inference. Thank you for your guidance and trust, which simultaneously directed me to the finish line and left me space to set my own pace. Thanks for all the pragmatic suggestions and for helping me navigating the statistical conferences I have been fortunate to attend. Thanks to all the staff of the Government Resource Center, in particular to Lorin Ranbom and Colin Odden, for the continuous support and for the invaluable oppor- tunity of continuously working on the Infant Mortality Research Partnership project. A special thank you to all the researchers I was fortunate to meet within this project. Thank you “Task 4” members, especially to Dr. Pat and Steve Gabbe and Dr. Court- ney Hebert. Your enthusiasm and genuine devotion to impact on the well-being of our society have truly inspired me. I would also like to thank Dr. Henry Xiang and Dr. Junxin Shi, from Nationwide Children’s Hospital, for their expert advice and the help with the trauma data, which motivated part of this work. Thanks to all the faculties and students I have met during my time at the Ohio State University. In particular, I would like to thank Dr. Elly Kaizar, for your
iv valuable feedback on my work. I am grateful to Dr. Matt Pratola and Hengrui Luo and to Dr. Mike Pennell. Even though the results of our collaborations do not appear in these pages, working with you was a truly stimulating, refreshing and enjoyable experience. A special thanks also to Dr. Amy Ferketich and Dr. Mario Peruggia, for your friendly advice and for being my “Little Italy” in Columbus. I would like to thank the researchers of the Laboratory of Clinical Epidemiology at the Mario Negri Institute for Pharmacological Research, in Italy, where I developed my interests in research and in biostatistics. Thank you all, especially to Dr. Guido Bertolini, for helping me embarking on this journey. Thanks to all the friends who have been my Columbus family in these years. In particular, thank you Sebastian, Guilherme, Aziz, J´ulia,Armand, Shuyuan, Jason, Natalia, Jafar, Juli´an,Alejandro and Andreas. Thanks for all the dinners together, the Friday night gatherings, the endless barbecues, the bike rides, the rock climbing sessions, the racquetball and disc golf games. You will be missed. A special thanks to my parents, Daniela and Beppe, and my brothers, Francesco and Stefano. If I am where I am, this is because of your education, encouragement and love. Finally, a profound thank you to my fianc´ee,Melissa. You understood the impor- tance of this goal for me, despite the time together that I had to sacrifice along the way. Thanks for your patience and heartening words. I could not have asked for a better travel companion.
v Vita
1987 ...... Born in Lecco (LC), Italy
Education
2009 ...... B.S. Applied Mathematics, University of Milan, Milan, Italy 2011 ...... M.S. Applied Mathematics, University of Milan, Milan, Italy 2014 ...... Post-graduate certificate in Biomedical Research, Istituto di Ricerche Farma- cologiche Mario Negri IRCCS, Ranica (BG), Italy Professional Experience
2011-2015 ...... Research Associate, Laboratory of Clinical Epidemiology, Istituto di Ricerche Farmacologiche Mario Negri IRCCS, Ranica (BG), Italy 2016-2019 ...... Graduate Research Associate, Divi- sion of Biostatistics, College of Pub- lic Health, The Ohio State University, Columbus, Ohio 2017-2019 ...... Graduate Research Associate, Ohio Colleges of Medicine Government Re- source Center, The Ohio State Univer- sity Wexner Medical Center, Colum- bus, Ohio
vi Publications
1. Giovanni Nattino, Michael L Pennell, and Stanley Lemeshow. Assessing the goodness of fit of logistic regression models in large samples: a modification of the Hosmer-Lemeshow test. Submitted to Biometrics, 2019.
2. Giovanni Nattino, Bo Lu, Junxin Shi, Stanley Lemeshow, and Henry Xiang. Triplet matching for estimating causal effects with three treatment arms: a comparative study of mortality by trauma center level. Submitted to Journal of the American Statistical Association, 2019.
3. Courtney L Hebert, Giovanni Nattino, Steven G Gabbe, Patricia T Gabbe, Jason Benedict, Gary Phillips, and Stanley Lemeshow. A predictive model for very preterm birth: developing a point of care tool. Submitted to American Journal of Obstetrics and Gynecology, 2019.
4. Erinn M Hade, Giovanni Nattino, Heather A Frey, and Bo Lu. Propensity Score Matching for Treatment Delay Effects with Observational Survival Data. Submitted to Statistical Methods in Medical Research, 2019.
5. Giovanni Nattino and Bo Lu. Model assisted sensitivity analyses for hidden bias with binary outcomes. Biometrics, 74: 1141–1149, 2018.
6. Stefano Skurzak, Greta Carrara, Carlotta Rossi, Giovanni Nattino, Daniele Crespi, Michele Giardino, and Guido Bertolini. Cirrhotic patients admitted to the icu for medical reasons: Analysis of 5506 patients admitted to 286 icus in 8years. Journal of Critical Care, 45: 220–228, 2018.
7. Guido Bertolini, Giovanni Nattino, Carlo Tascini, Daniele Poole, Bruno Viaggi, Greta Carrara, Carlotta Rossi, Daniele Crespi, Matteo Mondini, Martin Langer, Gian Maria Rossolini, and Paolo Malacarne. Mortality attributable to different kleb- siella susceptibility patterns and to the coverage of empirical antibiotic therapy: a cohort study on patients admitted to the ICU with infection. Intensive Care Medicine, 44(10): 1709–1719, 2018.
8. Giovanni Nattino, Stanley Lemeshow, Gary Phillips, Stefano Finazzi, and Guido Bertolini. Assessing the calibration of dichotomous outcome models with the calibra- tion belt. Stata Journal, 17(4): 1003–1014, 2017.
9. Daniele Poole, Stefano Finazzi, Giovanni Nattino, Danilo Radrizzani, Giuseppe Gristina, Paolo Malacarne, Sergio Livigni, and Guido Bertolini. The prognostic im- portance of chronic end-stage diseases in geriatric patients admitted to 163 italian ICUs. Minerva Anestesiologica, 83: 1283–1293, 2017. vii 10. Giovanni Nattino, Stefano Finazzi, and Guido Bertolini. A new test and graphical tool to assess the goodness of fit of logistic regression models. Statistics in Medicine, 35(5): 709–720, 2016.
11. Daniele Poole, Giovanni Nattino, and Guido Bertolini. Overoptimism in the interpretation of statistics. Intensive Care Medicine, 40(12): 1927–1929, 2014.
12. Giovanni Nattino, Stefano Finazzi, and Guido Bertolini. Comments on ‘Graphical assessment of internal and external calibration of logistic regression models by using loess smoothers’ by Peter C. Austin and Ewout W. Steyerberg. Statistics in Medicine, 33(15): 2696–2698, 2014.
13. Giovanni Nattino, Stefano Finazzi, and Guido Bertolini. A new calibration test and a reappraisal of the calibration belt for the assessment of prediction models based on dichotomous outcomes. Statistics in Medicine, 33(14): 2390–2407, 2014.
14. Nicola Latronico, Giovanni Nattino, Bruno Guarneri, Nazzareno Fagoni, Aldo Amantini, and Guido Bertolini. Validation of the peroneal nerve test to diagnose critical illness polyneuropathy and myopathy in the intensive care unit: the multi- centre italian crimyne-2 diagnostic accuracy study. F1000Research, 3(127), 2014.
Fields of Study
Major Field: Biostatistics
viii Table of Contents
Page Abstract...... ii Acknowledgments...... iv Vita...... vi List of Figures ...... xii List of Tables ...... xiv List of Abbreviations ...... xv
Chapters
1 Introduction1 1.1 Causal Inference in Observational Studies...... 1 1.2 Target of Inference ...... 3 1.3 Treatment Effects...... 6 1.4 Estimation of Treatment Effects...... 7 1.4.1 Identifiability Assumptions...... 7 1.4.2 The Propensity Score Framework...... 9 1.4.3 G-methods...... 12 1.5 Modern Challenges in Causal Inference...... 13 1.5.1 Multiple Treatment Groups ...... 13 1.5.2 Complex Survey Data ...... 17 1.5.3 Generalized Intervention Effects...... 20
2 Multiple Treatment Groups 24 2.1 Conditionally Optimal Matching Algorithm ...... 24 2.1.1 Algorithm Setup ...... 24 2.1.2 Matching Algorithm for Three Treatment Groups ...... 26 2.1.3 Extensions to More than Three Treatment Groups ...... 33 2.2 Post-matching Outcome Analysis ...... 36 2.2.1 Covariate Balance...... 36 2.2.2 Statistical Setup ...... 37 2.2.3 Evidence Factors ...... 38 2.2.4 Estimation of Treatment Effects...... 40
ix 2.2.5 Sensitivity Analysis to Hidden Bias...... 42 2.3 Simulation Study...... 45 2.3.1 Setup ...... 45 2.3.2 Results...... 47 2.4 Application: Mortality Differences among Trauma Center Levels . . . 49 2.4.1 Background...... 49 2.4.2 Data...... 50 2.4.3 Methods...... 51 2.4.4 Results...... 52 2.4.5 Conclusions...... 56
3 Propensity Score Adjustment With Cluster Sampling Data 58 3.1 Weighted Estimators for Population ATE in Complex Survey Data . 59 3.1.1 Weighting in Causal Inference and Survey Sampling ...... 59 3.1.2 Treatment and Sample Selections...... 60 3.1.3 Weighted or Unweighted Propensity Score?...... 63 3.2 Two-Stage Cluster Sample Surveys ...... 65 3.2.1 Cluster Sampling Design: Notation...... 65 3.2.2 Weighted Estimator for Population ATE...... 67 3.2.3 Propensity Score Estimation...... 69 3.2.4 Asymptotic Properties...... 71 3.2.5 Design Variance in Simple Two-Stage Cluster Sampling . . . . 77 3.3 Simulation Study...... 78 3.3.1 Setup ...... 78 3.3.2 Results...... 82 3.4 Application: Effect of Insurance Status on Decision to Seek Care After Injury ...... 85 3.4.1 Background...... 85 3.4.2 Data...... 87 3.4.3 Methods...... 88 3.4.4 Results...... 89 3.4.5 Conclusions...... 92
4 Population Intervention Effects 96 4.1 Definition ...... 97 4.2 Interventions ...... 100 4.3 Upper and Lower Bounds ...... 101 4.4 Estimation of Upper and Lower Bounds ...... 104 4.5 Properties of the Proposed Estimators ...... 106 4.5.1 Asymptotic Distribution...... 107 4.5.2 Bootstrap...... 111 4.6 Outcome Models ...... 112 4.7 Simulation Study...... 114
x 4.7.1 Setup ...... 114 4.7.2 Results...... 116 4.8 Application: Tobacco Cessation Interventions and Nicotine Addiction during Pregnancy...... 119 4.8.1 Background...... 119 4.8.2 Data...... 121 4.8.3 Methods...... 122 4.8.4 Results...... 122 4.8.5 Conclusions...... 124
5 Discussion and Future Work 126 5.1 Multiple Treatment Group...... 126 5.1.1 Discussion...... 126 5.1.2 Limitations ...... 127 5.1.3 Future Work...... 128 5.2 Complex Survey Designs...... 130 5.2.1 Discussion...... 130 5.2.2 Limitations ...... 130 5.2.3 Future Work...... 131 5.3 Population Intervention Effects ...... 133 5.3.1 Discussion...... 133 5.3.2 Limitations ...... 134 5.3.3 Future Work...... 134
Bibliography 136
Appendices
A Additional Results of Simulation Study in Chapter 3 147
xi List of Figures
Figure Page
1.1 Sampling and Treatment Selections in Population Structure..... 4 1.2 Causal Contrasts of Average and Intervention Effects...... 21
2.1 First Step of the Conditionally Optimal Matching Algorithm . . . . . 28 2.2 Distributions of the Matching Variable in the Scenarios of the Simula- tion Study...... 46 2.3 Result of the Sensitivity Analysis in the Comparison of Mortality among Trauma Center Level...... 57
3.1 Result of Simulation Study: Continuous Outcome, (S,Ns) = (5, 80), All the Covariates Considered, Sampling Scheme Independent from the Treatment...... 83 3.2 Result of Simulation Study: Continuous Outcome, (S,Ns) = (5, 80), All the Covariates Considered, Sampling Scheme Dependent on the Treatment...... 83 3.3 Result of Simulation Study: Continuous Outcome, (S,Ns) = (20, 20), All the Covariates Considered, Sampling Scheme Dependent on the Treatment...... 84 3.4 Result of Simulation Study: Continuous Outcome, (S,Ns) = (5, 80), Covariate X3 Omitted, Sampling Scheme Dependent on the Treatment 85 3.5 Result of Simulation Study: Binary Outcome, (S,Ns) = (5, 80), All the Covariates Considered, Sampling Scheme Dependent on the Treatment 86 3.6 Estimates of Average Treatment Effect for Insurance Status on Deci- sion to Seek Care after Injury...... 91
4.1 Result of Simulation Study: Outcome Model Correctly Specified . . . 118 4.2 Result of Simulation Study: Scale of Nonlinear Covariates Misspecified in Outcome Model ...... 119 4.3 Result of Simulation Study: One Covariate Omitted from Outcome Model ...... 120
xii 4.4 Estimates of Bounds of Intervention Effect as function of Proportion of Treated subjects...... 123
xiii List of Tables
Table Page
2.1 Result of Simulation Study...... 48 2.2 Balance of Covariates between Treatment Groups in Matched Sample 53 2.3 Mortality by Trauma Center Level before and after Matching . . . . . 55
3.1 Estimates of Coefficients of Population Propensity Score Model. . . . 94 3.2 Balance of Covariates between Treatment Groups in Weighted Sample 95
4.1 Logistic Regression Model Estimating Probabilities of Preterm Delivery 125
xiv List of Abbreviations
ATE Average Treatment Effect.
ATT Average Treatment Effect for the Treated.
IE Intervention Effect.
MEPS Medical Expenditure Panel Survey.
NEDS Nationwide Emergency Department Sample.
NN Nearest Neighbor.
NTC Nontrauma Centers.
SUTVA Stable Unit Treatment Value Assumption.
TC Trauma Centers.
TC I Level I Trauma Centers.
TC II Level II Trauma Centers.
xv Chapter 1 Introduction
1.1 Causal Inference in Observational Studies
The goal of causal inference is to measure causal effects of treatments or exposures on outcomes. Treatment and outcome are denoted with Z and Y , respectively. For introductory purposes, I focus on studies involving two treatment levels, indicated with values 1 and 0. Subjects assigned to treatment 1 and 0 will be referred to as treated and controls, respectively. Causal effects are traditionally defined at the individual level. For any given subject k of the cohort under study, imagine being able to observe the outcome of interest in two counterfactual scenarios. In one scenario, the unit receives the treatment 1 (Zk = 1). In the other scenario, the same unit receives the treatment 0
1 0 (Zk = 0). Denote the outcomes observed under the two scenarios with Yk and Yk , the potential outcomes of subject k. For this specific subject, the treatment has a
1 0 causal effect on the outcome if Yk differs from Yk . In cohorts of subjects, causal effects may be quantified in several ways. For exam- ple, the difference between the average of Y 1 and Y 0 in a cohort is a popular measure of average effect and is referred to as Average Treatment Effect (ATE). Other mea- sures of causal effects are discussed in Section 1.3. For most treatments and outcomes, it is impossible to observe more than one
1 potential outcome per unit. When a subject receives a treatment, only the corre- sponding potential outcome can be observed. In this sense, the treatment assignment can be interpreted as a selection from the set of all the potential outcomes of the units under study. Historically, randomized experiments are considered as the gold standard to mea- sure causal effects. The randomization of the treatment guarantees that, for each subject, the potential outcome to be observed is selected at random. As a conse- quence, the distributions of the observed outcomes in treated and control groups are expected to represent the distributions of the the potential outcomes Y 1 and Y 0, respectively. In this case, causal effects can be quantified straightforwardly. For ex- ample, the ATE can be estimated using the difference between the sample averages of the observed outcomes in the treatment groups. The estimation of causal effects in observational studies requires stronger assump- tions and more complex statistical methodology. Treatments are not assigned at random and, most of the time, the underlying assignment mechanism is unknown. The literature provides different methods to deal with a variety of scenarios. All of these methods require assumptions, which are often reasonable but rarely testable. Nevertheless, causal inference in observational studies has become increasingly popular over the past decades. There are several reasons motivating such a marked increase in popularity. First of all, observational studies have much lower costs than interventional experiments. In particular, holding fixed the study budget, the possibil- ity to study larger samples is an attractive feature when dealing with small treatment effects. Second, in the fields of medicine and environmental sciences, randomized ex- periments are not always ethical. Whenever researchers are interested in treatments that have known risks for the experimental units, observational studies are the only option for estimating causal effects. Third, the generalizability of the results of in-
2 terventional studies is often hampered by the rigid criteria controlling the conditions of the experiments. Samples collected in carefully-designed observational studies are much more likely to be representative of the population of interest. Finally, modern technologies and computer science developments are constantly increasing the avail- ability of large observational datasets. Large scale surveys, electronic health records and social networks are just few of the many sources of “big data”, which are invalu- able resources for observational studies.
1.2 Target of Inference
Researchers are often interested in quantifying causal effects at the aggregate level. Depending on the research question, the target cohort may be the study sample or the population where the sample has been drawn. Figure 1.1 provides a graphical representation of a comprehensive population framework, which includes the cohorts targeted by the methods presented throughout the dissertation. The most inclusive cohort is the infinite potential outcome superpopulation, which is the top-left set of the figure. Because both the potential outcomes are known for each subject of this superpopulation, the distribution of infinite subjects can be thought as a bivariate distribution of (Y 0,Y 1). In the most general case, subjects with different characteristics may show different values of the potential outcomes. This feature is represented with a multi-modal pattern in the figure, where the colors indicate different subgroups. The bottom-right set is the smallest of the cohorts, the observed sample. Only a finite set of subjects is observed and only one potential outcome is known, because subjects have received the treatment. Two types of selections are involved in the process that identifies the observed sample from the potential outcome superpopulation. On the one hand, there are
3 (0) (0) Y (1) (1) Y (1) D YD Y A YA (0) Infinite YC Infinite (1) superpopulation T YC Subgroup D superpopulation SP Subgroup A (1) (1) YE of potential YB after treatment (0) Subgroup C YB (0) outcomes YE selection
(0) Y Subgroup B Subgroup E
PPO PT
Subject A Subject D Subject A Subject D (1) (0) (1) YA (0) Finite YA YA YD Finite Subject C (0) (1) Subject C population YD YD T population P Y (1) of potential Subject B Y (0) Y (1) C after treatment C C Subject E Subject B outcomes Subject E selection (0) (1) (0) YB YB YB (0) (1) (1) YE YE YE
SPO ST
Subject A Subject D Subject A Subject D Sample Sample (1) (0) (1) (0) (1) YA YA YA YD Y T (0) after treatment D S YD of potential Subject C outcomes Subject C selection (0) (1) (1) (observed) YC YC YC
Figure 1.1: Population structure assumed throughout the document. Arrows repre- sent selections. Treatment selections at superpopulation, population and sample level are indicated with TSP , TP and TS. PPO and PT indicate the population selec- tions from the superpopulations of potential outcomes and the one after treatment. SPO and ST indicate the sample selections from the finite populations of potential outcomes and the one after treatment.
4 subject selections, which draw finite populations from infinite superpopulations (se-
lections PPO and PT ) and samples from finite populations (SPO and ST ). These selections are represented with vertical arrows in the figure. On the other hand, there are treatment selections, which identify the potential outcome to be revealed for each unit. Depending on the study design, the treatment selection can be thought
to be applied to the infinite superpopulation (TSP ), finite population (TP ) or study
sample (TS). Typically, the finite population is considered to be representative of the super- population, because it is assumed to be drawn via simple random sampling. The size of the finite population is denoted with N. The study sample is subsequently selected from the finite population. The sample size is denoted with n, with n < N. In practice, at this stage, researchers may employ sophisticated sampling designs, to ensure the generalizability of sample-based results while optimizing the costs of the study. Therefore, the sample may be drawn by simple random sampling or with other complex sampling designs. The target cohort of the causal inference research question varies from study to study. Causal parameters may be defined on any of the sets of the figure. When the interest is on sample-level effects, randomization-based inference is the most popular methodology to verify hypotheses and quantify causal effects (Rosenbaum, 2002b). In this framework, the target cohort is the sample of potential outcomes and the only source of randomness is attributed to the treatment selection. On the other hand, policy makers are often interested in causal parameters defined in the population or superpopulation from which the study sample is drawn (Westreich, 2017). In this case, the sampling selection introduces an additional source of randomness, which must be considered in the statistical analysis. The hierarchical framework illustrated in Figure 1.1 provides an overarching map
5 of the possible paths leading to the observed sample. When the target cohort is identified, researchers need to recognize the selections that have resulted in the ob- served sample and ensure the identifiability of the causal parameter of interest with the available data.
1.3 Treatment Effects
I focus on marginal effects, which are averages of individual-level effects over the target cohort. When the target cohort is either an infinite superpopulation or a finite population, the most popular marginal effect is the ATE, which is formally defined as
1 0 ∆AT E = E Y − E Y . (1.1)
The operator E [·] is used to denote the average over the cohort of interest (as in Hern´anand Robins(2018)). In particular, it is possible to define more specific versions of the ATE depending on the target cohort. For example, denoting the index set of the subjects in the finite population with U, the population ATE is defined as
1 X 1 X ∆ = Y 1 − Y 0. (1.2) P AT E N k N k k∈U k∈U A different way to quantify a causal effect is to focus on a subset of the units under study. A common choice is to look at the effect of the treatment on the subjects assigned to the treatment group (i.e., such that Z = 1). The Average Treatment Effect for the Treated (ATT) is defined as:
1 0 ∆AT T = E Y |Z = 1 − E Y |Z = 1 . (1.3)
Similarly to the ATE, it is possible to define specific versions of the ATT for the cohort of interest. Notably, ATE and ATT quantify different aspects of the impact
6 of the treatment on the outcome. In practice, the most appropriate measure to be considered depends on the research question to be addressed. The definitions of ATE and ATT are naturally extended to the sample. For instance, denoting the index set of the subjects in the study sample with S, the sample ATE is defined as
1 X 1 X ∆ = Y 1 − Y 0. (1.4) SAT E n k n k k∈S k∈S The sample ATT is defined accordingly, restricting the average to treated subjects. These sample parameters are the principal target of inference in the framework intro- duced by Neyman(1935). From the author’s perspective, hypotheses about causal effects should be formulated in terms of average effects. For example, the null hy-
pothesis of no treatment effect can be expressed as H0 : ∆SAT E = 0. Fisher(1935) proposed a different framework, where sample effects are established by verifying a null hypothesis of no treatment effect for all the subjects in the sample,
1 0 i.e., H0 : Yk = Yk for all k in S. This hypothesis, referred to as Fisher’s sharp null, is stronger than the Neyman’s null hypothesis of no average effect. Because the sharp null hypothesis has a central role in Fisher’s framework, effects are usually quantified by inverting the test statistic that is employed to verify the test.
1.4 Estimation of Treatment Effects
1.4.1 Identifiability Assumptions
In order to quantify causal effects in observational studies, we need to account for pre- treatment covariates and pose some identifiability assumptions. Let X be the vector of covariates. I assume a traditional set of assumptions in causal inference (Hern´an and Robins, 2018). These assumptions are:
7 1. Consistency: There is no interference among subjects and, when the treatment level is fixed, the same potential outcome is consistently observed. This con- dition implies that the observed outcome Y is always defined as Y = I(Z = 1)Y 1 + I(Z = 0)Y 0.
2. Exchangeability: Treatment and potential outcomes are assumed to be indepe- nent given the covariates (i.e., Z ⊥⊥ Y z| X for z = 0, 1).
3. Positivity: All of the subjects in the cohort of interest are eligible to receive all the treatment levels (i.e., 0 < P (Z = z|X) < 1 for z = 0, 1).
Analogous conditions have been referred to with a different terminology in the literature. For example, the condition of consistency has been referred to as Stable Unit Treatment Value Assumption (SUTVA) in the randomization-based inference framework (Rubin, 1980). Similarly, the condition of exchangeability and positivity have been referred to as ignorability of the treatment assignment or weak uncofound- edness (Rosenbaum and Rubin, 1983; Imbens, 2000). If these assumptions are met, causal effects can be estimated with the available data. The statistical method to be used depends on the specific effect to be es- timated, which in turn depends on the research question to be addressed. While a comprehensive presentation of the methodology to estimate causal effects is out of the scope of this dissertation, the available approaches can be classified in nonparameteric methods and methods requiring modeling (Hern´anand Robins, 2018). Matching and stratifying subjects according to the value of the covariates are examples of nonparametric approaches (Cochran and Chambers, 1965). The basic idea is to identify subgroups where the treatment groups are balanced with respect to the covariates and, therefore, comparable in terms of the outcome. The key limitation of these approaches is the difficulty of handling large numbers of covariates. This a
8 common issue in observational studies, where many factors may influence both the treatment and the outcome. The second class of methods involve modeling. The propensity score framework is a popular methodology that belongs to this family, where models are used to estimate the probability of receiving the treatment given the covariates, namely the propensity score (Rosenbaum and Rubin, 1983). G-methods represent another class of methods of this family (Robins, 1986). In this case, models are used to either estimate the probability of receiving the treatment or the conditional mean of the outcome value (or both). These two comprehensive frameworks are briefly introduced in the following sections.
1.4.2 The Propensity Score Framework
The Propensity Score
The propensity score is the probability of receiving the treatment given the covari- ates, i.e., e(X) = P (Z = 1|X). Under the identifiability assumptions, Rosenbaum and Rubin(1983) showed that the exchangeability property holds if the possibly high-dimensional vector of covariates is replaced by the propensity score. This data- reduction property makes the propensity score extremely attractive in empirical re- search. In most practical applications, the probability e(X) is unknown and must be es- timated. If the treatment is binary, a common approach is to estimate the propensity score with a logistic regression model, using the treatment variable Z as dependent variable and the available covariates X as predictors. The propensity score can be used in different ways to estimate causal effects (Rosen- baum and Rubin, 1983). I focus on propensity score matching and weighting, which are considered as the most reliable approaches to estimate treatment effects. These
9 approaches are briefly described in the following subsections.
Propensity Score Matching
Rosenbaum and Rubin(1983) showed that the propensity score is a balancing score, which means that treatment and covariates are independent conditional on the value of the propensity score. As a consequence of this property, if treated units could be perfectly paired to controls with the same propensity score values, the distribution of the covariates is expected to be the same in the matched treatment groups, as it might have been in a randomized experiment. This is the rationale of propensity score matching. Matching algorithms construct matched sets formed by control and treated units that are similar with respect to the propensity score. In this way, researchers hope to generate matched samples where the covariates in the treatment groups are well balanced. The main limitation of matching is the fact that only a subset of the units in the control group is selected to enter the matched sample, even though discarding control units that are not comparable to the treated might be a desirable property. On the other hand, matching offers several advantages over the other propensity score-based methods (Austin, 2011). First of all, by attempting to recreate the balanced design resulting from a randomized study, the results of matching are easy to interpret. Second, because post-matching analyses do not need to rely on parametric outcome models, they are robust to the misspecification of functional forms of the covariates in these models. Third, because of the necessity to evaluate the quality of matching in terms of covariate balance, researchers are required to critically assess the overlap of the treatment groups in terms of the observed confounders. A lack of overlap between treatment groups implies an undesired extrapolation from the available data when estimating treatment effects, and might pass unnoticed with model-based approaches.
10 Finally, matching offers the possibility of “outcome blinding”. The matching step, which creates the balanced design to be used for the following analysis, can be per- formed blindly with respect to the outcome of interest. This practice ensures the robustness of the final analysis to conscious and unconscious choices that may condi- tion the study results.
Propensity Score Weighting
The rationale of weighting is to recreate the sample, population or superpopulation where both the potential outcomes are known (left sets in Figure 1.1), by assigning an appropriate weight to each subject of the observed sample. For example, the sample ATE can be estimated with a weighted average of the outcomes, where each subject is weighted by the inverse of the probability of receiving the treatment that he/she actually received (e(X) for treated and 1 − e(X) for controls). Similarly, a different family of weights can be used to define a weighted estimator for the ATT. In this sense, weighting is a versatile approach for the estimation of different causal effects. As opposed to matching, the estimates of the propensity score are explicitly in- volved in the estimation of the treatment effect. Traditional weighting estimators are therefore more sensitive to misspecifications of the propensity score model than matched estimators. To address this limitation, doubly-robust weighting estimators have been proposed in the literature (Robins et al., 1994). Briefly, the idea is to appropriately introduce an outcome model within the estimator. Estimation of the treatment effect is guaranteed to be unbiased if either the propensity score or the outcome model are misspecified (but not both).
11 1.4.3 G-methods
G-methods (or, generalized methods) are model-based approaches to estimate a va- riety of causal contrasts, in cross-sectional and longitudinal designs (Robins, 1986). This broad family includes marginal structural models, structural nested models and the parametric g-formula. Extensive descriptions of these approaches are available in the literature (e.g., Naimi et al.(2016)). Marginal structural models and structural nested models are families of models whose coefficients are directly related to marginal causal parameters, such as E[Y z], the ATE or the ATE in subgroups of the study sample. Marginal structural models require a model for the probability of receiving a treatment level (i.e., the propensity score). Structural nested models need both models for a function of the outcome and the propensity score, but they are doubly-robust: estimates are consistent if either of the two models is correctly specified. The parametric g-formula relies on outcome models that include the treatment Z and the covariates X. I provide some background information about this methodol- ogy, which is used in Chapter 4. In fixed-time settings, the g-formula estimator for ATEs is motivated by the following equality:
1 0 ∆AT E = E Y − E Y =
= E E Y 1|X − E E Y 0|X =
= E E Y 1|Z = 1, X − E E Y 0|Z = 0, X
= E [E [Y |Z = 1, X]] − E [E [Y |Z = 0, X]] , (1.5)
where E [Y z|X] = E [Y z|Z = z, X] because of the exchangeability assumption (Y z and Z are independent given X). The equality suggests that E [Y z] can be esti- mated via standardization of the mean of Y across values of Z and X. Theoreti-
12 cally, E [Y |Z, X] could be estimated nonparametrically, by computing sample aver- ages across strata of Z and X. However, this is only possible if the dimension of X is small and the strata corresponding to each value of X in the observed sample are well populated. In most practical settings, these conditions are not met and E [Y |Z, X] is estimated with a parametric outcome model. A natural estimator of the sample ATE follows from Equation (1.5):
n 1 X n o ∆b SAT E = Eb [Y |Z = 1, Xk] − Eb [Y |Z = 0, Xk] . (1.6) n k=1 Importantly, the estimators based on the g-formula are consistent only if the outcome model is correctly specified. G-methods can be easily applied to complex designs, in the presence of time- varying treatments or when the target causal contrasts involves interventions on the treatment mechanism (Hern´anand Robins, 2018). In study designs with fixed-time treatments and when the goal is to estimate traditional causal effects (such as the ATE or the ATT), nonparametric methods and the propensity score framework are simpler alternative approaches.
1.5 Modern Challenges in Causal Inference
1.5.1 Multiple Treatment Groups
Causal Inference Setup and Estimation of Treatment Effects
Even though most of the traditional causal inference literature have focused on designs with two treatment levels, simultaneously evaluating the effect of multiple treatments is vital in modern public health and medical research, where several alternative treat- ments are often available. The potential outcome framework naturally generalizes to settings with multiple
13 treatment groups. In the presence of K treatment levels, the treatment variable Z assumes values in the set Z = {1, ..., K} and each treatment level z ∈ Z corresponds to one potential outcome, Y z. The population structure described in Section 1.2 also applies to designs with multi-valued treatments. The definition of causal effects proceeds analogously to the binary-treatment case. The seminal work by Imbens(2000) discussed extentions of the traditional identifi- ability assumptions. In particular, the exchangeability and positivity conditions are extended to multi-valued treatments, by assuming that potential outcomes Y z are independent given the covariates X for all z ∈ Z and that each subject in the sample is eligible to receive any of the treatments under study, i.e., 0 < P (Z = z|X) < 1 for any z ∈ Z. Imbens(2000) also generalized the propensity score to multiple-treatment set- tings. Defining ez(X) = P (Z = z|X), the generalized propensity score is the K- dimensional vector of probabilities e(X) = (e1(X), e2(X), ..., eK (X)). Notably, since the treatments are mutually exclusive, these probabilities are subject to the constraint P z∈Z ez(X) = 1 for any value of the covariates X. Since each probability ez(X) can be expressed as one minus the sum of the other probabilities, the generalized propen- sity score belongs to a (K − 1)-dimensional space. The author showed that the generalized propensity score offers data-reduction properties similar to the ones of the traditional propensity score in the two-group case. In particular, the treatment assignment is independent of each potential outcome given the propensity score, i.e.,
z Z ⊥⊥ Y |ez(X) for each z ∈ Z. Different models can be used to estimate the generalized propensity score, de- pending on the characteristics of the treatment values. If the treatment values are qualitatively different, Imbens(2000) suggested the use of multinomial logit or probit regression. On the other hand, if there is a logical ordering of the treatment levels
14 (e.g., when the treatment levels under study are different doses of a drug), ordinal logistic regression is better suited. These results provide the theoretical foundation for the estimation of causal effects using observational data. Most of the statistical approaches designed for the two- group case are potentially extendable to the multiple-treatment case. Linden et al. (2016) provided an overview of regression adjustment, stratification and weighted estimators. Even though matching is a common approach when dealing with two treatment groups, the method has received limited attention for studies with multiple treatments. This is unfortunate, because of the unique advantages of matching over alternative approaches (see Section 1.4.2). The following section provides an overview of the available matching procedures for designs with multiple treatments.
Matching Algorithms
Lopez and Gutman(2017) recently discussed the limitations in scope of existing matching algorithms for multi-valued treatments. Part of the reason is that these algorithms are much harder to implement than in the two-group case and no op- timal solution is available. Notably, given any finite sample, the optimal matched sample that minimizes the total distance within matched sets does exist. However, this optimization problem is NP-hard when the number of treatments is larger than two, which means that the solution cannot be identified in polynomial time (Karp, 1972). Therefore, optimal solutions exist, but they are not practically identifiable in a reasonable computation time. To fill this gap, Lu et al.(2001) and Lu and Rosenbaum(2004) introduced the optimal nonbipartite matching design for multiple treatment groups. The optimality is achieved by relaxing the requirement of having units from each of the treatment groups in the matched sets. Unfortunately, the resulting design is a paired structure
15 that cannot be used to compare all groups directly. To create matched sets with subjects from all treatment arms, Rassen et al.(2013) discussed applications of the popular Nearest Neighbor (NN) algorithm, using differ- ent distance metrics. The major issue withNN algorithms is that the overall matching quality could be poor, as has been shown in the two-group case (Rosenbaum, 1989). Simple extensions of optimal two-group matching to the three-group settings have been implemented in empirical research (Lu et al., 2012; Shi et al., 2016). These studies used optimal matching to generate pairs between a reference, anchor group (arbitrarily selected) and the other treatment groups. However, the distances between units in the non-anchor groups were not taken into account and it is easy to construct examples where this approach performs poorly. Lopez and Gutman(2017) described a two-step algorithm that can be used to form matched sets on the basis of the generalized propensity score. First, the dimen- sionality of the matching problem is reduced by grouping subjects on the basis of a subset of the components of the propensity score, using clustering techniques. Then, subjects are matched within clusters on the remaining components of the propen- sity score. One of the main limitations of the algorithm is the necessity to trim the study sample, to create a good overlap of the distributions of the propensity score across treatment groups. This limits the interpretability and the generalizability of the results. Recently, Bennett et al.(2018) proposed a procedure to construct matched sam- ples satisfying fine balance constraints in multiple-treatment designs. In the pres- ence of binary treatments, computationally efficient fine-balance algorithms prioritize good marginal balance in the covariates over small within-pair distances in terms of covariates or propensity score (Zubizarreta, 2012). To extend this approach to multiple-treatment designs, the authors proposed to match all the treatment groups
16 to a template sample, which is chosen to be similar to the ideal target population. Nevertheless, the distances within matched sets can be far from optimal, because the adopted two-group algorithm primarily targets marginal balance instead of small total distances. Other researchers have focused on algorithms that generate matched sets with less stringent structures, producing stratification-like designs. S¨avjeet al.(2017) recently proposed a computationally efficient algorithm that generalizes full matching to the multiple-treatment case. In order to make the solution to the problem feasible, the authors relax the classic full matching design, allowing the construction of matched sets with more than one subject for all treatment groups. Despite its computational efficiency, such designs tend to produce imbalanced matched sets. This complicates subsequent statistical analyses (Gu and Rosenbaum, 1993). Chapter 2 introduces a matching algorithm for the multiple-treatment case, which is designed to generate matched sets characterized by small total distance. The chapter also describes post-matching statistical analyses. The methodology is applied to a comparative study of mortality across trauma center levels, which motivated the methodological research discussed throughout the chapter.
1.5.2 Complex Survey Data
Population surveys are invaluable data sources for policy research. The represen- tativeness of the target population is guaranteed by appropriate sampling designs. As introduced in Section 1.2, the sample may not be selected with simple random sampling from the finite population in complex survey designs. Common sampling methods include systematic, stratified and cluster sampling (Levy and Lemeshow, 2013). The appropriate method is generally chosen to guarantee that the sample will be representative of the finite population while minimizing the costs of the data
17 collection. Methods to infer causal effects in the sample, such as the sample ATE or ATT, are well-established and discussed in Section 1.4. Little attention has been dedicated to the estimation of population and superpopulation effects when the study sample is selected with complex sampling designs, where researchers must take into account the survey design, the sampling weights and the observational nature of the data (Lenis et al., 2017). The first attempts to estimate population treatment effects in complex designs used heuristic methodology based on weighted estimators (Zanutto et al., 2005; Zanutto, 2006). The idea is to interpret the study sample as the result of two sampling stages from the finite population, as represented in Figure 1.1. On one hand, the treatment selection, based on the individual probabilities of receiving the treatment (namely, the propensity score). On the other hand, the sample selection, based on the survey sampling probabilities, which are often known by design. The authors proposed to construct the overall probability underlying the two-stage selection as the product of survey probability and propensity score. Estimates of population treatment effects were generated using weights defined as the inverse of this probability. Formal methodological justifications of this family of estimators have been de- scribed in the literature (Wang et al., 2009; Ashmead, 2014; Ridgeway et al., 2015). However, previous studies are mainly confined to single-stage sampling designs. The only exception is the very recent work of Yang(2018), who described a weighted estimator of the population ATE in two-stage cluster sample surveys. Matched estimators have also been considered to estimate population treatment effects. Ashmead(2014), Austin et al.(2018) and Lenis et al.(2017) recently described simulation analyses investigating the performance of propensity score matching to estimate population effects in complex survey designs. On the basis of the simulation
18 results, the authors provide guidelines for matching designs. Despite the recent developments in the field, there is still no consensus on all the aspects of causal inference methodology for complex survey designs. The method that should be considered for estimating the propensity score is one central element where previous studies disagree. Wang et al.(2009), Ashmead(2014) and Ridgeway et al. (2015) recommended estimating the propensity score model with weighted regression models. Yang(2018) proposed a complex algorithm to estimate a calibrated propen- sity score model, which is designed to provide good balance in terms of the covariates between treatment groups. The author presents treatment effect estimators that use the calibrated propensity score, which is described as robust with respect to mis- specification of model form and with respect to unmeasured cluster-specific variables. Zanutto(2006), DuGoff et al.(2014) and Lenis et al.(2017) affirmed that incorpo- rating survey weights in estimating the propensity score is not necessary, since the balancing property of the propensity score is only necessary at the sample level. In the simulations carried out by Lenis et al.(2017), weighted and unweighted propensity score models performed similarly in the estimation of the treatment effect. Chapter 3 is devoted to an estimator of population and superpopulation ATE for two-stage cluster sampling survey designs, which have received little attention in the literature. I describe a weighted estimator, which naturally combines the survey weights with the propensity score. I introduce the properties of this estimator and a comparison of its performance with competing methods. The role of survey weights in the estimation of the propensity score is a key factor that is evaluated in the simulation analysis. The methodology is applied to the 2015 Medical Expenditure Panel Survey (MEPS) data, to quantify the causal effect of health insurance coverage on the decision to seek medical care after an injury.
19 1.5.3 Generalized Intervention Effects
Figure 1.2 provides a graphical representation of the comparisons evaluated by the most traditional marginal causal effects, the ATE and ATT, in an example with a bi- nary treatment. The ATE compares the average outcome value in two counterfactual scenarios where all the subjects receive either of two treatment levels (panel (b)). For instance, in a cohort of patients being eligible to receive two drugs, this effect may be used to determine the drug that is associated, on average, with better outcomes. The ATT evaluates a similar comparison, but focus on the subjects that received the treatment Z = 1 (panel (c)). Policy makers, however, are often interested in a different type of effect, evaluating the impact of interventions that modify the distribution of the treatment in the target population. For instance, decision makers may be interested in quantifying the effect of a policy change that increases the proportion of treated subjects. In the study motivating my work in this area, state agencies are interested in evaluating how the preterm birth rate would change if it would be possible to increase the proportion of women enrolling in smoking cessation programs among nicotine-dependent pregnant women. In particular, given the finite amount of resources to implement programs and interventions, stakeholders need to consider scenarios with partial modifications of the treatment status, as the possibility of increasing the proportion of the treated to 100% of the cohort is unrealistic in most applications. For example, in the context of the motivating study, suppose that only 5% of the nicotine-dependent pregnant women currently receive a particular type of smoking cessation treatment. The study team believes that the preterm birth rate could be reduced if a greater percentage of women could be convinced to receive this treatment. While it is very unlikely that 100% of women could be convinced to enroll, it might be possible to dedicate some
20 (a) Observed Cohort
Z=0 Z=1
(b) ATE
Z=1 vs. Z=0
(c) ATT
Z=1 vs. Z=0
(d) IE
Z=0 Z=1 vs. Z=0 Z=1
Figure 1.2: Graphical representation of the causal comparisons evaluated by the ATE, ATT and IE.
additional resources in order to increase the enrollment in this smoking cessation program to, say, 10% or 20%. A reasonable question is what would be the impact on the preterm birth rate if the proportion of subjects receiving smoking cessation treatment could be increased by some specified amount. In such cases, the comparison between a counterfactual scenario with an increased proportion of treated subjects and the real, factual, cohort is the most informative contrast for policy makers (Figure 1.2, panel (d)). Such an effect has been referred to as a “generalized intervention effect” or “population intervention effect” (Ahern, 2016; Westreich, 2017). I will denote it simply as Intervention Effect (IE). It has been the
21 target of sporadic studies over the past three decades (Browner, 1986; Bulterys et al., 1997) and it has recently gained popularity, because it translates research efforts into valuable measures for policy makers (Ahern, 2016). Nevertheless, a formal definition of this effect with the potential outcome framework has not been provided yet in the literature. Only few studies have provided guidance on the methodology to estimate the IE and, in most of the cases, they have adopted the parametric g-formula (Ahern et al., 2009; Westreich, 2014; Ahern et al., 2016). Both fixed-time and time-varying treatments have been considered in the literature. In the latter case, the estimation of the effect is more complex, as the intervention on the treatment Z may vary over time. Hence, it is necessary to deal with correlated outcome occurrences and it is necessary to resort to longitudinal models for the outcome (Taubman et al., 2009; Westreich, 2014). Moreover, because interventions on time-varying treatments may have very complex patterns on both future treatment status and time-varying covariates, these interventions were studied under several simplifying assumptions (Westreich, 2014). Most of the research targeting the estimation of theIE have focused on sce- narios with continuous exposures. A possible explanation is that, for quantitative treatments, it is easier to specify plausible interventions on the treatment Z without having to define, one by one, the units selected for such modification. For example, Ahern et al.(2016) described a study investigating the effect of alcohol outlet density on binge drinking. The authors considered modifications of the distribution of the quantitative treatment (alcohol outlet density) by setting pre-specified upper limits to the value of the treatment (e.g., 60 outlets per square miles). All the treatment values exceeding the pre-specified threshold were replaced with the value of the up- per limit. In this way, the study estimated the effect of a reduction in the maximum number of alcohol outlet density on the overall rate of binge drinking in the cohort
22 under study. The same strategy is not immediately translated to scenarios with categorical treatments. In this case, it is not possible to truncate the maximum value of the treatment value to pre-specified thresholds. In order to evaluate the impact of an intervention, researchers must specify the subset of subjects whose treatment levels have been modified. This comes with an additional challenge whenever the effect of the treatment is heterogeneous: the impact of the intervention will depend on the selected subset. Westreich(2014) studied the estimation of theIE in a scenario where both treatment and outcome are binary. The author simulated several fictitious co- horts where, in each simulation, a constant proportion of control units was assigned to the treatment. The control units targeted by the modification of the treatment assignment were randomly selected. The effect estimated with this procedure cor- responds to an average over simulated interventions. This approach is feasible, but computationally intensive. Moreover, this estimate might poorly predict the impact of an intervention modifying the treatment distribution in the cohort, if such inter- vention does not target a naturally representative sample of the cohort. Chapter 4 is devoted to the estimation of population intervention effects. I intro- duce a formal definition of theIE with the potential outcome framework and propose a simple estimator for the upper and lower bounds of this effect. I focus on scenarios with binary and categorical treatments, which have received little attention in previ- ous research. I illustrate the proposed approach with a study investigating the effect of smoking cessation interventions on the number of preterm deliveries in a cohort of pregnant women.
23 Chapter 2 Multiple Treatment Groups
Matching has unique advantages over other approaches to estimate causal effects in observational studies. However, this methodology is rarely used in the presence of multiple treatment groups, partially because of the limitations of the available algorithms. This chapter introduces a new matching algorithm for observational studies with multiple-treatment designs, aiming to create matched sets characterized by small total distance. The chapter is organized as follows. The new matching algorithm is described in Section 2.1. Section 2.2 discusses a strategy to conduct post-matching outcome analyses. A simulation study comparing the performance of the proposed algorithm with the principal competing method, theNN algorithm, is described in section Section 2.3. Section 2.4 describes a comparative study of mortality across trauma center levels, which motivated the methodological research discussed throughout the chapter.
2.1 Conditionally Optimal Matching Algorithm
2.1.1 Algorithm Setup
The goal of the proposed matching algorithm is to identify matched samples charac- terized by small total distance. The algorithm is structured in two main steps. First, a starting matched sample is generated. I suggest one specific solution to construct
24 this starting point, even though, for this purpose, existing matching algorithms might be employed as well. The second step involves an iterative procedure, which explores improvements in the quality of matching. At each iteration, a subset of L of the K treatment groups is selected and each K-tuple is split into two matched sets: one L- tuple from the L selected groups and one (K − L)-tuple from the remaining groups. In other words, this process relaxes the links between subjects from the L groups and the remaining groups. Then, the two sets of fixed K-tuples and (K − L)-tuples are rematched, using the optimal bipartite algorithm. The process is iterated, un- til the quality of matching cannot be reduced further. In particular, a measure of within-K-tuple dissimilarity is used to quantify the quality of matching. The algorithm takes advantage of the optimal solution to the two-group matching problem, which can be found in polynomial time. Because the algorithm iteratively matches two families of fixed matched sets and the optimality is achieved conditioning on a partially matched structure, we refer to it as conditionally optimal. The setup of the algorithm is formally introduced in this section in the general scenario of K treatment groups. Section 2.1.2 presents the algorithm in the case where K = 3. In this case there is only one possible way to split the existing matched sets— L = 1, K − L = 2. Extensions to designs with K > 3 are discussed in Section 2.1.3.
Denote by nz the size of the treatment group z ∈ Z in the study sample. The
proposed algorithm constructs a matched sample with S = minz∈Z {nz} matched sets, with one subject per treatment group. Without loss of generality, suppose that
the first group is the smallest, i.e., S = n1. Let I be the index set of the units in the
first treatment group (the smallest) and let J1, ..., JK−1 be the index sets of the units in the other K − 1 treatment groups. The algorithm is based on a distance, which measures differences within K-tuples in terms of the matching variables, which may be the propensity score vector or the
25 covariates. For multiple-treatment matching, a K-dimensional distance metric must
K be defined. Let d (i, j1, ..., jK−1) be the distance within the K-tuple involving units
{i, j1, ..., jK−1}. I focus on distances with form
K−1 K X 2 X 2 d (i, j1, ..., jK−1) = d (i, jz) + d (ja, jb), (2.1) z=1 16a 2 d (ja, jb) = ke(Xja ) − e(Xjb )k2, where k · k2 is the Euclidean norm. In the case of 3 three treatment groups, the three-way distance d (i, j1, j2) induced by this choice corresponds to the perimeter of the triangle defined by the points e(Xi), e(Xj1 ) and e(Xj2 ). The matched samples is denoted with M = {(i, j1(i), ..., jK−1(i))}i∈I , a collection of SK-tuples, where the index jz(i) identifies the subject from group z matched to subject i, with i ∈ I and jz(i) ∈ Jz. The total distance associated with the matched sample M is defined as X K D (M) = d (i, j1(i), ..., jK−1(i)) , (2.2) i∈I i.e., the sum of the distances within K-tuples. 2.1.2 Matching Algorithm for Three Treatment Groups The iterative procedure introduced in Section 2.1.1 can be implemented in a single way in the case of K = 3 treatment groups. At each step, one treatment group (L = 1) is selected and the connection of the existing triplets to the selected group is relaxed. Subjects from the selected group are then optimally rematched to the fixed pairs of the remaining K − L = 2 groups. 26 To generate the starting matched sample, I propose a simple procedure that uses two two-groups matching steps. A formal description of the algorithm is provided by the following points: Step 1: Generate the starting matched sample. Step 1.1: Select two treatment groups and match them with the optimal two-group matching procedure. Without loss of generality, label these two groups as 1 and 2 and the remaining group as 3. Step 1.2: Optimally match subjects from group 3 to the 1-2 pairs defined in (0) Step 1.1. Let M1,2 be this set of initial matched triplets and let (0) D M1,2 be the total distance associated with the constructed matched sample. The subscript “1,2” emphasizes the fact that the matching is conditional on the fixed 1-2 pairs. Step 2: Explore potential reductions of the total distance with conditional iterations. (n−1) For each n > 1, consider the matched set Mz1,z2 and the associated total (n−1) distance D Mz1,z2 , resulting from the previous iteration. Repeat the following steps: (n−1) Step 2.1: Fix the z2-z3 pairs within the triplets Mz1,z2 and optimally re- match such pairs with the subjects in group z1. (n−1) Step 2.2: Fix the z1-z3 pairs within the triplets Mz1,z2 and optimally re- match such pairs with the subjects in group z2. (n) (n) Step 2.3: Let Mz2,z3 and Mz1,z3 be the matched sets generated at steps 2.1 (n) (n) and 2.2, and let D Mz2,z3 and D Mz1,z3 be their respective (n) (n) total distances. If both D Mz2,z3 and D Mz1,z3 are greater (n−1) than D Mz1,z2 , stop the iterations: the new matched sets do 27 not decrease the total distance. Otherwise, select the matched sample corresponding to the smallest total distance. Figure 2.1 provides a graphical representation of the first step of the algorithm. At each iteration, the procedure explores a potential reduction in the total distance by changing the two groups whose pairs are fixed. Step 1.1 Group 1 Step 1.2 Group 2 Group 3 Figure 2.1: First step of the conditionally optimal matching algorithm in a three- group design. First, groups 1 and 2 are optimally matched. Second, subjects in group 3 are optimally matched to the pairs formed in Step 1.1. There is no guarantee that the algorithm converges to the global optimum, i.e., the matched sample attaining the minimum total distance. However, by design, each iteration cannot decrease the quality of matching. This is shown in Proposition 2.1, which proves that the total distance of the solution cannot be larger than the total distance of the starting matched sample. Proposition 2.1. Given any starting triplet match M0, the conditionally optimal matching algorithm will produce a new match, MCO, with total distance no larger 28 than the initial one, i.e., D (MCO) 6 D (M0). Proof. The total distance of the matched sample M0 = {(i, j1(i), j2(i))}i∈I is D (M0) = P 3 i∈I d (i, j1(i), j2(i)) . When applying the conditionally optimal matching algorithm, we first fix one edge of the triplet, say between groups 1 and 2. We then try to find a new set of subjects from group 3 to minimize the total distance with the fixed {(i, j1(i))}i∈I pairs: X 3 0 arg min d (i, j1(i), j2(i)) . 0 j2(i)∈J2 i∈I This becomes a two-group matching problem, where the goal is to match the pairs {(i, j1(i))}i∈I to the subjects in group 3. Define the two-way distance between the pair (i, j1) and subject j2 as 2 3 d ((i, j1), j2) = d (i, j1, j2) . The optimal bipartite matching algorithm can be used to identify the optimal solution ∗ to this problem. That is, we can identify subjects {j2 (i)}i∈I from group 3 such P 2 ∗ P 2 0 0 that i∈I d ((i, j1(i)), j2 (i)) 6 i∈I d ((i, j1(i)), j2(i)) for any choice {j2(i)}i∈I . In P 2 ∗ P 2 particular, i∈I d ((i, j1(i)), j2 (i)) 6 i∈I d ((i, j1(i)), j2(i)). ∗ (1) Denote the new triplet match {(i, j1(i), j2 (i))}i∈I with M1,2. Notably, (1) X 3 ∗ D M1,2 = d (i, j1(i), j2 (i)) i∈I X 2 ∗ = d ((i, j1(i)), j2 (i)) i∈I X 2 6 d ((i, j1(i)), j2(i)) i∈I X 3 = d (i, j1(i), j2(i)) = D (M0) . i∈I (1) Therefore, we have D M1,2 6 D (M0). This result applies to each iteration of the algorithm and proves that the total 29 distance cannot increase at any iteration. Therefore, denoting with MCO the final matched set, we have D (MCO) 6 D (M0). The proposition has two implications. First, even though the algorithm does not necessarily converge to the globally optimal solution, the iterations will end in a local optimum, where relaxing the connection to each group and optimally rematching it to the remaining pairs cannot reduce the total distance further. Second, the final result potentially depends on the arbitrary choice of the two treatment groups matched in Step 1.1. To obtain the best result, the procedure can be applied three times, once for each starting combination. Then, the matched sample with the smallest total distance can be selected. Step 1 describes a simple approach to generate the starting matched sample. However, the algorithm is very flexible and the initializing set of triplets can be con- structed with any matching procedure. This allows the use of the proposed algorithm to explore potential improvements upon the result of any existing three-way matching algorithm. For example, the result of theNN procedure can be used as the starting point and the conditionally optimal algorithm can be used to search for possible re- ductions in the total distance. Proposition 2.1 guarantees that the resulting matched sample cannot be worse than theNN solution. The solution of the proposed algorithm has another appealing property. Even though the algorithm might not converge to the global optimum, the total distance of the solution is bounded by the optimal distance multiplied by a factor that is less than 2. Proposition 2 describes this property. 2-opt 2-opt Proposition 2.2. Let i, j1 (i) i∈I and i, j2 (i) i∈I be the optimal pairs result of the optimal two-group matching between groups 1-2 and 1-3, respectively, 30 opt opt and let MOPT = i, j1 (i), j2 (i) i∈I be the optimal set of triplets. Then: ( ) X 2 2-opt opt X 2 2-opt opt D (MCO) ≤ D (MOPT ) + min d (j1 (i), j1 (i)), d (j2 (i), j2 (i)) i∈I i∈I ≤ 2D (MOPT ) A similar formulation of the inequality is: " # 1 X X D (M ) ≤ D (M ) + d2(j2-opt(i), jopt(i)) + d2(j2-opt(i), jopt(i)) CO OPT 2 1 1 2 2 i∈I i∈I ≤ 2D (MOPT ) (0) 2-opt ∗ Proof. Consider the set M1,2 = (i, j1 (i), j2 (i)) i∈I , generated as the first itera- tion of the three-way conditionally optimal algorithm. In particular, the elements ∗ {j2 (i)}i∈I from the third group are chosen to minimize the total distance between 2-opt the pairs i, j1 (i) i∈I and the elements of the third group. By the distance shortening property, we have: (0) X 2 2-opt 2 ∗ 2 2-opt ∗ D (MCO) 6 D M1,2 = d i, j1 (i) + d (i, j2 (i)) + d j1 (i), j2 (i) . i∈I ∗ 2-opt Because the {j2 (i)}i∈I minimize the total distance given the pairs i, j1 (i) i∈I , 2-opt opt (0) the total distance of the triplets (i, j1 (i), j2 (i)) i∈I is larger than D M1,2 : X 2 2-opt 2 ∗ 2 2-opt ∗ d i, j1 (i) + d (i, j2 (i)) + d j1 (i), j2 (i) (2.3) i∈I X 2 2-opt 2 opt 2 2-opt opt 6 d i, j1 (i) + d i, j2 (i) + d j1 (i), j2 (i) . (2.4) i∈I 2-opt Moreover, since the pairs i, j1 (i) i∈I are optimal: X 2 2-opt X 2 opt d i, j1 (i) 6 d i, j1 (i) . (2.5) i∈I i∈I 2 2-opt opt 2 2-opt opt Using this result and the triangular inequality d j1 (i), j2 (i) 6 d j1 (i), j1 (i) + 31 2 opt opt d j1 (i), j2 (i) on the last component of Equation (2.4): X 2 2-opt 2 opt 2 2-opt opt D (MCO) 6 d i, j1 (i) + d i, j2 (i) + d j1 (i), j2 (i) i∈I X 2 opt 2 opt 2 opt opt 2 2-opt opt 6 d i, j1 (i) + d i, j2 (i) + d j1 (i), j2 (i) + d j1 (i), j1 (i) i∈I X 2 2-opt opt 6 D (MOPT ) + d (j1 (i), j1 (i)). (2.6) i∈I 2-opt Using analogous inequalities starting from the set of pairs i, j2 (i) i∈I , it is possible to show the following result: X 2 2-opt opt D (MCO) 6 D (MOPT ) + d (j2 (i), j2 (i)). (2.7) i∈I From Equation (2.6) and (2.7), we have: ( ) X 2 2-opt opt X 2 2-opt opt D (MCO) ≤ D (MOPT ) + min d (j1 (i), j1 (i)), d (j2 (i), j2 (i)) i∈I i∈I Summing Equation (2.6) and (2.7), we have the second inequality: " # 1 X X D (M ) ≤ D (M ) + d2(j2-opt(i), jopt(i)) + d2(j2-opt(i), jopt(i)) CO OPT 2 1 1 2 2 i∈I i∈I P 2 2-opt opt P 2 2-opt opt If i∈I d (j1 (i), j1 (i)) 6 D (MOPT ) and i∈I d (j2 (i), j2 (i)) 6 D (MOPT ), the proof is complete, because ( ) X 2 2-opt opt X 2 2-opt opt D (MCO) ≤ D (MOPT ) + min d (j1 (i), j1 (i)), d (j2 (i), j2 (i)) i∈I i∈I ≤ D (MOPT ) + min {D (MOPT ) , D (MOPT )} = 2D (MOPT ) , and " # 1 X X D (M ) ≤ D (M ) + d2(j2-opt(i), jopt(i)) + d2(j2-opt(i), jopt(i)) CO OPT 2 1 1 2 2 i∈I i∈I 1 ≤ D (M ) + [D (M ) + D (M )] = 2D (M ) . OPT 2 OPT OPT OPT 32 P 2 2-opt opt We show that i∈I d (j1 (i), j1 (i)) 6 D (MOPT ) using triangular inequalities and P 2 2-opt opt the result in Equation (2.5). The proof that i∈I d (j2 (i), j2 (i)) 6 D (MOPT ) is analogous. X 2 2-opt opt X 2 2-opt opt 2 opt opt d (j1 (i), j1 (i)) 6 d (j1 (i), j2 (i)) + d (j2 (i), j1 (i)) i∈I i∈I X 2 2-opt 2 opt 2 opt opt 6 d (i, j1 (i)) + d (i, j2 (i)) + d (j2 (i), j1 (i)) i∈I X 2 opt 2 opt 2 opt opt 6 d (i, j1 (i)) + d (i, j2 (i)) + d (j2 (i), j1 (i)) i∈I = D (MOPT ) . 2.1.3 Extensions to More than Three Treatment Groups The same general idea of the algorithm can be extended to designs with more than three treatment groups. However, before moving to the description of this extension, it is important to note that the advantages of distance-based matching procedures poorly scale to designs involving a very large number of treatments. Suppose that the vector of the covariates is high-dimensional, as is the case in most practical applica- tions. In such designs, distance-based matching algorithms rely on the dimensionality reduction property of the propensity score methodology (see Section 1.4.2). However, when K increases, the dimension of the propensity score increases as well, and the matching problem takes place in a space that becomes increasingly sparse. In this case, identifying subjects with similar values of the propensity score is a problem that suffers from the “curse of dimensionality” (Linden et al., 2016). Confining the plausibility of matched designs to small-to-moderate numbers of treatment groups, there are multiple possible implementations of the general idea of our algorithm, because there are multiple ways to split the K-tuples at each step. 33 For example, with K = 4 treatment groups, the matched sets could be either split in two sets of pairs or in one set of triplets and one set of singletons. The number of possible strategies increase with the increase of K. For K = 5, matched sets could be split in singletons and quadruplets, pairs and triplets or in three groups (two sets of pairs and one group of singletons). Within the wide number of possible strategies, the approach that most easily generalizes to any value of K is to split matched sets in units from one treatment group and (K − 1)-tuples from the remaining K − 1 treatment groups. At each step, the selected group rotates among the K groups, to explore possible reductions in the total distance. The iterative portion of the conditionally optimal algorithm (i.e., Step 2) is there- fore extended to the general case of K treatment groups by the following procedure: (n−1) Step 2: For each n > 1, consider the matched set M and the associated to- tal distance D M(n−1), resulting from the previous iteration. Repeat the following steps: Step 2.1: For s from 1 to K, relax the connection of the matched sample M(n−1) to treatment group s and rematch the resulting (K − 1)- tuples to group s, using the optimal bipartite result. Denote the (n) generated matched sample with Ms .