Causal Inference in a 2 Factorial Design Using Generalized
Total Page:16
File Type:pdf, Size:1020Kb
Causal Inference in a 22 Factorial Design Using Generalized Propensity Score By Matilda Nilsson Department of Statistics Uppsala University Supervisors: Johan Lyhagen and Ronnie Pingel 2013 Abstract When estimating causal effects, typically one binary treatment is evaluated at a time. This thesis aims to extend the causal inference framework using the potential outcomes scheme to a situation in which it is of interest to simultaneously estimate the causal effects of two treatments, as well as their interaction effect. The model proposed is a 22 factorial model, where two methods have been used to estimate the generalized propensity score to assure unconfoundedness of the estimators. Of main focus is the inverse probability weighting estimator (IPW) and the doubly robust estimator (DR) for causal effects. Also, an estimator based on linear regression is included. A Monte Carlo simulation study is performed to evaluate the proposed estimators under both constant and variable treatment effects. Furthermore, an application on an empirical study is conducted. The empirical ap- plication is an assessment of the causal effects of two social factors (parents’ educational background and students’ Swedish background) on averages grades for ninth graders in Swedish compulsory schools. The data are from 2012 and are measured on school level. The results show that the IPW and DR estimators produces unbiased estimates for both constant and variable treatment effects, while the estimator based on linear regression is biased when treatment effects vary. Keywords: Potential outcomes, two treatments, Inverse probability weighting estimator, Doubly robust estimator. Contents 1 Introduction 1 2 The Causal Inference Framework 2 3 Causal Inference in a 22 Factorial Design 6 3.1 Estimators for the Average Treatment Effect . .9 3.2 Models for Multivalued Treatment Assignments . 11 4 Simulation Study 13 4.1 Simulation Setup . 13 4.2 Results from the Simulation Study . 16 5 Empirical Study 19 5.1 Data . 19 5.2 Results from Empirical Study . 22 6 Conclusion 26 References 29 Appendix A Tables and Graphs 39 Appendix B Estimators 39 1 Introduction The modern approach for causal inference in observational studies started to develop in the beginning of the 1970’s, foremost by Donald B. Rubin. What Rubin proposed was a framework for estimating average causal effects, commonly known as the Rubin Causal Model (RCM). (Rubin, 1974) It builds on the concept of potential outcomes in randomized experiments, first formulated by Neyman (1923). Of main interest is to find whether or not a treatment of some sort has a causal effect on an outcome. Treatment in this case refers to a factor and should be interpreted in a broader sense than merely a medical treatment or similar. When units are randomly assigned to the treatment groups there is no reason to believe that the units in the groups systematically differ from each other in other aspects than the treatment status. It is then straightforward to compare the groups, often by comparing the group means, to assess the effect of the treatment. (Imbens and Wooldridge, 2008) In many sciences however, such as for instance social sciences and economics, as Imbens and Wooldridge (2008) points out, the units are often individuals, and it is seldom feasible to construct a randomized experiment due to ethical, practical, economical or other reasons. However, it is often desirable to evaluate the effect of treatments such as labor market policies, educational programs and other educational policies, etc. The causal inference framework has mainly focused on the case where there is only one treatment to evaluate, extended to a longitudinal setting or with one multivalued treatment. However, in both experimental and non-experimental designs, the researcher might be inter- ested in evaluating two treatments simultaneously. One motivation for this is to see whether or not they interact. In experimental settings this is often formulated as a factorial design, in which the causal effect of two treatments (or more) with two levels (or more) is estimated. This gives a main effect for each treatment respectively, as well as interaction effects between the factors. Dasgupta et al. (2012) proposes an extension of the RCM to 2k designs by defining factorial effects in terms of potential outcomes in an experiment setting. However, they do not propose estimators for factorial experiments with covariates nor for observational studies; such estimators have not yet been developed for the non-experimental setting. The aim of this thesis is to extend the causal inference framework for observational studies using the potential outcomes scheme to a situation in which it is of interest to simultaneously estimate the causal effects of two treatments as well as their interaction effect. This is done using a 22 factorial model. The chosen estimators that are assessed are based on linear regres- sion (OLS) and inverse probability weighting (IPW). Also included is a doubly robust (DR) estimator that combines techniques from the two former. These estimators are chosen since they are commonly used within the causal inference framework for single treatments studies, see for instance Imbens (2004) and Lunceford and Davidian (2004). The latter two estimators are in the single treatment case conditioned on the propensity score to assure unconfounded- ness. For the two-treatment case proposed here, a generalized propensity score is used for this 1 purpose. Hence the question of how the generalized propensity score should be estimated is also of importance. Here, the multinomial and the nested logit models are considered, see for example Imbens (2000) and Tchernis et al. (2005). The estimators are assessed in terms of bias and mean squared error (MSE), under both constant and variable treatment effects. For this aim a Monte Carlo simulation study is per- formed. Furthermore, for completeness, both non-random and completely random treatment assignment mechanisms are included to highlight similarities and differences between causal inference in observational and randomized studies. The result shows that the IPW and DR es- timators produce unbiased estimates of the treatment effects both when treatment effects are constant and when they vary across individuals. The estimators based on OLS, however, only produces unbiased estimates when treatment effects are constant, since the method can not take variable effects into account. The use of the model and methods is illustrated using data from Swedish compulsory schools. The data are from 2012 and are collected by the Swedish National Agency for Educa- tion. The observations are measured on school level. The first treatment is a factor based on the proportion of students with parents with higher education (tertiary education). The second fac- tor is based on the proportion of students with Swedish background. In this data students born in Sweden with at most one parent born elsewhere are defined as having Swedish background. The dichotomization of the variables are discussed in Section 5. The outcome in the study is the average grades for the ninth graders in each school. The results indicate that parents’ educa- tional background has a large positive effect on students’ average grades, while the effect of the students’ background is close to zero; it is insignificant for all estimators except the estimator based on linear regression. The interaction effect is somewhat surprisingly negative. It is small but significant. The confidence intervals are 95% bootstrap percentile intervals. The thesis is structured as follows: The theory section is divided into two parts, Section 2 and Section 3. In Section 2 the framework of causal inference is presented, as well as the theory of conditioning on propensity scores to assure unconfoundedness. In Section 3 the 22 factorial design is specified as a generalized potential outcomes model. Furthermore, the estimators to be assessed are specified here. In Section 4 the simulation study is outlined and the results presented and Section 5 contains the empirical study. The conclusions are discussed in Section 6. 2 The Causal Inference Framework In this part of the theory section the framework of causal inference in observational studies is presented as well as a formulation of the estimands of interest. Focus lies on the treatment assignment mechanism and identification, and a short introduction to the propensity score and its function within the framework is given. The idea of causal inference in observational studies is drawn from classical randomized 2 experiments, in which it is possible to obtain estimators for the average effect of the treatment, e.g. the difference in means by treatment status. In the case where the treatment has two lev- els (often "treatment" or "control") this implies a comparison between the two outcomes for the same unit under both treatment and no treatment. However, in observational studies it is not possible to observe the outcome for the same unit under both treatment and no treatment. (Holland, 1986) Instead, in practice, each individual can be exposed to only one level of the treatment, and thus we can only observe one of the outcomes. Holland (1986) calls this the fun- damental problem of causal inference. As opposed to an experimental setting, since treatment generally cannot be randomly assigned in observational studies, individuals are self-selected into different treatment regimes. This might lead to systematical differences, which can affect the outcomes