Causal Inference in a 22 Factorial Design Using Generalized Propensity Score

By Matilda Nilsson

Department of Uppsala University

Supervisors: Johan Lyhagen and Ronnie Pingel

2013 Abstract When estimating causal effects, typically one binary treatment is evaluated at a time. This thesis aims to extend the causal inference framework using the potential outcomes scheme to a situation in which it is of interest to simultaneously estimate the causal effects of two treatments, as well as their effect. The model proposed is a 22 factorial model, where two methods have been used to estimate the generalized propensity score to assure unconfoundedness of the estimators. Of main focus is the inverse probability weighting estimator (IPW) and the doubly robust estimator (DR) for causal effects. Also, an estimator based on is included. A Monte Carlo simulation study is performed to evaluate the proposed estimators under both constant and variable treatment effects. Furthermore, an application on an empirical study is conducted. The empirical ap- plication is an assessment of the causal effects of two social factors (parents’ educational background and students’ Swedish background) on averages grades for ninth graders in Swedish compulsory schools. The are from 2012 and are measured on school level. The results show that the IPW and DR estimators produces unbiased estimates for both constant and variable treatment effects, while the estimator based on linear regression is biased when treatment effects vary.

Keywords: Potential outcomes, two treatments, Inverse probability weighting estimator, Doubly robust estimator. Contents

1 Introduction 1

2 The Causal Inference Framework 2

3 Causal Inference in a 22 Factorial Design 6 3.1 Estimators for the Average Treatment Effect ...... 9 3.2 Models for Multivalued Treatment Assignments ...... 11

4 Simulation Study 13 4.1 Simulation Setup ...... 13 4.2 Results from the Simulation Study ...... 16

5 Empirical Study 19 5.1 Data ...... 19 5.2 Results from Empirical Study ...... 22

6 Conclusion 26

References 29

Appendix A Tables and Graphs 39

Appendix B Estimators 39 1 Introduction

The modern approach for causal inference in observational studies started to develop in the beginning of the 1970’s, foremost by Donald B. Rubin. What Rubin proposed was a framework for estimating average causal effects, commonly known as the Rubin Causal Model (RCM). (Rubin, 1974) It builds on the concept of potential outcomes in randomized , first formulated by Neyman (1923). Of main interest is to find whether or not a treatment of some sort has a causal effect on an outcome. Treatment in this case refers to a factor and should be interpreted in a broader sense than merely a medical treatment or similar. When units are randomly assigned to the treatment groups there is no reason to believe that the units in the groups systematically differ from each other in other aspects than the treatment status. It is then straightforward to compare the groups, often by comparing the group , to assess the effect of the treatment. (Imbens and Wooldridge, 2008) In many sciences however, such as for instance social sciences and economics, as Imbens and Wooldridge (2008) points out, the units are often individuals, and it is seldom feasible to construct a randomized due to ethical, practical, economical or other reasons. However, it is often desirable to evaluate the effect of treatments such as labor market policies, educational programs and other educational policies, etc. The causal inference framework has mainly focused on the case where there is only one treatment to evaluate, extended to a longitudinal setting or with one multivalued treatment. However, in both experimental and non-experimental designs, the researcher might be inter- ested in evaluating two treatments simultaneously. One motivation for this is to see whether or not they interact. In experimental settings this is often formulated as a factorial design, in which the causal effect of two treatments (or more) with two levels (or more) is estimated. This gives a main effect for each treatment respectively, as well as interaction effects between the factors. Dasgupta et al. (2012) proposes an extension of the RCM to 2k designs by defining factorial effects in terms of potential outcomes in an experiment setting. However, they do not propose estimators for factorial experiments with covariates nor for observational studies; such estimators have not yet been developed for the non-experimental setting. The aim of this thesis is to extend the causal inference framework for observational studies using the potential outcomes scheme to a situation in which it is of interest to simultaneously estimate the causal effects of two treatments as well as their interaction effect. This is done using a 22 factorial model. The chosen estimators that are assessed are based on linear regres- sion (OLS) and inverse probability weighting (IPW). Also included is a doubly robust (DR) estimator that combines techniques from the two former. These estimators are chosen since they are commonly used within the causal inference framework for single treatments studies, see for instance Imbens (2004) and Lunceford and Davidian (2004). The latter two estimators are in the single treatment case conditioned on the propensity score to assure unconfounded- ness. For the two-treatment case proposed here, a generalized propensity score is used for this

1 purpose. Hence the question of how the generalized propensity score should be estimated is also of importance. Here, the multinomial and the nested logit models are considered, see for example Imbens (2000) and Tchernis et al. (2005). The estimators are assessed in terms of bias and squared error (MSE), under both constant and variable treatment effects. For this aim a Monte Carlo simulation study is per- formed. Furthermore, for completeness, both non-random and completely random treatment assignment mechanisms are included to highlight similarities and differences between causal inference in observational and randomized studies. The result shows that the IPW and DR es- timators produce unbiased estimates of the treatment effects both when treatment effects are constant and when they vary across individuals. The estimators based on OLS, however, only produces unbiased estimates when treatment effects are constant, since the method can not take variable effects into account. The use of the model and methods is illustrated using data from Swedish compulsory schools. The data are from 2012 and are collected by the Swedish National Agency for Educa- tion. The observations are measured on school level. The first treatment is a factor based on the proportion of students with parents with higher education (tertiary education). The second fac- tor is based on the proportion of students with Swedish background. In this data students born in Sweden with at most one parent born elsewhere are defined as having Swedish background. The dichotomization of the variables are discussed in Section 5. The outcome in the study is the average grades for the ninth graders in each school. The results indicate that parents’ educa- tional background has a large positive effect on students’ average grades, while the effect of the students’ background is close to zero; it is insignificant for all estimators except the estimator based on linear regression. The interaction effect is somewhat surprisingly negative. It is small but significant. The confidence intervals are 95% bootstrap intervals. The thesis is structured as follows: The theory section is divided into two parts, Section 2 and Section 3. In Section 2 the framework of causal inference is presented, as well as the theory of conditioning on propensity scores to assure unconfoundedness. In Section 3 the 22 factorial design is specified as a generalized potential outcomes model. Furthermore, the estimators to be assessed are specified here. In Section 4 the simulation study is outlined and the results presented and Section 5 contains the empirical study. The conclusions are discussed in Section 6.

2 The Causal Inference Framework

In this part of the theory section the framework of causal inference in observational studies is presented as well as a formulation of the estimands of interest. Focus lies on the treatment assignment mechanism and identification, and a short introduction to the propensity score and its function within the framework is given. The idea of causal inference in observational studies is drawn from classical randomized

2 experiments, in which it is possible to obtain estimators for the average effect of the treatment, e.g. the difference in means by treatment status. In the case where the treatment has two lev- els (often "treatment" or "control") this implies a comparison between the two outcomes for the same unit under both treatment and no treatment. However, in observational studies it is not possible to observe the outcome for the same unit under both treatment and no treatment. (Holland, 1986) Instead, in practice, each individual can be exposed to only one level of the treatment, and thus we can only observe one of the outcomes. Holland (1986) calls this the fun- damental problem of causal inference. As opposed to an experimental setting, since treatment generally cannot be randomly assigned in observational studies, individuals are self-selected into different treatment regimes. This might lead to systematical differences, which can affect the outcomes and bias the effects. (Imbens and Wooldridge, 2008) The issue of self-selection into treatment in observational studies must hence be addressed, and bias due to this issue must be removed. This is done through adjustment for differences in pre-treatment variables, also referred to as confounders, of both treatment and control groups. This is the notion of unconfoundedness (also labeled as exogeneity, ignorability or selection on observables). If unconfoundedness does not hold, there is no general approach to estimate treatment effects. (Imbens and Wooldridge, 2008) To clarify the setting and introduce notation, the single treatment case is presented below, while the extension into the factorial setting will be presented in the next section. The notation used roughly follows the notation in Imbens and Wooldridge (2008) and the common notation in the causal inference literature. As mentioned above, in the basic single treatment model we have one factor with two levels, where the two levels are "treatment" and "control". Observations are made on a random sample of N individuals, i = 1,...,N, where some of the individuals have been exposed to treatment, while the rest have been exposed to the control. The indicator Wi is used to indicate if individual i experienced the treatment or not, with Wi = 1 if the individual did and Wi = 0 if the individual did not. Then W is used to denote the N-vector with the i-th element equal to

Wi. Since, in observational studies, the treatment assignment often is not randomly assigned, the outcome is most likely dependent on W . (Imbens and Wooldridge, 2008)

For individual i, two potential outcomes are possible, denoted by Yit, t = 0, 1, where Yi0 is the outcome that would be realized if the individual would not experience the treatment and Yi1 is the outcome that would be realized if the individual did. Before the assignment is determined, both outcomes can potentially be realized, and as soon as either of the two is realized, the other one is a counterfactual outcome. The realized outcome for the i-th individual is denoted by

Yi, which is the i-th element in the N-vector Y. Lastly, for each individual we also observe a

K-dimensional column vector of confounders, Xi, with X denoting the N × K matrix with the 0 i-th row equal to Xi. The potential outcomes can be written as

0 Yit = αt + Xiβt + uit, (1)

3 where t = 0, 1. (Imbens and Wooldridge, 2008) In this formulation, the intercept, αt, gives the

treatment effect while the slope coefficients, βt, give the covariate effect. (Montgomery, 2001) Equation (1) implies that  Yi0 if Wi = 0, Yi = (1 − Wi) · Yi0 + Wi · Yi1 = (2) Yi1 if Wi = 1.

The estimand of interest in a causal inference setting with one binary treatment is typically the average treatment effect (ATE). Here, the ATE is defined as the population expectation of the

unit-level causal effect, Yi1 − Yi0,

τ = E[Y1 − Y0] = E[Y1] − E[Y0] = µ1 − µ0 (3)

For other estimands, see for example Imbens and Wooldridge (2008).

In the ideal world, both Yi1 and Yi0 would be observable for the i-th individual and the estimation of τ would be straightforward. But it is not possible to both treat and not treat the same unit. Furthermore, if the i-th individual were to be exposed to both levels of the treatment after each other, carryover effects will most likely bias the average treatment effect in a way that cannot be controlled for. (Rubin, 1974) The assignment mechanism is defined as the conditional probability of receiving the treat- ment, as a function of potential outcomes and observed covariates. (Rosenbaum and Rubin, 1983) Since the mechanism is often not randomized in observational studies, we must instead rely on the notion of unconfoundedness to be able to identify the treatment effects. The idea is to condition on confounders so that the assignment mechanism does not depend on the potential outcomes, formally put as follows.

Assumption 1 (UNCONFOUNDEDNESS).

 W ⊥⊥ Y0,Y1 |X.

 The assumption states conditional independence of W and Y0,Y1 given the covariates X. Further, it assumes that there are no additional characteristics of the individual associated with both the potential outcomes and the treatment. (Imbens and Wooldridge, 2008) The second assumption needed to identify the treatment effect is the assumption of overlap.

Assumption 2 (OVERLAP).

0 < pr (Wi = 1|Xi = x) < 1, ∀i.

This is called the overlap assumption since it implies that the support of the conditional

distribution of Xi given Wi = 0 overlaps completely with that of the conditional distribution of

4 Xi given Wi = 1. (Imbens and Wooldridge, 2008) In other words, it means that for all possible values of the covariates there are both treated and control units. Rosenbaum and Rubin (1983) denotes the situation where both Assumption 1 and Assumption 2 holds as strong ignorability. As explained by Rubin (1974), we generally do not know or are able to control for all variables that may systematically differ due to self-selection; hence, trying to control for bias due to self-selection might result in even more bias. One common approach to solve this issue is to condition on the propensity score. (Rosenbaum and Rubin, 1983) Furthermore, if the dimension of X is large, it is easier to condition on a scalar function of the covariates such as the propensity score instead of all possible combinations of the covariate values. (Imbens, 2004) As defined by Rosenbaum and Rubin (1983) the propensity score is the conditional prob- ability of assignment to a particular treatment given a vector of observed covariates, defined as p(X) ≡ pr(W = 1|X = x) = E(W |X). (4)

Rosenbaum and Rubin (1983) show that under Assumption 1 of unconfoundedness, the treat- ment indicator W and the potential outcomes (Y0,Y1) are independent after conditioning on the propensity score:  W ⊥⊥ Y0,Y1 p(X). (5)

They also show that when conditioning on the propensity score, the treatment assignment and the observed variables are independent (Rosenbaum and Rubin, 1983):

W ⊥⊥ X p(X), (6) or as (Imbens and Wooldridge, 2008, p. 28) explains:

"Within subpopulations with the same value for the propensity score, covariates are indepen- dent of the treatment indicator and thus cannot lead to biases (the same way in a regression framework omitted variables that are uncorrelated with included covariates do not introduce bias)".

Hence, in the single treatment case, under Assumption 1 it suffices to adjust only for differ- ences in the propensity score between the treated and the control units. Different methods have been proposed for this purpose. Lastly, in order to make causal interpretations of the effects, the Stable-Unit-Treatment- Value-Assumption (SUTVA) must hold. It states that treatments received by one individual do not affect outcomes for another individual. (Imbens and Wooldridge, 2008) For the two-treatment model that is suggested below, two semi-parametric estimators that are common in the single-treatment case are assessed, namely the inverse probability weighting

5 (IPW) estimator and the doubly robust (DR) estimator. Furthermore, an estimator based on linear regression estimated with OLS is included. These are presented in more detail in Section 3. For other estimators for the single treatment case, see for example Imbens (2004).

3 Causal Inference in a 22 Factorial Design

We now extend the potential outcomes framework to a case with two factors, formulated as a 22 factorial model. In a factorial design, each possible combination of the levels of the factors are investigated. A 22 factorial design includes two factors with two levels each, which means that there are 2 × 2 = 4 possible combinations. The two treatment factors are denoted A and B, where each is coded as 0 for "control" and 1 for "treatment". The four treatment combinations are presented in Figure 1. Following the same notation as in the single treatment case, the treatment combinations are denoted by t, but now t = 1, 2, 3, 4, according to the numbering in the figure. If we apply the potential outcomes scheme for observational studies, the four treatment combinations in the 22 factorial design can be seen as four separate potential outcomes as in Equation (1), with t = 1, 2, 3, 4. Hence, the four treatment combinations can be viewed as four possible treatments, denoted by T , which take on the value t among these four. The treatment indicators are then defined as:  1 if Ti = t Wit(Ti) = 0 otherwise which gives the four possible 4 × 1 treatment vectors  0 (1, 0, 0, 0) if Ti = 1   0 (0, 1, 0, 0) if Ti = 2 W i(Ti) = 0 (0, 0, 1, 0) if Ti = 3   0 (0, 0, 0, 1) if Ti = 4.

Factor B Control (0) Treatment (1)

Control (0) Y00 = Y1 Y01 = Y2 Factor A

Treatment (1) Y10 = Y3 Y11 = Y4

Figure 1: A 22 factorial design

6 This means that for each individual i, only one of the four outcomes is realized, and the remain- ing three are counterfactuals. For example, if individual i do not receive treatment A (A = 0) but only treatment B (B = 1), then t = 2, Wi(Ti) = Wi(2) and the realized outcome belongs to the upper right square in Figure 1 and is denoted Yit = Yi2. The realized outcome is thus  0 0 Yi1 if W i = W i(1)   0 0 0 Yi2 if W i = W i(2) Yi = W i · Yit = (7) 0 0 Yi3 if W i = W i(3)   0 0 Yi4 if W i = W i(4) .

The estimands of interest are the main effects and the interaction effect. The main effect of a treatment in a 22 factorial design is defined to be the change in response produced by a change in the level of the factor. (Montgomery, 2001) If the difference in response between the levels of one factor is not the same over both levels of the other factor there is an interaction effect between the factors. In a these effects have a causal interpretation. The estimands can be formulated as:

µ + µ µ + µ τ = 3 4 − 1 2 (8) A 2 2 µ + µ µ + µ τ = 2 4 − 1 3 . (9) B 2 2 µ + µ µ + µ τ = 1 4 − 2 3 , (10) AB 2 2

where µt is the expected outcome for each treatment group, i.e. the cell means of Figure 1.

If the theoretical results in the previous section can be generalized, the µt’s can be unbias- edly estimated and hence it is possible to get unbiased estimates of the two main effects and the interaction effect in equations (8) to (10) also in a non-randomized study. This means that the effects in a two-treatment can be interpreted as causal effects. Using the notation proposed above, the four treatment combinations can be viewed as a multivalued treatment setting, and hence theoretical results derived for multivalued treatments should be valid also for the proposed setting. However, in a multivalued treatment setting there is one treatment with many levels, such as different doses of drugs, different levels of education etc. Thus, the structure in a 22 factorial design differs from the multivalued case both in terms of in- terpretation of effects and in the fact that the model enables estimation of the interaction effect, since what is dealt with is essentially two binary treatments. According to Imbens (2000), adjusting for confounders in a multivalued setting results

in weak unconfoundedness of the assignment to treatment T if the treatment indicator Wt is independent of the respective potential outcome given the pre-treatment variables, such that the following assumption holds.

7 Assumption 3 (WEAK UNCONFOUNDEDNESS).

Wt ⊥⊥ Yt|X, ∀t ∈ T.

If this assumption holds, the expected value of Yt can be estimated by adjusting for X, since

E(Yt|X) = E(Yt|Wt = 1,X) = E(Y |Wt = 1,X) = E(Y |T = t, X),

and thus

E(Yt) = E[E(Yt|X)] = µt. (11)

To formulate ignorability also in the multivalued case, the assumption of generalized overlap proposed by Imbens (2000) must also hold.

Assumption 4 (GENERALIZED OVERLAP).

0 < pr(Ti = t|Xi = x) < 1, ∀i, ∀t ∈ T.

Since what we seek is to get unbiased estimates for each µt respectively, we only need unconfoundedness to hold within each treatment group. In this sense, the suggested setting is equivalent with a multivalued treatment setting, and Assumption 3 and 4 are valid for the proposed case. As mentioned in Section 2, if the dimension of X is large, it may be difficult to condition on X and hence the propensity score can be used instead. For multivalued treatments Im- bens (2000) shows that the generalized propensity score has the same properties as the single treatment counterpart. He defines it as

r(t, x) ≡ pr(T = t|X = x) = E(Wt|X = x) (12) in analogy to the propensity score in the case with one binary treatment. Then, if Assumption 3 of weak unconfoundedness holds, the treatment assignment is weakly unconfounded given the generalized propensity score, such that

Wt ⊥⊥ Yt|r(t, X), ∀t ∈ T.

A more thorough review of the generalized propensity function can be found in for instance Imai and van Dyk (2004). If Assumption 3 and Assumption 4 hold, unbiased estimators of the expected values for each treatment combination can be formulated. If these estimates are substituted into equations

(8) to (10) unbiased estimates of τA, τB and τAB are obtained. In the following section the proposed methods for estimating µt are presented. Complete expressions of the estimators of

τA, τB and τAB for each method can be found in Appendix B.

8 First, the well known method of linear regression to estimate conditional expectations is presented, after which the IPW and DR estimators follow.

3.1 Estimators for the Average Treatment Effect

In a completely randomized experiment, the 22 factorial model can be expressed as the regres- sion model:

y = β0 + βAA + βBB + βABAB + ε, (13)

where ε is a random error term. (Montgomery, 2001) Under an assumption of constant treat- ment effects covariates can be added to Equation (13) such that

y = β0 + βAA + βBB + βABAB + XβX + ε. (14)

These two models can be estimated using ordinary (OLS). The assumption of con- stant treatment effects implies equal slope coefficients in the model that generates the potential

outcomes, i.e., all βt’s equal in Equation (1). The isolated effects from the treatment factors A and B as in equations (8) to (10) cannot be found by simply interpreting the estimated coefficients from Equation (14) if the interaction ˆ coefficient βAB is significant, since the interaction term implies different marginal effects for each factor depending on the level of the other factor.1 Furthermore, the interaction term in a linear regression has a different interpretation than that of the factorial experiment. In the linear regression, the interaction effect is the additional effect of receiving both treatments. In the 22 factorial design, it is the average difference in levels of one factor across the levels of the other. As is shown in Equation (25), Appendix B, if the model is correctly specified, the interaction effect in a 22 factorial design equals half the interaction term in Equation (14), i.e., the average across the two factors. Instead, the expected values in Equation (11) can be retrieved from Equation (14), such that

µ1,OLS = E(Y |T = 1,X) = β0,

µ2,OLS = E(Y |T = 2,X) = β0 + βB,

µ3,OLS = E(Y |T = 3,X) = β0 + βA,

µ4,OLS = E(Y |T = 4,X) = β0 + βA + βB + βAB. (15)

1Here, dummy coding (0,1) is used, which is not the common coding scheme in a 22 factorial design. Since this is the most common coding in applications, it is used here. When coding is used ("control"=−1, "treatment"= 1) the regression coefficients can be interpreted as half the effects, respectively, such that τˆA = 2 × βA, τˆB = 2 × βB and τˆAB = 2 × βAB.

9 Hence, the estimated averages for each treatment level using linear regression is given by

ˆ µˆ1,OLS = β0, ˆ ˆ µˆ2,OLS = β0 + βB, ˆ ˆ µˆ3,OLS = β0 + βA, ˆ ˆ ˆ ˆ µˆ4,OLS = β0 + βA + βB + βAB. (16)

Substituting the µˆt,OLS’s into equations (8) to (10) yields the unbiased estimators τˆA,OLS,

τˆB,OLS and τˆAB,OLS. For the complete expressions, see equations (23) to (25) in Appendix B. Furthermore, if the is correctly specified, if follows from the Gauss-Markov

theorem that the for τˆA,OLS, τˆB,OLS and τˆAB,OLS is the smallest among linear estima- tors. The IPW estimator is the first estimator proposed that conditions on the propensity score for unconfoundedness, in which the inverses of the propensity scores are used as weights. This means that by weighting the outcome by the inverse of the generalized propensity score the expectation of the unconditional response under each treatment in Equation (11) can be found

and unbiased estimators of τA, τB and τAB can be constructed using equations (8) to (10) . The result is an extension of the simple case with one treatment shown in for instance Imbens and Wooldridge (2008), with the generalized result proposed by Imbens (2000). Considering

each treatment group separately, then, since Wt · Y = Wt · Yt, the expected outcome for each treatment group can be found through

 W · Y  E t = µ . r(X, t) t

The derivation to show that the IPW estimator is unbiased is shown in Appendix B. Since the weights do not always add up to one, a normalized estimator has been proposed, see for instance Imbens (2004). The estimated averages for each treatment level using the IPW estimator is hence given by

N !−1 N X Wit X Wit · Yi µˆ = (17) t,IPW rˆ(X, t) rˆ(X, t) i=1 i=1

For the complete expressions of τˆA,IP W , τˆB,IP W and τˆAB,IP W , see Equations (26) to (28) in Appendix B. The last estimator considered here is the DR estimator, which, under correct model specifi- cation, has the smallest large-sample variance among the semi-parametric estimators. (Lunce- ford and Davidian, 2004) The DR estimator has a term similar to the IPW estimator but also a term involving a regression model fitted by OLS separately for each treatment group. As described by Lunceford and Davidian (2004), the expected average for each treatment combi-

10 nation in Equation (11) can be found by

 W · Y W − r(X, t)  E t − t m (X, α ) = µ , r(X, t) r(X, t) t t t

where mt is the regression of the outcome on X for treatment group t. The derivation to show that the DR estimator is unbiased is shown in Appendix B. The name, doubly robust, arises from the fact that the estimator is still consistent if (1) the propensity score model is correctly specified but the regression model is not or (2) the regression model is correctly specified but the propensity score model is not. (Lunceford and Davidian, 2004) The DR estimator is then constructed such that

N   1 X Wit · Yi [Wit − rˆ(X, t)] µˆ = − m (X , αˆ ) , (18) t,DR N rˆ(X, t) rˆ(X, t) t i t i=1 where mt(Xi, αˆt) = E(Y |T = t, X) is the regression of the outcome on X for treatment group

t, with parameter αt estimated by αˆt based on data from individuals with T = t.

Hence, if either model or both is correctly specified, µˆt,DR is an unbiased estimator of µt,

and by substituting the µˆt,DR’s into equations (8) to (10) unbiased estimates of τA, τB and τAB

is obtained. For the complete expressions of τˆA,DR, τˆB,DR and τˆAB,DR, see Equations (30) to (32) in Appendix B. The latter two estimators utilize the generalized propensity score to assure unconfounded- ness. The propensity score is rarely known and must thus be estimated. Methods to estimate the generalized propensity score are described in the next section. It has been shown that in terms of large sample efficiency, it is better to use the estimated rather than the true propensity score. (Hahn, 1998) It has also been shown that semi-parametric estimators of treatment effects based on propensity scores are robust to many sorts of misspecification in the parametric model used to estimate the propensity score. (Waernbaum, 2008) One note of caution is, however, that if the propensity scores are too close to zero or one the estimators might become unstable and its variance inflated. (Imbens, 2004) Two methods for estimating the generalized propensity score are used. These are described next.

3.2 Models for Multivalued Treatment Assignments

In the binary treatment case it is common practice to use to estimate the propensity score. For unordered, multivalued settings, researchers such as Imbens (2000), Imai and van Dyk (2004) and Tchernis et al. (2005) suggest multinomial logistic regression or nested logit to estimate the generalized propensity score. Since these are often suggested common discrete choice models for multivalued treatments, they will be used to estimate the generalized propensity score in the simulation.

11 The generalization of the logistic regression model to the multivalued case is the multino- mial logistic model, where the probability of treatment assignment is specified as

0 Xiγt m e Pit = pr(Ti = t|Xi = x) = T 0 , (19) P Xiγk k=1 e in which the first treatment category is a reference category, i.e., γ1 = 0. For more information about the multinomial logistic model, see for instance Greene (2008). A special characteristic of the multinomial logistic model is that the estimated odds ratios between each treatment level and the reference treatment level are independent of the other alternatives. This is called the "independence from irrelevant alternatives assumption" (IIA), which implies that the model might not take all information about the relationships among treatments into consideration. A model that relaxes the IIA is the nested logit, in which the probability of treatment as- signment is specified sequentially. In the 22 factorial design the probability for each treatment combination in Figure 1 can be viewed as the intersection between the probabilities for the spe- cific level of each factor. Using the same example as previously, the probability of not receive treatment A (A = 0) but only treatment B (B = 1), which is the probability of belonging to treatment group 2 in the upper right square in Figure 1, can be written

pr (T = 2) = pr (A = 0 ∩ B = 1) = pr (A = 0|B = 1) × pr (B = 1)

Such probabilities can be estimated with the nested logit model, in which

0 Xiγt/ρb ρbIib n e e P = pr(Ti = t|Xi = x) = Pa|b × Pb = × , (20) it Jb 0 B P Xiγk/ρb P ρbIib k=1 e b=1 e

where Pb is the probability of receiving treatment level b of factor B, b = 0, 1, Jb is the number of treatments in treatment level b, and

( Jb ) 0 X X γk/ρb Iib = ln e i k=1

A = 0|B = 0 (t = 1) B = 0 A = 1|B = 0 (t = 3) B A = 0|B = 1 (t = 2) B = 1 A = 1|B = 1 (t = 4)

Figure 2: Decision tree for the nested logit treatment assignments

12 X X

T Y B Y

A

Figure 3: To the left is the causal chains for when a multinomial treatment assignment is correct (T=1,2,3,4), and to the right is the causal chains for when a nested logit treatment assignment is correct(A=0,1, B=0,1).

is the so called inclusive value, see for instance Greene (2008) or Tchernis et al. (2005). The structure of the nested logit probabilities are illustrated with the decision tree in Figure 2. The

term ρb may be interpreted as a measure of dissimilarity between treatments within levels of

factor B and is restricted to the (0, 1] interval. If ρb = 1 the treatments are dissimilar and the model reduces to a multinomial logit, and if 0 < ρb < 1, the nested logit model is the correct model. (Tchernis et al., 2005) For the IPW and DR estimators the subscript m is used to indicate that the propensity scores are estimated with multinomial regression, while the subscript n indicates that the nested logit regression has been used. The difference between the discrete choice situations can be illustrated using causal chains as in Figure 3 in which each arrow means ”causing”.

4 Simulation Study

A Monte Carlo simulation study is conducted to evaluate and compare the proposed causal estimators for the two main effects and the interaction effect in terms of accuracy and effi- ciency. The propensity scores are estimated with the two discrete choice models described in the previous section. Hence for each estimator that utilizes the propensity score to assure un- confoundedness there will be two estimates per effect. In the following section the simulation setup is described and results from the simulation presented. All simulations and calculations aree done in the software R version 2.15.1.

4.1 Simulation Setup

In the data simulation, outcomes are simulated with both multinomial and the nested logit mod- els as the true assignment probabilities. Furthermore, a study design similar to an experiment setting is conducted where assignment to each of the four groups have equal probability. This is done to compare the results and highlight the differences between non-randomized and ran- domized studies. It is also done to assess efficiency of estimators that utilize the propensity

13 score in the completely randomized case.

For each true treatment assignment model τA, τB and τAB are estimated with the estima- tors based on linear regression estimated with OLS, and the IPW and DR estimators. A so called crude estimator is used as reference, in which the cell means are calculated without any adjustment to confounders. Each potential outcome will be simulated according to Equation (1), with one confounder, such that

Yit = αt + βtXi + uit, t = 1, 2, 3, 4.

where Xi ∼ Uniform(-1,1) and uit ∼ N(0, 1). In the data simulation, both treatment A and treatment B have an effect, as well as an interaction effect between them. Since the expected value of the confounder X is zero, the true causal treatment effects are decided only by the intercepts of the potential outcomes. The 0 parameter vector α = (−6, 6, 8, 16) is chosen, which gives the effects τA = 12, τB = 10 and

τAB = −2. Both variable and constant treatment effects are simulated. For the constant treatment effects, the vector β = (2, 2, 2, 2) is used, and for the variable treatment effects the vector β = (5, 8, 3, 9)0 is used. All these parameters are arbitrarily chosen, but with the notion that there must exist effects in order to estimate them. One aim of the thesis is to establish if there is a difference in the performance of the estimators between situations with and without variable treatment effects. If they do, the question is how the methods differ from each other. Since the

chosen covariate effects does not affect the true treatment effect τA, τB and τAB, it does not matter how the variable effects are chosen, in the sense that interest does not lie in investigating how the degree of variation between the treatment groups affect the estimation. However, the choice of β may effect the variance of the estimators, but since the number of replications is large potential effects from this is negligible. For each true model the treatment assignment probabilities for each treatment group are

calculated. The 4 × 1 treatment assignment vector W i for each individual is then randomized based on the treatment assignment probabilities. Once the treatment assignment is generated, the potential outcome for the assigned treatment for each individual is considered the realized outcome and the remaining three counterfactuals are discarded. Furthermore, from the infor- mation in the treatment assignment vector the indicators for the two treatment factors A and B are retrieved. The γ-parameters in the true treatment assignment models, based on the multinomial lo- gistic and the nested logistic models, are also chosen somewhat arbitrarily, but with two clear guidelines. First, there must be some dependence between the treatment assigment probabilities and the confounders, i.e., the elements of the treatment probability vector for an individual can- not all be equal, since this constitutes the completely randomized case. Second, the propensity score methods, especially the IPW estimators, become unstable if too many of the propensity

14 scores are too close to zero or one since dividing with values close to zero will inflate the variance and put excessive weight on unusual observations. Hence the treatment assignment probability vectors cannot contain elements too close to zero or one, since that would results in the estimated propensity scores to be the same. For the multinomial model, treatment assignment probabilities are calculated using Equa- tion (19), with model parameters γ = (0, 0.3, 1, 0.8)0, where the first element is set to zero due to the fact that it must be standardized. The choice of these three parameters is arbitrarily made but follows the above mentioned guidelines. The treatment assignment vector is randomized using the random multinomial function in the package ”nnet” in R. For the second true model the treatment assignment probabilities are generated from a nested logit. The probabilities are calculated using Equation (20), also with model parameters γ = (0, 0.3, 1, 0.8)0 as in the multinomial case. Furthermore, the additional model parameters ρ = (0.5, 0.5, )0 are chosen following the article of Tchernis et al. (2005), and states moderate similarity. A random nested logit function is not available in R. Instead, a two-step function

using random Bernoulli trials based on the probabilities PB=1 and PA=1|b is written, where the treatment assignment mechanism follows the decision tree in Figure 2. Last, for the true model in which each treatment combination has equal probability, i.e., the completely randomized treatment assignment, the multinomial scheme outlined above is used, but γ = (0, 0, 0, 0)0. This results in a treatment assignment probability equal to 0.25 for each individual and treatment combination. Hence, the treatment assignment does not depend on the confounder. The two types of generalized propensity score used are also estimated, where the multi- nomial propensity scores are estimated using the multinomial logistic regression for E(W |X)

in the R package ”nnet”. The fitted values are 4 × 1 probability vectors πm(xi) where each m element in the vector is an estimate of Pit = pr(Ti = t|Xi = x), t = 1, 2, 3, 4. Again, there exists no straightforward function for the nested logit, and a two-step procedure to estimate the

generalized propensity scores using logistic regression is used. First, PA=1|B=1 is estimated by

regressing (1,B,X) on A, and PB=1 is estimated by regressing (1,X) on B, both by simple logistic regression. For each individual, the elements of the desired propensity score vector n πn(xi) is then calculated such that each element equals Pit = pr(Ti = t|Xi = x), t=1,2,3,4, by using the fitted values and their complements according to Equation (20).

The simulated data for each individual consist of the vector (yi, ai, bi, xi, π(xi)). For each

proposed estimator, using both propensity score estimation models, the causal effects of τA, τB

and τAB are calculated under each of the three true treatment assignment models. The performance of the estimators will be evaluated by comparing the bias and mean squared error (MSE) for each effect. The bias for each effect is calculated as

r 1X  (q)  Bias = τˆ − τ , (21) r k k q=1

15 (q) where r is the number of replications in the study, τˆk is the estimated causal effect k in data set q, and τk is the true causal effect k, k = A, B, AB. Similarly, MSE is calculated as

r 2 1X  (q)  MSE = τˆ − τ . (22) r k k q=1

For each simulation design, the two sample sizes N = 500 and N = 5000 will be used. Within the field of causal inference in observational studies, a sample size of 500 observations is generally considered small. A sample size of 5000 is chosen to compare if there are differences between small and large samples. Furthermore, r = 10, 000 replications will be performed for each design.

4.2 Results from the Simulation Study

The simulation results are presented in Table 1 and Table 2. The bias and MSE are rescaled since they would otherwise be inconveniently small to interpret. As mentioned above, the parameters in the models are chosen so that there is bias present in the effects calculated from the unconditional means, i.e for the crude estimates, and it is this bias the estimators based on the different conditional means are supposed to avoid. Thus, the term "estimators" still refers to the proposed estimators, not the crude estimator. In the simulations where the treatment effects are set to be constant all estimates are un- biased, see Table 1. As expected, in the completely randomized simulation also the crude estimates are unbiased.

Worth noting is that the IPW estimator based on the multinomial propensity score (IPWm) gives the least accurate estimates for τA and τAB when the true treatment assignment is gener- ated with the nested logit model. This result follows the results in Tchernis et al. (2005) where it is shown that the same applies for multivalued treatments. This is probably due to the IIA property of the multinomial . Regarding the efficiency of the estimators, in the case where the treatment effects are con- stant, the OLS estimator outperforms all the other estimators for N = 500. For N = 5000 this is less apparent. This results follows the theoretical expectations, since all assumptions of the OLS are fulfilled in this case. The differences are small, but cannot be explained by in the simulation, since the simulation are small due to the large number of replicates, see Table 5 in Appendix A. Between the IPW estimators and the DR estimators, the DR estimators are slightly more efficient than the IPW estimators. Again, this is more ap- parent for N = 500 than for N = 5000. Even though the crude estimator is unbiased in the completely randomized case, all other estimators are more efficient, which indicates that gains in terms of power of tests in randomized experiments can be achieved. The simulation results for the sets of data where the treatment effects vary are presented in Table 2. For the corresponding simulation variances, see Table 6 in Appendix A. Since

16 variable treatment effects violates the OLS assumption of equal slopes for the covariate effect for all potential outcomes the OLS estimates of the causal effects are biased. Furthermore, it actually inflates the bias present for the interaction effect τAB in these cases. The result regarding the slightly poorer performance of the IPWm estimator from the case with constant treatment effects under the true nested treatment assignment is present also when treatment effects vary; it has in fact increased. In this setting, where the linear model is misspecified, it becomes apparent that the DR estimators are robust against misspecifications.

Table 1: Simulation results. Constant treatment effects True treatment model

Multinomial Logit Nested Logit Compl. Random

Estimator τA τB τAB τA τB τAB τA τB τAB

N = 500 Bias × 100 Crude 48.20 3.31 -15.78 88.47 2.84 -26.13 0.20 0.23 -0.13 OLS -0.09 -0.03 0.09 0.05 0.02 0.04 0.13 0.10 -0.00 IPWm -0.01 0.01 0.02 1.05 0.01 -2.46 0.13 0.10 0.00 IPWn -0.03 0.01 0.06 0.28 0.03 -0.18 0.13 0.10 0.00 DRm -0.07 -0.00 0.08 0.04 0.06 -0.02 0.13 0.10 -0.00 DRn -0.07 -0.00 0.08 0.04 0.06 -0.00 0.13 0.10 -0.00

MSE × 100 Crude 25.07 1.92 4.33 80.00 1.75 8.54 1.87 1.87 1.88 OLS 0.83 0.81 0.81 0.96 0.79 0.81 0.82 0.80 0.80 IPWm 0.86 0.91 0.91 1.12 1.24 1.31 0.82 0.81 0.81 IPWn 0.86 0.92 0.87 1.13 1.30 1.09 0.82 0.81 0.81 DRm 0.85 0.85 0.86 1.03 0.99 1.01 0.82 0.80 0.81 DRn 0.85 0.86 0.86 1.05 1.00 1.03 0.82 0.80 0.81 N = 5000 Bias × 100 Crude 48.38 3.08 -15.90 88.36 2.73 -26.24 -0.03 -0.01 -0.07 OLS 0.02 -0.03 0.01 -0.03 -0.01 -0.05 0.01 0.01 -0.07 IPWm 0.02 -0.04 0.02 0.74 -0.02 -2.26 0.01 0.01 -0.07 IPWn 0.02 -0.04 0.01 -0.00 -0.01 -0.09 0.01 0.01 -0.07 DRm 0.02 -0.03 0.02 -0.03 0.01 -0.07 0.01 0.01 -0.07 DRn 0.02 -0.03 0.02 -0.03 0.01 -0.07 0.01 0.01 -0.07

MSE × 100 Crude 23.55 0.28 2.70 78.25 0.25 7.05 0.19 0.18 0.19 OLS 0.08 0.08 0.08 0.10 0.08 0.08 0.08 0.08 0.08 IPWm 0.09 0.09 0.08 0.11 0.12 0.17 0.08 0.08 0.08 IPWn 0.09 0.09 0.09 0.11 0.13 0.11 0.08 0.08 0.08 DRm 0.09 0.08 0.09 0.10 0.10 0.10 0.08 0.08 0.08 DRn 0.09 0.08 0.09 0.10 0.10 0.10 0.08 0.08 0.08

τA = Main effect for factor A, τB = Main effect for factor B, τAB = Interaction effect m denotes that the propensity scores are estimated with multinomial logistic regression. n denotes that the propensity scores are estimated with nested logit regression. replicates=10,000 β = (2, 2, 2, 2)

17 Finally, the efficiency of the OLS estimator is now rather poor in comparison to the other estimators, while the DR estimators are the most efficient. Worth noting here is the fact that the MSE of the OLS estimator is larger than for the other estimators even in the completely ran- domized case. This implies that estimators that condition on propensity scores might improve power of tests also in completely randomized experiments.

Table 2: Simulation results. Variable treatment effects True treatment model

Multinomial Logit Nested Logit Compl. Random

Estimator τA τB τAB τA τB τAB τA τB τAB

N = 500 Bias × 100 Crude 134.50 29.96 4.07 247.60 45.30 17.00 0.58 0.74 0.14 OLS -16.93 20.37 54.20 -33.68 36.80 100.90 0.10 0.21 0.08 IPWm 0.16 0.01 -0.00 0.31 0.77 -6.41 -0.01 0.14 -0.13 IPWn 0.14 0.06 -0.05 0.44 0.38 -0.21 -0.01 0.14 -0.12 DRm 0.07 0.05 -0.06 0.01 0.11 0.02 0.02 0.15 -0.09 DRn 0.07 0.05 -0.06 0.01 0.11 0.02 0.02 0.15 -0.09

MSE × 100 Crude 193.22 21.42 12.50 624.59 32.04 14.62 13.11 13.03 12.88 OLS 5.45 6.46 3.67 14.65 15.62 103.97 2.34 2.40 2.37 IPWm 0.97 2.61 1.42 1.50 4.17 3.67 0.86 2.23 1.01 IPWn 0.94 2.61 1.07 1.50 4.25 1.65 0.85 2.23 0.99 DRm 0.87 2.22 1.00 1.04 2.32 1.16 0.82 2.18 0.97 DRn 0.87 2.22 1.00 1.06 2.34 1.18 0.82 2.18 0.97 N = 5000 Bias × 100 Crude 134.59 29.99 4.70 247.59 45.74 17.21 0.11 0.02 0.41 OLS -16.73 20.25 54.13 -33.55 36.72 100.75 0.13 -0.00 0.00 IPWm 0.03 -0.06 -0.00 -0.07 0.58 -6.06 -0.01 0.01 0.02 IPWn 0.02 -0.01 -0.00 0.05 0.22 0.03 -0.02 0.01 0.01 DRm 0.01 -0.04 -0.02 -0.00 0.07 0.03 -0.02 0.02 0.01 DRn 0.01 -0.04 -0.01 0.00 0.07 0.03 -0.02 0.02 0.01

MSE × 100 Crude 182.38 10.23 1.46 614.17 22.06 4.13 1.25 1.27 1.29 OLS 3.06 4.32 29.52 11.57 13.69 101.71 0.23 0.23 0.23 IPWm 0.09 0.25 0.14 0.14 0.41 0.61 0.08 0.21 0.10 IPWn 0.09 0.25 0.10 0.14 0.41 0.15 0.08 0.21 0.10 DRm 0.09 0.22 0.10 0.10 0.23 0.11 0.08 0.21 0.09 DRn 0.09 0.22 0.10 0.10 0.23 0.11 0.08 0.21 0.09

τA = Main effect for factor A, τB = Main effect for factor B, τAB = Interaction effect m denotes that the propensity scores are estimated with multinomial logit regression n denotes that the propensity scores are estimated with nested logit regression replicates=10,000 β = (5, 8, 3, 9)

18 5 Empirical Study

An empirical study is conducted to illustrate the proposed models and methods using data from Swedish compulsory schools. In a summarizing report from 2009 the Swedish National Agency for Education states that many studies show that social factors such a sex, ethnicity and parents’ educational background influence students’ educational outcomes. Moreover, they argue that the impact of socioeconomic background is larger on school level than on individual level, and that a more homogeneous composition of students strengthen the effect. The latter is probably due to peer effects and teachers’ expectations. (Skolverket, 2009) On the Swedish National Agency for Education’s website2 they present the database SALSA which is based on a model that compare students’ average grades in the ninth grade between municipalities and schools after controlling for social background factors. It is reported that the level of parents’ education has the largest impact and that their coefficient of determination is around 40%. However, no measures of magnitude of the coefficients can be found and their model therefore can not be used as reference. Knowing which factors influence students’ results is a starting point in working towards equal possibilities for all students. It is therefore well motivated to seek to estimate causal effects of some of these factors. To achieve this goal, of course the mechanisms of how the factors influence the results must be known, but that question is beyond the scope of this case study. In the next sections the data set is described and the results presented.

5.1 Data

Data on school level can be retrieved from the Swedish National Agency for Education’s web- site. The data used here are collected for 2012. To employ the methods proposed, the treatments must be factors with two levels each. The two treatments chosen for this study is a factor based on the proportion of students with parents with higher education and a factor based on the pro- portion of students with Swedish background. Dichotomizing continuous variables can be a mischievous thing to do, with loss of information and power as well as uncertainty in defin- ing the cut point. (Royston et al., 2006) However, based on the fact that it has been shown that more homogeneous compositions strengthen the effect it can be motivated to regard di- chotomized levels. The thresholds for the shares are not set to optimize the effect in the data at hand, but is instead set somewhat arbitrarily, such that the data are reasonably balanced but of course also such that the two levels are reasonable to interpret. As is described below, the cut points are set in the neighborhood of the means and of the respective variable, which of course are specific for the data at hand, which might lead to loss of generality. However, there has been no a priori testing of cut point values to optimize the result in this specific data

2www.skolverket.se

19 set. Again the dichotomization is foremost done to be able to prove a point in using the methods proposed. The question at hand is thus: Do social factors such as parents’ education and students’ Swedish background have an effect on students’ performance? All variables are presented in Table 3. The first part of the table contains descriptive statis- tics of the continuous variables, while the second part contains of the factors. The outcome in the study, used as a measure of students’ performance, is the variable Average Grades (AG). The two variables Parents’ Education (PE) and Swedish Background (SB) are dichotomized, and thus appear as factors (FPE and FSB) in the second part of Table 3. Below follows a short description of all variables.

Table 3: Descriptive statistics Continuous variables

Variable Name Label Min. Max. Mean Sd. 25% 75%

Average Grades AG 27 289.70 211.52 24.26 198.40 210.30 225.30 Parents’ Education∗ PE 7.00 100.00 49.32 17.06 36.00 47.00 61.00 Swedish Background∗ SB 2.00 100.00 80.43 20.81 76.00 88.00 94.00 No. of Students NS 10.00 310.00 68.28 41.20 35.00 62.00 94.00 Females∗ Fe 8.00 95.00 49.25 9.79 44.00 49.00 54.00 Teachers’ Qualifications∗ TQ 20.10 100.00 83.70 11.76 77.50 85.90 92.10 Students per Teacher ST 0.00 53.80 12.08 3.06 10.40 11.90 13.60

Factors

Min. Max. prop. Comment

Parents’ Education FPE 0 1 0.45 1 if Parents’ Education ≥ 50% Swedish Background FSB 0 1 0.61 1 if Swedish background ≥ 85% Private School PS 0 1 0.22 1 = Private school

∗ Values in percent n=1401 500 300 100 0

50 100 150 200 250 300 Average grades

Figure 4: of students’ average grades.

20 Average Grades (AG) is the average final grade for the ninth graders in each school. The final grade for each student is the sum of the individual’s 16 best subject grades where G (pass) = 10p, VG = 15p and MVG = 20p. Only students with at least the grade G in at least one subject are included. Students who received grades from other systems are not included (for instance Waldorf schools). Three schools included in the data have exceptionally low average grades (below 100), with a minimum value of 27 credits on average. However, this does not seem affect the mean (mean = 211.52, median = 210.30). Furthermore, the average grades are roughly symmetrically distributed, see Figure 4.

Parents’ Education (PE) is defined as the share of students whose parents’ highest education includes at least 30 credits from college or university studies, or attendance at a four year technical program in upper secondary school.

Treatment factor for Parents’ Education (FPE) is defined as 1 if PE ≥ 50%. The chosen threshold of 50% is very close to the mean of 49.32 %, and comprises 45% of the schools included.

Swedish Background (SB) is defined as the share of students who are born in Sweden with at least one parent also born in Sweden.

Treatment factor for Swedish Background (FPE) is defined as 1 if SB ≥ 85%. The cut point for this variable is chosen to be 85%, which is between the mean of 80.43% and the median of 88%. The treatment comprises 61% of the included schools.

No. of Students (NS) is the number of students in each school.

Females (Fe) is the share of females in each school expressed in percent.

Teachers’ Qualification (TQ) is the share of teachers in each school with a higher educational degree

Students per Teacher (ST) is the student-teacher ratio at each school.

Private School (PS) is defined as 1 if private school.

The original data set retrieved from the Swedish National Agency for Education consists of 1664 schools that teach up to the ninth grade. Data from schools with less than 10 students is not reported and these schools are excluded from the analysis. The variables concerning teach- ers has been corrected for non-response in some municipalities. After schools with have been excluded the data set consists of 1401 observations. For more details regard- ing definitions and missing data, please visit the Swedish National Agency for Education’s website3. 3see footnote 2.

21 hni h fet r infiatadi o o ag hyare. is they question large The how so, the effect. if and interaction and slope effect, significant an a are crude of have effects presence a the lines indicates have if two then non-parallel factors the levels are both that different lines that fact the two indicates that The the separated fact at are mean. SB and overall of zero the levels from marks two different line the horizontal for positive The grades with PE. lines average of two of The mean FSB. and the FPE show between slopes plot interaction crude the displays 5 Figure Study Empirical from Results 5.2 sample small for even coverage nominal provides it because reported sizes. be estimator for should robust intervals Furthermore, doubly confidence literature. bootstrap the that the for estimates. recommend in the (2011) bootstrap used al. of the commonly et limits of is Funk the distribution inference instance, of (2011), each method al. of a et as Funk 97.5% Bootstrapping instance and for 2.5% by the estimates described are and As intervals replacement with calculated. drawn effect are each employed, samples inter- for bootstrap be percentile bootstrap 10,000 and will 95% estimator, assessed 5% with each drawn are of For be results level will vals. effects The significance treatment causal standard the interest. the of of inference study effects pre- and empirical methods the the evaluate and For models to the used compared. step the are second step 3 the this Section In In in interactions. assessed. sented additional are variables these treat- with included the one and between and models interaction confounders, without the one and proposed, factors are ment models Two performed. is OLS with indicate lines Non-parallel FSB. and effect FPE interaction an of of effects presence crude indicate slopes separated Positive, 5: Figure h td scnutda olw.Frt h omnapoc ffitn ersinmodel regression a fitting of approach common the First, follows. as conducted is study The o h otta rcdr,tepcae”ot in ”boot” package the procedure, bootstrap the For neato lto P essFB h oiotlln ersnsteoealmean. overall the represents line horizontal The FSB. versus FPE of plot Interaction

Mean of Average Grades 195 205 215 225 . 0 Factor ofParents' Education 22 R a enused. been has 1 FSB: 1 0 First, the following model is fitted using OLS:

AGi =β0 + βAFPEi + βBFSBi + βX1 NSi + βX2 F ei + βX3 TQi + βX4 STi + βX5 PSi

+ βABFPEi × FSBi + εi (Model 1)

where FPE and FSB are the treatment factors of interest, and the remaining five variables are confounders. The estimates of Model 1 is presented in Table 7, Appendix A. The overall model fit is good, with an Adjusted R2 of 40%. Furthermore, in Figure 8, Appendix A, a plot of residuals versus fitted values is presented. The randomness of the residuals indicate that the model fits the data well. Also worth noting is that the coefficients for the main variables and their interaction effect is significant, indicating that the causal effects are significant. However, the fact that treatment effects might not be constant for all individuals implies that there might be significant interaction effects between the treatment factors and the confounders. Using backward selection based on Akaike Information Criterion (AIC), Model 1 is extended with interaction effects between the confounders and both treatment factors as well as between the confounders and the interaction of the treatment factors. The selection procedure results in Model 2 defined as:

AGi = β0 + βAFPEi + βBFSBi + βX1 NSi + βX2 F ei + βX3 TQi + βX4 STi + βX5 PSi

+ βABFPEi × FSBi + βAX1 FPEi × NSi + βAX3 FPEi × TQi + βAX4 FPEi × STi

+ βBX1 FSBi × NSi + βBX2 FSBi × F ei + βBX3 FSBi × TQi + βBX4 FSBi × STi

+ βABX1 FPEi × FSBi × NSi + βABX3 FPEi × FSBi × TQi

+ βABX4 FPEi × FSBi × STi + εi. (Model 2)

The estimated coefficients of Model 2 is presented in Table 8. Now the Adjusted R2 has im- proved by 2 percentage points to 42 % and there are slight changes in the residuals, shown in Figure 9, Appendix A. The inclusion of more interaction terms came at the cost of larger standard errors for the treatment factors and their interaction, probably due to multicollinear- ity. Furthermore, in Model 2 both the treatment factor FSB and its interaction with FPE have insignificant regression coefficients. When comparing the information criteria between the two models, the model with the low- est value of the criterion is preferable. The AIC is lower for Model 2, which is not surprising since the selection procedure is based on AIC. However, the Bayesian information criterion (BIC) and the Hannan-Quinn information criterion (HQ) support one model each. Hence, no model has stronger support than the other. It is then reasonable to prefer Model 1, based on the law of parsimony. (Dobson and Barnett, 2008) The fact that the linear regression demands constant treatment effects to give unbiased es- timates of the causal effects and the fact that it can be difficult to determine which model is correctly specified are two drawbacks for the method.

23 200 250 100 100 Frequency Frequency 0 0 0.0 0.2 0.4 0.6 0.0 0.2 0.4 0.6 Pr(T=1|X=x) Pr(T=2|X=x) 400 200 200 100 Frequency Frequency 0 0 0.0 0.2 0.4 0.6 0.0 0.2 0.4 0.6 0.8 Pr(T=3|X=x) Pr(T=4|X=x)

Figure 6: of the propensity scores estimated with multinomial logit regression 250 200 100 100 Frequency Frequency 0 0 0.0 0.2 0.4 0.6 0.0 0.2 0.4 0.6 Pr(T=1|X=x) Pr(T=2|X=x) 200 200 100 Frequency Frequency 0 0 0.0 0.2 0.4 0.6 0.0 0.2 0.4 0.6 0.8 Pr(T=3|X=x) Pr(T=4|X=x)

Figure 7: Histograms of the propensity scores estimated with nested logit regression

Instead, the methods proposed in Section 3 are more flexible for evaluating the causal effects of interest. To be able to make causal interpretations based on the estimators Assumption 3 of weak unconfoundedness and Assumption 4 of generalized overlap must hold. These are assumptions that cannot be tested, however the overlap assumption can be graphically assessed by plotting the propensity scores. What is important is that no propensity scores equal zero or one, meaning that all four treatments are possible for each individual given its observed values of the confounders. Otherwise it would not be possible to consider the non-observed treatments as counterfactuals for those individuals. Furthermore, as mentioned previously, for the methods to work properly, not too many probabilities can be close to zero or the estimates based on the propensity scores becomes unstable and their variances inflated. Histograms of the propensity scores estimated with multinomial logistic regression are shown in Figure 6 and histograms of the propensity scores estimated with nested logistic re- gression are presented in Figure 7. As is shown, there are no peaks at zero or one, and the distributions of the propensity scores for both models are reasonably well spread. Hence the overlap assumption seems to be fulfilled.

24 Lastly, the assumption of SUTVA must be fulfilled, meaning in this case that the composi- tion of students regarding parents’ education and Swedish background in one school does not affect the average grades of the students in another school. There is no reason to believe such dependencies are present, and hence SUTVA is also considered fulfilled. For the data at hand, equations (8) to (10) are estimated with the estimator based on linear regression and the IPW and DR estimators. Furthermore, the IPW and DR estimators are cal- culated with propensity scores estimated with both multinomial and nested logistic regression. The causal effects from the OLS regressions are calculated using the estimated coefficients in both Model 1 and Model 2. The results are presented in Table 4. All estimates have the same

sign as the crude estimate. For τA and τB IPW and DR produce smaller estimates of the causal effect, while for τAB, which (somewhat surprisingly) is negative, the absolute value of the IPW and DR estimates are larger. The estimates produced by Model 2 deviate remarkably from the other, and have very wide confidence intervals, probably because of the inflated standard errors of the estimates in the linear regression. For the other estimates the confidence intervals are much narrower where the OLS Model 1 has the narrowest interval. The lengths of the intervals indicate that treatment effects might be constant.

The crude estimate of the causal effect of PE (τA) of 28.46 points is larger than the upper limit for all confidence intervals except for Model 2. The IPW and DR estimates of the same effect is approximately 21.5 points, all with intervals somewhere around 16.5 to 25 points. Hence the models confirm that parents’ education has a large causal impact on students’ grades.

For the causal effect of the second treatment factor, SB (τB), the estimates are smaller and also deviate less among the estimators. The OLS estimate calculated from Model 1 of 5.5 points is larger than the crude estimate of 4.85 points, and is significantly different from zero

Table 4: Estimated causal effects of parents’ educational background and students’ Swedish background on average grade

Estimator τA τB τAB

Crude 28.46 ··· 4.85 ··· -5.66 ··· OLS 1∗ 23.58 (20.67; 25.42) 5.50 (3.30; 7.76) -5.42 (-7.47; -3.26) OLS 2∗∗ 49.51 (26.51; 70.21) 3.08 (-24.49; 29.64) -6.01 (-28.76; 14.05) IPWm 21.36 (16.92; 25.19) 1.95 (-2.69; 5.85) -6.92 (-11.63; -2.93) IPWn 21.54 (17.24; 25.37) 1.95 (-2.54; 5.73) -6.83 (-11.46; -2.85) DRm 21.83 (15.45; 25.39) 2.41 (-4.06; 6.07) -6.61 (-12.72; -2.77) DRn 22.02 (16.07; 25.55) 2.34 (-3.75; 5.92) -6.53 (-12.53; -2.56) ∗ Model 1: OLS without interactions between treatment factors and confounders ∗∗ Model 2: OLS with interactions between treatment factors and confounders Within brackets are 95% Bootstrap Percentile Intervals presented

τ1 = Effect of parents’ educational background

τ2 = Effect of students’ Swedish background

τ3 = Interaction Effect

25 at the 95% confidence level, with an interval ranging from 3.30 to 7.76, while the IPW and DR estimates are smaller than the crude estimate, with intervals that includes zero, ranging from around -2.6 to 5.8 points for the IPW estimates and -4 to 6 points for the DR estimates. Hence, the results for τB are not conclusive. Again, the OLS estimate from Model 2 stands out, now with a confidence interval that ranges from -24.49 to 29.64 points.

Lastly, the estimates of the interaction effect of PE and SB (τAB), are negative for all esti- mators. Furthermore, The IPW and DR only deviate from each other by less than one point, and only slightly more for the OLS Model 1. These estimates of τAB are close to -6.5 points with intervals that do not include zero. Hence these estimators produce significant interaction effects at the 95 % confidence level. There seems to be a significant, but small, negative inter- action effect between the two treatment factors. However, the OLS Model 2 again stands out, again with a 95 % confidence interval that is disturbingly large and includes zero.

6 Conclusion

As is described in the introduction, researchers might want to estimate causal effects of two binary treatments in observational studies. For this purpose, the causal inference framework is extended to a 22 factorial design with proposed estimators for the two main effects and the interaction effect. The estimation rests on the notion of potential outcomes and a generalized propensity score to assure unconfoundedness. Noting that there exists other estimation meth- ods than those chosen here, the assessed estimators of the effects are an estimator based on linear regression and the IPW and DR estimators, and the estimation of the propensity score is restricted to two estimation methods. The performance of the estimators under each regime is evaluated in terms of bias and MSE using Monte Carlo simulations. The results show that under constant treatment effects, all estimators produce unbiased estimates, but under variable treatment effects the OLS estimator is biased while the IPW and DR estimators are still unbiased. These results are consistent with what is expected from the single treatment case, and hence the results imply that the generalization of the method is valid. Regarding the MSE, the simulation results are also in line with what is expected from the single treatment case. In the case with constant treatment effects the OLS estimator produces the estimates with the smallest MSE, and among the semi-parametric estimators the DR esti- mator is more efficient than the IPW estimator. When the linear model is misspecified, i.e., when the equal slope assumption of the potential outcomes does not hold, the OLS estimator is not only biased, but also has larger MSE. Instead the DR estimator is most efficient. All together, the results show that the extension of the potential outcomes and the single treatment model into a 22 factorial model is feasible. Furthermore, the semi-parametric estima- tors of the effects are still more flexible, in that they allow more general model specifications of the potential outcomes while still being robust.

26 Two methods for estimating the generalized propensity score are utilized, namely the multi- nomial and the nested logistic models. These two are also used as the true treatment assignment mechanism in the data simulations, together with a completely randomized treatment assign- ment mechanism for comparison. The difference between the two models is that the nested logit relaxes the IIA assumption, which makes it more flexible. Tchernis et al. (2005) recom- mends more flexible models for causal inference in multivalued treatment setting. The results from the simulation study is in line with their results, since the IPW estimator estimated with the multinomial propensity score performs worse both in terms of bias and efficiency when the true treatment assignment follows the nested logit model. The DR estimator on the other hand seems less sensitive in all aspects. The empirical results seems to be in line with the simulation results. All estimates have the same sign as the crude estimate. The IPW and DR estimates deviate only by less than one point from each other while the OLS 1 estimates lie closer to the crude estimates. Furthermore, the

OLS 1 estimates of τB and τAB are both larger than the crude effect, while the semi-parametric estimates are smaller. Furthermore, the OLS 1 estimate for τB has a 95 % confidence interval that does not cover zero, while the IPW and DR intervals for this effect do. The OLS 2 estimator performs poorly both in terms of the point estimates relation to the crude estimate (especially for τA) but foremost in terms of precision expressed as width of the confidence intervals. The results from the classic linear regression are inconclusive when it comes to determining whether the treatment effects are constant or variable. If they are variable, the OLS estimator can not reduce all the bias. This might cause the differences between the estimates from the

OLS 1 estimator and the semi-parametric estimators. That in turn implies that the effect τB, which is significant using OLS 1, is in fact insignificant. However, for all effects, the OLS 1 estimator produces narrower confidence intervals than the semi-parametric estimator, while the DR estimator produces the widest intervals. Since the DR estimator is expected to be the most efficient and the OLS 1 estimator should be least efficient under variable treatment effects, these results imply that treatment effects might be constant. If that is the case the OLS 1 estimator produces the most efficient estimates, indicating that τB is significant. Since the true treatment effects are not known, but both the theoretical and the simulation results are clear in terms of efficiency, it might be preferable to consider the OLS 1 estimator in this case. It could also be the case that this result is due to a particular model specification not included in the simulation. Hence the situation can be further analysed. In summary, the results show that it is possible to extend the potential outcomes framework to a 22 factorial model, with results still consistent. They also show that there are advantages in using semi-parametric estimators to estimate the causal effects of interest, especially when there are doubts about the linear model specification. However, it is not clear if, and if so how, other types of model misspecifications affect the estimates. Hence more extensive research is

27 needed to draw definitive conclusions in this regard. Furthermore, the results verify that the choice of model for estimating the generalized propensity score might have an impact on the estimates as proposed by Tchernis et al. (2005), especially for the IPW estimator. It is thus preferable to use flexible discrete choice models for this aim.

28 References

Dasgupta, T., N. S. Pillai, and D. B. Rubin (2012). Causal inference from 2k factorial de- signs using the potential outcomes model. Unpublished Manuscript. Harvard University, Cambridge.

Dobson, A. J. and A. G. Barnett (2008). An Introduction to Generalized Linear Models (3 ed.). CRC Press.

Funk, M. J., D. Westreich, C. Wiesen, T. Stürmer, M. A. Brookhart, and M. Davidian (2011). Doubly robust estimation of causal effects. American Journal of 173(7), 761– 767.

Greene, W. H. (2008). Econometric Analysis (6 ed.). Pearson Prentice Hall.

Hahn, J. (1998). On the role of the propensity score in efficient semiparametric estimation of average treatment effects. Econometrica 66, 315–331.

Holland, P. W. (1986). Statistics and causal inference. Journal of the American Statistical Association 81(396), 945–960.

Imai, K. and D. A. van Dyk (2004). Causal inference with general treatment regimes: Gen- eralizing the propensity score. Journal of the American Statistical Association 99(467), 854–866.

Imbens, G. M. and J. M. Wooldridge (2008). Recent developments in the of program . NBER Working Paper.

Imbens, G. W. (2000). The role of the propensity score in estimating dose-response functions. Biometrika 87(3), 706–710.

Imbens, G. W. (2004). Nonparametric estimation of average treatment effects under exogeneity: A review. The Review of Economics and Statistics 86(1), 4–29.

Lunceford, J. K. and M. Davidian (2004). Stratification and weighting via the propensity score in estimation of causal treatment effects: a comparative study. Statistics in medicine 23(19), 2937–2960.

Montgomery, D. C. (2001). Design and Analysis of Experiments (5 ed.). John Wiley and Sons.

Neyman, J. (1923). Sur les applications de la théorie des probabilités aux experiences agricoles: Essai des principes. Roczniki Nauk Rolniczych X, 1–51. In Polish, English translation by D. Dabrowska and T. Speed in Statistical Science, 5, 465–472, 1990.

Rosenbaum, P. R. and D. B. Rubin (1983). The central role of the propensity score in observa- tional studies for causal effects. Biometrika 70(1), 41–55.

29 Royston, P., D. G. Altman, and W. Sauerbrei (2006). Dichotomizing continuous predictors in multiple regression: a bad idea. Statistics in medicine 25, 127–141.

Rubin, D. B. (1974). Estimating causal effects of treatments in randomized and nonrandomized studies. Journal of 66(5), 688–701.

Skolverket (2009). Vad påverkar resultaten i svensk grundskola? Kunskapsöversikt: samman- fattande analys, Skolverket.

Tchernis, R., M. Horvitz-Lennon, and S.-L. Normand (2005). On the use of discrete choice models for causal inference. Statistics in medicine 24(14), 2197–2212.

Waernbaum, I. (2008). Covariate selection and propensity score specification in causal infer- ence. Ph. D. thesis, Umeå.

30 Appendix A Tables and Graphs

Table 5: Monte Carlo Variances. Constant treatment effects True treatment model

Multinomial Logit Nested Logit Compl. Random

Estimator τA τB τAB τA τB τAB τA τB τAB

N = 500 MC Variance for Bias × 100 Crude 0.0002 0.0002 0.0002 0.0002 0.0002 0.0002 0.0002 0.0002 0.0002 OLS 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 IPWm 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 IPWn 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 DRm 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 DRn 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001

MC Variance for MSE × 100 Crude 0.0002 0.0000 0.0000 0.0005 0.0000 0.0001 0.0000 0.0000 0.0000 OLS 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 IPWm 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 IPWn 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 DRm 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 DRn 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000

N = 5000 MC Variance for Bias × 100 Crude 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 OLS 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 IPWm 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 IPWn 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 DRm 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 DRn 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000

MC Variance for MSE × 100 Crude 0.0000 0.0000 0.0000 0.0001 0.0000 0.0000 0.0000 0.0000 0.0000 OLS 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 IPWm 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 IPWn 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 DRm 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 DRn 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000

τA = Main effect for factor A, τB = Main effect for factor B, τAB = Interaction effect m denotes that the propensity scores are estimated with multinomial logit regression n denotes that the propensity scores are estimated with nested logit regression β = (2, 2, 2, 2) replicates=10,000

31 Table 6: Monte Carlo Variances. Variable treatment effects True treatment model

Multinomial Logit Nested Logit Compl. Random

Estimator τA τB τAB τA τB τAB τA τB τAB

N = 500 MC Variance for Bias × 100 Crude 0.0012 0.0012 0.0012 0.0012 0.0012 0.0012 0.0013 0.0013 0.0013 OLS 0.0003 0.0002 0.0002 0.0003 0.0002 0.0002 0.0002 0.0002 0.0002 IPWm 0.0001 0.0003 0.0001 0.0001 0.0004 0.0003 0.0001 0.0002 0.0001 IPWn 0.0001 0.0003 0.0001 0.0002 0.0004 0.0002 0.0001 0.0002 0.0001 DRm 0.0001 0.0002 0.0001 0.0001 0.0002 0.0001 0.0001 0.0002 0.0001 DRn 0.0001 0.0002 0.0001 0.0001 0.0002 0.0001 0.0001 0.0002 0.0001

MC Variance for MSE × 100 Crude 0.0091 0.0008 0.0003 0.0284 0.0012 0.0004 0.0003 0.0003 0.0003 OLS 0.0000 0.0000 0.0003 0.0002 0.0001 0.0009 0.0000 0.0000 0.0000 IPWm 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 IPWn 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 DRm 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 DRn 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000

N = 5000 MC Variance for Bias × 100 Crude 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 OLS 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 IPWm 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 IPWn 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 DRm 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 DRn 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000

MC Variance for MSE × 100 Crude 0.0009 0.0000 0.0000 0.0028 0.0001 0.0000 0.0000 0.0000 0.0000 OLS 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 IPWm 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 IPWn 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 DRm 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 DRn 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000

τA = Main effect for factor A, τB = Main effect for factor B, τAB = Interaction effect m denotes that the propensity scores are estimated with multinomial logit regression n denotes that the propensity scores are estimated with nested logit regression β = (5, 83, 9) replicates=10,000

32 Table 7: Estimated coefficients from the OLS regression on Average Grades without interac- tions of confounders Variable Estimate Std. Error t value pr(>|t|)

(Intercept) 165.91 5.23 31.70 0.0000 FPE 28.48 1.7091 16.67 0.0000 FSB 10.92 1.37 7.97 0.0000 NS 0.02 0.01 1.76 0.0789 Fe 0.26 0.05 5.03 0.0000 TQ -0.05 0.05 -0.98 0.3276 ST 1.44 0.17 8.30 0.0000 PS 6.83 1.55 4.41 0.0000 FPE×FSB -10.85 2.08 -5.22 0.0000

AIC∗ = 12198.36 BIC∗∗ = 12250.81 HQ∗∗∗ = 12214.01 R2 = 0.41 Adj. R2 = 0.40 F- = 119.3 with 8 and 1392 df, p-value < 0.00

∗Akaike’s Information Criterion ∗∗Bayesian Information Criterion ∗∗∗Hannan-Quinn Information Criterion

33 Table 8: Estimated coefficients from the OLS regression on Average Grades with interactions of confounders Variable Estimate Std. Error t value pr(>|t|)

(Intercept) 152.3697 9.6240 15.83 0.0000 FPE 55.5228 12.0544 4.61 0.0000 FSB 9.0903 13.1074 0.69 0.4881 NS 0.0530 0.0250 2.12 0.0340 Fe 0.4086 0.0819 4.99 0.0000 TQ -0.1834 0.0911 -2.01 0.0444 ST 2.8296 0.3748 7.55 0.0000 PS 6.7106 1.5501 4.33 0.0000 FPE × FSB -12.0264 16.5503 -0.73 0.4676 FPE ×NS -0.0533 0.0377 -1.41 0.1580 FPE×TQ 0.0159 0.1217 0.13 0.8961 FPE×ST -2.1201 0.5507 -3.85 0.0001 FSB×NS -0.0526 0.0353 -1.49 0.1360 FSB×Fe -0.2446 0.1052 -2.33 0.0202 FSB×TQ 0.4494 0.1278 3.52 0.0005 FSB×ST -1.8548 0.5148 -3.60 0.0003 FPE×FSB×NS 0.0811 0.0500 1.62 0.1053 FPE×FSB×TQ -0.3731 0.1742 -2.14 0.0324 FPE×FSB ×ST 2.4243 0.7103 3.41 0.0007

AIC∗ = 12173.85 BIC∗∗ = 12278.75 HQ∗∗∗ = 12209.10 R2 = 0.43 Adj. R2 = 0.42 F-statistic = 56.84 with 18 and 1382 df, p-value < 0.00

∗Akaike’s Information Criterion ∗∗Bayesian Information Criterion ∗∗∗Hannan-Quinn Information Criterion

34 50 0 −50 Residuals −150 180 200 220 240 260 280 Fitted Values

Figure 8: Residuals versus fitted values from the OLS regression of Model 1. A random pattern indicates that the model fit the data well. 50 0 −50 Residuals −100

−150 160 180 200 220 240 260 280 Fitted Values

Figure 9: Residuals versus fitted values from the OLS regression of Model 2 A random pattern indicates that the model fit the data well.

35 Appendix B Estimators

Below follows the full expressions for each causal effect for all three estimators. For the IPW and DR estimators, the derivation to show that they are unbiased is also shown.

The OLS Estimators

µˆ +µ ˆ µˆ +µ ˆ τˆ = 3,OLS 4,OLS − 1,OLS 2,OLS A,OLS 2 2 1 h i = βˆ + βˆ + βˆ + βˆ + βˆ + βˆ − (βˆ + βˆ + βˆ ) 2 0 A 0 A B AB 0 0 B 1 = βˆ + βˆ . (23) A 2 AB

µˆ +µ ˆ µˆ +µ ˆ τˆ = 2,OLS 4,OLS − 1,OLS 3,OLS B,OLS 2 2 1 h i = βˆ + βˆ + βˆ + βˆ + βˆ + βˆ − (βˆ + βˆ + βˆ ) 2 0 B 0 A B AB 0 0 A 1 = βˆ + βˆ . (24) B 2 AB

µˆ +µ ˆ µˆ +µ ˆ τˆ = 1,OLS 4,OLS − 2,OLS 3,OLS AB,OLS 2 2 1 h  = βˆ + βˆ + βˆ + βˆ + βˆ − (βˆ + βˆ + βˆ + βˆ ] 2 0 0 A B AB 0 A 0 B 1 = βˆ . (25) 2 AB

The Inverse Probability Weighting Estimators

To show that the IPW estimator is unbiased, we note that since Wt · Y = Wt · Yt, it follows that

" " ##  W · Y  W · Y  W · Y E(W |X) · E(Y |X) t t t t t t t E = E = E E X = E r(X, t) r(X, t) r(X, t) r(X, t) r(X, t) · E(Y |X) = E t = E [E(Y |X)] = E [Y ] = µ , r(X, t) t t t where the third equality holds by unconfoundedness.

36 µˆ +µ ˆ µˆ +µ ˆ τˆ = 3,IP W 4,IP W − 1,IP W 2,IP W A,IP W 2 2  N !−1 N N !−1 N  1 X Wi3 X Wi3 · Yi X Wi4 X Wi4 · Yi = + 2  rˆ(X, 3) rˆ(X, 3) rˆ(X, 4) rˆ(X, 4)  i=1 i=1 i=1 i=1

 N !−1 N N !−1 N  1 X Wi1 X Wi1 · Yi X Wi2 X Wi2 · Yi − + . (26) 2  rˆ(X, 1) rˆ(X, 1) rˆ(X, 2) rˆ(X, 2)  i=1 i=1 i=1 i=1

µˆ +µ ˆ µˆ +µ ˆ τˆ = 2,IP W 4,IP W − 1,IP W 3,IP W B,IP W 2 2  N !−1 N N !−1 N  1 X Wi2 X Wi2 · Yi X Wi4 X Wi4 · Yi = + 2  rˆ(X, 2) rˆ(X, 2) rˆ(X, 4) rˆ(X, 4)  i=1 i=1 i=1 i=1

 N !−1 N N !−1 N  1 X Wi1 X Wi1 · Yi X Wi3 X Wi3 · Yi − + . (27) 2  rˆ(X, 1) rˆ(X, 1) rˆ(X, 3) rˆ(X, 3)  i=1 i=1 i=1 i=1

µˆ +µ ˆ µˆ +µ ˆ τˆ = 1,IP W 4,IP W − 2,IP W 3,IP W AB,IP W 2 2  N !−1 N N !−1 N  1 X Wi1 X Wi1 · Yi X Wi4 X Wi4 · Yi = + 2  rˆ(X, 1) rˆ(X, 1) rˆ(X, 4) rˆ(X, 4)  i=1 i=1 i=1 i=1

 N !−1 N N !−1 N  1 X Wi2 X Wi2 · Yi X Wi3 X Wi3 · Yi − + . (28) 2  rˆ(X, 2) rˆ(X, 2) rˆ(X, 3) rˆ(X, 3)  i=1 i=1 i=1 i=1

The Doubly Robust Estimators

To show that the DR is unbiased for each expected value, we note that

 W · Y W − r(X, t)  E t − t m (X, α ) r(X, t) r(X, t) t t W · Y W − r(X, t)  = E t t − t m (X, α ) r(X, t) r(X, t) t t  W − r(X, t)  = E Y + t {Y − m (X, α )} t r(X, t) t t t W − r(X, t)  = E(Y ) + E t {Y − m (X, α )} , (29) t r(X, t) t t t

37 implying that µˆt,DR is an unbiased estimator of µt when the second term of the last expression equals zero. Noting that when the propensity score is correctly specified we have that

r(t, X) = E(Wt|X) = E(Wt|Yt,X).

Using this equality, we can rewrite the last term in Equation (29) to show that it equals zero when the propensity score is correctly specified, which means the estimator is unbiased. Hence,

W − r(X, t)  E t {Y − m (X, α )} r(X, t) t t t    Wt − r(X, t) = E E {Yt − mt(X, αt)} Yt,X r(X, t)    Wt − r(X, t) = E {Yt − mt(X, αt)}E Yt,X r(X, t)  {E(W |Y ,X) − r(X, t)} = E {Y − m (X, α )} t t t t t r(X, t)  {E(W |X) − r(X, t)} = E {Y − m (X, α )} t t t t r(X, t)  {r(X, t) − r(X, t)} = E {Y − m (X, α )} = 0 t t t r(X, t)

When instead the regression model is correctly specified, the second term also equals zero since

mt(X, αt) = E(Y |T = t, X) = E(Yt|Wt,X) = E(Yt|X) and

W − r(X, t)  E t {Y − m (X, α )} r(X, t) t t t W − r(X, t)  = E t {Y − E(Y |T = t, X)} r(X, t) t    Wt − r(X, t) = E E {Yt − E(Y |T = t, X)} Wt,X r(X, t)   Wt − r(X, t)   = E E {Yt − E(Y |T = t, X)} Wt,X r(X, t) W − r(X, t)  = E t {E(Y |W ,X) − E(Y |T = t, X)} r(X, t) t t W − r(X, t)  = E t {E(Y |X) − E(Y |X)} = 0. r(X, t) t t

Hence, when either model (or both) is correctly specified the DR estimator is unbiased.

38 µˆ +µ ˆ µˆ +µ ˆ τˆ = 3,DR 4,DR − 1,DR 2,DR = A,DR 2 2 " N  # 1 X Wi3 · Yi [Wi3 − rˆ(X, 3)] = − m (X , αˆ ) 2N rˆ(X, 3) rˆ(X, 3) 3 i 3 i=1 " N  # 1 X Wi4 · Yi [Wi4 − rˆ(X, 4)] + − m (X , αˆ ) 2N rˆ(X, 4) rˆ(X, 4) 4 i 4 i=1 " N  # 1 X Wi1 · Yi [Wi1 − rˆ(X, 1)] − − m (X , αˆ ) 2N rˆ(X, 1) rˆ(X, 1) 1 i 1 i=1 " N  # 1 X Wi2 · Yi [Wi2 − rˆ(X, 2)] − − m (X , αˆ ) . (30) 2N rˆ(X, 2) rˆ(X, 2) 2 i 2 i=1

µˆ +µ ˆ µˆ +µ ˆ τˆ = 2,DR 4,DR − 1,DR 3,DR B,DR 2 2 " N  # 1 X Wi2 · Yi [Wi2 − rˆ(X, 2)] = − m (X , αˆ ) 2N rˆ(X, 2) rˆ(X, 2) 2 i 2 i=1 " N  # 1 X Wi4 · Yi [Wi4 − rˆ(X, 4)] + − m (X , αˆ ) 2N rˆ(X, 4) rˆ(X, 4) 4 i 4 i=1 " N  # 1 X Wi1 · Yi [Wi1 − rˆ(X, 1)] − − m (X , αˆ ) 2N rˆ(X, 1) rˆ(X, 1) 1 i 1 i=1 " N  # 1 X Wi3 · Yi [Wi3 − rˆ(X, 3)] − − m (X , αˆ ) . (31) 2N rˆ(X, 3) rˆ(X, 3) 3 i 3 i=1

µˆ +µ ˆ µˆ +µ ˆ τˆ = 1,DR 4,DR − 2,DR 3,DR AB,DR 2 2 " N  # 1 X Wi1 · Yi [Wi1 − rˆ(X, 1)] = − m (X , αˆ ) 2N rˆ(X, 1) rˆ(X, 1) 1 i 1 i=1 " N  # 1 X Wi4 · Yi [Wi4 − rˆ(X, 4)] + − m (X , αˆ ) 2N rˆ(X, 4) rˆ(X, 4) 4 i 4 i=1 " N  # 1 X Wi2 · Yi [Wi2 − rˆ(X, 2)] − − m (X , αˆ ) 2N rˆ(X, 2) rˆ(X, 2) 2 i 2 i=1 " N  # 1 X Wi3 · Yi [Wi3 − rˆ(X, 3)] − − m (X , αˆ ) . (32) 2N rˆ(X, 3) rˆ(X, 3) 3 i 3 i=1

39