<<

To Adjust or Not to Adjust? Sensitivity Analysis of M-Bias and Butterfly-Bias

The Harvard community has made this article openly available. Please share how this access benefits you. Your story matters

Citation Ding, Peng, and Luke W. Miratrix. 2015. “To Adjust or Not to Adjust? Sensitivity Analysis of M-Bias and Butterfly-Bias.” Journal of Causal Inference 3 (1) (January 1). doi:10.1515/jci-2013-0021.

Published Version 10.1515/jci-2013-0021

Citable link http://nrs.harvard.edu/urn-3:HUL.InstRepos:25207409

Terms of Use This article was downloaded from ’s DASH repository, and is made available under the terms and conditions applicable to Open Access Policy Articles, as set forth at http:// nrs.harvard.edu/urn-3:HUL.InstRepos:dash.current.terms-of- use#OAP Page 1 of 24 Journal of Causal Inference

1 2 3 4 5 6 7 Abstract 8 9 “M-Bias”, as it is called in the epidemiological literature, is the bias intro- 10 11 duced by conditioning on a pretreatment covariate due to a particular “M-Structure” 12 between two latent factors, an observed treatment, an outcome, and a “collider”. 13 This potential source of bias, which can occur even when the treatment and the out- 14 come are not confounded, has been a source of considerable controversy. We here 15 16 present formulae for identifying under which circumstances biases are inflated or 17 reduced. In particular, we show that the magnitude of M-Bias in Gaussian linear 18 structural equationFor models Review tends to be relatively Only small compared to 19 bias, suggesting that it is generally not a serious concern in many applied settings. 20 21 These theoretical results are consistent with recent empirical findings from simu- 22 lation studies. We also generalize the M-Bias setting to allow for the correlation 23 between the latent factors to be nonzero, and to allow for the collider to also be 24 a confounder between the treatment and the outcome. These results demonstrate 25 that mild deviations from the M-Structure tend to increase confounding bias more 26 27 rapidly than M-bias, suggesting that choosing to condition on any given covariate 28 is generally the superior choice. As an application, we re-examine a controversial 29 example between Professors Donald Rubin and Judea Pearl. 30 Key Words: Causality; Collider; Confounding; Controversy; Covariate. 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 Journal of Causal Inference Journal of Causal Inference Page 2 of 24

1 2 3 4 5 6 7 1 Introduction 8 9 Many statisticians believe that “there is no reason to avoid adjustment for a variable 10 describing subjects before treatment” in observational studies (Rosenbaum, 2002, 11 pp 76), because “typically, the more conditional an assumption, the more generally 12 13 acceptable it is” (Rubin, 2009). This advice, recently dubbed the “pretreatment 14 criterion” (VanderWeele and Shpitser, 2011), is widely used in empirical studies, 15 as more covariates generally seem to make the ignorability assumption, i.e., the 16 assumption that conditionally on the observed pretreatment covariates, treatment 17 assignment is independent of the potential outcomes (Rosenbaum and Rubin, 1983), 18 For Review Only 19 more plausible. As the validity of causal inference in observational studies relies 20 strongly on this (untestable) assumption (Rosenbaum and Rubin, 1983), it seems 21 reasonable to make all efforts to render it plausible. 22 23 However, other researchers (Pearl, 2009a,b, Shrier, 2008, 2009, Sjolander,¨ 24 2009), mainly from the causal diagram community, do not accept this view because 25 of the possibility of a so-called M-Structure, illustrated in Figure 1(c). As a main 26 opponent to Rubin and Rosenbaum’s advice, Pearl (2009a) and Pearl (2009b) warn 27 practitioners that spurious bias may arise due to adjusting for a collider M in an 28 29 M-Structure, even if it is a pretreatment covariate. This form of bias, typically 30 called M-bias, a special version of so-called “collider bias”, has since generated 31 considerable controversy and confusion. 32 We attempt to resolve some of these debates by an analysis of M-bias under 33 34 the causal diagram or directed acyclic graph (DAG) framework. For readers unfa- 35 miliar with the terminologies from the DAG (or Bayesian Network) literature, more 36 details can be found in Pearl (1995) or Pearl (2000). We here use only a small part 37 of this larger framework. Arguably the most important structure in the DAG, and 38 39 certainly the one at root of almost all controversy, is the “V-Structure” illustrated 40 in Figure 1(b). Here, U and W are marginally independent with a common out- 41 come M, which shapes a “V” with the vertex M being called a “collider”. From a 42 data-generation viewpoint, one might imagine Nature generating data in two steps: 43 44 She first picks independently two values for U and W from two distributions, and 45 then she combines them (possibly along with some additional random variable) to 46 create M. Given this, conditioning on M can cause a spurious correlation between 47 U and W, which is known as the collider bias (Greenland, 2002), or, in epidemi- 48 49 ology, Berkson’s Paradox (Berkson, 1946). Conceptually, this correlation happens 50 because if one cause of an observed outcome is known to have not occurred, the 51 other cause becomes more likely. Consider an automatic-timer sprinkler system 52 where the sprinkler being on is independent of whether it is raining. Here, the 53 weather gives no information on the sprinkler. However, given wet grass, if one 54 55 observes a sunny day, one will likely conclude that the sprinklers have recently run. 56 57 58 59 60 Journal of Causal Inference Page 3 of 24 Journal of Causal Inference

1 2 3 4 5 6 7 Correlation has been induced. 8 9   10 11    12 13  14  15   16   17 (a) A simple DAG (b) V-Structure (c) M-Structure 18 For Review Only 19 U W but U /W|MT Y but T /Y|M 20 21 Figure 1: Three DAGs 22 23 Where things get interesting is when this collider is made into a pre-treatment 24 25 variable. Consider Figure 1(c), an extension of Figure 1(b). Here U and W are now 26 also causes of the treatment T and the outcome Y, respectively. Nature, as a last, 27 third step generates T as a function of U and some randomness, and Y as a func- 28 tion of W and some randomness. This structure is typically used to represent a 29 30 circumstance where a researcher observes T, Y, and M in nature and is attempting 31 to derive the causal impact of T on Y. However, U and W are sadly unobserved, 32 or latent. Clearly, the causal effect of T on Y is zero, which is also equal to the 33 marginal association between T and Y. If a researcher regressed Y on T, he or she 34 35 would obtain a 0 in expectation which is correct for estimating the causal effect of 36 T on Y. But perhaps there is a concern that M, a pretreatment covariate, may be a 37 confounder that is masking a treatment effect. Typically, one would then “adjust” 38 for M to take this possibility into account, e.g., by including M in a regression or by 39 40 units on similar values of M. If we do this in this circumstance, however, 41 then we will not find zero, in expectation. This is the so-called “M-Bias,” and this 42 special structure is called the “M-Structure” in the DAG literature. 43 Previous qualitative analysis for binary variables shows that collider bias 44 generally tends to be small (Greenland, 2002), and simulation studies (Liu, Brookhart, 45 46 Schneeweiss, Mi, and Setoguchi, 2012) again demonstrate that M-Bias is small in 47 many realistic settings. While mathematically describing the magnitudes of M-Bias 48 in general models is intractable, it is possible to derive exact formulae of the biases 49 as functions of the correlation coefficients in Gaussian linear structural equation 50 51 models (GLSEMs). The GLSEM has a long history in statistics (Wright, 1921, 52 1934) to describe dependence among multiple random variables. Sprites (2002) 53 uses linear models to illustrate M-Bias in observational studies, and Pearl (2013a,b) 54 also utilize the transparency of such linear models to examine various types of 55 56 57 58 59 60 Journal of Causal Inference Journal of Causal Inference Page 4 of 24

1 2 3 4 5 6 7 causal phenomena, biases, and paradoxes. We here extend these works and pro- 8 vide exact formulae for biases, allowing for a more detailed quantitative analysis of 9 M-bias. 10 While M-Bias does exist when the true underlying data generating process 11 12 (DGP) follows the exact M-Structure, it might be rather sensitive to various devi- 13 ations from the exact M-Structure. Furthermore, some might argue that an exact 14 M-Structure is unlikely to hold in practice. Gelman (2011), for example, doubts the 15 exact independence assumption required for the M-Structure in the social sciences 16 17 by arguing that there are “(almost) no true zeros” in this discipline. Indeed, since 18 U and W are oftenFor latent characteristicsReview of the sameOnly individual, the independence 19 assumption U W is a rather strong structural assumption. Furthermore, it might be 20 plausible that the pretreatment covariate M is also a confounder between, i.e., has 21 some causal impact on both, the treatment and outcome. We extend our work by 22 23 accounting for these departures from a pure M-Structure, and find that even slight 24 departures from the M-Structure can dramatically change the forms of the biases. 25 This paper theoretically compares the bias from conditioning on an M to not 26 under several scenarios and finds that M-Bias is indeed small unless there is a strong 27 28 correlation structure for the variables. We further show that these findings extend to 29 a binary treatment regime as well. This argument proceeds in several stages. First, 30 in Section 2, we examine a pure M-Structure and introduce our GLSEM framework. 31 We then discuss the cases when the latent variables U and W may be correlated and 32 33 M may also be a confounder between the treatment T and the outcome Y. In Section 34 3, we generalize the results in Section 2 to a binary treatment. In Section 4, we 35 illustrate the theoretical findings using a controversial example between Professors 36 Donald Rubin and Judea Pearl (Rubin, 2007, Pearl, 2009b). We conclude with a 37 38 brief discussion and present all technical details in the Appendix. 39 40 41 2 M-Bias and Butterfly-Bias in GLSEMs 42 43 We begin by examining pure M-Bias in a GLSEM. As our primary focus is bias, 44 45 we assume data is ample and that anything estimable is estimated with nearly per- 46 fect precision. In particular, when we say we obtain some result from a regression, 47 we implicitly mean we obtain that result in expectation; in practice an estimator 48 will be near the given quantities. We do not compare relative uncertainties of dif- 49 50 ferent estimators given the need to estimate more or fewer parameters. There are 51 likely degrees-of-freedom issues that would implicitly advocate using estimators 52 with fewer parameters, but in the circumstances considered here these concerns are 53 likely to be minor as all the models have few parameters. 54 55 56 57 58 59 60 Journal of Causal Inference Page 5 of 24 Journal of Causal Inference

1 2 3 4 5 6 7 A causal DAG can be viewed as a hierarchical DGP. In particular, any vari- 8 able on the graph R can be viewed as a function of its parents and some additional 9 noise, i.e., if R had parents A,B, and C we would have 10 11 R = f (A,B,C,εR) with εR (A,B,C). 12 13 14 Generally noise terms such as εR are considered to be independent from each other, 15 but they can also be given an unknown correlation structure corresponding to earlier 16 variables not explicitly included in the diagram. This is typically represented by 17 drawing the dependent noise terms jointly from some multivariate distribution. This 18 For Review Only 19 framework is quite general; we can represent any distribution that can be factored 20 as a product of conditional distributions corresponding to a DAG (which is one 21 representation of the Markov Condition, a fundamental assumption for DAGs). 22 GLSEMs are special cases of the above with additional linearity, Gaussian- 23 24 ity, and additivity constraints. For simplicity, and without loss of generality, we also 25 rescale all primary variables (U,W,M,T,Y) to have zero mean and unit variance. 26 For example, consider this data generating process corresponding to Figure 1(a): 27 28 iid 29 M,εT ,εY ∼ N(0,1), 30 p 2 T = aM + 1 − a εT , 31 p 2 2 32 Y = bT + cM + 1 − b − c εY . 33 34 In the causal DAG literature, we think about causality as reaching in and 35 fixing a given node to a set value, but letting Nature take her course otherwise. For 36 37 example, if we were able to set T at t, the above data generation process would be 38 transformed to: 39 iid 40 M,εT ,εY ∼ N(0,1), 41 42 T = t, 43 p 2 2 Y = bt + cM + 1 − b − c εY . 44 45 46 The previous cause, M, of T has been broken, but the impact of T on Y remains 47 intact. This changes the distribution of Y but not M. More importantly, this results 48 in a distribution distinct from that of conditioning on T = t. Consider the case of 49 positive a,b, and c. If we observe a high T, we can infer a high M (as T and M are 50 correlated) and a high Y due to both the bT and cM terms in Y’s equation. However, 51 52 if we set T to a high value, M is unchanged. Thus, while we will still have the large 53 bT term for Y, the cM term will be 0 in expectation. Thus, the expected value for Y 54 will be less. 55 56 57 58 59 60 Journal of Causal Inference Journal of Causal Inference Page 6 of 24

1 2 3 4 5 6 7 This setting as compared to conditioning is represented with the “do” oper- 8 ator. Given the “do” operator, we define a local causal effect of T on Y at T = t 9 as: 10 ∂E{Y | do(T) = x} τt = . 11 ∂x 12 x=t 13 For linear models, the local causal effect is a constant, and thus we do not need 14 to specify t. We use “do” here purely to indicate the different distributions. For a 15 16 more technical overview, see Pearl (1995) or Pearl (2000). Our results, with more 17 formality, can easily be expressed in this more technical notation. 18 For Review Only 19    20 21  22 23  24   25 26 27 28   29 30 Figure 2: M-Structure with Possibly Correlated Hidden Causes 31 32 33 If we extend the M-Structure in Figure 1(c) by allowing possible correlation 34 between the two hidden causes U and W, we obtain the DAG in Figure 2. This in 35 turn gives the following DGP: 36 37 iid 38 εM,εT ,εY ∼ N(0,1), 39 0 1 ρ 40 (U,W) ∼ N 2 , , 41 0 ρ 1 42 p 2 2 M = bU + cW + 1 − b − c εM, 43 p 2 44 T = aU + 1 − a εT , 45 p 2 46 Y = dW + 1 − d εY . 47 48 Here, the true causal effect of T on Y is zero, namely, τt = 0 for all t. The 49 unadjusted estimator for the causal effect obtained by regressing Y onto T is the 50 same as the covariance between T and Y: 51 52 53 Biasunad j = Cov(T,Y) = Cov(aU,dW) = adCov(U,W) = adρ. 54 55 56 57 58 59 60 Journal of Causal Inference Page 7 of 24 Journal of Causal Inference

1 2 3 4 5 6 7 The adjusted estimator (see Lemma 2 in Appendix A for a proof) obtained by re- 8 gressing Y onto (T,M) is 9 10 adρ(1 − b2 − c2 − bcρ) − abcd Bias = . 11 ad j 1 − (ab + ac )2 12 ρ 13 14 The results above and some of the results discussed later in this paper can be ob- 15 tained directly from traditional path analysis (Wright, 1921, 1934). However, we 16 provide elementary proofs, which can easily be extended to binary treatment, in the 17 Appendix. In the Gaussian case, if we allowed for a treatment effect, our results 18 For Review Only 19 would remain essentially unchanged; the only difference would be due to restric- 20 tions on the correlation terms needed to maintain unit variance for all variables. 21 The above can also be expressed in the potential outcomes framework (Ney- 22 man, 1923/1990, Rubin, 1974). In particular, for a given unit let Nature draw 23 ε ,ε ,ε ,U, and V as before. Let T be the “natural treatment” for that unit, i.e., 24 M T Y 25 what treatment it would receive sans intervention. Then calculate Y(t) for any t 26 of interest using the “do” operator. These are what we would see if we set T = t. 27 How Y(t) changes for a particular unit defines that unit’s collection of potential out- 28 comes. Then {Y(t)} for some t is the expected potential outcome over the popu- 29 E 30 lation for a particular t. We can examine the derivative of this function as above to 31 get a local treatment effect. This connection is exact: the findings in this paper are 32 the same as what one would find using this DGP and the potential outcomes frame- 33 work. We here examine regression as the estimator. Note that matching would 34 35 produce identical results as the amount of data grew (assuming the data generating 36 process ensures common support, etc.). 37 38 39 Exact M-Bias. The M-Bias originally considered in the literature is the special 40 case where the correlation coefficient between U and W is ρ = 0. In this case, 41 the unadjusted estimator is unbiased and the absolute bias of the adjusted esti- 42 mator is |abcd|/{1 − (ab)2}. With moderate correlation coefficients a,b,c,d the 43 2 44 denominator 1 − (ab) is close to one, and the bias is close to −abcd. Since 45 abcd is a product of four correlation coefficients, it can be viewed as a “higher 46 order bias.” For example, if a = b = c = d = 0.2, then 1 − (ab)2 = 0.9984 ≈ 1, 47 and the bias of the adjusted estimator is −abcd/{1 − (ab)2} = −0.0016 ≈ 0; if 48 a = b = c = d = 0.3, then 1 − (ab)2 = 0.9919 ≈ 1, and the bias of the adjusted 49 2 50 estimator is −abcd/{1−(ab) } = −0.0082 ≈ 0. Even moderate correlation results 51 in little bias. 52 In Figure 3, we plot the bias of the adjusted estimator as a function of 53 the correlation coefficients, and let these coefficients change to see how the bias 54 55 changes. In the first subfigure, we assume all the correlation coefficients have the 56 57 58 59 60 Journal of Causal Inference Journal of Causal Inference Page 8 of 24

1 2 3 4 5 6 7 a=b=c=d a=b=2c=2d 2a=2b=c=d 8 9 0.4 0.4 0.4 10 11 ≤ ≤ ≤ 0.3 a 2 2 0.3 c 2 5 5 0.3 a 5 5 12 13 proportion(bias<0.01) proportion(bias<0.01) proportion(bias<0.01)

14 0.2 0.2 0.2 15 16 Absolute Value of the Bias Absolute Value of the Bias Absolute Value of the Bias Absolute Value

17 0.1 0.1 0.1 18 0.446For Review0.495 Only 0.499 19 20 0.0 0.0 0.0 21 −1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0 22 a a a 23 24 25 Figure 3: M-Bias with Independent U and W. The three subfigures correspond 26 to the cases when (U,V) are equally/more/less predictive to the treatment than to 27 the outcome. In each subfigure, we show the proportions of the areas where the 28 adjusted estimator has a bias smaller than 0.01. 29 30 31 same magnitude (a = b = c = d), and we plot the absolute bias of the adjusted es- 32 33 timator versus a. Here, bias is low unless a is large. Note that the constraints on 34 variance and correlation only allow for some combinations of values for √a,b,c and 35 d which limits the domain of the figures. In this case, for example, |a| ≤ 2/2 due 36 to the requirement that b2 + c2 = 2a2 ≤ 1. Other figures have limited domains due 37 38 to similar constraints. 39 In the second subfigure of Figure 3, we assume that M is more predictive 40 to the treatment T than to the outcome Y, with a = b = 2c = 2d. In the third 41 subfigure of Figure 3, we assume that M is more predictive to the outcome Y, with 42 2a = 2b = c = d. In all these cases, although the biases do blow up when the 43 44 correlation coefficients are extremely large, they are generally very small within a 45 wide range of the feasible region of the correlation coefficients. In examining the 46 formula, note that if any correlation is low, the bias will be low. 47 In Figure 4(a), we assume a = b and c = d and examine a broader range 48 49 of relationships. Here, the grey area satisfies |Biasad j| < min(|a|,|c|)/20. For ex- 50 ample, when the absolute values of the correlation coefficients are smaller than 0.5 51 (the square with dashed boundary in Figure 4(a)), the corresponding area is almost 52 grey, implying small bias. 53 54 As a side note, Pearl (2013b) noticed a surprising fact: the stronger the 55 correlation between T and M, the larger the absolute bias of the adjusted estimator, 56 57 58 59 60 Journal of Causal Inference Page 9 of 24 Journal of Causal Inference

1 2 3 4 5 6 7 since the absolute bias is monotone increasing in |ab|. From the second and the 8 third subfigure of Figure 3, we see that when M is more predictive of the treatment, 9 the biases of the adjusted estimator indeed tends to be larger. 10 11 12 Correlated latent variables. When the latent variables U and W are correlated 13 with ρ 6= 0, both the unadjusted and adjusted estimators may be biased. The ques- 14 tion then becomes: which is worse? The ratio of the absolute biases of the adjusted 15 16 and unadjusted estimators is 17 Bias (1 − b2 − c2 − bc ) − bc 18 For adReview j ρ Onlyρ = 2 , 19 Biasunad j ρ{1 − (ab + acρ) } 20 21 which does not depend on d (the relationship between W and Y). For example, if 22 23 the correlation coefficients a,b,c,ρ all equal 0.2, the ratio above is 0.714; in this 24 case the adjusted estimator is superior to the unadjusted one by a factor of 1.4. 25 Figure 4(b) compares this ratio to 1 for all combinations of ρ and a(= b = c). 26 Generally, the adjusted estimator has smaller bias except when a,b, and c are quite 27 large. 28 29 30 31

32 feasible region ●●●●●●●●●●● 1.0 ●●●●●●●●●●● ●●●●●●●●●●●●● ●●●●●●●●●●●●● ●●●●●●●●●●●●● 33 ●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●● 34 ●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● 35 ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● 0.5 ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● 36 ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● 37 ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● 38 ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●|Bias|

0.0 ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● c=d ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● 39 ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● 40 ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● 41 ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● − 0.5 ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● 42 ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● 43 ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● 44 ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● 45 − 1.0 −1.0 −0.5 0.0 0.5 1.0 46 47 a=b 48 49 (a) Pure M-Bias. Within the grey region, (b) M-Bias with Correlated U and W. 50 the absolute bias of the adjusted estimator Within the grey region, the adjusted es- 51 is less than 1/20 of the minimum of |a|(= timator is superior. 52 |b|) and |c|(= |d|). 53 54 Figure 4: M-Bias under different scenarios 55 56 57 58 59 60 Journal of Causal Inference Journal of Causal Inference Page 10 of 24

1 2 3 4 5 6 7 ρ = 0.1 ρ = 0.2 ρ = 0.4 8

9 0.20 unadjusted 0.20 unadjusted 0.20 unadjusted 10 adjusted adjusted adjusted 11 12 0.15 0.15 0.15 < < < 13 Biasadj Biasunadj Biasadj Biasunadj Biasadj Biasunadj 14 proportion=0.570 proportion=0.726 proportion=0.845 0.10 0.10 0.10 15 Absolute Bias Absolute Bias Absolute Bias 16

17 0.05 0.05 0.05 18 For Review Only 19

20 0.00 0.00 0.00 21 −0.6 −0.2 0.2 0.4 0.6 −0.6 −0.2 0.2 0.4 0.6 −0.6 −0.2 0.2 0.4 0.6 22 a=b=c=d a=b=c=d a=b=c=d 23 24 25 Figure 5: M-Bias with correlated U and W when ρ 6= 0 with a = b = c = d. In each 26 subfigure, we show the proportion of the areas where the adjusted estimator has a 27 smaller bias than the unadjusted estimator. 28 29 30 In Figure 5, we again assume a = b = c = d and investigate the absolute 31 biases as functions of a for fixed ρ at 0.1,0.2, and 0.4. When the correlation coeffi- 32 33 cients a(= b = c = d) are not dramatically larger than ρ, the adjusted estimator has 34 smaller bias than the unadjusted one. 35 36 37 The Disjunctive Cause Criterion. In order to remove biases in observational 38 studies, VanderWeele and Shpitser (2011) propose a new “disjunctive cause crite- 39 rion” for selecting confounders, which requires controlling for all the covariates that 40 are either causes of the treatment, or causes of the outcome, or causes of both of 41 them. According to the “disjunctive cause criterion”, when 6= 0, we should con- 42 ρ 43 trol for (U,V) if possible. Unfortunately, neither of (U,V) is observable. However, 44 controlling the “proxy variable” M for (U,V) may reduce bias when ρ is relatively 45 large. In the special case with b = 0, the ratio of the absolute biases is 46 47 Bias 1 − c2 48 ad j = 2 ≤ 1; 49 Biasunad j 1 − (acρ) 50 51 in another special case with c = 0, the ratio of the absolute biases is 52 2 53 Biasad j 1 − b 54 = 2 ≤ 1. (1) 55 Biasunad j 1 − (ab) 56 57 58 59 60 Journal of Causal Inference Page 11 of 24 Journal of Causal Inference

1 2 3 4 5 6 7 Therefore, if either U or W is not causative to M, the adjusted estimator is always 8 better than the unadjusted one. 9 10 M 11 Butterfly-Bias: -Bias with Confounding Bias Models, especially in the social 12 sciences, are approximations. They rarely hold exactly. In particular, for any covari- 13 ate M of interest, there is likely to be some concern that M is indeed a confounder, 14 even if it is also possible source of M-Bias. If we let M both be a confounder as well 15 as the middle of an M-Structure we obtain a “Butterfly-Structure” (Pearl, 2013b) as 16 17 shown in Figure 6. In this circumstance, conditioning will help with confounding 18 bias, but hurt withForM-Bias. Review Ignoring M will not resolveOnly any confounding, but avoid 19 M-Bias. The question then becomes that of determining which is the lesser of the 20 two evils. 21 22   23 24 25   26 27   28 29 30 31   32 33 34 Figure 6: Butterfly-Structure 35 36 We now examine this trade-off for a GLSEM corresponding to Figure 6. 37 The DGP is given by the following equations: 38 iid 39 U,W,ε ,ε ,ε ∼ N(0,1), 40 M T Y p 2 2 41 M = bU + cW + 1 − b − c εM, 42 p 2 2 43 T = aU + eM + 1 − a − e εT , p 44 Y = dW + f M + 1 − d2 − f 2ε . 45 Y 46 Again, the true causal effect of T on Y is zero. The unadjusted estimator 47 obtained by regressing Y onto T is the covariance between T and Y: 48 49 Biasunad j = Cov(T,Y) = ab f + cde + e f . 50 51 It is not, in general, 0, implying bias. The adjusted estimator (see Lemma 3 in 52 Appendix for a proof) obtained by regressing Y onto (T,M) has bias 53 abcd 54 Bias = − . 55 ad j 1 − (ab + e)2 56 57 58 59 60 Journal of Causal Inference Journal of Causal Inference Page 12 of 24

1 2 3 4 5 6 7 8 9 0.8 10 unadjusted estimator adjusted estimator 11 12 Biasadj < Biasunadj 13 0.6 14 proportion=0.749 s a 15 i B 16 0.4 17 18 For Review Only 19 0.2 20 21 22 0.0 23 −0.6 −0.4 −0.2 0.0 0.2 0.4 0.6 24 a=b=c=d=e=f 25 26 (a) Absolute biases of both estimators (b) Comparison of the absolute biases 27 with a = b = c = d = e = f . with a = b = c = d and e = f . Within 28 71.2% (in grey) of the feasible region, the 29 30 adjusted estimator has smaller bias than 31 the unadjusted one. 32 33 Figure 7: Butterfly-Bias 34 35 36 If the values of e and f are relatively high (i.e., M has a strong effect on both 37 T and Y), the confounding bias is large and the unadjusted estimator will be severely 38 biased. For example, if a,b,c,d,e, and f all equal 0.2, the bias of the unadjusted 39 estimator is 0.056, but the bias of the adjusted estimator is only −0.0017, an order 40 41 of magnitude smaller. Generally, the largest term for the unadjusted bias is the 42 second-order term of e f , while the adjusted bias only has, ignoring the denominator, 43 a fourth-order term of abcd. This suggests adjustment is generally preferable and 44 that M-bias is in some respect a “higher order bias.” 45 46 Detailed comparison of the ratio of the biases is difficult, since we can vary 47 six parameters (a,b,c,d,e, f ). In Figure 7(a), we assume all the correlation coef- 48 ficients have the same magnitude, and plot bias for both estimators as a function 49 of the correlation coefficient within the feasible region, defined by the restrictions 50 √ √ 51 − 2/2 < a < (−1 + 5)/2, due to the restrictions 52 2 2 2 2 2 2 2 53 b + c < 1, a + e < 1, d + f < 1, and |a + e| < 1. (2) 54 55 Within 74.9% of the feasible region, the adjusted estimator has smaller bias than the 56 57 58 59 60 Journal of Causal Inference Page 13 of 24 Journal of Causal Inference

1 2 3 4 5 6 7 unadjusted one. The unadjusted estimator only has smaller bias than the adjusted 8 estimator when the correlation coefficients are extremely large. In Figure 7(b), we 9 assume a = b = c = d and e = f , and compare |Biasad j| and |Biasunad j| within 10 the feasible region of (a,e) defined by (2). We can see that the adjusted estimator 11 12 is superior to the unadjusted one for 71% (colored in grey in Figure 7(b)) of the 13 feasible region. In the area satisfying |e| > |a| in Figure 7(b), where the connection 14 between M to T and Y is stronger than the other connections, the area is almost grey 15 suggesting that the adjusted estimator is preferable. This is sensible because here 16 M 17 the confounding bias has larger magnitude than the -Bias. In the area satisfying 18 |a| < |e|, whereForM-bias is strongerReview than confounding Only bias, the unadjusted estimator 19 is superior for some values, but still tends to be inferior when the correlations are 20 roughly the same size. 21 22 23 24 3 Extensions to a Binary Treatment 25 26 27    28 29 30    31 32   33 34 35 36   37 38 39 Figure 8: M-Structure with Possibly Correlated Hidden Causes and a Binary Treat- 40 ment 41 42 If the treatment is binary, one might worry that the conclusions in the previ- 43 ous section are not applicable. It turns out, however, that they are. In this section, 44 we extend the results in Section 2 to binary treatments by representing the treatment 45 46 through a latent Gaussian variable. In particular, extend Figure 2 to Figure 8. Here, 47 T ∗ is the old T. The generating equations for T and T ∗ become 48 ∗ ∗ p 2 49 T = I(T ≥ α), and T = aU + 1 − a εT . 50 51 Other variables and noise terms remain the same. The intercept α determines 52 the proportion of the individuals receiving the treatment: Φ(− ) = P(T = 1), 53 α 54 where Φ(·) is the cumulative distribution function of a standard Normal distribu- 55 tion. When α = 0, the individuals exposed to the treatment and the control are 56 57 58 59 60 Journal of Causal Inference Journal of Causal Inference Page 14 of 24

1 2 3 4 5 6 7 ρ = 0.1 ρ = 0.2 ρ = 0.4 8

9 0.30 unadjusted 0.30 unadjusted 0.30 unadjusted 10 adjusted adjusted adjusted 11 0.25 0.25 0.25 12 13 0.20 0.20 0.20 14 0.15 0.15 0.15 < < < 15 Biasadj Biasunadj Biasadj Biasunadj Biasadj Biasunadj Absolute Bias Absolute Bias Absolute Bias proportion=0.572 proportion=0.733 proportion=0.863

16 0.10 0.10 0.10 17

18 0.05 For Review0.05 Only0.05 19

20 0.00 0.00 0.00 21 −0.6 −0.2 0.2 0.4 0.6 −0.6 −0.2 0.2 0.4 0.6 −0.6 −0.2 0.2 0.4 0.6 22 a=b=c=d a=b=c=d a=b=c=d 23 24 25 Figure 9: M-Bias with correlated U and W and binary treatment. Compare to Fig- 26 ure 5. 27 28 29 balanced; when α < 0, more individuals are exposed to the treatment; when α > 0, 30 the reverse. 31 The true causal effect of T on Y is again zero. Let φ(·) = Φ0(·) and η(α) ≡ 32 φ(α)/{Φ(α)Φ(−α)}. Then Lemma 6 in Appendix shows that the unadjusted es- 33 34 timator has bias 35 36 Biasunad j = adρη(α), 37 38 and the adjusted estimator has bias 39 40 adη(α){ρ(1 − b2 − c2 − bcρ) − bc} Bias = . 41 ad j ρ{1 − (ab + acρ)2φ(α)η(α)} 42 43 When ρ = 0, the unadjusted estimator is unbiased, but the adjusted estimator 44 has bias 45 abcdη(α) 46 − . 47 1 − (ab)2φ(α)η(α) 48 When ρ 6= 0, the ratio of the absolute biases of the adjusted and unadjusted estima- 49 50 tors is Bias (1 − b2 − c2 − bc ) − bc 51 ad j ρ ρ = 2 . 52 Biasunad j ρ{1 − (ab + acρ) φ(α)η(α)} 53 54 The patterns for a binary treatment do not differ much from a Gaussian 55 treatment. As before, if the correlation coefficient is moderately small, the M-Bias 56 57 58 59 60 Journal of Causal Inference Page 15 of 24 Journal of Causal Inference

1 2 3 4 5 6 7 also tends to be small. As shown in Figure 9 (analogous to Figure 5), when |ρ| 8 is comparable to |a|(= |b| = |c| = |d|), the adjusted estimator is less biased than 9 the unadjusted estimator. Only when |a| is much larger than |ρ| is the unadjusted 10 estimator superior. 11 12 13 Butterfly-Bias with a Binary Treatment We can extend the GLSEM Butterfly- 14 Bias setup to binary treatment just as we extended the M-Bias setup. Compare 15 Figure 10 to Figure 6 and Figure 8 to Figure 2. T becomes T ∗ and T is built from 16 ∗ ∗ 17 T as above. The structural equations for T and T for butterfly bias in the binary 18 case are then For Review Only 19 p 20 T = I(T ∗ ≥ α), and T ∗ = aU + eM + 1 − a2 − e2. 21 22 The other equations and variables remain the same as before. 23 24   25 26 27  28 29   30 31 32 33    34 35 36 Figure 10: Butterfly-Structure with a Binary Treatment 37 38 Although the true causal effect of T on Y is zero, Lemma 7 in Appendix 39 40 shows that the unadjusted estimator has bias 41 42 Biasunad j = (cde + ab f + e f )η(α), (3) 43 44 and the adjusted estimator has bias 45 abcdη(α) 46 Biasad j = − . (4) 47 1 − (ab + e)2φ(α)η(α) 48 49 Therefore, the ratio of the absolute biases of the adjusted and unadjusted estimators 50 is Bias abcd ( ) 51 ad j η α = 2 . 52 Biasunad j (cde + ab f + e f ){1 − (ab + e) φ(α)η(α)} 53 54 Complete investigation of the ratio of the biases is intractable with seven 55 varying parameters (a,b,c,d,e, f ,α). However, in the very common case with 56 57 58 59 60 Journal of Causal Inference Journal of Causal Inference Page 16 of 24

1 2 3 4 5 6 7 α = 0, which gives equal-sized treatment and control groups, we again find trends 8 similar to the Gaussian case. See Figure 11. As before, only in the cases with very 9 small e(= f ) but large a(= b = c = d), does the unadjusted estimator tend to be 10 superior. Within a reasonable region of α, these patterns are quite similar. 11 12 13 14 15 1.4 16 1.2 unadjusted estimator 17 adjusted estimator 18 1.0 For Review Only 19

Biasadj < Biasunadj

20 0.8 21 proportion=0.777 |Bias|

22 0.6 23

24 0.4 25

26 0.2 27

28 0.0 29 −0.6 −0.4 −0.2 0.0 0.2 0.4 0.6

30 a=b=c=d=e=f 31 32 (a) Absolute biases with a = b = c = d = (b) Comparison of the absolute bias with 33 e = f . a = b = c = d and e = f . Within 74.4% 34 (in grey) of the feasible region, the ad- 35 36 justed estimator is better than the unad- 37 justed estimator. 38 39 Figure 11: Butterfly-Bias with a Binary Treatment 40 41 42 43 44 4 Illustration: Rubin-Pearl Controversy 45 46 Pearl (2009b) cites Rubin (2007)’s example about the causal effect of smoking 47 habits (T) on lung cancer (Y), and argues that conditioning on the pretreatment co- 48 variate “seat-belt usage” (M) would introduce spurious associations, since M could 49 50 be reasonably thought of as an indicator of a person’s attitudes toward society norms 51 (U) as well as safety and health related measures (W). Assuming all the analysis is 52 already conditioned on other observed covariates, we focus our discussion on the 53 five variables (U,W,M,T,Y), of which the dependence structure is illustrated by 54 55 56 57 58 59 60 Journal of Causal Inference Page 17 of 24 Journal of Causal Inference

1 2 3 4 5 6 7 Figure 12. Since the patterns with a Gaussian treatment and a binary treatment are 8 similar, we focus our discussion under the assumption of a Gaussian treatment. 9 10           11           12  13 14      15  16 17 18 For Review Only 19      20 21 Figure 12: A Potential M-Structure in Rubin’s Example (Rubin, 2007, Pearl, 2009b) 22 23 24 As Pearl (2009b) points out, 25 26 If we have good reasons to believe that these two types of attitudes are 27 28 marginally independent, we have a pure M-structure on our hand. 29 30 In the case with ρ = 0, conditioning on M will lead to spurious correlation between 31 T and Y under the null, and will bias the estimation of causal effect of T on Y. 32 33 However, Pearl (2009b) also recognizes that the independence assumption seems 34 very strong in this example, since U and W are both background variables about the 35 habit and personality of a person. Pearl (2009b) further argues: 36 37 38 But even if marginal independence does not hold precisely, condition- 39 ing on “seat-belt usage” is likely to introduce spurious associations, 40 hence bias, and should be approached with caution. 41 42 43 Although we believe most things should be approached with caution, our 44 work, above, suggests that even mild perturbations of an M-Structure can switch 45 which of the two approaches, conditioning or not conditioning, is likely to remove 46 more bias. In particular, Pearl (2009b) is correct in that the adjusted estimator 47 48 indeed tends to introduce more bias than the unadjusted one when an exact M- 49 Structure holds and thus the general advice “to condition on all observed covariates” 50 may not be sensible in this context. However, in the example of Rubin (2007), 51 the exact independence between a person’s attitude toward societal norms U and 52 safety and health related measures W is questionable, since we have good reasons 53 54 to believe that other hidden variables such as income and family background will 55 affect both U and W simultaneously, and thus Pearl’s fears may be unfounded. 56 57 58 59 60 Journal of Causal Inference Journal of Causal Inference Page 18 of 24

1 2 3 4 5 An Arrow from W to T 6 7 0.8 8 unadjusted estimator           adjusted estimator 9           10  0.6 11 bias 12      0.4 13  14

0.2 15 16

17      0.0 18 For Review Only0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 19 correlation coefficient a=b=c=d=g 20 21 (a) M-Bias with an Additional Arrow from W to (b) Biases of resulting unadjusted and adjusted es- 22 T timators with a = b = c = d = g. 23 24 Figure 13: Sensitivity Analysis of Pearl (2009b)’s Critique on Rubin (2007)’s Ex- 25 26 ample 27 28 29 To examine this further, we consider two possible deviations from the exact 30 M-Structure, and investigate the biases of the unadjusted and adjusted estimators 31 for each. 32 33 34 (a) (Correlated U and W) Assume the DGP follows the DAG in Figure 12, with an 35 additional correlation between the attitudes U and W as shown in Figure 2. If 36 we then assume that all the correlation coefficients have the same positive mag- 37 nitude, earlier results demonstrate that the adjusted estimator is preferable as it 38 39 strictly dominates the unadjusted estimator except for extremely large values of 40 the correlation coefficients. 41 Furthermore, in Rubin (2007)’s example, attitudes toward societal norms U are 42 more likely to affect the “seat-belt usage” variable M than safety and health 43 W 44 related measures , which further strengthens the case for adjustment. If we 45 were willing to assume that c is zero but ρ is not, equation (1) again shows that 46 the adjusted estimator is superior. 47 (b) (An arrow from W to T) Pearl’s example seems a bit confusing on further in- 48 spection, even if we accept his independence assumption U W. In particu- 49 50 lar, one’s “attitudes towards safety and health related measures” likely impact 51 one’s decisions about smoking. Therefore, we might reasonably expect an ar- 52 row from W to T. In Figure 13(a), we remove the correlation between U and W, 53 but we allow an arrow from W to T, i.e., the generating equation for T becomes 54 p 2 2 55 T = aU + gW + 1 − a − g εT . Lemma 8 in Appendix gives the associated 56 57 58 59 60 Journal of Causal Inference Page 19 of 24 Journal of Causal Inference

1 2 3 4 5 6 7 formulae for biases of the adjusted and unadjusted estimators. Figure 13(b) 8 shows that, assuming a = b = c = d = g (i.e., equal correlations), the adjusted 9 estimator is uniformly better. 10 11 12 13 5 Discussion 14 15 For objective causal inference, Rubin and Rosenbaum suggest balancing all the pre- 16 treamtent covariate in observational studies to parallel with the design of random- 17 ized experiments (Rubin, 2007, 2008, 2009, Rosenbaum, 2002), which is called 18 For Review Only 19 the “pretreatment criterion” (VanderWeele and Shpitser, 2011). However, Pearl and 20 other researchers (Pearl, 2009a,b, Shrier, 2008, 2009, Sjolander,¨ 2009) criticize the 21 “pretreatment criterion” by pointing out that this criterion may lead to biased in- 22 ference in presence of a possible M-Structure even if the treatment assignment is 23 24 unconfounded. While we agree that Pearl’s warning is very insightful, our theo- 25 retical and numerical results show that this conclusion is quite sensitive to various 26 deviations from the exact M-Structure, e.g., to circumstances where latent causes 27 may be correlated or the M variable may also be a confounder between the treat- 28 ment and the outcome. In particular, results derived from causal DAGs appear to be 29 30 quite dependent on the absolute zeros that correspond to missing arrows. It is im- 31 portant to then examine the sensitivity to these findings to even modest departures 32 from these pure independence assumption. And indeed, the above results suggest 33 that, in many cases, adjusting for all the pretreatment covariates is in fact a reason- 34 35 able choice. Many on both sides of this debate argue that scientific knowledge is 36 key for a proper design, either in the generation of the DAG or in the justification of 37 a strong ignorability assumption. We agree, but in either case this scientific knowl- 38 edge should also shed light on whether a pure M-Structure is a real possibility or 39 40 not. 41 42 43 Acknowledgment 44 45 The authors thank all the participants in the “Causal Graphs in Low and High Di- 46 47 mensions” seminar at Harvard Statistics department in Fall, 2012, and thank Pro- 48 fessor Peter Spirtes for sending us his slides presented at WNAR/IMS meeting in 49 Los Angeles CA in 2002. 50 51 52 53 54 55 56 57 58 59 60 Journal of Causal Inference Journal of Causal Inference Page 20 of 24

1 2 3 4 5 6 7 Appendix: Lemmas and Proofs 8 9 The proofs in this paper are based on some simple results from regression used on 10 the structural equations defined by the DAGs. We heavily use the following fact: 11 12 13 Lemma 1 In the linear regression model Y = β0 +βT T +βMM +ε with ε (T,M) 14 and E(ε) = 0, we have 15 16 Cov(Y,T)Var(M) − Cov(Y,M)Cov(M,T) βT = . 17 Var(T)Var(M) − Cov2(M,T) 18 For Review Only 19 20 Proof. Solve for (βT ,βM) using the following moment conditions 21 22 Cov(Y,T) = βT Var(T) + βMCov(M,T), 23 24 Cov(Y,M) = βT Cov(M,T) + βMVar(M). 25 26 Lemma 2 Under the model generated by the DAG in Figure 2, the regression co- 27 28 efficient of T by regressing Y onto (T,M) is 29 30 adρ(1 − b2 − c2 − bcρ) − abcd βT = . 31 1 − (ab + acρ)2 32 33 34 Proof. We simply expand and rearrange the terms in Lemma 1. In particular, all 35 variance terms such as Var(M) are 1. The covariance terms are easily calculated as 36 well. For example, we have 37 38 p 2 p 2 39 Cov(Y,T) = Cov(dW + 1 − d εY ,aU + 1 − a εT ) 40 = Cov(dW,aU) = adρ 41 42 43 and 44 p 2 p 2 2 45 Cov(Y,M) = Cov(dW + 1 − d εY ,bU + cW + 1 − b − c εM) 46 = Cov(dW,bU) + Cov(dW,cW) = bdρ + cd. 47 48 49 Lemma 3 Using the model generate by the DAG in Figure 7, the regression coeffi- 50 cient of T from regressing Y onto (T,M) is 51 52 abcd 53 βT = − . 54 1 − (ab + e)2 55 56 57 58 59 60 Journal of Causal Inference Page 21 of 24 Journal of Causal Inference

1 2 3 4 5 6 7 Proof. Follow Lemma 2. Expand Lemma 1 to obtain 8 (ab f + cde + e f ) − (cd + f )(ab + e) 9 β = . 10 T 1 − (ab + e)2 11 12 Now rearrange. 13 14 15 Lemma 4 Assume that (X1,X2) follows a bivariate Normal distribution: 16 17 X  0 1 r 1 ∼ N , . 18 For ReviewX 2 0 r Only1 19 2 20 21 Then E(X1 | X2 ≥ z) − E(X1 | X2 < z) = rη(z), where η(z) = φ(z)/{Φ(z)Φ(−z)}. 22 √ 23 Proof. Since X = rX + 1 − r2Z with Z ∼ N(0,1) and Z X , we have 24 1 2 2 25 Z ∞ Z ∞ 26 r r φ(z) E(X1 | X2 ≥ z) = rE(X2 | X2 ≥ z) = xφ(x)dx = − dφ(x) = r . 27 Φ(−z) z Φ(−z) z Φ(−z) 28 29 Similarly, we have E(X1 | X2 < z) = E(X1 | −X2 > −z) = −rφ(−z)/Φ(z) = −rφ(z)/Φ(z). 30 Therefore, we have 31 32  1 1  rφ(z) 33 E(X1 | X2 ≥ z) − E(X1 | X2 < z) = rφ(z) + = . 34 Φ(−z) Φ(z) Φ(z)Φ(−z) 35 36 Lemma 5 The covariance between X and B ∼ Bernoulli(p) is 37 38 39 Cov(X,B) = p(1 − p){E(X | B = 1) − E(X | B = 0)}. 40 41 Proof. We have Cov(X,B) = (XB)− (X) (B) = p (X | B = 1)− p{p (X | B = 42 E E E E E 43 1) + (1 − p)E(X | B = 0)} = p(1 − p){E(X | B = 1) − E(X | B = 0)}. 44 45 Lemma 6 Under the model generated by the DAG in Figure 8, the regression co- 46 47 efficient of T from regressing Y onto (T,M) is 48 49 adη(α){ρ(1 − b2 − c2 − bcρ) − bc} βT = . 50 ρ{1 − (ab + acρ)2φ(α)η(α)} 51 52 53 54 55 56 57 58 59 60 Journal of Causal Inference Journal of Causal Inference Page 22 of 24

1 2 3 4 5 6 ∗ 7 Proof. We have the following joint Normality of (Y,M,T ): √ 8    2  9 Y dW + √1 − d εY 2 2 10 M  = bU + cW + √1 − b − c εM ∗ 2 11 T α + aU + 1 − a εT 12     13  0 1 bdρ + cd adρ  14 ∼ N 3 0,bdρ + cd 1 ab + acρ . 15  0 adρ ab + acρ 1  16 17 Derive by computing all covariance terms between Y and M, etc., and then plugging 18 them into the aboveFor matrix. Review From Lemma 4, we haveOnly 19 ∗ ∗ 20 E(M | T = 1) − E(M | T = 0) = E(M | T ≥ α) − E(M | T < α) = (ab + acρ)η(α), 21 ∗ ∗ 22 E(Y | T = 1) − E(Y | T = 0) = E(Y | T ≥ α) − E(Y | T < α) = adρη(α). 23 24 Therefore, from Lemma 5, the covariances are 25 Cov(M,T) = ( ) (− )(ab + ac ) ( ), 26 Φ α Φ α ρ η α 27 Cov(Y,T) = Φ(α)Φ(−α)adρη(α). 28 29 According to Lemma 1, the regression coefficient βT is 30 31 Φ(α)Φ(−α)adρη(α) − (bdρ + cd)Φ(α)Φ(−α)(ab + acρ)η(α) βT = 32 Φ(α)Φ(−α) − Φ2(α)Φ2(−α)(ab + acρ)2η2(α) 33 adρη(α) − (bdρ + cd)(ab + acρ)η(α) 34 = 2 35 1 − (ab + acρ) φ(α)η(α) 36 adη(α){ρ(1 − b2 − c2 − bcρ) − bc} 37 = 2 . 38 ρ{1 − (ab + acρ) φ(α)η(α)} 39 40 Lemma 7 Under the model generated by the DAG in Figure 10, the regression 41 coefficient of T from regressing Y onto (T,M) is 42 43 abcdη(α) βT = − . 44 1 − (ab + e)φ(α) 45 46 Proof. We have the following joint Normality of (Y,M,T ∗): 47 √ 48  Y   dV + f M + 1 − d2ε  49 √ Y 2 2 50 M  =  bU + cV + 1√− b − c εM  ∗ 2 2 51 T α + aU + eM + 1 − a − e εT 52     53  0 1 cd + f cde + ab f + e f  54 ∼ N 3 0, cd + f 1 ab + e  . 55  0 cde + ab f + e f ab + e 1  56 57 58 59 60 Journal of Causal Inference Page 23 of 24 Journal of Causal Inference

1 2 3 4 5 6 7 From Lemma 4, we have 8 ∗ ∗ 9 E(M | T = 1) − E(M | T = 0) = E(M | T ≥ α) − E(M | T < α) = (ab + e)η(α), ∗ ∗ 10 E(Y | T = 1) − E(Y | T = 0) = E(Y | T ≥ α) − E(Y | T < α) = (cde + ab f + e f )η(α). 11 12 From Lemma 5, we obtain their covariances 13 14 15 Cov(M,T) = Φ(α)Φ(−α)(ab + e)η(α), 16 Cov(Y,T) = Φ(α)Φ(−α)(cde + ab f + e f )η(α). 17 18 According to LemmaFor 1, theReview regression coefficient OnlyβT is 19 20 Φ(α)Φ(−α)(cde + ab f + e f )η(α) − (cd + f )Φ(α)Φ(−α)(ab + e)η(α) 21 βT = 2 2 2 2 22 Φ(α)Φ(−α) − Φ (α)Φ (−α)(ab + e) η (α) 23 (cde + ab f + e f )η(α) − (cd + f )(ab + e)η(α) = 24 1 − (ab + e)2φ(α)η(α) 25 abcdη(α) 26 = − . 27 1 − (ab + e)2φ(α)η(α) 28 29 30 Lemma 8 Under the model generated by the DAG in Figure 13(b), the unadjusted 31 estimator has bias adρ + dg, and the adjusted estimator has bias 32 33 dg − (cd)(ab + cg) . 34 1 − (ab + cg)2 35 36 37 Proof of Lemma 8. The unadjusted estimator is the covariance Cov(T,Y) = adρ + 38 dg. Expanding Lemma 1 gives the above as the regression coefficient of T from 39 regressing Y onto (T,M). 40 41 42 43 References 44 45 Berkson, J. (1946): “Limitations of the application of fourfold table analysis to 46 hospital data,” Biometrics Bulletin, 2, 47–53. 47 48 Gelman, A. (2011): “Causality and statistical learning,” American Journal of Soci- 49 ology, 117, 955–966. 50 Greenland, S. (2002): “Quantifying biases in causal models: classical confounding 51 vs collider-stratification bias,” Epidemiology, 14, 300–306. 52 Liu, W., M. A. Brookhart, S. Schneeweiss, X. Mi, and S. Setoguchi (2012): “Im- 53 54 plications of M bias in epidemiologic studies: a simulation study,” American 55 Journal of Epidemiology, 176, 938–948. 56 57 58 59 60 Journal of Causal Inference Journal of Causal Inference Page 24 of 24

1 2 3 4 5 6 7 Neyman, J. (1923/1990): “On the application of probability theory to agricultural 8 experiments. essay on principles. section 9,” Statistical Science, 5, 465–472. 9 Pearl, J. (1995): “Causal diagrams for empirical research,” Biometrika, 82, 669– 10 688. 11 12 Pearl, J. (2000): Causality: Models, Reasoning and Inference, Cambridge Univer- 13 sity Press. 14 Pearl, J. (2009a): “Letter to the editor: Remarks on the method of propensity score,” 15 Statistics in Medicine, 28, 1415–1416. 16 Technical 17 Pearl, J. (2009b): “Myth, confusion, and science in causal analysis,” 18 Report. For Review Only 19 Pearl, J. (2013a): “Discussion on ‘surrogate measures and consistent surrogates’,” 20 Biometrics, 69, 573–577. 21 Pearl, J. (2013b): “Linear models: A useful microscope for causal analysis,” Jour- 22 23 nal of Causal Inference, 1, 155–170. 24 Rosenbaum, P. R. (2002): Observational Studies, Springer. 25 Rosenbaum, P. R. and D. B. Rubin (1983): “The central role of the propensity score 26 in observational studies for causal effects,” Biometrika, 70, 41–55. 27 28 Rubin, D. B. (1974): “Estimating causal effects of treatments in randomized and 29 nonrandomized studies.” Journal of Educational Psychology, 66, 688. 30 Rubin, D. B. (2007): “The design versus the analysis of observational studies 31 for causal effects: parallels with the design of randomized trials,” Statistics in 32 33 Medicine, 26, 20–36. 34 Rubin, D. B. (2008): “For objective causal inference, design trumps analysis,” The 35 Annals of Applied Statistics, 808–840. 36 Rubin, D. B. (2009): “Should observational studies be designed to allow lack of bal- 37 38 ance in covariate distributions across treatment groups?” Statistics in Medicine, 39 28, 1420–1423. 40 Shrier, I. (2008): “Letter to the editior,” Statistics in Medicine, 27, 2740–2741. 41 Shrier, I. (2009): “Propensity scores,” Statistics in Medicine, 28, 1315–1318. 42 Statistics in Medicine 43 Sjolander,¨ A. (2009): “Propensity scores and M-structures,” , 44 28, 1416–1420. 45 Sprites, P. (2002): “Presented at: WNAR/IMS Meeting. Los Angeles, CA, June 46 2002,” . 47 VanderWeele, T. J. and I. Shpitser (2011): “A new criterion for confounder selec- 48 49 tion,” Biometrics, 67, 1406–1413. 50 Wright, S. (1921): “Correlation and causation,” Journal of Agricultural Research, 51 20, 557–585. 52 Wright, S. (1934): “The method of path coefficients,” The Annals of Mathematical 53 54 Statistics, 5, 161–215. 55 56 57 58 59 60 Journal of Causal Inference