<<

1 Causal in Randomized and Non-Randomized Studies: The Definition, Identification, and Estimation of Causal Parameters

Michael E. Sobel

INTRODUCTION between the illusory and non-illusory correlations, Yule invented The distinction between causation and associ- to ‘control’ for the influence of a common ation has figured prominently in science and factor, arguing in context that because the philosophy for several hundred years at least, relationship between pauperism and out and, more recently, in statistical science as relief did not vanish when ‘controlling’ for well, indeed, since Galton, Pearson and, Yule poverty, this relationship could be deemed developed the theory of correlation. causal. A half century later, philosophers, have pioneered two psychologists and social scientists (e.g., approaches to that have Reichenbach, 1956; Simon 1954; Suppes proven influential in the natural and 1970) rediscovered Yule’s approach to behavioral sciences. The oldest dates back distinguishing between causal and non-causal to Yule (1896), who wrote extensively about relationships, and econometricians (e.g., ‘illusory’ correlations, by which he meant Granger 1969) extended this idea to the correlations that should not be endowed time-series setting. Graphical models, path with a causal interpretation. To distinguish analysis and, more generally, structural 4 DESIGN AND INFERENCE equation models, when these methods are section, ‘Estimation of causal parameters in used to make causal , also rely randomized studies’, discusses estimation. on this type of reasoning. The theory of The section ‘Mediation analyses’ takes up the experimental design, which emerges in topic of mediation, which is of special interest the 1920s and thereafter, and is associated to psychologists and prevention scientists. especially with Neyman (1923) and Fisher I show that the usual approach to mediation, (1925), forms the basis for a second approach which uses structural equation modeling, does to inferring causal relationships. Here, the not yield estimates of causal parameters, use of good design, especially , even in randomized studies. Several other is emphasized, apparently obviating the need approaches to mediation, including principal to worry about spurious relationships. stratification and instrumental variables, are Despite these important contributions, dur- also considered. ing the majority of the twentieth century, most statisticians espoused the view that had little to do with causation. CAUSATION AND PROBABILISTIC But the situation has reversed dramatically CAUSATION in the last 30 years, since Rubin (1974, 1977, 1978, 1980) rediscovered Neyman’s Regularity theories of causation are concerned potential outcomes notation and extended with the full (or philosophical) of an the theory of experimental design to obser- effect, by which is meant a set of conditions vational studies. Currently, there is a large that is sufficient (or necessary or necessary and growing inference in statistics on the and sufficient) for the effect to occur. This topic of causal inference and this second type of theory descends from Hume, who approach to inferring causal relationships claimed that causation (as it exists in the is coming to dominate the first approach, real world) consists only of the following: even in disciplines such as , (1) temporal priority, i.e., the cause must which rely on observational studies and precede the effect in time; (2) spatiotemporal where the first approach has traditionally contiguity, i.e., the cause and effect are dominated. ‘near’ in time and space; and (3) constant This chapter provides an introduction, conjunction, i.e., if the same circumstances tailored to the concerns of behavioral sci- are repeated, the same outcome will occur. entists, to this second approach to causal Many subsequent writers argued that Hume’s inference. Because causal inference is the analysis cannot distinguish between regular- act of making inferences about the causal ities that are not causal, such as a relation relation and notions of the causal relation between two events brought about by a differ, it is important to understand what common factor, and genuine causation. At the notion of causation is under consideration minimum, this suggests that Hume’s account when such an inference is made. Thus, in is incomplete. A number of philosophers the next section, I briefly review several [e.g., Bunge (1979) and Harré and Madden notions of causation and also briefly examine (1975)] argue that the causal relation is the approach to causal inference that derives generative. While this idea is appealing, from Yule. In the section ‘Unit and especially to modern scientists who speak of causal effects’ the second approach, which mechanisms, attempts to elaborate this idea is built on the idea that a causal relation have not been entirely successful. Another should sustain a counterfactual conditional approach (examined later) is to require statement, is introduced, and a number of causal relationships to sustain counterfactual estimands of interest are defined. The section conditional statements. ‘Identification of causal parameters under Hume’s analysis is also deterministic, ignorable treatment assignment’discusses the and the literature on probabilistic causation identification of causal effects and the next that descends from Yule can be viewed as CAUSAL INFERENCE IN RANDOMIZED AND NON-RANDOMIZED STUDIES 5 an attempt to both relax this feature and whether causation is viewed as probabilistic distinguish between causal and non-causal in some inherent sense or if probability arises regularities. The basic idea is as follows. First, in some other way. The deficiencies of this there is a putative cause Z prior in some approach are evident in the psychological sense to an outcome Y. Further, Z and Y literature on casual modeling, where a variety are associated (correlated). However, if the of extra-mathematical considerations (such as Z − Y association vanishes when a (set of) model specification) are used to suggest that variable(s) X prior to Z is conditioned on model coefficients can be endowed with a (or in some accounts, if such a set exists), causal interpretation (see, for example, Sobel this is taken to that Z ‘does not cause’ 1995 on this point). Y, that is, the relationship is ‘spurious’. To By way of to regularity theo- complete the picture, various examples where ries, manipulability theories view causes this criterion would seem to work well have as variables that can be manipulated, with been constructed. the outcome depending on the state of the Granger causation and structural equa- manipulated variable. Here, as opposed to tion models use this type of reasoning to specifying all the variables and the functional distinguish between empirical relationships relationship between these and the outcome that are regarded as causal and not causal. (which would constitute a successful causal For example, consider a structural equation account of a phenomenon under a regularity model, with X a vector of variables, Z an theory), the goal is more modest, to examine outcome occurring after X, and Y an outcome the ‘effect’ of a particular variable. See Sobel after Z. If the ‘direct effect’ of Y on Z is 0 (1995) for a reconciliation of these two (not 0), Z is not viewed (viewed) as a cause approaches. of Y. A sufficient condition for this direct Manipulability theories require the causal effect to be 0 is that Y and Z are conditionally relation to sustain a counterfactual conditional independent, given X. Of course, this same statement (e.g., eating the poison caused John kind of reasoning can be extended to the to die that John ate the poison and consideration of other types of direct and died, but had John not eaten the poison, indirect effects. For example, consider the he would not have died). This is closer to case of a variable X associated with Z, with the way an experimentalist thinks of cau- X and Y conditionally independent, given Z, sation. However, many philosophers regard implying the ‘direct effect of X on Y is 0, and manipulability theories as anthropomorphic. Z and Y not conditionally independent given Further, many questions that scientists ask X, implying the ‘direct effect’ of Z on Y is are not amenable to experimentation, e.g, non-zero. Here there is an effect of X on Y, the effect of education on longevity or but the effect is indirect, through Z. the effect of marriage on happiness. This There are a number of problems with this would appear to seriously limit the value of approach. First and foremost, it confounds this approach for addressing real scientific causation with the act of inferring causation, questions. as evidenced by the fact that the criteria However, even without manipulating a per- above for inferring causation are typically son’s level of education, one might imagine put forth independently of any explicit notion that had this person’s level of education of the causal relation. As notions of the taken on a different value than that actually causal relation vary, this method of inferring realized, this person might also have a causation may be appropriate for some different outcome. This suggests adopting the notions of causation, for example, the case broader view that it is not the manipulation where causation is regarded as a predictive per se , but the idea that the causal variable relationship among variables, but not for could take on a different value than it others. Nor (because the nature of the casual actually takes on, which is key. This is the relation is not explicitly considered), is it clear idea underlying counterfactual theories of 6 DESIGN AND INFERENCE causation (e.g., Lewis 1973), where the closest show that these methods rest on a number of possible world to the one we live in serves as implausible assumptions. the basis for the counterfactual. Counterfactual theories also have their difficulties, both theoretical and practical. UNIT AND AVERAGE CAUSAL EFFECTS One criticism goes under the rubric of ‘preemption’. Person A shoots person C in the The notion of causation congruous to recent head, and C dies. It seems natural to claim statistical work on causal inference has two that person A caused person C to die. Yet, important properties. First, the causal relation suppose that if A had not shot C, B would is singular, i.e., it is meaningful to speak of have done so and C would also have died. In effects at the individual level and these effects that case, C dies whether or not A shoots him, may vary over individuals (heterogeneity). so one cannot (under a simple counterfactual Second, causal statements sustain counterfac- theory) say A caused C to die. This seems tual conditionals. Thus, we might state that wrong. attending health class caused Bill to drink less, In practice, the outcome may also depend by which we mean that Bill went to health on the way in which the cause is brought class and later drank amount y, whereas had he about. When an is performed, not attended health class, later he would have this issue is not, per se , problematic, and drunk y∗ > y. For Mary, perhaps the outcome the effect corresponding to that manipulation is the same whether or not she attends, in is well defined. Otherwise, as there may be which case we would say that attending health different outcomes, ‘the effect’ is ill-defined, class did not cause Mary to drink less (or unless the closest world is specified; in some more). Note that only attending health class instances, this will be a very difficult task. is considered as the cause. Other possible This suggests that some questions, e.g., the causes, e.g., sex, are regarded as part of the effect of marriage on happiness, may be better causal background (pretreatment covariates in left unasked (or at the minimum, one must statistical language). specify the hypothetical intervention by which The single most important contribution persons are exposed/not exposed to marriage in this literature is the potential outcomes and to marriage partners). notation developed by Neyman (1923) and I now turn to the recent statistical literature Rubin (1974) to express the ideas above. on causal inference, which is also based Using this notation allows causal effects to on the idea that causal relations sustain be defined independently of the association counterfactual conditionals. This approach parameters that are actually estimated in to casual inference (as suggested by the studies; one can then ask whether and under preceding material) is not concerned with what conditions these associations equal the elucidating the various causes of the outcome causal effects. Consider the case of an (effect) and the way in which these causes experiment where unit i in a population P is produce the effect, but with the more limited assigned (or not) to receive a treatment. The goal of inferring the effect (in a sense to for this unit is typically written as: be described) of a particular causal variable. (Zi, Yi, Xi), where Zi = 1 if i is assigned Scientists who are interested in a fuller to receive the treatment, 0 otherwise, Yi is accounting of the causes of an effect and the value of the outcome and Xi is a vector the pathways through which the effect is of covariates. Although this representation produced may find this approach less than is adequate for descriptive modeling [e.g., entirely satisfying. However, as discussed the regression function E(Y | Z, X)], it does subsequently, this approach can also be used to not adequately express the idea that Z might evaluate methods (such as structural equation take on different values and that i’s outcome models) that researchers sometimes use to might vary with this. One way to formalize provide a fuller account, and it is not hard to this idea is to consider two outcomes for i, CAUSAL INFERENCE IN RANDOMIZED AND NON-RANDOMIZED STUDIES 7

Yzi (0), the outcome i would have if he is As a simple example, consider the case not assigned to receive treatment and Yzi (1), above, with h(Yzi (0) , Yzi (1)) = Yzi (1) − the outcome i would have if he is assigned Yzi (0). The ‘intent to treat’ estimand (here- to receive treatment. With this notation, it is after ITT), which is commonly featured in then straightforward to define singular causal connection with randomized clinical trials, is effects (unit effects) h(Yzi (0) , Yzi (1)) [where defined as: there is no effect if Yzi (0) = Yzi (1)]; the unit effects then serve as the building blocks for E(Yz(1) − Yz(0)) , (1) various types of average effects. Were it possible to take a sample from the average of the unobserved unit effects. (Y (0) , Y (1)), it would be a simple matter z z Because the expected value is a linear to obtain the unit effects h(Y (0) , Y (1)) for zi zi operator, the ITT can also be expressed as the sampled units and to estimate various E(Y (1)) − E(Y (0)). Thus, if it is possible parameters that are functions of these. The z z to take a random sample of size n from P literature on causal inference arises from the and then take random sub-samples from Y (1) fact that it is only possible to observe one of z and Y (0), the difference between the sample the potential outcomes. z : A limitation of the notation above is that only a scalar outcome and a binary treatment n n are considered. Because the generalization to Pi=1 ZiYi Pi=1(1 − Zi)Yi n − n (2) a random object and to arbitrary types of Pi=1 Zi Pi=1(1 − Zi) treatments is trivial and does not generate substantially new issues, I continue to treat is an unbiased and consistent estimator of the case of a binary treatment and scalar the ITT. outcome. A more serious limitation (though The ITT is one of a number of possible it is again not difficult to generalize this parameters of interest and may not always be notation) that does generate new issues is of greatest scientific or policy relevance. It that the notation above does not allow for measures the effect of treatment assignment, interference (Cox, 1958), that is, i’s potential and as subjects may not always take up the outcomes are not allowed to depend on the treatments to which they are assigned, the ITT treatment received by other units. Rubin does not measure the effect of treatment itself. (1980) calls this the stable unit treatment A policy maker might nevertheless argue that value assumption (SUTVA). Although this the ITT is of primary interest because it assumption is often reasonable and almost measures the effect that would be actually be universally made, there are many instances in observed in the real world. As an example, the social and behavioral sciences where it is consider the effect of a universally free school- untenable. For example, in schools, children breakfast program vs. the current Federal in the same (or even different) classrooms program (Crepinsek et al., 2006) on total food may interfere with one another. This case has and nutrient intake. Some students will take been studied by Gitelman (2005); for the more up the free breakfast, others will not. From general case, see Halloran and Struchiner the policy maker’s perspective, if the program (1995) and Sobel (2006a, 2006b). Hereafter, I is highly effective amongst those who take it shall assume SUTVA holds. up, but the takers are a small percentage of Although the unit effects above cannot be those who might benefit, the program may be determined (since only one of the potential judged a failure. outcomes is ever observed), it turns out In observational studies where treatments remarkably that under suitable conditions are not assigned, one observes only whether (discussed later) various types of averages of or not a subject takes up a treatment ( D = 1) or these effects are nevertheless identifiable and not ( D = 0); defining the potential outcomes can be consistently estimated. as Ydi (0) and Ydi (1), interest often centers on 8 DESIGN AND INFERENCE the (hereafter ATE): X to be a vector of pretreatment covari- ates. This leads to consideration of the E(Yd(1) − Yd(0)) (3) parameters ITT( X), ATE( X), and ATT( X), where, for example, ATE( X) is defined as or the effect of treatment on the treated (ATT): E( Yd(1) − Yd(0) | X), and the other parame- ters are defined analogously. E(Yd(1) − Yd(0) | D = 1) . (4) Although attention herein focuses on the parameters above, a number of other inter- Neyman (1923) first considered the ATE. esting and/or useful parameters have been The ATT was first considered by Belsen defined and considered. Bj orklund˝ and Moffitt (1956) and discussed in detail by Rubin (1987) defined (and discussed the economic (1978). The ATE measures the average effect relevance of) the marginal treatment effect if all persons in the population are given for subjects indifferent between participating the treatment, whereas the ATT measures or not in a program of interest. Quantile the average effect of the treatment in the treatment effects, the difference between the subpopulation that takes up the treatment. marginal quantiles of Y(1) and Y(0), were The ATE is a natural parameter of interest defined by Doksum (1974) and Lehmann if the treatment corresponds to a policy (1974); these effects have received some that is under consideration for universal and attention recently (e.g., Abadie, Angrist and mandatory adoption. In cases where adoption Imbens, 2002). A parameter (discussed sub- is voluntary, some economists have argued sequently) that has received much attention that only the ATT is relevant, because it lately is the local average treatment effect reflects what would actually occur if the (LATE) considered by Angrist, Imbens and policy were to be implemented. However, one Rubin (1996). might also want to know if those who do Many other parameters might also be not adopt the policy would benefit, because considered. A decision maker might wish to an affirmative answer might suggest to a consider the utilities of the potential outcome policy maker that efforts focus on increasing values and ask whether a treatment increases the take up rate. Additionally, if the ATT is average utility or some other measure of social positive, and persons who have not take up the welfare; this is a matter of considering U(Y) policy have access to this type of information, as opposed to Y. they might be more motivated to do so. This The average effects above take as building suggests that in general, for policy purposes, blocks the unit differences Ydi (1) − Ydi (0) one might wish to know, in addition to the (or Yzi (1) − Yzi (0)). As will be evident ATT, the average effect of treatment on the later, because averages of these depend untreated (ATU) and the ATE, which is a only upon the marginal distributions of weighted average of the ATT and ATU. The Yd(0) and Yd(1), the effects in question are ATE will also be a more natural parameter of identified if the marginal distributions are interest than the ATT in many contexts where identified. Parameters that depend on the joint the focus is on the basic science, where the distribution of the potential outcomes may causal variable may not be one that can be also be defined (e.g., the proportion who manipulated for policy purposes. would benefit from treatment), but the data, Various other parameters may also be even from a , typically of interest. First, for both scientific and contain little or no information about this policy reasons, one often wants to know joint distribution, so these parameters will not whether the effects above vary in different be identifiable without introducing additional sub-populations defined by characteristics assumptions. While this may appear to be a of the units. Let X denote a vector of serious limitation, several comments are in variables that are not affected by the treatment order. First, sometimes a transformation may (or assignment variable), for example, take produce an estimand of the desired form. CAUSAL INFERENCE IN RANDOMIZED AND NON-RANDOMIZED STUDIES 9

For example, for positive variables, with To see how (5) is used for identification, h(Yd(0) , Yd(1)) = Yd(1) /Yd(0), redefining note that whether or not randomization is the potential outcomes as log Yd(0) and assumed to assign subjects to treatments, what log Yd(1) gives transformed effects in the can actually be observed is a sample from desired form. Second, in decision making, the joint distribution of ( Y, Z, X). From this the additional information contained in the distribution, the conditional distributions Y | joint distribution may be irrelevant (Imbens Z = z for z = 0, 1 are identified, and as and Rubin, 1997) to the policy maker. Third, Y = ZY z(1) +(1 −Z)Yz(0), the distribution Y | at least occasionally, plausible substantive Z = z is the distribution Yz | Z = z. Thus, the assumptions could lead to identification of the population means E(Y | Z = 1) = E(Yz(1) | joint distribution of the joint distribution. For Z = 1) and E(Y | Z = 0) = E(Yz(0) | Z = 0) example, let the outcome be death (1 if alive, are identifiable, so the difference: 0 if dead) and suppose one wants to know the proportion benefiting under treatment; it E(Y | Z = 1) = E(Yz(1) | Z = 1) − is easy to see that if treatment is at least E(Y | Z = 0) = E(Yz(0) | Z = 0) (6) not harmful, the joint distribution is identified from the marginal distributions of the poten- is also identified. In general, (6) does not equal tial outcomes. However, in general, this is not (1) because the identified conditional distri- the case, and as there is likely to be very little butions Y | Z = z, z = 0, 1, are not equal to scientific knowledge about quantities like the z the corresponding marginal distributions Y , joint distribution of potential outcomes in a z z = 0, 1. But under the study in which the marginal distribution is not assumption, (5) holds, implying equality of even assumed to be known, the identification the two sets of distributions; hence (6) = (1). of parameters involving the joint distribution Often an investigator will also want to know will typically require making assumptions if the value of the ITT depends on covariates that are substantively heroic (although per- of interest. The parameter of interest is then haps mathematically convenient) and possibly ITT ( X) = E(Y (1) – Y (0) | X). As (for the quite sensitive to violations. z z case above): I now consider the identification of causal parameters. 0 < π z(X) ≡ Pr( Z = 1 | X) < 1, (7)

and since assumption (5) implies: IDENTIFICATION OF CAUSAL PARAMETERS UNDER IGNORABLE Z%Y (0) , Y (1) | X, (8) TREATMENT ASSIGNMENT z z that is, assignment is random within levels of Random assignment is an assumption about X, ITT( X) is identifiable and equal to: the way units are assigned to treatments. At the heart of randomized , this E(Y | Z = 1, X) − E(Y (0) | Z = 0, X). assumption enables identification of causal z (9) parameters. In the simplest case where each subject is assigned with probability 0 < π < Just as the assumption of random assignment 1 to the control condition and probability 1 −π is the key to identifying causal parameters to the treatment condition, random assignment from the randomized experiment, the assump- implies treatment assignment is “ignorable”, tion of random assignment within blocks i.e., Z is independent of background covari- (sub-populations) is the key to identifying ates and potential outcomes: causal parameters from the randomized block experiment. It is also the key to making causal Z%X, Yz(0) , Yz(1) (5) inferences from observational studies. 10 DESIGN AND INFERENCE

When assumptions (7) and (8) hold, treat- I do not consider this matter further here, save ment assignment is said to be strongly ignor- to note that estimating these distributions will able, given X (Rosenbaum and Rubin, 1983). (when the outcome Y is metrical) typically In observational studies in the social and be more difficult than estimating the average behavioral sciences, where subjects choose causal effects above. the treatment received ( D), the assumption: Second, the average effects above can be identified under weaker ignorability assump- D%X, Yd(0) , Yd(1) (10) tions than those given here. For example, ATE( X) and ATT( X) are identified under the [akin to (5)] is likely to be unreasonable, marginal ignorability assumption: as typically evidenced by differences in the distribution of covariates in the treatment and D%Yd(d) | X (15) control groups. However, if the investigator knows (as in a randomized block experiment) for d = 0, 1, and occasionally using this the covariates that account for the differential weaker assumption is advantageous. It also assignment of subjects into the treatment and obvious that for estimating means, ignorabil- control groups: ity assumptions can be replaced by the weaker condition of so-called ‘mean independence’, D%Yd(0) , Yd(1) | X, (11) e.g., E(Yd | X, D = d) = E(Yd | X). How- ever, it is difficult to think of situations where and if: mean independence holds and ignorability does not. Additionally, mean independence 0 < π d(X) ≡ Pr( D = 1 | X) < 1, (12) does not hold for functions of Y, such as U(Y), the utility of Y. Thus, I do not consider this for all X, i.e., treatment received D is strongly further. ignorable given X, the parameter ATE(X): In observational studies, it will often be the case that an investigator is not sure if he/she E(Yd(1) − Yd(0) | X) (13) has measured all the covariates X predictive of both the treatment and the outcome. is identified and equals E(Y | X, D = 1) − Not surprisingly (as it is not possible to E(Y | X, D = 0). It also follows that ATE(X ) observe both potential outcomes), ignorability = ATT(X ): assumptions are not, per se , testable; attempts to assess such assumptions invariably rely E(Yd(1) − Yd(0) | X, D = 1) . (14) on various types of auxiliary assumptions (Rosenbaum, 1987; Rosenbaum, 2002). Typically, the investigator will be interested When an investigator believes there are not only in ATE( X) and/or ATT( X), but also in variables he/she has not measured that predict ATE= EP (ATE( X)) andATT= EP∗ (ATT( X)), both the treatment and the potential outcomes, where P∗ is the sub-population of units that it is nevertheless sometimes possible (using receive treatment. Note that ATT (= ATE other types of assumptions) to estimate because these parameters are obtained by the parameters above (or parameters similar averaging over different units: the ATE is a to these). The section ‘Mediation analyses’ weighted average of the ATT and the ATU. examines the use of this approach in the More generally, as at the beginning of context of mediation. Another approach that this section, under the types of ignorability has been used involves the use of fixed effects assumptions above, it is possible to identify models (and differences in differences) to the marginal and conditional (given X) dis- remove the effects of unmeasured variables. tributions of the potential outcomes in P and When the investigator knows the treatment to therefore consider any causal estimand that assignment rule, but the assignment proba- can be defined in terms of these distributions. bilities are 0 and 1, as is the case in risk CAUSAL INFERENCE IN RANDOMIZED AND NON-RANDOMIZED STUDIES 11 based allocation (Thistlethwaite and Camp- ordinary regression of Y on Z: bell, 1960), causal inferences necessarily rely on extrapolation. Nevertheless, in some Yi = α + τZi + i, (16) cases, reasonable inferences can be made (Finkelstein, Levin and Robbins, 1996). i = 1, . . ., n, where the parameters are identi- Other approaches in the absence of fied by the assumption E() = 0. ignorability include bounding causal effects The estimator (2) also arises by predicting (Manski, 1995; Robins, 1989). Bounds that the missing outcomes Yzi (0) (if Zi = 1) or make few assumptions are often quite wide Yzi (1) (if Zi = 0) using the estimated mini- and not especially useful. Nevertheless, when mum mean square error predictor E(Y | Z). b assumptions leading to tighter bounds are Let Yzi (0) = Yzi (0) if Zi = 0, and αˆ = E(Y | b b credible, this approach may be quite helpful. Z = 0) otherwise, Yzi (1) = Yzi (1) if Zi = 1, b Additionally, sensitivity analyses can also be αˆ +τ ˆ = E(Y | Z = 1) otherwise; thus, (2) can b very useful; if ignorability is violated due to also be written as: an unmeasured covariate, but the results are n −1 robust to this violation, credible inferences n X(Yzi (1) − Yzi (0)) . (17) can nevertheless be made (see Rosenbaum, i=1 b b 2002, for further material on this topic). In the case of a randomized block exper- iment, where the probability of assignment to the treatment group depends on known ESTIMATION OF CAUSAL covariates, assumption (5) will be violated, PARAMETERS IN RANDOMIZED but if the covariates are unrelated to the STUDIES potential outcomes:

I consider ITT( X) and ITT in this section, Z% Yz(0) , Yz(1) , (18) using these cases to introduce the primary in which case (2) is still unbiased and ideas underlying the estimation of causal consistent for ITT. effects in the simplest setting. The discussion When the covariates are related to both is organized around two broad approaches: treatment assignment and the potential out- (1) using potential outcomes imputed by comes, (8) provides the basis for extending regression or some other method (e.g., the approach above. Let the covariates X ) and using the observed and imputed take on L distinct values, corresponding to outcomes to estimate ITT; and (2) reweighting blocks b = 1, . . ., L and let g(X) ≡ B the data in the treatment and control groups be the one to one onto function mapping X to reflect the composition of the population P onto the variable B. Within each (or an appropriate subpopulation thereof). The block b, n = n Pr( Z = 1 | B = b) of the estimators considered under the first approach 1b b n units are assigned to the treatment group, have been used in the experimental design b where 0 < Pr( Z = 1 | B = b) < 1 for all b. literature for many years and will be familiar The matched pairs design is the special case to most readers. where the sample size is 2 n, L = n, n1b = 1, n0b = nb − n1b = 1. The regression corresponding to (16) is: Estimation of ITT(X) and ITT in randomized studies L Yi = X 1{Bi}(b)( αb + τbZi + i), (19) The simplest case, previously considered, b=1 estimates the ITT using (2) under the identification condition (5). It is also useful where 1 {Bi}(b) = 1 if {Bi} = b, 0 other- to note that (2) is also the coefficient τˆ in the wise, and E(  | X, Z) = 0. Thus, τb is the 12 DESIGN AND INFERENCE value of ITT( X) in block b, with estimator applied to the weighted data, yielding the IPW τˆb = Y¯{Z=1,B=b} − Y¯{Z=0,B=b}, the difference estimator: between the treatment group and control n −1 group means in this block. The ITT can then π (X )ZiYi Pi=1 z i − be estimated using the estimated marginal n −1 Pi=1 πz (Xi)Zi distribution (or the marginal distribution if it n (1 − π (X )) −1(1 − Z )Y is known) of the blocking variable: Pi=1 z i i i . n −1 (22) Pi=1(1 − πz(Xi)) (1 − Zi) L Using YZ = Y (1) Z, elementary properties ITT = (Y¯ −Y¯ )Pr( B=b). z X {Z=1,B=b} {Z=0,B=b} of conditional expectation, and assumption d b= b 1 (8), leads to a more formal justification: (20) −1 −1 E(πz (X)ZY ) = E(E(πz (X)ZY | X)) = −1 −1 E(π (X)E(ZY z(1) | X)) = E(π (X)E(Z | Under random from P, z z X)E(Yz(1) | X)) = E(Yz(1)). Finally, note Pr( B = b) = nb/n, and (20) = (17); For that in the randomized block experiment theb matched pair design, in addition, πz(x) = πz(g(x)) = πz(b) = n1b/nb, and the (17) = (2). estimate (22) is identical to (20) under random As above, it is also easy to see that: sampling from P.

Y¯{Z=1,B=b} − Y¯{Z=0,B=b} = n Estimation of treatment effects in −1 observational studies (nb) X 1{Bi}(b)( Yzi (1) − Yzi (0)) , (21) b b i=1 In observational studies in the social and behavioral sciences, the assumption that where the missing outcomes are imputed treatment D is unrelated to the potential using the estimated ‘best’ predictor; thus outcomes Yd(0) and Yd(1) is unlikely to the estimator (20) can also be obtained by hold. Estimation of the treatment effects imputing missing potential outcomes. ATE( X), ATT( X), ATE and ATT is therefore Another approach to estimating the ITT considered under the assumption (given under (8) is to reweight the treatment by (11) and (7)) that treatment asignment group observations in such a way that the is strongly ignorable, given the covariates reweighted data from the treatment group X (Rosenbaum and Rubin, 1983). For a (control group) would be a random sample more extensive treatment of estimation under from the distribution of ( Yz(1) , X) ( Yz(0) , X)) strongly ignorable treatment assignment, see and then apply the simple estimator (2) to the reviews by Imbens (2004) and Schafer and the weighted data. This is the essence of Kang (2007). ‘inverse probability weighting’ (Horvitz and In principle, this case has already been Thompson, 1952). considered. Nevertheless, new issues arise in To see how this works, suppose that πz(x) attempting to use the estimators previously percent of the observations at level x of X considered. There are serval reasons for this. are in the treatment group. Under random First, in a randomized block experiment, sampling from the distributions ( Yz(1) , X) and the treatment and control group probabilities (Yz(0) , X), the treatment and control groups depend on the covariates in a known way, should have the same distribution on X. that is πz(X) is known. Thus, for example, If the treated observations at level x are if inverse probability weighting is used to −1 weighting by πz (x) and the control group estimate the ITT, the weights are known. This −1 observations at x by (1 − πz(x)) the treated is not the case in an . and controls will have the same distribution Second, in a randomized block experiment (8) on X in the weighted data set and (2) can be is a byproduct of the study design, whereas CAUSAL INFERENCE IN RANDOMIZED AND NON-RANDOMIZED STUDIES 13 in observational studies the analogue (10) is the sample mean vector for the covariates is an assumption; this issue was briefly in the treatment (control) group. It is easy to discussed earlier. In practice, making a see that (25) = (23). compelling argument that a particular set of There are essentially two problems with the covariates renders (8) true is the most difficult estimator τ.ˆ First, the investigator typically challenge facing empirical workers who want does not know the form of the response to use the methods below to make inferences function and the linear form is chosen out about various types of treatment effects. of convenience. This form has very strong Third, in a randomized block experiment, implications: τ = ATE( X) = ATE = ATT, the covariates used in blocking take on (in that is, not only are the effects the same at all principle) relatively few values, and thus levels of X, but the ATE is also the ATT. When the ITT can be estimated non-parametrically the regression functions are misspecified, using the first appproach (as in (20)). In an using τˆ can yield misleading inferences. observational study,where X is most likely Second, when there are ‘regions’ with little high dimensional, this is no longer the case. overlap between covariate values in the And finally, it is necessary (though not treatment and control groups, imputed values difficult) to modify estimators of the ATE( X) are then based on extrapolations outside the andATEto apply when it is desired to estimate of the data. For example, if the treatment ATT( X) and ATT. group members have ‘large’ values of a Regression estimators use estimates ( E) of covariate X1 and the control group members the regressions E(Yd(1) | X) and E(Yd(0)b| X) have ‘small’ values, the imputations Yd(0) to impute missing potential outcomes: if (Yd(1)) for treatment (control) group membersb Di = 1, Ydi (1) = Ydi (1), Ydi (0) = E(Yd(0) | willb involve extrapolating the control group b b b X = xi), and if Di = 0, Ydi (0) = Ydi (0), (treatment group) regression to large (small) b Ydi (1) = E(Yd(1) | X = xi). These are then X1 values. This may produce very misleading bused to imputeb the unit effects and the ATE is results. then estimated by averaging over these: To deal with the first of these difficulties, n a natural alternative is to use nonlinear −1 regresssion, or in the typical case where the n X(Ydi (1) − Ydi (0)) . (23) form of the regression functions is not known, i=1 b b to estimate these non-parametrically (as in As ATE(X ) = ATT(X ), the ATT can be (19)). Imbens (2004) reviews this approach. obtained as above, averaging only over the n1 Non-parametric regression can work well if treated observations. X is not high dimensional, but when there are As a starting point, consider the simplest many covariates to control for, as is typical (and still most widely used) regression in observational studies, the precision of the estimator, where the covariates enter the estimator regression may be quite low. This response function linearly: problem then spills over to the imputations. [But see also Hill and McCulloch (2007), who Y = α + τD + β-X +  , (24) i i i i propose using Bayesian Additive Regression where the parameters are identified by the Trees to fit the regression functions, finding condition E( | X, D) = 0. This leads to the that estimates based on this approach are well-known regression adjusted estimator of superior to those obtained using many other ATE(X): typically employed methods.] Sub-classification (also called blocking) ¯ ¯ ˆ - ¯ ¯ τˆ =(Y{D=1} −Y{D=0})−β (X{D=1} −X{D=0}), is an older method used to estimate causal (25) effects that is also non-parametric in spirit. Here, units with ‘similar’ values of X are where Y¯{D=1} is the treatment group mean, grouped into blocks and the ATE is estimated ¯ ¯ ¯ Y{D=0} is the control group mean, and X1(X0) as in the case of a randomized block 14 DESIGN AND INFERENCE experiment considered above. To estimate the the control group. When this is done, however, ATT, the distribution of the blocking variable the quantity estimated is no longer the ATE or B in the group receiving treatment (rather than ATT because the average is only taken over the overall population) is used; equivalently, the region of common support. the imputed unit effects are averaged over the In an important paper, Rosenbaum and treatment group only. In a widely cited paper, Rubin (1983) addressed the issue of overlap, Cochran (1968) shows in a concrete example proving that when (11) and (12) hold: with one covariate that subclassification with five blocks removes 90% of the . Y(0) , Y(1) %D | πd(X), (26) Matching is another long-standing method that has been used which avoids paramet- 0 < Pr( D = 1 | πd(X)) < 1, (27) rically modeling the regression functions. Although matching can be used to estimate implying that any of the methods just dis- the ATE, it has most commonly been used cussed may be applied using the ‘propensity to estimate the ATT in the situation where score’ πd(X) (which is a many to one function the control group is substantially larger of X), rather than X; this cannot exacerbate than the treatment group. In this case, each the potential overlap problem and may help unit i = 1,..., n1 in the treatment group is to lessen this problem. [For generalizations of matched to one or more units ‘closest’ in the the propensity score, applicable to the case control group and the outcome values from where the treatment is categorical, ordinal or the matched control(s) are used to impute continuous, see Imai and van Dyk (2004), Ydi (0). In the case of ‘one to one’ matching, Imbens (2000) and Joffe and Rosenbaum bunit i is matched to one control with value (1999).] Y∗ ≡ Ydi (0); if i is matched to more than Rosenbaum and Rubin (1983) also discuss one control,b the average of the control group (their corollary 4.3) using the propensity outcomes can be used. score to estimate the regression functions There are many possible matching E(Yd(1) | πd(X)) and E(Yd(0) | πd(X)) schemes. A unit can be matched with one when these are linear. Typically, the true or more others using various metrics to form of the regression functions relating measure the distance between covariates X, potential outcomes to the propensity score and various criteria for when two units have will be unknown. But these functions can be covariate values close enough to constitute estimated non-parametrically more precisely a ‘match’ can be used. In some schemes, using πd(X) than X. However, the advantage matches are not reused, but in others are used of this approach is somewhat illusory, as in again. In some schemes, not all units are observational studies, πd(X) will be unknown necessarily matched. See Gu and Rosenbaum and must be estimated. is (1993) for a nice discussion of the issues often used, but this form is typically chosen involved in matching. Despite the intuitive for convenience. To the best of my knowledge, appeal of matching, estimators that match the impact of using a misspecified propensity on X typically have poor large sample score in this case has not been studied. properties (see Abadie and Imbens, 2006, for If, however, a non-parametric estimator is details). used (for example, a sieve estimator as The procedures above do not contend described in Imbens, 2004), the so-called with the frequently encountered problem of ‘curse of dimensionality’is simply transferred insufficient overlap in the treatment and from estimation of the regression function to control groups. One alternative is to only estimation of the propensity score. consider regions where there is sufficient A straightforward way to use subclas- overlap, for example, to match only those sification on the propensity score is to treatment units with covariate values that are divide the unit interval into L equal length ‘sufficiently’ close to the values observed in intervals and group the observations by their CAUSAL INFERENCE IN RANDOMIZED AND NON-RANDOMIZED STUDIES 15 estimated propensity scores. Within interval more control group units. The control group I`, ` = 1, . . ., L, the ATE is estimated as: outcomes are then used to impute Ydi (0) and the ATT is estimated as in the caseb of DiYi (1 − Di)Yi matching on X. Pi∈I` − Pi∈I` . (28) D (1 − D ) Because the propensity score is a balancing Pi∈I` i Pi∈I` i score, after matching, the distribution of the The ATE is then estimated by averaging covariates should be similar in the treatment n 1 (`) and control groups. In practice, a researcher the estimates (28), using weights Pi=1 Ai , n should check this balance and, if there is a where 1 (`) = 1 if i ∈ I , 0 oth- Ai ` problem, the model for the propensity score erwise. To estimate the ATT, the weights can be refitted (perhaps including interactions should be modified to reflect the distribution among covariates and/or other higher order of the observations receiving treatment: n terms) and the balance rechecked. In this Pi=1 1Ai (`)Di n . Lunceford and Davidian (2004) sense, proper specification of the propensity Pi=1 Di study subclassification on the propensity score model is not really at issue here: the score and compare this with IPW estimators question is whether the matched sample is (discussed below). Using simulations, Drake balanced. (1993) compares the bias in the case where Finally, in practice, it is often found that the propensity score is known to the case when the estimated propensity scores are near where it is estimated, finding no additional 0 or 1, the problem of insufficient overlap bias is introduced in the latter case. She in the treatment and control groups may be also finds that when the model for the lessened, but it is still present. In this case, propensity score is misspecified, the bias the same kinds of issues previously discussed incurred is smaller than that incurred by reappear. misspecifying the regression function. This As seen above, the propensity score also may be suggestive, but without knowing features prominently when methods that use how to put the misspecification in the two inverse probability weighting are used to esti- different models onto a common ground, it mate treatment effects. There the covariates is difficult to attribute too much meaning to took on L distinct values, each with positive this finding. probability, and the IPW estimator is identical Matching on propensity scores is widely to the non-parametric regression estimator. used in empirical work and has also been This will no longer be the case. Parallelling the shown to perform well in some situations material above, the ATE may be estimated as: (Dehejia and Wahba, 1999). Corollary 4.1 in n −1 Rosenbaum and Rubin (1983) shows the ATE πˆ (X )DiYi Pi=1 d i − can be estimated by drawing a random sample n −1 Pi=1 πˆd (Xi)Di πd(X1), . . ., π d(Xn) from the distribution of n −1 (1 −π ˆ d(X )) (1 − Di)Yi πd(X), then randomly choosing a unit from Pi=1 i . n −1 (29) the treatment group and the control group with Pi=1(1 −π ˆ d(Xi)) (1 − Di) this value πd(x), and taking the difference Y(1) − Y(0), then averaging the n differences. To estimate the ATT, it is necessary to weight In practice, of course, the propensity scores the expression above by πd(X), giving is usually unknown and must be estimated. Y¯ − [Rosenbaum (1987) explains the seemingly {D=1} n −1 paradoxical finding that using the estimated πˆd(X )(1 −π ˆ d(X )) (1 − Di)Yi Pi=1 i i . propensity score tends to produce better n −1 Pi=1 πˆd(Xi)(1 −π ˆ d(Xi)) (1 − Di) balance than using the true propensity score.] (30) Typically, the ATT is estimated by matching (using an estimate of the propensity score) A problem with using these estimators is each treated unit i = 1,... n1 to one or that probabilities near 0 and 1 assign large 16 DESIGN AND INFERENCE weights to relatively few cases (Rosenbaum, observations, and if the population model 1987). IPW estimators do not require estimat- is correct and gˆ is consistent, the estimator −1 n ing the regression functions, but the weights n Pi=1 gˆ(Xi) is consistent for E(Yd(1)). must be estimated consistently in order that If the regression function is misspecified, the estimator be consistent. When the model the errors may not have 0 mean over for the propensity score is misspecifed, the P. But if a good estimate of the δi can weights will be estimated incorrectly and be obtained, E(Yd(1)) can be estimated as −1 n −1 n ˆ the IPW estimator will not be consistent; n Pi=1 gˆ(Xi) + n Pi=1 δi. To estimate if the estimated probabilities near 0 and 1 the δi in P, the propensity score can be used. If are not close to the true probabilities, the the model for the propensity score is correct, bias can be substantial. To contend with this, E(Diπd(Xi)δi) = E(δi), and thus the estimator Hirano, Imbens and Ridder (2003) propose the use of a sieve estimator for the propensity score, while Shafer and Kang (2007) propose n n −1 ˆ using a ‘robit’ model (a more robust model n X gˆ(Xi) + X Diπˆd(Xi)δi (32) based on the cumulative distribution function i=1 i=1 of the t distribution, as opposed to the normal distribution (probit model) or logistic distribution (logistic regression). will be consistent for E(Yd(1)). Strategies for estimating treatment effects On the other hand, if the model for the that combine one or more of the methods regression function is correct, then whether have also been proposed. For example, sub- or not the model for the propensity score is classification may still leave an imbalance correct, E(Diπd(Xi)δi) = EE (Diπd(Xi)δi | between the covariates in the treatment X) = E(πd(Xi)E(Di | X)E(δi | X) = 0 as and control groups. To reduce bias, linear E(δi | X) = 0. regression of the outcome on D and X in A consistent estimate for E(Yd(0)) can be each block may be used to adjust for the constructed in a similar manner. The weighted imbalance, e.g., as in (25) (Rosenbaum and least-squares estimator, with appropriately Rubin, 1983). Matching estimators may be chosen weights, is another example; more similarly modified. generally, the regression function can be Recently, a number of estimators that estimated semiparametrically (Robins and combine inverse probability weighting with Rotnitsky, 1995). Kang and Schafer (2007) regression have been proposed. These estima- discuss a number of other estimators that are tors have the property that so long as either the consistent so long as either the regression model for the propensity score is correct or the function or the propensity score is specified model for the regression function is correct, correctly. Such estimators are often called the estimator is consistent. Kang and Schafer ‘doubly robust’; however, the reader should (2007) do a nice job of explaining this idea, note that this terminology is a bit misleading. which originates in the sampling literature Statistical methods that operate well when (Cassel, Sarndal and Wretman, 1976, 1977), the assumptions underlying their usage are and of summarizing the literature on this topic. violated are typically called robust. Here, the To give some intuition, consider estimation estimator is robust with misspecification to of theATE.Suppose the population regression either the propensity score or the regression function is assumed to have the form: function, but not both. In that vein, Kang and Schafer’s (2007) simulations suggest Ydi (1) = g(Xi) + δi, (31) that when neither the propensity score nor the regression function is correctly specified, with E(δi | Xi) = 0, giving E(Yd(1) | doubly robust estimators are often more X) = g(X). As before, because (11) holds, biased than estimators without this attractive the model can be estimated using the treated theoretical property. CAUSAL INFERENCE IN RANDOMIZED AND NON-RANDOMIZED STUDIES 17

MEDIATION ANALYSES The local average treatment effect As before, let Z = 1 if unit i is assigned Mediation is a difficult topic and a thorough i to the treatment group, 0 otherwise. Let treatment would require an essay length treat- D (0) ( D (1)) denote the treatment i takes ment. The topic arises in several ways. First, zi zi up when assigned to the control (treatment) even in randomized experiments, subjects group. Similarly, for z = 0, 1 and d = 0, 1, do not always ‘comply’ with their treatment let Y (0 , 0) denote the response when i is assignments. Thus, the treatment received D is zdi assigned to treatment z = 0 and receives treat- an intermediate outcome intervening between ment d = 0; Y (0 , 1), Y (1 , 0), Y (0 , 0) Z and Y, and an investigator might want to zdi zdi zdi are defined analogously. Let Y (Z , D (Z )) know, in addition to the ITT (which measures zdi i zi i denote i’s observed response. the effect of Z on Y) the effect of D on Y. In a randomized experiment, the potential This might be of interest scientifically, and outcomes are assumed to be independent of may also point, if the effect is substantial, but treatment assignment: subjects don’t take up the treatment, to the need to improve the delivery of the treatment D (0) , D (1) , Y (0 , D (0)) , Y (1 , D (1)) %Z. package. Traditional methods of analysis that z z z z z z compare subjects by the treatment actually (33) received or which compare only those subjects in the treatment and control groups that The two ITTs (hereafter ITT D and ITT Y ) follow the experimental protocol are flawed are identified as before by virtue of because treatment received D is not ignorable assumption (5); while these parameters are with respect to Y. To handle this, Bloom clearly of interest (and some would say (1984) first proposed using Z as an instrument these are the only parameters that should be for D. Subsequently, Angrist, Imbens and of interest), neither parameter measures the Rubin (1996) clarified the meaning of the effect of D on Y. That is because D is (in IV estimand. Second, and more generally, econometric parlance) ‘endogenous’. To deal researchers often have theories about the path- with such problems, economists have long ways (intervening variables) through which a used instrumental variables (including two particular cause (or set of causes) affects the stage least squares), in which ‘exogenous’ response variable and the effects of both the variables that are believed to affect Y only particular cause(causes) and the intervening through D are used as an instrument for D. The variables is of interest. To quantify these IV estimand (in the simple case herein) is: effects, psychologists and others often use cov( Z, Y) E(Y | Z = 1) − E(Y | Z = 0) structural equation models, following Baron = and Kenny (1986), for example. However, cov( Z, D) E(D | Z = 1) − E(D | Z = 0) the ‘direct effects’ of D on Y and Z on = ITT Y /ITT D. (34) Y in structural equation models should not generally be interpreted as effects; conditions Recently, Imbens and Angrist (1994) and (which are unlikely to be met) when these Angrist, Imbens and Rubin (1996) clarified the parameters can be given a casual interpreta- meaning of the IV estimand (34) and the sense tion are also given below, as are conditions in which this estimand is a causal parameter. under which the IV estimated admits a causal ITT Y is a weighted average over four interpretation. compliance types: (1) compliers, with Throughout, only the case of a ran- Dzi (0) = 0, Dzi (1) = 1; (2) never takers, with domized experiment with no covariates is Dzi (0) = 0, Dzi (1) = 0; (3) always takers, considered; the results extend immediately with Dzi (0) = 1, Dzi (1) = 1; and (4) defiers, to the case of an observational study where with Dzi (0) = 1, Dzi (1) = 0, who take up treatment assignment is ignorable only after treatment if not assigned to treatment and conditioning on the covariates. who do not take up treatment if assigned 18 DESIGN AND INFERENCE to treatment. Often it will be substantively in the not uncommon case where the only reasonable to assume there are no defiers; way to obtain the treatment is by being in this is the ‘weak monotonicity assumption’ the treatment group. In addition, although the Dzi (1) ≥ Dzi (0) for all i. Because the never ATT is defined under the assumption that takers and always takers receive the same the exclusion restriction holds, the average treatment irrespective of their assignment, effect of Z on Y for compliers equals the any effect of treatment assignment on Y for average effect of Z on Y for the treated these types cannot be due to treatment D. in this case. Finally, in the (unlikely) case If it is reasonable to assume the effect of where the treatment effects are constant, treatment assignment operates only via the LATE = ATT = ATE. treatment, i.e., there is no ‘direct effect’ of Empirical workers should also remember Z on Y, then the unit effect of Z on Y for the exclusion restriction is very strong (even never takers and always takers is 0; this is if applied only to the never takers and called the exclusion restriction. Under weak always takers), and in a ‘natural’ experiment monotonicity and exclusion, ITT Y therefore or a randomized experiment that is not reduces to: double blinded, this restriction may not hold. Researchers who are in the position of being E(Yz(1 , Dz(1)) − Yz(0 , Dz(0))) = able to design a double-blinded, randomized E(Y(1 , 1) − Y(0 , 0)) × Pr( Dz(0) = 0, experiment should do so, and researchers who are relying on a should Dz(1) = 1) . (35) think very seriously about whether or not As Pr( Dz(0) = 0, Dz(1) = 1) = E(D(1) − this restriction is plausible. Finally, it is also D(0)) in the absence of defiers, provided this important to remember that the compliers is greater than 0 (weak monotonicity and consitute an unobserved sub-population of P, this assumption is sometimes called ‘strong so that even if a policy maker were able to monotonicity’), (34) is the average effect of offer the treatment only to subjects in this Z on Y for the compliers. If the direct effect subpopulation, he/she cannot identify these of Z on Y for the compliers is also 0, (34) is subjects with certainty. also the effect of D on Y in this subpopulation; The approach above also serves as the this is sometimes called the complier average basis for the idea of principal stratification causal effect (CACE) or the local average (Frangakis and Rubin, 2002). The essential treatment effect (LATE). [For some further idea is that for any intermediate outcome D statistical work on compliance, see Imbens (not necessarily binary), causal effects of D and Rubin, (1997), Little and Yau (1998), are defined within principal strata (subpop- Jo (2002), Hirano, Imbens, Rubin and Zhou ulations with identical values of Dzi (0) and (2000)]. Dzi (1)). Because compliance is such an important issue, empirical researchers have been quick Mediation and structural equation to apply the results above. But researchers modeling who want to know the ATT or ATE might find the average effect of Z on Y for compliers To facilitate comparison with the psycholog- or LATE to be of limited interest when the ical literature, in which structural equation proportion of compliers is small (e.g., about models are typically used to study mediation 15% in the example presented by Angrist (Baron and Kenny, 1986; MacKinnon and et al. (1996)]. Researchers who estimate Dwyer, 1993), I discuss the special case where the IV estimand (or who use instrumental D and Y are continuous, Z and D have additive variables or two stage least squares) should effects on Y and the average effect of D on Y is be careful not to forget that compliers may linear (as described below); for a more general differ systematically from the never takers discussion, see Sobel (2008).As above, I make and always takers. However, LATE = ATT assumption (5) and examine the IV estimand; CAUSAL INFERENCE IN RANDOMIZED AND NON-RANDOMIZED STUDIES 19

the extension to the case where ignorability E(Yzd (z, Dz(z))); a sufficient condition for this holds, conditional on covariates, is immediate. to hold is: Using potential outcomes, a linear analogous to a linear structural equa- Yz(z, Dz(z)) %Dz(z). (40) tion model may be constructed: c s Similarly, β2 = β2 under this condition. c c c Dzi (z) = α1 + γ1 z + ε1zi (z) (36) Results along these lines are reported in c c c c Eggleston, Scharfstein, Munoz and West Yzdi (z, d) = α2 + γ2 z + β2d + ε2zdi (z, d), (37) (2006), Sobel (2008) and Ten Have, Joffe, Lynch, Brown and Maisto (2005). Unfortu- nately, this condition is unlikely to be met where E(εc (z)) = E(εc (z, d)) = 0; thus, 1z 2zd in applications, as it requires the intermediate γ c = ITT , γ c = E(Y (1 , d) − Y (0 , d)) 1 D 2 zd zd outcome D to be ignorable with respect to Y, for any d is the average unmediated effect of as if D had been randomized. Z on Y, and βc = E(Y (z, d + 1) − Y (z, d)) 2 zd zd Holland (1988) showed that if (33) for z = 0, 1 is the average effect of a one unit holds,the exclusion restriction Y (1 , d) − increase in D on Y. zdi Y (0 , d) = 0 for all i holds, ITT (= 0, A linear structural equation model for the zdi D and the other unit effects Dzi (1) − Dzi (0) relationship between Z, D and Y is given by: - and Yzdi (z, d) − Yzdi (z, d ) are constant for all i, βc is equal to the IV estimand (34). D = αs + γ sZ + εs (38) 2 i 1 1 i 1i Unfortunately, the assumption that the effects s s s s Yi = α2 + γ2 Zi + β2Di + ε2i, (39) are constant is even more implausible in the kinds of studies typically carried out in where the parameters are identified by the the behavioral and medical sciences than the s s assumptions E(ε1 | Z) = 0 and E(ε2 | assumptions needed to justify using structural Z, D) = 0. Thus, the ‘direct effects’ of Z on equation models. s D and Y, respectively, are: γ1 = E(D | Z = Sobel (2008) relaxes the assumption of s 1) − E(D | Z = 0), γ2 = E(Y | Z = 1, D = constant effects, assuming instead: d) − E(Y | Z = 0, D = d). The ‘direct effect’ s c c of D on Y is given by β2 = E(Y | Z = z, E(ε2zd (1 , Dz(1)) − ε2zd (0 , Dz(0)) = 0. (41) D = d + 1) − E(Y | Z = 0, D = d). The s s s s ‘total effect’ τ ≡ γ2 + γ1 β2. Under (41), (33), the exclusion restriction s c c c By virtue of (33) γ1 = ITT D and γ = 0, and the assumption γ (= 0, β = s 2 1 2 τ = ITT Y (Holland, 1988). However, the IV estimand (34). Further, assumption s s neither γ2 nor β2 should generally be (41) is also weaker than the assumption (40) given a causal interpretation. To illustrate, needed to justify using structural equation consider E(Y | Z = z, D = d) = models. E(Yzd (z, Dz(z)) | Z = z, Dz(z) = d) = The results above can be extended to the E(Yzd (z, Dz(z)) | Dz(z) = d), where case where there are multiple instruments and the last equality follows from (33). This multiple mediators. The results can also be s gives γ2 = E(Yzd (1 , Dz(1)) | Dz(1) = extended to the case where compliance is an d) − E(Yzd (0 , Dz(0)) | Dz(0) = d). Because intermediate outcome prior to the mediating subjects with Dz(0) = d are not the same variable (Sobel, 2008) to obtain a complier subjects as those with Dz(1) = d, unless the average effect of the continuous mediator D s unit effects of Z on D are 0, γ2 is a descriptive on Y; this is the effect of the continuous parameter comparing subjects across mediator D on Y within the principal stratum different subpopulations. Similar remarks (of the binary outcome denoting whether s apply to β2. or not treatment is taken) composed of the It is also easy to see from the above that compliers. Principal stratification itself can s c γ2 = γ2 if E(Yzd (z, Dz(z)) | Dz(z) = d) = also be used to approach the problem of 20 DESIGN AND INFERENCE estimating the effect of D on Y (Jo, 2008); the manner in which a treatment package here the idea would be to consider the effects may work through multiple mediators and the of D on Y within strata defined by the pair of causal relationships among these mediators. values of ( Dz(0) , Dz(1)). These and many other issues are in need of much further work. DISCUSSION

In the last three decades, statisticians have REFERENCES generated a literature on causal inference that formally expresses the idea that causal rela- Abadie, A. and Imbens. G. (2006) ‘Large tionships sustain counterfactual conditional sample properties of matching estimators for statements. The potential outcomes notation average treatment effects’, Econometrica , 74: allows causal estimands to be defined inde- 235–267. pendently of the expected values of estima- Abadie, A., Angrist, J. and Imbens, G. (2002) ‘Instrumen- tors. Thus, one can assess and give conditions tal variables estimation of quantile treatment effects’, Econometrica , 70: 91–117. (e.g., ignorability) under which estimators Angrist, J.D., Imbens, G.W. and Rubin, D.B. (1996) commonly employed actually estimate causal ‘Identification of causal effects using instrumental parameters. Prior to this, researchers esti- variables’, (with discussion) Journal of the American mated descriptive parameters and verbally Statistical Association , 91: 444–472. argued these were causal based on other Baron A., Rubin M., and Kenny, D.A. (1986) ‘The considerations, such as model specification, a moderator–mediator variable distinction in social practice that led workers in many disciplines, psychological research: Conceptual, strategic and e.g., and psychology, to interpret statistical considerations’, Journal of Personality and just about any parameter from a regression or , 51: 1173–1182. structural equation model as a causal effect. Belsen, W.A. (1956) A technique for studying the While this literature has led most effects of a television broadcast’, Applied Statistics , researchers to a better understanding that a 5: 195–202. good study design (especially a randomized Björklund, A. and Moffit, R. (1987) ‘The estimation of wage gains and welfare gains in self-selection study) leads to more credible estimation of models’, The Review of Economics and Statistics , causal parameters than approaches using 69: 42–49. observational studies in conjunction with Bloom, H.S. (1984) ‘Accounting for no-shows in many unverifiable substantive assumptions, experimental evaluation designs’, Evaluation Review , it is also important to remember the old 8: 225–246. lesson (Campbell and Stanley, 1963) that Bunge, M.A. (1979) and Modern Science (3rd randomized studies do not always estimate edn.). New York: Dover. parameters that are generalizable to the Campbell, D.T. and Stanley, J.C. (1963) Experimen- desired population. This is especially true for tal and Quasi-experimental Designs for Research . natural experiments, where the investigator Chicago: Rand McNally. has no control over the experiment, although Cassel, C.M., Särndal, C.E. and Wretman, J.H. (1976) the randomization assumption is plausible. ‘Some results on generalized difference estimation and generalized regression estimation for finite This literature has also led to clarification populations’, Biometrika , 63: 615–620. of existing procedures. In the process, new Cassel, C.M., Särndal, C.E. and Wretman, J.H. (1977) challenges have been generated. For example, Foundations of Inference in . while this literature reveals that the framework New York: Wiley. psychologists have been using for 25 years to Cochran, W.G. (1968) ‘The effectiveness of adjustment study mediation is seriously flawed, as of yet, by subclassication in removing bias in observational this literature cannot give adequate expression studies’, Biometrics , 24: 205–213. to and/or indicate how to assess the sub- Cox, D.R. (1958) The Planning of Experiments . stantive theories that investigators have about New York: John Wiley. CAUSAL INFERENCE IN RANDOMIZED AND NON-RANDOMIZED STUDIES 21

Crepinsek, M.K., Singh, A., Bernstein, L.S., and effects using the estimated propensity score’, McLaughlin, J.E. (2006) ‘Dietary effects of universal- Econometrica , 71: 1161–1189. free school breakfast: findings from the evaluation of Holland, P.W. (1988) ‘Causal inference, path analysis, the school breakfast program pilot project’, Journal of and recursive structural equation models’, (with the American Dietetic Association , 106: 1796–1803. discussion) in Clogg, C.C. (ed.), Sociological Method- Dehejia, R.H. and Wahba, S. (1999) ‘Causal effects in ology. Washington, D.C.: American Sociological nonexperimental studies: reevaluating the evaluation Association. pp. 449–493. of training programs’, Journal of the American Horvitz, D.G., and D.J. Thompson (1952) ‘A gener- Statistical Association , 94: 1053–1062. alization of sampling without replacement from a Doksum, K. (1974) ‘Empirical probability plots and finite universe’ Journal of the American Statistical for nonlinear models in the Association , 47: 663–685. two-sample case’, Annals of Statistics , 2: 267–277. Imai, K. and van Dyk, D.A. (2004) ‘Causal inference Drake, C. (1993) ‘Effects of misspecication of the with general treatment regimes: generalizing the propensity score on estimators of treatment eect’, propensity score’, Journal of the American Statistical Biometrics , 49: 1231–1236. Association , 99: 854–866. Eggleston, B., Scharfstein, D., Munoz, B. and West, S. Imbens, G.W. (2000) ‘The role of the propensity score (2006) ‘Investigation mediation when counterfactu- in estimating dose-response functions’, Biometrika , als are well-defined: does sunlight exposure mediate 87: 706–710. the effect of eye-glasses on cataracts?’. Unpublished Imbens, G.W. (2004) ‘Nonparametric estimation of manuscript, Johns Hopkins University. average treatment effects under exogeneity: a Finkelstein, M.O., Levin, B. and Robbins, H. (1996) review’, Review of Economics and Statistics , 86: ‘Clinical and prophylactic trials with assured new 4–29. treatment for those at greater risk: II. examples’, Imbens, G.W., and J.D. Angrist (1994) ‘Identification American Journal of , 86: 696–702. and estimation of local average treatment effects’, Fisher, R.A. (1925) Statistical Methods for Research Econometrica , 62: 467–475. Workers . London: Oliver and Boyd. Imbens, G.W. and Rubin, D.B. (1997) ‘Estimating Frangakis, C.E. and Rubin, D.B. (2002) ‘Principal outcome distributions for compliers in instrumental stratication in causal inference’, Biometrics , 58: variables models’, Review of Economic Studies , 21–29. 64: 555–574. Gitelman, A.I. (2005) ‘Estimating causal effects from Jo, B. (2002) ‘Estimation of intervention effects multilevel group-allocation data’, Journal of Educa- with noncompliance: Alternative model specifications tional and Behavioral Statistics , 30: 397–412. (with discussion),’ Journal of Educational and Granger, C.W. (1969) ‘Investigating causal relationships Behavioral Statistics , 27: 385–415. by econometric models and cross-spectral methods’, Jo, B. (2008) ‘Causal inference in randomized exper- Econometrica , 37: 424–438. iments with mediational processes,’ Psychological Gu, X.S, and Rosenbaum, P.R. (1993) ‘Comparison of Methods , 13: 314–336. multivariate matching methods: structures, distances Joffe, M.M. and Rosenbaum P.R. (1999) ‘Propen- and algorithms’, Journal of Computational and sity scores’, American Journal of , Graphical Statistics , 2: 405–420. 150: 327–333. Halloran, M. E. and Struchiner, C.J. (1995) ‘Causal Kang, J.D.Y. and Schafer, J.L. (2007) ‘Demystifying inference in infectious diseases’, Epidemiology , double robustness: a comparison of alternative 6: 142–151. strategies for estimating population means from Harre, R. and Madden, E.H. (1975) Causal Powers: incomplete data’, Statistical Science , 22: 523–580. A Theory of Natural Necessity . Oxford: Basil Lehmann, E.L. (1974) ‘Nonparametris: Statistical Blackwell. Methods Based on Ranks,’ Holden-Day , Inc.: Hill, J.L. and McCulloch, R.E. (2007) ‘Bayesian nonpara- San Francisco, CA. metric modeling for causal inference.’ Unpublished Lewis, D. (1973) ‘Causation’, Journal of Philosophy , 70: manuscript, Columbia University. 556–567. Hirano, K., Imbens, G.W., Rubin, D.B., and X. Little, R.J, and Yau, L.H.Y. (1998) ‘Statistical techniques Zhou (2000) ‘Assessing the effect of an influenza for analyzing data from prevention trials: treatment of vaccine in an encouragement design with covariates,’ no-shows using Rubin’s causal model’, Psychological , 1: 69–88. Methods , 3: 147–159. Hirano, Keisuke, Imbens, Guido W., and Ridder, G. Lunceford, J.K., and M. Davidian. (2004) ‘Stratication (2003) ‘Efficient estimation of average treatment and weighting via the propensity score in estimation 22 DESIGN AND INFERENCE

of causal treatment effects: a comparative study’, test” by D. Basu’, Journal of the American Statistical Statistics in Medicine , 23: 2937–2960. Association , 75: 591–593. MacKinnon, D.P. and Dwyer, J.H. (1993) ‘Estimating Schafer, J.L. and Kang, J.D.Y. (2007) ‘Average causal mediating effects in prevention studies’, Evaluation effects from observational studies: a practical guide Review , 17: 144–158. and simulated example’. Unpublished manuscript, Manski, C.F. (1995) Identication Problems in the Pennsylvania State University. Social Sciences . Cambridge, MA: Harvard University Simon, H.A. (1954) ‘Spurious correlation: a causal Press. interpretation’, Journal of the American Statistical Neyman, J. (1923) 1990 ‘On the application of Association , 49: 467–492. probability theory to agricultural experiments. essays Sobel, M.E. (1995) ‘Causal inference in the social and on principles. Section 9’, (with discussion) Statistical behavioral sciences’, in Arminger, G., Clogg, C.C. and Science , 4: 465–480. Sobel, M.E. (eds.), Handbook of Statistical Modeling Reichenbach, H. (1956) The Direction of Time . Berkeley: for the Social and Behavioral Sciences. New York: University of California Press. Plenum. pp. 1–38. Robins, J.M. (1989) ‘The analysis of randomized and Sobel, M.E. (2006a) ‘Spatial concentration and social nonrandomized aids treatment trials using a new stratication: does the clustering of disadvantage approach to causal inference in longitudinal studies’, “beget” bad outcomes?’, in Bowles, S., Durlauf, S.N. in Sechrest, L., Freedman, H. and Mulley, A. (eds.), and Hoff, K. (eds.), Poverty Traps , New York: Russell Health Services Research Methodology: A Focus Sage Foundation. pp. 204–229. on AIDS . Rockville, MD: US Department of Health Sobel, M.E. (2006b) ‘What do randomized studies of and Human Services. pp. 113–159. housing mobility demonstrate? Causal inference in Robins, J.M. and Rotnitsky, A. (1995) ‘Semiparametric the face of interference’, Journal of the American efficiency in multivariate regression models with Statistical Association , 101: 1398–1407. ’, Journal of the American Statistical Sobel, M.E. (2008) ‘Identification of causal parameters Association , 90: 122–129. in randomized studies with mediating variables,’ Rosenbaum, P.R. (1987) ‘The role of a second control Journal of Educational and Behavioral Statistics , group in an observational study,’ Statistical Science , 33: 230–251. 2: 292–316. Suppes, P. (1970) A Probabilistic Theory of Causality . Rosenbaum, P.R. (2002) Observational Studies . Amsterdam: North Holland. New York: Springer-Verlag. Tenhave, T.R., Joffe, M.M., Lynch, K.G., Brown, G.K., Rosenbaum, P.R. and Rubin, D.B. (1983) ‘The central Maisto, S. A., and A.T. Beck (2007) ‘Causal mediation role of the propensity score in observational studies analyses with rank preserving models,’ Biometrics , for causal effects’, Biometrika , 70: 41–55. 63: 926–934. Rubin, D.B. (1974) ‘Estimating causal effects of treat- TenHave, T.R., Marshall, J., Kevin, L., Brown, G. and ments in randomized and nonrandomized studies’, Maisto, S. (2005) Causal Mediation Analysis with Journal of Educational Psychology , 66: 688–701. Structural Mean Models. University of Pennsylvania Rubin, D.B. (1977) ‘Assignment to treatment groups Biostatistics, Working Paper. on the basis of a covariate’, Journal of Educational Thistlethwaite, D.L. and Campbell, D.T. (1960) Statistics , 2: 1–26. ‘Regression-discontinuity analysis: an alternative to Rubin, D.B. (1978) ‘ for causal effects: the ex post facto experiment’, Journal of Educational the role of randomization’, The Annals of Statistics , Psychology , 51: 309–317. 6: 34–58. Yule, G.U. (1896) ‘On the correlation of total pauperism Rubin, D.B. (1980) ‘Comment on “randomization anal- with proportion of out-relief ii: males over 65’, ysis of experimental data: the Fisher randomization Economic Journal , 6: 613–623.