arXiv:1503.02894v1 [stat.ME] 10 Mar 2015 hmsS ihrsnadAde Rotnitzky Andrea Robins and Richardson Research S. M. Thomas the James of of Etiology Causal o n aebe otnt ohv usqetycollabo- subsequently years. have many to over him fortunate with were been post- rated Issue and have student Special graduate and this a doc as of Robins’ editors by grad- the influenced as greatly of him Both with students. Tchetgen worked uate VanderWeele Eric Hern´an, Tyler Miguel and Tchetgen Halloran, amo : traini Elizabeth causal by others, in contributed researchers leading also mentoring has and Robins literature. research 04 o.2,N.4 459–484 4, DOI: No. 29, Vol. 2014, Science Statistical aine11,Beo ie,Agniae-mail: Argentina [email protected] S´aenz Aires, 7350, Buenos Alcorta 1010, Figueroa Valiente Av. CONICET, Di & Torcuato Tella Universidad Economics, of Department [email protected] o 532 etl,Wsigo 89,UAe-mail: USA 98195, Washington Seattle, Washington, 354322, of Box University Statistics, of Department aigi 96 i ..dge ean i only diploma. his school remains high degree his than M.D. other His degree, 1976. grad- Louis, in St. uating in University Washington political Medical at in and School enrolled Robins social later, years activist Several left more goals. , the pursue of to Har- spirit the college at in then, mathematics but College, in vard subsequently majored was Robins he his involved. which how with in- the problems particular of ferential many in motivated experience and practical early trajectory career gular

hmsS ihrsni rfso n Chair, and Professor is Richardson S. Thomas c ttsia Science article Statistical original the the of by reprint published electronic an is This ern iesfo h rgnli aiainand pagination in detail. original typographic the from differs reprint 1 Statistics Mathematical of Institute ayraesmyb naiirwt ois sin- Robins’ with unfamiliar be may readers Many ee ersrc teto oRbn’cnrbtost the to contributions Robins’ to attention restrict we Here, 10.1214/14-STS505 nraRtizyi Professor, is Rotnitzky Andrea . nttt fMteaia Statistics Mathematical of Institute , Abstract. eonzdta ewr mn h e napsto orlt t relate to position becaus a issue in date. this few to career edit the research to his among agreed of were trajectory We we years. highligh that 40 would recognized c nearly that collaborated have tim for issue we o us, the Jamie special depth of at and two a breadth the Editor the edit Between and contributions. the to Robins by Wellner, us studied Jon topics of research Robins. two M. the James asked of work the . 04 o.2,N.4 459–484 4, No. 29, Vol. 2014, 2014 , hsiseof issue This ttsia Science Statistical This . ng ng in 1 h n usinecue rmfra discussion formal from excluded was answer question to to one needed dismayed he the question was one He the that Yale. learn at classes and biostatistics in enrolled Robins analyses, biosta- and tistical scientific studies relevant epidemiologic his the of consisted As with papers answer. began an provide inference to causal need in Robins’ workplace.” interest the was lifelong in illness chemicals or to death exposure worker’s by probable a “more that was not Robins it than whether cases, asked regularly compensation was unions workers’ When trade in safety. with and testifying working health Stan- occupational of promoting goal at in the with health General occupational clinic, an founded of School, Medical ford head now Cullen, one’s of strength enduring connections. the League Ivy to irony, Inter- says some he in testament, with residency a University, Yale prestigious un- at a Medicine his nal obtain with to surprisingly and able he job was out, a running find insurance was to employment commu- he Unable Boston that clinics. other found nity the and at clinic unwelcome the somewhat of dismissed director was he the retaliation, by In center. health working the physicians, at to salaried maintenance all from included personnel, In- that Employees affiliate Service Union vertical the ternational a in organize year, clinic helped that pri- he During community Boston. a a of as neighborhood in Roxbury working physician year care a mary spent the completing Robins After York. internship, New in Hospital Harlem tYl,Rbn n i olg redMark friend college his and Robins Yale, At at medicine in interned he graduating, After 1 rw t nprto from inspiration its draws oeywith losely Robins’ f the t we e he e, caused 2 T. S. RICHARDSON AND A. ROTNITZKY in the mainstream biostatistical literature.2 At the guaranteed to have conservative coverage for the av- , most biostatisticians insisted that evidence for erage causal effect among the n study subjects par- causation could only be obtained through random- ticipating in a completely randomized experiment ized controlled trials; since, for ethical reasons, po- with a binary response variable; he showed that this tentially harmful chemicals could not be randomly interval can be strictly narrower than the usual bino- assigned, it followed that statistics could play little mial interval even under the Neyman null hypothesis role in disentangling causation from spurious corre- of no average causal effect. To do so, he constructed lation. an estimator of the variance of the empirical differ- ence in treatment means that improved on a vari- 1. CONFOUNDING ance estimator earlier proposed by Neyman (1923). Aronow, Green and Lee (2014) have recently gener- In his classes, Robins was struck by the gap alized this result in several directions including to present between the informal, yet insightful, lan- guage of epidemiologists such as Miettinen and Cook nonbinary responses. (1981) expressed in terms of “confounding, compa- rability, and bias,” and the technical language of 2. TIME-DEPENDENT CONFOUNDING AND mathematical statistics in which these terms either THE g-FORMULA did not have analogs or had other meanings. Robins’ It was also in 1982 that Robins turned his atten- first major paper “The foundations of confounding tion to the subject that would become his grail: in Epidemiology” written in 1982, though only pub- causal inference from complex longitudinal data lished in 1987, was an attempt to bridge this gap. As with time-varying treatments, that eventually cul- one example, he offered a precise mathematical def- minated in his revolutionary papers Robins (1986, inition for the informal epidemiologic concept of a 1987b). His interest in this topic was sparked by (i) a “confounding variable” that has apparently stood paper of Gilbert (1982)3 on the healthy worker sur- the test of time (see VanderWeele and Shpitser, vivor effect in occupational epidemiology, wherein 2013). As a second example, Efron and Hinkley the author raised a number of questions Robins an- (1978) had formally considered inference accurate to swered in these papers and (ii) his medical experi- −3/2 order n in variance conditional on exact or ap- ence of trying to optimally adjust a patient’s treat- proximate ancillary statistics. Robins showed, sur- ments in response to the evolution of the patient’s prisingly, that long before their paper, epidemiolo- clinical and laboratory data. gists had been intuitively and informally referring to an estimator as “unbiased” just when it was asymp- 2.1 Overview totically unbiased conditional on either exact or ap- Robins career from this point on became a “quest” proximate ancillary statistics; furthermore, they in- to solve this problem, and thereby provide meth- tuitively required that the associated conditional ods that would address central epidemiological ques- Wald confidence interval be accurate to O(n−3/2) in tions, for example, is a given long-term exposure variance. As a third example, he solved the prob- harmful or a treatment beneficial? If beneficial, what lem of constructing the tightest Wald-type intervals interventions, that is, treatment strategies, are opti- mal or near optimal? 2 Robins and Greenland (1989a, 1989b) provided a formal In the process, Robins created a “bestiary” of definition of the probability of causation and a definitive an- 4 swer to the question in the following sense. They proved that causal models and analytic methods. There are the the probability of causation was not identified from epidemi- basic “phyla” consisting of the g-formula, marginal ologic data even in the absence of confounding, but that structural models and structural nested models. sharp upper and lower bounds could be obtained. Specifi- These phyla then contain “species,” for example, cally, under the assumption that a workplace exposure was structural nested failure time models, structural never beneficial, the probability P (t) that a workers death occurring t years after exposure was due to that exposure was sharply upper bounded by 1 and lower bounded by 3The author, Ethel Gilbert, is the mother of Peter Gilbert max[0, {f1(t) − f0(t)}/f1(t)], where f1(t) and f0(t) are, re- who is a contributor to this special issue; see (Richardson spectively, the marginal densities in the exposed and unex- et al. (2014)). posed cohorts of the random variable T encoding time to 4In the epidemiologic literature, this bestiary is sometimes death. referred to as the collection of “g-methods.” ROBINS’ CAUSAL ETIOLOGY 3 nested distribution models, structural nested (multi- no net, direct or indirect effect of exposure on the plicative additive and logistic) mean models and yet failure time of any subject. further “subspecies”: direct-effect structural nested Prior to Robins (1986), although informal discus- models and optimal-regime structural nested mod- sions of net, direct and indirect (i.e., mediated) ef- els. fects of time varying exposures were to be found Each subsequent model in this taxa was developed in the discussion sections of most epidemiologic pa- to help answer particular causal questions in specific pers, no formal mathematical definitions existed. contexts that the “older siblings” were not quite up To address this, Robins (1986) introduced a new to. Thus, for example, Robins’ creation of structural counterfactual model, the finest fully randomized nested and marginal structural models was driven causally interpreted structured tree graph (FFR- 5 by the so-called null paradox, which could lead to CISTG) model that extended the point treatment counterfactual model of Neyman (1923) and Rubin falsely finding a treatment effect where none existed, 6 and was a serious nonrobustness of the estimated g- (1974, 1978a) to longitudinal studies with time- formula, his then current . Similarly, his varying treatments, direct and indirect effects and feedback of one cause on another. Due to his lack research on higher-order influence function estima- of formal statistical training, the notation and for- tors was motivated by a concern that, in the pres- malisms in Robins (1986) differ from those found in ence of confounding by continuous, high dimensional the mainstream literature; as a consequence the pa- confounders, even doubly robust methods might fail per can be a difficult read.7 Richardson and Robins to adequately control for confounding bias. (2013, Appendix C) present the FFRCISTG model This variety also reflects Robins’ that the using a more familiar notation.8 best analytic approach varies with the causal ques- We illustrate the basic ideas using a simplified ex- tion to be answered, and, even more importantly, ample. Suppose that we obtain data from an obser- that confidence in one’s substantive findings only comes when multiple, nearly orthogonal, modeling 5A complete list of acronyms used is given before the Ref- strategies lead to the same conclusion. erences. 6See Freedman (2006) and Sekhon (2008) for historical re- 2.2 Causally Interpreted Structured Tree Graphs views of the counterfactual point treatment model. 7 Suppose one wishes to estimate from longitudi- Robins published an informal, accessible, summary of his main results in the epidemiologic literature (Robins (1987a)). nal data the causal effect of time-varying treat- However, it was not until 1992 (and many rejections) that his ment or exposure, say cigarette smoking, on a fail- work on causal inference with time-varying treatments ap- ure time outcome such as all-cause mortality. In peared in a major statistical journal. this setting, a time-dependent confounder is a time- 8The perhaps more familiar Non-Parametric Structural varying covariate (e.g., presence of emphysema) that Equation Model with Independent Errors (NPSEM-IE) con- sidered by Pearl may be viewed as submodel of Robins’ FFR- is a predictor of both future exposure and of fail- CISTG. ure. In 1982, the standard analytic approach was A Non-Parametric Structural Equation Model (NPSEM) to model the conditional probability (i.e., the haz- assumes that all variables (V ) can be intervened on. In con- ard) of failure time t as a function of past expo- trast, the FFRCISTG model does not require one to assume sure history using a time-dependent Cox propor- this. However, if all variables in V can be intervened on, then the FFRCISTG specifies a of one-step ahead counterfac- tional hazards model. Robins formally showed that, tuals, Vm(vm−1) which may equivalently be written as struc- even when confounding by unmeasured factors and tural equations Vm(vm−1)= fm(vm−1,εm) for functions fm model specification are absent, this approach may and (vector-valued) random errors εm. Thus, leaving aside no- result in estimates of effect that may fail to have a tational differences, structural equations and one-step ahead causal interpretation, regardless of whether or not counterfactuals are equivalent. All other counterfactuals, as well as factual variables, are then obtained by recursive sub- one also adjusts for the measured time-dependent stitution. confounders in the analysis. In fact, if previous ex- However, the NPSEM-IE model of Pearl (2000) further posure also predicts the subsequent evolution of the assumes the errors εm are jointly independent. In contrast, time-dependent confounders (e.g., since smoking is though an FFRCISTG model is also an NPSEM, the errors (associated with incompatible counterfactual worlds) may be a cause of emphysema, it predicts this disease) then dependent—though any such dependence could not be de- the standard approach can find an artifactual expo- tected in a RCT. Hence, Pearl’s model is a strict submodel of sure effect even under the sharp null hypothesis of an FFRCISTG model. 4 T. S. RICHARDSON AND A. ROTNITZKY vational or randomized study in which n patients Y (a , a ) for a , a 0, 1 which are the outcome 1 2 1 2 ∈ { } are treated at two times. Let A1 and A2 denote the a patient would have if (possibly counter-to-fact) treatments. Let L be a measurement taken just prior they were to receive the treatments a1 and a2. Then to the second treatment and let Y be a final out- E[Y (a1, a2)] is the mean outcome (e.g., the survival come, higher values of which are desirable. To sim- probability) if everyone in the population were to re- plify matters, for now we will suppose that all of the ceive the specified level of the two treatments. The treatments and responses are binary. As a concrete particular instance of this regime under which ev- example, consider a study of HIV infected subjects eryone is treated at both times, so a1 = a2 =1, is with (A1,L,A2,Y ), respectively, being binary indi- depicted in Figure 4(a). We are interested in esti- cators of anti-retroviral treatment at time 1, high mation of these four means since the regime (a1, a2) CD4 count just before time 2, anti-retroviral ther- that maximizes E[Y (a1, a2)] is the regime a new pa- apy at time 2, and survival at time 3 (where for tient exchangeable with the n study subjects should simplicity we assume no deaths prior to assignment follow. 4 of A2). There are 2 = 16 possible observed data se- There are two extreme scenarios: If in an obser- quences for (A1,L,A2,Y ); these may be depicted as vational study, the treatments are assigned, for ex- an tree as in Figure 1.9 Robins (1986) referred ample, by doctors, based on additional unmeasured to such event trees as “structured tree graphs.” predictors U of Y then E[Y (a1, a2)] is not identified We wish to assess the effect of the two treatments since those receiving (a1, a2) within the study are (a1, a2) on Y . In more detail, for a given subject not representative of the population as a whole. we suppose the existence of four potential outcomes At the other extreme, if the data comes from a completely randomized clinical trial (RCT) in which treatment is assigned independently at each time by the flip of coin, then it is simple to see that the counterfactual Y (a1, a2) is independent of the treat- ments (A1, A2) and that the average potential out- comes are identified since those receiving (a1, a2) in the study are a simple random sample of the whole population. Thus, (1) Y (a , a ) A , A , 1 2 ⊥⊥{ 1 2} (2) E[Y (a , a )] = E[Y A = a , A = a ], 1 2 | 1 1 2 2 where the right-hand side of (2) is a function of the observed data distribution. In a completely random- ized experiment, association is causation: the associ- ational quantity on the right-hand side of (2) equals the causal quantity on the left-hand side. Robins, however, considered an intermediate trial design in which both treatments are randomized, but the probability of receiving A2 is dependent on both the treatment received initially (A1) and Fig. 1. Causal tree graph depicting a simple scenario with the observed response (L); a scenario now termed treatments at two times A1, A2, a response L measured prior a sequential randomized trial. Robins viewed his to A2, and a final response Y . Blue circles indicate evolution analysis as also applicable to observational data as of the process determined by Nature; red dots indicate poten- follows. In an observational study, the role of an tial treatment choices. epidemiologist is to use subject matter knowledge to try to collect in L sufficient data to eliminate 9 In practice, there will almost always exist baseline covari- confounding by unmeasured factors, and thus to ates measured prior to A1. In that case, the analysis in the text is to be understood as being with a given joint stratum of have the study mimic a sequential RCT. If success- a set of baseline covariates sufficient to adjust for confounding ful, the only difference between an actual sequen- due to baseline factors. tial randomized trial and an observational study is ROBINS’ CAUSAL ETIOLOGY 5

∗ 12 that in the former the randomization probabilities the marginal under fa1,a2 (y, l). Robins called (6) Pr(A2 = 1 L, A1) are known by design while in the randomization w.r.t. Y .13 Furthermore, he provided | latter they must be estimated from the data.10 substantive examples of observational studies in Robins viewed the sequential randomized trial as which only the weaker assumption would be ex- a collection of five trials in total: the original trial pected to hold. It is much easier to describe these at t = 1, plus a set of four randomized trials at t = 2 studies using representations of causal systems using 11 nested within the original trial. Let the counter- Directed Acyclic Graphs and Single World Interven- factual L(a1) be the outcome L when A1 is set to tion Graphs, neither of which existed when (Robins a1. Since the counterfactuals Y (a1, a2) and L(a1) do (1987b)) was written. not depend on the actual treatment received, they can be viewed, like a subject’s genetic make-up, as a 2.3 Causal DAGs and Single World Intervention fixed (possibly unobserved) characteristic of a sub- Graphs (SWIGs) ject and therefore independent of the randomly as- Causal DAGs were first introduced in the seminal signed treatment conditional on pre-randomization work of Spirtes, Glymour and Scheines (1993); the covariates. That is, for each (a1, a2) and l: theory was subsequently developed and extended by (3) Y (a1, a2),L(a1) A1, Pearl (1995a, 2000) among others. { } ⊥⊥ A causal DAG with random variables V ,...,V (4) Y (a , a ) A A = a , L = l. 1 M 1 2 ⊥⊥ 2 | 1 1 as nodes is a graph in which (1) the lack of an arrow These independences suffice to identify the joint from node Vj to Vm can be interpreted as the ab- density fY (a1,a2),L(a1)(y, l) of (Y (a1, a2),L(a1)) from sence of a direct causal effect of Vj on Vm (relative the distribution of the factual variables by the to the other variables on the graph), (2) all com- “g-computation algorithm formula” (or simply g- mon causes, even if unmeasured, of any pair of vari- formula) density ables on the graph are themselves on the graph, and ∗ (3) the Causal Markov Assumption (CMA) holds. (5) f (y, l) f(y a1, l, a2)f(l a1) a1,a2 ≡ | | The CMA links the causal structure represented by provided the conditional probabilities on the right- the Directed Acyclic Graph (DAG) to the statistical hand side are well-defined (Robins, 1986, page ∗ data obtained in a study. It states that the distribu- 1423). Note that fa1,a2 (y, l) is obtained from the tion of the factual variables factor according to the joint density of the factuals by removing the treat- DAG. A distribution factors according to the DAG ment terms f(a a , l, a )f(a ). This is in-line with 2 | 1 2 1 if nondescendants of a given variable Vj are indepen- the intuition that A1 and A2 cease to be ran- dent of Vj conditional on paj , the parents of Vj . The dom since, under the regime, they are set by in- CMA is mathematically equivalent to the statement tervention to constants a and a . The g-formula 1 2 that the density f(v1, . . ., vM ) of the variables on the was later referred to as the “manipulated density” causal DAG satisfies the Markov factorization by Spirtes, Glymour and Scheines (1993) and the G M “truncated factorization” by Pearl (2000). (8) f(v , . . ., v )= f(v pa ). Robins (1987b) showed that under the weaker 1 M j | j condition that replaces (3) and (4) with jY=1 A graphical criterion, called d-separation (Pearl Y (a1, a2) A1 and (6) ⊥⊥ (1988)), characterizes all the marginal and condi- Y (a1, a2) A2 A1 = a1, L = l, tional independences that hold in every distribution ⊥⊥ | obeying the Markov factorization (8). the marginal density of Y (a , a ) is still identified 1 2 Causal DAGs may also be used to represent the by joint distribution of the observed data under the ∗ (7) f (y)= f(y a , l, a )f(l a ), counterfactual FFRCISTG model of Robins (1986). a1,a2 | 1 2 | 1 Xl 12The g-formula density for Y is a generalization of stan- 10Of course, one can never be certain that the epidemiolo- dardization of effect measures to time varying treatments. See gists were successful which is the reason RCTs are generally Keiding and Clayton (2014) for a historical review of stan- considered the gold standard for establishing causal effects. dardization. 11 13 That is, the trials starting at t = 2 are on study popula- Note that the distribution of L(a1) is no longer identified tions defined by specific (A1, L)-histories. under this weaker assumption. 6 T. S. RICHARDSON AND A. ROTNITZKY

Fig. 2. (a) A causal DAG G describing a sequentially randomized trial; (b) the SWIG G(a1,a2) resulting from intervening on A1 and A2.

This follows because an FFRCISTG model over the of the effect of formaldehyde exposure on the mor- variables V1,...,VM induces a distribution that tality of rubber workers which can represented by factors as{ (8). Figure}2(a) shows a causal Directed the causal graph in Figure 3(a). (This graph cannot Acyclic Graph (DAG) corresponding to the sequen- represent a sequential RCT because the treatment tially randomized experiment described above: ver- variable A1 and the response L have an unmeasured tex H represents an unmeasured common cause common cause.) Follow-up begins at time of hire; (e.g., immune function) of CD4 count L and survival time 1 on the graph. The vertices H1, A1, H2, L2, Y . Randomization of treatment implies A1 has no A2, Y are indicators of sensitivity to eye irritants, parents and A2 has only the observed variables A1 formaldehyde exposure at time 1, lung cancer, cur- and L as parents. rent employment, formaldehyde exposure at time 2 Single-World Intervention Graphs (SWIGs), in- and survival. Data on eye-sensitivity and lung can- troduced in (Richardson and Robins (2013)), pro- cer were not collected. Formaldehyde is a known eye- vide a simple way to derive the counterfactual in- irritant. The presence of an arrow from H1 to A1 but dependence relations implied by an FFRCISTG not from H1 to A2 reflects the fact that subjects model. SWIGs were designed to unify the graphi- who believe their eyes to be sensitive to formalde- cal and potential outcome approaches to . hyde are given the discretion to choose a job with- The nodes on a SWIG are the counterfactual ran- out formaldehyde exposure at time of hire but not dom variables associated with a specific hypothet- later. The arrow from H1 to L reflects the fact that ical intervention on the treatment variables. The eye sensitivity causes some subjects to leave employ- SWIG in Figure 2(b) is derived from the causal ment. The arrows from H to L and Y reflects the DAG in Figure 2(a) corresponding to a sequentially 2 2 fact that lung cancer causes both death and loss of randomized experiment. The SWIG represents the employment. The fact that H and H are indepen- counterfactual world in which A and A have been 1 2 1 2 dent reflects the fact that eye sensitivity is unrelated set to (a , a ), respectively. Richardson and Robins 1 2 to the risk of lung cancer. (2013) show that under the (naturally associated) From the SWIG in Figure 3(b), we can see that FFRCISTG model the distribution of the counter- factual variables on the SWIG factors according to (6) holds so we have randomization with respect to the graph. Applying Pearl’s d-separation criterion Y but L(a1) is not independent of A1. It follows that the g-formula f ∗ (y) equals the density of to the SWIG we obtain the independences (3) and a1,a2 (4).14 Y (a1, a2) even though (i) the distribution of L(a1) Robins (1987b) in one of the aforementioned sub- is not identified and (ii) neither of the individual stantive examples described an observational study terms f(l a1) and f(y a1, l, a2) occurring in the g-formula has| a causal interpretation.| 15 14More precisely, we obtain the SWIG independence 15 Y (a1,a2) ⊥⊥ A2(a1) | A1, L(a1), that implies (4) by the con- Above we have assumed the variables A1, L, A2, Y oc- sistency assumption after instantiating A1 at a1. Note when curring in the g-formula are temporally ordered. Interestingly, checking d-separation on a SWIG all paths containing red Robins (1986, Section 11) showed identification by the g- “fixed” nonrandom vertices, such as a1, are treated as always formula can require a nontemporal ordering. In his analysis being blocked (regardless of the conditioning set). of the Healthy Worker Survivor Effect, data were available on ROBINS’ CAUSAL ETIOLOGY 7

Fig. 3. Formaldehyde study: H1, indicator of sensitivity to eye irritants; A1, formaldehyde exposure at time 1; H2, lung cancer; L, current employment; A2, formaldehyde exposure at time 2; Y , survival. H1 and H2 are unmeasured. (a) A causal DAG G in which initial treatment is confounded, while the second treatment is sequentially randomized; (b) the SWIG G(a1,a2). L is known to have no direct effect on Y except indirectly via the effect on A2; H1 influences A1 but not A2. See text for further explanation.

Subsequently, Tian and Pearl (2002a) developed a and is highlighted in Figure 4. If L is binary, then graphical algorithm for nonparametric identification G consists of 8 regimes comprised of the 4 ear- that is “complete” in the sense that if the algorithm lier static regimes (a1, a2) and 4 dynamic regimes. fails to derive an identifying formula, then the causal The g-formula density associated with a regime quantity is not identified (Shpitser and Pearl, 2006; g = (a1, g2(l)) is Huang and Valtorta, 2006). This algorithm strictly ∗ f (y, l) f(l a )f(y A = a ,L = l, A = g (l)). extends the set of causal identification results ob- g ≡ | 1 | 1 1 2 2 tained by Robins for static regimes. Letting Y (g) be a subject’s counterfactual outcome under regime g, Robins (1987b) proves that if both 2.4 Dynamic Regimes of the following hold: The “g” in “g-formula” and elsewhere in Robins’ work refers to generalized treatment regimes g. The Y (g) A1, (9) ⊥⊥ set G of all such regimes includes dynamic regimes Y (g) A2 A1 = a1, L = l in which a subject’s treatment at time 2 depends ⊥⊥ | on the response L to the treatment at time 1. then fY (g)(y) is identified by the g-formula density An example of a dynamic regime is the regime in for Y : ∗ ∗ which all subjects receive anti-retroviral treatment f (y)= f (y, l) at time 1, but continue to receive treatment at time g g Xl 2 only if their CD4 count at time 2 is low, indi- cating that they have not yet responded to anti- = f(y A1 = a1,L = l, A2 = g2(l)) | retroviral treatment. In our study with no baseline Xl covariates and A1 and A2 binary, a dynamic regime f(l a1). g can be written as g = (a1, g2(l)) where the function · | Robins (1987b) refers to (9) as the assumption that g2(l) specifies the treatment to be given at time 2. regime g is randomized with respect to Y . Given a The dynamic regime above has (a1 = 1, g2(l) = 1 l) − causal DAG, Dynamic SWIGs (dSWIGS) can be used to check whether (9) holds. Tian (2008) gives temporally ordered variables (A1, L1,A2, L2,Y ) where the Lt are indicators of survival until time year t, At is the indicator a complete graphical algorithm for identification of of exposure to a lung carcinogen and, there exists substantive the effect of dynamic regimes based on DAGs. background knowledge that carcinogen exposure at t cannot Independences (3) and (4) imply that (9) is true cause death within a year. Under these assumptions, Robins for all g G. For a drug treatment, for which, say, proved that equation (6) was false if one respected temporal higher outcome∈ values are better, the optimal regime order and chose L to be L1, but was true if one chose L = L2. ∗ gopt maximizing E[Y (g)] over g G is almost always Thus, E[Y (a1,a2)] was identified by the g-formula fa1,a2 (y) ∈ only for L = L2. See (Richardson and Robins, 2013, page 54) a dynamic regime, as treatment must be discontin- for further details. ued when toxicity, a component of L, develops. 8 T. S. RICHARDSON AND A. ROTNITZKY

Fig. 4. Tree graphs depicting specific treatment regimes: (a) a1 = a2 = 1; (b) the dynamic regime a1 = 1, a2 = (1 − l). The red paths indicate all possible observed data sequences under these regimes.

Robins (1986, 1989, page 1423) used the g-nota- and L into Y .17 Then, under this null, although tion f(y g) as a shorthand for f (y) in order f ∗ (y) = f(y a , l, a )f(l a ) does not de- | Y (g) a1,a2 l | 1 2 | 1 to emphasize that this was the density of Y had pend on (a1P, a2), nonetheless both f(y a1, l, a2) and | intervention g been applied to the population. In the f(l a1) will, in general, depend on a1 (as may be | 18 special case of static regimes (a1, a2), he wrote f(y seen via d-connection). In general, if L has discrete 16 | g = (a1, a2)). components, it is not possible for standard nonsat- 2.5 Statistical Limitations of the Estimated urated parametric models (e.g., logistic regression models) for both f(y a , a , l ) and f(l a ) to g-Formulae | 1 2 2 2 | 1 be correctly specified, and thus depend on a1 and Consider a sequentially randomized experiment. ∗ 19 yet for fa1,a2 (y) not to depend on a1. As a conse- In this context, randomization probabilities f(a1) quence, inference based on the estimated g-formula and f(a2 a1, l) are known by design; however, the must result in the sharp null hypothesis being falsely densities f| (y a , a , l) and f(l a ) are not known | 1 2 | 1 rejected with probability going to 1, as the trial size and, therefore, they must be replaced by estimates increases, even when it is true. f(y a1, a2, l2) and f(l a1) in the g-formula. If the sample| size is moderate| and l is high dimensional, 2.6 Structural Nested Models20 b b these estimates must come from fitting dimension- To overcome the null paradox, Robins (1989) and reducing models. Model misspecification will then Robins et al. (1992) introduced the semiparametric lead to biased estimators of the mean of Y (a , a ). 1 2 structural nested distribution model (SNDMs) for Robins (1986) and Robins and Wasserman (1997) continuous outcomes Y and structural nested failure described a serious nonrobustness of the g-formula: time models (SNFTMs) for time to event outcomes. the so-called “null paradox”: In biomedical trials, See Robins (1997a, 1997b) for additional details. it is frequently of interest to consider the possi- bility that the sharp causal null hypothesis of no 17 effect of either A1 or A2 on Y holds. Under this If the L → Y edge is present, then A1 still has an effect null, the causal DAG generating the data is as in on Y . 18 The dependence of f(y | a1,l,a2) on a1 does not represent Figure 2 except without the arrows from A1, A2 causation but rather selection bias due to conditioning on the common effect L of H1 and A1. 16Pearl (1995a) introduced an identical notation except 19But see Cox and Wermuth (1999) for another approach. that he substituted the word “do” for “g =,” thus writing 20These models are discussed by Vansteelandt and Joffe f(y | do(a1,a2)). (2014) in this issue. ROBINS’ CAUSAL ETIOLOGY 9

Robins (1986, Section 6) defined the g-null hy- with history a1. An additive SNMM specifies para- pothesis as metric models γ(a1, l, a2; ψ2) and γ(a1; ψ1) for these blip functions with γ(a ;0)= γ(a , l, a ; 0) = 0. Un- H : the distribution of Y (g) 1 1 2 0 der the independence assumptions (9), H (ψ ) (10) 2 2 × is the same for all g G. d(L, A1) A2 E[A2 L, A1] and H1(ψ) A1 E[A1] ∈ are unbiased{ − estimating| } functions for{ the− true} This hypothesis is implied by the sharp null hypoth- ∗ ψ , where H2(ψ2)= Y γ(A1,L,A2; ψ2), H1(ψ)= esis of no effect of A or A on any subject’s Y . If − 1 2 H2(ψ2) γ(A1; ψ1), and d(L, A1) is a user-supplied (9) holds for all g G, then the g-null hypothesis is − ∈ function of the same dimension as ψ2. Under the g- equivalent to any one of the following assertions: null mean hypothesis (11), the SNMM is guaranteed ∗ (i) f ∗(y) equals the factual density f(y) for all g to be correctly specified with ψ = 0. Thus, these g ∈ ∗ G; estimating functions when evaluated at ψ = 0, can be used in the construction of an asymptotically α- (ii) Y A1 and Y A2 L, A1; ∗⊥⊥ ⊥⊥ | level test of the g-null mean hypothesis when f(a1) (iii) fa1,a2 (y) does not depend on (a1, a2) and Y ⊥⊥ and f(a2 a1, l) are known (or are consistently es- A2 L, A1; | timated).21| When L is a high-dimensional vector, see Robins (1986, Section 6). In addition, any one the parametric blip models may well be misspec- of these assertions exhausts all restrictions on the ified when g-null mean hypothesis is false. How- observed data distribution implied by the sharp null ever, because the functions γ(a1, l, a2) and γ(a1) are hypothesis. nonparametrically identified under assumptions (9), Robins’ goal was to construct a causal model in- one can construct consistent tests of the correct- ∗ dexed by a parameter ψ such that in a sequen- ness of the blip models γ(a , l, a ; ψ ) and γ(a ; ψ ). ∗ 1 2 2 1 1 tially randomized trial (i) ψ = 0 if and only if the Furthermore, one can also estimate the blip func- g-null hypothesis (10) was true and (ii) if known, one tions using cross-validation (Robins (2004)) and/or could use the randomization probabilities to both flexible machine learning methods in lieu of a pre- ∗ construct an unbiased estimating function for ψ specified parametric model (van der Laan and Rose ∗ and to construct tests of ψ = 0 that were guar- (2011)). A recent modification of a multiplicative anteed (asymptotically) to reject under the null at SNMM, the structural nested cumulative failure the nominal level. The SNDMs and SNFTMs ac- time model, designed for censored time to event out- complish this goal for continuous and failure time comes has computational advantages compared to a outcomes Y . Robins (1989) and Robins (1994) also SNFTM, because, in contrast to a SNFTM, param- constructed additive and multiplicative structural eters are estimated using an unbiased estimating nested mean models (SNMMs) which satisfied the function that is differentiable in the model parame- above properties except with the g-null hypothesis ters; see Picciotto et al. (2012). replaced by the g-null mean hypothesis: Robins (2004) also introduced optimal-regime (11) H : E[Y (g)] = E[Y ] for all g G. SNNMs drawing on the seminal work of Murphy 0 ∈ (2003) on semiparametric methods for the esti- As an example, we consider an additive structural mation of optimal treatment strategies. Optimal- nested mean model. Define regime SNNM estimation, called A-learning in com- puter science, can be viewed as a semiparametric γ(a , l, a ) 1 2 implementation of dynamic programming (Bellman, 22 = E[Y (a1, a2) Y (a1, 0) L = l, A1 = a1, 1957). Optimal-regime SNMMs differ from stan- − | A2 = a2] 21In the literature, semiparametric estimation of the pa- and rameters of a SNM based on such estimating functions is re- ferred to as “g-estimation.” γ(a )= E[Y (a , 0) Y (0, 0) A = a ]. 22 1 1 − | 1 1 Interestingly, Robins (1989, page 127 and App. 1), un- Note γ(a , l, a ) is the effect of the last blip of treat- aware of Bellman’s work, reinvented the method of dynamic 1 2 programming but remarked that, due to the difficulty of the ment a2 at time 2 among subjects with observed estimation problem, it would only be of theoretical interest history (a1, l, a2), while γ(a1) is the effect of the for finding the optimal dynamic regimes from longitudinal last blip of treatment a1 at time 1 among subjects epidemiological data. 10 T. S. RICHARDSON AND A. ROTNITZKY

24 dard SNMMs only in that γ(a1) is redefined to be (1994). See Richardson et al. (2014) in this vol- ume for a survey of recent research on bounds. γ(a1)= E[Y (a1, g2,opt(a1,L(a1))) 2.8 Limitations of Structural Nested Models Y (0, g2,opt(0,L(0))) A1 = a1], − | Robins (2000) noted that there exist causal ques- where g2,opt(a1, l) = argmaxa2 γ(a1, l, a2) is the op- tions for which SNMs are not altogether satisfac- timal treatment at time 2 given past history (a1, l). tory. As an example, for Y binary, Robins (2000) The overall optimal treatment strategy gopt is then proposed a structural nested logistic model in order

(a1,opt, g2,opt(a1, l)) where a1,opt = argmaxa1 γ(a1). to ensure estimates of the counterfactual mean of Y More on the estimation of optimal treatment regimes were between zero and one. However, he noted that can be found in Schulte et al. (2014) in this volume. knowledge of the randomization probabilities did not allow one to construct unbiased estimating func- 2.7 Instrumental Variables and Bounds for the tion for its parameter ψ∗. More importantly, SNMs Average Treatment Effect do not directly model the final object of public Robins (1989, 1993) also noted that structural health interest—the distribution or mean of the out- nested models can be used to estimate treatment come Y as function of the regimes g—as these dis- effects when assumptions (9) do not hold but data tributions are generally functions not only of the pa- rameters of the SNM but also of the conditional law are available on a time dependent instrumental vari- of the time dependent covariates L given the past able. As an example, patients sometimes fail to fill history. In addition, SNMs constitute a rather large their prescriptions and thus do not comply with conceptual leap from standard associational regres- their prescribed treatment. In that case, we can take p d p sion models familiar to most statisticians. Robins Aj = (Aj , Aj ) for each time j, where Aj denotes (1998, 2000) introduced a new class of causal mod- d the treatment prescribed and Aj denotes the dose els, marginal structural models, that overcame these of treatment actually received at time j. Robins particular difficulties. Robins also pointed out that p defined Aj to be an instrumental variable if (9) MSMs have their own shortcomings, which we dis- p still holds after replacing Aj by Aj and for all sub- cuss below. Robins (2000) concluded that the best p d causal model to use will vary with the causal ques- jects Y (a1, a2) depends on aj = (aj , aj ) only through d tion of interest. the actual dose aj . Robins noted that unlike the p d case of full compliance (i.e., Aj = Aj with probabil- 2.9 Dependent Censoring and Inverse ity 1) discussed earlier, the treatment effect func- Probability Weighting tions γ are not nonparametrically identified. Conse- Marginal Structural Models grew out of Robins’ quently, identification can only be achieved by cor- work on censoring and inverse probability of censor- rectly specifying (sufficiently restrictive) parametric ing weighted (IPCW) estimators. Robins work on models for γ. dependent censoring was motivated by the famil- If we are unwilling to rely on such parametric as- iar clinical observation that patients who did not sumptions, then the observed data distribution only return to the clinic and were thus censored differed implies bounds for the γ’s. In particular, in the set- from other patients on important risk factors, for ex- ting of a point treatment randomized trial with non- ample measures of cardio-pulmonary reserve. In the p compliance and the instrument A1 being the as- 1970s and 1980s, the analysis of right censored data signed treatment, Robins (1989) obtained bounds on was a major area of statistical research, driven by the average causal effect E[Y (a = 1) Y (a = 0)] the introduction of the proportional hazards model d − d of the received treatment Ad. To the best of our (Cox (1972); Kalbfleisch and Prentice (1980)) and knowledge, this paper was the first to derive bounds by martingale methods for their analysis (Aalen for nonidentified causal effects defined through po- (1978); Andersen et al. (1993); Fleming and Har- tential outcomes.23 The study of such bounds has rington (1991)). This research, however, was focused become an active area of research. Other early pa- pers include Manski (1990) and Balke and Pearl 24Balke and Pearl (1994) showed that Robins’ bounds were not sharp in the presence of “defiers” (i.e., subjects who would never take the treatment assigned) and derived sharp bounds 23See also Robins and Greenland (1989a, 1989b). in that case. ROBINS’ CAUSAL ETIOLOGY 11 on independent censoring. An important insight in against proportional-hazards alternatives. Even un- Robins (1986) was the recognition that by refram- der independent censoring tests based on the esti- ing the problem of censoring as a causal inference mated g-formula are not guaranteed to be asymp- problem as we will now explain, it was possible to totically α-level, and hence are not robust. adjust for dependent censoring with the g-formula. To illustrate, we consider here an RCT with Rubin (1978a) had pointed out previously that A1 being the randomization indicator, L a post- counterfactual causal inference could be viewed as randomization covariate, A2 the indicator of cen- a missing data problem. Robins (1986, page 1491) soring and Y the indicator of survival. For simplic- recognized that the converse was indeed also true: a ity, we assume that any censoring occurs at time 2 missing data problem could be viewed as a problem and that there are no failures prior to time 2. The ∗ in counterfactual causal inference.25 Robins concep- IPCW estimator β of the ITT effect β = E[Y A = | tualized right censoring as just another time depen- 1] E[Y A = 0] is defined as the solution to − | b dent “treatment” At and one’s inferential goal as (12) P [I(A = 0)U(β)/Pr(A = 0 L, A )] = 0, the estimation of the outcome Y under the static n 2 2 | 1 regime g “never censored.” Inference based on the where U(β) = (Y βA )(Ac 1/2), throughout P g-formula was then licensed provided that censoring − 1 1 − n denotes the empirical mean operator and Pr(A = was explainable in the sense that (6) holds. This ap- 2 0 L, A ) is an estimator of the arm-specific condi- proach to dependent censoring subsumed indepen- | 1 c tional probability of being uncensored. When first dent censoring as the latter is a special case of the introduced in 1992, IPCW estimators, even when former. taking the form of simple Horvitz–Thompson esti- Robins, however, recognized once again that in- mators, were met with both surprise and suspicion ference based on the estimated g-formula could be as they violated the then widely held belief that one nonrobust. To overcome this difficulty, (Robins and should never adjust for a post-randomization vari- Rotnitzky, 1992) introduced IPCW tests and esti- able affected by treatment in a RCT. mators whose properties are easiest to explain in the context of a two-armed RCT of a single treat- 2.10 Marginal Structural Models ment (A ). The standard Intention-to-Treat (ITT) 1 Robins (1993, Remark A1.3, pages 257–258) noted analysis for comparing the survival distributions in that, for any treatment regime g, if randomization the two arms is a log-rank test. However, data are w.r.t. Y , that is, (9), holds, Pr Y (g) >y can be often collected on covariates, both pre- and post- { } estimated by IPCW if one defines a person’s cen- randomization, that are predictive of the outcome soring time as the first time he/she fails to take as well as (possibly) of censoring. An ITT analy- the treatment specified by the regime. In this set- sis that tries to adjust for dependent-censoring by ting, he referred to IPCW as inverse probability of IPCW uses estimates of the arm-specific hazards of treatment weighted (IPTW). In actual longitudinal censoring as functions of past covariate history. The data in which either (i) treatment A is measured at proposed IPCW tests have the following two advan- k many times k or (ii) the A are discrete with many tages compared to the log rank test. First, if cen- k levels or continuous, one often finds that few study soring is dependent but explainable by the covari- subjects follow any particular regime. In response, ates, the log-rank test is not asymptotically valid. In Robins (1998, 2000) introduced MSMs. These mod- contrast, IPCW tests asymptotically reject at their els address the aforementioned difficulty by borrow- nominal level provided the arm-specific hazard es- ing across regimes. Additionally, MSMs timators are consistent. Second, when censoring is represent another response to the g-null paradox independent, although both the IPCW tests and the complementary to Structural Nested Models. log-rank test asymptotically reject at their nominal To illustrate, suppose that in our example of Sec- level, the IPCW tests, by making use of covariates, tion 2, A and A now have many levels. An in- can be more powerful than the log-rank test even 1 2 stance of an MSM for the counterfactual means E[Y (a1, a2)] is a model that specifies that 25A viewpoint recently explored by Mohan, Pearl and Tian − ∗ ∗ (2013). Φ 1 E[Y (a , a )] = β + γ(a , a ; β ), { 1 2 } 0 1 2 1 12 T. S. RICHARDSON AND A. ROTNITZKY where Φ−1 is a given link function such as the logit, squares in the actual study population with weights log, or identity link and γ(a , a ; β ) is a known func- 1/ f(A )f(A A ,L) .27 1 2 1 { 1 2 | 1 } tion satisfying γ(a1, a2; 0) = 0. In this model, β1 = 0 Robins (2000, Section 4.3) also noted that the encodes the static-regime mean null hypothesis that weights W can be replaced by the so-called stabi- lized weights SW = f(A )f(A A ) / f(A )f(A { 1 2 | 1 } { 1 2 | (13) H0 : E[Y (a1, a2)] is the same for all (a1, a2). A1,L) , and described settings where, for efficiency reasons,} using SW is preferable to using W . Robins (1998) proposed IPTW estimators (β0, β1) MSMs are not restricted to models for the depen- ∗ ∗ of (β , β ). When the treatment probabilities are dence of the mean of Y (a , a ) on (a , a ). Indeed, 0 1 b b 1 2 1 2 known, these estimators are defined as the solution one can consider MSMs for the dependence of any to functional of the law of Y (a1, a2) on (a1, a2), such as a quantile or the hazard function if Y is a time-to- P [W v(A , A )(Y Φ β + γ(A , A ; β ) )] n 1 2 − { 0 1 2 1 } event variable. If the study is fully randomized, that (14) is, (1) holds, then an MSM model for a given func- = 0 tional of the law of Y (a1, a2) is tantamount to an associational model for the same functional of the for a user supplied vector function v(A1, A2) of the ∗ ∗ law of Y conditional on A = a and A = a . Thus, dimension of (β , β ) where 1 1 2 2 0 1 under (1), the MSM model can be estimated using standard methods for estimating the corresponding W = 1/ f(A1)f(A2 A1,L) . { | } associational model. If the study is only sequentially Informally, the product f(A )f(A A ,L) is the randomized, that is, (6) holds but (1) does not, then 1 2 | 1 “probability that a subject had the treatment his- the model can still be estimated by the same stan- tory he did indeed have.”26 When the treatment dard methods but weighting each subject by W or probabilities are unknown, they are replaced by es- SW . timators. Robins (2000) discussed disadvantages of MSMs Intuitively, the reason why the estimating func- compared to SNMs. Here, we summarize some of ∗ ∗ the main drawbacks. Suppose (9) holds for all g G. tion of (14) has mean zero at (β0 , β1 ) is as fol- ∈ lows: Suppose the data had been generated from a If the g-null hypothesis (10) is false but the static regime null hypothesis that the law of Y (a , a ) is sequentially randomized trial represented by DAG 1 2 the same for all (a , a ) is true, then by (iii) of in Figure 2. We may create a pseudo-population 1 2 Section 2.6, f(y A1 = a1, A2 = a2,L = l) will de- by making 1/ f(A1)f(A2 A1,L) copies of each | { | } pend on a2 for some stratum (a1, l) thus imply- study subject. It can be shown that in the resulting pseudo-population A2 L, A1 , and thus is repre- 27 ⊥⊥{ } More formally, recall that under (6), E[Y (a1,a2)] = sented by the DAG in Figure 2, except with both ∗ ∗ ∗ Φ{β0 +γ(a1,a2; β1 )} is equal to the g-formula R yfa1,a2 (y) dy. arrows into A2 removed. In the pseudo-population, Now, given the joint density of the data f(A1,L,A2,Y ), define treatment is completely randomized (i.e., there is f(A ,L,A ,Y )= f(Y | A ,L,A )f (A )f(L | A )f (A ), no confounding by either measured or unmeasured e 1 2 1 2 e2 2 1 e1 1 where f (A )f (A ) are user-supplied densities chosen so that variables), and hence causation is association. Fur- e1 1 e2 2 f is absolutely continuous with respect to f. Since the g- ther, the mean of Y (a1, a2) takes the same value in e formula depends on the joint density of the data only through the pseudo-population as in the actual population. f(Y | A ,L,A ) and f(L | A ), then it is identical under f and 1 2 1 e Thus if, for example, γ(a1, a2; β1)= β1,1a1 + β1,2a2 under f. Furthermore, for each a , a the g-formula under f − ∗ ∗ 1 2 e and Φ 1 is the identity link, we can estimate (β , β ) is just equal to E[Y | A = a ,A = a ] since, under f, A is 0 1 e 1 1 2 2 e 2 by OLS in the pseudo-population. However, OLS independent of {L, A1}. Consequently, for any q(A1,A2) in the pseudo-population is precisely weighted least 0= E[q(A ,A )(Y − Φ{β∗ + γ(A ,A ; β∗)})] e 1 2 0 1 2 1 = E[q(A ,A ){f(A )f(A )/{f(A )f(A | A , L)}} 1 2 e 1 e 2 1 2 1 26IPTW estimators and IPCW estimators are essentially · (Y − Φ{β∗ + γ(A ,A ; β∗)})], equivalent. For instance, in the censoring example of Sec- 0 1 2 1 tion 2.9, on the event A2 = 0 of being uncensored, the where the second equality follows from the Radon–Nikodym IPCW denominator pr(A2 = 0 | L, A1) equals f(A2 | A1, L), theorem. The result then follows by taking q(A1,A2) = the IPTW denominator.b v(A ,A )/{f(A )f(A )}. 1 2 e 1 e 2 ROBINS’ CAUSAL ETIOLOGY 13 ing a causal effect of A2 in that stratum; estima- whether the first treatment has any effect on the tion of an SNM model would, but estimation of an final outcome were everyone to receive the second MSM model would not, detect this effect. A second treatment. Formally, we wish to compare the poten- drawback is that estimation of MSM models, suffers tial outcomes Y (a1 = 1, a2 = 1) and Y (a1 = 0, a2 = from marked instability and finite-sample bias in the 1). Robins (1986, Section 8) considered such con- presence of weights W that are highly variable and trasts, that are now referred to as controlled direct skewed. This is not generally an issue in SNM es- effects. More generally, the average controlled direct timation. A third limitation of MSMs is that when effect of A1 on Y when A2 is set to a2 is defined to (6) fails but an instrumental variable is available, be one can still consistently estimate the parameters of 28 (15) CDE(a ) E[Y (a = 1, a ) Y (a = 0, a )], a SNM but not of an MSM. 2 ≡ 1 2 − 1 2 An advantage of MSMs over SNMs that was not where Y (a = 1, a ) Y (a = 0, a ) is the individual 1 2 − 1 2 discussed in Section 2.8 is the following. MSMs can level direct effect. Thus, if A2 takes k-levels then be constructed that are indexed by easily inter- there are k such contrasts. pretable parameters that quantify the overall effects Under the causal graph shown in Figure 5(a), in of a subset of all possible dynamic regimes (Hern´an contrast to Figures 2 and 3, the effect of A2 on Y is et al. (2006); van der Laan and Petersen (2007); unconfounded, by either measured or unmeasured Orellana, Rotnitzky and Robins (2010a, 2010b). As variables, association is causation and thus, under an example consider a longitudinal study of HIV the associated FFRCISTG model: infected patients with baseline CD4 counts exceed- ing 600 in which we wish to determine the optimal CDE(a2)= E[Y A1 = 1, A2 = a2] | CD4 count at which to begin anti-retroviral treat- E[Y A1 = 0, A2 = a2]. ment. Let gx denote the dynamic regime that speci- − | fies treatment is to be initiated the first time a sub- The CDE can be identified even in the presence ject’s CD4 count falls below x, x 1, 2,..., 600 . of time-dependent confounding. For example, in the ∈ { } Let Y (gx) be the associated counterfactual response context of the FFRCISTG associated with either and suppose few study subjects follow any given of the causal DAGs shown in Figures 2 and 3, regime. If we assume E[Y (gx)] varies smoothly with the CDE(a2) will be identified via the difference in x, we can specify and fit (by IPTW) a dynamic the expectations of Y under the g-formula densities ∗ ∗T ∗ ∗ 29 regime MSM model E[Y (gx)] = β0 +β1 h(x) where, fa1=1,a2 (y) and fa1=0,a2 (y). say, h(x) is a vector of appropriate spline functions. The CDE requires that the potential outcomes Y (a1, a2) be well-defined for all values of a1 and a2. 3. DIRECT EFFECTS This is because the CDE treats both A2 and A1 as Robins’ analysis of sequential regimes leads imme- causes, and interprets “A2 remained unchanged” to diately to the consideration of direct effects. Thus, mean “had there been an intervention on A2 fixing perhaps not surprisingly, all three of the distinct di- it to a2.” rect effect concepts that are now an integral part This clearly requires that the analyst be able to of the causal literature are all to be found in his describe a well-defined intervention on the mediat- early papers. Intuitively, all the notions of direct ef- ing variable A2. fect consider whether “the outcome (Y ) would have There are many contexts in which there is no clear well-defined intervention on A2 and thus it is not been different had cause (A1) been different, but the meaningful to refer to Y (a , a ). The CDE is not level of (A2) remained unchanged.” The notions dif- 1 2 applicable in such contexts. fer regarding the precise meaning of A2 “remained unchanged.” 3.2 Principal Stratum Direct Effects (PSDE) 3.1 Controlled Direct Effects Robins (1986) considered causal contrasts in the In a setting in which there are temporally or- situation described in Section 2.9 in which death dered treatments A1 and A2, it is natural to wonder from a disease of interest, for example, a heart at- tack, may be censored by death from other diseases. 28Note that, as observed earlier, in this case identification is achieved through parametric assumptions made by the SNM. 29See (7). 14 T. S. RICHARDSON AND A. ROTNITZKY

Fig. 5. (a) A causal DAG G with no (measured or unmeasured) confounding of A2 on Y ; (b) the SWIG G(a1,a2) resulting from intervening on A1 and A2.

To describe these contrasts, we suppose A1 is a In the terminology of Frangakis and Rubin (2002) treatment of interest, Y = 1 is the indicator of death for a subject with A2(a1 =1)= A2(a1 =0)= a2, the from the disease of interest (in a short interval sub- individual principal stratum direct effect is defined 31 sequent to a given fixed time t) and A2 = 0 is the “at to be: risk indicator” denoting the absence of death either Y (a = 1, a ) Y (a = 0, a ) from other diseases or the disease of interest prior 1 2 − 1 2 to time t. (here, A1 is assumed to be binary). The average Earlier Kalbfleisch and Prentice (1980) had ar- PSDE in principal stratum a2 is then defined to be gued that if A2 = 1, so that the subject does not survive to time t, then the question of whether the PSDE(a2) E[Y (a1 = 1, a2) Y (a1 = 0, a2) subject would have died of heart disease subsequent ≡ − | A2(a1 =1)= A2(a1 =0)= a2] to t had death before t been prevented is meaning- (17) less. In the language of counterfactuals, they were = E[Y (a1 = 1) Y (a1 = 0) saying (i) that if A1 = a1 and A2 A2(a1)=1, the − | ≡ A (a =1)= A (a =0)= a ], counterfactual Y (a1, a2 = 0) is not well-defined and 2 1 2 1 2 (ii) the counterfactual Y (a1, a2 = 1) is never well- where the second equality here follows, since Y (a1, defined. 32 A2(a1)) = Y (a1). In contrast to the CDE, the Robins (1986, Section 12.2) observed that if one PSDE has the advantage that it may be defined, via accepts this then the only direct effect contrast that (17), without reference to potential outcomes involv- is well-defined is Y (a = 1, a = 0) Y (a = 0, a = 1 2 1 2 ing intervention on a . Whereas the CDE views A 0) and that is well-defined only for− those subjects 2 2 as a treatment, the PSDE treats A as a response. who would survive to t regardless of whether they re- 2 Equivalently, this contrast interprets “had A re- ceived a =0 or a = 1. In other words, even though 2 1 1 mained unchanged” to mean “we restrict attention Y (a1, a2) may not be well-defined for all subjects to those people whose value of A2 would still have and all a1, a2, the contrast: been a2, even under an intervention that set A1 to E[Y (a1 = 0, a2) Y (a1 = 1, a2) a different value.” (16) − | Although the PSDE is an interesting parame- A2(a1 =1)= A2(a1 =0)= a2] ter in many settings (Gilbert, Bosch and Hudgens is still well-defined when a2 = 0. As noted by Robins, (2003)), it has drawbacks beyond the obvious (but this could provide a solution to the problem of defin- perhaps less important) ones that neither the pa- ing the causal effect of the treatment A1 on the out- rameter itself nor the subgroup conditioned on are come Y in the context of censoring by death due to nonparametrically identified. In fact, having just other diseases. defined the PSDE parameter, Robins (1986) crit- Rubin (1998) and Frangakis and Rubin (1999, icized it for its lack of transitivity when there is 2002) later used this same contrast to solve precisely a non-null direct effect of A and A has more the same problem of “censoring by death.”30 1 1 than two levels; that is, for a given a2, the PS- DEs comparing a1 = 0 with a1 =1 and a1 = 1 30The analysis of Rubin (2004) was also based on this con- trast, with A2 no longer a failure time indicator so that the 31 contrast (16) could be considered as well-defined for any value For subjects for whom A2(a1 = 1) 6= A2(a1 =0), no prin- of a2 for which the conditioning event had positive probabil- cipal stratum direct effect (PSDE) is defined. ity. 32This follows from consistency. ROBINS’ CAUSAL ETIOLOGY 15 with a1 = 2 may both be positive but the PSDE would have taken had we fixed A1 to 0.” The con- comparing a1 = 0 with a1 = 2 may be negative. trast thus represents the effect of A1 on Y had the Robins, Rotnitzky and Vansteelandt (2007) noted effect of A1 on hypertension A2 been blocked. As that the PSDE is undefined when A1 has an effect for the CDE, to be well-defined, potential outcomes on every subject’s A2, a situation that can easily Y (a1, a2) must be well-defined. As a summary mea- occur if A2 is continuous. In that event, a natural sure of the direct effect of (a binary variable) A1 on strategy would be to, say, dichotomize A2. However, Y , the PDE has the advantage (relative to the CDE Robins, Rotnitzky and Vansteelandt (2007) showed and PSDE) that it is a single number. ∗ 35 that the PSDE in principal stratum a2 of the di- The average pure direct effect is defined as chotomized variable may fail to retain any mean- PDE = E[Y a1 = 1, A2(a1 = 0) ] ingful substantive interpretation. { } E[Y (a1 = 0, A2(a1 = 0))]. 3.3 Pure Direct Effects (PDE)33 − Thus, the ratio of the PDE to the total effect Once it has been established that a treatment A 1 E[Y a1 = 1 ] E[Y a1 = 0 ] is the fraction of the has a causal effect on a response Y , it is natural to total{ that is} through− a{ pathway} that does not involve ask what “fraction” of a the total effect may be at- hypertension (A2). tributed to a given causal pathway. As an example, Unlike the PSDE, the PDE is an average over consider a RCT in nonhypertensive smokers of the the full population. However, unlike the CDE, the effect of an anti-smoking intervention (A1) on the PDE is not nonparametrically identified under the outcome myocardial infarction (MI) at 2 years (Y ). FFRCISTG model associated with the simple DAG For simplicity, assume everyone in the intervention shown in Figure 5(a). Robins and Richardson (2011, arm and no one in the placebo arm quit cigarettes, App. C) computed bounds for the PDE under the that all subjects were tested for new-onset hyper- FFRCISTG associated with this DAG. tension A2 at the end of the first year, and no sub- Pearl (2001) obtains identification of the PDE un- ject suffered an MI in the first year. Hence, A1, A2 der the DAG in Figure 5(a) by imposing stronger and Y occur in that order. Suppose the trial showed counterfactual independence assumptions, via a smoking cessation had a beneficial effect on both Nonparametric Structural Equation Model with In- hypertension and MI. It is natural to consider the dependent Errors (NPSEM-IE).36 Under these as- query: “What fraction of the total effect of smoking sumptions, Pearl (2001) obtains the following iden- cessation A1 on MI Y is through a pathway that tifying formula: does not involve hypertension A2?” E[Y A = 1, A = a ] Robins and Greenland (1992) formalized this { | 1 2 2 question via the following counterfactual contrast, Xa2 which they termed the “pure direct effect”: 35Robins and Greenland (1992) also defined the total indi- Y a1 = 1, A2(a1 = 0) Y a1 = 0, A2(a1 = 0) . rect effect (TIE) of A on Y through A to be { }− { } 1 2 34 The second term here is simply Y (a1 = 0). The E[Y {a1 = 1,A2(a1 = 1)}] − E[Y {a1 = 1,A2(a1 = 0)}]. contrast is thus the difference between two quanti- It follows that the total effect E[Y {a1 = 1}] − E[Y {a1 = 0}] ties: first, the outcome Y that would result if we can then be decomposed as the sum of the PDE and the TIE. 36In more detail, the FFRCISTG associated with Fig- set a1 to 1, while “holding fixed” a2 at the value ures 5(a) and (b) assumes for all a1, a2, A2(a1 = 0) that it would have taken had a1 been 0; second, the outcome Y that would result from (18) Y (a1,a2),A2(a1) ⊥⊥ A1,Y (a1,a2) ⊥⊥ A2(a1) | A1, simply setting a1 to 0 [and thus having A2 again which may be read directly from the SWIG shown in Fig- take the value A2(a1 = 0)]. Thus, the Pure Direct ure 5(b); recall that red nodes are always blocked when apply- Effect interprets had “A2 remained unchanged” to ing d-separation. In contrast, Pearl’s NPSEM-IE also implies the independence mean “had (somehow) A2 taken the value that it ∗ (19) Y (a1,a2) ⊥⊥ A2(a1) | A1, 33 ∗ Pearl (2001) adopted the definition given by Robins and when a1 6= a1. Independence (19), which is needed in order Greenland (1992) but changed nomenclature. He refers to the for the PDE to be identified, is a “cross-world” independence ∗ pure direct effect as a “natural” direct effect. since Y (a1,a2) and A2(a1) could never (even in principle) 34This follows by consistency. both be observed in any randomized experiment. 16 T. S. RICHARDSON AND A. ROTNITZKY

(20) E[Y A = 0, A = a ] 3.4 The Direct Effect Null − | 1 2 2 } P (A = a A = 0), Robins (1986, Section 8) considered the null hy- · 2 2 | 1 pothesis that Y (a , a ) does not depend on a for which he calls the “Mediation Formula.” 1 2 1 all a , which we term the sharp null-hypothesis of Robins and Richardson (2011) noted that the 2 no direct effect of A on Y (relative to A ) or more additional assumptions made by the NPSEM-IE 1 2 simply as the “sharp direct effect null.” are not testable, even in principle, via a random- In the context of our running example with data ized experiment. Consequently, this formula rep- (A ,L,A ,Y ), under (6) the sharp direct effect null resents a departure from the principle, originating 1 2 implies the following constraint on the observed data with Neyman (1923), that causation be reducible distribution: to experimental interventions, often expressed in the slogan “no causation without manipulation.”37 ∗ (21) fa1,a2 (y) is not a function of a1 for all a2. Robins and Richardson (2011) achieve a rapproche- ment between these opposing positions by showing Robins (1986, Sections 8 and 9) noted that this con- that the formula (20) is equal to the g-formula asso- straint (21) is not a conditional independence. This ciated with an intervention on two treatment vari- is in contrast to the g-null hypothesis which we have ables not appearing on the graph (but having de- seen is equivalent to the independencies in (ii) of terministic relations with A ) under the assumption Section 2.6 [when equation (9) holds for all g G].40 1 ∈ that one of the variables has no direct effect on A2 He concluded that, in contrast to the g-null hypoth- and the other has no direct effect on Y . Hence, under esis, the constraint (21), and thus the sharp direct this assumption and in the absence of confounding, effect null, cannot be tested using case control data the effect of this intervention on Y is point identified with unknown case and control sampling fractions.41 by (20).38 This constraint (21) was later independently discov- Although there was a literature on direct ef- ered by Verma and Pearl (1990) and for this reason fects in linear structural equation models (see, e.g., is called the “Verma constraint” in the Computer Blalock (1971)) that preceded Robins (1986) and Science literature. Robins and Greenland (1992), the distinction be- Robins (1999b) noted that, though (21) is not a tween the CDE and PDE did not arise since in linear conditional independence in the observed data dis- models these notions are equivalent.39 tribution, it does correspond to a conditional in- dependence, but in a weighted distribution with 42 37 weights proportional to 1/f(A A ,L). This can A point freely acknowledged by Pearl (2012) who argues 2 | 1 that causation should be viewed as more primitive than in- be understood from the informal discussion follow- tervention. ing equation (14) in the previous section: there it 38 This point identification is not a “free lunch”: was noted that given the FFRCISTG corresponding Robins and Richardson (2011) show that it is these additional to the DAG in Figure 2, reweighting by 1/f(A assumptions that have reduced the FFRCISTG bounds for 2 | the PDE to a point. This is a consequence of the fact that A1,L) corresponds to removing both edges into A2. these assumptions induce a model for the original variables Hence, if the edges A Y and L Y are not 1 → → {A1,A2(a1),Y (a1,a2)} that is a strict submodel of the origi- present, so that the sharp direct effect null holds, nal FFRCISTG model. as in Figure 6(a), then the reweighted population is Hence to justify applying the mediation formula by this route one must first be able to specify in detail the additional treatment variables and the associated intervention so as to 40 Results in Pearl (1995b) imply that under the sharp di- make the relevant potential outcomes well-defined. In addi- rect effect null the FFRCISTGs associated with the DAGs tion, one must be able to argue on substantive grounds for shown in Figures 2 and 3 also imply inequality restrictions the plausibility of the required no direct effect assumptions similar to Bell’s inequality in Quantum Mechanics. See Gill and deterministic relations. (2014) for discussion of statistical issues arising from experi- It should also be noted that even under Pearl’s NPSEM- mental tests of Bell’s inequality. IE model the PDE is not identified in causal graphs, such 41To our knowledge, it is the first such causal null hypoth- as those in Figures 2 and 3 that contain a variable (whether esis considered in Epidemiology for which this is the case. observed or unobserved) that is present both on a directed 42This observation motivated the development of graphical pathway from A1 to A2 and on a pathway from A1 to Y . “nested” Markov models that encode constraints such as (21) 39Note that in a linear structural equation model the PSDE in addition to ordinary conditional independence relations; is not defined unless A1 has no effect on A2. see the discussion of “Causal Discovery” in Section 7 below. ROBINS’ CAUSAL ETIOLOGY 17

Fig. 6. (a) A DAG representing the sequentially randomized experiment shown in Figure 2 but where there is no direct effect of A1 on Y relative to A2; (b) a DAG representing the pseudo-population obtained by re-weighting the distribution with weights proportional to 1/f(A2 | L, A1). described by the DAG in Figure 6(b). It then fol- of statistics and for Bayesian inference. To make lows from the d-separation relations on this DAG their argument transparent, we will assume in our that Y A A in the reweighted distribution. running example (from Section 2.2) that the den- ⊥⊥ 1 | 2 This fact can also be seen as follows. If, in our sity of L is known and that A1 = 1 with probabil- running example from Section 2.2, A1, A2, Y are ity 1 (hence we drop A1 from the notation). We all binary, the sharp direct effect null implies that will further assume the observed data are n i.i.d. ∗ ∗ β1 = β3 = 0 in the saturated MSM with copies of a random vector (L, A2,Y ) with A2 and −1 ∗ ∗ ∗ ∗ Y binary and L a d 1 continuous vector with Φ E[Y (a1, a2)] = β0 + β1 a1 + β2 a2 + β3 a1a2. × { } support on the unit cube (0, 1)d. We consider a ∗ ∗ Since β1 and β3 are the associational parameters model for the law of (L, A2,Y ) that assumes that of the weighted distribution, their being zero im- the density f ∗(l) of L is known, that the treat- ∗ plies the conditional independence Y A1 A2 un- ment probability π (l) Pr(A = 1 L = l) lies in ⊥⊥ | ≡ 2 | der this weighted distribution. the interval (c, 1 c) for some known c> 0 and that ∗ − In more complex longitudinal settings, with the b (l, a ) E[Y L = l, A = a ] is continuous in l. 2 ≡ | 2 2 number of treatment times k exceeding 2, all the Under this model, the likelihood function is parameters multiplying terms containing a particu- lar treatment variable in a MSM may be zero, yet (22) (b, π)= 1(b) 2(π), there may still be evidence in the data that the sharp L L L direct effect null for that variable is false. This is di- where rectly analogous to the limitation of MSMs relative n ∗ Y to SNMs with regard to the sharp null hypothesis 1(b)= f (Li)b(Li, A2,i) L (10) of no effect of any treatment that we noted at Yi=1 (23) the end of Section 2.10. To overcome this problem, − 1 b(L , A ) 1 Y , Robins (1999b) introduced direct effect structural · { − i 2,i } n nested models. In these models, which involve treat- − (24) (π)= π (L )A2,i 1 π (L ) 1 A2,i , ment at k time points, if all parameters multiplying L2 2 i { − 2 i } Yi=1 a given aj take the value 0, then we can conclude that the distribution of the observables do not refute and (b, π) Π. Here is the set of continuous ∈B× B the natural extension of (21) to k times. The latter functions from (0, 1)d 0, 1 to (0, 1) and Π is the × { } is implied by the sharp direct effect null that aj has set of functions from (0, 1)d to (c, 1 c). − no direct effect on Y holding aj+1, . . ., ak fixed. We assume the goal is inference about µ(b) where µ(b)= b(l, 1)f ∗(l) dl. Under randomization, that is ∗ 4. THE FOUNDATIONS OF STATISTICS AND (3) andR (4), µ(b ) is the counterfactual mean of Y BAYESIAN INFERENCE when treatment is given at both times. ∗ Robins and Ritov (1997) and Robins and Wasser- When π is unknown, Robins and Ritov (1997) ∗ man (2000) recognized that the lack of robustness showed that no estimator of µ(b ) exists that is uni- of estimators based on the g-formula in a sequen- formly consistent over all Π. They also showed ∗ B× tial randomized trial with known randomization that even if π is known, any estimator that does not probabilities had implications for the foundations use knowledge of π∗ cannot be uniformly consistent 18 T. S. RICHARDSON AND A. ROTNITZKY over π∗ for all π∗. However, there do exist es- This viewpoint led them to recognize that the timatorsB × {that} depend on π∗ that are uniformly √n- IPCW and IPTW estimators described earlier were ∗ ∗ ∗ consistent for µ(b ) over π for all π . The not fully efficient. To obtain efficient estimators, B × { } ∗ Horvitz–Thompson estimator P A Y/π (L) is a Robins and Rotnitzky (1992) and Robins, Rot- n{ 2 } simple example. nitzky and Zhao (1994) used the theory of semi- Robins and Ritov (1997) concluded that, in this parametric efficiency bounds (Bickel et al. (1993); example, any method of estimation that obeys the van der Vaart (1991)) to derive representations for likelihood principle such as maximum likelihood or the efficient score, the efficient influence function, Bayesian estimation with independent priors on b the semiparametric variance bound, and the influ- and π, must fail to be uniformly consistent. This ence function of any asymptotically linear estima- is because any procedure that obeys the likelihood tor in this general problem. The books by Tsiatis ∗ principle must result in the same inference for µ(b ) (2006) and by van der Laan and Robins (2003) pro- ∗ ∗ regardless of π , even when π becomes known. vide thorough treatments. The generality of these Robins and Wasserman (2000) noted that this ex- results allowed Robins and his principal collabora- ample illustrates that the likelihood principle and tors Mark van der Laan and Andrea Rotnitzky to frequentist performance can be in severe conflict in solve many open problems in the analysis of semi- that any procedure with good frequentist properties parametric models. For example, they used the ef- 43 must violate the likelihood principle. Ritov et al. ficient score representation theorem to derive lo- (2014) in this volume extends this discussion in cally efficient semiparametric estimators in many many directions. models of importance in biostatistics. Some exam- ples include conditional mean models with miss- 5. SEMIPARAMETRIC EFFICIENCY AND ing regressors and/or responses (Robins, Rotnitzky DOUBLE ROBUSTNESS IN MISSING DATA and Zhao (1994); Rotnitzky and Robins (1995)), bi- AND CAUSAL INFERENCE MODELS variate survival (Quale, van der Laan and Robins Robins and Rotnitzky (1992) recognized that the (2006)) and multivariate survival models with ex- inferential problem of estimation of the mean E[Y (g)] plainable dependent censoring (van der Laan, Hub- 45 (when identified by the g-formula) of a response Y bard and Robins (2002)). under a regime g is a special case of the general problem of estimating the parameters of an arbi- timators of the functional. In this section, we assume that trary semi-parametric model in the presence of data the distribution of the observables is compatible with CAR, that had been coarsened at random (Heitjan and and further, that in the estimation problems that we consider, Rubin (1991)).44 CAR may be assumed to hold without loss of generality. In fact, this is the case in the context of our run- ning causal inference example from Section 2.2. Specifi- 43 In response Robins (2004, Section 5.2) offered a Bayes– cally, let X = {Y (a1,a2), L(a1); aj ∈ {0, 1}, j = 1, 2}, R = frequentist compromise that combines honest subjective (A1,A2), and X(a1,a2) = {Y (a1,a2), L(a1)}. Consider a Bayesian decision making under uncertainty with good fre- model MX for X that specifies (i) {Y (1,a2), L(1); a2 ∈ quentist behavior even when, as above, the model is so large {0, 1}} ⊥⊥{Y (0,a2), L(0); a2 ∈ {0, 1}} and (ii) Y (a1, 1) ⊥⊥ and the likelihood function so complex that standard (un- Y (a1, 0) | L(a1) for a1 ∈ {0, 1}. Results in Gill and Robins compromised) Bayes procedures have poor frequentist per- (2001, Section 6) and Robins (2000, Sections 2.1 and 4.2) formance. The key to the compromise is that the Bayesian show that (a) model MX places no further restrictions decision maker is only allowed to observe a specified vector on the distribution of the observed data (A1,A2,L,Y ) = ∗ function of X [depending on the known π (X)] but not X (A1,A2, L(A1),Y (A1,A2)), (b) given model MX , the addi- itself. tional independences X ⊥⊥ A1 and X ⊥⊥ A2 | A1, L together 44Given complete data X, an always observed coarsening also place no further restrictions on the distribution of the variable R, and a known coarsening function x(r) = c(r, x), observed data (A1,A2,L,Y ) and are equivalent to assuming coarsening at random (CAR) is said to hold if Pr(R = r | CAR. Further, the independences in (b) imply (9) so that ∗ X) depends only on X(r), the observed data part of X. fY (g)(y) is identified by the g-formula fg (y). Robins and Rotnitzky (1992), Gill, van der Laan and Robins 45More recently, in the context of a RCT, Tsiatis et al. (1997) and Cator (2004) showed that in certain models as- (2008) and Moore and van der Laan (2009), following the suming CAR places no restrictions on the distribution of the strategy of Robins and Rotnitzky (1992), studied variants of observed data. For such models, we can pretend CAR holds the locally efficient tests and estimators of Scharfstein, Rot- when our goal is estimation of functionals of the observed nitzky and Robins (1999) to increase efficiency and power by data distribution. This trick often helps to derive efficient es- utilizing data on covariates. ROBINS’ CAUSAL ETIOLOGY 19

In coarsened at random data models, whether where now b(L) = expit m(L; η) + θ/π(L) and { } missing data or causal inference models, locally ef- (η, θ) are obtained by fitting by maximum likelihood b b b b ficient semiparametric estimators are also doubly the logistic regression model Pr(Y = 1 A1 = 1, robust (Scharfstein, Rotnitzky and Robins (1999), b | L,b A2 = 1) = expit m(L; η)+ θ/π(L) to subjects pages 1141–1144) and (Robins and Rotnitzky (2001)). { } with A1 = 1, A2 = 1. Here, m(L; η) is a user-specified See the book (van der Laan and Robins (2003)) function of L and of the Euclideanb parameter η. for details and for many examples of doubly ro- Robins (1999a) and Bang and Robins (2005) ob- bust estimators. Doubly robust estimators had tained plug-in DR regression estimators in longitu- been discovered earlier in special cases. In fact, dinal missing data and causal inference models by Firth and Bennett (1998) note that the so-called reexpressing the g-formula as a sequence of iterated model-assisted regression estimator of a finite popu- conditional expectations. lation mean of Cassel, S¨arndal and Wretman (1976) van der Laan and Rubin (2006) proposed a clever is design consistent which is tantamount to being general method for obtaining plug-in DR estima- doubly robust. See Robins and Rotnitzky (2001) for tors called targeted maximum likelihood. In our other precursors. setting, the method yields an estimator µ In the context of our running example, from Sec- dr,TMLE that differs from µ only in that b(L) is now tion 2.2, suppose (6) holds. An estimator µdr of dr,reg ∗ b µ = E[Y (a , a )] = f (1) for, say a = a =1, is given by expit m(L)+ θgreedy/π(L) where θgreedy 1 2 a1,a2 1 2 { b } b said to be doubly robust (DR) if it is consistentb when is again obtained by maximum likelihood but with b b b b either (i) a model for π(L) Pr(A2 = 1 A1 = 1,L) a fixed offset m(L). This offset is an estimator of or (ii) a model for b(L) E≡[Y A = 1,L,A| = 1] is Pr(Y = 1 A = 1,L,A = 1) that might be ob- 1 2 | 1 2 correct. When L is high≡ dimensional| and, as in an tained using flexibleb machine learning methods. observational study, π( ) is unknown, double robust- Similar comments apply to models considered by ness is a desirable property· because model misspec- Bang and Robins (2005). Since 2006 there has been ification is generally unavoidable, even when we use an explosion of research that has produced dou- flexible, high dimensional, semiparametric models in bly robust estimators with much improved large (i) and (ii). In fact, DR estimators have advantages sample efficiency and finite sample performance; even when, as is usually the case, the models in (i) Rotnitzky and Vansteelandt (2014) give a review. and (ii) are both incorrect. This happens because We note that CAR models are not the only models the bias of the DR estimator µdr is of second order, that admit doubly robust estimators. For example, and thus generally less than the bias of a non-DR Scharfstein, Rotnitzky and Robins (1999) exhibited estimator (such as a standardb IPTW estimator). By doubly robust estimators in models with nonignor- second order, we mean that the bias of µdr depends able missingness. Robins and Rotnitzky (2001) de- on the product of the error made in the estimation rived sufficient conditions, satisfied by many non- b of Pr(A2 = 1 A1 = 1,L) times the error made in the CAR models, that imply the existence of doubly ro- estimation of| E[Y A = 1,L,A = 1]. | 1 2 bust estimators. Recently, doubly robust estimators Scharfstein, Rotnitzky and Robins (1999) noted have been obtained in a wide variety of models. See that the locally efficient estimator of Robins, Rot- Dudik et al. (2014) in this volume for an interesting nitzky and Zhao (1994) example. −1 µdr = Pn[A1] { } 6. HIGHER ORDER INFLUENCE FUNCTIONS A A e P A 2 Y 2 1 b(L) · n 1π(L) − π(L) −   It may happen that the second-order bias of a b doubly-robust estimator µdr decreases slower to 0 is doubly robust whereb π(L) andb b(L) are estimators with n than n−1/2, and thus the bias exceeds the of π(L) and b(L). Unfortunately, in finite samples standard error of the estimator.b In that case, con- this estimator may failb to lie in theb parameter space fidence intervals for µ based on µdr fail to cover for µ, that is, the interval [0, 1] if Y is binary. In at their nominal rate even in large samples. Fur- response, Scharfstein, Rotnitzky and Robins (1999) thermore, in such a case, in termsb of mean squared proposed a plug-in DR estimator, the doubly robust error, µdr does not optimally trade off bias and regression estimator variance. In an attempt to address these problems, − µ = P [A ] 1P A b(L) , Robinsb et al. (2008) developed a theory of point and dr,reg { n 1 } n{ 1 } b b 20 T. S. RICHARDSON AND A. ROTNITZKY

2 interval estimation based on higher order influence subcube-specific estimates (Yi Yj) /2 over all the functions and use this theory to construct estima- sub-cubes with at least two observations.− The rate tors of µ that improve on µdr. Higher order in- of convergence of the estimator is maximized at fluence functions are higher order U-statistics. The n−(4β/d)/(1+4β/d) by taking k = n2/(1+4β/d).48 theory of Robins et al. (2008)b extends to higher or- Robins et al. (2008) conclude that the random de- der the first order semiparametric inference theory sign estimator has better bias control, and hence of Bickel et al. (1993) and van der Vaart (1991). In converges faster than the optimal equal-spaced this issue, van der Vaart (2014) gives a masterful re- fixed X estimator, because the random design es- 2 2/(1+4β/d) view of this theory. Here, we present an interesting timator exploits the Op(n /n ) random result found in Robins et al. (2008) that can be un- fluctuations for which the X’s corresponding to derstood in isolation from the general theory and two different observations are only a distance of conclude with an open estimation problem. O( n2/(1+4β/d) −1/d) apart. { } Robins et al. (2008) consider the question of 49 whether, for estimation of a conditional variance, An Open Problem random regressors provide for faster rates of con- Consider again the above setting with random X. vergence than do fixed regressors, and, if so, how? Suppose that β/d remains less than 1/4 but now They consider a setting in which n i.i.d. copies of β> 1. Does there still exist an estimator of σ2 that (Y, X) are observed with X a d-dimensional random converges at n−(4β/d)/(1+4β/d)? Analogy with other vector, with bounded density f( ) absolutely con- nonparametric estimation problems would suggest · tinuous w.r.t. the uniform measure on the unit cube the answer is “yes,” but the question remains un- (0, 1)d. The regression function b( )= E[Y X = ] solved.50 is assumed to lie in a given H¨older· ball with| H¨older· exponent β< 1.46 The goal is to estimate E[Var Y 7. OTHER WORK X ] under the homoscedastic semiparametric model{ | Var[} Y X]= σ2. Under this model, the authors con- The available space precludes a complete treat- struct| a simple estimator σ2 that converges at rate ment of all of the topics that Robins has worked on. n−(4β/d)/(1+4β/d), when β/d < 1/4. We provide a brief description of selected additional Wang et al. (2008) andb Cai, Levine and Wang topics and a guide to the literature. (2009) earlier proved that if Xi, i = 1,...,n, are Analyzing Observational Studies as Nested d nonrandom but equally spaced in (0, 1) , the min- Randomized Trials imax rate of convergence for the estimation of σ2 is n−2β/d (when β/d < 1/4) which is slower than Hern´an et al. (2008) and Hern´an, Robins and n−(4β/d)/(1+4β/d). Thus, randomness in X allows for Garc´ıa Rodr´ıguez (2005) conceptualize and ana- improved convergence rates even though no smooth- lyze observational studies of a time varying treat- ness assumptions are made regarding f( ). · 48 2 2 To explain how this happens, we describe the Observe that E[(Yi − Yj ) /2 | Xi, Xj ]= σ + {b(Xi) − 2 β b(Xj )} /2, |b(Xi) − b(Xj )| = O(kXi − Xj k ) as β < 1, and estimator of Robins et al. (2008). The unit cube 1/2 −1/d in Rd is divided into k = k(n)= nγ , γ > 1 identi- kXi − Xj k = d O(k ) when Xi and Xj are in the same −1/d subcube. It follows that the estimator has variance of order cal subcubes each with edge length k . A sim- k/n2 and bias of order O(k−2β/d). Variance and the squared ple probability calculation shows that the number bias are equated by solving k/n2 = k−4β/d which gives k = of subcubes containing at least two observations n2/(1+4β/d). 2 2 49 is Op(n /k). One may estimate σ in each such Robins has been trying to find an answer to this question 2 47 2 2 without success for a number of years. He suggested that it is subcube by (Yi Yj) /2. An estimator σ of σ − now time for some crowd-sourcing. may then be constructed by simply averaging the 50 b The estimator given above does not attain this rate when β> 1 because it fails to exploit the fact that b(·) is differen- 46A function b(·) lies in the H¨older ball H(β,C) with H¨older tiable. In the interest of simplicity, we have posed this as a exponent β> 0 and radius C> 0, if and only if b(·) is bounded problem in variance estimation. However, Robins et al. (2008) in supremum norm by C and all partial derivatives of b(x) up show that the estimation of the variance is mathematically to order ⌊β⌋ exist, and all partial derivatives of order ⌊β⌋ are isomorphic to the estimation of θ in the semi-parametric re- Lipschitz with exponent (β − ⌊β⌋) and constant C. gression model E[Y | A, X]= θA + h(X), where A is a binary 47If a subcube contains more than two observations, two treatment. In the absence of confounding, θ encodes the causal are selected randomly, without replacement. effect of the treatment. ROBINS’ CAUSAL ETIOLOGY 21 ment as a nested sequence of individual RCTs tri- are no unobserved common causes, the latter explic- als run by nature. Their analysis is closely re- itly allows for this possibility. lated to g-estimation of SNM (discussed in Sec- Robins and Wasserman (1999) and Robins et al. tion 2.6). The critical difference is that in these (2003) pointed out that although these procedures papers Robins and Hern´an do not specify a SNM were consistent they were not uniformly consistent. to coherently link the trial-specific effect estimates. More recent papers (Kalisch and B¨uhlmann (2007); This has benefits in that it makes the analysis eas- Colombo et al. (2012)) recover uniform consistency ier and also more familiar to users without training for these algorithms by imposing additional assump- in SNMs. The downside is that, in principle, this tions. Spirtes and Zhang (2014) in this volume ex- lack of coherence can result in different analysts rec- tend this work by developing a variant of the PC Al- ommending, as optimal, contradictory interventions gorithm which is uniformly consistent under weaker (Robins, Hern´an and Rotnitzky 2007). assumptions. Shpitser et al. (2012, 2014), building on Tian and Adjustment for “Reverse Causation” Pearl (2002b) and Robins (1999b) develop a theory Consider an epidemiological study of a time- de- of nested Markov models that relate the structure of pendent treatment (say cigarette smoking) on time a causal DAG to conditional independence relations to a disease of interest, say clinical lung cancer. that arise after re-weighting; see Section 3.4. This In this setting, uncontrolled confounding by unde- theory, in combination with the theory of graphical tected preclinical lung cancer (often referred to as Markov models based on Acyclic Directed Mixed “reverse causation”) is a serious problem. Robins Graphs (Richardson and Spirtes (2002); Richard- (2008) develops analytic methods that may still pro- son (2003); Wermuth (2011); Evans and Richardson vide an unconfounded effect estimate, provided that (2014); Sadeghi and Lauritzen (2014)), will facilitate (i) all subjects with preclinical disease severe enough the construction of more powerful51 causal discov- to affect treatment (i.e., smoking behavior) at a ery algorithms that could (potentially) reveal much given time t will have their disease clinically diag- more information regarding the structure of a DAG nosed within the next x, say 2 years and (ii) based containing hidden variables than algorithms (such on subject matter knowledge an upper bound, for as FCI) that solely use conditional independence. example, 3 years, on x is known. Extrapolation and Transportability of Treatment Causal Discovery Effects Spirtes, Glymour and Scheines (1993) and Pearl Quality longitudinal data is often only avail- and Verma (1991) proposed statistical methods that able in high resource settings. An important ques- allowed one to draw causal conclusions from asso- tion is when and how can such data be used to ciational data. These methods assume an underly- inform the choice of treatment strategy in low ing causal DAG (or equivalently an FFRCISTG). If resource settings. To help answer this question, the DAG is incomplete, then such a model imposes Robins, Orellana and Rotnitzky (2008) studied the conditional independence relations on the associ- extrapolation of optimal dynamic treatment strate- ated joint distribution (via d-separation). Spirtes, gies between two HIV infected patient populations. Glymour and Scheines (1993) and Pearl and Verma The authors considered the treatment strategies gx, (1991) made the additional assumption that all con- of the same form as those defined in Section 2.10, ditional independence relations that hold in the dis- namely, “start anti-retroviral therapy the first time tribution of the observables are implied by the un- at which the measured CD4 count falls below x.” derlying causal graph, an assumption termed “sta- Given a utility measure Y , their goal is to find the bility” by Pearl and Verma (1991), and “faithful- regime gxopt that maximizes E[Y (gx)] in the sec- ness” by Spirtes, Glymour and Scheines (1993). ond low-resource population when good longitudi- Under this assumption, the underlying DAG may nal data are available only in the first high-resource be identified up to a (“Markov”) equivalence class. population. Due to differences in resources, the fre- Spirtes, Glymour and Scheines (1993) proposed two quency of CD4 testing in the first population is much algorithms that recover such a class, entitled “PC” and “FCI.” While the former presupposes that there 51But still not uniformly consistent! 22 T. S. RICHARDSON AND A. ROTNITZKY greater than in the second and, furthermore, for lo- in this issue contain further results on interference gistical and/or financial reasons, the testing frequen- and spillover effects. cies cannot be altered. In this setting, the authors Multiple Imputation derived conditions under which data from the first population is sufficient to identify gxopt and con- Wang and Robins (1998) and Robins and Wang (2000) studied the statistical properties of the mul- struct IPTW estimators of gxopt under those condi- tions. A key finding is that owing to the differential tiple imputation approach to missing data (Rubin rates of testing, a necessary condition for identifica- (1987)). They derived a variance estimator that is tion is that CD4 testing has no direct causal effect consistent for the asymptotic variance of a multi- on Y not through anti-retroviral therapy. In this is- ple imputation estimator even under misspecifica- sue, Pearl and Bareinboim (2014) study the related tion and incompatibility of the imputation and the question of transportability between populations us- (complete data) analysis model. They also charac- ing graphical tools. terized the large sample bias of the variance estima- tor proposed by Rubin (1978b). Interference, Interactions and Quantum Mechanics Posterior Predictive Checks Within a counterfactual causal model, Cox (1958) Robins, van der Vaart and Ventura (2000) stud- defined there to be interference between treatments ied the asymptotic null distributions of the poste- if the response of some subject depends not only rior predictive p-value of Rubin (1984) and Guttman on their treatment but on that of others as well. On (1967) and of the conditional predictive and partial the other hand, VanderWeele and Robins (2009) de- posterior predictive p-values of Bayarri and Berger fined two binary treatments (a , a ) to be causally (2000). They found the latter two p-values to have 1 2 an asymptotic uniform distribution; in contrast they interacting to cause a binary response Y if for found that the posterior predictive p-value could be some unit Y (1, 1) = Y (1, 0) = Y (0, 1); VanderWeele very conservative, thereby diminishing its power to (2010a) defined the6 interaction to be epistatic if detect a misspecified model. In response, Robins et Y (1, 1) = Y (1, 0) = Y (0, 1) = Y (0, 0). VanderWeele al. derived an adjusted version of the posterior pre- with his6 collaborators has developed a very gen- dictive p-value that was asymptotically uniform. eral theory of empirical tests for causal interac- tion of different types (VanderWeele and Robins Sensitivity Analysis (2009); VanderWeele (2010a), 2010b; VanderWeele Understanding that epidemiologists will almost and Richardson (2012)). never succeed in collecting data on all covariates Robins, VanderWeele and Gill (2012) showed, per- needed to fully prevent confounding by unmeasured haps surprisingly, that this theory could be used to factors and/or nonignorable missing data, Robins give a simple but novel proof of an important re- with collaborators Daniel Scharfstein and Andrea sult in quantum mechanics known as Bell’s theo- Rotnitzky developed methods for conducting sensi- rem. The proof was based on two insights: The first tivity analyses. See, for example, Scharfstein, Rot- was that the consequent of Bell’s theorem could, nitzky and Robins (1999), Robins, Rotnitzky and by using the Neyman causal model, be recast as Scharfstein (2000) and Robins (2002, pages 319– the statement that there is interference between a 321). In this issue, Richardson et al. (2014) describe certain pair of treatments. The second was to recog- methods for sensitivity analysis and present several nize that empirical tests for causal interaction can applied examples. be reinterpreted as tests for certain forms of interfer- ence between treatments, including the form needed Public Health Impact to prove Bell’s theorem. VanderWeele et al. (2012) Finally, we have not discussed the large impact used this latter insight to show that existing em- of the methods that Robins introduced on the sub- pirical tests for causal interactions could be used to stantive analysis of longitudinal data in epidemiol- test for interference and spillover effects in vaccine ogy and other fields. Many researchers have been trials and in many other settings in which inter- involved in transforming Robins’ work on time- ference and spillover effects may be present. The varying treatments into increasingly reliable, robust papers Ogburn and VanderWeele (2014) and Van- analytic tools and in applying these tools to help derWeele, Tchetgen Tchetgen and Halloran (2014) answer questions of public health importance. ROBINS’ CAUSAL ETIOLOGY 23

LIST OF ACRONYMS USED Bang, H. and Robins, J. M. (2005). Doubly robust estima- tion in missing data and causal inference models. Biomet- rics 61 962–972. MR2216189 CAR: Section 5 coarsened at random. Bayarri, M. J. and Berger, J. O. (2000). p values for com- CD4: Section 2.2 (medical) cell line depleted by posite null models. J. Amer. Statist. Assoc. 95 1127–1142, HIV. 1157–1170. MR1804239 CDE: Section 3.1 controlled direct effect. Bellman, R. (1957). Dynamic Programming. Princeton CMA: Section 2.3 causal Markov assumption. DAG: Section 2.3 directed acyclic graph. Univ. Press, Princeton, NJ. MR0090477 Bickel, P. J. Klaassen, C. A. J. Ritov, Y. Well- DR: Section 5 doubly robust. , , and ner, J. A. dSWIG: Section 2.4 dynamic single-world interven- (1993). Efficient and Adaptive Estimation for tion graph. Semiparametric Models. Johns Hopkins Univ. Press, Balti- FFRCISTG: Section 2.2 finest fully randomized causally more, MD. MR1245941 interpreted structured tree Blalock, H. M., ed. (1971). Causal Models in the Social graph. Sciences. Aldine Publishing, Chicago, IL. HIV: Section 2.2 (medical) human immunodefi- Cai, T. T., Levine, M. and Wang, L. (2009). Variance ciency virus. function estimation in multivariate nonparametric regres- IPCW: Section 2.9 inverse probability of censoring sion with fixed design. J. Multivariate Anal. 100 126–136. weighted. MR2460482 IPTW: Section 2.10 inverse probability of treatment Cassel, C. M., Sarndal,¨ C. E. and Wretman, J. H. weighted. (1976). Some results on generalized difference estimation ITT: Section 2.9 intention to treat. and generalized regression estimation for finite popula- MI: Section 3.3 (medical) myocardial infarction. tions. Biometrika 63 615–620. MR0445666 MSM: Section 2.10 marginal structural model. Cator, E. A. (2004). On the testability of the car assump- NPSEM: Section 2.2 nonparametric structural equa- tion. Ann. Statist. 32 1957–1980. MR2102499 tion model. Colombo, D., Maathuis, M. H., Kalisch, M. and NPSEM-IE: Section 2.2 nonparametric structural equa- Richardson, T. S. (2012). Learning high-dimensional di- tion model with independent er- rected acyclic graphs with latent and selection variables. rors. Ann. Statist. 40 294–321. MR3014308 PDE: Section 3.3 pure direct effects. Cox, D. R. (1958). Planning of Experiments. Wiley, New PSDE: Section 3.2 principal stratum direct effects. York. MR0095561 RCT: Section 2.2 randomized clinical trial. Cox, D. R. (1972). Regression models and life-tables. J. R. SNM: Section 2.6 structural nested model. Stat. Soc. Ser. B Stat. Methodol. 34 187–220. MR0341758 SNDM: Section 2.6 structural nested distribution Cox, D. R. and Wermuth, N. (1999). Likelihood factoriza- model. tions for mixed discrete and continuous variables. Scand. SNFTM: Section 2.6 structural nested failure time J. Stat. 26 209–220. MR1707595 model. Dudik, M., Erhan, D., Langford, J. and Li, L. (2014). SNMM: Section 2.6 structural nested mean model. Doubly robust policy evaluation and learning. Statist. Sci. SWIG: Section 2.3 single-world intervention graph. 29 485–511. TIE: Section 3.3 total indirect effect. Efron, B. and Hinkley, D. V. (1978). Assessing the accu- racy of the maximum likelihood estimator: Observed ver- ACKNOWLEDGMENTS sus expected Fisher information. Biometrika 65 457–487. MR0521817 This work was supported by the US National In- Evans, R. J. and Richardson, T. S. (2014). Markovian stitutes of Health Grant R01 AI032475. acyclic directed mixed graphs for discrete data. Ann. Statist. 42 1452–1482. MR3262457 Firth, D. and Bennett, K. E. (1998). Robust models REFERENCES in probability sampling. J. R. Stat. Soc. Ser. B Stat. Aalen, O. (1978). Nonparametric inference for a family of Methodol. 60 3–21. MR1625672 counting processes. Ann. Statist. 6 701–726. MR0491547 Fleming, T. R. and Harrington, D. P. (1991). Count- Andersen, P. K., Borgan, Ø., Gill, R. D. and Keid- ing Processes and Survival Analysis. Wiley, New York. ing, N. (1993). Statistical Models Based on Counting Pro- MR1100924 cesses. Springer, New York. MR1198884 Frangakis, C. E. and Rubin, D. B. (1999). Addressing Aronow, P. M., Green, D. P. and Lee, D. K. K. (2014). complications of intention-to-treat analysis in the com- Sharp bounds on the variance in randomized experiments. bined presence of all-or-none treatment-noncompliance and Ann. Statist. 42 850–871. MR3210989 subsequent missing outcomes. Biometrika 86 365–379. Balke, A. and Pearl, J. (1994). Probabilistic evaluation of MR1705410 counterfactual queries. In Proceedings of the 12th Confer- Frangakis, C. E. and Rubin, D. B. (2002). Principal ence on Artificial Intelligence 1 230–237. MIT Press, Menlo stratification in causal inference. Biometrics 58 21–29. Park, CA. MR1891039 24 T. S. RICHARDSON AND A. ROTNITZKY

Freedman, D. A. (2006). Statistical models for causation: Moore, K. L. and van der Laan, M. J. (2009). Covari- What inferential leverage do they provide? Eval. Rev. 30 ate adjustment in randomized trials with binary outcomes: 691–713. Targeted maximum likelihood estimation. Stat. Med. 28 Gilbert, E. S. (1982). Some confounding factors in the study 39–64. MR2655550 of mortality and occupational exposures. American J. Epi- Murphy, S. A. (2003). Optimal dynamic treatment regimes. demiology 116 177–188. J. R. Stat. Soc. Ser. B Stat. Methodol. 65 331–366. Gilbert, P. B., Bosch, R. J. and Hudgens, M. G. (2003). MR1983752 Sensitivity analysis for the assessment of causal vaccine Neyman, J. (1923). Sur les applications de la th´eorie des effects on viral load in HIV vaccine trials. Biometrics 59 probabilit´es aux experiences agricoles: Essai des principes. 531–541. MR2004258 Roczniki Nauk Rolniczych X 1–51. In Polish. English trans- Gill, R. D. (2014). Statistics, causality and Bell’s theorem. lation by D. Dabrowska and T. Speed in Statist. Sci. 5 29 Statist. Sci. 512–528. (1990) 463–472. Gill, R. D. and Robins, J. M. (2001). Causal inference Ogburn, E. L. and VanderWeele, T. J. (2014). Causal for complex longitudinal data: The continuous case. Ann. diagrams for interference. Statist. Sci. 29 559–578. Statist. 29 1785–1811. MR1891746 Orellana, L., Rotnitzky, A. and Robins, J. M. (2010a). Gill, R. D., van der Laan, M. J. and Robins, J. M. Dynamic regime marginal structural mean models for es- (1997). Coarsening at random: Characterizations, conjec- timation of optimal dynamic treatment regimes. Part I: tures, counter-examples. In Survival Analysis. Proceedings 6 of the First. Seattle Symposium in Biostatistics. Lecture Main content. Int. J. Biostat. Art. 8, 49. MR2602551 Orellana, L. Rotnitzky, A. Robins, J. M. Notes in Statistics 123 255–294. Springer, New York. , and (2010b). Guttman, I. (1967). The use of the concept of a future ob- Dynamic regime marginal structural mean models for es- servation in goodness-of-fit problems. J. R. Stat. Soc. Ser. timation of optimal dynamic treatment regimes. Part II: B Stat. Methodol. 29 83–100. MR0216699 Proofs of results. Int. J. Biostat. 6 Art. 9, 19. MR2602552 Heitjan, D. F. and Rubin, D. B. (1991). Ignorability and Pearl, J. (1988). Probabilistic Reasoning in Intelligent Sys- coarse data. Ann. Statist. 19 2244–2253. MR1135174 tems: Networks of Plausible Inference. Morgan Kaufmann, Hernan,´ M. A., Robins, J. M. and Garc´ıa San Mateo, CA. MR0965765 Rodr´ıguez, L. A. (2005). Discussion on “Statistical Pearl, J. (1995a). Causal diagrams for empirical research. issues in the women’s health initiative.” Biometrics 61 Biometrika 82 669–710. With discussion and a rejoinder 922–930. MR2216183 by the author. MR1380809 Hernan,´ M. A., Lanoy, E., Costagliola, D. and Pearl, J. (1995b). On the testability of causal models with Robins, J. M. (2006). Comparison of dynamic treatment latent and instrumental variables. In Proceedings of the regimes via inverse probability weighting. Basic & Clinical 11th Annual Conference on Uncertainty in Artificial In- Pharmacology & Toxicology 98 237–242. telligence 435–443. Morgan Kaufmann, San Francisco, CA. Hernan,´ M. A., Alonso, A., Logan, R., Grodstein, F., MR1615027 Michels, K. B., Stampfer, M. J., Willett, W. C., Pearl, J. (2000). Causality. Models, Reasoning, and Infer- Manson, J. E. and Robins, J. M. (2008). Observational ence. Cambridge Univ. Press, Cambridge. MR1744773 studies analyzed like randomized experiments: An appli- Pearl, J. (2001). Direct and indirect effects. In Proceedings cation to postmenopausal hormone therapy and coronary of the 17th Annual Conference on Uncertainty in Artificial heart disease. Epidemiology 19 766. Intelligence 411–442. Morgan Kaufmann, San Francisco, Huang, Y. and Valtorta, M. (2006). Pearl’s calculus of CA. interventions is complete. In Proceedings of the 22nd Con- Pearl, J. (2012). Eight about causality and struc- ference on Uncertainty in Artificial Intelligence (UAI-06). tural equation models. Technical Report R-393, Computer AUAI Press, Corvallis, OR. Science Dept., UCLA. Kalbfleisch, J. D. and Prentice, R. L. (1980). The Sta- Pearl, J. and Bareinboim, E. (2014). External validity: tistical Analysis of Failure Time Data. Wiley, New York. From do-calculus to transportability across populations. MR0570114 Statist. Sci. 29 579–595. Kalisch, M. and Buhlmann,¨ P. (2007). Estimating high- Pearl, J. and Verma, T. S. (1991). A theory of inferred cau- dimensional directed acyclic graphs with the pc-algorithm. J. Mach. Learn. Res. 8 613–636. sation. In Principles of Knowledge Representation and Rea- Keiding, N. and Clayton, D. (2014). Standardization and soning (Cambridge, MA, 1991). Morgan Kaufmann Ser. control for confounding in observational studies: A histor- Represent. Reason. 441–452. Morgan Kaufmann, San Ma- ical perspective. Statist. Sci. 29 529–580. teo, CA. MR1142173 Manski, C. (1990). Non-parametric bounds on treatment ef- Picciotto, S., Hernan,´ M. A., Page, J. H., Young, J. G. fects. American Economic Review 80 351–374. and Robins, J. M. (2012). Structural nested cumulative Miettinen, O. S. and Cook, E. F. (1981). Confounding: failure time models to estimate the effects of interventions. Essence and detection. American J. Epidemiology 114 593– J. Amer. Statist. Assoc. 107 886–900. MR3010878 603. Quale, C. M., van der Laan, M. J. and Robins, J. R. Mohan, K., Pearl, J. and Tian, J. (2013). Graphical mod- (2006). Locally efficient estimation with bivariate right- els for inference with missing data. In Advances in Neural censored data. J. Amer. Statist. Assoc. 101 1076–1084. Information Processing Systems 26 1277–1285. MR2324147 ROBINS’ CAUSAL ETIOLOGY 25

Richardson, T. S. (2003). Markov properties for acyclic Proceedings of the Section on Bayesian Statistical Science directed mixed graphs. Scand. J. Stat. 30 145–157. 6–10. Americ. Statist. Assoc., Alexandria, VA. MR1963898 Robins, J. M. (1999b). Testing and estimation of direct Richardson, T. S. and Robins, J. M. (2013). Single World effects by reparameterizing directed acyclic graphs with Intervention Graphs (SWIGs): A unification of the coun- structural nested models. In Computation, Causation, and terfactual and graphical approaches to causality. Technical Discovery (C. Glymour and G. Cooper, eds.) 349–405. Report 128, Center for Statistics and the Social Sciences, AAAI Press, Menlo Park, CA. MR1696459 Univ. Washington, Seattle, WA. Robins, J. M. (2000). Marginal structural models versus Richardson, T. S. and Spirtes, P. (2002). Ancestral graph structural nested models as tools for causal inference. In Markov models. Ann. Statist. 30 962–1030. MR1926166 Statistical Models in Epidemiology, the Environment, and Richardson, A., Hudgens, M. G., Gilbert, P. B. and Clinical Trials (Minneapolis, MN, 1997). IMA Vol. Math. Fine, J. P. (2014). Nonparametric bounds and sensitivity Appl. 116 95–133. Springer, New York. MR1731682 analysis of treatment effects. Statist. Sci. 29 596–618. Robins, J. M. (2002). Comment on “Covariance adjustment Ritov, Y., Bickel, P. J., Gamst, A. C. and in randomized experiments and observational studies” by Kleijn, B. J. K. (2014). The Bayesian analysis of P. R. Rosenbaum. Statist. Sci. 17 309–321. complex, high-dimensional models: Can it be CODA? Robins, J. M. (2004). Optimal structural nested models for Statist. Sci. 29 619–639. optimal sequential decisions. In Proceedings of the Sec- Robins, J. M. (1986). A new approach to causal inference ond Seattle Symposium in Biostatistics. Lecture Notes in in mortality studies with a sustained exposure period— Statist. 179 189–326. Springer, New York. MR2129402 Application to control of the healthy worker survivor effect. Robins, J. M. (2008). Causal models for estimating the ef- Mathematical models in medicine: Diseases and epidemics. fects of weight gain on mortality. Int. J. Obes. (Lond.) 32 Part 2. Math. Modelling 7 1393–1512. MR0877758 Suppl 3 S15–S41. Robins, J. M. Greenland, S. Robins, J. M. (1987a). A graphical approach to the identi- and (1989a) Estimability and 8 fication and estimation of causal parameters in mortality estimation of excess and etiologic fractions. Stat. Med. 845–859. studies with sustained exposure periods. J. Chronic. Dis. Robins, J. M. and Greenland, S. (1989b). The probability 40 Suppl 2 139S–161S. of causation under a stochastic model for individual risk. Robins, J. M. (1987b). Addendum to: “A new approach Biometrics 45 1125–1138. MR1040629 to causal inference in mortality studies with a sustained Robins, J. M. and Greenland, S. (1992). Identifiability and exposure period—Application to control of the healthy exchangeability for direct and indirect effects. Epidemiol- worker survivor effect.” Comput. Math. Appl. 14 923–945. ogy 3 143–155. MR0922792 Robins, J. M., Hernan,´ M. A. and Rotnitzky, A. (2007). Robins, J. M. (1989). The analysis of randomized and non- Invited commentary: Effect modification by time-varying randomized AIDS treatment trials using a new approach to covariates. American J. Epidemiology 166 994–1002. causal inference in longitudinal studies. In Health Service Robins, J. M. and Morgenstern, H. (1987). The founda- Research Methodology: A Focus on AIDS (L. Sechrest, tions of confounding in epidemiology. Comput. Math. Appl. H. Freeman A. Mulley and , eds.). U.S. Public Health 14 869–916. MR0922790 Service, Washington, DC. Robins, J. M., Orellana, L. and Rotnitzky, A. (2008). Robins, J. M. (1992). Estimation of the time-dependent ac- Estimation and extrapolation of optimal treatment and celerated failure time model in the presence of confounding testing strategies. Stat. Med. 27 4678–4721. MR2528576 factors. Biometrika 79 321–334. MR1185134 Robins, J. M. and Richardson, T. S. (2011). Alterna- Robins, J. M. (1993). Analytic methods for estimating HIV- tive graphical causal models and the identification of di- treatment and cofactor effects. In Methodological Issues in rect effects. In Causality and Psychopathology: Finding the AIDS Behavioral Research 213–288. Springer, New York. Determinants of Disorders and Their Cures (P. Shrout, Robins, J. M. (1994). Correcting for non-compliance in K. Keyes and K. Ornstein, eds.) 1–52. Oxford Univ. randomized trials using structural nested mean models. Press, Oxford. Comm. Statist. Theory Methods 23 2379–2412. MR1293185 Robins, J. M. and Ritov, Y. (1997). Toward a curse of di- Robins, J. M. (1997a). Causal inference from complex lon- mensionality appropriate (CODA) asymptotic theory for gitudinal data. In Latent Variable Modeling and Applica- semi-parametric models. Stat. Med. 16 285–319. tions to Causality (Los Angeles, CA, 1994). Lecture Notes Robins, J. M. and Rotnitzky, A. (1992). Recovery of in- in Statist. 120 69–117. Springer, New York. MR1601279 formation and adjustment for dependent censoring using Robins, J. M. (1997b). Structural nested failure time models. surrogate markers. In AIDS Epidemiology (N. P. Jewell, In Survival Analysis (P. K. Andersen and N. Keiding, K. Dietz and V. T. Farewell, eds.) 297–331. Birkh¨auser, Section eds.), Encyclopedia of Biostatistics (P. Armitage Boston, MA. and T. Colton, eds.) 4372–4389. Wiley, New York. Robins, J. M. and Rotnitzky, A. (2001). Comment on “In- Robins, J. M. (1998). Marginal structural models. In 1997 ference for semiparametric models: Some questions and an ASA Proceedings of the Section on Bayesian Statistical Sci- answer,” by P. Bickel. Statist. Sinica 11 920–936. ence 1–10. Amer. Statist. Assoc., Alexandria, VA. Robins, J. M., Rotnitzky, A. and Scharfstein, D. O. Robins, J. M. (1999a). Robust estimation in sequentially ig- (2000). Sensitivity analysis for selection bias and unmea- norable missing data and causal inference models. In ASA sured confounding in missing data and causal inference 26 T. S. RICHARDSON AND A. ROTNITZKY

models. In Statistical Models in Epidemiology, the Envi- Rubin, D. B. (1978a). Bayesian inference for causal ef- ronment, and Clinical Trials (Minneapolis, MN, 1997). fects: The role of randomization. Ann. Statist. 6 34–58. IMA Vol. Math. Appl. 116 1–94. Springer, New York. MR0472152 MR1731681 Rubin, D. B. (1978b). Multiple imputations in sample sur- Robins, J. M., Rotnitzky, A. and Vansteelandt, S. veys: A phenomenological Bayesian approach to nonre- (2007). Discussion of “Principal stratification designs to sponse (C/R: P29–34). In ASA Proceedings of the Section estimate input data missing due to death” by C. E. on Survey Research Methods 20–28. Americ. Statist. As- Frangakis, D. B. Rubin, M.-W. An and E. MacKenzie. soc., Alexandria, VA. MR2395697. Biometrics 63 650–653. MR2395698 Rubin, D. B. (1984). Bayesianly justifiable and relevant Robins, J. M., Rotnitzky, A. and Zhao, L. P. (1994). frequency calculations for the applied statistician. Ann. Estimation of regression coefficients when some regressors Statist. 12 1151–1172. MR0760681 are not always observed. J. Amer. Statist. Assoc. 89 846– Rubin, D. B. (1987). Multiple Imputation for Nonresponse in 866. MR1294730 Surveys. Wiley, New York. MR0899519 Robins, J. M., VanderWeele, T. J. and Gill, R. D. Rubin, D. B. (1998). More powerful randomization-based (2012). A proof of Bell’s inequality in quantum mechan- p-values in double-blind trials with non-compliance. Stat. ics using causal interactions. Available at arXiv:1207.4913. Med. 17 371–385; discussion 387–389. Robins, J. M., van der Vaart, A. and Ventura, V. Rubin, D. B. (2004). Direct and indirect causal effects via po- (2000). Asymptotic distribution of p values in composite tential outcomes. Scand. J. Stat. 31 161–170. MR2066246 null models. J. Amer. Statist. Assoc. 95 1143–1167, 1171– Sadeghi, K. and Lauritzen, S. (2014). Markov properties 1172. MR1804240 for mixed graphs. Bernoulli 20 676–696. MR3178514 Robins, J. M. and Wang, N. (2000). Inference for imputa- Scharfstein, D. O., Rotnitzky, A. and Robins, J. M. tion estimators. Biometrika 87 113–124. MR1766832 (1999). Adjusting for nonignorable drop-out using semi- Robins, J. M. Wasserman, L. and (1997). Estimation of parametric nonresponse models. J. Amer. Statist. Assoc. effects of sequential treatments by reparameterizing di- 94 1096–1146. MR1731478 rected acyclic graphs. In Proceedings of the 13th Conference Schulte, P. J., Tsiatis, A. A., Laber, E. B. and David- on Uncertainty in Artificial Intelligence 309–420. Morgan ian, M. (2014). Q- and A-learning methods for estimating Kaufmann, San Francisco, CA. optimal dynamic treatment regimes. Statist. Sci. 29 640– Robins, J. M. and Wasserman, L. (1999). On the impossi- 661. bility of inferring causation from association without back- Sekhon, J. S. (2008). The Neyman–Rubin model of causal ground knowledge. In Computation, Causation, and Dis- inference and estimation via matching methods. In The covery (C. Glymour and G. Cooper, eds.) 305–321. MIT Oxford Handbook of Political Methodology (J. M. Box- Press, Cambridge, MA. Steffensmeier, H. E. Brady and D. Collier, eds.) 271– Robins, J. M. and Wasserman, L. (2000). Condition- 299. Oxford Handbooks Online, Oxford. ing, likelihood, and coherence: A review of some founda- Shpitser, I. and Pearl, J. (2006). Identification of joint tional concepts. J. Amer. Statist. Assoc. 95 1340–1346. interventional distributions in recursive semi-Markovian MR1825290 Robins, J. M., Blevins, D., Ritter, G. and Wulfsohn, M. causal models. In Proceedings of the 21st National Con- (1992). G-estimation of the effect of prophylaxis therapy for ference on Artificial Intelligence 1219–1226. AAAI Press, pneumocystis carinii pneumonia on the survival of AIDS Menlo Park, CA. patients. Epidemiology 3 319–336. Shpitser, I., Richardson, T. S., Robins, J. M. and Robins, J. M., Scheines, R., Spirtes, P. and Wasser- Evans, R. J. (2012). Parameter and structure learning man, L. (2003). Uniform consistency in causal inference. in nested Markov models. In Causal Structure Learning Biometrika 90 491–515. MR2006831 Workshop of the 28th Conference on Uncertainty in Arti- Robins, J. M., Li, L., Tchetgen, E. and van der ficial Intelligence (UAI-12). Vaart, A. (2008). Higher order influence functions and Shpitser, I., Evans, R. J., Richardson, T. S. and minimax estimation of nonlinear functionals. In Probabil- Robins, J. M. (2014). Introduction to nested Markov mod- ity and Statistics: Essays in Honor of David A. Freedman. els. Behaviormetrika 41 3–39. Inst. Math. Stat. Collect. 2 (D. Nolan and T. Speed, eds.) Spirtes, P., Glymour, C. and Scheines, R. (1993). Cau- 335–421. IMS, Beachwood, OH. MR2459958 sation, Prediction, and Search. Lecture Notes in Statistics Rotnitzky, A. and Robins, J. M. (1995). Semiparametric 81. Springer, New York. MR1227558 regression estimation in the presence of dependent censor- Spirtes, P. and Zhang, J. (2014). A uniformly consistent ing. Biometrika 82 805–820. MR1380816 estimator of causal effects under the k-triangle-faithfulness Rotnitzky, A. and Vansteelandt, S. (2014). Double- assumption. Statist. Sci. 29 662–678. robust methods. In Handbook of Missing Data Methodol- Tian, J. (2008). Identifying dynamic sequential plans. In Pro- ogy (G. Fitzmaurice, M. Kenward, G. Molenberghs, ceedings of the 24th Conference on Uncertainty in Artificial A. Tsiatis and G. Verbeke, eds.). Chapman & Hall/CRC Intelligence (UAI-08) 554–561. AUAI Press, Corvallis, OR. Press, Boca Raton, FL. Tian, J. and Pearl, J. (2002a). A general identification con- Rubin, D. B. (1974). Estimating causal effects of treatments dition for causal effects. In Proceedings of the 18th National in randomized and non-randomized studies. J. Educational Conference on Artificial Intelligence 567–573. AUAI Press, Psychology 66 688–701. Menlo Park, CA. ROBINS’ CAUSAL ETIOLOGY 27

Tian, J. and Pearl, J. (2002b). On the testable implica- Statist. Sci. 29 707–731. tions of causal models with hidden variables. In Proceedings van der Laan, M. J., Hubbard, A. E. and Robins, J. M. of the 18th Conference on Uncertainty in Artificial Intel- (2002). Locally efficient estimation of a multivariate sur- ligence (UAI-02) 519–527. Morgan Kaufmann, San Fran- vival function in longitudinal studies. J. Amer. Statist. As- cisco, CA. soc. 97 494–507. MR1941466 Tsiatis, A. A. (2006). Semiparametric Theory and Missing van der Laan, M. J. and Petersen, M. L. (2007). Causal Data. Springer Series in Statistics. Springer, New York. effect models for realistic individualized treatment and MR2233926 intention to treat rules. Int. J. Biostat. 3 Art. 3, 54. Tsiatis, A. A., Davidian, M., Zhang, M. and Lu, X. MR2306841 (2008). Covariate adjustment for two-sample treatment van der Laan, M. J. and Robins, J. M. (2003). Unified comparisons in randomized clinical trials: A principled yet Methods for Censored Longitudinal Data and Causality. flexible approach. Stat. Med. 27 4658–4677. MR2528575 Springer, New York. MR1958123 VanderWeele, T. J. (2010a). Epistatic interactions. Stat. van der Laan, M. J. and Rose, S. (2011). Targeted Learn- Appl. Genet. Mol. Biol. 9 Art. 1, 24. MR2594940 ing. Causal Inference for Observational and Experimental Vanderweele, T. J. (2010b). Sufficient cause interactions Data. Springer, New York. MR2867111 for categorical and ordinal exposures with three levels. van der Laan, M. J. and Rubin, D. (2006). Targeted max- 97 Biometrika 647–659. MR2672489 imum likelihood learning. Int. J. Biostat. 2 Art. 11, 40. VanderWeele, T. J. and Richardson, T. S. (2012). Gen- MR2306500 eral theory for interactions in sufficient cause models van der Vaart, A. (1991). On differentiable functionals. with dichotomous exposures. Ann. Statist. 40 2128–2161. Ann. Statist. 19 178–204. MR1091845 MR3059079 van der Vaart, A. (2014). Higher order tangent spaces and VanderWeele, T. J. and Robins, J. M. (2009). Mini- influence functions. Statist. Sci. 29 679–686. mal sufficient causation and directed acyclic graphs. Ann. Verma, T. and Pearl, J. (1990). Equivalence and synthesis Statist. 37 1437–1465. MR2509079 of causal models. In Proceedings of the 6th Annual Con- VanderWeele, T. J. and Shpitser, I. (2013). On the defini- tion of a confounder. Ann. Statist. 41 196–220. MR3059415 ference on Uncertainty in Artificial Intelligence (UAI-90) VanderWeele, T. J., Tchetgen Tchetgen, E. J. and 220–227. Elsevier, New York. Halloran, M. E. (2014). Interference and sensitivity Wang, N. and Robins, J. M. (1998). Large-sample theory analysis. Statist. Sci. 29 687–706. for parametric multiple imputation procedures. Biometrika VanderWeele, T. J., Vandenbroucke, J. P., Tchetgen 85 935–948. MR1666715 Tchetgen, E. J. and Robins, J. M. (2012). A mapping Wang, L., Brown, L. D., Cai, T. T. and Levine, M. (2008). between interactions and interference: Implications for vac- Effect of mean on variance function estimation in nonpara- cine trials. Epidemiology 23 285–292. metric regression. Ann. Statist. 36 646–664. MR2396810 Vansteelandt, S. and Joffe, M. (2014). Structural nested Wermuth, N. (2011). Probability distributions with sum- models and G-estimation: The partially realized promise. mary graph structure. Bernoulli 17 845–879. MR2817608