Quick viewing(Text Mode)

External Validity: from Do-Calculus to Transportability Across Populations

External Validity: from Do-Calculus to Transportability Across Populations

arXiv:1503.01603v1 [stat.ME] 5 Mar 2015 ilcm oasadtl.Hwvr h atthat fact the However, standstill. progress a scientific to and come transferred will be environment study can the nothing from different drastically or of that from aspects envi- many laboratory. an the in in differs elsewhere, that applied ronment and laboratory transported the be in obtained conclusions that requires ue er n la Bareinboim Elias and Pearl to PopulationsJudea Do-Calculus Across Transportability From Validity: External 04 o.2,N.4 579–595 4, DOI: No. 29, Vol. 2014, Science Statistical fClfri,LsAgls aiona90519,USA 90095-1596, California e-mail: Angeles, Los California, University of Department, Science Computer Candidate,

ue er sPoesradEisBribi sPh.D. is Bareinboim Elias and Professor is Pearl Judea c ttsia Science article Statistical original the the of by reprint published electronic an is This ern iesfo h rgnli aiainand pagination in detail. original typographic the from differs reprint lal,i h agtevrneti arbitrary, is environment target the if Clearly, generalization and generalization, about is Science nttt fMteaia Statistics Mathematical of Institute 10.1214/14-STS486 .ITOUTO:TRASVS. THREATS INTRODUCTION: 1. [email protected] nttt fMteaia Statistics Mathematical of Institute ASSUMPTIONS , ffcs xenlvalidity. external effects, transport. phrases: bias-free and ensure words to Key combined populat be two the can experime from they what obtained how be identify b W need procedures can population. findings observational the study population affirmative, the target in is the answer findings in experimental effects from causal ferred prior whether deciding, data, for do-calcany procedures the graph-based in yields derivations red duction symbolic we representation, to this transportability using of and, betw interest commonalities of diagr and ulations differences “selection about called cond knowledge representation pressing be formal can studies a observational introduce only stud experimental which in in learned population, effects a causal treats transfer “transportability,” to paper called cense This generalizability, val explorations. of “external scientific problem called most often in populations, sential or settings ments, Abstract. 04 o.2,N.4 579–595 4, No. 29, Vol. 2014, ; [email protected] 2014 , h eeaiaiiyo miia nig onwenviron- new to findings empirical of generalizability The . xeietldsg,gnrlzblt,causal generalizability, design, Experimental This . in 1 htcnb civdi rcie htpolm are problems what practice, in achieved on be limits can of what form the in guidance theoretical provides sim- by observational ulation. primarily evaluated and is performance dis- experimental regimes; explicit between make rarely tinction hi- and in modeling) pooling inverse-varianceerarchical partial meta-analysis, (e.g., in stan- procedures reweighting by together statistical pooled are dard studies diverse of ( sults Hill and (Gelman models chical ( ( Olkin (Glass and analysis Hedges meta of to are methods related by studies handled diverse problems from generalizing practice, and combining statistical In treat- formal ment. systematic received not have transport of transport ramifications. the their or justify results to experimental to environment similar study sufficiently usually the environment we target that the means deem elsewhere ap- results of intention the the plying with conducted are studies most osplmn hs ehdlge,orpaper our methodologies, these supplement To such permit that conditions the Remarkably, enda li- a as defined 1985 ls hsre- This ulus. c questions uce m”frex- for ams” oobserving to dt, ses- is idity,” e oanew a to ies ce.We ucted. particular ;Oe ( Owen ); os and ions, e pop- een tland ntal e the hen in- e 2007 2009 ) nwihre- which in )), ) rhierar- or )), 1976 ); 2 J. PEARL AND E. BAREINBOIM likely to be encountered when populations differ sig- language within which the notion “environment” (or nificantly from each other, what population differ- “population”) is given precise characterization, and ences can be circumvented by clever design and what differences among environments can be encoded and differences constitute theoretical impediments, pro- analyzed. hibiting generalization by any means whatsoever. The advent of causal diagrams (Wright (1921); On the theoretical front, the standard literature Heise (1975); Davis (1984); Verma and Pearl (1988); on this topic, falling under rubrics such as “exter- Spirtes, Glymour and Scheines (1993); Pearl (1995)) nal validity” (Campbell and Stanley (1963), Man- together with models of interventions (Haavelmo ski (2007)), “heterogeneity” (H¨ofler, Gloster and (1943); Strotz and Wold (1960)) and counterfactu- Hoyer (2010)), “quasi-” (Shadish, Cook als (Neyman (1923); Rubin (1974); Robins (1986); and Campbell (2002), Chapter 3; Adelman (1991)),1 Balke and Pearl (1995)) provides such a language consists primarily of “threats,” namely, explanations and renders the formalization of transportability of what may go wrong when we try to transport re- possible. sults from one study to another while ignoring their Armed with this language, this paper departs differences. Rarely do we find an analysis of “licens- from the tradition of communicating “threats” and ing assumptions,” namely, formal conditions under embarks instead on the task of formulating “licenses which the transport of results across differing envi- to transport,” namely, assumptions that, if they held ronments or populations is licensed from first prin- true, would permit us to transport results across ciples.2 studies. The reasons for this asymmetry are several. First, In addition, the paper uses the inferential ma- threats are safer to cite than assumptions. He chinery of the do-calculus (Pearl (1995); Koller and who cites “threats” appears prudent, cautious and Friedman (2009); Huang and Valtorta (2006); Sh- thoughtful, whereas he who seeks licensing assump- pitser and Pearl (2006)) to derive algorithms for de- tions risks suspicions of attempting to endorse those ciding whether transportability is feasible and how assumptions. experimental and observational findings can be com- Second, assumptions are self-destructive in their bined to yield unbiased estimates of causal effects in honesty. The more explicit the assumption, the more the target population. criticism it invites, for it tends to trigger a richer The paper is organized as follows. In Section 2, space of alternative scenarios in which the assump- we review the foundations of structural equations tion may fail. Researchers prefer therefore to declare modeling (SEM), the question of identifiability and threats in public and make assumptions in private. the do-calculus that emerges from these founda- Third, whereas threats can be communicated in tions. (This section can be skipped by readers fa- plain English, supported by anecdotal pointers to miliar with these concepts and tools.) In Section 3, familiar experiences, assumptions require a formal we motivate the question of transportability through simple examples, and illustrate how the solution de- 1Manski (2007) defines “” as follows: “An pends on the causal story behind the problem. In is said to have “external validity” if the distribu- Section 4, we formally define the notion of trans- tion of outcomes realized by a treatment group is the same as portability and reduce it to a problem of symbolic the distribution of outcome that would be realized in an ac- transformations in do-calculus. In Section 5, we pro- tual program.” Campbell and Stanley (1963), page 5, take a vide a graphical criterion for deciding transporta- slightly broader view: ““External validity” asks the question of generalizability: to what populations, settings, treatment bility and estimating transported causal effects. We variables, and measurement variables can this effect be gen- conclude in Section 6 with brief discussions of re- eralized?” lated problems of external validity, these include sta- 2Hern´an and VanderWeele (2011) studied such conditions tistical transportability, and meta-analysis. in the context of compound treatments, where we seek to predict the effect of one version of a treatment from experi- ments with a different version. Their analysis is a special case 2. PRELIMINARIES: THE LOGICAL of the theory developed in this paper (Petersen (2011)). A FOUNDATIONS OF CAUSAL INFERENCE related application is reported in Robins, Orellana and Rot- nitzky (2008) where a treatment strategy is extrapolated be- The tools presented in this paper were developed tween two biological similar populations under different ob- in the context of nonparametric Structural Equa- servational regimes. tions Models (SEM), which is one among several EXTERNAL VALIDITY: FROM DO-CALCULUS TO TRANSPORTABILITY 3 approaches to causal inference, and goes back to In principle, each query Qi ∈ Q should be “well (Haavelmo (1943); Strotz and Wold (1960)). Other defined,” that is, computable from any fully approaches include, for example, potential-outcomes specified model M compatible with A. (See Def- (Rubin (1974)), Structured Tree Graphs (Robins inition 1 for formal characterization of a model, (1986)), decision analytic (Dawid (2002)), Causal and also Section 2.4 for the problem of identifi- Bayesian Networks (Spirtes, Glymour and Scheines cation in partially specified models.) (2000); Pearl (2000), Chapter 1; Bareinboim, Brito I-3. A set D of experimental or non-experimental and Pearl (2012)), and Settable Systems (White and data, governed by a joint probability distribu- Chalak (2009)). We will first describe the generic tion presumably consistent with A. features common to all such approaches, and then summarize how these features are represented in The outputs are: 3 SEM. O-1. A set A∗ of statements which are the logical 2.1 Causal Models as Inference Engines implications of A, separate from the data at hand. For example, that X has no effect on Y if From a logical viewpoint, causal analysis relies on we hold Z constant, or that Z is an instrument causal assumptions that cannot be deduced from relative to {X, Y }. (nonexperimental) data. Thus, every approach to O-2. A set C of data-dependent claims concern- causal inference must provide a systematic way of ing the magnitudes or likelihoods of the tar- encoding, testing and combining these assumptions get queries in Q, each contingent on A. C may with data. Accordingly, we view causal modeling as contain, for example, the estimated mean and an inference engine that takes three inputs and pro- variance of a given structural parameter, or the duces three outputs. The inputs are: expected effect of a given intervention. Auxil- I-1. A set A of qualitative causal assumptions which iary to C, a causal model should also yield an the investigator is prepared to defend on sci- estimand Qi(P ) for each query in Q, or a de- entific grounds, and a model MA that encodes termination that Qi is not identifiable from P these assumptions mathematically. (In SEM, (Definition 2). MA takes the form of a diagram or a set of un- O-3. A list T of testable statistical implications specified functions. A typical assumption is that of A (which may or may not be part of O- no direct effect exists between a pair of vari- 2), and the degree g(Ti), Ti ∈ T , to which the ables (known as exclusion restriction), or that data agrees with each of those implications. an omitted factor, represented by an error term, A typical implication would be a conditional is independent of other such factors observed or independence assertion, or an equality con- unobserved, known as well as unknown. straint between two probabilistic expressions. I-2. A set Q of queries concerning causal or coun- Testable constraints should be read from the terfactual relationships among variables of in- model MA (see Definition 3), and used to con- terest. In linear SEM, Q concerned the magni- firm or disconfirm the model against the data. tudes of structural coefficients but, in general, Q may address causal relations directly, for ex- The structure of this inferential exercise is shown ample: schematically in Figure 1. For a comprehensive re- view on methodological issues, see Pearl (2009a, Q1: What is the effect of treatment X on out- come Y ? 2012a). Q2: Is this employer practicing gender discrim- 2.2 Assumptions in Nonparametric Models ination? A structural equation model (SEM) M is defined 3 We use the acronym SEM for both parametric and non- as follows. parametric representations though, historically, SEM practi- tioners preferred the former (Bollen and Pearl (2013)). Pearl Definition 1 (Structural equation model (Pearl (2011) has used the term Structural Causal Models (SCM) (2000), page 203)). to eliminate this confusion. While comparisons of the various approaches lie beyond the scope of this paper, we nevertheless 1. A set U of background or exogenous variables, propose that their merits be judged by the extent to which representing factors outside the model, which each facilitates the functions described below. nevertheless affect relationships within the model. 4 J. PEARL AND E. BAREINBOIM

Fig. 1. Causal analysis depicted as an inference engine converting assumptions (A), queries (Q), and data (D) into logical implications (A∗), conditional claims (C), and data-fitness indices (g(T )).

2. A set V = {V1,...,Vn} of endogenous variables, represents a causal process (or mechanism) that de- assumed to be observable. Each of these variables termines the value of the left variable (output) from is functionally dependent on some subset P Ai of the values on the right variables (inputs), and is as- U ∪ V . sumed to be invariant unless explicitly intervened 3. A set F of functions {f1,...,fn} such that each on. The absence of a variable from the right-hand fi determines the value of Vi ∈ V , vi = fi(pai, u). side of an equation encodes the assumption that na- 4. A joint probability distribution P (u) over U. ture ignores that variable in the process of determin- ing the value of the output variable. For example, A simple SEM model is depicted in Figure 2(a), the absence of variable Z from the arguments of fY which represents the following three functions: conveys the empirical claim that variations in Z will z = fZ(uZ ), leave Y unchanged, as long as variables UY and X remain constant. (2.1) x = fX (z, uX ), It is important to distinguish between a fully spec- y = fY (x, uY ), ified model in which P (U) and the collection of func- tions F are specified and a partially specified model, where in this particular example, UZ , UX and UY usually in the form of a diagram. The former entails are assumed to be jointly independent but otherwise one and only one observational distribution P (V ); arbitrarily distributed. Whenever dependence exists the latter entails a set of observational distributions between any two exogenous variables, a bidirected P (V ) that are compatible with the graph (those that arrow will be added to the diagram to represent this can be generated by specifying hF, P (u)i). dependence (e.g., Figure 4).4 Each of these functions 2.3 Representing Interventions, Counterfactuals and Causal Effects This feature of invariance permits us to derive powerful claims about causal effects and counter- factuals, even in nonparametric models, where all functions and distributions remain unknown. This is Fig. 2. The diagrams associated with (a) the structural done through a mathematical operator called do(x), model of equation (2.1) and (b) the modified model of equation which simulates physical interventions by deleting (2.2), representing the intervention do(X = x0). certain functions from the model, replacing them

4More precisely, the absence of bidirected arrows implies variables. In other words, the set of all bidirected edges con- marginal independences relative of the respective exogenous stitute an i-map of P (U) (Richardson (2003)). EXTERNAL VALIDITY: FROM DO-CALCULUS TO TRANSPORTABILITY 5 with a constant X = x, while keeping the rest of 2.4 Identification, d-Separation and Causal the model unchanged (Haavelmo (1943); Strotz and Calculus Wold (1960); Pearl (2014)). For example, to emulate A central question in causal analysis is the ques- an intervention do(x ) that sets X to a constant x 0 0 tion of identification of causal queries (e.g., the effect M x in model of Figure 2(a), the equation for in of intervention do(X = x )) from a combination of x x 0 equation (2.1) is replaced by = 0, and we obtain data and a partially specified model, for example, a new model, Mx , 0 when only the graph is given and neither the func- z = fZ (uZ ), tions F nor the distribution of U. In linear paramet- ric settings, the question of identification reduces (2.2) x = x0, to asking whether some model parameter, β, has a y = fY (x, uY ), unique solution in terms of the parameters of P (say the population covariance matrix). In the nonpara- the graphical description of which is shown in Fig- metric formulation, the notion of “has a unique so- ure 2(b). lution” does not directly apply since quantities such The joint distribution associated with this mod- as Q(M)= P (y| do(x)) have no parametric signa- ified model, denoted P (z,y| do(x0)) describes the ture and are defined procedurally by simulating an post-intervention distribution of variables Y and intervention in a causal model M, as in equation Z (also called “controlled” or “experimental” dis- (2.2). The following definition captures the require- tribution), to be distinguished from the preinter- ment that Q be estimable from the data: vention distribution, P (x,y,z), associated with the Definition 2 (Identifiability). A causal query original model of equation (2.1). For example, if Q(M) is identifiable, given a set of assumptions A, X represents a treatment variable, Y a response if for any two (fully specified) models, M and M , Z 1 2 variable, and some covariate that affects the that satisfy A, we have 7 amount of treatment received, then the distribution P (z,y| do(x0)) gives the proportion of individuals (2.4) P (M1)= P (M2) ⇒ Q(M1)= Q(M2). that would attain response level Y = y and covari- In words, the functional details of M1 and M2 do ate level Z = z under the hypothetical situation in not matter; what matters is that the assumptions which treatment X = x0 is administered uniformly A 5 in (e.g., those encoded in the diagram) would to the population. constrain the variability of those details in such a In general, we can formally define the postinter- way that equality of P ’s would entail equality of vention distribution by the equation Q’s. When this happens, Q depends on P only, and should therefore be expressible in terms of the pa- (2.3) PM (y| do(x)) = PMx (y). rameters of P . In words, in the framework of model M, the postin- When a query Q is given in the form of a do- tervention distribution of outcome Y is defined as expression, for example, Q = P (y| do(x), z), its iden- the probability that model Mx assigns to each out- tifiability can be decided systematically using an al- come level Y = y. From this distribution, which is gebraic procedure known as the do-calculus (Pearl readily computed from any fully specified model M, (1995)). It consists of three inference rules that per- we are able to assess treatment efficacy by compar- mit us to map interventional and observational dis- ing aspects of this distribution at different levels tributions whenever certain conditions hold in the 6 of x0. causal diagram G. The conditions that permit the application these

5 inference rules can be read off the diagrams using Equivalently, P (z,y| do(x0)) can be interpreted as the joint probability of (Z = z,Y = y) under a randomized ex- 7 periment among units receiving treatment level X = x0. An implication similar to (2.4) is used in the standard Readers versed in potential-outcome notations may interpret statistical definition of parameter identification, where it con- P (y| do(x),z) as the probability P (Yx = y|Zx = z), where Yx veys the uniqueness of a parameter set θ given a distribution is the potential outcome under treatment X = x. Pθ (Lehmann and Casella (1998)). To see the connection, one 6Counterfactuals are defined similarly through the equation should think about the query Q = P (y| do(x)) as a function

Yx(u)= YMx (u) (see Pearl (2009b), Chapter 7), but will not Q = g(θ) where θ is the pair F ∪ P (u) that characterizes a be needed for the discussions in this paper. fully specified model M. 6 J. PEARL AND E. BAREINBOIM a graphical criterion known as d-separation (Pearl Rule 1 (Insertion/deletion of observations). (1988)). P (y| do(x), z, w) Definition 3 (d-separation). A set S of nodes (2.5) G is said to block a path p if either = P (y| do(x), w) if (Y ⊥⊥Z|X,W ) X . 1. p contains at least one arrow-emitting node that Rule 2 (Action/observation exchange). S is in , or P (y| do(x), do(z), w) 2. p contains at least one collision node that is out- (2.6) side S and has no descendant in S. = P (y| do(x), z, w) if (Y ⊥⊥Z|X,W )GXZ . If S blocks all paths from set X to set Y , it is said to Rule 3 (Insertion/deletion of actions). “d-separate X and Y ,” and then, it can be shown that variables X and Y are independent given S, P (y| do(x), do(z), w) written X⊥⊥Y |S.8 (2.7) G = P (y| do(x), w) if (Y ⊥⊥Z|X,W ) XZ W , D-separation reflects conditional independencies ( ) that hold in any distribution P (v) that is compatible where Z(W ) is the set of Z-nodes that are not an- with the causal assumptions A embedded in the di- cestors of any W -node in GX . agram. To illustrate, the path UZ → Z → X → Y in To establish identifiability of a query Q, one needs Figure 2(a) is blocked by S = {Z} and by S = {X}, to repeatedly apply the rules of do-calculus to Q, since each emits an arrow along that path. Conse- until the final expression no longer contains a do- quently, we can infer that the conditional indepen- operator;10 this renders it estimable from nonexper- dencies UZ ⊥⊥Y |Z and UZ ⊥⊥Y |X will be satisfied in imental data. The do-calculus was proven to be com- any probability function that this model can gener- plete for the identifiability of causal effects in the ate, regardless of how we parameterize the arrows. form Q = P (y| do(x), z) (Shpitser and Pearl (2006); Likewise, the path UZ → Z → X ← UX is blocked by Huang and Valtorta (2006)), which means that if Q the null set {∅}, but it is not blocked by S = {Y } cannot be expressed in terms of the probability of since Y is a descendant of the collision node X. Con- observables P by repeated application of these three sequently, the marginal independence UZ ⊥⊥UX will rules, such an expression does not exist. In other hold in the distribution, but UZ ⊥⊥UX |Y may or may not hold.9 words, the query is not estimable from observational studies without making further assumptions, for ex- 2.5 The Rules of do-Calculus ample, linearity, monotonicity, additivity, absence of Let X, Y , Z and W be arbitrary disjoint sets interactions, etc. We shall see that, to establish transportability, of nodes in a causal DAG G. We denote by GX the graph obtained by deleting from G all arrows the goal will be different; instead of eliminating do- pointing to nodes in X. Likewise, we denote by GX operators from the query expression, we will need to the graph obtained by deleting from G all arrows separate them from a set of variables S that repre- emerging from nodes in X. To represent the deletion sent disparities between populations. of both incoming and outgoing arrows, we use the notation GXZ . 3. INFERENCE ACROSS POPULATIONS: The following three rules are valid for every inter- MOTIVATING EXAMPLES ventional distribution compatible with G: To motivate the treatment of Section 4, we first demonstrate some of the subtle questions that trans- 8See Hayduk et al. (2003), Glymour and Greenland (2008) portability entails through three simple examples, and Pearl (2009b), page 335, for a gentle introduction to d- informally depicted in Figure 3. separation. 9This special handling of collision nodes (or colliders, e.g., Example 1. Consider the graph in Figure 3(a) Z → X ← UX ) reflects a general phenomenon known as Berk- that represents cause-effect relationships in the pre- son’s paradox (Berkson (1946)), whereby observations on a treatment population in Los Angeles. We conduct common consequence of two independent causes render those causes dependent. For example, the outcomes of two indepen- dent coins are rendered dependent by the testimony that at 10Such derivations are illustrated in graphical details in least one of them is a tail. Pearl (2009b), page 87. EXTERNAL VALIDITY: FROM DO-CALCULUS TO TRANSPORTABILITY 7

Fig. 3. Causal diagrams depicting Examples 1–3. In (a) Z represents “age.” In (b), Z represents “linguistic skills” while age (in hollow circle) is unmeasured. In (c), Z represents a biological marker situated between the treatment (X) and a disease (Y ). a randomized trial in Los Angeles and estimate the We ask, for example, what must we assume about causal effect of exposure X on outcome Y for every other confounding variables beside age, both latent age group Z = z.11,12 We now wish to generalize the and observed, for equation (3.1) to be valid, or, results to the population of New York City (NYC), would the same transport formula hold if Z was but data alert us to the fact that the study distribu- not age, but some proxy for age, say, language pro- tion P (x,y,z) in LA is significantly different from ficiency. More intricate yet, what if Z stood for the one in NYC (call the latter P ∗(x,y,z)). In par- an exposure-dependent variable, say hyper-tension ticular, we notice that the average age in NYC is level, that stands between X and Y ? significantly higher than that in LA. How are we Let us examine the proxy issue first. to estimate the causal effect of X on Y in NYC, ∗ Example 2. Z denoted P (y| do(x))? Let the variable in Example 1 stand for subjects language proficiency, and let us Our natural inclination would be to assume that assume that Z does not affect exposure (X) or age-specific effects are invariant across cities and so, outcome (Y ), yet it correlates with both, being if the LA study provides us with (estimates of) age- a proxy for age which is not measured in either specific causal effects P (y| do(x),Z = z), the overall study [see Figure 3(b)]. Given the observed dispar- causal effect in NYC should be ity P (z) 6= P ∗(z), how are we to estimate the causal ∗ ∗ P ∗ y x (3.1) P (y| do(x)) = X P (y| do(x), z)P (z). effect ( | do( )) for the target population of NYC z from the z-specific causal effect P (y| do(x), z) esti- This transport formula combines experimental re- mated at the study population of LA? sults obtained in LA, P (y| do(x), z), with observa- The inequality P (z) 6= P ∗(z) in this example may ∗ tional aspects of NYC population, P (z), to obtain reflect either age difference or differences in the way ∗ 13 an experimental claim P (y| do(x)) about NYC. that Z correlates with age. If the two cities enjoy Our first task in this paper will be to explicate the identical age distributions and NYC residents ac- assumptions that renders this extrapolation valid. quire linguistic skills at a younger age, then since Z has no effect whatsoever on X and Y , the inequal- 11Throughout the paper, each graph represents the causal ity P (z) 6= P ∗(z) can be ignored and, intuitively, the structure of the population prior to the treatment, hence X proper transport formula would be stands for the level of treatment taken by an individual out ∗ of free choice. (3.2) P (y| do(x)) = P (y| do(x)). 12The arrow from Z to X represents the tendency of older people to seek treatment more often than younger people, If, on the other hand, the conditional probabilities and the arrow from Z to Y represents the effect of age on the ∗ P (z|age) and P (z|age) are the same in both cities, outcome. ∗ 13At first glance, equation (3.1) may be regarded as a rou- and the inequality P (z) 6= P (z) reflects genuine age tine application of “standardization” or “recalibration”—a differences, equation (3.2) is no longer valid, since statistical extrapolation method that can be traced back to a the age difference may be a critical factor in deter- century-old tradition in demography and political arithmetic mining how people react to X. We see, therefore, (Westergaard (1916); Yule (1934); Lane and Nelder (1982)). On a second thought, it raises the deeper question of why we that the choice of the proper transport formula de- consider age-specific effects to be invariant across populations. pends on the causal context in which population dif- See discussion following Example 2. ferences are embedded. 8 J. PEARL AND E. BAREINBOIM

This example also demonstrates why the invari- simple average of the z-specific causal effects. The ance of Z-specific causal effects should not be taken correct weighing rule is for granted. While justified in Example 1, with Z = ∗ age, it fails in Example 2, in which Z was equated P (y| do(x)) (3.3) with “language skills.” Indeed, using Figure 3(b) for ∗ ∗ = P (y| do(x), z)P (z| do(x)), Z X Y X guidance, the -specific effect of on in NYC is z given by ∗ which reduces to (3.1) only in the special case where P (y| do(x), z) Z is unaffected by X. Equation (3.2) is also wrong ∗ ∗ because we can no longer argue, as we did in Exam- = X P (y| do(x), z, age)P (age| do(x), z) age ple 2, that Z does not affect Y , hence it can be ig- nored. Here, Z lies on the causal pathway between X ∗ ∗ = X P (y| do(x), age)P (age|z) and Y so, clearly, it affects their relationship. What age then is the correct transport formula for this sce- ∗ nario? = X P (y| do(x), age)P (age|z). age To cast this example in a more realistic setting, let us assume that we wish to use Z as a “surrogate end- Thus, if the two populations differ in the relation point” to predict the efficacy of treatment X on out- between age and skill, that is, come Y , where Y is too difficult and/or expensive to ∗ measure routinely (Prentice (1989); Ellenberg and P (age|z) 6= P (age|z) Hamilton (1989)). Thus, instead of considering ex- the skill-specific causal effect would differ as well. perimental and observational studies conducted at The intuition is clear. A NYC person at skill level two different locations, we consider two such studies Z = z is likely to be in a totally different age group taking place at the same location, but at different from his skill-equals in Los Angeles and, since it is times. In the first study, we measure P (y, z| do(x)) age, not skill that shapes the way individuals re- and discover that Z is a good surrogate, namely, spond to treatment, it is only reasonable that Los knowing the effect of treatment on Z allows predic- Angeles residents would respond differently to treat- tion of the effect of treatment on the more clinically ment than their NYC counterparts at the very same relevant outcome (Y ) (Joffe and Greene (2009)). skill level. Once Z is proclaimed a “surrogate endpoint,” it in- The essential difference between Examples 1 and vites efforts to find direct means of controlling Z. 2 is that age is normally taken to be an exoge- For example, if cholesterol level is found to be a pre- nous variable (not assigned by other factors in the dictor of heart diseases in a long-run trial, drug man- model) while skills may be indicative of earlier fac- ufacturers would rush to offer cholesterol-reducing tors (age, education, ethnicity) capable of modifying substances for public consumption. As a result, both the causal effect. Therefore, conditional on skill, the the prior P (z) and the treatment-dependent proba- effect may be different in the two populations. bility P (z| do(x)) would undergo a change, resulting in P ∗(z) and P ∗(z| do(x)), respectively. Example 3. Examine the case where Z is a We now wish to reassess the effect of the drug X-dependent variable, say a disease bio-marker, ∗ P (y| do(x)) in the new population and do it in the standing on the causal pathways between X and Y cheapest possible way, namely, by conducting an ob- as shown in Figure 3(c). Assume further that the ∗ ∗ servational study to estimate P (z,x), acknowledg- disparity P (z|x) 6= P (z|x) is discovered and that, ing that confounding exists between X and Y and again, both the average and the z-specific causal ef- that the drug affects Y both directly and through fect P (y| do(x), z) are estimated in the LA experi- Z, as shown in Figure 3(c). ment, for all levels of X and Z. Can we, based on in- Using a graphical representation to encode the as- formation given, estimate the average (or z-specific) sumptions articulated thus far, and further assum- causal effect in the target population of NYC? ing that the disparity observed stems only from a Here, equation (3.1) is wrong because the overall difference in people’s susceptibility to X (and not causal effect (in both LA and NYC) is no longer a due to a change in some unobservable confounder), EXTERNAL VALIDITY: FROM DO-CALCULUS TO TRANSPORTABILITY 9 we will prove in Section 5 that the correct transport which population differences come about; differ- formula should be ent localization of these mechanisms yield different ∗ ∗ transport formulae. This can be seen most vividly in (3.4) P (y| do(x)) = X P (y| do(x), z)P (z|x), z Example 2 [Figure 3(b)] where we reasoned that no reweighing is necessary if the disparity P (z) 6= P ∗(z) which is different from both (3.1) and (3.2). It calls originates with the way language proficiency de- instead for the z-specific effects to be reweighted by ∗ pends on age, while the age distribution itself re- the conditional probability P (z|x), estimated in the mains the same. Yet, because age is not measured, 14 target population. this condition cannot be detected in the probability To see how the transportability problem fits into distribution P , and cannot be distinguished from an the general scheme of causal analysis discussed in alternative condition, Section 2.1 (Figure 1), we note that, in our case, the ∗ ∗ data comes from two sources, experimental (from P (age) 6= P (age) and P (z|age) = P (z|age), the study) and nonexperimental (from the target), assumptions are encoded in the form of selection di- one that may require reweighting according to equa- agrams, and the query stands for the causal effect tion (3.1). In other words, every probability distri- (e.g., P ∗(y| do(x))). Although this paper does not bution P (x,y,z) that is compatible with the process discuss the goodness-of-fit problem, standard meth- of Figure 3(b) is also compatible with that of Fig- ods are available for testing the compatibility of the ure 3(a) and, yet, the two processes dictate different selection diagram with the data available. transport formulas. Based on these observations, it is clear that if we 4. FORMALIZING TRANSPORTABILITY are to represent formally the differences between populations (similarly, between experimental set- 4.1 Selection Diagrams and Selection Variables tings or environments), we must resort to a represen- The pattern that emerges from the examples dis- tation in which the causal mechanisms are explicitly cussed in Section 3 indicates that transportability encoded and in which differences in populations are is a causal, not statistical notion. In other words, represented as local modifications of those mecha- the conditions that license transport as well as the nisms. formulas through which results are transported de- To this end, we will use causal diagrams aug- pend on the causal relations between the variables mented with a set, S, of “selection variables,” where in the domain, not merely on their statistics. For each member of S corresponds to a mechanism by instance, it was important in Example 3 to as- which the two populations differ, and switching be- certain that the change in P (z|x) was due to the tween the two populations will be represented by change in the way Z is affected by X, but not due conditioning on different values of these S vari- to a change in confounding conditions between the ables.15 two. This cannot be determined solely by compar- Intuitively, if P (v| do(x)) stands for the distribu- P z x P ∗ z x X Z ing ( | ) and ( | ). If and are confounded tion of a set V of variables in the experimental [e.g., Figure 6(e)], it is quite possible for the in- ∗ study (with X randomized) then we designate by equality P (z|x) 6= P (z|x) to hold, reflecting differ- ∗ P (v| do(x)) the distribution of V if we were to con- ences in confounding, while the way that Z is af- ∗ duct the study on population Π instead of Π. We fected by X (i.e., P (z| do(x))) is the same in the now attribute the difference between the two to the two populations—a different transport formula will then emerge for this case. 15 Consequently, licensing transportability requires Disparities among populations or subpopulations can also knowledge of the mechanisms, or processes, through arise from differences in design; for example, if two samples are drawn by different criteria from a given population. The prob- lem of generalizing between two such subpopulations is usu- 14Quite often the possibility of running a second random- ally called sampling selection bias (Heckman (1979); Hern´an, ized experiment to estimate P ∗(z| do(x)) is also available to Hern´andez-D´ıaz and Robins (2004); Cole and Stuart (2010); investigators, though at a higher cost. In such cases, a trans- Pearl (2013); Bareinboim, Tian and Pearl (2014)). In this pa- port formula would be derivable under more relaxed assump- per, we deal only with nature-induced, not man-made dispar- tions, for example, allowing for X and Z to be confounded. ities. 10 J. PEARL AND E. BAREINBOIM

Fig. 4. Selection diagrams depicting specific versions of Examples 1–3. In (a), the two populations differ in age distributions. In (b), the populations differs in how Z depends on age (an unmeasured variable, represented by the hollow circle) and the age distributions are the same. In (c), the populations differ in how Z depends on X. In all diagrams, dashed arcs (e.g., X L9999K Y ) represent the presence of latent variables affecting both X and Y . action of a set S of selection variables, and write16,17 Definition 4 (Selection diagram). Let hM, M ∗i ∗ ∗ be a pair of structural causal models (Definition 1) P v x P v x ,s . ∗ ( | do( )) = ( | do( ) ) relative to domains hΠ, Π i, sharing a causal dia- ∗ The selection variables in S may represent all fac- gram G. hM, M i is said to induce a selection dia- tors by which populations may differ or that may gram D if D is constructed as follows: “threaten” the transport of conclusions between 1. Every edge in G is also an edge in D. populations. For example, in Figure 4(a) the age dis- 2. D contains an extra edge Si → Vi whenever there P z P ∗ z ∗ parity ( ) 6= ( ) discussed in Example 1 will be might exist a discrepancy fi 6= fi or P (Ui) 6= ∗ ∗ represented by the inequality P (Ui) between M and M . P (z) 6= P (z|s), In summary, the S-variables locate the mecha- nisms where structural discrepancies between the where S stands for all factors responsible for draw- two populations are suspected to take place. Alter- ing subjects at age Z = z to NYC rather than LA. natively, the absence of a selection node pointing to Of equal importance is the absence of an S vari- a variable represents the assumption that the mech- able pointing to Y in Figure 4(a), which encodes anism responsible for assigning value to that vari- the assumption that age-specific effects are invari- able is the same in the two populations. In the ex- ant across the two populations. treme case, we could add selection nodes to all vari- This graphical representation, which we will call ables, which means that we have no reason to be- “selection diagrams” is defined as follows:18 lieve that the populations share any mechanism in common, and this, of course would inhibit any ex- 16Alternatively, one can represent the two populations’ dis- ∗ change of information among the populations. The tributions by P (v| do(x),s), and P (v| do(x),s ), respectively. invariance assumptions between populations, as we The results, however, will be the same, since only the location of S enters the analysis. will see, will open the door for the transport of some 17Pearl (1993, 2009b, page 71), Spirtes, Glymour and experimental findings. Scheines (1993) and Dawid (2002), for example, use condi- For clarity, we will represent the S variables by tioning on auxiliary variables to switch between experimen- squares, as in Figure 4, which uses selection dia- tal and observational studies. Dawid (2002) further uses such grams to encode the three examples discussed in variables to represent changes in parameters of probability S distributions. Section 3. (Besides the variables, these graphs 18The assumption that there are no structural changes be- also include additional latent variables, represented tween domains can be relaxed starting with D = G∗ and by bidirected edges, which makes the examples more adding S-nodes following the same procedure as in Defini- realistic.) In particular, Figures 4(a) and 4(b) repre- tion 4, while enforcing acyclicity. In extreme cases in which sent, respectively, two different mechanisms respon- the two domains differ in causal directionality (Spirtes, Gly- sible for the observed disparity P (z) 6= P ∗(z). The mour and Scheines (2000), pages 298–299), acyclicity cannot be maintained. This complication as well as one created when first [Figure 4(a)] dictates transport formula (3.1), G is a edge-super set of G∗ require a more elaborated graph- while the second [Figure 4(b)] calls for direct, un- ical representation and lie beyond the scope of this paper. adjusted transport (3.2). This difference stems from EXTERNAL VALIDITY: FROM DO-CALCULUS TO TRANSPORTABILITY 11 the location of the S variables in the two diagrams. they require one to demonstrate the nonexistence of In Figure 4(a), the S variable represents unspecified two competing models compatible with D, agreeing factors that cause age differences between the two on {P, P ∗,I}, and disagreeing on R(Π∗). populations, while in Figure 4(b), S represents fac- Definition 5 is declarative, and does not offer tors that cause differences in reading skills (Z) while an effective method of demonstrating transportabil- the age distribution itself (unobserved) remains the ity even in simple models. Theorem 1 offers such same. a method using a sequence of derivations in do- In this paper, we will address the issue of trans- calculus. portability assuming that scientific knowledge about Theorem 1. Let D be the selection diagram invariance of certain mechanisms is available and ∗ encoded in the selection diagram through the S characterizing two populations, Π and Π , and S nodes. Such knowledge is, admittedly, more de- a set of selection variables in D. The relation R ∗ ∗ manding than that which shapes the structure of = P (y| do(x), z) is transportable from Π to Π if each causal diagram in isolation. It is, however, a the expression P (y| do(x),z,s) is reducible, using the prerequisite for any attempt to justify transfer of rules of do-calculus, to an expression in which S findings across populations, which makes selection appears only as a conditioning variable in do-free diagrams a mathematical object worthy of analysis. terms. 4.2 Transportability: Definitions and Examples Proof. Every relation satisfying the condition of Theorem 1 can be written as an algebraic com- Using selection diagrams as the basic represen- tational language, and harnessing the concepts of bination of two kinds of terms, those that involve intervention, do-calculus, and identifiability (Sec- S and those that do not. The former can be writ- ten as P ∗-terms and are estimable, therefore, from tion 2), we can now give the notion of transporta- ∗ bility a formal definition. observations on Π , as required by Definition 5. All other terms, especially those involving do-operators, Definition 5 (Transportability). Let D be a ∗ do not contain S; they are experimentally identifi- selection diagram relative to domains hΠ, Π i. Let able therefore in Π.  hP,Ii be the pair of observational and interventional distributions of Π, and P ∗ be the observational This criterion was proven to be both sufficient and ∗ distribution of Π∗. The causal relation R(Π∗) = necessary for causal effects, namely R = P (y| do(x)) P ∗(y| do(x), z) is said to be transportable from Π (Bareinboim and Pearl (2012)). Theorem 1, though to Π∗ in D if R(Π∗) is uniquely computable from procedural, does not specify the sequence of rules P, P ∗,I in any model that induces D. leading to the needed reduction when such a se- quence exists. Bareinboim and Pearl (2013b) de- Two interesting connections between identifiabil- rived a complete procedural solution for this, based ity and transportability are worth noting. First, note on graphical method developed in (Tian and Pearl that all identifiable causal relations in D are also (2002); Shpitser and Pearl (2006)). Despite its com- transportable, because they can be computed di- rectly from P ∗ and require no experimental informa- pleteness, however, the procedural solution is not tion from Π. Second, note that given causal diagram trivial, and we take here an alternative route to es- G, one can produce a selection diagram D such that tablish a simple and transparent procedure for con- identifiability in G is equivalent to transportability firming transportability, guided by two recognizable in D. First set D = G, and then add selection nodes subgoals. pointing to all variables in D, which represents that Definition 6 (Trivial transportability). A caus- the target domain does not share any mechanism al relation R is said to be trivially transportable from with its counterpart—this is equivalent to the prob- Π to Π∗, if R(Π∗) is identifiable from (G∗, P ∗). lem of identifiability because the only way to achieve transportability is to identify R from scratch in the This criterion amounts to an ordinary test of iden- target population. tifiability of causal relations using graphs, as given ∗ While the problems of identifiability and trans- by Definition 2. It permits us to estimate R(Π ) di- ∗ portability are related, proofs of nontransportability rectly from observational studies on Π , unaided by are more involved than those of nonidentifiability for causal information from Π. 12 J. PEARL AND E. BAREINBOIM

Example 4. Let R be the causal effect P ∗(y| Theorem 2. Let D be the selection diagram do(x)) and let the selection diagram of Π and Π∗ characterizing two populations, Π and Π∗, and S be given by X → Y ← S, then R is trivially trans- the set of selection variables in D. The strata- ∗ ∗ portable, since R(Π )= P (y|x). specific causal effect P ∗(y| do(x), z) is transportable from Π to Π∗ if Z d-separates Y from S in the Another special case of transportability occurs X-manipulated version of D, that is, Z satisfies when a causal relation has identical form in both (Y ⊥⊥S|Z, X)D . domains—no recalibration is needed. X Proof. Definition 7 (Direct transportability). A causal ∗ ∗ relation R is said to be directly transportable from P (y| do(x), z)= P (y| do(x),z,s ). Π to Π∗, if R(Π∗)= R(Π). From Rule 1 of do-calculus we have: P (y| do(x), z, A graphical test for direct transportability of R = s∗)= P (y| do(x), z) whenever Z satisfies (Y ⊥⊥S|Z, P ∗(y| do(x), z) follows from do-calculus and reads: X) in DX . This proves Theorem 2.  (S⊥⊥Y |X,Z)GX ; in words, X blocks all paths from S to Y once we remove all arrows pointing to X and Definition 8 (S-admissibility). A set T of vari- condition on Z. As a concrete example, this test is ables satisfying (Y ⊥⊥S|T, X) in DX will be called satisfied in Figure 4(a) and, therefore, the z-specific S-admissible (with respect to the causal effect of effects is the same in both populations; it is directly X on Y ). transportable. Corollary 1. The average causal effect P ∗(y| Remark. The notion of “external validity” as do(x)) is transportable from Π to Π∗ if there ex- defined by Manski (2007) (footnote 1) corresponds ists a set Z of observed pretreatment covariates that to Direct Transportability, for it requires that R re- is S-admissible. Moreover, the transport formula is tains its validity without adjustment, as in equation given by the weighting of equation (3.1). (3.2). Such conditions preclude the use of informa- tion from Π∗ to recalibrate R. Example 7. The causal effect is transportable in Figure 4(a), since Z is S-admissible, and in Fig- Example 5. Let R be the causal effect of X ure 4(b), where the empty set is S-admissible. It on Y , and let D have a single S node pointing to is also transportable by the same criterion in Fig- X, then R is directly transportable, because causal ure 5(b), where W is S-admissible, but not in Fig- effects are independent of the selection mechanism ure 5(a) where no S-admissible set exists. (see Pearl (2009b), pages 72 and 73). Corollary 2. Any S variable that is point- Example 6. R z Let be the -specific causal ef- ing directly into X as in Figure 6(a), or that is d- X Y P ∗ y x , z Z fect of on ( | do( ) ) where is a set separated from Y in D can be ignored. of variables, and P and P ∗ differ only in the con- X ditional probabilities P (z| pa(Z)) and P ∗(z| pa(Z)) This follows from the fact that the empty set is such that (Z⊥⊥Y | pa(Z)), as shown in Figure 4(b). S-admissible relative to any such S variable. Con- Under these conditions, R is not directly trans- ceptually, the corollary reflects the understanding portable. However, the pa(Z)-specific causal effects that differences in propensity to receive treatment P ∗(y| do(x), pa(Z)) are directly transportable, and so is P ∗(y| do(x)). Note that, due to the confound- ing arcs, none of these quantities is identifiable.

5. TRANSPORTABILITY OF CAUSAL EFFECTS—A GRAPHICAL CRITERION We now state and prove two theorems that permit us to decide algorithmically, given a selection dia- gram, whether a relation is transportable between two populations, and what the transport formula Fig. 5. Selection diagrams illustrating S-admissibility. (a) should be. Has no S-admissible set while in (b), W is S-admissible. EXTERNAL VALIDITY: FROM DO-CALCULUS TO TRANSPORTABILITY 13

Fig. 6. Selection diagrams illustrating transportability. The causal effect P (y| do(x)) is (trivially) transportable in (c) but not in (b) and (f). It is transportable in (a), (d) and (e) (see Corollary 2).

∗ do not hinder the transportability of treatment ef- (5.6) = X P (y|w,s)P (w| do(x)) fects; the used in the experimental w study washes away such differences. (by Rule 3 of do-calculus) We now generalize Theorem 2 to cases involving ∗ ∗ treatment-dependent Z variables, as in Figure 4(c). (5.7) = X P (y|w)P (w| do(x)). ∗ w Theorem 3. The average causal effect P (y| ∗ do(x)) is transportable from Π to Π∗ if either one We similarly note that the transportability of P (w| of the following conditions holds: do(x)) should reduce P (w| do(x),s) to a star-free expression and would render P ∗(y| do(x)) trans- ∗ 1. P (y| do(x)) is trivially transportable. portable. This proves Theorem 3.  2. There exists a set of covariates, Z (possibly af- fected by X) such that Z is S-admissible and for Example 8. To illustrate the application of which P ∗(z| do(x)) is transportable. Theorem 3, let us apply it to Figure 4(c), which 3. There exists a set of covariates, W that satisfy corresponds to the surrogate endpoint problem dis- ∗ cussed in Section 3 (Example 3). Our goal is to (X⊥⊥Y |W )D and for which P (w| do(x)) is ∗ X(W ) estimate P (y| do(x))—the effect of X on Y in transportable. the new population created by changes in how Proof. 1. Condition 1 entails transportability. Z responds to X. The structure of the problem 2. If condition 2 holds, it implies permits us to satisfy condition 2 of Theorem 3, ∗ ∗ since Z is S-admissible and P (z| do(x)) is triv- P (y| do(x)) ially transportable. The former can be seen from (5.1) ∗ (S⊥⊥Y |X,Z)G , hence P (y| do(x), z)= P (y| do(x), = P (y| do(x),s) X z)); the latter can be seen from the fact that X and ∗ ∗ (5.2) = X P (y| do(x),z,s)P (z| do(x),s) Z and unconfounded, hence P (z| do(x)) = P (z|x). z Putting the two together, we get ∗ ∗ ∗ (5.3) = X P (y| do(x), z)P (z| do(x)). (5.8) P (y| do(x)) = X P (y| do(x), z)P (z|x), z z We now note that the transportability of P (z| do(x)) which proves equation (3.4). should reduce P ∗(z| do(x)) to a star-free expression ∗ Remark. and would render P (y| do(x)) transportable. The test entailed by Theorem 3 is re- 3. If condition 3 holds, it implies cursive, since the transportability of one causal ef- fect depends on that of another. However, given ∗ P (y| do(x)) that the diagram is finite and acyclic, the sets Z (5.4) and W needed in conditions 2 and 3 of Theorem 3 = P (y| do(x),s) would become closer and closer to X, and the it- erative process will terminate after a finite num- (5.5) = X P (y| do(x),w,s)P (w| do(x),s) w ber of steps. This occurs because the causal ef- 14 J. PEARL AND E. BAREINBOIM fects P ∗(z| do(x)) (likewise, P ∗(w| do(x))) is triv- ially transportable and equals P (z) for any Z node that is not a descendant of X. Thus, the need for re- iteration applies only to those members of Z that lie on the causal pathways from X to Y . Note further that the analyst need not terminate the procedure upon satisfying the conditions of Theorem 3. If one wishes to reduce the number of experiments, it can continue until no further reduction is feasible. Fig. 7. Selection diagram in which the causal effect is shown Example 9. Figure 6(d) requires that we invoke to be transportable in multiple iterations of Theorem 3 (see the both conditions of Theorem 3, iteratively. To satisfy Appendix). condition 2, we note that Z is S-admissible, and we ∗ need to prove the transportability of P (z| do(x)). in the Appendix): W To do that, we invoke condition 3 and note that ∗ d-separates X from Z in D. There remains to con- P (y| do(x)) firm the transportability of P ∗(w| do(x)), but this (5.10) = X P (y| do(x), z) is guaranteed by the fact that the empty set is z S-admissible relative to W , since (W ⊥⊥S). Hence, ∗ ∗ ∗ by Theorem 2 (replacing Y with W ) P (w| do(x)) · X P (z|w) X P (w| do(x),t)P (t). is transportable, which bestows transportability on w t ∗ P (y| do(x)). Thus, the final transport formula (de- The main power of this formula is to guide in- rived formally in the Appendix) is: vestigators in deciding what measurements need be ∗ P (y| do(x)) taken in both the experimental study and the target population. It asserts, for example, that variables U (5.9) = X P (y| do(x), z) and V need not be measured. It likewise asserts that z the W -specific causal effects need not be estimated ∗ in the experimental study and only the conditional · X P (w| do(x))P (z|w). ∗ ∗ w probabilities P (z|w) and P (t) need be estimated in the target population. The derivation of this for- The first two factors of the expression are estimable mulae is given in the Appendix. in the experimental study, and the third through ob- Despite its power, Theorem 3 in not complete, servational studies on the target population. Note namely, it is not guaranteed to approve all trans- that the joint effect P (y,w,z| do(x)) need not be es- portable relations or to disapprove all nontrans- timated in the experiment; a decomposition that re- portable ones. An example of the former is contrived sults in decrease of measurement cost and sampling in Bareinboim and Pearl (2012), where an alter- variability. native, necessary and sufficient condition is estab- A similar analysis proves the transportability of lished in both graphical and algorithmic form. The- the causal effect in Figure 6(e) (see Pearl and orem 3 provides, nevertheless, a simple and powerful Bareinboim (2011)). The model of Figure 6(f), how- method of establishing transportability in practice. ever, does not allow for the transportability of P ∗(y| do(x)) as witnessed by the absence of S- 6. CONCLUSIONS admissible set in the diagram, and the inapplica- Given judgements of how target populations may bility of condition 3 of Theorem 3. differ from those under study, the paper offers a for- Example 10. To illustrate the power of The- mal representational language for making these as- orem 3 in discerning transportability and deriving sessments precise and for deciding whether causal transport formulae, Figure 7 represents a more intri- relations in the target population can be inferred cate selection diagram, which requires several itera- from those obtained in an experimental study. When tion to discern transportability. The transport for- such inference is possible, the criteria provided by mula for this diagram is given by (derived formally Theorems 2 and 3 yield transport formulae, namely, EXTERNAL VALIDITY: FROM DO-CALCULUS TO TRANSPORTABILITY 15 principled ways of calibrating the transported re- one another. This knowledge is not vastly different lations so as to properly account for differences in from that required in any principled approach to the populations. These transport formulae enable causation in observational studies, since judgement the investigator to select the essential measurements about possible effects of omitted factors is crucial in both the experimental and observational studies, in any such analysis. Whereas such knowledge may and thus minimize measurement costs and sample only be partially available, the analysis presented in variability. this paper is nevertheless essential for understand- The inferences licensed by Theorem 2 and 3 repre- ing what knowledge is needed for the task to succeed sent worst case analysis, since we have assumed, in and how sensitive conclusions are to knowledge that the tradition of nonparametric modeling, that every we do not possess. variable may potentially be an effect-modifier (or Real-life situations will be marred, of course, with moderator). If one is willing to assume that certain additional complications that were not addressed relationships are noninteractive, or monotonic as is directly in this paper; for example, measurement the case in additive models, then additional trans- errors, selection bias, finite sample variability, un- port licenses may be issued, beyond those sanctioned certainty about the graph structure and the pos- by Theorems 2 and 3. sible existence of unmeasured confounders between While the results of this paper concern the trans- any two nodes in the diagram. Such issues are not fer of causal information from experimental to ob- unique to transportability; they plague any problem servational studies, the method can also benefit in causal analysis, regardless of whether they are in transporting statistical findings from one obser- represented formally or ignored by avoiding formal- vational study to another (Pearl and Bareinboim ism. The methods offered in this paper are represen- (2011)). The rationale for such transfer is two-fold. tative of what theory permits us to do in ideal situa- First, information from the first study may enable tions, and the graphical representation presented in researchers to avoid repeated measurement of cer- this paper makes the assumptions explicit and trans- tain variables in the target population. Second, by parent. Transparency is essential for reaching tenta- pooling data from both populations, we increase tive consensus among researchers and for facilitating the precision in which their commonalities are esti- discussions to distinguish that which is deemed plau- mated and, indirectly, also increase the precision by sible and important from that which is negligible or which the target relationship is transported. Sub- implausible. stantial reduction in sampling variability can be Finally, it is important to mention two recent thus achieved through this decomposition (Pearl extensions of the results reported in this article. (2012b)). Bareinboim and Pearl (2013a) have addressed the Clearly, the same data-sharing can be problem of transportability in cases where only used to guide Meta-Analysis (Glass (1976); Hedges a limited set of experiments can be conducted and Olkin (1985); Rosenthal (1995); Owen (2009)), at the source environment. Subsequently, the re- where one attempts to combine results from many sults were generalized to the problem of “meta- experimental and observational studies, each con- transportability,” that is, pooling experimental re- ducted on a different population and under a differ- sults from multiple and disparate sources to syn- ent set of conditions, so as to construct an aggregate thesize a consistent estimate of a causal relation at measure of effect size that is “better,” in some for- yet another environment, potentially different from mal sense, than any one study in isolation. While each of the former (Bareinboim and Pearl (2013c)). traditional approaches aims to average out differ- It is shown that such synthesis may be feasible from ences between studies, our theory exploits the com- multiple sources even in cases where it is not feasible monalities among the populations studied and the from any one source in isolation. target population. By pooling together commonali- ties and discarding areas of disparity, we gain max- APPENDIX imum use of the available samples (Bareinboim and Pearl (2013c)). Derivation of the transport formula for the causal To be of immediate use, our method relies on the effect in the model of Figure 6(d) [equation (5.9)]: assumption that the analyst is in possession of suf- ∗ P (y| do(x)) ficient background knowledge to determine, at least qualitatively, where two populations may differ from = P (y| do(x),s) 16 J. PEARL AND E. BAREINBOIM

′ ′ = X P (y| do(x),s,z)P (z| do(x),s) · X P (z| do(x),s,s , w)P (w| do(x),s,s ) z w

= X P (y| do(x), z)P (z| do(x),s) = X P (y| do(x), z) z z , ′ ′ (2nd condition of Theorem 3 · X P (z|s,s , w)P (w| do(x),s,s ) S-admissibility of Z of CE(X,Y )) w (3rd condition of Theorem 3, = P (y| do(x), z) X ′ z (X⊥⊥Z|W,S,S )D ) X(W ) (A.2) · X P (z| do(x),w,s)P (w| do(x),s) ′ w = X P (y| do(x), z) X P (z|s,s , w) z w = X P (y| do(x), z) ′ ′ z · X P (w| do(x),s,s ,t)P (t| do(x),s,s ) t (A.1) · X P (z|w,s)P (w| do(x),s) ′ w = X P (y| do(x), z) X P (z|s,s , w) z w (3rd condition of Theorem 3, ′ · X P (w| do(x),t)P (t| do(x),s,s ) (X⊥⊥Z|W, S)D ) X(W ) t (2nd condition of Theorem 3, = X P (y| do(x), z) z S-admissibility of T on CE(X,W )) ′ · X P (z|w,s)P (w| do(x)) = P (y| do(x), z) P (z|s,s , w) w X X z w (2nd condition of Theorem 3, ′ · X P (w| do(x),t)P (t|s,s ) S-admissibility of the t empty set {} of CE(X,W )) (1st condition of Theorem 3/ Rule 3 of do-calculus, (X⊥⊥T |S,S′) ) = X P (y| do(x), z) D z ∗ = X P (y| do(x), z) X P (z|w) ∗ · X P (z|w)P (w| do(x)). z w w ∗ · X P (w| do(x),t)P (t). Derivation of the transport formula for the causal t effect in the model of Figure 7 [equation (5.10)]:

∗ P (y| do(x)) ACKNOWLEDGMENTS ′ = P (y| do(x),s,s ) This paper benefited from discussions with Onye- ′ ′ = X P (y| do(x),s,s , z)P (z| do(x),s,s ) buchi Arah, Stuart Baker, Sander Greenland, z Michael Hoefler, Marshall Joffe, William Shadish, ′ Ian Shrier and Dylan Small. We are grateful to = X P (y| do(x), z)P (z| do(x),s,s ) z two anonymous referees for thorough reviews of this manuscript and for suggesting a simplification (2nd condition of Theorem 3, in the transport formula of Example 10. This re- S-admissibility of Z of CE(X,Z)) search was supported in parts by NIH Grant #1R01 LM009961-01, NSF Grant #IIS-0914211 and ONR = X P (y| do(x), z) z Grant #N000-14-09-1-0665. EXTERNAL VALIDITY: FROM DO-CALCULUS TO TRANSPORTABILITY 17

REFERENCES Glymour, M. M. and Greenland, S. (2008). Causal dia- grams. In Modern Epidemiology, 3rd ed. (K. J. Rothman, Adelman, L. (1991). Experiments, quasi-experiments, and S. Greenland and T. L. Lash, eds.) 183–209. Lippincott case studies: A review of empirical methods for evaluating Williams & Wilkins, Philadelphia, PA. decision support systems. IEEE Transactions on Systems, Haavelmo, T. (1943). The statistical implications of a sys- Man and Cybernetics 21 293–301. tem of simultaneous equations. Econometrica 11 1–12. Balke, A. and Pearl, J. (1995). Counterfactuals and pol- MR0007954 icy analysis in structural models. In Proceedings of the Hayduk, L., Cummings, G., Stratkotter, R., Nimmo, M., Eleventh Conference on Uncertainty in Artificial Intelli- Grygoryev, K., Dosman, D., Gillespie, M., gence (P. Besnard and S. Hanks, eds.) 11–18. Morgan Pazderka-Robinson, H. and Boadu, K. (2003). Kaufmann, San Francisco, CA. MR1615008 Pearl’s d-separation: One more step into causal thinking. Bareinboim, E., Brito, C. and Pearl, J. (2012). Local Struct. Equ. Model. 10 289–311. MR1977133 characterizations of causal Bayesian networks. In Graph Heckman, J. J. (1979). Sample selection bias as a specifica- Structures for Knowledge Representation and Reasoning. tion error. Econometrica 47 153–161. MR0518832 Lecture Notes in Artificial Intelligence 7205 1–17. Springer, Hedges, L. V. Olkin, I. Berlin. and (1985). Statistical Methods for Bareinboim, E. and Pearl, J. (2012). Transportability Meta-Analysis. Academic Press, Orlando, FL. MR0798597 Heise, D. R. of causal effects: Completeness results. In Proceedings of (1975). Causal Analysis. Wiley, New York. Hernan,´ M. A. Hernandez-D´ ´ıaz, S. Robins, J. M. the Twenty-Sixth National Conference on Artificial Intelli- , and gence 698–704. AAAI Press, Menlo Park, CA. (2004). A structural approach to selection bias. Epidemi- 15 Bareinboim, E. and Pearl, J. (2013a). Causal transporta- ology 615–625. ´ bility with limited experiments. In Proceedings of the Hernan, M. A. and VanderWeele, T. J. (2011). Com- Twenty-Seventh National Conference on Artificial Intelli- pound treatments and transportability of causal inference. 22 gence 95–101. AAAI Press, Menlo Park, CA. Epidemiology 368–377. ¨ Bareinboim, E. and Pearl, J. (2013b). A general algo- Hofler, M., Gloster, A. T. and Hoyer, J. (2010). rithm for deciding transportability of experimental results. Causal effects in psychotherapy: Counterfactuals counter- 20 J. Causal Inference 1 107–134. act overgeneralization. Psychotherapy 668–679. Bareinboim, E. and Pearl, J. (2013c). Meta- DOI:10.1080/10503307.2010.501041 transportability of causal effects: A formal approach. Huang, Y. and Valtorta, M. (2006). Pearl’s calculus of In Proceedings of the Sixteenth International Conference intervention is complete. In Proceedings of the Twenty- on Artificial Intelligence and Statistics (AISTATS 2013). Second Conference on Uncertainty in Artificial Intelligence J. Mach. Learn. Res. 31 135–143. (R. Dechter and T. S. Richardson, eds.) 217–224. Bareinboim, E., Tian, J. and Pearl, J. (2014). Recovering AUAI Press, Corvallis, OR. from selection bias in causal and statistical inference. In Joffe, M. M. and Greene, T. (2009). Related causal frame- 65 Proceedings of The Twenty-Eighth Conference on Artificial works for surrogate outcomes. Biometrics 530–538. Intelligence (C. E. Brodley and P. Stone, eds.). AAAI MR2751477 Press, Menlo Park, CA. To appear. Koller, D. and Friedman, N. (2009). Probabilistic Graphi- Berkson, J. (1946). Limitations of the application of fourfold cal Models: Principles and Techniques. Adaptive Computa- table analysis to hospital data. Biometrics 2 47–53. tion and Machine Learning. MIT Press, Cambridge, MA. Bollen, K. A. and Pearl, J. (2013). Eight myths about MR2778120 and structural equation models. In Handbook of Lane, P. W. and Nelder, J. A. (1982). Analysis of covari- Causal Analysis for (S. L. Morgan, ed.) ance and standardization as instances of prediction. Bio- Chapter 15. Springer, New York. metrics 38 613–621. Campbell, D. and Stanley, J. (1963). Experimental and Lehmann, E. L. and Casella, G. (1998). Theory of Point Quasi-Experimental Designs for Research. Wadsworth, Estimation, 2nd ed. Springer, New York. MR1639875 Chicago. Manski, C. (2007). Identification for Prediction and Deci- Cole, S. R. and Stuart, E. A. (2010). Generalizing evidence sion. Harvard Univ. Press, Cambridge, MA. from randomized clinical trials to target populations: The Neyman, J. (1923). Sur les applications de la thar des proba- ACTG 320 trial. Am. J. Epidemiol. 172 107–115. bilities aux experiences Agaricales: Essay des principle. En- Davis, J. A. (1984). Extending Rosenberg’s technique for glish translation of excerpts by D. Dabrowska and T. Speed standardizing percentage tables. Social Forces 62 679–708. in Statist. Sci. 5 (1990) 463–472. Dawid, A. P. (2002). Influence diagrams for causal modelling Owen, A. B. (2009). Karl Pearson’s meta-analysis revisited. and inference. Internat. Statist. Rev. 70 161–189. Ann. Statist. 37 3867–3892. MR2572446 Ellenberg, S. S. and Hamilton, J. M. (1989). Surrogate Pearl, J. (1988). Probabilistic Reasoning in Intelligent Sys- endpoints in clinical trials: Cancer. Stat. Med. 8 405–413. tems. Morgan Kaufmann, San Mateo, CA. Gelman, A. and Hill, J. (2007). Data Analysis Using Pearl, J. (1993). Graphical models, causality, and interven- Regression and Multilevel/Hierarchical Models. Analytical tion. Statist. Sci. 8 266–273. Methods for Social Research. Cambridge Univ. Press, New Pearl, J. (1995). Causal diagrams for empirical research. York. Biometrika 82 669–710. MR1380809 Glass, G. V. (1976). Primary, secondary, and meta-analysis Pearl, J. (2000). Causality: Models, Reasoning, and Infer- of research. Educational Researcher 5 3–8. ence. Cambridge Univ. Press, Cambridge. MR1744773 18 J. PEARL AND E. BAREINBOIM

Pearl, J. (2009a). Causal inference in statistics: An Rubin, D. B. (1974). Estimating causal effects of treatments overview. Stat. Surv. 3 96–146. MR2545291 in randomized and nonrandomized studies. J. Educational Pearl, J. (2009b). Causality: Models, Reasoning, and In- Psychology 66 688–701. ference, 2nd ed. Cambridge Univ. Press, Cambridge. Shadish, W. R., Cook, T. D. and Campbell, D. T. (2002). MR2548166 Experimental and Quasi-Experimental Designs for Gener- Pearl, J. (2011). The structural theory of causation. alized Causal Inference, 2nd ed. Houghton-Mifflin, Boston. In Causality in the Sciences (P. McKay Illari, F. Shpitser, I. and Pearl, J. (2006). Identification of condi- Russo and J. Williamson, eds.) 697–727. Clarendon tional interventional distributions. In Proceedings of the Press, Oxford. Twenty-Second Conference on Uncertainty in Artificial In- Pearl, J. (2012a). The causal foundations of structural equa- telligence (R. Dechter and T. S. Richardson, eds.) 437– tion modeling. In Handbook of Structural Equation Model- 444. AUAI Press, Corvallis, OR. ing (R. H. Hoyle, ed.). Guilford Press, New York. Spirtes, P., Glymour, C. and Scheines, R. (1993). Cau- Pearl, J. (2012b). Some thoughts concerning transfer learn- sation, Prediction, and Search. Lecture Notes in Statistics ing, with applications to meta-analysis and data-sharing es- 81. Springer, New York. MR1227558 timation. Technical Report R-387, Cognitive Systems Lab- Spirtes, P., Glymour, C. and Scheines, R. (2000). Cau- oratory, Dept. Computer Science, UCLA. sation, Prediction, and Search, 2nd ed. Adaptive Computa- Pearl, J. (2013). Linear models: A useful “microscope” for tion and Machine Learning. MIT Press, Cambridge, MA. 1 causal analysis. J. Causal Inference 155–170. MR1815675 Pearl, J. (2014). Trygve Haavelmo and the emergence Strotz, R. H. and Wold, H. O. A. (1960). Recursive vs. of causal calculus. Econometric Theory, Special Issue on nonrecursive systems: An attempt at synthesis. Economet- Haavelmo Centennial. Published online: 10 June 2014. rica 28 417–427. MR0120034 DOI:10.1017/S0266466614000231. Tian, J. and Pearl, J. (2002). A general identification condi- Pearl, J. and Bareinboim, E. (2011). Transportability tion for causal effects. In Proceedings of the Eighteenth Na- across studies: A formal approach. Technical Report R-372, tional Conference on Artificial Intelligence 567–573. AAAI Cognitive Systems Laboratory, Dept. Computer Science, Press/The MIT Press, Menlo Park, CA. UCLA. Verma, T. and Pearl, J. (1988). Causal networks: Seman- Petersen, M. L. (2011). Compound treatments, transporta- tics and expressiveness. In Proceedings of the Fourth Work- bility, and the structural causal model: The power and sim- 22 shop on Uncertainty in Artificial Intelligence 352–359. plicity of causal graphs. Epidemiology 378–381. 4 Prentice, R. L. (1989). Surrogate endpoints in clinical trials: Mountain View, CA. Also in Uncertainty in AI (1990) Definition and operational criteria. Stat. Med. 8 431–440. (R. Shachter, T. S. Levitt, L. N. Kanal and J. F. Lemmer, Richardson, T. (2003). Markov properties for acyclic eds.) 69–76. North-Holland, Amsterdam. MR1166827 directed mixed graphs. Scand. J. Stat. 30 145–157. Westergaard, H. (1916). Scope and method of statistics. 15 MR1963898 Publ. Amer. Statist. Assoc. 229–276. Robins, J. (1986). A new approach to causal inference White, H. and Chalak, K. (2009). Settable systems: An in mortality studies with a sustained exposure period— extension of Pearl’s causal model with optimization, equi- 10 Application to control of the healthy worker survivor effect. librium, and learning. J. Mach. Learn. Res. 1759–1799. Math. Modelling 7 1393–1512. MR0877758 MR2540776 Robins, J., Orellana, L. and Rotnitzky, A. (2008). Esti- Wright, S. (1921). Correlation and causation. J. Agricultural 20 mation and extrapolation of optimal treatment and testing Research 557–585. strategies. Stat. Med. 27 4678–4721. MR2528576 Yule, G. U. (1934). On some points relating to vital statis- Rosenthal, R. (1995). Writing meta-analytic reviews. Psy- tics, more especially statistics of occupational mortality. J. chological Bulletin 118 183–192. Roy. Statist. Soc. 97 1–84.