<<

The Causal Interpretations of Bayesian

Zhiyu Wang1, Mohammad Ali Javidian2, Linyuan Lu1, Marco Valtorta2 1Department of 2Department of and Engineering University of South Carolina, Columbia, SC, 29208

Abstract Probabilistic Graphical Models (PGMs) enjoy a well-deserved popularity because they allow ex- We propose a causal interpretation of Bayesian plicit representation of structural constraints in the hypergraphs (Javidian et al. 2018), a probabilis- tic whose structure is a directed language of graphs and similar structures. From acyclic , that extends the causal interpre- the perspective of efficient belief update, factoriza- tation of LWF chain graphs. We provide intervention tion of the joint probability distribution of random formulas and a graphical criterion for intervention variables corresponding to variables in the graph is in Bayesian hypergraphs that specializes to a new paramount, because it allows decomposition of the graphical criterion for intervention in LWF chain calculation of the evidence or of the posterior prob- graphs and sheds light on the causal interpretation ability (Lauritzen and Jensen 1997). The prolifera- of interaction as represented by undirected edges tion of different PGMs that allow factorizations of in LWF chain graphs or heads in Bayesian hyper- different kinds leads us to consider a more general graphs. graphical structure in this paper, namely directed acyclic hypergraphs. Since there are many more Introduction and Motivation hypergraphs than DAGs, undirected graphs, chain Probabilistic graphical models are graphs in which graphs, and, indeed, other graph-based networks, nodes represent random variables and edges rep- Bayesian hypergraphs can model much finer fac- resent conditional independence assumptions. They torizations and thus are more computationally effi- provide a compact way to represent the joint prob- cient. See (Javidian et al. 2018) for examples of how ability distributions of a of random variables. Bayesian hypergraphs can model graphically func- In undirected graphical models, e.g., Markov net- tional dependencies that are hidden in probability ta- works (see (Darroch et al. 1980; Pearl 1988)), there bles when using BNs or CGs. We provide a causal is a simple rule for determining independence: two interpretation of Bayesian hypergraphs that extends set of nodes A and B are conditionally independent the causal interpretation of LWF chain graphs (Lau- given C if removing C separates A and B. On the ritzen and Richardson 2002) by giving correspond- other hand, directed graphical models, e.g. Bayesian ing formulas and a graphical criterion for interven- networks (see (Kiiveri, Speed, and Carlin 1984; tion, which operationalize the intuition that feed- Wermuth and Lauritzen 1983; Pearl 1988)), which back processes in LWF chain graphs and Bayesian consist of a directed acyclic graph (DAG) and a hypergraphs models turn into causal processes when corresponding set of conditional probability tables, variables in them are conditioned upon by inter- have a more complicated rule (d-separation) for de- vention. The addition of feedback processes and its termining independence. More complex graphical causal interpretation is a conceptual advance within models include various types of graphs with edges the three-level causal hierarchy (Pearl 2009). of several types (e.g., (Cox and Wermuth 1993; 1996; Richardson and Spirtes 2002; Pena˜ 2014)), in- Bayesian Hypergraphs: Definitions cluding chain graphs (Lauritzen and Wermuth 1989; Hypergraphs are generalizations of graphs such that Lauritzen 1996), for which different interpretations each edge is allowed to contain more than two ver- have emerged (Andersson, Madigan, and Perlman tices. Formally, an (undirected) hypergraph is a pair 1996; Drton 2009). H = (V, E), where V = {v1, v2, ··· , vn} is the set of Copyright c 2019, Association for the Advancement of vertices (or nodes) and E = {h1, h2, ··· , hm} is the Artificial Intelligence (www.aaai.org). All rights reserved. set of hyperedges where hi ⊆ V for all i ∈ [m]. If |h | = k for every i ∈ [m], then we say H is a k- chain components {τ : τ ∈ D} yields an unique nat- i S uniform (undirected) hypergraph. A directed hyper- ural partition of the set V(H) = τ∈D τ with edge or hyperarc h is an ordered pair, h = (X, Y), of the following properties: (possibly empty) subsets of V where X ∩ Y = ∅; X Proposition 1. Let H be a DAH and {τ : τ ∈ D} is the called the tail of h while Y is the head of h. be its chain components. Let G be a graph obtained We write X = T(h) and Y = H(h). We say a directed from H by contracting each element of {τ : τ ∈ D} hyperedge h is fully directed if none of H(h) and into a single vertex and creating a directed edge T(h) are empty. A directed hypergraph is a hyper- from τi ∈ V(G) to τ j ∈ V(G) in G if and only if there graph such that all of the hyperedges are directed. A exists a hyperedge h ∈ E(H) such that T(h) ∩ τi , ∅ (s, t)-uniform directed hypergraph is a directed hy- and H(h) ∩ τ j , ∅. Then G is a DAG. pergraph such that the tail and head of every directed edge have size s and t respectively. For example, any Proof. See (Javidian et al. 2018).  DAG is a (1, 1)-uniform hypergraph (but not vice Note that the DAG obtained in Proposition 1 is versa). An undirected graph is a (0, 2)-uniform hy- unique and given a DAH H we call such G the pergraph. Given a hypergraph H, we use V(H) and canonical DAG of H. E(H) to denote the the vertex set and edge set of H respectively. Definition 1. A Bayesian hypergraph is a triple We say two vertices u and v are co-head (or co- (V, H, P) such that V is a set of random variables, tail) if there is a directed hyperedge h such that H is a DAH on the vertex set V and P is a multivari- {u, v} ⊂ H(h) ( or {u, v} ⊂ T(h) respectively). Given ate probability distribution on V such that the local another vertex u , v, we say u is a parent of v, de- Markov property holds with respect to the DAH H, noted by u → v, if there is a directed hyperedge h i.e., for any vertex v ∈ V(H), ∈ ∈ such that u T(h) and v H(h). If u and v are co- v y nd(v)\cl(v) | bd(v). (1) head, then u is a neighbor of v. If u, v are neighbors, we denote them by u−v. Given v ∈ V, we define par- For a Bayesian hypergraph H whose underlying ent (pa(v)), neighbor (nb(v)), boundary (bd(v)), an- DAH is a LWF DAH, we call H a LWF Bayesian cestor (an(v)), anterior (ant(v)), descendant (de(v)), hypergraph. and non-descendant (nd(v)) for hypergraphs exactly the same as for graphs (and therefore use the same Bayesian hypergraphs factorizations names). The same holds for the equivalent concepts The factorization of a probability measure P accord- for τ ⊆ V. Note that it is possible that some vertex u ing to a Bayesian hypergraph is similar to that of is both the parent and neighbor of v. a chain graph. Before we present the factorization A partially directed in H is a sequence property, let us introduce some additional terminol- u {v1, v2,... vk} satisfying that vi is either a neighbor ogy. Given a DAH H, we use H to denote the or a parent of vi+1 for all 1 ≤ i ≤ k and vi → vi+1 for undirected hypergraph obtained from H by replac- some 1 ≤ i ≤ k. Here vk+1 ≡ v1. We say a directed ing each directed hyperedge h = (A, B) of H into an hypergraph H is acyclic if H contains no partially undirected hyperedge A ∪ B. Given a family of sets directed cycle. For ease of reference, we call a di- F , define a partial order (F , ≤) on F such that for rected acyclic hypergraph a DAH or a Bayesian hy- two sets A, B ∈ F , A ≤ B if and only if A ⊆ B. Let pergraph structure (as defined in Definition 1). Note M(F ) denote the set of maximal elements in F , i.e., that for any two vertices u, v in a directed acyclic no element in M(F ) contains another element as hypergraph H, u can not be both the parent and subset. When F is a set of directed hyperedges, we neighbor of v otherwise we would have a partially abuse the notation to denote M(F ) = M(F u). Let directed cycle. H be a directed acyclic hypergraph and {τ : τ ∈ D} Remark 1. DAHs are generalizations of undirected be its chain components. Assume that a probability graphs, DAGs and chain graphs. In particular an distribution P has a density f , with respect to some undirected graph can be viewed as a DAH in which product measure µ = ×α∈V µα on X = ×α∈V Xα. Now every hyperedge is of the form (∅, {u, v}). A DAG we say a probability measure P factorizes according is a DAH in which every hyperedge is of the form to H if it has density f such that ({u}, {v}). A chain graph is a DAH in which every (i) f factorizes as in the directed acyclic case: hyperedge is of the form (∅, {u, v}) or ({u}, {v}). Y f (x) = f (x | x ). (2) We define the chain components of H as the τ pa(τ) ∈D equivalence classes under the equivalence relation τ ∗ where two vertices v1, vt are equivalent if there ex- (ii) For each τ ∈ D, define Hτ to be the subhy- ists a sequence of distinct vertices v1, v2,..., vt such pergraph of Hτ∪pa(τ) containing all edges h in that vi and vi+1 are co-head for all i ∈ [t − 1]. The Hτ∪pa(τ) such that H(h) ⊆ τ. Y H and one of its chain components τ, the factor- f (xτ | xpa(τ)) = ψh(x). (3) ization scheme in (3) allows a distinct function for ∗ h∈M(Hτ ) each maximal subset of τ ∪ paD(τ) that intersects τ (paD is the parent of τ in the canonical DAG of where ψh are non-negative functions depending Z Y H). For each subset S of τ ∪ paD(τ) that does not only on xh and ψh(x)µτ(dxτ) = 1. intersect τ, recall that the factorization in (3) can be Xτ ∗ h∈M(Hτ ) rewritten as follows: Equivalently, we can also write f (xτ | xpa(τ)) as Y ψ (x) Y h −1 ∗ f (xτ | xpa(τ)) = Z (xpa(τ)) ψh(x), (4) h∈M(Hτ ) f (xτ | xpa(τ)) = Z . h∈M(H∗) Y τ ψ (x)µ (dx ) Z h τ τ Y Xτ ∗ −1 h∈M(Hτ ) where Z (xpa(τ)) = ψh(x)µτ(dxτ). Xτ ∗ h∈M(Hτ ) Observe that ψS (x) is a function that does not de- pend on values of variables in τ. Hence ψ (x) can be Remark 2. One of the key advantages of Bayesian S factored out from the integral above and cancels out hypergraphs is that they allow much finer factoriza- with itself in f (x | x ). Thus, the factorization tions of probability distributions compared to chain τ pa(τ) formula in (3) or (4) in fact allows distinct functions graph models. We will illustrate with a simple exam- for all possible maximal subsets of τ ∪ pa (τ). ple in Figure 1. D Intervention in Bayesian hypergraphs a b a b Formally, intervention in Bayesian hypergraphs can be defined analogously to intervention in LWF chain graphs (Lauritzen and Richardson 2002). In this sec- c d c d tion, we give graphical procedures that are consis- tent with the intervention formulas for chain graphs Figure 1: (1) a chain graph G; (2) a Bayesian hyper- (Equation (5), (6)) and for Bayesian hypergraphs graph H. (Equation (7), (8)). Before we present the details, we need some additional definitions and tools to de- termine when factorizations according to two chain Note that in Figure 1 (1), the factorization ac- graphs or DAHs are equivalent in the sense that they cording to G is could be written as products of the same type of functions (functions that depend on same set of vari- f (x) = f (xa) f (xb) f (xcd | xab) ables). We say two chain graphs G1, G2 admit the = f (xa) f (xb)ψabcd(x) same factorization decomposition if for every prob- In Figure 1 (2), the factorization according to H is ability density f that factorizes according to G1, f also factorizes according to G2, and vice versa. Sim- f (x) = f (xa) f (xb) f (xcd | xab) ilarly, two DAHs H1, H2 admit the same factoriza- = f (xa) f (xb)ψabc(x)ψabd(x)ψcd(x) tion decomposition if for every probability density f that factorizes according to H1, f also factorizes Note that although G and H have the same global according to H2, and vice versa. Markov properties, the factorization according to H is one step further compared to the factorization Factorization equivalence and intervention according to G. Suppose each of the variables of in LWF chain graphs {a, b, c, d} can take k values. Then the factorization In this subsection, we will give graphical procedures according to G will require a conditional probabil- to model intervention based on the formula intro- ity table of size k4 while the factorization according 3 duced in (Lauritzen and Richardson 2002). Let us to H only needs a table of size Θ(k ) asymptotically. first give some background. In many statistical con- Hence, a Bayesian hypergraph model allows much text, we would like to modify the distribution of a finer factorizations and thus achieves higher mem- variable Y by intervening externally and forcing the ory efficiency. value of another variable X to be x. This is com- Remark 3. We remark that the factorization for- monly referred as conditioning by intervention or mula defined in (3) is in fact the most general pos- conditioning by action and denoted by Pr(ykx) or sible in the sense that it allows all possible factor- Pr(y | X ← x). Other expressions such as Pr(Yx = izations of a probability distribution admitted by a y), Pman(x)(y), set(X = x), X = xˆ or do(X = x) have DAH. In particular, given a Bayesian hypergraph also been used to denote intervention conditioning (Splawa-Neyman 1990; Rubin 1974; Pearl 1993; Notice that for any maximal clique h1 ∈ C such that m 1995; 2009). h1 ∩ A = ∅, h1 is also a clique in (G0[τ\A]) . For Let G be a chain graph with chain components h1 ∈ C with h1 ∩ A , ∅, there are two cases: {τ : τ ∈ D}. Moreover, assume further that a sub- Case 1: (h1 ∩ τ)\A , ∅. In this case, observe that set A of variables in V(G) are set such that for ev- m ∈ h1\A is also a clique in (G0[τ\A]) , thus is con- ery a A, xa = a0. Lauritzen and Richardson, in 0 m (Lauritzen and Richardson 2002), generalized the tained in some maximal clique h in (G0[τ\A]) . conditioning by intervention formula for DAGs and Since all variables in A are pre-set as constants, it follows that ψ (x) also appears in a factor in the gave the following formula for intervention in chain h1 graphs (where it is understood that the probability of factorization of f according to G0. any configuration of variables inconsistent with the Case 2: h1 ∩ τ ⊆ A. In this case, note that h1 ∩ τ is

intervention is zero). A probability density f factor- disjoint with τ\A. Hence ψh1 (x) appears as a fac- tor independently of x in both Z−1(x , x ) izes according to G (with A intervened) if Y τ\A pa(τ) τ∩A Y and ψh(x), which cancels out with itself. f (xkxA) = f (xτ\A | xpa(τ), xτ∩A). (5) h∈C τ∈D Thus it follows that every probability density f that Moreover, for each τ ∈ D, factorizes according to G1 also factorizes accord- −1 Y f (x \ | x , x ∩ )=Z (x , x ∩ ) ψ (x) (6) ing to G0. On the other hand, it is easy to see that τ A pa(τ) τ A pa(τ) τ A h 0 0 h∈C for every τ ∈ D0 and every maximal clique h in 0 m 0 m (G0[τ ]) , h is contained in some maximal clique h where C is the set of maximal cliques in (Gτ∪pa(τ)) in (G [τ])m for some τ ∈ D . Hence we can con- Z Y 1 1 −1 clude that G and G admit the same factorization and Z (xpa(τ), xτ∩A) = ψh(x)µτ\A(dxτ\A). 1 0 Xτ\A h∈C decomposition. The above argument also works for G and G . Thus, G and G admit the same factor- Definition 2. G and G be two chain graphs. Given 2 0 1 2 1 2 ization decomposition.  a subset A1 ⊆ V(G1) and A2 ⊆ V(G2), we say 1 (G1, A1) and (G2, A2) are factorization-equivalent We now define a graphical procedure (call it redi- if they become the same chain graph after remov- rection procedure) that is consistent with the inter- ing from Gi all vertices in Ai together with the edges vention formula in Equation (5) and (6). Let G be incident to vertices in Ai for i ∈ {1, 2}. a chain graph. Given an intervened set of variables Typically, Ai in Definition 2 is a set of constant A ⊆ V(G), let Gˆ be the chain graph obtained from variables in V(Gi) created by intervention. G by performing the following operation: for every u ∈ A and every undirected edge e = {u, w} con- Theorem 1. Let G1 and G2 be two chain graphs defined on the same set of variables V. Moreover a taining u, replace e by a directed edge from u to common set of variables A in V are set by interven- w; finally remove all the directed edges that point tion such that for every a ∈ A, x = a . If (G , A) and to some vertex in A. By replacing the undirected a 0 1 edge with a directed edge, we replace any feed- (G2, A) are factorization-equivalent, then G1 and G2 admit the same factorization decomposition. back mechanisms that include a variable in A with a causal mechanism. The intuition behind the pro- Proof. Let G0 be the chain graph obtained from G1 cedure is the following. Since a variable that is set by removing all vertices in A and the edges incident by intervention cannot be modified, the symmet- to A. It suffices to show that G1 and G2 both ad- ric feedback relation is turned into an asymmetric mit the same factorization decomposition as G0. Let causal one. Similarly, we can justify this graphi- D1, D0 be the set of chain components of G1 and G0 cal procedure as equivalent to “striking out” some respectively. Let τ ∈ D1 be an arbitrary chain com- equations in the Gibbs process on top of p. 338 of ponent of G1. By the factorization formula in (6), it (Lauritzen and Richardson 2002), as Lauritzen and follows that Richardson (Richardson 2018) did for Equation (18) −1 Y in (Lauritzen and Richardson 2002). f (xτ\A | xpa(τ), xτ∩A) = Z (xpa(τ), xτ∩A) ψh(x) h∈C Theorem 2. Let G be a chain graph with a subset m of variables A ⊆ V(G) set by intervention such that where C is the set of maximal cliques in (Gτ∪pa(τ)) for every a ∈ A. x = a . Let Gˆ be obtained from G Z Y a 0 −1 by the redirection procedure. Then G and Gˆ admit and Z (xpa(τ), xτ∩A) = ψh(x)µτ\A(dxτ\A). Xτ\A h∈C the same factorization decomposition. 1This term was defined for a different purpose in (Stu- Proof. It is not hard to see that removing from Gˆ and deny´ 1992). G all vertices in A and all edges incident to A results in the same chain graph. Hence by Theorem (1), G Factorization equivalence and intervention and Gˆ admit the same factorization decomposition. in Bayesian hypergraphs  Intervention in Bayesian hypergraphs can be mod- Example 1. Consider the chain graph G shown eled analogously to the case of chain graphs. We use in Figure 2. Let Gˆ be the graph obtained from G the same notation as before. Let H be a DAH and through the redirection procedure described in this {τ : τ ∈ D} be its chain components. Moreover, as- subsection. Let G0 be the chain graph obtained from sume further that a subset A of variables in V(H) are G by deleting the vertex c0 and the edges incident to set such that for every a ∈ A, xa = a0. Then a prob- c0. We will compare the factorization decomposition ability density f factorizes according to H (with A according to the formula (5),(6) as well as the graph intervened) as follows: (where it is understood that structure Gˆ and G0. the probability of any configuration of variables in- consistent with the intervention is zero): Y dc e c d e c0 d e 0 f (xkxA) = f (xτ\A | xpa(τ), xτ∩A). (7) τ∈D

a b ∗ a b a b For each τ ∈ D, define Hτ to be the subhyper-

graph of Hτ∪paD(τ) containing all edges h in Hτ∪pa(τ) Figure 2: (a) A chain graph G; (b) The graph Gˆ ob- such that H(h) ⊆ τ, then tained from G through the redirection procedure; (c) The graph G obtained from G by deleting variables Y 0 f (x | x , x )=Z−1(x , x ) ψ (x) (8) A τ\A pa(τ) τ∩A pa(τ) τ∩A h in . ∗ h∈M(Hτ ) By formulas (5) and (6) proposed in (Lauritzen where and Richardson 2002), when x is set as c by inter- c 0 Z vention, −1 Y Z (xpa(τ), xτ∩A) = ψh(x)µτ\A(dxτ\A) Xτ\A ∗ f (xkxc) = f (xa) f (xb) f (xde | xabc0 ) h∈M(Hτ ) ψ (x)ψ (x) ac0d abde and ψ are non-negative functions that depend only = f (xa) f (xb)P . h d,e ψac0d(x)ψabde(x) on xh.

Now consider the factorization according to G.ˆ Definition 3. Let H1 and H2 be two Bayesian hy- The chain components of Gˆ are {{a}, {b}, {c}, {d, e}} pergraphs. Given a subset of variables A1 ⊆ V(H1) with xc set to be c0. The factorization according to and A2 ⊆ V(H2), we say (H1, A1) and (H2, A2) are Gˆ is as follows: factorization-equivalent if performing the following operations to H and H results in the same di- f (xkx ) = f (x ) f (x ) f (x ) f (x | x ) 1 2 Gˆ c Gˆ a Gˆ b Gˆ c Gˆ de abc0 rected acyclic hypergraph:

ψac0d(x)ψabde(x) = f ˆ (xa) f ˆ (xb) f ˆ (xc) , (i) Deleting all hyperedges with empty head, i.e., hy- G G G P ψ (x)ψ (x) d,e ac0d abde peredges of the form (S, ∅). where f (xc) = 1 when xc = c0 and otherwise 0. (ii) Deleting every hyperedge that is contained in Hence G and Gˆ admit the same factorization. some other hyperedge, i.e., delete h if there is Now consider the factorization according to G0. another h0 such that T(h) ⊆ T(h0) and H(h) ⊆ The chain components of G0 are {{a}, {b}, {d, e}}. The H(h0). factorization according to G0 is as follows: (iii) Shrinking all hyperedges of Hi containing ver- f0(x) = f0(xa) f0(xb) f0(xde | xab) tices in Ai, i.e. replace every hyperedge h of Hi by h0 T h \A , H h \A for i ∈ { , }. ψad(x)ψabde(x) = ( ( ) i ( ) i) 1 2 = f0(xa) f0(xb)P , d,e ψad(x)ψabde(x) Typically, Ai in Definition 3 is a set of constant variables in V created by intervention. Observe that f0(x) has the same form of decompo- sition as f (xkxc) since xc is set to be c0 in ψac0d(x) Theorem 3. Let H1 and H2 be two DAHs defined (with the understanding that the probability of any on the same set of variables V. Moreover, a common configuration of variables with xc , c0 is zero). set of variables A in V are set by intervention such Hence we can conclude that G, Gˆ (with xc inter- that for every a ∈ A, Xa = a0. If (H1, A) and (H2, A) vened) and G0 admit the same factorization decom- are factorization-equivalent, then H1 and H2 admit position. the same factorization decomposition. Proof. Similar to the proof in Theorem 1, let H0 be S = H(h) ∩ A , ∅, replace the hyperedge h by 0 the DAH obtained from H1 (or H2) by performing h = (T(h) ∪ S, H(h)\S ). If a hyperedge has empty the operations above repeatedly. Let D1 and D0 be set as its head, delete that hyperedge. Call the result- the set of chain components of H1 and H0 respec- ing hypergraph Hˆ A. We will show that the factoriza- tively. First, note that performing the operation (i) tion according to Hˆ A is consistent with Equation (8). does not affect the factorization since hyperedges of the form h = (S, ∅) never appear in the factorization Theorem 4. Let H be a Bayesian hypergraph and decomposition due to the fact that H(h) ∩ τ = ∅ for {τ : τ ∈ D} be its chain components. Given an intervened set of variables x , let Hˆ be the DAH every τ ∈ D1. Secondly, (ii) does not change the fac- A A torization decomposition too since if one hyperedge obtained from H by replacing each hyperedge h ∈ h is contained in another hyperedge h0 as defined, E(H) satisfying S = H(h) ∩ A , ∅ by the hyperedge 0 ∪ \ then ψ (x) can be simply absorbed into ψ 0 (x) by re- h = (T(h) S, H(h) S ) and removing hyperedges h h ˆ placing ψh0 (x) with ψh0 (x) · ψh(x). with empty head. Then H and H admit the same Now let τ ∈ D1 be an arbitrary chain compo- factorization decomposition. nent of H and h ∈ H [τ]∗, i.e., the set of hy- 1 1 1 Proof. This is a corollary of Theorem (3) since per- peredges in H whose head intersects τ. Suppose 1 forming the operations (i)(ii)(iii) in the definition of that τ is separated into several chain components 0 0 0 factorization-equivalence of DAH to H and Hˆ re- τ , τ , ··· , τ in H0 because of the shrinking oper- 1 2 t sults in the same DAH.  ation. If h1 ∩ A = ∅, then h1 is also a hyperedge in ∗ H0[τ\A] . If h1 ∩ A , ∅, there are two cases: Example 2. Let G be a chain graph as shown in H Case 1: H(h ) ⊆ A. Then since variables in A are Figure 3(a) and be the canonical LWF Bayesian 1 hypergraph of G as shown in Figure 3(b), con- constants, it follows that in Equation (8), ψh1 (x) does not depend on variables in τ\A. Hence ψ (x) structed based on the procedure in (Javidian et al. h H appears as factors independent of variables in τ\A 2018, Section 2.4). has two directed hyperedges Y { } { } { } { } −1 ( a , c, d ) and ( a, b , d, e ). Applying the redirec- in both Z (xpa(τ), xτ∩A) and ψh(x), thus tion procedure for intervention in Bayesian hyper- ∗ h∈M(Hτ ) graphs leads to the Bayesian hypergraph Hˆ in Fig- cancels out with itself. Note that, h does not 1 ure 3(c). We show that using equations (5) and (6) exist in H too since h becomes a hyperedge 0 1 for Figure 3(a) leads to the same result as if one with empty head after being shrinked and thus is uses the factorization formula for the Bayesian hy- deleted in Operation (i). pergraph in Figure 3(c). First, we compute f (x||xc) Case 2: H(h1)\A , ∅. In this case, H(h1)\A must 0 0 be entirely contained in one of {τ1, ··· , τt } . With- 0 d e out loss of generality, say H(h1)\A ⊆ τ1 in c d e c d e H0. Then note that h1\A must be contained in 0 some maximal hyperedge h in E(H0) such that 0 0 c = c0 a b H(h ) ∩ τ1 , ∅. Moreover, recall that variables in a b a b A are constants. Hence ψh1 must appear in some factor in the factorization of f according to H0. Figure 3: (a) A chain graph G; (b) the canonical LWF DAH H of G; (c) the resulting hypergraph Thus it follows that every probability density f that Hˆ H factorizes according to H also factorizes accord- after performing the graphical procedure on 1 when the variable c is intervened. ing to H0. On the other hand, it is not hard to see 0 0 that for every τ ∈ D0 and every hyperedge h in 0 ∗ 0 (H0[τ ]) , h is contained in some maximal hyper- for chain graph in Figure 3(a). Based on equation ∗ edge h in (H1[τ]) for some τ ∈ D1. Hence we can (5) we have: conclude that H and H admit the same factoriza- 1 0 f (xkx ) = f (x ) f (x ) f (x | x ), tion decomposition. The above argument also works c a b de abc0 for H2 and H0. Thus, H1 and H2 admit the same as the effect of the atomic intervention do(Xc = c0). factorization decomposition.  Then, using equation (6) gives: ψ (x)ψ (x) We now present a graphical procedure (call it ac0d abde f (x||xc) = f (xa) f (xb)P . (9) redirection procedure) for modeling intervention in d,e ψac0d(x)ψabde(x) Bayesian hypergraph. Let H be a DAH and {τ : τ ∈ Now, we compute f (x) for Bayesian hypergraph in D} be its chain components. Suppose a set of vari- Figure 3(c). Using equation (2) gives: ables xA is set by intervention. We then modify H

as follows: for each hyperedge h ∈ E(H) such as f (xkxc) = f (xa) f (xb) f (xde | xabc0 ). Applying formula (3) gives: Drton, M. 2009. Discrete chain graph models. Bernoulli 15(3):736–753. ψac d(x)ψabde(x) f (x||x )= f (x ) f (x ) f (x ) 0 . (10) Javidian, M. A.; Lu, L.; Valtorta, M.; and Wang, c a b c P ψ (x)ψ (x) d,e ac0d abde Z. 2018. On a hypergraph probabilistic graphical https://arxiv.org/abs/1811.08372 Note that f (xc) = 1, when xc = c0, otherwise model. . f (xc) = 0. As a result, the right side of equations Kiiveri, H.; Speed, T. P.; and Carlin, J. B. 1984. (9) and (10) are the same. Recursive causal models. Journal of the Australian Mathematical Society 36:30–52. Lauritzen, S., and Jensen, F. V. 1997. Local com- putations with valuations from a commutative semi- group. Annals of Mathematics and Artificial Intelli- gence 21:51–69. Lauritzen, S., and Richardson, T. 2002. Chain graph models and their causal interpretations. Journal of the Royal Statistical Society. Series B, Statistical Methodology 64(3):321–348. Lauritzen, S., and Wermuth, N. 1989. Graphical Figure 4: Commutative diagram of factorization models for associations between variables, some of equivalence which are qualitative and some quantitative. The An- nals of Statistics 17(1):31–57. Remark 4. Figure 4 summarizes our results. Given Lauritzen, S. 1996. Graphical Models. Oxford Sci- H a chain graph G and its canonical LWF DAH , ence Publications. Theorem 4 in (Javidian et al. 2018) shows that G and H admit the same factorization decomposition. Pearl, J. 1988. Probabilistic Reasoning in Intelli- Suppose a set of variables A is set by intervention. gent Systems. Morgan Kaufmann. Theorem 1 and 2 show that the the DAH obtained Pearl, J. 1993. [bayesian analysis in expert sys- from G by the redirection procedure or deleting the tems]: Comment: Graphical models, and variables in A admit the same factorization decom- intervention. Statistical Science 8(3):266–269. position, which is also consistent with the interven- Pearl, J. 1995. Causal diagrams for empirical re- tion formula introduced in (Lauritzen and Richard- search. Biometrika 82(4):669–710. son 2002). Similarly, Theorem 3 and 4 show that the Pearl, J. 2009. Causality. Models, reasoning, and DAH obtained from H by the redirection procedure inference. Cambridge University Press. or shrinking the variables in A admit the same fac- Pena,˜ J. M. 2014. Marginal amp chain graphs. torization decomposition, which is consistent with a International Journal of Approximate Reasoning hypergraph analogue of the formula in (Lauritzen 55:1185–1206. and Richardson 2002). Richardson, T. S., and Spirtes, P. 2002. Ances- Acknowledgments. This work is primarily sup- tral graph markov models. The Annals of Statistics ported by Office on Naval Research grant ONR 30(4):962–1030. N00014-17-1-2842. Richardson, T. S. 2018. personal communication. Rubin, D. 1974. Estimating causal effects of treat- References ments in randomized and nonrandomized studies. Andersson, S. A.; Madigan, D.; and Perlman, M. D. Journal of Educational Psychology 66(5):688–701. 1996. Alternative markov properties for chain Splawa-Neyman, J. 1990. On the application of graphs. Uncertainty in artificial intelligence 40–48. probability theory to agricultural experiments: Es- Cox, D. R., and Wermuth, N. 1993. Linear depen- say on principles. Statistical Science 5(4):465–472. dencies represented by chain graphs. Statistical Sci- Studeny,´ M. 1992. Conditional independence re- ence 8(3):204–218. lations have no finite complete characterization. In Cox, D. R., and Wermuth, N. 1996. Multivariate Kub´ık, S., and V´ısek,ˇ J., eds., Information Theory, Dependencies-Models, Analysis and Interpretation. Statistical Decision Functions and Random Pro- Chapman and Hall. cesses: Proceedings of the 11th Prague Conference Darroch, J. N.; Lauritzen, S. L.; ; and Speed, T. P. -B, volume 15, 377–396. 1980. Markov fields and log-linear interaction mod- Wermuth, N., and Lauritzen, S. L. 1983. Graph- els for contingency tables. Scandinavian Journal of ical and recursive models for contingency tables. Statistics 8(3):522–539. Biometrika 70(3):537–552.