<<

Equivalence Between Minimal Generative Model Graphs and Directed Information Graphs

Christopher J. Quinn Negar Kiyavash Todd P. Coleman Department of Electrical and Department of Industrial and Department of Electrical and Computer Engineering Enterprise Systems Engineering Computer Engineering University of Illinois University of Illinois University of Illinois Urbana, Illinois 61801 Urbana, Illinois 61801 Urbana, Illinois 61801 Email: [email protected] Email: [email protected] Email: [email protected]

Abstract—We propose a new type of probabilistic graphical own past [4], [5]. This improvement is measured in terms of model, based on directed information, to represent the causal average reduction in the encoding length of the information. dynamics between processes in a stochastic system. We show the Clearly such a definition of causality has the notion of timing practical significance of such graphs by proving their equivalence to generative model graphs which succinctly summarize interde- ingrained in it. Therefore, we expect a defined pendencies for causal dynamical systems under mild assumptions. based on directed information to deal with processes as its This equivalence that directed information graphs may be nodes. used for causal inference and learning tasks in the same manner In this work, we formally define directed information Bayesian networks are used for correlative graphs, an alternative type of graphical model. In these graphs, and learning. nodes represent processes, and directed edges correspond to whether there is directed information from one process to I. INTRODUCTION another. To demonstrate the practical significance of directed Graphical models facilitate design and analysis of interact- information graphs, we validate their ability to capture causal ing multivariate probabilistic complex systems that arise in interdependencies for a class of interacting processes corre- numerous fields such as , information theory, machine sponding to a causal dynamical system. For such a system, learning, systems engineering, etc. In graphical models (or the causal dependencies are known by construction, and the more precisely probabilistic graphical models), nodes repre- factorization of the joint distribution of the processes can be sent random variables and existence or absence of edges rep- succinctly summarized in terms of a generative model graph. resent conditional dependence or independence respectively. Such graphs use directed edges to encode which processes take These graphs can be undirected, directed, or a mix (such as part in generating the others at each time instance. We prove chain graphs). An overview of significant, common properties the equivalence of our directed information graphs to genera- of graphical models relevant to analysis and inference is in tive model graphs under mild assumptions. This equivalence [1]. means that directed information graphs may be used for causal Bayesian networks (or belief networks) are directed graphi- inference and learning tasks in the same manner that Bayesian cal models that are commonly used for inference and decision networks are used for statistical inference and learning. These making in machine learning, statistics, and decision analysis graphs could enhance research in causal inference, some recent [2]. In these graphs, a node is independent of all its non- works in this area including [5]–[10]. descendants given its parents. While Bayesian networks have II. DEFINITIONS an advantage over undirected graphical models (or Markov j • For a sequence a1,a2,..., denote ai as (ai,...,aj) and random fields) in understanding causation, they cannot be k k a  a1 . used to distinguish causation from correlation alone except [m] •  { } for certain special topologies. Denote [m] 1,...,m and the power set 2 on [m] [m] [m]i  [m]\{i}. We investigate graphical models that can encode causal to be the set of all subsets of . Let • Z B(Z) relationships. We use directed information, an information For any Borel space , denote its Borel sets by and (Z, B(Z)) P (Z) theoretic quantity formally introduced by Massey [3], as the space of probability measures on as . • Consider two probability measures P and Q on P (Z). our causality metric. Directed information is analogous to P Q P Q mutual information, but unlike mutual information which only is absolutely continuous with respect to ( ) if Q(A)=0implies that P(A)=0for all A ∈B(Z). captures statistical correlation, directed information captures P Q statistical causation [4]–[6]. Akin to Granger’s philosophical If , denote the Radon-Nikodym derivative as the random variable dP : Z → R that satisfies point of view of causality, directed information quantifies the dQ improvement in predicting the future of one process using dP P(A)= (z)Q(dz),A∈B(Z). causal side information of another, as opposed to solely its z∈A dQ • The Kullback-Leibler between P ∈P(Z) and The following Lemma will be useful throughout: Q ∈P(Z) is defined as Lemma 2.1: D PYWQYW|PW =0if and only if    y y dP dP PYW=w(d )=QYW=w(d ) with PW probability D(PQ)  EP log = log (z)P(dz) (1) one. dQ z∈Z dQ • Let X ≡ Xi for some i, Y ≡ Xk for some k and P Q ∞ if and otherwise. W ≡ XI for some I⊆[m]i,k. The mutual informa- • Throughout this paper, we will consider m random tion, directed information [3], and causally conditioned processes where the ith (with i ∈{1,...,m}) random directed information [11] are given by process at time j (with j ∈{1,...,n}), takes values in   X I(X; Y) D PY|XPY|PX (5) a Borel space .   • For a sample space Ω, sigma-algebra F, and probability I(X → Y) D PYXPY|PX (6)   measure P, denote the probability space as (Ω, F, P). I(X → YW) D PYX,WPYW|PX,W (7) • Denote the ith random variable at time j by Xi,j :Ω→ X, the ith random process as Xi =(Xi,1,...,Xi,n): Conceptually, mutual information and directed informa- n Ω → X , the subset of random processes XI = tion are related. However, while mutual information quan- T |I|n (Xi : i ∈I) :Ω→ X , and the whole col- tifies statistical correlation (in the colloquial sense of sta- lection of all m random processes as X  X[m] : tistical interdependence), directed information quantifies mn Ω → X . Denote the whole collection of all m statistical causation. For example, I(X; Y)=I(Y; X),  random processes from time j =1to n as X(1:n)  but I(X → Y) = I(Y → X) in general. Note that as a m(n) (X1,1,...,X1,n ,...Xm,1,...,Xm,n ):Ω→ X . consequence of Lemma 2.1 and (7), we have: • The probability measure P thus induces a probability Corollary 2.2: I(X → YW)=0if and only if X is Y W distribution on Xi,j given by PXi,j (·) ∈P(X), a joint causally conditionally independent of given : n Xi PX (·) ∈P(X ) distribution on given by i  , and a joint y y − X · ∈P X|I|n PYX=x,W=w(d )=PYW=w(d ),PX,W a.s. distribution on I given by PXI ( ) . • With slight abuse of notation, denote Y ≡ Xi for some i Equivalently, we denote that (X, W, Y) form a causal and X ≡ Xk for some i = k and denote the conditional Markov chain. distribution and causally conditioned distribution of Y given X as A. Generative model graphs

PY|X=x(dy)  PY|X(dy|x) We will first discuss properties of causal, dynamical sys- n   tems. Next we will define generative models and their corre- j−1 n j−1 n | sponding graphs. Then we will introduce directed information = PYj |Y ,X dyj y ,x (2) j=1 graphs. Consider the following simple dynamical system. Example 1: Let xt and yt be two processes comprising a PYX=x(dy)  PYX(dyx) n   physical, dynamical system which can be fully described by j−1 j−1  j−1 j | coupled differential equations: PYj |Y ,X dyj y ,x . (3) j=1 t t xt+Δ = xt +Δg1(x ,y ) Note the similarity with regular conditioning in (3), t t n yt+Δ = yt +Δg2(x ,y ) except in causal conditioning the future (xj ) is not   1 xt conditioned on [11] . The notation for PY|X=x and This system has causal dynamics, in that the state yt evolves n PYX=x is used to emphasize that PY|X=x ∈P(X ) in time; the present state of the system (at time t+Δ) depends n and PYX=x ∈P(X ). on the past and not on the future. If we consider the system  • With slight abuse of notation, denote Y ≡ Xi for some with i.i.d. noises t and t, n i with Y = X and W ≡ XI for some I⊆[m]i with t t |I|n Xt+Δ = Xt +Δg1(X ,Y )+t W = X . Consider two sets of causally conditioned t t  distributions {PYW=w ∈P(Y):w ∈ W} and Yt+Δ = Yt +Δg2(X ,Y )+t, {QYW=w ∈P(Y):w ∈ W} along with a marginal we can describe the system dynamics through PW ∈P(W). Then the conditional KL distributions (discretizing time t as j): divergence is given by   n n n j−1 j−1 D PYWQYW|PW n n j−1 j−1 |  PX ,Y (dx ,dy )= PXj,Y j|X ,Y (dxj,dyj x ,y ).(8)   j=1 = D PYW=wQYW=w PW(dw) (4) W Note that although the original dynamical system is coupled (both xt+Δ and yt+Δ depend on full past) the present states 1Note the slight difference in conditioning upon xj−1 in this definition as compared to xj in the original causal conditioning definition. The purpose of the two processes are independent given the past (e.g. for doing this will be clear later in the manuscript. xt+Δ does not depend on yt+Δ). There are no simultaneous influences or “actions at a distance.” In the stochastic case, X X X X since the noises are also independent, this corresponds to 1 2 1 2 n n n j−1 j−1 n n j−1 j−1 | PX ,Y (dx ,dy )= PXj |X ,Y (dxj x ,y ) j=1 X3 X3 j−1 j−1 × j−1 j−1 | PYj |X ,Y (dyj x ,y ).(9) (a) Generative model (b) Generative model graph corresponding to graph corresponding to Now we will consider generalizations. (13). (14). Consider a stochastic dynamical system with a set of random processes X and joint distribution PX. Using the chain Fig. 1. Generative model graphs for the generative models in Example 2. rule over increasing time j, the distribution can be written as in (8): n Clearly, the factorization in (14) represents the full dynamics X x x |x in (13) succinctly. Having a simple model which captures the P (d )= PX(j)|X(1:j−1) (d (j) (1:j−1)). (10) j=1 full statistics of a system can be useful to generate realizations of the system or perform inference or learning tasks. This formulation emphasizes how the future of the processes One way to develop models for a system is to specify, for depends on the past. For many causal dynamical systems each process Xi, a set of processes A(i), that will be used (such as those governed by coupled differential equations as to describe the dynamics of Xi. Specifically, we will form in Example 1), given the past, the future of the processes are factorizations of the joint PX like in (12) but, for each Xi, conditionally independent. Analogous to (9), this corresponds only causally conditioning it on A(i). In Example 2, letting to: A(1) = {2}, A(2) = {1, 3}, and A(3) = {1}, allows us n m to fully characterize the dynamics of the system since (14) PX(dx)= PX |X (dxi,j|x(1:j−1)) (11) i,j (1:j−1) ∅ j=1 i=1 holds. Choosing A(2) = , however, might lead to a poor approximation of the dynamics. We will refer to this as spatial conditional independence. Generalizing this for a stochastic system with a joint dis- Assumption 1: For the remainder of this paper, we only tribution PX which has spatial conditional independence (11), PX [m] consider joint distributions which satisfy spatial condi- consider a mapping A :[m] → 2 from each process to tional independence (11) and for which nontrivial conditional subsets of processes, with i/∈ A(i) for each i ∈ [m].Wecan distributions are nondegenerate. consider the corresponding induced probability measure PA: Remark 1: That the nontrivial conditional distributions in- m duced by PX are nondegenerate is a technicality required for PA(dx)= PX X (dxi  xA(i)). (15) the following proofs. i A(i) i=1 Using causal conditioning notation, we can rewrite (11) as m (15) might be equivalent to the full joint (12), or it might only x x  x be an approximation. We want to focus on those choices of PX(d )= PXiX (d i [m] ). (12) [m]i i m i=1 subsets {A(i)}i=1 that capture the full statistics. In addition to having spatial conditional independence, Definition 2.3: Under Assumption 1, for a joint distribution → [m] some distributions also have stronger conditional indepen- PX,agenerative model is a function A :[m] 2 such ∈ ∈ dence properties. In equation (12), the dynamics of each that for each process i [m], i/A(i) and X   process i are described as causally depending on all other D PXPA =0,. (16) processes. However, there might be some processes whose dynamics only causally depend on a subset of other processes. where PA is defined in (15). m Consider the following example: The choice of subsets of processes {A(i)}i=1 corresponds Example 2: Let X =(X1, X2, X3) be a set of three to causal dependencies (with Xi causally dependent on A(i)). random processes with a joint distribution PX that has spatial These dependencies can be represented graphically. conditional independence: Definition 2.4: A generative model graph is a directed graph for a generative model where each process is represented PX(dx)=PX X ,X (dx1 x2, x3)PX X ,X (dx2 x1, x3) 1 2 3 2 1 3 by a node, and there is a directed edge from Xk to Xi for × x  x x PX3X1,X2 (d 3 1, 2). (13) i, k ∈ [m] iff k ∈ A(i). Further suppose that causally conditioned on X3, X1 is A generative model graph for the generative model cor- causally conditionally independent of X2. Also, suppose that responding to the full joint distribution (13) in Example 2 is causally conditioned on X1, X3 is causally conditionally shown in Figure 1(a). A generative model graph for the simpler independent of X2. Thus, factorization (14) is shown in Figure 1(b). The statistics of the generative model corresponding to (14) x x x x x x PX(d )=PX1X3 (d 1 3)PX2X1,X3 (d 2 1, 3) are identical to that of the full distribution (13) because the × x  x PX3X1 (d 3 1). (14) individual terms in the product of (14) were found to be equivalent to those in (13). The definition of generative models X only requires that the model be statistically identical to the full Y distribution (the product in (15) the same as the product in U (12)). However, this is equivalent to each causally conditional W probability term induced by the model being exact (each term Ũ V in (15) equivalent to corresponding term in (12)). This can be shown with an induction argument using marginalization over space i and time j (first down to time j =1to show equivalence for first timestep, then back up to time j = n). Consequently, it is meaningful to consider the simplest model for the whole system as the model where the modeling of each subpart (causally conditional terms) is simplest. For each i ∈ [m], there is a minimum cardinality for A(i) (possibly Fig. 2. Sets of processes for Lemma 3.1.  m − 1 or 0), in the sense that for any subset A (i) ⊂ [m]i smaller than this cardinality, i ∈ [m] U ≡ X W ≡ X  . Let IU (nonempty) and IW be  IU, IW ⊆ [m]i D PXiX PXiX PX > 0. different subsets of processes such that and [m]i A (i) [m]i  We can correspondingly consider generative models where  D PYU PYW PXI ∪I =0. (18) each A(i) is of minimal cardinality. U W Definition 2.5: A minimal generative model is a generative So both U and W causally provide the same information model where for each i ∈ [m], A(i) is of minimal cardinality. about Y. Let V be a strict subset of U containing the pro- Note that there could potentially be multiple minimal cesses in the intersection of U and W: IU ∩IW ⊆IV ⊂IU. i ∈ [m] U U U ≡ X generative models (for each , they could have the Let be the remaining processes in ( IU\IV ),which same cardinalities but be different sets). We will later show are necessarily unique to U. See Figure 2 for a graphical in Lemma 3.2, however, that under Assumption 1, for any depiction of the sets. distribution PX there is a unique minimal generative model. Then causally conditioned on V, U causally provides no Y B. Directed information graphs new information about in the following sense: Lemma 3.1: For the process Y and sets of processes U, U , Generative model graphs are one graphical representation of V, and W as described above, the causal dynamics of a system. The edges encode causally   conditional independence statements. An alternative graphical D PYUPYV PU =0. (19) representation is a directed graph where the edges encode statements about causally conditioned directed information. That is, all of the causal information about Y in U is contained

Definition 2.6: A directed information graph is a directed in V,soU can be removed. graph over a set of random processes X where each node Proof: By sequentially marginalizing the processes in represents a process and there is a directed edge from process (18) from time j = n down to j =1, and with the X ≡ Xk to process Y ≡ Xi (for some i, k ∈ [m])iff nonnegativity of KL divergence, we get for all time j:

I(X → Y  X[m] ) > 0. (17) i,k j−1  j−1 j−1 j−1 D PY |Y j−1,U ,Vj−1 PYj |Y ,W PY j−1,X =0. j IU∪W So there is an edge from X to Y if causally conditioned on all other processes, Y still has causal dependence on X.Given Applying the definition of conditional distributions to uncon- j−1 a distribution PX, the corresponding directed information dition on U , graph is unique. For Example 2, if I(X2 → X1  X3)=  X → X  X I( 2 3 1)=0, and the other causally conditioned D P  j−1 j−1 j−1  P  j−1 j−1 j−1 Yj ,U |Y ,V U |Y V directed information values were nonzero, then the directed

× j−1 j−1 j−1 information graph would correspond to Figure 1(b). PYj |Y ,W PY j−1,X =0. IU∪IW III. EQUIVALENCE OF MINIMAL GENERATIVE MODELS This is equivalent to a conditional mutual information state- AND DIRECTED INFORMATION GRAPHS ment. Since the processes in U are not in {Y}∪W, marginal- In this section, we show that minimal generative models j−1 izing over realizations of U yields are unique (Lemma 3.2). Moreover, we show that minimal generative model graphs and directed information graphs are j−1 j−1  j−1 j−1 j−1 D PYj |Y ,V PYj |Y ,W PY j−1,X =0. equivalent (Theorem 3.3). First, we will introduce a lemma IV∪IW that will be used in later proofs.

This implies D(PYVPYW PX )=0and thus with Consider the following setup. Let PX denote a distribution IV∪IW with spatial conditional independence. Let Y ≡ Xi for some (18) implies (19). X → X  X A. Uniqueness of minimal generative models information I( i k [m]i,k ) in (17) from data. Minimal generative models were defined to have minimum Under different assumptions, consistent estimators for directed cardinalities for each subset causally conditioned upon. The information [5], [12] exist - but this area of estimating dynamic definition does not preclude the possibility of having multiple information theoretic quantities still generally has undesirable minimal generative models where the corresponding subsets performance-complexity tradeoffs. For example, the causally (Xi → Xk  X ) in each of the models have the same minimum cardinalities. conditioned directed information I [m]i,k However, this is not the case. in (17) is a functional of the joint distribution PX on all (Xi → Xk) Lemma 3.2: Under Assumption 1, for any distribution PX processes, whereas the directed information I there is only one minimal generative model. is a functional of only the pairwise distribution PXi,Xk . X → X Proof: Suppose there are at least two minimal generative Sufficient conditions guaranteeing that I( i k) satisfying (Xi → Xk  X ) models, A and B, for PX. Let Y ≡ Xi for some i ∈ [m] be some property leads to I [m]i,k satisfying any process for which A(i) = B(i). Then we have another property in general do not hold: there are cases when  I(X → Y) > 0 but I(X → Y  Z)=0, and others when

 X (X → Y)=0 (X → Y  Z) > 0 D PYX[m] PYXA(i) P [m] =0, I but I . Consequently, some  i i future directions of research require carefully understanding YX  YX X D P A(i) P B(i) P A(i)∪B(i) =0. practically relevant situations when lower complexity methods guarantee correct inference in large graphs. By Lemma 3.1,  ACKNOWLEDGMENT D PYX PYX PX =0. [m]i A(i)∩B(i) [m]i Chistopher Quinn was supported by the Department of This is a contradiction, as |A(i) ∩ B(i)| < |A(i)| but |A(i)| is Energy Computational Science Graduate Fellowship, which minimal by definition. is provided under grant number DE-FG02-97ER25308. This work was supported by AFOSR FA9550-10-1-0345. B. Equivalence of minimal generative model graphs and di- rected information graphs REFERENCES Theorem 3.3: Under Assumption 1, for any joint distribu- [1] H. Loeliger, “An introduction to factor graphs,” Signal Processing Magazine, IEEE, vol. 21, no. 1, pp. 28–41, 2004. tion PX, the corresponding minimal generative model graph [2] D. Heckerman, “A tutorial on learning with Bayesian networks,” Inno- and directed information graph are equivalent. vations in Bayesian Networks, pp. 33–82, 2008. Proof: Let A denote the minimal generative model, and [3] J. Massey, “Causality, feedback and directed information,” in Proc. Int. X ≡ X Y ≡ X ∈ Symp. Information Theory Application (ISITA-90), 1990, pp. 303–305. let k and i be processes for some i, k [m]. [4] J. Rissanen and M. Wax, “Measures of mutual and causal dependence Suppose that there is not a directed edge in the minimal between two (Corresp.),” IEEE Transactions on Information generative model graph from X to Y. By definition, k/∈ A(i). Theory, vol. 33, no. 4, pp. 598–601, 1987. [5] C. Quinn, T. Coleman, N. Kiyavash, and N. Hatsopoulos, “Estimating Thus, by Lemma 3.1, the directed information to infer causal relationships in ensemble neural  spike train recordings,” Journal of Computational Neuroscience, pp. 1–  X 28, 2010. D PYX[m] PYX[m] P [m] =0, (20) i i,k i [6] A. Rao, A. Hero III, D. States, and J. Engel, “Motif discovery in tissue- and consequently specific regulatory sequences using directed information,” EURASIP Journal on and Systems Biology, vol. 2007, pp. 1–13, I(X → Y  X )=0, (21) 2007. [m]i,k [7] M. Yuan and Y. Lin, “ and estimation in regression with grouped variables,” Journal of the Royal Statistical Society: Series B so there is also no edge in the directed information graph. (Statistical Methodology), vol. 68, no. 1, pp. 49–67, 2006. Now suppose that there is not an edge from X to Y in the [8] A. Bolstad, B. Van Veen, and R. Nowak, “Causal Network Inference via directed information graph, so (21) holds which implies (20). Group Sparse Regularization,” IEEE Transactions on Signal Processing, X Y 2010, submitted. However, suppose there is a directed edge from to in the [9] D. Materassi and G. Innocenti, “Topological identification in networks of minimal generative model graph. Thus k ∈ A(i), and |A(i)| dynamical systems,” Automatic Control, IEEE Transactions on, vol. 55, is minimal. With (20) this implies by Lemma 3.1 no. 8, pp. 1860–1871, 2010.  [10] C. Quinn, T. Coleman, and N. Kiyavash, “Causal Dependence Tree Approximations of Joint Distributions for Multiple Random Processes,” D PYX PYX PX =0. (22) [m]i A(i)\{k} [m]i Information Theory, IEEE Transactions on, 2011, submitted, Arxiv preprint arXiv:1101.5108. This is a contradiction because |A(i)| is minimal. Therefore [11] G. Kramer, “Directed information for channels with feedback,” Ph.D. there is no edge in the minimal generative model graph. dissertation, University of Manitoba, Canada, 1998. [12] L. Zhao, H. Permuter, Y. Kim, and T. Weissman, “Universal estimation IV. DISCUSSION AND FUTURE DIRECTIONS of directed information,” in Information Theory Proceedings (ISIT), 2010 IEEE International Symposium on. IEEE, 2010, pp. 1433–1437. We have shown that directed information graphs capture the full causal structure of minimal generative models in networked stochastic systems. In many cases, the statistics of PX are unknown, and thus statistical estimation methods are needed to to estimate the causally conditioned directed