Equivalence Between Minimal Generative Model Graphs and Directed Information Graphs
Total Page:16
File Type:pdf, Size:1020Kb
Equivalence Between Minimal Generative Model Graphs and Directed Information Graphs Christopher J. Quinn Negar Kiyavash Todd P. Coleman Department of Electrical and Department of Industrial and Department of Electrical and Computer Engineering Enterprise Systems Engineering Computer Engineering University of Illinois University of Illinois University of Illinois Urbana, Illinois 61801 Urbana, Illinois 61801 Urbana, Illinois 61801 Email: [email protected] Email: [email protected] Email: [email protected] Abstract—We propose a new type of probabilistic graphical own past [4], [5]. This improvement is measured in terms of model, based on directed information, to represent the causal average reduction in the encoding length of the information. dynamics between processes in a stochastic system. We show the Clearly such a definition of causality has the notion of timing practical significance of such graphs by proving their equivalence to generative model graphs which succinctly summarize interde- ingrained in it. Therefore, we expect a graphical model defined pendencies for causal dynamical systems under mild assumptions. based on directed information to deal with processes as its This equivalence means that directed information graphs may be nodes. used for causal inference and learning tasks in the same manner In this work, we formally define directed information Bayesian networks are used for correlative statistical inference graphs, an alternative type of graphical model. In these graphs, and learning. nodes represent processes, and directed edges correspond to whether there is directed information from one process to I. INTRODUCTION another. To demonstrate the practical significance of directed Graphical models facilitate design and analysis of interact- information graphs, we validate their ability to capture causal ing multivariate probabilistic complex systems that arise in interdependencies for a class of interacting processes corre- numerous fields such as statistics, information theory, machine sponding to a causal dynamical system. For such a system, learning, systems engineering, etc. In graphical models (or the causal dependencies are known by construction, and the more precisely probabilistic graphical models), nodes repre- factorization of the joint distribution of the processes can be sent random variables and existence or absence of edges rep- succinctly summarized in terms of a generative model graph. resent conditional dependence or independence respectively. Such graphs use directed edges to encode which processes take These graphs can be undirected, directed, or a mix (such as part in generating the others at each time instance. We prove chain graphs). An overview of significant, common properties the equivalence of our directed information graphs to genera- of graphical models relevant to analysis and inference is in tive model graphs under mild assumptions. This equivalence [1]. means that directed information graphs may be used for causal Bayesian networks (or belief networks) are directed graphi- inference and learning tasks in the same manner that Bayesian cal models that are commonly used for inference and decision networks are used for statistical inference and learning. These making in machine learning, statistics, and decision analysis graphs could enhance research in causal inference, some recent [2]. In these graphs, a node is independent of all its non- works in this area including [5]–[10]. descendants given its parents. While Bayesian networks have II. DEFINITIONS an advantage over undirected graphical models (or Markov j • For a sequence a1,a2,..., denote ai as (ai,...,aj) and random fields) in understanding causation, they cannot be k k a a1 . used to distinguish causation from correlation alone except [m] • { } for certain special topologies. Denote [m] 1,...,m and the power set 2 on [m] [m] [m]i [m]\{i}. We investigate graphical models that can encode causal to be the set of all subsets of . Let • Z B(Z) relationships. We use directed information, an information For any Borel space , denote its Borel sets by and (Z, B(Z)) P (Z) theoretic quantity formally introduced by Massey [3], as the space of probability measures on as . • Consider two probability measures P and Q on P (Z). our causality metric. Directed information is analogous to P Q P Q mutual information, but unlike mutual information which only is absolutely continuous with respect to ( ) if Q(A)=0implies that P(A)=0for all A ∈B(Z). captures statistical correlation, directed information captures P Q statistical causation [4]–[6]. Akin to Granger’s philosophical If , denote the Radon-Nikodym derivative as the random variable dP : Z → R that satisfies point of view of causality, directed information quantifies the dQ improvement in predicting the future of one process using dP P(A)= (z)Q(dz),A∈B(Z). causal side information of another, as opposed to solely its z∈A dQ • The Kullback-Leibler divergence between P ∈P(Z) and The following Lemma will be useful throughout: Q ∈P(Z) is defined as Lemma 2.1: D PYWQYW|PW =0if and only if y y dP dP PYW=w(d )=QYW=w(d ) with PW probability D(PQ) EP log = log (z)P(dz) (1) one. dQ z∈Z dQ • Let X ≡ Xi for some i, Y ≡ Xk for some k and P Q ∞ if and otherwise. W ≡ XI for some I⊆[m]i,k. The mutual informa- • Throughout this paper, we will consider m random tion, directed information [3], and causally conditioned processes where the ith (with i ∈{1,...,m}) random directed information [11] are given by process at time j (with j ∈{1,...,n}), takes values in X I(X; Y) D PY|XPY|PX (5) a Borel space . • For a sample space Ω, sigma-algebra F, and probability I(X → Y) D PYXPY|PX (6) measure P, denote the probability space as (Ω, F, P). I(X → YW) D PYX,WPYW|PX,W (7) • Denote the ith random variable at time j by Xi,j :Ω→ X, the ith random process as Xi =(Xi,1,...,Xi,n): Conceptually, mutual information and directed informa- n Ω → X , the subset of random processes XI = tion are related. However, while mutual information quan- T |I|n (Xi : i ∈I) :Ω→ X , and the whole col- tifies statistical correlation (in the colloquial sense of sta- lection of all m random processes as X X[m] : tistical interdependence), directed information quantifies mn Ω → X . Denote the whole collection of all m statistical causation. For example, I(X; Y)=I(Y; X), random processes from time j =1to n as X(1:n) but I(X → Y) = I(Y → X) in general. Note that as a m(n) (X1,1,...,X1,n ,...Xm,1,...,Xm,n ):Ω→ X . consequence of Lemma 2.1 and (7), we have: • The probability measure P thus induces a probability Corollary 2.2: I(X → YW)=0if and only if X is Y W distribution on Xi,j given by PXi,j (·) ∈P(X), a joint causally conditionally independent of given : n Xi PX (·) ∈P(X ) distribution on given by i , and a joint y y − X · ∈P X|I|n PYX=x,W=w(d )=PYW=w(d ),PX,W a.s. distribution on I given by PXI ( ) . • With slight abuse of notation, denote Y ≡ Xi for some i Equivalently, we denote that (X, W, Y) form a causal and X ≡ Xk for some i = k and denote the conditional Markov chain. distribution and causally conditioned distribution of Y given X as A. Generative model graphs PY|X=x(dy) PY|X(dy|x) We will first discuss properties of causal, dynamical sys- n tems. Next we will define generative models and their corre- j−1 n j−1 n | sponding graphs. Then we will introduce directed information = PYj |Y ,X dyj y ,x (2) j=1 graphs. Consider the following simple dynamical system. Example 1: Let xt and yt be two processes comprising a PYX=x(dy) PYX(dyx) n physical, dynamical system which can be fully described by j−1 j−1 j−1 j | coupled differential equations: PYj |Y ,X dyj y ,x . (3) j=1 t t xt+Δ = xt +Δg1(x ,y ) Note the similarity with regular conditioning in (3), t t n yt+Δ = yt +Δg2(x ,y ) except in causal conditioning the future (xj ) is not 1 xt conditioned on [11] . The notation for PY|X=x and This system has causal dynamics, in that the state yt evolves n PYX=x is used to emphasize that PY|X=x ∈P(X ) in time; the present state of the system (at time t+Δ) depends n and PYX=x ∈P(X ). on the past and not on the future. If we consider the system • With slight abuse of notation, denote Y ≡ Xi for some with i.i.d. noises t and t, n i with Y = X and W ≡ XI for some I⊆[m]i with t t |I|n Xt+Δ = Xt +Δg1(X ,Y )+t W = X . Consider two sets of causally conditioned t t distributions {PYW=w ∈P(Y):w ∈ W} and Yt+Δ = Yt +Δg2(X ,Y )+t, {QYW=w ∈P(Y):w ∈ W} along with a marginal we can describe the system dynamics through probability distribution PW ∈P(W). Then the conditional KL distributions (discretizing time t as j): divergence is given by n n n j−1 j−1 D PYWQYW|PW n n j−1 j−1 | PX ,Y (dx ,dy )= PXj,Y j|X ,Y (dxj,dyj x ,y ).(8) j=1 = D PYW=wQYW=w PW(dw) (4) W Note that although the original dynamical system is coupled (both xt+Δ and yt+Δ depend on full past) the present states 1Note the slight difference in conditioning upon xj−1 in this definition as compared to xj in the original causal conditioning definition. The purpose of the two processes are independent given the past (e.g.