1 Efficient Methods to Compute Optimal Tree Approximations of Directed Graphs Christopher J. Quinn*, Student Member, IEEE, Negar Kiyavash, Senior Member, IEEE, and Todd P. Coleman, Senior Member, IEEE

Abstract—Recently, directed information graphs have been human comprehension of complex relationships. For instance, proposed as concise graphical representations of the statistical in situations such as network intrusion detection, decision dynamics amongst multiple random processes. A directed edge making in adversarial environments, and first response tasks from one node to another indicates that the past of one random process statistically affects the future of another, given the where a rapid decision is required, such representations can past of all other processes. When the number of processes is greatly aid the situation awareness and the decision making large, computing those conditional dependence tests becomes process. difficult. Also, when the number of interactions becomes too Graphical models have been used to describe both full and large, the graph no longer facilitates visual extraction of relevant approximating joint distributions of random variables [1]. For information for decision-making. This work considers approximating the true joint distribu- many graphical models, random variables are represented as tion on multiple random processes by another, whose directed nodes and edges between pairs encode conditional depen- information graph has at most one parent for any node. Under dence relationships. Markov networks and Bayesian networks a Kullback-Leibler (KL) divergence minimization criterion, we are two common examples. Markov networks are undirected show that the optimal approximate joint distribution can be graphs, while Bayesian networks are directed acyclic graphs. A obtained by maximizing a sum of directed . In particular, (a) each directed information calculation only involves Bayesian network’s graphical structure depends on the variable statistics amongst a pair of processes and can be efficiently indexing. This methodology could in principle be applied to estimated; (b) given all pairwise directed informations, an ef- describing multiple random processes. For example, if we ficient minimum weight spanning directed tree algorithm can be have n time indices and m random processes, then we could solved to find the best tree. We demonstrate the efficacy of this create a Markov or Bayesian network on mn random variables. approach using simulated and experimental data. In both, the approximations preserve the relevant information for decision- However, if m or n is large, this could be prohibitive from a making. complexity and visualization perspective. We have recently developed an alternative graphical model, EDICS classification identifier: MLR-GRKN termed a “directed information graph,” to parsimoniously describe statistical dynamics amongst a collection of random I.INTRODUCTION processes [2]. Each process is represented by a single node, Many important inference problems involve reasoning about and directed edges encode conditional independence relation- dynamic relationships between time series. In such cases, ships pertaining to how the past of one process affects the observations of multiple time series are recorded and the ob- future of another, given the past of all other processes. As such, jective of the observer is to understand relationships between in this framework, directed edges represent directions of causal the past of some processes, and how they affect the future of influence1. They are motivated by simplified generative models others. In general, with knowledge of joint statistics amongst of coupled dynamical systems. They admit cycles, and can multiple random processes, such decision-making could in thus represent feedback between processes. Under appropriate principle be done. However, if these processes exhibit complex technical conditions, they do not depend on process indexing dynamics, gaining knowledge can be prohibitive from compu- and moreover are unique [2]. tational and storage perspectives. As such, it is appealing to Directed information graphs are particularly attractive when develop an approximation of the joint distribution on multiple we have m processes and a large number n of time units: it random processes which can be calculated efficiently and is collapses a graph of mn nodes to a graph on m nodes and a less complex for inference. Moreover, simplified representa- directed arrow encodes information about temporal dynamics. tions of joint statistics can facilitate easier visualization and In some situations, however, the number m of processes we record itself can be very large, and in such a situation The material in this paper was presented (in part) at the International Symposium on and Applications, Taichung, Taiwan, each conditional independence test, involving all m processes, October 2010. can be difficult to evaluate. Moreover, even visualization of C. Quinn is with the Department of Electrical and Computer Engineering, the directed information graph with up to m2 edges can be University of Illinois at Urbana Champaign, Urbana, Illinois 61801 (email: [email protected]). cumbersome. As such, the benefits of having a precise picture N. Kiyavash is with the Department of Industrial and Enterprise Systems of the statistical dynamics might be out-weighed by the costs Engineering, University of Illinois at Urbana Champaign, Urbana, Illinois 61801 (email: [email protected]). 1“Causal” in this work refers to Granger causality [3], where a process X T. P. Coleman is with the Department of Bioengineering, University of is said to causally influence a process Y if the past of X helps to predict the California, San Diego, La Jolla, CA 92093 (email: [email protected]) future of Y, already conditioned on the past Y and all other processes. 2

distribution on m random processes, where each node in the subsequent directed information graph has at most one parent. We consider two variants, one in which the approximation’s directed information graph need not be connected, and the second for which it is (i.e. it must be a directed tree). We use the KL divergence as the metric to find the best approximation, and show that the subsequent optimization (a) The full influence structure of (b) The graph of an approximation problem is equivalent to maximizing a sum of pairwise di- the social network. It is difficult to which captures key structural com- rected informations. Both cases only require knowledge of determine which users to target to ponents. By targeting only the root statistics between pairs of processes to find the best such indirectly influence the whole net- of the tree, who is circled, it is possi- work. ble influence will spread throughout approximations. For the connected case, a minimum weight the rest of the network. spanning tree algorithm can be solved in time that is quadratic in the number of processes. Both approximations have similar Fig. 1. Graphical models of the full user influence dynamics of algorithmic and storage complexity. We demonstrate the utility the social network and an approximation of those dynamics. Arrow widths correspond to strengths of influence. Although some structural of this approach in simulated and experimental data, where the components are lost, the graph of the approximation makes it clear relevant information for decision-making is maintained in the who to target and the paths of expected influence. tree approximation.

B. Related work in computation, storage, and ease-of-use to a human. Chow and Liu proposed an efficient algorithm to find an An approximation of the joint distribution that preserves optimal tree structured approximation to the joint distribution a small number of important interactions could alleviate this on a collection of random variables [4]. Since then, many problem. As an example, consider how a social network works have been developed to approximate joint distributions, company negotiates the costs of advertisements to its users in terms of underlying Markov and Bayesian networks. There with another company. If the preferences or actions of certain have been other works that approximated with more compli- users on average have a large “causal” influence on the cated structures; see [1] for an overview. subsequent preferences or actions of friends in their network, In this work, we use directed information graphs to describe then a business might be willing to pay more money to the joint distribution on random processes, in terms of how advertise to those users, as compared to the down-stream the past of processes statistically affect the future of others. friends with less influence. By paying to advertise to the These were recently introduced in [2], where it was also shown influential users, the business is effectively advertising to that they are a generalized embodiment of Granger’s notion many. For the social network company and the business to of causality [3] and that under mild assumptions, they are agree on pricing, however, it needs to be agreed on which users equivalent to minimal generative model graphs. are the most influential. With a complicated social network, Many methods to estimate joint distributions on random such as Figure 1(a), a simple procedure to identify who to processes from a generative model perspective have recently advertise to, and for how much, might be onerous to develop. been developed. Shalizi et al. [5] have developed methods us- However, if the social network company could approximate ing a stochastic state reconstruction algorithm for discrete val- the user interactions dynamics into a simplified - but accurate ued processes to identify interactions between processes and - picture, such as Figure 1(b), then it would be much easier functional communities. “Group Lasso” is a method to infer for the business to see who to target to influence the whole the causal relationships between multivariate auto-regressive network. This is the motivation of this work. models [6]. Bolstad et al. recently showed conditions under Directed trees, such as Figure 1(b), are among the simplest which the estimates of Group Lasso are consistent [7]. Puig structures that could be used for approximation - each node has et al. have developed a multidimensional shrinkage-threshold at most one parent. In reducing the computational, storage, and operator which arises in problems with Group Lasso type visual complexity substantially, directed trees are much more penalties [8]. Tan and Willsky analyzed sample complexity for amenable to analysis than the full structure. They also depict a identifying the topology of a tree structured network of LTI clear hierarchy between nodes. We here consider the problem systems [9]. Materassi et al. have developed methods based of finding the best approximation of a joint distribution on m on Wiener filtering to statistically infer causal influences in random processes so that each node in the subsequent directed linear stochastic dynamical systems; consistency results have information graph has at most one parent. We will demonstrate been derived for the case when the underlying dynamics have the efficacy of this approach from complexity, visualization, a tree structure [10], [11]. For the setting where the directed and decision-making perspectives. information graph has a tree structure and some processes are not observed, Etesami et al. developed a procedure to recover II.OURCONTRIBUTIONANDRELATEDWORK the graph [12]. A. Our Contribution C. Paper organization In this paper, we consider the problem of approximating The paper organization is as follows. Section III establishes a joint distribution on m random processes by another joint definitions and notations. In Section IV, we review directed 3 information graphs and discuss their relationship with gener- given X as ative models of stochastic dynamical systems to motivate our P (dy) P (dy|x) approach. In Section V, we present our main results pertaining Y|X=x , Y|X n to finding the optimal approximations of the joint distribution Y t−1 n = P t−1 n dy |y , x where each node can have at most one parent, both uncon- Yt|Y ,X t t=1 strained and when the structure is constrained to be a directed P (dy) P (dykx) tree. Here we show that in both cases, the optimization simpli- YkX=x , YkX n fies to maximizing a sum of pairwise directed informations. In Y t−1 t−1 P t−1 t−1 dy |y , x . (2) Section VI, we analyze the algorithmic and storage complexity , Yt|Y ,X t t=1 of the approximations. In Section VII, we review parametric estimation, evaluate the performance of the approximations in Note the similarity with regular conditioning in (2), n a simulated binary classification experiment, and showcase the except in causal conditioning the future (xt ) is not utility of this approach in elucidating the wave-like phenomena conditioned on [14]. in the joint neural spiking activity of primary motor cortex. • The and directed information [15] between random process X and random process Y are Z I(X; Y) = DP kP  P (dx) III.DEFINITIONS AND NOTATION Y|X=x Y X x Z  This section presents probabilistic notations and I(X → Y) = D PYkX=xkPY PX(dx). (3) information-theoretic definitions and identities that will x be used throughout the remainder of the manuscript. Unless Conceptually, mutual information and directed informa- otherwise noted, the definitions and identities come from tion are related. However, while mutual information quan- Cover & Thomas [13]. tifies statistical correlation (in the colloquial sense of sta- i tistical interdependence), directed information quantifies • For a sequence a , a ,..., denote a (a , . . . , a ). 1 2 , 1 i statistical causation. Note that I(X; Y) = I(Y; X), but • For any Borel space Z, denote its Borel sets by B(Z) and I(X → Y) 6= I(Y → X) in general. the space of probability measures on (Z, B(Z)) as P (Z). Remark 1: Note that in (2), there is no conditioning on • Consider two probability measures and on P (Z). P Q P the present x . This follows Marko’s definition [14] and is absolutely continuous with respect to (denoted as t Q is consistent with Granger causality [3]. Massey [15]  ) if (A) = 0 implies that (A) = 0 for all A ∈ P Q Q P and Kramer [16] later included conditioning on x for B(Z). If  , denote the Radon-Nikodym derivative t P Q the specific setting of communication channels. In such as the random variable dP : Z → that satisfies dQ R settings, since the directions of causation (e.g. that X is input and Y is output) are known, it is convenient to Z d (A) = P (z) (dz),A ∈ B(Z). work with synchronized time, for which conditioning on P d Q z∈A Q xt is meaningful. Note, however, that by conditioning on the present xt in (2), that in a binary symmetric channel • The Kullback-Leibler divergence between ∈ P (Z) and P (for example) with input X, output Y, and no feedback, ∈ P (Z) is defined as Q I(Y → X) > 0, even though Y does not influence X.   Z Directed information has been shown to play im- dP dP D( k ) log = log (z) (dz) (1) portant roles in characterizing the capacity of channels P Q , EP d d P Q z∈Z Q with feedback [17]–[19], quantifying achievable rates for source coding with feedforward [20], for feedback control if P  Q and ∞ otherwise. over noisy channels [21], [22], and gambling, hypothesis • For a sample space Ω, sigma-algebra F, and probability testing, and portfolio theory [23]. See [24] for examples measure P, denote the probability space as (Ω, F, P). and further discussion. • Throughout this paper, we will consider m random processes where the ith (with i ∈ {1, . . . , m}) random Remark 2: This work is in the setting of discrete time, process at time t (with t ∈ {1, . . . , n}), takes values such as sampled continuous-time processes. Under appro- in a Borel space X. Denote the ith random variable at priate technical assumptions, directed information can be directly extended to continuous time on the [0,T ] interval. time t by Xi,t :Ω → X, the ith random process as > Define Xi = (Xi,1,...,Xi,n) , and the whole collection of all m random processes as X = (X ,..., X )>. 1 m Ft = σ(Xτ : 0 ≤ τ < t, Yτ : 0 ≤ τ < t) • The probability measure P thus induces a joint distribu- mn tion on X given by PX(·) ∈ P (X ), a joint distribution to be the sigma-algebra generated by the past of all n processes and on Xi given by PXi (·) ∈ P (X ), and a marginal distribution on Xi,t given by PX (·) ∈ P (X). i,t F˜t = σ(Yτ : 0 ≤ τ < t) • With slight abuse of notation, denote X ≡ Xi for some i and Y ≡ Xj for some j 6= i and denote the conditional to be the sigma-algebra generated by the past of all distribution and causally conditioned distribution of Y processes excluding X. If we assume that all processes 4

are well-behaved (i.e. on Polish spaces), then we have ˜ that regular versions of P (Yt ∈ ·|Ft) and P (Yt ∈ ·|Ft), Y Y exist almost-surely [25]. As such, we can denote the regular conditional probabilities by by Pt(·) ∈ P (Y) and P˜t(·) ∈ P (Y) respectively. Then the directed information in continuous time is given in complete analogy with X Z X Z discrete time by (a) Full causal dependence (b) Causal dependence tree ap- " # Z T   structure, the directed information proximation (7). ˜ graph. I(X → Y) , E D PtkPt dt . 0 Fig. 2. Directed information graph and a causal dependence tree Connections between directed information in continuous approximation for the dynamical system in Example 1. time, causal continuous-time estimation, and communica- tion in continuous time have also recently been proposed [26]. A treatment of the continuous-time setting is outside three processes {X, Y, Z}, formed by including i.i.d. noises n the scope of this work. {Bt,Ct,Dt}t=1 to the above dynamical system and relabeling the time indices: IV. BACKGROUNDAND MOTIVATING EXAMPLE: t t t Xt+1 = Xt + ∆g1(X ,Y ,Z ) + Bt+1 PPROXIMATING THE TRUCTUREOF YNAMICAL A S D t t t SYSTEMS Yt+1 = Yt + ∆g2(X ,Y ,Z ) + Ct+1 t t t In this section, we describe the problem of identifying Zt+1 = Zt + ∆g3(X ,Y ,Z ) + Dt+1. the structure of a stochastic, dynamical system, and then The system can alternatively be described through the joint approximating it with another stochastic dynamical system. distribution (up to time n) as We will review the definitions and basic properties of directed information graphs. We first consider an example of a deter- PX,Y,Z(dx, dy, dz) = n ministic dynamical system described in state space format in Y t−1 t−1 t−1 P t−1 t−1 t−1 (dx , dy , dz |x , y , z ). terms of coupled differential equations. Xt,Yt,Zt|X ,Y ,Z t t t t=1 Example 1: Consider a system with three deterministic processes, {x , y , z }, which evolves according to: Because of the causal structure of the dynamical system and t t t the statistical independence of the noises, given the full past, t t t x˙ = g1(x, y, z) ⇔ xt+∆ = xt + ∆g1(x , y , z ) the present values are conditionally independent: t t t y˙ = g2(x, y, z) ⇔ yt+∆ = yt + ∆g2(x , y , z ) PX,Y,Z(dx, dy, dz) = (4) t t t z˙ = g3(x, y, z) ⇔ zt+∆ = zt + ∆g3(x , y , z ). n Y t−1 t−1 t−1 P t−1 t−1 t−1 (dxt|x , y , z ) Given the full past of the whole network, {xt, yt, zt}, the Xt|X ,Y ,Z t=1 future of each process (at time t + ∆) can be constructed. t−1 t−1 t−1 t−1 t−1 t−1 × PYt|X ,Y ,Z (dyt|x , y , z ) In many cases, some processes do not depend on the past of t−1 t−1 t−1 × P t−1 t−1 t−1 (dz |x , y , z ). every other process, but only some subset of other processes. Zt|X ,Y ,Z t Suppose we can simplify the above equations by removing all More generally, we will make the analogous assumption about of the dependencies of how one process evolves given others: the chain rule and how each process at time t is conditionally x = x + ∆g (xt, yt) independent of one another, given the full past of all processes. t+∆ t 1 dPX t t Assumption 1: Equation (4) holds and dφ (x) > 0 for all yt+∆ = yt + ∆g2(x , y ) x and some measure φ  PX. t t t zt+∆ = zt + ∆g3(x , y , z ). A large class of stochastic systems satisfy Assumption 1. This structure can be depicted graphically (see Figure 2(a)). For example, coupled stochastic processes described by an We can further approximate this dynamical system by approx- Ito stochastic differential equation with independent Brownian t t t t t t t noise satisfy the continuous-time equivalent of this assumption imating the functions {g1(x , y ), g2(x , y ), g3(x , y , z )} whose generative models have fewer inputs. One approxima- [2]. Granger argued that this is a valid assumption for real tion for the system is: world systems, provided the sampling rate 1/∆ is high [3]. We can rewrite (4) using causal conditioning notation (2): t t 0 t xt+∆ = xt + ∆g1(x , y ) ≈ xt + ∆g1(x ) PX,Y,Z(dx, dy, dz) = PXkY,Z(dx k y, z)PYkX,Z(dy k x, z) y = y + ∆g (xt, yt) t+∆ t 2 ×P (dz k x, y). t t t 0 t t ZkX,Y zt+∆ = zt + ∆g3(x , y , z ) ≈ zt + ∆g3(y , z ). As in the deterministic case, often the evolution of one process Figure 2(b) depicts the corresponding directed tree structure. does not depend on every other process, but only some subset. We refer to such structures as causal dependence trees. We can remove the unnecessary dependencies to obtain A similar procedure can be used for networks of stochas- tic processes, where the system is described in a time- PX,Y,Z(dx, dy, dz) = PXkY(dx k y)PYkX(dy k x) evolving manner through conditional probabilities. Consider ×PZkX,Y(dz k x, y). 5

The dependence structure of this stochastic system is repre- sented by Figure 2(a). We next generalize this procedure. For each process Xi, let A(i) ⊆ {1, . . . , m}\{i} denote a potential subset of parent processes. Define the corresponding induced probability measure PA: m Y P (dx) = P (dx k x ). (5) A XikXA(i) i A(i) i=1 (a) Best parent approximation. (b) Best tree approximation. To find a minimal graph, for each process Xi, we would like to find the smallest set of parents that fully describes Fig. 3. Examples of directed information graph approximations. The the dynamics of X as well as the whole network does: best parent approximation is better in terms of KL divergence. i However, the best tree approximation is connected and has a clearly  distinguished root with paths from the root to all other nodes. Thus, D PXkPA = 0. (6) it is more useful for applications such as targeted advertising. In Example 1, the A(i)’s would correspond to {Y}, {X}, and {X, Y} for X, Y, and Z, respectively. The parent sets n {A(i)}i=1 can be independently minimized so that (6) holds. V. MAIN RESULT:BEST PARENTAND CAUSAL With these minimal parent sets, we can define the graphical DEPENDENCE TREE APPROXIMATIONS model we will use throughout this discussion.2 Definition 4.1: A directed information graph is a directed We now describe two approaches to approximate joint graph, where each process is represented by a node, and there distributions of networks of stochastic processes, with corre- sponding low complexity directed information graphs. In both is a directed edge from Xj to Xi for i, j ∈ [m] iff j ∈ A(i), m cases, at most a single parent will be kept. The first case where the cardinalities {|A(i)|}i=1 are minimal such that (6) holds. is an unconstrained optimization. The second constrains the Lemma 4.2 ([2]): Under Assumption 1, directed informa- approximating structure to be a causal dependence tree (this was presented in part at [27]). Minimizing the KL divergence tion graphs are unique. Furthermore, for a given process Xi, a directed edge is placed from j to i (j ∈ A(i)) if and only if between the full and approximating joint distributions in both cases will result in a sum of pairwise directed informations. I(Xj → XikX\{Xj, Xi}) > 0. We first examine the problem of finding the best approx- imation where each process has at most one parent. See Directed information graphs can have cycles, representing Figure 3(a). Consider the joint distribution P of m random feedback between processes, and can even be complete. For X processes {X , X , ··· , X }, each of length n. We will some systems, there might be a large number of influences 1 2 m consider approximations of the form between processes, with varying magnitudes. For analysis and even storage purposes, it can be helpful to have succinct m Y approximations. PbX(dx) , PXikXa(i) (dxi k xa(i)). (8) For the stochastic system of Example 1, we can apply i=1 a similar approximation to this system as was done in the where a(i) ∈ {1, . . . , m}\{i} selects the parent. Let G denote discrete case with: 1 the set of all such approximations. We want to find the PbX ∈ PXkY(dx k y) ≈ PX(dx) G1 that minimizes the KL divergence. Theorem 1: PZkX,Y(dz k x, y) ≈ PZkY(dz k y). m Thus, our causal dependence tree approximation to these X arg min D(PX k PbX) = arg max I(Xa(i) → Xi). (9) stochastic processes, denoted by P , is: a(i)∈{1,...,m}\{i} bX PbX∈G1 i=1

PX(dx)≈PbX(dx),PX(dx)PYkX(dykx)PZkY(dzky). (7) Proof: First define the product distribution

This approximation is represented graphically in Figure 2(b). m Although the system in Example 1 only had three processes, Y PeX(dx) , PXi (dxi), (10) with a large number m of processes, the directed information i=1 graph could be quite complex, difficult to compute and analyze visually. As we will show, it is possible to construct efficient which is equivalent to PX(x) when the processes are statisti- optimal tree-like approximations to the directed information cally independent. graph, and these approximations do not suffer greatly in Note that PX, PbX, PeX all lie in P (Ω), and moreover, PX  dPX decision-making performance nor in visualization of relevant PbX  PeX. Thus, the Radon-Nikodym derivative satisfies dPeX features. the chain rule [28]:

2In [2], minimal generative model graphs are defined by Definition 4.1. dPX dPX dPbX Under mild technical assumptions they are equivalent to directed information = . (11) graphs; for clarity we refer to them together as directed information graphs. dPeX dPbX dPeX 6

Thus, where π is a permutation on {1, . . . , m} and 0 ≤ l(i) < i with X0 denoting a deterministic constant (for the root node’s arg min D(PX k PbX) dependence). Let TC denote the set of all possible causal PbX∈G1 " # dependence tree approximations. Like before, we want to find dP X the PbX ∈ TC that minimizes the KL divergence. = arg min EPX log Theorem 2: PbX∈G1 dPbX " # " # m dPX dPX X b arg min D(PXkPbX) = arg max I(Xl(π(i)) →Xπ(i)). (19) = arg min EPX log + EPX − log (12) PbX∈TC PbX∈TC i=1 PbX∈G1 dPeX dPeX " # Proof: The proof is similar to the proof of Theorem 1, ex- dPbX = arg max EPX log (13) cept that (16) cannot be broken up, as the structural constraint dP PbX∈G1 eX couples choosing π(·) and l(·). m Z dP Because the maximization became decoupled in Theorem 1, X XikXa(i)=xa(i) = arg max log PX(dx) (14) there was a simple algorithm to find the best structure, and x dPXi PbX∈G1 i=1 that algorithm could be run in a distributed manner. Although m Z X   that does not happen here, note that the optimal PbX ∈ TC = arg max D P kP P (dx) (15) XikXa(i)=xa(i) Xi Xa(i) is maximizing a sum of pairwise directed information values. PbX∈G1 i=1 x m Each value corresponds to an edge weight for one directed X edge in a complete directed graph on the m processes. = arg max I(Xa(i) → Xi) (16) PbX∈G1 i=1 To find the tree with maximal weight, we can employ a m maximum-weight directed spanning tree (MWDST) algorithm. X = arg max I(Xa(i) → Xi), (17) We discuss MWDST algorithms in Section VI-A. Algorithm 2 i=1 a(i)∈{1,...,m}\{i} describes the procedure to find the best approximating distri- where (12) applies the log to (11) and rearranges; (13) follows bution with a causal dependence tree structure. dPX from not depending on PbX; (14) follows from (8) and dPeX Algorithm 2. Causal Dependence Tree (10); (15) follows from (1); (16) follows from (3); and (17) Input: R follows from the choice of each a(i) effecting only a single 1. For i ∈ {1, . . . , m} term in the sum. 2. a(i) ← ∅ 3. For j ∈ {1, . . . , m}\{i} Thus, finding the optimal structure where each node has 4. Compute I(X → X ) at most one parent is equivalent to individually maximizing j i 5. {a(i)}m ← MWDST ({I(X → X )} ) pairwise directed informations. The process is described in i=1 j i 1≤i6=j≤m Algorithm 1. Let R denote the set of all pairwise marginal distributions of PX: Since TC contains simpler approximations than G1, Algo- rithm 1’s approximations are superior to Algorithm 2’s in R = {PX ,X : i, j ∈ {1, . . . , m}}. i j terms of KL divergence. For some applications, however, having a directed tree can be more useful for analysis and Algorithm 1. Best Parent allocation of resources. Input: R Remark 3: Chow and Liu [4] solved an analogous problem 1. For i ∈ {1, . . . , m} for a collection of random variables. They developed an algo- 2. a(i) ← ∅ rithm to efficiently find the best tree structured approximation 3. For j ∈ {1, . . . , m}\{i} for a Markov network (or, equivalently for that problem, a 4. Compute I(Xj → Xi) Bayesian network). They showed that using KL divergence, 5. a(i) ← arg max I(Xj → Xi) finding the best tree approximation was equivalent to maxi- j mizing a sum of mutual informations. They used a maximum weight spanning tree to solve the optimization. Thus, even Algorithm 1 will return the best possible approximation though directed information graphs have different properties where only pairwise interactions are preserved. It is possible, than Markov or Bayesian networks, and operate on a collection though, that PbX could be disconnected. See Figure 3(a). For of random processes not variables, the method for finding the some applications, such as picking a single most influential best tree is analogous. user in a group of friends for targeted advertising, it is Next, we consider the consistency of these algorithms in useful to have a connected structure with a dominant node. the setting of estimating from data. We discuss estimation in See Figure 3(b). Next consider the case where candidate Section VII-A. approximations have causal dependence tree structures. The Theorem 3: Suppose PX ∈ TC and the estimates {I(Xj → approximations have the form Xi)}1≤j6=i≤m converge almost surely (a.s.). Then for the m Y output PbX of Algorithm 2, PbX(dx) , PXπ(i)kXl(π(i)) (dxπ(i) k xl(π(i))) (18) i=1 PbX → PX a.s. (20) 7

Proof: Since PX ∈ TC, by Lemma 4.2, PX is the unique B. Storage complexity tree structure with maximal sum of directed informations along In the full joint distribution, there are mn variables. Each its edges. Algorithm 2 finds the tree with maximal weight, and possible outcome might have unique probability. Thus, for thus if the edge weights converge almost surely, then the tree discrete variables with alphabet X, the total storage for the joint estimate does also. distribution is O(|X|mn). Both approximations we consider Note that an analogous result holds for Algorithm 1 in reduce the full joint distribution to m pairwise distributions. the case PX ∈ G1. In general, there could be multiple Thus, the storage is O(m|X|2n). Further, if the approxima- approximation structures in TC or G1 with the same maximal tions have Markovicity of order k, the total storage becomes weight, so PbX might not converge, but the approximating O(mn|X|2k) = O(mn) for constant k. structures picked would almost surely be among those of maximal weight. VII.APPLICATIONS TO SIMULATED AND EXPERIMENTAL DATA

VI.COMPLEXITY In this section, we demonstrate the efficacy of the approx- imations in a classification experiment with simulated time- In this section, we will discuss the complexity both of the series. We then show the approximations capture important algorithms and storage requirements for the solution. structural characteristics of a network of brain cells from a neuroscience experiment. First we discuss parametric estima- tion of directed information from data. A. Algorithmic complexity Both algorithms first compute the directed information A. Parametric Estimation values between each pair. For discrete random processes, While a thorough discussion of estimation techniques is computing the directed information, a divergence (3), in gen- outside the scope of this work, for completeness we briefly eral involves summations over exponentially large alphabets. describe the consistent parametric estimation technique for Computing one directed information value for two processes directed information proposed in [24] and [33] and applied 2n of length n is O(|X| ). If the distributions are assumed to study brain cell networks. After we discuss estimation for to be jointly Markov of order k, then it becomes linear the specific setting of autoregressive time-series. 2k O(n|X| ) = O(n) for fixed k. Thus computing the directed 1) Point-Process Parametric Models: Let Y and X denote information for each ordered pair of processes is O(nm2) two binary time series of brain cell activity. Yt = 1 if cell work when Markovicity is assumed. Y was active at time t, otherwise 0. Truccolo et al. [34] For both algorithms, computation of the directed informa- proposed modeling how Y depends on its own past and the tions can be done independently: the for loops in lines 1 and past of X using a point process framework. The conditional 4 of both algorithms can be done in a distributed fashion. log-likelihood has the form Note that computing only pairwise relationships is computa- n tionally much more tractable than in the full case. To identify X t−1 t−1 log fYkX(ykx; θ) = − log λθ(t, y , x )yt the true directed information graph, divergence calculations t=1 using the whole network of processes are used [2], requiring t−1 t−1 −λθ(t, y , x )∆, O(|X|mn) time without Markovicity, and O(n|X|mk) with Markovicity. where ∆ is the time length between samples and t−1 t−1 Furthermore, the computation can reduced by calculating λθ(t, y , x ) is the conditional intensity function [34] mutual informations initially for line 4 in both algorithms. J R t−1 t−1 X X Equation (4) holding means PX,Y = PXkYPYkX which log λθ(t, y , x ) = α0 + αjyt−j + βrxt−r. implies I(X; Y) = I(X → Y) + I(Y → X) [14]. Since j=1 r=1 mutual and directed informations are non-negative, the mutual λ (t, yt−1, xt−1) can be interpreted as the propensity of Y information bounds each directed information. Either directed θ to be active at time t based on its past and the past of X. information can later be computed to resolve both. The Markov orders J and R are assumed to be unknown. After computing the pairwise directed informations, Algo- To avoid over-fitting, the minimum description length penalty rithm 1 then picks the best parent for each process, which [35] is used to select the MLE bθ: takes O(m2) total, so the total runtime is O(nm2) assuming Markovicity. Algorithm 2 additionally computes a maximum 1 (J + R) log n (J,b R,b bθ) = arg min − log fYkX(ykx; θ) + . weight spanning tree. Chu and Liu [29], Edmonds [30], (J,R,θ) n 2n and Bock [31] independently developed an efficient MWDST This penalty balances the Shannon code-length of encoding algorithm, which runs in O(m2) time. Thus, like Algorithm 1, Y with causal side information X using a MLE bθ(J, R) and Algorithm 2 also runs in O(nm2). the code-length required to describe the MLE bθ(J, R). The Note that Humblet [32] proposed a distributed MWDST directed information estimates are algorithm, which constructs the maximum weight tree for each O(m2) 1 fYkX(ykx; bθ) node as root in time. In some applications, it is useful bI(X → Y) − log , (21) , n 0 to be able to choose from multiple potential roots. fY(y; bθ ) 8

0 where bθ and bθ are the MLE parameter vectors for their respec- Next, classification was performed. For each pair of models tive models. Under stationarity, ergodicity, and Markovicity, (B, Σ) and (B0, Σ0), n = 106 length time-series were gen- almost sure convergence of bI(X → Y) is shown in [24]. These erated from each model using (22). First, the log-likelihoods results extend to general parametric classes. Also note that for of each time-step conditioned on the past was computed for the setting of finite alphabets, [36] proposed universal estima- the full distributions using estimates of (B, Σ) and (B0, Σ0). tion of directed information using context tree weighting. The frequency of correct classification was calculated. Next, 2) Autoregressive Models: Next consider the specific para- the log-likelihoods using the best parent approximations with metric class of autoregressive time-series. Specifically, a estimated coefficients were calculated and then those for the Markov-order one autoregressive model (AR-1) for X is best tree approximations. This was repeated for each set of co- efficient estimates, corresponding to n ∈ {50, 102, 103, 104}. Xt = BXt−1 + Nt, (22) 2) Results: The results of these classification experiments are shown in Figure 4. The classification rates are averaged where B is a coefficient matrix and Nt is i.i.d. white Gaussian noise with variance matrix Σ. The noise components are over the 100 trials. Error bars show standard deviation. assumed to be independent, so Σ is diagonal. The coefficients The best parent approximations only perform slightly better (B, Σ) are fixed, so for two processes X = (X, Y) modeled than the best tree approximations. Both performed close to as AR-1, 85% correct classification rate, slightly improving with larger m. Classification using the full distribution noticeably im- n 1 X proves with m. This is due to the increased complexity of the I(X → Y) = I(Xt−1; Y |Y t−1) n t distributions; with more processes, there are more relationships t=1 n to distinguish the distributions. There are m(m − 1) edges in 1 X = I(X ; Y |Y ) (23) the full distribution compared to m in the best parent and m−1 n n−1 n n−1 t=1 in the best tree approximations. Despite having significantly   fewer edges, the approximations capture enough structure to 1 |KYn,Yn−1 ||KXn−1,Yn−1 | = log , (24) distinguish models. 2 |KY ||KX ,Y ,Y | n−1 n−1 n n−1 The effect of having a small number of samples to estimate where (23) follows from stationarity and Markovicity and (24) AR coefficients is more dramatic as m increases. For m ∈ 3 4 follows from (pg. 249 of [13]) with |KYn,Yn−1 | denoting the {5, 10, 15}, coefficients estimated with n = 10 and n = 10 determinant of the covariance matrix of {Yn,Yn−1}. Note that length time-series performed almost identically. by the recurrence relation (22), the covariance matrix K Xt,Xt0 can be computed as C. Application to Experimental Data min(t,t0) X t−s t0−s > We now discuss an application of these methods to analysis KX ,X = (B )Σ(B ) . (25) t t0 of neural activity. A recent study computed the directed infor- s=1 mation graph for a group of neurons in a monkey’s primary Thus, estimates of bI(X → Y) can be computed by first motor cortex [24]. Using that graph, they identified a dominant finding the least squares estimate Bb of the coefficient matrix axis of local interactions, which corresponded to the known, in (22), then computing covariance matrices using (25), and primary direction of wave propagation of regional synchronous then computing (24). activity, believed to mediate information transfer [38]. We show that the best parent and best tree approximations preserve that dominant axis. B. Classification experiment The monkey was performing a sequence of arm-reaching We tested the utility of the approximation methods using a tasks. Its arm was constrained to move along a horizontal binary classification experiment. surface. Each task involved presentation of a randomly po- 1) Setup: For the number of processes m ∈ {5, 10, 15}, sitioned, fixed target on the surface, the monkey moving its 100 pairs of AR-1 models (B, Σ) and (B0, Σ0) were randomly hand to meet the target, and a reward (drop of juice) given generated. Each element of the m × m coefficient matrix to the monkey if it was successful. For more details, see B was generated i.i.d. from a N (0, 1) distribution. Σ was [24], [39]. Neural activity in the primary motor cortex was an m × m diagonal matrix with entries randomly selected recorded by an implanted silicon micro-electrode array. The 1 from the interval [ 4 , 1] uniformly. For each AR model (B, Σ), recorded waveforms were filtered and processed to produce, time-series of lengths n ∈ {50, 102, 103, 104} were generated for each neuron that was detected, a sequence of times when using (22). The coefficients of (B, Σ) were estimated using that neuron became active (e.g. it “spiked.”). The 37 neurons least squares for each of the time-series. The best parent with the greatest total activity (number of spikes) were used and best tree approximations were computed using estimated for analysis. coefficients. The directed informations {I(X → Y)} between To study the flow of activity between individual neurons, each pair (X, Y) were estimated using the method in Sec- we constructed a directed information graph on the collec- tion VII-A2 with X = (X, Y)>. To identify the MWDSTs, a tion of neurons. To simplify computation, pairwise directed Matlab implementation of Edmunds’s algorithm [37] was used. informations were estimated using the parametric estimation Coefficients (B0, Σ0) were generated and estimated likewise. procedure discussed in Section VII-A. 9

(a) m = 5. (b) m = 10. (c) m = 15.

Fig. 4. Classification rate between pairs of autoregressive series. For each m ∈ {5, 10, 15}, 100 pairs of autoregressive coefficients were generated randomly. Classification was performed using the full structures, best parent approximations, and best tree approximations, using coefficients estimated with n ∈ {50, 102, 103, 104} length time-series. Error bars depict standard deviation.

(a) Graphical structure of non-zero pairwise directed information (b) Causal dependence tree approximation. values.

Fig. 5. Graphical structures of non-zero pairwise directed information values from [24] and causal dependence tree approximation. The best parent approximation was almost identical and is not shown. The blue arrow in Figure 5(a) depicts a dominant orientation of the edges. That orientation is consistent with the direction of propagation of local field potentials, which is believed to mediate information transfer [38]. Both approximations preserve that structure.

Figure 5(a) depicts the pairwise directed information graph. loops. Both approximations reduced the number of edges by a The relative positions of the neurons in the graph correspond third, improving the clarity of the graph. Both approximations to the relative positions of the recording electrodes. The blue preserve the dominant edge orientation - pertaining to wave arrow indicates a dominant orientation of the edges. This propagation - depicted by the blue arrow in Figure 5(a). orientation along the rostro-caudal axis is consistent with This suggests that these approximation methodologies preserve the direction of propagation of local field potentials, which relevant information for decision-making and visualization for researchers believe mediates information transfer between analysis of mechanistic biological phenomena. regions [38]. We applied Algorithms 1 and 2 to this data set. The structure VIII.CONCLUSION of the dependence tree approximation is shown in Figure 5(b). In this work, we presented efficient methods to optimally The best parent approximation is almost identical. The only approximate networks of stochastic, dynamically interacting differences are that the parents of nodes 28 and 13 are 27 and processes with low-complexity approximation methods. Both 3 respectively. approximations only required pairwise marginal statistics be- The original graph had 117 edges with many complicated tween the processes, which computationally are significantly 10 more tractable than the full joint distribution. Also, the corre- [13] T. Cover and J. Thomas, Elements of information theory. Wiley- sponding directed information graphs are much more accessi- Interscience, 2006. [14] H. Marko, “The bidirectional communication theory–a generalization of ble to analysis and practical usage for many applications. information theory,” Communications, IEEE Trans. on, vol. 21, no. 12, An important line of future work involves investigating pp. 1345–1351, Dec 1973. methods to approximate with other, more complicated struc- [15] J. Massey, “Causality, feedback and directed information,” in Proc. 1990 Intl. Symp. on Info. Th. and its Applications, 1990, pp. 27–30. tures. Best-parent approximations and causal dependence tree [16] G. Kramer, “Directed information for channels with feedback,” Ph.D. approximations will always reduce the storage complexity dissertation, Swiss Federal Institute of Technology (ETH), Zrich, dramatically and facilitate analysis. However, for some appli- Switzerland, 1998. [17] S. Tatikonda and S. Mitter, “The Capacity of Channels With Feedback,” cations, it might be desirable to have slightly more complicated IEEE Trans. on Information Theory, vol. 55, no. 1, pp. 323–349, 2009. structures, such as connected graphs with at most three parents [18] H. Permuter, T. Weissman, and A. Goldsmith, “Finite State Channels for each node. Such approximations highlight a richer set With Time-Invariant Deterministic Feedback,” IEEE Trans. on Informa- tion Theory, vol. 55, no. 2, pp. 644–662, 2009. of interactions and feedback while still being visually and [19] C. Li and N. Elia, “The information flow and capacity of channels with computationally simpler to analyze than the full structure. noisy feedback,” arXiv preprint arXiv:1108.2815, 2011. Although it might not always be possible to efficiently find [20] R. Venkataramanan and S. Pradhan, “Source coding with feed-forward: rate-distortion theorems and error exponents for a general source,” IEEE optimal approximations of such graphical complexity, even Trans. on Information Theory, vol. 53, no. 6, pp. 2154–2179, 2007. near-optimal approximations could prove quite beneficial to [21] N. Martins and M. Dahleh, “Feedback control in the presence of real world applications. noisy channels: “bode-like” fundamental limitations of performance,” Automatic Control, IEEE Trans. on, vol. 53, no. 7, pp. 1604 –1615, Aug. 2008. ACKNOWLEDGMENTS [22] S. K. Gorantla, “The interplay between information and control theory within interactive decision-making problems,” Ph.D. dissertation, Uni- The authors thank Jalal Etesami and Mavis Rodrigues for versity of Illinois at Urbana-Champaign, 2012. their assistance with computer simulations. [23] H. Permuter, Y. Kim, and T. Weissman, “Interpretations of directed in- This work was supported in part to C. J. Quinn by the NSF formation in portfolio theory, , and hypothesis testing,” Information Theory, IEEE Trans. on, vol. 57, no. 6, pp. 3248–3259, IGERT fellowship under grant number DGE-0903622, and 2011. the Department of Energy Computational Science Graduate [24] C. Quinn, T. Coleman, N. Kiyavash, and N. Hatsopoulos, “Estimating Fellowship under grant number DE-FG02-97ER25308; to N. the directed information to infer causal relationships in ensemble neural spike train recordings,” Journal of Computational Neuroscience, vol. 30, Kiyavash by AFOSR under grants FA 9550-11-1-0016, FA no. 1, pp. 17–44, 2011. 9550-10-1-0573, and FA 9550-10-1-0345; and by NSF grant [25] R. M. Gray, Probability, random processes, and ergodic properties. CCF 10-54937 CAR; and to T. P. Coleman by NSF Science & Springer, 2009. [26] T. Weissman, Y.-H. Kim, and H. H. Permuter, “Directed Information, Technology Center grant CCF-0939370 and NSF grant CCF Causal Estimation, and Communication in Continuous Time,” ArXiv e- 10-65352. prints, Sep. 2011. [27] C. Quinn, T. Coleman, and N. Kiyavash, “Approximating discrete REFERENCES probability distributions with causal dependence trees,” in Info. Theory and its App. (ISITA), 2010 Intl. Symp. on. IEEE, 2010, pp. 100–105. [1] D. Koller and N. Friedman, Probabilistic graphical models: principles [28] H. Royden and P. Fitzpatrick, Real analysis, 3rd ed. Macmillan New and techniques. The MIT Press, 2009. York, 1988. [2] C. Quinn, N. Kiyavash, and T. Coleman, “Directed information graphs,” [29] Y. Chu and T. Liu, “On the shortest arborescence of a directed graph,” Arxiv preprint arXiv:1204.2003, 2012. Science Sinica, vol. 14, no. 1396-1400, p. 270, 1965. [3] C. Granger, “Investigating causal relations by econometric models and [30] J. Edmonds, “Optimum branchings,” J. Res. Natl. Bur. Stand., Sect. B, cross-spectral methods,” Econometrica, vol. 37, no. 3, pp. 424–438, vol. 71, pp. 233–240, 1967. 1969. [31] F. Bock, “An algorithm to construct a minimum directed spanning tree [4] C. Chow and C. Liu, “Approximating discrete probability distributions in a directed network,” Developments in operations research, vol. 1, pp. with dependence trees,” IEEE Trans. on Information Theory, vol. 14, 29–44, 1971. no. 3, pp. 462–467, 1968. [32] P. Humblet, “A distributed algorithm for minimum weight directed [5] C. Shalizi, M. Camperi, and K. Klinkner, “Discovering functional com- spanning trees,” Communications, IEEE Trans. on, vol. 31, no. 6, pp. munities in dynamical networks,” Statistical Network Analysis: Models, 756–762, 1983. Issues, and New Directions, pp. 140–157, 2007. [33] S. Kim, D. Putrino, S. Ghosh, and E. N. Brown, “A Granger causality [6] M. Yuan and Y. Lin, “Model selection and estimation in regression with measure for point process models of ensemble neural spiking activity,” grouped variables,” Journal of the Royal Statistical Society: Series B PLoS Comput Biol, vol. 7, no. 3, March 2011. (Statistical Methodology), vol. 68, no. 1, pp. 49–67, 2006. [34] W. Truccolo, U. T. Eden, M. R. Fellows, J. P. Donoghue, and E. N. [7] A. Bolstad, B. Van Veen, and R. Nowak, “Causal network inference via Brown, “A point process framework for relating neural spiking activity to group sparse regularization,” Signal Processing, IEEE Trans. on, vol. 59, spiking history, neural ensemble, and extrinsic covariate effects,” Journal no. 6, pp. 2628–2641, 2011. of Neurophysiology, vol. 93, no. 2, pp. 1074–1089, 2005. [8] A. Puig, A. Wiesel, G. Fleury, and A. Hero, “Multidimensional [35] P. D. Grunwald,¨ The minimum description length principle. MIT press, shrinkage-thresholding operator and group lasso penalties,” Signal Pro- 2007. cessing Letters, IEEE, vol. 18, no. 6, pp. 363 –366, June 2011. [36] J. Jiao, H. Permuter, L. Zhao, Y. Kim, and T. Weissman, “Universal [9] V. Tan and A. Willsky, “Sample complexity for topology estimation estimation of directed information,” Arxiv preprint arXiv:1201.2334, in networks of LTI systems,” in Decision and Control, 2011. IEEE 2012. Conference on. IEEE, 2011. [37] G. Li, “Maximum Weight Spanning tree (Undi- [10] D. Materassi and G. Innocenti, “Topological identification in networks rected),” Computer software, June 2009. [Online]. Avail- of dynamical systems,” Automatic Control, IEEE Trans. on, vol. 55, able: http://www.mathworks.com/matlabcentral/fileexchange/23276- no. 8, pp. 1860–1871, 2010. maximum-weight-spanning-tree-undirected [11] D. Materassi and M. Salapaka, “On the problem of reconstructing an [38] D. Rubino, K. Robbins, and N. Hatsopoulos, “Propagating waves unknown topology via locality properties of the Wiener filter,” Automatic mediate information transfer in the motor cortex,” Nature neuroscience, Control, IEEE Trans. on, no. 99, pp. 1–1, 2011. vol. 9, no. 12, pp. 1549–1557, 2006. [12] J. Etesami, N. Kiyavash, and T. P. Coleman, “Learning minimal latent [39] W. Wu and N. Hatsopoulos, “Evidence against a single coordinate directed information trees,” in Information Theory Proceedings (ISIT), system representation in the motor cortex,” Experimental brain research, 2012 IEEE International Symposium on. IEEE, 2012, pp. 2726–2730. vol. 175, no. 2, pp. 197–210, 2006.