Efficient Methods to Compute Optimal Tree Approximations of Directed

1 Efficient Methods to Compute Optimal Tree Approximations of Directed Information Graphs Christopher J. Quinn*, Student Member, IEEE, Negar Kiyavash, Senior Member, IEEE, and Todd P. Coleman, Senior Member, IEEE Abstract—Recently, directed information graphs have been human comprehension of complex relationships. For instance, proposed as concise graphical representations of the statistical in situations such as network intrusion detection, decision dynamics amongst multiple random processes. A directed edge making in adversarial environments, and first response tasks from one node to another indicates that the past of one random process statistically affects the future of another, given the where a rapid decision is required, such representations can past of all other processes. When the number of processes is greatly aid the situation awareness and the decision making large, computing those conditional dependence tests becomes process. difficult. Also, when the number of interactions becomes too Graphical models have been used to describe both full and large, the graph no longer facilitates visual extraction of relevant approximating joint distributions of random variables [1]. For information for decision-making. This work considers approximating the true joint distribu- many graphical models, random variables are represented as tion on multiple random processes by another, whose directed nodes and edges between pairs encode conditional depen- information graph has at most one parent for any node. Under dence relationships. Markov networks and Bayesian networks a Kullback-Leibler (KL) divergence minimization criterion, we are two common examples. Markov networks are undirected show that the optimal approximate joint distribution can be graphs, while Bayesian networks are directed acyclic graphs. A obtained by maximizing a sum of directed informations. In particular, (a) each directed information calculation only involves Bayesian network’s graphical structure depends on the variable statistics amongst a pair of processes and can be efficiently indexing. This methodology could in principle be applied to estimated; (b) given all pairwise directed informations, an ef- describing multiple random processes. For example, if we ficient minimum weight spanning directed tree algorithm can be have n time indices and m random processes, then we could solved to find the best tree. We demonstrate the efficacy of this create a Markov or Bayesian network on mn random variables. approach using simulated and experimental data. In both, the approximations preserve the relevant information for decision- However, if m or n is large, this could be prohibitive from a making. complexity and visualization perspective. We have recently developed an alternative graphical model, EDICS classification identifier: MLR-GRKN termed a “directed information graph,” to parsimoniously describe statistical dynamics amongst a collection of random I. INTRODUCTION processes [2]. Each process is represented by a single node, Many important inference problems involve reasoning about and directed edges encode conditional independence relation- dynamic relationships between time series. In such cases, ships pertaining to how the past of one process affects the observations of multiple time series are recorded and the ob- future of another, given the past of all other processes. As such, jective of the observer is to understand relationships between in this framework, directed edges represent directions of causal the past of some processes, and how they affect the future of influence1. They are motivated by simplified generative models others. In general, with knowledge of joint statistics amongst of coupled dynamical systems. They admit cycles, and can multiple random processes, such decision-making could in thus represent feedback between processes. Under appropriate principle be done. However, if these processes exhibit complex technical conditions, they do not depend on process indexing dynamics, gaining knowledge can be prohibitive from compu- and moreover are unique [2]. tational and storage perspectives. As such, it is appealing to Directed information graphs are particularly attractive when develop an approximation of the joint distribution on multiple we have m processes and a large number n of time units: it random processes which can be calculated efficiently and is collapses a graph of mn nodes to a graph on m nodes and a less complex for inference. Moreover, simplified representa- directed arrow encodes information about temporal dynamics. tions of joint statistics can facilitate easier visualization and In some situations, however, the number m of processes we record itself can be very large, and in such a situation The material in this paper was presented (in part) at the International Symposium on Information Theory and Applications, Taichung, Taiwan, each conditional independence test, involving all m processes, October 2010. can be difficult to evaluate. Moreover, even visualization of C. Quinn is with the Department of Electrical and Computer Engineering, the directed information graph with up to m2 edges can be University of Illinois at Urbana Champaign, Urbana, Illinois 61801 (email: [email protected]). cumbersome. As such, the benefits of having a precise picture N. Kiyavash is with the Department of Industrial and Enterprise Systems of the statistical dynamics might be out-weighed by the costs Engineering, University of Illinois at Urbana Champaign, Urbana, Illinois 61801 (email: [email protected]). 1“Causal” in this work refers to Granger causality [3], where a process X T. P. Coleman is with the Department of Bioengineering, University of is said to causally influence a process Y if the past of X helps to predict the California, San Diego, La Jolla, CA 92093 (email: [email protected]) future of Y, already conditioned on the past Y and all other processes. 2 distribution on m random processes, where each node in the subsequent directed information graph has at most one parent. We consider two variants, one in which the approximation’s directed information graph need not be connected, and the second for which it is (i.e. it must be a directed tree). We use the KL divergence as the metric to find the best approximation, and show that the subsequent optimization (a) The full influence structure of (b) The graph of an approximation problem is equivalent to maximizing a sum of pairwise di- the social network. It is difficult to which captures key structural com- rected informations. Both cases only require knowledge of determine which users to target to ponents. By targeting only the root statistics between pairs of processes to find the best such indirectly influence the whole net- of the tree, who is circled, it is possi- work. ble influence will spread throughout approximations. For the connected case, a minimum weight the rest of the network. spanning tree algorithm can be solved in time that is quadratic in the number of processes. Both approximations have similar Fig. 1. Graphical models of the full user influence dynamics of algorithmic and storage complexity. We demonstrate the utility the social network and an approximation of those dynamics. Arrow widths correspond to strengths of influence. Although some structural of this approach in simulated and experimental data, where the components are lost, the graph of the approximation makes it clear relevant information for decision-making is maintained in the who to target and the paths of expected influence. tree approximation. B. Related work in computation, storage, and ease-of-use to a human. Chow and Liu proposed an efficient algorithm to find an An approximation of the joint distribution that preserves optimal tree structured approximation to the joint distribution a small number of important interactions could alleviate this on a collection of random variables [4]. Since then, many problem. As an example, consider how a social network works have been developed to approximate joint distributions, company negotiates the costs of advertisements to its users in terms of underlying Markov and Bayesian networks. There with another company. If the preferences or actions of certain have been other works that approximated with more compli- users on average have a large “causal” influence on the cated structures; see [1] for an overview. subsequent preferences or actions of friends in their network, In this work, we use directed information graphs to describe then a business might be willing to pay more money to the joint distribution on random processes, in terms of how advertise to those users, as compared to the down-stream the past of processes statistically affect the future of others. friends with less influence. By paying to advertise to the These were recently introduced in [2], where it was also shown influential users, the business is effectively advertising to that they are a generalized embodiment of Granger’s notion many. For the social network company and the business to of causality [3] and that under mild assumptions, they are agree on pricing, however, it needs to be agreed on which users equivalent to minimal generative model graphs. are the most influential. With a complicated social network, Many methods to estimate joint distributions on random such as Figure 1(a), a simple procedure to identify who to processes from a generative model perspective have recently advertise to, and for how much, might be onerous to develop. been developed. Shalizi et al. [5] have developed methods us- However,

Efficient Methods to Compute Optimal Tree Approximations of Directed

An Information-Theoretic Perspective on Credit Assignment in Reinforcement Learning

Task-Driven Estimation and Control Via Information Bottlenecks

Interpretations of Directed Information in Portfolio Theory, Data

Analogy Between Gambling and Measurements-Based Work Extraction

Universal Estimation of Directed Information Lei Zhao, Haim Permuter, Young-Han Kim, and Tsachy Weissman

Information Theoretic Causal Effect Quantification

Capacity of Continuous Channels with Memory Via Directed Information Neural Estimator

Information Structures for Causally Explainable Decisions

A Unified Bellman Equation for Causal Information and Value in Markov

Transfer-Entropy-Regularized Markov Decision Processes Takashi Tanaka1 Henrik Sandberg2 Mikael Skoglund3

Directed Information for Complex Network Analysis from Multivariate Time Series

Echo State Network Models for Nonlinear Granger Causality