Causality Through Directed Information

Total Page:16

File Type:pdf, Size:1020Kb

Causality Through Directed Information Causality Through Directed Information Young-Han Kim University of California, San Diego SNU Institute for Research in Finance and Economics April , Joint work with Jiantao Jiao (Stanford), Haim Permuter (Ben Gurion), Tsachy Weissman (Stanford), and Lei Zhao (Jump Operations) Supported in part by National Science Foundation (NSF), US–Israel Binational Science Foundation (BSF), and BSF Bergmann Memorial Award Related publications Haim H. Permuter, Young-Han Kim, and Tsachy Weissman, “Interpretations of ∙ directed information in portfolio theory, data compression, and hypothesis testing,” IEEE Transactions on Information Theory, vol. , no. , pp. –, June . Tsachy Weissman, Young-Han Kim, and Haim H. Permuter, “Directed information, ∙ causal estimation, and communication in continuous time,” IEEE Transactions on Information Theory, vol. , no. , pp. –, March . Jiantao Jiao, Lei Zhao, Haim H. Permuter, Young-Han Kim, and Tsachy Weissman, ∙ “Universal estimation of directed information,” to appear in IEEE Transactions on Information Theory, . For more information, visit http://circuit.ucsd.edu/˜yhk ∙ Young-HanKim (UCSD) DirectedInformation SIRFESeminar(April) / Shannon’s information measures () Entropy: “uncertainty in a random variable X” ∙ H X p x log ( )=ᚰ ( ) p x x ( ) Mutual information: “information about X provided by Y” ∙ I X; Y H X H Y H X, Y ( )= ( )+ ( )− ( ) Relative entropy (Kullback–Leibler ): “distinction between p and q” ∙ p x D p q p x log ( ) ( ‖ )=ᚰ ( ) q x x ( ) Young-HanKim (UCSD) DirectedInformation SIRFESeminar(April) / Where do they come from? Mathematical communication theory (Shannon ) ∙ é Fundamental limits on communication and compression é Probability theory and statistics Axiomatic definitions (Aczel–Dar´ oczy´ ) ∙ é “Reasonable” properties for information measures Functional equations: f p q f p f q f H é ( × )= ( )+ ( ) ⇒ ≅ How about finance and economics? ∙ Young-HanKim (UCSD) DirectedInformation SIRFESeminar(April) / Gambling in horse races Horses: ,,..., m ∙ Odds: o , o ,..., o m (say, o x m) ∙ ( ) ( ) ( ) ( )≡ Win probabilities: p , p ,..., p m ∙ ( ) ( ) ( ) Young-HanKim (UCSD) DirectedInformation SIRFESeminar(April) / Optimal gambling Bets: b , b ,..., b m ∙ ( ) ( ) ( ) No short: b x , x ,,..., m é ( )≥ = No margin: b x é ∑x ( )= In other words, b x lies in the probability simplex é ( ) Payoff: If horse x wins (with probability p x ), then turns into b x o x ∙ ( ) ( ) ( ) Question How should we choose our portfolio b x ? ( ) Young-HanKim (UCSD) DirectedInformation SIRFESeminar(April) / Kelly gambling and log-optimal portfolio Kelly (), “A new interpretation of information rate”: ∙ b∗ x p x ( )= ( ) Maximize E log b X o X ∙ [ ( ( ) ( ))] é Logarithmic utility é Growth rate optimality é Competitive optimality (Bell–Cover ) é Other properties (MacLean–Thorp–Ziemba ) Optimal growth rate: ∙ W∗ X max E log b X o X E log o X H X ( )= b(x) [ ( ( ) ( ))] = ( ( )) − ( ) With o x m, ( )≡ ∗ W X log m H X ( )= − ( ) Young-HanKim (UCSD) DirectedInformation SIRFESeminar(April) / Entropy H X p x log E log ( )=ᚰ ( ) p x = ឱ p X ុ x ( ) ( ) Amount of randomness (information, uncertainty) in X ∙ Fundamental limit on lossless compression (Shannon ) ∙ Can be generalized to measures other than the counting measure ∙ Conditional entropy: ∙ H X Y p x, y log E log ( | )=ᚰ ( ) p x y = ឱ p X Y ុ x,y ( | ) ( | ) Young-HanKim (UCSD) DirectedInformation SIRFESeminar(April) / Gambling with side information Side information Y about the horse race outcome X ∙ Bets: b x y , x ,,..., m ∙ ( | ) = Kelly gambling: b∗ x y p x y ∙ ( | )= ( | ) Optimal growth rate: ∙ W∗ X Y max E log b X Y o X ( | )= b(x|y) [ ( ( | ) ( ))] E log o X H X Y = ( ( )) − ( | ) Value of side information (Kelly ) ΔW W∗ X Y W∗ X I X; Y = ( | )− ( )= ( ) Young-HanKim (UCSD) DirectedInformation SIRFESeminar(April) / Mutual information I X; Y H X H Y H X, Y H X H X Y H Y H Y X ( )= ( )+ ( )− ( )= ( )− ( | )= ( )− ( | ) Amount of information about X provided by Y (and vice versa) ∙ For a general stock market (Barron–Cover ): ΔW I X; Y é ≤ ( ) Fundamental limit on communication (Shannon ) ∙ Fundamental limit on lossy compression/quantization (Shannon ) ∙ Can be generalized to any pair of random objects ∙ Conditional mutual information: ∙ I X; Y Z H X Z H Y Z H X, Y Z ( | )= ( | )+ ( | )− ( | ) Young-HanKim (UCSD) DirectedInformation SIRFESeminar(April) / Repeated gambling in horse races with memory Win probabilities: p x , p x x , p x x , x ,..., p x xn− ∙ ( ) ( | ) ( | ) ( n| ) Odds: o x m ∙ ( i)≡ Bets: b x , b x x , b x x , x ,..., b x xn− ∙ ( ) ( | ) ( | ) ( n| ) Kelly gambling: b∗ x xi− p x xi− , i ,,... ∙ ( i| )= ( i| ) = Optimal growth rate: ∙ n W∗ Xn log m H Xn log m H X Xi− ( )= − ( ) = − ᚰ ( i | ) n n i= If the horse race process X is stationary ergodic, then ∙ { n} n H Xn H∗ X é ( / ) ( ) → ( ) ∗ n ∗ é W X W X ( ) → ∗ ( ) ≐ nW é wealth almost surely (Shannon , McMillan , Breiman ) Young-HanKim (UCSD) DirectedInformation SIRFESeminar(April) / Gambling with causal side information Side information Y , Y ,... ∙ Bets: b x xi−, yi ∙ ( i| ) Kelly gambling: b∗ x xi−, yi p x xi−, yi , i ,,... ∙ ( i| )= ( i| ) = Optimal growth rate: ∙ n W∗ Xn Yn log m H X Xi−, Yi log m H Xn Yn ( ‖ )= − ᚰ ( i | ) = − ( ‖ ) n i= n If the X , Y is stationary ergodic, then n H Xn Yn H∗ X Y ∙ {( n n)} ( / ) ( ‖ ) → ( ‖ ) Value of causal side information (Permuter–K–Weissman ) ΔW W∗ Xn Yn W∗ Xn H Xn H Xn Yn I Yn Xn = ( ‖ )− ( )= n( ( )− ( ‖ )) = n ( → ) Young-HanKim (UCSD) DirectedInformation SIRFESeminar(April) / Directed information n I Yn Xn H Xn H Xn Yn I X ; Yi Xi− ( → )= ( )− ( ‖ )= ᚰ ( i | ) i= Amount of information about X causally provided by Y ∙ For a general stock market: ΔW ≤ n I Y n → Xn é ( / ) ( ) Arrow of time: directed and asymmetric ∙ I Yn Xn I Xn Yn ( → ) ̸= ( → ) Fundamental limit on feedback communication ∙ (Tatikonda–Mitter , K , Permuter–Weissman–Goldsmith ) Can be generalized to continuous time (Weissman–K–Permuter ) ∙ Young-HanKim (UCSD) DirectedInformation SIRFESeminar(April) / Test for causal dependence H H Xi Yi Xi Yi Controller Output generatorController Output generator i− i− Xi i i− Yi i− i− Xi i− Yi p(xi|x , y ) p(yi|x , y ) p(xi|x , y ) p(yi|y ) Yi− Yi− Type-I and type-II error probabilities: α P Ac H , β P A H ∙ = ( | ) = ( | ) Chernoff–Stein lemma for the causal dependence test n n β∗ min β ≐ −I(X →Y ) = A⊆X n×Y n: α<є Young-HanKim (UCSD) DirectedInformation SIRFESeminar(April) / Brief history ∙ Marko (), “The bidirectional communication theory: A generalization of information theory” é Direction of information flow for mutually coupled statistical systems é Cybernetics: Group behavior with monkeys ∙ Massey (), “Causality, feedback, and directed information” Young-HanKim (UCSD) DirectedInformation SIRFESeminar(April) / Relationship to other notions for causality ∙ Granger causality (Granger , Geweke ): n i− LMMSE Yi Yi−p G Xn → Yn = log ( | ) ( ) ᚰ LMMSE Y Yi− , Xi i= ( i| i−p i−p) The higher G Xn → Y n is, the more X influences Y é ( ) If X , Y is Gauss–Markov of order p, then é {( n n)} I Xn → Y n ≡ G Xn → Y n ( ) ( ) ∙ Transfer entropy (Schreiber ): T X → Y = I Xi−; Y Yi− i( ) ( i | ) The higher T X → Y is, the more X influences Y (with one step delay) é i( ) If X , Y is stationary, then é {( n n)} I Xn− → Y n → T X → Y n ( ) ( ) Young-HanKim (UCSD) DirectedInformation SIRFESeminar(April) / Causal conditioning ∙ Causally conditional probability (Kramer ): n p yn xn = p y xi, yi− ( ‖ ) ᚱ ( i | ) i= n p yn xn− = p y xi−, yi− ( ‖ ) ᚱ ( i | ) i= ∙ Causally conditional entropy: H Yn Xn =− E log p Yn Xn , ( ‖ ) [ ( ‖ )] H Yn Xn− =− E log p Yn Xn− ( ‖ ) [ ( ‖ )] Chain rules p xn, yn = p xn yn p yn xn− = p xn yn− p yn xn , ( ) ( ‖ ) ( ‖ ) ( ‖ ) ( ‖ ) H Xn, Yn = H Xn Yn + H Yn Xn− = H Xn Yn− + H Yn Xn ( ) ( ‖ ) ( ‖ ) ( ‖ ) ( ‖ ) Young-HanKim (UCSD) DirectedInformation SIRFESeminar(April) / Properties of directed information I Xn → Yn = H Yn − H Yn Xn , ( ) ( ) ( ‖ ) I Xn− → Yn = H Yn − H Yn Xn− ( ) ( ) ( ‖ ) ∙ I Xn → Yn ≤ I Xn; Yn ( ) ( ) ∙ I Xn → Yn = I Xn; Yn if p xn yn− = p xn ( ) ( ) ( ‖ ) ( ) ∙ I Xn → Yn = I Xn; Yn = nI X; Y if X , Y is IID ( ) ( ) ( ) {( n n)} Conservation law I Xn ; Yn = I Xn → Yn + I Yn− → Xn = I Xn− → Yn + I Yn → Xn ( ) ( ) ( ) ( ) ( ) ∙ Measure of causal influence Young-HanKim (UCSD) DirectedInformation SIRFESeminar(April) / Universal estimation of directed information ∙ In reality, the probability distribution may not be known or may not even exist Something out of nothing ∙ Can we perform as if the distribution were known? ∙ Can we perform as well as the best estimator in a given class? ∙ Answer: Yes! (Jiao–Zhao–Permuter–K–Weissman ) Young-HanKim (UCSD) DirectedInformation SIRFESeminar(April) / Universal probability assignments ∙ Probability assignment: q xn ( ) ∙ Sequential probability assignment: q x , q x x , q x x , x ,..., q x xn− ( ) ( | ) ( | ) ( n| ) ∙ Probability assignment q is universal if lim D p xn q xn = n→∞ n ( ( )‖ ( )) for every stationary distribution p ∙ Probability assignment q is pointwise universal if p Xn ( ) ≤ lim sup log n p–a.s. n→∞ n q X ( ) for every stationary ergodic distribution p ∙ (Pointwise) universal probability assignments é Compression-based approaches: Ziv–Lempel (), Willems–Shtarkov–Tjalkens () é Ergodic theoretic approaches: Ornstein (), Morvai–Yakowitz–Algoet () Young-HanKim (UCSD) DirectedInformation SIRFESeminar(April) / Algorithm ̂I Xn → Yn = Ĥ Yn − Ĥ Yn Xn ( ) ( ) ( ‖ ) ∙ Ĥ Yn =− log q Yn and Ĥ Yn Xn =− log q Yn Xn ( ) n ( ) ( ‖ ) n ( ‖ ) , Consistency
Recommended publications
  • An Information-Theoretic Perspective on Credit Assignment in Reinforcement Learning
    An Information-Theoretic Perspective on Credit Assignment in Reinforcement Learning Dilip Arumugam Peter Henderson Department of Computer Science Department of Computer Science Stanford University Stanford University [email protected] [email protected] Pierre-Luc Bacon Mila - University of Montreal [email protected] Abstract How do we formalize the challenge of credit assignment in reinforcement learn- ing? Common intuition would draw attention to reward sparsity as a key contribu- tor to difficult credit assignment and traditional heuristics would look to temporal recency for the solution, calling upon the classic eligibility trace. We posit that it is not the sparsity of the reward itself that causes difficulty in credit assignment, but rather the information sparsity. We propose to use information theory to define this notion, which we then use to characterize when credit assignment is an ob- stacle to efficient learning. With this perspective, we outline several information- theoretic mechanisms for measuring credit under a fixed behavior policy, high- lighting the potential of information theory as a key tool towards provably-efficient credit assignment. 1 Introduction The credit assignment problem in reinforcement learning [Minsky, 1961, Sutton, 1985, 1988] is concerned with identifying the contribution of past actions on observed future outcomes. Of par- ticular interest to the reinforcement-learning (RL) problem [Sutton and Barto, 1998] are observed future returns and the value function, which quantitatively answers “how does choosing an action a in state s affect future return?” Indeed, given the challenge of sample-efficient RL in long-horizon, sparse-reward tasks, many approaches have been developed to help alleviate the burdens of credit assignment [Sutton, 1985, 1988, Sutton and Barto, 1998, Singh and Sutton, 1996, Precup et al., 2000, Riedmiller et al., 2018, Harutyunyan et al., 2019, Hung et al., 2019, Arjona-Medina et al., 2019, Ferret et al., 2019, Trott et al., 2019, van Hasselt et al., 2020].
    [Show full text]
  • Task-Driven Estimation and Control Via Information Bottlenecks
    Task-Driven Estimation and Control via Information Bottlenecks Vincent Pacelli and Anirudha Majumdar Abstract— Our goal is to develop a principled and general algorithmic framework for task-driven estimation and control for robotic systems. State-of-the-art approaches for controlling robotic systems typically rely heavily on accurately estimating the full state of the robot (e.g., a running robot might estimate joint angles and velocities, torso state, and position relative to a goal). However, full state representations are often excessively rich for the specific task at hand and can lead to significant Fig. 1. A schematic of our technical approach. We seek to synthesize computational inefficiency and brittleness to errors in state (offline) a minimalistic set of task-relevant variables (TRVs) x~t that create x u estimation. In contrast, we present an approach that eschews a bottleneck between the full state t and the control input t. These TRVs π such rich representations and seeks to create task-driven repre- are estimated online in order to apply the policy t. We demonstrate our approach on a spring-loaded inverted pendulum model whose goal is to sentations. The key technical insight is to leverage the theory of run to a target location. Our approach automatically synthesizes a one- information bottlenecks to formalize the notion of a “task-driven dimensional TRV x~ sufficient for achieving this task. representation” in terms of information theoretic quantities that t measure the minimality of a representation. We propose novel estimated. Second, since only a few prominent variables need iterative algorithms for automatically synthesizing (offline) a to be estimated, fewer sources of measurement uncertainty task-driven representation (given in terms of a set of task- relevant variables (TRVs)) and a performant control policy that result in a more robust policy.
    [Show full text]
  • Interpretations of Directed Information in Portfolio Theory, Data
    1 Interpretations of Directed Information in Portfolio Theory, Data Compression, and Hypothesis Testing Haim H. Permuter, Young-Han Kim, and Tsachy Weissman Abstract We investigate the role of Massey’s directed information in portfolio theory, data compression, and statistics with causality constraints. In particular, we show that directed information is an upper bound on the increment in growth rates of optimal portfolios in a stock market due to causal side information. This upper bound is tight for gambling in a horse race, which is an extreme case of stock markets. Directed information also characterizes the value of causal side information in instantaneous compression and quantifies the benefit of causal inference in joint compression of two stochastic processes. In hypothesis testing, directed information evaluates the best error exponent for testing whether a random process Y causally influences another process X or not. These results give a natural interpretation of directed n n n information I(Y → X ) as the amount of information that a random sequence Y = (Y1,Y2,...,Yn) causally n provides about another random sequence X =(X1, X2,...,Xn). A new measure, directed lautum information, is also introduced and interpreted in portfolio theory, data compression, and hypothesis testing. Index Terms Causal conditioning, causal side information, directed information, hypothesis testing, instantaneous compression, Kelly gambling, Lautum information, portfolio theory. NTRODUCTION arXiv:0912.4872v1 [cs.IT] 24 Dec 2009 I. I Mutual information I(X; Y ) between two random variables X and Y arises as the canonical answer to a variety of questions in science and engineering. Most notably, Shannon [1] showed that the capacity C, the maximal data rate for reliable communication, of a discrete memoryless channel p(y|x) with input X and output Y is given by C = max I(X; Y ).
    [Show full text]
  • Analogy Between Gambling and Measurements-Based Work Extraction
    IRWIN AND JOAN JACOBS CENTER FOR COMMUNICATION AND INFORMATION TECHNOLOGIES Analogy Between Gambling and Measurements-Based Work Extraction Dror A. Vinkler, Haim H. Permuter and Neri Merhav CCIT Report #857 April 2014 Electronics Computers DEPARTMENT OF ELECTRICAL ENGINEERING TECHNION - ISRAEL INSTITUTE OF TECHNOLOGY, HAIFA 32000, ISRAEL Communications 1 Analogy Between Gambling and Measurement-Based Work Extraction Dror A. Vinkler, Haim H. Permuter and Neri Merhav Abstract In information theory, one area of interest is gambling, where mutual information characterizes the maximal gain in wealth growth rate due to knowledge of side information; the betting strategy that achieves this maximum is named the Kelly strategy. In the field of physics, it was recently shown that mutual information can characterize the maximal amount of work that can be extracted from a single heat bath using measurement-based control protocols, i.e., using “information engines”. However, to the best of our knowledge, no relation between gambling and information engines has been presented before. In this paper, we briefly review the two concepts and then demonstrate an analogy between gambling, where bits are converted into wealth, and information engines, where bits representing measurements are converted into energy. From this analogy follows an extension of gambling to the continuous-valued case, which is shown to be useful for investments in currency exchange rates or in the stock market using options. Moreover, the analogy enables us to use well-known methods and results from one field to solve problems in the other. We present three such cases: maximum work extraction when the probability distributions governing the system and measurements are unknown, work extraction when some energy is lost in each cycle, e.g., due to friction, and an analysis of systems with memory.
    [Show full text]
  • Universal Estimation of Directed Information Lei Zhao, Haim Permuter, Young-Han Kim, and Tsachy Weissman
    ISIT 2010, Austin, Texas, U.S.A., June 13 - 18, 2010 Universal Estimation of Directed Information Lei Zhao, Haim Permuter, Young-Han Kim, and Tsachy Weissman Abstract— In this paper, we develop a universal algorithm to Notation: We use capital letter X to denote a random estimate Massey’s directed information for stationary ergodic variable and use small letter x to denote the corresponding processes. The sequential probability assignment induced by a realization or constant. Calligraphic letter X denotes the universal source code plays the critical role in the estimation. |X | In particular, we use context tree weighting to implement the alphabet of X and denotes the cardinality of the alphabet. algorithm. Some numerical results are provided to illustrate the II. PRELIMINARIES performance of the proposed algorithm. We first give the mathematical definitions of directed infor- mation and causally conditional entropy, and then discuss the I. INTRODUCTION relation between universal sequential probability assignment First introduced by Massey in [1], directed information and universal source coding. arises as a natural counterpart of mutual information for chan- nel capacity when feedback is present. In [2] and [3], Kramer A. Directed information extended the use of directed information to discrete memory- Directed information from Xn to Y n is defined as less networks with feedback, including the two-way channel n → n n − n|| n and the multiple access channel. For a class of stationary I (X Y )=H(Y ) H(Y X ), (1) channels with feedback, where the output is a function of the where H(Y n||Xn) is the causally conditional entropy [2], current and past m inputs and channel noise, Kim [4] proved defined as that the feedback capacity is equal to the limit of the supremum n n n i−1 i of the normalized directed information from the input to the H(Y ||X )= H(Yi|Y ,X ).
    [Show full text]
  • Information Theoretic Causal Effect Quantification
    entropy Article Information Theoretic Causal Effect Quantification Aleksander Wieczorek ∗ and Volker Roth Department of Mathematics and Computer Science, University of Basel, CH-4051 Basel, Switzerland; [email protected] * Correspondence: [email protected] Received: 31 August 2019; Accepted: 30 September 2019 ; Published: 5 October 2019 Abstract: Modelling causal relationships has become popular across various disciplines. Most common frameworks for causality are the Pearlian causal directed acyclic graphs (DAGs) and the Neyman-Rubin potential outcome framework. In this paper, we propose an information theoretic framework for causal effect quantification. To this end, we formulate a two step causal deduction procedure in the Pearl and Rubin frameworks and introduce its equivalent which uses information theoretic terms only. The first step of the procedure consists of ensuring no confounding or finding an adjustment set with directed information. In the second step, the causal effect is quantified. We subsequently unify previous definitions of directed information present in the literature and clarify the confusion surrounding them. We also motivate using chain graphs for directed information in time series and extend our approach to chain graphs. The proposed approach serves as a translation between causality modelling and information theory. Keywords: directed information; conditional mutual information; directed mutual information; confounding; causal effect; back-door criterion; average treatment effect; potential outcomes; time series; chain graph 1. Introduction Causality modelling has recently gained popularity in machine learning. Time series, graphical models, deep generative models and many others have been considered in the context of identifying causal relationships. One hopes that by understanding causal mechanisms governing the systems in question, better results in many application areas can be obtained, varying from biomedical [1,2], climate related [3] to information technology (IT) [4], financial [5] and economic [6,7] data.
    [Show full text]
  • Capacity of Continuous Channels with Memory Via Directed Information Neural Estimator
    Capacity of Continuous Channels with Memory via Directed Information Neural Estimator Ziv Aharoni Dor Tsur Ziv Goldfeld Haim H. Permuter Ben Gurion University Ben Gurion University Cornell University Ben Gurion University [email protected] [email protected] [email protected] [email protected] Abstract—Calculating the capacity (with or without feedback) [24] for further details). Built on (3), for stationary processes, of channels with memory and continuous alphabets is a challeng- the DI rate is defined as ing task. It requires optimizing the directed information (DI) rate 1 over all channel input distributions. The objective is a multi- I(X → Y) := lim I(Xn → Y n). (4) letter expression, whose analytic solution is only known for a n→∞ n few specific cases. When no analytic solution is present or the As shown in [8], when feedback is not present, the optimiza- channel model is unknown, there is no unified framework for tion problem (2) (which amounts to optimizing over P n calculating or even approximating capacity. This work proposes X a novel capacity estimation algorithm that treats the channel rather than PXnkY n ) coincides with (1). Thus, DI provides a as a ‘black-box’, both when feedback is or is not present. The unified framework for representing both FF and FB capacities. algorithm has two main ingredients: (i) a neural distribution Computing CFF and CFB requires solving a multi-letter transformer (NDT) model that shapes a noise variable into the optimization problem. Closed form solutions to this chal- channel input distribution, which we are able to sample, and (ii) the DI neural estimator (DINE) that estimates the communication lenging task are known only in several special cases.
    [Show full text]
  • Information Structures for Causally Explainable Decisions
    entropy Article Information Structures for Causally Explainable Decisions Louis Anthony Cox, Jr. Department of Business Analytics, University of Colorado School of Business, and MoirAI, 503 N. Franklin Street, Denver, CO 80218, USA; [email protected] Abstract: For an AI agent to make trustworthy decision recommendations under uncertainty on be- half of human principals, it should be able to explain why its recommended decisions make preferred outcomes more likely and what risks they entail. Such rationales use causal models to link potential courses of action to resulting outcome probabilities. They reflect an understanding of possible actions, preferred outcomes, the effects of action on outcome probabilities, and acceptable risks and trade-offs—the standard ingredients of normative theories of decision-making under uncertainty, such as expected utility theory. Competent AI advisory systems should also notice changes that might affect a user’s plans and goals. In response, they should apply both learned patterns for quick response (analogous to fast, intuitive “System 1” decision-making in human psychology) and also slower causal inference and simulation, decision optimization, and planning algorithms (analogous to deliberative “System 2” decision-making in human psychology) to decide how best to respond to changing conditions. Concepts of conditional independence, conditional probability tables (CPTs) or models, causality, heuristic search for optimal plans, uncertainty reduction, and value of information (VoI) provide a rich, principled framework for recognizing and responding to relevant changes and features of decision problems via both learned and calculated responses. This paper reviews how these and related concepts can be used to identify probabilistic causal dependencies among variables, detect changes that matter for achieving goals, represent them efficiently to support responses on multiple time scales, and evaluate and update causal models and plans in light of new data.
    [Show full text]
  • A Unified Bellman Equation for Causal Information and Value in Markov
    A Unified Bellman Equation for Causal Information and Value in Markov Decision Processes Stas Tiomkin 1 Naftali Tishby 1 2 Abstract interaction patterns, (infinite state and action trajectories) The interaction between an artificial agent and its define the typical behavior of an organism in a given en- environment is bi-directional. The agent extracts vironment. This typical behavior is crucial for the design relevant information from the environment, and and analysis of intelligent systems. In this work we derive affects the environment by its actions in return to typical behavior within the formalism of a reinforcement accumulate high expected reward. Standard rein- learning (RL) model subject to information-theoretic con- forcement learning (RL) deals with the expected straints. reward maximization. However, there are always In the standard RL model, (Sutton & Barto, 1998), an arti- information-theoretic limitations that restrict the ficial organism generates an optimal policy through an in- expected reward, which are not properly consid- teraction with the environment. Typically, the optimality ered by the standard RL. is taken with regard to a reward, such as energy, money, In this work we consider RL objectives with social network ’likes/dislikes’, time etc. Intriguingly, the information-theoretic limitations. For the first reward-associated value function, (an average accumulated time we derive a Bellman-type recursive equa- reward), possesses the same property for different types of tion for the causal information between the envi- rewards - the optimal value function is a Lyapunov function, ronment and the agent, which is combined plau- (Perkins & Barto, 2002), a generalized energy function. sibly with the Bellman recursion for the value A principled way to solve RL (as well as Optimal Con- function.
    [Show full text]
  • Transfer-Entropy-Regularized Markov Decision Processes Takashi Tanaka1 Henrik Sandberg2 Mikael Skoglund3
    1 Transfer-Entropy-Regularized Markov Decision Processes Takashi Tanaka1 Henrik Sandberg2 Mikael Skoglund3 Abstract—We consider the framework of transfer-entropy- can be used as a proxy for the data rate on communication regularized Markov Decision Process (TERMDP) in which the channels, and thus solving TERMDP provides a fundamental weighted sum of the classical state-dependent cost and the performance limitation of such systems. The second applica- transfer entropy from the state random process to the control random process is minimized. Although TERMDPs are generally tion of TERMDP is non-equilibrium thermodynamics. There formulated as nonconvex optimization problems, we derive an has been renewed interests in the generalized second law of analytical necessary optimality condition expressed as a finite set thermodynamics, in which transfer entropy arises as a key of nonlinear equations, based on which an iterative forward- concept [10]. TERMDP in this context can be interpreted as backward computational procedure similar to the Arimoto- the problem of operating thermal engines at a nonzero work Blahut algorithm is proposed. It is shown that every limit point of the sequence generated by the proposed algorithm is a stationary rate near the fundamental limitation of the second law of point of the TERMDP. Applications of TERMDPs are discussed thermodynamics. in the context of networked control systems theory and non- In contrast to the standard MDP [11], TERMDP penalizes equilibrium thermodynamics. The proposed algorithm is applied the information flow from the underlying state random pro- to an information-constrained maze navigation problem, whereby cess to the control random process. Consequently, TERMDP we study how the price of information qualitatively alters the optimal decision polices.
    [Show full text]
  • Directed Information for Complex Network Analysis from Multivariate Time Series
    DIRECTED INFORMATION FOR COMPLEX NETWORK ANALYSIS FROM MULTIVARIATE TIME SERIES by Ying Liu A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY Electrical Engineering 2012 ABSTRACT DIRECTED INFORMATION FOR COMPLEX NETWORK ANALYSIS FROM MULTIVARIATE TIME SERIES by Ying Liu Complex networks, ranging from gene regulatory networks in biology to social networks in sociology, have received growing attention from the scientific community. The analysis of complex networks employs techniques from graph theory, machine learning and signal pro- cessing. In recent years, complex network analysis tools have been applied to neuroscience and neuroimaging studies to have a better understanding of the human brain. In this the- sis, we focus on inferring and analyzing the complex functional brain networks underlying multichannel electroencephalogram (EEG) recordings. Understanding this complex network requires the development of a measure to quantify the relationship between multivariate time series, algorithms to reconstruct the network based on the pairwise relationships, and identification of functional modules within the network. Functional and effective connectivity are two widely studied approaches to quantify the connectivity between two recordings. Unlike functional connectivity which only quantifies the statistical dependencies between two processes by measures such as cross correlation, phase synchrony, and mutual information (MI), effective connectivity quantifies the influ- ence one node exerts on another node. Directed information (DI) measure is one of the approaches that has been recently proposed to capture the causal relationships between two time series. Two major challenges remain with the application of DI to multivariate data, which include the computational complexity of computing DI with increasing signal length and the accuracy of estimation from limited realizations of the data.
    [Show full text]
  • Echo State Network Models for Nonlinear Granger Causality
    bioRxiv preprint doi: https://doi.org/10.1101/651679; this version posted June 25, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. 1 Echo State Network models for nonlinear Granger causality Andrea Duggento∗, Maria Guerrisi∗, Nicola Toschiy, ∗Department of Biomedicine and Prevention, University of Rome Tor Vergata, Rome, Italy yDepartment of Radiology, Athinoula A. Martinos Center for Biomedical Imaging, Boston, MA, USA Abstract—While Granger Causality (GC) has been often em- for reconstructing neither the nonlinear components of neural ployed in network neuroscience, most GC applications are based coupling, nor the multiple nonlinearities and time-scales which on linear multivariate autoregressive (MVAR) models. However, concur to generating the signals. Instead, neural network (NN) real-life systems like biological networks exhibit notable non- linear behavior, hence undermining the validity of MVAR-based models more flexibly account for multiscale nonlinear dynam- GC (MVAR-GC). Current nonlinear GC estimators only cater for ics and interactions [10]. For example, multi-layer perceptions additive nonlinearities or, alternatively, are based on recurrent [11] or neural networks with non-uniform embeddings [12] neural networks (RNN) or Long short-term memory (LSTM) have been used to introduce nonlinear estimation capabili- networks, which present considerable training difficulties and tai- ties which also include “extended” GC [13] and wavelet- loring needs. We define a novel approach to estimating nonlinear, directed within-network interactions through a RNN class termed based approaches [14]. Also, recent preliminary work has echo-state networks (ESN), where training is replaced by random employed deep learning to estimate bivariate GC interactions initialization of an internal basis based on orthonormal matrices.
    [Show full text]