Transfer-Entropy-Regularized Markov Decision Processes Takashi Tanaka1 Henrik Sandberg2 Mikael Skoglund3

1 Transfer-Entropy-Regularized Markov Decision Processes Takashi Tanaka1 Henrik Sandberg2 Mikael Skoglund3 Abstract—We consider the framework of transfer-entropy- can be used as a proxy for the data rate on communication regularized Markov Decision Process (TERMDP) in which the channels, and thus solving TERMDP provides a fundamental weighted sum of the classical state-dependent cost and the performance limitation of such systems. The second applica- transfer entropy from the state random process to the control random process is minimized. Although TERMDPs are generally tion of TERMDP is non-equilibrium thermodynamics. There formulated as nonconvex optimization problems, we derive an has been renewed interests in the generalized second law of analytical necessary optimality condition expressed as a finite set thermodynamics, in which transfer entropy arises as a key of nonlinear equations, based on which an iterative forward- concept [10]. TERMDP in this context can be interpreted as backward computational procedure similar to the Arimoto- the problem of operating thermal engines at a nonzero work Blahut algorithm is proposed. It is shown that every limit point of the sequence generated by the proposed algorithm is a stationary rate near the fundamental limitation of the second law of point of the TERMDP. Applications of TERMDPs are discussed thermodynamics. in the context of networked control systems theory and non- In contrast to the standard MDP [11], TERMDP penalizes equilibrium thermodynamics. The proposed algorithm is applied the information flow from the underlying state random pro- to an information-constrained maze navigation problem, whereby cess to the control random process. Consequently, TERMDP we study how the price of information qualitatively alters the optimal decision polices. promotes “information-frugal” decision policies, under which control actions tend to be statistically less dependent on the underlying Markovian state dynamics. This is often a favorable I. INTRODUCTION property in various real-time decision-making scenarios (for both humans and robots) in which information acquisition, Transfer entropy [1] is a quantity that can be understood as processing, and transmission are costly operations. Therefore, a measure of information flow between random processes. It is it is expected that TERMDP plays major roles in broader a generalization of directed information, a concept proposed in contexts beyond the aforementioned applications, although the the information theory literature for the analysis of communi- interpretations of transfer entropy in each application must be cation systems with feedback [2]–[4]. Closely related concepts carefully discussed. include the KL-causality measure [5], which was originally In the literature, a few alternative approaches have been introduced in the economic statistics literature for the causality 1 suggested to apply information-theoretic cost functions to analysis. Recently, these concepts have been applied in a capture decision-making costs in MDPs. Similarities and dif- broad range of academic disciplines, including neuroscience ferences between TERMDP and the existing problem for- [7], finance [8], and social science [9]. mulations are noteworthy. The rationally inattentive control In this paper, we formulate the problem of transfer-entropy- problem [12], [13] has been motivated in a macroeconomic regularized Markov Decision Process (TERMDP), and de- context, where Shannon’s mutual information (a special case velop a numerical solution algorithm. TERMDP is an optimal of transfer entropy) is adopted as an attention cost for decision- control problem in which we seek a causal decision-making makers. The authors of [14]–[16] present a class of optimal policy that minimizes the weighted sum of the classical state- control problems in which control costs are modeled as the arXiv:1708.09096v3 [math.OC] 27 May 2020 dependent cost and transfer entropy from the state random Kullback-Leibler (KL) divergence from the “uncontrolled” process to the control actions. As we will discuss in the se- state trajectories to the “control” state trajectories. Alternative quel, TERMDP predicts a fundamental performance limitation information-theoretic decision costs in dynamic environments of feedback control systems from an information-theoretic include predictive information [17], past-future information- perspective. The first context in which TERMDP naturally bottleneck [18], and information-to-go [19], [20]. Information- arises is networked control systems theory, where the trade-off theoretic bounded rationality and its analogy to thermody- between the best achievable control performance and the data namics are discussed in [21]. While intuitively plausible, rate at which sensor information is fed back to the controller is some of these problem formulations lack physical (or coding- a central question. Prior work has shown that transfer entropy theoretic) justifications, unlike TERMDP, whose operational interpretation can be found in the aforementioned contexts. 1University of Texas at Austin, USA, [email protected]; 2KTH Royal Institute of Technology, Sweden, [email protected]; 3KTH Royal Insti- An equivalent problem formulation to TERMDP first ap- tute of Technology, Sweden, [email protected]. peared in [22] and [23], where the problem was formulated in a 1Sometimes (e.g., in statistical physics [6]), transfer entropy is used as a general (Polish state space) setup. Linear-Quadratic-Gaussian synonym for directed information. It appears that the concepts of transfer entropy [1], directed information [2], [3], and Kullback causality measure [5] (LQG) control with minimum directed information [24] is a were introduced independently. version of TERMDP specialized to the LQG regime. While 2 the problem in the LQG setup was shown to be tractable distribution: t t−1 by semidefinite programming [24], algorithmic aspects of qt(utjx ; u ): (1) TERMDP beyond the LQG regime have not been thoroughly studied. Therefore, the primary goal of this paper is to provide The joint distribution of the state and control trajectories is de- µ (xt+1; ut) an efficient computational algorithm to find a stationary point noted by t+1 , which is uniquely determined by the µ (x ) (an optimal solution candidate) of the given TERMDP. The initial state distribution 1 1 , the state transition probability p (x jx ; u ) q (u jxt; ut−1) contributions of this paper are as follows: t+1 t+1 t t and the decision policy t t by a recursive formula • We derive a necessary optimality condition expressed as t+1 t a set of nonlinear equations involving a finite number µt+1(x ; u ) of variables. This result recovers, and partly strengthens, t t−1 t t−1 = pt+1(xt+1jxt; ut)qt(utjx ; u )µt(x ; u ): (2) results obtained in prior work [22]. • We propose a forward-backward iterative algorithm that A stage-additive cost functional can be viewed as a generalization of the Arimoto-Blahut T T +1 T X algorithm [25], [26] to solve the optimality condition J(X ;U ) , Ect(Xt;Ut) + EcT +1(XT +1) (3) numerically. t=1 The proposed algorithm is the first application of the the is a function of random variables XT +1 and U T with a Arimoto-Blahut algorithm for transfer entropy minimization. T +1 T joint distribution µT +1(x ; u ). Transfer entropy is an Our algorithm should be compared with the generalized information-theoretic quantity defined as follows: Arimoto-Blahut algorithm for transfer entropy maximization Definition 1: For nonnegative integers m and n, the transfer proposed in [27]. The algorithm in [27] can be viewed as entropy of degree (m; n) is defined by a generalization of the Arimoto-Blahut “capacity algorithm” T t t−1 in [25], while our proposed algorithm can be viewed as a X µt+1(UtjX ;U ) I (XT ! U T ) log t−m t−n (4) generalization of the Arimoto-Blahut “rate-distortion algo- m;n , E t−1 µt+1(UtjUt−n) rithm” in [25]. Unfortunately, we discover that the proposed t=1 T t t−1 algorithm may not converge to the global minimum due to X X t t µt+1(utjxt−m; ut−n) = µt+1(x ; u ) log : the non-convex nature of TERMDP. This result is somewhat µ (u jut−1 ) t=1 xt2X t;ut2U t t+1 t t−n surprising as the global convergence of the original Arimoto- Blahut rate-distortion algorithm, which is a special case of Using conditional mutual information [28], transfer entropy our algorithm, is well-known. Nevertheless, observing that the can also be written as proposed algorithm belongs to the class of block coordinate T descent (BCD) algorithms, we show that every limit point T T X t t−1 Im;n(X ! U ) , I(Xt−m; UtjUt−n): (5) generated by the algorithm is guaranteed to be a stationary t=1 point of the given TERMDP. When m = 1 and n = 1,(4) coincides with directed Organization of the paper: The problem formulation of information [3]: TERMDP is formally introduced in SectionII. Mathemati- cal preliminaries are summarized in Section III. SectionIV T T T X t t−1 presents the main results. Derivation of the main results are I(X ! U ) , I(X ; UtjU ): (6) summarized in SectionV. SectionVI discusses applications of t=1 the TERMDP framework. A numerical demonstration of the The main problem studied in this paper is now formulated as proposed algorithm is presented in Section VII. We conclude follows. with a list of future work in Section VIII. Problem 1: (TERMDP) Let the initial state distribution Notation: Upper case symbols such as X are used to µ1(x1) and the state transition probability pt+1(xt+1jxt; ut) t+1 t represent random variables, while lower case symbols such be given, and assume that the joint distribution µt+1(x ; u ) as x are used to represent a specific realization. Notation is recursively given by (2). For a fixed constant β ≥ 0, the l t xk , (xk; xk+1; :::; xl) and x , (x1; x2; :::; xt) will be Transfer-Entropy-Regularized Markov Decision Processes is used to specify subsequences.

Transfer-Entropy-Regularized Markov Decision Processes Takashi Tanaka1 Henrik Sandberg2 Mikael Skoglund3

JIDT: an Information-Theoretic Toolkit for Studying the Dynamics of Complex Systems

Exploratory Causal Analysis in Bivariate Time Series Data

Empirical Analysis Using Symbolic Transfer Entropy

Entropic Measures of Connectivity with an Application to Intracerebral Epileptic Signals Jie Zhu

An Information-Theoretic Perspective on Credit Assignment in Reinforcement Learning

Fast and Effective Pseudo Transfer Entropy for Bivariate Data-Driven

Task-Driven Estimation and Control Via Information Bottlenecks

TRENTOOL 3.3.1 Beta – User Documentation

Direction of Information Flow in Large-Scale Resting-State

Interpretations of Directed Information in Portfolio Theory, Data

Causal Coupling Inference from Multivariate Time Seriesbasedonordinalpartitiontransitionnetworks

Disentangling Causal Webs in the Brain Using Functional Magnetic Resonance Imaging: a Review of Current Approaches