<<

1 Transfer-Entropy-Regularized Markov Decision Processes Takashi Tanaka1 Henrik Sandberg2 Mikael Skoglund3

Abstract—We consider the framework of transfer-entropy- can be used as a proxy for the data rate on communication regularized Markov Decision Process (TERMDP) in which the channels, and thus solving TERMDP provides a fundamental weighted sum of the classical state-dependent cost and the performance limitation of such systems. The second applica- transfer entropy from the state random process to the control random process is minimized. Although TERMDPs are generally tion of TERMDP is non-equilibrium thermodynamics. There formulated as nonconvex optimization problems, we derive an has been renewed interests in the generalized second law of analytical necessary optimality condition expressed as a finite set thermodynamics, in which transfer entropy arises as a key of nonlinear equations, based on which an iterative forward- concept [10]. TERMDP in this context can be interpreted as backward computational procedure similar to the Arimoto- the problem of operating thermal engines at a nonzero work Blahut algorithm is proposed. It is shown that every limit point of the sequence generated by the proposed algorithm is a stationary rate near the fundamental limitation of the second law of point of the TERMDP. Applications of TERMDPs are discussed thermodynamics. in the context of networked control systems theory and non- In contrast to the standard MDP [11], TERMDP penalizes equilibrium thermodynamics. The proposed algorithm is applied the flow from the underlying state random pro- to an information-constrained maze navigation problem, whereby cess to the control random process. Consequently, TERMDP we study how the price of information qualitatively alters the optimal decision polices. promotes “information-frugal” decision policies, under which control actions tend to be statistically less dependent on the underlying Markovian state dynamics. This is often a favorable I.INTRODUCTION property in various real-time decision-making scenarios (for both humans and robots) in which information acquisition, Transfer entropy [1] is a quantity that can be understood as processing, and transmission are costly operations. Therefore, a measure of information flow between random processes. It is it is expected that TERMDP plays major roles in broader a generalization of directed information, a concept proposed in contexts beyond the aforementioned applications, although the the literature for the analysis of communi- interpretations of transfer entropy in each application must be cation systems with feedback [2]–[4]. Closely related concepts carefully discussed. include the KL- measure [5], which was originally In the literature, a few alternative approaches have been introduced in the economic statistics literature for the causality 1 suggested to apply information-theoretic cost functions to analysis. Recently, these concepts have been applied in a capture decision-making costs in MDPs. Similarities and dif- broad range of academic disciplines, including neuroscience ferences between TERMDP and the existing problem for- [7], finance [8], and social science [9]. mulations are noteworthy. The rationally inattentive control In this paper, we formulate the problem of transfer-entropy- problem [12], [13] has been motivated in a macroeconomic regularized Markov Decision Process (TERMDP), and de- context, where Shannon’s (a special case velop a numerical solution algorithm. TERMDP is an optimal of transfer entropy) is adopted as an attention cost for decision- control problem in which we seek a causal decision-making makers. The authors of [14]–[16] present a class of optimal policy that minimizes the weighted sum of the classical state- control problems in which control costs are modeled as the arXiv:1708.09096v3 [math.OC] 27 May 2020 dependent cost and transfer entropy from the state random Kullback-Leibler (KL) divergence from the “uncontrolled” process to the control actions. As we will discuss in the se- state trajectories to the “control” state trajectories. Alternative quel, TERMDP predicts a fundamental performance limitation information-theoretic decision costs in dynamic environments of feedback control systems from an information-theoretic include predictive information [17], past-future information- perspective. The first context in which TERMDP naturally bottleneck [18], and information-to-go [19], [20]. Information- arises is networked control systems theory, where the trade-off theoretic bounded rationality and its analogy to thermody- between the best achievable control performance and the data namics are discussed in [21]. While intuitively plausible, rate at which sensor information is fed back to the controller is some of these problem formulations lack physical (or coding- a central question. Prior work has shown that transfer entropy theoretic) justifications, unlike TERMDP, whose operational interpretation can be found in the aforementioned contexts. 1University of Texas at Austin, USA, [email protected]; 2KTH Royal Institute of Technology, Sweden, [email protected]; 3KTH Royal Insti- An equivalent problem formulation to TERMDP first ap- tute of Technology, Sweden, [email protected]. peared in [22] and [23], where the problem was formulated in a 1Sometimes (e.g., in statistical physics [6]), transfer entropy is used as a general (Polish state space) setup. Linear-Quadratic-Gaussian synonym for directed information. It appears that the concepts of transfer entropy [1], directed information [2], [3], and Kullback causality measure [5] (LQG) control with minimum directed information [24] is a were introduced independently. version of TERMDP specialized to the LQG regime. While 2 the problem in the LQG setup was shown to be tractable distribution: t t−1 by semidefinite programming [24], algorithmic aspects of qt(ut|x , u ). (1) TERMDP beyond the LQG regime have not been thoroughly studied. Therefore, the primary goal of this paper is to provide The joint distribution of the state and control trajectories is de- µ (xt+1, ut) an efficient computational algorithm to find a stationary point noted by t+1 , which is uniquely determined by the µ (x ) (an optimal solution candidate) of the given TERMDP. The initial state distribution 1 1 , the state transition probability p (x |x , u ) q (u |xt, ut−1) contributions of this paper are as follows: t+1 t+1 t t and the decision policy t t by a recursive formula • We derive a necessary optimality condition expressed as t+1 t a set of nonlinear equations involving a finite number µt+1(x , u ) of variables. This result recovers, and partly strengthens, t t−1 t t−1 = pt+1(xt+1|xt, ut)qt(ut|x , u )µt(x , u ). (2) results obtained in prior work [22]. • We propose a forward-backward iterative algorithm that A stage-additive cost functional can be viewed as a generalization of the Arimoto-Blahut T T +1 T X algorithm [25], [26] to solve the optimality condition J(X ,U ) , Ect(Xt,Ut) + EcT +1(XT +1) (3) numerically. t=1 The proposed algorithm is the first application of the the is a function of random variables XT +1 and U T with a Arimoto-Blahut algorithm for transfer entropy minimization. T +1 T joint distribution µT +1(x , u ). Transfer entropy is an Our algorithm should be compared with the generalized information-theoretic quantity defined as follows: Arimoto-Blahut algorithm for transfer entropy maximization Definition 1: For nonnegative integers m and n, the transfer proposed in [27]. The algorithm in [27] can be viewed as entropy of degree (m, n) is defined by a generalization of the Arimoto-Blahut “capacity algorithm” T t t−1 in [25], while our proposed algorithm can be viewed as a X µt+1(Ut|X ,U ) I (XT → U T ) log t−m t−n (4) generalization of the Arimoto-Blahut “rate-distortion algo- m,n , E t−1 µt+1(Ut|Ut−n) rithm” in [25]. Unfortunately, we discover that the proposed t=1 T t t−1 algorithm may not converge to the global minimum due to X X t t µt+1(ut|xt−m, ut−n) = µt+1(x , u ) log . the non-convex nature of TERMDP. This result is somewhat µ (u |ut−1 ) t=1 xt∈X t,ut∈U t t+1 t t−n surprising as the global convergence of the original Arimoto- Blahut rate-distortion algorithm, which is a special case of Using conditional mutual information [28], transfer entropy our algorithm, is well-known. Nevertheless, observing that the can also be written as proposed algorithm belongs to the class of block coordinate T descent (BCD) algorithms, we show that every limit point T T X t t−1 Im,n(X → U ) , I(Xt−m; Ut|Ut−n). (5) generated by the algorithm is guaranteed to be a stationary t=1 point of the given TERMDP. When m = ∞ and n = ∞,(4) coincides with directed Organization of the paper: The problem formulation of information [3]: TERMDP is formally introduced in SectionII. Mathemati- cal preliminaries are summarized in Section III. SectionIV T T T X t t−1 presents the main results. Derivation of the main results are I(X → U ) , I(X ; Ut|U ). (6) summarized in SectionV. SectionVI discusses applications of t=1 the TERMDP framework. A numerical demonstration of the The main problem studied in this paper is now formulated as proposed algorithm is presented in Section VII. We conclude follows. with a list of future work in Section VIII. Problem 1: (TERMDP) Let the initial state distribution Notation: Upper case symbols such as X are used to µ1(x1) and the state transition probability pt+1(xt+1|xt, ut) t+1 t represent random variables, while lower case symbols such be given, and assume that the joint distribution µt+1(x , u ) as x are used to represent a specific realization. Notation is recursively given by (2). For a fixed constant β ≥ 0, the l t xk , (xk, xk+1, ..., xl) and x , (x1, x2, ..., xt) will be Transfer-Entropy-Regularized Markov Decision Processes is used to specify subsequences. We use the natural logarithm the optimization problem log(·) = loge(·) throughout the paper. T +1 T T T min J(X ,U ) + βIm,n(X → U ). (7) t t−1 T {qt(ut|x ,u )}t=1 II.PROBLEM FORMULATION A few remarks are in order regarding this problem formula- We formulate TERMDP based upon the standard Markov tion. First, the transfer entropy term in (7) is interpreted as an Decision Process (MDP) formalism [11] defined by a time additional cost corresponding to the information transfer from index t = 1, 2, ..., T , state space Xt, action space Ut, transition the state random process Xt to the control random process Ut. probability pt+1(xt+1|xt, ut), cost functions ct : Xt ×Ut → R The regularization parameter β ≥ 0 can be thought of as the for each t = 1, 2, ..., T and cT +1 : XT +1 → R. For simplicity, cost of information transfer. When β = 0, the standard MDP we assume that both Xt and Ut are finite. The decision policy formulation is recovered. As we increase β > 0, the optimal to be synthesized can be probabilistic and history-dependent decision policy for (7) tends to be “information frugal” in order T T in general, and is represented by a conditional probability to reduce Im,n(X → U ). That is, control actions generated 3 by the policy becomes statistically less dependent on the state for t = 1, 2, ..., T , with the terminal condition of the system. T +1 T  µT +1 Second, in contrast to the standard MDP (i.e., β = 0) VT +1 µT +1(x , u ) = E cT +1(XT +1). (11) for which optimal polities can be assumed deterministic and The next proposition summarizes key structural results. history-independent (Markovian) without loss of generality [11, Section 4.4], TERMDP (7) with β > 0 admits an optimal Proposition 1: For the optimization problem (7) and its policy that is randomized and history-dependent. Thus, the dynamic programming formulation (9)–(11), the following cardinality of the solution space we must explore to solve (7) statements hold for each k = 1, 2, ..., T . t t−1 T is much larger than that of the standard MDP. However, in (a) For each sequence of policies {qt(ut|x , u )}t=k, Proposition1 below, we show that one can assume without there exists a sequence of policies of the form 0 t−1 T loss of performance a structure of the optimal policy of the {qt(ut|xt, ut−n)}t=k such that the value of the objective 0 T form function on the left hand side of (9) attained by {qt}t=k t−1 T qt(ut|xt, ut−n) (8) is less than or equal to the value attained by {qt}t=k. (b) If the policy at time step k is of the form instead of (1). In other words, it is sufficient to consider a q0 (u |x , uk−1 ), the identity I(Xk ; U |U k−1) = policy that is dependent only on the most recent realization of k k k k−n k−m k k−n I(X ; U |U k−1) holds. the state and the last n realizations of the control inputs. k k k−n k k−1 (c) The value function Vk(µk(x , u )) depends only on Finally, the structure of the problem (7) is similar (but not k−1 equivalent) to that of the KL control formulation in [14]. In the marginal distribution µk(xk, uk−n). particular, if (m, n) = (0, 0), the transfer entropy cost (4) Proof: See AppendixA. becomes XT µt+1(Ut|Xt) An implication of Proposition1 (a) is that the search for E log . t=1 µt+1(Ut) the optimal policy for the original TERMDP (7) can be t−1 On the other hand, the KL control framework considers the restricted to the class of policies of the form qt(ut|xt, ut−n) without loss of performance. Proposition1 (b) implies that, KL divergence cost of the form t−1 as far as the policy of the form qt(ut|xt, ut−n) is used, the XT µt+1(Ut|Xt) E log problem remains equivalent even after the transfer entropy t=1 r (U |X ) T T t+1 t t term Im,n(X → U ) is replaced by the transfer entropy where rt+1(ut|xt) is the conditional distribution specified by of degree (0, n): a predefined “reference” policy. Unfortunately, this difference T T T X t−1 renders (7) nonconvex as we will observe in Section IV-C. I0,n(X → U ) = I(Xt; Ut|Ut−n). t=1 Despite this disadvantage, we emphasize the importance of t−1 studying TERMDP as it arises in some practical problems in Proposition1 (c) implies that the distribution µt(xt, ut−n), t t−1 science and engineering as discussed in SectionVI. rather than the original µt(x , u ), suffices as the state of the considered problem. This “reduced” state evolves according to III.PRELIMINARIES t In this section, we summarize some technical results needed µt+1(xt+1, ut−n+1) = to derive our main results in this paper. X t−1 t−1 pt+1(xt+1|xt, ut)qt(ut|xt, ut−n)µt(xt, ut−n).

xt∈Xt,ut−n∈Ut−n A. Structure of the optimal solution (12) We first derive some important structural properties of the optimal policy, which allow us to rewrite the main problem Based on these observations, it can be seen that Problem 1 (7) in a simpler form. The desired structural results can be can be solved by solving the following simplified problem: obtained by applying the dynamic programming principle to Problem 2: (Simplified TERMDP) Let the initial state (7). To this end, notice that (7) can be viewed as a T -stage distribution µ1(x1) and the state transition probability optimal control problem, in which the joint distribution µt is pt+1(xt+1|xt, ut) be given, and assume that the joint distri- t−1 the “state” of the system to be controlled by a multiplicative bution µt(xt, ut−n) is recursively given by (12). For a fixed control action qt via the state evolution equation (2). Introduce constant β ≥ 0, the simplified TERMDP is the optimization the value function as problem k k−1  T +1 T T T Vk µk(x , u ) , min J(X ,U ) + βI0,n(X → U ). (13) t−1 T T {qt(ut|xt,ut−n)}t=1 X  t t−1 min Ect(Xt,Ut)+ I(Xt−m; Ut|Ut−n) . (9) T Notice that Proposition1 implies that a global minimizer {qt}t=k t=k for (13) is a global minimizer for (7). With an appropriate The value function satisfies the Bellman equation notion of local optimality, it can also be shown that a local t t−1  n minimizer for (13) is also a local minimizer for (7), as detailed Vt µt(x , u ) = min Ect(Xt,Ut) qt in AppendixB. For this reason, in what follows, we will o develop an algorithm that solves the simplified TERMDP (13) +I(Xt ; U |U t−1) + V (µ (xt+1, ut)) (10) t−m t t−n t+1 t+1 rather than the original TERMDP (7). 4

Proposition1 implies that an optimal solution to both the C. Rate-distortion theory and Arimoto-Blahut Algorithm original and simplified TERMDP can be found by solving the In the special case with T = 1, n = 0, β = 1 and cT +1(·) = Bellman equation 0, the optimization problem (13) becomes

t−1  n min Ec(X,U) + I(X; U) (17) Vt µt(xt, ut−n) = min Ect(Xt,Ut) q(u|x) qt t−1 t o where the probability distribution p(x) on X is given. In this +I(Xt; Ut|Ut−n) + Vt+1(µt+1(xt+1, ut−n+1)) (14) special case, the problem (17) is convex, and the solution is well-known in the context of rate-distortion theory [28]. with the state transition rule (12) and the terminal condition Proposition 3: A conditional distribution q∗(u|x) is a global minimizer for(17) if and only if it satisfies the following T  µT +1 VT +1 µT +1(xT +1, uT −n+1) = E cT +1(XT +1). (15) condition p(x)-almost everywhere: ∗ Notice that the Bellman equation (14) is simpler than the origi- ∗ ν (u) exp {−c(x, u)} q (u|x) = P ∗ (18a) nal form (10). However, solving (14) remains computationally u∈U ν (u) exp {−c(x, u)} X challenging as the right hand side of (14) involves a nonconvex ν∗(u) = p(x)q∗(u|x). (18b) optimization problem. In Section IV-C, we present a simple x∈X numerical example demonstrating this nonconvexity. Proof: This result is standard and hence the proof is omitted. See [29, Appendix A] and [30] for relevant discussions. Condition (18) is required only p(x)-almost everywhere since B. Transfer entropy and directed information for x such that p(x) = 0, q∗(u|x) can be chosen arbitrarily. Commonly, the denominator in (18a) is called the partition In some applications (e.g., networked control systems, see function: Section VI-A), we are interested in (7) with directed informa- ∗ X ∗ tion (m = ∞ and n = ∞), even though solving such a prob- φ (x) , ν (u) exp {−c(x, u)} . lem is often computationally intractable. In such applications, u∈U approximating directed information with transfer entropy with By substitution, it is easy to show that the optimal value of ∗ finite degrees is a natural idea. The next proposition describes (17) can be written in terms of ν (u) as the consequence of such an approximation. X nX o − p(x) log ν∗(u) exp{−c(x, u)} , (19) Proposition 2: For any fixed decision policy of the form x∈X u∈U t−1 q (u |x , u ) t = 1, 2, ..., T 2 p(x) ∗ t t t t−n , , we have or more compactly as E {− log φ (X)}. This quantity is often referred to as free energy [6], [31]. I0,n ≥ I0,n+1 ≥ · · · ≥ I0,∞. The Arimoto-Blahut algorithm is an iterative algorithm to compute q∗(u|x) satisfying (18) numerically. It is based on Proof: See AppendixC. the alternating updates: X The following chain of inequalities shows that the optimal ν(k)(u) = p(x)q(k−1)(u|x) (20a) x∈X n value of (13) with any finite provides an upper bound on ν(k)(u) exp{−c(x, u)} q(k)(u|x) = . the optimal value of (7) with (m, n) = (∞, ∞). P (k) (20b) u∈U ν (u) exp{−c(x, u)} The algorithm is first proposed for the computation of channel T +1 T min J(X ,U ) + βI∞,∞ capacity [26] and for the computation of rate-distortion func- {q (u |xt,ut−1)}T ∗ ∗ t t t=1 tions [25]. Clearly, the optimal solution (q , ν ) is a fixed point T +1 T = min J(X ,U ) + βI∞,∞ (16a) of the algorithm (20). Under a mild assumption, convergence {q (u |x ,ut−1)}T t t t t=1 of the algorithm is guaranteed; see [26], [32], [33]. The main T +1 T = min J(X ,U ) + βI0,∞ (16b) algorithm we propose in this paper to solve the simplified {q (u |x ,ut−1)}T t t t t=1 TERMDP (13) can be thought of as a generalization of the T +1 T ≤ min J(X ,U ) + βI0,∞ (16c) t−1 T Arimoto-Blahut “rate-distortion algorithm.” {qt(ut|xt,ut−n)}t=1 T +1 T ≤ min J(X ,U ) + βI0,n. (16d) t−1 T D. Block Coordinate Descent Algorithm {qt(ut|xt,ut−n)}t=1 The Arimoto-Blahut algorithm can be viewed as a block Equalities (16a) and (16b) follows from Proposition1 (a) and Coordinate Descent (BCD) algorithm applied to a special (b), respectively. The inequality (16c) is trivial since any policy class of objective functions. In this subsection, we summarize t−1 of the form qt(ut|xt, ut−n) is a special case of the policy of elements of the BCD method and a version of its convergence t−1 the form qt(ut|xt, u ). The final inequality (16d) is due to results that is relevant to our analysis. Consider the problem Proposition2. min f(x) (21a)

2 T T s.t. x ∈ X = X × X × ... × X (21b) Im,n is a short-hand notation for Im,n(X → U ). 1 2 N 5 where the feasible set X is the Cartesian product of closed, A. Necessary Optimality Condition ni nonempty and convex subsets Xi ⊆ R for i = 1, 2, ..., N, The first technical result states that a necessary optimality n +...+n and the function f : R 1 N → R ∪ {∞} is continuously condition for the simplified TERMDP (13) is given by the (0) differentiable on the sublevel set {x ∈ X : f(x) ≤ f(x )}, nonlinear condition (23) in terms of (µ∗, ν∗, ρ∗, φ∗, q∗). (0) ∗ where x ∈ X is a given initial point. We call x ∈ X a Theorem 1: If {q∗}T is a local minimizer for (13), ∗ > ∗ t t=1 stationary point for (21) if it satisfies ∇f(x ) (y − x ) ≥ 0 then there exist variables {µ∗ , ν∗, ρ∗, φ∗}T satisfying ∗ ∗ t+1 t t t t=1 for every y ∈ X, where ∇f(x ) is the gradient of f at x . the set of nonlinear equations (23) together with the ini- If f is convex, every stationary point is a global minimizer ∗ tial condition µ1(x1) = p1(x1) and the terminal condition of f. The BCD algorithm for (21) is defined by the following ∗ T φT +1(xT +1, uT −n+1) , exp{−cT +1(xT +1)}. cyclic update rule: The optimality condition (23) can be utilized to develop (k) (k−1) (k−1) a numerical algorithm to find an optimal solution candidate x1 ∈ arg min f(x1, x2 , ..., xN ) (22a) x1 to the simplified TERMDP (13). Unfortunately, it will soon (k) (k) (k−1) (k−1) be shown that the optimality condition (23) is only necessary x2 ∈ arg min f(x1 , x2, x3 , ..., xN ) (22b) x2 in general. Since (23) is a nonlinear condition, it is possible ··· that (23) admits multiple distinct solutions, some of which (k) (k) (k) may correspond to local minima and saddle points of the xN ∈ arg min f(x1 , x2 , ..., xN ). (22c) xN simplified TERMDP (13). Theorem1 is closely related to the previously obtained condition in [22] and [23]. Theorem1 A number of sufficient conditions for the convergence of the refines the results of [22] and [23] by incorporating the BCD algorithm are known in the literature (e.g., [33]–[35] and underlying Makovian structure of the simplified TERMDP references therein). For instance, if f is pseudoconvex and (13). has compact level sets, then every limit point of the sequence {x(k)} generated by the BCD algorithm is a global minimizer of f [35, Proposition 6]. This result can be applied to show B. Forward-Backward Arimoto-Blahut Algorithm the global convergence of the Arimoto-Blahut algorithm (20), As the second contribution, we propose an iterative algo- simply by noticing that the objective function in (17) can be rithm to solve (23) numerically. To this end, we classify the written as a convex function of ν and q as five equations into two groups. Equations (23a) and (23b) form the first group (characterizing variables µ∗ and ν∗), and equa- X  q(u|x) f(ν, q) = p(x)q(u|x) c(x, u) + log tions (23c)-(23e) form the second group (characterizing vari- ν(u) x∈X ,u∈U ables ρ∗, φ∗, and q∗). Observe that if the variables (ρ∗, φ∗, q∗) and that (20) is equivalent to the BCD update rule (22). In are known, then the first set of equations, which can be viewed f as the Kolmogorov forward equation, can be solved forward in the absence of the convexity assumption on , it is typically ∗ ∗ ∗ ∗ required that each coordinate-wise minimization is uniquely3 time to compute (µ , ν ). Conversely, if the variables (µ , ν ) attained in order to guarantee that every limit point of the are known, then the second set of equations, which can be viewed as the Bellman backward equation, can be solved BCD algorithm is a stationary point [34, Proposition 2.7.1]. ∗ ∗ ∗ Unfortunately, the generalized Arimoto-Blahut algorithm we backward in time to compute (ρ , φ , q ). Hence, to compute introduce in this paper for the TERMDP is a BCD algorithm these unknowns simultaneously, the following boot-strapping applied to a nonconvex objective function, and the uniqueness method is natural: first, the forward computation is performed of the coordinate-wise minimizer cannot be assumed. Thus, using the current best guess of the second set of unknowns, none of the above results are applicable to prove the conver- and then the backward computation is performed using the gence. Fortunately, the requirement of the uniqueness of the updated guess of the first set of unknowns. The forward- coordinate-wise minimizer can be relaxed when N = 2 (two- backward iteration is repeated sufficiently many times. The block BCD algorithms). The following result is due to [35, proposed algorithm is summarized in Algorithm1. Notice that Corollary 2] and [33, Theorem 4.2 (c)]. Algorithm1 is a generalization of the standard Arimoto-Blahut Lemma 1: Consider the problem (21) with N = 2, and algorithm (20), which can be recovered as a special case with suppose that the sequence {x(k)} generated by the two-block T = 1, m = n = 0 and CT +1(·) = 0. BCD algorithm (22) has limit points. Then, every limit point Clearly, solutions of (23) are fixed points of Algorithm1. x∗ of {x(k)} is a stationary point of the problem (21). The second main result of this paper is stated as follows. Theorem 2: For each initial condition (24), the sequence Lemma1 is critical to obtain one of our main results (k) (Theorem2) below. q generated by Algorithm1 has a limit point. Moreover, every limit point q∗ satisfies (23). Theorems1 and2 together imply that every limit point of IV. MAIN RESULTS Algorithm1 is an optimal solution candidate for TERMDP. This section summarizes the main results (Theorems1 and The following remarks clarify some limitations of our main 2) of this paper. Derivations are deferred to SectionV. results, which require further analysis in future work. Remark 1: 3The counterexample by Powell [36] with N = 3 shows that the lack of uniqueness of the coordinate-wise minimizer can result in a BCD algorithm 1) Since (23) is only a necessary optimality condition, a limit with a limit point which is not a stationary point. point may not be a local minimum of the given TERMDP. 6

∗ t X X ∗ t−1 ∗ t−1 µt+1(xt+1, ut−n+1) = pt+1(xt+1|xt, ut)qt (ut|xt, ut−n)µt (xt, ut−n) (23a)

xt∈Xt ut−n∈Ut−n ∗ t−1 X ∗ t−1 ∗ t−1 t−1 t−1 νt (ut|ut−n) = qt (ut|xt, ut−n)µt (xt|ut−n), ∀ut−n such that µt(ut−n) > 0 (23b)

xt∈Xt ∗ t X ∗ t ρt (xt, ut−n+1) = ct(xt, ut) − pt+1(xt+1|xt, ut) log φt+1(xt+1, ut−n+1) (23c)

xt+1∈Xt+1 ∗ t−1 X ∗ t−1  ∗ t φt (xt, ut−n) = νt (ut|ut−n) exp −ρt (xt, ut−n+1) (23d)

ut∈Ut ν∗(u |ut−1 ) exp −ρ∗(x , ut ) q∗(u |x , ut−1 ) = t t t−n t t t−n , ∀(x , ut−1 ) µ (x , ut−1 ) > 0 t t t t−n ∗ t−1 t t−n such that t t t−n (23e) φt (xt, ut−n)

Algorithm 1: Forward-Backward Arimoto-Blahut Algorithm Initialize: (0) t−1 qt (ut|xt, ut−n) > 0 for t = 1, 2, ..., T ; (24) (k) T φT +1(xT +1, uT +1−n) , exp{−cT +1(xT +1)} for k = 1, 2, ..., K; for k = 1, 2, ..., K (until convergence) do // (Forward path); for t = 1, 2, ..., T do

(k) t X X (k−1) t−1 (k) t−1 µt+1(xt+1, ut−n+1) = pt+1(xt+1|xt, ut)qt (ut|xt, ut−n)µt (xt, ut−n); (25a) xt∈Xt ut−n∈Ut−n (k) t−1 X (k−1) t−1 (k) t−1 νt (ut|ut−n) = qt (ut|xt, ut−n)µt (xt|ut−n); (25b) xt∈Xt

// (Backward path); for t = T,T − 1, ..., 1 do

(k) t X (k) t ρt (xt, ut−n) = ct(xt, ut) − pt+1(xt+1|xt, ut) log φt+1(xt+1, ut−n+1); (26a) xt+1∈Xt+1 (k) t−1 X (k) t−1 n (k) t o φt (xt, ut−n) = νt (ut|ut−n) exp −ρt (xt, ut−n) ; (26b) ut∈Ut n o ν(k)(u |ut−1 ) exp −ρ(k)(x ,ut ) (k) t−1 t t t−n t t t−n qt (ut|xt, ut−n) = (k) t−1 ; (26c) φt (xt,ut−n)

(K) t−1 Return qt (ut|xt, ut−n);

Section IV-C below presents an example in which one of C. Nonconvexity the limit points is a saddle point. Due to the nonconvexity of the value functions in (14), the optimality condition (23) is only necessary in general. 2) Theorem2 does not guarantee the existence of In fact, depending on the initial condition, Algorithm1 can lim q(k) (e.g., the sequence may oscillate between k→∞ converge to different stationary points corresponding to local distinct points). However, we were not able to construct minima and saddle points of the considered TERMDP (13). such a counterexample. To demonstrate this, consider a simple problem instance of the 3) In the classical Arimoto-Blahut algorithm, there is a known TERMDP (13) with T = 2, n = 0, X = {0, 1}, U = {0, 1}, stopping criterion by which suboptimality of the iteration ( 0 if ut = xt is estimated effectively (e.g., [25, Fig. 3]). Currently, such ct(xt, ut) = 1 u 6= x a stopping criterion is not known for Algorithm1. if t t

for t = 1, 2, and c3(x3) ≡ 0. The initial state distribution is assumed to be µ1(x1 = 0) = µ1(x1 = 1) = 0.5, and the state The number of arithmetic operations to perform a sin- transitions are deterministic in that gle forward-backward path in Algorithm1 is estimated as 2 n+1 ( O(T |X | |U| ). Notice that it is linear in T , grows expo- 1 if xt+1 = ut pt+1(xt+1|xt, ut) = nentially with n, and does not depend on m. 0 if xt+1 6= ut. 7

To see that this problem has multiple distinct local minima, 0.4 1 we solve the Bellman equation (14) numerically by griding 1.6 1.4 1 C 1.2 0.8 the space of probability distributions. Specifically, at t = 2, 0.8 2 the value function V2(µ2(x2)) is computed by solving the 0.3 minimization 0.6

V2(µ2(x2)) = min {Ec2(X2,U2) + I(X2; U2)}. 0.2 1 q2(u2|x2) 0.4 Since µ2(x2) is an element of a single-dimensional probability 0.8 B

Value function, V 0.1 simplex, it can be parameterized as µ2(x2 = 0) = λ and 0.2 µ (x = 1) = 1 − λ with λ ∈ [0, 1]. For each fixed λ, the 2 2 A minimization above is solved by the standard Arimoto-Blahut 0.8 0 0 iteration: 0 0.5 1 0 0.5 1 (k) X (k−1) ν2 (u2) = µ2(x2)q2 (u2|x2) 0

x2∈{0,1} V (k) X (k) Fig. 1. Left: The value function 2. Right: Contour plot of the objective φ2 (x2) = ν2 (u2) exp{−c2(x2, u2)} function Ec1(x1, u1)+I(X1; U1)+V2(µ2(x2)) in (27) as a function of θ0 and θ1. Stationary points A and B are local minima, whereas C is a saddle u2∈{0,1} point. Two sample trajectories of the proposed forward-backward Arimoto- ν(k)(u ) exp{−c (x , u )} Blahut algorithm (Algorithm1) started with different initial conditions are also q(k)(u |x ) = 2 2 2 2 2 . 2 2 2 (k) shown. It can be shown that A, B and C are all fixed points of Algorithm1. φ2 (x2)

After the convergence, the value function is computed as t−1 When µt(ut−n) = 0, νt is defined to be the uniform distribu- X (k) V2(µ2(x2)) = − µ2(x2) log φ2 (x2). tion on Ut. For each t = 1, 2, ..., T , we consider νt and qt as

x2∈{0,1} elements of Euclidean spaces, i.e.,

t−1 |Ut−n|×···×|Ut| Fig.1 (Left) shows V2(µ2(x2)) as a function of λ. It is νt(ut|ut−n) ∈ R (29a) clearly nonconvex. After V2(µ2(x2)) is obtained, the Bellman t−1 |Ut−n|×···×|Ut|×|Xt| qt(ut|xt, ut−n) ∈ R . (29b) equation at time t = 1 can be evaluated as Since νt and qt are conditional probability distributions, they V1(µ1(x1)) = min {Ec1(X1,U1)+I(X1; U1)+V2(µ2(x2))}. q1(u1|x1) are entry-wise non-negative (denoted by νt ≥ 0 and qt ≥ 0) (27) and

Due to the nonconvexity of V2(µ2(x2)), the objective function X t−1 t−1 t−1 νt(ut|ut−n) = 1 ∀ut−n ∈ Ut−n (30a) in the minimization (27) is a nonconvex function of q1(u1|x1). Fig.1 (Right) shows the objective function in (27) plotted as ut∈Ut X t−1 t−1 t−1 a function of q1 parameterized by θ0 and θ1: qt(ut|xt, ut−n) = 1 ∀(xt, ut−n) ∈ Xt × Ut−n. (30b) ut∈Ut q1(u1 = 0|x1 = 0) = θ0 Thus, the feasibility sets for νT and qT are q1(u1 = 1|x1 = 0) = 1 − θ0 T q1(u1 = 0|x1 = 1) = θ1 Xν = {ν : (29a), (30a) and νt ≥ 0 for every t = 1, 2, ..., T }; T q1(u1 = 1|x1 = 1) = 1 − θ1. Xq = {q : (29b), (30b) and qt ≥ 0 for every t = 1, 2, ..., T }.

Clearly, it is a nonconvex function, admitting two local minima Using µt, νt and qt, the stage-wise cost in (13) can be written (A and C) and a saddle point (B). Each of them is a fixed point as of Algorithm1. t−1 `t(µt, νt, qt) , Ect(Xt,Ut) + I(Xt; Ut|Ut−n) X X k−1 k−1 V. DERIVATION OF MAIN RESULTS = µk(xk, uk−n)qk(uk|xk, uk−n) xk∈Xk uk ∈U k This section provides technical details of the proofs of k−n k−n k−1 ! Theorems1 and2. qk(uk|xk, uk−n) × log k−1 +ck(xk, uk) νk(uk|uk−n) A. Preparation for t = 1, 2, ..., T and The first step is to rewrite the objective function in (13) as an explicit function of qT . For each t = 1, 2, ..., T , let `T +1(µT +1) , EcT +1(XT +1) t−1 X µt(xt|ut−n) be the conditional distribution obtained from = µT +1(xT +1)cT +1(xT +1) t−1 t−1 xT +1∈XT +1 µt(xt, ut−n) whenever µt(ut−n) > 0. Define the conditional distribution νt by for the final time step. Thus, the objective function in (13) is T t−1 X t−1 t−1 T T X νt(ut|ut−n) = qt(ut|xt, ut−n)µt(xt|ut−n). (28) f(ν , q ) = `t(µt, νt, qt) + `T +1(µT +1). (31) xt∈Xt t=1 8

Notice that we consider (31) as a function of νT and qT , (26) backward in time for t = T, ..., τ. Then, for each which turns out to be convenient in order to view Algorithm1 τ = T,T − 1, ..., 1, the following statements hold: as a two-block BCD algorithm. Our problem is to minimize T T (a) The function f(ν , q ) over Xν × Xq subject to the equality constraint (28). However, the following result states that (28) will be (k) (k) (k−1) (k−1) (k) (k) f(ν1 , ..., νT , q1 , ..., qτ−1 , qτ , qτ+1, ..., qT ) automatically satisfied by a coordinate-wise minimizer νT of T T T ◦ f(ν , q ) for a fixed q . is convex in qτ ≥ 0, and any global minimizer qτ qT ∈ X ◦ (k) Lemma 2: [25, Theorem 4(b)] Let q be fixed. Then satisfies qτ = qτ almost everywhere with respect to τ−1 µτ (xτ , uτ−n). T T (k) T min f(ν , q ) (32a) (b) The cost-to-go function under the policy {qt }t=τ is T linear in µτ : s.t. ν ∈ Xν (32b) T T X (k) (k) is a convex optimization problem with respect to ν , and an `t(µt, νt , qt ) + `T +1(µT +1) t=τ optimal solution is given by (28). X X τ−1 (k) τ−1 Therefore, the simplified TERMDP (13) can be written as = − µτ (xτ , uτ−n) log φτ (xτ , uτ−n). xτ ∈Xτ τ−1 τ−1 uτ−n∈Uτ−n min f(νT , qT ) (33a) T T s.t. ν ∈ Xν , q ∈ Xq. (33b) Proof: The proof is by backward induction. For the time step T , we have (Including the equality constraint (28) in (33) does not alter the the optimal solution because of Lemma2.) Notice that if qT ∗ is (k) (k) (k−1) (k−1) f(ν1 , ..., νT , q1 , ..., qT −1 , qT ) T ∗ a local minimizer for (13), then there exists a vector ν ∈ Xν X X = µ (x , uT −1 )q (u |x , uT −1 ) such that (νT ∗, qT ∗) is a local minimizer for (33). Moreover, T T T −n T T T T −n xT ∈XT uT ∈U T a local minimizer (νT ∗, qT ∗) is necessarily a coordinate-wise T −n T −n local minimizer. By coordinate-wise convexity (which will be T −1 ! qT (uT |xT , uT −n) (k) T T T × log +ρ (xT , u ) + const. (36) shown below) of f(ν , q ), we have: (k) T −1 T T −n νT (uT |uT −n) T ∗ T T ∗ ν ∈ arg minνT ∈X f(ν , q ) (34a) ν (k) (k) T ∗ T ∗ T where µ , ν , ρ and “const.” do not depend on q . T T T T T q ∈ arg minq ∈Xq f(ν , q ). (34b) Convexity of (36) in qT is clear, as qT log qT is a convex T ∗ Therefore, (34) is a necessary condition for q to be a function in qT . Moreover, minimization of (36) in terms of local minimizer for the simplified TERMDP (13). In the next qT has the same structure as the minimization problem (17) subsection, we further show that (34) implies the condition in terms of q(u|x). Therefore, it follows from Proposition3 (23), which thus shows that (23) is a necessary optimality ◦ ◦ (k) that the minimizer qT for (36) is given by qT = qT . This condition for the simplified TERMDP (13). This argument establishes (a) for the time step τ = T . The statement (b) for outlines the proof of Theorem1, which will be detailed in τ = T can be directly shown by substituting the expression Section V-C. (k) of qt given by (26c) with t = T into (36):

(k) (k) B. Analysis of Algorithm1 `T (µT , νT , qT ) + `T +1(µT +1) X X T −1 (k) T −1 For the analysis of Algorithm1, a key observation is that = µT (xT , uT −n)qT (uT |xT , uT −n) x ∈X T T it is the two-block BCD algorithm applied to (34): T T uT −n∈UT −n T (k) T T (k−1) (k) T −1 ! ν = arg min T f(ν , q ) (35a) q (u |x , u ) ν ∈Xν T T T T −n (k) T × log + ρ (xT , u ) (37a) T (k) T (k) T (k) T −1 T T −n q = arg min T f(ν , q ). (35b) q ∈Xq νT (uT |uT −n) X X T −1 (k) T −1 To prove that the backward path (26) is equivalent to the = µT (xT , uT −n)qT (uT |xT , uT −n) x ∈X T T coordinate-wise minimization (35b), we remark that the min- T T uT −n∈UT −n imization problem (35b) can be viewed as an optimal control  (k) T −1  T × − log φ (xT , u ) (37b) problem with respect to the control actions q = (q1, ..., qT ). T T −n The next lemma essentially shows that the backward path (26) X X T −1 (k) T −1 = − µT (xT , u ) log φ (xT , u ) is solving (35b) by backward dynamic programming. T −n T T −n xT ∈XT uT −1 ∈U T −1 Lemma 3: Suppose T −n T −n X (k) T −1 (k) (k) × qT (uT |xT , uT −n) . (37c) (ν , ..., ν ) ∈ Xν , 1 T uT ∈UT (k−1) (k−1) (k) (k) | {z } (q1 , ..., qτ−1 , qτ , qτ+1, ..., qT ) ∈ Xq, =1 T and {µt+1}t=1 is the sequence of probability measures gen- To complete the proof, we show that if (b) holds for the (k−1) (k−1) (k) (k) erated by (q1 , ..., qτ−1 , qτ , qτ+1, ..., qT ) via (12). Let time step τ +1, then both (a) and (b) hold for the time step τ. (k) (k) (k) (k) ρτ , φτ and qτ be the parameters obtained by computing Since (b) is hypothesized for τ + 1, using ρτ , it is possible 9 to write Finally, we claim that every stationary point for (33) satisfies (23). To see this, notice that every stationary point (νT ∗, qT ∗) f(ν(k), ..., ν(k), q(k−1), ..., q(k−1), q , q(k) , ..., q(k)) t T 1 τ−1 τ τ+1 T is a coordinate-wise stationary point. Since f(νT , qT ) is X X τ−1 τ−1 T ∗ T ∗ = µτ (xτ , uτ−n)qτ (uτ |xτ , uτ−n) coordinate-wise convex, (ν , q ) is a coordinate-wise min- τ τ xτ ∈Xτ uτ−n∈Uτ−n imizer, i.e., (34) holds. By Lemma2, conditions (23a) and ! (k) ∗ τ−1 (23b) are necessary for (34a). By Lemma3 (with ντ = ντ qτ (uτ |xτ , uτ−n) (k) τ (k) (k−1) × log + ρ (xτ , u ) + const. (38) ∗ ∗ (k) τ−1 τ τ−n for τ = 1, ..., T , qτ = qτ for τ = t+1, ..., T and qτ = qτ ντ (uτ |uτ−n) for τ = 1, ..., t − 1), conditions (23c)-(23e) are necessary for (k) (k) (34b). where µτ , ντ , ρτ and “const.” do not depend on qτ . Namely, (38) is convex in qτ . Proposition3 is applicable (k) once again to conclude that a minimizer coincides with qτ τ−1 VI.INTERPRETATIONS almost everywhere with respect to µτ (xτ , uτ−n). Hence, (a) is established for the time step τ. The statement (b) for τ can In this section, we discuss two applications of TERMDP in be shown by a direct substitution. The details are similar to engineering and scientific contexts in which transfer entropy (37). plays central roles.

C. Proof of main results A. Networked Control Systems Based on the observations so far, the main theorems in this The first application is the analysis of networked control paper are established as follows. systems, where the sensor data is transmitted to the controller ∗ T Proof of Theorem1: For a given {qt }t=1, we define over a rate-limited communication channel. Fig.2 shows a ∗ T {µt }t=1 via (23a) and hence (23a) is automatically satisfied. discrete-time, finite-horizon MDP setup in which a decision ∗ T By Lemma2, for any locally optimal solution {qt }t=1 to policy must be realized by a joint design of encoder and ∗ T (13), there exist variables {νt }t=1 satisfying (23b), such that decoder, together with an appropriate codebook for discrete ∗ ∗ T {νt , qt }t=1 is a locally optimal solution to (33). Since local noiseless channel. Most generally, assume that an encoder is a t t−1 optimality implies coordinate-wise local optimality, for each stochastic kernel et(wt|x , w ) and a decoder is a stochastic ∗ t t−1 t = 1, 2, ..., T , qt is a local minimizer of kernel dt(ut|w , u ). At each time step, a codeword wt is Rt ∗ ∗ ∗ ∗ ∗ ∗ chosen from a codebook Wt such that |Wt| = 2 . We refer f(ν1 , ..., νT , qT , ..., qt+1, qt, qt−1, ..., q1 ). (39) PT to R = t=1 Rt as the rate of communication. The next (k) ∗ However, Lemma3 is applicable (with ντ = ντ for τ = proposition claims that the rate of communication in Fig.2 is (k) ∗ (k−1) ∗ 1, ..., T , qτ = qτ for τ = t + 1, ..., T and qτ = qτ for fundamentally lower bounded by the directed information. τ = 1, ..., t − 1) to conclude that (39) is convex in qt ≥ 0 and Proposition 4: Let the encoder and the decoder be t t−1 hence any stochastic kernels of the form et(wt|x , w ) and d (u |wt, ut−1). Then R log 2 ≥ I(XT → U T ). q∗ ∈ arg min f(ν∗, ..., ν∗ , q∗ , ..., q∗ , q , q∗ , ..., q∗). t t t 1 T T t+1 t t−1 1 Proof: Note that qt≥0 XT Moreover, Lemma3 (a) also implies that if the parameters R log 2= Rt log 2 {ρ∗, φ∗}T are calculated by (23c)-(23d) backward in time, t=1 τ τ τ=t XT then any global minimizer ≥ H(Wt) t=1 ◦ ∗ ∗ ∗ ∗ ∗ ∗ T q ∈ arg min f(ν , ..., ν , q , ..., q , qt, q , ..., q ) X t−1 t−1 t 1 T T t+1 t−1 1 ≥ H(Wt|W ,U ) qt≥0 t=1 T X t−1 t−1 t t−1 t−1 satisfies ≥ H(Wt|W ,U )−H(Wt|X,W ,U ) ∗ t−1  ∗ t t=1 ν (u |u ) exp −ρ (x , u ) T q◦ = t t t−n t t t−n X t t−1 t−1 t ∗ t−1 = I(X ; Wt|W ,U ) φt (xt, ut−n) t=1 I(XT → W T kU T −1). µt-almost everywhere. Hence (23e) must hold. , Proof of Theorem2: The fact that the sequence q(k) The first inequality is due to the fact that entropy of a discrete generated by Algorithm1 has a limit point follows from the random variable cannot be greater than its log-cardinality. fact that Xq is a compact set. Notice that a factor log 2 appears since we are using the natural To show that every limit point of the sequence (ν(k), q(k)) logarithm in this paper. The second inequality holds because generated by Algorithm1 is a stationary point for (33), observe conditioning reduces entropy. The third inequality follows that since entropy is nonnegative. The last quantity is known as the (a) The update rule (25b) is equivalent to (35a); and causally conditioned directed information [37]. The feedback (b) The update rule (26c) is equivalent to (35b). data-processing inequality [24] The fact (a) follows from Lemma2 and (b) follows from I(XT → U T ) ≤ I(XT → W T kU T −1) Lemma3. Therefore, Algorithm1 is equivalent to the two- block BCD algorithm to which Lemma1 is applicable. is applicable to complete the proof. 10

The state of the engine at time t is represented by the position State Transition ( | , ) and the velocity of the molecule, which is denoted by Xt ∈ X . 𝑋𝑋𝑡𝑡 𝑈𝑈𝑡𝑡 Assume that the state space is divided into finite cells so that 𝑝𝑝𝑡𝑡 𝑥𝑥𝑡𝑡+1 𝑥𝑥𝑡𝑡 𝑢𝑢𝑡𝑡 X is a finite set. Also, assume that the evolution of Xt is Encoder Decoder described by a discrete-time random process. 𝑡𝑡 Discrete noiseless𝑊𝑊 channel At each time step t = 0, 1, ..., T − 1, suppose that one of with alphabet size 2 the following three possible control actions Ut can be applied: 𝑡𝑡 𝑅𝑅 (i) insert a weight-less barrier into the middle of the engine Fig. 2. MDP over discrete noiseless channel. box and move it to the left at a constant velocity v for a unit time, (ii) insert a barrier into the middle of the box and move Proposition4 provides a fundamental performance limi- it to the right at the velocity v for a unit time, or (iii) do tation of a communication system when both encoder and nothing. At the end of control actions, the barrier is removed decoder have full memories of the past. However, it is also from the engine. We assume that the insertion and removal meaningful to consider restricted scenarios in which the en- of the barrier is frictionless and as such do not consume any coder and decoder have limited memories. For instance: work. The sequence of operations is depicted in Fig.3. Denote t (A) The encoder stochastic kernel is of the form et(wt|xt−m) by p(xt+1|xt, ut) the transition probability from the state xt and the decoder stochastic kernel is of the form to another state xt+1 when control action ut is applied. By t dt(ut|wt, ut−n); or Ec(Xt,Ut) we denote the expected work required to apply t t−1 (B) The encoder stochastic kernel is et(wt|xt−m, ut−n) and control action ut at time t when the state of the engine is 4 the decoder is a deterministic function ut = dt(wt). The xt. This quantity is negative if the controller is expected to t−1 encoder has an access to the past control inputs ut−n extract work from the engine. Work extraction occurs when t−1 since they are predictable from the past wt−n because the gas molecule collides with the barrier and “pushes” it in the decoder is a deterministic map. the direction of its movement. The next proposition shows that the transfer entropy of degree Right before applying a control action Ut, suppose that (m, n) provides a tighter lower bound in these cases. the controller makes (a possibly noisy) observation of the t Proposition 5: Suppose that the encoder and the decoder engine state, and thus there is an information flow from X t have structures specified by (A) or (B) above. Then to U . For our discussion, there is no need to describe what T T kind of sensing mechanism is involved in this step. However, R log 2 ≥ Im,n(X → U ). notice that if an error-free observation of the engine state Proof: See AppendixD. Xt is performed, then the controller can choose a control action such that Ec(Xt,Ut) is always non-positive. (Consider By solving the TERMDP (7) with different β ≥ 0, one can moving the barrier always to the opposite direction from the draw a trade-off curve between J(XT +1,U T ) and I(XT → T position of the gas molecule.) At first glance, this seems to U ). Proposition5 means that this trade-off curve shows a imply that one can construct a device that is expected to fundamental limitation of the achievable control performance cyclically extract work from a single thermal bath, which is under the given data rate. a contradiction to the Kelvin-Planck statement of the second The tightness of the lower bounds provided by Proposi- law of thermodynamics. tions4 and5 (i.e., whether it is possible to construct an It is now widely recognized that this paradox (Maxwell’s encoder-decoder pair such that the data rate matches its lower demon) can be resolved by including the “memory” of the bound while satisfying the desired control performance) is the controller into the picture. Recently, a generalized second law natural next question. In the LQG control setup, this question is proposed by [10], in which transfer entropy plays a critical has been studied in [38]–[41]. In these references, it is shown role. Viewing the combined engine and memory system as that the conservativeness of the lower bound provided by a Bayesian network comprised of Xt and Ut (see [10] for Proposition4 is no greater than a small constant. Beyond the details), and assuming that the free energy change of the LQG setting, the question is currently wide open. engine from t = 0 to t = T is zero (which is the case when the above sequence of operations are repeated in a cyclic manner B. Maxwell’s demon with period T ), the generalized second law [10, equation (10)] Maxwell’s demon is a physical device that can seemingly reads violate the second law of thermodynamics, which turns out to T −1 be a prototypical thought-experiment that connects statistical X T −1 T −1 Ec(Xt,Ut) + kBT0I(X0 → U0 ) ≥ 0 (40) physics and information theory [6]. One of the simplest t=0 forms of Maxwell’s demon is a device called the Szilard where k [J/K] is the Boltzmann constant. The above in- engine. Below, we introduce an application of TERMDP to B equality shows that a positive amount of work is extractable the analysis of the efficiency of a generalized Szilard engine (i.e., the first term can be negative), but this is possible only extracting work at a non-zero rate (in contrast to the common at the expense of the transfer entropy cost (the second term assumption that the engine is operated infinitely slowly). Consider a single-molecule gas trapped in a box (“engine”) 4 Here, we do not provide a detailed model of the function c(xt, ut). See for that is immersed in a thermal bath of temperature T0 (Fig.3). instance [42] for a model of work extraction based on the Langevin equation. 11

At each time step t = 1, 2, ..., T , the state-dependent cost is defined by ct(xt, ut) = 0 if xt is already the target cell indicated by “G” in Fig.4, and c (x , u ) = 1 otherwise. 𝑣𝑣 t t t The terminal cost is 0 if xT +1 = G and 10000 otherwise. 0 T T 𝑇𝑇 We consider transfer entropy Im,n(X → U ) to quantify the price of information that Bob must acquire about Alice’s 𝑣𝑣 location. With some nonnegative weight β, the overall control problem can be written as (7). As shown in Fig.4, there are two qualitatively different (a) (b) (c) paths from the origin to the target. The path A is shorter than the path B, and hence Bob will try to navigate Alice along path ...... A when no information-theoretic cost is considered (i.e., β = 0 1 + 1 1 0). However, navigating Alice along the path A requires Bob to have accurate information about Alice’s current location, as Fig. 3. Modified Szilard engine.𝑡𝑡 The controller𝑡𝑡 performs𝑇𝑇 −the following𝑇𝑇 steps this path is “risky” (there are many side roads with dead ends). in a unit time. (a) The controller makes (a possibly noisy) observation of the The path B is longer, but navigating through it is simpler; state Xt of the engine. (b) One of the three possible control actions Ut (move the barrier to the left or to the right, or do nothing) is applied. (c) At the end rough knowledge about Alice’s location is sufficient to provide of control action, the barrier is removed. correct instructions. Hence, it is expected that Bob would try to navigate Alice through A when information is relatively cheap (β is small), while he would choose B when information is 5 must be positive). Given a fundamental law (40), a natural expensive (β is large). question is how efficient the considered thermal engine can be Fig.6 shows the solutions to the considered problem. t t−1 by optimally designing a control policy q(ut|x , u ). This Solutions are obtained by iterating Algorithm1 sufficiently can be analyzed by minimizing the left hand side of (40), many times in four different conditions. Each plot shows a which is precisely the TERMDP problem (7). snapshot of the state probability distribution µt(xt) at time t = 25. Fig.6 (a) is obtained under the setting that the cost VII.NUMERICALEXPERIMENT of information is high (β = 10), the planning horizon is long (T = 55), and the transfer entropy of degree (m, n) = (0, 0) In this section, we apply the proposed forward-backward is considered. Accordingly, a decision policy of the form Arimoto-Blahut algorithm (Algorithm1) to study how the of qt(ut|xt) is considered. It can be seen that with high price of information affects the level of information-frugality, probability, the agent is navigated through the longer path. which yields qualitatively different decision policies. In Fig.6 (b), the cost of information is reduced ( β = 1) Consider a situation in which Alice, whose movements while the other settings are kept the same. As expected, the are described by Markovian dynamics controlled by Bob, solution chooses the shorter path. Fig.5 shows the time- is traveling through a maze shown in Fig.4. Suppose that dependent information usage in (a) and (b); it shows that the Bob knows the geometry of the maze (including start and total information usage is greater in situation (b) than in (a). goal locations), but observing Alice’s location is costly. We We note that this simulation result is consistent with a model this problem as an MDP where the state Xt is the cell prior work [20], where similar numerical experiments were where Alice is located at time step t, and Ut is a navigation conducted. Using Algorithm1, we can further investigate the instruction given by Bob. The observation cost is characterized nature of the problem. Fig.6 (c) considers the same setting by the transfer entropy. We assume five different instructions as in (a) except that the planning horizon is shorter (T = 45). are possible; u = N, E, S, W and R, corresponding to go This result shows that the solution becomes qualitatively north, go east, go south, go west, and rest. The initial state is different depending on how close the deadline is even if the the cell indicated by “S” in Fig.4, and the motion of Alice is cost of information is the same. Finally, Fig.6 (d) considers described by a transition probability p(xt+1|xt, ut). the case where the transfer entropy has degree (m, n) = (0, 1) The transition probability is defined by the following rules. and the decision policy is of the form of qt(ut|xt, ut−1). At each cell, a transition to the indicated direction occurs Although the rest of simulation parameters are unchanged w.p. 0.8 if there is no wall in the indicated direction, while from (a), we observe that the shorter path is chosen in this transitions to any open directions (directions without walls) case. This result demonstrates that the solution to (7) can be occurs w.p. 0.05 each. With the remaining probability, Alice qualitatively different depending on the considered degree of stays in the same cell. If there is a wall in the indicated transfer entropy costs. direction, or u = R, then transition to each open direction occurs w.p. 0.05, while Alice stays in the same cell with the VIII.SUMMARY AND FUTURE WORK remaining probability. In this paper, we considered a mathematical framework of transfer-entropy-regularized Markov Decision Process (TER- 5The consistency with the classical second law is maintained if one accepts MDP), which is motivated both in engineering (networked Landauer’s principle, which asserts that erasure of one bit of information from control systems) and scientific (non-equilibrium thermody- any sort of memory device in an environment at temperature T0 [K] requires at least kB T0 log 2 [J] of work. See [43] for the further discussions. namics) contexts. We derived structural properties of the 12

(a) -=10, T=55, (m,n)=(0,0) (b) -=1, T=55, (m,n)=(0,0) 0.12 0.1

0.1 0.09

0.08

0.08 0.07

0.06 0.06 0.05

0.04 0.04 0.03

0.02 0.02 0.01 S G S G Fig. 4. Information-regularized optimal navigation 0 0 through a maze. (c) -=10, T=45, (m,n)=(0,0) (d) -=10, T=55, (m,n)=(0,1)

0.1 0.1 )

t 0.8 ;U

t 0.09 0.09 (a) (b) 0.08 0.08 0.6 0.07 0.07

0.4 0.06 0.06 0.05 0.05

0.2 0.04 0.04

0.03 0.03

0 0.02 0.02 Mutual Information [nats], I(X [nats], Information Mutual 0 10 20 30 40 50 Time 0.01 0.01 S G S G Fig. 5. Information usage by the policies (a) and (b) in 0 0 Fig.6. Fig. 6. State probability distribution µt(xt) at t = 25.

T T −1 optimal solution, and provided a necessary optimality con- holds by construction. Let λT (xT , uT −n) and λT (xT , uT −n) T T 0 dition written as a set of coupled nonlinear equations. We be marginals of λT (x , u ). Construct a new policy qT as proposed an iterative numerical algorithm (forward-backward T Arimoto-Blahut algorithm) in which every limit point of the 0 T −1 λT (xT , uT −n) qT (uT |xT , uT −n) , T −1 (42) generated sequence is guaranteed to be a stationary point of λT (xT , uT −n) the given TERMDP. By numerical simulation (information- T −1 constrained maze navigation), it was demonstrated that the if λT (xT , uT −n) > 0, and as an arbitrary probability distribu- T −1 proposed algorithm can be used to find an optimal solution tion on UT if λT (xT , uT −n) = 0. Let candidate. 0 T T 0 T −1 T T −1 The proposed algorithm has several limitations as sum- λT (x , u ) , qT (uT |xT , uT −n)µT (x , u ) (43) marized in Remark1, which must be addressed in future be the joint distribution induced by q0 . Then, we have work. Improvement of the convergence speed of the proposed T algorithm is also necessary for many applications. Finally, the T 0 T λT (xT , uT −n) = λT (xT , uT −n), (44) roles of transfer entropy in engineering and scientific applica- tions (including networked control systems, non-equilibrium which can be directly verified as thermodynamics and beyond) and implications of TERMDP 0 T solutions studied in this paper need further investigation. λT (xT , uT −n) X 0 T −1 T T −1 = qT (uT |xT , uT −n)µT (x , u ) (45a) CKNOWLEDGEMENT A xT −1∈X T −1 The authors would like to thank the anonymous reviewers uT −n−1∈U T −n−1 X 0 T −1 T T −1 for their valuable suggestions. = qT (uT |xT , uT −n)λT (x , u ) (45b) xT −1∈X T −1 APPENDIX uT −n−1∈U T −n−1 0 T −1 T −1 A. Proof of Proposition1 = qT (uT |xT , uT −n)λT (xT , uT −n) T Our proof is based on backward induction. = λT (xT , uT −n) (45c) k = T : We prove (a) first. For a given where the equality (45a) holds by definition (43), (45b) by policy q (u |xT , uT −1), let λ (xT , uT ) T t T , (41), and (45c) by the construction (42). Now the Bellman q (u |xT , uT −1)µ (xT , uT −1) be the joint distribution T T T equation at k = T reads induced by qT . Notice that T T −1 T T −1 T T −1 c I λT (x , u ) = µT (x , u ) (41) VT (µT (x , u )) = min JT (λT ) + JT (λT ) qT 13 where B. Local minima of simplified TERMDP t t−1 T c λT ,pT +1 Let q = {q(ut|x , u )} be a given policy. We say that JT (λT ) = E (cT (XT ,UT ) + cT +1(XT +1)) t=1 the joint distribution λ(xT , uT ) is induced by q if it is defined J I (λ ) = I (XT ; U |U T −1). T T λT T −m T T −n as To establish (a) for k = T , it is sufficient to show that T c c 0 I I 0 T T Y t t−1 JT (λT ) = JT (λT ) and JT (λT ) ≥ JT (λT ). The first equality λ(x , u ) = q(ut|x , u )p(xt|xt−1, ut−1) holds because of (44). To see the second inequality, t=1 I T T −1 with the initial time condition p(x |x , u ) = µ(x ). As in J (λT ) = Iλ (X ; UT |U ) (46a) 1 0 0 1 T T T −m T −n 0 T −1 the proof of Proposition1, we construct a new policy q = ≥ IλT (XT ; UT |UT −n) (46b) 0 t−1 T T T {q (ut|xt, ut−n)}t=1 from λ(x , u ) by T −1 = Iλ0 (XT ; UT |U ) (46c) T T −n t T −1 0 t−1 λ(xt, ut−n) = Iλ0 (XT ; UT |U ) q (ut|xt, u ) = T T −n t−n λ(x , ut−1 ) T −1 T −1 t t−n + I 0 (X ; U |X ,U ) (46d) λT T −m T T T −n T T −1 for each t = 1, 2, ..., T . For simplicity of the analysis, we = Iλ0 (X ; UT |U ) (46e) T T −m T −n assume that λ(xT , uT ) satisfies the following condition: = J I (λ0 ). (46f) t−1 T T Assumption 1: For each t = 1, 2, ..., T and (xt, ut−n) ∈ X × U t−1, we have λ(x , ut−1 ) > 0. The equality (46c) follows from (44). The second term in (46d) t t−n t t−n T −1 T −1 Although this assumption is somewhat restrictive, it is valid is zero since U is independent of X given (X ,U ) T T −m T T −n when, for instance, the transition probability p(x |x , u ) by construction of λ0 in (43). Hence the statement (a) is t t−1 t−1 T and the policy q(u |xt, ut−1) assign a nonzero probability established for k = T . t mass to each element of X and U , respectively. Denote by The statement (b) follows immediately from the above t t λ0(xT , uT ) the joint distribution induced by q0, i.e., discussion. To establish (c), notice that due to (a), we can T −1 T assume the optimal policy of the form qT (uT |xT , uT −n) 0 T T Y 0 t t−1 without loss of generality. Hence, due to (b), the Bellman λ (x , u ) = q (ut|x , u )p(xt|xt−1, ut−1) equation at k = T can be written as t=1 T t T T −1 n T −1 Y λ(xt, ut−n) VT (µT (x , u )) = min I(XT ; UT |UT −n) = t−1 p(xt|xt−1, ut−1). (48) q (u |x ,uT −1 ) T T T T −n t=1 λ(xt, ut−n) o + EcT (XT ,UT ) + EcT +1(XT +1) . We write the construction of q0 from q via the procedure above as q0 = π(q), where π : Q → Q0 is a map- T T −1 Since the last expression depends on µT (x , u ) ping from the space Q of general policies of the form T −1 only through its marginal µ (x , u ), we conclude t t−1 T 0 T T T −n {q(ut|x , u )} to the space Q of simplified policies of T T −1 t=1 that VT (µT (x , u )) is a function of the marginal the form {q(u |x , ut−1 )}T . Notice that Q0 ⊂ Q. T −1 t t t−n t=1 µT (xT , uT −n) only. This establishes (c) for k = T . We first show that π is a continuous mapping with respect k = t: Next, we assume (a), (b) and (c) hold for k = t + 1. to an appropriate metric on Q. Suppose q¯ ∈ Q and q˜ ∈ Q are To establish (a), (b) and (c) for k = t, notice that under the given policies, and let λ¯ and λ˜ be joint distributions induced induction hypothesis, we can assume without loss of generality by q¯ and q˜, respectively. We define a metric kq¯− q˜k as the `1 that policies for k = t + 1, t + 2, ..., T are of the form distance between λ¯ and λ˜: k−1 qk(uk|xk, uk−n), and that the value function at k = t + 1 t X ¯ T T ˜ T T depends only on µt+1(xt+1, ut−n+1). With a slight abuse of kq¯− q˜k , |λ(x , u ) − λ(x , u )|. (49) notation, the latter fact is written as xT ∈X T ,uT ∈U T

t+1 t t Vt+1(µt+1(x , u )) = Vt+1(µt+1(xt+1, ut−n+1)). Claim 1: Suppose q¯ ∈ Q and q˜ ∈ Q satisfy Assumption1. Thus, the Bellman equation at k = t can be written as For every  > 0, there exists δ > 0 such that kq¯ − q˜k < δ ⇒ kπ(¯q) − π(˜q)k < . t t−1  n Vt µt(x , u ) = min Ect(Xt,Ut) q (u |xt,ut−1) t t Proof: Let λ¯0 and λ˜0 be joint distributions induced by t t−1 t o 0 0 +I(Xt−m; Ut|Ut−n) + Vt+1(µt+1(xt+1, ut−n+1)) . (47) q¯ = π(¯q) and q˜ = π(˜q), respectively. Notice that kq¯−q˜k < δ P ¯ T T ˜ T T means that X T ,U T |λ(x , u ) − λ(x , u )| < δ, which Now, using the similar construction to the case for k = T , one implies that t t−1 can show that for every qt(ut|x , u ), there exists a policy 0 t−1 ¯ t ˜ t of the form qt(ut|xt, ut−n) such that the value of the objective |λ(xt, ut−n) − λ(xt, ut−n)| < δ and 0 function in the right hand side of (47) attained by qt is less ¯ t−1 ˜ t−1 |λ(xt, ut−n) − λ(xt, ut−n)| < δ. than or equal to the value attained by qt. This observation establishes (a) for k = t. Statements (b) and (c) for k = t Since (48) shows that λ0(xT , uT ) for each (xT , uT ) ∈ X T × T t t−1 follows similarly. U is a rational function of λ(xt, ut−n) and λ(xt, ut−n), and 14 a rational function is continuous in the domain of strictly pos- the value of the TERMDP (7). By Proposition1, it also follows itive denominators, for each  > 0 and (xT , uT ) ∈ X T × U T , that if q ∈ Q0, then f(q) is equal to the value of the simplified there exists δ > 0 such that kq¯− q˜k < δ implies TERMDP (13), i.e.,

0 T T 0 T T  T +1 T T T |λ¯ (x , u ) − λ˜ (x , u )| < , f(q) = J(X ,U ) + βI0,n(X → U ). |X T ||U T | which further implies ∗ ∗ t−1 T 0 Proposition 6: Suppose q = {q (ut|xt, ut−n)}t=1 ∈ Q ∗ kπ(¯q) − π(˜q)k = kq¯0 − q˜0k satisfies Assumption1. If q is a local minimizer for the X simplified TERMDP (13), i.e., = |λ¯0(xT , uT ) − λ˜0(xT , uT )| (a) there exists  > 0 such that f(q∗) ≤ f(q0) holds for all T T T T x ∈X ,u ∈U 0 0 0 ∗ X  q ∈ Q with kq − q k < . < ∗ |X T ||U T | Then, q is a local minimizer for the original TERMDP (7), xT ∈X T ,uT ∈U T i.e., = . (b) there exists δ > 0 such that f(q∗) ≤ f(q) holds for all q ∈ Q with kq − q∗k < δ. Proof: Pick a constant  > 0 such that the condition (a) 0 The next claim shows that π leaves an element of Q invariant. holds. By continuity of π (Claim1), there exists a constant 0 Claim 2: Suppose q ∈ Q and the induced joint distribution δ > 0 such that λ satisfies Assumption1. Then π(q) = q. Proof: By definition, q ∈ Q, kq − q∗k < δ ⇒ kπ(q) − q∗k < . (52) t Here, we have used the fact that π(q∗) = q∗ (Claim2). t t Y k−1 λ(x , u ) = q(uk|xk, uk−n)p(xk|xk−1, uk−1). To complete the proof by contradiction, suppose the nega- k=1 tion of the condition (b) holds: Therefore, (¬b) for every δ > 0, there exists q¯ ∈ Q such that kq¯−q∗k < δ ∗ t and f(¯q) < f(q ). t X Y k−1 ∗ λ(xt, ut−n) = q(uk|xk, uk−n)p(xk|xk−1, uk−1) Now, pick a policy q¯ ∈ Q such that kq¯ − q k < δ and t−1 t−1 x ∈X k=1 ∗ ut−n−1U t−n−1 f(¯q) < f(q ). (53) t−1 X = q(u |x , u ) p(x |x , u ) 0 0 t t t−n t t−1 t−1 If we write q¯ , π(¯q) ∈ Q , it also follows from (52) that xt−1∈X t−1 0 ∗ t−1 kq¯ − q k < . (54) X Y k−1 × q(uk|xk, u )p(xk|xk−1, uk−1) k−n By Proposition1,(53) implies that ut−n−1∈U t−n−1 k=1 (50) f(¯q0) ≤ f(¯q) < f(q∗). (55) and However, (54) and (55) contradict (a). Therefore, the condition t−1 X t (b) must hold. λ(xt, ut−n) = λ(xt, ut−n)

ut∈Ut X = p(x |x , u ) t t−1 t−1 C. Proof of Proposition2 xt−1∈X t−1 0 t−1 For each n ≥ n, X Y k−1 T × q(uk|xk, uk−n)p(xk|xk−1, uk−1). T T X t−1 I0,n0 (X → U ) = I(Xt; Ut|U 0 ) ut−n−1∈U t−n−1 k=1 t=1 t−n T (51) X t−1 t = H(Ut|Ut−n0 ) − H(Ut|Xt,Ut−n0 ) 0 t=1 Let q = π(q). Then, from (50) and (51), we have T X t−1 t = H(Ut|U 0 ) − H(Ut|Xt,Ut−n). λ(x , ut ) t=1 t−n q0(u |x , ut−1 ) = t t−n = q(u |x , ut−1 ). t t t−n t−1 t t t−n In the last step, we used the fact that H(X|Y,Z) = H(X|Y ) λ(xt, ut−n) holds when X and Z are conditionally independent given Therefore, π(q) = q. t−1 t−1 Y . By the structure of qt(ut|xt, ut−n), Ut and (Xt,Ut−n) t−n−1 We are now ready to prove the following proposition stating is conditionally independent of Ut−n0 . Now, that a local minimum for the simplified TERMDP (13) is a T T T T I0,n0 (X → U ) − I0,n0+1(X → U ) local minimum for the original TERMDP (7). For a given T X t−1 t−1  policy q ∈ Q, denote by = H(Ut|U 0 ) − H(Ut|U 0 ) ≥ 0 t=1 t−n t−n −1 T +1 T T T f(q) = J(X ,U ) + βIm,n(X → U ) since entropy never increases by conditioning. 15

D. Proof of Proposition5 [19] N. Tishby and D. Polani, “Information theory of decisions and ac- tions,” Perception-reason-action cycle: Models, algorithms and systems. For each t = 1, 2, ..., T , we have Springer, 2010. t t−1 [20] J. Rubin, O. Shamir, and N. Tishby, “Trading value and information in I(Xt−m; Wt|Ut−n) MDPs,” Decision Making with Imperfect Decision Makers, pp. 57–74, = I(Xt ; W ,U |U t−1) 2012. t−m t t t−n [21] P. A. Ortega and D. A. Braun, “Thermodynamics as a theory of decision- t t−1 t t making with information-processing costs,” Proceedings of the Royal = I(Xt−m; Ut|Ut−n) + I(Xt−m; Wt|Ut−n) Society of London A: Mathematical, Physical and Engineering Sciences, t t−1 ≥ I(Xt−m; Ut|Ut−m). vol. 469, no. 2153, 2013. [22] C. D. Charalambous and P. A. Stavrou, “Optimization of directed The first equality is due to the particular structure of the information and relations to filtering theory,” The 2014 European Control decoder specified by (A) or (B). Thus Conference (ECC), 2014. [23] P. A. Stavrou, C. K. Kourtellaris, and C. D. Charalambous, “Information T X t t−1 T T nonanticipative rate distortion function and its applications,” in Coordi- I(Xt−m; Wt|Ut−n) ≥ Im,n(X → U ). nation Control of Distributed Systems t=1 . Springer, 2015, pp. 317–324. [24] T. Tanaka, P. M. Esfahani, and S. K. Mitter, “LQG control with The proof of Proposition5 is complete by noticing the minimum directed information: Semidefinite programming approach,” following chain of inequalities. IEEE Transactions on Automatic Control, 2017. [25] R. Blahut, “Computation of channel capacity and rate-distortion func- XT tions,” IEEE transactions on Information Theory, vol. 18, no. 4, pp. R log 2= Rt log 2 t=1 460–473, 1972. XT [26] S. Arimoto, “An algorithm for computing the capacity of arbitrary dis- ≥ H(Wt) crete memoryless channels,” IEEE Transactions on Information Theory, t=1 vol. 18, no. 1, pp. 14–20, 1972. T X t−1 t−1 [27] I. Naiss and H. H. Permuter, “Extension of the Blahut–Arimoto al- ≥ H(Wt|W ,Ut−n) t=1 gorithm for maximizing directed information,” IEEE Transactions on T Information Theory, vol. 59, no. 1, pp. 204–222, 2013. X t−1 t t−1 ≥ H(Wt|Ut−n) − H(Wt|Xt−m,Ut−n) [28] T. M. Cover and J. A. Thomas, Elements of Information Theory. New t=1 York, NY, USA: Wiley-Interscience, 1991. T X t t−1 [29] I. Petersen, V. A. Ugrinovskii, and A. V. Savkin, Robust Control Design = I(Xt−m; Wt|Ut−n). t=1 Using H∞ Methods. Springer Science & Business Media, 2012. [30] E. A. Theodorou, “Nonlinear stochastic control and information theoretic REFERENCES dualities: Connections, interdependencies and thermodynamic interpre- tations,” Entropy, vol. 17, no. 5, pp. 3352–3375, 2015. [1] T. Schreiber, “Measuring information transfer,” Physical review letters, [31] E. A. Theodorou and E. Todorov, “Relative entropy and free energy vol. 85, no. 2, p. 461, 2000. dualities: Connections to path integral and KL control,” The 51st IEEE [2] H. Marko, “The bidirectional communication theory – A generalization Conference on Decision and Control (CDC), pp. 1466–1473, 2012. of information theory,” IEEE Transactions on Communications, vol. 21, [32] I. Csiszar,´ “On the computation of rate-distortion functions,” IEEE no. 12, pp. 1345–1351, 1973. transactions on Information Theory, vol. 20, no. 1, pp. 122–124, 1974. [3] J. Massey, “Causality, feedback and directed information,” International [33] P. Tseng, “Convergence of a block coordinate descent method for Symposium on Information Theory and Its Applications (ISITA), 1990. nondifferentiable minimization,” Journal of optimization theory and [4] G. Kramer, “Directed information for channels with feedback,” Ph.D. applications, vol. 109, no. 3, pp. 475–494, 2001. dissertation, ETH Zurich, 1998. [34] D. Bertsekas, Nonlinear Programming. Athena Scientific, 2016. [5] C. Gourieroux, A. Monfort, and E. Renault, “Kullback causality mea- [35] L. Grippo and M. Sciandrone, “On the convergence of the block sures,” Annales d’Economie et de Statistique, pp. 369–410, 1987. nonlinear Gauss–Seidel method under convex constraints,” Operations [6] J. M. Parrondo, J. M. Horowitz, and T. Sagawa, “Thermodynamics of research letters, vol. 26, no. 3, pp. 127–136, 2000. information,” Nature Physics, vol. 11, no. 2, pp. 131–139, 2015. [36] M. J. Powell, “On search directions for minimization algorithms,” [7] M. Wibral, R. Vicente, and J. T. Lizier, Directed information measures Mathematical programming, vol. 4, no. 1, pp. 193–201, 1973. in neuroscience. Springer, 2014. [37] G. Kramer, “Capacity results for the discrete memoryless network,” [8] J. Jiao, H. H. Permuter, L. Zhao, Y.-H. Kim, and T. Weissman, IEEE Transactions on Information Theory, vol. 49, no. 1, pp. 4–21, “Universal estimation of directed information,” IEEE Transactions on 2003. Information Theory, vol. 59, no. 10, pp. 6220–6242, 2013. [38] E. Silva, M. Derpich, J. Ostergaard, and M. Encina, “A characterization [9] G. Ver Steeg and A. Galstyan, “Information transfer in social media,” of the minimal average data rate that guarantees a given closed-loop Proceedings of the 21st international conference on World Wide Web, performance level,” IEEE Transactions on Automatic Control, 2015. pp. 509–518, 2012. [39] T. Tanaka, K. H. Johansson, T. Oechtering, H. Sandberg, and [10] S. Ito and T. Sagawa, “Information thermodynamics on causal net- M. Skoglund, “Rate of prefix-free codes in LQG control systems,” IEEE works,” Physical Review Letters, vol. 111, no. 18, p. 180603, 2013. International Symposium on Information Theory (ISIT), 2016. [11] M. L. Puterman, Markov decision processes: discrete stochastic dynamic [40] V. Kostina and B. Hassibi, “Rate-cost tradeoffs in control,” IEEE programming. John Wiley & Sons, 2014. Transactions on Automatic Control, vol. 64, no. 11, pp. 4525–4540, [12] C. A. Sims, “Implications of rational inattention,” Journal of monetary 2019. Economics, vol. 50, no. 3, pp. 665–690, 2003. [41] P. A. Stavrou, J. Østergaard, C. D. Charalambous, and M. Derpich, [13] E. Shafieepoorfard, M. Raginsky, and S. P. Meyn, “Rationally inattentive “Zero-delay rate distortion via filtering for unstable vector Gaussian control of Markov processes,” SIAM Journal on Control and Optimiza- sources,” arXiv:1701.06368, 2017. tion, vol. 54, no. 2, pp. 987–1016, 2016. [42] J. M. Horowitz and H. Sandberg, “Second-law-like inequalities with [14] E. Todorov, “Linearly-solvable Markov decision problems,” Advances in information and their interpretations,” New Journal of Physics, vol. 16, neural information processing systems, pp. 1369–1376, 2007. no. 12, p. 125007, 2014. [15] E. Theodorou, J. Buchli, and S. Schaal, “A generalized path integral con- [43] C. H. Bennett, “The thermodynamics of computation - A review,” trol approach to ,” Journal of Machine Learning International Journal of Theoretical Physics, vol. 21, no. 12, pp. 905– Research, vol. 11, no. Nov, pp. 3137–3181, 2010. 940, 1982. [16] K. Dvijotham and E. Todorov, “A unifying framework for linearly solvable control,” arXiv preprint arXiv:1202.3715, 2012. [17] W. Bialek, I. Nemenman, and N. Tishby, “Predictability, complexity, and learning,” Neural computation, vol. 13, no. 11, pp. 2409–2463, 2001. [18] F. Creutzig, A. Globerson, and N. Tishby, “Past-future information bottleneck in dynamical systems,” Physical Review E, vol. 79, no. 4, p. 041925, 2009.