A Unified Bellman Equation for Causal and Value in Markov Decision Processes

Stas Tiomkin 1 Naftali Tishby 1 2

Abstract interaction patterns, (infinite state and action trajectories) The interaction between an artificial agent and its define the typical behavior of an organism in a given en- environment is bi-directional. The agent extracts vironment. This typical behavior is crucial for the design relevant information from the environment, and and analysis of intelligent systems. In this work we derive affects the environment by its actions in return to typical behavior within the formalism of a reinforcement accumulate high expected reward. Standard rein- learning (RL) model subject to information-theoretic con- forcement learning (RL) deals with the expected straints. reward maximization. However, there are always In the standard RL model, (Sutton & Barto, 1998), an arti- information-theoretic limitations that restrict the ficial organism generates an optimal policy through an in- expected reward, which are not properly consid- teraction with the environment. Typically, the optimality ered by the standard RL. is taken with regard to a reward, such as energy, money, In this work we consider RL objectives with social network ’likes/dislikes’, time etc. Intriguingly, the information-theoretic limitations. For the first reward-associated value function, (an average accumulated time we derive a Bellman-type recursive equa- reward), possesses the same property for different types of tion for the causal information between the envi- rewards - the optimal value function is a Lyapunov function, ronment and the agent, which is combined plau- (Perkins & Barto, 2002), a generalized energy function. sibly with the Bellman recursion for the value A principled way to solve RL (as well as Optimal Con- function. The unified equitation serves to explore trol) problems is to find a corresponding generalized en- the typical behavior of artificial agents in an infi- ergy function. Specifically, the Bellman recursive equation, nite time horizon. (Bertsekas, 1995), is the Lyapunov function for Markov de- cision processes. Strictly speaking, the standard RL frame- 1. Introduction work is about the discovery and minimization (maximiza- tion) of an energetic quantity, whereas the Bellman recur- The interaction between an organism and the environment sion is an elegant tool for finding the optimal policy both consists of three major components - utilization of past ex- for finite and infinite time horizons for planning. perience, observation of the current environmental state, However, this model is incomplete. Physics suggests that and generation of a behavioral policy. The latter compo- there is always an interplay between energetic and entropic nent, planning is an essential feature of intelligent systems quantities. Recent advances in the artificial intelligence for survival in limited-resource environments. Organisms show that the behavior of realistic artificial agents is indeed with long-term planning can accumulate more resources affected simultaneously by energetic and entropic quanti- arXiv:1703.01585v2 [cs.SY] 5 Jun 2018 and avoid possible catastrophic states in the future, which ties (Tishby & Polani, 2011; Rubin et al., 2012). The finite makes than evolutionarily superior to ’short-term planners’. rate of information transfer is a fundamental constraint for An intelligent agent combines past experience with the cur- any value accumulation mechanism. rent observations to act upon the environment. This feed- Information-theoretic constraints are ubiquitous in the in- back interaction induces complex statistical correlations teraction between an agent and the environment. For exam- over different time scales between environment state tra- ple, a sensor has a limited bandwidth, a processor has a lim- jectories and the agent’s action trajectories. Infinite-time ited information processing rate, a memory has a limited 1The Benin School of Computer Science and Engineer- information update rate, etc. All these are not pure theoret- ing, The Hebrew University, Jerusalem, Israel. 2The Ed- ical considerations, but rather the practical requirements to mond and Lilly Safra Center for Brain Sciences, The He- build and analyze realistic intelligent agents. Moreover, of- brew University, Jerusalem, Israel. Correspondence to: ten, an artificial agent needs to limit the available informa- Stas Tiomkin , Naftali Tishby . tion to the information for a particular task, (Polani et al., 2006). For example, when looking at a picture, the visual stimuli strikes the retina at the rate of the speed of light, A Unified Bellman Equation for Causal Information and Value in Markov Decision Processes which is decreased dramatically to leave only the relevant ical simulations of different types of problems, which can information rate, which is essential for understanding the be solved by them. picture. The paper is organized as follows. In Section2 we provide Different types of entropic constraints have been studied in background on causal conditioning, directed information, the context of RL. The KL-control, (Todorov et al., 2006; and Markov decision processes. In Section3 we represent Tishby & Polani, 2011; Kappen et al., 2012) introduces a the Bellman-type recursive equation for directed informa- DKL cost to each state, which penalizes complex policies, tion from the agent’s actions to the environment. In Section where complexity is measured by the DKL divergence be- 4 we show the unified recursion for the directed informa- tween a prior policy, or uncontrolled dynamics, and an ac- tion and the common recursion for the value function. The tual policy. Importantly, the KL-control does not directly optimization problem for finding the minimal directed in- address the question of the information transfer rate. formation rate required to achieve the maximal expected Some recent works do explicitly consider the effects of in- reward in an infinite time horizon is stated in Section 4.1. formation constraints on the performance of a controller The numerical simulation of the unified Bellman equation (Tatikonda & Mitter, 2004; Tanaka & Sandberg, 2015; is provided in Section5, where we consider the standard Tanaka et al., 2015) for an infinite time horizon. However, maze-escaping problem for a predefined target. In Section these works are mostly limited to linear dynamics. 6 we consider the dual problem of the maximal directed in- Obviously, feedback implies causality - the current ac- formation rate from the agent to the environment, and de- tion depends on previous and/or present information alone. rive the dual Bellman equation. In Section7 we provide a Consequently, the information constraints should address numerical simulation for the dual Bellman equation, where the interaction causality. we consider tasks without a predefined target. We compare In this work we consider directed information constraints our solution to the standard algorithms for finding the av- over an infinite time horizon in the framework of Markov erage all-pairs shortest path problem. Finally, in Section decision processes, where the directed information is be- 8 we summarize the paper, and provide directions for the tween the environment state trajectories and the agent’s ac- continuation of this work. tion trajectories. For the first time we derive a Bellman type equation for the 2. Background causal information, and combine it with the Bellman re- In this section we overview the theoretical background for cursion for value. The unified Information-Value Bellman this work. Specifically, we briefly review the framework equation enables us to derive the typical optimal behavior of and Markov decision processes. over an infinite time horizon in the MDP setting. Then, we review causal conditioning and directed informa- In addition, this work has practical implications for the de- tion, and define the required quantities. sign criteria of optimal intelligent agents. Specifically, an information processing rate of a ’brain’ (processor) of an 2.1. Interaction Model artificial agent should be higher than the minimal informa- We assume that the agent and environment interact at dis- tion rate required to solve a particular MDP problem. crete time steps for t ∈ (−∞,...,T ). At time t, the en- The interaction between an agent and the environment is vironment appears at state St ∈ S, the agent observes the bi-directional - an agent extracts information and affects state, and affects the environment by its action At ∈ A the environment by its actions in return. To provide a com- according to the policy π(At | St). The environmental prehensive analysis of this bi-directional interaction, we evolution is given by the transition probability distribution, consider both information channels. These channels are p(St+1 | At,St). This interaction model is plausibly de- dual, comprising the action-perception cycle of reinforce- scribed by the probabilistic graph, shown at Figure1, where ment learning under the information-theoretic constraints. the arrows denote directions of causal dependency. We derive a Bellman type equation for the directed infor- mation from action trajectories to state trajectories. S∞ St−1 St St+1 ST −1 ST The minimal directed information rate from the environ- ment to the agent comprises a constraint for the expected A∞ At−1 At At+1 AT −1 AT reward maximization in standard goal-oriented RL test bench tasks. By contrast, the maximal directed information rate from the agent to the environment comprises a proxy Figure 1. A probabilistic directed graph for a Markov Decision for solving different AI test bench tasks without specifying Process. the target, which can be considered as self-motivated be- As a results of this feedback interaction, the environment havior to be ready for any possible target. We elaborate on T state trajectories, S−∞ defined by (1), and the agent ac- the difference between these two opposite information rates T tion trajectories, A−∞ defined by (2), become statistically by their corresponding analytic expressions, and by numer- correlated. A Unified Bellman Equation for Causal Information and Value in Markov Decision Processes

vs. non-causal scalar conditioning: T . S−∞ =(...,St−1,St,St+1,...,ST ), (1) . T AT =(...,A ,A ,A ,...,A ).. (2) T Y i−1 −∞ t−1 t t+1 T p(At+1 | St+τ ) = p(Ai | At+1,St+τ ). i=t+1 The corresponding trajectories of finite length length T , A random sequence can be conditioned simultaneously by and their time-delayed versions are given by: another random sequence and by a scalar random variable. T . In particular, the following two expressions will be useful St+1 =(St+1,St+2,...,ST ), T . in the next section: At+1 =(At+1,At+2,...,AT ), T T . DSt+1 =(St,St+1,...,ST −1), T T Y i i−1 p(At+1 k St ,At) = p(Ai | St ,At+1,At) T . DAt+1 =(At,At+1,...,AT −1), i=t+1 T T T −1 Y i−1 i−1 where DS denotes the time-delay operator. p(At+1 k St ,At) = p(Ai | St ,At+1,At), Remark 1. These definitions imply that the original se- i=t+1 which, due to the graph in Figure1, are reduced to: quence, ST , includes the time-delayed sequence, DST , t+1 t+1 T T except for its first component, S . The union between S T T Y t t+1 p(A k S ,At) = p(Ai | Si) (6) T T t+1 t and DSt+1 is a sequence St with length t + 1. i=t+1 T 2.1.1. BELLMAN RECURSIONFOR VALUE T T −1 Y p(A k S ,A ) = p(A | S ,A ). (7) Extending the interaction model in Section 2.1 by intro- t+1 t t i i−1 i−1 i=t+1 ducing the reward function, r(St+1 | At,St), makes the agent to plan ’consciously’ its actions by adapting the pol- 2.2.1. CAUSAL ENTROPY icy, π(At | St), in order to achieve a higher reward in a Following the causally conditioned distribution (4), the given time-horizon. The optimal policy, π(At | St), that causally conditioned entropy is defined by: guarantees the maximal expected reward, is found by the policy iteration algorithm,(Sutton & Barto, 1998), where T T D T T E H(At+1 k St+1) = ln p(At+1 k St+1) . (8) T T a value of π(At | St) is estimated by the value function. It p(At+1,St+1) can be computed recursively: The causally conditioned entropy satisfies the chain rule,

π D E (and other properties of entropy, (Kramer, 1998)): Q (St,At) = r(St+1; At,St) (3) p(S |St,At) t+1 T D E T T X i i−1 +γt Q(St+1,At+1) , H(At+1 k St+1) H(Ai | St+1,At+1). (9) π(A |S )p(S |S ,A ) t+1 t+1 t+1 t t i=t+1 where, γt is the discounting factor. The Bellman recursion The causally conditioned entropy enables us to define the is an fundamental concept in optimal planning. A compre- directed (causal) information between two sequences of hensive review of reinforcement learning can be found in random variables. (Bertsekas, 1995; Sutton & Barto, 1998; Puterman, 2014). 2.2.2. DIRECTED INFORMATION 2.2. Causal Conditioning The Shannon information between two sequences of ran- A random process can be conditioned by another random T T dom variables, X1 , and, Y1 , is given by a decrease in the process either causally or non-causally, respectively: entropy of one variable by the conditional entropy of this

T variable (Cover & Thomas, 2012): T T . Y i−1 i p(A k S ) = p(A | A , S ) (4) t+1 t+1 i t+1 t+1 h i . i=t+1 I X; Y = H(X) − H(X | Y ). (10) T T T . Y i−1 T p(A | S ) = p(A | A , S ). (5) t+1 t+1 i t+1 t+1 The directed information is defined similarly as a decrease i=t+1 in the entropy. But, in this case, the conditional entropy is The causal conditioning assumes time-synchronization be- the causally conditional entropy (Massey & Massey, 2005; tween the sequences. A random process can be causally- Kramer, 1998): conditioned by a random variable at a particular time as h i . well. Compare the following causal I~ X → Y = H(X) − H(X k Y ). (11) t+τ−1 T T Y i−1 Y i−1 p(At+1 k St+τ )= p(Ai | At+1) p(Ai | At+1,St+τ ), The essential difference between ordinary mutual informa- i=t+1 i=t+τ tion (10) and directed information (11) is the lack of sym- A Unified Bellman Equation for Causal Information and Value in Markov Decision Processes metry: St+1 and At+1. Consequently, the recursive equation in (15) can be decomposed into the first step and all the future h i h i I~ X → Y 6= I~ Y → X . (12) steps:

~ A comprehensive review of directed information properties IT (St,At) = I [St+1; At+1 | St,At] (16) D E can be found in (Kramer, 1998; Marko, 1973; Massey & + I~T (St+1,At+1) , Massey, 2005). π(At+1|St+1)p(St+1|St,At) where the first term is the local step of the recursion and 3. Information from Environment to Agent the second term is the future directed information. This We equip the agent with the ability to remember the pre- expression comprises the Bellman type equation for the di- vious state and the action. Memory is a desired feature of rected information between the environment and the agent. intelligent agents. Memoryless agents can act only reac- The time horizon of the future state and action trajecto- tively without learning from the experience. Specifically, ries is marked by a subscript in the directed information, we model the memory by causally-conditioning the di- I~T (St,At). rected information flow from the environments to the agent. Information processing is an integral part of the interaction Assuming that at time t the environment and the agent ac- between the agent and its environment. An optimal agent tion were St and At, respectively, the causally-conditioning strives to accumulate as much value as possible under the directed information is given by: constraints of limited bandwidth and information transfer rate. In the next section we show that the Bellman recur- ~h T T T i I St+1 → At+1 k DSt+1,At = sion for the directed information and the Bellman recursion D h p(AT k ST ,DST ,A ) iE for the value naturally combine into a unified Bellman re- = ln t+1 t+1 t+1 t , T T T T cursion. p(At+1 k DSt+1,At) p(St+1,At+1|St,At) which, following Remark1, is equivalent to: 4. Unified Bellman Recursion T T D h p(At+1 k St ,At) iE As mentioned above, the optimal control solution should = ln T −1 , (13) T p(ST ,AT |S ,A ) adhere to the information processing constraints, which are p(At+1 k St ,At) t+1 t+1 t t always present in real-life intelligent systems. For exam- and, according to Eq. (6) and Eq. (7) is decomposed to: ple, the limited bandwidth and/or time-delays due to the T Q distance between a controller and a plant are ubiquitous p(Ai | Si) D h iE information-theoretic constraints, which may reduce the = ln i=t+1 T p(ST ,AT |S ,A ) maximal achievable value. Q t+1 t+1 t t p(Ai | Si−1,Ai−1) i=t+1 Consequently, the designer of an intelligent agent should T always address the question - what is the minimal informa- X D h p(Ai | Si) iE = ln . (14) tion rate required to achieve the maximal value? This ques- p(Ai | Si−1,Ai−1) p(ST ,AT |S ,A ) i=t+1 t+1 t+1 t t tion is formulated by the unconstrained Lagrangian which comprises the unified Bellman recursion. 3.1. Bellman Equation for I~[States → Actions] We found that the Bellman recursion for the directed infor- The expression (14) enables us to write the directed infor- mation (16) and the Bellman recursion for the value (3) mation recursively: combine naturally to form the unified Information-Value Bellman Recursion: ~h T T T i I St+1 → At+1 k DSt+1,At = π ~ π π GT (St,At, β) =IT (St,At) − βQ (St,At) = D h p(A | S ) iE = ln t+1 t+1 p(At+1 | St,At) p(St+1,At+1|St,At) T which is decomposed into the first step and all the future X D h p(Ai | Si) iE + ln steps: p(Ai | Si−1,Ai−1) p(ST ,AT |S ,A ) i=t+2 t+1 t+1 t t π h i =I [St+1; At+1 | St,At] (15) = IT St+1; At+1 | St,At (17) D h iE ~ T T T − β hr(St+1; St,At)i + I St+2→At+2 kDSt+2,At+1 . p(St+1|St,At) p(St+1,At+1|St,At) D~ π + IT (St+1,At+1) The directed information in (13) is a function of St and π E At, whereas the directed information in (15) within the ex- −βQ (St+1,At+1) . π(At+1|St+1)p(St+1|St,At) pectation operator < · >p(St+1,At+1|St,At) is a function of A Unified Bellman Equation for Causal Information and Value in Markov Decision Processes

Writing explicitly the and identifying function. Previously, the stationary policy for directed in- the terms within the second expectation operator with the formation minimization was derived for the LQG setting π future averaged value of GT (St+1,At+1, β), we get the (Tanaka et al., 2015); in this work we derive the optimal unified Bellman recursion: stationary policy the MDP setting.

π D h π(At+1 | St+1) i 5. Numerical Simulations GT (St,At, β) = π(At+1 | St+1) ln (18) p(At+1 | St,At) To explore a long-term (typical) interaction between an E − βr(St+1; St,At) agent and the environment we solve different benchmark p(St+1|St,At) RL problems, such as maze escaping problems (Sutton & D π E + GT (St+1,At+1, β) . Barto, 1998), using the information-value recursion in Eq. π(At+1|St+1)p(St+1|St,At) (18). This recursive equation can be computed for any T . Specif- We emphasize that even though, in these experiments the ically, we are interested in typical behaviour, when an target goal appears in a particular state (or at a number of agents acts according to a stationary policy, π(A | S). In states), there are different possible scenarios where a goal the next section we show how to find the optimal policy, state can appear in any state. In these cases, the agent π(A | S), that maximizes the unified Bellman recursion in should be prepared to reach any goal. This omni-goal strat- (18). egy will be discussed in Section7. 4.1. Optimization Problem 5.1. Maze with predefined targets To find the optimal policy (that minimizes the directed in- To investigate the solutions in (18) we simulated different formation under a constraint on the attained value), one mazes where the agent is asked to find a target cell. The needs to solve the unconstrained minimization of the cor- agent has full information about the world (it knows its ex- responding Lagrangian in (17). act location within the maze), which is proportional to the logarithm of the maze size. But it cannot utilize all this π h~ π π i information to generate the optimal actions due to its slow argminGT (St,At,β)=argmin IT (St,At)−βQ (St,At) . (19) π π ’brain’. For example, the agent’s sensor retrieves 8 bits of information about its location in each interaction with the The minimization with regard to π is done by augmenting environment, whereas the processor can transform only 3 the equation in (19) by the Lagrange term for the normal- bits of information to the policy. ization of π(A | S), and computing the variational deriva- π The dynamics is given by a transition probability p(St+1 | tive, δGT /δπ. The minimization with regard to q(At0 | . At,St). If a certain action at time t is impossible, such as S, A) = p(A 0 | S, A) is done by setting the correspond- t stepping into a wall, the agent remains in its current posi- ing variational derivative δGq /δq to zero. Straightforward T tion. The target state is an absorbing state - the agent stays computation of these variational derivatives gives the fol- in this state for any action p(S | A ,S ) = δ . lowing set of self-consistent equations: target t t Starget,St The agent receives a reward, 0, when it is in the target state.

 π∗,q∗  In all other states the reward is given by r(St+1 | At,St) = exp −G (St,At, β) ∗ −1. This is the general setting for maze escaping problems, π (A | S) = P π∗,q∗ , (20) exp (−G (St,At, β)) (Sutton & Barto, 1998). A Figure2 shows the value function, and the directed in- 0 1 p∗(A | S, A) = , (21) β |A| formation for different values of parameter . The opti- mal solutions to equation (18) are characterized by phase- transitions - the qualitative difference in the typical behav- 4.2. Stationary Policy ior as a function of the value of parameter β. The optimal The typical behavior of the agent in its environment is de- path is indicated by the solid white line from the start state, fined by the stationary distribution over the states and the marked by S, to the goal state, marked by G. actions. This stationary distribution is given by the policy The low values of β correspond to the high stochasticity in with the infinite time horizon: the system. The more stochastic the system is, the harder it is for the agent to affect the environment. Specifically,  ∗ ∗  exp −Gπ ,q (S, A, β) the agent prefers not to take the risk of going through the π∗(A | S) = . (22) P exp (−Gπ∗,q∗ (S, A, β)) narrow door for β = 0.005, as shown in Figure (2-b). How- A ever, it does go through the narrow door for β = 0.05, as The unified Bellman recursion enables us to compute the shown in Figure (2-a). stationary policy by iterating the recursion in Eq. (18) in The typical distribution of the directed information along the same way as for the Bellman recursion for the value the optimal path is shown in the bottom plots. The informa- A Unified Bellman Equation for Causal Information and Value in Markov Decision Processes

which is a dual question to the question of the maximal directed information from the environment to the agent. These two directed differ, because the di- rected information functional is asymmetric in general. We argue and demonstrate by numerical simulations that the maximal directed information between the agent and the environment is useful for acting optimally without an ex- (a) (b) ternally provided reward (unknown targets). 6.1. Background The information capacity of the channel from the agent’s actions to the environment states was shown to be useful in solving various AI benchmark problems without provid- ing an external reward/utility function, (Salge et al., 2014; Klyubin et al., 2005b;a; Mohamed & Rezende, 2015). This capacity, known as empowerment, evaluates a maximal (c) (d) number of environment states that the agent can potentially achieve by its actions. The higher the value of empowerment is, the more poten- tial states are reachable by actions. As in the empowerment method, when a goal state is not predefined but may occur in any state as in Chess, Go and other interactive games, the optimal strategy for the agent is to keep as many reachable (e) (f) states as possible. As was shown in (Klyubin et al., 2005a) Figure 2. Numerical simulation of the maze problem with 625 Figure 5 and Figure 6, there is a qualitative similarity be- cells. The sub figures (a), (c), and (e) correspond to β = 0.05. tween the landscape of the empowerment and the landscape They show the value landscape, the directed information land- of the average shortest distance from a given state to any scape, and the directed information along the optimal path. The other state. sub figures (b), (d), and (f) correspond to β = 0.005. A phase- In this paper we extend the empowerment method by con- transition occurs between β = 0.05 and β = 0.005, where a typ- sidering in the feedback control policy by applying the di- ical behavior ”to pass through the narrow door”, changes to an- rected information from the action trajectories to the envi- other qualitatively different behavior ”to take a longer path with- ronment trajectories. In the next section we describe the out entering the narrow door”. The peaks at the sub figures (e) and (f) correspond to the states with high information consump- causal conditioning of a sequence of states of length T on tion. For instance, as seen at (e), the 31st state along the optimal a sequence of actions of length T . We derive the Bellman- path (the entrance to the narrow door), corresponds to the max- type recursion for the directed information from the action imal information rate that the agent should process to reach the to the states, which together with the Bellman recursion goal state. from the states to the action in Eq.(18) provide a complete picture of the bi-directional information exchange between the agent and the environment. tion is higher when the agent needs to move carefully, e.g., near the corners and/or through the narrow door. The solu- 6.1.1. CAUSAL CONDITIONING tion is derived for the infinite time horizon, which enables We assume that at time t the environmental state is St. us to explore the typical stationary policy of the agent. Given St, the agent computes action At, and the environ- As is seen in the sub figure (d), the agent needs different ment moves to St+1, according to the given transition prob- amounts of information at different times over the optimal ability, p(St+1 | St,At). This interaction repeats T − 1 cy- path. For example, it must be more careful (needs more in- cles, driving the system to its final state, ST , by a sequence T −1 formation), at the narrow hole in order to exit the compart- of actions A1 . T −1 T ment in the middle of the maze. This is the maximum of the The joint probability between At and St+1 conditioned T T −1 minimal directed information from the states to the actions, on St is given by p(St+1,At | St). The causal depen- T −1 T which should be taken into consideration when choosing a dency between the time series, At , and St differs from processor for the agent. the non-causal one in its conditional probability (for details see e.g., (Kramer, 1998; Massey & Massey, 2005; Marko, 6. Information from Agent to Environment 1973; Permuter et al., 2009)), as following: In this section we address the question of the maximal directed information from the agent to the environment, A Unified Bellman Equation for Causal Information and Value in Markov Decision Processes

The conditional directed information in (29) is a function T of the state. To improve readability, the recursion in (29) T T −1 Y i−1 T −1 p(St+1 | At ,St) = p(Si | St ,At ), (23) may be represented concisely as follows, (where the right i=t arrow in I~ distinguishes the directed information from the T T T −1 Y i−1 i−1 ordinary/symmetric mutual information): p(St+1 k A1 ,St) = p(Si | St ,At ). (24) i=t ~ D~ E T T −1 I(st) = I [At; St+1 | st] + I(St+1) , (30) The join probability, p(Si+1,Ai | Si), is given by: p(St+1|st)

T T −1 where the free distribution is the feedback policy distribu- p(St+1,At | St) = tion, π(A | S). Assuming stationarity, the time index in =p(ST k AT −1,S )p(AT −1 k ST −1), (25) t+1 t t t t (30) may be reduced to: which is decomposed alternatively according to Bayes’ rule as: ~  0  D~ 0 E I(s) = I A; S | s + I(S ) 0 . (31) T −1 T T p(S |s) =p(At | St+1,St)p(St+1 | St). (26) The maximum of I~(s) with regard to π(A | S) gives the maximal number of states that the agent can achieve by its 6.1.2. DIRECTED INFORMATION action. In the next section we provide the formal solution. The two alternative decompositions of the joint distribu- T T −1 6.3. Optimal solution tion, p(St+1,At | St), enable us to decompose the di- rected information as follows To find the maximum of the directed information in (31), we apply the alternating maximization between the policy h T −1 T i distribution, π(Ai | Si), and the inverse-channel distribu- I At →St+1 | st = (27) tion p(Ai | Si,Si+1) (which is conventionally denoted by  T T −1  p(St+1 k At ,St) q(Ai | Si,Si+1)), as in the standard problem for finding the = ln T p(St+1 | St) T T −1 channel capacity, (Cover & Thomas, 2012; Blahut, 1972). p(St+1,At |st) T −1 The optimal formal solution is given by the following set Q * p(Aj | Sj ,Sj+1) + of self-consistent equations: j=t = ln , T −1 0 ∗ Q 0 p(S | S, A)π (A | S) π(Aj | Sj ) ∗ q (A | S, S ) = 0 , (32) j=i T T −1 P p(S | S, A)π∗(A | S) p(St+1,At |st) A 0 D 0 E which follows from the alternative decompositions of the I~π∗ [s] = Iπ∗ [A, S | S] + I~π∗ [S ] , (33) joint distribution in (25) and (26): p(S0 |A,S)π∗(A|S) π∗(A | S) = (34) T T −1 T −1 T p(S k A ,S ) p(A | S ,S ) 0  0 0  t+1 t t = t t+1 t , (28) exp P p(S | S, A) ln q∗(A | S, S ) + I~ ∗[S ] T T −1 T −1 0 p(St+1 | St) p(At k St ) S =   . P P 0 ∗ 0 ~ ∗ 0 and the graph property: exp p(S | S, A) ln q (A | S, S ) + I [S ] A S0

 t−1 T −1 T −1 T  1 ≤ t < T : At ⊥⊥ A1 ,At+1 ,S1 ,St+1 | St,St+1. This set of self-consistent equations is an extension of the common Blahut-Arimoto algorithm by the directed infor- This decomposition leads to the Bellman equation for the mation of the future action and state trajectories, which ap- π∗(A | S) directed information from the action trajectories to the state pears in the exponent of in Eq.34. A numerical trajectories. solution is derived by iterating equations (32-34) until con- vergence. The typical convergence is shown by the numeric 6.2. Bellman Equation for I~[Actions → States] simulations, provided in the next section. Separating the first term in (27) from the rest we get the following recursion: 7. Numerical Simulations To demonstrate the optimal solution given by equations h T −1 T i I At → St+1 | st =I [At; St+1 | St] (29) (32-34) we consider the problem of the optimal location of a firehouse in a city. We model the city by a grid world, D h T −1 T iE + I At+1 → St+2 | St+1 . where a fire engine is an agent moving across the city ac- p(St+1|st) cording to a given transition probability, p(St+1 | St,At). where I [At; St+1 | St] is the mutual information between The goal is to locate the firehouse at a position where the the t-th state and action, conditioned on the (t+1)-th state. firefighters can reach each block in the city in the fastest A Unified Bellman Equation for Causal Information and Value in Markov Decision Processes

Figure 4. Convergence of the Bellman recursion for the directed information. x-axis: number of blocks, y-axis: number of itera- tions until convergence. (a) (b)

Floyd/Dijkstra-type algorithms. Convergence is said to oc- cur when the relative change in the directed information be- (c) (d) tween two consequent iterations is smaller than tolerance, τ = 10−6 abs(I~ − I~ )/I~ ≤ τ I Figure 3. Sub figures (a) and (c) show the exact values of the min- , k+1 k k , where k+1 is given imal average path from a given block to all the other blocks in the by Eq.31 in the k-th iteration. city. Sub figures (b) and (d) show the value of the maximal di- rected information from the agent to the environment in the long- 8. Conclusion time horizon on every block in the city. Black lines represent the The feedback interaction between agent and the environ- fences in the city. Red corresponds to high values. ment includes two asymmetric information channels. Di- rected information from the environment constraints the maximal expected reward obtained by the agent in the stan- way on average. dard RL setting. Directed information is a functional of a This is an example where the agent should be prepared op- joint probability distribution of agent’s action trajectories timally to reach any possible target state on average, rather and environmental state trajectories in a given time horizon. than to be able to get to a particular state (or a number of As a result, the value of the directed information depends states). on a particular time horizon. However, it is worth exploring The naive solution is to compute the average shortest dis- the typical behavior of an agent in an infinite time horizon, tance between each block in the city and all the other when all transients events have settled down or become sta- blocks, and to locate the firehouse on the block with the tistically insignificant. In this work we derive the typical minimal average shortest distance. The algorithms for solv- behavior of an agent that strives to increase its expected re- ing the all-pairs shortest path problem have a time com- ward under the limitations of directed information. This is plexity of O(V 3) (Cormen, 2009), where V is the number shown to be possible through the unified Information-Value vertices in a graph. Bellman-type recursion, which we represent in this work. The sub figures (3.a) and (3.c) show the exact values of The second channel transfers directed information from the the average shortest paths from a particular block to all the agent to the environment. It has been shown that the ca- other blocks in two different cities, a city without any walls, pacity of this channel is useful for solving various AI test and a city with perpendicular black walls, respectively. The bench problems, when the agent needs to be ready to solve sub figures (3.b) and (3.d) show the values of the directed a problem optimally on average for any possible target. We information computed by iterating the self-consistent equa- derived a Bellman recursion for this channel as well, which tion (32-34). Red blocks denote high values, and blue enables us to explore the typical behavior of the agent in an blocks denote low values. infinite time horizon. The two approaches suggest qualitatively similar patterns A direct continuation of this work would be to explore the on the location of the firehouse. The preferable locations entire action-perception cycle of the interaction between are the brightest red blocks. In the case of the city without the intelligent agent and the environment when both infor- walls both solutions give the same block. mation channels operate simultaneously. Figure (4) shows the convergence of the iteration of the Acknowledgments self-consistent equations (32-34). The convergence is sub- We thank Daniel Polani for many useful discussions. This linear in the number of blocks in the city, where each itera-   work is partially supported by the Gatsby Charitable Foun- tion consumes O |S| × |A| multiplications of real num- dation, The Israel Science Foundation, and Intel ICRI-CI bers, which is overall faster in order of magnitude than the center. A Unified Bellman Equation for Causal Information and Value in Markov Decision Processes

References vs. progeny strategies. In Artificial Life X: Proceedings Bertsekas, Dimitri P., et al. Dynamic programming and of the Tenth International Conference on the Simulation optimal control. Athena Scientific Belmont, MA, 1995. and Synthesis of Living Systems, pp. 337–343. Citeseer, 2006. Blahut, Richard. Computation of channel capacity and rate-distortion functions. IEEE transactions on Informa- Puterman, Martin L. Markov decision processes: discrete tion Theory, 18(4):460–473, 1972. stochastic dynamic programming. John Wiley & Sons, 2014. Cormen, Thomas H. Introduction to algorithms. MIT press, 2009. Rubin, Jonathan, Shamir, Ohad, and Tishby, Naftali. Trad- ing value and information in mdps. In Decision Making Cover, Thomas M and Thomas, Joy A. Elements of infor- with Imperfect Decision Makers, pp. 57–74. Springer, mation theory. John Wiley & Sons, 2012. 2012. Kappen, Hilbert J, Gomez,´ Vicenc¸, and Opper, Manfred. Salge, Christoph, Glackin, Cornelius, and Polani, Daniel. Optimal control as a graphical model inference problem. Empowerment–an introduction. In Guided Self- Machine learning, 87(2):159–182, 2012. Organization: Inception, pp. 67–114. Springer, 2014.

Klyubin, Alexander S, Polani, Daniel, and Nehaniv, Sutton, Richard S and Barto, Andrew G. Reinforcement Chrystopher L. All else being equal be empowered. learning: An introduction. MIT press Cambridge, 1998. In European Conference on Artificial Life, pp. 744–753. Springer, 2005a. Tanaka, Takashi and Sandberg, Henrik. Sdp-based joint sensor and controller design for information-regularized Klyubin, Alexander S, Polani, Daniel, and Nehaniv, optimal lqg control. In Decision and Control (CDC), Chrystopher L. Empowerment: A universal agent- 2015 IEEE 54th Annual Conference on, pp. 4486–4491. centric measure of control. In Evolutionary Computa- IEEE, 2015. tion, 2005. The 2005 IEEE Congress on, volume 1, pp. 128–135. IEEE, 2005b. Tanaka, Takashi, Esfahani, Peyman Mohajerin, and Mit- ter, Sanjoy K. Lqg control with minimal information: Kramer, Gerhard. Directed information for channels with Three-stage separation principle and sdp-based solution feedback. PhD thesis, University of Manitoba, Canada, synthesis. arXiv preprint arXiv:1510.04214, 2015. 1998. Tatikonda, Sekhar and Mitter, Sanjoy. Control under com- Marko, Hans. The bidirectional communication theory–a munication constraints. IEEE Transactions on Automatic generalization of . IEEE Transactions Control, 49(7):1056–1068, 2004. on communications, 21(12):1345–1351, 1973. Tishby, Naftali and Polani, Daniel. Information theory of Massey, James L and Massey, Peter C. Conservation of decisions and actions. In Perception-action cycle, pp. mutual and directed information. In Information Theory, 601–636. Springer, 2011. 2005. ISIT 2005. Proceedings. International Symposium Todorov, Emanuel et al. Linearly-solvable markov decision on, pp. 157–158. IEEE, 2005. problems. In NIPS, pp. 1369–1376, 2006. Mohamed, Shakir and Rezende, Danilo Jimenez. Varia- tional information maximisation for intrinsically moti- vated reinforcement learning. In Advances in neural in- formation processing systems, pp. 2125–2133, 2015.

Perkins, Theodore J and Barto, Andrew G. Lyapunov de- sign for safe reinforcement learning. Journal of Machine Learning Research, 3(Dec):803–832, 2002.

Permuter, Haim Henry, Weissman, Tsachy, and Goldsmith, Andrea J. Finite state channels with time-invariant de- terministic feedback. IEEE Transactions on Information Theory, 55(2):644–662, 2009.

Polani, Daniel, Nehaniv, C, Martinetz, Thomas, and Kim, Jan T. Relevant information in optimized persistence