A Unified Bellman Equation for Causal Information and Value in Markov

A Unified Bellman Equation for Causal Information and Value in Markov Decision Processes Stas Tiomkin 1 Naftali Tishby 1 2 Abstract interaction patterns, (infinite state and action trajectories) The interaction between an artificial agent and its define the typical behavior of an organism in a given en- environment is bi-directional. The agent extracts vironment. This typical behavior is crucial for the design relevant information from the environment, and and analysis of intelligent systems. In this work we derive affects the environment by its actions in return to typical behavior within the formalism of a reinforcement accumulate high expected reward. Standard rein- learning (RL) model subject to information-theoretic con- forcement learning (RL) deals with the expected straints. reward maximization. However, there are always In the standard RL model, (Sutton & Barto, 1998), an arti- information-theoretic limitations that restrict the ficial organism generates an optimal policy through an in- expected reward, which are not properly consid- teraction with the environment. Typically, the optimality ered by the standard RL. is taken with regard to a reward, such as energy, money, In this work we consider RL objectives with social network ’likes/dislikes’, time etc. Intriguingly, the information-theoretic limitations. For the first reward-associated value function, (an average accumulated time we derive a Bellman-type recursive equa- reward), possesses the same property for different types of tion for the causal information between the envi- rewards - the optimal value function is a Lyapunov function, ronment and the agent, which is combined plau- (Perkins & Barto, 2002), a generalized energy function. sibly with the Bellman recursion for the value A principled way to solve RL (as well as Optimal Con- function. The unified equitation serves to explore trol) problems is to find a corresponding generalized en- the typical behavior of artificial agents in an infi- ergy function. Specifically, the Bellman recursive equation, nite time horizon. (Bertsekas, 1995), is the Lyapunov function for Markov decision processes. Strictly speaking, the standard RL frame- 1. Introduction work is about the discovery and minimization (maximization) of an energetic quantity, whereas the Bellman recur- The interaction between an organism and the environment sion is an elegant tool for finding the optimal policy both consists of three major components - utilization of past ex- for finite and infinite time horizons for planning. perience, observation of the current environmental state, However, this model is incomplete. Physics suggests that and generation of a behavioral policy. The latter compo- there is always an interplay between energetic and entropic nent, planning is an essential feature of intelligent systems quantities. Recent advances in the artificial intelligence for survival in limited-resource environments. Organisms show that the behavior of realistic artificial agents is indeed with long-term planning can accumulate more resources affected simultaneously by energetic and entropic quanti- arXiv:1703.01585v2 [cs.SY] 5 Jun 2018 and avoid possible catastrophic states in the future, which ties (Tishby & Polani, 2011; Rubin et al., 2012). The finite makes than evolutionarily superior to ’short-term planners’. rate of information transfer is a fundamental constraint for An intelligent agent combines past experience with the cur- any value accumulation mechanism. rent observations to act upon the environment. This feed- Information-theoretic constraints are ubiquitous in the in- back interaction induces complex statistical correlations teraction between an agent and the environment. For exam- over different time scales between environment state tra- ple, a sensor has a limited bandwidth, a processor has a lim- jectories and the agent’s action trajectories. Infinite-time ited information processing rate, a memory has a limited 1The Benin School of Computer Science and Engineer- information update rate, etc. All these are not pure theoret- ing, The Hebrew University, Jerusalem, Israel. 2The Ed- ical considerations, but rather the practical requirements to mond and Lilly Safra Center for Brain Sciences, The He- build and analyze realistic intelligent agents. Moreover, of- brew University, Jerusalem, Israel. Correspondence to: ten, an artificial agent needs to limit the available informa- Stas Tiomkin <[email protected]>, Naftali Tishby <[email protected]>. tion to the information for a particular task, (Polani et al., 2006). For example, when looking at a picture, the visual stimuli strikes the retina at the rate of the speed of light, A Unified Bellman Equation for Causal Information and Value in Markov Decision Processes which is decreased dramatically to leave only the relevant ical simulations of different types of problems, which can information rate, which is essential for understanding the be solved by them. picture. The paper is organized as follows. In Section2 we provide Different types of entropic constraints have been studied in background on causal conditioning, directed information, the context of RL. The KL-control, (Todorov et al., 2006; and Markov decision processes. In Section3 we represent Tishby & Polani, 2011; Kappen et al., 2012) introduces a the Bellman-type recursive equation for directed informa- DKL cost to each state, which penalizes complex policies, tion from the agent’s actions to the environment. In Section where complexity is measured by the DKL divergence be- 4 we show the unified recursion for the directed informa- tween a prior policy, or uncontrolled dynamics, and an action and the common recursion for the value function. The tual policy. Importantly, the KL-control does not directly optimization problem for finding the minimal directed in- address the question of the information transfer rate. formation rate required to achieve the maximal expected Some recent works do explicitly consider the effects of in- reward in an infinite time horizon is stated in Section 4.1. formation constraints on the performance of a controller The numerical simulation of the unified Bellman equation (Tatikonda & Mitter, 2004; Tanaka & Sandberg, 2015; is provided in Section5, where we consider the standard Tanaka et al., 2015) for an infinite time horizon. However, maze-escaping problem for a predefined target. In Section these works are mostly limited to linear dynamics. 6 we consider the dual problem of the maximal directed in- Obviously, feedback implies causality - the current ac- formation rate from the agent to the environment, and de- tion depends on previous and/or present information alone. rive the dual Bellman equation. In Section7 we provide a Consequently, the information constraints should address numerical simulation for the dual Bellman equation, where the interaction causality. we consider tasks without a predefined target. We compare In this work we consider directed information constraints our solution to the standard algorithms for finding the av- over an infinite time horizon in the framework of Markov erage all-pairs shortest path problem. Finally, in Section decision processes, where the directed information is be- 8 we summarize the paper, and provide directions for the tween the environment state trajectories and the agent’s ac- continuation of this work. tion trajectories. For the first time we derive a Bellman type equation for the 2. Background causal information, and combine it with the Bellman re- In this section we overview the theoretical background for cursion for value. The unified Information-Value Bellman this work. Specifically, we briefly review the framework equation enables us to derive the typical optimal behavior of reinforcement learning and Markov decision processes. over an infinite time horizon in the MDP setting. Then, we review causal conditioning and directed informa- In addition, this work has practical implications for the de- tion, and define the required quantities. sign criteria of optimal intelligent agents. Specifically, an information processing rate of a ’brain’ (processor) of an 2.1. Interaction Model artificial agent should be higher than the minimal informa- We assume that the agent and environment interact at dis- tion rate required to solve a particular MDP problem. crete time steps for t 2 (−∞;:::;T ). At time t, the en- The interaction between an agent and the environment is vironment appears at state St 2 S, the agent observes the bi-directional - an agent extracts information and affects state, and affects the environment by its action At 2 A the environment by its actions in return. To provide a com- according to the policy π(At j St). The environmental prehensive analysis of this bi-directional interaction, we evolution is given by the transition probability distribution, consider both information channels. These channels are p(St+1 j At;St). This interaction model is plausibly de- dual, comprising the action-perception cycle of reinforce- scribed by the probabilistic graph, shown at Figure1, where ment learning under the information-theoretic constraints. the arrows denote directions of causal dependency. We derive a Bellman type equation for the directed information from action trajectories to state trajectories. S1 St−1 St St+1 ST −1 ST The minimal directed information rate from the environment to the agent comprises a constraint for the expected A1 At−1 At At+1 AT −1 AT reward maximization in standard goal-oriented RL test bench tasks. By contrast, the maximal directed

Load more