Bellman Equation
Total Page:16
File Type:pdf, Size:1020Kb
Harvard University, Neurobiology 101hfm. Fundamentals in Computational Neuroscience Spring term 2014/15 Lecture 8 – Reinforcement Learning III: Bellman equation Alexander Mathis, Ashesh Dhawale April 7st, 2015, due date April, 15th, 2015 In class we have treated various aspects of Reinforcement learning (RL) so far, in particular the Rescola-Wagner model, greedy and softmax policies, and temporal difference (TD) learning. Here we provide a more principled, overarching view of RL with a focus on the mathematics rather than neurobiology. We largely follow the exposition of the classical book by Sutton and Barto called “Reinforcement Learning”.1 1 The Reinforcement learning problem RL studies how an agent is learning from interactions with an environment in order to achieve a goal. At each time t, which evolves in discrete steps, the agent perceives some state st 2 S, where S is the (discrete) set of possible states. and performs an action at 2 A(st), where A(st) is the set of actions possible in state st. In the next time point the agent receives a scalar reward rt+1 2 R and senses state st+1. This agent-environment interface is shown in Figure 1. Figure 1: The agent-environment interaction in RL. Figure from Sutton/Barto. At each time point the agent acts according to a probabilistic ’rule’, which maps states S onto actions A and is called policy π. In order to emphasize those dependencies one can write πt(s; a). The agent’s goal in RL is to maximize its total reward collected in the environment, called the return Rt. The agent is not (necessarily) interested in maximizing short-term reward, but cumulative, typical reward on the long run. The return is the defined as the sum all all (future) rewards: Rt = rt+1 + rt+2 + rt+3 + ::: + rT ; (1) in case there is a final time step T , or as 2 Rt = rt+1 + γrt+2 + γ rt+3 + :::; (2) for T = 1. Thereby, γ 2 [0; 1] is the discount rate. Note that for γ = 0 the agent only cares about immediate rewards, and that for γ < 1 this series converges, as long as the rewards rk are bounded. This abstract framework can be easily employed for many useful and highly different scenarios. Due to this generality it is studied in a wide array of fields with applications from robotics and machine learning to game theory and economics. But the math quickly becomes very hard, so we focus on something “simple”: MDPs, which according to Sutton and Barto are 90% of modern RL. 1.1 (Finite) Markov decision processes A finite MDP is defined by its (finite) state S and action set A. Given a particular state s and action a the transition probability of each possible next state s0 is: a pss0 = P (st+1jst = s; at = a) : (3) For any state s and action a, the expected value of the reward is given by a 0 rss0 = E rt+1jst = s; at = a; st+1 = s : (4) a a These two quantities pss0 and rss0 fully describe the (mean) dynamics of a MDP. Figure 2 shows a simple example. Given such a MDP, what is the optimal policy? How should one act? One of the key ideas in RL is that the policy can be derived from value functions. 1Sutton/Barto: “Reinforcement Learning: An Introduction (Adaptive Computation and Machine Learning)”, MIT press 1998. Page 1 of 4 Figure 2: An example MDP for a ’recycling robot’. The robot has two states of its battery (low & high), depicted as nodes, and 3 actions (waiting, searching for recyclable material, and recharging), depicted as edges. a a The transition probabilities pss0 and rewards rss0 are shown next to each directed edge. They are given by the scalar constants α; β 2 [0; 1] as well as Rsearch and Rwait. Figure from Sutton/Barto. 1.2 Sate-value and action-value functions For a MDP the state-value function for policy π can be defined as: 1 ! π X k V (s) = Eπ(Rtjst = s) = Eπ γ rt+k+1jst = s : (5) k=0 The expectation is calculated over the agent’s policy mapping π. In the same vain, one can define the action-value function for policy π, Qπ, which denotes the expected return starting at s, taking the action a and thereafter following policy π: 1 ! π X k Q (s; a) = Eπ(Rtjst = s; at = a) = Eπ γ rt+k+1jst = s; at = a : (6) k=0 1.3 Self-consistency and Bellman equation Self-consistency of the value function implies that certain recursive relationships have to be satisfied. This can be seen as follows, π V (s) = Eπ(Rtjst = s) 1 ! X k = Eπ γ rt+k+1jst = s k=0 1 ! X k = Eπ rt+1 + γ γ rt+k+1jst = s k=0 1 !! X X a a X k 0 = π(a; s) pss0 rss0 + γ · Eπ γ rt+k+1jst = s a s0 k=0 X X a a π 0 = π(a; s) pss0 rss0 + γ · V (s ) a s0 This important recursive relation is called the Bellman equation for V π. The value function is the solution to this Bellman equation. 1.4 Optimal value functions and policies 0 A policy π is called better or equal than π0, i.e. π ≥ π0, if and only if V π(s) ≥ V π (s) for all states! The relation ≥ defines a partial order2 on the set of policies. Thus, there is a (not necessarily unique) optimal state-value function V ∗: V ∗(s) = max V π(s) 8s 2 S (7) π 2A partial order on a set P is a binary relation R that is reflexive (aRa 8a 2 P ), antisymmetric (if aRb and bRa then a = b) and transitivity (if aRb and bRc then aRc). Page 2 of 4 Similarly, one can define an optimal action-value function3 Q∗: Q∗(s; a) = max V π(s; a) 8s 2 S; a 2 A: (8) π Because V ∗ is a value function for a particular policy π∗ it satisfies the Bellman equation. Since, π∗ is also the optimal policy, one can conclude: ∗ V ∗(s) = max Qπ (s; a) a2A = max π∗ (Rtjst = s; at = a) a E 1 ! X k = max π∗ γ rt+k+1jst = s; at = a a E k=0 X a a ∗ 0 = max pss0 rss0 + γ · V (s ) : (9) a s0 Similarly one can derive the Bellman equation for Q∗, ∗ ∗ 0 Q (s; a) = Eπ∗ rt+1 + γ max Q (st + 1; a )jst = s; at = a a0 X a a ∗ 0 0 = pss0 rss0 + γ max Q (s ; a ) : a0 s0 For finite MDP the Bellman equation (Eq. (9)) has a unique solution independent of the policy. Note that for N states s there are actually N nonlinear equations with N unknowns that need to be solved to obtain V ∗. Also for Q∗(s0; a0) the number of equations and free variables is equal. From either one of these functions one can define the optimal policy, denoted by π∗, in a straightforward way. Once one has V ∗, the optimal policy can be defined as follows: for a given state one assigns nonzero probabilities only to the ac- tions a that attain the maximum in V ∗(s). In other words, the greedy policy with respect to V ∗ is optimal. From Q∗(s; a) one obtains the optimal policy by simply assigning a nonvanishing probability to all actions that have the maximal Q∗(s; a). Example: One can show that for the simple recycling robot the Bellman equation by V ∗(high) = maxfRsearch + γ(αV ∗(high) + (1 − α)V ∗(low)); Rwait + γV ∗(high)g (10) V ∗(low) = maxfβRsearch − 3(1 − β) + γ((1 − β)V ∗(high) + βV ∗(low)); Rwait + γV ∗(low); γV ∗(high)g (11) For given constants, one can find the appropriate optimal value function V ∗. 1.5 Bellman equation and beyond Solving the system of Bellman equations is one way to solve RL problems. Being able to do so relies on three key assump- tions: 1. one needs to know the dynamics of the environment (the transition probabilities & rewards) 2. one needs to have enough computational power to crunch the Bellman equations 3. the Markov property For practical problems these conditions are typically violated. For instance, backgammon or Atari games, like we will discuss them in next week’s paper, have so many ’states’ that no computer is currently able to solve the system of Bellman equations. However, broadly speaking there are various techniques to use when these conditions are violated: 1. Monte Carlo methods: The value/value-state functions can be estimated from experience, i.e. if an agent behaves according to policy π then the average actual returns following a state s will converge against V π(s). Similarly, by keeping separate counts for each action a and state s the agent can approximate Qπ(s; a). We discussed something along this lines when we considered the n-armed bandit problem. 2. When a RL problem has many states this non-parametric estimation approach is highly impractical and parametrized functions are used instead to approximate V π or Qπ, respectively, based on experience of the agent. Among other tricks Deep Q-learning uses deep neural nets to achieve just that.4 3 ∗ ∗ ∗ , whose existence follows from the existence of V and the defining equation: Q (s; a) = E(rt+1 + γV (st=1)jst = s; at = a). 4“Human-level control through deep reinforcement learning” by Mnih et al. Nature 2015; doi:10.1038/nature14236 Page 3 of 4 3.