A Distributional Perspective on Reinforcement Learning

A Distributional Perspective on Reinforcement Learning Marc G. Bellemare * 1 Will Dabney * 1 Remi´ Munos 1 Abstract ment learning. Specifically, the main object of our study is In this paper we argue for the fundamental impor- the random return Z whose expectation is the value Q. This tance of the value distribution: the distribution random return is also described by a recursive equation, but of the random return received by a reinforcement one of a distributional nature: learning agent. This is in contrast to the com- D Z(x; a) = R(x; a) + γZ(X0;A0): mon approach to reinforcement learning which models the expectation of this return, or value. The distributional Bellman equation states that the distribu- Although there is an established body of liter- tion of Z is characterized by the interaction of three random ature studying the value distribution, thus far it variables: the reward R, the next state-action (X0;A0), and has always been used for a specific purpose such its random return Z(X0;A0). By analogy with the well- as implementing risk-aware behaviour. We begin known case, we call this quantity the value distribution. with theoretical results in both the policy evaluation and control settings, exposing a signifi- Although the distributional perspective is almost as old cant distributional instability in the latter. We as Bellman’s equation itself (Jaquette, 1973; Sobel, 1982; then use the distributional perspective to design White, 1988), in reinforcement learning it has thus far been a new algorithm which applies Bellman’s equa- subordinated to specific purposes: to model parametric un- tion to the learning of approximate value distri- certainty (Dearden et al., 1998), to design risk-sensitive al- butions. We evaluate our algorithm using the gorithms (Morimura et al., 2010b;a), or for theoretical anal- suite of games from the Arcade Learning En- ysis (Azar et al., 2012; Lattimore & Hutter, 2012). By con- vironment. We obtain both state-of-the-art re- trast, we believe the value distribution has a central role to sults and anecdotal evidence demonstrating the play in reinforcement learning. importance of the value distribution in approxi- Contraction of the policy evaluation Bellman operator. mate reinforcement learning. Finally, we com- Basing ourselves on results byR osler¨ (1992) we show that, bine theoretical and empirical evidence to high- for a fixed policy, the Bellman operator over value distribu- light the ways in which the value distribution im- tions is a contraction in a maximal form of the Wasserstein pacts learning in the approximate setting. (also called Kantorovich or Mallows) metric. Our particular choice of metric matters: the same operator is not a contraction in total variation, Kullback-Leibler divergence, 1. Introduction or Kolmogorov distance. One of the major tenets of reinforcement learning states Instability in the control setting. We will demonstrate an that, when not otherwise constrained in its behaviour, an instability in the distributional version of Bellman’s opti- arXiv:1707.06887v1 [cs.LG] 21 Jul 2017 agent should aim to maximize its expected utility Q, or mality equation, in contrast to the policy evaluation case. value (Sutton & Barto, 1998). Bellman’s equation succintly Specifically, although the optimality operator is a contrac- describes this value in terms of the expected reward and ex- tion in expected value (matching the usual optimality re- 0 0 pected outcome of the random transition (x; a) (X ;A ): sult), it is not a contraction in any metric over distributions. ! 0 0 These results provide evidence in favour of learning algo- Q(x; a) = E R(x; a) + γ E Q(X ;A ): rithms that model the effects of nonstationary policies. In this paper, we aim to go beyond the notion of value and argue in favour of a distributional perspective on reinforce- Better approximations. From an algorithmic standpoint, there are many benefits to learning an approximate distribu- *Equal contribution 1DeepMind, London, UK. Correspon- tion rather than its approximate expectation. The distribu- dence to: Marc G. Bellemare <[email protected]>. tional Bellman operator preserves multimodality in value Proceedings of the 34 th International Conference on Machine distributions, which we believe leads to more stable learn- Learning, Sydney, Australia, PMLR 70, 2017. Copyright 2017 ing. Approximating the full distribution also mitigates the by the author(s). effects of learning from a nonstationary policy. As a whole, A Distributional Perspective on Reinforcement Learning we argue that this approach makes approximate reinforce- P ⇡Z γP ⇡Z ment learning significantly better behaved. We will illustrate the practical benefits of the distributional perspective in the context of the Arcade Learning Environ- (a) (b) ment (Bellemare et al., 2013). By modelling the value dis- ⇡ ⇡ tribution within a DQN agent (Mnih et al., 2015), we ob- R+γP Z Φ Z tain considerably increased performance across the gamut T of benchmark Atari 2600 games, and in fact achieve state- of-the-art performance on a number of games. Our results (c) (d) echo those of Veness et al.(2015), who obtained extremely fast learning by predicting Monte Carlo returns. Figure 1. A distributional Bellman operator with a deterministic From a supervised learning perspective, learning the full reward function: (a) Next state distribution under policy π, (b) value distribution might seem obvious: why restrict our- Discounting shrinks the distribution towards 0, (c) The reward shifts it, and (d) Projection step (Section4). selves to the mean? The main distinction, of course, is that in our setting there are no given targets. Instead, we use Bellman’s equation to make the learning process tractable; proach for doing so involves the optimality equation we must, as Sutton & Barto(1998) put it, “learn a guess ∗ ∗ 0 0 Q (x; a) = E R(x; a) + γ EP max Q (x ; a ): from a guess”. It is our belief that this guesswork ultimately a02A carries more benefits than costs. This equation has a unique fixed point Q∗, the optimal value function, corresponding to the set of optimal policies ∗ ∗ ∗ ∗ 2. Setting Π (π is optimal if Ea∼π∗ Q (x; a) = maxa Q (x; a)). We consider an agent interacting with an environment in We view value functions as vectors in RX ×A, and the ex- the standard fashion: at each step, the agent selects an ac- pected reward function as one such vector. In this context, tion based on its current state, to which the environment re- the Bellman operator π and optimality operator are T T sponds with a reward and the next state. We model this in- π 0 0 Q(x; a) := E R(x; a) + γ E Q(x ; a ) (2) teraction as a time-homogeneous Markov Decision Process T P,π ( ; ; R; P; γ). As usual, and are respectively the Q(x; a) := R(x; a) + γ max Q(x0; a0): (3) X A X A E EP 0 state and action spaces, P is the transition kernel P ( x; a), T a 2A · j γ [0; 1] is the discount factor, and R is the reward func- These operators are useful as they describe the expected 2 tion, which in this work we explicitly treat as a random behaviour of popular learning algorithms such as SARSA variable. A stationary policy π maps each state x to a and Q-Learning. In particular they are both contraction 2 X probability distribution over the action space . mappings, and their repeated application to some initial A Q0 converges exponentially to Qπ or Q∗, respectively (Bert- 2.1. Bellman’s Equations sekas & Tsitsiklis, 1996). The return Zπ is the sum of discounted rewards along the agent’s trajectory of interactions with the environment. The 3. The Distributional Bellman Operators value function Qπ of a policy π describes the expected re- In this paper we take away the expectations inside Bell- turn from taking action a from state x , then man’s equations and consider instead the full distribution acting according to π: 2 A 2 X of the random variable Zπ. From here on, we will view Zπ " 1 # as a mapping from state-action pairs to distributions over π π X t Q (x; a) := E Z (x; a) = E γ R(xt; at) ; (1) returns, and call it the value distribution. t=0 Our first aim is to gain an understanding of the theoretical x P ( x − ; a − ); a π( x ); x = x; a = a: t ∼ · j t 1 t 1 t ∼ · j t 0 0 behaviour of the distributional analogues of the Bellman Fundamental to reinforcement learning is the use of Bell- operators, in particular in the less well-understood control man’s equation (Bellman, 1957) to describe the value func- setting. The reader strictly interested in the algorithmic tion: contribution may choose to skip this section. π π 0 0 Q (x; a) = E R(x; a) + γ E Q (x ; a ): P,π 3.1. Distributional Equations In reinforcement learning we are typically interested in act- It will sometimes be convenient to make use of the proba- ing so as to maximize the return. The most common ap- bility space (Ω; ; Pr). The reader unfamiliar with mea- F A Distributional Perspective on Reinforcement Learning sure theory may think of Ω as the space of all possible of U; V . The metric dp has the following properties: outcomes of an experiment (Billingsley, 1995). We will X write u p to denote the Lp norm of a vector u R for dp(aU; aV ) a dp(U; V ) (P1) k k X2 ×A ≤ j j 1 p ; the same applies to vectors in R . The dp(A + U; A + V ) dp(U; V ) (P2) ≤ ≤ 1 X X ×A ≤ Lp norm of a random vector U :Ω R (or R ) is ! dp(AU; AV ) A pdp(U; V ): (P3) p1=p ≤ k k then U p := E U(!) , and for p = we have k k k kp 1 U 1 = ess sup U(!) 1 (we will omit the dependency We will need the following additional property, which onk k! Ω wheneverk unambiguous).k We will denote the makes no independence assumptions on its variables.

A Distributional Perspective on Reinforcement Learning

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support