The Uncertainty Bellman Equation and Exploration

The Uncertainty Bellman Equation and Exploration Brendan O’Donoghue 1 Ian Osband 1 Remi Munos 1 Volodymyr Mnih 1 Abstract tions that maximize rewards given its current knowledge? We consider the exploration/exploitation prob- Separating estimation and control in RL via ‘greedy’ algo- lem in reinforcement learning. For exploitation, rithms can lead to premature and suboptimal exploitation. it is well known that the Bellman equation con- To offset this, the majority of practical implementations in- nects the value at any time-step to the expected troduce some random noise or dithering into their action value at subsequent time-steps. In this paper we selection (such as -greedy). These algorithms will even- consider a similar uncertainty Bellman equation tually explore every reachable state and action infinitely (UBE), which connects the uncertainty at any often, but can take exponentially long to learn the opti- time-step to the expected uncertainties at subse- mal policy (Kakade, 2003). By contrast, for any set of quent time-steps, thereby extending the potential prior beliefs the optimal exploration policy can be com- exploratory benefit of a policy beyond individual puted directly by dynamic programming in the Bayesian time-steps. We prove that the unique fixed point belief space. However, this approach can be computation- of the UBE yields an upper bound on the vari- ally intractable for even very small problems (Guez et al., ance of the posterior distribution of the Q-values 2012) while direct computational approximations can fail induced by any policy. This bound can be much spectacularly badly (Munos, 2014). tighter than traditional count-based bonuses that For this reason, most provably-efficient approaches to re- compound standard deviation rather than vari- inforcement learning rely upon the optimism in the face of ance. Importantly, and unlike several existing uncertainty (OFU) principle (Lai & Robbins, 1985; Kearns approaches to optimism, this method scales nat- & Singh, 2002; Brafman & Tennenholtz, 2002). These al- urally to large systems with complex generaliza- gorithms give a bonus to poorly-understood states and action. Substituting our UBE-exploration strategy tions and subsequently follow the policy that is optimal for -greedy improves DQN performance on 51 for this augmented optimistic MDP. This optimism incen- out of 57 games in the Atari suite. tivizes exploration but, as the agent learns more about the environment, the scale of the bonus should decrease and the agent’s performance should approach optimality. At 1. Introduction a high level these approaches to OFU-RL build up confidence sets that contain the true MDP with high probability We consider the reinforcement learning (RL) problem of an (Strehl & Littman, 2004; Lattimore & Hutter, 2012; Jaksch agent interacting with its environment to maximize cumu- et al., 2010). These techniques can provide performance lative rewards over time (Sutton & Barto, 1998). We model guarantees that are ‘near-optimal’ in terms of the problem the environment as a Markov decision process (MDP), but parameters. However, apart from the simple ‘multi-armed arXiv:1709.05380v4 [cs.AI] 22 Oct 2018 where the agent is initially uncertain of the true dynam- bandit’ setting with only one state, there is still a signif- ics and mean rewards of the MDP (Bellman, 1957; Bert- icant gap between the upper and lower bounds for these sekas, 2005). At each time-step, the agent performs an ac- algorithms (Lattimore, 2016; Jaksch et al., 2010; Osband tion, receives a reward, and moves to the next state; from & Van Roy, 2016). these data it can learn which actions lead to higher payoffs. This leads to the exploration versus exploitation trade-off: One inefficiency in these algorithms is that, although the Should the agent investigate poorly understood states and concentration may be tight at each state and action inde- actions to improve future performance or instead take ac- pendently, the combination of simultaneously optimistic estimates may result in an extremely over-optimistic esti- 1 DeepMind. Correspondence to: Brendan O’Donoghue mate for the MDP as a whole (Osband & Van Roy, 2017). <[email protected]>. Other works have suggested that a Bayesian posterior sam- Proceedings of the 35 th International Conference on Machine pling approach may not suffer from these inefficiencies and Learning, Stockholm, Sweden, PMLR 80, 2018. Copyright 2018 can lead to performance improvements over OFU methods by the author(s). The Uncertainty Bellman Equation and Exploration (Strens, 2000; Osband et al., 2013; Grande et al., 2014). A policy π = (π1; : : : ; πH ) is a sequence of functions h In this paper we explore a related approach that harnesses where each π : S × A ! R+ is a mapping from state- the simple relationship of the uncertainty Bellman equa- action pair to the probability of taking that action at that h tion (UBE), where we define uncertainty to be the vari- state, i.e., πsa is the probability of taking action a at state P h ance of the Bayesian posterior of the Q-values of a policy s at time-step h and a πsa = 1 for all s 2 S. At each conditioned on the data the agent has collected, in a sense time-step h the agent receives a state sh and a reward rh similar to the parametric variance of Mannor et al. (2007). and selects an action ah from the policy πh, and the agent Intuitively speaking, if the agent has high uncertainty (as moves to the next state sh+1, which is sampled with prob- h h measured by high posterior variance) in a region of the ability Psh+1shah , where Ps0sa is the probability of tran- state-space then it should explore there, in order to get a sitioning from state s to state s0 after taking action a at better estimate of those Q-values. We show that, just as time-step h. The goal of the agent is to maximize the ex- the Bellman equation relates the value of a policy beyond a pected total discounted return J under its policy π, where h i single time-step, so too does the uncertainty Bellman equa- PH h J(π) = E h=1 r j π . Here the expectation is with tion propagate uncertainty values over multiple time-steps, respect to the initial state distribution, the state-transition thereby facilitating ‘deep exploration’ (Osband et al., 2017; probabilities, the rewards, and the policy π. Moerland et al., 2017). The action-value, or Q-value, at time step l of a particular learns The benefit of our approach (which the solution state under policy π is the expected total return from tak- to the UBE and uses this to guide exploration) is that ing that action at that state and following π thereafter, i.e., we can harness the existing machinery for deep reinforce- l hPH h l l i ment learning with minimal change to existing network ar- Qsa = E h=l r j s = s; a = a; π (we suppress the chitectures. The resulting algorithm shares a connection dependence on π in this notation). The value of state s un- h h der policy π at time-step h, V (s) = E h Q , is the to the existing literature of both OFU and intrinsic moti- a∼πs sa vation (Singh et al., 2004; Schmidhuber, 2009; White & expected total discounted return of policy π from state s. White, 2010). Recent work has further connected these ap- The Bellman operator T h for policy π at each time-step h proaches through the notion of ‘pseudo-count’ (Bellemare relates the value at each time-step to the value at subsequent et al., 2016; Ostrovski et al., 2017), a generalization of the time-steps via dynamic programming (Bellman, 1957), number of visits to a state and action. Rather than adding a pseudo-count based bonus to the rewards, our work builds h h+1 h X h h h+1 T Qsa = µsa + πs0a0 Ps0saQs0a0 (1) upon the idea that the more fundamental quantity is the un- s0;a0 certainty of the value function and that naively compound- ing count-based bonuses may lead to inefficient confidence for all (s; a), where µ = E r is the mean reward. The sets (Osband & Van Roy, 2017). The key difference is that Q-values are the unique fixed point of equation (1), i.e., the the UBE compounds the variances at each step, rather than solution to T hQh+1 = Qh for h = 1;:::;H, where QH+1 standard deviation. is defined to be zero. Several reinforcement learning algorithms have been designed around minimizing the residual The observation that the higher moments of a value func- of equation (1) to propagate knowledge of immediate re- tion also satisfy a form of Bellman equation is not new and wards to long term value (Sutton, 1988; Watkins, 1989). has been observed by some of the early papers on the sub- In the next section we examine a similar relationship for ject (Sobel, 1982). Unlike most prior work, we focus upon propagating the uncertainties of the Q-values, we call this the epistemic uncertainty over the value function, as cap- relationship the uncertainty Bellman equation. tured by the Bayesian posterior, i.e., the uncertainty due to estimating a parameter using a finite amount of data, rather than the higher moments of the reward-to-go (Lattimore 3. The uncertainty Bellman equation & Hutter, 2012; Azar et al., 2012; Mannor & Tsitsiklis, In this section we derive a Bellman-style relationship that 2011; Bellemare et al., 2017). For application to rich envi- propagates the uncertainty (variance) of the Bayesian pos- ronments with complex generalization we will use a deep terior distribution over Q-values across multiple time-steps.

The Uncertainty Bellman Equation and Exploration

Calculating the Value Function When the Bellman Equation Cannot Be

From Single-Agent to Multi-Agent Reinforcement Learning: Foundational Concepts and Methods Learning Theory Course

Dynamic Programming for Dummies, Parts I & II

Macroeconomics: a Dynamic General Equilibrium Approach

Economics 2010C: Lecture 1 Introduction to Dynamic Programming

Graduate Macro Theory II: Notes on Investment

Chapter 3 the Representative Agent Model

Q-Function Learning Methods

Optimal and Autonomous Control Using Reinforcement Learning: a Survey

Insurance and Taxation Over the Life Cycle

Irreversibility, Uncertainty, and Investment." Pindyck, Robert S

Bellman Equation