<<

Towards Tractable Optimism in Model-Based Reinforcement Learning

*1 *2 *2 3 2 Aldo Pacchiano Philip Ball Jack Parker-Holder Krzysztof Choromanski Stephen Roberts

1UC Berkeley 2University of Oxford 3Google Brain Robotics *Equal Contribution.

Abstract efficiently, wasting as few samples as possible [Ball et al., 2020]. Thus, the effectiveness of MBRL algorithms hinges The principle of optimism in the face of uncertainty is on the exploration-exploitation dilemma. prevalent throughout sequential decision making prob- This dilemma has been studied extensively in the tabular lems such as multi-armed bandits and reinforcement RL setting, which considers Markov Decision Processes learning (RL). To be successful, an optimistic RL algo- (MDPs) with finite states and actions. Optimism in the face rithm must over-estimate the true value function (opti- mism) but not by so much that it is inaccurate (estima- of uncertainty (OFU) [Audibert et al., 2007, Kocsis and tion error). In the tabular setting, many state-of-the-art Szepesvári, 2006] is a principle that emerged first from methods produce the required optimism through ap- the Multi-Arm Bandit literature, where actions having both proaches which are intractable when scaling to deep large expected rewards (exploitation) and high uncertainty RL. We re-interpret these scalable optimistic model- (exploration) are prioritized. OFU is a crucial component based algorithms as solving a tractable noise augmented of several state-of-the-art algorithms in this setting [Silver MDP. This formulation achieves a competitive regret et al., 2016], although its success has thus far failed to scale bound: ˜ H T when augmenting using Gaus- O S A to larger settings. sian noise, where T is the total number of environ- ‘ ment steps.(∂ We∂ also∂ explore∂ ) how this trade-off changes However, in the field of deep RL, many of these theoret- in the deep RL setting, where we show empirically ical advances have been overlooked in favor of heuristics that estimation error is significantly more troublesome. [Burda et al., 2019a], or simple dithering based approaches However, we also show that if this error is reduced, for exploration [Mnih et al., 2013]. There are two potential optimistic model-based RL algorithms can match state- reasons for this. First, many of the theoretically motivated of-the-art performance in continuous control problems. OFU algorithms are intractable in larger settings. For ex- ample, UCRL2 [Jaksch et al., 2010] a canonical optimistic RL algorithm, requires the computation of an analytic un- 1 INTRODUCTION certainty envelope around the MDP, which is infeasible for continuous MDPs. Despite its many extensions [Filippi Reinforcement Learning (RL, Sutton and Barto [1998]) con- et al., 2010, Jaksch et al., 2010, Fruit et al., 2018, Azar et al., siders the problem of an agent taking sequential actions 2017b, Bartlett and Tewari, 2012, Tossou et al., 2019], none in an uncertain environment to maximize some notion of address generalizing the techniques to continuous (or even reward. Model-based reinforcement learning (MBRL) algo- large discrete) MDPs. rithms typically approach this problem by building a “world model” [Sutton, 1991], which can be used to simulate the Second, OFU algorithms must strike a fine balance in what true environment. This facilitates efficient learning, since we call the Optimism Decomposition. That is, they need to the agent no longer needs to query to true environment for be optimistic enough to upper bound the true value function, experience, and instead plans in the world model. In or- while maintaining low estimation error. Theoretically mo- der to learn a world model that accurately represents the tivated OFU algorithms predominantly focus on the prior. dynamics of the environment, the agent must collect data However, when moving to the deep RL setting, several that is rich in experiences [Sekar et al., 2020]. However, for sources of noise make estimation error a thorn in the side of faster convergence, data collection must also be performed optimistic approaches. We show that an optimistic algorithm can fixate on exploiting the least accurate models, which Correspondence to: [email protected], causes the majority of experience the agent learns from to {ball, jackph}@robots.ox.ac.uk

Accepted for the 37th Conference on Uncertainty in Artificial Intelligence (UAI 2021). be worthless, or even harmful for performance. must instead approximate these through estimators. For state action pair s,a , we denote the average reward es- In this paper we seek to address both of these issues, paving timator as rˆ s,a and the average dynamics estima- the way for OFU-inspired algorithms to gain prominence in k " R 1 ˆ the deep RL setting. We make two contributions: tor as Pk s,a "(D S ,) where index k refers to the episode. When training,( the) learner collects dynamics tuples during Making provably efficient algorithms tractable Our first its interactions with∂ ∂ , which in turn it uses during each ( ) M contribution is to introduce a new perspective on existing round t to produce a policy pk and an approximate MDP tabular RL algorithms such as UCRL2. We show that a com- k = , ,P˜,H,r˜,P0 . In our theoretical results we will M S A parable regret bound can be achieved by being optimistic allow P˜ s,a to be a signed measure whose entries do not with respect to a noise augmented MDP, where the noise is sum to( one. This is purely) a semantic devise, rendering the proportional to the amount of data collected during learning. exposition( of) our work easier and more general, and in no We propose several mechanisms to inject noise, including way affects the feasibility of our algorithms and arguments. count-scaled Gaussian noise and the variance from a boot- For any policy p, let V p be the (scalar) value of p and let strap mechanism. Since the latter technique is used in many V˜ p be the value of p operating in the approximate MDP prominent state-of-the-art deep MBRL algorithms [Kuru- k k. We define Ep as( the) expectation under the dynamics tach et al., 2018, Janner et al., 2019, Chua et al., 2018, Ball ofM the true MDP and using policy p (analogously E˜ as ( ) M p et al., 2020], we have all the ingredients we need to scale to the expectation under k). The true and approximate value that paradigm. function for a policy pMare defined as follows: Addressing model estimation error in the deep RL H1 H1 V p r s ,a ,V˜ p ˜ r˜ s ,a . paradigm We empirically explore the Optimism Decompo- = Ep = h h k = Ep = k h h sition in the deep RL setting, and introduce a new approach h=0 h=0 to reduce the likelihood that the weakest models will be We will( ) evaluate⌫ our( method) ( using) regret⌫ , the( difference) exploited. We show that we can indeed produce optimism between the value of the optimal policy and the value from with low model error, and thus match state of the art MBRL the policies it executed. Formally, in the episodic RL setting K performance. the regret of an agent using policies pk k=1 is (where K is number of episodes and T = KH): The rest of the paper is structured as follows: 1) We begin K ò { } with background and related work, where we formally in- R T V p V p , = = k troduce the Optimism Decomposition; 2) In Section 3 we k=1 introduce noise augmented MDPs, and draw connections ò where p denotes( the) optimal( policy) for( ), and V p is p with existing algorithms; 3) We next provide our main the- M k k oretical results, followed by empirical verification in the true value function. Furthermore, for each h " 1,⇧,H we h h tabular setting; 4) We rigorously evaluate the Optimism call V p " R S the value vector satisfying V ( p ) s = H1 h Decomposition in the deep RL setting, demonstrating the ¨ r s ¨ ,∂a∂¨ s s , similarly we define{ V˜ p} Ep

Algorithm Scalable? Model/Value Based Regret Bound UCRL [Jaksch et al., 2010] No Model Based ˜ H T O S A UCBVI [Azar et al., 2017a] No Model Based ˜ H T O S‘ A RLSVI [Osband et al., 2016b, Russo, 2019] Yes Value Based ˜ (∂H5∂ 2 ∂ ∂ ) T O S ‘ S A Posterior Sampling [Osband et al., 2013] Yes Model Based ˜ ( /H∂ ∂∂ ∂T) O S ‘ A (∂ ∂ ∂ ∂∂ ∂ ) Noise Augmented UCRL Yes Model Based ˜ H ‘ T O (S∂ ∂ S∂ A∂ ) Noise Augmented UCBVI Yes Model Based ˜ H T O S ‘ A (∂ ∂ ‘∂ ∂∂ ∂ ) (∂ ∂ ∂ ∂ ) We refer to this as the Optimism Decomposition, since it noise augmentation procedure. This gives rise to provably breaks down the regret into its’ two major components. OFU efficient algorithms for tabular RL problems. We discuss algorithms must ensure that: the approximate value func- variants of both UCRL and UCBVI which make use of this tion is sufficiently optimistic (Optimism); and the estimated to simultaneously scale to deep RL while maintaining their value function is not be too far from the true value func- theoretical guarantees. tion (Estimation Error). Balancing these two requirements forms the basis of all optimism based algorithms that satisfy provable regret guarantees. 3 ALGORITHMS In this paper we aim to shed light on how to transfer the principle of optimism into the realm of model based deep The aim of this section is to show that we can produce new RL with deep function approximation. To our knowledge versions of two popular OFU algorithms solely making use we are the first to propose algorithms for the deep RL setting of noise augmentation. First, we focus on showing these inspired by the optimism principle prevalent throughout the noise augmented algorithms are theoretically competitive, theoretical RL literature. before discussing practical implementations of our approach Two of the most prominent model-based OFU algorithms in the deep RL context. are UCRL2 and UCBVI. In the case of UCRL2, optimism is produced by analytically optimizing over the entire dynam- Algorithm 1 Noise Augmented RL (NARL) ics uncertainty set. It is easy to see that this is intractable Input: Finite horizon MDP = , ,P,H,r,P0 , Episodes K, beyond the tabular setting. In the case of UCBVI, optimism M S A Initial reward and dynamics augmentation noise distributions is produced by adding a bonus directly at the value function r P P1 s,a s,a" ✓ and P1 s,a s(,a" ✓ , sampling) frequencies level. This is also intractable in the deep RL setting as it S A S A M ,M . requires a count model over the visited states and actions. r P {Initialize:( )} the transition{ and( rewards)} data buffer s,a = o for D each s,a " ✓ . Optimism beyond tabular models has been theoretically S A studied in Du et al. [2020], Jin et al. [2019] that showed for k = 1,...,K 1 do ( ) m P the value of OFU where the MDP satisfies certain linearity (1) æs,a sample MP noise vectors x k,P s,a ⇥ Pk s,a . UCRL2: ( ) properties but their practical impact has been limited. m r (2) For each s,a sample Mr noise values x(k )s,a ⇥(Pk )s,a . Other methods which have successfully scaled from the tab- (3) Compute policy pk by running Noise( ) Augmented Ex- ular setting to deep RL are Posterior Sampling and RLSVI tended Value iteration as in Equation 2. ( ) ( ) [Osband et al., 2016b, Russo, 2019]. First we note that UCBVI: RLSVI is not model based in spirit and it certainly does not (2) Compute policy pk by running Noise Augmented Value use a model ensemble. While RLSVI is a philosophically iteration as in Equation 4. different algorithm to ours, they also propose the use of (*) Execute policy pk for a length H episode and up- Gaussian noise perturbations and so it is closely related to r date . Produce Pk+1 s,a s,a" ✓ and (if UCRL) our work. P D S A Pk+1 s,a s,a" ✓ . Our approach is also inspired by Agrawal and Jia [2017] S A{ ( )} and Xu and Tewari. The parametric approach to posterior Although{ ( we present)} our results for the case of undiscounted sampling studied in Agrawal and Jia [2017] can be easily episodic reinforcement learning problems, our results extend analyzed under our framework, which can be seen as a gener- to the average reward setting with bounded diameter MDPs alization of the their posterior sampling algorithm. Crucially as in [Jaksch et al., 2010, Agrawal and Jia, 2017]. We are however, our method is simple to implement in the deep RL inspired by UCRL, but shift towards noise augmentation setting, opening the door to new scalable algorithms. rather than an intractable model cloud. We thus call our approach Noise Augmented Reinforcemennt Learning or Next we show that optimism can be achieved by a simple NARL. NARL initializes an empty data buffer of rewards and tran- a regret guarantee of order ˜ H T , which is O S S A sitions . We denote by Nk s the number of times state competitive w.r.t. UCRL2 that achieves ˜ H T . D ‘ O S A s has been encountered in the algorithm’s run up to the These results can be extended beyond Gaussian noise aug- (∂ ∂ ∂ ∂∂ ∂ ) ‘ beginning of episode k (before( ) pk is executed). Similarly mentation provided the noise distributions satisfy(∂ ∂ quantifi-∂ ∂ ) we call Nk s,a the number of times the pair s,a has able anticoncentration properties. For example, when using been encountered up to the beginning of episode k, and let the dynamics noise given by posterior sampling of dynamics Nk s = a( N)k s,a . We make the following assumption:( ) vectors, we recover the results of Agrawal and Jia [2017] in < "A the episodic setting. Our results can be easily extended to Assumption 1 (Rewards). We assume the rewards are the bounded diameter, average reward setting. 1(sub) Gaussian with( ) mean values in 0,1 . Noise Augmented Extended Value Iteration (NAEVI) pro- 3.1 CONCENTRATION [ ] ceeds as follows: at the beginning of episode k we compute a value function V˜k as: We start by recalling the mean estimators ˆ rˆk s,a s,a " ✓ and Pk s,a s,a " ✓ concen- h h+1 ¨ S A S A ˜ s r s,a + ¨ ˜ s Vk pk = max ˆk Es ⇥Pˆk s,a Vk trate around their true values. We make use of a time a" A ( ) ( ) m + {uniform( )} concentration bound{ that( leverages)} the theory of m ( ) h 1 ( )[ ] + max⇤ x(k s,)a + max x  s,a(,V˜)k⇢ (2) self normalization [Peña et al., 2008, Abbasi-Yadkori et al., m m k ( ) 2011] to obtain the following: m where A ⇥= (max)m xk sÖ,a ( and) ã B ⇥= m ˜ h+1 Lemma 1 (Lemma 1 of Maillard and Asadi [2018]). For maxm x k s,a ,Vk represent the optimism bonuses all s,a " ✓ : for the present( ) and future respectively.( ) Many existing deep S A RL methodsÖ ( such) as Bellemareã et al. [2016], Tang et al. ¨ æ , , , , , ( )P t " N r s a rˆk s a ' br Nk s a d & d [2017], Burda et al. [2019b] focus on adding bonuses that ¨ act like term A. By adding term B, Algorithm 1 is able æt P s,a Pˆ s,a b N s,a ,d d, P ⇤" N ∂ ( ) k ( )1∂' P( k( ) ) & to take into account not only present but future rewards, ¨ ¨ and act optimistically according to them. In what follows, ⇥ ¨ Ωlog( n) d ( )Ω ¨ ( ( S )log )n d , ⇥ , , ⇥ . s.t. br n d ⌅ ÿ n bP n d ⌅ n and for all states s,a " ✓ , we will combine the noise ◊ m S A m / ⌥ ∂ ∂ ( / ) values x s,a and vectors x s,a with the average ( ) ( ) k k A more precise version of these bounds is stated in Ap- empirical( reward) (rˆk s),a and empirical( ) dynamics Pˆk s,a pendix A.1. Equipped with these bounds, for the rest of the and think of( them) as forming sample( rewards) and sample paper we condition on the event: dynamics vectors: ( ) ( ) ⇥= æk " N,æ s,a " ✓ , m m E S A r˜ s,a rˆ s,a + x s,a ¨ k = k k r s,a rˆk s,a & br Nk s,a ,d , { ( ) ˜(m) ˆ m ( ) m ¨ Pk s,a = Pk s,a + x k s,a . P s,a Pˆk s,a 1 & bP Nk s,a ,d . ( ) ( ) ( ) ∂ ( ) ( )∂ ( ( ) ) ( ) ( ) ( ) m ¨ ˜ d Although Pk (s,a ) may not( be) a probability( ) measure, If d = 2 S AΩ ,( Lemma) 1( implies)Ω P ( '( 1 ) d.)} E for convenience( ) we still treat it as a signed measure and m At the beginning of the kth episode the learner produces write m ( )⇥ Pˆ s,a + s,a , . Let ∂ ∂∂ ∂ ( ) m r E ¨ ˜ , = k x k k = M reward augmentation noise scalars x s,a s,a s ⇥Pk s a M r k ⇥ Pk ˜ ( ) and possibly M dynamics augmentation -dimensional , ,P,H(,r˜),P0 be the approximate MDP resulting from P S A ( ) [ ] Ö ( ) m ( ) ã m P S collecting the maximizing rewards r˜ and dynamics vec- noise vectors x s,a ⇥ Pk s,a , for each( state) action( pair) k ( m ) s,a " ✓ . This notation will become∂ ∂ clearer in the tors P˜ while executing NAEVI. In( other) words, for any S A k subsequent discussion.( ) ( ) state action( ) pair s,a " ✓ : S A ( ) m r˜k s,a = max r˜k s,a (3) 3.2 NOISE AUGMENTED UCRL (m=1),⇧,Mr m h+1 P˜k(s,a) = arg max( )P˜ s,a ,V˜ k . m M k We start by showing that an appropriate choice for the noise P˜ s,a P r P k k=1 ( ) variables s,a and s,a yields ( ) Pk s,a" ✓ Pk s,a" ✓ ( ) Ö ( ) ã an algorithm akin to UCRL2S A and with provableS regretA guar- Our main result in this{ section( )} is the following theorem: antees. 2 H { ( )} { ( )} log S A Let , , e , M d and Theorem 1. e " 0 1 d = 4T r ' ∂ 3∂∂ ∂ Our main theoretical results of this section (Theorem 2 H r log A ⇤ 1) states that in the tabular setting, if we set Pk s,a = M 3 + d . The regret R T of UCRL Algorithm 2 P 2 P ' 3∂ ∂ ( ) 0,st,r s,a and Pk s,a = 0,I st,P s,a , for ap- 1 with Gaussian⇤ noise augmentation satisfies the following N 2 N 2 S propriate values of st,r s,a and st,P s,a we can( obtain) bound with probability at least 1 (e:) ( ( )) ( ) ( ∂ ∂ ( )) ( ) ( ) R T & ˜ H T 3.3.1 Boostrap Noise Augmentation O S S A ’ O˜ hides logarithmic( ) factors(∂ in∂ ∂, ∂∂ ,e∂ and) T and: Drawing inspiration from [Vaswani et al., 2018, Kveton A S et al., 2019], we introduce the following algorithm: m 2 ∂ ∂ ∂ ∂ d x ⇥ 0,sr , s.t sr = 2br Nk s,a , k,r N 2 S A 1. Initialize by adding 2MB tuples ( ) D W d s,a,1 , s,a,1 i=1 to each state action pair m ( 2 ) ( ( ) ) x k,P ⇥ 0,sPI , s.t sP = 2bP Nk s,a , s,a " ✓ . N ∂ ∂∂ ∂ S A ( ) S A {( r ) ( )} ( ) ⌅ ( ) ⌦ Build(Pk )s,a via the following procedure: We remark these bounds are not optimal in H and∂ ∂∂S, never-∂ theless, Theorem 1 shows this simple (and computationally k k H 2 For( each) sh ,ah h=1 encountered during step (*) scalable) noise augmented algorithm satisfies a regret guar- k k k of Algorithm( ) 1,( add) s ,a ,r and 2M extra antee. Our proof techniques are inspired but not the same h h h B {(k k )} ( )k ( )k ( ) 2MB as those of Agrawal and Jia [2017]. Our proofs proceed in tuples sh ,ah ,1 , sh ,ah ,1 i=1 to the data buffer . ( ) two parts; returning to the Optimism Decomposition (Equa- D ( ) ( ) ( ) ( ) tion 1), we deal with the Optimism and Estimation Error 3 For all {(s,a " ✓ ,) compute( the)} empirical rewards S A separately. The details of all proofs are in Appendix A. m Mr rˆk s,a m=1 by boostrap sampling with replace- ment from( ) the data buffer s,a with probability ( ) D 3.3 NOISE AUGMENTED UCBVI {parameter( )} 1 2: ( ) k s,a m / 1 D In this section we show that a simple modification of the rˆ s,a x r s,a . k = s,a i i previous algorithm can yield an even stronger regret guaran- Dk ∂ =( )∂ ( ) i 1 xi i=1 tee. The chief insight is to note that under Assumption 1, the < = ( ) ∂ ( )∂ ( ) scale of the value function is at most H and therefore instead Where xi are all i.i.d. Bernoulli random variables with m 1 , of adding dynamics noise vectors x k it is enough to sim- parameter 2 and ri s a are the reward samples in ply scale up the variance of the reward noise components to the data buffer s,a . The value of r˜ s,a is again ( ) D k ensure optimism at the value function level. computed via equation( 3.) Noise Augmented Value Iteration (NAVI) proceeds as ( ) ( ) Similar to RLSVI this algorithm doesn’t need to maintain follows: at the beginning of episode k we compute a visitation counts. The following theorem holds: Qfunction Q˜ k as: , e Theorem 3. Let e " 0 1 and d = 4T , MB = H log T Q˜ k,h s,a min Q˜ k1,h s,a ,H,r˜k s,a + 2 H = log S A and M d . The regret R T of Algorithm 1 with r ' ∂ 3∂∂ ∂ E ¨ ˆ V˜k,h+1 s,a ( ) ( ) ( ) s ⇥P⇤k s,a ( ) ( ) Boostrap noise⇤ augmentation satisfies the following bound ˜ ˜ Vk,h s,a = maxQ(k,h) s,a . (4) with probability at least 1 e: ( ) a" ◆ ( )⇡ A ( ) ( ) R T & ˜ H T O S A Where r˜ s,a is defined as in equation 3. The policy exe- ’ k O˜ hides logarithmic factors in , ,e and T cuted at time k by Noise Augmented UCBVI is the greedy ( ) (∂ ∂A ∂S ∂ ) ˜ policy w.r.t.( Q)k,h s,a . ∂ ∂ ∂ ∂ Our main result in this section is the following theorem: 4 ANTI-CONCENTRATION AND ( ) OPTIMISM 2 H log S A Theorem 2. Let e 0,1 , d e and M d . " = 4T r ' ∂ 3∂∂ ∂ The fundamental principle behind our bounds is that noise The regret R T of UCBVI Algorithm 1 with Gaussian⇤ noise injection gives rise to optimism. In order to show this we augmentation satisfies the following bound with probability ( ) rely on anti-concentration properties of the noise augmen- at least 1 : e( ) tation distributions. For the sake of simplicity we present simple results regarding Gaussian noise variables, their anti- R T & ˜ H T O S A concentration properties and one-step optimism. More nu- ’ anced results extending the discussion to Bootstrap sampling O˜ hides logarithmic factors in , ,e and T and: ( ) (∂ ∂A ∂S ∂ ) are in Appendix A.3. m 2 ∂ ∂ ∂ ∂ d Benign variance. We start by showing that whenever the x ⇥ 0,sr , s.t. sr = 2Hbr Nk s,a , . k,r N 2 S A noise is Gaussian and has an appropriate variance, with ( ) ( ) ( ( ) ) ∂ ∂∂ ∂ a constant probability each of the noise perturbed reward NARL-UCBVI are superior to RLSVI, and even better in its m H dependence than the latest regret bounds for the RLSVI estimators r˜k is at least as large as the empirical mean, plus d setting (see Agrawal et al. [2020]). NARL and RLSVI are the confidence( ) radius b N s,a , . We can boost r k incomparable when moving into the function approximation this probability by setting M to be sufficientlyS A large. The r regime since in this setting RLSVI will not be a model based main ingredient behind this( proof( ) is∂ the∂∂ following∂ ) Gaussian algorithm. We also want to remark that the existing works anti-concentration result: on RLSVI are purely theoretical works and therefore there is 2 Lemma 2. Lower bound on Gaussian density µ,s : no empirical evidence in neither Russo [2019] nor Agrawal N et al. [2020] regarding its usefulness in a deep RL setting. 2 1 st t 2s2 . ( ) Comparison to Thompson Sampling using Dirichlet P X µ > t ' 2 2 e (5) 2p t + s prior: As explained above, the objective of this work is ( ) ” not to be state of the art and get the optimal regret guaran- Using Lemma 2 we can show that as long as the standard tees. Our dependence on S should be compared not with this m m deviation of xk s,a is set to the right value, r˜k s,a approach but with RLSVI. Although the use of a Dirichlet overestimates the( ) true reward r s,a with constant probabil-( ) prior allows the authors to get a better dependence on S, the ity. ( ) ( ) resulting algorithm is infeasible in the Deep RL paradigm. It is unclear what would the equivalent of maintaining such ( ) m Lemma 3. Let s,a " ✓ . If r˜k s,a ⇥ rˆk s,a + a prior be when making use of function approximation. 2 S A d 0,s for s = 2br Nk s,a , ( )then: N 2 S A ( ) ( ) ( ) m 1 5 TABULAR EXPLORATION P r˜ s,a ' r s,a∂ ∂∂ ∂ ' . (6) ( ) k ( ( ) E ) 10 EXPERIMENTS ( ) Lemma 3 implies( that( with) constant( )∂ probability) the values In this section we evaluate Noise Augmented UCRL (re- m r˜k s,a are an overestimate of the true rewards. It is also ferred to as NARL) in the tabular setting, as is common for possible( ) to show that despite this property, r˜k s,a remain work in theoretical RL. We consider two implementations very( close) tor ˆk s,a and therefore to r s,a . In summary: of NARL: (1) Gaussian, where we use use one model, but ( ) sample M = 10 noise vectors from a Gaussian distribution, Corollary 1. The sampled rewards r˜ s,a are optimistic: k with variance c for a constant c, which we set to 1, ( ) ( ) Nk s,a M m 1 r and (2) Bootstrap, where we maintain M = 10 models, each P r˜k s,a = max r˜k s,a ' r s,a( )' 1 (7) ( ) m=1,⇧,Mr E 10 having access to 50% of the data. We compare these against ( ) ª ª UCRL2 [Jaksch et al., 2010] and Optimistic Posterior Sam- while⌅ ( at) the same time( not) being( too)ª ⌦ far from⌅ the⌦ true 2 rewards: ª pling (OPSRL), using an open source implementation. d d We begin with the RiverSwim environment [Strehl and P r˜k s,a r s,a ' Lbr Nk s,a , & . 2 S A E Littman, 2008], with 6 states and en episode length of 20. S A ª (8) We repeat each experiment for 20 seeds, to produce a median ⌅∂ ( ) ( )∂ ⌅ ( ) ⌦ª ⌦ 4 M ª and IQR. In Fig. 1(a) we see that both versions of NARL Where L = 2 log S A r + 1 ∂. ∂∂ ∂ ª ∂ ∂∂ ∂ d exhibit strong computational performance, while UCRL2 ÷ ∂ ∂∂ ∂ performs poorly. In Fig. 1(b) we explore why this is the Corollary 2⌅ shows the⇥ trade-offs when⌦ increasing the num- case, and plot the approximate value function for UCRL2 ber of models in an ensemble: it increases the amount of and NARL. We see that the weak performance for UCRL2 optimism, at the expense of greater estimation error of the likely comes from over-estimation, i.e. being overly opti- sample rewards. A similar statement can be made of the dy- mistic, and it takes much longer to converge to the true value namics in UCRL, but we defer the details to the appendix. function. See the following link to run these experiments in a notebook: https://bit.ly/3gVwsQF. Boostrap optimism The necessary anti-concentration prop- erties of the sampling distribution corresponding to Theo- We also explore the choice of noise augmentation, using the rem 3 are explored in Appendix A.3. Deep Sea environment [Osband et al., 2018] from bsuite [Os- band et al., 2020]. This experiment shows the ability to scale RLSVI Comparison RLSVI’s regret bound is of the or- 3 3 2 with increasing problem dimension. We used ten environ- der of H S AK while NARL-UCRL2 is of the or- ,..., O 3 2 ments with N = 10 28 . As we see in Fig. 1(c) NARL der of HS / AK and NARL-UCBVI is of the order O ” solves all ten tasks. Interestingly, the Gaussian method is of HS( AK . Removing) an additional S factor may O / ” best, indicating promise{ for} this approach. prove more challenging since it is akin to the extra d fac- (” ) ” tor present in the worst case regret bounds for Thomspon ( ) ” sampling in Linear bandits. Our rates for noise augmented 2https://github.com/iosband/TabulaRL

(a) RiverSwim (b) Q-Values (c) Deep Sea Figure 1: Tabular RL experiments: a) RiverSwim b) Stochastic Chain c) Number of episodes to solve the Deep Sea task, the x-axis corresponds to increasing dimensionality. Now we see NARL can compete empirically in the tabu- on Janner et al. [2019], using probabilistic dynamics models lar setting, we next seek to demonstrate its’ scalability in [Nix and Weigend, 1994] and a Soft Actor Critic (SAC, the deep RL paradigm. Note that other methods, such as Haarnoja et al. [2018a,b]) agent learning inside the model. UCRL2, are intractable beyond tabular environments. Mean- Dyna-style approaches [Sutton, 1991], are particularly sensi- while, the noise augmentation we propose uses ingredients tive to model bias [Deisenroth and Rasmussen, 2011], which commonly found in state-of-the-art deep MBRL methods, often leads to catastrophic failure when policies are trained such as bootstrap ensembles. on inaccurate synthetic data. To prevent this, state-of-the- art methods such as MBPO randomly sample models from 6 OPTIMISM IN DEEP RL the ensemble to prevent the policy exploiting an individual (potentially biased) model. Rather than randomly sampling, Despite being a popular theoretical approach, optimism is we follow the Noise Augmented UCRL approach (Equa- not prevalent in the deep RL literature. The most prominent tion 2) and pass the same state-action tuple through each theoretically motivated deep RL algorithm is Bootstrapped model, and select the highest predicted reward, and and as- DQN [Osband et al., 2016a] which is inspired by PSRL. sess which ‘hallucinated’ next state has the highest expected However, it is well-known that Q-functions generally over- return according to the critic of the policy thus providing us estimate the true Q-values [Thrun and Schwartz, 1993], with an optimistic estimate of the transition dynamics. therefore, many methods not using a lower bound (as used in TD3, Fujimoto et al. [2018]) may in fact be using an optimistic estimate. In recent times [Ciosek et al., 2019, Rashid et al., 2020] present model-free approaches using optimistic policies to explore by shifting Q-values optimisti- cally based on epistemic uncertainty. However, as far as we are aware, optimism is not widely used w.r.t the dynamics in deep model based RL. We know from our theoretical insights that an effective optimistic algorithm needs to balance the Optimism Decom- Figure 2: Ensemble member selection frequency under optimism. position. In the tabular setting we sought to add noise to You’re only as good as your worst model: However, in the boost the Optimism term, which led to too much variance deep RL setting, we introduce a significant amount of noise in the case of UCRL2. For deep RL, the dynamics are very due to function approximation with neural networks. It has different, as we add significant noise from function approxi- been observed in practice that this variance is sufficient to mation with neural networks. In this section we introduce a induce optimism [Osband et al., 2016a]. A key considera- scalable implementation of NARL, which builds on top of tion is the tendency for optimism to select the individual the state-of-the art continuous control (from states) MBRL model with highest variance, resulting in over-exploitation algorithm. We also discuss the key factors influencing the of the least accurate models. In Fig. 2 we demonstrate this Optimism Decomposition. phenomenon, by training an ensemble of models (ordered here in increasing validation accuracy) and comparing the proportion each model was selected (red) against the av- 6.1 NOISE VIA BOOTSTRAPPED ENSEMBLES erage distance from the mean of the next state estimates (grey); we observe that these quantities are positively corre- We implement our algorithm in by using an ensemble, as is lated. Thus, the optimistic approach selects (and exploits) common in existing state-of-the-art methods [Janner et al., the least accurate models. 2019, Clavera et al., 2018, Kurutach et al., 2018, Chua et al., 2018, Ball et al., 2020]. For our implementation, we focus Reducing Estimation Error To balance the Optimism De-

(a) InvertedPendulum (b) Hopper (c) HalfCheetah Figure 3: Curves show the mean ± one std for InvertedPendulum (left) and Hopper (right). composition in deep RL, we must focus on Estimation Error. Table 2: The mean number of timesteps to solve the InvertedPendulum task, with standard deviations. We introduce a “Model Radius Constraint" eM: we calculate the empirical mean (µM) of the expected returns, and ex- eM 3 5 10 clude models that fall outside the permissible model sphere 0.1 1850 ±166 2350 ±300 3300 ± 953 (defined as µM ± eM) as being overly optimistic. Interest- 5 2000±274 2575 ±251 3225 ±675 ingly, an undocumented feature in MBPO [Janner et al., None 2000 ±353 2775 ±467 5850 ±2037 2019] that mirrors this is the idea of maintaining a subset of “elite" models. Briefly, even though K models are trained and maintained, in reality the top E models are used for rollouts, In Fig. 3 we compare NARL against the publicly re- where E & K. Even though models are sampled randomly leased data from MBPO [Janner et al., 2019] on the in this approach, there is still a chance that “exploitable" InvertedPendulum, Hopper and HalfCheetah environments. samples are generated by these poorer performing models. For InvertedPendulum we show the results with M = 3 and The introduction of eM allows us to reduce Estimation Error eM = 0.1, the strongest result from Table 2. With these pa- from optimism. We see in Fig. 2 that a wide radius (eM = 5, rameters selected appropriately, we get meaningful gains blue) has a small impact on reducing usage of the worst against a very strong baseline, using an almost identical im- model. However, when we set a small radius (eM = 0.1, plementation aside from the model selection and eM. For the green), the models are selected almost uniformly. Details larger Hopper and HalfCheetah tasks, we also used M = 3 are in Appendix E. and selected eM from 0.1,0.5 . Again, we are able to per- form favorably vs. MBPO, demonstrating the potential for our approach to scale{ to larger} environments. This perfor- mance comes despite using over 50% fewer models than MBPO (3 models vs. 7). We do not claim these results are state of the art, but high- 6.2 DEEP RL FOR CONTINUOUS CONTROL light the design choices considered when using optimism for deep MBRL. In these settings we have been able to show that if variance can be controlled (e.g. by using eM) Now we evaluate the deep RL implementation of NARL. We then optimism can perform comparably well with the best focus on the InvertedPendulum task, as it is the simplest con- random-sampling method. tinuous environment and thus allows us to perform rigorous ablation studies. We run ten seeds for a variety of configura- tions, selecting the number of models M from 3,5,10 and eM from 0.1,5,None . These two parameters trade-off the 7 CONCLUSION AND FUTURE WORK amount of variance in the ensemble. Having{ more models} means more noise. In addition, having a smaller e will { } M We introduce a new perspective on optimistic model-based reduce variance. The results are presented in Table 2. reinforcement learning (RL) algorithms, using noise aug- Interestingly, we see strong evidence for our hypothesis that mented MDPs. Our approach, NARL, achieves comparable too much variance is a problem in the deep RL setting. This regret bounds in the tabular setting, while crucially mak- results in the phenomenon whereby having fewer models ing use of mechanisms which naturally occur in the deep (e.g. M = 3) actually gives better performance, which has RL setting. As such, NARL offers the potential for design- an added benefit of reduced computational cost. This is in ing scalable optimistic model-based reinforcement learning contrast to methods based on random ensemble sampling algorithms. We explored the key factors for successfully [Kurutach et al., 2018, Osband et al., 2016a], where perfor- implementing NARL in the deep RL paradigm, and opened mance typically increases with the number of models. When the door for the application of optimistic algorithms to deep using more models, the smaller model radius (eM) is crucial. model-based RL. REFERENCES Kamil Ciosek, Quan Vuong, Robert Loftin, and Katja Hof- mann. Better exploration with optimistic actor critic. In Yasin Abbasi-Yadkori, Dávid Pál, and Csaba Szepesvári. Neural Information Processing Systems. 2019. Improved algorithms for linear stochastic bandits. In Neural Information Processing Systems, 2011. Ignasi Clavera, Jonas Rothfuss, John Schulman, Yasuhiro Fujita, Tamim Asfour, and Pieter Abbeel. Model-based Marc Abeille, Alessandro Lazaric, et al. Linear thompson reinforcement learning via meta-policy optimization. In sampling revisited. Electronic Journal of Statistics, 11 Conference on Robot Learning, 2018. (2):5165–5197, 2017. Marc Peter Deisenroth and Carl Edward Rasmussen. Priyank Agrawal, Jinglin Chen, and Nan Jiang. Improved PILCO: A model-based and data-efficient approach to worst-case regret bounds for randomized least-squares policy search. In International Conference on Interna- value iteration. arXiv preprint arXiv:2010.12163, 2020. tional Conference on Machine Learning, 2011. Shipra Agrawal and Randy Jia. Optimistic posterior Simon S. Du, Sham M. Kakade, Ruosong Wang, and Lin F. sampling for reinforcement learning: worst-case regret Yang. Is a good representation sufficient for sample ef- bounds. In Neural Information Processing Systems, 2017. ficient reinforcement learning? In International Confer- Jean-Yves Audibert, Rémi Munos, and Csaba Szepesvári. ence on Learning Representations, 2020. Tuning bandit algorithms in stochastic environments. In Sarah Filippi, Olivier Cappé, and Aurélien Garivier. Op- Algorithmic Learning Theory, 2007. timism in reinforcement learning based on Kullback- Mohammad Gheshlaghi Azar, Ian Osband, and Rémi Munos. Leibler divergence. CoRR, abs/1004.5229, 2010. Minimax regret bounds for reinforcement learning. In International Conference on Machine Learning, 2017a. Ronan Fruit, Matteo Pirotta, Alessandro Lazaric, and Ronald Ortner. Efficient bias-span-constrained Mohammad Gheshlaghi Azar, Ian Osband, and Rémi Munos. exploration-exploitation in reinforcement learning. In Minimax regret bounds for reinforcement learning. In International Conference on Machine Learning, 2018. Proceedings of the 34th International Conference on Ma- chine Learning, ICML, volume 70 of Proceedings of Ma- Scott Fujimoto, Herke van Hoof, and David Meger. Address- chine Learning Research, pages 263–272, 2017b. ing function approximation error in actor-critic methods. In Proceedings of the 35th International Conference on Philip Ball, Jack Parker-Holder, Aldo Pacchiano, Krzysztof Machine Learning, volume 80 of Proceedings of Machine Choromanski, and Stephen Roberts. Ready policy one: Learning Research, pages 1587–1596, Stockholmsmäs- World building through active learning. In International san, Stockholm Sweden, 10–15 Jul 2018. Conference on Machine Learning. 2020. Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Peter L. Bartlett and Ambuj Tewari. REGAL: A regulariza- Levine. Soft actor-critic: Off-policy maximum entropy tion based algorithm for reinforcement learning in weakly deep reinforcement learning with a stochastic actor. In communicating mdps. CoRR, abs/1205.2661, 2012. International Conference on Machine Learning, 2018a. Marc Bellemare, Sriram Srinivasan, Georg Ostrovski, Tom Tuomas Haarnoja, Aurick Zhou, Kristian Hartikainen, Schaul, David Saxton, and Remi Munos. Unifying count- George Tucker, Sehoon Ha, Jie Tan, Vikash Kumar, Henry based exploration and intrinsic motivation. In Advances Zhu, Abhishek Gupta, Pieter Abbeel, and Sergey Levine. in Neural Information Processing Systems. 2016. Soft actor-critic algorithms and applications. CoRR, Yuri Burda, Harrison Edwards, Deepak Pathak, Amos J. abs/1812.05905, 2018b. Storkey, Trevor Darrell, and Alexei A. Efros. Large-scale study of curiosity-driven learning. In 7th International Thomas Jaksch, Ronald Ortner, and Peter Auer. Near- Conference on Learning Representations, ICLR, 2019a. optimal regret bounds for reinforcement learning. J. Mach. Learn. Res., 11:1563–1600, August 2010. ISSN Yuri Burda, Harrison Edwards, Amos Storkey, and Oleg 1532-4435. Klimov. Exploration by random network distillation. In International Conference on Learning Representations, Michael Janner, Justin Fu, Marvin Zhang, and Sergey 2019b. Levine. When to trust your model: Model-based policy optimization. In Neural Information Processing Systems. Kurtland Chua, Roberto Calandra, Rowan McAllister, and 2019. Sergey Levine. Deep reinforcement learning in a hand- ful of trials using probabilistic dynamics models. In Chi Jin, Zhuoran Yang, Zhaoran Wang, and Michael I. Jor- Advances in Neural Information Processing Systems 31, dan. Provably efficient reinforcement learning with linear pages 4754–4765. 2018. function approximation. CoRR, abs/1907.05388, 2019. Levente Kocsis and Csaba Szepesvári. Bandit based Monte- Tabish Rashid, Bei Peng, Wendelin Boehmer, and Shimon Carlo planning. In Machine Learning: ECML 2006, 17th Whiteson. Optimistic exploration even with a pessimistic European Conference on Machine Learning, Berlin, Ger- initialisation. In International Conference on Learning many, September 18-22, 2006, Proceedings, volume 4212 Representations, 2020. of Lecture Notes in Computer Science, pages 282–293. Daniel Russo. Worst-case regret bounds for exploration Springer, 2006. via randomized value functions. In Advances in Neural Thanard Kurutach, Ignasi Clavera, Yan Duan, Aviv Tamar, Information Processing Systems, pages 14410–14420, and Pieter Abbeel. Model-ensemble trust-region policy 2019. optimization. In International Conference on Learning Ramanan Sekar, Oleh Rybkin, Kostas Daniilidis, Pieter Representations, 2018. Abbeel, Danijar Hafner, and Deepak Pathak. Planning Branislav Kveton, Csaba Szepesvári, Sharan Vaswani, to explore via self-supervised world models. CoRR, Zheng Wen, Tor Lattimore, and Mohammad abs/2005.05960, 2020. Ghavamzadeh. Garbage in, reward out: Bootstrap- David Silver, Aja Huang, Chris J. Maddison, Arthur Guez, ping exploration in multi-armed bandits. In Proceedings Laurent Sifre, George van den Driessche, Julian Schrit- of the 36th International Conference on Machine twieser, Ioannis Antonoglou, Vedavyas Panneershelvam, Learning, ICML 2019, 2019. Marc Lanctot, Sander Dieleman, Dominik Grewe, John Odalric-Ambrym Maillard and Mahsa Asadi. Upper con- Nham, Nal Kalchbrenner, Ilya Sutskever, Timothy P. Lilli- fidence reinforcement learning exploiting state-action crap, Madeleine Leach, Koray Kavukcuoglu, Thore Grae- equivalence. 2018. pel, and . Mastering the game of Go with deep neural networks and tree search. Nature, 2016. Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Mar- Alexander Strehl and Michael Littman. An analysis of tin Riedmiller. Playing atari with deep reinforcement model-based interval estimation for Markov decision pro- learning. In NIPS Workshop. 2013. cesses. Journal of Computer and System Sciences, 74(8): 1309 – 1331, 2008. Learning Theory 2005. D. A. Nix and A. S. Weigend. Estimating the mean and Richard S. Sutton. Dyna, an integrated architecture for variance of the target probability distribution. In Proceed- learning, planning, and reacting. SIGART Bull., 2(4): ings of 1994 IEEE International Conference on Neural 160–163, July 1991. Networks (ICNN’94), volume 1, pages 55–60 vol.1, 1994. Richard S. Sutton and Andrew G. Barto. Introduction to Ian Osband, Benjamin Van Roy, and Daniel Russo. (more) Reinforcement Learning. MIT Press, Cambridge, MA, efficient reinforcement learning via posterior sampling. USA, 1st edition, 1998. ISBN 0262193981. In Neural Information Processing Systems, 2013. Haoran Tang, Rein Houthooft, Davis Foote, Adam Stooke, Ian Osband, Charles Blundell, Alexander Pritzel, and Ben- Xi Chen, Yan Duan, John Schulman, Filip De Turck, jamin Van Roy. Deep exploration via bootstrapped DQN. and Pieter Abbeel. Exploration: A study of count-based In Neural Information Processing Systems. 2016a. exploration for deep reinforcement learning. In NIPS, Ian Osband, Benjamin Van Roy, and Zheng Wen. Gen- 2017. eralization and exploration via randomized value func- Sebastian Thrun and Anton Schwartz. Issues in using func- tions. In International Conference on Machine Learning, tion approximation for reinforcement learning. In In 2016b. Proceedings of the Fourth Connectionist Models Summer Ian Osband, John Aslanides, and Albin Cassirer. Random- School. Erlbaum, 1993. ized prior functions for deep reinforcement learning. In Aristide C. Y. Tossou, Debabrota Basu, and Christos Neural Information Processing Systems. 2018. Dimitrakakis. Near-optimal optimistic reinforcement learning using empirical bernstein inequalities. CoRR, Ian Osband, Yotam Doron, Matteo Hessel, John Aslanides, abs/1905.12425, 2019. Eren Sezener, Andre Saraiva, Katrina McKinney, Tor Lat- timore, Csaba Szepesvari, Satinder Singh, Benjamin Van Sharan Vaswani, Branislav Kveton, Zheng Wen, Anup Rao, Roy, Richard Sutton, David Silver, and Hado Van Hasselt. Mark Schmidt, and Yasin Abbasi-Yadkori. New insights Behaviour suite for reinforcement learning. In Interna- into bootstrapping for bandits. CoRR, abs/1805.09793, tional Conference on Learning Representations, 2020. 2018. Victor H Peña, Tze Leung Lai, and Qi-Man Shao. Self- Ziping Xu and Ambuj Tewari. Worst-case regret bound for normalized processes: Limit theory and Statistical Appli- perturbation based exploration in reinforcement learning. cations. Springer Science & Business Media, 2008. Ann Arbor, 1001:48109.