Sample-Efficient Reinforcement Learning with Stochastic Ensemble

Sample-Efficient Reinforcement Learning with Stochastic Ensemble Value Expansion Jacob Buckman∗ Danijar Hafner George Tucker Eugene Brevdo Honglak Lee [email protected] [email protected] [email protected] [email protected] [email protected] Google Brain, Mountain View, CA, USA Abstract many practical applications infeasible, particularly in real- world control problems (e.g., robotics) where data collec- tion is expensive. Integrating model-free and model-based approaches in reinforcement learning has the Model-based approaches aim to reduce the environment potential to achieve the high performance of samples required to learn a policy by modeling the dynam- model-free algorithms with low sample com- ics. A dynamics model can be used to increase sample plexity. However, this is difficult because an efficiency in various ways, for example training the policy imperfect dynamics model can degrade the per- on rollouts from the dynamics model Sutton [1990], us- formance of the learning algorithm, and in suf- ing rollouts to improve targets for TD learning Feinberg ficiently complex environments, the dynamics et al. [2018], and using rollout statistics as inputs to the model will almost always be imperfect. As a re- policy Weber et al. [2017]. Model-based algorithms such sult, a key challenge is to combine model-based as PILCO Deisenroth and Rasmussen [2011] have shown approaches with model-free learning in such a that it is possible to learn from orders-of-magnitude fewer way that errors in the model do not degrade samples. performance. We propose stochastic ensem- These successes have mostly been limited to environments ble value expansion (STEVE), a novel model- where the dynamics are simple to model. In noisy, com- based technique that addresses this issue. By plex environments, it is difficult to learn an accurate model dynamically interpolating between model roll- of the environment. When the model makes mistakes, it outs of various horizon lengths for each indi- can cause the wrong policy to be learned, hindering per- vidual example, STEVE ensures that the model formance. Recent work has begun to address this issue is only utilized when doing so does not intro- Sutton and Barto [1998], Kurutach et al. [2018], Kalweit duce significant errors. Our approach outper- and Boedecker [2017], Weber et al. [2017], Gal et al., forms model-free baselines on challenging con- Depeweg et al. [2016], Gu et al. [2016a]; see Appendix G tinuous control benchmarks with an order-of- for a more in-depth discussion of these approaches. magnitude increase in sample efficiency, and in contrast to previous model-based approaches, We propose stochastic ensemble value expansion performance does not degrade in complex envi- (STEVE), an extension to model-based value expansion ronments. (MVE), proposed by Feinberg et al. [2018]. Both tech- niques use a dynamics model to compute “rollouts” that are used to improve the targets for temporal difference 1 Introduction learning. MVE rolls out a fixed length into the future, potentially accumulating model errors or increasing value Deep model-free reinforcement learning has had great suc- estimation error along the way. In contrast, STEVE inter- cess in recent years, notably in playing video games Mnih polates between many different horizon lengths, favoring et al. [2013] and strategic board games Silver et al. [2016]. those whose estimates have lower uncertainty, and thus However, training agents using these algorithms requires lower error. To compute the interpolated target, we re- tens to hundreds of millions of samples, which makes place both the model and Q-function with ensembles, ap- proximating the uncertainty of an estimate by computing ∗ Work completed as part of the Google AI Residency Pro- its variance under samples from the ensemble. Through gram. these uncertainty estimates, STEVE dynamically utilizes parameters φ to minimize the negative Q-value: the model rollouts only when they do not introduce signifi- L = −Q^π(s; π (s)): (2) cant errors. We systematically evaluate STEVE on several φ θ φ challenging continuous control benchmarks and demon- 2.1.1 Model-Based Value Expansion strate that STEVE significantly outperforms model-free baselines with an order-of-magnitude increase in sample Recently, Feinberg et al. [2018] showed that a learned efficiency, and that in contrast to previous model-based dynamics model can be used to improve value estimation. approaches, the performance of STEVE does not degrade Their method, model-based value expansion, combines a as the environment gets more complex. short term value estimate formed by unrolling the model dynamics and a long term value estimate using the learned ^π 2 Background Qθ− function. When the model is accurate, this reduces the bias of the targets, leading to improved performance. Reinforcement learning aims to learn an agent policy that The learned dynamics model consists of three learned maximizes the expected (discounted) sum of rewards [Sut- functions: the transition function T^ξ(s; a), which returns ton and Barto, 1998]. We focus on the deterministic case 0 ^ a successor state s ; a termination function dξ(s), which for exposition; however, our method is applicable to the returns the probability that s is a terminal state; and the re- stochastic case as well. The agent starts at an initial state 0 ward function r^ (s; a; s ), which returns a scalar reward. s0 ∼ p(s0). Then, the agent chooses an action at ac- This model is trained to minimize cording to its policy πφ(st) with parameters φ, receives 0 2 0 ^ a reward rt = r(st; at), and transitions to a subsequent Lξ; = E(s;a;r;s ) jjTξ(s; a) − s jj state st+1 according to the Markovian dynamics T (at; st) 0 ^ ^ 0 2 + H d(t j s ); dξ(t j Tξ(s; a)) + (^r (s; a; s ) − r) ; of the environment. This generates a trajectory of states, (3) actions, and rewards τ = (s0; a0; r0; s1; a1;:::). We ab- breviate the trajectory by τ. The goal is to maximize where the expectation is over collected transitions the expected discounted sum of rewards along sampled (s; a; r; s0), d(t j s0) is an indicator function which re- P1 t 0 trajectories J(θ) = Eτ [ t=0 γ rt] where γ 2 [0; 1) is a turns 1 when s is a terminal state and 0 otherwise, and H discount parameter. is the cross-entropy. In this work, we consider continuous environments; for discrete environments, the first term 2.1 Value Estimation with TD-learning can be replaced by a cross-entropy loss term. π P1 t To incorporate the model into value estimation, we replace The action-value function Q (s0; a0) = t=0 γ rt is a critical quantity to estimate for many learning algorithms. our standard Q-learning target with an improved target, MVE Using the fact that Qπ(s; a) satisfies a recursion relation TH , computed by rolling the learned model out for H steps. π π 0 0 Q (s; a) = r(s; a) + γQ (s ; π(s )); 0 0 0 0 0 ^ 0 0 s0 = s ; ai = πφ(si); si = Tξ(si−1; ai−1); 0 π where s = T (s; a), we can estimate Q (s; a) off-policy i (4) 0 i Y 0 with collected transitions of the form (s; a; r; s ) sampled D = (1 − d(sj)) uniformly from a replay buffer [Sutton and Barto, 1998]. j=0 We approximate Qπ(s; a) with a deep neural network, H ! ^π MVE 0 X i i 0 0 0 Qθ (s; a). We learn parameters θ to minimize the mean TH (r; s ) = r + D γ r^ (si−1; ai−1; si) + squared error (MSE) between Q-value estimates of states i=1 and their corresponding temporal difference targets Sutton H+1 H+1 ^π 0 0 D γ Qθ− (sH ; aH ): and Barto [1998]: (5) TD 0 ^π 0 0 1 MVE TD T (r; s ) = r + γQθ− (s ; π(s )) To use this target , we substitute TH in place of T h ^π TD 0 2i when training θ using Equation 1. Note that when H = 0, Lθ = E(s;a;r;s0) (Qθ (s; a) − T (r; s )) : TD MVE MVE reduces to TD-learning (i.e., T = T0 ). See (1) Appendix B for further discussion of MVE. 1This formulation is a minor generalization of the original − Note that we use an older copy of the parameters, θ , MVE objective in that we additionally model the reward func- when computing targets [Mnih et al., 2013]. tion and termination function; Feinberg et al. [2018] consider fully observable environments in which the reward function and To approximate a policy which maximizes our Q-function, termination condition were known, deterministic functions of we use a neural network Lillicrap et al. [2015], learning the observations. MVE MVE MVE MVE 3 Stochastic Ensemble Value Expansion average of candidate targets T0 ,T1 ,T2 ,...,TH and the true Q-value. From a single rollout of H timesteps, we can compute 2 H !23 H + 1 distinct candidate targets by considering rollouts X MVE π MVE MVE MVE MVE E 4 wiTi − Q (s; a) 5 to various horizon lengths: T0 ,T1 ,T2 ,...,TH . MVE i=0 Standard TD learning uses T0 as the target, while MVE !2 ! MVE uses TH as the target. We propose interpolating X MVE X MVE all of the candidate targets to produce a target which is = Bias wiTi + Var wiTi better than any individual. Naïvely, one could average i i the candidate targets, or weight the candidate targets in !2 X MVE X 2 MVE an exponentially-decaying fashion, similar to TD-λ Sut- ≈ Bias wiTi + wi Var(Ti ); ton and Barto [1998].

Sample-Efficient Reinforcement Learning with Stochastic Ensemble

Elements of DSAI: Game Tree Search, Learning Architectures

Discrete Sequential Prediction of Continu

Bridging Imagination and Reality for Model-Based Deep Reinforcement Learning

Improving the Gating Mechanism of Recurrent

Bayesian Policy Selection Using Active Inference

Meta-Learning Deep Energy-Based Memory Models

Shixiang (Shane) Gu

Deep Learning Without Weight Transport

Arxiv:2010.02193V3 [Cs.LG] 3 May 2021

A General Reinforcement Learning Algorithm That Masters Chess, Shogi, and Go Through Self-Play

A 20-Year Community Roadmap for Artificial Intelligence Research in the US

A Brief Survey of Deep Reinforcement Learning Kai Arulkumaran, Marc Peter Deisenroth, Miles Brundage, Anil Anthony Bharath