Deep Reinforcement Learning with Weighted Q-Learning

Deep Reinforcement Learning with Weighted Q-Learning Andrea Cini 1 Carlo D’Eramo 2 Jan Peters 2 3 Cesare Alippi 4 1 Abstract are the constitutional elements of this kind of behavior. TD Overestimation of the maximum action-value is allows agents to bootstrap their current knowledge to learn a well-known problem that hinders Q-Learning from a new observation as soon as it is available. Off-policy performance, leading to suboptimal policies and learning gives the means for exploration and enables ex- unstable learning. Among several Q-Learning perience replay (Lin, 1991). Q-Learning (Watkins, 1989) variants proposed to address this issue, Weighted implements both paradigms. Q-Learning (WQL) effectively reduces the bias Algorithms based on Q-learning are, in fact, driving Deep and shows remarkable results in stochastic en- Reinforcement Learning (DRL) research towards solving vironments. WQL uses a weighted sum of the complex problems and achieving super-human performance estimated action-values, where the weights cor- on many of them (Mnih et al., 2015; Hessel et al., 2018). respond to the probability of each action-value Nonetheless, Q-Learning is known to be positively bi- being the maximum; however, the computation of ased (Van Hasselt, 2010) since it learns by using the max- these probabilities is only practical in the tabular imum over the - noisy - bootstrapped TD estimates. This setting. In this work, we provide the methodolog- overoptimism can be particularly harmful in stochastic envi- ical advances to benefit from the WQL proper- ronments and when using function approximation (Thrun ties in Deep Reinforcement Learning (DRL), by & Schwartz, 1993), notably also in the case where the ap- using neural networks with Dropout Variational proximators are deep neural networks (Van Hasselt et al., Inference as an effective approximation of deep 2016). Systematic overestimation of the action-values cou- Gaussian processes. In particular, we adopt the pled with the inherently high variance of DRL methods Concrete Dropout variant to obtain calibrated esti- can lead to incrementally accumulate errors, causing the mates of epistemic uncertainty in DRL. We show learning algorithm to diverge. that model uncertainty in DRL can be useful not only for action selection, but also action evalua- Among the possible solutions, the Double Q-Learning al- tion. We analyze how our novel Deep Weighted gorithm (Van Hasselt, 2010) and its DRL variant Double Q-Learning algorithm reduces the bias w.r.t. rel- DQN (Van Hasselt et al., 2016) tackle the overestimation evant baselines and provide empirical evidence problem by disentangling the choice of the target action of its advantages on several representative bench- and its evaluation. The resulting estimator, while achieving marks. superior performance in many problems, is negatively biased (Van Hasselt, 2013). Underestimation, in fact, can lead in some environments to lower performance and slower con- 1. Introduction vergence rates compared to standard Q-Learning (DEramo et al., 2016; Lan et al., 2020). Overoptimism, in general, is arXiv:2003.09280v2 [cs.LG] 30 Mar 2020 Reinforcement Learning (RL) aims at learning how to take not uniform over the state space and may induce to overesti- optimal decisions in unknown environments by solving mate the value of arbitrary bad actions, throwing the agent credit assignment problems that extend in time. In or- completely off. The same holds true, symmetrically, for der to be sample efficient learners, agents are required to overly pessimistic estimates that might undervalue a good constantly update their own beliefs about the world, about course of action. Ideally, we would like DRL agents to be which actions are good and which are not. Temporal differ- aware of their own uncertainty about the optimality of each ence (TD) (Sutton & Barto, 1998) and off-policy learning action, and be able to exploit it to make more informed 1Faculty of Informatics, Universita` della Svizzera italiana, estimations of the expected return. This is exactly what we Lugano, Switzerland 2IAS, TU Darmstadt, Darmstadt, Germany achieve in this work. 3Max Planck Institute for Intelligent Systems, Tubingen,¨ Germany 4DEIB, Politecnico di Milano, Milan, Italy. Correspondence to: We exploit recent developments in Bayesian Deep Learn- Andrea Cini <[email protected]>. ing to model the uncertainty of DRL agents using neural networks trained with dropout variational infer- A preprint. Deep Weighted Q-Learning ence (Kingma et al., 2015; Gal & Ghahramani, 2016). We The popular Deep Q-Network algorithm (DQN) (Mnih et al., combine, in a novel way, the dropout uncertainty esti- 2015) is a variant of Q-Learning designed to stabilize off- mates with the Weighted Q-Learning algorithm (DEramo policy learning with deep neural networks in highly dimen- et al., 2016), extending it to the DRL settings. The pro- sional state spaces. The two most relevant architectural posed Deep Weighted Q-Learning algorithm, or Weighted changes to standard Q-Learning introduced by DQN are the DQN (WDQN), leverages an approximated posterior distri- adoption of a replay memory, to learn offline from past expe- bution on Q-networks to reduce the bias of deep Q-learning. rience, and the use of a target network, to reduce correlation WDQN bias is neither always positive, neither negative, between the current model estimate and the bootstrapped but depends on the state and the problem at hand. WDQN target value. only requires minor modifications to the baseline algorithm In practice, DQN learns the Q-values online, using a neural and its computational overhead is negligible on specialized network with parameters θ, sampling the replay memory, hardware. and with a target network whose parameters θ− are updated The paper is organized as follows. In Section2 we define to match those of the online model every C steps. The the problem settings, introducing key aspects of value-based model is trained to minimize the loss RL. In Section3 we analyze in depth the problem of estima- 2 DQN tion biases in Q-Learning and sequential decision making L(θ) = E yi − Q(si; ai; θ) ; (4) hs ;a ;r ;s0 i∼m problems. Then, in Section4, we first discuss how neural i i i i networks trained with dropout can be used for Bayesian where m is a uniform distribution over the transitions stored inference in RL and, from that, we derive the WDQN algo- in the replay buffer and yDQN is defined as rithm. In Section5 we empirically evaluate the proposed DQN 0 − method against relevant baselines on several benchmarks. yi = ri + γ max Q(si; a; θ ): (5) Finally, we provide an overview of related works in Sec- a tion6, and we draw our conclusions and discuss future Double DQN Among the many studied improvements works in Section7. and extensions of the baseline DQN algorithm (Wang et al., 2016; Schaul et al., 2016; Bellemare et al., 2017; Hessel 2. Preliminaries et al., 2018), Double DQN (DDQN) (Van Hasselt et al., 2016) reduces the overestimation bias of DQN with a sim- A Markov Decision Process (MDP) is a tuple ple modification of the update rule. In particular, DDQN hS; A; P; R; γi where S is a state space, A is an ac- uses the target network to decouple action selection and tion space, P : S × A ! S is a Markovian transition evaluation, and estimates the target value as function, R : S × A ! R is a reward function, and γ 2 [0; 1] is a discount factor. A sequential decision DDQN 0 0 − yi = ri + γQ(si; argmax Q si; a; θ); θ : (6) maker ought to estimate, for each state s, the optimal value a Q∗(s; a) of each action a, i.e., the expected return obtained DDQN improves on DQN converging to a more accurate by taking action a in s and following the optimal policy π∗ approximation of the value function, while maintaining the afterwards. We can write Q∗ using the Bellman optimality same model complexity and adding a minimal computa- equation (Bellman, 1954) tional overhead. Q∗(s; a) = h ∗ 0 i 3. Estimation biases in Q-Learning rt+1 + γ max Q (st+1; a )jst = s; at = a : (1) E 0 a Choosing a target value for the Q-Learning update rule can be seen as an instance of the Maximum Expected (Deep) Q-Learning A classical approach to solve finite Value (MEV) estimation problem for a set of random vari- MDPs is the Q-Learning algorithm (Watkins, 1989), an ables, here the action-values Q(s ; · ). Q-Learning uses off-policy value-based RL algorithm, based on TD. A Q- t+1 the Maximum Estimator (ME) 1 to estimate the maximum Learning agent learns the optimal value function using the expected return and exploits it for policy improvement. It following update rule: is well-known that ME is a positively biased estimator of QL MEV (Smith & Winkler, 2006). The divergent behaviour Q(st; at) Q(st; at) + α yt − Q(st; at) ; (2) that may occur in Q-Learning, then, may be explained by the amplification over time effect on the action-value esti- where α is the learning rate and, following the notation mates caused by the overestimation bias, which introduces introduced by (Van Hasselt et al., 2016), 1 QL Details about the estimators considered in this section are yt = rt + γ max Q(st+1; a): (3) a provided in the appendix. Deep Weighted Q-Learning a positive error at each update (Van Hasselt, 2010). Double in the tabular setting assuming the sample means to be Q-Learning (Van Hasselt, 2010), on the other hand, learns normally distributed. two value functions in parallel and uses an update scheme WE has been studied also in the Batch RL settings, with con- based on the Double Estimator (DE).

Load more