Deep Learning with Weighted Q-Learning

Andrea Cini 1 Carlo D’Eramo 2 Jan Peters 2 3 Cesare Alippi 4 1

Abstract are the constitutional elements of this kind of behavior. TD Overestimation of the maximum action-value is allows agents to bootstrap their current knowledge to learn a well-known problem that hinders Q-Learning from a new observation as soon as it is available. Off-policy performance, leading to suboptimal policies and learning gives the means for exploration and enables ex- unstable learning. Among several Q-Learning perience replay (Lin, 1991). Q-Learning (Watkins, 1989) variants proposed to address this issue, Weighted implements both paradigms. Q-Learning (WQL) effectively reduces the bias based on Q-learning are, in fact, driving Deep and shows remarkable results in stochastic en- Reinforcement Learning (DRL) research towards solving vironments. WQL uses a weighted sum of the complex problems and achieving super-human performance estimated action-values, where the weights cor- on many of them (Mnih et al., 2015; Hessel et al., 2018). respond to the of each action-value Nonetheless, Q-Learning is known to be positively bi- being the maximum; however, the computation of ased (Van Hasselt, 2010) since it learns by using the max- these is only practical in the tabular imum over the - noisy - bootstrapped TD estimates. This setting. In this work, we provide the methodolog- overoptimism can be particularly harmful in stochastic envi- ical advances to benefit from the WQL proper- ronments and when using function approximation (Thrun ties in Deep Reinforcement Learning (DRL), by & Schwartz, 1993), notably also in the case where the ap- using neural networks with Dropout Variational proximators are deep neural networks (Van Hasselt et al., Inference as an effective approximation of deep 2016). Systematic overestimation of the action-values cou- Gaussian processes. In particular, we adopt the pled with the inherently high variance of DRL methods Concrete Dropout variant to obtain calibrated esti- can lead to incrementally accumulate errors, causing the mates of epistemic uncertainty in DRL. We show learning to diverge. that model uncertainty in DRL can be useful not only for , but also action evalua- Among the possible solutions, the Double Q-Learning al- tion. We analyze how our novel Deep Weighted gorithm (Van Hasselt, 2010) and its DRL variant Double Q-Learning algorithm reduces the bias w.r.t. rel- DQN (Van Hasselt et al., 2016) tackle the overestimation evant baselines and provide empirical evidence problem by disentangling the choice of the target action of its advantages on several representative bench- and its evaluation. The resulting estimator, while achieving marks. superior performance in many problems, is negatively bi- ased (Van Hasselt, 2013). Underestimation, in fact, can lead in some environments to lower performance and slower con- 1. Introduction vergence rates compared to standard Q-Learning (DEramo et al., 2016; Lan et al., 2020). Overoptimism, in general, is arXiv:2003.09280v2 [cs.LG] 30 Mar 2020 Reinforcement Learning (RL) aims at learning how to take not uniform over the state space and may induce to overesti- optimal decisions in unknown environments by solving mate the value of arbitrary bad actions, throwing the agent credit assignment problems that extend in time. In or- completely off. The same holds true, symmetrically, for der to be sample efficient learners, agents are required to overly pessimistic estimates that might undervalue a good constantly update their own beliefs about the world, about course of action. Ideally, we would like DRL agents to be which actions are good and which are not. Temporal differ- aware of their own uncertainty about the optimality of each ence (TD) (Sutton & Barto, 1998) and off-policy learning action, and be able to exploit it to make more informed 1Faculty of Informatics, Universita` della Svizzera italiana, estimations of the expected return. This is exactly what we Lugano, Switzerland 2IAS, TU Darmstadt, Darmstadt, Germany achieve in this work. 3Max Planck Institute for Intelligent Systems, Tubingen,¨ Germany 4DEIB, Politecnico di Milano, Milan, Italy. Correspondence to: We exploit recent developments in Bayesian Deep Learn- Andrea Cini . ing to model the uncertainty of DRL agents using neural networks trained with dropout variational infer- A preprint. Deep Weighted Q-Learning ence (Kingma et al., 2015; Gal & Ghahramani, 2016). We The popular Deep Q-Network algorithm (DQN) (Mnih et al., combine, in a novel way, the dropout uncertainty esti- 2015) is a variant of Q-Learning designed to stabilize off- mates with the Weighted Q-Learning algorithm (DEramo policy learning with deep neural networks in highly dimen- et al., 2016), extending it to the DRL settings. The pro- sional state spaces. The two most relevant architectural posed Deep Weighted Q-Learning algorithm, or Weighted changes to standard Q-Learning introduced by DQN are the DQN (WDQN), leverages an approximated posterior distri- adoption of a replay memory, to learn offline from past expe- bution on Q-networks to reduce the bias of deep Q-learning. rience, and the use of a target network, to reduce correlation WDQN bias is neither always positive, neither negative, between the current model estimate and the bootstrapped but depends on the state and the problem at hand. WDQN target value. only requires minor modifications to the baseline algorithm In practice, DQN learns the Q-values online, using a neural and its computational overhead is negligible on specialized network with parameters θ, sampling the replay memory, hardware. and with a target network whose parameters θ− are updated The paper is organized as follows. In Section2 we define to match those of the online model every C steps. The the problem settings, introducing key aspects of value-based model is trained to minimize the loss

RL. In Section3 we analyze in depth the problem of estima-  2  DQN  tion biases in Q-Learning and sequential decision making L(θ) = E yi − Q(si, ai; θ) , (4) hs ,a ,r ,s0 i∼m problems. Then, in Section4, we first discuss how neural i i i i networks trained with dropout can be used for Bayesian where m is a uniform distribution over the transitions stored inference in RL and, from that, we derive the WDQN algo- in the replay buffer and yDQN is defined as rithm. In Section5 we empirically evaluate the proposed DQN 0 − method against relevant baselines on several benchmarks. yi = ri + γ max Q(si, a; θ ). (5) Finally, we provide an overview of related works in Sec- a tion6, and we draw our conclusions and discuss future Double DQN Among the many studied improvements works in Section7. and extensions of the baseline DQN algorithm (Wang et al., 2016; Schaul et al., 2016; Bellemare et al., 2017; Hessel 2. Preliminaries et al., 2018), Double DQN (DDQN) (Van Hasselt et al., 2016) reduces the overestimation bias of DQN with a sim- A (MDP) is a tuple ple modification of the update rule. In particular, DDQN hS, A, P, R, γi where S is a state space, A is an ac- uses the target network to decouple action selection and tion space, P : S × A → S is a Markovian transition evaluation, and estimates the target value as function, R : S × A → R is a reward function, and γ ∈ [0, 1] is a discount factor. A sequential decision DDQN 0 0 − yi = ri + γQ(si, argmax Q si, a; θ); θ . (6) maker ought to estimate, for each state s, the optimal value a Q∗(s, a) of each action a, i.e., the expected return obtained DDQN improves on DQN converging to a more accurate by taking action a in s and following the optimal policy π∗ approximation of the value function, while maintaining the afterwards. We can write Q∗ using the Bellman optimality same model complexity and adding a minimal computa- equation (Bellman, 1954) tional overhead. Q∗(s, a) =

h ∗ 0 i 3. Estimation biases in Q-Learning rt+1 + γ max Q (st+1, a )|st = s, at = a . (1) E 0 a Choosing a target value for the Q-Learning update rule can be seen as an instance of the Maximum Expected (Deep) Q-Learning A classical approach to solve finite Value (MEV) estimation problem for a set of random vari- MDPs is the Q-Learning algorithm (Watkins, 1989), an ables, here the action-values Q(s , · ). Q-Learning uses off-policy value-based RL algorithm, based on TD. A Q- t+1 the Maximum Estimator (ME) 1 to estimate the maximum Learning agent learns the optimal value function using the expected return and exploits it for policy improvement. It following update rule: is well-known that ME is a positively biased estimator of

 QL  MEV (Smith & Winkler, 2006). The divergent behaviour Q(st, at) ← Q(st, at) + α yt − Q(st, at) , (2) that may occur in Q-Learning, then, may be explained by the amplification over time effect on the action-value esti- where α is the and, following the notation mates caused by the overestimation bias, which introduces introduced by (Van Hasselt et al., 2016), 1 QL Details about the estimators considered in this section are yt = rt + γ max Q(st+1, a). (3) a provided in the appendix. Deep Weighted Q-Learning a positive error at each update (Van Hasselt, 2010). Double in the tabular setting assuming the sample means to be Q-Learning (Van Hasselt, 2010), on the other hand, learns normally distributed. two value functions in parallel and uses an update scheme WE has been studied also in the Batch RL settings, with con- based on the Double Estimator (DE). It is shown that DE is tinuous state and action spaces, by using Gaussian Process a negatively biased estimator of MEV, which helps to avoid regression (D’Eramo et al., 2017). catastrophic overestimates of the Q-values. The DDQN algorithm, introduced in Section2, is the extension of Dou- ble Q-Learning to the DRL settings. In DDQN, the target 4. Deep Weighted Q-Learning network is used as a proxy of the second value function of A natural way to extend the WQL algorithm to the DRL Double Q-Learning to preserve sample and model complex- settings is to consider the uncertainty over the model pa- ity. rameters using a Bayesian approach. Among the possible In practice, as shown by DEramo et al.(2016) and Lan et al. solutions to estimate uncertainty, bootstrapping has been (2020), the overestimation bias of Q-Learning is not always the most successful in RL problems, with Bootstrapped- harmful, and may also be convenient when the action-values DQN (BDQN) (Osband et al., 2016; 2018) achieving im- are significantly different among each other (e.g., deter- pressive results in environments where exploration is critical. ministic environments with a short time horizon, or small On the other hand, using bootstrapping necessitates signif- action spaces). Conversely, the underestimation of Double icant modifications to the baseline DQN architecture and Q-Learning is effective when all the action-values are very requires to train a model for each sample of the approximate similar (e.g., highly stochastic environments with a long posterior distribution. This limits the number of samples or infinite time horizon, or large action spaces). In fact, available considerably and is a major drawback in using depending on the problem, both algorithms have properties BDQN to approximate the WE weights. Using dropout, that can be detrimental for learning. Unfortunately, a prior conversely, does not impact model complexity and allows knowledge about the environment is not always available to compute the weights of the WE by using infinitely many and, whenever it is, the problem may be too complex to de- samples. cide which estimator should be preferred. Given the above, In the following we first introduce how neural networks it is desirable to use a methodology which can robustly deal trained with dropout can be used for approximated Bayesian with heterogeneous problems. inference and discuss how this approach has been used with success in RL problems. Then, we propose a novel approach 3.1. Weighted Q-Learning to exploit the uncertainty over the model parameters for DEramo et al.(2016) proposes the Weighted Q- action evaluation, adapting the WE to the DRL settings. Learning (WQL) algorithm, a variant of Q-Learning based Finally we analyze a possible shortcoming of the proposed on the therein introduced Weighted Estimator (WE). The method and identify a solution from the literature to address WE estimates MEV as the weighted sum of the random vari- it. ables sample means, weighted according to their probability of corresponding to the maximum. Intuitively, the amount 4.1. Bayesian inference with dropout of uncertainty, i.e., the entropy of the WE weights, will Dropout (Srivastava et al., 2014) is regularization technique depend on the nature of the problem, the number of samples used to train large neural networks by randomly dropping and the variance of the mean estimator (critical when using units during learning. In recent years, dropout has been function approximation). WE bias is bounded by the biases analyzed from a Bayesian perspective (Kingma et al., 2015; of ME and DE (DEramo et al., 2016). Gal & Ghahramani, 2016), and interpreted as a variational The target value of WQL can be computed as approximation of a posterior distribution over the neural net- work parameters. In particular, Gal & Ghahramani(2016) W QL X st+1 yt = rt + γ wa Q(st+1, a), (7) show how a neural network trained with dropout and weight a∈A decay can bee seen as an approximation of a deep Gaussian process (Damianou & Lawrence, 2013). The result is a st+1 where wa are the weights of the WE and correspond to theoretically grounded interpretation of dropout and a class the probability of each action-value being the maximum: of Bayesian neural networks (BNNs) that are cheap to train   and can be queried to obtain uncertainty estimates. In fact, s 0 wa = P a = argmax Q(s, a ) . (8) a single stochastic forward pass through the BNN can be a0 interpreted as taking a sample from the model’s predictive QL distribution, while the predictive mean can be computed as The update rule of WQL can be obtained replacing yt with W QL the average of multiple samples. This inference technique is yt in Equation2. The weights of WQL are estimated Deep Weighted Q-Learning

Algorithm 1 Weighted DQN

Input: Q-network parameters θ, dropout rates p1, . . . , pL, a policy π, replay memory D θ− ← θ Initialize memory D. for step t = 0,... do Select an action ai according to some policy π given the distribution over action-value functions Q(s, · ; θ, ω) Execute at and add hst, at, rt, st+ii to D 0 Sample a mini-batch of transitions {hsi, ai, ri, sii, i = 1,...,M} from D for i = 1,...,M do B can be done in parallel − Take K samples from Q(si, · ; θ , ω) by performing K stochastic forward passes Use the samples to compute the WDQN weights (Eq.14) and targets (Eq.15) end for Perform a SGD step on the WDQN loss (Eq. 16) Eventually update θ− end for known as Monte Carlo (MC) dropout and can be efficiently variables and with Ω their joint distribution: parallelized in modern GPUs. ω = {ωlk : l = 1, . . . , L, k = 1,...,Kl}, (9) A straightforward application of Bayesian models to RL is Thompson Sampling (TS) (Thompson, 1933). TS is an ωlk ∼ Bernoulli(pl), ωi ∼ Ω(p1, . . . , pL), (10) exploration technique that aims at improving the sample where L is the number of weight layers of the network and complexity of RL algorithms by selecting actions accord- K is the number of units in layer l. ing to their probability of being optimal given the current l agent’s beliefs. A practical way to use TS in Deep Re- Consider a sample q(s, a) the MDP return, obtained tak- inforcement Learning is to take a single sample from a ing action a in s and following the optimal policy after- Q-network trained with dropout and select the action that wards. Following the GP interpretation of dropout of Gal & corresponds to the maximum sampled action-value (Gal & Ghahramani(2016), we can approximate the likelihood of Ghahramani, 2016). TS based on dropout achieves superior this observation as a Gaussian such that data efficiency compared against na¨ıve exploration strate- gies, such as ε-greedy, both in sequential decision making q(s, a) ∼ N Q(s, a; θ, ω); τ −1 , (11) problems (Gal & Ghahramani, 2016; Stadie et al., 2015) and contextual bandits (Riquelme et al., 2018; Collier & where τ is the model precision. Llorens, 2018). Furthermore, dropout has been successfully We can approximate the predictive mean of the process, used in model-based RL, to estimate the agent’s uncertainty and the expectation over the posterior distribution of the over the environment dynamics (Gal et al., 2016; Kahn et al., Q-value estimates, as the average of T stochastic forward 2017; Malik et al., 2019). passes through the network: Here we focus on the problem of action evaluation. We show how to use approximate Bayesian inference to eval- E [Q(s, a; θ, ω)] ≈ uate the WE estimator by introducing a novel approach to T 1 X exploit uncertainty estimates in DRL. Our method empiri- Qˆ (s, a; θ) = Q(s, a; θ, ω ). (12) T T t cally reduces Q-Learning bias, is grounded in theory and t=1 simple to implement. QˆT (s, a; θ) is the BNN prediction of the action-values as- 4.2. Weighted DQN sociated to state s and action a. A similar, more computa- tionally efficient, approximation can be obtained through Let Q( · , · ; θ, ω) be a BNN with weights θ trained with a weight averaging, which consists in scaling the output of the Gaussian prior and Dropout Variational Inference to learn neurons in layer l by 1−pl during training and leaving them the optimal action-value function of a certain MDP. We unchanged at inference time. We indicate this estimate as indicate with ω the set of random variables that represents Q˜(s, a; θ) and we use it for action selection during training. the dropout masks, with ωi the i-th realization of the random The model epistemic uncertainty, i.e., the model uncertainty over its parameters, can be measured similarly as the sam- ple variance across T realizations of the dropout random Deep Weighted Q-Learning variables: possible in RL where the available samples, and the un- derlying distribution generating them, change as the policy V ar [Q(s, a; θ, ω)] ≈ improves. In fact, using dropout with a fixed probability

T T !2 might lead to a poor uncertainty estimates (Osband et al., 1 X 1 X Q(s, a; θ, ω )2 − Q(s, a; θ, ω ) . (13) 2016; 2018; Gal et al., 2017). Concrete Dropout (Gal et al., T t T t t=1 t=1 2017) mitigates this problem by using a differentiable con- tinuous relaxation of the Bernoulli distribution and learning As shown in Gal & Ghahramani(2016) the predictive vari- the dropout rate from data. ance can be approximated with the variance of the estimator In practice, this means that the distribution of the dropout in Eq. 13 plus the model inverse precision τ −1. random variables ωlk becomes We can estimate the probability required to calculate the WE in a similar way. Given an action a, the probability ωlk = σ (β (log pl − log(1 − pl) + log u − log(1 − pl))) , that a corresponds the maximum expected action-value can (17) be approximated as the number of times in which, given where β is a temperature parameter (fixed at β = 10), u T samples, the sampled action-value of a is the maximum is a uniform random variable u ∼ U(0, 1) and σ( · ) is over the number of samples the . With this formulation the sampling procedure becomes differentiable and the loss in Eq. 16 can   s 0 be rewritten as: wa(θ) = P a = argmax Q(s, a ; θ, ω) a0 L(θ, p) = T 1 X s 0 {  2 ≈ a = argmax Q(s, a ; θ, ωt) , (14)  W DQN  T 0 E yi − Q(si, ai; θ, ωi) t=1 a 0 hsi,ai,ri,sii∼m ωi∼Ω where ... are the Iverson brackets ( P is 1 if P is true, L 0 otherwise).J K The weights can be efficientlyJ K inferred in X  2  + λ kθlk2 − ζKlH(pl) , (18) parallel with no impact in computational time. l=1

We can define the WE given the Bayesian target Q-network where ζ is a dropout regularization coefficient, H(p) is the estimates using the obtained weights as: entropy of a Bernoulli distribution with parameter p and Kl s0 the number of neurons in layer l. W DQN X i − ˆ 0 − yi = ri + γ wa (θ )QT (si, a; θ ). (15) a∈A 5. Experiments Finally we report for completeness the loss minimized by In this section we compare WDQN against the standard WDQN, where the parameter updates are backpropagated DQN algorithm and its DDQN variant to measure the effect using the dropout masks: of the different estimators on the quality of the learned value L(θ) = functions and policies. First we run a proof of concept  2 experiment on the Lunar Lander environment (Brockman  W DQN  E yi − Q(si, ai; θ, ωi) et al., 2016). Then we perform an in depth analysis on hs ,a ,r ,s0 i∼m i i i i Asterix, one of the Atari games in the Arcade Learning ωi∼Ω L Environment (ALE) (Bellemare et al., 2013) where DQN is X 2 known to overestimate the action-values (Van Hasselt et al., + λ kθlk , (16) 2 2016). Finally we test WDQN in three environments of l=1 the MinAtar benchmark (Young & Tian, 2019). WDQN where θl are the weights of layer l and λ the weight decay achieves a more accurate estimation of the expected return coefficient. Using weight decay is necessary for the the vari- which results, on average, in better policies. ational approximation being valid. The complete WDQN algorithm is reported in Algorithm1. All the agents are evaluated after each epoch of training us- ing the greedy policy, for Asterix we follow the evaluation protocol of Mnih et al.(2015). For each experiment we 4.3. Concrete Dropout report both the average cumulative reward at each evalua- The dropout probabilities are variational parameters and tion step and prediction of the expected return, compared influence the quality of the approximation. Ideally, they against the real discounted return obtained by the agents. should be tuned to maximize the log-likelihood of the ob- Even if TS is the natural choice for WDQN, all the algo- servations using a validation method. This is clearly not rithms, unless explicitly stated, use an ε-greedy policy to Deep Weighted Q-Learning

14 10000 12 1.3 ) 0

s 10 , a

( 1.2 Q

8000 x 8 a a WE entropy m

average reward 6 1.1 DQN 6000 DDQN 4 4000 WDQN 1.0 20000 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 epoch

Figure 1. Learning curve of the analyzed DQN variants on Asterix. The curves on the left show the evaluation scores. The figure in the middle reports, for each agent, the estimate of the expected return w.r.t. the starting screen of the game; the dashed lines indicate the real discounted return for each agent. The figure on the right shows the moving average of the entropy of the WDQN weights. Each epoch corresponds to 1M frames. The results shown here are the average of 3 independent runs and the shaded areas represent 95% confidence intervals. The curves are smoothed using a moving average of 10 epochs to improve readability.

275 10000 250 80 10 225 ) ) 0 0

60 s s , , 8

200 a a ( ( Q Q x 175 x 8000 40 a a a a 6 m m average reward average reward 150 DQN DropoutDQN 20 DDQN 6000 TS 125 4 WDQN 4000 WDQN 100 0 20000 0 25 50 75 100 0 25 50 75 100 0 50 100 150 200 0 50 100 150 200 epoch epoch epoch epoch

Figure 2. Learning curves on Lunar Lander. On the left the evalu- Figure 3. Comparison of WDQN against WDQN with TS and ation scores of each epoch (20k steps). On the right the maximum vanilla DQN with Concrete Dropout (DropDQN). On the left action-value estimated by each agent at the initial state of the MDP; the evaluation scores. On the right the maximum estimated action- the dashed lines are the real obtained discounted return. The re- value at the starting screen of the game; dashed lines are the real sults shown are the average of 20 independent runs, shaded areas obtained discounted return. The results shown are the average are 95% confidence intervals. The curves are smoothed using a across 3 different random seeds, the shaded areas are 95% confi- moving average of 10 epochs. dence intervals. The curves are smoothed using a moving average of 10 epochs.

guarantee the fairness of the comparison. For WDQN the greedy action is selected during training as the action cor- 5.1. Lunar Lander responding the maximum Q-value estimated with weight averaging, while during evaluation we take the action with Lunar Lander is an MDP from the Gym collection. In the highest probability of being optimal computed as in order to solve the environment, the agent has to control the Eq. 14. We found WDQN to be robust to the number of thrusters of a spacecraft to safely land at a specific location. dropout samples used to compute the WE (e.g., T ≥ 30). In order to make prediction and control more challenging, We used low values for the weight decay term, as in (Fare- we increased the stochasticity of the environment adding a brother et al., 2018), and tuned the dropout regularization 10% probability of repeating the last executed action, instead coefficient for the problem at hand. The algorithms and of the one selected by the agent. The three agents use the the experimental setup have been developed using the open- same exact and hyperparameters, the source RL libraries MushroomRL (D’Eramo et al., 2020) only difference is that WDQN uses Concrete Dropout in and OpenAI Gym (Brockman et al., 2016). Full description each hidden layer. Figure2 shows the learning curves of the of the complete experimental setup and further results for three agents, with WDQN achieving a significantly higher each experiment can be found in the appendix. average reward and prediction accuracy. Deep Weighted Q-Learning

MinAtar-Breakout 25 17.5 0.7

15.0 0.6 20

) 12.5 0 0.5 s ,

15 a ( 10.0 0.4 Q x a 10 a 7.5 WE entropy m 0.3

average reward 5.0 5 0.2 2.5 0.1 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 epoch epoch epoch MinAtar-Seaquest 0.7 10 0.6 15 8 )

0 0.5 s ,

a 6 10 ( 0.4 Q x a a 4 WE entropy m 0.3

average reward 5 2 0.2

0 0 0.1 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 epoch epoch epoch MinAtar-Freeway 2.5 0.7 60 0.6 50 2.0 ) 40 0 0.5 s

, 1.5 a ( 0.4

30 Q x a a 1.0 WE entropy 20 m 0.3 average reward 10 0.5 0.2

0 0.1 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 epoch epoch epoch DQN DDQN WDQN

Figure 4. Learning curves of the analyzed DQN variants on three Minatar games. In the leftmost column, we show the evaluation scores after each of the 200 training epochs, where 1 epoch corresponds to 25k steps. In the middle column we report for each game and agent the estimate of the expected return w.r.t. the initial state of the environment; the dashed lines indicate the real discounted return obtained by the agents. In the rightmost column we show the moving average of the entropy of the WDQN weights. The results shown here are averaged over 20 independent runs and the shaded areas represent 95% confidence intervals. The curves are smoothed using a moving average of 5 epochs.

5.2. Asterix WDQN, on the other hand, is clearly the less biased option and despite the slow start - that can be explained by the We use the same neural network and hyperparameters regularization introduced by dropout - reaches the highest of (Mnih et al., 2015), except for the optimizer that we average reward. After an initial phase, the average entropy replace with Adam (Kingma & Ba, 2015), following more of the WE weights remains almost constant with two pos- recent best practices (Hessel et al., 2018; Dabney et al., sible interpretations: 1) in many states there is not a single 2018). For WDQN we use Concrete Dropout only in the action that dominates the others, 2) the agent is not able fully connected layer after the convolutional block. We use to completely resolve its uncertainty over the action-value sticky-actions, with a probability of repeating last action of functions. 25%, as proposed by (Machado et al., 2018). Figure1 shows the result of the comparison in terms of average reward In order to gather more insights on WDQN and on how and prediction accuracy. Differently from what observed each component (namely the policy and the MEV estimator) by (Van Hasselt et al., 2016), DQN does not diverge in our influence the observed behavior, in Figure3 we compare it setting. In fact, while widely overestimating the discounted against a baseline DQN agent trained with Concrete Dropout return, DQN achieves better sample-complexity and higher (DropDQN in the figure) and a version of WDQN that uses a cumulative reward w.r.t. DDQN. TS policy as in (Gal & Ghahramani, 2016). In this setting TS Deep Weighted Q-Learning is not beneficial and underperforms. DropDQN is clearly the Overestimation of the action-values can be even more prob- worst performer of the tested algorithms, indicating that the lematic in the DQN algorithm (Mnih et al., 2015), due to use of WE is indeed beneficial and suggesting that WDQN the high variance typical of DRL approaches. The Double may improve with ad hoc hyperparameter tuning. DQN algorithm (Van Hasselt et al., 2016) introduces the use of DE in DQN, and shows better estimate of action-values 5.3. MinAtar and superior performance w.r.t. vanilla DQN. Other variants of DQN addressing the overestimation problem are based MinAtar (Young & Tian, 2019) is a RL testbed with envi- on exploiting multiple estimates of the action-values. For ronments mimicking the dynamics of games from ALE, but instance, Averaged DQN (Anschel et al., 2017) controls the with a simplified state representation. MinAtar implements variance of the estimation by averaging the target of the also sticky actions and difficulty ramping (Machado et al., update over an arbitrary number of previous checkpoints of 2018). The results of the experiment are shown in Figure4. the target network. Then, the recent Maxmin DQN (Lan We use the same convolutional neural network and hyperpa- et al., 2020) reduces the bias keeping several estimates of rameters of (Young & Tian, 2019), but we replace RMSProp the action-values in parallel, and the target values 2 with the Adam optimizer with learning rate 1e−4. For using the maximum of the minimums of each action-value WDQN we choose the dropout regularization coefficient estimate. Overestimation is also detrimental in actor-critic with random search and use the same value in the three DRL algorithms, e.g. Deep Deterministic Policy Gradient games. (DDPG) (Lillicrap et al., 2015). Notably, the Twin Delayed WDQN achieves higher average reward in Breakout, per- DDPG (TD3) algorithm (Fujimoto et al., 2018) is a variant forms on par with the other two algorithms in Seaquest and of DDPG exploiting several tricks, e.g. a clipped version is more sample efficient in Freeway. For what concerns the of Double Q-Learning update, that significantly improves estimation of the expected return, WDQN shows a much the approximation of the value function and the overall lower bias in the first epochs of the learning procedure and performance. converge to more accurate estimates in Breakout and Free- way, while performing similarly to the other algorithms in 7. Conclusion and future works Seaquest. Since the number of actions and the regularization coefficient are the same among the three environments, it is We present WDQN, a new value-based Deep Reinforcement clear that the entropy of the WE weights heavily depends Learning algorithm, that extends the Weighted Q-Learning on the dynamics and input space of the problem. algorithm to work in environments with an highly dimen- sional state representation. WDQN is a principled and ro- 6. Related works bust method to exploit uncertainty in DRL to accurately estimate the maximum action-value in a given state. We The Weighted Q-Learning algorithm (DEramo et al., 2016) empirically support our claims by showing that WDQN joins previous works in the attempt to address the bias of consistently reduces the bias of DQN across different tasks. Q-Learning. Double Q-Learning (Van Hasselt, 2010) firstly Our results corroborate the findings of previous works (Gal showed how the use of DE, to underestimate the action- & Ghahramani, 2016; Gal et al., 2017), confirming that values, can be helpful to stabilize learning in highly stochas- dropout can be used successfully for approximate Bayesian tic environments. Bias-corrected Q-Learning (Lee et al., inference in DRL. Future works may explore the combina- 2013) improves the learning stability of Q-Learning by sub- tion of WDQN with the other, orthogonal, DQN extensions tracting a quantity to the target value that depends on the and may attempt to adapt the WDQN approach to other standard deviation of the reward. The Double Weighted techniques modelling uncertainty in deep neural networks. Q-Learning algorithm (Zhang et al., 2017) (which, despite the name, is not related to WQL) uses instead a combina- tion of ME and DE to balance between the overestimation and underestimation of the two estimators. DE and WE have also been introduced in Batch RL as alternatives to the ME used in Fitted Q-Iteration (Ernst et al., 2005), respec- tively in the Double Fitted Q-Iteration and Weighted Fitted Q-Iteration algorithms (D’Eramo et al., 2017). Among the three, Weighted Fitted Q-Iteration is the only algorithm able to handle continuous action spaces.

2We found Adam to provide significantly more stable learning across all the agents. Deep Weighted Q-Learning

References Ernst, D., Geurts, P., and Wehenkel, L. Tree-based batch mode reinforcement learning. Journal of Machine Learn- Anschel, O., Baram, N., and Shimkin, N. Averaged-dqn: ing Research, 6(Apr):503–556, 2005. Variance reduction and stabilization for deep reinforce- ment learning. In Proceedings of the 34th International Farebrother, J., Machado, M. C., and Bowling, M. General- Conference on -Volume 70, pp. 176– ization and regularization in dqn, 2018. 185. JMLR. org, 2017. Fujimoto, S., Hoof, H., and Meger, D. Addressing function Bellemare, M. G., Naddaf, Y., Veness, J., and Bowling, M. approximation error in actor-critic methods. In Interna- The arcade learning environment: An evaluation plat- tional Conference on Machine Learning, pp. 1587–1596, form for general agents. Journal of Artificial Intelligence 2018. Research, 47:253–279, jun 2013. Gal, Y. and Ghahramani, Z. Dropout as a bayesian approxi- Bellemare, M. G., Dabney, W., and Munos, R. A distribu- mation: Representing model uncertainty in . tional perspective on reinforcement learning. In Proceed- In Proceedings of The 33rd International Conference on ings of the 34th International Conference on Machine Machine Learning, volume 48 of Proceedings of Machine Learning - Volume 70, ICML17, pp. 449458. JMLR.org, Learning Research, pp. 1050–1059, New York, New York, 2017. USA, 20–22 Jun 2016. PMLR. Bellman, R. The theory of . Bulletin of the American Mathematical Society, 60(6):503–515, Gal, Y., McAllister, R. T., and Rasmussen, C. E. Improving 1954. pilco with bayesian neural network dynamics models. In Data-Efficient Machine Learning workshop, ICML 2016, Blumenthal, S. and Cohen, A. Estimation of the larger of 2016. two normal means. Journal of the American Statistical Association, 63(323):861–876, 1968. Gal, Y., Hron, J., and Kendall, A. Concrete dropout. In Advances in Neural Information Processing Systems 30, Brockman, G., Cheung, V., Pettersson, L., Schneider, J., pp. 3581–3590. Curran Associates, Inc., 2017. Schulman, J., Tang, J., and Zaremba, W. Openai gym, 2016. Hessel, M., Modayil, J., van Hasselt, H., Schaul, T., Ostro- vski, G., Dabney, W., Horgan, D., Piot, B., Azar, M. G., Collier, M. and Llorens, H. U. Deep contextual multi-armed and Silver, D. Rainbow: Combining improvements in bandits. CoRR, abs/1807.09809, 2018. deep reinforcement learning. In The Thirty-Second AAAI Dabney, W., Rowland, M., Bellemare, M., and Munos, R. Conference on Artificial Intelligence (AAAI-18), 2018. Distributional reinforcement learning with quantile re- Kahn, G., Villaflor, A., Pong, V., Abbeel, P., and Levine, S. gression. In AAAI Conference on Artificial Intelligence, Uncertainty-aware reinforcement learning for collision 2018. avoidance. CoRR, abs/1702.01182, 2017. Damianou, A. and Lawrence, N. Deep gaussian processes. In Proceedings of the Sixteenth International Conference Kingma, D. P. and Ba, J. Adam: A method for stochastic Proceedings of the 3rd International on Artificial Intelligence and , volume 31 of optimization. In Conference on Learning Representations (ICLR) Proceedings of Machine Learning Research, pp. 207– , 2015. 215, Scottsdale, Arizona, USA, 29 Apr–01 May 2013. Kingma, D. P., Salimans, T., and Welling, M. Variational PMLR. dropout and the local reparameterization trick. In Ad- D’Eramo, C., Nuara, A., Pirotta, M., and Restelli, M. Es- vances in Neural Information Processing Systems 28, pp. timating the maximum expected value in continuous re- 2575–2583. Curran Associates, Inc., 2015. inforcement learning problems. In Thirty-First AAAI Lan, Q., Pan, Y., Fyshe, A., and White, M. Maxmin q- Conference on Artificial Intelligence, 2017. learning: Controlling the estimation bias of q-learning. In D’Eramo, C., Tateo, D., Bonarini, A., Restelli, M., and Pe- International Conference on Learning Representations, ters, J. Mushroomrl: Simplifying reinforcement learning 2020. research. arXiv preprint arXiv:2001.01102, 2020. Lee, D., Defourny, B., and Powell, W. B. Bias-corrected DEramo, C., Restelli, M., and Nuara, A. Estimating max- q-learning to control max-operator bias in q-learning. In imum expected value through gaussian approximation. 2013 IEEE Symposium on Adaptive Dynamic Program- In International Conference on Machine Learning, pp. ming and Reinforcement Learning (ADPRL), pp. 93–99. 1032–1040, 2016. IEEE, 2013. Deep Weighted Q-Learning

Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Tassa, Y., Silver, D., and Wierstra, D. Continuous con- and Salakhutdinov, R. Dropout: A simple way to prevent trol with deep reinforcement learning. In International neural networks from overfitting. Journal of Machine Conference on Learning Representations, 2015. Learning Research, 15:1929–1958, 2014. Lin, L.-J. Programming robots using reinforcement learning Stadie, B. C., Levine, S., and Abbeel, P. Incentivizing ex- and teaching. In Proceedings of the Ninth National Con- ploration in reinforcement learning with deep predictive ference on Artificial Intelligence - Volume 2, AAAI91, pp. models. CoRR, abs/1507.00814, 2015. 781786. AAAI Press, 1991. ISBN 0262510596. Stone, M. Cross-validatory choice and assessment of statis- Machado, M. C., Bellemare, M. G., Talvitie, E., Veness, tical predictions. Journal of the Royal Statistical Society: J., Hausknecht, M. J., and Bowling, M. Revisiting the Series B (Methodological), 36(2):111–133, 1974. arcade learning environment: Evaluation protocols and open problems for general agents. Journal of Artificial Sutton, R. S. and Barto, A. G. Reinforcement learning: An Intelligence Research, 61:523–562, 2018. introduction, volume 1. MIT press Cambridge, 1998. Malik, A., Kuleshov, V., Song, J., Nemer, D., Seymour, H., Thompson, W. R. On the likelihood that one unknown and Ermon, S. Calibrated model-based deep reinforce- probability exceeds another in view of the evidence of ment learning. In Proceedings of the 36th International two samples. Biometrika, 25(3/4):285–294, 1933. Conference on Machine Learning, volume 97 of Pro- ceedings of Machine Learning Research, pp. 4314–4323, Thrun, S. and Schwartz, A. Issues in using function approx- Long Beach, California, USA, 09–15 Jun 2019. PMLR. imation for reinforcement learning. In Proceedings of the 1993 Connectionist Models Summer School, pp. 255–263. Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Ve- Lawrence Erlbaum, 1993. ness, J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidjeland, A. K., and Ostrovski, G. Human-level control Van Hasselt, H. Double q-learning. In Advances in Neural through deep reinforcement learning. Nature, 518(7540): Information Processing Systems, 2010. 529, 2015. Van Hasselt, H. Estimating the maximum expected value: Osband, I., Blundell, C., Pritzel, A., and Van Roy, B. Deep an analysis of (nested) cross-validation and the maximum exploration via bootstrapped dqn. In Advances in Neural sample average. arXiv preprint arXiv:1302.7175, 2013. Information Processing Systems, pp. 4026–4034, 2016. Van Hasselt, H., Guez, A., and Silver, D. Deep reinforce- Osband, I., Aslanides, J., and Cassirer, A. Randomized prior ment learning with double q-learning. In AAAI, vol- functions for deep reinforcement learning. In Advances ume 16, pp. 2094–2100, 2016. in Neural Information Processing Systems 31, pp. 8617– 8629. Curran Associates, Inc., 2018. Wang, Z., Schaul, T., Hessel, M., Hasselt, H., Lanctot, M., and Freitas, N. Dueling network architectures for deep Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., reinforcement learning. In Proceedings of The 33rd In- Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, ternational Conference on Machine Learning, volume 48 L., Desmaison, A., Kopf, A., Yang, E., DeVito, Z., Raison, of Proceedings of Machine Learning Research, pp. 1995– M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang, L., 2003, New York, New York, USA, 20–22 Jun 2016. Bai, J., and Chintala, S. Pytorch: An imperative style, PMLR. high-performance deep learning . In Advances in Neural Information Processing Systems 32, pp. 8024– Watkins, C. J. C. H. Learning from Delayed Rewards. PhD 8035. Curran Associates, Inc., 2019. thesis, King’s College, Cambridge, UK, May 1989. Riquelme, C., Tucker, G., and Snoek, J. Deep bayesian Young, K. and Tian, T. Minatar: An atari-inspired testbed bandits showdown: An empirical comparison of bayesian for thorough and reproducible reinforcement learning deep networks for thompson sampling. In International experiments. arXiv preprint arXiv:1903.03176, 2019. Conference on Learning Representations, 2018. Zhang, Z., Pan, Z., and Kochenderfer, M. J. Weighted Schaul, T., Quan, J., Antonoglou, I., and Silver, D. Priori- double q-learning. In Proceedings of the Twenty-Sixth tized experience replay. In International Conference on International Joint Conference on Artificial Intelligence, Learning Representations, Puerto Rico, 2016. IJCAI-17, pp. 3455–3461, 2017. doi: 10.24963/ijcai. Smith, J. E. and Winkler, R. L. The optimizers curse: Skepti- 2017/483. cism and postdecision surprise in decision analysis. Man- agement Science, 52(3):311–322, 2006. Deep Weighted Q-Learning Appendix

The appendix is organized as follows:

• in SectionA we further discuss the problem of estimating the maximum expected value (MEV) of a set of random variables;

• in SectionB we provide, together with some additional empirical results, details on the hyperparameters and experi- mental setup used to evaluate the agents.

A. Estimating the Maximum Expected Value

Let X = {X1,...,XM } be a set of M ≥ 2 independent random variables with unknown means µi = E[Xi], and SM S = i=1 Si be a set of noisy samples, where Si corresponds to the subset of samples drawn from the unknown distribution of the random variable Xi. The problem of estimating the MEV consists in using the sample means µˆi(S) to approximate the largest true mean maxi µi = µ∗. Unfortunately, for many distributions (e.g. Gaussian), no unbiased estimator of the MEV exists (Blumenthal & Cohen, 1968; Van Hasselt, 2013).

Maximum Estimator The Maximum Estimator (ME) can be defined as

ME µˆ∗ (S) = max µˆi ≈ µ∗. (19) i

The ME is the most straightforward way of estimating the MEV and is positively biased (Smith & Winkler, 2006).

Double Estimator The Double Estimator (DE) is a negatively biased estimator for the MEV (Stone, 1974; Van Hasselt, A B A B 2013). DE splits the set of samples S into two independent sets S and S , with sample means µˆi(S ) and µˆi(S ). Using an cross-validation approach, the DE is computed by selecting the random variable with the highest sample mean in one set, A A and computing its sample mean from samples in the other one. The resulting values are µˆ ∗ (S ) =µ ˆ B (S ) b argmaxi µˆi(S ) B B and µˆ ∗ (S ) =µ ˆ A (S ). Finally, the DE can be obtained as a argmaxi µˆi(S )

A B µˆ ∗ (S ) +µ ˆ ∗ (S ) µˆDE(S) = b a . (20) ∗ 2

Weighted Estimator DEramo et al.(2016) introduced the Weighted Estimator (WE), defined as

M WE X S µˆ∗ (S) = wi µˆi(S), (21) i=1

S where wi is the probability of µˆi being the maximum sample mean. Since the distribution of the sample mean of a random variable with an unknown distribution is itself unknown, DEramo et al.(2016) approximates the sample means as normally distributed random variables by the central limit theorem. WE can be both positively and negatively biased, and its bias is bounded by the one of ME and DE:

v v  v u M 2 u M 2 u M 2 1 uX σi uX σi DE WE ME uM − 1 X σi − t + t  ≤ Bias(ˆµ∗ (S)) ≤ Bias(ˆµ∗ (S)) ≤ Bias(ˆµ∗ (S)) ≤ t , (22) 2 SA SB M S i=1 i i=1 i i=1 i where σi is the variance of the i-th random variable. Furthermore, WE has often a variance that is empirically smaller to the the variance of the other two estimators (DEramo et al., 2016). A variant of WE that extends to infinite random variables has been proposed in D’Eramo et al.(2017). Deep Weighted Q-Learning B. Experiments details In this section we provide details on the experimental setup used for the empirical evaluation of the proposed method. For the implementation of the algorithms and the simulation environments we rely on the following open-source libraries:

• MushroomRL (D’Eramo et al., 2020);

• Gym (Brockman et al., 2016);

• ALE (Bellemare et al., 2013);

• MinAtar (Young & Tian, 2019);

• PyTorch (Paszke et al., 2019).

B.1. Asterix For the experiment on the Asterix Atari game we use the settings of Mnih et al.(2015), but we add sticky actions (Machado et al., 2018), i.e., a probability of prepeat = 25% of repeating the action executed at the previous frame instead of the one selected by the agent. For the agents we use the same neural network of Mnih et al.(2015) with, as already mentioned, the Adam (Kingma & Ba, 2015) optimizer. The additional hyperparameters introduced by WDQN are tuned with a - small - random search. Concrete Dropout is used only on the neurons of the last hidden layer. Agents are evaluated during training every 1M frames, with the 30 no-op starting condition. During evaluation the episode length is capped at 30 minutes. Table1 reports a list of additional relevant hyperparameters.

Table 1. DQN hyperparameters on Asterix. Hyperparameters marked with a ∗ are used only for WDQN. Hyperparameter Value Optimizer Adam Learning rate 5e−5 Adam epsilon (Adam) 0.01/32 Batch size 32 Huber Training frequency 4 Target network update frequency 10000 Min. memory size 50000 Max. memory size 1M Discount factor (γ) 0.99 Initial exploration rate (εstart) 1.0 Final exploration rate (εend) 0.1 Exploration steps 1M Evaluation exploration rate (εtest) 0.001 MC dropout samples∗ 100 Weight decay coefficient (λ)∗ 1e−6 Dropout regularization coefficient (ζ)∗ 5e−4 Initial dropout rate (p)∗ 0.5

B.2. Lunar Lander For Lunar Lander (Brockman et al., 2016) we limit the length of an episode (during both evaluation and training) at 1000 steps. To increase the complexity of the problem, we make the environment stochastic using sticky actions (see previous subsection) with a prepeat = 10%. We use a small neural network with only two hidden layers. For WDQN, we use Concrete Dropout on each hidden layer. Table2 reports the hyperparameters used to train the agents. Deep Weighted Q-Learning

Table 2. DQN hyperparameters on Lunar Lander. Hyperparameters marked with a ∗ are used only for WDQN. Hyperparameter Value Units per layer [100, 100] Activation relu Optimizer Adam Learning rate 3e−4 Batch size 32 Loss function MSE Training frequency 1 Target network update frequency 300 Min. memory size 250 Max. memory size 10000 Discount factor (γ) 0.99 Initial exploration rate (εstart) 1.0 Final exploration rate (εend) 0.01 Exploration steps 1000 Evaluation exploration rate (εtest) 0.0 MC dropout samples∗ 50 Weight decay coefficient (λ)∗ 1e−6 Dropout regularization coefficient (ζ)∗ 2.5e−3 Initial dropout rate (p)∗ 0.2

B.3. MinAtar MinAtar (Young & Tian, 2019) offers a collection of environments resembling games from the Atari Learning Environ- ment (Bellemare et al., 2013). The state representation of MinAtar environments is a matrix with multiple channels, where each channel gives specific information about some aspects of the environment (e.g., position and speed of a moving object). MinAtar implements the ALE modifications suggested by Machado et al.(2018), i.e., sticky actions and difficulty ramping. We use the same hyperparameters and convolutional neural network of Young & Tian(2019), but we use Adam for training. The WDQN dropout regularization coefficient is tuned with random search: we select the value providing qualitatively better learning curves and keep it fixed across the three games. Table3 shows the relevant hyperparameters.

Table 3. DQN hyperparameters on Minatar environments. Hyperparameters marked with a ∗ are used only for WDQN. Hyperparameter Value Optimizer Adam Learning rate 1e−4 Batch size 32 Loss function Huber Training frequency 1 Target network update frequency 1000 Min. memory size 5000 Max. memory size 100000 Discount factor (γ) 0.99 Initial exploration rate (εstart) 1.0 Final exploration rate (εend) 0.1 Exploration steps 100000 Evaluation exploration rate (εtest) 0.0 MC dropout samples∗ 100 Weight decay coefficient (λ)∗ 1e−6 Dropout regularization coefficient (ζ)∗ 1e−4 Initial dropout rate (p)∗ 0.1 Deep Weighted Q-Learning

For WDQN, we run an additional experiment to asses the impact of the dropout regularization coefficient. We set the initial dropout rate at p = 0.5 (which corresponds to the maximum entropy) and test the agents using regularization coefficients of different magnitude. The results, reported in Figure5, show how higher levels of regularization generally correspond to higher entropy of the weights used to compute the WE.

MinAtar-Breakout 25 1.0 12 0.8 20 10 ) 0 s 15 , 8 0.6 a ( Q

x 6 a 10 a 0.4 WE entropy m

average reward 4 5 0.2 2 0.0 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 epoch epoch epoch MinAtar-Seaquest 1.0

20 5 0.8

) 4 15 0 s

, 0.6 a

( 3 Q

10 x a a 0.4 2 WE entropy m

average reward 5 1 0.2

0 0 0.0 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 epoch epoch epoch MinAtar-Freeway 2.5 1.0

50 2.0 0.8 )

40 0 s

, 1.5 0.6 a (

30 Q x a a 1.0 0.4 WE entropy m

average reward 20 0.5 0.2 10 0.0 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 epoch epoch epoch 1e-3 1e-4 1e-5

Figure 5. Learning curves for 3 different regularization values on three Minatar games. In the leftmost column, we show the evaluation scores after each of the 200 training epochs, where 1 epoch corresponds to 25k steps. In the middle column we report for each game and agent the estimate of the expected return w.r.t. the initial state of the environment; the dashed lines indicate the real discounted return obtained by the agents. In the rightmost column we show the moving average of the entropy of the WDQN weights. The results shown here are averaged over 20 independent runs and the shaded areas represent 95% confidence intervals. The curves are smoothed using a moving average of 5 epochs.