Revisiting the Softmax Bellman Operator: New Benefits and New Perspective

Zhao Song 1 * Ronald E. Parr 1 Lawrence Carin 1

Abstract tivates the use of exploratory and potentially sub-optimal actions during learning, and one commonly-used strategy The impact of softmax on the value function itself is to add randomness by replacing the max function with in (RL) is often viewed as the softmax function, as in Boltzmann exploration (Sutton problematic because it leads to sub-optimal value & Barto, 1998). Furthermore, the softmax function is a (or Q) functions and interferes with the contrac- differentiable approximation to the max function, and hence tion properties of the Bellman operator. Surpris- can facilitate analysis (Reverdy & Leonard, 2016). ingly, despite these concerns, and independent of its effect on exploration, the softmax Bellman The beneficial properties of the softmax Bellman opera- operator when combined with Deep Q-learning, tor are in contrast to its potentially negative effect on the leads to Q-functions with superior policies in prac- accuracy of the resulting value or Q-functions. For exam- tice, even outperforming its double Q-learning ple, it has been demonstrated that the softmax Bellman counterpart. To better understand how and why operator is not a contraction, for certain pa- this occurs, we revisit theoretical properties of the rameters (Littman, 1996, Page 205). Given this, one might softmax Bellman operator, and prove that (i) it expect that the convenient properties of the softmax Bell- converges to the standard Bellman operator expo- man operator would come at the expense of the accuracy nentially fast in the inverse temperature parameter, of the resulting value or Q-functions, or the quality of the and (ii) the distance of its Q function from the resulting policies. In this paper, we demonstrate that, in optimal one can be bounded. These alone do not the case of deep Q-learning, this expectation is surprisingly explain its superior performance, so we also show incorrect. We combine the softmax Bellman operator with that the softmax operator can reduce the over- the deep Q-network (DQN)(Mnih et al., 2015) and double estimation error, which may give some insight DQN (DDQN)(van Hasselt et al., 2016a) algorithms, by re- into why a sub-optimal operator leads to better placing the max function therein with the softmax function, performance in the presence of value function in the target network. We then test the variants on several approximation. A comparison among different games in the Arcade Learning Environment (ALE)(Belle- Bellman operators is then presented, showing the mare et al., 2013), a standard large-scale deep RL testbed. trade-offs when selecting them. The results show that the variants using the softmax Bell- man operator can achieve higher test scores, and reduce the 1. Introduction Q-value overestimation as well as the gradient noise on most of them. This effect is independent of exploration and is The Bellman equation (Bellman, 1957) has been a funda- entirely attributable to the change in the Bellman operator. arXiv:1812.00456v2 [cs.LG] 19 May 2019 mental tool in reinforcement learning (RL), as it provides a This surprising result suggests that a deeper understanding sufficient condition for the optimal policy in dynamic pro- of the softmax Bellman operator is warranted. To this end, gramming. The use of the max function in the Bellman we prove that starting from the same initial Q-values, we can equation further suggests that the optimal policy should be upper and lower bound how far the Q-functions computed greedy w.r.t. the Q-values. On the other hand, the trade-off with the softmax operator can deviate from those computed between exploration and exploitation (Thrun, 1992) mo- with the regular Bellman operator. We further show that the 1Duke University ∗Work performed as a graduate student softmax Bellman operator converges to the optimal Bellman at Duke University; now at Baidu Research. Correspondence operator in an exponential rate w.r.t. the inverse temperature to: Zhao Song , Ronald E. Parr parameter. This gives insight into why the negative conver- , Lawrence Carin . gence results may not be as discouraging as they initially Proceedings of the 36 th International Conference on Machine seem, but it does not explain the superior performance ob- Learning, Long Beach, California, PMLR 97, 2019. Copyright served in practice. Motivated by recent work (van Hasselt 2019 by the author(s). et al., 2016a; Anschel et al., 2017) targeting bias and insta- Revisiting the Softmax Bellman Operator: New Benefits and New Perspective bility of the original DQN (Mnih et al., 2015), we further a probability mass function (PMF), where π(s, a) ∈ [0, 1] investigate whether the softmax Bellman operator can alle- denotes the probability of selecting action a in state s, and P viate these issues. As discussed in van Hasselt et al.(2016a), a∈A π(s, a) = 1. one possible explanation for the poor performance of the For a given policy π, its state-action value func- vanilla DQN on some Atari games was the overestimation tion Qπ(s, a) is defined as the accumulated, expected, bias when computing the target network, due to the max discount reward, when taking action a in state s, operator therein. We prove that given the same assumptions and following policy π afterwards, i.e., Qπ(s, a) = as van Hasselt et al.(2016a), the softmax Bellman operator [P∞ γtr |s = s, a = a] . For the optimal policy can reduce the overestimation bias, for any inverse tem- Eat∼π t=0 t 0 0 π∗, its corresponding Q-function satisfies the following Bell- perature parameters. We also quantify the overestimation man equation: reduction by providing its lower and upper bounds. X Q∗(s, a) = R(s, a) + γ P (s0|s, a) max Q∗(s0, a0). Our results are complementary to and add new motivations a0 for existing work that explores various ways of softening s0 the Bellman operator. For example, entropy regularizers In DQN (Mnih et al., 2015), the Q-function is parameterized have been used to smooth policies. The motivations for with a neural network as Qθ(s, a), which takes the state s such approaches include computational convenience, ex- as input and outputs the corresponding Q-value in the final ploration, or robustness (Fox et al., 2016; Haarnoja et al., fully-connected linear , for every action a. The training 2017; Schulman et al., 2017; Neu et al., 2017). With respect objective for the DQN can be represented as 2 to value functions, Asadi & Littman(2017) proposed an 1 0 0 min Qθ(s, a) − [R(s, a) + γ max Qθ− (s , a )] , alternative mellowmax operator and proved that it is a con- θ 2 a0 traction. Their experimental results suggested that it can (1) improve exploration, but the possibility that the sub-optimal where θ− corresponds to the frozen weights in the target Bellman operator could, independent of exploration, lead to network, and is updated at fixed intervals. The optimization superior policies was not considered. Our results, therefore, of Eq. (1) is performed via RMSProp (Tieleman & Hinton, provide additional motivation for further study of operators 2012), with mini-batches sampled from a replay buffer. such as mellowmax. Although very similar to softmax, the To reduce the overestimation bias, the double DQN mellowmax operator needs some extra computation to rep- (DDQN) algorithm modified the target that Q (s, a) aims resent a policy, as noted in Asadi & Littman(2017). Our θ to fit in Eq. (1) as paper discusses this and other trade-offs among different 0 0  − Bellman operators. R(s, a) + γ Qθ s , arg max Qθt (s , a) . a The rest of this paper is organized as follows: We pro- Note that a separate network based on the latest estimate θt vide the necessary background and notation in Section2. is employed for , and the evaluation of this The softmax Bellman operator is introduced in Section3, policy is due to the frozen network. where its convergence properties and performance bound are provided. Despite being sub-optimal on Q-functions, the Notation The softmax function is defined as softmax operator is shown in Section4 to consistently out- [exp(τx ), exp(τx ),..., exp(τx )]T perform its max counterpart on several Atari games. Such 1 2 m fτ (x) = Pm , surprising result further motivates us to investigate why this i=1 exp(τxi) happens in Section5. A thorough comparison among differ- where the superscript T denotes the vector transpose. Sub- ent Bellman operators is presented in Section6. Section7 sequently, the softmax-weighted function is represented as T discusses the related work and Section8 concludes this gx(τ) = fτ (x) x, as a function of τ. Also, we define the T paper. vector Q(s, ) = [Q(s, a1),Q(s, a2),...,Q(s, am)] . We further set m to be the size of the action set A in the MDP. 2. Background and Notation Finally, Rmin and Rmax denote the minimum and the maxi- mum immediate rewards, respectively. A Markov decision process (MDP) can be represented as a 5-tuple hS, A, P, R, γi, where S is the state space, A is 3. The Softmax Bellman Operator the action space, P is the transition kernel whose element P (s0|s, a) denotes the transition probability from state s We start by providing the following standard Bellman oper- to state s0 under action a, R is a reward function whose ator: element R(s, a) denotes the expected reward for executing X T Q(s, a) = R(s, a) + γ P (s0|s, a) max Q(s0, a0). action a in state s, and γ ∈ (0, 1) is the discount factor. a0 s0 The policy π in an MDP can be represented in terms of (2) Revisiting the Softmax Bellman Operator: New Benefits and New Perspective

δb(s) T  We propose to use the softmax Bellman operator, defined as ≤ maxa Q(s, a) − fτ Q(s, ) Q(s, ) ≤ (m − m exp[τ δb(s)] X 0 n o T Q(s, a) =R(s, a) + γ P (s |s, a) 1 2Qmax 2 Rmax soft 1) max τ+2 , 1+exp(τ) , where Qmax = 1−γ represents s0 3 the maximum Q-value in Q-iteration with Tsoft. X exp[τ Q(s0, a0)] × Q(s0, a0), (3) Note that another upper bound for this gap was provided P exp[τ Q(s0, a¯)] a0 a¯ in O’Donoghue et al.(2017). Our proof here uses a different | {z } 0 smτ (Q(s ,)) strategy, by considering possible values for the difference between Q-values with different actions. We further derive τ ≥ 0 where denotes the inverse temperature parameter. a lower bound for this gap. Note that Tsoft will reduce to T when τ → ∞. The mellowmax operator was introduced in Asadi & 3.1. Performance Bound for Tsoft Littman(2017) as an alternative to the softmax operator The optimal state-action value function Q∗ is known to be a in Eq. (3), defined as fixed point for the standard Bellman operator T (Williams & 1 P 0  ∗ ∗ log 0 exp[ω Q(s, a )] Baird III, 1993), i.e., T Q = Q . Since T is a contraction mm Q(s, ) = m a , (4) k ∗ ω ω with rate γ, we also know that limk→∞ T Q0 = Q , for where ω > 0 is a tuning parameter. Similar to the softmax arbitrary Q0. Given these facts, one may wonder in the limit, 0  T T Q operator smτ Q(s , ) in Eq. (3), mmω converges to the how far iteratively applying soft, in lieu of , over 0 will ∗ mean and max operators, when ω approaches 0 and ∞, be away from Q , as a function of τ. k k respectively. Unlike the softmax operator, however, the Theorem 3. Let T Q0 and TsoftQ0 denote that the opera- mellowmax operator in Eq. (4) does not directly provide tors T and Tsoft are iteratively applied over an initial state- a over actions, and thus additional action value function Q0 for k times. Then, steps are needed to obtain a corresponding policy (Asadi & k ∗ (I) ∀(s, a), lim sup Tsoft Q0(s, a) ≤ Q (s, a) and Littman, 2017). k→∞ k In contrast to its typical use to improve exploration (also a lim inf Tsoft Q0(s, a) ≥ k→∞ purpose for the mellowmax operator in Asadi & Littman γ(m − 1) n 1 2Q o (2017)), the softmax Bellman operator in Eq. (3) is com- Q∗(s, a) − max , max . (1 − γ) τ + 2 1 + exp(τ) bined in this paper with the DQN and DDQN algorithms in an off-policy fashion, where the action is selected according (II) Tsoft converges to T with an exponential rate, in terms to the same -greedy policy as in Mnih et al.(2015). k k of τ, i.e., the upper bound of T Q0 − TsoftQ0 decays expo- Even though the softmax Bellman operator was shown nentially fast, as a function of τ, the proof of which does not in Littman(1996) not to be a contraction, with a counterex- depend on the bound in part (I). ample, we are not aware of any published work showing A noteworthy point about part (I) of Theorem3 is it does the performance bound by using the softmax Bellman oper- not contradict the negative convergence results for Tsoft since ator in Q-iteration. Furthermore, our surprising results in its proof and result do not need to assume the convergence of Section4 show that test scores in DQNs and DDQNs can Tsoft. Furthermore, this bound implies that non-convergence be improved, by solely replacing the max operator in their for Tsoft is different from divergence or oscillation with target networks with the softmax operator. Such improve- an arbitrary range: Even though Tsoft may not converge ments are independent of the exploration strategy and hence in certain scenarios, the Q-values could still be within a 4 motivate an investigation of the theoretical properties of the reasonable range . Tsoft also has other benefits over T , as softmax Bellman operator. shown in Section5. Before presenting our main theoretical results, we first show Note that although the bound for mellowmax under the how to bound the distance between the softmax-weighted entropy-regularized MDP framework was shown in Lee and max operators, which is subsequently used in the proof et al.(2018) to have a better scalar term log(m) instead of for the performance bound and overestimation reduction. (m − 1), its convergence rate is linear w.r.t. τ. In contrast, our softmax Bellman operator result has an exponential rate, Definition 1. δb(s) , supQ maxi,j |Q(s, ai)−Q(s, aj)| de- notes the largest distance between Q-functions w.r.t. actions as shown in Part (II), which potentially gives insight into for state s, by taking the supremum of Q-functions and the why very large values of τ are not required experimentally maximum of all action pairs. 2 The supQ is over all Q-functions that occur during Q- 1 Lemma 2. By assuming δb(s) > 0 , we have ∀Q, iteration, starting from Q0 until the iteration terminates. 3 The Supplemental Material formally bounds Qmax. 1Note that if δb(s) = 0, then all actions are equivalent in state 4 The lower bound can become loose when τ is extremely s, and softmax = max. small, but this degenerate case has never been used in experiments. Revisiting the Softmax Bellman Operator: New Benefits and New Perspective

Q*Bert Ms. Pacman Breakout Crazy Climber Asterix Seaquest 15000 3500 400 120000 12500 12500 10000 3000 100000 300 10000 2500 10000 8000 80000 7500 2000 7500 6000 200 60000 1500 5000 5000 4000 40000 S-DQN 1000 100 2500 2500 2000 20000 DQN 500 0 0 0 0 0 100 200 0 100 200 0 100 200 0 100 200 0 100 200 0 100 200

3000 120000 12500 400 15000 2500 100000 8000 12500 10000 300 2000 80000 10000 6000 7500 200 60000 7500 1500 4000 5000 5000 1000 40000 S-DDQN 100 2000 2500 2500 20000 DDQN 500 0 0 0 0 0 100 200 0 100 200 0 100 200 0 100 200 0 100 200 0 100 200

15000 3500 400 120000 12500 10000 3000 100000 12500 300 10000 2500 8000 80000 10000 7500 2000 6000 200 60000 7500 1500 5000 5000 4000 40000 S-DQN 1000 100 2500 2500 2000 20000 DDQN 500 0 0 0 0 0 100 200 0 100 200 0 100 200 0 100 200 0 100 200 0 100 200

Figure 1. Performance of (top row) S-DQN vs. DQN, (middle row) S-DDQN vs. DDQN, and (bottom row) S-DQN vs. DDQN on the Atari games. X-axis and Y-axis correspond to the training epoch and test score, respectively. All curves are averaged over five independent random seeds. in Section4. The error bound in the sparse Bellman oper- Table 1. Mean of test scores for different values of τ in S-DDQN ator (Lee et al., 2018) improves upon log(m), but is still (standard deviation in parenthesis). Each game is denoted with its linear w.r.t. τ. We further empirically illustrate the faster initial. Note that S-DDQN reduces to DDQN when τ = ∞. convergence for the softmax Bellman operator in Figure7 τ 1 5 10 ∞ of Section6. 12068.66 11049.28 11191.31 10577.76 Q 4. Main Experiments (1085.65) (1565.57) (1336.35) (1508.27) 2492.40 2566.44 2546.18 2293.73 M Our theoretical results apply to the case of a tabular value (183.71) (227.24) (259.82) (160.50) function representation and known next-state distributions, 313.08 350.88 303.71 284.64 B and they bound suboptimality rather than suggesting supe- (20.13) (35.58) (65.59) (60.83) riority of softmax. The goal of our main experiments is to 107405.11 111111.07 104049.46 96373.08 C show that despite being sub-optimal w.r.t. the Q-values, the (4617.90) (5047.19) (6686.84) (9244.27) softmax Bellman operator is still useful in practice by im- 3476.91 10266.12 6588.13 5523.80 A proving the DQN and DDQN algorithms. Such an improve- (460.27) (2682.00) (1183.10) (694.72) ment is entirely attributable to changing the max operator to 272.20 2701.05 6254.01 5695.35 S the softmax operator in the target network. (49.75) (10.06) (697.12) (1862.59)

We combine the softmax operator with DQN and DDQN, ing contains 200 epochs in total. The test procedures and by replacing the max function therein with the softmax func- all the hyperparameters are set the same as DQN, with tion, with all other steps being the same. The corresponding details described in Mnih et al.(2015). The inverse tem- new algorithms are termed as S-DQN and S-DDQN, respec- perature parameter τ was selected based on a grid search tively. Their exploration strategies are set to be -greedy, the over {1, 5, 10}. We also implemented the logarithmic cool- https: same as DQN and DDQN. Our code is provided at ing scheme (Mitra et al., 1986) in simulated annealing to //github.com/zhao-song/Softmax-DQN . gradually increase τ, but did not observe a better policy, We tested on six Atari games: Q*Bert, Ms. Pacman, Crazy compared with the constant temperature. The results statis- Climber, Breakout, Asterix, and Seaquest. Our code is built tics are obtained by running with five independent random on the +Lasagne implementation from https: seeds. //github.com/spragunr/deep_q_rl/ . The train- Figure1 shows the mean and one standard deviation for Revisiting the Softmax Bellman Operator: New Benefits and New Perspective the average test score on the Atari games, as a function quently, we tune the inverse temperature parameter for both of the training epoch: In the top two rows, S-DQN and S- softmax and mellowmax operators from {1, 5, 10}, and DDQN generally achieve higher maximum scores and faster report the best scores. Table2 shows that both softmax and learning than their max counterparts, illustrating the promise mellowmax operators can achieve higher test scores than of replacing max with softmax. For the game Asterix, often the max operator. Furthermore, the scores from softmax used as an example of the overestimation issue in DQN (van are higher or similar to those of mellowmax on all games Hasselt et al., 2016a; Anschel et al., 2017), both S-DQN except Asterix. The competitive performance of the softmax and S-DDQN have much higher test scores than their max operator here suggests that it is still preferable in certain counterparts, which suggests that the softmax operator can domains, despite its non-contraction property. mitigate the overestimation bias. Although not as dramatic as for Asterix, the bottom row of Figure1 shows that S- 5. Why Softmax Helps? DQN generally outperforms DDQN with higher test scores for the entire test suite, which suggests the possibility of One may wonder why the softmax Bellman operator can using softmax operator over double Q-learning (van Hasselt, achieve the surprising performance in Section4, as the 2010) to reduce the overestimation bias, in the presence of greedy policy from the Bellman equation suggests that the function approximation. max operator should be optimal. The softmax operator is not used for exploration in this paper, nor is it motivated by regu- Table1 shows the test scores of S-DDQN with different larizing the policy, as in the entropy regularized MDPs (Fox values of τ, obtained by averaging the scores from the last et al., 2016; Schulman et al., 2017; Neu et al., 2017; Nachum 10 epochs. Note that S-DDQN achieves higher test scores et al., 2017; Asadi & Littman, 2017; Lee et al., 2018). than its max counterpart, for most of the values of τ. Table1 further suggests a trade-off when selecting the values for τ: As discussed in earlier work (van Hasselt et al., 2016a; An- Setting τ too small will move Tsoft further away from the schel et al., 2017), the max operator leads to the significant Bellman optimality, since the max operator (corresponding issue of overestimation for the Q-function in DQNs, which to τ = ∞) is employed in the standard Bellman equation. further causes excessive gradient noise to destabilizes the On the other hand, using a larger τ will lead to the issues of optimization of neural networks. Here we aim to provide overestimation and high gradient noise, as to be discussed analysis of how the softmax Bellman operator can overcome in Section5. these issues, and also the corresponding evidence for overes- timation and gradient noise reduction in DQNs. Although Table 2. Mean of test scores for different Bellman operators (stan- our analysis is focused on the softmax Bellman operator, dard deviation in parenthesis). The statistics are averaged over five this work could potentially give further insight into practical independent random seeds. benefits of entropy regularization as well. Max Softmax Mellowmax 8331.72 11307.10 11775.93 5.1. Overestimation Bias Reduction Q*Bert (1597.67) (1332.80) (1173.51) Q-learning’s overestimation bias, due to the max operator, 2368.79 2856.82 2458.76 Ms. Pacman was first discussed in Thrun & Schwartz(1994). It was (219.17) (369.75) (130.34) later shown in van Hasselt et al.(2016a) and Anschel et al. 90923.40 106422.27 99601.47 C. Climber (2017) that overestimation leads to the poor performance (11059.39) (4821.40) (19271.53) of DQNs in some Atari games. Following the same as- 255.32 345.56 355.94 Breakout sumptions as van Hasselt et al.(2016a), we can show the (64.69) (34.19) (25.85) softmax operator reduces the overestimation bias. Further- 196.91 8868.00 11203.75 Asterix more, we quantify the range of the overestimation reduction, (135.16) (2167.35) (3818.40) by providing both lower and upper bounds. 4090.36 8066.78 6476.20 Seaquest (1455.73) (1646.51) (1952.12) Theorem 4. Given the same assumptions as van Hasselt et al.(2016a), where (A1) there exists some V ∗(s) such To compare against the mellowmax operator, we combine it that the true state-action value function satisfies Q∗(s, a) = with DQN in the same off-policy fashion as the softmax op- V ∗(s), for different actions. (A2) the estimation error is 5 ∗ erator . Note that the root-finding approach for mellowmax modeled as Qt(s, a) = V (s) + a, then in Asadi & Littman(2017) needs to compute the optimal (I) the overestimation errors from T are smaller or equal inverse temperature parameter for every state, and thus is soft to those of T using the max operator, for any τ ≥ 0; too computationally expensive to be applied here. Conse- (II) the overestimation reduction by using Tsoft in lieu of T 5We tried the mellowmax operator for exploration as in Asadi  δb(s) 1 2Qmax  & Littman(2017), but observed much worse performance. is within , (m − 1) max{ , } ; m exp[τ δb(s)] τ+2 1+exp(τ) Revisiting the Softmax Bellman Operator: New Benefits and New Perspective

τ τ τ

2 2 2

2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2

Figure 2. The mean and one standard deviation for the overestimation error for different values of τ.

(III) the overestimation error for T monotonically in- Q*Bert Asterix soft 6 creases w.r.t. τ ∈ [0, ∞). 7

5 An observation about Theorem4 is that for any positive 6 value of τ, there will still be some potential for overesti- 5 4 mation bias because noise can also influence the softmax 4 = = 5 3 operator. Depending upon the amount of noise, it is possible = 10 = 1 that, unlike double DQN, the reduction caused by softmax 3

Value estimates Value estimates 2 could exceed the bias introduced by max. This can be seen 2 in our experimental results below, which show that it is 1 1 possible to have negative error (overcompensation) from = = 5 = 10 = 1 0 0 the use of softmax. However, when combined with double 0 50 100 150 200 0 50 100 150 200 Q-learning, this effect becomes very small and decreases Epoch Epoch with the number of actions. Figure 4. Mean and one standard deviation of the estimated Q- values on the Atari games, for different values of τ in S-DDQN. To elucidate Theorem4, we simulate standard normal vari- ables a with 100 independent trials for each action a, using the same setup as van Hasselt et al.(2016a). Figure2 shows by plotting the corresponding mean and one standard de- the mean and one standard deviation of the overestimation viation in Figure3. Both S-DQN and S-DDQN achieve bias, for different values of τ in the softmax operator. For smaller values than DQN, which verifies that the softmax both single and double implementations of the softmax op- operator can reduce the overestimation bias. This reduc- erator, they achieve smaller overestimation errors than their tion can partly explain the higher test scores for S-DQN max counterparts, thus validating our theoretical results. and S-DDQN in Figure1. Moreover, we plot the estimated Note that the gap becomes smaller when τ increases, which Q-values of S-DDQN for different values of τ in Figure4, and the result is consistent with Theorem4 that increasing is intuitive as Tsoft → T when τ → ∞, and also consistent with the monotonicity result in Theorem4. τ leads to larger estimated Q-values and, thus, more risk of overestimation. It also shows that picking a value of τ that Q*Bert Asterix is very small may introduce a risk of underestimation that 14 can hurt performance, as shown for Asterix, with τ = 1. 102 12 (See also Table1.)

10 5.2. Gradient Noise Reduction 101 8 Mnih et al.(2015) observed that large gradients can be detri- 6 mental to the optimization of DQNs. This was noted in the

Value estimates Value estimates 100 4 context of large reward values, but it is possible that overes- timation bias can also introduce variance in the gradient and 2 DQN S-DQN DQN S-DQN have a detrimental effect. To see this, we first notice that the DDQN S-DDQN 10 1 DDQN S-DDQN 0 0 50 100 150 200 0 50 100 150 200 gradient in the DQN can be represented as Epoch Epoch Figure 3. Mean and one standard deviation of the estimated Q- n  o ∇θL = Es,a,r,s0 Qθ(s, a)−T Qθ− (s, a) ∇θQθ(s, a) . values on the Atari games, for different methods. (5) We further check the estimated Q-values on Atari games 6, When the Q-value is overestimated, the magnitude for 6Due to the space limit, we plot fewer games from here, and ∇θQθ(s, a) in Eq. (5) could become excessively large and provide the full plots in Supplemental Material. so does the gradient, which may further lead to the insta- Revisiting the Softmax Bellman Operator: New Benefits and New Perspective bility during optimization. This is supported by Figure3, 6. A Comparison for Bellman Operators which shows higher variance in Q-values for DQN vs. oth- ers. Since we show in Section 5.1 that using softmax in lieu So far we have covered three different types of Bellman op- of max in the Bellman operator can reduce the overestima- erators, where the softmax and mellowmax variants employ tion bias, the gradient estimate in Eq. (5) can be more stable the softmax and the log-sum-exp functions in lieu of the with softmax. Note that stabilizing the gradient during DQN max function in the standard Bellman operator, respectively. training was also shown in van Hasselt et al.(2016b), but in A thorough comparison among these Bellman operators will the context of adaptive target normalization. not only exhibit their differences and connections, but also reveal the trade-offs when selecting them. To evaluate the gradient noise on Atari games, we report the Table 3. A comparison of different Bellman operators (B.O. Bell- `2 norm of the gradient in the final fully-connected linear layer, by averaging over 50 independent inputs. As shown man optimality; O.R. overestimation reduction; P.R. policy repre- sentation; D.Q. double Q-learning). in Figure5, S-DQN and S-DDQN achieve smaller gradient norm and variance than their counterparts, which implies B.O. Tuning O.R. P.R. D.Q. that the softmax operator can facilitate variance reduction in the optimization for the neural networks. For the game Max Yes No - Yes Yes Asterix, we also observe that the overestimation of Q-value Mellowmax No Yes Yes No No in DQN eventually causes the gradient to explode, while Softmax No Yes Yes Yes Yes its competitors avoid this issue by achieving reasonable Q-value estimation. Finally, Figure6 shows that increas- Table3 shows the comparison from different criteria: Bell- ing τ generally leads to higher gradient variance, which man equation implies that the optimal greedy policy cor- is also consistent with our analysis in Theorem4 that the responds to the max operator only. In contrast to the mel- overestimation monotonically increases w.r.t. τ. lowmax and the softmax operators, max also does not need to tune the inverse temperature parameter. On the other

Q*Bert Asterix hand, the overestimation bias rooted in the max operator 104 can be alleviated by either of the softmax and mellowmax DQN DQN 3.0 DDQN DDQN operators. Furthermore, as noted in Asadi & Littman(2017), S-DQN S-DQN 2 S-DDQN 10 the mellowmax operator itself cannot directly represent a 2.5 S-DDQN policy, and needs to be transformed into a softmax policy, 2.0 100 where numerical methods are necessary to determine the

1.5 corresponding state-dependent temperature parameters. The

2 Gradient norm

Gradient norm 10 lack of an explicit policy representation also prevents the 1.0 mellowmax operator from being directly applied in double

0.5 10 4 Q-learning.

0.0 For mellowmax and softmax, it is interesting to further 0.0 0.2 0.4 0.6 0.8 1.0 1.2 0.0 0.2 0.4 0.6 0.8 1.0 1.2 Number of updates 1e7 Number of updates 1e7 investigate their relationship, especially given their compa- rable performance in the presence of function approxima- Figure 5. Mean and one standard deviation of the gradient norm tion, as shown in Table2. First, we notice the following on the Atari games, for different methods. equivalence between these two operators, in terms of the Bellman updates: For the softmax-weighted Q-function, i.e., Q*Bert Asterix 0 0 exp[τ Q(s ,a )] 0 0 0 P 2.00 gQ(s ,)(τ) = a0 P 0 Q(s , a ), there exists an = = a¯ exp[τ Q(s ,a¯)] = 10 = 10 1.75 ω such that the mellowmax-weighted Q-function achieves 0.8 = 5 = 5 the same value. This can be verified by the facts that both 1.50 = 1 = 1 softmax and mellowmax operators (i) converge to the mean 0.6 1.25 and max operators when the inverse temperature parameter 1.00 approaches 0 and ∞, respectively; (ii) are continuous on τ 0.4 7 0.75 and ω, with [0, ∞) as the support. Gradient norm Gradient norm

0.50 0.2 Furthermore, we compare via simulation the approximation 0.25 error to the max function and the overestimation error, for

0.00 0.0 7 0.0 0.2 0.4 0.6 0.8 1.0 1.2 0.0 0.2 0.4 0.6 0.8 1.0 1.2 The operators remain fundamentally different since this equiv- Number of updates 1e7 Number of updates 1e7 alence would hold only for a particular set of Q-values. The conver- sion between softmax and mellomax parameters would need to be Figure 6. Mean and one standard deviation of the gradient norm done on a state-by-state basis, and would change as the Q-values τ on the Atari games, for different values of in S-DDQN. are updated. Revisiting the Softmax Bellman Operator: New Benefits and New Perspective

101 on policies. Asadi & Littman(2017) further applied the log-sum-exp function to the on-policy updates, and showed 100 that the state-dependent inverse temperature parameter can be numerically computed. More recently, the log-sum-exp 10-1 function has also been used in the Boltzmann backup op-

10-2 erator for entropy-regularized RL (Schulman et al., 2017; Haarnoja et al., 2017; Neu et al., 2017; Nachum et al., 2017). 10-3 Mellowmax (|| = 5) One interesting observation is that the optimal policies in Softmax (|| = 5) -4 these work admit a form of softmax, and thus can be ben- 10 Mellowmax (|| = 50) Softmax (|| = 50) eficial for exploration. In Lee et al.(2018), a sparsemax 10-5 operator based on Tsallis entropy was proposed to improve 10-2 10-1 100 101 102 10-2 10-1 100 101 102 τ τ the error bound for mellowmax. Note that these operators were not shown to explicitly address the instability issues in the DQN, and their corresponding policies were used Figure 7. (Left) Approximation error to the max function; (Right) Overestimation error, as a function of τ (softmax) or ω (mellow- to improve the exploration performance, which is different max). from our focus here. mellowmax and softmax. The setup is as follows: The in- Among variants of the DQN in Mnih et al.(2015), verse temperature parameter τ is chosen from a linear grid DDQN (van Hasselt et al., 2016a) first identified the over- from 0.01 to 100. We generate standard normal random estimation bias issue caused by the max operator, and miti- variables, and compute the weighted Q-functions for the gated it via the double Q-learning algorithm (van Hasselt, cases with different number of actions. The approximation 2010). Anschel et al.(2017) later demonstrated that averag- error is then measured in terms of the difference between ing over the previous Q-value estimates in the learning target the max function and the corresponding softmax and mel- can also reduce the bias, while the analysis for the bias re- lowmax functions. The overestimation error is measured duction is restricted to a small MDP with a special structure. the same as in Figure2. The result statistics are reported by Furthermore, an adaptive normalization scheme was devel- averaging over 100 independent trials. Figure7 shows the oped in van Hasselt et al.(2016b) and was demonstrated to trade-off between converging to the Bellman optimality and reduce the gradient norm. The softmax function was also overestimation reduction in terms of the inverse temperature employed in the categorical DQN (Bellemare et al., 2017), parameter, where the softmax operator approaches the max but with a different purpose of generating distributions in its operator in a faster speed while the mellowmax operator can distributional Bellman operator. Finally, we notice that other further reduce the overestimation error. variants (Wang et al., 2016; Mnih et al., 2016; Schaul et al., 2016; He et al., 2017; Hessel et al., 2018; Dabney et al., 2018) have also been proposed to improve the vanilla DQN, 7. Related Work though they were not explicitly designed to tackle the issues Applying the softmax function in Bellman updates has long of overestimation error and gradient noise. been viewed as problematic, as Littman(1996) showed that the corresponding Boltzmann operator could be expansive 8. Conclusion and Future Work in a specific MDP, via a counterexample. The primary use of the softmax function in RL has been focused on We demonstrate surprising benefits of the softmax Bellman improving the exploration performance (Sutton & Barto, operator when combined with DQNs, which further sug- 1998). Our work here instead aims to bring evidence and gests that the softmax operator can be used as an alternative new perspective that the softmax Bellman operator can still for the double Q-learning to reduce the overestimation bias. be beneficial beyond exploration. Our theoretical results provide new insight into the con- vergence properties of the softmax Bellman operator, and The mellowmax operator (Asadi & Littman, 2017) was pro- quantify how it reduces the overestimation bias. To gain a posed based on the log-sum-exp function, which can be deep understanding about the relationship among different proven as a contraction and has received recent attention Bellman operators, we further compare them from different due to its convenient mathematical properties. The use of criteria, and show the trade-offs when choosing them in the log-sum-exp function in the backup operator dates back practice. to Todorov(2007), where the control cost was regularized by a Kullback-Leibler (KL) divergence on the transition prob- An interesting direction for future work is to provide more abilities. Fox et al.(2016) proposed a G-learning scheme theoretical analysis for the performance trade-off when se- with soft updates to reduce the bias for the Q-value esti- lecting the inverse temperature parameter in Tsoft, and sub- mation, by regularizing the cost with the KL divergence sequently design an efficient cooling scheme. Revisiting the Softmax Bellman Operator: New Benefits and New Perspective

Acknowledgements Lee, K., Choi, S., and Oh, S. Sparse Markov decision pro- cesses with causal sparse Tsallis entropy regularization We thank the anonymous reviewers for their helpful com- for reinforcement learning. IEEE Robotics and Automa- ments and suggestions. This work was partially supported tion Letters, 3(3):1466–1473, July 2018. by NSF RI grant 1815300. Opinions, findings, conclusions or recommendations herein are those of the authors and not Littman, M. L. Algorithms for Sequential Decision Making. necessarily those of the NSF. PhD thesis, Department of Computer Science, Brown University, February 1996.

References Mitra, D., Romeo, F., and Sangiovanni-Vincentelli, A. Con- Anschel, O., Baram, N., and Shimkin, N. Averaged-DQN: vergence and finite-time behavior of simulated annealing. variance reduction and stabilization for deep reinforce- Advances in applied probability, 18(3):747–771, 1986. ment learning. In International Conference on Machine Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Ve- Learning, pp. 176–185, 2017. ness, J., Bellemare, M. G., Graves, A., Riedmiller, M., Asadi, K. and Littman, M. L. An alternative softmax opera- Fidjeland, A. K., Ostrovski, G., Petersen, S., Beattie, C., tor for reinforcement learning. In International Confer- Sadik, A., Antonoglou, I., King, H., Kumaran, D., Wier- ence on , pp. 243–252, 2017. stra, D., Legg, S., and Hassabis, D. Human-level control through deep reinforcement learning. Nature, 518(7540): Bellemare, M. G., Naddaf, Y., Veness, J., and Bowling, M. 529–533, 02 2015. The arcade learning environment: An evaluation plat- form for general agents. Journal of Artificial Intelligence Mnih, V., Badia, A. P., Mirza, M., Graves, A., Lillicrap, Research, 47:253–279, 2013. T., Harley, T., Silver, D., and Kavukcuoglu, K. Asyn- chronous methods for deep reinforcement learning. In Bellemare, M. G., Dabney, W., and Munos, R. A distribu- International Conference on Machine Learning, pp. 1928– tional perspective on reinforcement learning. In Interna- 1937, 2016. tional Conference on Machine Learning, pp. 449–458, 2017. Nachum, O., Norouzi, M., Xu, K., and Schuurmans, D. Bridging the gap between value and policy based rein- Bellman, R. Dynamic Programming. Princeton University forcement learning. In Advances in Neural Information Press, Princeton, NJ, 1957. Processing Systems, pp. 2772–2782, 2017.

Dabney, W., Rowland, M., Bellemare, M. G., and Munos, Neu, G., Jonsson, A., and Gomez,´ V. A unified view of R. Distributional reinforcement learning with quantile re- entropy-regularized Markov decision processes. arXiv gression. In Thirty-Second AAAI Conference on Artificial preprint arXiv:1705.07798, 2017. Intelligence, 2018. O’Donoghue, B., Munos, R., Kavukcuoglu, K., and Mnih, Fox, R., Pakman, A., and Tishby, N. Taming the noise in V. Combining policy gradient and Q-learning. In Inter- reinforcement learning via soft updates. In Proceedings of national Conference on Learning Representations, 2017. the Thirty-Second Conference on Uncertainty in Artificial Reverdy, P. and Leonard, N. E. Parameter estimation in Intelligence, pp. 202–211. AUAI Press, 2016. softmax decision-making models with linear objective Haarnoja, T., Tang, H., Abbeel, P., and Levine, S. Re- functions. IEEE Transactions on Automation Science and inforcement learning with deep energy-based policies. Engineering, 13(1):54–67, 2016. In International Conference on Machine Learning, pp. Schaul, T., Quan, J., Antonoglou, I., and Silver, D. Priori- 1352–1361, 2017. tized experience replay. In International Conference on He, F. S., Liu, Y., Schwing, A. G., and Peng, J. Learning Learning Representations, 2016. to play in a day: Faster deep reinforcement learning by Schulman, J., Chen, X., and Abbeel, P. Equivalence be- optimality tightening. In International Conference on tween policy gradients and soft Q-learning. arXiv preprint Learning Representations, 2017. arXiv:1704.06440, 2017.

Hessel, M., Modayil, J., Van Hasselt, H., Schaul, T., Ostro- Sutton, R. S. and Barto, A. G. Reinforcement Learning: An vski, G., Dabney, W., Horgan, D., Piot, B., Azar, M., and Introduction. The MIT Press, 1998. Silver, D. Rainbow: Combining improvements in deep re- inforcement learning. In Thirty-Second AAAI Conference Thrun, S. and Schwartz, A. Issues in using function on Artificial Intelligence, 2018. approximation for reinforcement learning. In Mozer, Revisiting the Softmax Bellman Operator: New Benefits and New Perspective

M. C., Smolensky, P., Touretzky, D. S., Elman, J. L., and Weigend, A. S. (eds.), Proceedings of the 1993 Con- nectionist Models Summer School, Hillsdale, NJ, 1994. Lawrence Erlbaum. Thrun, S. B. The role of exploration in learning control. In White, D. A. and Sofge, D. A. (eds.), Handbook of Intel- ligent Control: Neural, Fuzzy, and Adaptive Approaches, pp. 527–559. Van Nostrand Reinhold, New York, NY, 1992. Tieleman, T. and Hinton, G. Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural networks for machine learning, 4 (2):26–31, 2012. Todorov, E. Linearly-solvable Markov decision problems. In Advances in neural information processing systems, pp. 1369–1376, 2007. van Hasselt, H. Double Q-learning. In Advances in Neural Information Processing Systems, pp. 2613–2621, 2010. van Hasselt, H., Guez, A., and Silver, D. Deep reinforce- ment learning with double Q-learning. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, pp. 2094–2100. AAAI Press, 2016a. van Hasselt, H. P., Guez, A., Hessel, M., Mnih, V., and Sil- ver, D. Learning values across many orders of magnitude. In Advances in Neural Information Processing Systems, pp. 4287–4295, 2016b.

Wang, Z., Schaul, T., Hessel, M., van Hasselt, H., Lanctot, M., and De Freitas, N. Dueling network architectures for deep reinforcement learning. In Proceedings of the 33rd International Conference on Machine Learning, pp. 1995–2003. JMLR. org, 2016. Williams, R. J. and Baird III, L. C. Tight performance bounds on greedy policies based on imperfect value func- tions. Technical report, Northeastern University, 1993. Revisiting the Softmax Bellman Operator: New Benefits and New Perspective

Supplemental Material

A1. Proof for Performance Bound Proof of Lemma2. We first sort the sequence {Q(s, ai)} such that Q(s, a[1]) ≥ ... ≥ Q(s, a[m]). Then, ∀Q and ∀s, We first show that for all Q-functions that occur during we have Q-iteration with Tsoft, their corresponding Q-values are T  bounded. max Q(s, a) − fτ Q(s, ) Q(s, ) a Lemma A1. Assuming ∀(s, a), the initial Q-values Pm   i=1 exp τQ(s, a[i]) Q(s, a[i]) Q0(s, a) ∈ [Rmin,Rmax], the Q-values during Q-iteration = Q(s, a[1]) − Pm   Rmin i=1 exp τQ(s, a[i]) with Tsoft are within [Qmin,Qmax], with Qmin = and 1−γ Pm     Rmax exp τQ(s, a[i]) Q(s, a[1]) − Q(s, a[i]) Qmax = . i=1 1−γ = Pm   . i=1 exp τQ(s, a[i]) Proof. The upper bound can be obtained by showing (A2) ∀(s, a), the Q-values at the ith iteration are bounded as By introducing δi(s) = Q(s, a[1]) − Q(s, a[i]), and noting i δi(s) ≥ 0 and δ1(s) = 0, we can proceed from Eq. (A2) as X j Qi(s, a) ≤ γ Rmax. (A1) Pm     i=1 exp τQ(s, a[i]) Q(s, a[1]) − Q(s, a[i]) j=0 Pm   i=1 exp τQ(s, a[i]) We then prove Eq. (A1) by induction as follows. The lower Pm i=1 exp[−τδi(s)] δi(s) bound can be proven similarly. = Pm i=1 exp[−τδi(s)] m (i) When i = 1, we start from the definition of Tsoft in Eq. (3) P exp[−τδ (s)] δ (s) = i=2 i i . (A3) and the assumption of Q0 to have Pm 1 + i=2 exp[−τδi(s)]

Q1(s, a) = Tsoft Q0(s, a) Now, we can proceed from Eq. (A3) to prove each direction X 0 0 0 separately as follows. ≤ Rmax + γ P (s |s, a) max Q0(s , a ) a0 s0 X (i) Upper bound: First note that for any two non-negative ≤ R + γ P (s0|s, a) R max max sequences {xi} and {yi}, s0 P xi X xi = (1 + γ)Rmax. i ≤ . (A4) 1 + P y 1 + y i i i i

(ii) Assuming Eq. (A1) holds when i = k, i.e., Qk(s, a) ≤ We then apply Eq. (A4) to Eq. (A3) as Pk j j=0 γ Rmax. Then, Pm m i=2 exp[−τδi(s)] δi(s) X exp[−τδi(s)] δi(s) Pm ≤ 1 + exp[−τδi(s)] 1 + exp[−τδi(s)] Qk+1(s, a) = TsoftQk(s, a) i=2 i=2 m X 0 0 0 δ (s) ≤ Rmax + γ P (s |s, a) max Qk(s , a ) X i a0 = . s0 1 + exp[τδ (s)] i=2 i k (A5) X 0 X j ≤ Rmax + γ P (s |s, a) γ Rmax s0 j=0 Next, we bound each term in Eq. (A5), by considering k+1 the following two cases: X j δi(s) δi(s) 2Qmax = γ Rmax. 1) δi(s) > 1: ≤ ≤ , 1+exp[τδi(s)] 1+exp(τ) 1+exp(τ) j=0 where we apply Corollary A2 to bound δi(s).

δi(s) 2) 0 ≤ δi(s) ≤ 1: = 1+exp[τδi(s)] 1 1 2 2 ≤ , where we first ex- +τ+0.5τ δi(s)+··· τ+2 Corollary A2. Assuming Rmax ≥ −Rmin ≥ 0 WLOG, we δi(s) Rmax pand the denominator using Taylor series for the have |Q(s, ai) − Q(s, aj)| ≤ 2 1−γ , ∀Q and ∀s. exponential function. Proof. This follows by using the assumption and the results By combining these two cases with Eq. (A5), we in Lemma A1. achieve the upper bound. Revisiting the Softmax Bellman Operator: New Benefits and New Perspective

(ii) Lower bound: (ii) Suppose the conjecture holds when i = l, i.e., l m l l P j P T Q0(s, a) − Tsoft Q0(s, a) ≤ j=1 γ ζ, then i=2 exp[−τδi(s)] δi(s) 1 + Pm exp[−τδ (s)] i=2 i T l+1Q (s, a) − T l+1 Q (s, a) Pm 0 soft 0 exp[−τδi(s)] δi(s) ≥ i=2 =TT lQ (s, a) − T l+1 Q (s, a) m 0 soft 0 Pm δ (s) l i=2 i  l X j  l+1 ≥ ≤T T Q0(s, a) + γ ζ − T Q0(s, a) m exp[τ δ(s)] soft soft b j=1 δb(s) l ≥ . (A6) X j+1 l m exp[τ δb(s)] = γ ζ + (T − Tsoft) Tsoft Q0(s, a) j=1 l l+1 X X ≤ γj+1ζ + γζ = γjζ, Proof of Theorem3. We first prove the upper bound by in- j=1 j=1 duction as follows. where the last inequality follows from the definition of ζ. By k ∗ (i) When i = 1, we start from the definitions for T and Tsoft using the fact that limk→∞ T Q0(s, a) = Q (s, a) again in Eq. (2) and Eq. (3), and proceed as and applying Lemma2 to bound ζ, we finish the proof for Part (I). T Q0(s, a) − Tsoft Q0(s, a) X 0  0 0 T 0  0  To prove part (II), note that as a byproduct of Eq. (A5) in =γ P (s |s, a) max Q0(s , a ) − fτ Q0(s , ) Q0(s , ) a0 s0 the proof of Lemma2, Eq. (A7) can be bounded as ≥0. k k T Q0(s, a)−Tsoft Q0(s, a) ≤ k m l γ(1 − γ ) X δi(s) (ii) Suppose this claim holds when i = l, i.e., T Q0(s, a) ≥ . (A8) l 1 − γ 1 + exp[τδ (s)] Tsoft Q0(s, a). When i = l + 1, we have i=2 i l+1 l+1 T Q0(s, a) − T Q0(s, a) soft From the definition of δi(s), we know δm(s) ≥ δm−1(s) ≥ l l =TT Q0(s, a) − TsoftTsoft Q0(s, a) ... ≥ δ2(s) ≥ 0. Furthermore, there must exist an index ∗ ∗ l l i ≤ m such that δi > 0, ∀i ≤ i ≤ m (otherwise the upper ≥T T Q0(s, a) − TsoftT Q0(s, a) soft soft bound becomes zero). Subsequently, we can proceed from ≥0. Eq. (A8) as ∗ Since Q is the fixed point for T , we know m k ∗ γ(1 − γk) δ (s) limk→∞ T Q0(s, a) = Q (s, a). Therefore, X i k ∗ 1 − γ 1 + exp[τδ (s)] lim sup Tsoft Q0(s, a) ≤ Q (s, a). i=2 i k→∞ k m γ(1 − γ ) X δi(s) To prove the lower bound, we first conjecture that = 1 − γ 1 + exp[τδi(s)] k i=i∗ k k X j k m T Q0(s, a) − Tsoft Q0(s, a) ≤ γ ζ, (A7) γ(1 − γ ) X δi(s) ≤ j=1 1 − γ exp[τδ (s)] i=i∗ i T  where ζ = supQ maxs[maxa Q(s, a) − fτ Q(s, ) Q(s, )] k m γ(1 − γ ) X δi denotes the supremum of the difference between the max ≤ 1 − γ exp[τδ ∗ (s)] and softmax operators, over all Q-functions that occur dur- i=i∗ i m ing Q-iteration, and state s. Eq. (A7) is proven using induc- γ(1 − γk) X = exp[−τδ ∗ (s)] δ (s), tion as follows. 1 − γ i i i=i∗ (i) When i = 1, we start from the definitions for T and Tsoft in Eq. (2) and Eq. (3), and proceed as which implies an exponential convergence rate in terms of τ and hence proves part (II). T Q0(s, a) − Tsoft Q0(s, a) X 0  0 0 T 0  0  =γ P (s |s, a) max Q0(s , a ) − fτ Q0(s , ) Q0(s , ) a0 A2. Proofs for Overestimation Reduction 0 s Pm i=1[exp(τxi)xi] X 0 Lemma A3. gx(τ) = Pm is a monotonically ≤γ P (s |s, a) ζ = γζ. i=1 exp(τxi) s0 increasing function for τ ∈ [0, ∞). Revisiting the Softmax Bellman Operator: New Benefits and New Perspective

Proof. The gradient of gx(τ) can be computed as Subsequently, we can employ Lemma2 to obtain the range. m m Finally, the monotonicity for the overestimation error in ∂gx(τ) n X 2 X  = exp(τxi) x exp(τxi) − terms of τ follows, by noting the term inside the expecta- ∂τ i i=1 i=1 tion of Eq. (A9) can be represented as g(τ), which is a m m  X 2o .  X 2 monotonic function of τ, according to Lemma A3. exp(τxi) xi exp(τxj) ≥ 0, i=1 j=1 A3. Additional Plots and Setups where the last step holds because of the Cauchy-Schwarz inequality. Figures A1 and A2 are the full version of the corresponding figures in the main text, by plotting all six games. The The overestimation bias due to the max operator can be corresponding values for τ in S-DQN and S-DDQN are observed by plugging assumption (A2) in Theorem4 provided in Table A1. into Eq. (2) as    Table A1. Values of τ used for S-DQN and S-DDQN in Figures A1 max Qt(s, a) − max Q∗(s, a) E a a and A2.   =E max Qt(s, a) − V∗(s) a QMBCAS   = max(a) , E a S-DQN 1 1 5 1 5 5 S-DDQN 1 5 5 5 5 10 and maxa(a) is typically positive for a large action set and the noise satisfying a normal distribution, or a uniform distribution with the symmetric support. Figures A3, A4, and A5 show the scores, Q-values, and gradient norm, for different values of τ, for S-DDQN. Proof of Theorem4. First, the overestimation error from Tsoft can be represented as   X exp[τQt(s, a)] Q (s, a) − V ∗(s) E P exp[τQ (s, a¯)] t a a¯ t  ∗  X exp[τV (s) + τa] = [V ∗(s) +  ] − V ∗(s) E P exp[τV ∗(s) + τ )] a a a¯ a¯   X exp[τa] =  (A9) E P exp[τ ] a a a¯ a¯   ≤ max(a) . E a

To prove Part (II), note that the overestimation reduction of Tsoft from T can then be represented as

 X exp[τa]  E max(a) − a a P exp[τ ] a a¯ a¯    ∗  X exp[τa]  ∗  =E max a + V (s) − a + V (s) a P exp[τ ] a a¯ a¯     X exp[τa]   =E max Qt(s, a) − Qt(s, a) a P exp[τ ] a a¯ a¯  ∗   X exp[τa + τV (s)] =E max Qt(s, a) − a P exp[τ + τV ∗(s)] a a¯ a¯    × Qt(s, a)     X exp[τQt(s, a)]   =E max Qt(s, a) − Qt(s, a) . a P exp[τQ (s, a¯)] a a¯ t Revisiting the Softmax Bellman Operator: New Benefits and New Perspective

Q*Bert Ms. Pacman Q*Bert Ms. Pacman

14 DQN 100 25 3.0 DDQN 12 S-DQN

S-DDQN 1 2.5 10 10 20

2.0 8 10 2 15

1.5 6 10 3

10 Gradient norm Gradient norm Value estimates Value estimates 1.0 4

10 4 DQN 5 0.5 2 DDQN DQN S-DQN DQN S-DQN S-DQN DDQN S-DDQN DDQN S-DDQN 0.0 10 5 S-DDQN 0 0 0 50 100 150 200 0 50 100 150 200 0.0 0.2 0.4 0.6 0.8 1.0 1.2 0.0 0.2 0.4 0.6 0.8 1.0 1.2 Epoch Epoch Number of updates 1e7 Number of updates 1e7

Breakout Crazy Climber Breakout Crazy Climber

DQN S-DQN 16 101 101 DQN 5 DDQN S-DDQN DDQN 14 0 S-DQN 10 0 10 S-DDQN 4 12

1 10 10 1 10 3

10 2 2 8 10

2

6 Gradient norm 3 Gradient norm 3 Value estimates Value estimates 10 10

4 DQN 1 4 10 DDQN 10 4 2 DQN S-DQN S-DQN DDQN S-DDQN 5 S-DDQN 5 0 10 10 0 0 50 100 150 200 0 50 100 150 200 0.0 0.2 0.4 0.6 0.8 1.0 1.2 0.0 0.2 0.4 0.6 0.8 1.0 1.2 Epoch Epoch Number of updates 1e7 Number of updates 1e7

Asterix Seaquest Asterix Seaquest

104 0.12 DQN DQN 2 10 10 DDQN DDQN

S-DQN 0.10 S-DQN 102 S-DDQN S-DDQN 8

0.08 101 100 6 0.06

4 2 Gradient norm

Gradient norm 10

Value estimates Value estimates 0.04 100

2 0.02 10 4 DQN S-DQN DQN S-DQN 10 1 DDQN S-DDQN DDQN S-DDQN 0.00 0

0 50 100 150 200 0 50 100 150 200 0.0 0.2 0.4 0.6 0.8 1.0 1.2 0.0 0.2 0.4 0.6 0.8 1.0 1.2 Epoch Epoch Number of updates 1e7 Number of updates 1e7

Figure A1. Mean and one standard deviation of the estimated Q- Figure A2. Mean and one standard deviation of the gradient norm values on the Atari games, for different methods. on the Atari games, for different methods. Revisiting the Softmax Bellman Operator: New Benefits and New Perspective

Q*Bert Ms. Pacman Q*Bert Ms. Pacman

14000

3000 7 17.5 12000

6 15.0 2500 10000 5 12.5

8000 2000

4 10.0

6000 1500 Test score Test score 3 7.5 Value estimates Value estimates 4000 2 = 1000 = 5.0

2000 = 10 = 10 1 = 5 = 5 2.5 500 = = 5 = = 5 = 1 = 1 0 = 10 = 1 = 10 = 1 0 0.0 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 Epoch Epoch Epoch Epoch

Breakout Crazy Climber Breakout Crazy Climber

14 120000 400 3.0

12

100000 2.5

300 10

80000 2.0

8

200 60000 1.5 6 Test score Test score Value estimates Value estimates 40000 1.0 4 100 = =

= 10 = 10 0.5 20000 2 = 5 = 5 = = 5 = = 5 0 = 1 = 1 = 10 = 1 = 10 = 1 0.0 0 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 Epoch Epoch Epoch Epoch

Asterix Seaquest Asterix Seaquest

6 7 16000 = =

= 10 = 10 6 14000 = 5 8000 = 5 5

= 1 = 1 5 12000

4 4 10000 6000

= = 5 = = 5 8000 3 3 = 10 = 1 = 10 = 1 4000 Test score Test score 6000 2

Value estimates 2 Value estimates

4000 1 2000 1 2000 0

0 0 0 1 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 Epoch Epoch Epoch Epoch

Figure A3. Mean and one standard deviation of test scores on the Figure A4. Mean and one standard deviation of the estimated Q- Atari games, for different values of τ in S-DDQN. values on the Atari games, for different values of τ in S-DDQN. Revisiting the Softmax Bellman Operator: New Benefits and New Perspective

Q*Bert Ms. Pacman

2.00 = = 0.7 = 10 = 10 1.75 = 5 = 5 0.6 1.50 = 1 = 1

0.5 1.25

0.4 1.00

0.3 0.75 Gradient norm Gradient norm

0.2 0.50

0.25 0.1

0.00 0.0

0.0 0.2 0.4 0.6 0.8 1.0 1.2 0.0 0.2 0.4 0.6 0.8 1.0 1.2 Number of updates 1e7 Number of updates 1e7

Breakout Crazy Climber

1.2 = = = 10 = 10 8 = 5 = 5 1.0 = 1 = 1

6 0.8

0.6 4

Gradient norm 0.4 Gradient norm 2

0.2 0

0.0

0.0 0.2 0.4 0.6 0.8 1.0 1.2 0.0 0.2 0.4 0.6 0.8 1.0 1.2 Number of updates 1e7 Number of updates 1e7

Asterix Seaquest

0.08 = = = 10 = 10 0.8 = 5 = 5

= 1 0.06 = 1

0.6

0.04

0.4 Gradient norm Gradient norm 0.02

0.2

0.00

0.0

0.0 0.2 0.4 0.6 0.8 1.0 1.2 0.0 0.2 0.4 0.6 0.8 1.0 1.2 Number of updates 1e7 Number of updates 1e7

Figure A5. Mean and one standard deviation of the gradient norm on the Atari games, for different values of τ in S-DDQN.