Revisiting the Softmax Bellman Operator: New Benefits and New Perspective

Zhao Song 1* Ronald E. Parr 1 Lawrence Carin 1

Abstract tivates the use of exploratory and potentially sub-optimal actions during learning, and one commonly-used strategy The impact of softmax on the value function itself is to add randomness by replacing the max function with in (RL) is often viewed as the softmax function, as in Boltzmann exploration (Sutton problematic because it leads to sub-optimal value & Barto, 1998). Furthermore, the softmax function is a (or Q) functions and interferes with the contrac- differentiable approximation to the max function, and hence tion properties of the Bellman operator. Surpris- can facilitate analysis (Reverdy & Leonard, 2016). ingly, despite these concerns, and independent of its effect on exploration, the softmax Bellman The beneficial properties of the softmax Bellman opera- operator when combined with Deep Q-learning, tor are in contrast to its potentially negative effect on the leads to Q-functions with superior policies in prac- accuracy of the resulting value or Q-functions. For exam- tice, even outperforming its double Q-learning ple, it has been demonstrated that the softmax Bellman counterpart. To better understand how and why operator is not a contraction, for certain pa- this occurs, we revisit theoretical properties of the rameters (Littman, 1996, Page 205). Given this, one might softmax Bellman operator, and prove that (i) it expect that the convenient properties of the softmax Bell- converges to the standard Bellman operator expo- man operator would come at the expense of the accuracy nentially fast in the inverse temperature parameter, of the resulting value or Q-functions, or the quality of the and (ii) the distance of its Q function from the resulting policies. In this paper, we demonstrate that, in optimal one can be bounded. These alone do not the case of deep Q-learning, this expectation is surprisingly explain its superior performance, so we also show incorrect. We combine the softmax Bellman operator with that the softmax operator can reduce the over- the deep Q-network (DQN)(Mnih et al., 2015) and double estimation error, which may give some insight DQN (DDQN)(van Hasselt et al., 2016a) algorithms, by re- into why a sub-optimal operator leads to better placing the max function therein with the softmax function, performance in the presence of value function in the target network. We then test the variants on several approximation. A comparison among different games in the Arcade Learning Environment (ALE)(Belle- Bellman operators is then presented, showing the mare et al., 2013), a standard large-scale deep RL testbed. trade-offs when selecting them. The results show that the variants using the softmax Bell- man operator can achieve higher test scores, and reduce the 1. Introduction Q-value overestimation as well as the gradient noise on most of them. This effect is independent of exploration and is The Bellman equation (Bellman, 1957) has been a funda- entirely attributable to the change in the Bellman operator. mental tool in reinforcement learning (RL), as it provides a This surprising result suggests that a deeper understanding sufficient condition for the optimal policy in dynamic pro- of the softmax Bellman operator is warranted. To this end, gramming. The use of the max function in the Bellman we prove that starting from the same initial Q-values, we can equation further suggests that the optimal policy should be upper and lower bound how far the Q-functions computed greedy w.r.t. the Q-values. On the other hand, the trade-off with the softmax operator can deviate from those computed between exploration and exploitation (Thrun, 1992) mo- with the regular Bellman operator. We further show that the 1 Duke University ⇤Work performed as a graduate student softmax Bellman operator converges to the optimal Bellman at Duke University; now at Baidu Research. Correspondence operator in an exponential rate w.r.t. the inverse temperature to: Zhao Song , Ronald E. Parr parameter. This gives insight into why the negative conver- , Lawrence Carin . gence results may not be as discouraging as they initially Proceedings of the 36 th International Conference on Machine seem, but it does not explain the superior performance ob- Learning, Long Beach, California, PMLR 97, 2019. Copyright served in practice. Motivated by recent work (van Hasselt 2019 by the author(s). et al., 2016a; Anschel et al., 2017) targeting bias and insta- Revisiting the Softmax Bellman Operator: New Benefits and New Perspective bility of the original DQN (Mnih et al., 2015), we further a probability mass function (PMF), where ⇡(s, a) [0, 1] 2 investigate whether the softmax Bellman operator can alle- denotes the probability of selecting action a in state s, and viate these issues. As discussed in van Hasselt et al. (2016a), a ⇡(s, a)=1. one possible explanation for the poor performance of the 2A For a given policy ⇡, its state-action value func- vanilla DQN on some Atari games was the overestimation P tion Q⇡(s, a) is defined as the accumulated, expected, bias when computing the target network, due to the max discount reward, when taking action a in state s, operator therein. We prove that given the same assumptions and following policy ⇡ afterwards, i.e., Q⇡(s, a)= as van Hasselt et al. (2016a), the softmax Bellman operator t Eat ⇡ [ 1 rt s0 = s, a0 = a] . For the optimal policy can reduce the overestimation bias, for any inverse tem- ⇠ t=0 | ⇡ , its corresponding Q-function satisfies the following Bell- perature parameters. We also quantify the overestimation ⇤ man equation:P reduction by providing its lower and upper bounds. Q⇤(s, a)=R(s, a)+ P (s0 s, a) max Q⇤(s0,a0). Our results are complementary to and add new motivations | a0 for existing work that explores various ways of softening Xs0 the Bellman operator. For example, entropy regularizers In DQN (Mnih et al., 2015), the Q-function is parameterized have been used to smooth policies. The motivations for with a neural network as Q✓(s, a), which takes the state s such approaches include computational convenience, ex- as input and outputs the corresponding Q-value in the final ploration, or robustness (Fox et al., 2016; Haarnoja et al., fully-connected linear , for every action a. The training 2017; Schulman et al., 2017; Neu et al., 2017). With respect objective for the DQN can be represented as to value functions, Asadi & Littman (2017) proposed an 1 2 min Q (s, a) [R(s, a)+ max Q (s0,a0)] , alternative mellowmax operator and proved that it is a con- ✓ ✓ ✓ 2 a0 traction. Their experimental results suggested that it can (1) improve exploration, but the possibility that the sub-optimal where ✓corresponds to the frozen weights in the target Bellman operator could, independent of exploration, lead to network, and is updated at fixed intervals. The optimization superior policies was not considered. Our results, therefore, of Eq. (1) is performed via RMSProp (Tieleman & Hinton, provide additional motivation for further study of operators 2012), with mini-batches sampled from a replay buffer. such as mellowmax. Although very similar to softmax, the To reduce the overestimation bias, the double DQN mellowmax operator needs some extra computation to rep- (DDQN) algorithm modified the target that Q (s, a) aims resent a policy, as noted in Asadi & Littman (2017). Our ✓ to fit in Eq. (1) as paper discusses this and other trade-offs among different

Bellman operators. R(s, a)+Q✓ s0, arg max Q✓t (s0,a) . a

The rest of this paper is organized as follows: We pro- Note that a separate network based on the latest estimate ✓t vide the necessary background and notation in Section 2. is employed for , and the evaluation of this The softmax Bellman operator is introduced in Section 3, policy is due to the frozen network. where its convergence properties and performance bound are provided. Despite being sub-optimal on Q-functions, the Notation The softmax function is defined as softmax operator is shown in Section 4 to consistently out- T [exp(⌧x1), exp(⌧x2),...,exp(⌧xm)] perform its max counterpart on several Atari games. Such f⌧ (x)= m , surprising result further motivates us to investigate why this i=1 exp(⌧xi) happens in Section 5. A thorough comparison among differ- where the superscript T denotesP the vector transpose. Sub- ent Bellman operators is presented in Section 6. Section 7 sequently, the softmax-weighted function is represented as T discusses the related work and Section 8 concludes this gx(⌧)=f⌧ (x) x, as a function of ⌧. Also, we define the T paper. vector Q(s, )=[Q(s, a1),Q(s, a2),...,Q(s, am)] .We further set m to be the size of the action set in the MDP. A 2. Background and Notation Finally, Rmin and Rmax denote the minimum and the maxi- mum immediate rewards, respectively. A Markov decision process (MDP) can be represented as a 5-tuple , ,P,R, , where is the state space, is hS A i S A the action space, P is the transition kernel whose element 3. The Softmax Bellman Operator P (s0 s, a) denotes the transition probability from state s We start by providing the following standard Bellman oper- | to state s0 under action a, R is a reward function whose ator: element R(s, a) denotes the expected reward for executing Q(s, a)=R(s, a)+ P (s0 s, a) max Q(s0,a0). action a in state s, and (0, 1) is the discount factor. T | a0 2 s0 The policy ⇡ in an MDP can be represented in terms of X (2) Revisiting the Softmax Bellman Operator: New Benefits and New Perspective

(s) T We propose to use the softmax Bellman operator, defined as maxa Q(s, a) f Q(s, ) Q(s, ) (m m exp[⌧ (s)]  ⌧  b 2Q Q(s, a)=R(s, a)+ P (s0 s, a) 1 max 2 Rmax soft 1) max b ⌧+2 , 1+exp(⌧) , where Qmax = 1 represents T | s0 3 X the maximumn Q-value ino Q-iteration with soft. exp[⌧Q(s0,a0)] T Q(s0,a0), (3) Note that another upper bound for this gap was provided ⇥ a¯ exp[⌧Q(s0, a¯)] Xa0 in O’Donoghue et al. (2017). Our proof here uses a different P sm⌧ (Q(s0,)) strategy, by considering possible values for the difference between Q-values with different actions. We further derive where ⌧ 0 denotes| the inverse{z temperature parameter.} a lower bound for this gap. Note that will reduce to when ⌧ . Tsoft T !1 3.1. Performance Bound for The mellowmax operator was introduced in Asadi & Tsoft Littman (2017) as an alternative to the softmax operator The optimal state-action value function Q⇤ is known to be a in Eq. (3), defined as fixed point for the standard Bellman operator (Williams & 1 T log exp[!Q(s, a0)] Baird III, 1993), i.e., Q⇤ = Q⇤. Since is a contraction mm Q(s, ) = m a0 , (4) T T k ! ! with rate , we also know that limk Q0 = Q⇤, for P !1 T where !> 0 is a tuning parameter. Similar to the softmax arbitrary Q0. Given these facts, one may wonder in the limit, how far iteratively applying , in lieu of , over Q will operator sm⌧ Q(s0, ) in Eq. (3), mm! converges to the Tsoft T 0 mean and max operators, when ! approaches 0 and , be away from Q⇤, as a function of ⌧. 1 k k respectively. Unlike the softmax operator, however, the Theorem 3. Let Q0 and Q0 denote that the opera- T Tsoft mellowmax operator in Eq. (4) does not directly provide tors and are iteratively applied over an initial state- T Tsoft a over actions, and thus additional action value function Q0 for k times. Then, steps are needed to obtain a corresponding policy (Asadi & k (I) (s, a), lim sup soft Q0(s, a) Q⇤(s, a) and Littman, 2017). 8 k T  !1 k In contrast to its typical use to improve exploration (also a lim inf soft Q0(s, a) k T purpose for the mellowmax operator in Asadi & Littman !1 (m 1) 1 2Qmax (2017)), the softmax Bellman operator in Eq. (3) is com- Q⇤(s, a) max , . (1 ) ⌧ +2 1+exp(⌧) bined in this paper with the DQN and DDQN algorithms in n o an off-policy fashion, where the action is selected according (II) converges to with an exponential rate, in terms Tsoft T to the same ✏-greedy policy as in Mnih et al. (2015). of ⌧, i.e., the upper bound of kQ k Q decays expo- T 0 Tsoft 0 Even though the softmax Bellman operator was shown nentially fast, as a function of ⌧, the proof of which does not in Littman (1996) not to be a contraction, with a counterex- depend on the bound in part (I). ample, we are not aware of any published work showing A noteworthy point about part (I) of Theorem 3 is it does the performance bound by using the softmax Bellman oper- not contradict the negative convergence results for soft since T ator in Q-iteration. Furthermore, our surprising results in its proof and result do not need to assume the convergence of Section 4 show that test scores in DQNs and DDQNs can soft. Furthermore, this bound implies that non-convergence T be improved, by solely replacing the max operator in their for soft is different from divergence or oscillation with T target networks with the softmax operator. Such improve- an arbitrary range: Even though soft may not converge T ments are independent of the exploration strategy and hence in certain scenarios, the Q-values could still be within a 4 motivate an investigation of the theoretical properties of the reasonable range . soft also has other benefits over , as T T softmax Bellman operator. shown in Section 5. Before presenting our main theoretical results, we first show Note that although the bound for mellowmax under the how to bound the distance between the softmax-weighted entropy-regularized MDP framework was shown in Lee and max operators, which is subsequently used in the proof et al. (2018) to have a better scalar term log(m) instead of for the performance bound and overestimation reduction. (m 1), its convergence rate is linear w.r.t. ⌧. In contrast, our softmax Bellman operator result has an exponential rate, Definition 1. (s) sup maxi,j Q(s, ai) Q(s, aj) de- , Q | | notes the largest distance between Q-functions w.r.t. actions as shown in Part (II), which potentially gives insight into for state s, byb taking the supremum of Q-functions and the why very large values of ⌧ are not required experimentally maximum of all action pairs. 2 The supQ is over all Q-functions that occur during Q- 1 Lemma 2. By assuming (s) > 0 , we have Q, iteration, starting from Q0 until the iteration terminates. 3 8 The Supplemental Material formally bounds Qmax. 1 4 Note that if (s)=0, thenb all actions are equivalent in state The lower bound can become loose when ⌧ is extremely s, and softmax = max. small, but this degenerate case has never been used in experiments. b Revisiting the Softmax Bellman Operator: New Benefits and New Perspective

Figure 1. Performance of (top row) S-DQN vs. DQN, (middle row) S-DDQN vs. DDQN, and (bottom row) S-DQN vs. DDQN on the Atari games. X-axis and Y-axis correspond to the training epoch and test score, respectively. All curves are averaged over five independent random seeds. in Section 4. The error bound in the sparse Bellman oper- Table 1. Mean of test scores for different values of ⌧ in S-DDQN ator (Lee et al., 2018) improves upon log(m), but is still (standard deviation in parenthesis). Each game is denoted with its linear w.r.t. ⌧. We further empirically illustrate the faster initial. Note that S-DDQN reduces to DDQN when ⌧ = . convergence for the softmax Bellman operator in Figure 7 1 ⌧ 1 5 10 of Section 6. 1 12068.66 11049.28 11191.31 10577.76 Q 4. Main Experiments (1085.65) (1565.57) (1336.35) (1508.27) 2492.40 2566.44 2546.18 2293.73 M Our theoretical results apply to the case of a tabular value (183.71) (227.24) (259.82) (160.50) function representation and known next-state distributions, 313.08 350.88 303.71 284.64 B and they bound suboptimality rather than suggesting supe- (20.13) (35.58) (65.59) (60.83) riority of softmax. The goal of our main experiments is to 107405.11 111111.07 104049.46 96373.08 C show that despite being sub-optimal w.r.t. the Q-values, the (4617.90) (5047.19) (6686.84) (9244.27) softmax Bellman operator is still useful in practice by im- 3476.91 10266.12 6588.13 5523.80 A proving the DQN and DDQN algorithms. Such an improve- (460.27) (2682.00) (1183.10) (694.72) ment is entirely attributable to changing the max operator to 272.20 2701.05 6254.01 5695.35 S the softmax operator in the target network. (49.75) (10.06) (697.12) (1862.59)

We combine the softmax operator with DQN and DDQN, ing contains 200 epochs in total. The test procedures and by replacing the max function therein with the softmax func- all the hyperparameters are set the same as DQN, with tion, with all other steps being the same. The corresponding details described in Mnih et al. (2015). The inverse tem- new algorithms are termed as S-DQN and S-DDQN, respec- perature parameter ⌧ was selected based on a grid search tively. Their exploration strategies are set to be ✏-greedy, the over 1, 5, 10 . We also implemented the logarithmic cool- { } same as DQN and DDQN. Our code is provided at https: ing scheme (Mitra et al., 1986) in simulated annealing to //github.com/zhao-song/Softmax-DQN. gradually increase ⌧, but did not observe a better policy, We tested on six Atari games: Q*Bert, Ms. Pacman, Crazy compared with the constant temperature. The results statis- Climber, Breakout, Asterix, and Seaquest. Our code is built tics are obtained by running with five independent random on the +Lasagne implementation from https: seeds. //github.com/spragunr/deep_q_rl/. The train- Figure 1 shows the mean and one standard deviation for Revisiting the Softmax Bellman Operator: New Benefits and New Perspective the average test score on the Atari games, as a function quently, we tune the inverse temperature parameter for both of the training epoch: In the top two rows, S-DQN and S- softmax and mellowmax operators from 1, 5, 10 , and { } DDQN generally achieve higher maximum scores and faster report the best scores. Table 2 shows that both softmax and learning than their max counterparts, illustrating the promise mellowmax operators can achieve higher test scores than of replacing max with softmax. For the game Asterix, often the max operator. Furthermore, the scores from softmax used as an example of the overestimation issue in DQN (van are higher or similar to those of mellowmax on all games Hasselt et al., 2016a; Anschel et al., 2017), both S-DQN except Asterix. The competitive performance of the softmax and S-DDQN have much higher test scores than their max operator here suggests that it is still preferable in certain counterparts, which suggests that the softmax operator can domains, despite its non-contraction property. mitigate the overestimation bias. Although not as dramatic as for Asterix, the bottom row of Figure 1 shows that S- 5. Why Softmax Helps? DQN generally outperforms DDQN with higher test scores for the entire test suite, which suggests the possibility of One may wonder why the softmax Bellman operator can using softmax operator over double Q-learning (van Hasselt, achieve the surprising performance in Section 4, as the 2010) to reduce the overestimation bias, in the presence of greedy policy from the Bellman equation suggests that the function approximation. max operator should be optimal. The softmax operator is not used for exploration in this paper, nor is it motivated by regu- Table 1 shows the test scores of S-DDQN with different larizing the policy, as in the entropy regularized MDPs(Fox values of ⌧, obtained by averaging the scores from the last et al., 2016; Schulman et al., 2017; Neu et al., 2017; Nachum 10 epochs. Note that S-DDQN achieves higher test scores et al., 2017; Asadi & Littman, 2017; Lee et al., 2018). than its max counterpart, for most of the values of ⌧. Table 1 further suggests a trade-off when selecting the values for ⌧: As discussed in earlier work (van Hasselt et al., 2016a; An- Setting ⌧ too small will move further away from the schel et al., 2017), the max operator leads to the significant Tsoft Bellman optimality, since the max operator (corresponding issue of overestimation for the Q-function in DQNs, which to ⌧ = ) is employed in the standard Bellman equation. further causes excessive gradient noise to destabilizes the 1 On the other hand, using a larger ⌧ will lead to the issues of optimization of neural networks. Here we aim to provide overestimation and high gradient noise, as to be discussed analysis of how the softmax Bellman operator can overcome in Section 5. these issues, and also the corresponding evidence for overes- timation and gradient noise reduction in DQNs. Although Table 2. Mean of test scores for different Bellman operators (stan- our analysis is focused on the softmax Bellman operator, dard deviation in parenthesis). The statistics are averaged over five this work could potentially give further insight into practical independent random seeds. benefits of entropy regularization as well. Max Softmax Mellowmax 8331.72 11307.10 11775.93 5.1. Overestimation Bias Reduction Q*Bert (1597.67) (1332.80) (1173.51) Q-learning’s overestimation bias, due to the max operator, 2368.79 2856.82 2458.76 Ms. Pacman was first discussed in Thrun & Schwartz (1994). It was (219.17) (369.75) (130.34) later shown in van Hasselt et al. (2016a) and Anschel et al. 90923.40 106422.27 99601.47 C. Climber (2017) that overestimation leads to the poor performance (11059.39) (4821.40) (19271.53) of DQNs in some Atari games. Following the same as- 255.32 345.56 355.94 Breakout sumptions as van Hasselt et al. (2016a), we can show the (64.69) (34.19) (25.85) softmax operator reduces the overestimation bias. Further- 196.91 8868.00 11203.75 Asterix more, we quantify the range of the overestimation reduction, (135.16) (2167.35) (3818.40) by providing both lower and upper bounds. 4090.36 8066.78 6476.20 Seaquest (1455.73) (1646.51) (1952.12) Theorem 4. Given the same assumptions as van Hasselt et al. (2016a), where (A1) there exists some V ⇤(s) such To compare against the mellowmax operator, we combine it that the true state-action value function satisfies Q⇤(s, a)= with DQN in the same off-policy fashion as the softmax op- V ⇤(s), for different actions. (A2) the estimation error is 5 erator . Note that the root-finding approach for mellowmax modeled as Qt(s, a)=V ⇤(s)+✏a, then in Asadi & Littman (2017) needs to compute the optimal (I) the overestimation errors from are smaller or equal inverse temperature parameter for every state, and thus is Tsoft to those of using the max operator, for any ⌧ 0; too computationally expensive to be applied here. Conse- T (II) the overestimation reduction by using soft in lieu of 5We tried the mellowmax operator for exploration as in Asadi T T is within (s) , (m 1) max 1 , 2Qmax ; & Littman (2017), but observed much worse performance. m exp[⌧ (s)] { ⌧+2 1+exp(⌧) } b ⇥ b ⇤ Revisiting the Softmax Bellman Operator: New Benefits and New Perspective

Figure 2. The mean and one standard deviation for the overestimation error for different values of ⌧.

(III) the overestimation error for monotonically in- Tsoft creases w.r.t. ⌧ [0, ). 2 1 An observation about Theorem 4 is that for any positive value of ⌧, there will still be some potential for overesti- mation bias because noise can also influence the softmax operator. Depending upon the amount of noise, it is possible that, unlike double DQN, the reduction caused by softmax could exceed the bias introduced by max. This can be seen in our experimental results below, which show that it is possible to have negative error (overcompensation) from the use of softmax. However, when combined with double Q-learning, this effect becomes very small and decreases with the number of actions. Figure 4. Mean and one standard deviation of the estimated Q- values on the Atari games, for different values of ⌧ in S-DDQN. To elucidate Theorem 4, we simulate standard normal vari- ables ✏a with 100 independent trials for each action a, using the same setup as van Hasselt et al. (2016a). Figure 2 shows by plotting the corresponding mean and one standard de- the mean and one standard deviation of the overestimation viation in Figure 3. Both S-DQN and S-DDQN achieve bias, for different values of ⌧ in the softmax operator. For smaller values than DQN, which verifies that the softmax both single and double implementations of the softmax op- operator can reduce the overestimation bias. This reduc- erator, they achieve smaller overestimation errors than their tion can partly explain the higher test scores for S-DQN max counterparts, thus validating our theoretical results. and S-DDQN in Figure 1. Moreover, we plot the estimated Note that the gap becomes smaller when ⌧ increases, which Q-values of S-DDQN for different values of ⌧ in Figure 4, is intuitive as when ⌧ , and also consistent and the result is consistent with Theorem 4 that increasing Tsoft !T !1 with the monotonicity result in Theorem 4. ⌧ leads to larger estimated Q-values and, thus, more risk of overestimation. It also shows that picking a value of ⌧ that is very small may introduce a risk of underestimation that can hurt performance, as shown for Asterix, with ⌧ =1. (See also Table 1.)

5.2. Gradient Noise Reduction Mnih et al. (2015) observed that large gradients can be detri- mental to the optimization of DQNs. This was noted in the context of large reward values, but it is possible that overes- timation bias can also introduce variance in the gradient and have a detrimental effect. To see this, we first notice that the gradient in the DQN can be represented as

Figure 3. Mean and one standard deviation of the estimated Q- ✓ = Es,a,r,s Q✓(s, a) Q✓ (s, a) ✓Q✓(s, a) . values on the Atari games, for different methods. r L 0 T r n (5)o We further check the estimated Q-values on Atari games 6, ⇥ ⇤ When the Q-value is overestimated, the magnitude for 6 Q (s, a) in Eq. (5) could become excessively large and Due to the space limit, we plot fewer games from here, and r✓ ✓ provide the full plots in Supplemental Material. so does the gradient, which may further lead to the insta- Revisiting the Softmax Bellman Operator: New Benefits and New Perspective bility during optimization. This is supported by Figure 3, 6. A Comparison for Bellman Operators which shows higher variance in Q-values for DQN vs. oth- ers. Since we show in Section 5.1 that using softmax in lieu So far we have covered three different types of Bellman op- of max in the Bellman operator can reduce the overestima- erators, where the softmax and mellowmax variants employ tion bias, the gradient estimate in Eq. (5) can be more stable the softmax and the log-sum-exp functions in lieu of the with softmax. Note that stabilizing the gradient during DQN max function in the standard Bellman operator, respectively. training was also shown in van Hasselt et al. (2016b), but in A thorough comparison among these Bellman operators will the context of adaptive target normalization. not only exhibit their differences and connections, but also reveal the trade-offs when selecting them. To evaluate the gradient noise on Atari games, we report the Table 3. A comparison of different Bellman operators (B.O. Bell- `2 norm of the gradient in the final fully-connected linear layer, by averaging over 50 independent inputs. As shown man optimality; O.R. overestimation reduction; P.R. policy repre- sentation; D.Q. double Q-learning). in Figure 5, S-DQN and S-DDQN achieve smaller gradient norm and variance than their counterparts, which implies B.O. Tuning O.R. P.R. D.Q. that the softmax operator can facilitate variance reduction in the optimization for the neural networks. For the game Max Yes No - Yes Yes Asterix, we also observe that the overestimation of Q-value Mellowmax No Yes Yes No No in DQN eventually causes the gradient to explode, while Softmax No Yes Yes Yes Yes its competitors avoid this issue by achieving reasonable Q-value estimation. Finally, Figure 6 shows that increas- Table 3 shows the comparison from different criteria: Bell- ing ⌧ generally leads to higher gradient variance, which man equation implies that the optimal greedy policy cor- is also consistent with our analysis in Theorem 4 that the responds to the max operator only. In contrast to the mel- overestimation monotonically increases w.r.t. ⌧. lowmax and the softmax operators, max also does not need to tune the inverse temperature parameter. On the other hand, the overestimation bias rooted in the max operator can be alleviated by either of the softmax and mellowmax operators. Furthermore, as noted in Asadi & Littman (2017), the mellowmax operator itself cannot directly represent a policy, and needs to be transformed into a softmax policy, where numerical methods are necessary to determine the corresponding state-dependent temperature parameters. The lack of an explicit policy representation also prevents the mellowmax operator from being directly applied in double Q-learning. For mellowmax and softmax, it is interesting to further investigate their relationship, especially given their compa- rable performance in the presence of function approxima- Figure 5. Mean and one standard deviation of the gradient norm tion, as shown in Table 2. First, we notice the following on the Atari games, for different methods. equivalence between these two operators, in terms of the Bellman updates: For the softmax-weighted Q-function, i.e., exp[⌧Q(s0,a0)] gQ(s0,)(⌧)= Q(s0,a0), there exists an a0 a¯ exp[⌧Q(s0,a¯)] ! such that the mellowmax-weighted Q-function achieves P P the same value. This can be verified by the facts that both softmax and mellowmax operators (i) converge to the mean and max operators when the inverse temperature parameter approaches 0 and , respectively; (ii) are continuous on ⌧ 1 and !, with [0, ) as the support.7 1 Furthermore, we compare via simulation the approximation error to the max function and the overestimation error, for 7The operators remain fundamentally different since this equiv- alence would hold only for a particular set of Q-values. The conver- sion between softmax and mellomax parameters would need to be Figure 6. Mean and one standard deviation of the gradient norm done on a state-by-state basis, and would change as the Q-values ⌧ on the Atari games, for different values of in S-DDQN. are updated. Revisiting the Softmax Bellman Operator: New Benefits and New Perspective

on policies. Asadi & Littman (2017) further applied the log-sum-exp function to the on-policy updates, and showed that the state-dependent inverse temperature parameter can be numerically computed. More recently, the log-sum-exp function has also been used in the Boltzmann backup op- erator for entropy-regularized RL (Schulman et al., 2017; Haarnoja et al., 2017; Neu et al., 2017; Nachum et al., 2017). One interesting observation is that the optimal policies in these work admit a form of softmax, and thus can be ben- eficial for exploration. In Lee et al. (2018), a sparsemax operator based on Tsallis entropy was proposed to improve the error bound for mellowmax. Note that these operators were not shown to explicitly address the instability issues in the DQN, and their corresponding policies were used Figure 7. (Left) Approximation error to the max function; (Right) Overestimation error, as a function of ⌧ (softmax) or ! (mellow- to improve the exploration performance, which is different max). from our focus here. mellowmax and softmax. The setup is as follows: The in- Among variants of the DQN in Mnih et al. (2015), verse temperature parameter ⌧ is chosen from a linear grid DDQN (van Hasselt et al., 2016a) first identified the over- from 0.01 to 100. We generate standard normal random estimation bias issue caused by the max operator, and miti- variables, and compute the weighted Q-functions for the gated it via the double Q-learning algorithm (van Hasselt, cases with different number of actions. The approximation 2010). Anschel et al. (2017) later demonstrated that averag- error is then measured in terms of the difference between ing over the previous Q-value estimates in the learning target the max function and the corresponding softmax and mel- can also reduce the bias, while the analysis for the bias re- lowmax functions. The overestimation error is measured duction is restricted to a small MDP with a special structure. the same as in Figure 2. The result statistics are reported by Furthermore, an adaptive normalization scheme was devel- averaging over 100 independent trials. Figure 7 shows the oped in van Hasselt et al. (2016b) and was demonstrated to trade-off between converging to the Bellman optimality and reduce the gradient norm. The softmax function was also overestimation reduction in terms of the inverse temperature employed in the categorical DQN (Bellemare et al., 2017), parameter, where the softmax operator approaches the max but with a different purpose of generating distributions in its operator in a faster speed while the mellowmax operator can distributional Bellman operator. Finally, we notice that other further reduce the overestimation error. variants (Wang et al., 2016; Mnih et al., 2016; Schaul et al., 2016; He et al., 2017; Hessel et al., 2018; Dabney et al., 2018) have also been proposed to improve the vanilla DQN, 7. Related Work though they were not explicitly designed to tackle the issues Applying the softmax function in Bellman updates has long of overestimation error and gradient noise. been viewed as problematic, as Littman (1996) showed that the corresponding Boltzmann operator could be expansive 8. Conclusion and Future Work in a specific MDP, via a counterexample. The primary use of the softmax function in RL has been focused on We demonstrate surprising benefits of the softmax Bellman improving the exploration performance (Sutton & Barto, operator when combined with DQNs, which further sug- 1998). Our work here instead aims to bring evidence and gests that the softmax operator can be used as an alternative new perspective that the softmax Bellman operator can still for the double Q-learning to reduce the overestimation bias. be beneficial beyond exploration. Our theoretical results provide new insight into the con- vergence properties of the softmax Bellman operator, and The mellowmax operator (Asadi & Littman, 2017) was pro- quantify how it reduces the overestimation bias. To gain a posed based on the log-sum-exp function, which can be deep understanding about the relationship among different proven as a contraction and has received recent attention Bellman operators, we further compare them from different due to its convenient mathematical properties. The use of criteria, and show the trade-offs when choosing them in the log-sum-exp function in the backup operator dates back practice. to Todorov (2007), where the control cost was regularized by a Kullback-Leibler (KL) divergence on the transition prob- An interesting direction for future work is to provide more abilities. Fox et al. (2016) proposed a G-learning scheme theoretical analysis for the performance trade-off when se- lecting the inverse temperature parameter in , and sub- with soft updates to reduce the bias for the Q-value esti- Tsoft mation, by regularizing the cost with the KL divergence sequently design an efficient cooling scheme. Revisiting the Softmax Bellman Operator: New Benefits and New Perspective

Acknowledgements Lee, K., Choi, S., and Oh, S. Sparse Markov decision pro- cesses with causal sparse Tsallis entropy regularization We thank the anonymous reviewers for their helpful com- for reinforcement learning. IEEE Robotics and Automa- ments and suggestions. This work was partially supported tion Letters, 3(3):1466–1473, July 2018. by NSF RI grant 1815300. Opinions, findings, conclusions or recommendations herein are those of the authors and not Littman, M. L. Algorithms for Sequential Decision Making. necessarily those of the NSF. PhD thesis, Department of Computer Science, Brown University, February 1996.

References Mitra, D., Romeo, F., and Sangiovanni-Vincentelli, A. Con- Anschel, O., Baram, N., and Shimkin, N. Averaged-DQN: vergence and finite-time behavior of simulated annealing. variance reduction and stabilization for deep reinforce- Advances in applied probability, 18(3):747–771, 1986. ment learning. In International Conference on Machine Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Ve- Learning, pp. 176–185, 2017. ness, J., Bellemare, M. G., Graves, A., Riedmiller, M., Asadi, K. and Littman, M. L. An alternative softmax opera- Fidjeland, A. K., Ostrovski, G., Petersen, S., Beattie, C., tor for reinforcement learning. In International Confer- Sadik, A., Antonoglou, I., King, H., Kumaran, D., Wier- ence on , pp. 243–252, 2017. stra, D., Legg, S., and Hassabis, D. Human-level control through deep reinforcement learning. Nature, 518(7540): Bellemare, M. G., Naddaf, Y., Veness, J., and Bowling, M. 529–533, 02 2015. The arcade learning environment: An evaluation plat- form for general agents. Journal of Artificial Intelligence Mnih, V., Badia, A. P., Mirza, M., Graves, A., Lillicrap, Research, 47:253–279, 2013. T., Harley, T., Silver, D., and Kavukcuoglu, K. Asyn- chronous methods for deep reinforcement learning. In Bellemare, M. G., Dabney, W., and Munos, R. A distribu- International Conference on Machine Learning, pp. 1928– tional perspective on reinforcement learning. In Interna- 1937, 2016. tional Conference on Machine Learning, pp. 449–458, 2017. Nachum, O., Norouzi, M., Xu, K., and Schuurmans, D. Bridging the gap between value and policy based rein- Bellman, R. Dynamic Programming. Princeton University forcement learning. In Advances in Neural Information Press, Princeton, NJ, 1957. Processing Systems, pp. 2772–2782, 2017.

Dabney, W., Rowland, M., Bellemare, M. G., and Munos, Neu, G., Jonsson, A., and Gomez,´ V. A unified view of R. Distributional reinforcement learning with quantile re- entropy-regularized Markov decision processes. arXiv gression. In Thirty-Second AAAI Conference on Artificial preprint arXiv:1705.07798, 2017. Intelligence, 2018. O’Donoghue, B., Munos, R., Kavukcuoglu, K., and Mnih, Fox, R., Pakman, A., and Tishby, N. Taming the noise in V. Combining policy gradient and Q-learning. In Inter- reinforcement learning via soft updates. In Proceedings of national Conference on Learning Representations, 2017. the Thirty-Second Conference on Uncertainty in Artificial Reverdy, P. and Leonard, N. E. Parameter estimation in Intelligence, pp. 202–211. AUAI Press, 2016. softmax decision-making models with linear objective Haarnoja, T., Tang, H., Abbeel, P., and Levine, S. Re- functions. IEEE Transactions on Automation Science and inforcement learning with deep energy-based policies. Engineering, 13(1):54–67, 2016. In International Conference on Machine Learning, pp. Schaul, T., Quan, J., Antonoglou, I., and Silver, D. Priori- 1352–1361, 2017. tized experience replay. In International Conference on He, F. S., Liu, Y., Schwing, A. G., and Peng, J. Learning Learning Representations, 2016. to play in a day: Faster deep reinforcement learning by Schulman, J., Chen, X., and Abbeel, P. Equivalence be- optimality tightening. In International Conference on tween policy gradients and soft Q-learning. arXiv preprint Learning Representations, 2017. arXiv:1704.06440, 2017.

Hessel, M., Modayil, J., Van Hasselt, H., Schaul, T., Ostro- Sutton, R. S. and Barto, A. G. Reinforcement Learning: An vski, G., Dabney, W., Horgan, D., Piot, B., Azar, M., and Introduction. The MIT Press, 1998. Silver, D. Rainbow: Combining improvements in deep re- inforcement learning. In Thirty-Second AAAI Conference Thrun, S. and Schwartz, A. Issues in using function on Artificial Intelligence, 2018. approximation for reinforcement learning. In Mozer, Revisiting the Softmax Bellman Operator: New Benefits and New Perspective

M. C., Smolensky, P., Touretzky, D. S., Elman, J. L., and Weigend, A. S. (eds.), Proceedings of the 1993 Con- nectionist Models Summer School, Hillsdale, NJ, 1994. Lawrence Erlbaum. Thrun, S. B. The role of exploration in learning control. In White, D. A. and Sofge, D. A. (eds.), Handbook of Intel- ligent Control: Neural, Fuzzy, and Adaptive Approaches, pp. 527–559. Van Nostrand Reinhold, New York, NY, 1992. Tieleman, T. and Hinton, G. Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural networks for machine learning,4 (2):26–31, 2012. Todorov, E. Linearly-solvable Markov decision problems. In Advances in neural information processing systems, pp. 1369–1376, 2007. van Hasselt, H. Double Q-learning. In Advances in Neural Information Processing Systems, pp. 2613–2621, 2010. van Hasselt, H., Guez, A., and Silver, D. Deep reinforce- ment learning with double Q-learning. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, pp. 2094–2100. AAAI Press, 2016a. van Hasselt, H. P., Guez, A., Hessel, M., Mnih, V., and Sil- ver, D. Learning values across many orders of magnitude. In Advances in Neural Information Processing Systems, pp. 4287–4295, 2016b.

Wang, Z., Schaul, T., Hessel, M., van Hasselt, H., Lanctot, M., and De Freitas, N. Dueling network architectures for deep reinforcement learning. In Proceedings of the 33rd International Conference on Machine Learning, pp. 1995–2003. JMLR. org, 2016. Williams, R. J. and Baird III, L. C. Tight performance bounds on greedy policies based on imperfect value func- tions. Technical report, Northeastern University, 1993.