Revisiting the Softmax Bellman Operator: New Benefits and New Perspective
Zhao Song 1 * Ronald E. Parr 1 Lawrence Carin 1
Abstract tivates the use of exploratory and potentially sub-optimal actions during learning, and one commonly-used strategy The impact of softmax on the value function itself is to add randomness by replacing the max function with in reinforcement learning (RL) is often viewed as the softmax function, as in Boltzmann exploration (Sutton problematic because it leads to sub-optimal value & Barto, 1998). Furthermore, the softmax function is a (or Q) functions and interferes with the contrac- differentiable approximation to the max function, and hence tion properties of the Bellman operator. Surpris- can facilitate analysis (Reverdy & Leonard, 2016). ingly, despite these concerns, and independent of its effect on exploration, the softmax Bellman The beneficial properties of the softmax Bellman opera- operator when combined with Deep Q-learning, tor are in contrast to its potentially negative effect on the leads to Q-functions with superior policies in prac- accuracy of the resulting value or Q-functions. For exam- tice, even outperforming its double Q-learning ple, it has been demonstrated that the softmax Bellman counterpart. To better understand how and why operator is not a contraction, for certain temperature pa- this occurs, we revisit theoretical properties of the rameters (Littman, 1996, Page 205). Given this, one might softmax Bellman operator, and prove that (i) it expect that the convenient properties of the softmax Bell- converges to the standard Bellman operator expo- man operator would come at the expense of the accuracy nentially fast in the inverse temperature parameter, of the resulting value or Q-functions, or the quality of the and (ii) the distance of its Q function from the resulting policies. In this paper, we demonstrate that, in optimal one can be bounded. These alone do not the case of deep Q-learning, this expectation is surprisingly explain its superior performance, so we also show incorrect. We combine the softmax Bellman operator with that the softmax operator can reduce the over- the deep Q-network (DQN)(Mnih et al., 2015) and double estimation error, which may give some insight DQN (DDQN)(van Hasselt et al., 2016a) algorithms, by re- into why a sub-optimal operator leads to better placing the max function therein with the softmax function, performance in the presence of value function in the target network. We then test the variants on several approximation. A comparison among different games in the Arcade Learning Environment (ALE)(Belle- Bellman operators is then presented, showing the mare et al., 2013), a standard large-scale deep RL testbed. trade-offs when selecting them. The results show that the variants using the softmax Bell- man operator can achieve higher test scores, and reduce the 1. Introduction Q-value overestimation as well as the gradient noise on most of them. This effect is independent of exploration and is The Bellman equation (Bellman, 1957) has been a funda- entirely attributable to the change in the Bellman operator. arXiv:1812.00456v2 [cs.LG] 19 May 2019 mental tool in reinforcement learning (RL), as it provides a This surprising result suggests that a deeper understanding sufficient condition for the optimal policy in dynamic pro- of the softmax Bellman operator is warranted. To this end, gramming. The use of the max function in the Bellman we prove that starting from the same initial Q-values, we can equation further suggests that the optimal policy should be upper and lower bound how far the Q-functions computed greedy w.r.t. the Q-values. On the other hand, the trade-off with the softmax operator can deviate from those computed between exploration and exploitation (Thrun, 1992) mo- with the regular Bellman operator. We further show that the 1Duke University ∗Work performed as a graduate student softmax Bellman operator converges to the optimal Bellman at Duke University; now at Baidu Research. Correspondence operator in an exponential rate w.r.t. the inverse temperature to: Zhao Song
δb(s) T We propose to use the softmax Bellman operator, defined as ≤ maxa Q(s, a) − fτ Q(s, ) Q(s, ) ≤ (m − m exp[τ δb(s)] X 0 n o T Q(s, a) =R(s, a) + γ P (s |s, a) 1 2Qmax 2 Rmax soft 1) max τ+2 , 1+exp(τ) , where Qmax = 1−γ represents s0 3 the maximum Q-value in Q-iteration with Tsoft. X exp[τ Q(s0, a0)] × Q(s0, a0), (3) Note that another upper bound for this gap was provided P exp[τ Q(s0, a¯)] a0 a¯ in O’Donoghue et al.(2017). Our proof here uses a different | {z } 0 smτ (Q(s ,)) strategy, by considering possible values for the difference between Q-values with different actions. We further derive τ ≥ 0 where denotes the inverse temperature parameter. a lower bound for this gap. Note that Tsoft will reduce to T when τ → ∞. The mellowmax operator was introduced in Asadi & 3.1. Performance Bound for Tsoft Littman(2017) as an alternative to the softmax operator The optimal state-action value function Q∗ is known to be a in Eq. (3), defined as fixed point for the standard Bellman operator T (Williams & 1 P 0 ∗ ∗ log 0 exp[ω Q(s, a )] Baird III, 1993), i.e., T Q = Q . Since T is a contraction mm Q(s, ) = m a , (4) k ∗ ω ω with rate γ, we also know that limk→∞ T Q0 = Q , for where ω > 0 is a tuning parameter. Similar to the softmax arbitrary Q0. Given these facts, one may wonder in the limit, 0 T T Q operator smτ Q(s , ) in Eq. (3), mmω converges to the how far iteratively applying soft, in lieu of , over 0 will ∗ mean and max operators, when ω approaches 0 and ∞, be away from Q , as a function of τ. k k respectively. Unlike the softmax operator, however, the Theorem 3. Let T Q0 and TsoftQ0 denote that the opera- mellowmax operator in Eq. (4) does not directly provide tors T and Tsoft are iteratively applied over an initial state- a probability distribution over actions, and thus additional action value function Q0 for k times. Then, steps are needed to obtain a corresponding policy (Asadi & k ∗ (I) ∀(s, a), lim sup Tsoft Q0(s, a) ≤ Q (s, a) and Littman, 2017). k→∞ k In contrast to its typical use to improve exploration (also a lim inf Tsoft Q0(s, a) ≥ k→∞ purpose for the mellowmax operator in Asadi & Littman γ(m − 1) n 1 2Q o (2017)), the softmax Bellman operator in Eq. (3) is com- Q∗(s, a) − max , max . (1 − γ) τ + 2 1 + exp(τ) bined in this paper with the DQN and DDQN algorithms in an off-policy fashion, where the action is selected according (II) Tsoft converges to T with an exponential rate, in terms to the same -greedy policy as in Mnih et al.(2015). k k of τ, i.e., the upper bound of T Q0 − TsoftQ0 decays expo- Even though the softmax Bellman operator was shown nentially fast, as a function of τ, the proof of which does not in Littman(1996) not to be a contraction, with a counterex- depend on the bound in part (I). ample, we are not aware of any published work showing A noteworthy point about part (I) of Theorem3 is it does the performance bound by using the softmax Bellman oper- not contradict the negative convergence results for Tsoft since ator in Q-iteration. Furthermore, our surprising results in its proof and result do not need to assume the convergence of Section4 show that test scores in DQNs and DDQNs can Tsoft. Furthermore, this bound implies that non-convergence be improved, by solely replacing the max operator in their for Tsoft is different from divergence or oscillation with target networks with the softmax operator. Such improve- an arbitrary range: Even though Tsoft may not converge ments are independent of the exploration strategy and hence in certain scenarios, the Q-values could still be within a 4 motivate an investigation of the theoretical properties of the reasonable range . Tsoft also has other benefits over T , as softmax Bellman operator. shown in Section5. Before presenting our main theoretical results, we first show Note that although the bound for mellowmax under the how to bound the distance between the softmax-weighted entropy-regularized MDP framework was shown in Lee and max operators, which is subsequently used in the proof et al.(2018) to have a better scalar term log(m) instead of for the performance bound and overestimation reduction. (m − 1), its convergence rate is linear w.r.t. τ. In contrast, our softmax Bellman operator result has an exponential rate, Definition 1. δb(s) , supQ maxi,j |Q(s, ai)−Q(s, aj)| de- notes the largest distance between Q-functions w.r.t. actions as shown in Part (II), which potentially gives insight into for state s, by taking the supremum of Q-functions and the why very large values of τ are not required experimentally maximum of all action pairs. 2 The supQ is over all Q-functions that occur during Q- 1 Lemma 2. By assuming δb(s) > 0 , we have ∀Q, iteration, starting from Q0 until the iteration terminates. 3 The Supplemental Material formally bounds Qmax. 1Note that if δb(s) = 0, then all actions are equivalent in state 4 The lower bound can become loose when τ is extremely s, and softmax = max. small, but this degenerate case has never been used in experiments. Revisiting the Softmax Bellman Operator: New Benefits and New Perspective
Q*Bert Ms. Pacman Breakout Crazy Climber Asterix Seaquest 15000 3500 400 120000 12500 12500 10000 3000 100000 300 10000 2500 10000 8000 80000 7500 2000 7500 6000 200 60000 1500 5000 5000 4000 40000 S-DQN 1000 100 2500 2500 2000 20000 DQN 500 0 0 0 0 0 100 200 0 100 200 0 100 200 0 100 200 0 100 200 0 100 200
3000 120000 12500 400 15000 2500 100000 8000 12500 10000 300 2000 80000 10000 6000 7500 200 60000 7500 1500 4000 5000 5000 1000 40000 S-DDQN 100 2000 2500 2500 20000 DDQN 500 0 0 0 0 0 100 200 0 100 200 0 100 200 0 100 200 0 100 200 0 100 200
15000 3500 400 120000 12500 10000 3000 100000 12500 300 10000 2500 8000 80000 10000 7500 2000 6000 200 60000 7500 1500 5000 5000 4000 40000 S-DQN 1000 100 2500 2500 2000 20000 DDQN 500 0 0 0 0 0 100 200 0 100 200 0 100 200 0 100 200 0 100 200 0 100 200
Figure 1. Performance of (top row) S-DQN vs. DQN, (middle row) S-DDQN vs. DDQN, and (bottom row) S-DQN vs. DDQN on the Atari games. X-axis and Y-axis correspond to the training epoch and test score, respectively. All curves are averaged over five independent random seeds. in Section4. The error bound in the sparse Bellman oper- Table 1. Mean of test scores for different values of τ in S-DDQN ator (Lee et al., 2018) improves upon log(m), but is still (standard deviation in parenthesis). Each game is denoted with its linear w.r.t. τ. We further empirically illustrate the faster initial. Note that S-DDQN reduces to DDQN when τ = ∞. convergence for the softmax Bellman operator in Figure7 τ 1 5 10 ∞ of Section6. 12068.66 11049.28 11191.31 10577.76 Q 4. Main Experiments (1085.65) (1565.57) (1336.35) (1508.27) 2492.40 2566.44 2546.18 2293.73 M Our theoretical results apply to the case of a tabular value (183.71) (227.24) (259.82) (160.50) function representation and known next-state distributions, 313.08 350.88 303.71 284.64 B and they bound suboptimality rather than suggesting supe- (20.13) (35.58) (65.59) (60.83) riority of softmax. The goal of our main experiments is to 107405.11 111111.07 104049.46 96373.08 C show that despite being sub-optimal w.r.t. the Q-values, the (4617.90) (5047.19) (6686.84) (9244.27) softmax Bellman operator is still useful in practice by im- 3476.91 10266.12 6588.13 5523.80 A proving the DQN and DDQN algorithms. Such an improve- (460.27) (2682.00) (1183.10) (694.72) ment is entirely attributable to changing the max operator to 272.20 2701.05 6254.01 5695.35 S the softmax operator in the target network. (49.75) (10.06) (697.12) (1862.59)
We combine the softmax operator with DQN and DDQN, ing contains 200 epochs in total. The test procedures and by replacing the max function therein with the softmax func- all the hyperparameters are set the same as DQN, with tion, with all other steps being the same. The corresponding details described in Mnih et al.(2015). The inverse tem- new algorithms are termed as S-DQN and S-DDQN, respec- perature parameter τ was selected based on a grid search tively. Their exploration strategies are set to be -greedy, the over {1, 5, 10}. We also implemented the logarithmic cool- https: same as DQN and DDQN. Our code is provided at ing scheme (Mitra et al., 1986) in simulated annealing to //github.com/zhao-song/Softmax-DQN . gradually increase τ, but did not observe a better policy, We tested on six Atari games: Q*Bert, Ms. Pacman, Crazy compared with the constant temperature. The results statis- Climber, Breakout, Asterix, and Seaquest. Our code is built tics are obtained by running with five independent random on the Theano+Lasagne implementation from https: seeds. //github.com/spragunr/deep_q_rl/ . The train- Figure1 shows the mean and one standard deviation for Revisiting the Softmax Bellman Operator: New Benefits and New Perspective the average test score on the Atari games, as a function quently, we tune the inverse temperature parameter for both of the training epoch: In the top two rows, S-DQN and S- softmax and mellowmax operators from {1, 5, 10}, and DDQN generally achieve higher maximum scores and faster report the best scores. Table2 shows that both softmax and learning than their max counterparts, illustrating the promise mellowmax operators can achieve higher test scores than of replacing max with softmax. For the game Asterix, often the max operator. Furthermore, the scores from softmax used as an example of the overestimation issue in DQN (van are higher or similar to those of mellowmax on all games Hasselt et al., 2016a; Anschel et al., 2017), both S-DQN except Asterix. The competitive performance of the softmax and S-DDQN have much higher test scores than their max operator here suggests that it is still preferable in certain counterparts, which suggests that the softmax operator can domains, despite its non-contraction property. mitigate the overestimation bias. Although not as dramatic as for Asterix, the bottom row of Figure1 shows that S- 5. Why Softmax Helps? DQN generally outperforms DDQN with higher test scores for the entire test suite, which suggests the possibility of One may wonder why the softmax Bellman operator can using softmax operator over double Q-learning (van Hasselt, achieve the surprising performance in Section4, as the 2010) to reduce the overestimation bias, in the presence of greedy policy from the Bellman equation suggests that the function approximation. max operator should be optimal. The softmax operator is not used for exploration in this paper, nor is it motivated by regu- Table1 shows the test scores of S-DDQN with different larizing the policy, as in the entropy regularized MDPs (Fox values of τ, obtained by averaging the scores from the last et al., 2016; Schulman et al., 2017; Neu et al., 2017; Nachum 10 epochs. Note that S-DDQN achieves higher test scores et al., 2017; Asadi & Littman, 2017; Lee et al., 2018). than its max counterpart, for most of the values of τ. Table1 further suggests a trade-off when selecting the values for τ: As discussed in earlier work (van Hasselt et al., 2016a; An- Setting τ too small will move Tsoft further away from the schel et al., 2017), the max operator leads to the significant Bellman optimality, since the max operator (corresponding issue of overestimation for the Q-function in DQNs, which to τ = ∞) is employed in the standard Bellman equation. further causes excessive gradient noise to destabilizes the On the other hand, using a larger τ will lead to the issues of optimization of neural networks. Here we aim to provide overestimation and high gradient noise, as to be discussed analysis of how the softmax Bellman operator can overcome in Section5. these issues, and also the corresponding evidence for overes- timation and gradient noise reduction in DQNs. Although Table 2. Mean of test scores for different Bellman operators (stan- our analysis is focused on the softmax Bellman operator, dard deviation in parenthesis). The statistics are averaged over five this work could potentially give further insight into practical independent random seeds. benefits of entropy regularization as well. Max Softmax Mellowmax 8331.72 11307.10 11775.93 5.1. Overestimation Bias Reduction Q*Bert (1597.67) (1332.80) (1173.51) Q-learning’s overestimation bias, due to the max operator, 2368.79 2856.82 2458.76 Ms. Pacman was first discussed in Thrun & Schwartz(1994). It was (219.17) (369.75) (130.34) later shown in van Hasselt et al.(2016a) and Anschel et al. 90923.40 106422.27 99601.47 C. Climber (2017) that overestimation leads to the poor performance (11059.39) (4821.40) (19271.53) of DQNs in some Atari games. Following the same as- 255.32 345.56 355.94 Breakout sumptions as van Hasselt et al.(2016a), we can show the (64.69) (34.19) (25.85) softmax operator reduces the overestimation bias. Further- 196.91 8868.00 11203.75 Asterix more, we quantify the range of the overestimation reduction, (135.16) (2167.35) (3818.40) by providing both lower and upper bounds. 4090.36 8066.78 6476.20 Seaquest (1455.73) (1646.51) (1952.12) Theorem 4. Given the same assumptions as van Hasselt et al.(2016a), where (A1) there exists some V ∗(s) such To compare against the mellowmax operator, we combine it that the true state-action value function satisfies Q∗(s, a) = with DQN in the same off-policy fashion as the softmax op- V ∗(s), for different actions. (A2) the estimation error is 5 ∗ erator . Note that the root-finding approach for mellowmax modeled as Qt(s, a) = V (s) + a, then in Asadi & Littman(2017) needs to compute the optimal (I) the overestimation errors from T are smaller or equal inverse temperature parameter for every state, and thus is soft to those of T using the max operator, for any τ ≥ 0; too computationally expensive to be applied here. Conse- (II) the overestimation reduction by using Tsoft in lieu of T 5We tried the mellowmax operator for exploration as in Asadi δb(s) 1 2Qmax & Littman(2017), but observed much worse performance. is within , (m − 1) max{ , } ; m exp[τ δb(s)] τ+2 1+exp(τ) Revisiting the Softmax Bellman Operator: New Benefits and New Perspective
τ τ τ