Revisiting the Softmax Bellman Operator: New Beneﬁts and New Perspective

Revisiting the Softmax Bellman Operator: New Benefits and New Perspective Zhao Song 1* Ronald E. Parr 1 Lawrence Carin 1 Abstract tivates the use of exploratory and potentially sub-optimal actions during learning, and one commonly-used strategy The impact of softmax on the value function itself is to add randomness by replacing the max function with in reinforcement learning (RL) is often viewed as the softmax function, as in Boltzmann exploration (Sutton problematic because it leads to sub-optimal value & Barto, 1998). Furthermore, the softmax function is a (or Q) functions and interferes with the contrac- differentiable approximation to the max function, and hence tion properties of the Bellman operator. Surpris- can facilitate analysis (Reverdy & Leonard, 2016). ingly, despite these concerns, and independent of its effect on exploration, the softmax Bellman The beneficial properties of the softmax Bellman opera- operator when combined with Deep Q-learning, tor are in contrast to its potentially negative effect on the leads to Q-functions with superior policies in prac- accuracy of the resulting value or Q-functions. For exam- tice, even outperforming its double Q-learning ple, it has been demonstrated that the softmax Bellman counterpart. To better understand how and why operator is not a contraction, for certain temperature pa- this occurs, we revisit theoretical properties of the rameters (Littman, 1996, Page 205). Given this, one might softmax Bellman operator, and prove that (i) it expect that the convenient properties of the softmax Bell- converges to the standard Bellman operator expo- man operator would come at the expense of the accuracy nentially fast in the inverse temperature parameter, of the resulting value or Q-functions, or the quality of the and (ii) the distance of its Q function from the resulting policies. In this paper, we demonstrate that, in optimal one can be bounded. These alone do not the case of deep Q-learning, this expectation is surprisingly explain its superior performance, so we also show incorrect. We combine the softmax Bellman operator with that the softmax operator can reduce the over- the deep Q-network (DQN)(Mnih et al., 2015) and double estimation error, which may give some insight DQN (DDQN)(van Hasselt et al., 2016a) algorithms, by re- into why a sub-optimal operator leads to better placing the max function therein with the softmax function, performance in the presence of value function in the target network. We then test the variants on several approximation. A comparison among different games in the Arcade Learning Environment (ALE)(Belle- Bellman operators is then presented, showing the mare et al., 2013), a standard large-scale deep RL testbed. trade-offs when selecting them. The results show that the variants using the softmax Bell- man operator can achieve higher test scores, and reduce the 1. Introduction Q-value overestimation as well as the gradient noise on most of them. This effect is independent of exploration and is The Bellman equation (Bellman, 1957) has been a funda- entirely attributable to the change in the Bellman operator. mental tool in reinforcement learning (RL), as it provides a This surprising result suggests that a deeper understanding sufficient condition for the optimal policy in dynamic pro- of the softmax Bellman operator is warranted. To this end, gramming. The use of the max function in the Bellman we prove that starting from the same initial Q-values, we can equation further suggests that the optimal policy should be upper and lower bound how far the Q-functions computed greedy w.r.t. the Q-values. On the other hand, the trade-off with the softmax operator can deviate from those computed between exploration and exploitation (Thrun, 1992) mo- with the regular Bellman operator. We further show that the 1 Duke University ⇤Work performed as a graduate student softmax Bellman operator converges to the optimal Bellman at Duke University; now at Baidu Research. Correspondence operator in an exponential rate w.r.t. the inverse temperature to: Zhao Song <[email protected]>, Ronald E. Parr parameter. This gives insight into why the negative conver- <[email protected]>, Lawrence Carin <[email protected]>. gence results may not be as discouraging as they initially Proceedings of the 36 th International Conference on Machine seem, but it does not explain the superior performance ob- Learning, Long Beach, California, PMLR 97, 2019. Copyright served in practice. Motivated by recent work (van Hasselt 2019 by the author(s). et al., 2016a; Anschel et al., 2017) targeting bias and insta- Revisiting the Softmax Bellman Operator: New Benefits and New Perspective bility of the original DQN (Mnih et al., 2015), we further a probability mass function (PMF), where ⇡(s, a) [0, 1] 2 investigate whether the softmax Bellman operator can alle- denotes the probability of selecting action a in state s, and viate these issues. As discussed in van Hasselt et al. (2016a), a ⇡(s, a)=1. one possible explanation for the poor performance of the 2A For a given policy ⇡, its state-action value func- vanilla DQN on some Atari games was the overestimation P tion Q⇡(s, a) is defined as the accumulated, expected, bias when computing the target network, due to the max discount reward, when taking action a in state s, operator therein. We prove that given the same assumptions and following policy ⇡ afterwards, i.e., Q⇡(s, a)= as van Hasselt et al. (2016a), the softmax Bellman operator t Eat ⇡ [ 1 γ rt s0 = s, a0 = a] . For the optimal policy can reduce the overestimation bias, for any inverse tem- ⇠ t=0 | ⇡ , its corresponding Q-function satisfies the following Bell- perature parameters. We also quantify the overestimation ⇤ man equation:P reduction by providing its lower and upper bounds. Q⇤(s, a)=R(s, a)+γ P (s0 s, a) max Q⇤(s0,a0). Our results are complementary to and add new motivations | a0 for existing work that explores various ways of softening Xs0 the Bellman operator. For example, entropy regularizers In DQN (Mnih et al., 2015), the Q-function is parameterized have been used to smooth policies. The motivations for with a neural network as Q✓(s, a), which takes the state s such approaches include computational convenience, ex- as input and outputs the corresponding Q-value in the final ploration, or robustness (Fox et al., 2016; Haarnoja et al., fully-connected linear layer, for every action a. The training 2017; Schulman et al., 2017; Neu et al., 2017). With respect objective for the DQN can be represented as to value functions, Asadi & Littman (2017) proposed an 1 2 min Q (s, a) [R(s, a)+γ max Q (s0,a0)] , alternative mellowmax operator and proved that it is a con- ✓ ✓− ✓ 2 − a0 traction. Their experimental results suggested that it can (1) improve exploration, but the possibility that the sub-optimal where ✓−corresponds to the frozen weights in the target Bellman operator could, independent of exploration, lead to network, and is updated at fixed intervals. The optimization superior policies was not considered. Our results, therefore, of Eq. (1) is performed via RMSProp (Tieleman & Hinton, provide additional motivation for further study of operators 2012), with mini-batches sampled from a replay buffer. such as mellowmax. Although very similar to softmax, the To reduce the overestimation bias, the double DQN mellowmax operator needs some extra computation to rep- (DDQN) algorithm modified the target that Q (s, a) aims resent a policy, as noted in Asadi & Littman (2017). Our ✓ to fit in Eq. (1) as paper discusses this and other trade-offs among different Bellman operators. R(s, a)+γQ✓ s0, arg max Q✓t (s0,a) . − a The rest of this paper is organized as follows: We pro- Note that a separate network based on the latest estimate ✓t vide the necessary background and notation in Section 2. is employed for action selection, and the evaluation of this The softmax Bellman operator is introduced in Section 3, policy is due to the frozen network. where its convergence properties and performance bound are provided. Despite being sub-optimal on Q-functions, the Notation The softmax function is defined as softmax operator is shown in Section 4 to consistently out- T [exp(⌧x1), exp(⌧x2),...,exp(⌧xm)] perform its max counterpart on several Atari games. Such f⌧ (x)= m , surprising result further motivates us to investigate why this i=1 exp(⌧xi) happens in Section 5. A thorough comparison among differ- where the superscript T denotesP the vector transpose. Sub- ent Bellman operators is presented in Section 6. Section 7 sequently, the softmax-weighted function is represented as T discusses the related work and Section 8 concludes this gx(⌧)=f⌧ (x) x, as a function of ⌧. Also, we define the T paper. vector Q(s, )=[Q(s, a1),Q(s, a2),...,Q(s, am)] .We further set m to be the size of the action set in the MDP. A 2. Background and Notation Finally, Rmin and Rmax denote the minimum and the maxi- mum immediate rewards, respectively. A Markov decision process (MDP) can be represented as a 5-tuple , ,P,R,γ , where is the state space, is hS A i S A the action space, P is the transition kernel whose element 3. The Softmax Bellman Operator P (s0 s, a) denotes the transition probability from state s We start by providing the following standard Bellman oper- | to state s0 under action a, R is a reward function whose ator: element R(s, a) denotes the expected reward for executing Q(s, a)=R(s, a)+γ P (s0 s, a) max Q(s0,a0). action a in state s, and γ (0, 1) is the discount factor.

Load more