Revisiting the Softmax Bellman Operator: New Benefits and New Perspective
Zhao Song 1* Ronald E. Parr 1 Lawrence Carin 1
Abstract tivates the use of exploratory and potentially sub-optimal actions during learning, and one commonly-used strategy The impact of softmax on the value function itself is to add randomness by replacing the max function with in reinforcement learning (RL) is often viewed as the softmax function, as in Boltzmann exploration (Sutton problematic because it leads to sub-optimal value & Barto, 1998). Furthermore, the softmax function is a (or Q) functions and interferes with the contrac- differentiable approximation to the max function, and hence tion properties of the Bellman operator. Surpris- can facilitate analysis (Reverdy & Leonard, 2016). ingly, despite these concerns, and independent of its effect on exploration, the softmax Bellman The beneficial properties of the softmax Bellman opera- operator when combined with Deep Q-learning, tor are in contrast to its potentially negative effect on the leads to Q-functions with superior policies in prac- accuracy of the resulting value or Q-functions. For exam- tice, even outperforming its double Q-learning ple, it has been demonstrated that the softmax Bellman counterpart. To better understand how and why operator is not a contraction, for certain temperature pa- this occurs, we revisit theoretical properties of the rameters (Littman, 1996, Page 205). Given this, one might softmax Bellman operator, and prove that (i) it expect that the convenient properties of the softmax Bell- converges to the standard Bellman operator expo- man operator would come at the expense of the accuracy nentially fast in the inverse temperature parameter, of the resulting value or Q-functions, or the quality of the and (ii) the distance of its Q function from the resulting policies. In this paper, we demonstrate that, in optimal one can be bounded. These alone do not the case of deep Q-learning, this expectation is surprisingly explain its superior performance, so we also show incorrect. We combine the softmax Bellman operator with that the softmax operator can reduce the over- the deep Q-network (DQN)(Mnih et al., 2015) and double estimation error, which may give some insight DQN (DDQN)(van Hasselt et al., 2016a) algorithms, by re- into why a sub-optimal operator leads to better placing the max function therein with the softmax function, performance in the presence of value function in the target network. We then test the variants on several approximation. A comparison among different games in the Arcade Learning Environment (ALE)(Belle- Bellman operators is then presented, showing the mare et al., 2013), a standard large-scale deep RL testbed. trade-offs when selecting them. The results show that the variants using the softmax Bell- man operator can achieve higher test scores, and reduce the 1. Introduction Q-value overestimation as well as the gradient noise on most of them. This effect is independent of exploration and is The Bellman equation (Bellman, 1957) has been a funda- entirely attributable to the change in the Bellman operator. mental tool in reinforcement learning (RL), as it provides a This surprising result suggests that a deeper understanding sufficient condition for the optimal policy in dynamic pro- of the softmax Bellman operator is warranted. To this end, gramming. The use of the max function in the Bellman we prove that starting from the same initial Q-values, we can equation further suggests that the optimal policy should be upper and lower bound how far the Q-functions computed greedy w.r.t. the Q-values. On the other hand, the trade-off with the softmax operator can deviate from those computed between exploration and exploitation (Thrun, 1992) mo- with the regular Bellman operator. We further show that the 1 Duke University ⇤Work performed as a graduate student softmax Bellman operator converges to the optimal Bellman at Duke University; now at Baidu Research. Correspondence operator in an exponential rate w.r.t. the inverse temperature to: Zhao Song
Bellman operators. R(s, a)+ Q✓ s0, arg max Q✓t (s0,a) . a
The rest of this paper is organized as follows: We pro- Note that a separate network based on the latest estimate ✓t vide the necessary background and notation in Section 2. is employed for action selection, and the evaluation of this The softmax Bellman operator is introduced in Section 3, policy is due to the frozen network. where its convergence properties and performance bound are provided. Despite being sub-optimal on Q-functions, the Notation The softmax function is defined as softmax operator is shown in Section 4 to consistently out- T [exp(⌧x1), exp(⌧x2),...,exp(⌧xm)] perform its max counterpart on several Atari games. Such f⌧ (x)= m , surprising result further motivates us to investigate why this i=1 exp(⌧xi) happens in Section 5. A thorough comparison among differ- where the superscript T denotesP the vector transpose. Sub- ent Bellman operators is presented in Section 6. Section 7 sequently, the softmax-weighted function is represented as T discusses the related work and Section 8 concludes this gx(⌧)=f⌧ (x) x, as a function of ⌧. Also, we define the T paper. vector Q(s, )=[Q(s, a1),Q(s, a2),...,Q(s, am)] .We further set m to be the size of the action set in the MDP. A 2. Background and Notation Finally, Rmin and Rmax denote the minimum and the maxi- mum immediate rewards, respectively. A Markov decision process (MDP) can be represented as a 5-tuple , ,P,R, , where is the state space, is hS A i S A the action space, P is the transition kernel whose element 3. The Softmax Bellman Operator P (s0 s, a) denotes the transition probability from state s We start by providing the following standard Bellman oper- | to state s0 under action a, R is a reward function whose ator: element R(s, a) denotes the expected reward for executing Q(s, a)=R(s, a)+ P (s0 s, a) max Q(s0,a0). action a in state s, and (0, 1) is the discount factor. T | a0 2 s0 The policy ⇡ in an MDP can be represented in terms of X (2) Revisiting the Softmax Bellman Operator: New Benefits and New Perspective