Revisiting the Softmax Bellman Operator: New Benefits and New Perspective

Zhao Song, Ronald Parr, and Lawrence Carin

June 11, 2019

1 Bellman Operators

• The Bellman equation X Q∗(s, a) = R(s, a) + γ P (s0|s, a) max Q∗(s0, a0). a0 s0

- The optimal policy should be greedy w.r.t. actions • The Bellman operator X T Q(s, a) = R(s, a) + γ P (s0|s, a) max Q(s0, a0) a0 s0

- Widely used in , e.g., deep Q-networks[Mnih et al., 2015, Nature] - A contraction

1 Bellman Operators (Cont.)

• The mellowmax operator[Asadi and Littman, 2017, ICML]

1 P 0 0  log 0 exp[ω Q(s , a )] max Q(s0, a0) −→ m a a0 ω

- The log-sum-exp function has been extensively used [Todorov, 2007, Fox et al., 2016, Schulman et al., 2017, Haarnoja et al., 2017, Neu et al., 2017, Nachum et al., 2017] - A contraction

2 Bellman Operators (Cont.)

• The softmax operator

X exp[τQ(s0, a0)] max Q(s0, a0) −→ Q(s0, a0) a0 P exp[τQ(s0, a¯)] a0 a¯

Can be a non-contraction [Littman, 1996] Mainly used for Boltzmann exploration [Sutton and Barto, 1998] Question: Is softmax really as bad as sour milk (credits go to Ron Parr), when updating the value function?

3 Experiments setup

• S-D(D)QN: Replacing the max function in the target network of D(D)QN with the softmax function • All other steps same as[Mnih et al., 2015, Nature] • Code available at https://github.com/zhao-song/Softmax-DQN, built upon Nathan Sprague’s implementation.

4 Results on Atari games

Q*Bert Ms. Pacman Breakout Crazy Climber Asterix Seaquest 15000 3500 400 120000 12500 12500 10000 3000 100000 300 10000 2500 10000 8000 80000 7500 2000 7500 6000 200 60000 1500 5000 5000 4000 40000 S-DQN 1000 100 2500 2500 2000 20000 DQN 500 0 0 0 0 0 100 200 0 100 200 0 100 200 0 100 200 0 100 200 0 100 200

3000 120000 12500 400 15000 2500 100000 8000 12500 10000 300 2000 80000 10000 6000 7500 200 60000 7500 1500 4000 5000 5000 1000 40000 S-DDQN 100 2000 2500 2500 20000 DDQN 500 0 0 0 0 0 100 200 0 100 200 0 100 200 0 100 200 0 100 200 0 100 200

15000 3500 400 120000 12500 10000 3000 100000 12500 300 10000 2500 8000 80000 10000 7500 2000 6000 200 60000 7500 1500 5000 5000 4000 40000 S-DQN 1000 100 2500 2500 2000 20000 DDQN 500 0 0 0 0 0 100 200 0 100 200 0 100 200 0 100 200 0 100 200 0 100 200

A revisit of the softmax Bellman operator is warranted!

5 Main theorem

Definition: δb(s) , sup max |Q(s, ai) − Q(s, aj)| Q i,j

Assumption: δb(s) > 0 Performance bound: Assuming δb(s) > 0, then ∀(s, a), k ∗ • lim sup Tsoft Q0(s, a) ≤ Q (s, a) and k→∞

k ∗ γ(m − 1) n 1 2Qmax o • lim inf Tsoft Q0(s, a) ≥ Q (s, a) − max , k→∞ (1 − γ) τ + 2 1 + exp(τ)

Convergence Rate: Tsoft converges to T with an exponential rate, in k k terms of τ, i.e., the upper bound of T Q0 − TsoftQ0 decays exponentially fast, as a function of τ, the inverse parameter.

6 Overestimation bias reduction

Assumptions: Same as DDQN [van Hasselt et al., 2016]

Smaller Error: The overestimation errors from Tsoft are smaller or equal to those of T using the max operator, for any τ ≥ 0;

Reduction Bound: The overestimation reduction by using Tsoft in lieu of δb(s) 1 2Qmax T is within  , (m − 1) max{ , }; m exp[τ δb(s)] τ + 2 1 + exp(τ)

Monotonicity: The overestimation error for Tsoft monotonically increases w.r.t. τ ∈ [0, ∞).

7 Simulated example

τ τ τ

2 2 2

2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2

Figure 1: The mean and one standard deviation for the overestimation error for different values of τ.

8 Effects on DQNs

Q-value and gradient norm for different methods

Q*Bert Ms. Pacman Breakout Crazy Climber Asterix Seaquest

14 DQN S-DQN 16 25 5 DDQN S-DDQN 102 10 12 14

4 10 20 12 8

101 10 8 15 3 6

8 6

2 4 10 6 Value estimates Value estimates Value estimates Value estimates Value estimates 100 Value estimates 4

4 1 5 2 2 2 DQN S-DQN DQN S-DQN DQN S-DQN DQN S-DQN DQN S-DQN DDQN S-DDQN DDQN S-DDQN DDQN S-DDQN 10 1 DDQN S-DDQN DDQN S-DDQN 0 0 0 0 0 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 Q*BertEpoch Ms. PacmanEpoch BreakoutEpoch Crazy ClimberEpoch AsterixEpoch SeaquestEpoch

104 0.12 101 DQN 100 101 DQN DQN DQN 3.0 DDQN DDQN DDQN DDQN S-DQN S-DQN S-DQN 100 S-DQN 0.10 100 102 S-DDQN 1 S-DDQN S-DDQN 2.5 10 S-DDQN

1 0.08 10 10 1 2.0 2 10 100

2 0.06 10 10 2 1.5

10 3 2 Gradient norm Gradient norm Gradient norm Gradient norm Gradient norm

3 Gradient norm 10 10 10 3 0.04 1.0

10 4 DQN DQN 4 4 0.5 DDQN 10 DDQN 10 0.02 10 4 S-DQN S-DQN

0.0 10 5 S-DDQN 10 5 S-DDQN 10 5 0.00

0.0 0.2 0.4 0.6 0.8 1.0 1.2 0.0 0.2 0.4 0.6 0.8 1.0 1.2 0.0 0.2 0.4 0.6 0.8 1.0 1.2 0.0 0.2 0.4 0.6 0.8 1.0 1.2 0.0 0.2 0.4 0.6 0.8 1.0 1.2 0.0 0.2 0.4 0.6 0.8 1.0 1.2 Number of updates 1e7 Number of updates 1e7 Number of updates 1e7 Number of updates 1e7 Number of updates 1e7 Number of updates 1e7

9 Comparison among different Bellman operators

Table 1: A comparison of different Bellman operators (B.O. Bellman optimality; O.R. overestimation reduction; P.R. policy representation; D.Q. double Q-learning).

B.O. Tuning O.R. P.R. D.Q. Max Yes No - Yes Yes Mellowmax No Yes Yes No No Softmax No Yes Yes Yes Yes

10 Welcome to our poster tonight, #40

Thank You

11 Asadi, K. and Littman, M. L. (2017). An alternative softmax operator for reinforcement learning. In International Conference on , pages 243–252. Fox, R., Pakman, A., and Tishby, N. (2016). Taming the noise in reinforcement learning via soft updates. In Proceedings of the Thirty-Second Conference on Uncertainty in Artificial Intelligence, pages 202–211. AUAI Press. Haarnoja, T., Tang, H., Abbeel, P., and Levine, S. (2017). Reinforcement learning with deep energy-based policies. In International Conference on Machine Learning, pages 1352–1361. Littman, M. L. (1996). Algorithms for Sequential Decision Making. PhD thesis, Department of Computer Science, Brown University. Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidjeland, A. K., Ostrovski, G., Petersen, S., Beattie, C., Sadik, A., Antonoglou, I., King, H., Kumaran, D., Wierstra, D., Legg, S., and Hassabis, D. (2015). 11 Human-level control through deep reinforcement learning. Nature, 518(7540):529–533. Nachum, O., Norouzi, M., Xu, K., and Schuurmans, D. (2017). Bridging the gap between value and policy based reinforcement learning. In Advances in Neural Information Processing Systems, pages 2772–2782. Neu, G., Jonsson, A., and G´omez, V. (2017). A unified view of entropy-regularized Markov decision processes. arXiv preprint arXiv:1705.07798. Schulman, J., Chen, X., and Abbeel, P. (2017). Equivalence between policy gradients and soft Q-learning. arXiv preprint arXiv:1704.06440. Sutton, R. S. and Barto, A. G. (1998). Reinforcement Learning: An Introduction. The MIT Press. Todorov, E. (2007).

11 Linearly-solvable Markov decision problems. In Advances in neural information processing systems, pages 1369–1376. van Hasselt, H., Guez, A., and Silver, D. (2016). Deep reinforcement learning with double Q-learning. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, pages 2094–2100. AAAI Press.

11