Deepmellow: Removing the Need for a Target Network in Deep Q-Learning
Total Page:16
File Type:pdf, Size:1020Kb
DeepMellow: Removing the Need for a Target Network in Deep Q-Learning Seungchan Kim , Kavosh Asadi , Michael Littman and George Konidaris Brown University Department of Computer Science fseungchan kim, [email protected], fmlittman, [email protected] Abstract We propose an approach that reduces the need for a target network in DQN while still ensuring stable learning and good Deep Q-Network (DQN) is an algorithm that performance in high-dimensional domains. Our approach achieves human-level performance in complex uses a recently proposed alternative softmax operator, domains like Atari games. One of the important Mellowmax [Asadi and Littman, 2017]. This operator has elements of DQN is its use of a target network, been shown to ensure convergence in learning and planning, which is necessary to stabilize learning. We argue has an entropy-regularization interpretation [Nachum et al., that using a target network is incompatible with 2017; Fox et al., 2016; Neu et al., 2017], and facilitates online reinforcement learning, and it is possible convergent off-policy learning even with non-linear function to achieve faster and more stable learning without approximation [Dai et al., 2018]. We additionally show a target network when we use Mellowmax, an that Mellowmax is convex regardless of the value of the alternative softmax operator. We derive novel temperature parameter, monotonically non-decreasing with properties of Mellowmax, and empirically show respect to its temperature parameter, and that it alleviates the that the combination of DQN and Mellowmax, but well-known overestimation problem associated with the max without a target network, outperforms DQN with a operator [van Hasselt, 2010]. These properties suggest that target network. the use of Mellowmax in DQN mitigates the main cause of instability in online reinforcement learning with deep neural 1 Introduction networks. Reinforcement learning (RL) is a general framework to study We test the performances of DeepMellow, the combination agents that make sequential decisions in an environment of Mellowmax operator and DQN, in two control domains to maximize long-term utility. Recent breakthroughs (Acrobot and Lunar Lander) and two Atari domains in reinforcement learning have shown that reinforcement (Breakout and Seaquest). Our empirical results show that learning algorithms can train agents in high-dimensional DeepMellow achieves more stability than a version of DQN complex domains when combined with deep neural networks. with no target network. We also show that, DeepMellow, Deep Q-Network (or simply DQN) [Mnih et al., 2015] which has no target network, learns faster than the original was the first algorithm that successfully instantiated DQN with a target network, in these domains. this combination; it yields human-level performance in high-dimensional large-scale domains like Atari video games 2 Background [Bellemare et al., 2013]. An important component of DQN is the use of a target RL problems are typically formulated as Markov Decision network, which was introduced to stabilize learning. In Processes (MDP) [Bellman, 1957], described by the tuple (S, Q-learning, the agent updates the value of executing an action A, T , R, γ). Here, S denotes the set of states and A denotes in the current state, using the values of executing actions in a the set of actions. The function T : S × A × S 7! [0; 1] successive state. This procedure often results in an instability specifies the transition dynamics of the process; specifically, because the values change simultaneously on both sides of the the probability of moving to a state s0 after taking action a update equation. A target network is a copy of the estimated in state s is denoted by T (s; a; s0). Similarly, R : S × A × value function that is held fixed to serve as a stable target for S 7! R is the reward function, with R(s; a; s0) being the some number of steps. expected reward upon moving to state s0 when taking action a However, a key shortcoming of target networks is that they in state s. Finally, the discount rate γ 2 [0; 1) determines the move us farther from online reinforcement learning, a desired importance of immediate rewards relative to rewards received property in the reinforcement learning community [Sutton in the long term. and Barto, 1998; van Seijen and Sutton, 2014]. The presence RL agents typically seek to find a policy π : S 7! Pr(A) of a target network can also slow down learning due to that achieves high long-term reward. More formally, the delayed value function updates. long-term utility (or value) of taking an action in a state is defined as: 2.2 Motivation: Removing the Target Network 1 π X t We aim to remove the target network for the following Q (s; a) := E[ γ R(st; at; st+1)js0 = s; a0 = a; π]: reasons. First, the target network in DQN violates online t=0 reinforcement learning, and hinders fast learning. Online We can define the optimal Q value, meaning the value of a learning enables real-time learning with streams of incoming state–action pair under the best policy π∗, as follows: data, continually updating the value functions without 1 delays [Sutton and Barto, 1998; Rummery and Niranjan, ∗ X t ∗ Q (s; a) := E[ γ R(st; at; st+1)js0 = s; a0 = a; π ]: 1994]. (Note that both using experience replay and a target t=0 network are deviations from online learning, but our focus A fundamental result in MDPs is that Q∗ can be written is solely on the target network.) By eliminating the target recursively: network, we can remove the delays in the update of value X functions, and thus support faster learning, as demonstrated Q∗(s; a) = T (s; a; s0)[R(s; a; s0) + γmaxQ∗(s0; a0)] ; a0 in our experiments. s02S Secondly, having a separate target network doubles the in what are known as the Bellman equations [Bellman, 1957]. memory required to store neural network weights. Thus, A generalization [Littman and Szepesvari,´ 1996] replaces the removing the target network contributes to the better max operator with a generic operator N as follows: allocation of memory resources. Lastly, we aim to X O develop simpler learning algorithms since they are easier to Q(s; a) = T (s; a; s0)[R(s; a; s0) + γ Q(s0; a0)]: a0 implement in practice and understand in theory. A target s02S network is an extra complication added to Q-learning to make The operator can be thought of as an action selector, which it work; removing that complication results in a simpler, and summarizes values over actions; different choices of the therefore, better algorithm. operator can lead to different solutions for Q. 3 New Properties of the Mellowmax Operator 2.1 Deep Q-Network Softmax operators have been found useful across many In its basic form, given a sample hs; a; r; s0i, online tabular disciplines of science, including optimization [Boyd and Q-learning seeks to improve its approximate solution to Vandenberghe, 2004], electrical engineering [Safak, 1993], the fixed-point of the Bellman equation by performing the game theory [Gao and Pavel, 2017], and experimental following simple update: psychology [Stahl II and Wilson, 1994]. Q(s; a) Q(s; a) + αr + γ max Q(s0; a0) − Q(s; a): a0 In reinforcement learning, softmax has been used in the context of action selection to trade off exploration (trying new In domains with large state spaces, rather than learning actions) and exploitation (trying good actions), owing to its a value for each individual state–action pair, algorithms cheap computational complexity relative to more principled must represent the Q-function using a parameterized function approaches like optimism under uncertainty [Brafman and approximator. DQN, for example, uses a deep convolutional Tennenholtz, 2002; Strehl et al., 2006] or Bayes-optimal neural network [LeCun et al., 2015] parameterized by decision making [Dearden et al., 1998]. We use softmax in weights θ. In this case, the value-function estimate is updated the context of value-function optimization, where we focus as follows: on the following softmax operator: 0 0 θ θ +α r +γ max Q(s ; a ; θ)−Q(s; a; θ) rθQ(s; a; θ): 1 n a0 log( P exp(!x )) mm (x) := n i=1 i : DQN employs two additional techniques, namely ! ! experience replay and the use of a separate target network, This recently introduced operator, called Mellowmax [Asadi to stabilize learning and improve performance [Mnih et ] 0 and Littman, 2017 , can be thought of as a smooth al., 2015]. Experience replay stores samples hs; a; r; s i approximation of the max operator. Mellowmax can also be in a buffer, randomly samples a minibatch, and performs incorporated into the Bellman equation as follows: the above update over that minibatch. This approach helps X 0 0 reduce the correlation between consecutive samples, which Q(s; a) = T (s; a; s )[R(s; a; s ) + γmm!Q(s; ·)]: can otherwise negatively effect gradient-based methods. s02S The second modification, the target network, maintains a In contrast to the more familiar Boltzmann softmax, separate weight vector θ− to create a temporal gap between it yields convergent behavior due to its non-expansion the target action-value function and the action-value function property. Recent research has also found an appealing that updates continually. The update is changed as follows: entropy-regularization characterization of this operator 0 0 − [ ] θ θ+α r+γ max QT (s ; a ; θ )−Q(s; a; θ) rθQ(s; a; θ); Nachum et al., 2017; Fox et al., 2016; Neu et al., 2017 . a0 Mellowmax has several interesting properties. In addition where the separate weight vector θ− is synchronized with θ to being a non-expansion, the parameter ! offers an after some period of time chosen as a hyper-parameter.