Distributed Deep Q-Learning
Total Page:16
File Type:pdf, Size:1020Kb
Distributed Deep Q-Learning Kevin Chavez1, Hao Yi Ong1, and Augustus Hong1 Abstract— We propose a distributed deep learning model unsupervised image classification. To achieve model paral- to successfully learn control policies directly from high- lelism, we use Caffe, a deep learning framework developed dimensional sensory input using reinforcement learning. The for image recognition that distributes training across multiple model is based on the deep Q-network, a convolutional neural network trained with a variant of Q-learning. Its input is processor cores [11]. raw pixels and its output is a value function estimating The contributions of this paper are twofold. First, we future rewards from taking an action given a system state. develop and implement a software framework adapted that To distribute the deep Q-network training, we adapt the supports model and data parallelism for DQN. Second, we DistBelief software framework to the context of efficiently demonstrate and analyze the performance of our distributed training reinforcement learning agents. As a result, the method is completely asynchronous and scales well with the number RL agent. The rest of this paper is organized as follows. of machines. We demonstrate that the deep Q-network agent, Section II introduces the background on the class of machine receiving only the pixels and the game score as inputs, was able learning problem our algorithm solves. This is followed by to achieve reasonable success on a simple game with minimal Section III and Section IV, which detail the serial DQN parameter tuning. and our approach to distributing the training. Section V I. INTRODUCTION discusses our experiments on a classic video game, and some Reinforcement learning (RL) agents face a tremendous concluding remarks are drawn and future works mentioned challenge in optimizing their control of a system approaching in Section VI. real-world complexity: they must derive efficient represen- tations of the environment from high-dimensional sensory II. BACKGROUND inputs and use these to generalize past experience to new situations. While past work in RL has shown that with good We begin with a brief review on MDPs and reinforcement hand-crafted features agents are able to learn good control learning (RL). policies, their applicability has been limited to domains where such features have been discovered, or to domains A. Markov decision process with fully observed, low-dimensional state spaces [1]–[3]. We consider the problem of efficiently scaling a deep In an MDP, an agent chooses action at at time t after learning algorithm to control a complicated system with observing state st . The agent then receives reward rt , and high-dimensional sensory inputs. The basis of our algorithm the state evolves probabilistically based on the current state- is a RL agent called a deep Q-network (DQN) [4], [5] that action pair. The explicit assumption that the next state only combines RL with a class of artificial neural networks known depends on the current state-action pair is referred to as the as deep neural networks [6]. DQN uses an architecture called Markov assumption. An MDP can be defined by the tuple the deep convolutional network, which utilizes hierarchical .S;A;T;R/, where S and A are the sets of all possible layers of tiled convolutional filters to exploit the local spatial states and actions, respectively, T is a probabilistic transition correlations present in images. As a result, this architecture function, and R is a reward function. T gives the probability is robust to natural transformations such as changes of of transitioning into state s0 from taking action a at the viewpoint and scale [7]. current state s, and is often denoted T .s;a;s0/. R gives a In practice, increasing the scale of deep learning with scalar value indicating the immediate reward received for respect to the number of training examples or the number of taking action a at the current state s and is denoted R .s;a/. model parameters can drastically improve the performance of To solve an MDP, we compute a policy ? that, if deep neural networks [8], [9]. To train a deep network with followed, maximizes the expected sum of immediate rewards many parameters on multiple machines efficiently, we adapt from any given state. The optimal policy is related to the a software framework called DistBelief to the context of the optimal state-action value function Q? .s;a/, which is the training of RL agents [10]. Our new framework supports expected value when starting in state s, taking action a, data parallelism, thereby allowing us to potentially utilize and then following actions dictated by ?. Mathematically, computing clusters with thousands of machines for large- it obeys the Bellman recursion scale distributed training, as shown in [10] in the context of ? X ? Q .s;a/ R .s;a/ T s;a;s0 max Q s0;a0 : 1 D C a0 A K. Chavez, H. Y. Ong, and A. Hong are with the Departments of Elec- s0 S 2 trical Engineering, Mechanical Engineering, and Computer Science, respec- 2 tively, at Stanford University, Stanford, CA 94305, USA kjchavez, The state-action value function can be computed using a haoyi, auhong @stanford.edu f dynamic programming algorithm called value iteration. To g obtain the optimal policy for state s, we compute rectifier nonlinearity ? .s/ argmaxQ? .s;a/: f .x/ max.0;x/; D a A D 2 B. Reinforcement learning which was empirically observed to model real/integer valued inputs well [12], [13], as is required in our case. The The problem reinforcement learning seeks to solve differs remaining layers are fully-connected linear layers with a from the standard MDP in that the state space and transition single output for each valid action. The number of valid and reward functions are unknown to the agent. The goal of actions varies with the game application. The neural network the agent is thus to both build an internal representation of is implemented on Caffe [11], which is a versatile deep the world and select actions that maximizes cumulative future learning framework that allows us to define the network reward. To do this, the agent interacts with an environment architecture and training parameters freely. And because through a sequence of observations, actions, and rewards and Caffe is designed to take advantage of all available com- learns from past experience. puting resources on a machine, we can easily achieve model In our algorithm, the deep Q-network builds its internal parallelism using the software. representation of its environment by explicitly approximating the state-action value function Q? via a deep neural network. B. Q-learning Here, the basic idea is to estimate ? We parameterize the approximate value function Q .s;a/ maxEŒRt st s;at a;; D j D D Q .s;a / using the deep convolutional network as j where maps states to actions (or distributions over actions), described above, in which are the parameters of the with the additional knowledge that the optimal value function Q-network. These parameters are iteratively updated by the obeys Bellman equation minimizers of the loss function Ä h 2i ? ? Li .i / E .yi Q .s;a i // ; (1) Q .s;a/ E r maxQ s0;a0 s;a ; D s;a . / I D s0 E C a0 j with iteration number i, target y where E is the MDP environment. i E Œr max Q .s ;a / s;a, and “behaviorD s0 E a0 0 0 i 1 distribution” C (exploration policyI j .s;a/. The optimizers of III. APPROACH the Q-network loss function can be computed by gradient This section presents the general approach adapted from descent the serial deep Q-learning in [4], [5] to our purpose. In Q .s;a / Q .s;a / ˛ Â Q .s;a /; particular, we discuss the neural network architecture, the I WD I C r I iterative training algorithm, and a mechanism that improves with learning rate ˛. training convergence stability. For computational expedience, the parameters are updated after every time step; i.e., with every new experience. Our A. Preprocessing and network architecture algorithm also avoids computing full expectations, and we Working directly with raw video game frames can be train on single samples from and E. This results in the computationally demanding. Our algorithm applies a basic Q-learning update preprocessing step aimed at reducing the input dimension- Â Ã ality. Here, the raw frames are gray-scaled from their RGB Q .s;a/ Q .s;a/ ˛ r maxQ s0;a0 Q .s;a/ : WD C C a representation and down-sampled to a fixed size for input 0 to the neural network. For this paper, the function applies The procedure is an off-policy training method [14] that this preprocessing to the last four frames of a sequence and learns the policy a argmax Q .s;a / while using an D a I stacks them to produce the input to the state-action value exploration policy or behavior distribution selected by an function Q. -greedy strategy. We use an architecture in which there is a separate output The target network parameters used to compute y in unit for each possible action, and only the state representation Eq. (1) are only updated with the Q-network parameters is an input to the neural network; i.e., the preprocessed four every C steps and are held fixed between individual updates. frames sequence. The outputs correspond to the predicted These staggered updates stabilizes the learning process com- Q-values of the individual action for the input size. The pared to the standard Q-learning process, where an update main advantage of this type of architecture is the ability to that increases Q .st ;at / often also increases Q .st 1;a/ compute Q-values for all possible actions in a given state for all a and hence also increases the target y.C These with only a single forward pass through the network.