DAQN: Deep Auto-Encoder and Q-Network

DAQN: Deep Auto-encoder and Q-Network Daiki Kimura IBM Research AI Email: [email protected] -$($)#-%5$ ')1$, Abstract—The deep reinforcement learning method usually )).)- 2 #-$(- 4#.-)3$( )1$)8 requires a large number of training images and executing actions %(3.- )$0#)1 to obtain sufficient results. When it is extended a real-task in !"" the real environment with an actual robot, the method will be )$%(*)) $'$(- $- 111 ,-#-$ ,$#)(%(- required more training images due to complexities or noises of the input images, and executing a lot of actions on the real robot also becomes a serious problem. Therefore, we propose # -%)( an extended deep reinforcement learning method that is applied %" (%) $' 2- (. 5 $$$$$0123$0167$8 8 6 a generative model to initialize the network for reducing the ( 9 :$ 7 number of training trials. In this paper, we used a deep q-network 1(.*6-$$$$$$$$$$$$$$$$$$$$$$$0123$0167$6 method as the deep reinforcement learning method and a deep auto-encoder as the generative model. We conducted experiments Fig. 1. Deep Auto-encoder and Q-Network on three different tasks: a cart-pole game, an atari game, and a real-game with an actual robot. The proposed method trained efficiently on all tasks than the previous method, especially 2.5 Therefore, we proposed an extended deep reinforcement times faster on a task with real environment images. learning method that is applied a generative model to initialize the network for reducing the number of training trials. In I. INTRODUCTION this paper, we used a deep q-network method [1], [2] as The deep reinforcement learning algorithms, such as deep deep reinforcement learning, and a deep auto-encoder [6] as a q-network method [1], [2], deep deterministic policy gradi- generative model. We named this proposed method “deep auto- ents [3], and asynchronous advantage actor-critic [4], are the encoder and q-network” (daqn) in this paper. The overview is suitable methods for implementing interactive robot (agent). shown in Figure 1. This method first trains a network by the The method chooses an optimum action from a state input. generative model with a random policy agent or inputs without These methods were often evaluated in a simulated environ- actions. Then, the method trains policies by reinforcement ment. It is easy in the simulator to obtain input, to perform learning which has the pre-trained network. The expected actions, and to get a reward. However, giving rewards to advantage is decreasing the number of training trials of the the real robot for each action needs human efforts. Also, reinforcement learning phase. It is true that the proposed performing training trials on the actual robot takes time and method additionally requires the training data for pre-training. has risks of hurting the robot and environment. Thus, reducing However, the cost of obtaining this data is much lower than the number of giving the reward and taking an action is data for the reinforcement learning. This is because rewards necessary. Moreover, when we consider introducing into the and optimal actions are not required for the pre-training data. real environment, state inputs will have a wider diversity. For In this paper, we evaluate in three different environments, example, if there is a state of “in front of a human”, there are including a physical interaction with a real robot. Note that we almost infinite variations of the visual representation of the focus on discrete actions in this paper for a clear discussion human. Also, the input from an actual sensor contains complex about contributions by a simple architecture. We assume the background noises. Hence, a larger amount of training is proposed method will be easily applied to continuous controls essentially required than in the simulator environment. with such reinforcement learning methods [3], [4]. On the other hand, a generative model has recently become The contributions of this paper are to clarify the benefit of popular to pre-train a neural network. An auto-encoder [5], [6] introducing the generative model into the deep reinforcement arXiv:1806.00630v1 [cs.CV] 2 Jun 2018 is one of the well-known generative model methods. Some learning (especially, the auto-encoder into the deep q-network), studies, such as [7], [8], report the auto-encoder reduced conditions for its effectiveness, and a requirement of pre- the number of training steps for the classification task. We, training data. The most similar work is [9]. They copied an therefore, assume pre-training helps to reduce the number auto-encoder network to deep q-network in atari environment. of reinforcement learning steps. Moreover, it only requires However, they concluded the pre-training results show lower inputs during pre-training; class labels of data are unnecessary. performance. The main reason are it used only first layer to Hence, the training data of the pre-training doesn’t need copy and some training parameters are not proper. We use rewards for reinforcement learning; that means we can obtain all layers and change the parameters. And we conduct an these data from a random policy agent, or the environment experiment in the real environment that is more complex than around the robot without any action. the atari; thus, a benefit of pre-training will be expected more. 250 II. DEEP AUTO-ENCODER AND Q-NETWORK dqn[1] daqn daqn first[9] 200 The proposed method has following steps: (1) trains a !.-$0')($)', 150 ',( network by the deep auto-encoder, (2) deletes the decoder- ',( '$-( '$-( '$-( '$-( '$-( '$-( average reward " dqn[1]: 176.8 " " " " " '()*%, 100 daqn: 189.6 layers, and adds a fully-connected layer on the top of encoder- .*%)*%, daqn first[9]: 179.4 '',-.0/'-0$,1 layers, and (3) trains policies by the deep q-network algorithm. 50 Average reward for test phase Average '/( 2,)3-' )'-0$,1 The method first trains inputs by a deep auto-encoder for " 0 0 1000 2000 3000 4000 pre-training the network. The auto-encoder has encoder and Number of training step decoder components, which can be defined as transitions φ n m ∗ ∗ Fig. 2. Network for cart-pole game: and . Here, we define X = R ; Y = R . Let φ and be blue part is auto-encoder component, Fig. 3. Rewards for cart-pole game: trained by reconstructing its own inputs: orange part is deep q-network, and the maximum reward is 300. Note “FC” means a fully-connected layer. that optimizer setting is only for auto- φ : X!Y; : Y!X ; x 2 X (1) encoder (ae); the optimizer of dqn is same as dqn method. φ∗; ∗ = arg min kx − ( ◦ φ)xk2: (2) φ,ψ III. EXPERIMENT When the input is a simple vector, the proposed method uses a one-dimensional auto-encoder [5]; when the input is an We conduct three different types of games in this study: a image, it uses a convolutional auto-encoder [6]. Importantly, cart-pole game [12], an Atari game which is implemented in the training data of this step will be obtained through a random the arcade learning environment [13], and an interactive game policy of the agent, or the captured data from the environment on an actual robot in the real environment. The cart-pole is without any action of the robot. a simple game; hence, this is a base experiment. The Atari Next, the method removes decoder-layers of the auto- game, which is also used in the original deep q-network study encoder network, and adds a fully-connected layer at the top or some related works, contains image inputs and a complex of encoder-layers for discrete actions. Note that the weights game rule. However, the images are taken from a simulated of this added layer are initialized by random values. environment; specifically, images are from the game screen. Then, the method trains the policy by deep q-network [1], The third environment is a real game. In this paper, we choose [2], which is initialized by the pre-trained network parameters a “rock-paper-scissors” game, which is one of the various from previous steps. The deep q-network is based on Q- human hand-games with discrete actions. We adapt it to an learning algorithm [10]. The algorithm has an “action-value interactive game between a real robot and a human. The input function” to calculate the quantity for a combination of state of this game is significantly complex than other experiments S and action A; Q : S × A ! R. Then, the update function due to the real environment. of Q is, We used the OpenAI Gym framework [14] for simulating the cart-pole game and the Atari game. Also, we used Keras- Q(st; at) library [15] and Keras-RL [16] for conducting the experiments. Q(st; at) + α rt+1 + γ max Q(st+1; a) − Q(st; at) ; (3) a A. Cart-pole game 1) Environment: The cart-pole game [12] is a game in where the right-side Q(st; at) is the current value, α is a which one pole is attached by a joint to a cart that moves along learning ratio, rt+1 is a reward, γ is a discount factor, and a friction-less track. The agent controls the cart to prevent the maxa Q(st+1; a) is a maximum estimated action-value for pole from falling over. It starts at the upright position, and it state st+1. Therefore, the loss function of the deep q-network is, ends when the pole is more than 15 degrees from vertical. In this game, the agent obtains four-dimensional values from a sensor and chooses an optimal action from two discrete 2 actions: moving right or left.

Load more