Game Playing with Deep Q-Learning Using Openai Gym
Total Page:16
File Type:pdf, Size:1020Kb
Game Playing with Deep Q-Learning using OpenAI Gym Robert Chuchro Deepak Gupta [email protected] [email protected] Abstract sociated with performing a particular action in a given state. This information is fundamental to any reinforcement learn- Historically, designing game players requires domain- ing problem. specific knowledge of the particular game to be integrated The input to our model will be a sequence of pixel im- into the model for the game playing program. This leads ages as arrays (Width x Height x 3) generated by a particular to a program that can only learn to play a single particu- OpenAI Gym environment. We then use a Deep Q-Network lar game successfully. In this project, we explore the use of to output a action from the action space of the game. The general AI techniques, namely reinforcement learning and output that the model will learn is an action from the envi- neural networks, in order to architect a model which can ronments action space in order to maximize future reward be trained on more than one game. Our goal is to progress from a given state. In this paper, we explore using a neural from a simple convolutional neural network with a couple network with multiple convolutional layers as our model. fully connected layers, to experimenting with more complex additions, such as deeper layers or recurrent neural net- 2. Related Work works. The best known success story of classical reinforcement learning is TD-gammon, a backgammon playing program 1. Introduction which learned entirely by reinforcement learning [6]. TD- gammon used a model-free reinforcement learning algo- Game playing has recently emerged as a popular play- rithm similar to Q-learning. Attempt to explore on success ground for exploring the application of artificial intelligence of TD-gammon were less successful like applying the same theory. With the addition of convolutional neural networks, techniques on Go and checkers as it was later concluded that performance of game playing programs has seen improve- stochasticity in the dice rolls helps exploring state space and ment that can even go beyond the ability of even expert level makes value function particularly smooth [7]. human players. This demonstrates the powerful learning The core of the problem is to apply non linear func- capabilities of computers in a variety of challenging envi- tion approximation such as a neural network to reinforce- ronments. Designing a model which is agnostic to its envi- ment learning problem. Traditionally reinforcement learn- ronment allows us to investigate a core problem of artificial ing problems tend to be unstable or diverge when a non intelligence, which is the concept of general intelligence. linear function is applied to represent the action value aka The benefit to interfacing with OpenAI Gym is that it is Q value[1].Learning to control agents directly from high an actively developed interface which is adding more envi- dimensional sensory inputs like vision is one of the long ronments and features useful for training. One of the core standing challenges of reinforcement learning[2]. Most of challenges with computer vision is obtaining enough data to the recent algorithms and works that have used deep learn- properly train a neural network, and OpenAI Gym provides ing models to approximate Q function for RL exercise has a clean interface with dozens of different environments. been made possible due to recent breakthroughs in com- 1.1. Learning Environment puter vision [3,4,5].First deep learning model to tackle rein- forcement learning problem was introduced by DeepMind In this project, we will be exploring reinforcement learn- Technologies [2] to successfully learn control policies di- ing on a variety of OpenAI Gym environments (G. Brock- rectly from image input using reinforcement learning. They man et al., 2016). OpenAI Gym is an interface which pro- used a convolutional neural network to train seven Atari vides various environments which simulate reinforcement 2600 games from Arcade Learning Environment with no learning problems. Specifically, each environment has an adjustment of architecture or learning algorithm. They were observation state space, an action space to interact with the able to achieve state-of-the-art results in six of the seven environment to transition between states, and a reward as- games it was tested on, with no adjustment of the architec- 1 ture or hyperparameters. back is the successor state, represented as an image array of Following the success of DQN, the Google DeepMind form (Width x Height x 3), where the third dimension is the team experimented with applying deep convolution network RGB values at each pixel. In addition, we receive a reward to Double Q-learning algorithm [8,9], the motivation be- value from the transition as well as an indicator to notify if ing standard Deep Q-learning algorithms were known to the particular episode has terminated. overestimate values under certain conditions. If overestima- In the case of Flappy Bird, the observation space is a tions are not uniform they can negatively impact the quality matrix of (288x512x3) representing the pixel values of the of resulting policy [10].Double DQN builds on the idea of image of a particular frame in the game. Upon passing in Double Q-learning which aims at reducing overestimations between a set of pipes, a reward of 1 is received. If Flappy by decomposing the max operation in the target into ac- Bird collides with a pipe, the reward is -5 and the episode tion selection and action evaluation [8]. Google DeepMind ends. A reward of 0 is given in all other states. The action team concluded that DDQN clearly improves over DQN. space that we will learn our policy on is just two actions, 0 Noteworthy examples include Road Runner (from 233% to or 1. A sample image from a frame generated by the Ope- 617%), Asterix (from 70% to 180%), Zaxxon (from 54% nAi Gym environment is shown in figure [1]. to 111%), and Double Dunk (from 17% to 397%). DDQN also estimated to show low variance in results than the cor- responding DQN model. There has been experimentation with different network architectures within convolution network domain such as Dueling Networks which contain two separate estimators: one for the state value function and one for the state- dependent action advantage function. This helps in generalizing learning across actions with- out imposing any change to the underlying reinforce- ment learning algorithm[11].Google DeepMind team ex- perimented with applying dueling networks on DDQN with Figure 1. Sample frame of Flappy Bird. Action of 1 causes the prioritized experience replay [12] and uniform networks bird to go up, while 0 allows gravity to drag it down. without any prioritized experience replay buffer.They eval- uated dueling network architecture on 57 Atari games and found prioritized dueling agent performs significantly bet- 3.2. Input Preprocessing ter than both the prioritized baseline agent and the dueling Once we receive the image from the Gym environment, agent alone Recent works by Google DeepMind involves we apply a couple steps to format the image to finally use an asynchronous variant of action critic model which surpasses input to the model. First, we convert the image to gray scale, current state-of-the-art on the Atari domain while training which should reduce discrepancies between episodes with for half the time on a single multi-core CPU instead of a different background settings (night/day, pipe/bird color). It GPU [13].Asynchronous variant of actor critic model finds also reduces the size of the third dimension from 3 for RGB inspiration from General Reinforcement Learning Architec- to 1. Next, we De-noise the image using adaptive thresh- ture (Gorila) and some earlier work which studied conver- olding. In Flappy Bird, the background contains some un- gence properties of Q learning in the asynchronous opti- necessary pixels that can create some noise for the neural mization setting [15]. network, such as stars in the night background. Applying adaptive Gaussian thresholding (CITE) incorporates each 3. Data Format pixel’s neighboring values to determine whether its value is 3.1. Interacting with Environment noise or not. This same algorithm generalizes well format other games, as shown in figure [2]. All of the data used for training our model comes from Afterwards, we normalize values to be between 0 and interacting with the OpenAI Gym Interface (CITE). Inter- 1 and reduce the image to 80x80 pixels. This allows for acting with the Gym interface has three main steps: register- consistent input sizes across games as well as decreasing ing the desired game with Gym, resetting the environment dimensionality of each layer of our network, allowing for to get the initial state, then applying a step on the environ- faster passes and fewer parameters per layer. Finally, we ment to generate a successor state. stack last 4 frames as input to network (CITE) to form the The input which is required to step in the environment 80x80x4 input to the network. For the reward values, we is an action value. More specifically, it is an integer rang- clip all incoming rewards such that positive rewards are all ing from [0; numactions). The information that we receive 1, negative rewards are -1, and 0 rewards remain 0. This 2 Network Architecture Type Classes / Filters Filter Size Stride Activation Conv-1 32 8x8 4 ReLU Conv-2 64 4x4 2 ReLU Conv-3 64 3x3 1 ReLU Fully Connected-1 512 ReLU Fully Connected-2 # of actions Linear Figure 3. Details of layers of our current network, based off the Figure 2. Image preprocessing for Flappy Bird (top) and Pixel DeepMind DQN model (V.