Playing FPS Games with Deep Reinforcement Learning

Playing FPS Games with Deep Reinforcement Learning

Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence (AAAI-17) Playing FPS Games with Deep Reinforcement Learning Guillaume Lample,∗ Devendra Singh Chaplot∗ {glample,chaplot}@cs.cmu.edu School of Computer Science Carnegie Mellon University Abstract Advances in deep reinforcement learning have allowed au- tonomous agents to perform well on Atari games, often out- performing humans, using only raw pixels to make their de- cisions. However, most of these games take place in 2D envi- ronments that are fully observable to the agent. In this paper, we present the first architecture to tackle 3D environments in first-person shooter games, that involve partially observ- able states. Typically, deep reinforcement learning methods only utilize visual input for training. We present a method to augment these models to exploit game feature information such as the presence of enemies or items, during the training phase. Our model is trained to simultaneously learn these fea- tures along with minimizing a Q-learning objective, which is shown to dramatically improve the training speed and perfor- Figure 1: A screenshot of Doom. mance of our agent. Our architecture is also modularized to allow different models to be independently trained for differ- ent phases of the game. We show that the proposed architec- to play Atari 2600 games. Foerster et al. (2016) consider ture substantially outperforms built-in AI agents of the game as well as average humans in deathmatch scenarios. a multi-agent scenario where they use deep distributed re- current neural networks to communicate between different agent in order to solve riddles. The use of recurrent neural 1 Introduction networks is effective in scenarios with partially observable Deep reinforcement learning has proved to be very success- states due to its ability to remember information for an arbi- ful in mastering human-level control policies in a wide va- trarily long amount of time. riety of tasks such as object recognition with visual atten- Previous methods have usually been applied to 2D envi- tion (Ba, Mnih, and Kavukcuoglu 2014), high-dimensional ronments that hardly resemble the real world. In this paper, robot control (Levine et al. 2016) and solving physics-based we tackle the task of playing a First-Person-Shooting (FPS) control problems (Heess et al. 2015). In particular, Deep Q- game in a 3D environment. This task is much more challeng- Networks (DQN) are shown to be effective in playing Atari ing than playing most Atari games as it involves a wide va- 2600 games (Mnih et al. 2013) and more recently, in defeat- riety of skills, such as navigating through a map, collecting ing world-class Go players (Silver et al. 2016). items, recognizing and fighting enemies, etc. Furthermore, However, there is a limitation in all of the above appli- states are partially observable, and the agent navigates a 3D cations in their assumption of having the full knowledge of environment in a first-person perspective, which makes the the current state of the environment, which is usually not task more suitable for real-world robotics applications. true in real-world scenarios. In the case of partially observ- In this paper, we present an AI-agent for playing death- able states, the learning agent needs to remember previous matches1 in FPS games using only the pixels on the screen. states in order to select optimal actions. Recently, there have Our agent divides the problem into two phases: navigation been attempts to handle partially observable states in deep (exploring the map to collect items and find enemies) and reinforcement learning by introducing recurrency in Deep action (fighting enemies when they are observed), and uses Q-networks. For example, Hausknecht and Stone (2015) use separate networks for each phase of the game. Furthermore, a deep recurrent neural network, particularly a Long-Short- the agent infers high-level game information, such as the Term-Memory (LSTM) Network, to learn the Q-function presence of enemies on the screen, to decide its current ∗The authors contributed equally to this work. Copyright c 2017, Association for the Advancement of Artificial 1A deathmatch is a scenario in FPS games where the objective Intelligence (www.aaai.org). All rights reserved. is to maximize the number of kills by a player/agent. 2140 ∗ phase and to improve its performance. We also introduce If Qθ ≈ Q , it is natural to think that Qθ should be close a method for co-training a DQN with game features, which from also verifying the Bellman equation. This leads to the turned out to be critical in guiding the convolutional layers following loss function: of the network to detect enemies. We show that co-training significantly improves the training speed and performance 2 L (θ )=E y − Q (s, a) of the model. t t s,a,r,s t θt We evaluate our model on the two different tasks adapted from the Visual Doom AI Competition (ViZDoom)2 using where t is the current time step, and yt = r + the API developed by Kempka et al. (2016) (Figure 1 shows γ max Q (s ,a) y a θt . The value of t is fixed, which leads a screenshot of Doom). The API gives a direct access to the to the following gradient: Doom game engine and allows to synchronously send com- mands to the game agent and receive inputs of the current state of the game. We show that the proposed architecture ∇ L (θ )=E y − Q (s, a) ∇ Q (s, a) substantially outperforms built-in AI agents of the game as θt t t s,a,r,s t θ θt θt well as humans in deathmatch scenarios and we demonstrate the importance of each component of our architecture. Instead of using an accurate estimate of the above gradient, we compute it using the following approximation: 2 Background Below we give a brief summary of the DQN and DRQN ∇ L (θ ) ≈ y − Q (s, a) ∇ Q (s, a) models. θt t t t θ θt θt Although being a very rough approximation, these up- 2.1 Deep Q-Networks dates have been shown to be stable and to perform well in Reinforcement learning deals with learning a policy for an practice. agent interacting in an unknown environment. At each step, Instead of performing the Q-learning updates in an online an agent observes the current state st of the environment, fashion, it is popular to use experience replay (Lin 1993) decides of an action at according to a policy π, and observes to break correlation between successive samples. At each r a reward signal t. The goal of the agent is to find a policy time steps, agent experiences (st,at,rt,st+1) are stored in that maximizes the expected sum of discounted rewards Rt a replay memory, and the Q-learning updates are done on batches of experiences randomly sampled from the memory. T t−t At every training step, the next action is generated using Rt = γ rt an -greedy strategy: with a probability the next action is t=t selected randomly, and with probability 1 − according to where T is the time at which the game terminates, and γ ∈ the network best action. In practice, it is common to start [0, 1] is a discount factor that determines the importance of with =1and to progressively decay . future rewards. The Q-function of a given policy π is defined as the expected return from executing an action a in a state s: 2.2 Deep Recurrent Q-Networks The above model assumes that at each step, the agent re- Qπ(s, a)=E [R |s = s, a = a] t t t ceives a full observation st of the environment - as opposed It is common to use a function approximator to estimate the to games like Go, Atari games actually rarely return a full action-value function Q. In particular, DQN uses a neural observation, since they still contain hidden variables, but network parametrized by θ, and the idea is to obtain an esti- the current screen buffer is usually enough to infer a very mate of the Q-function of the current policy which is close good sequence of actions. But in partially observable envi- to the optimal Q-function Q∗ defined as the highest return ronments, the agent only receives an observation ot of the we can expect to achieve by following any strategy: environment which is usually not enough to infer the full state of the system. A FPS game like DOOM, where the agent field of view is limited to 90 centered around its posi- ∗ π Q (s, a)=maxE [Rt|st = s, at = a]=maxQ (s, a) π π tion, obviously falls into this category. To deal with such environments, Hausknecht and Stone In other words, the goal is to find θ such that Qθ(s, a) ≈ (2015) introduced the Deep Recurrent Q-Networks ∗ Q (s, a). The optimal Q-function verifies the Bellman opti- (DRQN), which does not estimate Q(st,at),but mality equation Q(ot,ht−1,at), where ht is an extra input returned by the network at the previous step, that represents the hidden Q∗(s, a)=E r + γ max Q∗(s,a)|s, a a state of the agent. A recurrent neural network like a LSTM (Hochreiter and Schmidhuber 1997) can be implemented 2ViZDoom Competition at IEEE Computational on top of the normal DQN model to do that. In that case, Intelligence And Games (CIG) Conference, 2016 ht = LSTM(ht−1,ot), and we estimate Q(ht,at). Our (http://vizdoom.cs.put.edu.pl/competition-cig-2016) model is built on top of the DRQN architecture.

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    7 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us