Combo-Action: Training Agent for FPS Game with Auxiliary Tasks
Total Page:16
File Type:pdf, Size:1020Kb
The Thirty-Third AAAI Conference on Artificial Intelligence (AAAI-19) Combo-Action: Training Agent for FPS Game with Auxiliary Tasks Shiyu Huang, Hang Su, Jun Zhu,∗ Ting Chen∗ Dept. of Comp. Sci. & Tech., BNRist Center, State Key Lab for Intell. Tech. & Sys., Institute for AI, THBI Lab, Tsinghua University, Beijing, 100084, China [email protected]; fsuhangss, dcszj, [email protected] Abstract Deep reinforcement learning (DRL) has achieved surpassing human performance on Atari games, using raw pixels and rewards to learn everything. However, first-person-shooter (FPS) games in 3D environments contain higher levels of hu- man concepts (enemy, weapon, spatial structure, etc.) and a large action space. In this paper, we explore a novel method which can plan on temporally-extended action sequences, which we refer as Combo-Action to compress the action space. We further train a deep recurrent Q-learning network model as a high-level controller, called supervisory network, to manage the Combo-Actions. Our method can be boosted with auxiliary tasks (enemy detection and depth prediction), which enable the agent to extract high-level concepts in the FPS games. Extensive experiments show that our method is efficient in training process and outperforms previous state- of-the-art approaches by a large margin. Ablation study ex- periments also indicate that our method can boost the perfor- mance of the FPS agent in a reasonable way. Introduction Figure 1: An illustration of the Combo-Action in ViZ- Deep reinforcement learning (DRL) has shown great suc- Doom. We define four Combo-Actions (Forward, Turn, Ro- cess in many games, including the computer Go game (Sil- tate, Shoot) in this paper. Each Combo-Action is a tempo- ver et al. 2016), Atari games (Mnih et al. 2013), etc. Be- ral abstraction of the primitive actions. A depth prediction sides the 2D games (e.g., Go and Atari), applying DRL to model and an enemy detection model are used for corre- first-person-shooter (FPS) games in an adversarial 3D envi- sponding Combo-Actions. Instead of using the whole depth ronment (Kempka et al. 2016; Lample and Chaplot 2016) map, we divide it into a 3 × 6 grid and calculate the average has attracted attention, in which a player fights against other depth value for each cell. We further train a DRQN model, computer agents or human players. Compared with the 2D called supervisory network, as a high level controller for the games, FPS games show a multitude of challenges since Combo-Actions. the additional spatial dimension not only introduces no- tions of partial observability and occlusions, but also causes complications due to viewpoint variance and more unpre- egy via data-driven methods since the release of ViZDoom, dictable actions of the enemies. Moreover, this task also in- e.g., Arnold (Lample and Chaplot 2016), F1 (Wu and Tian volves a wide variety of actions and skills, such as navigat- 2017) and IntelAct (Dosovitskiy and Koltun 2016). ing through a map, collecting items, fighting enemies, etc. ViZDoom (Kempka et al. 2016) is a RL research platform Challenges which allows researchers to develop agents to play the Doom Even much work has been done recently, there still remain game with the screen buffer and game variables. Many ef- many problems in building agents for the FPS game. forts have been paid on developing AI bots to learn a strat- Large Action Space: In general, there are numerous prim- ∗J. Zhu and T. Chen are corresponding authors. itive actions in FPS games for an agent to interact with Copyright c 2019, Association for the Advancement of Artificial the environment, which can be categorized into on-off ac- Intelligence (www.aaai.org). All rights reserved. tions and delta actions. On-off actions only contain bi- 954 nary states, while delta buttons have continuous values. dom choosing of Combo-Actions can yield not-bad perfor- Moreover, the combinations of actions increase exponen- mance. tially with time steps. This makes the action space very Supervisory Network: To manage the switch between dif- large, and results in the inefficient training process. Previ- ferent Combo-Actions, a high-level controller should be ous work (Lample and Chaplot 2016; Wu and Tian 2017; applied. In this paper, we use an LSTM (Hochreiter and Dosovitskiy and Koltun 2016) selectively abandons many Schmidhuber 1997) based recurrent neural network for the actions to compress the action space, which results in per- Q-learning model. Our supervisory network can work har- formance loss, e.g., their agents can’t change visual angle monically with other auxiliary networks during test period. vertically, making the agent unable to shoot the enemy on a Contributions: The contributions of our work are as fol- different horizontal plane. lows: (1) Our method can compress the original action space Lack of Prior Knowledge: Humans can learn throughout sharply, which improves the training efficiency and explo- their lives and can utilize prior knowledge to complete new ration ability. (2) Our method can fuse priori knowledge and tasks quickly. However, reinforcement learning algorithms basic concepts into the RL, which reduces the training dif- often learn a new task from scratch, which makes them ficulty and boosts the performance of the trained agent. (3) requiring far more experience than humans during train- Our method can alleviate disharmonious-action problem by ing. Although large amounts of research seeks to improve defining reasonable Combo-Actions for the FPS game. the sample efficiency of reinforcement learning algorithms, there are few studies in incorporating prior knowledge into reinforcement learning. In FPS games, for example, it is vi- Background tal to recognize some basic concepts(enemy, weapon, spatial In this section, we briefly review the deep Q-learning and structure, etc.). But it is hard to extract such information in deep recurrent Q-learning network. We also present some a single end-to-end RL model. work related to our method and the efforts made in the FPS Disharmonious Actions: Disharmonious actions often oc- game AI research field. cur in previous trained agents, i.e., actions are not meaning- ful between step to step. For example, sometimes the agent Deep Q-learning will turn left and right repeatedly and remain where it is. Previous work (Wu and Tian 2017) only tried to relieve this Deep Q-learning can learn a policy by interacting with the problem by manually detecting this situation in test period environment. At each step, the agent obtains current state and could do nothing for the RL model. st of the environment, gives out an action according to its policy, and receives a reward rt. The goal of the Q-learning Our Proposal algorithm is to maximize the expected sum of discounted PT t0−t To address the aforementioned issues, we develop a novel rewards Rt = t0=t γ rt0 , where T is the terminating method that can plan on temporally-extended action se- time, and γ 2 [0; 1] is a discount factor. The action value quences, which we refer as Combo-Action. We trained a function, called Q-function, takes two inputs: state s and ac- deep recurrent Q-learning network(DRQN) as a supervisory tion a, and returns the expected future reward: Qπ(s; a) = network to manage the Combo-Action. Our method enables E[Rtjst = s; at = a]. In Deep Q-learning(Sutton, Barto, the reinforcement learning algorithm to be boosted with aux- and others 1998), a neural network parameterized by θ is iliary tasks and prior knowledge. used as an estimate of the optimal Q-function. To optimize Combo-Action: We propose a kind of micro-action, called the Q-function, the temporal difference error is taken as the Combo-Action, in this paper. The Combo-Action is built on loss function: a series of primitive actions, which can complete a specific sub-task. These action combinations are adopted to RL train- 2 ing, which compresses the action space sharply and allows L(θ) = Es;a;r;s0 [(Qtarget − Qθ(s; a)) ]; (1) us to obtain the optimal value function within a practical 0 0 time and memory limitation. This method also guides the where Qtarget = r + maxa0 Qθ(s ; a ). agent for a better exploration during training. Deep Recurrent Q-learning Network(DRQN): Typi- Auxiliary Tasks: Previous methods prefer to use an end-to- cally, the task for reinforcement learning should be Marko- end neural network to play the FPS game. However, a single vian. However, the observation(e.g. partial field of vision in model is hard to handle a complex task. Our method devel- 3D FPS game) for the agent is not Markovian, and this is ops extra two sub-tasks simultaneously, i.e., enemy detec- considered as a partially observable Markov decision pro- tion task and depth prediction task. This decoupling makes cess(POMDP). To allay this problem, a memory module the debugging process to be more intuitionistic. Moreover, is often required, which can be used to store the history the auxiliary networks extract high-level concepts from the information. (Hausknecht and Stone 2015) introduced the observation, which provides useful information to the exe- Deep Recurrent Q-Networks(DRQN). DRQN applies a re- cuting of Combo-Action. current neural network into DRL, and LSTM (Hochreiter Our method can alleviate disharmonious-action problem and Schmidhuber 1997) is often used on the top of the nor- by defining reasonable Combo-Actions. The priori knowl- mal DQN model. In our project, we use DQRN as our ba- edge in Combo-Actions can emit more reasonable primitive sic reinforcement learning model. We present some related actions. Interestingly, experiment shows that even the ran- work in following sub-sections.