1

A Survey of Deep Reinforcement Learning in Video Games Kun Shao, Zhentao Tang, Yuanheng Zhu, Member, IEEE, Nannan Li, and Dongbin Zhao, Fellow, IEEE

Abstract—Deep reinforcement learning (DRL) has made great of representation learning, the whole system has successfully achievements since proposed. Generally, DRL agents receive modeled large-scale state space with deep neural networks. high-dimensional inputs at each step, and make actions according The second challenge is that learning proper policies to make to deep-neural-network-based policies. This learning mechanism updates the policy to maximize the return with an end-to- decisions in dynamic unknown environment is difficult. For end method. In this paper, we survey the progress of DRL this problem, data-driven methods, such as supervised learn- methods, including value-based, policy gradient, and model-based ing and reinforcement learning (RL), are feasible solutions. algorithms, and compare their main techniques and properties. The third challenge is that the vast majority of game AI is Besides, DRL plays an important role in game artificial in- developed in a specified virtual environment. How to transfer telligence (AI). We also take a review of the achievements of DRL in various video games, including classical Arcade games, the AI’s ability among different games is a core challenge. A first-person perspective games and multi-agent real-time strategy more general learning system is also necessary. games, from 2D to 3D, and from single-agent to multi-agent. A large number of video game AIs with DRL have achieved For a long time, solving these challenges with reinforcement super-human performance, while there are still some challenges learning is widely used in game AI. And in the last few in this domain. Therefore, we also discuss some key points when applying DRL methods to this field, including exploration- years, deep learning (DL) has achieved remarkable perfor- exploitation, sample efficiency, generalization and transfer, multi- mance in computer vision and natural language processing agent learning, imperfect information, and delayed spare re- [2]. The combination, deep reinforcement learning (DRL), wards, as well as some research directions. teaches agents to make decisions in high-dimensional state Index Terms—reinforcement learning, deep learning, deep space in an end-to-end framework, and dramatically improves reinforcement learning, game AI, video games. the generalization and scalability of traditional RL algorithms. Especially, DRL has made great progress in video games, I.INTRODUCTION including Atari, ViZDoom, StarCraft, Dota2, and so on. There are some related works to introduce these achievements in this RTIFICIAL intelligence (AI) in video games is a long- field. Zhao et al. [3] and Tang et al. [4] survey the development standing research area. It studies how to use AI tech- A of DRL research, and focus on AlphaGo and AlphaGo Zero. nologies to achieve human-level performance when playing Justesen et al. [5] reviews DL-based methods in video game games. More generally, it studies the complex interactions be- play, including supervised learning, unsupervised learning, tween agents and game environments. Various games provide reinforcement learning, evolutionary approaches, and some interesting and complex problems for agents to solve, making hybrid approaches. Arulkumaran et al. [6] make a brief intro- video games perfect environments for AI research. These duction of DRL, covering central algorithms and presenting virtual environments are safe and controllable. In addition, a range of visual RL domains. Li [7] gives an overview of these game environments provide infinite supply of useful data recent achievements of DRL, and discusses core elements, for machine learning algorithms, and they are much faster than important mechanisms, and various applications. In this paper, arXiv:1912.10944v2 [cs.MA] 26 Dec 2019 real-time. These characteristics make games the unique and we focus on DRL-based game AI, from 2D to 3D, and from favorite domain for AI research. On the other side, AI has single-agent to multi-agent. The main contributions include been helping games to become better in the way we play, the comprehensive and detailed comparisons of various DRL understand and design them [1]. methods, their techniques, properties, and the impressive and Broadly speaking, game AI involves the perception and diverse performances in these given video games. the decision-making in game environments. With these com- ponents, there are some crucial challenges and proposed The organization of the remaining paper is arranged as solutions. The first challenge is that the state space of the follows. In Section II, we introduce the background of DL and game is very large, especially in strategic games. With the rise RL. In Section III, we focus on recent DRL methods, including K. Shao, Z. Tang, Y. Zhu, N. Li, and D. Zhao are with the State Key value-based, policy gradient, and model-based DRL methods. Laboratory of Management and Control for Complex Systems, Institute of Au- After that, we make a brief introduction of research platforms tomation, Chinese Academy of Sciences. Beijing 100190, China. They are also with the University of Chinese Academy of Sciences, Beijing, China (e-mail: and competitions, and present performances of DRL methods [email protected]; [email protected]; [email protected]; in classical single-agent Arcade games, first-person perspective [email protected], [email protected]). games, and multi-agent real-time strategy games. In Section This work is supported by National Natural Science Foundation of China (NSFC) under Grants No.61573353, No.61603382, No.6180337, and V, we discuss some key points and research directions in this No.61533017. field. In the end, we draw a conclusion of this survey. 2

Actions

Video Value-based Games Rewards Policy gradient API Model-based etc…

Environments RL Agents CNN, LSTM , etc…

States Features

Fig. 1. The framework diagram of the typical DRL for video games. The deep learning model takes input from video games API, and extract meaningful features automatically. DRL agents produces actions based on these features, and make the environments transfer to next state.

II.BACKGROUND B. Reinforcement learning Reinforcement learning is a kind of machine learning meth- Generally speaking, training an agent to make decisions ods where agents learn the optimal policy by trial and error with high-dimensional inputs is difficult. With the development [12]. By interacting with the environment, RL can be suc- of deep learning, researchers take deep neural networks as cessfully applied to sequential decision-making tasks. Consid- function approximations, and use plenty of samples to opti- ering a discounted episodic Markov decision process (MDP) mize policies successfully. The framework diagram of typical (S, A, γ, P, r), the agent chooses an action a according to DRL for video games is depicted in Fig. 1. t the policy π(at|st) at state st. The environment receives the action, produces a reward rt+1 and transfers to the next state st+1 according to the transition probability P (st+1|st, at). A. Deep learning This transition probability is unknown in RL domain. The process continues until the agent reaches a terminal state or a Deep learning comes from artificial neural networks, and maximum time step. The objective is to maximize the expected is used to learn data representation. It is inspired by the discounted cumulative rewards theory of brain development, and can be learned in supervised ∞ learning, unsupervised learning and semi-supervised learning. X [R ] = [ γir ], (1) Although the term deep learning is introduced in 1986 [8], Eπ t Eπ t+i i=0 deep learning has a winter time because of lacking data and incapable computation hardware. However, with more and where γ ∈ (0, 1] is the discount factor. more large-scale datasets being released, and capable hardware Reinforcement learning can be devided into off-policy and being available, a big revolution happens in DL [9]. on-policy methods. Off-policy RL algorithms mean that the Convolutional neural network (CNN) [10] is a class of behavior policy used for selecting actions is different from the deep neural networks, which is widely applied to computer learning policy. On the contrary, behavior policy is the same vision. CNN is inspired by biological processes, and is with the learning policy in on-policy RL algorithms. Besides, shift invariant based on shared-weights architecture. Recurrent reinforcement learning can also be devided into value-based Neural Network (RNN) is another kind of deep nerial net- and policy-based methods. In value-based RL, agents update work, especially for natural language processing. As a special the value function to learn suitable policy, while policy-based kind of RNN, Long Short Term Memory (LSTM) [11] is RL agents learn the policy directly. capable of learning long-term dependencies. Deep learning Q-learning is a typical off-policy value-based method. The architectures have been applied into many fields, and have update rule of Q-learning is achieved significant successes, such as speech recognition, im- age classification and segmentation, semantic comprehension, δt = rt+1 + γ arg max Q(st+1, a) − Q(st, at), (2a) and machine translation [2]. DL-based methods with efficient a parallel distributed computing resources can break the limit Q(st, at) ← Q(st, at) + αδt. (2b) of traditional machine learning methods. This method inspires scientists and researchers to achieve more and more state-of- δt is the temporal difference (TD) error, and α is the learning the-art performance in respective fields. rate. 3

(a) (b) (c) (d) (e)

Fig. 2. The network architectures of typical DRL methods, with increased complexity and performance. (a): DQN network; (b)Dueling DQN network; (c): DRQN network; (d): Actor-critic network; (e): Reactor network.

Policy gradient [13] parameterizes the policy and updates much better performance. Prioritized experience replay [16] parameters θ. In its general form, the objective function of helps prioritize experience to replay important transitions policy gradient is defined as more frequently. The sample probability of transition i as α pi ∞ P (i) = P pα , where pi is the priority of transition i. Dueling X k k J(θ) = Eπ[ log πθ(at|st)R]. (3) DQN [17] uses the dueling neural network architecture for t=0 model-free DRL. It includes two separate estimators: one for R is the total accumulated return. state value function V (s; θ, β) and the other for advantage Actor-critic [12] reinforcement learning improves the policy function A(s, a; θ, α), as shown in Fig. 2(b). gradient with an value-based critic Q(s, a : θ, α, β) = V (s; θ, β) + A(s, a; θ, α). (7) ∞ X J(θ) = Eπ[ Ψt log πθ(at|st)]. (4) Pop-Art [18] is proposed to adapt to different and non- t=0 stationary target magnitudes, which successfully replaces the Ψt is the critic, which can be the state-action value function clipping of rewards as done in DQN to handle various mag- π π π Q (st, at), the advantage function A (st, at) = Q (st, at) − nitudes of targets. Fast reward propagation [19] is a novel π π π V (st) or the TD error rt + V (st+1) − V (st). training algorithm for reinforcement learning, which combines the strength of DQN, and exploits longer state-transitions in III.DEEP REINFORCEMENT LEARNING experience replays by tightening the optimization via con- DRL makes a combination of DL and RL, achieving rapid straints. This novel technique makes DRL more practical developments since proposed. This section will introduce by drastically reducing training time. Gorila [20] is the first various DRL methods, including value-based methods, policy massively distributed architecture for DRL. This architecture gradient methods, and model-based methods. uses four main components: parallel actors; parallel learners; a distributed neural network to represent the value function or behavior policy; and a distributed store of experience. To A. Value-based DRL methods address the limited memory and imperfect game information Deep Q-network (DQN) [14] is the most famous DRL at each decision point, Deep Recurrent Q-Network (DRQN) model which learns policies directly from high-dimensional [21] replaces the first fully-connected layer with a recurrent inputs. It receives raw pixels, and outputs a value function to neural network in DQN, as shown in Fig. 2(c). estimate future rewards, as shown in Fig. 2(a). DQN uses the Generally, DQN learns rich domain representations and experience replay method to break the sample correlation, and approximates the value function with deep neural networks, stabilizes the learning process with a target Q-network. The while batch RL algorithms with linear representations are loss function at iteration i is more stable and require less hyperparameter tuning. The Least

DQN 2 Squares DQN (LS-DQN) [22] combines DQN’s rich feature 0 (5) Li(θi) = E(s,a,r,s )∼U(D)[(yi − Q(s, a; θi)) ], representations with the stability of a linear least squares with method. In order to reduce approximation error variance in DQN 0 0 − DQNs target values, averaged-DQN [23] averages previous yi = r + γ max Q(s , a ; θi ). (6) a0 Q-values estimates, leading to a more stable training and DQN bridges the gap between high-dimensional visual improved performance. Deep Q-learning from Demonstrations inputs and actions. After that, researchers have improved DQN (DQfD) [24] combines DQN with human demonstrations, in different aspects. Double DQN [15] introduces double Q- which improves the sample efficiency greatly. DQV [25] uses learning to reduce observed overestimations, and it leads to TD learning to train a Value neural network, and uses this 4 network to train a second Quality-value network to esti- including value function replay, reward prediction, and pixel mate state-action values. DQV learns significantly faster and control. This agent drastically improves both data efficiency better than double-DQN. Researchers have proposed several and robustness to hyperparameter settings. PAAC [36] is a improvements to DQN. However, it is unclear which of novel framework for efficient parallelization of DRL, where these are complementary and how much can be combined. multiple actors learn the policy on a single machine. Pol- Rainbow [26] combines with main extensions to DQN, and icy gradient methods are efficient techniques for policies gives each component’s contribution to overall performance. improvement, while they are usually on-policy and unable RUDDER [27] is a novel reinforcement learning approach to take advantage of off-policy data. The new method is for finite MDPs with delayed rewards, which is also a return referred as PGQ [37], which combines policy gradient with Q- decomposition method, RUDDER is exponentially faster on learning. PGQ establishes an equivalency between regularized tasks with different lengths of reward delays. Ape-X DQfD policy gradient techniques and advantage function learning [28] uses a new transformed Bellman operator to process algorithms. Retrace(λ) [38] takes the best of the importance rewards of varying densities and scales, and applies human sampling, off-policy Q(λ), and tree-backup(λ), resulting in demonstrations to ease the exploration problem to guide agents low variance, safety, and efficiency. It makes a combination towards rewarding states. Additional, it proposes an auxiliary of dueling DRQN architecture and actor-critic architecture, temporal consistency loss to train stably extending the effective as shown in Fig. 2(e). Reactor [39] is a sample-efficient planning horizon by an order of magnitude. Soft DQN [29] and numerical efficient reinforcement learning agent based is an entropy-regularized versions of Q-learning, with better on a multi-step return off-policy actor-critic architecture. The robustness and generalization . network outputs a target policy, an action-value Q-function, Distributional DRL learns the value distribution, in contrast and an estimated behavioral policy. The critic is trained with to common RL that models the expectation of return, or value. the off-policy multi-step Retrace method and the actor is C51 [30] focuses on the distribution of value, and designs trained by a β-leave-one-out policy gradient. Importance- distributional DQN algorithm to learn approximate value dis- Weighted Actor Learner Architecture (IMPALA) [40] is a new tributions. QR-DQN [31] methods close a number of gaps distributed DRL, which can scale to thousands of machine. between theoretical and algorithmic results. Distributional IMPALA uses a single reinforcement learning agent with a reinforcement learning with Quantile regression in which the single set of parameters to solve a mass of tasks. This method distribution over returns is modeled explicitly instead of only achieves stable learning by combining decoupled acting and estimating the mean. Implicit Quantile Networks (IQN) [32] is learning with a novel V-trace off-policy correction method, a flexible, applicable, and state-of-the-art distributional DQN. which is critical for achieving learning stability. IQN approximates the full Quantile function for the return 1) Trust region method: Trust Region Policy Optimization distribution with Quantile regression, and provides a fully (TRPO) [41] is proposed for optimizing control policies, integrated distributional RL agent without prior assumptions with guaranteed monotonic improvement. TRPO computes on the parameterization of the return distribution. Furthermore, an ascent direction to improve on policy gradient, which IQN allows to expand the class of control policies to a wide can ensure a small change in the policy distribution. The range of risk-sensitive policies connected to distortion risk constrained optimization problem of TRPO in each epoch is measures. πθ(a|s) 0 maximizeθ Es∼ρθ0 ,a∼πθ0 [ Aθ (s, a)], (9a) π 0 (a|s) B. Policy gradient DRL methods θ s.t. E [D (π 0 (·|s))] ≤ δ . (9b) Policy gradient DRL optimizes the parameterized policy s∼ρθ0 KL θ KL directly. Actor-critic architecture computes the policy gradi- This algorithm is effective for optimizing large nonlinear ent using a value-based critic function to estimate expected policies. Proximal policy optimization (PPO) [42] samples future reward, as shown in Fig. 2(d). Asynchronous DRL data by interaction with the environment, and optimizes the is an efficient framework for DRL that uses asynchronous objective function with stochastic gradient ascent gradient descent to optimize the policy [33]. Asynchronous πθ(at|st) advantage actor-critic (A3C) trains several agents on multiple rt(θ) = , (10a) environments, showing a stabilizing effect on training. The πθold (at|st) ˆ ˆ ˆ objective function of the actor is demonstrated as L(θ) = Et[min(rt(θ)At, clip(rt(θ), 1 − , 1 + )At]. (10b) ∞ X rt(θ) denotes the probability ratio. This objective function J(θ) = Eπ[ Aθ,θv (st, at) log πθ(at|st) + βHθ(π(st))], clips the probability ratio to modify the surrogate objective. t=0 (8) PPO has some benefits over TRPO, and is much simpler to where Hθ(π(st)) is an entropy term used to encourage explo- implement, with better sample complexity. Actor-critic with ration. experience replay (ACER) [43] introduces several innovations, GA3C [34] is a hybrid CPU/GPU version of A3C, which including stochastic dueling network, truncated importance achieves a significant speed up compared to the original CPU sampling, and a new trust region method, which is stable and implementation. UNsupervised REinforcement and Auxiliary sample efficient. Actor-critic using Kronecker-Factored Trust Learning (UNREAL) [35] learns separate policies for max- Region (ACKTR) [44] bases on natural policy gradient, and imizing many other pseudo-reward functions simultaneously, uses Kronecker-factored approximate curvature (K-FAC) with 5 trust region to optimize the actor and the critic. ACKTR is model RL environments through compressed spatiotemporal sample efficient compared with other actor-critic methods. representations. It feeds extracted features into simple and 2) Deterministic policy: Apart from stochastic policy, deep compact policies, achieving impressive results in several en- deterministic policy gradient (DDPG) [45] is a kind of deter- vironments. Value propagation (VProp) [50] bases on value ministic policy gradient method which adapts the success of iteration, and is an efficient differentiable planning module. It DQN to continuous control. The update rule of DDPG is can successfully be trained to learn to plan using reinforcement learning. As a general framework of AlphaZero, MuZero Q(s , a ) = r(s , a ) + γQ(s , π (s )). (11) t t t t t+1 θ t+1 [54] combines MCTS with a learned model, and predicts DDPG is an actor-critic, off-policy algorithm, and is able the reward, the action-selection policy, and the value function to learn reasonable policies on various tasks. Distributed to make planning. It extends model-based RL to a range of Distributional DDPG (D4PG) [46] is a distributional update to logically complex and visually complex domains, and achieves DDPG, combined with the use of multiple distributed workers superhuman performance. all writing into the same replay table. This method has a much A general review of various DRL methods from 2017 to better performance on a number of difficult continuous control 2019 is presented in Table I. problems. 3) Entropy-regularized policy gradient: Soft Actor Critic IV. DRL INVIDEOGAMES (SAC) is an off-policy policy gradient method, which estab- Playing video games like human experts is challenging for lishes a bridge between DDPG and stochastic policy optimiza- computers. With the development of DRL, agents are able tion. SAC incorporates the clipped double-Q trick, and the to play various games end-to-end. Here we focus on game objective function of maximum entropy DRL is research platforms and competitions, and impressive progress T in various video games, from 2D to 3D, and from single-agent X J(π) = E(st,at)∼ρπ [r(st, at) + αH(π(.|st))], (12) to multi-agent, as shown in Fig. 3. t=0 SAC uses an entropy regularization in its objective function. A. Game research platforms It trains the policy to maximize a trade-off between entropy Platforms and competitions make great contributions to and expected return. The entropy is a measure of randomness the development of game AI, and help to evaluate agents’ in the policy. This mechanism is similar to the trade-off intelligence, as presented in Table II. Most platforms can be between exploration and exploitation. Increasing entropy can described by two major categories: General Platforms and encourage more exploration, and accelerate learning process. Specific Platforms. Moreover, it can also prevent the learning policy from con- General Platforms: Arcade Learning Environment (ALE) verging to a poor local optimum. [55] is the pioneer evaluation platform for DRL algorithms, which provides an interface to plenty of Atari 2600 games. C. Model-based DRL methods ALE presents both game images and signals, such as player Combining model-free reinforcement learning with on-line scores, which makes it a suitable testbed. To promote the planning is a promising approach to solve the sample effi- progress of DRL research, OpenAI integrates a collection of ciency problem. TreeQN [47] is proposed to address these reinforcement learning tasks into a platform called Gym [56], challenges. It is a differentiable, recursive, tree-structured which mainly contains Algorithmic, Atari, Classical Control, model that serves as a drop-in replacement for any value Board games, 2D and 3D robots. After that, OpenAI Universe function network in DRL with discrete actions. TreeQN dy- [57] is a platform for measuring and training agents’ general namically constructs a tree by recursively applying a transition intelligence across a large supply of games. Gym Retro [58] model in a learned abstract state space and then aggregating is a wrapper for video game emulator with a unified interface predicted rewards and state-values using a tree backup to esti- as Gym, and makes Gym easy to be extended with a large mate Q-values. ATreeC is an actor-critic variant that augments collection of video games, not only Atari but also NEC, TreeQN with a softmax layer to form a stochastic policy Nintendo, and Sega, for RL research. The OpenAI Retro network. Both approaches are trained end-to-end, such that contest aims at exploring the development of DRL that can the learned model is optimized for its actual use in the plan- generalize from previous experience. OpenAI bases on the ner. TreeQN and ATreeC outperform n-step DQN and value Sonic the HedgehogTM video game, and presents a new DRL prediction networks on multiple Atari games. Vezhnevets et benchmark [59]. This benchmark can help to measure the al. [48] presents STRategic Attentive Writer (STRAW) neural performance of few-shot learning and transfer learning in network architecture to build implicit plans. STRAW purely reinforcement learning. General Video Game Playing [60] is interacts with an environment, and is an end-to-end method. intended to design an agent to play multiple video games STRAW model can learn temporally abstracted high-level without human intervention. The General Video Game AI macro-actions, which enables both economic computation and (GVGAI) [61] competition is proposed to provide a easy-to- structured exploration. STRAW employs temporally extended use and open-source platform for evaluating AI methods, in- planning strategies and achieves strong improvements on Atari cluding DRL. DeepMind Lab [62] is a first-person perspective games. The world model [49] uses an unsupervised manner learning environment, and provides multiple complicated tasks to train a generative recurrent neural network, which can in partially observed, large-scale, and visually diverse worlds. 6

TABLE I A GENERALREVIEWOFRECENT DRL METHODSFROM 2017 TO 2018.

DRL Algorithms Main Techniques Networks Category DQN [14] experience replay, target Q-network CNN value-based, off-policy Double DQN [15] double Q-learning CNN value-based, off-policy Dueling DQN [17] dueling neural network architecture CNN value-based, off-policy Prioritized DQN [16] prioritized experience replay CNN value-based, off-policy Bootstrapped DQN [51] combine deep exploration with DNNs CNN value-based, off-policy Gorila [20] massively distributed architecture CNN value-based, off-policy LS-DQN [22] combine least-squares updates in DRL CNN value-based, off-policy Averaged-DQN [23] averaging learned Q-values estimates CNN value-based, off-policy DQfD [24] learn from the demonstration data CNN value-based, off-policy DQN with Pop-Art [18] adaptive normalization with Pop-Art CNN value-based, off-policy Soft DQN [29] KL penalty and entropy bonus CNN value-based, off-policy DQV [25] training a Quality-value network CNN value-based, off-policy Rainbow [26] integrate six extensions to DQN CNN value-based, off-policy RUDDER [27] return decomposition CNN-LSTM value-based, off-policy Ape-X DQfD [28] transformed Bellman operator, temporal consistency loss CNN value-based, off-policy C51 [30] distributional Bellman optimality CNN value-based, off-policy QR-DQN [31] distributional RL with Quantile regression CNN value-based, off-policy IQN [32] an implicit representation of the return distribution CNN value-based, off-policy A3C [33] asynchronous gradient descent CNN-LSTM policy gradient, on-policy GA3C [34] hybrid CPU/GPU version CNN-LSTM policy gradient, on-policy PPO [42] clipped surrogate objective, adaptive KL penalty coefficient CNN-LSTM policy gradient, on-policy ACER [43] experience replay, truncated importance sampling CNN-LSTM policy gradient, off-policy ACKTR [44] K-FAC with trust region CNN-LSTM policy gradient, on-policy Soft Actor-Critic [52] entropy regularization CNN policy gradient, off-policy UNREAL [35] unsupervised auxiliary tasks CNN-LSTM policy gradient, on-policy Reactor [39] Retrace(λ), β-leave-one-out policy gradient estimate CNN-LSTM policy gradient, off-policy PAAC [36] parallel framework for A3C CNN policy gradient, on-policy DDPG [45] DQN with deterministic policy gradient CNN-LSTM policy gradient, off-policy TRPO [41] incorporate a KL divergence constraint CNN-LSTM policy gradient , on-policy D4PG [46] distributed distributional DDPG CNN policy gradient , on-policy PGQ [37] combine policy gradient and Q-learning CNN policy gradient, off-policy IMPALA [40] importance-weighted actor learner architecture CNN-LSTM policy gradient, on-policy FiGAR-A3C [53] fine grained action repetition CNN-LSTM policy gradient, on-policy TreeQN/ATreeC [47] on-line planning, tree-structured model CNN model-based, on-policy STRAW [48] macro-actions, planning strategies CNN model-based, on-policy World model [49] mixture density network, variational autoencoder CNN-LSTM model-based, on-policy MuZero [54] representation function, dynamics function, and prediction function CNN model-based, off-policy

TABLE II Unity ML-Agents Toolkit [63] is a new toolkit for creating A LISTOFGAME AI COMPETITIONSSUITABLEFOR DRL RESEARCH. and interacting with simulation environments. This platform has sensory, physical, cognitive, and social complexity, and Competition Name Time enables fast and distributed simulation, and flexible control. ViZDoom AI competition 2016, 2017, 2018 StarCraft AI competitions (AIIDE, CIG, SSCAIT) 2010 — 2019 Specific Platforms: Malmo [64] is a research platform for microRTS competition 2017, 2018, 2019 AI experiments, which is built on top of Minecraft. It is a The GVGAI competition – learning track 2017, 2018 first-person 3D environment, and can be used for multi-agent Microsoft Malmo collaborative AI challenge 2017 research in Microsoft Malmo collaborative AI challenge 2017 The multi-agent RL in Malmo competition 2018 and the multi-agent RL in MalmO competition 2018. TORCS The OpenAI Retro contest 2018 [65] is a racing car simulator which has both low-level and NeurIPS Pommerman competition 2018 visual features for the self-driving car with DRL. ViZDoom Unity Obstacle Tower Challenge 2019 [66] is a first-person shooter game platform, and encourages NeurIPS MineRL competition 2019 DRL agent to utilize the visual information to perform naviga- tion and shooting tasks in a semi-realistic 3D world. ViZDoom AI competition has attracted plenty of researchers to develop their DRL-based agents since 2016. As far as we 7

Dimensions

TORCS 3D Minecraft Quake III ViZDoom Arena CTF DM Lab

ALE 2D StarCraft Montezuma’s Dota2 Revenge Number of agents Single-agent Multi-agent

Fig. 3. The diagram of various video games AI, from 2D to 3D, and from single-agent to multi-agent. know, real-time strategy (RTS) games are very challenging for TABLE III reinforcement learning method. Facebook proposes TorchCraft MEANANDMEDIANSCORESACROSS 57 ATARI GAMES OF TYPICAL DRL METHODS, MEASUREDASPERCENTAGESOFHUMANBASELINE. for StarCraft I [67], and DeepMind releases StarCraft II learning environment [68]. They expect researchers to propose Methods Mean Median year powerful DRL agents to achieve high-level performance in DQN [14] 228% 79% 2015 RTS games and annual StarCraft AI competitions. CoinRun C51 [30] 701% 178% 2017 [69] provides a metric for an agent’s ability to transfer its UNREAL [35] 880% 250% 2017 experience to novel situations. This new training environment QR-DQN [30] 915% 211% 2017 strikes a desirable balance in complexity: the environment is IQN [32] 1019% 218% 2018 much simpler than traditional platform games, but it still poses Rainbow [26] 1189% 230% 2018 a worthy generalization challenge for DRL algorithms. Google Ape-X DQN [71] 1695% 434% 2018 Research Football is a new environment based on open-source Ape-X DQfD ∗ [28] 2346% 702% 2018 game Gameplay Football for DRL research. Note: ∗ means this method is measured across 42 Atari games.

B. Atari games ALE is an evaluation platform that aims at building agents conventional exploration heuristics with NoisyNet, and yields with general intelligence across hundreds of Atari 2600 games. substantially higher scores in ALE domain. As a distributional As the most popular testbed for DRL research, a large num- DRL method, C51 obtains a new series of impressive results, ber of DRL methods have achieved outstanding performance and demonstrates the importance of the value distribution in consecutively. Machado et al. [70] takes a review at the approximated RL [30]. Rainbow provides improvements in ALE in DRL research community, proposes diverse evaluation terms of sample efficiency and final performance. The authors methodologies and some key concerns. In this section, we also show the contribution of each component to overall per- will introduce the main achievements in the ALE domain, formance [26]. QR-DQN algorithm significantly outperforms including the extremely difficult Montezuma’s Revenge. recent improvements on DQN, including the related C51 [30]. As the milestone in this domain, DQN is able to surpass IQN shows substantial gains on the Atari benchmark over the performances of previous algorithms, and achieves human- QR-DQN, and even halves the distance between QR-DQN level performance across 49 games [14]. Averaged-DQN ex- and Rainbow [32]. Ape-X DQN substantially improves the amines the source of value function estimation errors, and performance on the ALE, achieving better final score in less demonstrates significantly improved stability and performance wall-clock training time [71]. When tested on a set of 42 Atari on the ALE benchmark [23]. UNREAL significantly outper- games, the Ape-X DQfD algorithm exceeds the performance forms the previous best performance on Atari, averaging 880% of an average human on 40 games using a common set of expert human performance [35]. PAAC achieves sufficiently hyperparameters. Mean and median scores across multiple good performance on ALE after a few hours of training [36]. Atari games of typical DRL methods that achieve state-of- DQfD has better initial performance than DQN on most Atari the-art performance consecutively are presented in Table III. games, and receives more average rewards than DQN on 27 Montezuma’s Revenge is one of the most difficult Atari of 42. In addition, DQfD learns faster than DQN even when video games. It is a goal-directed behavior learning environ- given poor demonstration data [24]. Noisy DQN replaces the ment with long horizons and sparse reward feedback signals. 8

Players must navigate through a number of different rooms, graphics than Atari games, but also requires agents to learn the avoid obstacles and traps, climb ladders up and down, and then dynamic of the car. FIGAR-DDPG can successfully complete pick up the key to open new rooms. It requires a long sequence the race task and finish 20 laps of the circuit, with a 10× total of actions before reaching the goal and receiving a reward, reward against that obtained by DDPG, and much smoother and is difficult to explore an optimal policy to tackle tasks. policies [53]. Normalized Actor-Critic (NAC) normalizes the Efficient exploration is considered as a crucial factor to learn Q-function effectively, and learns an initial policy network in a delayed feedback environment. Then, Ostrovski et al. [72] from demonstration and refine the policy in a real environment provide an improved version of count-based exploration with [79]. NAC is robust to suboptimal demonstration data, learns PixelCNN as a supplement for pseudo-count, also reveals the robustly and outperforms existing baselines when evaluated importance of Monte Carlo return for effective exploration. on TORCS. Mazumder et al. [80] incorporate state-action In addition to improve the exploration efficiency, learning permissibility (SAP) and DDPG, and applies it to tackle the from human data is also a proper method to reach better lane keeping problem in TORCS. The proposed method can performance in this problem. Le et al. [73] leverage imitation speedup DRL training remarkably for this task. In [81], a learning from expert interaction and hierarchical reinforcement two-stage approach is proposed for the vision-based vehicle learning at different levels. This method learns obviously lateral control problem which includes an multi-task learning faster than original hierarchical reinforcement learning, and perception stage and an RL control stage. By exerting the also significantly more efficiently than conventional imitation correlation between multiple learning task, the perception learning. Other than gameplay, the demonstration is also a module can robustly extract track features. Additionally, the valuable kind of sample for agent to learn. DQfD utilizes RL agent learns by maximizing the geometry-based reward a small set of demonstration data to speed up the learning and performs better than the LQR and MPC controllers. Zhu process [24]. It combines prioritized replay mechanism with et al. [82] use DRL to train a CNN to perceive driving data temporal difference updates and supervised classification, and from images of first-person view, and learns a controller to get finally achieves a better and impressive result. Further, Aytar driving commands, showing a promising performance. et al. [74] only use YouTube video as a demonstration sample 3) Minecraft: Minecraft is a sandbox construction game, and invests a transformed Bellman operator for learning from where players can build creative creations, structures, and human demonstrations. Interestingly, these two works both artwork across various game modes. Recently, it becomes a claim being the first to solve the entire first level of Mon- popular platform for game AI research, with 3D infinitely tezuma’s Revenge. Go-explore [75] makes further progress, varied data. Project Malmo is an experimentation platform [83] and achieves scores over 400,000 on average. Go-Explore sep- that builts on the Minecraft for AI research. It supports a large arates learning into exploration and robustification. It reliably number of scenarios, including navigation, problem solving solves the whole game, and generalizes well. tasks, and survival to collaboration. Xiong et al. [84] propose a novel Q-learning approach with state-action abstraction and warm start using human reasoning to learn effective policies in C. First-person perspective games the Microsoft Malmo collaborative AI challenge. The ability to Different from Atari games, agents in first-person perspec- transfer knowledge from source task to target task in Minecraft tive video games can only receive observations from their own is one of the major challenges. Tessler et al. [85] provides a perspectives, resulting from imperfect information inputs. In DRL agent which can transfer knowledge by learning reusable RL domain, this is a POMDP problem which requires efficient skills, and then incorporated into hierarchical DRL network exploration and memory. (H-DRLN). H-DRLN exhibits superior performance and low 1) ViZDoom: First-person shooter (FPS) games play an learning sample complexity compared to regular DQN in important role in game AI research. Doom is a classical Minecraft, and the potential to transfer knowledge between FPS game, and ViZDoom is presented as a novel testbed related Minecraft tasks without any additional learning. To for DRL [66]. Agents learn from visual inputs, and interact solve the partial or non-Markovian observations problems, with the ViZDoom environment in a first-person perspective. Jin et al. [86] propose a new DRL algorithm based on Wu et al. [76] propose a method that combines A3C and counterfactual regret minimization that iteratively updates an curriculum learning. The agent learns to navigate and attack approximation to a cumulative clipped advantage function. On via playing against built-in agents progressively. Parisotto et the challenging Minecraft first-person navigation benchmarks, al. [77] develop Neural Map, which is a memory system this algorithm can substantially outperform strong baseline with an adaptable write operator. Neural Map uses a spatially methods. structured 2D memory image to store the environment’s in- 4) DeepMind lab: DeepMind lab is a 3D first-person formation. This method surpasses other DRL memories on game platform extended from OpenArena, which is based several challenging ViZDoom maze tasks and shows a capable on Quake3. Comparable to other first-person game platforms, generalization ability. Shao et al. [78] show that ACKTR can DeepMind lab has considerably richer visuals and more re- successfully teach agents to battle in ViZDoom environment, alistic physics, making it a significantly complex platform. and significantly outperform A2C agents by a significant On a challenging suite of DeepMind lab tasks, the UNREAL margin. agent leads to a mean speedup in learning of 10× over A3C 2) TORCS: TORCS is a racing game where actions are ac- and averaging 87% expert human performance. As learning celeration, braking and steering. This game has more realistic agents become more powerful, continual learning has made 9 quick progress recently. To test continual learning capabilities, transfer learning to this method. This improves the sample Mankowitz et al. [87] consider an implicit sequence of tasks efficiency, and outperforms GMEZO and BiCNet in large-scale with sparse rewards in DeepMind lab. The novel agent archi- scenarios. Kong et al. [95] bases on master-slave architecture, tecture called Unicorn, demonstrates strong continual learning and proposes master-slave multi-agent reinforcement learning and outperforms several baseline agents on the proposed (MS-MARL). MS-MARL includes composed action represen- domain. Schmitt et al. [88] present a method which uses tation, independent reasoning, and learnable communication. teacher agents to kickstart the training of a new student agent. This method has better performance than other methods in On a multi-task and challenging DMLab-30 suite, kickstarted micromanagement tasks. Rashid et al. [96] focus on sev- training improves new agents’ sample efficiency to a great eral challenging StarCraft II micromanagement tasks, and extend, and surpasses the final performance by 42%. Jaderberg use centralized training and decentralized execution to learn et al. [89] focus on Quake III Arena Capture the Flag, which is cooperative behaviors. This eventually outperforms state-of- a popular 3D first-person multiplayer video game, and demon- the-art multi-agent deep reinforcement learning methods. strates that DRL agents can achieve human-level performance Researchers also use DRL methods to optimize the build with only pixels and game points as input. The agent uses order in StarCraft. Tang et al. [97] put forward neural network population based training to optimize the policy. This method fitted Q-learning (NNFQ) and convolutional neural network trains a large number of agents concurrently from thousands fitted Q-learning (CNNFQ) to build units in simple StarCraft of parallel matches, where agents plays cooperatively in teams maps. These models are able to find effective production and against each other on randomly generated environments. sequences, and eventually defeat enemies. In [68], researchers In an evaluation, the trained agents exceed the winrate of self- present baseline results of several main DRL agents in the play baseline and high-level human players both as teammates StarCraft II domain. The fully convolutional advantage actor- and opponents, and are proved far stronger than existing DRL critic (FullyConv-A2C) agents achieve a beginner-level in agents. StarCraft II mini-games. Zambaldi et al. [98] introduce the relational DRL to StarCraft, which iteratively reasons about the relations between entities with self-attention, and uses D. Real-time strategy games it to guide a model-free RL policy. This method improves Real-time strategy games are very popular among players, sample efficiency, generalization ability, and interpretability of and have become popular platforms for AI research. conventional DRL approaches. Relational DRL agent achieves 1) StarCraft: In StarCraft, players need to perform actions impressive performance on SC2LE mini-games. Sun et al. [99] according to real-time game states, and defeat the enemies. develop the DRL based agent TStarBot, which uses flat action Generally speaking, designing an AI bot have many chal- structure. This agent defeats the built-in AI agents from level lenges, including multi-agent collaboration, spatial and tem- 1 to level 10 in a full game firstly. Lee et al. [100] focus poral reasoning, adversarial planning, and opponent model- on StarCraft II AI, and present a novel modular architecture, ing. Currently, most bots are based on human experiences which splits responsibilities between multiple modules. Each and replays, with limited flexibility and intelligence. DRL is module controls one aspect of the game, and two modules are proved to be a promising direction for StarCraft AI, especially trained with self-play DRL methods. This method defeats the in micromanagement, build order, mini-games and full-games built-in bot in ”Harder” level. Pang et al. [101] investigate [90]. a two-level hierarchical RL approach for StarCraft II. The Recently, micromanagement is widely studied as the first macro-action is automatically extracted from expert’s data, and step to solve StarCraft AI. Usunier et al. [91] introduce the the other is a flexible and scaleable hierarchical architecture. greedy MDP with episodic zero-order optimization (GMEZO) More recently, DeepMind proposes AlphaStar, and defeats algorithm to tackle micromanagement scenarios, which per- professional players for the first time. forms better than DQN and policy gradient. BiCNet [92] is 2) MOBA and Dota2: MOBA (Multiplayer Online Battle a multi-agent deep reinforcement learning method to play Arena) is originated from RTS games, which has two teams, StarCraft combat games. It bases on actor-critic reinforcement and each team consists of five players. To beat the opponent, learning, and uses bi-directional neural networks to learn five players in a team must cooperate together, kill enemies, collaboration. BiCNet successfully learns some cooperative upgrade heros, and eventually destroy the opponent base. Since strategies, and is adaptable to various tasks, showing better MOBA research is still in a primary stage, there are fewer performances than GMEZO. In aforementioned works, re- works than conventional RTS games. Most works on MOBA searchers mainly develops centralized methods to play mi- concentrate on dataset analysis and case study. However, due cromanagement. Foerster et al. [93] focus on decentralized to a series of breakthroughs that DRL achieves in game AI, control for micromanagement, and propose a multi-agent researchers start to pay more attention to MOBA recently. actor-critic method. To stabilize experience replay and solve King of Glory (a simplified mobile version of Dota) is the most nonstationarity, they use fingerprints and importance sampling, popular mobile-end MOBA game in China. Jiang et al. [102] which can improve the final performance. Shao et al. [94] apply Monte-Carlo Tree Search and deep neural networks to follow decentralized micromanagement task, and propose pa- this game. The experimental results indicate that MCTS-based rameter sharing multi-agent gradient descent SARSA(λ) (PS- DRL method is efficient and can be used in 1v1 MOBA MAGDS) method. To resue the knowledge between various scenario. Most impressive works on MOBA are proposed by micromanagement scenarios, they also combine curriculum OpenAI. Their results prove that DRL method with self-play 10 can not only be successful in a 1v1 and 2v2 Dota2 scenarios for seek out high reward experiences in a complex sample [103], but also in 5v5 [104] [105]. The model architecture is space, which limits their applicability to many scenarios. In simple, using a LSTM layer as the core component of neural order to reduce the exploration dimension of environment and network. Under the support of massively distributed cloud ease the expenditure of time on interaction, some solutions computing and PPO optimization algorithm, OpenAI Five can can be used for improving data efficiency, such as hierarchy master the critical abilities of team fighting, searching forest, and demonstration. focusing, chasing, and diversion for team victory, and defeat Hierarchical reinforcement learning (HRL) allows agents to human champion OG with 2:0. Their works truly open a new decompose the task into several simple subtasks, which can door to MOBA research with DRL method. speed up training and improve sample efficiency. Temporal abstraction is key to scaling up learning, while creating V. CHALLENGESIN GAMESWITH DRL such abstractions autonomously has remained challenging. The option-critic architecture has the ability to learn the internal Since DRL has achieved large progress in some video policies and the options’ termination conditions, without any games, it is considered as one of most promising ways to additional rewards or subgoals [109]. FeUdal Networks (FuNs) realize the artificial general intelligence. However, there are include a Manager module and a Worker module [110]. The still some challenges should be conquered towards goal. In this Manager sets abstract goals at high-level. The Worker receives secition, we discuss some crucial challenges for DRL in video these goals, and generates actions in the environment. FuN games, such as tradeoff between exploration and exploitation, dramatically outperforms baseline agents on tasks that involve low sample efficiency, dilemma in generalization and overfit- long-term credit assignment or memorization. Representation ing, multi-agent learning, incomplete information and delayed learning methods can also be used to guide the option discov- sparse rewards. Though there are some proposed approaches ery process in HRL domain [111]. have been tried to solve these problems, as presented in Fig. Demonstration is a proper technique to improve sample 4, there are still some limitations should be broken. efficiency. Current approaches that learn from demonstration use supervised learning on expert data and use reinforcement A. Exploration-exploitation learning to improve the performance. This method is difficult Exploration can help to obtain more diversity samples, to jointly optimize divergent losses, and is very sensitive to while exploitation is the way to learn the high reward policy noisy demonstrations. Leveraging data from previous control with valuable samples. The trade-off between exploration and of the system can greatly accelerate the learning process even exploitation remains a major challenge for RL. Common meth- with small amounts of demonstration data [24]. Goals defined ods for exploration require a large amount of data, and can not with human preferences can effectively solve complicated RL tackle temporally-extended exploration. Most model-free RL tasks without the reward function, while greatly reducing the algorithms are not computationally tractable in complicated cost of human oversight [112]. environments. Parametric noise can help exploration to a large extend in C. Generalization and Transfer the training process [106] [107]. Besides, randomized value functions become an effective approach for efficient explo- The ability to transfer knowledge across multiple environ- ration. Combining exploration with deep neural networks can ments is considered as a critical aspect of intelligent agents. help to learn much faster, which greatly improves the learning With the purpose of promoting the performance of generaliza- speed and final performance in most games [51]. tion in multiple environments, multi-task learning and policy A simple generalization of popular count-based ap- distillation have been focus on these situations. proach can reach satisfactory performance on various high- Multi-task learning with shared neural network parameters dimensional DRL benchmarks [108]. This method maps states can solve the generalization problem, and efficiency can be to hash codes, and counts their occurrences via a hash table. improved through transfer across related tasks. Hybrid reward Then, according to the classic count-based method, we can use architecture takes a decomposed reward function as input and these counts to compute a reward bonus. On many challenging learns a separate value function for each component [113]. tasks, these simple hash functions can achieve impressive The whole value function is much smoother, which can be performance. This exploration strategy provides a simple and easily approximated with a low-dimensional representation, powerful baseline to solve MDPs requiring considerable ex- and learns more effectively. IMPALA shows the effectiveness ploration. for multi-task reinforcement learning, using less data and ex- hibiting positive transfer between tasks [40] . PopArt-IMPALA combines PopArt’s adaptive normalization with IMPALA, and B. Sample efficiency allows a more efficient use of parallel data generation, showing DRL algorithms usually take millions of samples to achieve impressive performance on multi-task domain [114]. human-level performance. While humans can quickly master To successfully learn complex tasks with DRL, we usually highly rewarding actions of an environment. Most model- need large task-specific networks and extensive training to free DRL algorithms are data inefficient, especially for a achieve good performance. Distral shares a distilled policy environment with high dimension and large explore space. which can learn common knowledge across multiple tasks They have to interact with environment in a large time cost [115]. Each worker is trained to solve individual task and to be 11 close to the shared policy, while the shared policy is trained by In many scenarios, researchers use curiosity as an intrinsic distillation. This approach shows efficient transfer on complex reward to encourage agents to explore environment and learn tasks, with more robust and more stable performance. Mix & useful skills. Curiosity can be formulated as the error that Match is a training framework that is designed to encourage the agent predicts its own actions’ consequence in a visual effective and rapid learning in DRL agents [116]. It allows to space [122]. This can scale to high-dimensional continuous automatically form a curriculum over agent, and progressively state spaces. Moreover, it leaves out the aspects of environment trains more complex agents from simpler agents. that cannot affect agents. Curiosity search for DRL encourages intra-life exploration by rewarding agents for visiting as many D. Multi-agent learning different states as possible within each episode [123]. Multi-agent learning is very important in video games, such as StarCraft. In a cooperative multi-agent setting, curse- VI.CONCLUSIONANDDISCUSSION of-dimensionality, communication, and credit assignment are Game AI with deep reinforcement learning is a challenging major challenges. and promising direction. Recent progress in this domain has Team learning uses a single learner to learn joint solutions promote the development of artificial intelligence research. In in multi-agent system, while concurrent learning uses multiple this paper, we review the achievements of deep reinforcement learners for each agent. Recently, the centralised training learning in video games. Different DRL methods and their of decentralised policies is becoming a standard paradigm successful applications are introduced. These DRL agents for multi-agent training. Multi-agent DDPG considers other achieve human-level or super-human performances in various agents’ action policy and can successfully learn complex games, from 2D perfect information to 3D imperfect infor- multi-agent coordination behavior [117]. Counterfactual multi- mation, and from single-agent to multi-agent. In addition to agent policy gradients uses a centralized critic to estimate these achievements, there are still some major problems when the action-value function and decentralized actors to opti- applying DRL methods to this field, especially in 3D imperfect mize each agents’ policies, with a counterfactual advantage information multi-agent video game. A high-level game AI function to address the multi-agent credit assignment problem requires to explore more efficient and robust DRL techniques, [118] . In addition, communication protocols is important and needs novel frameworks to be implemented in complex to share information to solve multi-agent tasks. Reinforced environment. These challenges have not been fully investigated Inter-Agent Learning (RIAL) and Differentiable Inter-Agent and could be opened for further study in the future. Learning (DIAL) use deep reinforcement learning to learn end- to-end communication protocols in complex environments. Analogously, CommNet is able to learn continuous communi- ACKNOWLEDGMENT cation between multiple agents. The authors would like to thank Qichao Zhang, Dong Li and Weifan Li for the helpful comments and discussions about this E. Imperfect information work. In partially observable and first-perspective games, DRL agents need to tackle imperfect information to learn a suitable REFERENCES policy. Making decisions in these environments is challenging [1] N. Y. Georgios and T. Julian, Artificial Intelligence and Games. New for DRL agents. York: Springer, 2018. A critical component of enabling effective learning in these [2] Y. Lecun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521, no. 7553, pp. 436–444, 2015. environment is the use of memory. DRL agents have used [3] D. Zhao, K. Shao, Y. Zhu, D. Li, Y. Chen, H. Wang, D. Liu, T. Zhou, some simple memory architectures, such as several past frames and C. Wang, “Review of deep reinforcement learning and discussions or an LSTM layer. But these architectures are limited to only on the development of computer Go,” Control Theory and Applications, vol. 33, no. 6, pp. 701–717, 2016. remember transitory information. Model-free episode control [4] Z. Tang, K. Shao, D. Zhao, and Y. Zhu, “Recent progress of deep learns difficult sequential decision-making tasks much faster, reinforcement learning: from AlphaGo to AlphaGo Zero,” Control and achieves a higher overall reward [119]. Differentiable Theory and Applications, vol. 34, no. 12, pp. 1529–1546, 2017. [5] J. Niels, B. Philip, T. Julian, and R. Sebastian, “Deep learning for video neural computer uses a neural network to read from and game playing,” CoRR, vol. abs/1708.07902, 2017. write to an external memory matrix [120]. This method can [6] A. Kailash, P. D. Marc, B. Miles, and A. B. Anil, “Deep reinforcement solve complex, structured tasks which can not access to neural learning: A brief survey,” IEEE Signal Processing Magazine, vol. 34, pp. 26–38, 2017. networks without external read and write memory. Neural [7] L. Yuxi, “Deep reinforcement learning: An overview,” CoRR, vol. episodic control inserts recent state representations paired abs/1701.07274, 2017. with corresponding value functions into the appropriate neural [8] R. Dechter, “Learning while searching in constraint-satisfaction- problems,” pp. 178–183, 1986. dictionary, and learns significantly faster than other baseline [9] J. Schmidhuber, “Deep learning in neural networks,” Neural Networks, agents [121]. vol. 61, pp. 85–117, 2015. [10] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in International Conference F. Delayed spare rewards on Neural Information Processing Systems, 2012, pp. 1097–1105. The sparse and delayed reward is very common in many [11] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural Computation, vol. 9, no. 8, pp. 1735–1780, 1997. games, and is also one of the reasons that reduce sample [12] R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction. efficiency in reinforcement learning. MIT Press, 1998. 12

[13] J. W. Ronald, “Simple statistical gradient-following algorithms for [41] S. John, L. Sergey, A. Pieter, I. J. Michael, and M. Philipp, “Trust connectionist reinforcement learning,” Machine Learning, vol. 8, pp. region policy optimization,” in International Conference on Machine 229–256, 1992. Learning, 2015. [14] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. [42] S. John, W. Filip, D. Prafulla, R. Alec, and K. Oleg, “Proximal policy Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, and G. Os- optimization algorithms,” CoRR, vol. abs/1707.06347, 2017. trovski, “Human-level control through deep reinforcement learning,” [43] W. Ziyu, B. Victor, H. Nicolas, M. Volodymyr, M. Remi,´ K. Koray, Nature, vol. 518, no. 7540, p. 529, 2015. and d. F. Nando, “Sample efficient actor-critic with experience replay,” [15] v. H. Hado, G. Arthur, and S. David, “Deep reinforcement learning in International Conference on Learning Representations, 2017. with double Q-learning,” in AAAI Conference on Artificial Intelligence, [44] W. Yuhuai, M. Elman, L. Shun, B. G. Roger, and B. Jimmy, “Scalable 2016. trust-region method for deep reinforcement learning using Kronecker- [16] S. Tom, Q. John, A. Ioannis, and S. David, “Prioritized experience factored approximation,” in Advances in Neural Information Processing replay,” in International Conference on Learning Representations, Systems, 2017. 2016. [45] P. L. Timothy, J. H. Jonathan, P. Alexander, H. Nicolas, E. Tom, [17] W. Ziyu, S. Tom, H. Matteo, v. H. Hado, L. Marc, and d. F. Nando, T. Yuval, S. David, and W. Daan, “Continuous control with deep “Dueling network architectures for deep reinforcement learning,” in reinforcement learning,” CoRR, vol. abs/1509.02971, 2015. International Conference on Machine Learning, 2016. [46] “Distributed distributional deterministic policy gradients,” CoRR, vol. [18] P. v. H. Hado, G. Arthur, H. Matteo, M. Volodymyr, and S. David, abs/1804.08617, 2018. “Learning values across many orders of magnitude,” in Advances in [47] F. Gregory, R. Tim, I. Maximilian, and W. Shimon, “TreeQN and Neural Information Processing Systems, 2016. ATreeC: differentiable tree planning for deep reinforcement learning,” [19] S. H. Frank, L. Yang, G. S. Alexander, and P. Jian, “Learning to play in International Conference on Learning Representations, 2018. in a day: faster deep reinforcement learning by optimality tightening,” [48] V. Alexander, M. Volodymyr, O. Simon, G. Alex, V. Oriol, A. John, in International Conference on Learning Representations, 2017. and K. Koray, “Strategic attentive writer for learning macro-actions,” [20] “Massively parallel methods for deep reinforcement learning,” in in Advances in Neural Information Processing Systems, 2016. International Conference on Machine Learning Workshop on Deep [49] D. Ha and J. Schmidhuber, “Recurrent world models facilitate policy Learning, 2015. evolution,” Neural Information Processing Systems, 2018. [21] J. H. Matthew and S. Peter, “Deep recurrent Q-learning for partially [50] N. Nantas, S. Gabriel, L. Zeming, K. Pushmeet, H. S. T. Philip, and observable MDPs,” CoRR, vol. abs/1507.06527, 2015. U. Nicolas, “Value propagation networks,” CoRR, vol. abs/1805.11199, [22] L. Nir, Z. Tom, J. M. Daniel, T. Aviv, and M. Shie, “Shallow updates 2018. for deep reinforcement learning,” in Advances in Neural Information [51] O. Ian, B. Charles, P. Alexander, and V. R. Benjamin, “Deep ex- Processing Systems, 2017. ploration via bootstrapped DQN,” in Advances in Neural Information [23] A. Oron, B. Nir, and S. Nahum, “Averaged-DQN: variance reduction Processing Systems, 2016. and stabilization for deep reinforcement learning,” in International [52] T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine, “Soft actor-critic: Off- Conference on Machine Learning, 2017. policy maximum entropy deep reinforcement learning with a stochastic [24] “Deep Q-learning from demonstrations,” in AAAI Conference on Arti- actor,” international conference on machine learning, pp. 1856–1865, ficial Intelligence, 2018. 2018. [53] S. Sahil, S. L. Aravind, and R. Balaraman, “Learning to repeat: [25] M. Sabatelli, G. Louppe, P. Geurts, and M. Wiering, “Deep quality- value (dqv) learning.” abs/1810.00368, 2018. fine grained action repetition for deep reinforcement learning,” in International Conference on Learning Representations, 2017. [26] “Rainbow: combining improvements in deep reinforcement learning,” [54] S. Julian, A. Ioannis, and H. Thomas, “Mastering atari, go, chess and in AAAI Conference on Artificial Intelligence, 2018. shogi by planning with a learned model,” abs/1911.08265, 2019. [27] A. A.-M. Jose, G. Michael, W. Michael, U. Thomas, and H. Sepp, [55] G. B. Marc, N. Yavar, V. Joel, and H. B. Michael, “The Arcade learning “RUDDER: return decomposition for delayed rewards,” CoRR, vol. environment: An evaluation platform for general agents,” J. Artif. Intell. abs/1806.07857, 2018. Res., vol. 47, pp. 253–279, 2013. [28] “Observe and look further: achieving consistent performance on Atari,” [56] B. Greg, C. Vicki, P. Ludwig, S. Jonas, S. John, T. Jie, and Z. Wojciech, CoRR , vol. abs/1805.11593, 2018. “OpenAI Gym,” CoRR, vol. abs/1606.01540, 2016. [29] S. John, A. Pieter, and C. Xi, “Equivalence between policy gradients [57] “OpenAI Universe github,” https://github.com/openai/universe, 2016. and soft Q-learning,” CoRR, vol. abs/1704.06440, 2017. [58] “OpenAI Retro github,” https://github.com/openai/retro, 2018. [30] G. B. Marc, D. Will, and M. Remi, “A distributional perspective [59] N. Alex, P. Vicki, H. Christopher, K. Oleg, and S. John, “Gotta on reinforcement learning,” in International Conference on Machine learn fast: a new benchmark for generalization in RL,” CoRR, vol. Learning, 2017. abs/1804.03720, 2018. [31] D. Will, R. Mark, G. B. Marc, and M. Remi,´ “Distributional rein- [60] “General video game AI: a multi-track framework for evaluat- forcement learning with quantile regression,” in AAAI Conference on ing agents, games and content generation algorithms,” CoRR, vol. Artificial Intelligence, 2018. abs/1802.10363, 2018. [32] D. Will, O. Georg, S. David, and M. Remi,´ “Implicit quantile networks [61] R. T. Ruben, B. Philip, T. Julian, L. Jialin, and P.-L. Diego, “Deep for distributional reinforcement learning,” CoRR, vol. abs/1806.06923, reinforcement learning for general video game AI,” CoRR, vol. 2018. abs/1806.02448, 2018. [33] “Asynchronous methods for deep reinforcement learning,” in Interna- [62] “DeepMind Lab,” CoRR, vol. abs/1612.03801, 2016. tional Conference on Machine Learnin, 2016. [63] J. Arthur, B. Vincent-Pierre, V. Esh, G. Yuan, H. Hunter, M. Marwan, [34] B. Mohammad, F. Iuri, T. Stephen, C. Jason, and K. Jan, “Reinforce- and L. Danny, “Unity: a general platform for intelligent agentst,” CoRR, ment learning through asynchronous advantage actor-critic on a GPU,” vol. abs/1809.02627, 2018. in International Conference on Learning Representations, 2017. [64] “OpenAI Malmo github,” https://github.com/Microsoft/malmo, 2017. [35] J. Max, M. Volodymyr, C. Wojciech, S. Tom, Z. L. Joel, S. David, and [65] B. Wymann, E. Espie,´ C. Guionneau, C. Dimitrakakis, R. Coulom, and K. Koray, “Reinforcement learning with unsupervised auxiliary tasks,” A. Sumner, “Torcs, the open racing car simulator,” Software available in International Conference on Learning Representations, 2017. at http://torcs. sourceforge. net, vol. 4, p. 6, 2000. [36] “Efficient parallel methods for deep reinforcement learning,” CoRR, [66] M. Kempka, M. Wydmuch, G. Runc, J. Toczek, and W. Jakowski, vol. abs/1705.04862, 2017. “ViZDoom: a Doom-based AI research platform for visual reinforce- [37] O. Brendan, M. Remi,´ K. Koray, and M. Volodymyr, “PGQ: combin- ment learning,” in IEEE Conference on Computational Intelligence and ing policy gradient and Q-learning,” in International Conference on Games, 2017, pp. 1–8. Learning Representations, 2017. [67] S. Gabriel, N. Nantas, A. Alex, C. Soumith, L. Timothee,´ L. Zeming, [38] M. Remi,´ S. Tom, H. Anna, and G. B. Marc, “Safe and efficient R. Florian, and U. Nicolas, “TorchCraft: a library for machine learning off-policy reinforcement learning,” in Advances in Neural Information research on real-time strategy games,” CoRR, vol. abs/1611.00625, Processing Systems, 2016. 2016. [39] G. Audrunas, G. A. Mohammad, G. B. Marc, and R. Munos, [68] “StarCraft II: a new challenge for reinforcement learning,” CoRR, vol. “The Reactor: a sample-efficient actor-critic architecture,” CoRR, vol. abs/1708.04782, 2017. abs/1704.04651, 2017. [69] C. Karl, K. Oleg, H. Chris, K. Taehoon, and S. John, “Quantifying [40] “IMPALA: scalable distributed deep-RL with importance weighted generalization in reinforcement learning,” CoRR, vol. abs/1812.02341, actor-learner architectures,” CoRR, vol. abs/1802.01561, 2018. 2018. 13

[70] C. M. Marlos, G. B. Marc, T. Erik, V. Joel, J. H. Matthew, and [95] K. Xiangyu, X. Bo, L. Fangchen, and W. Yizhou, “Revisiting the B. Michael, “Revisiting the Arcade learning environment: evaluation master-slave architecture in multi-agent deep reinforcement learning,” protocols and open problems for general agents,” Journal of Artificial CoRR, vol. abs/1712.07305, 2017. Intelligence Research, vol. 61, pp. 523–562, 2018. [96] R. Tabish, S. Mikayel, S. d. W. Christian, F. Gregory, N. F. Jakob, and [71] H. Dan, Q. John, B. David, B.-M. Gabriel, H. Matteo, v. H. Hado, and W. Shimon, “QMIX: monotonic value function factorisation for deep S. David, “Distributed prioritized experience replay,” in International multi-agent reinforcement learning,” CoRR, vol. abs/1803.11485, 2018. Conference on Learning Representations, 2018. [97] T. Zhentao, Z. Dongbin, Z. Yuanheng, and G. Ping, “Reinforcement [72] O. Georg, G. B. Marc, O. Aaron, and M. Remi, “Count-based ex- learning for build-order production in StarCraft II,” in International ploration with neural density models,” in International Conference on Conference on Information Science and Technology, 2018. Machine Learning, 2017. [98] V. Zambaldi, D. Raposo, A. Santoro, V. Bapst, Y. Li, I. Babuschkin, [73] M. L. Hoang, J. Nan, A. Alekh, D. Miroslav, Y. Yisong, and D. Hal, K. Tuyls, D. Reichert, T. Lillicrap, E. Lockhart et al., “Relational deep “Hierarchical imitation and reinforcement learning,” in International reinforcement learning,” CoRR, vol. abs/806.01830, 2018. Conference on Machine Learning, 2018. [99] P. Sun, X. Sun, L. Han, J. Xiong, Q. Wang, B. Li, Y. Zheng, J. Liu, [74] Y. Aytar, T. Pfaff, D. Budden, T. L. Paine, Z. Wang, and N. D. Freitas, Y. Liu, H. Liu, and T. Zhang, “TStarBots: defeating the cheating level “Playing hard exploration games by watching youtube,” 2018. builtin AI in StarCraft II in the full game,” CoRR, vol. abs/1809.07193, [75] “Montezuma’s revenge solved by go-explore, a new algorithm for hard- 2018. exploration problems,” https://eng.uber.com/go-explore/, 2018. [100] L. Dennis, T. Haoran, O. Z. Jeffrey, X. Huazhe, D. Trevor, and A. Pieter, “Modular architecture for starcraft ii with deep reinforcement [76] Y. Wu and Y. Tian, “Training agent for first-person shooter game learning,” CoRR, p. abs/1811.03555, 2018. with actor-critic curriculum learning,” in International Conference on [101] P. Zhen-Jia, L. Ruo-Ze, M. Zhou-Yu, Z. Yi, Y. Yang, and L. Tong, Learning Representations, 2017. “On reinforcement learning for full-length game of starcraft,” CoRR, [77] P. Emilio and S. Ruslan, “Neural map: structured memory for deep p. abs/1809.09095, 2018. reinforcement learning,” CoRR, vol. abs/1702.08360, 2017. [102] R. J. Daniel, E. Emmanuel, and L. Hao, “Feedback-based tree search [78] K. Shao, D. Zhao, N. Li, and Y. Zhu, “Learning battles in ViZDoom via for reinforcement learning,” in International Conference on Machine deep reinforcement learning,” in IEEE Conference on Computational Learning, 2018. Intelligence and Games, 2018. [103] “OpenAI Dota 1v1,” https://blog.openai.com/dota-2/, 2017. [79] G. Yang, X. Huazhe, L. Ji, Y. Fisher, L. Sergey, and D. Trevor, [104] “OpenAI Dota Five,” https://blog.openai.com/openai-five/, 2018. “Reinforcement learning from imperfect demonstrations,” CoRR, vol. [105] B. Christopher, B. Greg, and C. Brooke, “Dota 2 with large scale deep abs/1802.05313, 2018. reinforcement learning,” abs/1912.06680, 2019. [80] M. Sahisnu, L. Bing, W. Shuai, Z. Yingxuan, L. Lifeng, and L. Jian, [106] “Noisy networks for exploration,” CoRR, vol. abs/1706.10295, 2017. “Action permissibility in deep reinforcement learning and application [107] “Parameter space noise for exploration,” CoRR, vol. abs/1706.01905, to autonomous driving,” in ACM SIGKDD Conference on Knowledge 2017. Discovery and Data Mining, 2018. [108] T. Haoran, H. Rein, F. Davis, S. Adam, C. Xi, D. Yan, S. John, D. T. [81] D. Li, D. Zhao, Q. Zhang, and Y. Chen, “Reinforcement learning Filip, and A. Pieter, “Exploration: a study of count-based exploration and deep learning based lateral control for autonomous driving,” IEEE for deep reinforcement learning,” in Advances in Neural Information Computational Intelligence Magazine, 2018. Processing Systems, 2017. [82] Y. Zhu and D. Zhao, “Driving control with deep and reinforcement [109] B. Pierre-Luc, H. Jean, and P. Doina, “The option-critic architecture,” learning in the open racing car simulator,” in International Conference in AAAI Conference on Artificial Intelligence, 2017. on Neural Information Processing, 2018. [110] S. V. Alexander, O. Simon, S. Tom, H. Nicolas, J. Max, S. David, and [83] J. Matthew, H. Katja, H. Tim, and B. David, “The Malmo platform for K. Koray, “FeUdal networks for hierarchical reinforcement learning,” artificial intelligence experimentation,” in International Joint Confer- in International Conference on Machine Learning, 2017. ences on Artificial Intelligence, 2016. [111] C. M. Marlos, R. Clemens, G. Xiaoxiao, L. Miao, T. Gerald, and [84] X. Yanhai, C. Haipeng, Z. Mengchen, and A. Bo, “HogRider: cham- C. Murray, “Eigenoption discovery through the deep successor rep- pion agent of Microsoft Malmo collaborative AI challenge,” in AAAI resentation,” CoRR, vol. abs/1710.11089, 2017. Conference on Artificial Intelligence, 2018. [112] F. C. Paul, L. Jan, B. B. Tom, M. Miljan, L. Shane, and A. Dario, [85] T. Chen, G. Shahar, Z. Tom, J. M. Daniel, and M. Shie, “A deep “Deep reinforcement learning from human preferences,” in Advances hierarchical approach to lifelong learning in Minecraft,” in AAAI in Neural Information Processing Systems, 2017. Conference on Artificial Intelligence, 2017. [113] v. S. Harm, F. Mehdi, L. Romain, R. Joshua, B. Tavian, and T. Jeffrey, [86] H. J. Peter, L. Sergey, and K. Kurt, “Regret minimization for partially “Hybrid reward architecture for reinforcement learning,” in Advances observable deep reinforcement learning,” in International Conference in Neural Information Processing Systems, 2017. on Learning Representations, 2018. [114] H. Matteo, S. Hubert, E. Lasse, C. Wojciech, S. Simon, and v. H. [87] “Unicorn: continual learning with a universal, off-policy agent,” CoRR, Hado, “Multi-task deep reinforcement learning with popart,” CoRR, vol. abs/1802.08294, 2018. vol. abs/1809.04474, 2018. [115] “Distral: robust multitask reinforcement learning,” in Advances in [88] “Kickstarting deep reinforcement learning,” CoRR, vol. Neural Information Processing Systems, 2017. abs/1803.03835, 2018. [116] “Mix& Match - agent curricula for reinforcement learning,” CoRR, vol. [89] “Human-level performance in first-person multiplayer games abs/1806.01780, 2018. with population-based deep reinforcement learning,” CoRR, vol. [117] L. Ryan, W. Yi, T. Aviv, H. Jean, A. Pieter, and M. Igor, “Multi-agent abs/1807.01281, 2018. actor-critic for mixed cooperative-competitive environments,” 2017, pp. [90] Z. Tang, K. Shao, Y. Zhu, D. Li, D. Zhao, and T. Huang, “A review 6382–6393. of computational intelligence for StarCraft AI,” in IEEE Symposium [118] J. Foerster, G. Farquhar, T. Afouras, N. Nardelli, and S. Whiteson, Series on Computational Intelligence (SSCI), 2018. “Counterfactual multi-agent policy gradients,” in AAAI Conference on [91] N. Usunier, G. Synnaeve, Z. Lin, and S. Chintala, “Episodic ex- Artificial Intelligence, 2018. ploration for deep deterministic policies: an application to StarCraft [119] B. Charles, U. Benigno, P. Alexander, L. Yazhe, R. Avraham, Z. L. Joel, micromanagement tasks,” in International Conference on Learning W. R. Jack, W. Daan, and H. Demis, “Model-free episodic control,” Representations, 2017. CoRR, vol. abs/1606.04460, 2016. [92] P. Peng, Q. Yuan, Y. Wen, Y. Yang, Z. Tang, H. Long, and J. Wang, [120] “Hybrid computing using a neural network with dynamic external “Multiagent bidirectionally-coordinated nets for learning to play Star- memory,” Nature, vol. 538, pp. 471–476, 2016. Craft combat games,” 2017. [121] “Neural episodic control,” in International Conference on Machine [93] J. Foerster, N. Nardelli, G. Farquhar, T. Afouras, P. H. S. Torr, Learning, 2017. P. Kohli, and S. Whiteson, “Stabilising experience replay for deep [122] P. Deepak, A. Pulkit, A. E. Alexei, and D. Trevor, “Curiosity-driven multi-agent reinforcement learning,” in International Conference on exploration by self-supervised prediction,” in IEEE Conference on Machine Learning, 2017. Computer Vision and Pattern Recognition Workshops, 2017, pp. 488– [94] K. Shao, Y. Zhu, and D. Zhao, “StarCraft micromanagement with 489. reinforcement learning and curriculum transfer learning,” IEEE [123] S. Christopher and C. Jeff, “Deep curiosity search: intra-life exploration Transactions on Emerging Topics in Computational Intelligence, improves performance on challenging deep reinforcement learning DOI:10.1109/TETCI.2018.2823329, 2018. problems,” CoRR, vol. abs/1806.00553, 2018.