Benchmarking End-To-End Behavioural Cloning on Video Games
Total Page:16
File Type:pdf, Size:1020Kb
Benchmarking End-to-End Behavioural Cloning on Video Games Anssi Kanervisto* Joonas Pussinen* Ville Hautamaki¨ School of Computing School of Computing School of Computing University of Eastern Finland University of Eastern Finland University of Eastern Finland Joensuu, Finland Joensuu, Finland Joensuu, Finland anssk@uef.fi [email protected].fi villeh@uef.fi Abstract—Behavioural cloning, where a computer is taught to still requires considerable engineering. Even worse, after the perform a task based on demonstrations, has been successfully environment is created, training the agents may take thousands applied to various video games and robotics tasks, with and of years of in-game time [4], [6]. without reinforcement learning. This also includes end-to-end approaches, where a computer plays a video game like humans An alternative approach is imitation learning, in which do: by looking at the image displayed on the screen, and sending agents learn to replicate demonstrators’ actions. Behavioural keystrokes to the game. As a general approach to playing video cloning (BC) [9] is the simplest form of this: given an games, this has many inviting properties: no need for specialized observation and an associated action from a demonstrator, modifications to the game, no lengthy training sessions and the predict this action based on observation (i.e a classification ability to re-use the same tools across different games. However, related work includes game-specific engineering to achieve the task). This has been used to kick-start RL agents [6], [10], but results. We take a step towards a general approach and study also applied alone in e.g. autonomous driving [9], [11], [12], the general applicability of behavioural cloning on twelve video and Vinyals et al. [6] show that Starcraft II can be played at games, including six modern video games (published after 2010), proficient human-level with behavioural cloning alone. This by using human demonstrations as training data. Our results begs the question: How well can behavioural cloning play show that these agents cannot match humans in raw performance but do learn basic dynamics and rules. We also demonstrate how video games, in general? Can we reach the level of a human the quality of the data matters, and how recording data from player? How much data do we need? Do we need data from humans is subject to a state-action mismatch, due to human multiple players? reflexes. If we can create performant, end-to-end agents with BC Index Terms—video game, behavioral cloning, imitation learn- and human gameplay alone, it would skip many hurdles ing, reinforcement learning, learning environment, neural net- works experienced with RL: we do not need to create an environment for agents to play in, nor do we need to spend large amounts I. INTRODUCTION of compute resources for training. We only need the video game, a tool to record the gameplay, and players for the game. Reinforcement learning (RL) [1] has been successfully If BC can manage with just an hour or two of gameplay applied to create super-human players in multiple video games, demonstration, a single person could record the demonstration including classic Atari 2600 games [2], as well as more mod- data. If the recording tool captures the same output and input ern shooters [3], MOBAs [4], [5] and real-time strategy games a human player would have (i.e. image of the screen and [6]. Even more so, all before-mentioned accomplishments keyboard/mouse, end-to-end), this would require no game- use “end-to-end” systems, where input features are not pre- specific coding and could be applied to any game. Even if processed by crafting specific features, and instead rely on raw arXiv:2004.00981v2 [cs.AI] 18 May 2020 BC does not reach human-level performance, it could still be information like image pixels. However, RL is not without used as a starting point for other learning methods, or as a its limitations: they require an environment where to play support for diversifying the agent’s behaviour [6]. the game. Whether this is achieved by modifying an existing Video games have been in active use as benchmarks in game (like Starcraft II [7]) or by using their engines to create research using BC [10], [13]–[15], and as milestones to beat environments from ground-up (like Unity ML-Agents [8]), it in AI research [2], [5], [6]. The other way around, “BC for *Equal contribution, alphabetical ordering. This research was partially video games”, has seen works like human-like bots in first- funded by the Academy of Finland (grant #313970). We gratefully acknowl- person shooter (FPS) games using hand-crafted features and edge the support of NVIDIA Corporation with the donation of the Titan Xp imitation learning [16]–[18], end-to-end FPS bots with RL and GPU used for this research. We thank the reviewers for extensive comments used to improve the final version of this paper. BC [19]. Our setting and motivation resemble the motivation ©2020 IEEE. Personal use of this material is permitted. Permission from of [20], where authors employ end-to-end imitation learning IEEE must be obtained for all other uses, in any current or future media, to play two Nintendo 64 games successfully. However, these including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers studies have been limited to only a few games a time, making it or lists, or reuse of any copyrighted component of this work in other works. hard to tell how well BC performs in general at playing video Fig. 1. Games tested with behavioural cloning. Images represent what the BC agent would see. From left to right: Ms. Pac-Man, Video Pinball, Q*bert, Montezuma’s Revenge, Space Invaders, Deathmatch (Doom), HGS (Doom), Downwell, Crypt of The NecroDancer, Super Hexagon, Boson X, Binding of Isaac: Rebirth and BeamNG.drive. games. Apart from [15], related work does not study how data simplicity, and as most of the games used in this work do not should be chosen for behavioural cloning. In addition, Zhang require memory to master. et al. [14] bring up an important point on how human delay We take an end-to-end approach, where states are pixels of can adversarially affect the quality of the dataset but did not an RGB image s 2 RH×W ×3; H; W 2 N, and actions a are include experimental results on this. a vector of one or more discrete variables ai 2 0; 1; : : : di, In this work, we aim to answer these three questions and where i 2 N represents the number of discrete variables, and to assess the general applicability of end-to-end behavioural di tells the number of options per discrete variable. In Atari cloning for video game playing. We use data from human environments [23], action contains one discrete variable with demonstrators to train a deep network to predict their actions, 18 options, including all possible choices human player could given the same observations human players saw (the screen make (multi-class classification task). With a PC game using a image). We run empirical experiments to study how well keyboard with, say, four buttons available, the actions consist BC agents play Atari 2600 games, Doom (1993) and various of four discrete variables, all with two options: down or up modern video games. Along with the raw performance, we (multi-label classification task). study the effect of quality and quantity of the training data, and To model the conditional distribution p(ajs), we use deep the effect of delay of human reflexes on the data quality. Along neural networks. They support the different actions we could the results, we present ViControl (“Visual Control”), a multi- have and are known to excel in image classification tasks [24]. platform tool to record and play an arbitrary game, which we We treat action discrete variables i independent from each use to do behavioural cloning on the modern games. ViControl other, and the network is trained to minimize cross-entropy is available at https://github.com/joonaspu/ViControl. between predictions and labels in the dataset. II. END-TO-END BEHAVIOURAL CLONING FOR VIDEO B. Challenges of general end-to-end control of video games GAMES Compared to Atari 2600 games and Doom (1993), modern A. Behavioural cloning video games (published after 2010) can take advantage of more We wish to train computer to play a game, based on given computing power and tend to be more complex when it comes demonstrations of humans playing it. We model the environ- to visual aesthetics and dynamics. We also do not assume to ment as a truncated version of Markov Decision Processes have control over game program’s flow, so the game will run (MDPs) [1], where playing the game consists of observations at a fixed rate, as humans would experience it. All-together, s 2 S and associated actions a 2 A. We do not include a these raise some specific challenges for generalized end-to-end notion of time, reward signal nor terminal/initial states. The control, where we wish to avoid per-game engineering. task of behavioural cloning is simple: given a dataset of human a) High resolution: Modern games commonly run at gameplay D containing tuples (s; a), learn the conditional “high definition” resolutions, with most common monitor distribution p(ajs), i.e. probability of human players picking resolution for players being 1920×1080. However, RL and BC action a in state s. After learning this distribution, we can agents resize images to small resolutions due to computational use it to play the game by sampling an action a ∼ p(ajs) for efficiency, usually capped around 200 to 300 pixels per axis a given state s (agent).