Neuroevolution for General Video Game Playing
Total Page:16
File Type:pdf, Size:1020Kb
Neuroevolution for General Video Game Playing Spyridon Samothrakis, Diego Perez-Liebana, Simon M. Lucas and Maria Fasli School of Computer Science and Electronic Engineering University of Essex, Colchester CO4 3SQ, UK [email protected], [email protected], [email protected], [email protected] Abstract—General Video Game Playing (GVGP) allows for methods non-trivial. On the other hand, modern evolutionary the fair evaluation of algorithms and agents as it minimizes algorithms require minimum fine-tuning and can help greatly the ability of an agent to exploit apriori knowledge in the with providing a quick baseline. In this paper we perform form of game specific heuristics. In this paper we compare four experiments using evolutionary algorithms to learn actors in possible combinations of evolutionary learning using Separable the following setup. Natural Evolution Strategies as our evolutionary algorithm of choice; linear function approximation with Softmax search and Among evolutionary algorithms, for problems with con- -greedy policies and neural networks with the same policies. tinuous state variables, a simple yet robust algorithm is Sep- The algorithms explored in this research play each of the games arable Natural Evolution Strategies (S-NES) [4]. Coupling during a sequence of 1000 matches, where the score obtained is the algorithm with different policy schemes and different used as a measurement of performance. We show that learning approximators yields the experimental methods: is achieved in 8 out of the 10 games employed in this research, without introducing any domain specific knowledge, leading the • S-NES using e-greedy policy and a linear function algorithms to maximize the average score as the number of games approximator. played increases. • S-NES using e-greedy policy and a neural network. • S-NES using softmax policy and a linear function I. INTRODUCTION approximator. Learning how to act in unknown environments is often • S-NES using softmax policy and a neural network. termed the “Reinforcement Learning” problem [1]. The prob- lem encapsulates the core of artificial intelligence and has been The motivation behind this is to check whether there is studied widely under different contexts. Broadly speaking, an an exploration method which works better in a general sense, agents tries to maximize some long term notion of utility or without misguiding the evolution. We will see specifics of this reward by selecting appropriate actions at each state. Classic setup in later sections. In order to make this research easy to RL assumes that state can somehow be identified by the agent replicate and interesting in its own right, the environmental in a unique fashion, but it is often the case that function substrate is provided by the new track of the General Video approximators are used to help with state spaces that cannot Game Playing AI Competition1. The players evaluated in this be enumerated. research are tested in this setup. A possible way of attacking the problem in the gen- The paper is structured as follows: Section II does a brief eral sense is Neuroevolution (for an example in games see review of the current state of the art in General Game and [2]). Neuroevolution adapts the weights of a local or global Video Game Playing. Section III analyses the framework function approximator (in the form of a neural network) in used in this paper for evaluating games. We describe the order to maximize reward. The appeal of Neurovolution over algorithms used in Section IV, while in Section V we present traditional gradient-based Reinforcement Learning methods the experimental results. We conclude with a discussion in is three-fold. First, the gradient in RL problems might be Section VI. unstable and/or hard to approximate, thus requiring extensive hyper-parameter tuning in order to get any performance. Sec- II. RELATED RESEARCH ondly, the tournament-ranking schemes used by evolutionary While there has been considerable research in the topic approaches are robust to outliers, as they are effectively of both Neuroevolution and General Game Playing (GGP), calculations of medians rather than means. Thirdly, classic GGP [5] was initially mostly concerned with approaches that RL algorithms (at least their critic-only versions [3]) can be included access to a forward model. In this scenario, agents greatly impacted by the lack of a perfect Markov state. Sensor (or players) have the ability to predict the future to a certain aliasing is a known and common problem in RL under function extent and plan their actions, typically in domains that are approximation. more closely aligned to board games and are of a deterministic On one hand, the lack of direct gradient information nature. limits the possible size of the parameters of the function Recently, a competition termed “The General Video Game approximator to be evolved. If provided with the correct setup Playing Competition” [6] has focused on similar problems, and hyperparameters, RL algorithms might be able to perform but this time in the real-time game domain. The competition at superhuman level in a number of hard problems. In practical provided the competitors with internal models they could sam- terms, research into RL has often been coupled with the use of ple from (but not access to specific future event probabilities), strong heuristics. General Game Playing allows for completely arbitrary games, making the easy application of temporal based 1www.gvgai.net while limiting the amount of “thinking” time provided for VGDL defines an ontology that allows the generation of real- an agent to 40ms (imposing an-almost-real-time constraint). time games and levels with simple text files. In both contexts, the algorithms that dominated were sample based search methods [7], [6], also known as Monte Carlo The GVG-AI framework is a Java port of py-vgdl, re- Tree Search (MCTS). In the case of GVG-AI [6], which allows engineered for the GVG-AI Competition by Perez et al. [6], for stochastic environments, Open Loop MCTS methods seem which exposes an interface that allows the implementation of to perform best. A recent paper studies the use of open controllers that must determine the actions of the player (also avatar loop methods, employing both tree search and evolutionary referred to as the ). The VGDL definition of the game not algorithms, in this domain [8]. is provided, so it is the responsibility of the agent to determine the nature of the game and the actions needed to This competition and framework has brought more re- achieve victory. search in this area during the last months, after the end of forward the contest. For instance, B. Ross, in his Master thesis [9], The original version of this framework provided a model explores enhancements to the short-sight of MCTS in this , that allowed the agent to simulate actions on the current domain by adding a long term goal based approach. Also state to devise the consequences of the available actions. In this 40ms recently, T. Nielsen et al. [10] employed different game playing case, the agents counted on of CPU time to decide an algorithms to identify well and badly designed games, under action on each game cycle. This version of the framework the assumption that good games should differentiate well the was employed by the organizers of the competition to run performance of these different agents. what is now called the “planning track” of the contest, which took part during the Computational Intelligence in Games There has also been considerable effort in learning how (CIG) conference in Dortmund (Germany) in 2014. A complete to play games in a generic fashion outside competitions, description of this competition, framework, rules, entries and using the Atari framework [11], which provides a multitude results have been published in [6]. of games for testing. The games were tackled recently [12] using a combination of neural networks and Q-learning with 2016 will feature a new track for this competition, named not experience replay. The method was widely seen as a success “learning track”. In this track, controllers do have access (though it could still not outperform humans in all games). The to a forward model, so they cannot foresee the states reached methods for these games operate on a pixel level and require after applying any of the available actions of the game. In a significant amount of training, but the results achieved when contrast, they will be offered the possibility of playing the successful cover the full AI loop (from vision to actions). game multiple times, with the aim of allowing the agents to learn how to play each given game. Note that the competition is This latest work (by Mihn et al.) is of special interest to not limited by some of the drawbacks related to using Atari as this research, as it tackles the problem in a similar way to an approach (e.g., no true stochasticity [19]) and that it allows how it is done here. The agents that play those games have for games that can be customised easily, and also allows for no access to a forward model that allows them to anticipate great flexibility in the choice of interface between the game the state that will be found after applying an action. Learning andthe AI agent, which could be vision based (as with ALE) is performed by maximizing the score obtained in the games or game-object based (as with the current experiments). Thus, after playing them during a number of consecutive trials. we think, it provides different challenges than Atari General Game Playing. Neuroevolution has been used for real time video games [2], General Game Playing (see [13]) and playing Atari Information is given regarding the state of the game to the games [14], [15], all with varying degrees of success.