Non-Linear Monte-Carlo Search in Civilization II S.R.K
Total Page:16
File Type:pdf, Size:1020Kb
Proceedings of the Twenty-Second International Joint Conference on Artificial Intelligence Non-Linear Monte-Carlo Search in Civilization II S.R.K. Branavan David Silver * Regina Barzilay Computer Science and Artificial Intelligence Laboratory * Department of Computer Science Massachusetts Institute of Technology University College London {branavan, regina}@csail.mit.edu d.silver@cs.ucl.ac.uk Abstract This paper presents a new Monte-Carlo search al- gorithm for very large sequential decision-making problems. We apply non-linear regression within ! Monte-Carlo search, online, to estimate a state- action value function from the outcomes of ran- dom roll-outs. This value function generalizes be- tween related states and actions, and can therefore Figure 1: An extract from the game manual of Civilization II. provide more accurate evaluations after fewer roll- outs. A further significant advantage of this ap- A common approach to value generalization is value func- proach is its ability to automatically extract and tion approximation (VFA), in which the values of all states leverage domain knowledge from external sources and actions are approximated using a smaller number of pa- such as game manuals. We apply our algorithm rameters. Monte-Carlo search can use VFA to estimate state to the game of Civilization II, a challenging multi- and action values from the roll-out scores. For instance, lin- agent strategy game with an enormous state space ear Monte-Carlo search [Silver et al., 2008] uses linear VFA: and around 1021 joint actions. We approximate the it approximates the value function by a linear combination of value function by a neural network, augmented by state features and weights. linguistic knowledge that is extracted automatically In this paper, we investigate the use of Non-linear VFA from the official game manual. We show that this within Monte-Carlo search to provide a richer and more com- non-linear value function is significantly more ef- pact representation of value. This value approximation re- ficient than a linear value function, which is itself sults in better generalization between related game states and more efficient than Monte-Carlo tree search. Our actions, thus providing accurate evaluations after fewer roll- non-linear Monte-Carlo search wins over 78% of outs. We apply non-linear VFA locally in time,tofitanap- games against the built-in AI of Civilization II. 1 proximate value function to the scores from the current set of roll-outs. These roll-outs are all generated from the current state, i.e., they are samples from the sub game starting from 1 Introduction now. The key features of our method are: • Fitting Non-Linear VFA to local game context Pre- Monte-Carlo search is a simulation-based search paradigm vious applications of non-linear VFA [Tesauro, 1994] that has been successfully applied to complex games such as estimated a global approximation of the value function. Go, Poker, Scrabble, multi-player card games, and real-time While effective for games with relatively small state [ strategy games, among others Gelly et al., 2006; Tesauro spaces and smooth value functions, this approach has so and Galperin, 1996; Billings et al., 1999; Sheppard, 2002; far not been successful in larger, more complex games, ] Schafer,¨ 2008; Sturtevant, 2008; Balla and Fern, 2009 .In such as Go or Civilization II. Non-linear VFA, applied this framework, the values of game states and actions are globally, does not provide a sufficiently accurate repre- estimated by the mean outcome of roll-outs, i.e., simulated sentation in these domains. In contrast, the local value games. These values are used to guide the selection of the function which focuses solely on the dynamics of the best action. However, in complex games such as Civiliza- current subgame, which may be quite specialized, can be tion II, the extremely large action space makes simple Monte- considerably easier to approximate than the full global Carlo search impractical, as prohibitively many roll-outs are value function which models the entire game. needed to determine the best action. To achieve good perfor- mance in such domains, it is crucial to generalize from the • Leveraging domain knowledge automatically ex- outcome of roll-outs to the values of other states and actions. tracted from text Game manuals and other textual re- sources provide valuable sources of information for se- 1The code and data for this work are available from lecting a game strategy. An example of such advice is http://groups.csail.mit.edu/rbg/code/civ/ shown in Figure 1. For instance, the system may learn 2404 that the action irrigate-land should be selected if the Firstly, due to its greater representational power, it general- words “improve terrain” are present in text. However, izes better from small numbers of roll-outs. This is partic- the challenge in using such information is in identify- ularly important in complex games, such as Civilization II, ing which segments of text are relevant for the current in which simulation is computationally intensive, severely game context; for example, the system needs to identify limiting the number of roll-outs. Secondly, non-linear VFA the second sentence in Figure 1 as advice pertinent to allows for multi-layered representations, such as neural net- the irrigation action. These decisions can be effectively works, which can automatically learn features of the game modeled via hidden variables in a non-linear VFA. state. Furthermore, we show that a multi-layered, non-linear We model our non-linear VFA using a four-layer neural VFA can automatically incorporate domain knowledge from network, and estimate its parameters via reinforcement learn- natural language documents, so as to produce better features ing within the Monte-Carlo search framework. The first layer of the game state, and therefore a better value function and of this network represents features of the game state, action, better overall performance. and the game manual. The second layer represents a linguis- A number of other approaches have previously been ap- tic model of the game manual, which encodes sentence rele- plied to turn-based strategy games [Amato and Shani, 2010; vance and the predicate structure of sentences. The third layer Wender and Watson, 2008; Bergsma and Spronck, 2008; corresponds to a set of fixed features which perform simple, Madeira et al., 2006; Balla and Fern, 2009]. However, given predefined transformations on the outputs of the first two lay- the enormous state and action space in most strategy games, ers. The fourth and final layer represents the value function. much of this previous work has focused on smaller, simplified During Monte-Carlo search, the roll-out scores are used to or limited components of the game. For example, Amato and train the entire value function online, including the linguis- Shani [2010] use reinforcement learning to dynamically se- tic model. The weights of the neural network are updated by lect between two hard-coded strategies in the game Civiliza- error backpropagation, so as to minimize the mean squared tion IV. Wender and Watson [2008] focus on city building in error between the value function (the network output) and the Civilization IV. Madeira et al. [2006] address only the com- roll-out scores. bat aspect of strategy games, learning high-level strategies, We apply our technique to Civilization II, a notoriously and using hard-coded algorithms for low-level control. Balla large and challenging game.2 It is a cooperative, multi-agent and Fern [2009] apply Monte-Carlo tree search with some decision-making problem with around 1021 joint actions and success, but focus only on a tactical assault planning task. In an enormous state space of over 10188 states.3 To make the contrast to such previous work, our algorithm learns to con- game tractable, we assume that actions of agents are indepen- trol all aspects of the game Civilization II, and significantly dent of each other given the game state. We use the official outperforms the hand-engineered, built-in AI player. game manual as a source of domain knowledge in natural lan- guage form, and demonstrate a significant boost to the perfor- 3 Background mance of our non-linear Monte-Carlo search. In full-length games, our method wins over 78% of games against the built- Our task is to play and win a strategy game against a given in, hand-crafted AI of Civilization II. opponent. Formally, we represent the game as a Markov De- cision Process (MDP) S, A, T, R, where S is the set of pos- 2 Related Work sible states and A is the space of legal actions. Each state s ∈ S represents a complete configuration of the game world, Monte-Carlo search has previously been combined with including attributes such as available resources and the loca- global non-linear VFA. The original TD-Gammon [Tesauro, tions of cities and units. Each action a ∈ A represents a joint 1994] used a multilayer neural network that was trained from assignment of actions to each city and each unit. At each step games of self-play to approximate the value function. Tesauro of the game, the agent executes an action a in state s, which and Galperin [1996] subsequently used TD-Gammon’s neural changes the state to s according to the state transition distri- network to score the roll-outs in a simple Monte-Carlo search bution T (s|s, a). This distribution incorporates both the op- algorithm; however, the neural network was not adjusted on- ponent’s action selection policy, and the game rules. It is not line from these roll-outs. In contrast, our algorithm re-learns known explicitly; however, state transitions can be sampled its value function, online, from each new set of roll-outs – just by invoking the game code as a black-box simulator.