<<

Learning to win by reading manuals in a Monte-Carlo framework

The MIT Faculty has made this article openly available. Please share how this access benefits you. Your story matters.

Citation Branavan, S.R.K., David Silver, and Regina Barzilay. "Learning to win by reading manuals in a Monte-Carlo framework." Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1, ACL HLT '11, Portland, Oregon, June 19-24, 2011.

As Published http://dl.acm.org/citation.cfm?id=2002507

Publisher Association for Computing Machinery

Version Author's final manuscript

Citable link http://hdl.handle.net/1721.1/73115

Terms of Use Creative Commons Attribution-Noncommercial-Share Alike 3.0

Detailed Terms http://creativecommons.org/licenses/by-nc-sa/3.0/ Learning to Win by Reading Manuals in a Monte-Carlo Framework

S.R.K. Branavan David Silver * Regina Barzilay

Computer Science and Artificial Intelligence Laboratory * Department of Computer Science Massachusetts Institute of Technology University College London {branavan, regina}@csail.mit.edu [email protected]

Abstract The natural resources available where a population settles affects its ability to produce food and goods. This paper presents a novel approach for lever- Build your city on a plains or grassland square with a river running through it if possible. aging automatically extracted textual knowl- edge to improve the performance of control Figure 1: An excerpt from the user manual of the game applications such as games. Our ultimate goal II. is to enrich a stochastic player with high- level guidance expressed in text. Our model Consider for instance the text shown in Figure 1. jointly learns to identify text that is relevant This is an excerpt from the user manual of the game to a given game state in addition to learn- 2 ing game strategies guided by the selected Civilization II. This text describes game locations text. Our method operates in the Monte-Carlo where the action “build-city” can be effectively ap- search framework, and learns both text anal- plied. A stochastic player that does not have access ysis and game strategies based only on envi- to this text would have to gain this knowledge the ronment feedback. We apply our approach to hard way: it would repeatedly attempt this action in the complex strategy game Civilization II us- a myriad of states, thereby learning the characteri- ing the official game manual as the text guide. zation of promising state-action pairs based on the Our results show that a linguistically-informed observed game outcomes. In games with large state game-playing agent significantly outperforms its language-unaware counterpart, yielding a spaces, long planning horizons, and high-branching 27% absolute improvement and winning over factors, this approach can be prohibitively slow and 78% of games when playing against the built- ineffective. An algorithm with access to the text, in AI of Civilization II. 1 however, could learn correlations between words in the text and game attributes – e.g., the word “river” and places with rivers in the game – thus leveraging 1 Introduction strategies described in text to better select actions. In this paper, we study the task of grounding lin- The key technical challenge in leveraging textual guistic analysis in control applications such as com- knowledge is to automatically extract relevant infor- puter games. In these applications, an agent attempts mation from text and incorporate it effectively into a to optimize a utility function (e.g., game score) by control algorithm. Approaching this task in a super- learning to select situation-appropriate actions. In vised framework, as is common in traditional infor- complex domains, finding a winning strategy is chal- mation extraction, is inherently difficult. Since the lenging even for humans. Therefore, human players game’s state space is extremely large, and the states typically rely on manuals and guides that describe that will be encountered during game play cannot be promising tactics and provide general advice about known a priori, it is impractical to manually anno- the underlying task. Surprisingly, such textual infor- tate the information that would be relevant to those mation has never been utilized in control algorithms states. Instead, we propose to learn text analysis despite its potential to greatly improve performance. based on a feedback signal inherent to the control application, such as game score. 1The code, data and complete experimental setup for this work are available at http://groups.csail.mit.edu/rbg/code/civ. 2http://en.wikipedia.org/wiki/Civilization II Our general setup consists of a game in a stochas- line, and wins over 78% of games against the built- tic environment, where the goal of the player is to in, hand-crafted AI of Civilization II.4 maximize a given utility function R(s) at state s. We follow a common formulation that has been the 2 Related Work basis of several successful applications of machine learning to games. The player’s behavior is deter- Our work fits into the broad area of grounded lan- mined by an action-value function Q(s, a) that as- guage acquisition where the goal is to learn linguis- sesses the goodness of an action a in a given state tic analysis from a situated context (Oates, 2001; s based on the features of s and a. This function is Siskind, 2001; Yu and Ballard, 2004; Fleischman learned based solely on the utility R(s) collected via and Roy, 2005; Mooney, 2008a; Mooney, 2008b; simulated game-play in a Monte-Carlo framework. Branavan et al., 2009; Vogel and Jurafsky, 2010). Within this line of work, we are most closely related An obvious way to enrich the model with textual to reinforcement learning approaches that learn lan- information is to augment the action-value function guage by proactively interacting with an external en- with word features in addition to state and action vironment (Branavan et al., 2009; Branavan et al., features. However, adding all the words in the docu- 2010; Vogel and Jurafsky, 2010). Like the above ment is unlikely to help since only a small fraction of models, we use environment feedback (in the form the text is relevant for a given state. Moreover, even of a utility function) as the main source of supervi- when the relevant sentence is known, the mapping sion. The key difference, however, is in the language between raw text and the action-state representation interpretation task itself. Previous work has focused may not be apparent. This representation gap can on the interpretation of instruction text where input be bridged by inducing a predicate structure on the documents specify a set of actions to be executed in sentence—e.g., by identifying words that describe the environment. In contrast, game manuals provide actions, and those that describe state attributes. high-level advice but do not directly describe the In this paper, we propose a method for learning an correct actions for every potential game state. More- action-value function augmented with linguistic fea- over, these documents are long, and use rich vocabu- tures, while simultaneously modeling sentence rele- laries with complex grammatical constructions. We vance and predicate structure. We employ a multi- do not aim to perform a comprehensive interpreta- layer neural network where the hidden layers rep- tion of such documents. Rather, our focus is on lan- resent sentence relevance and predicate parsing de- guage analysis that is sufficiently detailed to help the cisions. Despite the added complexity, all the pa- underlying control task. rameters of this non-linear model can be effectively The area of language analysis situated in a game learned via Monte-Carlo simulations. domain has been studied in the past (Eisenstein et We test our method on the strategy game Civiliza- al., 2009). Their method, however, is different both tion II, a notoriously challenging game with an im- 3 in terms of the target interpretation task, and the su- mense action space. As a source of knowledge for pervision signal it learns from. They aim to learn guiding our model, we use the official game man- the rules of a given game, such as which moves are ual. As a baseline, we employ a similar Monte- valid, given documents describing the rules. Our Carlo search based player which does not have ac- goal is more open ended, in that we aim to learn cess to textual information. We demonstrate that the winning game strategies. Furthermore, Eisenstein et linguistically-informed player significantly outper- al. (2009) rely on a different source of supervision – forms the baseline in terms of number of games won. game traces collected a priori. For complex games, Moreover, we show that modeling the deeper lin- like the one considered in this paper, collecting such guistic structure of sentences further improves per- game traces is prohibitively expensive. Therefore formance. In full-length games, our algorithm yields our approach learns by actively playing the game. a 27% improvement over a language unaware base- 4In this paper, we focus primarily on the linguistic aspects 3Civilization II was #3 in IGN’s 2007 list of top video games of our task and algorithm. For a discussion and evaluation of of all time (http://top100.ign.com/2007/ign top game 3.html) the non-linguistic aspects please see Branavan et al. (2011). 3 Monte-Carlo Framework for Computer procedure PlayGame () Games Initialize game state to fixed starting state Our method operates within the Monte-Carlo search s1 ← s0 framework (Tesauro and Galperin, 1996), which for t = 1 ...T do has been successfully applied to complex computer games such as Go, Poker, Scrabble, multi-player Run N simulated games card games, and real-time strategy games, among for i = 1 ...N do (ai, ri) ← SimulateGame(s) others (Gelly et al., 2006; Tesauro and Galperin, end 1996; Billings et al., 1999; Sheppard, 2002; Schafer,¨ 2008; Sturtevant, 2008; Balla and Fern, 2009). Compute average observed utility for each action 1 X Since Monte-Carlo search forms the foundation of at ← arg max ri a Na our approach, we briefly describe it in this section. i:ai=a Execute selected action in game Game Representation The game is defined by a 0 st+1 ← T (s |st, at) large Markov Decision Process hS, A, T, Ri. Here end S is the set of possible states, A is the space of legal actions, and T (s0|s, a) is a stochastic state transition function where s, s0 ∈ S and a ∈ A. Specifically, a procedure SimulateGame (st) state encodes attributes of the game world, such as for u = t . . . τ do available resources and city locations. At each step Compute Q function approximation of the game, a player executes an action a which Q(s, a) = ~w · f~(s, a) causes the current state s to change to a new state Sample action from action-value function in s0 according to the transition function T (s0|s, a). -greedy fashion: While this function is not known a priori, the pro- ( uniform(a ∈ A) with probability  gram encoding the game can be viewed as a black a ∼ u arg max Q(s, a) otherwise box from which transitions can be sampled. Finally, a a given utility function R(s) ∈ captures the like- R Execute selected action in game: lihood of winning the game from state s (e.g., an 0 su+1 ← T (s |su, au) intermediate game score). if game is won or lost break Monte-Carlo Search Algorithm The goal of the end Monte-Carlo search algorithm is to dynamically se- Update parameters ~w of Q(s, a) lect the best action for the current state st. This se- Return action and observed utility: lection is based on the results of multiple roll-outs return at,R(sτ ) which measure the outcome of a sequence of ac- tions in a simulated game – e.g., simulations played Algorithm 1: The general Monte-Carlo algorithm. against the game’s built-in AI. Specifically, starting at state st, the algorithm repeatedly selects and exe- tion quality at each step of the roll-outs. States cutes actions, sampling state transitions from T . On and actions are evaluated by an action-value func- game completion at time τ, we measure the final tion Q(s, a), which is an estimate of the expected 5 utility R(sτ ). The actual game action is then se- outcome of action a in state s. This action-value lected as the one corresponding to the roll-out with function is used to guide action selection during the the best final utility. See Algorithm 1 for details. roll-outs. While actions are usually selected to max- The success of Monte-Carlo search is based on imize the action-value function, sometimes other ac- its ability to make a fast, local estimate of the ac- tions are also randomly explored in case they are more valuable than predicted by the current estimate 5In general, roll-outs are run till game completion. However, if simulations are expensive as is the case in our domain, roll- of Q(s, a). As the accuracy of Q(s, a) improves, outs can be truncated after a fixed number of steps. the quality of action selection improves and vice versa, in a cycle of continual improvement (Sutton Input layer: Deterministic feature and Barto, 1998). layer: In many games, it is sufficient to maintain a dis- tinct action-value for each unique state and action Output layer in a large search tree. However, when the branch- ing factor is large it is usually beneficial to approx- Hidden layer encoding imate the action-value function, so that the value sentence relevance of many related states and actions can be learned Hidden layer encoding from a reasonably small number of simulations (Sil- predicate labeling ver, 2009). One successful approach is to model the action-value function as a linear combination of Figure 2: The structure of our model. Each rectan- state and action attributes (Silver et al., 2008): gle represents a collection of units in a layer, and the shaded trapezoids show the connections between layers. Q(s, a) = ~w · f~(s, a). A fixed, real-valued feature function ~x(s, a, d) transforms the game state s, action a, and strategy document d into n Here f~(s, a) ∈ R is a real-valued feature function, the input vector ~x. The first hidden layer contains two and ~w is a weight vector. We take a similar approach disjoint sets of units ~y and ~z corresponding to linguis- here, except that our feature function includes latent tic analyzes of the strategy document. These are softmax structure which models language. layers, where only one unit is active at any time. The units of the second hidden layer f~(s, a, d, y , z ) are a set ~w Q(s, a) i i The parameters of are learned based on of fixed real valued feature functions on s, a, d and the feedback from the roll-out simulations. Specifically, active units yi and zi of ~y and ~z respectively. the parameters are updated by stochastic gradient descent by comparing the current predicted Q(s, a) fraction of the document provides guidance relevant against the observed utility at the end of each roll- to the current state, while the remainder of the text out. We provide details on parameter estimation in is likely to be irrelevant. Since this information is the context of our model in Section 4.2. not known a priori, we model the decision about a The roll-outs themselves are fully guided by the sentence’s relevance to the current state as a hid- action-value function. At every step of the simula- den variable. Moreover, to fully utilize the infor- tion, actions are selected by an -greedy strategy: mation presented in a sentence, the model identifies with probability  an action is selected uniformly the words that describe actions and those that de- at random; otherwise the action is selected greed- scribe state attributes, discriminating them from the ily to maximize the current action-value function, rest of the sentence. As with the relevance decision, arg maxa Q(s, a). we model this labeling using hidden variables. As shown in Figure 2, our model is a four layer 4 Adding Linguistic Knowledge to the neural network. The input layer ~x represents the Monte-Carlo Framework current state s, candidate action a, and document In this section we describe how we inform the d. The second layer consists of two disjoint sets of simulation-based player with information automat- units ~y and ~z which encode the sentence-relevance ically extracted from text – in terms of both model and predicate-labeling decisions respectively. Each structure and parameter estimation. of these sets of units operates as a stochastic 1-of-n softmax selection layer (Bridle, 1990) where only a 4.1 Model Structure single unit is activated. The activation function for To inform action selection with the advice provided units in this layer is the standard softmax function: in game manuals, we modify the action-value func- . X p(y = 1|~x) = e~ui·~x e~uk·~x, tion Q(s, a) to take into account words of the doc- i k ument in addition to state and action information. th Conditioning Q(s, a) on all the words in the docu- where yi is the i hidden unit of ~y, and ~ui is the ment is unlikely to be effective since only a small weight vector corresponding to yi. Given this acti- th vation function, the second layer effectively models Here ej is the predicate label of the j word being sentence relevance and predicate labeling decisions labeled, and ~e1:j−1 is the partial predicate labeling via log-linear distributions, the details of which are constructed so far for sentence yi. described below. In the second layer of the neural network, the The third feature layer f~ of the neural network is units ~z represent a predicate labeling ~ei of every sen- deterministically computed given the active units yi tence yi ∈ d. However, our intention is to incorpo- and zj of the softmax layers, and the values of the rate, into action-value function Q, information from input layer. Each unit in this layer corresponds to only the most relevant sentence. Thus, in practice, a fixed feature function fk(st, at, d, yi, zj) ∈ R. Fi- we only perform a predicate labeling of the sentence nally the output layer encodes the action-value func- selected by the relevance component of the model. tion Q(s, a, d), which now also depends on the doc- Given the sentence selected as relevant and its ument d, as a weighted linear combination of the predicate labeling, the output layer of the network units of the feature layer: can now explicitly learn the correlations between textual information, and game states and actions – ~ Q(st, at, d) = ~w · f, for example, between the word “grassland” in Fig- ure 1, and the action of building a city. This allows where ~w is the weight vector. our method to leverage the automatically extracted Modeling Sentence Relevance Given a strategy textual information to improve game play. document d, we wish to identify a sentence yi that 4.2 Parameter Estimation is most relevant to the current game state st and ac- tion at. This relevance decision is modeled as a log- Learning in our method is performed in an online linear distribution over sentences as follows: fashion: at each game state st, the algorithm per- forms a simulated game roll-out, observes the out- ~u·φ(yi,st,at,d) p(yi|st, at, d) ∝ e . come of the game, and updates the parameters ~u, ~v and ~w of the action-value function Q(s , a , d). Here φ(y , s , a , d) ∈ n is a feature function, and t t i t t R These three steps are repeated a fixed number of ~u are the parameters we need to estimate. times at each actual game state. The information Modeling Predicate Structure Our goal here is from these roll-outs is used to select the actual game to label the words of a sentence as either action- action. The algorithm re-learns Q(st, at, d) for ev- description, state-description or background. Since ery new game state st. This specializes the action- these word label assignments are likely to be mu- value function to the subgame starting from st. tually dependent, we model predicate labeling as a Since our model is a non-linear approximation of sequence prediction task. These dependencies do the underlying action-value function of the game, not necessarily follow the order of words in a sen- we learn model parameters by applying non-linear tence, and are best expressed in terms of a syn- regression to the observed final utilities from the tactic tree. For example, words corresponding to simulated roll-outs. Specifically, we adjust the pa- state-description tend to be descendants of action- rameters by stochastic gradient descent, to mini- description words. Therefore, we label words in de- mize the mean-squared error between the action- pendency order — i.e., starting at the root of a given value Q(s, a) and the final utility R(sτ ) for each dependency tree, and proceeding to the leaves. This observed game state s and action a. The resulting allows a word’s label decision to condition on the update to model parameters θ is of the form: label of the corresponding dependency tree parent. α ∆θ = − ∇ [R(s ) − Q(s, a)]2 Given sentence yi and its dependency parse qi, we 2 θ τ model the distribution over predicate labels ~ei as: = α [R(sτ ) − Q(s, a)] ∇θQ(s, a; θ), Y p(~ei |yi, qi) = p(ej|j,~e1:j−1, yi, qi), where α is a learning rate parameter. j This minimization is performed via standard error ~v·ψ(ej ,j,~e1:j−1,yi,qi) p(ej|j,~e1:j−1, yi, qi) ∝ e . backpropagation (Bryson and Ho, 1969; Rumelhart et al., 1986), which results in the following online Map tile attributes: updates for the output layer parameters ~w: - Terrain type (e.g. grassland, mountain, etc) - Tile resources (e.g. wheat, coal, wildlife, etc) ~w ← ~w + α [Q − R(s )] f~(s, a, d, y , z ), City attributes: w τ i j - City population - Amount of food produced where αw is the learning rate, and Q = Q(s, a, d). Unit attributes: The corresponding updates for the sentence rele- - Unit type (e.g., worker, explorer, archer, etc) - Is unit in a city ? vance and predicate labeling parameters ~u and ~v are:

1 if action=build-city ~ui ← ~ui + αu [Q − R(sτ )] Q ~x [1 − p(yi|·)], & tile-has-river=true & action-words={build,city} & state-words={river,hill} ~vi ← ~vi + αv [Q − R(sτ )] Q ~x [1 − p(zi|·)]. 0 otherwise

5 Applying the Model 1 if action=build-city & tile-has-river=true We apply our model to playing the turn-based strat- & words={build,city,river} egy game, Civilization II. We use the official man- 0 otherwise 6 ual of the game as the source of textual strategy 1 if label=action advice for the language aware algorithms. & word-type='build' & parent-label=action

Civilization II is a multi-player game set on a grid- 0 otherwise based map of the world. Each grid location repre- sents a tile of either land or sea, and has various Figure 3: Example attributes of the game (box above), resources and terrain attributes. For example, land and features computed using the game manual and these tiles can have hills with rivers running through them. attributes (box below). In addition to multiple cities, each player controls various units – e.g., settlers and explorers. Games simulated game roll-outs. In the typical application are won by gaining control of the entire world map. of the algorithm, the final game outcome is used as In our experiments, we consider a two-player game the utility function (Tesauro and Galperin, 1996). of Civilization II on a grid of 1000 squares, where Given the complexity of Civilization II, running sim- we play against the built-in AI player. ulation roll-outs until game completion is impracti- cal. The game, however, provides each player with a Game States and Actions We define the game state game score, which is a noisy indication of how well of Civilization II to be the map of the world, the at- they are currently playing. Since we are playing a tributes of each map tile, and the attributes of each two-player game, we use the ratio of the game score player’s cities and units. Some examples of the at- of the two players as our utility function. tributes of states and actions are shown in Figure 3. The space of possible actions for a given city or unit Features The sentence relevance features φ~ and the is known given the current game state. The actions action-value function features f~ consider the at- of a player’s cities and units combine to form the ac- tributes of the game state and action, and the words tion space of that player. In our experiments, on av- of the sentence. Some of these features compute text erage a player controls approximately 18 units, and overlap between the words of the sentence, and text each unit can take one of 15 actions. This results in labels present in the game. The feature function ψ~ a very large action space for the game – i.e., 1021. used for predicate labeling on the other hand oper- To effectively deal with this large action space, we ates only on a given sentence and its dependency assume that given the state, the actions of a single parse. It computes features which are the Carte- unit are independent of the actions of all other units sian product of the candidate predicate label with of the same player. word attributes such as type, part-of-speech tag, and dependency parse information. Overall, f~, φ~ and Utility Function The Monte-Carlo algorithm uses ψ~ compute approximately 306,800, 158,500, and the utility function to evaluate the outcomes of 7,900 features respectively. Figure 3 shows some 6www.civfanatics.com/content/civ2/reference/Civ2manual.zip examples of these features. 6 Experimental Setup Method % Win % Loss Std. Err. Datasets We use the official game manual for Civi- Random 0 100 — lization II as our strategy guide. This manual uses a Built-in AI 0 0 — large vocabulary of 3638 words, and is composed of Game only 17.3 5.3 ± 2.7 2083 sentences, each on average 16.9 words long. Sentence relevance 46.7 2.8 ± 3.5 Full model 53.7 5.9 ± 3.5 Experimental Framework To apply our method to Random text 40.3 4.3 ± 3.4 the Civilization II game, we use the game’s open Latent variable 26.1 3.7 ± 3.1 source implementation .7 We instrument the game to allow our method to programmatically mea- Table 1: Win rate of our method and several baselines sure the current state of the game and to execute within the first 100 game steps, while playing against the game actions. The Stanford parser (de Marneffe et built-in game AI. Games that are neither won nor lost are al., 2006) was used to generate the dependency parse still ongoing. Our model’s win rate is statistically signif- information for sentences in the game manual. icant against all baselines except sentence relevance. All results are averaged across 200 independent game runs. Across all experiments, we start the game at the The standard errors shown are for percentage wins. same initial state and run it for 100 steps. At each step, we perform 500 Monte-Carlo roll-outs. Each Method % Wins Standard Error roll-out is run for 20 simulated game steps before Game only 45.7 ± 7.0 halting the simulation and evaluating the outcome. Latent variable 62.2 ± 6.9 For our method, and for each of the baselines, we Full model 78.8 ± 5.8 run 200 independent games in the above manner, with evaluations averaged across the 200 runs. We Table 2: Win rate of our method and two baselines on 50 use the same experimental settings across all meth- full length games played against the built-in AI. ods, and all model parameters are initialized to zero. The test environment consisted of typical PCs 7 Results with single Intel Core i7 CPUs (4 hyper-threaded cores each), with the algorithms executing 8 simula- Game performance As shown in Table 1, our lan- tion roll-outs in parallel. In this setup, a single game guage aware Monte-Carlo algorithm substantially of 100 steps runs in approximately 1.5 hours. outperforms several baselines – on average winning 53.7% of all games within the first 100 steps. The Evaluation Metrics We wish to evaluate two as- dismal performance, on the other hand, of both the pects of our method: how well it leverages tex- random baseline and the game’s own built-in AI tual information to improve game play, and the ac- (playing against itself) is an indicator of the diffi- curacy of the linguistic analysis it produces. We culty of the task. This evaluation is an underesti- evaluate the first aspect by comparing our method mate since it assumes that any game not won within against various baselines in terms of the percent- the first 100 steps is a loss. As shown in Table 2, our age of games won against the built-in AI of Freeciv. method wins over 78% of full length games. This AI is a fixed algorithm designed using exten- To characterize the contribution of the language sive knowledge of the game, with the intention of components to our model’s performance, we com- challenging human players. As such, it provides a pare our method against two ablative baselines. The good open-reference baseline. Since full games can first of these, game-only, does not take advantage last for multiple days, we compute the percentage of of any textual information. It attempts to model the games won within the first 100 game steps as our pri- action value function Q(s, a) only in terms of the mary evaluation. To confirm that performance under attributes of the game state and action. The per- this evaluation is meaningful, we also compute the formance of this baseline – a win rate of 17.3% – percentage of full games won over 50 independent effectively confirms the benefit of automatically ex- runs, where each game is run to completion. tracted textual information in the context of our task. 7http://freeciv.wikia.com. Game version 2.2 The second ablative baseline, sentence-relevance, is 1 Phalanxes are twice as effective at defending cities as warriors. Build the city on plains or grassland with a river running through it. 0.8

You can rename the city if you like, but we'll refer to it as washington. accuracy There are many different strategies dictating the order in which 0.6 advances are researched

After the road is built, use the settlers to start improving the terrain. 0.4 SSS A A A A A When the settlers becomes active, chose build road. 0.2 Sentence relevance S S S A A A Moving average Use settlers or engineers to improve a terrain square within the city radius 0 A S ✘ AA S A✘ S S S S Sentence relevance 20 40 60 80 100 Game step Figure 4: Examples of our method’s sentence relevance Figure 5: Accuracy of our method’s sentence relevance and predicate labeling decisions. The box above shows predictions, averaged over 100 independent runs. two sentences (identified by check marks) which were predicted as relevant, and two which were not. The box impractical since the relevance decision is depen- below shows the predicted predicate structure of three dent on the game context, and is hence specific to sentences, with “S” indicating state description,“A” ac- each time step of each game instance. Therefore, for tion description and background words unmarked. Mis- takes are identified with crosses. the purposes of this evaluation, we modify the game manual by adding to it sentences randomly selected identical to our model, but lacks the predicate label- from the Wall Street Journal corpus (Marcus et al., ing component. This method wins 46.7% of games, 1993) – sentences that are highly unlikely to be rel- showing that while identifying the text relevant to evant to game play. We then evaluate the accuracy the current game state is essential, a deeper struc- with which sentences from the original manual are tural analysis of the extracted text provides substan- picked as relevant. tial benefits. In this evaluation, our method achieves an average One possible explanation for the improved perfor- accuracy of 71.8%. Given that our model only has to mance of our method is that the non-linear approx- differentiate between the game manual text and the imation simply models game characteristics better, Wall Street Journal, this number may seem disap- rather than modeling textual information. We di- pointing. Furthermore, as can be seen from Figure 5, rectly test this possibility with two additional base- the sentence relevance accuracy varies widely as the lines. The first, random-text, is identical to our full game progresses, with a high average of 94.2% dur- model, but is given a document containing random ing the initial 25 game steps. text. We generate this text by randomly permut- In reality, this pattern of high initial accuracy fol- ing the word locations of the actual game manual, lowed by a lower average is not entirely surprising: thereby maintaining the document’s overall statisti- the official game manual for Civilization II is writ- cal properties. The second baseline, latent variable, ten for first time players. As such, it focuses on the extends the linear action-value function Q(s, a) of initial portion of the game, providing little strategy 8 the game only baseline with a set of latent variables advice relevant to subsequence game play. If this is – i.e., it is a four layer neural network, where the sec- the reason for the observed sentence relevance trend, ond layer’s units are activated only based on game we would also expect the final layer of the neural information. As shown in Table 1 both of these base- network to emphasize game features over text fea- lines significantly underperform with respect to our tures after the first 25 steps of the game. This is model, confirming the benefit of automatically ex- indeed the case, as can be seen from Figure 6. tracted textual information in the context of this task. To further test this hypothesis, we perform an ex- periment where the first 50 steps of the game are Sentence Relevance Figure 4 shows examples of played using our full model, and the subsequent 50 the sentence relevance decisions produced by our steps are played without using any textual informa- method. To evaluate the accuracy of these decisions, we ideally require a ground-truth relevance annota- 8This is reminiscent of opening books for games like Chess tion of the game’s user manual. This however, is or Go, which aim to guide the player to a playable middle game. 1.5 Method S/A/B S/A nce Random labeling 33.3% 50.0% 1 Model, first 100 steps 45.1% 78.9% Model, first 25 steps 48.0% 92.7%

0.5 Table 3: Predicate labeling accuracy of our method and a Text features dominate Game features dominate random baseline. Column “S/A/B” shows performance on the three-way labeling of words as state, action or Text feature importa 0 20 40 60 80 background, while column “S/A” shows accuracy on the Game step task of differentiating between state and action words.

Figure 6: Difference between the norms of the text fea- game attribute word tures and game features of the output layer of the neural state: grassland "city" network. Beyond the initial 25 steps of the game, our state: grassland "build" action: settlers_build_city "city" method relies increasingly on game features. action: set_research "discovery" tion. This hybrid method performs as well as our Figure 7: Examples of word to game attribute associa- full model, achieving a 53.3% win rate, confirm- tions that are learned via the feature weights of our model. ing that textual information is most useful during the initial phase of the game. This shows that our Figure 7 shows examples of how this textual infor- method is able to accurately identify relevant sen- mation is grounded in the game, by way of the asso- tences when the information they contain is most ciations learned between words and game attributes pertinent to game play. in the final layer of the full model. Predicate Labeling Figure 4 shows examples of the 8 Conclusions predicate structure output of our model. We eval- uate the accuracy of this labeling by comparing it In this paper we presented a novel approach for against a gold-standard annotation of the game man- improving the performance of control applications ual. Table 3 shows the performance of our method by automatically leveraging high-level guidance ex- in terms of how accurately it labels words as state, pressed in text documents. Our model, which op- action or background, and also how accurately it dif- erates in the Monte-Carlo framework, jointly learns ferentiates between state and action words. In ad- to identify text relevant to a given game state in ad- dition to showing a performance improvement over dition to learning game strategies guided by the se- the random baseline, these results display two clear lected text. We show that this approach substantially trends: first, under both evaluations, labeling accu- outperforms language-unaware alternatives while racy is higher during the initial stages of the game. learning only from environment feedback. This is to be expected since the model relies heav- Acknowledgments ily on textual features only during the beginning of the game (see Figure 6). Second, the model clearly The authors acknowledge the support of the NSF performs better in differentiating between state and (CAREER grant IIS-0448168, grant IIS-0835652), action words, rather than in the three-way labeling. DARPA Machine Reading Program (FA8750-09- To verify the usefulness of our method’s predi- -0172) and the Microsoft Research New Faculty cate labeling, we perform a final set of experiments Fellowship. Thanks to Michael Collins, Tommi where predicate labels are selected uniformly at ran- Jaakkola, Leslie Kaelbling, Nate Kushman, Sasha dom within our full model. This random labeling Rush, Luke Zettlemoyer, the MIT NLP group, and results in a win rate of 44% – a performance similar the ACL reviewers for their suggestions and com- to the sentence relevance model which uses no pred- ments. Any opinions, findings, conclusions, or rec- icate information. This confirms that our method ommendations expressed in this paper are those of is able identify a predicate structure which, while the authors, and do not necessarily reflect the views noisy, provides information relevant to game play. of the funding organizations. References planning. Ph.D. thesis, University of Massachusetts Amherst. R. Balla and A. Fern. 2009. UCT for tactical assault David E. Rumelhart, Geoffrey E. Hinton, and Ronald J. planning in real-time strategy games. In 21st Interna- Williams. 1986. Learning representations by back- tional Joint Conference on Artificial Intelligence. propagating errors. Nature, 323:533–536. Darse Billings, Lourdes Pena˜ Castillo, Jonathan Scha- J. Schafer.¨ 2008. The UCT algorithm applied to games effer, and Duane Szafron. 1999. Using probabilis- with imperfect information. Diploma Thesis. Otto- tic knowledge and simulation to play poker. In 16th von-Guericke-Universitat¨ Magdeburg. National Conference on Artificial Intelligence, pages B. Sheppard. 2002. World-championship-caliber Scrab- 697–703. ble. Artificial Intelligence, 134(1-2):241–275. S.R.K Branavan, Harr Chen, Luke Zettlemoyer, and D. Silver, R. Sutton, and M. Muller.¨ 2008. Sample- Regina Barzilay. 2009. Reinforcement learning for based learning and search with permanent and tran- mapping instructions to actions. In Proceedings of sient memories. In 25th International Conference on ACL, pages 82–90. Machine Learning, pages 968–975. S.R.K Branavan, Luke Zettlemoyer, and Regina Barzilay. D. Silver. 2009. Reinforcement Learning and 2010. Reading between the lines: Learning to map Simulation-Based Search in the Game of Go. Ph.D. high-level instructions to commands. In Proceedings thesis, University of Alberta. of ACL, pages 1268–1277. Jeffrey Mark Siskind. 2001. Grounding the lexical se- S.R.K. Branavan, David Silver, and Regina Barzilay. mantics of verbs in visual perception using force dy- 2011. Non-linear monte-carlo search in . namics and event logic. Journal of Artificial Intelli- In Proceedings of IJCAI. gence Research, 15:31–90. John S. Bridle. 1990. Training stochastic model recog- N. Sturtevant. 2008. An analysis of UCT in multi-player nition algorithms as networks can lead to maximum games. In 6th International Conference on Computers mutual information estimation of parameters. In Ad- and Games, pages 37–49. vances in NIPS, pages 211–217. Richard S. Sutton and Andrew G. Barto. 1998. Rein- Arthur E. Bryson and Yu-Chi Ho. 1969. Applied optimal forcement Learning: An Introduction. The MIT Press. control: optimization, estimation, and control. Blais- G. Tesauro and G. Galperin. 1996. On-line policy im- dell Publishing Company. provement using Monte-Carlo search. In Advances in Marie-Catherine de Marneffe, Bill MacCartney, and Neural Information Processing 9, pages 1068–1074. Christopher D. Manning. 2006. Generating typed Adam Vogel and Daniel Jurafsky. 2010. Learning to dependency parses from phrase structure parses. In follow navigational directions. In Proceedings of the LREC 2006. ACL, pages 806–814. Jacob Eisenstein, James Clarke, Dan Goldwasser, and Chen Yu and Dana H. Ballard. 2004. On the integration Dan Roth. 2009. Reading to learn: Constructing of grounding language and learning objects. In Pro- features from semantic abstracts. In Proceedings of ceedings of AAAI, pages 488–493. EMNLP, pages 958–967. Michael Fleischman and Deb Roy. 2005. Intentional context in situated natural language learning. In Pro- ceedings of CoNLL, pages 104–111. S. Gelly, Y. Wang, R. Munos, and O. Teytaud. 2006. Modification of UCT with patterns in Monte-Carlo Go. Technical Report 6062, INRIA. Mitchell P. Marcus, Beatrice Santorini, and Mary Ann Marcinkiewicz. 1993. Building a large annotated cor- pus of english: The penn treebank. Computational Linguistics, 19(2):313–330. Raymond J. Mooney. 2008a. Learning language from its perceptual context. In Proceedings of ECML/PKDD. Raymond J. Mooney. 2008b. Learning to connect lan- guage and perception. In Proceedings of AAAI, pages 1598–1601. James Timothy Oates. 2001. Grounding knowledge in sensors: Unsupervised learning for language and