Language Understanding for Text-Based Games Using Deep Reinforcement Learning

Language Understanding for Text-based Games using Deep Reinforcement Learning Karthik Narasimhan∗ Tejas Kulkarni∗ Regina Barzilay CSAIL, MIT CSAIL, BCS, MIT CSAIL, MIT [email protected] [email protected] [email protected] Abstract State 1: The old bridge You are standing very close to the bridge’s In this paper, we consider the task of learn- eastern foundation. If you go east you will ing control policies for text-based games. be back on solid ground ... The bridge In these games, all interactions in the vir- sways in the wind. tual world are through text and the underlying state is not observed. The re- Command: Go east sulting language barrier makes such environments challenging for automatic game State 2: Ruined gatehouse players. We employ a deep reinforcement The old gatehouse is near collapse. Part of learning framework to jointly learn state its northern wall has already fallen down ... representations and action policies using East of the gatehouse leads out to a small game rewards as feedback. This frame- open area surrounded by the remains of the work enables us to map text descriptions castle. There is also a standing archway of- into vector representations that capture the fering passage to a path along the old south- semantics of the game states. We eval- ern inner wall. uate our approach on two game worlds, Exits: Standing archway, castle corner, comparing against baselines using bag-of- Bridge over the abyss words and bag-of-bigrams for state representations. Our algorithm outperforms Figure 1: Sample gameplay from a Fantasy World. the baselines on both worlds demonstrat- The player with the quest of finding a secret tomb, ing the importance of learning expressive is currently located on an old bridge. She then 1 representations. chooses an action to go east that brings her to a ruined gatehouse (State 2). 1 Introduction challenging for existing AI programs to play these In this paper, we address the task of learning con- games (DePristo and Zubek, 2001). trol policies for text-based strategy games. These In designing an autonomous game player, we games, predecessors to modern graphical ones, have considerable latitude when selecting an ad- still enjoy a large following worldwide.2 They of- equate state representation to use. The simplest ten involve complex worlds with rich interactions method is to use a bag-of-words representation and elaborate textual descriptions of the underly- derived from the text description. However, this ing states (see Figure 1). Players read descriptions scheme disregards the ordering of words and the of the current world state and respond with natural finer nuances of meaning that evolve from com- language commands to take actions. Since the un- posing words into sentences and paragraphs. For derlying state is not directly observable, the player instance, in State 2 in Figure 1, the agent has to has to understand the text in order to act, making it understand that going east will lead it to the castle whereas moving south will take it to the stand- ∗ Both authors contributed equally to this work. ing archway. An alternative approach is to convert 1Code is available at http://people.csail.mit. edu/karthikn/mud-play. text descriptions to pre-specified representations 2http://mudstats.com/ using annotated training data, commonly used in language grounding tasks (Matuszek et al., 2013; and playing computer games (Eisenstein et al., Kushman et al., 2014). 2009; Branavan et al., 2011a). In contrast, our goal is to learn useful represen- Games provide a rich domain for grounded lan- tations in conjunction with control policies. We guage analysis. Prior work has assumed perfect adopt a reinforcement learning framework and for- knowledge of the underlying state of the game to mulate game sequences as Markov Decision Pro- learn policies. Gorniak and Roy (2005) developed cesses. An agent playing the game aims to maxi- a game character that can be controlled by spoken mize rewards that it obtains from the game engine instructions adaptable to the game situation. The upon the occurrence of certain events. The agent grounding of commands to actions is learned from learns a policy in the form of an action-value func- a transcript manually annotated with actions and tion Q(s; a) which denotes the long-term merit of state attributes. Eisenstein et al. (2009) learn game an action a in state s. rules by analyzing a collection of game-related The action-value function is parametrized us- documents and precompiled traces of the game. In ing a deep recurrent neural network, trained us- contrast to the above work, our model combines ing the game feedback. The network contains two text interpretation and strategy learning in a single modules. The first one converts textual descrip- framework. As a result, textual analysis is guided tions into vector representations that act as prox- by the received control feedback, and the learned ies for states. This component is implemented us- strategy directly builds on the text interpretation. ing Long Short-Term Memory (LSTM) networks Our work closely relates to an automatic game (Hochreiter and Schmidhuber, 1997). The second player that utilizes text manuals to learn strategies module of the network scores the actions given the for Civilization (Branavan et al., 2011a). Similar vector representation computed by the first. to our approach, text analysis and control strate- We evaluate our model using two Multi-User gies are learned jointly using feedback provided Dungeon (MUD) games (Curtis, 1992; Amir and by the game simulation. In their setup, states are Doyle, 2002). The first game is designed to pro- fully observable, and the model learns a strategy vide a controlled setup for the task, while the sec- by combining state/action features and features ond is a publicly available one and contains hu- extracted from text. However, in our application, man generated text descriptions with significant the state representation is not provided, but has to language variability. We compare our algorithm be inferred from a textual description. Therefore, against baselines of a random player and mod- it is not sufficient to extract features from text to els that use bag-of-words or bag-of-bigrams rep- supplement a simulation-based player. resentations for a state. We demonstrate that our Another related line of work consists of auto- model LSTM-DQN significantly outperforms the matic video game players that infer state repre- baselines in terms of number of completed quests sentations directly from raw pixels (Koutn´ık et al., and accumulated rewards. For instance, on a fan- 2013; Mnih et al., 2015). For instance, Mnih et tasy MUD game, our model learns to complete al. (2015) learn control strategies using convolu- 96% of the quests, while the bag-of-words model tional neural networks, trained with a variant of and a random baseline solve only 82% and 5% of Q-learning (Watkins and Dayan, 1992). While the quests, respectively. Moreover, we show that both approaches use deep reinforcement learning the acquired representation can be reused across for training, our work has important differences. games, speeding up learning and leading to faster In order to handle the sequential nature of text, we convergence of Q-values. use Long Short-Term Memory networks to auto- matically learn useful representations for arbitrary 2 Related Work text descriptions. Additionally, we show that de- composing the network into a representation layer Learning control policies from text is gaining in- and an action selector is useful for transferring the creasing interest in the NLP community. Example learnt representations to new game scenarios. applications include interpreting help documenta- tion for software (Branavan et al., 2010), navi- 3 Background gating with directions (Vogel and Jurafsky, 2010; Kollar et al., 2010; Artzi and Zettlemoyer, 2013; Game Representation We represent a game by Matuszek et al., 2013; Andreas and Klein, 2015) the tuple hH; A; T; R; Ψi, where H is the set of all possible game states, A = f(a; o)g is the set of Q(s, a) Q(s, o) all commands (action-object pairs), T (h0 j h; a; o) is the stochastic transition function between states Linear Linear ( ) and R h; a; o is the reward function. The game ReLU φA state H is hidden from the player, who only re- ceives a varying textual description, produced by Linear a stochastic function Ψ: H ! S. Specifically, vs the underlying state h in the game engine keeps Mean Pooling track of attributes such as the player’s location, her health points, time of day, etc. The function LSTM LSTM LSTM LSTM φR Ψ (also part of the game framework) then converts w1 w2 w3 wn this state into a textual description of the location the player is at or a message indicating low health. We do not assume access to either H or Ψ for our Figure 2: Architecture of LSTM-DQN: The Rep- agent during both training and testing phases of resentation Generator (φR) (bottom) takes as input our experiments. We denote the space of all possi- a stream of words observed in state s and produces ble text descriptions s to be S. Rewards are gener- a vector representation vs, which is fed into the ated using R and are only given to the player upon action scorer (φA) (top) to produce scores for all completion of in-game quests. actions and argument objects. Q-Learning Reinforcement Learning is a com- state-action pairs. One solution to this problem monly used framework for learning control poli- is to approximate Q(s; a) using a parametrized cies in game environments (Silver et al., 2007; function Q(s; a; θ), which can generalize over Amato and Shani, 2010; Branavan et al., 2011b; states and actions by considering higher-level at- Szita, 2012).

Language Understanding for Text-Based Games Using Deep Reinforcement Learning

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support