<<

Language Understanding for Text-based Games using Deep Reinforcement Learning

Karthik Narasimhan∗ Tejas Kulkarni∗ Regina Barzilay CSAIL, MIT CSAIL, BCS, MIT CSAIL, MIT [email protected] [email protected] [email protected]

Abstract State 1: The old bridge You are standing very close to the bridge’s In this paper, we consider the task of learn- eastern foundation. If you go east you will ing control policies for text-based games. be back on solid ground ... The bridge In these games, all interactions in the vir- sways in the wind. tual world are through text and the un- derlying state is not observed. The re- Command: Go east sulting language barrier makes such envi- ronments challenging for automatic game State 2: Ruined gatehouse players. We employ a deep reinforcement The old gatehouse is near collapse. Part of learning framework to jointly learn state its northern wall has already fallen down ... representations and action policies using East of the gatehouse leads out to a small game rewards as feedback. This frame- open area surrounded by the remains of the work enables us to map text descriptions castle. There is also a standing archway of- into vector representations that capture the fering passage to a path along the old south- semantics of the game states. We eval- ern inner wall. uate our approach on two game worlds, Exits: Standing archway, castle corner, comparing against baselines using bag-of- Bridge over the abyss words and bag-of-bigrams for state rep- resentations. Our algorithm outperforms Figure 1: Sample gameplay from a Fantasy World. the baselines on both worlds demonstrat- The player with the of finding a secret tomb, ing the importance of learning expressive is currently located on an old bridge. She then 1 representations. chooses an action to go east that brings her to a ruined gatehouse (State 2). 1 Introduction challenging for existing AI programs to play these In this paper, we address the task of learning con- games (DePristo and Zubek, 2001). trol policies for text-based strategy games. These In designing an autonomous game player, we games, predecessors to modern graphical ones, have considerable latitude when selecting an ad- still enjoy a large following worldwide.2 They of- equate state representation to use. The simplest ten involve complex worlds with rich interactions method is to use a bag-of-words representation and elaborate textual descriptions of the underly- derived from the text description. However, this ing states (see Figure 1). Players read descriptions scheme disregards the ordering of words and the of the current world state and respond with natural finer nuances of meaning that evolve from com- language commands to take actions. Since the un- posing words into sentences and paragraphs. For derlying state is not directly observable, the player instance, in State 2 in Figure 1, the agent has to has to understand the text in order to act, making it understand that going east will lead it to the cas- tle whereas moving south will take it to the stand- ∗ Both authors contributed equally to this work. ing archway. An alternative approach is to convert 1Code is available at http://people.csail.mit. edu/karthikn/mud-play. text descriptions to pre-specified representations 2http://mudstats.com/ using annotated training data, commonly used in language grounding tasks (Matuszek et al., 2013; and playing computer games (Eisenstein et al., Kushman et al., 2014). 2009; Branavan et al., 2011a). In contrast, our goal is to learn useful represen- Games provide a rich domain for grounded lan- tations in conjunction with control policies. We guage analysis. Prior work has assumed perfect adopt a reinforcement learning framework and for- knowledge of the underlying state of the game to mulate game sequences as Markov Decision Pro- learn policies. Gorniak and Roy (2005) developed cesses. An agent playing the game aims to maxi- a game character that can be controlled by spoken mize rewards that it obtains from the game engine instructions adaptable to the game situation. The upon the occurrence of certain events. The agent grounding of commands to actions is learned from learns a policy in the form of an action-value func- a transcript manually annotated with actions and tion Q(s, a) which denotes the long-term merit of state attributes. Eisenstein et al. (2009) learn game an action a in state s. rules by analyzing a collection of game-related The action-value function is parametrized us- documents and precompiled traces of the game. In ing a deep recurrent neural network, trained us- contrast to the above work, our model combines ing the game feedback. The network contains two text interpretation and strategy learning in a single modules. The first one converts textual descrip- framework. As a result, textual analysis is guided tions into vector representations that act as prox- by the received control feedback, and the learned ies for states. This component is implemented us- strategy directly builds on the text interpretation. ing Long Short-Term Memory (LSTM) networks Our work closely relates to an automatic game (Hochreiter and Schmidhuber, 1997). The second player that utilizes text manuals to learn strategies module of the network scores the actions given the for Civilization (Branavan et al., 2011a). Similar vector representation computed by the first. to our approach, text analysis and control strate- We evaluate our model using two Multi-User gies are learned jointly using feedback provided Dungeon (MUD) games (Curtis, 1992; Amir and by the game simulation. In their setup, states are Doyle, 2002). The first game is designed to pro- fully observable, and the model learns a strategy vide a controlled setup for the task, while the sec- by combining state/action features and features ond is a publicly available one and contains hu- extracted from text. However, in our application, man generated text descriptions with significant the state representation is not provided, but has to language variability. We compare our algorithm be inferred from a textual description. Therefore, against baselines of a random player and mod- it is not sufficient to extract features from text to els that use bag-of-words or bag-of-bigrams rep- supplement a simulation-based player. resentations for a state. We demonstrate that our Another related line of work consists of auto- model LSTM-DQN significantly outperforms the matic players that infer state repre- baselines in terms of number of completed quests sentations directly from raw pixels (Koutn´ık et al., and accumulated rewards. For instance, on a fan- 2013; Mnih et al., 2015). For instance, Mnih et tasy MUD game, our model learns to complete al. (2015) learn control strategies using convolu- 96% of the quests, while the bag-of-words model tional neural networks, trained with a variant of and a random baseline solve only 82% and 5% of Q-learning (Watkins and Dayan, 1992). While the quests, respectively. Moreover, we show that both approaches use deep reinforcement learning the acquired representation can be reused across for training, our work has important differences. games, speeding up learning and leading to faster In order to handle the sequential nature of text, we convergence of Q-values. use Long Short-Term Memory networks to auto- matically learn useful representations for arbitrary 2 Related Work text descriptions. Additionally, we show that de- composing the network into a representation layer Learning control policies from text is gaining in- and an action selector is useful for transferring the creasing interest in the NLP community. Example learnt representations to new game scenarios. applications include interpreting help documenta- tion for software (Branavan et al., 2010), navi- 3 Background gating with directions (Vogel and Jurafsky, 2010; Kollar et al., 2010; Artzi and Zettlemoyer, 2013; Game Representation We represent a game by Matuszek et al., 2013; Andreas and Klein, 2015) the tuple hH, A, T, R, Ψi, where H is the set of all possible game states, A = {(a, o)} is the set of Q(s, a) Q(s, o) all commands (action-object pairs), T (h0 | h, a, o) is the stochastic transition function between states Linear Linear ( ) and R h, a, o is the reward function. The game ReLU A state H is hidden from the player, who only re- ceives a varying textual description, produced by Linear a stochastic function Ψ: H → S. Specifically, vs the underlying state h in the game engine keeps Mean Pooling track of attributes such as the player’s location, her health points, time of day, etc. The function LSTM LSTM LSTM LSTM R Ψ (also part of the game framework) then converts w1 w2 w3 wn this state into a textual description of the location the player is at or a message indicating low health. We do not assume access to either H or Ψ for our Figure 2: Architecture of LSTM-DQN: The Rep- agent during both training and testing phases of resentation Generator (φR) (bottom) takes as input our experiments. We denote the space of all possi- a stream of words observed in state s and produces ble text descriptions s to be S. Rewards are gener- a vector representation vs, which is fed into the ated using R and are only given to the player upon action scorer (φA) (top) to produce scores for all completion of in-game quests. actions and argument objects. Q-Learning Reinforcement Learning is a com- state-action pairs. One solution to this problem monly used framework for learning control poli- is to approximate Q(s, a) using a parametrized cies in game environments (Silver et al., 2007; function Q(s, a; θ), which can generalize over Amato and Shani, 2010; Branavan et al., 2011b; states and actions by considering higher- at- Szita, 2012). The game environment can be tributes (Sutton and Barto, 1998; Branavan et al., formulated as a sequence of state transitions 2011a). However, creating a good parametrization (s, a, r, s0) of a Markov Decision Process (MDP). requires knowledge of the state and action spaces. The agent takes an action a in state s by consult- One way to bypass this feature engineering is to ing a state-action value function Q(s, a), which is use a Deep Q-Network (DQN) (Mnih et al., 2015). a measure of the action’s expected long-term re- The DQN approximates the Q-value function with ward. Q-Learning (Watkins and Dayan, 1992) is a deep neural network to predict Q(s, a) for all a model-free technique which is used to learn an possible actions a simultaneously given the cur- optimal Q(s, a) for the agent. Starting from a ran- rent state s. The non-linear function layers of the dom Q-function, the agent continuously updates DQN also enable it to learn better value functions its Q-values by playing the game and obtaining re- than linear approximators. wards. The iterative updates are derived from the Bellman equation (Sutton and Barto, 1998):

0 0 4 Learning Representations and Control Qi+1(s, a) = E[r + γ max Qi(s , a ) | s, a] (1) a0 Policies where γ is a discount factor for future rewards and In this section, we describe our model (DQN) and the expectation is over all game transitions that in- describe its use in learning good Q-value approxi- volved the agent taking action a in state s. mations for games with stochastic textual descrip- Using these evolving Q-values, the agent tions. We divide our model into two parts. The chooses the action with the highest Q(s, a) to first module is a representation generator that con- maximize its expected future rewards. In practice, verts the textual description of the current state the trade-off between exploration and exploitation into a vector. This vector is then input into the can be achieved following an -greedy policy (Sut- second module which is an action scorer. Fig- ton and Barto, 1998), where the agent performs a ure 2 shows the overall architecture of our model. random action with probability . We learn the parameters of both the representation Deep Q-Network In large games, it is often im- generator and the action scorer jointly, using the practical to maintain the Q-value for all possible in-game reward feedback. Representation Generator (φR) The represen- multi-word natural language commands such as tation generator reads raw text displayed to the eat apple or go east. Due to computational con- agent and converts it to a vector representation vs. straints, in this work we limit ourselves to con- A bag-of-words (BOW) representation is not suf- sider commands to consist of one action (e.g. eat) ficient to capture higher-order structures of sen- and one argument object (e.g. apple). This as- tences and paragraphs. The need for a better se- sumption holds for the majority of the commands mantic representation of the text is evident from in our worlds, with the exception of one class of the average performance of this representation in commands that require two arguments (e.g. move playing MUD-games (as we show in Section 6). red-root right, move blue-root up). We consider all In order to assimilate better representations, possible actions and objects available in the game we utilize a Long Short-Term Memory network and predict both for each state using the same net- (LSTM) (Hochreiter and Schmidhuber, 1997) as work (Figure 2). We consider the Q-value of the a representation generator. LSTMs are recurrent entire command (a, o) to be the average of the Q- neural networks with the ability to connect and values of the action a and the object o. For the rest recognize long-range patterns between words in of this section, we only show equations for Q(s, a) text. They are more robust than BOW to small but similar ones hold for Q(s, o). variations in word usage and are able to capture Parameter Learning We learn the parameters underlying semantics of sentences to some ex- θ of the representation generator and θ of the tent. In recent work, LSTMs have been used suc- R A action scorer using stochastic gradient descent cessfully in NLP tasks such as machine transla- with RMSprop (Tieleman and Hinton, 2012). The tion (Sutskever et al., 2014) and sentiment anal- complete training procedure is shown in Algo- ysis (Tai et al., 2015) to compose vector repre- rithm 1. In each iteration i, we update the pa- sentations of sentences from word-level embed- rameters to reduce the discrepancy between the dings (Mikolov et al., 2013; Pennington et al., predicted value of the current state Q(s , a ; θ ) 2014). In our setup, the LSTM network takes in t t i (where θ = [θ ; θ ] ) and the expected Q-value word embeddings w from the words in a descrip- i R A i k given the reward r and the value of the next state tion s and produces output vectors x at each step. t k max Q(s , a; θ ). To get the final state representation v , we add a a t+1 i−1 s We keep track of the agent’s previous experi- mean pooling layer which computes the element- ences in a memory D.4 Instead of performing wise mean over the output vectors x .3 k updates to the Q-value using transitions from the n 1 X current episode, we sample a random transition vs = x (2) n k (ˆs, a,ˆ s0, r) from D. Updating the parameters in k=1 this way avoids issues due to strong correlation Action Scorer (φA) The action scorer module when using transitions of the same episode (Mnih produces scores for the set of possible actions et al., 2015). Using the sampled transition and (1), given the current state representation. We use a we obtain the following loss function to minimize: multi-layered neural network for this purpose (see 2 Figure 2). The input to this module is the vec- Li(θi) = Es,ˆ aˆ[(yi − Q(ˆs, aˆ; θi)) ] (3) tor from the representation generator, vs = φR(s) 0 0 where y = E [r + γ max 0 Q(s , a ; θ ) | s,ˆ aˆ] and the outputs are scores for actions a ∈ A. i s,ˆ aˆ a i−1 is the target Q-value with parameters θ fixed Scores for all actions are predicted simultaneously, i−1 from the previous iteration. which is computationally more efficient than scor- The updates on the parameters θ can be per- ing each state-action pair separately. Thus, by formed using the following gradient of L (θ ): combining the representation generator and action i i scorer, we can obtain the approximation for the Q- ∇θi Li(θi) = Es,ˆ aˆ[2(yi − Q(ˆs, aˆ; θi))∇θi Q(ˆs, aˆ; θi)] function as Q(s, a) ≈ φA(φR(s))[a]. An additional complexity in playing MUD- For each epoch of training, the agent plays several games is that the actions taken by the player are episodes of the game, which is restarted after ev- 3We also experimented with considering just the output ery terminal state. vector of the LSTM after processing the last word. Empiri- cally, we find that mean pooling leads to faster learning, so 4The memory is limited and rewritten in a first-in-first-out we use it in all our experiments. (FIFO) fashion. Algorithm 1 Training Procedure for DQN with prioritized sampling 1: Initialize experience memory D 2: Initialize parameters of representation generator (φR) and action scorer (φA) randomly 3: for episode = 1,M do 4: Initialize game and get start state description s1 5: for t = 1,T do

6: Convert st (text) to representation vst using φR 7: if random() <  then 8: Select a random action at 9: else

10: Compute Q(st, a) for all actions using φA(vst ) 11: Select at = argmax Q(st, a)

12: Execute action at and observe reward rt and new state st+1 13: Set priority pt = 1 if rt > 0, else pt = 0 14: Store transition (st, at, rt, st+1, pt) in D 15: Sample random mini batch of transitions (sj, aj, rj, sj+1, pj) from D, with fraction ρ having pj = 1  rj if sj+1 is terminal 16: Set yj = 0 rj + γ maxa0 Q(sj+1, a ; θ) if sj+1 is non-terminal 2 17: Perform gradient descent step on the loss L(θ) = (yj − Q(sj, aj; θ))

Mini-batch Sampling In practice, online up- Stats Home World Fantasy World dates to the parameters are performed over a Vocabulary size 84 1340 θ Avg. words / description 10.5 65.21 mini batch of state transitions, instead of a single Max descriptions / room 3 100 transition. This increases the number of experi- # diff. quest descriptions 12 - ences used per step and is also more efficient due State transitions Deterministic Stochastic # states (underlying) 16 ≥ 56 to optimized matrix operations. Branching factor The simplest method to create these mini- (# commands / state) 40 222 batches from the experience memory D is to sam- ple uniformly at random. However, certain ex- Table 1: Various statistics of the two game worlds periences are more valuable than others for the agent to learn from. For instance, rare transitions objects and actions. The game engine keeps that provide positive rewards can be used more of- track of the game state internally, presenting tex- ten to learn optimal Q-values faster. In our ex- tual descriptions to the player and receiving text periments, we consider such positive-reward tran- commands from the player. We conduct exper- sitions to have higher priority and keep track of iments on two worlds - a smaller Home world them in D. We use prioritized sampling (inspired we created ourselves, and a larger, more com- by Moore and Atkeson (1993)) to sample a frac- plex Fantasy world created by Evennia’s develop- tion ρ of transitions from the higher priority pool ers. The motivation behind Home world is to ab- and a fraction 1 − ρ from the rest. stract away high-level planning and focus on the language understanding requirements of the game. 5 Experimental Setup Table 1 provides statistics of the game worlds. Game Environment For our game environ- We observe that the Fantasy world is moderately ment, we modify Evennia,5 an open-source library sized with a vocabulary of 1340 words and up to for building online textual MUD games. Evennia 100 different descriptions for a room. These de- is a Python-based framework that allows one to scriptions were created manually by the game de- easily create new games by writing a batch file velopers. These diverse, engaging descriptions are describing the environment with details of rooms, designed to make it interesting and exciting for hu- man players. Several rooms have many alternative 5http://www.evennia.com/ descriptions, invoked randomly on each visit by the player. Fantasy World The Fantasy world is consider- Comparatively, the Home world is smaller: it ably more complex and involves quests such as has a very restricted vocabulary of 84 words and navigating through a broken bridge or finding the the room descriptions are relatively structured. secret tomb of an ancient hero. This game also has However, both the room descriptions (which are stochastic transitions in addition to varying state also varied and randomly provided to the agent) descriptions provided to the player. For instance, and the quest descriptions were adversarially cre- there is a possibility of the player falling from the ated with negation and conjunction of facts to bridge if she lingers too long on it. force an agent to actually understand the state in Due to the large command space in this game,7 order to play well. Therefore, this domain pro- we make use of cues provided by the game itself to vides an interesting challenge for language under- narrow down the set of possible objects to consider standing. in each state. For instance, in the MUD example in In both worlds, the agent receives a positive Figure 1, the game provides a list of possible exits. reward on completing a quest, and negative re- If the game does not provide such clues for the wards for getting into bad situations like falling current state, we consider all objects in the game. off a bridge, or losing a battle. We also add Evaluation We use two metrics for measuring small deterministic negative rewards for each non- an agent’s performance: (1) the cumulative reward terminating step. This incentivizes the agent to obtained per episode averaged over the episodes learn policies that solve quests in fewer steps. The and (2) the fraction of quests completed by the supplementary material has details on the reward agent. The evaluation procedure is as follows. In structure. each epoch, we first train the agent on M episodes of T steps each. At the end of this training, we Home World We created Home world to mimic have a testing phase of running M episodes of the 6 the environment of a typical house. The world game for T steps. We use M = 50,T = 20 for the consists of four rooms - a living room, a bedroom, Home world and M = 20,T = 250 for the Fan- a kitchen and a garden with connecting pathways. tasy world. For all evaluation episodes, we run the Every room is reachable from every other room. agent following an -greedy policy with  = 0.05, Each room contains a representative object that the which makes the agent choose the best action ac- agent can interact with. For instance, the kitchen cording to its Q-values 95% of the time. We report has an apple that the player can eat. Transitions the agent’s performance at each epoch. between the rooms are deterministic. At the start of each game episode, the player is placed in a ran- Baselines We compare our LSTM-DQN model dom room and provided with a randomly selected with three baselines. The first is a Random agent quest. The text provided to the player contains that chooses both actions and objects uniformly at both the description of her current state and that random from all available choices.8 The other two of the quest. Thus, the player can begin in one are BOW-DQN and BI-DQN, which use a bag- of 16 different states (4 rooms × 4 quests), which of-words and a bag-of-bigrams representation of adds to the world’s complexity. the text, respectively, as input to the DQN action An example of a quest given to the player in scorer. These baselines serve to illustrate the im- text is Not you are sleepy now but you are hun- portance of having a good representation layer for gry now. To complete this quest and obtain a re- the task. ward, the player has to navigate through the house Settings For our DQN models, we used D = to reach the kitchen and eat the apple (i.e type in 100000, γ = 0.5. We use a learning rate of 0.0005 the command eat apple). More importantly, the for RMSprop. We anneal the  for -greedy from player should interpret that the quest does not re- 1 to 0.2 over 100000 transitions. A mini-batch quire her to take a nap in the bedroom. We cre- gradient update is performed every 4 steps of the ated such misguiding quests to make it hard for gameplay. We roll out the LSTM (over words) for agents to succeed without having an adequate level of language understanding. 7We consider 222 possible command combinations of 6 actions and 37 object arguments. 8In the case of the Fantasy world, the object choices are 6An illustration is provided in the supplementary material. narrowed down using game clues as described earlier. 1.2 1.0 1.0 1.0 0.5 0.5 0.8 0.0 0.0 0.6 0.5

Reward LSTM-DQN LSTM-DQN Reward 0.5 1.0 0.4 BI-DQN BI-DQN 1.0 1.5 BOW-DQN Quest Completion 0.2 BOW-DQN No Transfer Random Random Transfer 2.0 0.0 1.5 0 20 40 60 80 100 0 20 40 60 80 100 0 10 20 30 40 50 Epochs Epochs Epochs

Reward (Home) Quest completion (Home) Transfer Learning (Home)

1.5 1.2 1.0 2.0 1.0 0.5 2.5 0.8 3.0 0.0

3.5 0.6 0.5

4.0 LSTM-DQN LSTM-DQN Reward 0.4 1.0 4.5 BI-DQN BI-DQN Quest Completion Reward (log scale) BOW-DQN 0.2 BOW-DQN 1.5 Uniform 5.0 Random Random Prioritized 5.5 0.0 2.0 0 10 20 30 40 50 60 70 0 10 20 30 40 50 60 70 0 20 40 60 80 100 Epochs Epochs Epochs

Reward (Fantasy) Quest completion (Fantasy) Prioritized Sampling (Home)

Figure 3: Left: Graphs showing the evolution of average reward and quest completion rate for BOW- DQN, LSTM-DQN and a Random baseline on the Home world (top) and Fantasy world (bottom). Note that the reward is shown in log scale for the Fantasy world. Right: Graphs showing effects of transfer learning and prioritized sampling on the Home world. a maximum of 30 steps on the Home world and for are hungry now. The BI-DQN model suffers from 100 steps on the Fantasy world. For the prioritized the same issue although it performs slightly bet- sampling, we used ρ = 0.25 for both worlds. We ter than BOW-DQN by completing 48% of quests. employed a mini-batch size of 64 and word em- In contrast, the LSTM-DQN model does not suf- bedding size d = 20 in all experiments. fer from this issue and is able to complete 100% of the quests after around 50 epochs of training, 6 Results achieving close to the optimal reward possible.10 This demonstrates that having an expressive rep- Home World Figure 3 illustrates the perfor- resentation for text is crucial to understanding the mance of LSTM-DQN compared to the baselines. game states and choosing intelligent actions. We can observe that the Random baseline per- In addition, we also investigated the impact of forms quite poorly, completing only around 10% using a deep neural network for modeling the ac- of quests on average9 obtaining a low reward of tion scorer φ . Figure 4 illustrates the perfor- around −1.58. The BOW-DQN model performs A mance of the BOW-DQN and BI-DQN models significantly better and is able to complete around along with their simpler versions BOW-LIN and 46% of the quests, with an average reward of 0.20. BI-LIN, which use a single linear layer for φ . It The improvement in reward is due to both greater A can be seen that the DQN models clearly achieve quest success rate and a lower rate of issuing in- better performance than their linear counterparts, valid commands (e.g. eat apple would be invalid which points to them modeling the control policy in the bedroom since there is no apple). We no- better. tice that both the reward and quest completion graphs of this model are volatile. This is because Fantasy World We evaluate all the models on the model fails to pick out differences between the Fantasy world in the same manner as before quests like Not you are hungry now but you are and report reward, quest completion rates and Q- sleepy now and Not you are sleepy now but you 10Note that since each step incurs a penalty of −0.01, the 9Averaged over the last 10 epochs. best reward (on average) a player can get is around 0.98. 0.7 “Kitchen”

0.6 “Bedroom”

0.5

0.4

0.3 BI-DQN Quest Completion BOW-DQN 0.2 “Living room” BI-LIN BOW-LIN 0.1 0 10 20 30 40 50 60 “Garden” Epochs

Figure 4: Quest completion rates of DQN vs. Lin- ear models on Home world. Figure 5: t-SNE visualization of the word embed- dings (except stopwords) after training on Home values. The quest we evaluate on involves crossing world. The embedding values are initialized ran- the broken bridge (which takes a minimum of five domly. steps), with the possibility of falling off at random (a 5% chance) when the player is on the bridge. transferable to new game worlds. To test this, The game has an additional quest of reaching a we created a second Home world with the same secret tomb. However, this is a complex quest that rooms, but a completely different map, changing requires the player to memorize game events and the locations of the rooms and the pathways be- perform high-level planning which are beyond the tween them. The main differentiating factor of scope of this current work. Therefore, we focus this world from the original home world lies in the only on the first quest. high-level planning required to complete quests. From Figure 3 (bottom), we can see that the We initialized the LSTM part of an LSTM- Random baseline does poorly in terms of both av- DQN agent with parameters θR learnt from the 11 erage per-episode reward and quest completion original home world and trained it on the new rates. BOW-DQN converges to a much higher av- world.12 Figure 3 (top right) demonstrates that erage reward of −12.68 and achieves around 82% the agent with transferred parameters is able to quest completion. Again, the BOW-DQN is often learn quicker than an agent starting from scratch confused by varying (10 different) descriptions of initialized with random parameters (No Transfer), the portions of the bridge, which reflects in its er- reaching the optimal policy almost 20 epochs ear- ratic performance on the quest. The BI-DQN per- lier. This indicates that these simulated worlds can forms very well on quest completion by finishing be used to learn good representations for language 97% of quests. However, this model tends to find that transfer across worlds. sub-optimal solutions and gets an average reward of −26.68, even worse than BOW-DQN. One rea- Prioritized sampling We also investigate the ef- son for this is the negative rewards the agent ob- fects of different minibatch sampling procedures tains after falling off the bridge. The LSTM-DQN on the parameter learning. From Figure 3 (bottom model again performs best, achieving an average right), we observe that using prioritized sampling reward of −11.33 and completing 96% of quests significantly speeds up learning, with the agent on average. Though this world does not con- achieving the optimal policy around 50 epochs tain descriptions adversarial to BOW-DQN or BI- faster than using uniform sampling. This shows DQN, the LSTM-DQN obtains higher average re- promise for further research into different schemes ward by completing the quest in fewer steps and of assigning priority to transitions. showing more resilience to variations in the state descriptions. Representation Analysis We analyzed the rep- resentations learnt by the LSTM-DQN model on Transfer Learning We would like the represen- the Home world. Figure 5 shows a visualization tations learnt by φR to be generic enough and 12 The parameters for the Action Scorer (θA) are initialized 11Note that the rewards graph is in log scale. randomly. Description Nearest neighbor You are halfways out on the unstable bridge. From the castle The bridge slopes precariously where it extends westwards to- you hear a distant howling sound, like that of a large dog or wards the lowest point - the center point of the hang bridge. You other beast. clasp the ropes firmly as the bridge sways and creaks under you. The ruins opens up to the sky in a small open area, lined by The old gatehouse is near collapse. .... East the gatehouse leads columns. ... To the west is the gatehouse and entrance to the out to a small open area surrounded by the remains of the cas- castle, whereas southwards the columns make way for a wide tle. There is also a standing archway offering passage to a path open courtyard. along the old southern inner wall.

Table 2: Sample descriptions from the Fantasy world and their nearest neighbors (NN) according to their vector representations from the LSTM representation generator. The NNs are often descriptions of the same or similar (nearby) states in the game. of learnt word embeddings, reduced to two di- planning and strategy learning to improve the per- mensions using t-SNE (Van der Maaten and Hin- formance of intelligent agents. ton, 2008). All the vectors were initialized ran- domly before training. We can see that semanti- Acknowledgements cally similar words appear close together to form We are grateful to the developers of Evennia, the coherent subspaces. In fact, we observe four dif- game framework upon which this work is based. ferent subspaces, each for one type of room along We also thank Nate Kushman, Clement Gehring, with its corresponding object(s) and quest words. Gustavo Goretkin, members of MIT’s NLP group For instance, food items like pizza and rooms like and the anonymous EMNLP reviewers for insight- kitchen are very close to the word hungry which ful comments and feedback. T. Kulkarni was gra- appears in a quest description. This shows that ciously supported by the Leventhal Fellowship. the agent learns to form meaningful associations We would also like to acknowledge MIT’s Cen- between the semantics of the quest and the envi- ter for Brains, Minds and Machines (CBMM) for ronment. Table 2 shows some examples of de- support. scriptions from Fantasy world and their nearest neighbors using cosine similarity between their corresponding vector representations produced by References LSTM-DQN. The model is able to correlate de- [Amato and Shani2010] Christopher Amato and Guy scriptions of the same (or similar) underlying Shani. 2010. High-level reinforcement learning in states and project them onto nearby points in the strategy games. In Proceedings of the 9th Interna- representation subspace. tional Conference on Autonomous Agents and Multi- agent Systems: Volume 1, pages 75–82. International Foundation for Autonomous Agents and Multiagent 7 Conclusions Systems.

We address the task of end-to-end learning of con- [Amir and Doyle2002] Eyal Amir and Patrick Doyle. 2002. Adventure games: A challenge for cognitive trol policies for text-based games. In these games, robotics. In Proc. Int. Cognitive Robotics Workshop, all interactions in the virtual world are through pages 148–155. text and the underlying state is not observed. The resulting language variability makes such envi- [Andreas and Klein2015] Jacob Andreas and Dan Klein. 2015. Alignment-based compositional ronments challenging for automatic game play- semantics for instruction following. In Proceedings ers. We employ a deep reinforcement learning of the Conference on Empirical Methods in Natural framework to jointly learn state representations Language Processing. and action policies using game rewards as feed- [Artzi and Zettlemoyer2013] Yoav Artzi and Luke back. This framework enables us to map text de- Zettlemoyer. 2013. Weakly supervised learning of scriptions into vector representations that capture semantic parsers for mapping instructions to actions. the semantics of the game states. Our experiments Transactions of the Association for Computational demonstrate the importance of learning good rep- Linguistics, 1(1):49–62. resentations of text in order to play these games [Branavan et al.2010] SRK Branavan, Luke S Zettle- well. Future directions include tackling high-level moyer, and Regina Barzilay. 2010. Reading be- tween the lines: Learning to map high-level instruc- [Kushman et al.2014] Nate Kushman, Yoav Artzi, Luke tions to commands. In Proceedings of the 48th An- Zettlemoyer, and Regina Barzilay. 2014. Learning nual Meeting of the Association for Computational to automatically solve algebra word problems. ACL Linguistics, pages 1268–1277. Association for Com- (1), pages 271–281. putational Linguistics. [Matuszek et al.2013] Cynthia Matuszek, Evan Herbst, [Branavan et al.2011a] SRK Branavan, David Silver, Luke Zettlemoyer, and Dieter Fox. 2013. Learning and Regina Barzilay. 2011a. Learning to win by to parse natural language commands to a robot con- reading manuals in a monte-carlo framework. In trol system. In Experimental Robotics, pages 403– Proceedings of the 49th Annual Meeting of the Asso- 415. Springer. ciation for Computational Linguistics: Human Lan- guage Technologies-Volume 1, pages 268–277. As- [Mikolov et al.2013] Tomas Mikolov, Kai Chen, Greg sociation for Computational Linguistics. Corrado, and Jeffrey Dean. 2013. Efficient estima- tion of word representations in vector space. arXiv [Branavan et al.2011b] SRK Branavan, David Silver, preprint arXiv:1301.3781. and Regina Barzilay. 2011b. Non-linear monte-carlo search in Civilization II. AAAI [Mnih et al.2015] Volodymyr Mnih, Koray Press/International Joint Conferences on Artificial Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Intelligence. Veness, Marc G. Bellemare, Alex Graves, Martin Riedmiller, Andreas K. Fidjeland, Georg Ostrovski, [Curtis1992] . 1992. Mudding: Social phe- Stig Petersen, Charles Beattie, Amir Sadik, Ioannis nomena in text-based virtual realities. High noon Antonoglou, Helen King, Dharshan Kumaran, Daan on the electronic frontier: Conceptual issues in cy- Wierstra, Shane Legg, and Demis Hassabis. 2015. berspace, pages 347–374. Human-level control through deep reinforcement learning. Nature, 518(7540):529–533, 02. [DePristo and Zubek2001] Mark A DePristo and Robert Zubek. 2001. being-in-the-world. In [Moore and Atkeson1993] Andrew W Moore and Proceedings of the 2001 AAAI Spring Sympo- Christopher G Atkeson. 1993. Prioritized sweep- sium on Artificial Intelligence and Interactive ing: Reinforcement learning with less data and less Entertainment, pages 31–34. time. Machine Learning, 13(1):103–130. [Pennington et al.2014] Jeffrey Pennington, Richard [Eisenstein et al.2009] Jacob Eisenstein, James Clarke, Socher, and Christopher D Manning. 2014. Glove: Dan Goldwasser, and Dan Roth. 2009. Reading Global vectors for word representation. Proceedings to learn: Constructing features from semantic ab- of the Empiricial Methods in Natural Language Pro- stracts. In Proceedings of the Conference on Empiri- cessing (EMNLP 2014), 12. cal Methods in Natural Language Processing, pages 958–967, Singapore, August. Association for Com- [Silver et al.2007] David Silver, Richard S Sutton, and putational Linguistics. Martin Muller.¨ 2007. Reinforcement learning of local shape in the game of go. In IJCAI, volume 7, [Gorniak and Roy2005] Peter Gorniak and Deb Roy. pages 1053–1058. 2005. Speaking with your sidekick: Understanding situated speech in computer role playing games. In [Sutskever et al.2014] Ilya Sutskever, Oriol Vinyals, R. Michael Young and John E. Laird, editors, Pro- and Quoc VV Le. 2014. Sequence to sequence ceedings of the First Artificial Intelligence and Inter- learning with neural networks. In Advances in Neu- active Digital Entertainment Conference, June 1-5, ral Information Processing Systems, pages 3104– 2005, Marina del Rey, California, USA, pages 57– 3112. 62. AAAI Press. [Sutton and Barto1998] Richard S Sutton and An- [Hochreiter and Schmidhuber1997] Sepp Hochreiter drew G Barto. 1998. Introduction to reinforcement and Jurgen¨ Schmidhuber. 1997. Long short-term learning. MIT Press. memory. Neural computation, 9(8):1735–1780. [Szita2012] Istvan´ Szita. 2012. Reinforcement learn- [Kollar et al.2010] Thomas Kollar, Stefanie Tellex, Deb ing in games. In Reinforcement Learning, pages Roy, and Nicholas Roy. 2010. Toward under- 539–577. Springer. standing natural language directions. In Human- Robot Interaction (HRI), 2010 5th ACM/IEEE Inter- [Tai et al.2015] Kai Sheng Tai, Richard Socher, and national Conference on, pages 259–266. IEEE. Christopher D. Manning. 2015. Improved semantic representations from tree-structured long short-term [Koutn´ık et al.2013] Jan Koutn´ık, Giuseppe Cuccu, memory networks. In Proceedings of the 53rd An- Jurgen¨ Schmidhuber, and Faustino Gomez. 2013. nual Meeting of the Association for Computational Evolving large-scale neural networks for vision- Linguistics and the 7th International Joint Confer- based reinforcement learning. In Proceedings of the ence on Natural Language Processing (Volume 1: 15th annual conference on Genetic and evolutionary Long Papers), pages 1556–1566, Beijing, China, computation, pages 1061–1068. ACM. July. Association for Computational Linguistics. [Tieleman and Hinton2012] Tijmen Tieleman and Ge- offrey Hinton. 2012. Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent mag- nitude. COURSERA: Neural Networks for Machine Learning, 4. [Van der Maaten and Hinton2008] Laurens Van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-sne. Journal of Machine Learning Research, 9(2579-2605):85. [Vogel and Jurafsky2010] Adam Vogel and Dan Juraf- sky. 2010. Learning to follow navigational direc- tions. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, pages 806–814. Association for Computational Lin- guistics. [Watkins and Dayan1992] Christopher JCH Watkins and Peter Dayan. 1992. Q-learning. Machine learn- ing, 8(3-4):279–292.