Interactive Fiction Game Playing as Multi-Paragraph Reading Comprehension with Reinforcement Learning

Xiaoxiao Guo⇤ Mo Yu⇤ Yupeng Gao IBM Research IBM Research IBM Research xiaoxiao.guo@.com [email protected] [email protected]

Chuang Gan Murray Campbell Shiyu Chang MIT-IBM Watson AI Lab IBM Research MIT-IBM Watson AI Lab [email protected] [email protected] [email protected]

Abstract

Interactive Fiction (IF) games with real human- written natural language texts provide a new natural evaluation for language understanding techniques. In contrast to previous text games with mostly synthetic texts, IF games pose lan- guage understanding challenges on the human- written textual descriptions of diverse and so- phisticated game worlds and language genera- tion challenges on the action command gener- ation from less restricted combinatorial space. We take a novel perspective of IF game solv- ing and re-formulate it as Multi-Passage Read- ing Comprehension (MPRC) tasks. Our ap- proaches utilize the context-query attention mechanisms and the structured prediction in MPRC to efficiently generate and evaluate ac- tion outputs and apply an object-centric his- torical observation retrieval strategy to miti- gate the partial observability of the textual ob- Figure 1: Sample for the classic dungeon game servations. Extensive experiments on the re- Zork1. The objective is to solve various and collect cent IF benchmark (Jericho) demonstrate clear the 19 treasures to install the trophy case. The player receives advantages of our approaches achieving high textual observations describing the current game state and ad- ditional reward scalars encoding the game designers’ objective winning rates and low data requirements com- 1 of game progress. The player sends textual action commands pared to all previous approaches. to control the protagonist.

1 Introduction natural language command (action) via a text in- Interactive systems capable of understanding natu- put interface. Without providing an explicit game ral language and responding in the form of natural strategy, the agents need to identify behaviors that language text have high potentials in various appli- maximize objective-encoded cumulative rewards. cations. In pursuit of building and evaluating such IF games composed of human-written texts (dis- systems, we study learning agents for Interactive tinct from previous text games with synthetic texts) Fiction (IF) games. IF games are world-simulating create superb new opportunities for studying and in which players use text commands to evaluating natural language understanding (NLU) control the protagonist and influence the world, as techniques due to their unique characteristics. (1) illustrated in Figure 1. IF gameplay agents need Game designers elaborately craft on the literari- to simultaneously understand the game’s informa- ness of the texts to attract players when tion from a text display (observation) and generate creating IF games. The resulted texts in IF games are more linguistically diverse and sophisticated ⇤ Primary authors. 1Source code is available at: https://github.com/ than the template-generated ones in synthetic text XiaoxiaoGuo/rcdqn. games. (2) The language contexts of IF games

7755 Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, pages 7755–7765, November 16–20, 2020. 2020 Association for Computational Linguistics Observation! Action Value Prediction: RC-model based • : 0.01 Object-based past observation-action • : 0.01 Action! observation retrieval • … … value approximator • : 0.4 identified retrieved past • : 0.4 objects observations observations templates Observation"#$ as context as queries Object Action extraction Action"#$ Observations Template List: selection • Observation" • … Action! = window with stone>

(a) Multi-Passage Retrieval for Partial Observability (b) Multi-Passage RC for Action Value Learning

Figure 2: Overview of our approach to solving the IF games as Multi-Paragraph Reading Comprehension (MPRC) tasks. are more versatile because various designers con- information to determine the long-term effects of tribute to enormous domains and genres, such as actions. Previous approaches address this problem adventure, fantasy, horror, and sci-fi. (3) The text by building a representation over past observations commands to control characters are less restricted, (e.g., building a graph of objects, positions, and spa- having sizes over six orders of magnitude larger tial relations) (Ammanabrolu and Riedl, 2019; Am- than previous text games. The recently introduced manabrolu and Hausknecht, 2020). These methods Jericho benchmark provides a collection of such IF treat the historical observations equally and sum- games (Hausknecht et al., 2019a). marize the information into a single vector without The complexity of IF games demands more so- focusing on important contexts related to the action phisticated NLU techniques than those used in syn- prediction for the current observation. Therefore, thetic text games. Moreover, the task of designing their usages of history also bring noise, and the IF game-play agents, intersecting NLU and rein- improvement is not always significant. forcement learning (RL), poses several unique chal- We propose a novel formulation of IF game lenges on the NLU techniques. The first challenge playing as Multi-Passage Reading Comprehension is the difficulty of exploration in the huge natural (MPRC) and harness MPRC techniques to solve language action space. To make RL agents learn the huge action space and partial observability efficiently without prohibitive exhaustive trials, the challenges. The graphical illustration is shown in action estimation must generalize learned knowl- Figure 2. First, the action value prediction (i.e., edge from tried actions to others. To this end, pre- predicting the long-term rewards of selecting an vious approaches, starting with a single embedding action) is essentially generating and scoring a com- vector of the observation, either predict the ele- positional action structure by finding supporting ments of actions independently (Narasimhan et al., evidence from the observation. We base on the fact 2015; Hausknecht et al., 2019a); or embed each that each action is an instantiation of a template, valid action as another vector and predict action i.e., a verb phrase with a few placeholders of object value based on the vector-space similarities (He arguments it takes (Figure 2b). Then the action et al., 2016). These methods do not consider the generation process can be viewed as extracting ob- compositionality or role-differences of the action jects for a template’s placeholders from the textual elements, or the interactions among them and the observation, based on the interaction between the observation. Therefore, their modeling of the ac- template verb phrase and the relevant context of tion values is less accurate and less data-efficient. the objects in the observation. Our approach ad- The second challenge is partial observability. dresses the structured prediction and interaction At each game-playing step, the agent receives a tex- problems with the idea of context-question atten- tual observation describing the locations, objects, tion mechanism in RC models. Specifically, we and characters of the game world. But the latest treat the observation as a passage and each tem- observation is often not a sufficient summary of plate verb phrase as a question. The filling of ob- the interaction history and may not provide enough ject placeholders in the template thus becomes an

7756 extractive QA problem that selects objects from was proposed to generate verb-object action with the observation given the template. Simultaneously pre-defined sets of possible verbs and objects, but each action (i.e., a template with all placeholder treat the selection and learning of verbs and objects replaced) gets its evaluation value predicted by the independently. Template-DQN (Hausknecht et al., RC model. Our formulation and approach better 2019a) extended LSTM-DQN for template-based capture the fine-grained interactions between ob- action generation, introducing one additional but servation texts and structural actions, in contrast still independent prediction output for the second to previous approaches that represent the observa- object in the template. Deep Reinforcement Rel- tion as a single vector and ignore the fine-grained evance Network (DRRN) (He et al., 2016) was dependency among action elements. introduced for choice-based games. Given a set of Second, alleviating partial observability is es- valid actions at every game state, DRRN projects sentially enhancing the current observation with each action into a hidden space that matches the potentially relevant history and predicting actions current state representation vector for action se- over the enhanced observation. Our approach re- lection. Action-Elimination Deep Q-Network (AE- trieves potentially relevant historical observations DQN) (Zahavy et al., 2018) learns to predict invalid with an object-centric approach (Figure 2a), so that actions in the . It eliminates the retrieved ones are more likely to be connected to invalid action for efficient policy learning via uti- the current observation as they describe at least one lizing expert demonstration data. shared interactable object. Our attention mecha- Other techniques focus on addressing the partial nisms are then applied across the retrieved multiple observability in text games. Knowledge Graph observation texts to focus on informative contexts DQN (KG-DQN) (Ammanabrolu and Riedl, 2019) for action value prediction. was proposed to deal with synthetic games. The We evaluated our approach on the suite of Jeri- method constructs and represents the game states cho IF games, compared to all previous approaches. as knowledge graphs with objects as nodes and Our approaches achieved or outperformed the state- uses pre-trained general purposed OpenIE tool and of-the-art performance on 25 out of 33 games, human-written rules to extract relations between trained with less than one-tenth of game interac- objects. KG-DQN handles the action representa- tion data used by prior art. We also provided abla- tion following DRRN. KG-A2C (Ammanabrolu tion studies on our models and retrieval strategies. and Hausknecht, 2020) later extends the work for IF games, by adding information extraction heuris- 2 Related Work tics to fit the complexity of the object relations in IF games and utilizing a GRU-based action generator IF Game Agents. Previous work mainly studies to handle the action space. the text understanding and generation in parser- based or rule-based text game tasks, such as Reading Comprehension Models for Question TextWorld platform (Cotˆ e´ et al., 2018) or custom Answering. Given a question, reading compre- domains (Narasimhan et al., 2015; He et al., 2016; hension (RC) aims to find the answer to the ques- Adhikari et al., 2020). The recent platform Jeri- tion based on a paragraph that may contain support- cho (Hausknecht et al., 2019a) supports over thirty ing evidence. One of the standard RC settings is human-written IF games. Earlier successes in real extractive QA (Rajpurkar et al., 2016; Joshi et al., IF games mainly rely on heuristics without learning. 2017; Kwiatkowski et al., 2019), which extracts a NAIL (Hausknecht et al., 2019b) is the state-of-the- span from the paragraph as an answer. Our formu- art among these “no-learning” agents, employing a lation of IF game playing resembles this setting. series of reliable heuristics for exploring the game, Many neural reader models have been designed interacting with objects, and building an internal for RC. Specifically, for the extractive QA task, the representation of the game world. With the devel- reader models usually build question-aware pas- opment of learning environments like Jericho, the sage representations via attention mechanisms (Seo RL-based agents have started to achieve dominat- et al., 2016; Yu et al., 2018), and employ a pointer ing performance. network to predict the start and end positions of A critical challenge for learning-based agents is the answer span (Wang and Jiang, 2016). Powerful how to handle the combinatorial action space in pre-trained language models (Peters et al., 2018; IF games. LSTM-DQN (Narasimhan et al., 2015) Devlin et al., 2019; Radford et al., 2019) have been

7757 recently applied to enhance the encoding and at- Arg0 GRU Embedding tention mechanisms of the aforementioned reader Encoder Block Forward Layer Q(o, a) Arg1 GRU models. They give performance boost but are more + Embedding resource-demanding and do not suit the IF game Self-Attention playing task very well. Encoder Block

Reading Comprehension over Multiple Para- BiDAF Encoder Block graphs. Multi-paragraph reading comprehension Encoder Block Encoder Block (MPRC) deals with the more general task of an- Bi-GRU Glove Embedding Glove Embedding swering a question from multiple related para- LayerNorm graphs, where each paragraph may not necessarily support the correct answer. Our formulation be- Textual Observation Template Text comes an MPRC setting when we enhance the state representation with historical observations and pre- Figure 3: Our RC-based action prediction model archi- dict actions from multiple observation paragraphs. tecture. The template text is a verb phrase with place- holders for objects, such as [pick up OBJ] and [break A fundamental research problem in MPRC, OBJ with OBJ]. which is also critical to our formulation, is to se- lect relevant paragraphs from all the input para- graphs for the reader to focus on. Previous ap- discount factor respectively. The game playing proaches mainly apply traditional IR approaches agent interacts with the game engine in multiple like BM25 (Chen et al., 2017; Joshi et al., 2017), or turns until the game is over or the maximum num- neural ranking models trained with distant super- ber of steps is reached. At the t-th turn, the agent vision (Wang et al., 2018; Min et al., 2019a), for receives a textual observation describing the cur- paragraph selection. Our formulation also relates to rent game state ot O and sends a textual action the work of evidence aggregation in MPRC (Wang 2 command at A back. The agent receives ad- et al., 2017; Lin et al., 2018), which aims to infer 2 ditional reward scalar rt which encodes the game the answers based on the joint of evidence pieces designers’ objective of game progress. Thus the from multiple paragraphs. Finally, recently some task of the game playing can be formulated to gen- works propose the entity-centric paragraph retrieval erate a textual action command per step as to maxi- approaches (Ding et al., 2019; Godbole et al., 2019; mize the expected cumulative discounted rewards Min et al., 2019b; Asai et al., 2019), where para- t E 1 r . Value-based RL approaches learn graphs are connected if they share the same-named t=0 t to approximate an observation-action value func- entities. The paragraph retrieval then becomes a h P i traversal over such graphs via entity links. These tion Q(ot,at; ✓) which measures the expected cu- entity-centric paragraph retrieval approaches share mulative rewards of taking action at when observ- a similar high-level idea to our object-based his- ing ot. The agent selects action based on the action tory retrieval approach. The techniques above have value prediction of Q(o, a; ✓). been applied to deal with evidence from Wikipedia, news collections, and, recently, books (Mou et al., Template Action Space. Template action space 2020). We are the first to extend these ideas to IF considers actions satisfying decomposition in games. the form of verb, arg , arg . verb is an in- h 0 1i terchangeable verb phrase template with place- 3 Multi-Paragraph RC for IF Games holders for objects and arg0 and arg1 are op- tional objects. For example, the action com- 3.1 Problem Formulation mand [east], [pick up eggs] and [break window Each IF game can be defined as a Partially Observ- with stone] can be represented as template ac- able Markov Decision Process (POMDP), namely tions east, none, none , pick up OBJ, eggs, none h i h a 7-tuple of S, A, T , O, ⌦, R, , representing and break OBJ with OBJ, window, stone . We re- h i h i the hidden game state set, the action set, the state use the template library and object list from Jericho. transition function, the set of textual observations The verb phrases usually consist of several vocabu- composed from vocabulary words, the textual ob- lary words and each object is usually a single word. servation function, the reward function, and the

7758 3.2 RC Model for Template Actions Observation-Action Value Prediction. We gen- We parameterize the observation-action value func- erate an action by replacing the placeholders arg arg tion Q(o, a= verb, arg , arg ; ✓) by utilizing the ( 0 and 1) in a template with objects appear- h 0 1i decomposition of the template actions and context- ing in the observation. The observation-action value Q(o, a= verb, arg =obj , arg =obj ; ✓) query contextualized representation in RC. Our h 0 m 1 ni model treats the observation o as a context in is achieved by processing each object’s correspond- ing verb-contextualized observation representation. RC and the verb=(v1,v2,...,vk) component of the obj template actions as a query. Then a verb-aware Specifically, we get the indices of an in the observation texts I(obj,o). When the object is a observation representation is derived via a RC 2 reader model with Bidirectional Attention Flow noun phrase, we take the index of its headword. (BiDAF) (Seo et al., 2016) and self-attention. The Because the same object has different meanings when it replaces different placeholders, we apply observation representation responding to the arg0 and arg words are pooled and projected to a scalar two GRU-based embedding functions for the two 1 verb value estimate for Q(o, a= verb, arg , arg ; ✓). placeholders, to get the object’s -placeholder h 0 1i A high-level model architecture of our model is dependent embeddings. We derive a single vec- tor representation h for the case that the illustrated in Figure 3. arg0=objm placeholder arg0 is replaced by objm by mean- Observation and verb Representation. We to- pooling over the verb-placeholder dependent em- kenize the observation and the verb phrase into beddings indexed by I(objm,o) for the correspond- words, then embed these words using pre-trained ing placeholder arg0. We apply a linear transfor- GloVe embeddings (Pennington et al., 2014). mation on the concatenated embeddings of the two A shared encoder block that consists of Layer- placeholders to obtain the observation action value

Norm (Ba et al., 2016) and Bidirectional GRU (Cho Q(o, a)=w5 [harg0=objm , harg1=objn ] for a= verb, · h et al., 2014) processes the observation and verb arg0=objm, arg1=objn . Our formulation avoids i word embeddings to obtain the separate observa- the repeated computation overhead among different tion and verb representation. actions with a shared template verb phrase.

Observation-verb Interaction Layers. Given 3.3 Multi-Paragraph Retrieval Method for the separate observation and verb representation, Partial Observability we apply two attention mechanisms to compute The observation at the current step sometimes does a verb-contextualized observation representation. not have full-textual evidence to support action se- We first apply BiDAF with observation as the con- lection and value estimation, due to the inherent text input and verb as the query input. Specifi- partial observability of IF games. For example, cally, we denote the processed embeddings for ob- when repeatedly attacking a troll with a sword, the servation word i and template word j as oi and player needs to know the effect or feedback of the tj. The attention between the two words is then last attack to determine if an extra attack is neces- a =w o +w t +w (o t ), where w , w , sary. It is thus important for an agent to efficiently ij 1· i 2· j 3· i⌦ j 1 2 w are learnable vectors and is element-wise utilize historical observations to better support ac- 3 ⌦ product. We then compute the “verb2observation” tion value prediction. In our RC-based action pre- attention vector for the i-th observation word diction model, the historical observation utilization as ci= j pijtj with pij=exp(aij)/ j exp(aij). can be formulated as selecting evidential obser- Similarly, we compute the “observation2verb” vation paragraphs in history, and predicting the P P attention vector as q= i pioi with pi = action values from multiple selected observations, exp(maxj aij)/ exp(maxj aij). We concate- namely a Multiple-Paragraph Reading Comprehen- i P nate and project the output vectors as w4 [o , sion (MPRC) problem. We propose to retrieve past P · i c , o c , q c ], followed by a linear layer with observations with an object-centric approach. i i ⌦ i ⌦ i leaky ReLU activation units (Maas et al., 2013). Past Observation Retrieval. Multiple past ob- The output vectors are processed by an encoder servations may share objects with the current obser- block. We then apply a residual self-attention on 2 the outputs of the encoder block. The self-attention Some templates may take zero or one object. We denote the unrequired objects as none so that all templates take two is the same as BiDAF, but only between the obser- objects. The index of the none object is for a special token. vation and itself. We set to the index of split token of the observation contents.

7759 Agents Action strategy State strategy Interaction data TDQN Independent selection of template and the None 1M two objects DRRN Action as a word sequence without distin- None 1M guishing the roles of verbs and objects KG-A2C Recurrent neural decoder that selects the Object graph from historical observations 1.6M template and objects in a fixed order based on OpenIE and human-written rules Ours Observation-template representation for Object-based history observation retrieval 0.1M object-centric value prediction

Table 1: Summary of the main technical differences between our agent and the baselines. All agents use DQN to update the model parameters except KG-A2C uses A2C. All agents use the same handicaps. vation, and it is computationally expensive and un- rewards or even negative rewards are also indis- necessary to retrieve all of such observations. The pensable in recovering well-performed trajectories. utility of past observations associated with each ob- We thus extend the strategy from transition level ject is often time-sensitive in that new observations to trajectory level. We prioritize transitions from may entirely or partially invalidate old observa- trajectories that outperform the exponential moving tions. We thus propose a time-sensitive strategy average score of recent trajectories. for retrieving past observations. Specifically, given the detected objects from the current observation, 4 Experiments we retrieve the most recent K observations with We evaluate our proposed methods on the suite of at least one shared object. The K retrieved obser- Jericho supported games. We compared to all previ- vations are sorted by time steps and concatenated ous baselines that include recent methods address- to the current observation. The observations from ing the huge action space and partial observability different time steps are separated by a special to- challenges. ken. Our RC-based action prediction model treats the concatenated observations as the observation 4.1 Setup inputs, and no other parts are changed. We use the Jericho Handicaps and Configuration. The notation ot to represent the current observation and handicaps used by our methods are the same as the extended current observation interchangeably. other baselines. First, we use the Jericho API to check if an action is valid with game-specific 3.4 Training Loss templates. Second, we augmented the observa- We apply the Deep Q-Network (DQN) (Mnih et al., tion with the textual feedback returned by the com- 2015) to update the parameters ✓ of our RC-based mand [inventory] and [look]. Previous work also action prediction model. The loss function is: included the last action or game score as additional inputs. Our model discarded these two types of in- (✓)=E(ot,at,rt,ot+1) ⇢( ) Q(ot,at; ✓) L ⇠ D || puts as we did not observe a significant difference h by our model. The maximum game step number is (rt + max Q(ot+1,b; ✓)) b || set to 100 following baselines. i where is the experience replay consisting of re- We apply spaCy3 to to- D Implementation Details. cent gameplay transition records and ⇢ is a distri- kenize the observations and detect the objects in the bution over the transitions defined by a sampling observations. We use the 100-dimensional GloVe strategy. embeddings as fixed word embeddings. The out- of-vocabulary words are mapped to a randomly Prioritized Trajectories. The distribution ⇢ has initialized embedding. The dimension of Bi-GRU a decent impact on DQN performance. Previous hidden states is 128. We set the observation rep- work samples transition tuples with immediate pos- resentation dimension to be 128 throughout the itive rewards more frequently to speed up learn- model. The history retrieval window K is 2. For ing (Narasimhan et al., 2015; Hausknecht et al., DQN configuration, we use the ✏-greedy strategy 2019a). We observe that this heuristic is often in- sufficient. Some transitions with zero immediate 3https://spacy.io

7760 Human Baselines Ours Game Max Walkthrough-100 TDQN DRRN KG-A2C MPRC-DQN RC-DQN 905 11 00 0 0 0 acorncourt 30 30 1.6 10 0.3 10.0 10.0 advent 350 113 36 36 36 63.9 36 adventureland 100 42 0 20.6 0 24.2 21.7 afflicted 75 75 1.3 2.6 – 8.0 8.0 anchor 100 11 00 0 0 0 awaken 50 50 00 0 0 0 balances 51 30 4.8 10 10 10 10 deephome 300 83 11 1 1 1 detective 360 350 169 197.8 207.9 317.7 291.3 dragon 25 25 -5.3 -3.5 0 0.04 4.84 enchanter 400 125 8.6 20 12.1 20.0 20.0 gold 100 30 4.1 0– 0 0 inhumane 90 70 0.7 0 3 00 jewel 90 24 0 1.6 1.8 4.46 2.0 karn 170 40 0.7 2.1 0 10.0 10.0 library 30 30 6.3 17 14.3 17.7 18.1 ludicorp 150 37 6 13.8 17.8 19.7 17.0 moonlit 11 00 0 0 0 omniquest 50 50 16.8 10 3 10.0 10.0 pentari 70 60 17.4 27.2 50.7 44.4 43.8 reverb 50 50 0.3 8.2 – 2.0 2.0 snacktime 50 50 9.7 00 0 0 sorcerer 400 150 5 20.8 5.8 38.6 38.3 spellbrkr 600 160 18.7 37.8 21.3 25 25 spirit 250 8 0.6 0.8 1.3 3.8 5.2 temple 35 20 7.9 7.4 7.6 8.0 8.0 tryst205 350 50 0 9.6 – 10.0 10.0 yomomma 35 34 0 0.4 – 1.0 1.0 zenon 20 20 0 0 3.9 00 zork1 350 102 9.9 32.6 34 38.3 38.8 zork3 73a 0 0.5 0.1 3.63 2.83 ztuu 100b 100 4.9 21.6 9.2 85.4 79.1 Winning 24%/8 30%/10 27%/9 64%/21 52%/17 percentage / counts 76%/25

Table 2: Average game scores on Jericho benchmark games. The best performing agent score per game is in bold. The Winning percentage / counts row computes the percentage / counts of games that the corresponding agent is best. The scores of baselines are from their papers. The missing scores are represented as “–”, for which games KG-A2C skipped. We also added the 100-step results from a human-written game-playing walkthrough, as a reference of human-level scores. We denote the difficulty levels of the games defined in the original Jericho paper with colors in their names – possible (i.e., easy or normal) games in green color, difficult games in tan and extreme games in red. Best seen in color. a Zork3 walkthrough does not maximize the score in the first 100 steps but explores more. b Our agent discovers some unbounded reward loops in the game Ztuu. for exploration, annealing ✏ from 1.0 to 0.05. ditioned on a single vector representation of the is 0.98. We use Adam to update the weights with whole observation texts. Thus they do not exploit 4 10 learning rate. Other parameters are set to their the fine-grained interplay among the template com- default values. More details of the Reproducibility ponents and the observations. Our approach ad- Checklist is in Appendix A. dresses this problem by formulating action predic- tion as an RC task, better utilizing the rich textual Baselines. We compare with all the public results observations with deeper language understanding. on the Jericho suite, namely TDQN (Hausknecht et al., 2019a), DRRN (He et al., 2016), and KG- Training Sample Efficiency. We update our A2C (Ammanabrolu and Hausknecht, 2020). As models for 100, 000 times. Our agents interact with discussed, our approaches differ from them mainly the environment one step per update, resulting in a in the strategies of handling the large action space total of 0.1M environment interaction data. Com- and partial observability of IF games. We summa- pared to the other agents, such as KG-A2C (1.6M), rize these main technical differences in Table 1. In TDQN (1M), and DRRN (1M), our environment summary, all previous agents predict actions con- interaction data is significantly smaller.

7761 Game Template Action Avg. Steps Dialog Darkness Nonstandard Inventory Space ( 106) Per Reward Actions Limit ⇥ advent 107 7 XXX detective 19 2 karn 63 17 XX ludicorp 45 4 XX pentari 32 5 X spirit 195 21 XX X X zork3 67 39 XX X

Table 3: Difficulty levels and characteristics of games on which our approach achieves the most considerable improvement. Dialog indicates that it is necessary to speak with another character. Darkness indicates that accessing some dark areas requires a light source. Nonstandard Actions refers to actions with words not in an English dictionary. Inventory Limit restricts the number of items carried by the player. Please refer to (Hausknecht et al., 2019a) for more comprehensive definitions.

4.2 Overall Performance Competitors Win Draw Lose MPRC-DQN v.s. TDQN 23 6 4 We summarize the performance of our Multi- MPRC-DQN v.s. DRRN 18 13 2 MPRC-DQN v.s. KG-A2C 18 7 3 Paragraph Reading Comprehension DQN (MPRC- DQN) agent and baselines in Table 2. Of the 33 Table 4: Pairwise comparison between our MPRC-DQN IF games, our MPRC-DQN achieved or improved versus each baseline. the state of the art performance on 21 games (i.e., a winning rate of 64%). The best performing baseline Pairwise Competition. To better understand the (DRRN) achieved the state-of-the-art performance performance difference between our approach and on only ten games, corresponding to the winning each of the baselines, we adopt a direct one-to-one rate of 30%, lower than half of ours. Note that comparison metric based on the results from Ta- all the methods achieved the same initial scores ble 2. Our approach has a high winning rate when on five games, namely 905, anchor, awaken, deep- competing with any of the baselines, summarized home, and moonlit. Apart from these five games, in Table 4. All the baselines have a rare chance to our MPRC-DQN achieved more than three times beat us on games. DRRN gives a higher chance of wins. Our MPRC-DQN achieved significant im- draw-games when competing with ours. provement on some games, such as adventureland, afflicted, detective, etc. Appendix C shows some Human-Machine Gap. We additionally com- game playing trajectories. pare IF gameplay agents to human players to better We include the performance of an RC-DQN understand the improvement significance and the agent, which implements our RC-based action pre- potential improvement upper-bound. We measure diction model but only takes the current observa- each agent’s game progress as the macro-average tions as inputs. It also outperformed the baselines of the normalized agent-to-human game score ra- by a large margin. After we consider the RC-DQN tios, capped at 100%. The progress of our MPRC- agent, our MPRC-DQN still has the highest win- DQN is 28.5%, while the best performing baseline ning percentage, indicating that our RC-based ac- DRRN is 17.8%, showing that our agent’s improve- tion prediction model has a significant impact on ment is significant even in the realm of human the performance improvement of our MPRC-DQN players. Nevertheless, there is a vast gap between and the improvement from the multi-passage re- the learning agents and human players. The gap trieval is also unneglectable. Moreover, compared indicates IF games can be a good benchmark for to RC-DQN, our MPRC-DQN has another advan- the development of natural language understanding tage of faster convergence. The learning curves of techniques. our MPRC-DQN and RC-DQN agents on various Difficulty Levels of Games. Jericho categorizes games are in Appendix B. the supported games into three difficulty levels, Finally, our approaches, overall, achieve the new namely possible games, difficult games, and ex- state-of-the-art on 25 games (i.e., a winning rate of treme games, based on the characteristics of the 76%), giving a significant advance in the field of game dynamics, such as the action space size, the IF game playing. length of the game, and the average number of

7762 Figure 4: Learning curves for ablative studies. (left) Model ablative studies on the game Detective. (middle) Model ablative studies on Zork1. (right) Retrieval strategy study on Zork1. Best seen in color. steps to receive a non-zero reward. Our approach in Figure 4 (left/middle). The RC-models with- improves over prior art on seven of the sixteen pos- out either self-attention or argument-specific em- sible games, seven of the eleven difficult games, bedding degenerate, and the argument-specific em- and three of the six extreme games in Table 2. It bedding has a greater impact. The Transformer- shows that the strategies of our method are gen- based encoder block sometimes learns faster than erally beneficial for any difficulty levels of game Bi-GRU at the early learning stage. It achieved dynamics. Table 3 summarizes the characteristics a comparable final performance, even with much of the seven games in which our method improves greater computational resource requirements. the most, i.e., larger than 15% of the game progress Retrieval Strategy. We compare with history re- in the first 100 steps.4 First, these mostly improved trieval strategies with different history sizes (K) games have medium action space sizes, and it is an and pure recency-based strategies (i.e., taking the advantageous setting for our methods where mod- latest K observations as history, denoted as w/o eling the template-object-observation interactions rec). The learning curves of different strategies is effective. Second, our approach improves most are in Figure 4 (right). In general, the impact of his- on games with a reasonably high degree of reward tory window size is highly game-dependent, but the sparsity, such as karn, spirit, and zork3, indicat- pure recency based ones do not differ significantly ing that our RC-based value function formulation from RC-DQN at the beginning of learning. The is- helps in optimization and mitigates the reward spar- sues of pure recency based strategy are: (1) limited sity. Finally, we remark that these game difficulty additional information about objects provided by levels are not directly categorized based on natu- successive observations; and (2) higher variance of ral language-related characteristics, such as text retrieved observations due to policy changes. comprehension and -solving difficulties. Fu- ture studies on additional game categories based 5 Conclusion on those natural language-related characteristics would shed light on related improvements. We formulate the general IF game playing as MPRC tasks, enabling an MPRC-style solution 4.3 Ablative Studies to efficiently address the key IF game challenges on the huge combinatorial action space and the RC-model Design. The overall results show partial observability in a unified framework. Our that our RC-model plays a critical role in per- approaches achieved significant improvement over formance improvement. We compare our RC- the previous state-of-the-art on both game scores model to some alternative models as ablative stud- and training data efficiency. Our formulation ies. We consider three alternatives, namely (1) also bridges broader NLU/RC techniques to ad- our RC-model without the self-attention compo- dress other critical challenges in IF games for fu- nent (w/o self-att), (2) without the argument- ture work, e.g., common-sense reasoning, novelty- specific embedding (w/o arg-emb) and (3) our driven exploration, and multi-hop inference. RC-model with Transformer-based block encoder (RC-Trans) following QANet (Yu et al., 2018). Acknowledgments Detailed architecture is in Appendix A. The learning curves for different RC-models are We would like to thank Matthew Hausknecht for helpful discussions on the Jericho environments. 4We ignore ztuu due to the infinite reward loops.

7763 References Ameya Godbole, Dilip Kavarthapu, Rajarshi Das, Zhiyu Gong, Abhishek Singhal, Hamed Zamani, Ashutosh Adhikari, Xingdi Yuan, Marc-Alexandre Mo Yu, Tian Gao, Xiaoxiao Guo, Manzil Zaheer, Cotˆ e,´ Mikula´sˇ Zelinka, Marc-Antoine Rondeau, Ro- et al. 2019. Multi-step entity-centric information main Laroche, Pascal Poupart, Jian Tang, Adam retrieval for multi-hop . arXiv Trischler, and William L Hamilton. 2020. Learn- preprint arXiv:1909.07598. ing dynamic knowledge graphs to generalize on text- based games. arXiv preprint arXiv:2002.09127. Matthew Hausknecht, Prithviraj Ammanabrolu, Marc- Alexandre Cotˆ e,´ and Xingdi Yuan. 2019a. Inter- Prithviraj Ammanabrolu and Matthew Hausknecht. active fiction games: A colossal adventure. arXiv 2020. Graph constrained reinforcement learning for preprint arXiv:1909.05398. natural language action spaces. arXiv, pages arXiv– 2001. Matthew Hausknecht, Ricky Loynd, Greg Yang, Adith Swaminathan, and Jason D Williams. 2019b. Nail: Prithviraj Ammanabrolu and Mark Riedl. 2019. Play- A general interactive fiction agent. arXiv preprint ing text-adventure games with graph-based deep re- arXiv:1902.04259. inforcement learning. In Proceedings of the 2019 Conference of the North American Chapter of the Ji He, Jianshu Chen, Xiaodong He, Jianfeng Gao, Li- Association for Computational Linguistics: Human hong Li, Li Deng, and Mari Ostendorf. 2016. Deep Language Technologies, Volume 1 (Long and Short reinforcement learning with a natural language ac- Papers), pages 3557–3565. tion space. In Proceedings of the 54th Annual Meet- ing of the Association for Computational Linguistics Akari Asai, Kazuma Hashimoto, Hannaneh Hajishirzi, (Volume 1: Long Papers), pages 1621–1630. Richard Socher, and Caiming Xiong. 2019. Learn- ing to retrieve reasoning paths over wikipedia Mandar Joshi, Eunsol Choi, Daniel S Weld, and Luke graph for question answering. arXiv preprint Zettlemoyer. 2017. Triviaqa: A large scale distantly arXiv:1911.10470. supervised challenge dataset for reading comprehen- sion. In Proceedings of the 55th Annual Meeting of Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hin- the Association for Computational Linguistics (Vol- ton. 2016. Layer normalization. arXiv preprint ume 1: Long Papers), pages 1601–1611. arXiv:1607.06450. Tom Kwiatkowski, Jennimaria Palomaki, Olivia Red- Danqi Chen, Adam Fisch, Jason Weston, and Antoine field, Michael Collins, Ankur Parikh, Chris Alberti, Bordes. 2017. Reading wikipedia to answer open- Danielle Epstein, Illia Polosukhin, Jacob Devlin, domain questions. In Proceedings of the 55th An- Kenton Lee, et al. 2019. Natural questions: a bench- nual Meeting of the Association for Computational mark for question answering research. Transactions Linguistics (Volume 1: Long Papers), pages 1870– of the Association for Computational Linguistics, 1879. 7:453–466.

Kyunghyun Cho, Bart Van Merrienboer,¨ Dzmitry Bah- Yankai Lin, Haozhe Ji, Zhiyuan Liu, and Maosong Sun. danau, and Yoshua Bengio. 2014. On the properties 2018. Denoising distantly supervised open-domain of neural : Encoder-decoder ap- question answering. In Proceedings of the 56th An- proaches. arXiv preprint arXiv:1409.1259. nual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1736– Marc-Alexandre Cotˆ e,´ Akos´ Kad´ ar,´ Xingdi Yuan, Ben 1745. Kybartas, Tavian Barnes, Emery Fine, James Moore, Matthew Hausknecht, Layla El Asri, Mahmoud Andrew L Maas, Awni Y Hannun, and Andrew Y Ng. Adada, et al. 2018. Textworld: A learning environ- 2013. Rectifier nonlinearities improve neural net- ment for text-based games. In Workshop on Com- work acoustic models. In Proc. icml, volume 30, puter Games, pages 41–75. Springer. page 3.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Sewon Min, Danqi Chen, Hannaneh Hajishirzi, and Kristina Toutanova. 2019. Bert: Pre-training of Luke Zettlemoyer. 2019a. A discrete hard em ap- deep bidirectional transformers for language under- proach for weakly supervised question answering. standing. In Proceedings of the 2019 Conference of In Proceedings of the 2019 Conference on Empirical the North American Chapter of the Association for Methods in Natural Language Processing and the Computational Linguistics: Human Language Tech- 9th International Joint Conference on Natural Lan- nologies, Volume 1 (Long and Short Papers), pages guage Processing (EMNLP-IJCNLP), pages 2844– 4171–4186. 2857.

Ming Ding, Chang Zhou, Qibin Chen, Hongxia Yang, Sewon Min, Danqi Chen, Luke Zettlemoyer, and Han- and Jie Tang. 2019. Cognitive graph for multi-hop naneh Hajishirzi. 2019b. Knowledge guided text re- reading comprehension at scale. In Proceedings of trieval and reading for open domain question answer- ACL 2019. ing. arXiv preprint arXiv:1911.03868.

7764 Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Adams Wei Yu, David Dohan, Minh-Thang Luong, Rui Andrei A Rusu, Joel Veness, Marc G Bellemare, Zhao, Kai Chen, Mohammad Norouzi, and Quoc V Alex Graves, Martin Riedmiller, Andreas K Fidje- Le. 2018. Qanet: Combining local convolution land, Georg Ostrovski, et al. 2015. Human-level with global self-attention for reading comprehen- control through deep reinforcement learning. Na- sion. arXiv preprint arXiv:1804.09541. ture, 518(7540):529–533. Tom Zahavy, Matan Haroush, Nadav Merlis, Daniel J Xiangyang Mou, Mo Yu, Bingsheng Yao, Chenghao Mankowitz, and Shie Mannor. 2018. Learn what Yang, Xiaoxiao Guo, Saloni Potdar, and Hui Su. not to learn: Action elimination with deep reinforce- 2020. Frustratingly hard evidence retrieval for qa ment learning. In Advances in Neural Information over books. In Proceedings of the First Joint Work- Processing Systems, pages 3562–3573. shop on Narrative Understanding, Storylines, and Events, pages 108–113.

Karthik Narasimhan, Tejas Kulkarni, and Regina Barzi- lay. 2015. Language understanding for text-based games using deep reinforcement learning. In Pro- ceedings of the 2015 Conference on Empirical Meth- ods in Natural Language Processing, pages 1–11.

Jeffrey Pennington, Richard Socher, and Christopher D Manning. 2014. Glove: Global vectors for word rep- resentation. In Proceedings of the 2014 conference on empirical methods in natural language process- ing (EMNLP), pages 1532–1543.

Matthew E Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word rep- resentations. In Proceedings of NAACL-HLT, pages 2227–2237.

Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language models are unsupervised multitask learners.

Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. Squad: 100,000+ questions for machine comprehension of text. In Proceedings of the 2016 Conference on Empirical Methods in Natu- ral Language Processing, pages 2383–2392.

Minjoon Seo, Aniruddha Kembhavi, Ali Farhadi, and Hannaneh Hajishirzi. 2016. Bidirectional attention flow for machine comprehension.

Shuohang Wang and Jing Jiang. 2016. Machine com- prehension using match-lstm and answer pointer. arXiv preprint arXiv:1608.07905.

Shuohang Wang, Mo Yu, Xiaoxiao Guo, Zhiguo Wang, Tim Klinger, Wei Zhang, Shiyu Chang, Gerry Tesauro, Bowen Zhou, and Jing Jiang. 2018. R 3: Reinforced ranker-reader for open-domain question answering. In Thirty-Second AAAI Conference on Artificial Intelligence.

Shuohang Wang, Mo Yu, Jing Jiang, Wei Zhang, Xiaoxiao Guo, Shiyu Chang, Zhiguo Wang, Tim Klinger, Gerald Tesauro, and Murray Campbell. 2017. Evidence aggregation for answer re-ranking in open-domain question answering. arXiv preprint arXiv:1711.05116.

7765