
The Thirty-Fourth AAAI Conference on Artificial Intelligence (AAAI-20) Interactive Fiction Games: A Colossal Adventure Matthew Hausknecht Prithviraj Ammanabrolu Marc-Alexandre Cotˆ e´ Xingdi Yuan Microsoft Research AI Georgia Institute of Technology Microsoft Research Montreal´ Microsoft Research Montreal´ Abstract Observation: This bedroom is extremely spare, with A hallmark of human intelligence is the ability to understand dirty laundry scattered haphazardly all over the floor. and communicate with language. Interactive Fiction games Cleaner clothing can be found in the dresser. A bath- are fully text-based simulation environments where a player room lies to the south, while a door to the east leads issues text commands to effect change in the environment to the living room. On the end table are a telephone, and progress through the story. We argue that IF games are a wallet and some keys. an excellent testbed for studying language-based autonomous The phone rings. agents. In particular, IF games combine challenges of combi- natorial action spaces, language understanding, and common- Action: Answer phone sense reasoning. To facilitate rapid development of language- based agents, we introduce Jericho, a learning environment Observation: You pick up the phone. ”Hadley!” a for man-made IF games and conduct a comprehensive study shrill voice cries. ”Hadley, haven’t you even left of text-agents across a rich set of games, highlighting direc- yet?? You knew that our presentation was at nine tions in which agents can improve. o’clock sharp! First the thing with the printers, now this - there won’t even be enough left of you for Bow- man to fire once he’s done with you. Now get the hell 1 Introduction down here!!” Interactive fiction (IF) games are software environments in which players observe textual descriptions of the simulated Action: Examine me world, issue text actions, and receive score progress through Observation: You’re covered with mud and dried the story. As illustrated by Figure 1, humans bring com- sweat. It was an exhausting night - no wonder you petencies in natural language understanding, commonsense overslept! Even making it to the bed before conking reasoning, and deduction to bear in order to infer the context out was a heroic accomplishment. and objectives of a game. We believe IF environments will provide a good testbed for studying artificial development of these capabilities. Beyond games, real-world applications Figure 1: Excerpt from Adam Cadre’s 9:05. such as voice-activated personal assistants also exist at the intersection of natural language understanding, natural lan- guage generation, and sequential decision making. From a machine learning perspective, Interactive Fiction al. 2018) environments, but has less commonly been applied games exist at the intersection of natural language process- in the context of language-based tasks. ing and sequential decision making. Like many NLP tasks, Why study environments that feature ungrounded natural they require natural language understanding, but unlike most language instead of multimodal environments that provide NLP tasks, IF games are sequential decision making prob- visual grounding for language? We believe that ungrounded lems in which actions change the subsequent world states natural language is inescapable in certain real world tasks of the game and choices made early in a game may have such as voice-activated personal assistants. long term effects on the eventual endings. Reinforcement Learning (Sutton and Barto 1998) studies sequential deci- The contributions of this paper are as follows: First, we in- sion making problems and has shown promise in vision- troduce Jericho, a learning environment for human-made IF based (Jaderberg et al. 2016) and control-based (OpenAI et games. Second, we introduce a template-based action space and that we argue is appropriate for language generation. Copyright c 2020, Association for the Advancement of Artificial Third, we conduct an empirical evaluation of learning agents Intelligence (www.aaai.org). All rights reserved. across a set of human-made games. 7903 2 Research Challenges flee from a monster, and spatial reasoning such as stack- From the perspective of reinforcement learning, IF games ing garbage cans against the wall of a building to access a can be modeled as partially observable Markov decision pro- second-floor window. cesses (POMDPs) defined by (S, T, A, O, R). Observations Knowledge Representation IF games span many distinct o ∈ O correspond to the game’s text responses, while la- locations, each with unique descriptions, objects, and char- tent states s ∈ S correspond to player and item locations, acters. Players move between locations by issuing naviga- inventory contents, monsters, etc. Text-based actions a ∈ A tional commands like go west. Due to the large number of change the game state according to an latent transition func- locations in many games, humans often create maps to nav- tion T (s|s, a), and the agent receives rewards r from an un- igate efficiently and avoid getting lost. This gives rise to known reward function R(s, a). To succeed in these environ- the Textual-SLAM problem, a textual variant of Simultane- ments, agents must generate natural language actions, reason ous localization and mapping (SLAM) (Thrun, Burgard, and about entities and affordances, and represent their knowl- Fox 2005) problem of constructing a map while navigating edge about the game world. We present these challenges in a new environment. In particular, because connectivity be- greater detail: tween locations is not necessarily Euclidean, agents need to Combinatorial Action Space Reinforcement learning be able to detect when a navigational action has succeeded has studied agents that operate in discrete or continuous or failed and whether the location reached was previously action space environments. However, IF games require the seen or new. Beyond location connectivity, it’s also helpful agent to operate in the combinatorial action space of nat- to keep track of the objects present at each location, with ural language. Combinatorial spaces pose extremely diffi- the understanding that objects can be nested inside of other cult exploration problems for existing agents. For example, objects, such as food in a refrigerator or a sword in a chest. an agent generating a four-word sentence from a modest vocabulary of size 700, is effectively exploring a space of 3 Related Work |7004| = 240 billion possible actions. Further complicating, In contrast to the parser-based games studied in this paper, this challenge is the fact that natural language commands are choice-based games provide a list of possible actions at each interpreted by the game’s parser - which recognizes a game- step, so learning agents must only choose between the can- specific subset of possible commands. For example, out of didates. The DRRN algorithm for choice-based games (He the 240 billion possible actions there may be only ten valid et al. 2016; Zelinka 2018) estimates Q-Values for a particu- actions at each step - actions that are both recognized by the lar action from a particular state. This network is evaluated game’s parser and generate a change in world state. once for each possible action, and the action with the max- As discussed in Section 4.1, we propose the use of a imum Q-Value is selected. While this approach is effective template-based action space in which the agent first chooses for choice-based games which have only a handful of can- from a template of the form put OBJ in OBJ and then fills in didate actions at each step, it is difficult to scale to parser- the blanks using the vocabulary. A typical game may have based games where the action space is vastly larger. around 200 templates each with up to two blanks - yield- In terms of parser-based games, such as the ones ex- ing an action space |200 ∗ 7002| =98million, three orders amined in this paper, several approaches have been in- of magnitude smaller than the naive space but six orders of vestigated: LSTM-DQN (Narasimhan, Kulkarni, and Barzi- magnitude greater than most discrete RL environments. lay 2015), considers verb-noun actions up to two-words in Commonsense Reasoning Due to the lack of graphics, length. Separate Q-Value estimates are produced for each IF games rely on the player’s commonsense knowledge as possible verb and object, and the action consists of pairing a prior for how to interact with the game world. For exam- the maximally valued verb combined with the maximally ple, a human player encountering a locked chest intuitively valued object. LSTM-DQN was demonstrated to work on understands that the chest needs to be unlocked with some two small-scale domains, but human-made games, such as type of key, and once unlocked, the chest can be opened and those studied in this paper, represent a significant increase will probably contain useful items. They may make a men- in both complexity and vocabulary. tal note to return to the chest if a key is found in the future. Another approach to affordance extraction (Fulda et al. They may even mark the location of the chest on a map to 2017) identified a vector in word2vec (Mikolov et al. 2013) easily find their way back. space that encodes affordant behavior. When applied to the These inferences are possible for humans who have years noun sword, this vector produces affordant verbs such as of embodied experience interacting with chests, cabinets, vanquish, impale, duel, and battle. The authors use this safes, and all variety of objects. Artificial agents lack the method to prioritize verbs for a Q-Learning agent to pair commonsense knowledge gained through years of grounded with in-game objects. language acquisition and have no reason to prefer opening a An alternative strategy has been to reduce the combina- chest to eating it.
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages8 Page
-
File Size-