Reading and Acting while Blindfolded: The Need for Semantics in Text Game Agents Shunyu Yaoy∗ Karthik Narasimhany Matthew Hausknechtz yPrinceton University zMicrosoft Research {shunyuy, karthikn}@princeton.edu [email protected] Abstract et al., 2020), open-domain question answering systems (Ammanabrolu et al., 2020), knowledge Text-based games simulate worlds and inter- act with players using natural language. Re- graphs (Ammanabrolu and Hausknecht, 2020; Am- cent work has used them as a testbed for manabrolu et al., 2020; Adhikari et al., 2020), and autonomous language-understanding agents, reading comprehension systems (Guo et al., 2020). with the motivation being that understanding Meanwhile, most of these models operate un- the meanings of words or semantics is a key der the reinforcement learning (RL) framework, component of how humans understand, reason, where the agent explores the same environment and act in these worlds. However, it remains in repeated episodes, learning a value function or unclear to what extent artificial agents utilize semantic understanding of the text. To this policy to maximize game score. From this per- end, we perform experiments to systematically spective, text games are just special instances of reduce the amount of semantic information a partially observable Markov decision process available to a learning agent. Surprisingly, we (POMDP) (S; T; A; O; R; γ), where players issue find that an agent is capable of achieving high text actions a 2 A, receive text observations o 2 O scores even in the complete absence of lan- and scalar rewards r = R(s; a), and the under- guage semantics, indicating that the currently lying game state s 2 S is updated by transition popular experimental setup and models may 0 be poorly designed to understand and leverage s = T (s; a). game texts. To remedy this deficiency, we pro- However, what distinguishes these games from pose an inverse dynamics decoder to regular- other POMDPs is the fact that the actions and ob- ize the representation space and encourage ex- servations are in language space L. Therefore, a ploration, which shows improved performance certain level of decipherable semantics is attached on several games including ZORK I. We dis- to text observations o 2 O ⊂ L and actions cuss the implications of our findings for de- a 2 A ⊂ L signing future agents with stronger semantic . Ideally, these texts not only serve understanding. as observation or action identifiers, but also pro- vide clues about the latent transition function T 1 Introduction and reward function R. For example, issuing an action “jump” based on an observation “on the cliff” Text adventure games such as ZORK I (Figure1 (a)) have been a testbed for developing autonomous would likely yield a subsequent observation such agents that operate using natural language. Since as “you are killed” along with a negative reward. interactions in these games (input observations, ac- Human players often rely on their understanding of tion commands) are through text, the ability to language semantics to inform their choices, even on understand and use language is deemed necessary games they have never played before, while replac- and critical to progress through such games. Pre- ing texts with non-semantic identifiers such as their vious work has deployed a spectrum of methods corresponding hashes (Figure1 (c)) would likely for language processing in this domain, including render games unplayable for people. However, word vectors (Fulda et al., 2017), recurrent neu- would this type of transformation affect current ral networks (Narasimhan et al., 2015; Hausknecht RL agents for such games? In this paper, we ask To what extent do current et al., 2020), pre-trained language models (Yao the following question: reinforcement learning agents leverage semantics ∗ Work partly done during internship at Microsoft Re- in text-based games? search. Project page: https://blindfolded.cs. princeton.edu. To shed light on this question, we investi- 3097 Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 3097–3102 June 6–11, 2021. ©2021 Association for Computational Linguistics (a) ZORK I (b) MIN-OB Observation 21: You are in the living room. There is a doorway to the Observation 21: Living Room east, a wooden door with strange gothic lettering to the west, which Action 21: move rug appears to be nailed shut, a trophy case, and a large oriental rug in the Observation 22: Living Room center of the room. You are carrying: A brass lantern . Action 22: open trap Action 21: move rug Observation 22: With a great effort, the rug is moved to one side of the (c) HASH room, revealing the dusty cover of a closed trap door... Living room... Observation 21: 0x6fc You are carrying: ... Action 21: 0x3a04 Observation 22: 0x103b Action 22: open trap Action 22: 0x16bb Figure 1: (a): Sample original gameplay from ZORK I. (b) (c): Our proposed semantic ablations. (b) MIN-OB reduces observations to only the current location name, and (c) HASH replaces observation and action texts by their string hash values. gate the Deep Reinforcement Relevance Network from an experience replay buffer and the following (DRRN) (He et al., 2016), a top-performing RL temporal difference (TD) loss is minimized: model that uses gated recurrent units (GRU) (Cho 0 0 2 et al., 2014) to encode texts. We conduct three LTD(φ) = (r + γ max Qφ(o ; a ) − Qφ(o; a)) a02A experiments on a set of interactive fiction games (2) from the Jericho benchmark (Hausknecht et al., During gameplay, a softmax exploration policy is 2020) to probe the effect of different semantic rep- used to sample an action: resentations on the functioning of DRRN. These exp(Q (o; a)) include (1) using just a location phrase as the in- φ πφ(ajo) = P 0 (3) put observation (Figure1 (b)), (2) hashing text a02A exp(Qφ(o; a )) observations and actions (Figure1 (c)), and (3) reg- Note that when the action space A is large, (2) ularizing vector representations using an auxiliary and (3) become intractable. A valid action hand- inverse dynamics loss. While reducing observa- icap (Hausknecht et al., 2020) or a language tions to location phrases leads to decreased scores model (Yao et al., 2020) can be used to generate a and enforcing inverse dynamics decoding leads to reduced action space for efficient exploration. For increased scores on some games, hashing texts to all the modifications below, we use the DRRN with break semantics surprisingly matches or even out- the valid action handicap as our base model. performs the baseline DRRN on almost all games considered. This implies current RL agents for text- Reducing Semantics via Minimizing Observa- based games might not be sufficiently leveraging tion (MIN-OB) Unlike other RL domains such the semantic structure of game texts to learn good as video games or robotics control, at each step of policies, and points to the need for developing bet- text games the (valid) action space is constantly ter experiment setups and agents that have a finer changing, and it reveals useful information about grasp of natural language. the current state. For example, knowing “un- lock box” is valid leaks the existence of a locked 2 Models box. Also, sometimes action semantics indicate its value even unconditional on the state, e.g. “pick DRRN Baseline Our baseline RL agent gold” usually seems good. Given these, we min- DRRN (He et al., 2016) learns a Q-network imize the observation to only a location phrase Qφ(o; a) parametrized by φ. The model encodes o 7! loc(o) (Figure1 (b)) to isolate the action loc the observation o and each action candidate a semantics: Qφ (o; a) = g(fo(loc(o)); fa(a))). using two separate GRU encoders fo and fa, and then aggregates the representations to derive the Breaking Semantics via Hashing (HASH) Q-value through a MLP decoder g: GRU encoders fo and fa in the Q-network (1) generally ensure that similar texts (e.g. a single word change) are given similar representations, and Qφ(o; a) = g(concat(fo(o); fa(a))) (1) therefore similar values. To study whether such For learning φ, tuples (o; a; r; o0) of observation, a semantics continuity is useful, we break it by action, reward and the next observation are sampled hashing observation and action texts. Specifically, 3098 Game DRRN MIN-OB HASH INV-DY Max Zork 1 -> Zork 3 transfer balances 10 / 10 10 / 10 10 / 10 10 / 10 51 1.0 DRRN (base) deephome 57 / 66 8.5 / 27 58 / 67 57.6 / 67 300 0.8 DRRN (invdy) detective 290 / 337 86.3 / 350 290 / 317 290 / 323 360 DRRN (hash) 0.6 dragon -5.0 / 6 -5.4 / 3 -5.0 / 7 -2.7 / 8 25 enchanter 20 / 20 20 / 40 20 / 30 20 / 30 400 0.4 inhumane 21.1 / 45 12.4 / 40 21.9 / 45 19.6 / 45 90 0.2 library 15.7 / 21 12.8 / 21 17 / 21 16.2 / 21 30 Score (max=7) 0.0 ludicorp 12.7 / 23 11.6 / 21 14.8 / 23 13.5 / 23 150 0 2000 4000 6000 8000 10000 10 omniquest 4.9 / 5 4.9 / 5 4.9 / 5 5.3 / 50 Steps pentari 26.5 / 45 21.7 / 45 51.9 / 60 37.2 / 50 70 zork1 39.4 / 53 29 / 46 35.5 / 50 43.1 / 87 350 zork3 0.4 / 4.5 0.0 / 4 0.4 / 4 0.4 / 4 7 Figure 2: Transfer results from ZORK I. Avg. Norm .21 / .38 .12 / .35 .25 / .39 .23 / .40 And during experience replay, these two losses are Table 1: Final/maximum score of different models.
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages6 Page
-
File Size-