Bootstrapped Q-Learning with Context Relevant Observation Pruning To

Bootstrapped Q-learning with Context Relevant Observation Pruning to Generalize in Text-based Games Subhajit Chaudhury Daiki Kimura Kartik Talamadupula IBM Research IBM Research IBM Research [email protected] [email protected] [email protected] Michiaki Tatsubori Asim Munawar Ryuki Tachibana IBM Research IBM Research IBM Research [email protected] [email protected] [email protected] Abstract Goal: Who’s got a virtual machine and is about to play through an fast paced round of textworld? We show that Reinforcement Learn- You do! Retrieve the coin in the balmy kitchen. ing (RL) methods for solving Text-Based Games (TBGs) often fail to generalize on un- Observation: You’ve entered a studio. You try seen games, especially in small data regimes. To address this issue, we propose Context to gain information on your surroundings by Relevant Episodic State Truncation (CREST) using a technique you call “looking.” You need for irrelevant token removal in observation an unguarded exit ? you should try going east. text for improved generalization. Our method You need an unguarded exit? You should try go- first trains a base model using Q-learning, ing south. You don’t like doors? Why not try which typically overfits the training games. going west, that entranceway is unblocked. The base model’s action token distribution Bootstrapped Policy Action: go south is used to perform observation pruning that removes irrelevant tokens. A second boot- Figure 1: Our method retains context-relevant tokens strapped model is then retrained on the pruned from the observation text (shown in green) while prun- observation text. Our bootstrapped agent ing irrelevant tokens (shown in red). A second policy shows improved generalization in solving network re-trained on the pruned observations general- unseen TextWorld games, using 10x-20x izes better by avoiding overfitting to unwanted tokens. fewer training games compared to previous state-of-the-art (SOTA) methods despite requiring fewer number of training episodes.1 like virtual-navigation agents on cellular phones at 1 Introduction a shopping mall with user rating as the reward. Traditional text-based RL methods focus on the Reinforcement Learning (RL) methods are increas- problems of partial observability and large action ingly being used for solving sequential decision- spaces. However, the topic of generalization to un- making problems from natural language inputs, seen TBGs is less explored in the literature. We like text-based games (Narasimhan et al., 2015; He show that previous RL methods for TBGs often et al., 2016; Yuan et al., 2018; Zahavy et al., 2018) show poor generalization to unseen test games. We chat-bots (Serban et al., 2017) and personal con- hypothesize that such overfitting is caused due to versation assistants (Dhingra et al., 2017; Li et al., the presence of irrelevant tokens in the observation 2017; Wu et al., 2016). In this work, we focus on text, which might lead to action memorization. To Text-Based Games (TBGs), which require solving alleviate this problem, we propose CREST, which goals like “Obtain coin from the kitchen”, based first trains an overfitted base model on the original on a natural language description of the agent’s observation text in training games using Q-learning. observation of the environment. To interact with Subsequently, we apply observation pruning for the environment, the agent issues text-based action each training game, such that observation tokens commands (“go west”) upon which it receives a re- that are not semantically related to the base pol- ward signal used for training the RL agent. TBGs icy’s action tokens are pruned. Finally, we re-train serve as testbeds for interactive real-world tasks a bootstrapped policy on the pruned observation 1Our code is available at: text using Q-learning that improves generalization www.github.com/IBM/context-relevant-pruning-textrl by removing irrelevant tokens. Figure1 shows an 3002 Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, pages 3002–3008, November 16–20, 2020. c 2020 Association for Computational Linguistics Episodic Episodic “go” Action Token “east” ActionAggregation Tokens !(#, ') !(#, %) Token 1: “go” Token 2: “west” Action Linear Linear Token 2: “west” Scorer . Concept-Net based Token n: “coin” ℎ!$# similarity score ℎ!"# ℎ! Base BaseToken Action Relevance Similarity Distribution Distribution Model Context Vector threshold Attention (b) Easy validation games Map typical kind of place there is an exit to the east Pruned observation text: “exit east” LSTM Encoder Train Bootstrapped Policy on Pruned Text using Q-learning typical kind of place there is an exit to the east Original observation text (a) Overview of our CREST observation pruning system (c) Medium validation games Figure 2: (a) Overview of Context Relevant Episodic State Truncation (CREST) module using Token Relevance Distribution for observation pruning. Our method shows better generalization from 10x-20x less number of training games and faster learning with fewer episodes on (b) “easy” and (c) “medium” validation games. illustrative example of our method. Experimen- able Markov Decision Process (POMDP), repre- tal results on TextWorld games (Cotˆ e´ et al., 2018) sented as (s; a; r; s0), where s is the current state, show that our proposed method generalizes to un- s0 the next state, a the current action, and r(s; a) seen games using almost 10x-20x fewer training is the reward function. The agent receives state games compared to SOTA methods; and features description st that is a combination of text describ- significantly faster learning. ing the agent’s observation and the goal statement. The action consists of a combination of verb and 2 Related Work object output, such as “go north”, “take coin”, etc. LSTM-DQN (Narasimhan et al., 2015) is the first The overall model has two modules: a represen- work on text-based RL combining natural lan- tation generator, and an action scorer as shown in guage representation learning and deep Q-learning. Figure2. The observation tokens are fed to the LSTM-DRQN (Yuan et al., 2018) is the state-of- embedding layer, which produces a sequence of vectors xt = fxt ; xt ; :::; xt g, where N is the the-art on TextWorld CoinCollector games, and 1 2 Nt t addresses the issue of partial observability by us- number of tokens in the observation text for time- ing memory units in the action scorer. Fulda et al. step t. We obtain hidden representations of the (2017) proposed a method for affordance extrac- input embedding vectors using an LSTM model t t t tion via word embeddings trained on a Wikipedia as hi = f(xi; hi−1). We compute a context vec- corpus. AE-DQN (Action-Elimination DQN) – a tor (Bahdanau et al., 2014) using attention on the th combination of a Deep RL algorithm with an action j input token as, eliminating network for sub-optimal actions – was proposed by Zahavy et al. (Zahavy et al., 2018). et = vT tanh(W ht + b ) (1) Recent methods (Adolphs and Hofmann, 2019; j h j attn t t Ammanabrolu and Riedl, 2018; Ammanabrolu and αj = softmax(ej) (2) Hausknecht, 2020; Yin and May, 2019; Adhikari where W , v and b are learnable parameters. et al., 2020) use various heuristics to learn better h attn The context vector at time-step t is computed state representations for efficiently solving com- as the weighted sum of embedding vectors as plex TBGs. t PNt t t c = j=1 αjhj. The context vector is fed into 3 Our Method the action scorer, where two multi-layer percep- trons (MLPs), Q(s; v) and Q(s; o) produce the 3.1 Base model Q-values over available verbs and objects from We consider the standard sequential decision- a shared MLP’s output. The original works of making setting: a finite horizon Partially Observ- Narasimhan et al.(2015); Yuan et al.(2018) do 3003 Table 1: Average success rate of various methods on 20 unseen test games. Experiments were repeated on three random seeds. Our method trained on almost 20× fewer data has a similar success rate to state-of-the-art methods. Easy Medium Hard Methods N25 N50 N500 N50 N100 N500 N50 N100 LSTM-DQN (no att) 0.0 0.03 0.33 0.0 0.0 0.0 0.0 0.0 LSTM-DRQN (no att) 0.17 0.53 0.87 0.02 0.0 0.25 0.0 0.0 LSTM-DQN (+attn) 0.0 0.03 0.58 0.0 0.0 0.0 0.0 0.0 LSTM-DRQN (+attn) 0.32 0.47 0.87 0.02 0.06 0.82 0.02 0.08 LSTM-DRQN (+attn+dropout) 0.58 0.80 1.0 0.02 0.37 0.85 0.0 0.33 Ours (ConceptNet+no att) 0.47 0.5 0.98 0.75 0.67 0.97 0.62 0.92 Ours (Word2vec+att) 0.67 0.82 1.0 0.57 0.92 0.95 0.77 0.92 Ours (Glove+att) 0.70 0.97 1.0 0.67 0.72 0.90 0.1 0.63 Ours (ConceptNet+att) 0.82 0.93 1.0 0.67 0.95 0.97 0.93 0.88 Observation: You find yourself in a Observation: You've entered a not use the attention layer. LSTM-DRQN replaces launderette. An usual kind of place. The cookhouse. You begin to take stock of room seems oddly familiar, as though it what's in the room. You need an the shared MLP with an LSTM layer so that the were only superficially different from the unguarded exit? You should try going other rooms in the building. There is an exit north.

Bootstrapped Q-Learning with Context Relevant Observation Pruning To

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support