Multitasking Inhibits Semantic Drift
Total Page:16
File Type:pdf, Size:1020Kb
Multitasking Inhibits Semantic Drift Athul Paul Jacob Mike Lewis Jacob Andreas [email protected] [email protected] [email protected] MIT CSAIL Facebook AI Research MIT CSAIL Abstract Instructor for Task 1 Single-task Executor (build dragons to win) build a When intelligent agents communicate to ac- spearman complish shared goals, how do these goals Standard training executor builds a shape the agents’ language? We study the dragon dynamics of learning in latent language poli- Instructor for Task 1 (build dragons to win) build a Multitask cies (LLPs), in which instructor agents gener- dragon Executor ate natural-language subgoal descriptions and executor agents map these descriptions to low- send a executor explores peasantattack with to with peasant level actions. LLPs can solve challenging exploredragon long-horizon reinforcement learning problems and provide a rich model for studying task- attack with Multitask spearman training oriented language use. But previous work has Instructor for Task 2 executor attacks found that LLP training is prone to seman- (build spearmen to win) with spearman tic drift (use of messages in ways inconsis- (a) (b) (c) (d) tent with their original natural language mean- Figure 1: In latent language policies, instructor agents ings). Here, we demonstrate theoretically and (a) send natural-language commands (b) to executor empirically that multitask training is an effec- agents (c), which execute them in an interactive envi- tive counter to this problem: we prove that ronment (d). Jointly trained instructor–executor pairs multitask training eliminates semantic drift in learn to use messages in ways inconsistent with their a well-studied family of signaling games, and natural language meanings (top shows a real message– show that multitask training of neural LLPs action pair from a model described in Section4). We in a complex strategy game reduces drift and show that multitask training with a population of task- while improving sample efficiency. specific instructors stabilizes message semantics and in some cases improves model performance. 1 Introduction A major goal in the study of artificial and natu- However, they present a number of challenges ral intelligence is to understand how language can for training. As LLPs employ a human-specified scaffold more general problem-solving skills (e.g. space of high-level commands, they must be initial- Spelke, 2017), and how these skills in turn shape ized with human supervision, typically obtained by language itself (e.g. Gibson et al., 2017). In NLP pretraining the executor. On its own, this training paradigm restricts the quality of the learned execu- arXiv:2104.07219v1 [cs.CL] 15 Apr 2021 and machine learning, latent language policies (LLPs; Andreas et al., 2018) provide a standard tor policy to that exhibited in (possibly suboptimal) framework for studying these questions. An LLP human supervision. For tasks like the real-time consists of instructor and executor subpolicies: the strategy game depicted in Fig.1, we would like instructor generates natural language messages (e.g. to study LLPs trained via reinforcement learning high-level commands or subgoals), and the execu- (RL), jointly learning from a downstream reward tor maps these messages to sequences of low-level signal, and optimizing both instructors and execu- actions (Fig.1). LLPs have been used to construct tors for task success rather than fidelity to human interactive agents capable of complex reasoning teachers. (e.g. programming by demonstration) and planning Training LLPs via RL has proven difficult. Past over long horizons (e.g. in strategy games; Hu et al., work has identified two main challenges: primar- 2019). They promise an effective and interpretable ily, the LLP-specific problem of semantic drift, in interface between planning and control. which agents come to deploy messages in ways inconsistent with their original (natural language) goals and communicative needs can facilitate (and meanings (Lewis et al., 2017; Lee et al., 2019); specifically stabilize) learning of communication secondarily, the general problem of sample inef- strategies. ficiency in RL algorithms (Kakade et al., 2003; Brunskill and Li, 2013). Model-free deep RL is par- 2 Background and Related Work ticularly notorious for requiring enormous amounts of interaction with the environment (Munos et al., Deep reinforcement learning (DRL) has recently 2016; Mnih et al., 2013b). For LLPs to meet their made impressive progress on many challenging do- promise as flexible, controllable, and understand- mains such as games (Mnih et al., 2013a; Silver able tools for deep learning, better approaches are et al., 2016), locomotion (Schulman et al., 2015) needed to limit semantic drift and perhaps improve and dexterous manipulation tasks (Gu et al., 2016; sample efficiency. Rajeswaran et al., 2017). However, even state-of- While semantic change is a constant and well- the-art approaches to reinforcement struggle with documented feature of human languages (McMa- tasks involving complex goals, sparse rewards, and hon and April, 1994), (human) word meanings are long time horizons. A variety of models and algo- on the whole remarkably stable relative to the rate rithms for hierarchical reinforcement learning have of change in the tasks for which words are deployed been proposed to address this challenge (Dayan (Karjus et al., 2020). In particular, disappearance and Hinton, 1993; Dietterich, 2000; Richard et al., of lexical items is mitigated by increased popu- 1999; Bacon et al., 2017) via supervised or unsuper- lation size (Bromham et al., 2015) and increased vised training of a fixed, discrete set of sub-policies. frequency of use (Pagel et al., 2007). Drawing Language can express arbitrary goals, and has on these facts about stabilizing factors in human compositional structure that allows generalization language, we hypothesize that training of machine across commands. Building on this intuition, sev- learning models with latent language variables can eral recent papers have explored hierarchical RL be made more robust by incorporating a population in which natural language is used to parameterize of instructors with diverse communicative needs the space of high-level actions (Oh et al., 2017; that exercise different parts of the lexicon. Andreas et al., 2017; Shu et al., 2018; Jiang et al., We describe a multitask LLP training scheme in 2019; Hu et al., 2019). While there are minor imple- which task-specific instructors communicate with mentation differences between all these approaches, a shared executor. We show that complex long- we will refer to them collectively as latent lan- horizon LLPs can be effectively tuned via joint guage policies (LLPs). Like other hierarchical reinforcement learning of instructors and executors agents, an LLP consists of a pair of subpolicies: an using multitask training: instructor I(m j o) and an executor E(a j m; o). An LLP takes actions by first sampling a string- • Section3 presents a formal analysis of LLP valued message m ∼ I from the instructor, and training as an iterated Lewis signalling game then an action a ∼ E from the executor. For these (Lewis, 1969). By modeling learning in this messages to correspond to natural language, rather game as a dynamical system, we completely than arbitrary strings, policies need some source characterize a class of simple policies that are of information about what human language users subject to semantic drift. We show that a par- think they mean. This is typically accomplished ticular multitask training scheme eliminates by pretraining executors via human demonstrations the set of initializations that undergo semantic or reinforcement learning; here we focus on the drift. ingredients of effective joint RL of instructors and • Section4 evaluates the empirical effectiveness executors. of multitask learning in a real-time strategy Reinforcement learning has been widely used to game featuring rich language, complex com- improve supervised language generation policies, plex dynamics, and LLPs implemented with particularly for dialogue (Li et al., 2016; Lewis deep neural networks. Again, we show that et al., 2017), translation (Ranzato et al., 2015; Wu multitask training reduces semantic drift (and et al., 2016) and summarization (Stiennon et al., improves sample efficiency) of LLPs in multi- 2020). Here, we instead focus on models where ple game variants. language is a latent variable as part of a hierarchical Together, these results show that diverse shared policy for a non-linguistic task. Instructor I m o Executor E a m ( ∣ ) ( ∣ ) a a a R 1 2 R a1 2 push red a ′ o m1 = 1 1 1 0 0 1 o 1 o1 push blue m2 = a2 0 1 1 0 o 2 o2 o2 Figure 2: A signaling game with two possible observations, two possible messages, and two possible actions. The instructor observes either a triangle or a square, then sends a message to a executor, who pushes either the red or blue buttons. The players’ reward depends on the observation and the action but not the message. Two possible reward functions, R and R0, are shown at right. As noted in Section1, an observed shortcoming ing improves the faithfulness and adaptability of of reinforcement learning in all these settings is its learned language understanding models, even when susceptibility to semantic drift. In the literature optimizing for a downstream reward. on human language change (Blank, 1999), seman- tic drift refers to a variety of phenomena, includ- 3 Multitask Communication in Theory: ing specific terms becoming more general, general Lewis Signaling Games terms becoming specific, and parts coming to refer We begin our analysis with the simple signaling to wholes. In machine learning, it refers broadly to game depicted in Fig.2. In this game, one agent the use of messages inconsistent with their natural receives an observation, then sends a message to language meanings in language-generation policies another agent, which then performs an action. Sig- (Lazaridou et al., 2020). naling games like this one are widely studied in Lee et al.(2019) mitigate semantic drift in pivot- NLP as models of reference resolution and lan- based machine translation by using visual ground- guage generation (Frank and Goodman, 2012).