Arxiv:1906.10187V3 [Cs.AI] 19 Nov 2019 Agent Who Knows the Task and Serves As a Human Surrogate

Learning to Interactively Learn and Assist

Mark Woodward∗, Chelsea Finn, Karol Hausman Google Brain, Mountain View {markwoodward,chelseaf,karolhausman}@google.com

Abstract The effort required by the human user is affected by many aspects of the learning problem, including restrictions on When deploying autonomous agents in the real world, we when the agent is allowed to act and restrictions on the be- need effective ways of communicating objectives to them. Traditional skill learning has revolved around reinforcement havior space of either human or agent, such as limiting the and imitation learning, each with rigid constraints on the for- user feedback to rewards or demonstrations. We consider a mat of information exchanged between the human and the setting where both the user and the agent are allowed to act agent. While scalar rewards carry little information, demon- throughout learning, which we refer to as interactive learn- strations require significant effort to provide and may carry ing. Unlike collecting a set of demonstrations before train- more information than is necessary. Furthermore, rewards ing, interactive learning allows the user to selectively act and demonstrations are often defined and collected before only when it deems the information is necessary and use- training begins, when the human is most uncertain about what ful, reducing the user’s effort. Examples of such interactions information would help the agent. In contrast, when humans include allowing user interventions or agent requests, for communicate objectives with each other, they make use of demonstrations (Kelly et al. 2018), rewards (Warnell et al. a large vocabulary of informative behaviors, including non- verbal communication, and often communicate throughout 2018; Arumugam et al. 2019), or preferences (Christiano et learning, responding to observed behavior. In this way, hu- al. 2017). While these methods allow the user to provide mans communicate intent with minimal effort. In this pa- feedback throughout learning, the communication interface per, we propose such interactive learning as an alternative is restricted to structured forms of supervision, which may to reward or demonstration-driven learning. To accomplish be inefficient for a given situation. For example, in a dish- this, we introduce a multi-agent training framework that en- washer unloading task, given the history of learning, it may ables an agent to learn from another agent who knows the be sufficient to point at the correct drawer rather than pro- current task. Through a series of experiments, we demon- vide a full demonstration. strate the emergence of a variety of interactive learning be- To this end, we propose to allow the agent and the user haviors, including information-sharing, information-seeking, and question-answering. Most importantly, we find that our to exchange information through an unstructured interface. approach produces an agent that is capable of learning inter- To do so, the agent and the user need a common prior un- actively from a human user, without a set of explicit demon- derstanding of the meaning of different unstructured inter- strations or a reward function, and achieving significantly bet- actions, along with the context of the space of tasks that the ter performance cooperatively with a human than a human user cares about. Indeed, when humans communicate tasks performing the task alone. to each other, they come in with rich prior knowledge and common sense about what the other person may want and 1 Introduction how they may communicate that, enabling them to communicate concepts effectively and efficiently (Peloquin, Good- Many tasks that we would like our agents to perform, such as man, and Frank 2019). unloading a dishwasher, straightening a room, or restocking In this paper, we propose to allow the agent to acquire shelves are inherently user-specific, requiring information this prior knowledge through joint pre-training with another from the user in order to fully learn all the intricacies of the arXiv:1906.10187v3 [cs.AI] 19 Nov 2019 agent who knows the task and serves as a human surrogate. task. The traditional paradigm for agents to learn such tasks The agents are jointly trained on a variety of tasks, where is through rewards and demonstrations. However, iterative actions and observations are restricted to the physical envi- reward engineering with untrained human users is imprac- ronment. Since the first agent is available to assist, but only tical in real-world settings, while demonstrations are often the second agent is aware of the task, interactive learning be- burdensome to provide. In contrast, humans learn from a va- haviors should emerge to accomplish the task efficiently. We riety of interactive communicative behaviors, including non- hypothesize that, by restricting the action and observation verbal gestures and partial demonstrations, each with their spaces to the physical environment, the emerged behaviors own information capacity and effort. Can we enable agents can transfer to learning from a human user. An added bene- to learn tasks from humans through such unstructured inter- fit of our framework is that, by training on a variety of tasks action, requiring minimal effort from the human user? from the target task domain, much of the non-user specific ∗Work done as a part of the Goolge AI Residency task prior knowledge is pre-trained into the agent, further Copyright c 2020, Association for the Advancement of Artificial reducing the effort required by the user. Intelligence (www.aaai.org). All rights reserved. We evaluate various aspects of agents trained with our (a) After 100 gradient steps (b) After 1k gradient steps (c) After 40k gradient steps (d) With human principal

Figure 1: Episode traces after 100, 1k, and 40k pre-training steps for the cooperative fruit collection domain of Experiment 4. The principal agent “P” (pink) is told the fruit to be collected, lemons or plums, in its observations. Within an episode, the assistant agent “A” (blue) must infer the fruit to be collected from observations of the principal. Each agent observes an overhead image of itself and its nearby surroundings. By the end of training (c) the assistant is inferring the correct fruit and the agents are coordinating. This inference and coordination transfers to human principals (d). An interactive game and videos for all experiments are available at: https://interactive-learning.github.io framework on several simulated object gathering task do- tories (Christiano et al. 2017; Mindermann et al. 2018; mains, including a domain with pixel observations, shown Ibarz et al. 2018). Here again, the interaction is restricted in Figure 1. We show that our trained agents exhibit emer- to a single modality. gent information-gathering behaviors in general and explicit Our work builds upon the idea of meta-learning, or question-asking behavior where appropriate. Further, we learning-to-learn (Schmidhuber 1987; Bengio, Bengio, and conduct a user study with trained agents, where the users Cloutier 1991; Thrun and Pratt 2012). Meta-learning for score significantly higher with the agent than without the control has been considered in the context of reinforcement agent, which demonstrates that our approach can produce learning (Duan et al. 2016; Wang et al. 2016; Finn, Abbeel, agents that can learn from and assist human users. and Levine 2017) and imitation learning (Duan et al. 2017; The key contribution of our work is a training framework Yu et al. 2018b). Our problem setting differs from these, as that allows agents to quickly learn new tasks from humans the agent is learning by observing and interacting with an- through unstructured interactions, without an explicitly- other agent, as opposed to using reinforcement or imitation provided reward function or demonstrations. Critically, our learning. In particular, our method builds upon recurrence- experiments demonstrate that agents trained with our frame- based meta-learning approaches (Santoro et al. 2016; Duan work generalize to learning test tasks from human users, et al. 2016; Wang et al. 2016) in the context of the multi- demonstrating interactive learning with a human in the loop. agent task setting. In addition, we introduce a novel multi-agent model archi- When a broader range of interactive behaviors is desired, tecture for cooperative multi-agent training that exhibits im- prior works have introduced a multi-agent learning compo- proved training characteristics. Finally, our experiments on nent (Potter and Jong 1994; Palmer et al. 2018). The fol- a series of object-gathering task domains illustrate a variety lowing methods are closely related to ours in that, during of emergent interactive learning behaviors and demonstrate training, they also maximize a joint reward function between that our method can scale to raw pixel observations. the agents and emerge cooperative behavior (Gupta, Egorov, and Kochenderfer 2017; Foerster et al. 2018; Foerster et al. 2 Related Work 2016; Lazaridou, Peysakhovich, and Baroni 2016; Andreas, The traditional means of passing task information to an Dragan, and Klein 2017). Multiple works (Gupta, Egorov, agent include specifying a reward function (Barto and Sut- and Kochenderfer 2017; Foerster et al. 2018) emerge co- ton 1998) that can be hand-crafted for the task (Singh, operative behavior but in task domains that do not require Lewis, and Barto 2009; Levine et al. 2016; Chebotar et al. knowledge transfer between the agents, while others (Foer- 2017) and providing demonstrations (Schaal 1999; Abbeel ster et al. 2016; Lazaridou, Peysakhovich, and Baroni 2016; and Ng 2004) before the agent starts training. More re- Lowe et al. 2017; Andreas, Dragan, and Klein 2017; Mor- cent works explore the concept of the human supervision datch and Abbeel 2018) all emerge communication over a being provided throughout training by either providing re- communication channel. Such communication is known to wards during training (Isbell et al. 2001; Thomaz et al. 2005; be difficult to interpret (Lazaridou, Peysakhovich, and Ba- Warnell et al. 2018; Perez-Dattari et al. 2018) or demon- roni 2016), without post-inspection (Mordatch and Abbeel strations during training; either continuously (Ross, Gordon, 2018) or a method for translation (Andreas, Dragan, and and Bagnell 2011b; Kelly et al. 2018) or at the agent’s dis- Klein 2017). Critically, none of these prior works conduct cretion (Ross, Gordon, and Bagnell 2011a; Borsa et al. 2017; user experiments to evaluate transfer to humans. Xu et al. 2018; Hester et al. 2018; James, Bloesch, and Davi- Mordatch and Abbeel (2018) experiment with tasks sim- son 2018; Yu et al. 2018a; Krening 2018; Brown, Cui, and ilar to ours, in which information must be communicated Niekum 2018). In all of these cases, however, the reward and between the agents, and communication is restricted to the demonstrations are the sole means of interaction. physical environment. This work demonstrates the emer- Another recent line of research involves the human ex- gence of pointing, demonstrations, and pushing behavior. pressing their preference between agent generated trajec- Unlike this prior approach, however, our algorithm does not partially observable Markov game as described in Section 3, with N = 2. To enable the agent to solve such tasks, we train the agent, whom we call the “assistant” (superscript A), jointly with another agent, whom we call the “principal” (superscript P ) on a variety of tasks. Critically, the principal’s observation function informs it of the task.1 The principal agent acts as a human surrogate which allows us to replace it with a human once the training is finished. By informing the principal of the current task and withholding rewards and gradient updates until the end of each task, the (a) MAIDRQN (b) MADDRQN agents are encouraged to emerge interactive learning behaviors in order to inform the assistant of the task and allow Figure 2: Information flow for the two models used in our experi- them to contribute to the joint reward. We limit actions and ments; red paths are only needed during training. The MADDRQN observations to the physical environment, with the hope of model (b) uses a centralized value-function with per-agent advan- emerging human-compatible behaviors. tage functions. The centralized value function is only used during training. Superscripts A and P refer to the assistant and principal In order to train the agents, we consider two different agents respectively. The MAIDRQN model (a) is used in experi- models. We first introduce a simple model that we find ments 1-3 and the MADDRQN model (b) is used in experiment 4 works well in tabular environments. Then, in order to scale where it exhibits superior training characteristics for learning from our approach to pixel observations, we introduce a modi- pixels. fication to the first model that we found was important in increasing the stability of learning. Multi-Agent Independent DRQN (MAIDRQN): require a differentiable environment. We also demonstrate The first model uses two deep recurrent Q-networks our method with pixel observations and conduct a user ex- (DRQN) (Hausknecht and Stone 2015) that are each trained i i i periment to evaluate transfer to humans. with Q-learning (Watkins 1989). Let Qθi (ot, at, ht) be the Laird et al. (2017) describe desiderata for interac- action-value function for agent i, which maps from the i tive learning systems. Our method primarily addresses the current action, observation, and history, ht, to the expected desiderata of efficient interaction and accessible interaction. discounted sum of future rewards. The MAIDRQN method optimizes the following loss:2 3 Preliminaries MAIDRQN 1 X i i i 2 L := [y − Q i (o , a , h )] (2) In this section, we review the cooperative partially observ- N t θ t t t able Markov game (Littman 1994), which serves as the foun- i,t i i i dation for tasks in Section 4. A cooperative partially observ- yt := rt + γ max Qθi (ot+1, at+1, ht+1) i ai able Markov game is defined by the tuple h S, {A }, T , R, t+1 {Ωi} {Oi} γ H i i ∈ {1..N} , , , , where indexes the agent The networks are trained simultaneously. The model archi- N S Ai Ωi among agents, , , and are state, action, and ob- tecture is a recurrent neural network, depicted in Figure 2a; T : S × {Ai} × S0 → servation spaces, R is the transi- see Section A.1 for details. We use this model for experi- R : S × {Ai} → tion function, R is the reward function, ments 1-3. Oi : S × Ωi → γ R are the observation functions, is the Multi-Agent Dueling DRQN (MADDRQN): With in- H discount factor, and is the horizon. dependent Q-Learning, as in MAIDRQN, the other agent’s T R Oi The functions , , and are not accessible to the changing behavior and unknown actions make it difficult to t {ai} ∈ agents. At time , the environment accepts actions t estimate the Bellman target y in Equation 2, which leads {Ai} s ∼ T (s , {ai}) t , samples t+1 t t , and returns reward to instability in training. This model addresses part of the r ∼ R(s , {ai}) {oi } ∼ {Oi(s )} t t t and observations t+1 t+1 . instability that is caused by unknown actions. The objective of the game is to choose actions to maximize If Q∗(o, a, h) is the optimal action-value func- the expected discounted sum of future rewards: tion, then the optimal value function is V ∗(o, h) = ∗ H argmaxa Q (o, a, h), and the optimal advantage function is X t−t0 ∗ ∗ ∗ argmax γ rt . (1) defined as A (o, a, h) = Q (o, a, h) − V (o, h) (Wang et Ei {ai |oi } s,o ,r al. 2015). The advantage function captures how inferior an t0 t0 t=t0 action is to the optimal action in terms of the expected sum Note that, while the action and observation spaces vary for of discounted future rewards. This allows us to express Q the agents, they share a common reward which leads to a cooperative task. 1Our tasks are similar to tasks in Hadfield-Menell et al. (2016), but with partially observable state and without access to the other 4 The LILA Training Framework agent’s actions, which should better generalize to learning from humans in natural environment. We now describe our training framework for producing an 2In our experiments we do not use a lagged “target” Q- assisting agent that can learn a task interactively from a hu- network (Mnih et al. 2013), but we do stop gradients through the man user. We define a task to be an instance of a cooperative Q-network in yt. Table 1: Experimental configurations for our 4 experiments. Experiment 1 has two sub experiments, 1A and 1B. In 1B, the agents incur a penalty whenever the principals moves. The Observation Window column lists the radius of cells visible to each agent.

PRINCIPAL GRID NUM. OBSERVATION EXP.MODEL MOTION OBSERVATIONS SHAPE OBJECTS WINDOW PENALTY 1A MAIDRQN 0.0 5X5 10 BINARY VECTORS FULL 1B MAIDRQN -0.4 5X5 10 BINARY VECTORS FULL 2 MAIDRQN 0.0 5X5 10 BINARY VECTORS 1-CELL 3 MAIDRQN -0.1 3X1 “L” 1 BINARY VECTORS 1-CELL 4 MADDRQN 0.0 5X5 10 64 × 64 × 3 PIXELS ∼2-CELLS

in a new form, Q∗(o, a, h) = V ∗(o, h) + A∗(o, a, h). the immediate reward due to the other agent, simplifying the We note that the value function is not needed optimization. ∗ when selecting actions: argmaxa Q (o, a, h) = We refer to this model as a multi-agent dueling deep re- ∗ ∗ ∗ argmaxa(V (o, h) + A (o, a, h)) = argmaxa A (o, a, h). current Q-network (MADDRQN), in reference to the single We leverage this idea by making the following approxi- agent dueling network of Wang et al. (2015). The MAD- mation to an optimal, centralized action-value function for DRQN model, which adds a fully connected network for multiple agents: the shared value function, is depicted in Figure 2b; See Sec- ∗ i i i ∗ i i ∗ i i i tion A.1 for details. The MADDRQN model is used in ex- Q ({o , a , h }) = V ({o , h }) + A ({o , a , h }) (3) periment 4. X ≈ V ∗({oi, hi}) + Ai∗(oi, ai, hi), Training Procedure: We use a standard episodic train- i ing procedure, with the task changing on each episode. The training procedure for the MADDRQN and MAIDRQN where Ai∗(oi, ai, hi) is an advantage function for agent i models differ only in the loss function. Here, we describe and V ∗({oi, hi}) is a joint value function.3 the training procedure with reference to the MADDRQN The training loss for this model is: model. We assume access to a subset of tasks, DT rain, from a task domain, D = {..., Tj, ...}. First, we initialize the pa- MADDRQN X i i i 2 L := [yt − Q{θi},φ({ot, at, ht})] (4) rameters θP , θA, and φ. Then, the following procedure is t repeated until convergence. A batch of tasks are uniformly i i i T rain yt := rt + γ max Q{θi},φ({ot+1, at+1, ht+1}) sampled from D . For each task Tb in the batch, a tra- {ai } P A P A P A P A t+1 jectory, τb = (o0 , o0 , a0 , a0 , r0..., oH , oH , aH , aH , rH ), is collected by playing out an episode in an environment con- where figured to Tb, with actions chosen -greedy according to AθP i i i i i X i i i Q{θi},φ({ot, at, ht}) := Vφ({ot, ht}) + Aθi (ot, at, ht). and AθA . The hidden states for the recurrent LSTM cells are i reset to 0 at the start of each episode. The loss for each tra- (5) jectory is calculated using Equations 4 and 5. Finally, a gra- Once trained, each agent selects their actions according to dient step is taken with respect to θP , θA, and φ on the sum their advantage function Aθi , of the episode losses. i i i at = argmax Aθi (ot, a, ht), (6) a 5 Experimental Results We designed a series of experiments in order to study how Q i as opposed to the Q-functions θ in the case of MAID- different interactive learning behaviors may emerge, to test DRQN. whether our method can scale to pixel observations, and to In the loss for the MAIDRQN model, Equation 2, there evaluate the ability for the agents to transfer to a setting with is a squared error term for each Qθi which depends on the a human user. joint reward r. This means that, in addition to estimating We conducted four experiments on grid-world environ- the immediate reward due their own actions, each Qθi must ments, where the goal was to cooperatively collect all ob- estimate the immediate reward due to the actions of the other jects from one of two object classes. Two agents, the prin- agent, without access to their actions or observations. By cipal and the assistant, act simultaneously and may move using a joint action value function and decomposing it into i in one of the four cardinal directions or may choose not to advantage functions and a value function, each A can ignore move, giving five possible actions per agent. 3 The approximation is due to the substitution of Within an experiment, tasks vary by the placement of ob- P i∗ i i i ∗ i i i jects, and by the class of objects to be collected, which we i A (o , a , h ) for A ({o , a , h }) in Equation 3, which implies that the agents’ current actions have independent effects on call the “target class”. The target class is supplied to the prin- expected future rewards, and is not true in general. Nevertheless, it cipal as a two dimensional, one-hot vector. If either agent is a useful approximation. enters a cell containing an object, the object disappears and Table 2: Results for Experiments 1A and 1B. Experiment 1B includes a motion penalty for the principal’s motion. In both experiments, MAIDRQN outperforms the principal acting alone, demonstrating that the assistant learns from and assists the principal. All performance increases are significant (confidence > 99%), except for FeedFwd-A and Solo-P in Experiment 1A, which are statistically equivalent.

EXPERIMENT 1A EXPERIMENT 1B METHOD JOINT REWARD REWARD JOINT REWARD REWARD NAME REWARD DUETO P DUETO A REWARD DUETO P DUETO A ORACLE-A 4.9 ± 0.2 2.5 ± 0.1 2.4 ± 0.1 4.0 ± 0.1 0.0 ± 0.0 4.0 ± 0.1 MAIDRQN 4.6 ± 0.2 3.3 ± 0.2 1.3 ± 0.3 3.6 ± 0.1 0.4 ± 0.1 3.2 ± 0.1 FEEDFWD-A 4.1 ± 0.1 4.1 ± 0.1 0.0 ± 0.04 2.0 ± 0.4 0.7 ± 0.3 1.3 ± 0.6 SOLO-P 4.0 ± 0.1 4.0 ± 0.1 N/A 1.2 ± 0.1 1.2 ± 0.1 N/A

4.0

3.5

3.0

2.5 reward 2.0 reward due to assistant 1.5 reward due to principal

Episode Reward 1.0

0.5 (a) The assistant learns from a (b) The assistant learns from a single principal movement. lack of principal movement. 0.0 0 50000 100000 150000 Episode Batch Figure 4: Episode traces of trained agents on test tasks from Exper- iment 1B. Both agents begin in the center square and cooperatively collect as many instances of the target shape as possible. The target Figure 3: Training curve for Experiment 1B. Error bars are 1 stan- shape is shown in green. The principal agent P observes the target dard deviation. At the end of training, nearly all of the joint reward in shape, but the assistant agent A does not and must learn from the prin- an episode is due to the assistant’s actions, indicating that the trained cipal’s movement or lack of movement. The assistant rapidly learns assistant can learn the task and then complete it independently. the target shape from the principal and collects all instances. both agents receive a reward: +1 for objects of the target that of a principal trained to act alone (Solo-P), and ap- class and −1 otherwise. Each episode consisted of a single proaches the optimal setting where the assistant also ob- task and lasted for 10 time-steps. Table 1 gives the setup for serves the target class (Oracle-A). Further, we see that the re- each experiment. ward due to the assistant is positive, and even exceeds the re- We collected 10 training runs per experiment, and we ward due to the principal when the motion penalty is present report the aggregated performance of the 10 trained agent (Experiment 1B). This demonstrates that the assistant learns pairs on 100 test tasks not seen during training. The training the task from the principal and assists the principal. Our ap- batch size was 100 episodes and the models were trained for proach also outperforms an ablation in which the assistant’s 150,000 gradient steps (Experiments 1-3) or 40,000 gradient LSTM is replaced with a feed forward network (FeedFwd- steps (Experiment 4). Videos for all experiments, as well as A), highlighting the importance of memory. an interactive game, are available on the paper website.5 Experiment 1 A&B – Learning and Assisting: In this Experiment 2 – Active Information Gathering: In this experiment we explore if the assistant can be trained to learn experiment we explore if, in the presence of additional par- and assist the principal. Table 2 shows the experimental re- tial observability, the assistant will take actions to actively sults without and with a penalty for motion of the principal seek out information. This experiment restricts the view of (Experiments 1A and 1B respectively). Figures 3 and 4 show each agent to a 1-cell window and only places objects around the learning curve and trajectory traces for trained agents in the exterior of the grid, requiring the assistant to move with Experiment 1B. the principal and observe its behavior, see Figure 5. Figure 6 The joint reward of our approach (MAIDRQN) exceeds shows trajectory traces for two test tasks. The average joint reward, reward due to the principal, and reward due to the 4The FeedForward assistant moves 80% of the time, but it never assistant are 4.7 ± 0.2, 2.8 ± 0.2, and 1.9 ± 0.1 respectively. collects an object. This shows that our training framework can produce infor- 5https://interactive-learning.github.io mation seeking behaviors. (a) Principal View (b) Assistant View (a) 2 step info. seek (b) 3 step info. seek

Figure 5: Visualization of the 1-cell observation window used in ex- Figure 6: Episode traces of trained agents on test tasks from Experi- periments 2 and 3. Cell contents outside of an agent’s window are ment 2. Both agents begin in the center square and cooperatively col- hidden from that agent. lect as many instances of the target shape as possible. The target shape is shown in green. The principal agent P observes the target shape, but the assistant agent A does not and must learn from the principal’s movement With restricted observations, the assistant moves with the principal until it observes a disambiguating action, and then proceeds to collect the target shape on its own.

Experiment 3 – Interactive Questioning and Answer- Table 3: Results for Experiment 4. Trained assistants learned ing: In this experiment we explore if there is a setting where from human principals and significantly increased their scores (Hu- explicit questioning and answering can emerge. On 50% of man&Agent) over the humans acting alone (Human), demonstrating the potential for our training framework to produce agents that the tasks, the assistant is allowed to observe the target class. can learn from and assist humans.7 This adds uncertainty for the principal, and discourages it from proactively informing the assistant. Figure 7 shows the first several states of tasks in which the assistant does not JOINT REWARD REWARD PLAYERS observe the target class.6 REWARD DUETO P DUETO A The emerged behavior is for the assistant to move into AGENT&AGENT 4.6 ± 0.2 2.6 ± 0.2 2.0 ± 0.2 the visual field of the principal, effectively asking the ques- HUMAN&AGENT 4.2 ± 0.4 2.9 ± 0.3 1.3 ± 0.5 tion, then the principal moves until it sees the object, and fi- AGENT 3.9 ± 0.1 3.9 ± 0.1 N/A nally answers the question by moving one step closer only if HUMAN 3.8 ± 0.3 3.8 ± 0.3 N/A the object should be collected. The average joint reward, reward due to the principal, and reward due to the assistant are 0.4±0.1, −0.1±0.1, and 0.5±0.1 respectively. This demonstrates that our framework can emerge question-answering, RQN. The failure rate was 64% vs 75% for each method interactive behaviors. respectively, and the mean failure time was 5.6 hours vs 9.7 hours (confidence > 99%), which saved training time and Experiment 4 – Learning from and Assisting a Human was a practical benefit. Principal with Pixel Observations: In this final experiment Table 3 shows the experimental results. The participants we explore if our training framework can extend to pixel ob- scored significantly higher with the assistant than without servations and whether the trained assistant can learn from a (confidence > 99%). This demonstrates that our framework human principal. Figure 8 shows examples of the pixel ob- can produce agents that can learn from humans. servations. Ten participants, who were not familiar with this While inclusion of the assistant increases the human’s research, were paired with the 10 trained assistants, and each score, it is still less than the score when the assistant acts played 20 games with the assistant and 20 games without the with the principal agent with which it was trained. What is assistant. Participants were randomly assigned which setting the cause of this gap? To answer this question, we identified to play first. Figure 1 shows trajectory traces on test tasks at that 12% of the time the assistant incorrectly infers which several points during training and with a human principal object to collect (episodes where the assistant always col- after training. lects the wrong object). If we exclude these episodes, we Unlike the previous experiments, stability was a challenge obtain the performance when the assistant has correctly in- in this problem setting; most training runs of MAIDRQN ferred the task, but must still coordinate with the human became unstable and dropped below 0.1 joint reward before principal. Humans with “correct” assistants achieve rewards the end of training. Hence, we chose to use the MADDRQN (4.7 ± −0.3, 2.8 ± 0.3, 1.9 ± 0.2) that are statistically model because we found it to be more stable than MAID- 7Significance is based on a t-test of the participants’ change in 6The test and training sets are the same in Experiment 3, since score, which is more significant than the table’s standard deviations there are only 8 possible tasks would suggest (confidence > 99%). (a) The square should be collected (green), but the assistant does not observe this (grey under green).

(b) The circle should not be collected (red), but the assistant does not observe this (grey under red)

Figure 7: Episode roll-outs for trained agents from Experiment 3. When the assistant is uncertain of an object, it requests information from the principal by moving into its visual ﬁeld and observing the response.

bility over MAIDRQN, which was a practical benefit in the experiments. The experiments demonstrate that, depending on the environment, LILA emerges behaviors such as demonstrations, partial demonstrations, information seeking, and question answering. Experiment 4 demonstrated that LILA scales to environments with pixel observations, and, crucially, that LILA is able to produce agents that can learn from and assist humans. A possible future extension involves training with popu- (a) Principal (b) Assistant lations of agents. In our experiments, the agents sometimes Figure 8: Example observations for experiment 4. The principal’s emerged overly co-adapted behaviors. For example, in Ex- observation also includes a 2 dimensional one-hot vector indicating periment 2, the agents tend to always move in the same the fruit to collect, plums in this case. These are the 7th observa- direction in the first time step, but the direction varies by tions from the human-agent trajectory in Figure 1d. the training run. This makes agents paired across runs less compatible, and less likely to generalize to human principals. We believe that training an assistant across popu- equivalent to Agent&Agent and statistically superior to Hu- lations of agents will reduce such co-adapted behaviors. man&Agent in Table 3. This means that the assistants coor- Finally, LILA’s emergence of behaviors means that the dinate equivalently with human principals and artificial prin- trained assistant can only learn from behaviors that emerged cipals, but can experience problems inferring the task from during training. Further research should seek to minimize human principals, resulting in the observed drop in score. these limitations, perhaps through advances in online meta- This is an example of co-adaptation to communicating with learning (Finn et al. 2019). the principal agent during training. The next section suggests Acknowledgments The authors are especially grateful to an approach to address such co-adaptation. Alonso Martinez for designing and iterating on the Unity environment used in Experiment 4. 6 Summary and Future Work We introduced the LILA training framework, which trains an References assistant to learn interactively from a knowledgeable prin- [Abbeel and Ng 2004] Abbeel, P., and Ng, A. Y. 2004. Ap- cipal through only physical actions and observations in the prenticeship learning via inverse reinforcement learning. In environment. LILA produces the assistant by jointly training Proceedings of the twenty-first international conference on it with a principal, who is made aware of the task through its Machine learning, 1. ACM. observations, on a variety of tasks, and restricting the observation and action spaces to the physical environment. We [Andreas, Dragan, and Klein 2017] Andreas, J.; Dragan, A.; further introduced the MADDRQN algorithm, in which the and Klein, D. 2017. Translating neuralese. arXiv preprint agents have individual advantage functions but share a value arXiv:1704.06960. function during training. MADDRQN showed improved sta- [Arumugam et al. 2019] Arumugam, D.; Lee, J. K.; Saskin, S.; and Littman, M. L. 2019. Deep reinforcement learn- reinforcement learning. In Proc. of the Conference on Neu- ing from policy-dependent human feedback. arXiv preprint ral Information Processing Systems (NIPS). arXiv:1902.04257. [Hausknecht and Stone 2015] Hausknecht, M., and Stone, P. [Barto and Sutton 1998] Barto, A., and Sutton, R. S. 1998. 2015. Deep recurrent q-learning for partially observable Reinforcement Learning: An Introduction. MIT Press. mdps. In AAAI Fall Symposium Series. [Bengio, Bengio, and Cloutier 1991] Bengio, Y.; Bengio, S.; [Hester et al. 2018] Hester, T.; Vecerik, M.; Pietquin, O.; and Cloutier, J. 1991. Learning a synaptic learning rule. In Lanctot, M.; Schaul, T.; Piot, B.; Horgan, D.; Quan, J.; Proc. of the International Joint Conference on Neural Net- Sendonaris, A.; Dulac-Arnold, G.; Osband, I.; Agapiou, J.; works (IJCNN). Leibo, J. Z.; and Gruslys, A. 2018. Deep q-learning from [Borsa et al. 2017] Borsa, D.; Piot, B.; Munos, R.; and demonstrations. In AAAI. Pietquin, O. 2017. Observational learning by reinforcement [Hochreiter and Schmidhuber 1997] Hochreiter, S., and learning. Schmidhuber, J. 1997. Long short-term memory. Neural [Brown, Cui, and Niekum 2018] Brown, D. S.; Cui, Y.; and Computation 9(8):1735–1780. Niekum, S. 2018. Risk-aware active inverse reinforcement [Ibarz et al. 2018] Ibarz, B.; Leike, J.; Pohlen, T.; Irving, G.; learning. In Proc. of Conference on Robot Learning (CoRL). Legg, S.; and Amodei, D. 2018. Reward learning from hu- [Chebotar et al. 2017] Chebotar, Y.; Hausman, K.; Zhang, man preferences and demonstrations in atari. In Proc. of M.; Sukhatme, G.; Schaal, S.; and Levine, S. 2017. Com- the Conference on Neural Information Processing Systems bining model-based and model-free updates for trajectory- (NIPS). centric reinforcement learning. In International Conference [Isbell et al. 2001] Isbell, C. L.; Shelton, C. R.; Kearns, M.; on Machine Learning (ICML). Singh, S. P.; and Stone, P. 2001. Cobot: A social reinforce- [Christiano et al. 2017] Christiano, P.; Leike, J.; Brown, ment learning agent. In Proc. of the Conference on Neural T. B.; Martic, M.; Legg, S.; and Amodei, D. 2017. Deep Information Processing Systems (NIPS). reinforcement learning from human preferences. In Proc. of [James, Bloesch, and Davison 2018] James, S.; Bloesch, M.; the Conference on Neural Information Processing Systems and Davison, A. J. 2018. Task-embedded control networks (NIPS). for few-shot imitation learning. In Proc. of the Conference [Duan et al. 2016] Duan, Y.; Schulman, J.; Chen, X.; on Robot Learning (CoRL). Bartlett, P. L.; Sutskever, I.; and Abbeel, P. 2016. Rl2: Fast [Kelly et al. 2018] Kelly, M.; Sidrane, C.; Driggs-Campbell, reinforcement learning via slow reinforcement learning. K.; and Kochenderfer, M. J. 2018. Hg-dagger: Interactive [Duan et al. 2017] Duan, Y.; Andrychowicz, M.; Stadie, B.; imitation learning with human experts. Ho, O. J.; Schneider, J.; Sutskever, I.; Abbeel, P.; and Zaremba, W. 2017. One-shot imitation learning. In Ad- [Kingma and Ba 2015] Kingma, D. P., and Ba, J. 2015. vances in neural information processing systems, 1087– Adam: A method for stochastic optimization. In Proc. of 1098. the International Conference for Learning Representations (ICLR). [Finn, Abbeel, and Levine 2017] Finn, C.; Abbeel, P.; and Levine, S. 2017. Model-agnostic meta-learning for fast [Krening 2018] Krening, S. 2018. Newtonian action ad- adaptation of deep networks. In Proc. of the International vice: Integrating human verbal instruction with reinforce- Conference on Machine Learning (ICML). ment learning. [Finn et al. 2019] Finn, C.; Rajeswaran, A.; Kakade, S.; and [Laird et al. 2017] Laird, J. E.; Gluck, K.; Anderson, J.; Levine, S. 2019. Online meta-learning. arXiv preprint Forbus, K. D.; Jenkins, O. C.; Lebiere, C.; Salvucci, D.; arXiv:1902.08438. Scheutz, M.; Thomaz, A.; Trafton, G.; et al. 2017. Inter- active task learning. IEEE Intelligent Systems 32(4):6–21. [Foerster et al. 2016] Foerster, J. N.; Assael, Y. M.; de Fre- itas, N.; and Whiteson, S. 2016. Learning to communicate [Lazaridou, Peysakhovich, and Baroni 2016] Lazaridou, A.; with deep multi-agent reinforcement learning. In Proc. of Peysakhovich, A.; and Baroni, M. 2016. Multi-agent co- the Conference on Neural Information Processing Systems operation and the emergence of (natural) language. arXiv (NIPS). preprint arXiv:1612.07182. [Foerster et al. 2018] Foerster, J. N.; Farquhar, G.; Afouras, [LeCun, Bengio, and others 1995] LeCun, Y.; Bengio, Y.; T.; Nardelli, N.; and Whiteson, S. 2018. Counterfactual et al. 1995. Convolutional networks for images, speech, multi-agent policy gradients. In Thirty-Second AAAI Con- and time series. The handbook of brain theory and neural ference on Artificial Intelligence. networks 3361(10):1995. [Gupta, Egorov, and Kochenderfer 2017] Gupta, J. K.; [Levine et al. 2016] Levine, S.; Finn, C.; Darrell, T.; and Egorov, M.; and Kochenderfer, M. 2017. Cooperative Abbeel, P. 2016. End-to-end training of deep visuomo- multi-agent control using deep reinforcement learning. tor policies. The Journal of Machine Learning Research In Proc. of Autonomous Agents and Multiagent Systems 17(1):1334–1373. (AAMAS). [Littman 1994] Littman, M. L. 1994. Markov games as a [Hadfield-Menell et al. 2016] Hadfield-Menell, D.; Dragan, framework for multi-agent reinforcement learning. In Ma- A.; Abbeel, P.; and Russell, S. 2016. Cooperative inverse chine learning proceedings 1994. Elsevier. 157–163. [Lowe et al. 2017] Lowe, R.; Wu, Y.; Tamar, A.; Harb, J.; ceedings of the annual conference of the cognitive science Abbeel, O. P.; and Mordatch, I. 2017. Multi-agent actor- society, 2601–2606. critic for mixed cooperative-competitive environments. In [Thomaz et al. 2005] Thomaz, A. L.; Hoffman, G.; ; and Advances in Neural Information Processing Systems, 6379– Breazeal, C. 2005. Real-time interactive reinforcement 6390. learning for robots. In AAAI. [Mindermann et al. 2018] Mindermann, S.; Shah, R.; [Thrun and Pratt 2012] Thrun, S., and Pratt, L. 2012. Learn- Gleave, A.; and Hadfield-Menell, D. 2018. Active inverse ing to learn. Springer Science & Business Media. reward design. In Proc. of the ICML/IJCAI/AAMAS [Wang et al. 2015] Wang, Z.; Schaul, T.; Hessel, M.; Workshop on Goals for Reinforcement Learning. Van Hasselt, H.; Lanctot, M.; and De Freitas, N. 2015. Du- [Mnih et al. 2013] Mnih, V.; Kavukcuoglu, K.; Silver, D.; eling network architectures for deep reinforcement learning. Graves, A.; Antonoglou, I.; Wierstra, D.; and Riedmiller, arXiv preprint arXiv:1511.06581. M. A. 2013. Playing atari with deep reinforcement learning. [Wang et al. 2016] Wang, J. X.; Kurth-Nelson, Z.; Tirumala, In Proc. of the Conference on Neural Information Process- D.; Soyer, H.; Leibo, J. Z.; Munos, R.; Blundell, C.; Ku- ing Systems (NIPS), Workshop on Deep Learning. maran, D.; and Botvinick, M. 2016. Learning to reinforce- [Mordatch and Abbeel 2018] Mordatch, I., and Abbeel, P. ment learn. arXiv preprint arXiv:1611.05763. 2018. Emergence of grounded compositional language in [Warnell et al. 2018] Warnell, G.; Waytowich, N.; Lawhern, multi-agent populations. In Proc. of the AAAI Conference V.; and Stone, P. 2018. Deep tamer: Interactive agent shap- on Artificial Intelligence (AAAI). ing in high-dimensional state spaces. In Thirty-Second AAAI [Palmer et al. 2018] Palmer, G.; Tuyls, K.; Bloembergen, D.; Conference on Artificial Intelligence. and Savani, R. 2018. Lenient multi-agent deep reinforce- [Watkins 1989] Watkins, C. J. C. H. 1989. Learning from ment learning. In Proc. of Autonomous Agents and Multia- delayed rewards. Ph.D. Dissertation, King’s College, Cam- gent Systems (AAMAS). bridge. [Peloquin, Goodman, and Frank 2019] Peloquin, B. N.; [Xu et al. 2018] Xu, K.; Ratner, E.; Dragan, A.; Levine, S.; Goodman, N. D.; and Frank, M. C. 2019. The interactions and Finn, C. 2018. Learning a prior over intent via meta- of rational, pragmatic agents lead to efficient language inverse reinforcement learning. structure and use. PsyArXiv. [Perez-Dattari et al. 2018] Perez-Dattari, R.; Celemin, C.; [Yu et al. 2018a] Yu, T.; Finn, C.; Xie, A.; Dasari, S.; Zhang, del Solar, J. R.; and Kober, J. 2018. Interactive learning T.; Abbeel, P.; and Levine, S. 2018a. One-shot imitation with corrective feedback for policies based on deep neural from observing humans via domain-adaptive meta-learning. Proc. of Robotics: Science and Systems (RSS) networks. In Proc. of the International Symposium on Ex- In . perimental Robotics (ISER). [Yu et al. 2018b] Yu, T.; Finn, C.; Xie, A.; Dasari, S.; Zhang, [Potter and Jong 1994] Potter, M. A., and Jong, K. A. D. T.; Abbeel, P.; and Levine, S. 2018b. One-shot imitation 1994. A cooperative coevolutionary approach to function from observing humans via domain-adaptive meta-learning. arXiv preprint arXiv:1802.01557 optimization. The Third Parallel Problem Solving From Na- . ture 249–257. [Ross, Gordon, and Bagnell 2011a] Ross, S.; Gordon, G.; and Bagnell, D. 2011a. A reduction of imitation learning and structured prediction to no-regret online learning. In Proceedings of the fourteenth international conference on artificial intelligence and statistics, 627–635. [Ross, Gordon, and Bagnell 2011b] Ross, S.; Gordon, G. J.; and Bagnell, J. A. 2011b. A reduction of imitation learning and structured prediction to no-regret online learning. In Proc. of the International Conference on Artificial Intelli- gence and Statistics (AISTATS). [Santoro et al. 2016] Santoro, A.; Bartunov, S.; Botvinick, M.; Wierstra, D.; and Lillicrap, T. P. 2016. One-shot learning with memory-augmented neural networks. In Proc. of the International Conference on Machine Learning (ICML). [Schaal 1999] Schaal, S. 1999. Is imitation learning the route to humanoid robots? Trends in Cognitive Sciences 233–242. [Schmidhuber 1987] Schmidhuber, J. 1987. Evolutionary Principles in Self-Referential Learning. Ph.D. Dissertation, Institut f. Informatik, Tech. Univ. Munich. [Singh, Lewis, and Barto 2009] Singh, S.; Lewis, R. L.; and Barto, A. G. 2009. Where do rewards come from. In Pro- A Appendix agents may occupy the same cell in the shape environment A.1 Model Architectures of Experiments 1-3, but collide with each other in the fruit environment of Experiment 4; if, in the fruit environment, Multi-Agent Independent DRQN (MAIDRQN) The the agents attempt to move onto the same cell, the princi- MAIDRQN model makes use of independent DRQN mod- pal gets to move to the cell and the assistant remains at its els, see figure 2a. There are independent parallel paths current location. P A for each agent, giving Q and Q . Each path consists In experiments using the shape environment, the agents of a convolutional neural network (CNN) (LeCun, Bengio, observe the world as a grid of binary vectors, one binary and others 1995), followed by a long short-term memory vector for each cell. Some fields in the binary vector are ze- (LSTM) (Hochreiter and Schmidhuber 1997), followed by roed out depending on the agent and the experiment. Each a fully-connected layer. The CNNs consist of a 2-layer net- binary vector contains: 1 bit indicating if the cell is visible work with 10, 3x3, stride 1 filters per layer and rectified to the agent, 1 bit indicating the presence of the principal, 1 linear unit (ReLU) activations. The LSTMs have 50 hidden bit indicating the presence of the assistant, 1 bit indicating units. The fully-connected network layer outputs are 5 di- the presence of an object, 2 bits indicating the class of the mensional, representing Q values for the 5 actions in the object in one-hot form, 2 bits indicating whether the object environment. should be collected in one-hot form. All bits are zeroed if the Multi-Agent Dueling DRQN (MADDRQN) Up to the fi- cell is not visible to the agent, and all object bits are zeroed nal layer, the MADDRQN model is similarly structured to if no object is present in the cell. the MAIDRQN model, with the following modifications. In Experiment 1, the bits indicating whether the object The CNN filter weights are shared between the agents. The should be collected are set to zero in the assistant’s binary first CNN layer has 16, 8x8 filters with stride 4, and the sec- vectors. ond layer has 32, 8x8 filters with stride 2, both with ReLU In Experiment 2, the bits indicating whether the object activations. A 256-unit fully-connected layer with ReLU ac- should be collected are set to zero in the assistant’s binary tivations is inserted between the CNN and the LSTM, also vectors. Further, for both agents, if a cell is more than 1 cell with weights shared between the agents. Information about away from the agent, then all bits in the binary vector for the current task is represented as a 1-hot vector and con- that cell are set to zero. catenated to the output of the fully-connected layer before In Experiment 3, on half of the tasks, the bits in indicating feeding it to the principal’s LSTM. The outputs for each whether the object should be collected are set to zero in the LSTM are fed to separate fully-connected layers with 5 assistant’s binary vectors. Further, for both agents, if a cell outputs each, and the softmax across the 5 outputs is sub- is more than 1 cell away from the agent, then all bits in the tracted from each output, producing an advantage functions, binary vector for that cell are set to zero. Ai, for each agent. In a separate path, the outputs from the In Experiment 4, using the fruit environment, the agents two LSTMs are concatenated together and fed to a fully- observe the world as a 64x64x3 color image centered on the connected layer with a single output, representing the shared respective agent, see Figure 8. The camera view includes the joint value function V . The MADDRQN model is depicted entire world when the agent is at the center of the grid, but in Figure 2b becomes partially observable as the agent moves away from the center. The principal agent receives the class to collect, Multi-Agent Independent DRQN (MAIDRQN) (Experi- in one-hot form, as an additional observation. The class of ment 4) In Experiment 4, we conducted a comparison be- objects to collect is concatenated to the principal’s LSTM tween MADDRQN and MAIDRQN. For a fair comparison, input. In Experiment 4, the assistant never observes the class the MAIDRQN model in Experiment 4 is the MADDRQN of objects to collect. model without the value function, and without the softmax subtracted from the fully-connected output layers, giving A.3 Training Algorithms P A Q and Q . All other aspects, including the fully-connected Algorithm 1 describes the training procedure for train- layer after the CNN layers and the sharing of weights are ing with the Multi-Agent Independent DRQN (MAIDRQN) identical to MADDRQN. model, used in experiments 1-3. Algorithm 2 describes the training procedure for training with our Multi-Agent Duel- A.2 Experimental Details ing DRQN (MADDRQN) model, used in experiment 4. All experiments use -greedy exploration during training ( = 0.05), and argmax when evaluating. The Adam opti- A.4 Experiment 3 - Rollouts mizer with default parameters was used to train the mod- Figure A.4 shows rollouts for emerged behaviors in 4 of the els (Kingma and Ba 2015). 8 tasks. All experiments operate on a grid. On every step, each In the case where the assistant observes object goodness agent can take one of five actions: move left, move right, and the object is bad, Figure 9c, the assistant correctly does move up, move down, or stay in place. Actions are taken in not collect the bad object. However, it does move into the parallel, and the subsequent observations reflect the effect of principal’s visual field later in the episode. A late appear- both agents’ actions. If either agent moves onto a cell con- ance in the principal’s visual field must not be a well-formed taining an object, then the object is removed and considered question since the principal does not move to answer. The “collected”, with resulting joint reward defined by R. The emergence of time-step-dependent behaviors such as this (a) The assistant immediately consumes the good object when it observes object goodness

(b) The assistant asks the principal whether the object is good when it does not observe object goodness.

(d) The assistant asks the principal whether the object is good when it does not observe object goodness.

Figure 9: The assistant has learned how and when to ask questions to the principal. The assistant randomly observes object goodness dependent on the episode. When asking a question, as shown in cases (b) and (d), the assistant moves into the visual field of the principal. Next, the principal moves to the right to observe the object and moves one step closer if the object is good or does not move if the object is bad. The assistant understands this answer and proceed correctly. Algorithm 1 Training Procedure using the MAIDRQN model Algorithm 2 Training Procedure using the MADDRQN model Input: DT rain: a set of training tasks Input: DT rain: a set of training tasks Input: Two recurrent artificial neural networks that output Input: A recurrent artificial neural network that outputs ad- action-value functions QθP and QθA vantage functions AθP and AθA , and value function Vφ Input: An environment interface Env with methods init() Input: An environment interface Env with methods init() and step() and step() Output: θP and θA Output: θP , θA, and φ 1: Initialize θP and θA 1: Initialize θP , θA, and φ 2: while not done do 2: while not done do 3: Uniformly sample a batch of tasks from DT rain, let 3: Uniformly sample a batch of tasks from DT rain, let Tb be a task in the batch. Tb be a task in the batch 4: for all Tb do 4: for all Tb do 5: Reset the memory for the recurrent neural net- 5: Reset the memory for the recurrent neural networks to 0 works to 0 P A P A 6: o , o = Env.init(Tb) 6: o , o = Env.init(Tb) P A P A 7: τb = [o , o ] 7: τb = [o , o ] 8: repeat 8: repeat 9: Choose actions aP and aA using -greedy explo- 9: Choose actions aP and aA using -greedy explo- P A P A ration on QθP (o ) and QΘA (o ) ration on AθP (o ) and AθA (o ) 10: r, oP , oA = Env.step(aP , aA) 10: r, oP , oA = Env.step(aP , aA) P A P A P A P A 11: τb.append(a , a , r, o , o ) 11: τb.append(a , a , r, o , o ) 12: until end of episode 12: until end of episode MAIDRQN i i MADDRQN 13: Calculate Lb using Equation 2 with ot, at, 13: Calculate Lb using Equations 4 and 5 with i i and rt from τb ot, at, and rt from τb 14: end for 14: end for 15: P LMAIDRQN θP P MADDRQN P A Take a gradient step on b b w.r.t. and 15: Take a gradient step on b Lb w.r.t. θ , θ , θA and φ 16: end while 16: end while might be avoided by starting agents at different random steps You and the assistant cannot occupy the same location. Your in the episode. action is replaced by in this case. A.5 Experiment 4 - Human Instructions The following instructions were provided to the human in the Experiment 4. Your goal in each episode is to harvest as many lemons or plums as you can. You will be shown which fruit to harvest, lemons or plums, in a separate window. You will receive +1 for each correct fruit harvested and -1 for each incorrect fruit harvested. An episode lasts for 10 steps, i.e. 10 action choices. You will play 20 episodes by yourself, controlling the principal agent, P. You will also play 20 episodes with an assistant agent, A. The assistant is a trained neural network. The assistant does not know which fruit to harvest at the start of the episode, it must learn the correct fruit from you. When harvesting with the assistant, you will receive +1 and -1 for each correct and incorrect fruit harvested, regardless of which agent harvested the fruit. You will begin with 10 practice episodes for each agent setting (with the assistant and without). Controls: • : starts an episode, the target fruit will be dis- played • , “a”, “d”, “w”, “s”: don’t move, move left, right, up, or down respectively