Machine Theory of Mind
Total Page:16
File Type:pdf, Size:1020Kb
Machine Theory of Mind Neil C. Rabinowitz 1 Frank Perbet 1 H. Francis Song 1 Chiyuan Zhang 2 Matthew Botvinick 1 Abstract of others, such as their desires and beliefs. This ability is typically described as our Theory of Mind (Premack Theory of mind (ToM) broadly refers to humans’ & Woodruff, 1978). While we may also leverage our ability to represent the mental states of others, own minds to simulate others’ (e.g. Gordon, 1986; Gallese including their desires, beliefs, and intentions. & Goldman, 1998), our ultimate human understanding of We design a Theory of Mind neural network – other agents is not measured by a correspondence between a ToMnet – which uses meta-learning to build our models and the mechanistic ground truth, but instead such models of the agents it encounters. The by how much they enable prediction and planning. ToMnet learns a strong prior model for agents’ future behaviour, and, using only a small num- In this paper, we take inspiration from human Theory of ber of behavioural observations, can bootstrap Mind, and seek to build a system which learns to model to richer predictions about agents’ characteris- other agents. We describe this as a Machine Theory of tics and mental states. We apply the ToMnet Mind. Our goal is not to assert a generative model of to agents behaving in simple gridworld environ- agents’ behaviour and an algorithm to invert it. Rather, we ments, showing that it learns to model random, focus on the problem of how an observer could learn au- algorithmic, and deep RL agents from varied tonomously how to model other agents using limited data populations, and that it passes classic ToM tasks (Botvinick et al., 2017). This distinguishes our work from such as the “Sally-Anne” test of recognising that previous literature, which has relied on hand-crafted mod- others can hold false beliefs about the world. els of agents as noisy-rational planners – e.g. using inverse RL (Ng et al., 2000; Abbeel & Ng, 2004), Bayesian Theory of Mind (Baker et al., 2011; Jara-Ettinger et al., 2016) or 1. Introduction game theory (Yoshida et al., 2008; Camerer, 2010). Like- wise, other work, such as Foerster et al.(2017) and Everett What does it actually mean to “understand” another agent? (2017), expect other agents to conform to known, strong As humans, we face this challenge every day, as we en- parametric models, while concurrent work by Raileanu gage with other humans whose latent characteristics, latent et al.(2018) assumes that other agents are functionally states, and computational processes are almost entirely in- identical to oneself. In contrast, we make very weak as- accessible. Yet we can make predictions about strangers’ sumptions about the generative model driving others’ be- future behaviour, and infer what information they have; we haviour. The approach we pursue here learns the agent can plan our interactions with others, and establish efficient models, and how to do inference on them, from scratch, and effective communication. via meta-learning. A salient feature of these “understandings” of other agents Building a rich, flexible, and performant Machine Theory is that they make little to no reference to the agents’ true of Mind may well be a grand challenge for AI. We are not underlying structure. A prominent argument from cogni- trying to solve all of this here. A main message of this tive psychology is that our social reasoning instead relies paper is that many of the initial challenges of building a on high-level models of other agents (Gopnik & Wellman, ToM can be cast as simple learning problems when they 1992). These models engage abstractions which do not are formulated in the right way. describe the detailed physical mechanisms underlying ob- served behaviour; instead, we represent the mental states There are many potential applications for this work. Learn- ing rich models of others will improve decision-making in An extended version is available at arxiv.org/abs/1802.07740. complex multi-agent tasks, especially where model-based 1 2 DeepMind Google Brain. Correspondence to: Neil Rabinowitz planning and imagination are required (Hassabis et al., <[email protected]>. 2013; Hula et al., 2015; Oliehoek & Amato, 2016). Our Proceedings of the 35 th International Conference on Machine work thus ties in to a rich history of opponent modelling Learning, Stockholm, Sweden, PMLR 80, 2018. Copyright 2018 (Brown, 1951; Albrecht & Stone, 2017); within this con- by the author(s). Machine Theory of Mind text, we show how meta-learning can provide the ability to 2. Model build flexible and sample-efficient models of others on the fly. Such models will also be important for value align- 2.1. The tasks ment (Hadfield-Menell et al., 2016), flexible cooperation We assume we have a family of partially observable (Nowak, 2006; Kleiman-Weiner et al., 2016), and mediat- S Markov decision processes (POMDPs) M = j Mj. Un- ing human understanding of artificial agents. like the standard formalism, we associate the reward func- We consider the challenge of building a Theory of Mind as tions, discount factors, and conditional observation func- essentially a meta-learning problem (Schmidhuber et al., tions with the agents rather than with the POMDPs. For 1996; Thrun & Pratt, 1998; Hochreiter et al., 2001; Vilalta example, a POMDP could be a gridworld with a particular & Drissi, 2002). At test time, we want to be able to en- arrangement of walls and objects; different agents, when counter a novel agent whom we have never met before, and placed in the same POMDP, might receive different re- already have a strong and rich prior about how they are go- wards for reaching these objects, and be able to see differ- ing to behave. Moreover, as we see this agent act in the ent amounts of their local surroundings. We only consider world, we wish to be able to form a posterior about their single-agent POMDPs here; the multi-agent extension is latent characteristics and mental states that will enable us simple. When agents have full observability, we use the to improve our predictions about their future behaviour. terms MDP and POMDP interchangeably. Separately, we assume we have a family of agents A = We formulate a task for an observer, who, in each episode, S gets access to a set of behavioural traces of a novel agent, i Ai, with their own observation spaces, observation and must make predictions about the agent’s future be- functions, reward functions, and discount factors. Their haviour. Over training, the observer should get better at policies might be stochastic (Section 3.1), algorithmic rapidly forming predictions about new agents from limited (Section 3.2), or learned (Sections 3.3–3.5). We do not as- data. This “learning to learn” about new agents is what we sume that the agents’ policies are optimal for their respec- mean by meta-learning. Through this process, the observer tive tasks. The agents may be stateful, though we assume should also acquire an effective prior over the agents’ be- their hidden states do not carry over between episodes. haviour that implicitly captures the commonalities between The POMDPs we consider here are 11 × 11 gridworlds agents within the training population. with a common action space (up/down/left/right/stay), de- We introduce two concepts to describe components of terministic dynamics, and a set of consumable objects (see this observer network: its general theory of mind – the AppendixC). We experimented with these POMDPs due learned weights of the network, which encapsulate predic- to their simplicity and ease of control; our constructions tions about the common behaviour of all agents in the train- should generalise to richer domains too. We parameteri- ing set – and its agent-specific theory of mind – the “agent cally generate individual Mj by randomly sampling wall, embedding” formed from observations about a single agent object, and initial agent locations. at test time, which encapsulates what makes this agent’s In turn, we consider an observer who makes potentially par- character and mental state distinct from others’. These cor- tial and/or noisy observations of agents’ trajectories. Thus, respond to a prior and posterior over agent behaviour. if agent Ai follows its policy πi on POMDP Mj and pro- T This paper is structured as a sequence of experiments of duces trajectory τij = f(st; at)gt=0, the observer would (obs) (obs) (obs) T increasing complexity on this Machine Theory of Mind see τij = f(xt ; at )gt=0. For all our experiments, network, which we call a ToMnet. These experiments we assume that the observer has unrestricted access to the showcase the idea of the ToMnet, exhibit its capabilities, MDP state and agents’ actions, but not to agents’ parame- and demonstrate its capacity to learn rich models of other ters, reward functions, policies, or identifiers. agents incorporating canonical features of humans’ Theory ToMnet training involves a series of encounters with indi- of Mind, such as the recognition of false beliefs. vidual agents. The ToMnet observer sees a set of full or Several experiments in this paper are inspired by the semi- partial “past episodes”, wherein a single, unlabelled agent, Npast nal work on Bayesian Theory of Mind (Baker et al., 2011; Ai, produces trajectories, fτijgj=1 , as it executes its pol- 2017). We do not try to directly replicate the experiments icy within the respective POMDPs, Mj. Npast might vary, in these studies, which seek to explain human judgements and may even be zero. The task for the observer is to pre- in computational terms. Our emphasis is on machine learn- dict the agent’s behaviour (e.g. atomic actions) and poten- ing, scalability, and autonomy.