The Thirty-Fourth AAAI Conference on Artificial Intelligence (AAAI-20)

Learning to Communicate Implicitly by Actions

Zheng Tian,1 Shihao Zou,1 Ian Davies,1 Tim Warr,1 Lisheng Wu,1 Haitham Bou Ammar,1,2 Jun Wang1 1University College London, 2Huawei R&DUK {zheng.tian.11, shihao.zou.17, ian.davies.12, tim.warr.17, lisheng.wu.17}@ucl.ac.uk [email protected] [email protected]

Abstract for facilitating collaboration in MARL, explicit communica- tion channels come at additional computational and memory In situations where explicit communication is limited, human costs, making them difficult to deploy in decentralized con- collaborators act by learning to: (i) infer meaning behind their trol (Roth, Simmons, and Veloso 2006). partner’s actions, and (ii) convey private information about the state to their partner implicitly through actions. The first Environments where explicit communication is difficult component of this learning process has been well-studied in or prohibited are common. These settings can be synthetic multi-agent systems, whereas the second — which is equally such as those in games, e.g., bridge and Hanabi, but also crucial for successful collaboration — has not. To mimic both frequently appear in real-world tasks such as autonomous components mentioned above, thereby completing the learn- driving and autonomous fleet control. In these situations, ing process, we introduce a novel algorithm: Policy Belief humans rely upon implicit communication as a means of in- Learning (PBL). PBL uses a belief module to model the other formation exchange (Rasouli, Kotseruba, and Tsotsos 2017) agent’s private information and a policy module to form a and are effective in learning to infer the implicit meaning be- distribution over actions informed by the belief module. Fur- thermore, to encourage communication by actions, we pro- hind others’ actions (Heider and Simmel 1944). The ability pose a novel auxiliary reward which incentivizes one agent to perform such inference requires the attribution of a men- to help its partner to make correct inferences about its pri- tal state and reasoning mechanism to others. This ability is vate information. The auxiliary reward for communication is known as theory of mind (Premack and Woodruff 1978). In integrated into the learning of the policy module. We evalu- this work, we develop agents that benefit from considering ate our approach on a set of environments including a matrix others’ perspectives and thereby explore the further devel- game, particle environment and the non-competitive bidding opment of machine theory of mind (Rabinowitz et al. 2018). problem from . We show empirically that this auxiliary reward is effective and easy to generalize. These re- Previous works have considered ways in which an agent sults demonstrate that our PBL algorithm can produce strong can, by observing an opponent’s behavior, build a model of pairs of agents in collaborative games where explicit commu- opponents’ characteristics, objectives or hidden information nication is disabled. either implicitly (He et al. 2016; Bard et al. 2013) or explic- itly (Raileanu et al. 2018; Li and Miikkulainen 2018). Whilst these works are of great value, they overlook the fact that an Introduction agent should also consider that it is being modeled and adapt In collaborative multi-agent systems, communication is es- its behavior accordingly, thereby demonstrating a theory of sential for agents to learn to behave as a collective rather mind. For instance, in collaborative tasks, a decision-maker than a collection of individuals. This is particularly impor- could choose to take actions which are informative to its tant in the imperfect information setting, where private in- teammates, whereas, in competitive situations, agents may formation becomes crucial to success. In such cases, effi- act to conceal private information to prevent their opponents cient communication protocols between agents are needed from modeling them effectively. for private information exchange, coordinated joint-action In this paper, we propose a generic framework, titled pol- exploration, and true world-state inference. icy belief learning (PBL), for learning to cooperate in im- In typical multi-agent reinforcement learning (MARL) perfect information multi-agent games. Our work combines settings, designers incorporate explicit communication opponent modeling with a policy that considers that it is be- channels hoping to conceptually resemble language or ver- ing modeled. PBL consists of a belief module, which models bal communication which are known to be important for hu- other agents’ private information by considering their previ- man interaction (Baker et al. 1999). Though they can be used ous actions, and a policy module which combines the agent’s current observation with their beliefs to return a distribution Copyright c 2020, Association for the Advancement of Artificial over actions. We also propose a novel auxiliary reward for Intelligence (www.aaai.org). All rights reserved. encouraging communication by actions, which is integrated

7261 into PBL. Our experiments show that agents trained using function. Strouse et al. (2018) use a mutual information ob- PBL can learn collaborative behaviors more effectively than jective to encourage an agent to reveal or hide its intention. a number of meaningful baselines without requiring any ex- In a related work, Jaques et al. (2019) utilize a mutual in- plicit communication. We conduct a complete ablation study formation objective to imbue agents with social influence. to analyze the effectiveness of different components within While the objective of maximal mutual information in ac- PBL in our bridge experiment. tions can yield highly effective collaborating agents, a mu- tual information objective in itself is insufficient to necessi- Related Work tate the development of implicit communication by actions. Our work is closely related to (Albrecht and Stone 2017; Eccles et al. (2019) introduce a reciprocity reward as an al- Lowe et al. 2017; Mealing and Shapiro 2017; Raileanu ternative approach to solve social dilemmas. et al. 2018) where agents build models to estimate other A distinguishing feature of our work in relation to previ- agents’ hidden information. Contrastingly, our work en- ous works in multi-agent communication is that we do not hances a “flat” opponent model with recursive reasoning. have a predefined explicit communication protocol or learn “Flat” opponent models estimate only the hidden informa- to communicate through an explicit channel. Information ex- tion of opponents. Recursive reasoning requires making de- change can only happen via actions. In contrast to previous cisions based on the mental states of others as well as the works focusing on unilaterally making actions informative, state of the environment. In contrast to works such as I- we focus on bilateral communication by actions where infor- POMDP (Gmytrasiewicz and Doshi 2005) and PR2 (Wen et mation transmission is directed to a specific party with po- al. 2019) where the nested belief is embedded into the train- tentially limited reasoning ability. Our agents learn to com- ing agent’s opponent model, we incorporate level-1 nested municate through iterated policy and belief updates such that belief “I believe that you believe” into our policy by a novel the resulting communication mechanism and belief mod- auxiliary reward. els are interdependent. The development of a communica- Recently, there has been a surge of interest in using rein- tion mechanism therefore requires either direct access to the forcement learning (RL) approaches to learn communication mental state of other agents (via centralized training) or the protocols (Foerster et al. 2016; Lazaridou, Peysakhovich, ability to mentalize, commonly known as theory of mind. and Baroni 2016; Mordatch and Abbeel 2017; Sukhbaatar, We investigate our proposed algorithm in both settings. Szlam, and Fergus 2016). Most of these works enable agents to communicate via an explicit channel. Among these works, Problem Definition Mordatch and Abbeel (2017) also observe the emergence We consider a set of agents, denoted by N , interacting with of non-verbal communication in collaborative environments an unknown environment by executing actions from a joint without an explicit communication channel, where agents set A = {A1,...,AN }, with Ai denoting the action space are exclusively either a sender or a receiver. Similar re- of agent i, and N the total number of agents. To enable search is also conducted in (de Weerd, Verbrugge, and Ver- models that approximate real-world scenarios, we assume heij 2015). In our setting, we do not restrict agents to be ex- private and public information states. Private information clusively a sender or a receiver of communications – agents states, jointly (across agents) denoted by X = {X1,...,XN } can communicate mutually by actions. Knepper et al. (2017) are a set of hidden information states where Xi is only propose a framework for implicit communication in a co- observable by agent i, while public states O are observed operative setting and show that various problems can be by all agents. We assume that hidden information states at mapped into this framework. Although our work is con- each time step are sampled from an unknown distribution ceptually close to (Knepper et al. 2017), we go further and Px : X→[0, 1], while public states evolve from an initial present a practical algorithm for training agents. The recent distribution Po : O→[0, 1], according to a stochastic tran- work of (Foerster et al. 2018) solves an imperfect informa- sition model T : O×X×A1×···×AN ×O → [0, 1]. Having tion problem as considered here from a different angle. We transitioned to a successor state according to T , agents re- approach the problem by encouraging agents to exchange ceive rewards from R : S×A1 ×···×AN → R, where critical information through their actions whereas Foerster we have used S = O×X to denote joint state descriptions et al. train a public player to choose an optimal determinis- that incorporate both public and private information. Finally, tic policy for players in a game based on publicly observable rewards are discounted over time by a factor γ ∈ (0, 1]. information. Implicit communication is also considered in With this notation, our problem can be described succinctly the human-robot interaction community. In Cooperative In- by the tuple: N , A, O, X , T , R, Px, Po,γ, which we re- verse RL (CIRL) where robotic agents try to infer a human’s fer to as an imperfect information Markov decision process private reward function from their actions (Hadfield-Menell (I2MDP)1. In this work, we simplify the problem by assum- et al. 2016), optimal solutions need to produce behavior that ing that hidden information states X are temporally static conveys information. and are given at the beginning of the game. Dragan, Lee, and Srinivasa (2013) consider how to train We interpret the joint policy from the perspective of agent agents to exhibit legible behavior (i.e. behavior from which i such that π =(πi(ai|s),π−i(a−i|s)), where π−i(a−i|s) it is easy to infer the intention). Their approach is dependent on a hand-crafted cost function to attain informative behav- 1We also note that our problem can be formalized as a de- ior. Mutual information has been used as a means to promote centralized partially observable Markov decision process (Dec- coordination without the need for a human engineered cost POMDP) (Bernstein, Zilberstein, and Immerman 2013).

7262 is a compact representation of the joint policy of all agents self-play and learn a new belief module by minimizing: i   excluding agent . In the collaborative setting, each agent is −i i i φ E −i i || φ presumed to pursue the shared maximal cumulative reward [k] := argmin (x ,h )∼Ω[k−1] KL(xj bj(hj; ) , expressed as φ   (2) ∞ i t i −i ·||· max η (π)=E γ R(st,a ,a ) , (1) where KL( ) is the Kullback–Leibler(KL) divergence and t t −i t=1 we use a one-hot vector to encode the ground truth, xj , when we calculate the relevant KL-divergence. i −i where st =[xt,xt ,ot] is the current full information state, With updated belief module Φ[k], we learn a new pol- i −i (at,at ) are joint actions taken by agent i and all other icy for the next iteration, π[k+1], via a policy gradient al- agents respectively at time t and γ is a discount factor. gorithm. Sharing information in multi-agent cooperative games through communication reduces intractability by en- abling coordinated behavior. Rather than implementing ex- Policy Belief Learning pensive protocols (Heider and Simmel 1944), we encourage Applying naive single agent reinforcement learning (SARL) agents to implicitly communicate through actions by intro- algorithms to our problem will lead to poor performance. ducing a novel auxiliary reward . To do so, notice that One reason for this is the partial observability of the envi- in the centralized setting agent i has the ability to consult −i i −i ronment. To succeed in a partially observable environment, its opponent’s belief model Φ (x |ht ) thereby exploiting an agent is often required to maintain a belief state. Recall the fact that other agents hold beliefs over its private infor- −i i that, in our setting, the environment state is formed from the mation xi. In fact, comparing bt to the ground-truth x en- union of the private information of all agents and the pub- ables agent i to learn which actions bring these two quanti- i −i licly observable information, st =[xt,xt ,ot]. We there- ties closer together and thereby learn informative behavior. i −i fore learn a belief module Φ (xt ) to model other agents’ This can be achieved through an auxiliary reward signal de- −i private information xt which is the only hidden informa- vised to encourage informative action communication: tion from the perspective of agent i in our setting. We as- i i −i,∗ i −i −i rc,t = KL(x ||b ) − KL(x ||bt+1), (3) sume that an agent can model xt given the history of public information and actions executed by other agents −i,∗ −i i −i i −i where b =Φ[k](x |ht,∗) is agent −i’s best belief (so-far) ht = {o1:t−1,a1:t−1}. We use a NN to parameterize the belief module which takes in the history of public informa- about agent i’s private information: tion and produces a belief state bi =Φi(x−i|hi). The belief −i,∗ i −i t t t b =argmin KL(x ||bu ) ∀ u ≤ t. state together with information observable by agent i forms i i i i a sufficient statistic, sˆt =[xt,bt,ot], which contains all the In other words, rc,t encourages communication as it is pro- information necessary for the agent to act optimally (Astr˚ om¨ portional to the improvement in the opponent’s belief (for a −i i −i 1965). We use a separate NN to parameterize agent i’s pol- fixed belief model Φ[k](x |ht+1)), measured by its proxim- i i| i icy π (at sˆt) which takes in the estimated environment state ity to the ground-truth, resulting from the opponent observ- sˆi i t and outputs a distribution over actions. As we assume hid- ing agent i’s action at. Hence, during the policy learning den information is temporally static, we will drop the time step of PBL, we apply a policy gradient algorithm with a script for it in the rest of the paper. shaped reward of the form:2 The presence of multiple learning agents interacting with the environment renders the environment non-stationary. r = re + αrc, (4) This further limits the success of SARL algorithms which are generally designed for environments with stationary dy- where re is the reward from the environment, rc is the com- ≥ namics. To solve this, we adopt centralized training and de- munication reward and α 0 balances the communication centralized execution, where during training all agents are and environment rewards. recognized as one central representative agent differing only Initially, in the absence of a belief module, we pre-train a by their observations. Under this approach, one can imagine policy π[0] naively by ignoring the existence of other agents i −i i −i i −i in the environment. As an agent’s reasoning ability may belief models Φ (x |ht) and Φ (x |ht ) sharing param- eters φ. The input data, however, varies across agents due be limited, we may then iterate between Belief and Policy i −i learning multiple times until either the allocated computa- to the dependency on both ht and ht . In a similar fash- ion, we let policies share the parameters θ. Consequently, tional resources are exhausted or the policy and belief mod- one may think of updating θ and φ using one joint data set ules converge. We summarize the main steps of PBL in Al- aggregated across agents. Without loss of generality, in the gorithm 1. Note that, although information can be leaked remainder of this section, we discuss the learning procedure during training, as training is centralized, distributed test- from the point of view of a single agent, agent i. phase execution ensures hidden-private variables during ex- ecution. We first present the learning procedure of our belief mod- i| ule. At iteration k, we use the current policy π[k](a sˆ) to 2Please note, we omit the agent index i in the reward equation, −i i M generate a data set of size M, Ω[k] = {(xj ,hj)j=1}, using as we shape rewards similarly for all agents.

7263 3OD\HU DFWVVHFRQG

&DUG &DUG 3OD\HUDFWLRQ $%&

$    

%   &DUG \ &      3OD\HUDFWLRQ

3OD\HU DFWVILUVW    

 

&DUG    

(a) Payoff for the matrix game (b) Learning curves of PBL and baselines over 100 runs

Figure 1: Matrix game experiment and results.

Algorithm 1 Per-Agent Policy Belief Learning (PBL) parameters (φi = φ−i). Without centralization, an agent can only measure how informative its action is to others with its 1: Initialize: Randomly initialize policy π0 and belief Φ0 own belief model. Assuming agents can perfectly recall their 2: Pre-train π0 past actions and observations, agent i computes its commu- 3: for k =0to max iterations do 3 4: Sample episodes for belief training using self-play nication reward as: forming the data set Ω[k] i i i,∗ i i rc,t = KL(x ||˜bt ) − KL(x ||˜bt+1), 5: Update belief network using data from Ω[k] solving Equation 2 ˜i,∗ i i| −i ˜i i i| −i · · where bt =Φ(x ht,∗) and bt+1 =Φ(x ht+1). In this 6: Given updated beliefs Φ[k+1]( ), update policy π( ) way, an agent essentially establishes a mental state of oth- (policy gradients with rewards from Equation 4) ers with its own belief model and acts upon it. We humbly 7: end for believe this could be a step towards machine theory of mind 8: Output: Final policy, and belief model where algorithmic agents learn to attribute mental states to others and adjust their behavior accordingly. The ability to mentalize relieves the restriction of collab- Machine Theory of Mind orators having the same reasoning process. However, the In PBL, we adopt a centralized training and decentralized success of collaboration still relies on the correctness of execution scheme where agents share the same belief and one’s belief about the mental states of others. For instance, policy models. In reality, however, it is unlikely that two correctly inferring other drivers’ mental states and conven- people will have exactly the same reasoning process. In con- tions can reduce the likelihood of traffic accidents. Therefore trast to requiring everyone to have the same reasoning pro- road safety education is important as it reduces variability cess, a person’s success in navigating social dynamics re- among drivers reasoning processes. In our work, this align- lies on their ability to attribute mental states to others. This ment amounts to the similarity between two agents’ trained attribution of mental states to others is known as theory belief models which is affected by training data, initializa- of mind (Premack and Woodruff 1978). Theory of mind is tion of weights, training algorithms and so on. We leave in- fundamental to human social interaction which requires the vestigation of the robustness of collaboration to variability recognition of other sensory perspectives, the understanding in collaborators’ belief models to future work. of other mental states, and the recognition of complex non- verbal signals of emotional state (Lemaignan and Dillen- Experiments & Results bourg 2015). In collaboration problems without an explicit We test our algorithms in three experiments. In the first, we communication channel, humans can effectively establish an validate the correctness of the PBL framework which inte- understanding of each other’s mental state and subsequently grates our communication reward with iterative belief and select appropriate actions. For example, a teacher will re- policy module training in a simple matrix game. In this rel- iterate a difficult concept to students if she infers from the atively simple experiment, PBL achieves near optimal per- students’ facial expressions that they have not understood. formance. Equipped with this knowledge, we further apply The effort of one agent to model the mental state of another PBL to the non-competitive bridge bidding problem to ver- is characterized as Mutual Modeling (Dillenbourg 1999). ify its scalability to more complex problems. Lastly, we in- In our work, we also investigate whether the proposed vestigate the efficacy of the proposed communication reward communication reward can be generalized to a distributed in a distributed training setting. setting which resembles a human application of theory of mind. Under this setting, we train a separate belief model for 3Note the difference of super/sub-scripts of the belief model i −i i −i i −i each agent so that Φ (x |ht) and Φ (x |ht ) do not share and its parameters when compared to Equation 3.

7264 0.077 0.079 0.069 0.062

0.025

-0.025 -0.054 PQL-4 PBL PQL-5 PQL-5 PQL-4 (No-Rules) (No-Rules)

PQL-4 (Single-Network)

PQL-5 (Single-Network) (a) Learning curves for non-competitive bridge bidding (b) Comparison of PBL and PQL

Figure 2: a) Learning curves for non-competitive bridge bidding with a warm start from a model trained to predict the score distribution (average reward at warm start: 0.038). Details of warm start provided in our supplementary material. b) Bar graph comparing PBL to variants of PQL, with the full version of PQL results as reported in (Yeh and Lin 2016).

Matrix Game decreasing and must increase if the newly proposed We test our PBL algorithm on a matrix card game where an suit precedes or equals the currently bid suit in the ordering implicit communication strategy is required to achieve the ♣ < ♦ < ♥ < ♠

7265 Pass Club belief Heart belief Heart belief belief update update update update 5.9 6.6 4.5 3.9 3.3 3.1 3.0 3.3 2.8 1.41.4 1.4 1.4 1.2 1.1 0.8

… … PASS 1,C1,HH 4,H

Time Player two makes Player one makes Player two makes Player one makes Pass bid One Clubs bid One Hearts bid Four Hearts bid Figure 3: An example of a belief update trace showing how PBL agents use actions for effective communication. Upon observ- ing PASS from East, West decreases its HCP belief in all suits. When West bids 1, C, East improves belief in clubs. Next, East bids 1, H. West recalculates its belief from last time step and increases its HCP belief in hearts.

multiple iterations between policy and belief training are reported in Figure 2b. In addition, without pre-processing beneficial, we compare our model to a baseline policy the training data as in (Yeh and Lin 2016), we could not re- trained with the same number of weight updates as our produce the original results. To achieve state-of-the-art per- model but no further PBL iterations after training a belief formance, we could use these (or other) heuristics for our network Φ0 at PBL iteration k =0. bidding algorithm. However, this deviates from the focus 4. Penetrative Q-Learning (PQL): PQL as proposed by of our work which is to demonstrate that PBL is a general Yeh and Lin (2016) as the first bidding policy for framework for learning to communicate by actions. non-competitive bridge bidding without human domain Belief Update Visualization: To understand how agents up- knowledge. date their beliefs after observing a new bid, we visualize the Figure 2a shows the average learning curves of our model belief update process (Figure 3). An agent’s belief about its and three baselines for our ablation study. We obtain these opponent’s hand is represented as a 52-dimensional vector curves by testing trained algorithms periodically on a pre- with real values which is not amenable to human interpreta- generated test data set which contains 30,000 games. Each tion. Therefore, we use high card points (HCPs) to summa- point on the curve is an average score computed by Dupli- rize each agent’s belief. For each suit, each card is given a cate bridge scoring rules (League 2017) over 30,000 games point score according to the mapping: A=4, K=3, Q=2, J=1, and 6 training runs. As can been seen, IP and NCR both ini- else=0. Note that while agents’ beliefs are updated based on tially learn faster than our model. This is reasonable as PBL the entire history of its opponent’s bids, the difference be- spends more time learning a communication protocol at first. tween that agent’s belief from one round to the next is pre- However, IP converges to a local optimum very quickly and dominantly driven by the most recent bid of its opponent, as is surpassed by PBL after approximately 400 learning iter- shown in Figure 3. ations. NCR learns a better bidding strategy than IP with a Learned Bidding Convention: Whilst our model’s bidding belief module. However, NCR learns more slowly than PBL decisions are based entirely on raw card data, we can use in the later stage of training because it has no guidance on high card points as a simple way to observe and summarize how to convey information to its partner. PBL outperform- the decisions which are being made. For example, we ob- ing NPBI demonstrates the importance of iterative training serve our policy opens the bid with 1♠ if it has HCPs of between policy and belief modules. spade 4.5 or higher but lower HCPs of any other suits. We Restrictions of PQL: PQL (Yeh and Lin 2016) is the first run the model on the unseen test set of 30,000 deals and sum- algorithm trained to bid in bridge without human engineered marize the learned bidding convention in the supplementary features. However, its strong bidding performance relies on material. heavy adaption and heuristics for non-competitive Bridge Imperfect Recall of History: the length of the action his- bidding. First, PQL requires a predefined maximum number tory players can recall affects the accuracy of the belief mod- of allowed bids in each deal, while using different bidding els. The extent of the impact depends on the nature of the networks at different times. Our results show that it will fail game. In bridge, the order of bidding encodes important in- when we train a single NN for the whole game, which can formation. We ran an ablation study where players can only been seen as a minimum requirement for most DRL algo- recall the most recent bid. In this setting, players do worse rithms. Second, PQL relies on a rule-based function for se- (average score 0.065) than players with perfect recall. We lecting the best contracts at test time. In fact, removing this conjecture that this is because players can extract less infor- second heuristic significantly reduces PQL’s performance as mation and therefore the accuracy of belief models drops.

7266 (b) CR (c) NCR (a) Learning curves for Silent Guide

Figure 4: a) Learning curves for Silent Guide. Guide agent trained with communication reward (CR) significantly outperforms the one trained with no communication reward (NCR). b) A trajectory of Listener (gray circle) and Guide (blue circle) with CR. Landmarks are positioned randomly and the Goal landmark (blue square) is randomly chosen at the start of each episode. c) A trajectory of Listener and Guide with NCR. Trajectories are presented with agents becoming progressively darker over time.

Silent Guide use its own belief model to establish the mental state of Lis- We modify a multi-agent particle environment (Lowe et al. tener and learns to communicate through actions judged by 2017) to test the effectiveness of our novel auxiliary reward this constructed mental state of Listener. We also observe in a distributed setting. This environment also allows us to that a trained Guide agent can work with a naive RL listener explore the potential for implicit communication to arise (best reward -0.252) which has no belief model but can ob- through machine theory of mind. In the environment there serve PBL guide agent’s action. The success of Guide with are two agents and three landmarks. We name the agents CR shows the potential for machine theory of mind. We ob- Guide and Listener respectively. Guide can observe Lis- tain the learning curves by repeating the training process five tener’s goal landmark which is distinguished by its color. times and take the shared average environment reward. Listener does not observe its goal. However, Listener is able to infer the meaning behind Guide’s actions. The two agents Conclusions & Future Work receive the same reward which is the negative distance be- In this paper, we focus on implicit communication through tween Listener and its goal. Therefore, to maximize the cu- actions. This draws a distinction of our work from previ- mulative reward, Guide needs to tell Listener the goal land- ous works which either focus on explicit communication or mark color. However, as the “Silent Guide” name suggests, unilateral communication. We propose an algorithm com- Guide has no explicit communication channel and can only bining agent modeling and communication for collaborative communicate to Listener through its actions. imperfect information games. Our PBL algorithm iterates In the distributed setting, we train separate belief modules between training a policy and a belief module. We propose a for Guide and Listener respectively. The two belief mod- novel auxiliary reward for encouraging implicit communica- ules are both trained to predict a naive agent’s goal landmark tion between agents which effectively measures how much color given its history within the current episode but using closer the opponent’s belief about a player’s private infor- different data sets. We train both Guide and Listener policies mation becomes after observing the player’s action. We em- from scratch. Listener’s policy takes Listener’s velocity, rel- pirically demonstrate that our methods can achieve near op- ative distance to three landmarks and the prediction of the timal performance in a matrix problem and scale to complex belief module as input. It is trained to maximize the envi- problems such as contract bridge bidding. We conduct an ronment reward it receives. Guide’s policy takes its veloc- initial investigation of the further development of machine ity, relative distance to landmarks and Listener’s goal as in- theory of mind. Specifically, we enable an agent to use its put. To encourage communication by actions, we train Guide own belief model to attribute mental states to others and act policy with the auxiliary reward proposed in our work. We accordingly. We test this framework and achieve some ini- compare our method against a naive Guide policy which is tial success in a multi-agent particle environment under dis- trained without the communication reward. The results are tributed training. There are a lot of interesting avenues for shown in Figure 4. Guide when trained with communica- future work such as exploration of the robustness of collab- tion reward (CR) learns to inform Listener of its goal by ap- oration to differences in agents’ belief models. proaching to the goal it observes. Listener learns to follow. However, in NCR setting, Listener learns to ignore Guide’s References uninformative actions and moves to the center of three land- Albrecht, S. V., and Stone, P. 2017. Autonomous Agents marks. While Guide and Listener are equipped with belief Modelling Other Agents: A Comprehensive Survey and models trained from different data sets, Guide manages to Open Problems. arXiv e-prints.

7267 Astr˚ om,¨ K. J. 1965. Optimal control of Markov Processes Lazaridou, A.; Peysakhovich, A.; and Baroni, M. 2016. with incomplete state information. Journal of Mathematical Multi-agent cooperation and the emergence of (natural) lan- Analysis and Applications 10:174–205. guage. CoRR abs/1612.07182. Baker, M.; Hansen, T.; Joiner, R.; and Traum, D. 1999. The League, A. C. B. 2017. . role of grounding in collaborative learning tasks. Collabo- Lemaignan, S., and Dillenbourg, P. 2015. Mutual mod- rative learning: Cognitive and computational approaches. elling in robotics: Inspirations for the next steps. 2015 10th Bard, N.; Johanson, M.; Burch, N.; and Bowling, M. 2013. ACM/IEEE International Conference on Human-Robot In- Online implicit agent modelling. In Proceedings of the 2013 teraction (HRI) 303–310. International Conference on Autonomous Agents and Multi- Li, X., and Miikkulainen, R. 2018. Dynamic adaptation and agent Systems, AAMAS ’13, 255–262. opponent exploitation in computer poker. In Workshops at Bernstein, D. S.; Zilberstein, S.; and Immerman, N. 2013. the 32nd AAAI Conference on Artificial Intelligence. The Complexity of Decentralized Control of Markov Deci- Lowe, R.; Wu, Y.; Tamar, A.; Harb, J.; Abbeel, P.; and sion Processes. arXiv e-prints arXiv:1301.3836. Mordatch, I. 2017. Multi-agent actor-critic for mixed de Weerd, H.; Verbrugge, R.; and Verheij, B. 2015. Higher- cooperative-competitive environments. Neural Information order theory of mind in the tacit communication game. Bio- Processing Systems (NIPS). logically Inspired Cognitive Architectures. Mealing, R., and Shapiro, J. L. 2017. Opponent model- Dillenbourg, P. 1999. What do you mean by collaborative ing by expectation–maximization and sequence prediction learning? In Dillenbourg, P., ed., Collaborative-learning: in simplified poker. IEEE Transactions on Computational Cognitive and Computational Approaches. Oxford: Elsevier. Intelligence and AI in Games 9(1):11–24. Dragan, A. D.; Lee, K. C.; and Srinivasa, S. S. 2013. Leg- Mordatch, I., and Abbeel, P. 2017. Emergence of grounded ibility and predictability of robot motion. In Proceedings compositional language in multi-agent populations. CoRR of the 8th ACM/IEEE international conference on Human- abs/1703.04908. robot interaction, 301–308. IEEE Press. Premack, D., and Woodruff, G. 1978. Does the chimpanzee Eccles, T.; Hughes, E.; Kramar,´ J.; Wheelwright, S.; and have a theory of mind? Behavioral and Brain Sciences 1. Leibo, J. Z. 2019. Learning reciprocity in complex sequen- Rabinowitz, N. C.; P., F.; Song, H. F.; Zhang, C.; Eslami, S. tial social dilemmas. arXiv preprint arXiv:1903.08082. M. A.; and Botvinick, M. 2018. Machine theory of mind. Foerster, J. N.; Assael, Y. M.; Freitas, N.; and Whiteson, S. CoRR abs/1802.07740. 2016. Learning to communicate with deep multi-agent rein- Raileanu, R.; Denton, E.; Szlam, A.; and Fergus, R. 2018. forcement learning. CoRR abs/1605.06676. Modeling others using oneself in multi-agent reinforcement Foerster, J. N.; Song, F.; Hughes, E.; Burch, N.; Dunning, learning. CoRR abs/1802.09640. I.; Whiteson, S.; Botvinick, M.; and Bowling, M. 2018. Rasouli, A.; Kotseruba, I.; and Tsotsos, J. 2017. Agree- Bayesian Action Decoder for Deep Multi-Agent Reinforce- ing to cross: How drivers and pedestrians communicate. In ment Learning. arXiv e-prints arXiv:1811.01458. Intelligent Vehicles Symposium (IV), 2017 IEEE. IEEE. Gmytrasiewicz, P. J., and Doshi, P. 2005. A framework for Roth, M.; Simmons, R.; and Veloso, M. 2006. What to com- sequential planning in multi-agent settings. J. Artif. Int. Res. municate? execution-time decision in multi-agent pomdps. Hadfield-Menell, D.; Russell, S. J.; Abbeel, P.; and Dragan, In DARS ’06. Springer. 177–186. A. 2016. Cooperative inverse reinforcement learning. In Strouse, D. J.; Kleiman-Weiner, M.; Tenenbaum, J.; Advances in neural information processing systems, 3909– Botvinick, M.; and Schwab, D. 2018. Learning to share and 3917. hide intentions using information regularization. NIPS’18. Haglund, B. 2010. Search algorithms for a bridge double Sukhbaatar, S.; Szlam, A.; and Fergus, R. 2016. Learning dummy solver. multiagent communication with backpropagation. CoRR. He, H.; Boyd-Graber, J.; Kwok, K.; and III, H. D. 2016. Op- Wen, Y.; Yang, Y.; Luo, R.; Wang, J.; and Pan, W. 2019. ponent modeling in deep reinforcement learning. In ICML Probabilistic recursive reasoning for multi-agent reinforce- ’16. ment learning. In ICLR. Heider, F., and Simmel, M. 1944. An experimental study of Yeh, C., and Lin, H. 2016. Automatic bridge bidding using apparent behavior. The American journal of psychology. deep reinforcement learning. CoRR abs/1607.03290. Jaques, N.; Lazaridou, A.; Hughes, E.; Gulcehre, C.; Ortega, P.; Strouse, D.; Leibo, J. Z.; and De Freitas, N. 2019. So- cial influence as intrinsic motivation for multi-agent deep reinforcement learning. In International Conference on Ma- chine Learning, 3040–3049. Knepper, R. A.; I.Mavrogiannis, C.; Proft, J.; and Liang, C. 2017. Implicit communication in a joint action. In Pro- ceedings of the 2017 ACM/IEEE International Conference on Human-Robot Interaction, HRI ’17. ACM.

7268