Learning to Communicate Implicitly by Actions
Total Page:16
File Type:pdf, Size:1020Kb
The Thirty-Fourth AAAI Conference on Artificial Intelligence (AAAI-20) Learning to Communicate Implicitly by Actions Zheng Tian,1 Shihao Zou,1 Ian Davies,1 Tim Warr,1 Lisheng Wu,1 Haitham Bou Ammar,1,2 Jun Wang1 1University College London, 2Huawei R&DUK {zheng.tian.11, shihao.zou.17, ian.davies.12, tim.warr.17, lisheng.wu.17}@ucl.ac.uk [email protected] [email protected] Abstract for facilitating collaboration in MARL, explicit communica- tion channels come at additional computational and memory In situations where explicit communication is limited, human costs, making them difficult to deploy in decentralized con- collaborators act by learning to: (i) infer meaning behind their trol (Roth, Simmons, and Veloso 2006). partner’s actions, and (ii) convey private information about the state to their partner implicitly through actions. The first Environments where explicit communication is difficult component of this learning process has been well-studied in or prohibited are common. These settings can be synthetic multi-agent systems, whereas the second — which is equally such as those in games, e.g., bridge and Hanabi, but also crucial for successful collaboration — has not. To mimic both frequently appear in real-world tasks such as autonomous components mentioned above, thereby completing the learn- driving and autonomous fleet control. In these situations, ing process, we introduce a novel algorithm: Policy Belief humans rely upon implicit communication as a means of in- Learning (PBL). PBL uses a belief module to model the other formation exchange (Rasouli, Kotseruba, and Tsotsos 2017) agent’s private information and a policy module to form a and are effective in learning to infer the implicit meaning be- distribution over actions informed by the belief module. Fur- thermore, to encourage communication by actions, we pro- hind others’ actions (Heider and Simmel 1944). The ability pose a novel auxiliary reward which incentivizes one agent to perform such inference requires the attribution of a men- to help its partner to make correct inferences about its pri- tal state and reasoning mechanism to others. This ability is vate information. The auxiliary reward for communication is known as theory of mind (Premack and Woodruff 1978). In integrated into the learning of the policy module. We evalu- this work, we develop agents that benefit from considering ate our approach on a set of environments including a matrix others’ perspectives and thereby explore the further devel- game, particle environment and the non-competitive bidding opment of machine theory of mind (Rabinowitz et al. 2018). problem from contract bridge. We show empirically that this auxiliary reward is effective and easy to generalize. These re- Previous works have considered ways in which an agent sults demonstrate that our PBL algorithm can produce strong can, by observing an opponent’s behavior, build a model of pairs of agents in collaborative games where explicit commu- opponents’ characteristics, objectives or hidden information nication is disabled. either implicitly (He et al. 2016; Bard et al. 2013) or explic- itly (Raileanu et al. 2018; Li and Miikkulainen 2018). Whilst these works are of great value, they overlook the fact that an Introduction agent should also consider that it is being modeled and adapt In collaborative multi-agent systems, communication is es- its behavior accordingly, thereby demonstrating a theory of sential for agents to learn to behave as a collective rather mind. For instance, in collaborative tasks, a decision-maker than a collection of individuals. This is particularly impor- could choose to take actions which are informative to its tant in the imperfect information setting, where private in- teammates, whereas, in competitive situations, agents may formation becomes crucial to success. In such cases, effi- act to conceal private information to prevent their opponents cient communication protocols between agents are needed from modeling them effectively. for private information exchange, coordinated joint-action In this paper, we propose a generic framework, titled pol- exploration, and true world-state inference. icy belief learning (PBL), for learning to cooperate in im- In typical multi-agent reinforcement learning (MARL) perfect information multi-agent games. Our work combines settings, designers incorporate explicit communication opponent modeling with a policy that considers that it is be- channels hoping to conceptually resemble language or ver- ing modeled. PBL consists of a belief module, which models bal communication which are known to be important for hu- other agents’ private information by considering their previ- man interaction (Baker et al. 1999). Though they can be used ous actions, and a policy module which combines the agent’s current observation with their beliefs to return a distribution Copyright c 2020, Association for the Advancement of Artificial over actions. We also propose a novel auxiliary reward for Intelligence (www.aaai.org). All rights reserved. encouraging communication by actions, which is integrated 7261 into PBL. Our experiments show that agents trained using function. Strouse et al. (2018) use a mutual information ob- PBL can learn collaborative behaviors more effectively than jective to encourage an agent to reveal or hide its intention. a number of meaningful baselines without requiring any ex- In a related work, Jaques et al. (2019) utilize a mutual in- plicit communication. We conduct a complete ablation study formation objective to imbue agents with social influence. to analyze the effectiveness of different components within While the objective of maximal mutual information in ac- PBL in our bridge experiment. tions can yield highly effective collaborating agents, a mu- tual information objective in itself is insufficient to necessi- Related Work tate the development of implicit communication by actions. Our work is closely related to (Albrecht and Stone 2017; Eccles et al. (2019) introduce a reciprocity reward as an al- Lowe et al. 2017; Mealing and Shapiro 2017; Raileanu ternative approach to solve social dilemmas. et al. 2018) where agents build models to estimate other A distinguishing feature of our work in relation to previ- agents’ hidden information. Contrastingly, our work en- ous works in multi-agent communication is that we do not hances a “flat” opponent model with recursive reasoning. have a predefined explicit communication protocol or learn “Flat” opponent models estimate only the hidden informa- to communicate through an explicit channel. Information ex- tion of opponents. Recursive reasoning requires making de- change can only happen via actions. In contrast to previous cisions based on the mental states of others as well as the works focusing on unilaterally making actions informative, state of the environment. In contrast to works such as I- we focus on bilateral communication by actions where infor- POMDP (Gmytrasiewicz and Doshi 2005) and PR2 (Wen et mation transmission is directed to a specific party with po- al. 2019) where the nested belief is embedded into the train- tentially limited reasoning ability. Our agents learn to com- ing agent’s opponent model, we incorporate level-1 nested municate through iterated policy and belief updates such that belief “I believe that you believe” into our policy by a novel the resulting communication mechanism and belief mod- auxiliary reward. els are interdependent. The development of a communica- Recently, there has been a surge of interest in using rein- tion mechanism therefore requires either direct access to the forcement learning (RL) approaches to learn communication mental state of other agents (via centralized training) or the protocols (Foerster et al. 2016; Lazaridou, Peysakhovich, ability to mentalize, commonly known as theory of mind. and Baroni 2016; Mordatch and Abbeel 2017; Sukhbaatar, We investigate our proposed algorithm in both settings. Szlam, and Fergus 2016). Most of these works enable agents to communicate via an explicit channel. Among these works, Problem Definition Mordatch and Abbeel (2017) also observe the emergence We consider a set of agents, denoted by N , interacting with of non-verbal communication in collaborative environments an unknown environment by executing actions from a joint without an explicit communication channel, where agents set A = {A1,...,AN }, with Ai denoting the action space are exclusively either a sender or a receiver. Similar re- of agent i, and N the total number of agents. To enable search is also conducted in (de Weerd, Verbrugge, and Ver- models that approximate real-world scenarios, we assume heij 2015). In our setting, we do not restrict agents to be ex- private and public information states. Private information clusively a sender or a receiver of communications – agents states, jointly (across agents) denoted by X = {X1,...,XN } can communicate mutually by actions. Knepper et al. (2017) are a set of hidden information states where Xi is only propose a framework for implicit communication in a co- observable by agent i, while public states O are observed operative setting and show that various problems can be by all agents. We assume that hidden information states at mapped into this framework. Although our work is con- each time step are sampled from an unknown distribution ceptually close to (Knepper et al. 2017), we go further and Px : X→[0, 1], while public states evolve from an initial present a practical algorithm for training agents. The recent distribution Po : O→[0, 1], according to a stochastic tran- work of (Foerster et al. 2018) solves an imperfect informa- sition model T : O×X×A1×···×AN ×O → [0, 1]. Having tion problem as considered here from a different angle. We transitioned to a successor state according to T , agents re- approach the problem by encouraging agents to exchange ceive rewards from R : S×A1 ×···×AN → R, where critical information through their actions whereas Foerster we have used S = O×X to denote joint state descriptions et al. train a public player to choose an optimal determinis- that incorporate both public and private information.