Learning to Communicate Implicitly by Actions

Learning to Communicate Implicitly by Actions Zheng Tiany, Shihao Zouy, Ian Daviesy, Tim Warry, Lisheng Wuy, Haitham Bou Ammary z, Jun Wangy y University College London z Huawei R&D UK zheng.tian.11, shihao.zou.17, ian.davies.12, tim.warr.17, lisheng.wu.17 @ucl.ac.uk f [email protected] g [email protected] Abstract for facilitating collaboration in MARL, explicit communication channels come at additional computational and memory In situations where explicit communication is limited, human costs, making them difficult to deploy in decentralized con- collaborators act by learning to: (i) infer meaning behind their trol (Roth, Simmons, and Veloso 2006). partner’s actions, and (ii) convey private information about the state to their partner implicitly through actions. The first Environments where explicit communication is difficult component of this learning process has been well-studied in or prohibited are common. These settings can be synthetic multi-agent systems, whereas the second — which is equally such as those in games, e.g., bridge and Hanabi, but also crucial for successful collaboration — has not. To mimic both frequently appear in real-world tasks such as autonomous components mentioned above, thereby completing the learn- driving and autonomous fleet control. In these situations, ing process, we introduce a novel algorithm: Policy Belief humans rely upon implicit communication as a means of in- Learning (PBL). PBL uses a belief module to model the other formation exchange (Rasouli, Kotseruba, and Tsotsos 2017) agent’s private information and a policy module to form a and are effective in learning to infer the implicit meaning be- distribution over actions informed by the belief module. Fur- thermore, to encourage communication by actions, we pro- hind others’ actions (Heider and Simmel 1944). The ability pose a novel auxiliary reward which incentivizes one agent to perform such inference requires the attribution of a men- to help its partner to make correct inferences about its pri- tal state and reasoning mechanism to others. This ability is vate information. The auxiliary reward for communication is known as theory of mind (Premack and Woodruff 1978). In integrated into the learning of the policy module. We evalu- this work, we develop agents that benefit from considering ate our approach on a set of environments including a matrix others’ perspectives and thereby explore the further devel- game, particle environment and the non-competitive bidding opment of machine theory of mind (Rabinowitz et al. 2018). problem from contract bridge. We show empirically that this auxiliary reward is effective and easy to generalize. These re- Previous works have considered ways in which an agent sults demonstrate that our PBL algorithm can produce strong can, by observing an opponent’s behavior, build a model of pairs of agents in collaborative games where explicit commu- opponents’ characteristics, objectives or hidden information nication is disabled. either implicitly (He et al. 2016; Bard et al. 2013) or explic- itly (Raileanu et al. 2018; Li and Miikkulainen 2018). Whilst these works are of great value, they overlook the fact that an Introduction agent should also consider that it is being modeled and adapt In collaborative multi-agent systems, communication is es- its behavior accordingly, thereby demonstrating a theory of sential for agents to learn to behave as a collective rather mind. For instance, in collaborative tasks, a decision-maker could choose to take actions which are informative to its arXiv:1810.04444v4 [cs.AI] 20 Nov 2019 than a collection of individuals. This is particularly important in the imperfect-information setting, where private in- teammates, whereas, in competitive situations, agents may formation becomes crucial to success. In such cases, effi- act to conceal private information to prevent their opponents cient communication protocols between agents are needed from modeling them effectively. for private information exchange, coordinated joint-action In this paper, we propose a generic framework, titled pol- exploration, and true world-state inference. icy belief learning (PBL), for learning to cooperate in im- In typical multi-agent reinforcement learning (MARL) perfect information multi-agent games. Our work combines settings, designers incorporate explicit communication opponent modeling with a policy that considers that it is be- channels hoping to conceptually resemble language or ver- ing modeled. PBL consists of a belief module, which models bal communication which are known to be important for hu- other agents’ private information by considering their previ- man interaction (Baker et al. 1999). Though they can be used ous actions, and a policy module which combines the agent’s current observation with their beliefs to return a distribution Copyright c 2020, Association for the Advancement of Artificial over actions. We also propose a novel auxiliary reward for Intelligence (www.aaai.org). All rights reserved. encouraging communication by actions, which is integrated into PBL. Our experiments show that agents trained using In a related work, Jaques et al. (2019) utilize a mutual in- PBL can learn collaborative behaviors more effectively than formation objective to imbue agents with social influence. a number of meaningful baselines without requiring any ex- While the objective of maximal mutual information in ac- plicit communication. We conduct a complete ablation study tions can yield highly effective collaborating agents, a mu- to analyze the effectiveness of different components within tual information objective in itself is insufficient to necessi- PBL in our bridge experiment. tate the development of implicit communication by actions. Eccles et al. (2019) introduce a reciprocity reward as an al- Related Work ternative approach to solve social dilemmas.. Our work is closely related to (Albrecht and Stone 2017; A distinguishing feature of our work in relation to previ- Lowe et al. 2017; Mealing and Shapiro 2017; Raileanu ous works in multi-agent communication is that we do not et al. 2018) where agents build models to estimate other have a predefined explicit communication protocol or learn agents’ hidden information. Contrastingly, our work en- to communicate through an explicit channel. Information ex- hances a “flat” opponent model with recursive reasoning. change can only happen via actions. In contrast to previous “Flat” opponent models estimate only the hidden informa- works focusing on unilaterally making actions informative, tion of opponents. Recursive reasoning requires making de- we focus on bilateral communication by actions where infor- cisions based on the mental states of others as well as the mation transmission is directed to a specific party with po- state of the environment. In contrast to works such as I- tentially limited reasoning ability. Our agents learn to com- POMDP (Gmytrasiewicz and Doshi 2005) and PR2 (Wen et municate through iterated policy and belief updates such that al. 2019) where the nested belief is embedded into the train- the resulting communication mechanism and belief mod- ing agent’s opponent model, we incorporate level-1 nested els are interdependent. The development of a communica- belief “I believe that you believe” into our policy by a novel tion mechanism therefore requires either direct access to the auxiliary reward. mental state of other agents (via centralized training) or the Recently, there has been a surge of interest in using rein- ability to mentalize, commonly known as theory of mind. forcement learning (RL) approaches to learn communication We investigate our proposed algorithm in both settings. protocols (Foerster et al. 2016; Lazaridou, Peysakhovich, and Baroni 2016; Mordatch and Abbeel 2017; Sukhbaatar, Problem Definition Szlam, and Fergus 2016). Most of these works enable agents We consider a set of agents, denoted by , interacting with to communicate via an explicit channel. Among these works, an unknown environment by executing actionsN from a joint Mordatch and Abbeel (2017) also observe the emergence set = 1; ::: ; N , with i denoting the action space of non-verbal communication in collaborative environments of agentA fAi, and NAtheg total numberA of agents. To enable without an explicit communication channel, where agents models that approximate real-world scenarios, we assume are exclusively either a sender or a receiver. Similar re- private and public information states. Private information search is also conducted in (de Weerd, Verbrugge, and Ver- states, jointly (across agents) denoted by = 1; ::: ; N heij 2015). In our setting, we do not restrict agents to be ex- X fX X g are a set of hidden information states where i is only clusively a sender or a receiver of communications – agents observable by agent i, while public states areX observed can communicate mutually by actions. Knepper et al. (2017) by all agents. We assume that hidden informationO states at propose a framework for implicit communication in a co- each time step are sampled from an unknown distribution operative setting and show that various problems can be x : [0; 1], while public states evolve from an initial mapped into this framework. Although our work is con- P X! distribution o : [0; 1], according to a stochastic tran- ceptually close to (Knepper et al. 2017), we go further and P O! sition model : 1 N [0; 1]. Having present a practical algorithm for training agents. The recent transitioned toT aO×X successor ×A state×···×A according×O

Load more