Communication Learning Via Backpropagation in Discrete Channels with Unknown Noise

The Thirty-Fourth AAAI Conference on Artificial Intelligence (AAAI-20) Communication Learning via Backpropagation in Discrete Channels with Unknown Noise Benjamin Freed Guillaume Sartoretti Jiaheng Hu Howie Choset Carnegie Mellon University National University of Singapore Columbia University Carnegie Mellon University 5000 Forbes Ave 21 Lower Kent Ridge Rd 116th St and Broadway 5000 Forbes Ave Pittsburgh, PA 15213 Singapore 119077 New York, NY 10027 Pittsburgh, PA 15213 Abstract from all other agents, conditioned only on information available to that particular agent (Bernstein et al. 2002; Gupta, This work focuses on multi-agent reinforcement learning Egorov, and Kochenderfer 2017; Sartoretti et al. 2019; Foer- (RL) with inter-agent communication, in which communi- ster et al. 2017). Such decentralized approaches offer major cation is differentiable and optimized through backpropagation. Such differentiable approaches tend to converge more scalability and parallelizability advantages over centralized quickly to higher-quality policies compared to techniques that planning approaches. Computational complexity of decen- treat communication as actions in a traditional RL frame- tralized approaches scales linearly with the team size (as op- work. However, modern communication networks (e.g., Wi- posed to exponentially, as for typical centralized planners), Fi or Bluetooth) rely on discrete communication channels, for and action selection for each agent can occur completely in which existing differentiable approaches that consider real- parallel (Busoniu, Babuska,ˇ and De Schutter 2010). How- valued messages cannot be directly applied, or require bi- ever, decentralized approaches pay for these benefits with ased gradient estimators. Some works have overcome this potentially sub-optimal policies or lower degrees of agent problem by treating the message space as an extension of coordination. Particularly in partially-observable environ- the action space, and use standard RL to optimize message ments, a decentralized approach in which agents make deci- selection, but these methods tend to converge slower and to inferior policies. In this paper, we propose a stochastic sions based solely on their own limited observations, cannot message encoding/decoding procedure that makes a discrete make as informed decisions as a centralized planner, which communication channel mathematically equivalent to an ana- conditions all actions on all available information. log channel with additive noise, through which gradients can be backpropagated. Additionally, we introduce an encryption step for use in noisy channels that forces channel noise to We consider MARL problems in which agents have the be message-independent, allowing us to compute unbiased ability to communicate with each other. This communica- derivative estimates even in the presence of unknown chan- tion ability offers agents a way to selectively exchange in- nel noise. To the best of our knowledge, this work presents formation, potentially allowing them to make more informed the first differentiable communication learning approach that decisions, while still achieving the same scalability and par- can compute unbiased derivatives through channels with un- allelizability advantages of a purely decentralized approach. known noise. We demonstrate the effectiveness of our ap- However, the addition of communication abilities also in- proach in two example multi-robot tasks: a path finding and creases the difficulty of the learning problem, as agents now a collaborative search problem. There, we show that our approach achieves learning speed and performance similar to have to make decisions not only about what actions to se- differentiable communication learning with real-valued mes- lect, but also what information to send, how to encode it, sages (i.e., unlimited communication bandwidth), while natu- and how to interpret messages received from other agents. rally handling more realistic real-world communication con- This paper focuses on multi-agent RL with communication straints. Content Areas: Multi-Agent Communication, Rein- that are readily applicable to real-world robotics problems. forcement Learning. In this paper, we present a novel approach to differen- 1 Introduction tiable communication learning that utilizes a randomized Reinforcement learning has recently been successfully ap- message encoding scheme to make discrete (and therefore plied to complex multi-agent control problems such as non-differentiable) communication channels behave mathe- DOTA, Starcraft II, and virtual capture the flag (Jaderberg et matically like a differentiable, analog communication chan- al. 2019; OpenAI 2018). Many multi-agent reinforcement- nel. We can use this technique to obtain unbiased, low vari- learning (MARL) approaches seek to learn decentralized ance gradient estimates through a discrete communication policies, meaning each agent selects actions independently channel. Additionally, we show how our approach can be generalized to communication channels with arbitrary un- Copyright c 2020, Association for the Advancement of Artificial known noise, which existing approaches to differentiable Intelligence (www.aaai.org). All rights reserved. communication learning have not been able to deal with. 7160 2 Background biased estimators typically require additional tuning pa- 2.1 Multi-agent Reinforcement Learning with rameters or more complex training techniques such as an- nealing (Foerster et al. 2016; Mordatch and Abbeel 2017; Communication Jang, Gu, and Poole 2016). Moreover, biased gradient es- Many past approaches to multi-agent reinforcement learn- timators lack the convergence guarantees of unbiased ones. ing with inter-agent communication fall into one of two Additionally, to the best of our knowledge, existing differ- general categories: 1) approaches in which communication entiable approaches cannot function with unknown chan- is treated as a differentiable process, allowing communica- nel noise, because the channel then represents an unknown tion behavior to be optimized via backpropagation (differen- stochastic function whose derivatives are consequently un- tiable approaches) (Foerster et al. 2016; Sukhbaatar, Szlam, known. The fact that differentiable approaches are restricted and Fergus 2016; Mordatch and Abbeel 2017; Paulos et al. to channels with no/known noise limits their applicability to 2019), and 2) approaches in which messages are treated as real-world robotic systems. an extension to the action space, and communication behav- In the more general context of learning-based coding, sev- ior is optimized via standard reinforcement learning (rein- eral recent works have learned a source- or channel-coding forced communication learning, RCL) (Foerster et al. 2016; method (Kim et al. 2018; Farsad, Rao, and Goldsmith 2018), Lowe et al. 2017). because a learned coding scheme can in some cases be RCL approaches tend to be more general than differen- more optimal or more computationally efficient than a hand- tiable approaches (Lowe et al. 2017). This is because the crafted one. Typically in these prior works, a coding scheme communication channel, and all downstream processing of is learned that minimizes a reconstruction metric that pe- messages selected by agents, is considered to be part of the nalizes loss of information. There is a connection between environment, and is therefore treated as a black box. No as- these works and ours, in that one can think of our agents as sumptions are made about what influence a particular mes- learning a source-coding scheme that optimizes the reward sage may have on other agents or future state transitions. signal, which implicitly penalizes loss of useful information, RCL approaches therefore naturally handle unknown chan- because agents must compress information available to them nel noise (Lowe et al. 2017). Additionally RCL naturally into a fixed-length message in a way that maximizes reward. handles discrete communication channels because RCL We believe that this emphasis on useful information allows does not require backpropagation through the communica- our agents to be even more selective with what information tion channel, so message selection can be non-differentiable. they choose to send than if a generic supervised loss function The downside of RCL’s generality is that it typically re- was used. quires significantly more learning updates to converge to a satisfactory policy, compared to differentiable approaches 3 Theory (Foerster et al. 2016). In RCL, agents are not explicitly pro- vided with knowledge of how their message selection im- In this section we explain our approach to differentiable pacts the behavioral policy of other agents, and must instead communication learning with a discrete communication deduce this influence through repeated trial and error. On the channel. We first focus on the case in which channel noise other hand, differentiable approaches allow one to explic- is not present, and then expand our approach to the noisy itly compute the derivative of recipient agents’ behavioral channel case. Throughout this paper, we assume centralized policy or action-value function with respect to the sending training, meaning that all training data (agent observations agent’s communication policy parameters. This allows dif- and ground-truth communication signals) can be thought of ferentiable approaches to converge to better policies after as being sent to a centralized server where learning updates fewer learning updates

Communication Learning Via Backpropagation in Discrete Channels with Unknown Noise

Training Autoencoders by Alternating Minimization

Deep Learning Based Computer Generated Face Identification Using

Matrix Calculus

Differentiability in Several Variables: Summary of Basic Concepts

Lipschitz Recurrent Neural Networks

Neural Networks

Jointly Clustering with K-Means and Learning Representations

Differentiability of the Convolution

The Simple Essence of Automatic Differentiation

Expressivity of Deep Neural Networks∗

Differentiable Programming for Piecewise Polynomial

GAN and What It Is?