<<

The Thirty-Fourth AAAI Conference on Artificial Intelligence (AAAI-20)

Communication Learning via in Discrete Channels with Unknown Noise

Benjamin Freed Guillaume Sartoretti Jiaheng Hu Howie Choset Carnegie Mellon University National University of Singapore Columbia University Carnegie Mellon University 5000 Forbes Ave 21 Lower Kent Ridge Rd 116th St and Broadway 5000 Forbes Ave Pittsburgh, PA 15213 Singapore 119077 New York, NY 10027 Pittsburgh, PA 15213

Abstract from all other agents, conditioned only on information avail- able to that particular agent (Bernstein et al. 2002; Gupta, This work focuses on multi-agent Egorov, and Kochenderfer 2017; Sartoretti et al. 2019; Foer- (RL) with inter-agent communication, in which communi- ster et al. 2017). Such decentralized approaches offer major cation is differentiable and optimized through backpropaga- tion. Such differentiable approaches tend to converge more scalability and parallelizability advantages over centralized quickly to higher-quality policies compared to techniques that planning approaches. Computational complexity of decen- treat communication as actions in a traditional RL frame- tralized approaches scales linearly with the team size (as op- work. However, modern communication networks (e.g., Wi- posed to exponentially, as for typical centralized planners), Fi or Bluetooth) rely on discrete communication channels, for and for each agent can occur completely in which existing differentiable approaches that consider real- parallel (Busoniu, Babuska,ˇ and De Schutter 2010). How- valued messages cannot be directly applied, or require bi- ever, decentralized approaches pay for these benefits with ased estimators. Some works have overcome this potentially sub-optimal policies or lower degrees of agent problem by treating the message space as an extension of coordination. Particularly in partially-observable environ- the action space, and use standard RL to optimize message ments, a decentralized approach in which agents make deci- selection, but these methods tend to converge slower and to inferior policies. In this paper, we propose a stochastic sions based solely on their own limited observations, cannot message encoding/decoding procedure that makes a discrete make as informed decisions as a centralized planner, which communication channel mathematically equivalent to an ana- conditions all actions on all available information. log channel with additive noise, through which can be backpropagated. Additionally, we introduce an encryption step for use in noisy channels that forces channel noise to We consider MARL problems in which agents have the be message-independent, allowing us to compute unbiased ability to communicate with each other. This communica- estimates even in the presence of unknown chan- tion ability offers agents a way to selectively exchange in- nel noise. To the best of our knowledge, this work presents formation, potentially allowing them to make more informed the first differentiable communication learning approach that decisions, while still achieving the same scalability and par- can compute unbiased through channels with un- allelizability advantages of a purely decentralized approach. known noise. We demonstrate the effectiveness of our ap- However, the addition of communication abilities also in- proach in two example multi-robot tasks: a path finding and creases the difficulty of the learning problem, as agents now a collaborative search problem. There, we show that our ap- proach achieves learning speed and performance similar to have to make decisions not only about what actions to se- differentiable communication learning with real-valued mes- lect, but also what information to send, how to encode it, sages (i.e., unlimited communication bandwidth), while natu- and how to interpret messages received from other agents. rally handling more realistic real-world communication con- This paper focuses on multi-agent RL with communication straints. Content Areas: Multi-Agent Communication, Rein- that are readily applicable to real-world robotics problems. forcement Learning. In this paper, we present a novel approach to differen- 1 Introduction tiable communication learning that utilizes a randomized Reinforcement learning has recently been successfully ap- message encoding scheme to make discrete (and therefore plied to complex multi-agent control problems such as non-differentiable) communication channels behave mathe- DOTA, Starcraft II, and virtual capture the flag (Jaderberg et matically like a differentiable, analog communication chan- al. 2019; OpenAI 2018). Many multi-agent reinforcement- nel. We can use this technique to obtain unbiased, low vari- learning (MARL) approaches seek to learn decentralized ance gradient estimates through a discrete communication policies, meaning each agent selects actions independently channel. Additionally, we show how our approach can be generalized to communication channels with arbitrary un- Copyright c 2020, Association for the Advancement of Artificial known noise, which existing approaches to differentiable Intelligence (www.aaai.org). All rights reserved. communication learning have not been able to deal with.

7160 2 Background biased estimators typically require additional tuning pa- 2.1 Multi-agent Reinforcement Learning with rameters or more complex training techniques such as an- nealing (Foerster et al. 2016; Mordatch and Abbeel 2017; Communication Jang, Gu, and Poole 2016). Moreover, biased gradient es- Many past approaches to multi-agent reinforcement learn- timators lack the convergence guarantees of unbiased ones. ing with inter-agent communication fall into one of two Additionally, to the best of our knowledge, existing differ- general categories: 1) approaches in which communication entiable approaches cannot with unknown chan- is treated as a differentiable process, allowing communica- nel noise, because the channel then represents an unknown tion behavior to be optimized via backpropagation (differen- stochastic function whose derivatives are consequently un- tiable approaches) (Foerster et al. 2016; Sukhbaatar, Szlam, known. The fact that differentiable approaches are restricted and Fergus 2016; Mordatch and Abbeel 2017; Paulos et al. to channels with no/known noise limits their applicability to 2019), and 2) approaches in which messages are treated as real-world robotic systems. an extension to the action space, and communication behav- In the more general context of learning-based coding, sev- ior is optimized via standard reinforcement learning (rein- eral recent works have learned a source- or channel-coding forced communication learning, RCL) (Foerster et al. 2016; method (Kim et al. 2018; Farsad, Rao, and Goldsmith 2018), Lowe et al. 2017). because a learned coding scheme can in some cases be RCL approaches tend to be more general than differen- more optimal or more computationally efficient than a hand- tiable approaches (Lowe et al. 2017). This is because the crafted one. Typically in these prior works, a coding scheme communication channel, and all downstream processing of is learned that minimizes a reconstruction metric that pe- messages selected by agents, is considered to be part of the nalizes loss of information. There is a connection between environment, and is therefore treated as a black box. No as- these works and ours, in that one can think of our agents as sumptions are made about what influence a particular mes- learning a source-coding scheme that optimizes the reward sage may have on other agents or future state transitions. signal, which implicitly penalizes loss of useful information, RCL approaches therefore naturally handle unknown chan- because agents must compress information available to them nel noise (Lowe et al. 2017). Additionally RCL naturally into a fixed-length message in a way that maximizes reward. handles discrete communication channels because RCL We believe that this emphasis on useful information allows does not require backpropagation through the communica- our agents to be even more selective with what information tion channel, so message selection can be non-differentiable. they choose to send than if a generic supervised The downside of RCL’s generality is that it typically re- was used. quires significantly more learning updates to converge to a satisfactory policy, compared to differentiable approaches 3 Theory (Foerster et al. 2016). In RCL, agents are not explicitly pro- vided with knowledge of how their message selection im- In this section we explain our approach to differentiable pacts the behavioral policy of other agents, and must instead communication learning with a discrete communication deduce this influence through repeated trial and error. On the channel. We first focus on the case in which channel noise other hand, differentiable approaches allow one to explic- is not present, and then expand our approach to the noisy itly compute the derivative of recipient agents’ behavioral channel case. Throughout this paper, we assume centralized policy or action-value function with respect to the sending training, meaning that all training data (agent observations agent’s communication policy parameters. This allows dif- and ground-truth communication signals) can be thought of ferentiable approaches to converge to better policies after as being sent to a centralized server where learning updates fewer learning updates compared to RCL. are computed for each agent, as is done in many multi- agent RL approaches, such as (Kraemer and Banerjee 2016; One additional advantage to differentiable approaches is Foerster et al. 2017; 2016). We feel that this is a reasonable that because the objective function used to optimize commu- assumption, because often training takes place in a setting nication in general does not directly depend on communica- in which additional state information is available than what tion behavior, it is possible to train communication behav- will be available to agents at execution time. However, we ior via imitation learning, with no explicit expert demonstra- also assume decentralized execution, meaning that at exe- tions of communication. This is useful when one can gener- cution time (i.e., after the training has finished/converged), ate expert action demonstrations but not expert communica- agents do not need to transfer data among themselves or with tion demonstrations, e.g., when a communication-unaware a centralized server, other than via the communication chan- centralized expert is present (Paulos et al. 2019). nels available to them within the environment. Several differentiable approaches in the past have al- lowed agents to exchange real-valued messages(Foerster 3.1 Communication in a Discrete, Noise-Free et al. 2017; Sukhbaatar, Szlam, and Fergus 2016). Differ- entiable approaches that consider real-valued communica- Channel tion signals cannot naturally handle discrete communica- Consider a multi-agent reinforcement learning problem with tion channels, as a discrete channel is non-differentiable. K agents. Assume that, at a given timestep, a particular Some recent approaches have circumvented this problem agent (Alice) is able to communicate to another agent (Bob) by using biased gradient estimators (Foerster et al. 2016; via a noise-free discrete communication channel with ca- Mordatch and Abbeel 2017; Lowe et al. 2017); however, pacity C (in bits). For now, we neglect details associated

7161 with how agent policies are trained, and assume some sort of gradient-based RL approach is used in which a loss function is computed from data gathered during a set of agent experi- ences, which is then differentiated with respect to agents’ policy parameters to compute a gradient update. This as- sumption is sufficiently general to encapsulate most promi- nent RL algorithms, such as Q-learning and variants of Figure 1: Randomized encoder/channel/decoder: The real- policy gradient. Additionally, we do not specify what fac- valued output z from agent Alice is converted into a discrete tors determine which agents can communicate, as this is message m by the randomized encoder, which additionally problem-specific. Here we are primarily concerned with how takes random noise variable . The message m is then sent gradients are backpropagated through the channel. through the channel to agent Bob, who decodes it into zˆ (a Because the communication channel is discrete, any mes- reconstruction of z), using the randomized decoder, which sage m sent through the channel from Alice to Bob is taken also takes as an input. Under the simple reparameteriza- from a finite set of possible messages, i.e., m ∈ M, where tion zˆ = z + e, zˆ becomes a continuous, differentiable func- C M = {m1, ..., m|M|}, and |M|≤2 . If we wish to perform tion of z, allowing gradients to be naturally backpropagated differentiable communication learning in such an environ- through the discrete communication channel. ment, we need to compute the derivative of the input of Bob with respect to the output of Alice. However, the derivative of the channel output with respect to the input is not well- first, so that there are exactly |M| possible values for m (fig. defined, making typical backpropagation impossible. 2). In other words, the discrete message to be sent through Existing approaches use a biased gradient estimator to cir- the channel is given by: cumvent this problem, such as the Gumbel-Softmax estima- (z + + δ/2) (mod 1) tor. We propose an alternative approach, described below, in m = . which Alice generates a real-valued communication signal δ z, which is encoded into a discrete message by a stochas- m ∈{0, ..., |M|−1} tic quantization procedure (randomized encoder). The re- Note that here , however this integer sulting discrete message m is sent through the communi- representation can be converted to binary for use in a digital cation channel and is received by Bob, who then uses this communication channel. discrete message to compute an approximation of the origi- Randomized Decoder The goal of the randomized de- nal real-valued signal generated by Alice, zˆ, using a stochas- coder is to reconstruct z from m, denoted zˆ, such that zˆ can tic dequantization procedure (randomized decoder) (fig. 1). be represented as a continuous, differentiable function of z, We show that the encoder/channel/decoder system is math- and a noise source that is independent of z. ematically equivalent to an analog communication channel Upon receiving m, Bob computes the reconstruction as with additive noise, allowing the of Bob’s zˆ = C(m) − , where C(m) denotes the center of the quan- reconstruction with respect to Alice’s communication signal 1 tization interval corresponding m, C(m)=δ(m + 2 ) (fig. to be computed. Once again, because we assume centralized 2). One special case exists in which the decoding process training, Bob’s learning gradient is available to Alice for the δ yields z<ˆ − 2 , in which zˆ =1+C(m) − . In short, the computation of her learning update. reconstruction is given by: Randomized Encoder The purpose of the randomized en- δ m 1 , δ m 1 > − δ coder is to convert a real-valued output from Alice into a dis- z ( + 2 ) if ( + 2 ) 2 ˆ = δ m 1 , . (1) crete message that can be sent through the channel to Bob. 1+ ( + 2 ) otherwise z One simple approach would be to quantize directly; how- We prove in the supplementary material that using the ever, the derivative of this operation then vanishes every- particular encoding/decoding procedure detailed above, zˆ where except at the boundaries of the quantization intervals, can be reparameterized according to zˆ = z + e, where where it is undefined, precluding the use of this simple idea δ δ e ∼ U(·; − 2 , + 2 ). Under this reparameterization, the par- in a differentiable communication setting. tial derivative of zˆ with respect to z is precisely 1. The Instead, we propose the following non-deterministic encoder/channel/decoder system therefore becomes mathe- z quantization method: for simplicity, assume is scalar and matically equivalent to an analogous situation in which we z ∈ , [0 1], however the encoding process can be easily gen- send real-valued z directly through an analog communica- z eralized to vectors of arbitrary length. To encode , we first tion channel, that introduces additive noise to the signal, but perturb it with noise , which is sampled independently at is nonetheless differentiable. each timestep from a uniform distribution centered at 0 with δ 1 z z ∼ U · − δ , δ Note that Bob requires knowledge of the value of that width = |M| , i.e., = + , where ; 2 + 2 . Alice used to encode m, which we do not send through the δ δ z now ranges from − 2 to 1+ 2 . z is then quantized into channel. This is not a problem, because does not depend one of |M|+1quantization intervals of width δ, spaced uni- upon z and we can therefore assume the agents agree ahead formly along the range of z. The index of the quantization of time upon a sequence of values to use for each timestep. interval is taken to be the discrete message, with the special Practically, this would most likely amount to all agents us- case that the last quantization interval is identified with the ing the same pseudorandom number generator with the same

7162 Figure 3: Encryption technique for channels with unknown noise. After discrete message m is generated, it is encrypted according to permutation sequence Q into m¯ . m¯ is then m Figure 2: Randomized encoder and decoder: The real-valued sent through the communication channel, which outputs ¯ˆ , m m output z from agent Alice is first perturbed by random noise which is not necessarily the same as ¯ . ¯ˆ is then decrypted Q−1 m , which is sampled from a zero-centered uniform distribu- (mapped by to form ˆ , which can be thought of as m tion with width equal to the quantization interval to form z . an estimate of the original message . With these encryp- z is then quantized into one of the |M| quantization inter- tion/decryption steps, the apparent channel noise introduced m m vals, the index of which is taken to be the discrete message in the mapping form to ˆ takes on a known form that m. The message m is finally sent through the channel to allows to compute unbiased gradients through the channel. agent Bob, who uses it to compute a reconstruction of z by subtracting from the center of the quantization interval cor- noise on z. In fact, we can use this capability to render re- responding to m. ∂zˆ construction error completely independent of z, so that ∂z is known. Consider the following procedure: z is encoded into m seed. To generalize the complete encoding/decoding proce- in a manner similar to that detailed in 3.1. After this, a dure to cases in which z is a vector rather than a scalar, one one-to-one mapping is randomly generated from each mes- M can apply the above procedure element-wise with a distinct sage to every other message in , described by the permu- Q m Q m value of to each component of z. tation sequence . is then mapped according to to ¯ , which is sent through the channel. Here we will refer to this 3.2 Communication in a Discrete Channel with step as the ”encryption” step, as it is effectively equivalent to Unknown Noise a simple form of encryption, which can be thought of as con- cealing the value of m from the environment, thus limiting Here we present a method for generalizing our approach to the ways in which the decrypted message distribution can channels with unknown noise. We define a noisy channel vary with m. The recipient agent receives m¯ˆ , which it then m −1 to be on in which the output of the channel, ˆ , is not nec- decrypts by mapping it according to Q to an estimate of m essarily the same as the input , but instead is distributed m, which we denote mˆ . This is then converted to zˆ via the a according to some unknown distribution that we assume randomized decoder similar to that described in 3.1 (fig. 3). can depend on the input to the channel and the state, i.e., We make some slight modifications to our randomized en- m ∼ P ·|m, S ˆ ( ). We assume this distribution is completely coder for use in noisy channels. We now take communica- unknown to us. Consequently, the message reconstruction tion output z to be a vector on the unit circle. This repre- error will no longer be distributed according to a known dis- sentation allows z to maintain exactly 1 bounded degree of z S tribution, and can depend on and in unknown ways. freedom; however, it allows us to eliminate the discontinuity To make the difficulty introduced by unknown channel associated with the modulo used in the noise-free technique noise more clear, we can express our message reconstruc- described in 3.1, because a vector can vary smoothly even tion in the following form: when its angle jumps from −π to +π. This modification will be crucial for representing the communication channel as a zˆ = h(z,α(z,S,β)), differentiable function in the presence of channel noise. We z ∼ U · − δ , δ where h is a known, deterministic function, α is an unknown again perturb by random noise ( ; 2 + 2 ).How- function representing channel noise, and β is some indepen- ever, to maintain the unit circle constraint, istakentobean z dent noise source that does not depend on z or S. Differ- angular perturbation, which we apply to by multiplication entiating zˆ w.r.t. z, we see that the derivative contains an with the 2-dimensional rotation corresponding to , R z R z z unknown term, namely the derivative of α w.r.t. z: denoted ( ), to form = ( ) . is then quantized into one of |M| quantization intervals, each one corresponding ∂zˆ ∂h(z,α) ∂h(z,α) ∂α(z,S) to a particular angular range on the unit circle, to form dis- = + . crete message m. Because the range of angles of z is now ∂z ∂z ∂α(z,S) ∂z 2π [−π, +π), the quantization intervals have width δ = |M| . The only situation in which we could backpropagate m α (fig. 4). In other words, is given by: through the channel without knowing the exact form of is if we know α does not depend on z, which is not generally arctan2(z ,z )+δ/2) m = 2 1 , the case for the procedure we’ve presented so far. However, δ recall that what is actually sent through the communication channel is m. Because we have a choice in how z is encoded where z1 and z2 are the first and second elements of z re- into m, we have some influence over the effect of channel spectively, and z = R()z.

7163 Following the quantization, m is then encrypted into m¯ by Q, which is then sent through the communication chan- nel. Upon receiving output of the communication channel m¯ˆ , Bob then decrypts m¯ˆ into mˆ (maps according to Q−1), and estimates z according to:

zˆ = R(−)C(ˆm), where C(m) is the vector corresponding to the center of the angular range represented by m. In the presented communication procedure, there are two sources of error that are introduced into Bob’s reconstruc- Figure 4: Randomized encoder for channels with unknown z tion of Alice’s real-valued communication output. The first noise: the real-valued vector output from agent Alice is is due to the fact that we have a limit on information transfer constrained to lie on the unit circle. It is then rotated by ran- rate through the channel, placing a fundamental limit on how dom noise , which is sampled from a zero-centered uniform precisely we can reconstruct z, even in a noise-free chan- distribution with width equal to the quantization interval to z z |M| nel. This form of noise is present whenever the communi- form . is then quantized into one of quantization cation channel does not corrupt the discrete message, as de- intervals, each corresponding to a particular angular range, scribed in sec. 3.1, i.e., when m¯ˆ =¯m. In this non-corrupting and the index of which is taken to be the discrete message m m m case, we claim (and show in the supplementary material), . In the case depicted above, = 2. that the angular error distribution of zˆ is uniform, indepen- dent of z, and zero-centered, in particular zˆ = R(e)z, where described below, with a noise-free communication chan- e|m mˆ ∼ U − δ , δ ˆ = ¯ ( 2 + 2 ). The second source of error is due nel. Additionally, we compare the performance of our to channel noise, and comes into effect whenever the dis- encryption-based, channel-noise-tolerant approach to the mˆ  m crete message is corrupted, i.e., ¯ =¯. In the supplemen- RCL approach, on the same two example tasks, but intro- tary material, we show that in this case, the angular error ducing message-dependent channel noise. Finally, we test distribution is still zero-centered, but now has uniform den- − δ , δ z the robustness of our approach to coarseness of discretiza- sity everywhere except within the interval [ 2 + 2 ]. ˆ can tion and message length. These experiments additionally al- therefore be represented in the following way: low us to assess the efficiency with which agents utilize the available communication channel. zˆ = R(e)z, where e is distributed according to a mixture of the two dis- 4.1 Implementation Details tributions described above. In particular, we show in the sup- The actor-critic algorithm is used as the reinforcement learn- plementary material that e is distributed according to: ing algorithm for all experiments, with separate actor and critic networks. Both actor and critic networks for all tasks ⎧ are composed of a convolutional stack followed by two ⎪ 1 − P S , − δ ≤ e<− δ ⎨ δ (1 e( )) 2 2 fully-connected layers. The policy networks used in the P e|z,S 1 P S , −π ≤ e<− δ e|z,S( )=⎪ 2π−δ e( ) 2 (2) search task also use a simple single- recurrent neural ⎩ 1 δ network to help them remember areas they have previously 2π−δ Pe(S), + 2 ≤ e<π explored and therefore do not need to revisit. Because the 1 critic network is used only during training, we choose to give where Pe(S)= |M| m ∈M P (mout = min|S, min) is in it full state observability in all experiments. During training, the state-dependent average channel error rate, with min and the policy networks are given only information they are will mout representing the channel input/output, respectively. Since e is independent of z with the introduction of our en- have during execution time. For a given task, the same ar- cryption step, it is possible to compute the Jacobian repre- chitecture is used for both the differentiable and RCL ap- senting the partial derivatives of zˆ with respect to z, regard- proaches we test, except for differences to the message in- less of the form of channel noise, given by: ∂zˆ = R(e). puts and outputs necessitated by the differences between the ∂z two approaches. In both tasks, we consider 2 agents that are Similar to the noise-free case, to generalize the entire pro- each able to send the other 40 bits of information at each cedure to multi-elements messages (i.e., z is a collection of timestep. In both the differentiable and RCL approaches we unit-length vectors rather than a single vector), one can sim- test, policy parameters are shared across agents, and each ply apply the above procedure to each element of z indepen- agents’ action policies are represented as a categorical dis- dently, using unique values of and Q for each element. tribution over discrete actions. In our RCL approach, mes- sages between agents are strings of 40 bits, and the commu- 4 Experiments nication policy of each agent (from which a message is sam- We compare the performance of our approach to a dif- pled and sent to the other agent at each timestep) is repre- ferentiable communication learning approach with real- sented as a collection of independent Bernoulli distributions valued messages, and a reinforced communication learn- representing the probability of each corresponding message ing approach, on two simple but illustrative example tasks, bit being a 1. In our differentiable approach, agents output a

7164 real-valued communication signal with 10 elements at each which is processed by a fully-connected layer before being timestep. Each element of this communication signal is then combined with the output of the convolutional stack. Each discretized independently into one of 16 discrete message agent’s critic network observes its location and goal, and the elements, meaning each element can be thought of as a 4-bit other agent’s location and goal, again as one-hot matrices. word (a nibble), allowing for a total of 1610 =240 different possible messages, equivalent to 40 bits. 4.4 Task 2: Coordinated Multi-Agent Search 4.2 Reinforced communication Learning The second example task is a two-agent search problem in which each agent is again assigned a goal location in a 10x10 In our experiments involving RCL, we allow agents to ex- grid. However, in this task, either agent can observe either change messages of 40 independently-selected bits at each goal, but only when they are in one of the grid cells im- timestep (i.e., we represent the communication policy of mediately adjacent to the goal, or on the goal itself. This each agent as a set of independent bitwise probabilities, each limited observability is meant to simulate a sensor with a representing the probability that the corresponding message finite field of view. In this task, to most effectively solve bit is a 1). the problem, agents should exchange information related to their observations, in effect broadening each one’s sensor 4.3 Task 1: Hidden-Goal Path-Finding footprint. Again, in the noise-free case, we compare differ- The first example task is a path-finding problem in which entiable communication learning with real-valued messages, each agent must navigate to a randomly-assigned goal cell reinforced communication learning with binary messages, in a 10x10 grid, and at each timestep is able to move up, and our proposed differentiable approach with discrete mes- down, left, right, or remain stationary. Agents do not ob- sages, and in the case with channel noise, we compare rein- serve their own goals or the other agent’s location, but in- forced communication learning with the noise-tolerant vari- stead observe only their location and the other agent’s goals; ant of our approach described in 3.2. We additionally com- thus, communication is necessary for agents to reliably find pare these approaches to a version of the task with no com- their goals. In this task, agents’ policy networks are feed- munication among agents, to test the hypothesis that com- forward neural networks with no persistent state, meaning munication improve performance. they cannot remember past messages or observations, and The reward function for this task is the same as the reward must therefore communicate at every timestep. In the noise- function used for the hidden-goal path-finding task. The ob- free channel case, we compare a differentiable learning ap- servations to both the actor network and critic network are proach in which agents are able to exchange real-valued also largely the same, with the exception that each agent’s messages (in the form of 10-element vectors in which each actor network now receives as input 4 matrices, representing element is bounded between 0 and 1), a reinforced com- its location, its goal (if visible), the other agent’s location, munication learning approach in which messages are bit- and the other agent’s goal (if visible). strings, and our proposed differentiable learning approach (described in Sect. 3.1) in which messages are composed of 4.5 Channel Noise Model 10 discrete elements, each taking one of 16 possible discrete The channel noise model we use for both tasks is an asym- values. In this case, messages sent by each agent are guaran- metric bit-flip error model. Bits are flipped with unequal teed to be received by the other agent at the next timestep, probability depending on their value. The probability of a completely unaltered. In the noisy channel case, we corrupt particular message bit being flipped is taken to be indepen- messages with an asymmetric bit-flip noise as described in dent of all other bits. More specifically, in the hidden-goal Sect. 4.5. We compare the reinforced communication learn- path-finding task, the probability of the ith message bit being ing approach with the noise-tolerant variant of our proposed flipped is given by P (flip |m¯ i =0)=0.1 and P (flip |m¯ i = differentiable approach (described in Sect. 3.2), where mes- i i 1) = 0.05. In the coordinated multi-agent search task, the sages are converted to a 40-bit binary representation so the bit-flip probability is given by P (flip |m¯ i =0)=0.02 and same channel noise can be added. i P (flip |m¯ i =1)=0.01. We utilize a shared reward structure, in which agents both i receive the same total reward, computed as a sum of both The noise was chosen to be asymmetric so that it would be agents’ individual rewards. At each timestep, agents are re- message-dependent, and therefore highlight the capability of warded proportionately to how much closer to their goals our approach to deal with channel noise that depends on the they become during that timestep, with an additional penalty message in arbitrary ways. for every timestep the agents have not reached their goal, and a penalty for each timestep agents choose not to move while 4.6 Effect of Message Length and Quantization not at their goal location to encourage exploration, similar Fineness to (Sartoretti et al. 2018a; 2018b). In this experiment, we compare the performance of our dis- Each agent’s policy network observes its own location and crete differentiable approach on the robot navigation prob- the other agent’s goal in the form of two one-hot 10x10 lem with various numbers of quantization levels and mes- matrices, which are fed into the convolutional stack of the sage lengths. Agents have no memory and no knowledge agent’s policy network. Additionally, each agent observes of the location of the other agent, and must therefore en- the message generated by the other agent at the last timestep, code the other agent’s goal location in the message, requir- that is fed into the policy network as a separate input, ing log2(100) = 6.6439 bits of information. We first test 10

7165 Mean Episode Length (Lower is Better) element messages with real-valued messages, and 16, 8, 4, 300 and 2 quantization levels per message element. In all these cases, agents can transmit enough information to solve the task optimally. We then compare 5- and 2-element messages, 200 each with 2 quantization level per element, in which case not enough information is transmitted to optimally solve the task (i.e., fewer than 6.6439 bits). These systematic experiments 100 allow us to assess how efficiently agents make use of their

available message bits. Mean Episode Length 0 4.7 Results 00.511.522.5 # Episodes 105 In both example tasks, with and without channel noise, we Mean Episode Reward (Higher is Better) found that when either method was able to solve the task, our 0 proposed discrete differentiable communication learning ap- d proach drastically outperformed RCL approaches (Fig.5,6). ewar -50

In the noise-free hidden-goal path-finding task, our proposed R real-valued message

e diff discrete, noise-free (ours) technique was able to solve the task nearly as rapidly as d -100 RCL,noise-free

so diff discrete, noisy (ours) a real-valued differentiable approach that places no limit i p RCL, noisy E whatsoever on information transfer rate through the channel. -150 Unsurprisingly, in the path-finding task with channel noise, ean the performance of our approach was somewhat degraded, M -200 requiring more time to converge than in the noise-free case, 00.511.522.5 and attaining worse final performance. However, our ap- # Episodes 105 proach represents a significant improvement over RCL, the only other approach we are aware of that is applicable in a Figure 5: Results for the Hidden-Goal Path-Finding Task. noisy channel setting. In fact, RCL with noise completely failed to solve the task, likely due to the additional variance being introduced into its policy gradient estimates (Fig.5). and 5-length messages with 2 quantization levels, final per- In the coordinated multi-agent search task with no chan- formance is slightly worse; however, the fact that perfor- nel noise, our approach outperforms the communication- mance did not deteriorate until the available channel capac- free variant we tested, indicating that agents are using ity was very close to the theoretical minimum required to their communication abilities to their benefit (Fig. 6). There solve the task optimally (6.6439 bits) indicates that agents again, we find that our approach outperforms RCL, which were able to make efficient use of the channel available to largely fails to solve the task. The fact that RCL fails to solve them. In all cases, training was stable despite the noise in- the task is not hugely unexpected, as (Foerster et al. 2016) troduced through the randomized quantization procedure, observed similar behavior in some experiments with their re- and learning efficiency remained high. Performance in the inforced inter-agent learning technique (a form of RCL). In- case of 2-length messages with 2 quantization levels was terestingly, the differentiable approach that used real-valued far inferior to the other cases tested, as one would expect messages seemed to escape a local minimum that our dis- given not enough information is exchanged to solve the task crete differentiable approach became trapped in. It is possi- optimally. The code for these experiments is available at ble that the message discretization we used (16 possible val- http://bit.ly/37r7q7y. ues for each message element) was too coarse, and therefore introduced too much noise. Perhaps fewer message elements with a finer discretization would have allowed our approach 5 Conclusions to perform better on this task while keeping the required in- In this paper, we presented novel approaches that allow dif- formation transfer rates the same. We will investigate this ferentiable communication learning to be effectively applied trade-off in future work. to realistic settings where such approaches were not previ- In the multi-agent search task with channel noise, both ously applicable. Specifically, we presented a novel stochas- RCL and our proposed differentiable approach struggle to tic encoding/decoding procedure, that can be used in con- make progress, and learning appears to be unstable (Fig. 6). junction with multi-agent reinforcement learning, to allow It is possible that additional tuning of the learning algorithms unbiased gradient estimates through a discrete channel. This could result in better performance on this task, which we will form of communication learning is made possible by the determine in future work. fact that, with our proposed technique, the encoder/discrete In the experiment that assesses the effect of message channel/decoder system becomes mathematically equivalent length and quantization fineness, our approach proved ro- to an analog channel with additive noise, through which bust to large variations in these parameters. Final perfor- derivatives can easily be backpropagated. We additionally mance in the case of 10-length messages with 16, 8, and extended our approach to allow gradient backpropagation 4 quantization levels per message element is comparable through a channel with unknown noise. To the best of our to that of real-valued messages (Fig. 7). In the case of 10- knowledge, our differentiable communication learning ap-

7166 Mean Episode Length (Lower is Better) Mean Episode Length (Lower is Better) 300 300

250 200 200

150 100 100 Mean Episode Length Mean Episode Length 50 0 00.511.522.53 01234 # Episodes 105 # Episodes 105 Mean Episode Reward (Higher is Better) Mean Episode Reward (Higher is Better) 0 0 d d ewar -50 ewar -50 R R e e d d -100 real-valued message -100 length10, real-valued so so i diff discrete, noise-free (ours) i length 10, 16 intervals p RCL,noise-free p length 10, 8 intervals E E -150 no comms -150 length 10, 4 intervals diff discrete, noisy length 10, 2 intervals ean RCL, noisy ean length 5, 2 intervals M M length 2, 2 intervals -200 -200 00.511.522.53 01234 # Episodes 105 # Episodes 105

Figure 6: Results for the Coordinated Search Task. Figure 7: Varying message length and quantization intervals. proach is the first one that can handle such unknown noise. ing many agents in environments with obstacles and other We enabled this noise-tolerance by introducing an encryp- impediments. We are also investigating the use of alternative tion step immediately before the message is sent through the learning modalities, such as imitation learning, in our ap- discrete channel, which forces apparent channel noise to be proach. Because agents learn to exchange messages that op- independent of the communication signal being sent, mean- timize some performance metric that depends on messages ing it does not contribute to the gradient of the channel and only indirectly through the agent policies, expert demonstra- can therefore be ignored in gradient calculations. tions can be used to train policies without explicit commu- We evaluated the effectiveness of our proposed techniques nication demonstration. Therefore, it is likely possible that in two example tasks, with and without channel noise, com- our approach could learn emergent communication behav- pared to a reinforced communication learning approach, and ior solely through imitation of a centralized expert that does a differentiable approach using real-valued messages (where not rely on communication, similar to (Paulos et al. 2019). applicable). We found that our approach outperforms the Second, we are also currently investigating how more RCL approaches we tested in all cases in which either al- established encryption techniques, such as RSA (Rivest, gorithm can solve the task. Additionally, we found that in Shamir, and Adleman 1978), could be applied to our ap- one of the two example tasks, our differentiable approach, proach. For instance, while the type of encryption we use is which uses a very coarse message discretization and a fairly information-theoretically secure and therefore the environ- tight limit on channel capacity, performed nearly as well as ment is guaranteed to have no knowledge of the message we the real-valued differentiable approach we tested, which as- are communicating, modern encryption techniques rely in- sumes no limit on channel capacity. stead on requiring third parties to do an enormous amount of Finally, we found that our approach is robust to relatively computation to decrypt the message. It may be sufficient to large variations in message length and quantization fine- use such non-information-theoretically secure but more ef- ness, and is capable of making very efficient use of available ficient encryption techniques to make channel noise nearly channel capacity (be it under sufficient or insufficient band- perfectly independent of the communication signal. width). Our experimental results, along with the theoreti- Thirdly, ongoing work explores how error correcting cal justification of our technique, indicate that our approach codes could be utilized to decrease the amount of noise in- offers significant advantages over existing approaches to troduced into agents’ communication inputs. Some error- MARL with communication, especially in cases with real- correction techniques, such as a cyclic redundancy check, istic discrete and noisy communication channels. allow error detection in message transmission. Because a Ongoing work focuses on four natural extensions of our large source of noise in our technique results from cases in presented approaches. First, we are applying our approach to which the channel outputs an erroneous message, being able more complex and realistic problems, such as tasks involv- to detect such cases with some probability, and neglect the

7167 erroneous messages, could lead to an improvement in per- OpenAI. 2018. Openai five. formance in situations with channel noise. Paulos, J.; Chen, S. W.; Shishika, D.; and Kumar, V. S. A. Finally, another possible extension of this work is adap- 2019. Decentralization of multiagent policies by learning tation to a fully decentralized training and execution frame- what to communicate. 2019 International Conference on work. If two-way communication is available among agents, Robotics and Automation (ICRA) 7990–7996. it is possible that the same channels used for communica- Rivest, R. L.; Shamir, A.; and Adleman, L. 1978. A method tion from sender to receiver could also be used to commu- for obtaining digital signatures and public-key cryptosys- nicate backpropagated gradients from receiver to sender. In tems. Communications of the ACM 21(2):120–126. this case, gradients could be encoded into a discrete mes- sage using the same randomized quantization procedure de- Sartoretti, G.; Kerr, J.; Shi, Y.; Wagner, G.; Kumar, T. K. S.; scribed here, and decoded into an unbiased backpropagated Koenig, S.; and Choset, H. 2018a. PRIMAL: pathfinding gradient estimate. This adaptation would allow agents to be via reinforcement and imitation multi-agent learning. CoRR trained in the same conditions as execution time. abs/1809.03531. Sartoretti, G.; Wu, Y.; Paivine, W.; Kumar, T. K. S.; Koenig, References S.; and Choset, H. 2018b. Distributed reinforcement learn- Bernstein, D. S.; Givan, R.; Immerman, N.; and Zilberstein, ing for multi-robot decentralized collective construction. In S. 2002. The complexity of decentralized control of markov DARS 2018 - International Symposium on Distributed Au- decision processes. Mathematics of operations research tonomous Robotic Systems, 35–49. 27(4):819–840. Sartoretti, G.; Paivine, W.; Shi, Y.; Wu, Y.; and Choset, H. Busoniu, L.; Babuska,ˇ R.; and De Schutter, B. 2010. Multi- 2019. Distributed learning of decentralized control policies agent reinforcement learning: An overview. Innovations in for articulated mobile robots. CoRR abs/1901.08537. Multi-Agent Systems and Applications-1 310:183–221. Sukhbaatar, S.; Szlam, A.; and Fergus, R. 2016. Learn- Farsad, N.; Rao, M.; and Goldsmith, A. 2018. Deep learn- ing multiagent communication with backpropagation. CoRR ing for joint source-channel coding of text. In 2018 IEEE abs/1605.07736. International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2326–2330. IEEE. Foerster, J. N.; Assael, Y. M.; de Freitas, N.; and Whiteson, S. 2016. Learning to communicate with deep multi-agent reinforcement learning. CoRR abs/1605.06676. Foerster, J. N.; Farquhar, G.; Afouras, T.; Nardelli, N.; and Whiteson, S. 2017. Counterfactual multi-agent policy gra- dients. CoRR abs/1705.08926. Gupta, J. K.; Egorov, M.; and Kochenderfer, M. J. 2017. Cooperative multi-agent control using deep reinforcement learning. In AAMAS Workshops. Jaderberg, M.; Czarnecki, W. M.; Dunning, I.; Marris, L.; Lever, G.; Castaneda,˜ A. G.; Beattie, C.; Rabinowitz, N. C.; Morcos, A. S.; Ruderman, A.; et al. 2019. Human-level performance in 3d multiplayer games with population-based reinforcement learning. Science 364(6443):859–865. Jang, E.; Gu, S.; and Poole, B. 2016. Categorical reparameterization with gumbel-softmax. arXiv preprint arXiv:1611.01144. Kim, H.; Jiang, Y.; Rana, R.; Kannan, S.; Oh, S.; and Viswanath, P. 2018. Communication algorithms via . arXiv preprint arXiv:1805.09317. Kraemer, L., and Banerjee, B. 2016. Multi-agent reinforce- ment learning as a rehearsal for decentralized planning. Neu- rocomputing 190:82–94. Lowe, R.; Wu, Y.; Tamar, A.; Harb, J.; Abbeel, P.; and Mordatch, I. 2017. Multi-agent actor-critic for mixed cooperative-competitive environments. CoRR abs/1706.02275. Mordatch, I., and Abbeel, P. 2017. Emergence of grounded compositional language in multi-agent populations. CoRR abs/1703.04908.

7168