Variational for Opponent Modeling in Multi-Agent Systems

Georgios Papoudakis Stefano V. Albrecht School of Informatics School of Informatics The University of Edinburgh The University of Edinburgh Edinburgh Centre for Robotics Edinburgh Centre for Robotics [email protected] [email protected]

Abstract 2018]. In this work, we focus on learning opponent mod- els using Variational Autoencoders (VAEs) [Kingma and Multi-agent systems exhibit complex behaviors that emanate Welling, 2014]. VAE are generative models that are com- from the interactions of multiple agents in a shared environ- monly used for learning representations of data, and various ment. In this work, we are interested in controlling one agent in a multi-agent system and successfully learn to interact with works use them in RL for learning representations of the the other agents that have fixed policies. Modeling the behav- environment [Igl et al., 2018, Ha and Schmidhuber, 2018, ior of other agents (opponents) is essential in understanding Zintgraf et al., 2019]. We first propose a VAE for learning the interactions of the agents in the system. By taking ad- opponent representations in multi-agent systems based on vantage of recent advances in , we pro- the opponent trajectories. pose modeling opponents using variational autoencoders. Ad- A shortcoming of this approach and most opponent mod- ditionally, many existing methods in the literature assume that eling methods, as will be discussed in Section 2, is that they the opponent models have access to opponent’s observations require access to opponent’s information, such as observa- and actions during both training and execution. To eliminate tions and actions, during training as well as execution. This this assumption, we propose a modification that attempts to identify the underlying opponent model using only local in- assumption is too limiting in the majority of scenarios. For formation of our agent, such as its observations, actions, and example, consider Poker, in which agents do not have access rewards. The experiments indicate that our opponent model- to the opponent’s observations. Nevertheless, during Poker, ing methods achieve equal or greater episodic returns in rein- humans can reason about the opponent’s behaviors and goals forcement learning tasks against another modeling method. using only their local observations. For example, an increase in the table’s pot could mean that the opponent either holds strong cards or is bluffing. Based on the idea that an agent 1 Introduction can reason about an opponent models using its own obser- In recent years, several promising works [Mnih et al., 2015, vations, actions, and rewards in a recurrent fashion, we pro- Schulman et al., 2015a, Mnih et al., 2016] have arisen in pose a second VAE-based architecture. The encoder of the deep (RL), leading to fruitful results VAE learns to represent opponent models conditioned on in single-agent scenarios. In this work, we are interested only local information removing the requirement to access in using single-agent RL in multi-agent systems, where we the opponents’ information during execution. control one agent and the other agents (opponents) in the We evaluate our proposed methodology using a toy ex- environment have fixed policies. The agent should be able ample and the commonly used Multi-agent Particle Envi- to successfully interact with a diverse set of opponents as ronment [Mordatch and Abbeel, 2017]. We evaluate the well as generalize to new unseen opponents. One effective episodic returns that RL algorithms can achieve. The exper- way to address this problem is opponent modeling. The op- iments indicate that opponent modeling without opponents’ ponent models output specific characteristics of the oppo- information can achieve comparable or higher average re- nents based on their trajectories. By successfully modeling turns, using RL, than models that access the opponent’s in- the opponents, the agent can reason about opponents’ be- formation. haviors and goals and adjust its policy to achieve optimal outcomes. There is a rich literature of modeling opponents 2 Related Work in multi-agent systems [Albrecht and Stone, 2018]. Several recent works have proposed learning opponent Learning Opponent Models. In this work, we are interested models using architectures [He et al., 2016, in opponent modeling methods that use neural networks to Raileanu et al., 2018, Grover et al., 2018a, Rabinowitz et al., learn representations of the opponents. He et al. [2016] pro- posed an opponent modeling method that learns a model- Copyright c 2020, Association for the Advancement of Artificial ing network to reconstruct the opponent’s actions given the Intelligence (www.aaai.org). All rights reserved. opponent observations. Raileanu et al. [2018] developed an algorithm for learning to infer opponents’ goals using the can be stochastic a ∼ π(a|s) or deterministic a = µ(s). policy of the controlled agent. Hernandez-Leal et al. [2019] Given a policy π, the state value function is defined as used auxiliary tasks for modeling opponents in multi-agent PH i−t V (st) = Eπ[ i=t γ rt|s = st] and the state-action value reinforcement learning. Grover et al. [2018a] proposed an PH i−t (Q-value) Q(st, at) = Eπ[ i=t γ rt|s = st, a = at], encoder-decoder method for modeling the opponent’s pol- where 0 ≤ γ ≤ 1 is the discount factor and H is the fi- icy. The encoder learns a point-based representation of dif- nite horizon of the episode. The goal of RL is to compute ferent opponents’ trajectories, and the decoder learns to re- the policy that maximizes state value function V , when the construct the opponent’s policy given samples from the em- transition and the reward functions are unknown. bedding space. Additionally, Grover et al. [2018a] introduce There is a large number of RL algorithms; however, in an objective to separate embeddings of different agents into this work, we focus on two actor-critic algorithms; the syn- different clusters. chronous Advantage Actor-Critic (A2C) [Mnih et al., 2016, 1 d(z+, z−, z) = (1) Dhariwal et al., 2017] and the Deep Deterministic Policy (1 + e|z−z−|2−|z−z+|2 )2 Gradient (DDPG) [Silver et al., 2014, Lillicrap et al., 2015]. where z+ and z are embeddings of the same agent from two DDPG is an off-policy algorithm, using an experience re- different episodes and embedding z− is generated from the play for breaking the correlation between consecutive sam- episode of a different agent. Zheng et al. [2018] use an op- ples and target networks for stabilizing the training [Mnih ponent modeling method for better opponent identification et al., 2015]. Given an actor network with parameters θ and and multi-agent reinforcement learning. Rabinowitz et al. a critic network with parameter φ, the gradient updates are [2018] proposed the Theory of mind Network (TomNet), performed using the following update rules. which learns embedding-based representations of opponents 1 0 0 2 for meta-learning. Tacchetti et al. [2018] proposed Relation min EB[(r + γ · Qtarget,φ0 (s , µtarget,θ0 (s )) − Qφ(s, a)) ] φ 2 Forward Model to model opponents using graph neural net- min −EB[Qφ(s, µθ(s))] works. A common assumption among these methods, which θ our work aims to eliminate, is that access to opponents tra- On the other hand, A2C is an on-policy actor-critic algo- jectories is available during execution. rithm, using parallel environments to break the correlation Representation Learning in Reinforcement Learning. between consecutive samples. The actor-critic parameters Another related topic that has received significant attention are optimized by: is representation learning in RL. Using unsupervised learn- ing techniques to learn low-dimensional representations of the MDP has led to significant improvement in RL. Ha ˆ 1 0 2 min EB[−A log πθ(a|s) + (r + γVφ(s ) − Vφ(s)) ] and Schmidhuber [2018] proposed a VAE-based and a for- θ,φ 2 ward model to learn state representations of the environ- where the advantage term, Aˆ, can be computed using the ment. Hausman et al. [2018] learned tasks embeddings and Generalized Advantage Estimation (GAE) [Schulman et al., interpolated them to solve harder tasks. Igl et al. [2018] used 2015b]. a VAE for learning representation in partially-observable en- vironments. Gupta et al. [2018] proposed a model which 3.2 Variational Autoencoders learns Gaussian embeddings to represent different tasks dur- ing meta-training and manages to quickly adapt to new task Consider samples from a dataset x ∈ X which are gen- during meta-testing. The work of Zintgraf et al. [2019] is erated from some hidden (latent) random variable z based closely related, where Zintgraf et al. proposed a recurrent on a generative distribution pu(x|z) with unknown param- VAE model, which receives as input the observation, action, eter u and a prior distribution on the latent variables, which reward of the agent, and learns a variational distribution of we assume is a Gaussian with 0 mean and unit variance tasks. Rakelly et al. [2019] used representations from an en- p(z) = N (z; 0, I). We are interested in approximating the coder for off-policy meta-RL. Note that all these works have true posterior p(z|x) with a variational parametric distri- been applied for learning representations of tasks or proper- bution qw(z|x) = N (z; µ, Σ, w). Kingma and Welling ties of the environments. In contrast, our approach is focused [2014] proposed the Variational Autoencoders (VAE) to on learning representations of opponents. learn this distribution. Starting from the Kullback-Leibler (KL) divergence from the approximate to the true posterior 3 Background DKL(qw(z|x)kp(z|x)), the lower bound on the evidence log p(x) is derived as: 3.1 Reinforcement Learning log p(x) ≥ [log p (x|z)]−D (q (z|x)kp(z)) Markov Decision Processes (MDPs) are commonly used to Ez∼qw(z|x) u KL w model decision making problems. An MDP consists of the The architecture consists of an encoder which receives a set of states S, the set of actions A, the transition function, sample x and generates the Gaussian variational distribution P (s0|s, a), which is the probability of the next state, s0, p(z|x; w). The decoder receives a sample from the Gaus- given the current state, s, and the action, a, and the reward sian variational distribution and reconstructs the initial input function, r(s0, a, s), that returns a scalar value conditioned x. The architecture is trained using the reparameterization on two consecutive states and the intermediate action. A pol- trick Kingma and Welling [2014]. Higgins et al. [2017] pro- icy function is used to choose an action given a state, which posed β-VAE, where a parameter β ≥ 0 is used to control the trade-off between the reconstruction loss and the KL- We can further subtract the discrimination objective divergence. (equation 1) that was proposed by Grover et al. [2018a]. Since the discrimination objective is always non-negative, L(x; w, v) = [log p (x|z)] Ez∼qw(z|x) u we derive and optimize a lower bound as: − βDKL(qw(z|x)kp(z)) L(τ−1; w, u) ≥ Ez∼qw(z|τ−1)[log pu(τ−1|z)] 4 Approach − βDKL(qw(z|τ−1)kp(z)) − λd(E(z+), E(z−), E(z)) 4.1 Problem Formulation The discrimination objective receives as input the mean of We consider a modified Markov Game [Littman, 1994], the variational Gaussian distribution, produced by three dif- which consists of N agents I = {1, 2, ..., N}, a set of states ferent trajectories. Despite the less tight lower bound, the S, the joint action space A = A1 × A−1 , the transition discrimination objective will separate the opponents in the function P : S × A × S → R and the reward function embedding space, which could potentially lead to higher r : S × A × S → RN . We consider partially observable episodic returns. At each time step t, the recurrent encoder settings, where each agent i has access only to its local ob- network generates a latent sample zt, which is conditioned servation oi and reward ri. Additionally, two sets of pre- on the opponent’s trajectory τ−1,:t, until time step t. The KL m=MT trained opponents are provided, T = {I−1,m}m=1 and divergence can be written as: = { }m=MG G I−1,m m=1 , which are responsible for providing H the joint action A−1. Note that by opponent we refer to X DKL(qw(z|τ−1)kp(z)) = DKL(qw(zt|τ−1,:t)kp(zt)) I−1,m, which consists of one or more agents, independently from the type of the interactions. Additionally, we assume t=1 that the sets T and G are disjoint. At the beginning of each The lower bound consist of the reconstruction loss of the episode, we sample a pretrained opponent from the set T trajectory which involves the observation and actions of the during training or from G during testing. Our goal is to train opponent. The opponent’s action, at each time step depends agent 1 using RL, to maximize the average return against op- on its observation and the opponent’s policy, which is repre- ponents from the training set, T, and generalize to opponents sented by the latent variable z. We use a decoder that con- sampled from the test G. sists of fully-connected layers, however, a recurrent network can be used, if we instead assume that the opponent decides X t arg max Eπ[E [ γ rt]] (2) θ T its actions based on the history of its observations. Addition- t ally, the observation at each time step depends only on the Note, that when we refer to agent 1 we drop the subscript. dynamics of the environment and the actions of the agents and not on the identity of the opponent. Therefore, the re- 4.2 Variational with Access to construction loss factorizes as: Opponent’s Information H X log pu(τ−1|z) = log pu(a−1,t|o−1,t, zt) Latent Space t=1 −1, Encoder pu(o−1,t|st−1, at−1, a−1,t−1) (3)  Opponent Policy H −1, −1, X ∝ log pu(a−1,t|o−1,t, zt) −1, t=1 From the equation above, we observe that the loss is the re- Figure 1: Diagram of the proposed VAE architecture construction of the opponent’s policy given the current ob- servation and a sample from the latent variable. Figure 1 il- We assume a number of K provided episode trajectories for lustrates the diagram of the VAE. (j) (j,k) k=K−1 each pretrained opponent j ∈ T, E = {τ−1 } , k=0 4.3 Variational Autoencoder without Access to τ (j,k) = {o , a }t=H o , a where −1 −1,t −1,t t=0 , and −1,t −1,t are the Opponent’s Information observations and actions of the opponent at the time step t in the trajectory. These trajectories are generated from the opponents in set T, which are represented in the latent space Latent Space Encoder from the variable z and for which we assume there exists an −1  Opponent unknown model pu(τ−1|z). Our goal is to approximate the −1 Policy −1, unknown posterior, p(z|τ−1), using a variational Gaussian −1 distribution N (µ, Σ; w) with parameters w. We consider −1, using a β-VAE for the sequential task:

L(τ−1; w, u) = Ez∼qw(z|τ−1)[log pu(τ−1|z)] Figure 3: Diagram of the proposed VAE architecture using local information − βDKL(qw(z|τ−1)kp(z)) Embeddings Visualization Episodic Reward

Opponent One SMA2C Opponent Two A2C C D 1.0 20

C (2, 2) (-1,3) 0.5 15

0.0 10 Reward

0.5 5 D (3,-1) (0,0) Second Dimension

1.0

1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 0 10000 20000 30000 40000 50000 60000 70000 (a) Payoff matrix First Dimension Time steps (b) Embedding visualization (c) Episodic return

Figure 2: Payoff matrix, embedding visualization and episodic return during training in the toy example.

In Sections 1 and 2, it was noted that most agent model- After training the variational autoencoder described in ing methods assume access to opponent’s observations and Section 4.2, we use it to train our agent against the oppo- actions is available both during training and execution. To nents in the set T. We use the DDPG [Lillicrap et al., 2015] eliminate this assumption, we propose a VAE that uses a algorithm for this task. At the beginning of each episode, we parametric variational distribution which is conditioned on sample an opponent from the set T. The agent’s input is the the observation-action-reward triplet of the agent under our local observation and a sample from the variational distri- control and a variable d indicating whether the episode has bution. We refer to this as OMDDPG (Opponent Modeling terminated; qw(z|τ = (o, a, r, d)). Specifically, our goal DDPG). is to approximate the true posterior that is conditioned on We optimize the second proposed VAE method jointly opponent’s information, with a variational distribution that with the policy of the controlled agent. We use the A2C algo- only depends on local information. The use of this local in- rithm, similarly to the meta-learning algorithm RL2 [Wang formation in a recurrent fashion has been successfully used et al., 2016, Duan et al., 2016]. In the rest of this paper, we in meta-RL settings [Wang et al., 2016, Duan et al., 2016]. refer to this as SMA2C (Self Modeling A2C). The actor’s We start by computing the KL divergence between the two and the critic’s input is the local observation and a sample distributions: from the latent space. We back-propagate the gradient from both the actor and the critic loss to the parameters of the DKL(qw(z|τ )kp(z|τ−1)) = Ez∼qw(z|τ )[log qw(z|τ ) encoder. Therefore, the encoder’s parameters are shaped to − log p(z|τ−1)] maximize both the VAE’s objective as well as the discounted By following the works of Kingma and Welling [2014] and sum of rewards. Higgins et al. [2017] and using the Jensen inequality, the VAE objective can be written as: 5 Experiments 5.1 Toy Example L(τ , τ−1; w, v) = Ez∼qw(z|τ )[log pu(τ−1|z)] (4) − βD (q (z|τ )kp(z)) We will first provide a toy example to illustrate SMA2C. We KL w consider the classic repeated game of prisoner’s dilemma The reconstruction loss factorizes exactly similar to equa- with a constant episode length of 25 time steps. We con- tion 3. From equation 4, it can be seen that the variational trol one agent, and the opponent uses one of two possible distribution only depends on locally available information. policies. The first policy always defects, while the second Since during execution only the encoder is required to gen- opponent policy follows a tit-for-tat policy. At the beginning erate the opponent’s model, this approach removes the as- of the episode, one of the two opponent policies is randomly sumption that access to the opponent’s observations and ac- selected. We train SMA2C against the two possible oppo- tions is available during execution. Figure 3 presents the di- nents. The agent that we control has to identify the correct agram of the VAE. opponent, and the optimal policy, it can achieve, is to defect against opponent one and collaborate with opponent two. 4.4 Reinforcement Learning Training Figure 2 shows the payoff matrix, the embedding space at We use the latent variable z augmented with the agent’s ob- the last time step of the episode, and the episodic return that servation to condition the policy of our agent, which is opti- SMA2C and A2C achieve during training. Note that, based mized using RL. Consider the augmented observation space on the payoff matrix, the optimal average episodic return O0 = O × Z, where O is the original observation space of that can be achieved is 24.5. our agent in the Markov game, and Z is the representation space of the opponent models. The advantage of learning the 5.2 Experimental Framework policy on O0 compared to O is that the policy can adapt to To evaluate the proposed methods in more complex environ- different z ∈ Z. ments, we used the Multi-agent Particle Environment (MPE) Speaker-Listener Double Speaker-Listener Predator-Prey Spread 0 0 10 50

0 5 52 10 10 10 54 20 20 15 56 30 20 30 58 40 25 Reward 40 50 60

30 60 62 50 35 70 64 40 60 80 4 5 6 4 5 6 7 4 5 6 7 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 10 10 10 10 10 10 10 10 10 10 10 1e6

Speaker-Listener Double Speaker-Listener Predator-Prey Spread 0 0 10 50

0 5 52 10 10 10 54 20 20 15 56 30 20 30 58 40 25

Reward 40 50 60 30 60 62 50 35 70 64 40 60 80 4 5 6 4 5 6 7 4 5 6 7 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 10 10 10 10 10 10 10 10 10 10 10 Time steps 1e6 Time steps Time steps Time steps

SMA2C OMDDPG Grover et al. [2018a]

Figure 4: Episodic return during training against the opponents from T (top row) and G (bottom row). Four environments are evaluated; speaker-listener (first column), double speaker-listener (second column), predator-prey (third column) and spread (fourth column). For the double speaker-listener, predator-prey and spread environments the x-axis is in logarithmic scale.

[Mordatch and Abbeel, 2017], which provides several dif- 5.3 Reinforcement Learning Performance ferent multi-agent environments. The environments have We evaluate the proposed opponent modeling methods in continuous observation, discrete action space, and fixed- RL settings. In Figure 4, the episodic returns for the three length episodes of 25 time steps. Four environments are used methods in all four environments are presented. for evaluating the proposed methodology; speaker-listener, Every line corresponds to the average return over five runs double-speaker listener, predator-prey, and spread. During with different initial seeds, and the shadowed part represents the experiments, we evaluated the two proposed algorithms the 95% confidence interval. We evaluate the models ev- OMDDPG and SMA2C and the modeling method of Grover ery 1000 episodes for 100 episodes. During the evaluation, et al. [2018a] combined with DDPG [Lillicrap et al., 2015]. we sample an embedding from the variational distribution at In all the environments, we pretrain ten different oppo- each time step, and the agent follows the greedy policy. The nents, where five are used for training and five for test- hyperparameters for all the experiments in Figure 4 were op- ing. In the speaker-listener environment, we control the lis- timized on weak generalization scenarios, against opponents tener, and we create ten different speakers using different from the set T. communication messages for different colors. In the double OMDDPG is an upper baseline for SMA2C achieving speaker-listener, which consists of two agents that have to higher returns in all environments during weak general- be both listener and speaker simultaneously, we control the ization. However, OMDDPG, as well as the method by first agent. We create a diverse set of opponents that have dif- Grover et al. [2018a], tend to overfit and perform poorly dur- ferent communication messages similar to speaker-listener, ing strong generalization in the speaker-listener and double while they learn to navigate using the MADDPG algorithm speaker-listener environment. SMA2C achieves higher re- [Lowe et al., 2017], with different initial random seeds. In turns than the method proposed by Grover et al. [2018a] in the predator-prey environment, we control the prey and pre- more than half of the scenarios. train the three other agents in the environment using MAD- DPG with different initial parameters. Similarly, in spread, 5.4 Evaluation of the Representations we control one of the agents, while the opponents are pre- To evaluate the representations created from our models we trained using MADDPG. will estimate the Mutual Information (MI) between the vari- We use agent generalization graphs [Grover et al., 2018b] ational distribution (q(z|τ ) or q(z|τ−1)) and the prior on to evaluate the generalization of the proposed methods. We the opponents’ identities, which is uniform. This is a com- evaluate two types of generalizations. First, we evaluate the mon way to estimate the quality of the representation [Chen episodic returns against the opponents that are used for train- et al., 2018, Hjelm et al., 2019]. To estimate the MI, we use ing, T, which Grover et al. [2018b] call ”weak generaliza- the Mutual Information Neural Estimation (MINE) [Belg- tion”. Secondly, we evaluate against unknown opponents hazi et al., 2018]. Note that the upper bound of the MI, the from the set G, which is called ”strong generalization”. entropy of the uniform distribution, in our experiments is Table 1: MI estimations using MINE in the double speaker-listener and the predator-prey environment of the embeddings at the 15th, 20th and 25th time step of the trajectory.

Double speaker-listener Predator-prey Algorithm\Time step 15 20 25 15 20 25 OMDDPG 0.86 0.81 0.86 0.87 0.86 0.85 SMA2C 0.73 0.72 0.7 1.27 1.04 1.39 Grover et al. [2018a] 1.16 1.21 1.20 1.29 1.38 1.41

Speaker-Listener Double Speaker-Listener Predator-Prey Spread 0 0 0 50

5 10 52 10

10 20 54 20 15 30 56

20 30 40 58

25 50

Reward 40 60 30 60 62 50 35 70 64 40 60 80 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0 1 2 3 4 5 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1e6 1e6 1e7 1e7

Speaker-Listener Double Speaker-Listener Predator-Prey Spread 0 0 0 50

5 10 52 10

10 20 54 20 15 30 56

20 30 40 58

25 50

Reward 40 60 30 60 62 50 35 70 64 40 60 80 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0 1 2 3 4 5 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 Time steps 1e6 Time steps 1e6 Time steps 1e7 Time steps 1e7

SMA2C Full SMA2C Observation-Action SMA2C Observation

Figure 5: Ablation on the episodic returns for different inputs in the VAE of SMA2C for weak (top row) and strong (bottom row) generalization in all four environments.

Speaker-Listener Double Speaker-Listener Predator-Prey Spread 0 10 10 50

5 0 20 52 10 10

15 30 54 20

20 30 40 56 25 Reward 40 30 50 58 50 35 60 60 60 40 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 1e6 1e6 1e6 1e6

Speaker-Listener Double Speaker-Listener Predator-Prey Spread 0 10 10 50

5 0 20 52 10 10

15 30 54 20 20 30 40 56 25 Reward 40 30 50 58 35 50

40 60 60 60 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 Time steps 1e6 Time steps 1e6 Time steps 1e6 Time steps 1e6

OMDDPG OMDDPG No Discrimination

Figure 6: Ablation study on the importance of the discrimination objective. Episodic returns in all four environments for weak (top row) and strong (bottom row) generalization. 1.61. We gather 200 trajectories against each opponent in To the best of our knowledge this is the first study showing T, where 80% of them are used for training and the rest for that effective opponent modeling can be achieved without testing. requiring access to opponent observations. In the future, we From Table 1, we observe that the method of Grover et al. would like to research how these models can be used for [2018a] achieves significantly higher values of MI. We be- non-stationary opponents. In particular, we plan on investi- lieve that the main reason behind this is the discrimination gating two scenarios; the first is multi-agent deep RL, where objective that implicitly increases MI. This is apparent in the different agents are learning concurrently leading to non- MI values of OMDDPG as well. SMA2C manages to create stationarity in the environment [Papoudakis et al., 2019], opponent representations, based only on the local informa- which prevents the agents from learning optimal policies. tion of our agent, that have statistical dependency with the Secondly, we would like to explore whether the proposed opponent identities. Additionally, based on Figure 4, we ob- models can deal with opponents that try to deceive it and ex- serve that the value of MI does not directly correlate to the ploit the controlled agent [Ganzfried and Sandholm, 2011, episodic returns in RL tasks. 2015]. 5.5 Ablation Study on SMA2C Inputs References We perform an ablation study to assess the performance re- Volodymyr Mnih, Koray Kavukcuoglu, David Silver, An- quirements of SMA2C. Our proposed method utilizes the drei A Rusu, Joel Veness, Marc G Bellemare, Alex observation, action, reward, and termination sequence of Graves, Martin Riedmiller, Andreas K Fidjeland, Georg our agent to generate the opponent’s model. We use differ- Ostrovski, et al. Human-level control through deep rein- ent combinations of inputs in the encoder and compare the forcement learning. Nature, 2015. episodic returns. In Figure 5, the average episode return is John Schulman, Sergey Levine, Pieter Abbeel, Michael Jor- presented for three different cases; SMA2C, SMA2C with dan, and Philipp Moritz. Trust region policy optimization. only observations and actions and SMA2C with only obser- In International Conference on , 2015a. vations. From Figure 5, we observe that during weak gener- Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi alization the performance of SMA2C without using the re- Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, ward and the termination variable is comparable to SMA2C David Silver, and Koray Kavukcuoglu. Asynchronous in all environments. However, during strong generalization methods for deep reinforcement learning. In International the performance SMA2C without using the reward and the Conference on Machine Learning, 2016. termination variable drops in the speaker-listener and double speaker-listener environment. SMA2C using only the obser- Stefano V Albrecht and Peter Stone. Autonomous agents vations for modeling has similar performance to SMA2C modelling other agents: A comprehensive survey and that uses both the observations and actions of our agent in open problems. Artificial Intelligence, 2018. all environments except the predator-prey. He He, Jordan Boyd-Graber, Kevin Kwok, and Hal Daume´ III. Opponent modeling in deep reinforcement 5.6 Ablation Study on Discrimination Objective learning. In International Conference on Machine Learn- Another element of this work is the utilization of the dis- ing, 2016. crimination objective proposed by Grover et al. [2018a] in Roberta Raileanu, Emily Denton, Arthur Szlam, and Rob the VAE loss. To better understand how the opponent sepa- Fergus. Modeling others using oneself in multi-agent re- ration in the embedding space is related to RL performance. inforcement learning. arXiv preprint arXiv:1802.09640, Figure 6 shows the episodic return during the training for the 2018. OMDDPG with and without the discrimination objective. Aditya Grover, Maruan Al-Shedivat, Jayesh K Gupta, Yura Using the discrimination objective has a significant impact Burda, and Harrison Edwards. Learning policy represen- on the episodic returns in the speaker-listener and the dou- tations in multiagent systems. International Conference ble speaker-listener environment. From Figure 6 we observe on Machine learning, 2018a. that the discrimination objective increases the performance of OMDDPG in the speaker-listener and double speaker- Neil C Rabinowitz, Frank Perbet, H Francis Song, Chiyuan listener environment for weak and strong generalization. Zhang, SM Eslami, and Matthew Botvinick. Machine the- ory of mind. International Conference on Machine Learn- 6 Conclusion ing, 2018. Diederik P Kingma and Max Welling. Auto-encoding varia- To conclude this work, we proposed two methods for op- tional bayes. International Conference on Learning Rep- ponent modeling in multi-agent systems using variational resentations, 2014. autoencoders. First, we proposed OMDDPG, a VAE-based method that uses the common assumption that access to op- Maximilian Igl, Luisa Zintgraf, Tuan Anh Le, Frank Wood, ponents’ information is available during execution. The goal and Shimon Whiteson. Deep variational reinforcement of this work is to motivate opponent modeling without ac- learning for pomdps. In International Conference on Ma- cess to opponent’s information. The core contribution of this chine Learning, 2018. work is SMA2C, which learns representations without re- David Ha and Jurgen¨ Schmidhuber. World models. arXiv quiring access to opponent’s information during execution. preprint arXiv:1803.10122, 2018. Luisa Zintgraf, Maximilian Igl, Kyriacos Shiarlis, Anuj Ma- Alexander Lerchner. beta-vae: Learning basic visual con- hajan, Katja Hofmann, and Shimon Whiteson. Variational cepts with a constrained variational framework. Interna- task embeddings for fast adaptation in deep reinforcement tional Conference on Learning Representations, 2017. learning. International Conference on Learning Repre- Michael L Littman. Markov games as a framework for sentations Workshop on Structure & Priors in Reinforce- multi-agent reinforcement learning. In Machine Learn- ment Learning, 2019. ing Proceedings 1994. Elsevier, 1994. Igor Mordatch and Pieter Abbeel. Emergence of grounded Jane X Wang, Zeb Kurth-Nelson, Dhruva Tirumala, Hu- compositional language in multi-agent populations. arXiv bert Soyer, Joel Z Leibo, Remi Munos, Charles Blun- preprint arXiv:1703.04908, 2017. dell, Dharshan Kumaran, and Matt Botvinick. Learning Pablo Hernandez-Leal, Bilal Kartal, and Matthew E Taylor. to reinforcement learn. arXiv preprint arXiv:1611.05763, Agent modeling as auxiliary task for deep reinforcement 2016. learning. In Proceedings of the AAAI Conference on Arti- Yan Duan, John Schulman, Xi Chen, Peter L Bartlett, Ilya ficial Intelligence and Interactive Digital Entertainment, Sutskever, and Pieter Abbeel. Rl 2: Fast reinforcement 2019. learning via slow reinforcement learning. arXiv preprint Yan Zheng, Zhaopeng Meng, Jianye Hao, Zongzhang arXiv:1611.02779, 2016. Zhang, Tianpei Yang, and Changjie Fan. A deep bayesian Ryan Lowe, Yi Wu, Aviv Tamar, Jean Harb, OpenAI Pieter policy reuse approach against non-stationary agents. In Abbeel, and Igor Mordatch. Multi-agent actor-critic for Advances in Neural Information Processing Systems, mixed cooperative-competitive environments. In Ad- pages 954–964, 2018. vances in Neural Information Processing Systems, 2017. Andrea Tacchetti, H Francis Song, Pedro AM Mediano, Aditya Grover, Maruan Al-Shedivat, Jayesh K Gupta, Yuri Vinicius Zambaldi, Neil C Rabinowitz, Thore Graepel, Burda, and Harrison Edwards. Evaluating generaliza- Matthew Botvinick, and Peter W Battaglia. Relational tion in multiagent systems using agent-interaction graphs. forward models for multi-agent learning. arXiv preprint In International Conference on Autonomous Agents and arXiv:1809.11044, 2018. Multiagent Systems, 2018b. Karol Hausman, Jost Tobias Springenberg, Ziyu Wang, Tian Qi Chen, Xuechen Li, Roger B Grosse, and David K Nicolas Heess, and Martin Riedmiller. Learning an em- Duvenaud. Isolating sources of disentanglement in vari- bedding space for transferable robot skills. International ational autoencoders. In Advances in Neural Information Conference on Learning Representations, 2018. Processing Systems, 2018. Abhishek Gupta, Russell Mendonca, YuXuan Liu, Pieter R Devon Hjelm, Alex Fedorov, Samuel Lavoie-Marchildon, Abbeel, and Sergey Levine. Meta-reinforcement learn- Karan Grewal, Phil Bachman, Adam Trischler, and ing of structured exploration strategies. In Advances in Yoshua Bengio. Learning deep representations by mutual Neural Information Processing Systems, 2018. information estimation and maximization. International Conference on Learning Representations, 2019. Kate Rakelly, Aurick Zhou, Deirdre Quillen, Chelsea Finn, and Sergey Levine. Efficient off-policy meta- Mohamed Ishmael Belghazi, Aristide Baratin, Sai Rajeswar, reinforcement learning via probabilistic context variables. Sherjil Ozair, Yoshua Bengio, Aaron Courville, and R De- International Conference on Machine Learning, 2019. von Hjelm. Mine: mutual information neural estimation. arXiv preprint arXiv:1801.04062, 2018. Prafulla Dhariwal, Christopher Hesse, Oleg Klimov, Alex Nichol, Matthias Plappert, Alec Radford, John Schulman, Georgios Papoudakis, Filippos Christianos, Arrasy Rahman, Szymon Sidor, Yuhuai Wu, and Peter Zhokhov. Openai and Stefano V Albrecht. Dealing with non-stationarity in baselines. https://github.com/openai/baselines, 2017. multi-agent deep reinforcement learning. arXiv preprint arXiv:1906.04737, 2019. David Silver, Guy Lever, Nicolas Heess, Thomas Degris, Daan Wierstra, and Martin Riedmiller. Deterministic pol- Sam Ganzfried and Tuomas Sandholm. Game theory-based icy gradient algorithms. In International Conference on opponent modeling in large imperfect-information games. Machine Learning, 2014. In International Conference on Autonomous Agents and Multiagent Systems, 2011. Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Sam Ganzfried and Tuomas Sandholm. Safe opponent ex- Daan Wierstra. Continuous control with deep reinforce- ploitation. ACM Transactions on Economics and Compu- ment learning. arXiv preprint arXiv:1509.02971, 2015. tation (TEAC), 2015. John Schulman, Philipp Moritz, Sergey Levine, Michael Jor- dan, and Pieter Abbeel. High-dimensional continuous control using generalized advantage estimation. arXiv preprint arXiv:1506.02438, 2015b. Irina Higgins, Loic Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew Botvinick, Shakir Mohamed, and