
Variational Autoencoders for Opponent Modeling in Multi-Agent Systems Georgios Papoudakis Stefano V. Albrecht School of Informatics School of Informatics The University of Edinburgh The University of Edinburgh Edinburgh Centre for Robotics Edinburgh Centre for Robotics [email protected] [email protected] Abstract 2018]. In this work, we focus on learning opponent mod- els using Variational Autoencoders (VAEs) [Kingma and Multi-agent systems exhibit complex behaviors that emanate Welling, 2014]. VAE are generative models that are com- from the interactions of multiple agents in a shared environ- monly used for learning representations of data, and various ment. In this work, we are interested in controlling one agent in a multi-agent system and successfully learn to interact with works use them in RL for learning representations of the the other agents that have fixed policies. Modeling the behav- environment [Igl et al., 2018, Ha and Schmidhuber, 2018, ior of other agents (opponents) is essential in understanding Zintgraf et al., 2019]. We first propose a VAE for learning the interactions of the agents in the system. By taking ad- opponent representations in multi-agent systems based on vantage of recent advances in unsupervised learning, we pro- the opponent trajectories. pose modeling opponents using variational autoencoders. Ad- A shortcoming of this approach and most opponent mod- ditionally, many existing methods in the literature assume that eling methods, as will be discussed in Section 2, is that they the opponent models have access to opponent’s observations require access to opponent’s information, such as observa- and actions during both training and execution. To eliminate tions and actions, during training as well as execution. This this assumption, we propose a modification that attempts to identify the underlying opponent model using only local in- assumption is too limiting in the majority of scenarios. For formation of our agent, such as its observations, actions, and example, consider Poker, in which agents do not have access rewards. The experiments indicate that our opponent model- to the opponent’s observations. Nevertheless, during Poker, ing methods achieve equal or greater episodic returns in rein- humans can reason about the opponent’s behaviors and goals forcement learning tasks against another modeling method. using only their local observations. For example, an increase in the table’s pot could mean that the opponent either holds strong cards or is bluffing. Based on the idea that an agent 1 Introduction can reason about an opponent models using its own obser- In recent years, several promising works [Mnih et al., 2015, vations, actions, and rewards in a recurrent fashion, we pro- Schulman et al., 2015a, Mnih et al., 2016] have arisen in pose a second VAE-based architecture. The encoder of the deep reinforcement learning (RL), leading to fruitful results VAE learns to represent opponent models conditioned on in single-agent scenarios. In this work, we are interested only local information removing the requirement to access in using single-agent RL in multi-agent systems, where we the opponents’ information during execution. control one agent and the other agents (opponents) in the We evaluate our proposed methodology using a toy ex- environment have fixed policies. The agent should be able ample and the commonly used Multi-agent Particle Envi- to successfully interact with a diverse set of opponents as ronment [Mordatch and Abbeel, 2017]. We evaluate the well as generalize to new unseen opponents. One effective episodic returns that RL algorithms can achieve. The exper- way to address this problem is opponent modeling. The op- iments indicate that opponent modeling without opponents’ ponent models output specific characteristics of the oppo- information can achieve comparable or higher average re- nents based on their trajectories. By successfully modeling turns, using RL, than models that access the opponent’s in- the opponents, the agent can reason about opponents’ be- formation. haviors and goals and adjust its policy to achieve optimal outcomes. There is a rich literature of modeling opponents 2 Related Work in multi-agent systems [Albrecht and Stone, 2018]. Several recent works have proposed learning opponent Learning Opponent Models. In this work, we are interested models using deep learning architectures [He et al., 2016, in opponent modeling methods that use neural networks to Raileanu et al., 2018, Grover et al., 2018a, Rabinowitz et al., learn representations of the opponents. He et al. [2016] pro- posed an opponent modeling method that learns a model- Copyright c 2020, Association for the Advancement of Artificial ing network to reconstruct the opponent’s actions given the Intelligence (www.aaai.org). All rights reserved. opponent observations. Raileanu et al. [2018] developed an algorithm for learning to infer opponents’ goals using the can be stochastic a ∼ π(ajs) or deterministic a = µ(s). policy of the controlled agent. Hernandez-Leal et al. [2019] Given a policy π, the state value function is defined as used auxiliary tasks for modeling opponents in multi-agent PH i−t V (st) = Eπ[ i=t γ rtjs = st] and the state-action value reinforcement learning. Grover et al. [2018a] proposed an PH i−t (Q-value) Q(st; at) = Eπ[ i=t γ rtjs = st; a = at], encoder-decoder method for modeling the opponent’s pol- where 0 ≤ γ ≤ 1 is the discount factor and H is the fi- icy. The encoder learns a point-based representation of dif- nite horizon of the episode. The goal of RL is to compute ferent opponents’ trajectories, and the decoder learns to re- the policy that maximizes state value function V , when the construct the opponent’s policy given samples from the em- transition and the reward functions are unknown. bedding space. Additionally, Grover et al. [2018a] introduce There is a large number of RL algorithms; however, in an objective to separate embeddings of different agents into this work, we focus on two actor-critic algorithms; the syn- different clusters. chronous Advantage Actor-Critic (A2C) [Mnih et al., 2016, 1 d(z+; z−; z) = (1) Dhariwal et al., 2017] and the Deep Deterministic Policy (1 + ejz−z−j2−|z−z+j2 )2 Gradient (DDPG) [Silver et al., 2014, Lillicrap et al., 2015]. where z+ and z are embeddings of the same agent from two DDPG is an off-policy algorithm, using an experience re- different episodes and embedding z− is generated from the play for breaking the correlation between consecutive sam- episode of a different agent. Zheng et al. [2018] use an op- ples and target networks for stabilizing the training [Mnih ponent modeling method for better opponent identification et al., 2015]. Given an actor network with parameters θ and and multi-agent reinforcement learning. Rabinowitz et al. a critic network with parameter φ, the gradient updates are [2018] proposed the Theory of mind Network (TomNet), performed using the following update rules. which learns embedding-based representations of opponents 1 0 0 2 for meta-learning. Tacchetti et al. [2018] proposed Relation min EB[(r + γ · Qtarget;φ0 (s ; µtarget;θ0 (s )) − Qφ(s; a)) ] φ 2 Forward Model to model opponents using graph neural net- min −EB[Qφ(s; µθ(s))] works. A common assumption among these methods, which θ our work aims to eliminate, is that access to opponents tra- On the other hand, A2C is an on-policy actor-critic algo- jectories is available during execution. rithm, using parallel environments to break the correlation Representation Learning in Reinforcement Learning. between consecutive samples. The actor-critic parameters Another related topic that has received significant attention are optimized by: is representation learning in RL. Using unsupervised learn- ing techniques to learn low-dimensional representations of the MDP has led to significant improvement in RL. Ha ^ 1 0 2 min EB[−A log πθ(ajs) + (r + γVφ(s ) − Vφ(s)) ] and Schmidhuber [2018] proposed a VAE-based and a for- θ;φ 2 ward model to learn state representations of the environ- where the advantage term, A^, can be computed using the ment. Hausman et al. [2018] learned tasks embeddings and Generalized Advantage Estimation (GAE) [Schulman et al., interpolated them to solve harder tasks. Igl et al. [2018] used 2015b]. a VAE for learning representation in partially-observable en- vironments. Gupta et al. [2018] proposed a model which 3.2 Variational Autoencoders learns Gaussian embeddings to represent different tasks dur- ing meta-training and manages to quickly adapt to new task Consider samples from a dataset x 2 X which are gen- during meta-testing. The work of Zintgraf et al. [2019] is erated from some hidden (latent) random variable z based closely related, where Zintgraf et al. proposed a recurrent on a generative distribution pu(xjz) with unknown param- VAE model, which receives as input the observation, action, eter u and a prior distribution on the latent variables, which reward of the agent, and learns a variational distribution of we assume is a Gaussian with 0 mean and unit variance tasks. Rakelly et al. [2019] used representations from an en- p(z) = N (z; 0; I). We are interested in approximating the coder for off-policy meta-RL. Note that all these works have true posterior p(zjx) with a variational parametric distri- been applied for learning representations of tasks or proper- bution qw(zjx) = N (z; µ; Σ; w). Kingma and Welling ties of the environments. In contrast, our approach is focused [2014] proposed the Variational Autoencoders (VAE) to on learning representations of opponents. learn this distribution. Starting from the Kullback-Leibler (KL) divergence from the approximate to the true posterior 3 Background DKL(qw(zjx)kp(zjx)), the lower bound on the evidence log p(x) is derived as: 3.1 Reinforcement Learning log p(x) ≥ [log p (xjz)]−D (q (zjx)kp(z)) Markov Decision Processes (MDPs) are commonly used to Ez∼qw(zjx) u KL w model decision making problems.
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages8 Page
-
File Size-