Learning Deep State Representations with Convolutional Autoencoders Gabriel Barth-Maron Supervised by Stefanie Tellex

Total Page:16

File Type:pdf, Size:1020Kb

Learning Deep State Representations with Convolutional Autoencoders Gabriel Barth-Maron Supervised by Stefanie Tellex Learning Deep State Representations With Convolutional Autoencoders Gabriel Barth-Maron Supervised by Stefanie Tellex Department of Computer Science Brown University Abstract Advances in artificial intelligence algorithms and techniques are quickly allowing us to cre- ate artificial agents that interact with the real Figure 1: A 80 × 20 gray scale image of the 10 × 2 state. world. However, these agents need to main- The agent is at location (3; 0) and the goal is at location tain a carefully constructed abstract represen- (9; 1). tation of the world around them [9]. Recent re- search in deep reinforcement learning attempts to overcome this challenge. Mnih et al. [24] at Feature engineering becomes a major hindrance as we DeepMind and Levine et al. [18] demonstrate create learning agents for more complex state spaces. successful methods of learning deep end-to-end Additionally, it requires expert knowledge and does not policies from high-dimensional input. In addi- generalize well across different domains. Several areas of tion, B¨ohmeret al. [1] and Mattner et al. [22] research attempt to deal with the challenge of exponen- extract deep state representations that can be tially large state spaces, such as Monte Carlo Tree Search used with traditional value function approxi- [2], hierarchical planning [4, 30, 7], and value function mation algorithms to learn policies. We present approximation [29]. a model that discovers low-dimensional deep Here we take an alternative approach with a focus on state representations in a similar fashion to the planning with sensor input. Visual information is an eas- deep fitted Q algorithm [1]. A plethora of func- ily accessible rich source of information, however uncov- tion approximation techniques can be used in ering structured information is a difficult and well stud- the lower dimension space to obtain the Q- ied problem in computer vision. Many vision problems function. To test our algorithms, we run sev- have been solved through the use of carefully crafted eral experiments on 80 × 20 images taken from features such as scale invariant feature transformations a 10 × 2 grid world and show that convolu- [20] and histogram of gradients [3]. Recent advances tional autoencoders can be trained to obtain in deep learning have made it possible to automatically deep state representations that are almost as extract high-level features from raw visual data, lead- good as knowing the ground-truth state. ing to breakthroughs in several areas of computer vision [14, 26, 23]. 1 Introduction In our model we use neural networks as an unsuper- vised technique to learn an abstract feature represen- Reinforcement learning provides an excellent framework tation of the raw visual input. Similar to hierarchical for planning and learning in non-stochastic domains. techniques, these neural networks allow us to plan in Since inception it has been used to accomplish a wide the (significantly simplified) abstract state space. This variety of tasks, from robotics [10, 5, 15] to sequential model is similar to the algorithm designed by DeepMind decision-making games [32, 11], and dialogue systems that plays Atari 2600 games from visual input [24]. How- [27, 34]. ever their algorithm performs end-to-end learning (which However, many reinforcement learning algorithms directly produces a policy), whereas ours learns a deep have a run-time that is polynomial in the number of state representation that can be used by a variety of re- states and actions. To learn in large domains, researchers inforcement learning algorithms. In addition, the Deep- have had to carefully craft features of their state space Mind algorithm does not allow for model-based alterna- so that they are general enough to represent the orig- tives, as we believe ours does. B¨ohmeret al. [1], Mattner inal problem, but small enough to be computationally et al. [22] have created a deep fitted Q (DFQ) algorithm tractable. that is very similar to what we propose, however our use (a) Autoencoder AE-10 with 10 (b) Autoencoder AE-20 with 20 (c) Stacked autoencoder SAE with 2 hidden nodes. hidden nodes. final hidden nodes. Figure 2: Autoencoder architectures of convolutional autoencoders takes advantage of image as a weighted linear sum of a set of features [12]. These structure and produces better state representations. features are also known as basis functions, some com- We used 80 × 20 pixel gray scale images taken from a mon examples being Radial Basis Functions, CMACs, 10×2 grid world, an example state may be seen in Figure and the Fourier Basis Function. 1. Because the 10 × 2 grid world can be characterized One particular algorithm for learning a linear value by only two numbers { the agent's x and y coordinates { function approximation is Gradient Descent SARSA(λ) one of our goals is to attempt to compress these images [25]. This algorithm combines Q-learning with Tempo- to a two dimensional output. ral Difference learning (TD-learning) to learn the Q- In Section 2 we give a brief overview of reinforcement function 1. Lin [19] derives an update equation for a learning and deep learning. Section 6 reviews state of the Q-learning algorithm that uses a neural network basis art techniques that combine reinforcement learning and function (it is also applicable to any other basis func- deep learning. Then in Section 3 we introduce our mod- tion) with weights w. els, and show their empirical performance in Sections 4 and 5. @Qt) ∆wt = η rt + γ max Qt+1(a) − Qt (1) a2A @wt 2 Background The Gradient Descent SARSA(λ) update scheme is This section should serve as a self-contained introduction similar with two notable exceptions. First, in order to to reinforcement learning and deep learning for those update previous states ∆wt is multiplied by a weighted who are not already familiar with the fields. sum of previous gradients. Second, the max operator is dropped in favor of using Qt+1 associated with the ac- 2.1 Reinforcement Learning tion that was selected, which allows for a better trade-off between exploration and exploitation { as the algorithm Reinforcement learning problems are typically modelled converges it will start behaving as if it were always se- as a Markov Decision Process (MDP). A MDP is a five- lecting the action that maximizes the Q-function. tuple: hS; A; T ; R; γi, where S is a state space; A is the agent's set of actions; T denotes T (s0 j s; a), the 2.2 Deep Learning transition probability of an agent applying action a 2 A 0 0 An autoencoder is a fully-connected neural network that in state s 2 S and arriving in s 2 S; R(s; a; s ) denotes attempts to learn the identity function. Additionally the the reward received by the agent for applying action a 0 network contains a single hidden layer that has a num- in state s and transitioning to state s ; and γ 2 [0; 1] is a ber of nodes significantly less than the input. During discount factor that defines how much the agent prefers training the autoencoder attempts to find a good com- immediate rewards over future rewards (the agent prefers pression of the input data. In addition, autoencoders to maximize immediate rewards as γ decreases). MDPs can be stacked { the output of one autoencoder's hid- may also include terminal states that cause all action to den layer as the input of another { to form deep archi- cease once reached. tectures. Autoencoders and stacked autoencoders have Reinforcement learning involves estimating a value been shown to be very useful in performing unsupervised function from experience, simulation, or search [28, 33]. dimensionality reduction [8]. Typically the value function is parametrized by the state Convolutional neural networks (CNNs) use convolu- space { there exists one unique entry per state. However tion to take advantage of the locality of image fea- in continuous state spaces (or as we will later see, in tures. In addition, since these networks share the ker- large discrete state spaces) it is desirable to find an al- nel's weights for each layer, they are be much sparser ternate parametrization of the value function. The most than their fully-connected counterparts. CNNs have common technique for doing so is linear value function 1 approximation, where the value function is represented We use Qt as shorthand for Q(st; at). been used to achieve state of the art performance in im- a more difficult optimization problem for the value func- age classification [14], face verification [31], and object tion approximation algorithm. detection [17]. 3 Architectures 4 Experiments We used autoencoders to learn abstract features for im- ages similar to the one in Figure 1 in an unsupervised All of our experiments used 80 × 20 images taken from a manner. To train these networks we used backpropaga- 10 × 2 grid world as seen in Figure 1. The autoencoders tion on an image set that captures the entirety of the AE-10, AE-20, SAE, along with the convolutional au- state space. We combined different numbers of layers toencoders CAE and SCAE-AGENT were trained on all and hidden nodes, and have reported the results for some 20 possible images while the goal was at location (9; 1). of the final models in Section 5. We also used convo- The convolutional autoencoders SCAE-8 and SCAE-4 lutional autoencoders (CAEs) to take advantage of the were both trained on all 400 possible images by moving structure and locality that is found in naturally occur- both the agent and goal. The larger training data set ring images. was used to make the kernels goal-location invariant. The output of the middle layer of the (convolutional) The middle layer of each of these neural networks autoencoders was used as a basis function, which served was then used as a feature basis for Gradient Descent as the features for linear value function approximation.
Recommended publications
  • Training Autoencoders by Alternating Minimization
    Under review as a conference paper at ICLR 2018 TRAINING AUTOENCODERS BY ALTERNATING MINI- MIZATION Anonymous authors Paper under double-blind review ABSTRACT We present DANTE, a novel method for training neural networks, in particular autoencoders, using the alternating minimization principle. DANTE provides a distinct perspective in lieu of traditional gradient-based backpropagation techniques commonly used to train deep networks. It utilizes an adaptation of quasi-convex optimization techniques to cast autoencoder training as a bi-quasi-convex optimiza- tion problem. We show that for autoencoder configurations with both differentiable (e.g. sigmoid) and non-differentiable (e.g. ReLU) activation functions, we can perform the alternations very effectively. DANTE effortlessly extends to networks with multiple hidden layers and varying network configurations. In experiments on standard datasets, autoencoders trained using the proposed method were found to be very promising and competitive to traditional backpropagation techniques, both in terms of quality of solution, as well as training speed. 1 INTRODUCTION For much of the recent march of deep learning, gradient-based backpropagation methods, e.g. Stochastic Gradient Descent (SGD) and its variants, have been the mainstay of practitioners. The use of these methods, especially on vast amounts of data, has led to unprecedented progress in several areas of artificial intelligence. On one hand, the intense focus on these techniques has led to an intimate understanding of hardware requirements and code optimizations needed to execute these routines on large datasets in a scalable manner. Today, myriad off-the-shelf and highly optimized packages exist that can churn reasonably large datasets on GPU architectures with relatively mild human involvement and little bootstrap effort.
    [Show full text]
  • Turbo Autoencoder: Deep Learning Based Channel Codes for Point-To-Point Communication Channels
    Turbo Autoencoder: Deep learning based channel codes for point-to-point communication channels Hyeji Kim Himanshu Asnani Yihan Jiang Samsung AI Center School of Technology ECE Department Cambridge and Computer Science University of Washington Cambridge, United Tata Institute of Seattle, United States Kingdom Fundamental Research [email protected] [email protected] Mumbai, India [email protected] Sewoong Oh Pramod Viswanath Sreeram Kannan Allen School of ECE Department ECE Department Computer Science & University of Illinois at University of Washington Engineering Urbana Champaign Seattle, United States University of Washington Illinois, United States [email protected] Seattle, United States [email protected] [email protected] Abstract Designing codes that combat the noise in a communication medium has remained a significant area of research in information theory as well as wireless communica- tions. Asymptotically optimal channel codes have been developed by mathemati- cians for communicating under canonical models after over 60 years of research. On the other hand, in many non-canonical channel settings, optimal codes do not exist and the codes designed for canonical models are adapted via heuristics to these channels and are thus not guaranteed to be optimal. In this work, we make significant progress on this problem by designing a fully end-to-end jointly trained neural encoder and decoder, namely, Turbo Autoencoder (TurboAE), with the following contributions: (a) under moderate block lengths, TurboAE approaches state-of-the-art performance under canonical channels; (b) moreover, TurboAE outperforms the state-of-the-art codes under non-canonical settings in terms of reliability. TurboAE shows that the development of channel coding design can be automated via deep learning, with near-optimal performance.
    [Show full text]
  • Double Backpropagation for Training Autoencoders Against Adversarial Attack
    1 Double Backpropagation for Training Autoencoders against Adversarial Attack Chengjin Sun, Sizhe Chen, and Xiaolin Huang, Senior Member, IEEE Abstract—Deep learning, as widely known, is vulnerable to adversarial samples. This paper focuses on the adversarial attack on autoencoders. Safety of the autoencoders (AEs) is important because they are widely used as a compression scheme for data storage and transmission, however, the current autoencoders are easily attacked, i.e., one can slightly modify an input but has totally different codes. The vulnerability is rooted the sensitivity of the autoencoders and to enhance the robustness, we propose to adopt double backpropagation (DBP) to secure autoencoder such as VAE and DRAW. We restrict the gradient from the reconstruction image to the original one so that the autoencoder is not sensitive to trivial perturbation produced by the adversarial attack. After smoothing the gradient by DBP, we further smooth the label by Gaussian Mixture Model (GMM), aiming for accurate and robust classification. We demonstrate in MNIST, CelebA, SVHN that our method leads to a robust autoencoder resistant to attack and a robust classifier able for image transition and immune to adversarial attack if combined with GMM. Index Terms—double backpropagation, autoencoder, network robustness, GMM. F 1 INTRODUCTION N the past few years, deep neural networks have been feature [9], [10], [11], [12], [13], or network structure [3], [14], I greatly developed and successfully used in a vast of fields, [15]. such as pattern recognition, intelligent robots, automatic Adversarial attack and its defense are revolving around a control, medicine [1]. Despite the great success, researchers small ∆x and a big resulting difference between f(x + ∆x) have found the vulnerability of deep neural networks to and f(x).
    [Show full text]
  • Survey on Reinforcement Learning for Language Processing
    Survey on reinforcement learning for language processing V´ıctorUc-Cetina1, Nicol´asNavarro-Guerrero2, Anabel Martin-Gonzalez1, Cornelius Weber3, Stefan Wermter3 1 Universidad Aut´onomade Yucat´an- fuccetina, [email protected] 2 Aarhus University - [email protected] 3 Universit¨atHamburg - fweber, [email protected] February 2021 Abstract In recent years some researchers have explored the use of reinforcement learning (RL) algorithms as key components in the solution of various natural language process- ing tasks. For instance, some of these algorithms leveraging deep neural learning have found their way into conversational systems. This paper reviews the state of the art of RL methods for their possible use for different problems of natural language processing, focusing primarily on conversational systems, mainly due to their growing relevance. We provide detailed descriptions of the problems as well as discussions of why RL is well-suited to solve them. Also, we analyze the advantages and limitations of these methods. Finally, we elaborate on promising research directions in natural language processing that might benefit from reinforcement learning. Keywords| reinforcement learning, natural language processing, conversational systems, pars- ing, translation, text generation 1 Introduction Machine learning algorithms have been very successful to solve problems in the natural language pro- arXiv:2104.05565v1 [cs.CL] 12 Apr 2021 cessing (NLP) domain for many years, especially supervised and unsupervised methods. However, this is not the case with reinforcement learning (RL), which is somewhat surprising since in other domains, reinforcement learning methods have experienced an increased level of success with some impressive results, for instance in board games such as AlphaGo Zero [106].
    [Show full text]
  • Artificial Intelligence Applied to Electromechanical Monitoring, A
    ARTIFICIAL INTELLIGENCE APPLIED TO ELECTROMECHANICAL MONITORING, A PERFORMANCE ANALYSIS Erasmus project Authors: Staš Osterc Mentor: Dr. Miguel Delgado Prieto, Dr. Francisco Arellano Espitia 1/5/2020 P a g e II ANNEX VI – DECLARACIÓ D’HONOR P a g e II I declare that, the work in this Master Thesis / Degree Thesis (choose one) is completely my own work, no part of this Master Thesis / Degree Thesis (choose one) is taken from other people’s work without giving them credit, all references have been clearly cited, I’m authorised to make use of the company’s / research group (choose one) related information I’m providing in this document (select when it applies). I understand that an infringement of this declaration leaves me subject to the foreseen disciplinary actions by The Universitat Politècnica de Catalunya - BarcelonaTECH. ___________________ __________________ ___________ Student Name Signature Date Title of the Thesis : _________________________________________________ ________________________________________________________________ ________________________________________________________________ ________________________________________________________________ P a g e II Contents Introduction........................................................................................................................................5 Abstract ..........................................................................................................................................5 Aim .................................................................................................................................................6
    [Show full text]
  • Unsupervised Speech Representation Learning Using Wavenet Autoencoders Jan Chorowski, Ron J
    1 Unsupervised speech representation learning using WaveNet autoencoders Jan Chorowski, Ron J. Weiss, Samy Bengio, Aaron¨ van den Oord Abstract—We consider the task of unsupervised extraction speaker gender and identity, from phonetic content, properties of meaningful latent representations of speech by applying which are consistent with internal representations learned autoencoding neural networks to speech waveforms. The goal by speech recognizers [13], [14]. Such representations are is to learn a representation able to capture high level semantic content from the signal, e.g. phoneme identities, while being desired in several tasks, such as low resource automatic speech invariant to confounding low level details in the signal such as recognition (ASR), where only a small amount of labeled the underlying pitch contour or background noise. Since the training data is available. In such scenario, limited amounts learned representation is tuned to contain only phonetic content, of data may be sufficient to learn an acoustic model on the we resort to using a high capacity WaveNet decoder to infer representation discovered without supervision, but insufficient information discarded by the encoder from previous samples. Moreover, the behavior of autoencoder models depends on the to learn the acoustic model and a data representation in a fully kind of constraint that is applied to the latent representation. supervised manner [15], [16]. We compare three variants: a simple dimensionality reduction We focus on representations learned with autoencoders bottleneck, a Gaussian Variational Autoencoder (VAE), and a applied to raw waveforms and spectrogram features and discrete Vector Quantized VAE (VQ-VAE). We analyze the quality investigate the quality of learned representations on LibriSpeech of learned representations in terms of speaker independence, the ability to predict phonetic content, and the ability to accurately re- [17].
    [Show full text]
  • A Deep Reinforcement Learning Neural Network Folding Proteins
    DeepFoldit - A Deep Reinforcement Learning Neural Network Folding Proteins Dimitra Panou1, Martin Reczko2 1University of Athens, Department of Informatics and Telecommunications 2Biomedical Sciences Research Center “Alexander Fleming” ABSTRACT Despite considerable progress, ab initio protein structure prediction remains suboptimal. A crowdsourcing approach is the online puzzle video game Foldit [1], that provided several useful results that matched or even outperformed algorithmically computed solutions [2]. Using Foldit, the WeFold [3] crowd had several successful participations in the Critical Assessment of Techniques for Protein Structure Prediction. Based on the recent Foldit standalone version [4], we trained a deep reinforcement neural network called DeepFoldit to improve the score assigned to an unfolded protein, using the Q-learning method [5] with experience replay. This paper is focused on model improvement through hyperparameter tuning. We examined various implementations by examining different model architectures and changing hyperparameter values to improve the accuracy of the model. The new model’s hyper-parameters also improved its ability to generalize. Initial results, from the latest implementation, show that given a set of small unfolded training proteins, DeepFoldit learns action sequences that improve the score both on the training set and on novel test proteins. Our approach combines the intuitive user interface of Foldit with the efficiency of deep reinforcement learning. KEYWORDS: ab initio protein structure prediction, Reinforcement Learning, Deep Learning, Convolution Neural Networks, Q-learning 1. ALGORITHMIC BACKGROUND Machine learning (ML) is the study of algorithms and statistical models used by computer systems to accomplish a given task without using explicit guidelines, relying on inferences derived from patterns. ML is a field of artificial intelligence.
    [Show full text]
  • An Introduction to Deep Reinforcement Learning
    An Introduction to Deep Reinforcement Learning Ehsan Abbasnejad Remember: Supervised Learning We have a set of sample observations, with labels learn to predict the labels, given a new sample cat Learn the function that associates a picture of a dog/cat with the label dog Remember: supervised learning We need thousands of samples Samples have to be provided by experts There are applications where • We can’t provide expert samples • Expert examples are not what we mimic • There is an interaction with the world Deep Reinforcement Learning AlphaGo Scenario of Reinforcement Learning Observation Action State Change the environment Agent Don’t do that Reward Environment Agent learns to take actions maximizing expected Scenario of Reinforcement Learningreward. Observation Action State Change the environment Agent Thank you. Reward https://yoast.com/how-t Environment o-clean-site-structure/ Machine Learning Actor/Policy ≈ Looking for a Function Action = π( Observation ) Observation Action Function Function input output Used to pick the Reward best function Environment Reinforcement Learning in a nutshell RL is a general-purpose framework for decision-making • RL is for an agent with the capacity to act • Each action influences the agent’s future state • Success is measured by a scalar reward signal Goal: select actions to maximise future reward Deep Learning in a nutshell DL is a general-purpose framework for representation learning • Given an objective • Learning representation that is required to achieve objective • Directly from raw inputs
    [Show full text]
  • Modelling Working Memory Using Deep Recurrent Reinforcement Learning
    Modelling Working Memory using Deep Recurrent Reinforcement Learning Pravish Sainath1;2;3 Pierre Bellec1;3 Guillaume Lajoie 2;3 1Centre de Recherche, Institut Universitaire de Gériatrie de Montréal 2Mila - Quebec AI Institute 3Université de Montréal Abstract 1 In cognitive systems, the role of a working memory is crucial for visual reasoning 2 and decision making. Tremendous progress has been made in understanding the 3 mechanisms of the human/animal working memory, as well as in formulating 4 different frameworks of artificial neural networks. In the case of humans, the [1] 5 visual working memory (VWM) task is a standard one in which the subjects are 6 presented with a sequence of images, each of which needs to be identified as to 7 whether it was already seen or not. Our work is a study of multiple ways to learn a 8 working memory model using recurrent neural networks that learn to remember 9 input images across timesteps in supervised and reinforcement learning settings. 10 The reinforcement learning setting is inspired by the popular view in Neuroscience 11 that the working memory in the prefrontal cortex is modulated by a dopaminergic 12 mechanism. We consider the VWM task as an environment that rewards the 13 agent when it remembers past information and penalizes it for forgetting. We 14 quantitatively estimate the performance of these models on sequences of images [2] 15 from a standard image dataset (CIFAR-100 ) and their ability to remember 16 and recall. Based on our analysis, we establish that a gated recurrent neural 17 network model with long short-term memory units trained using reinforcement 18 learning is powerful and more efficient in temporally consolidating the input spatial 19 information.
    [Show full text]
  • Reinforcement Learning: a Survey
    Journal of Articial Intelligence Research Submitted published Reinforcement Learning A Survey Leslie Pack Kaelbling lpkcsbrownedu Michael L Littman mlittmancsbrownedu Computer Science Department Box Brown University Providence RI USA Andrew W Mo ore awmcscmuedu Smith Hal l Carnegie Mel lon University Forbes Avenue Pittsburgh PA USA Abstract This pap er surveys the eld of reinforcement learning from a computerscience p er sp ective It is written to b e accessible to researchers familiar with machine learning Both the historical basis of the eld and a broad selection of current work are summarized Reinforcement learning is the problem faced by an agent that learns b ehavior through trialanderror interactions with a dynamic environment The work describ ed here has a resemblance to work in psychology but diers considerably in the details and in the use of the word reinforcement The pap er discusses central issues of reinforcement learning including trading o exploration and exploitation establishing the foundations of the eld via Markov decision theory learning from delayed reinforcement constructing empirical mo dels to accelerate learning making use of generalization and hierarchy and coping with hidden state It concludes with a survey of some implemented systems and an assessment of the practical utility of current metho ds for reinforcement learning Intro duction Reinforcement learning dates back to the early days of cyb ernetics and work in statistics psychology neuroscience and computer science In the last ve to ten years
    [Show full text]
  • An Energy Efficient Edgeai Autoencoder Accelerator For
    Received 26 July 2020; revised 29 October 2020; accepted 28 November 2020. Date of current version 26 January 2021. Digital Object Identifier 10.1109/OJCAS.2020.3043737 An Energy Efficient EdgeAI Autoencoder Accelerator for Reinforcement Learning NITHEESH KUMAR MANJUNATH1 (Member, IEEE), AIDIN SHIRI1, MORTEZA HOSSEINI1 (Graduate Student Member, IEEE), BHARAT PRAKASH1, NICHOLAS R. WAYTOWICH2, AND TINOOSH MOHSENIN1 1Department of Computer Science and Electrical Engineering, University of Maryland, Baltimore, MD 21250, USA 2U.S. Army Research Laboratory, Aberdeen, MD, USA This article was recommended by Associate Editor Y. Li. CORRESPONDING AUTHOR: N. K. MANJUNATH (e-mail: [email protected]) This work was supported by the U.S. Army Research Laboratory through Cooperative Agreement under Grant W911NF-10-2-0022. ABSTRACT In EdgeAI embedded devices that exploit reinforcement learning (RL), it is essential to reduce the number of actions taken by the agent in the real world and minimize the compute-intensive policies learning process. Convolutional autoencoders (AEs) has demonstrated great improvement for speeding up the policy learning time when attached to the RL agent, by compressing the high dimensional input data into a small latent representation for feeding the RL agent. Despite reducing the policy learning time, AE adds a significant computational and memory complexity to the model which contributes to the increase in the total computation and the model size. In this article, we propose a model for speeding up the policy learning process of RL agent with the use of AE neural networks, which engages binary and ternary precision to address the high complexity overhead without deteriorating the policy that an RL agent learns.
    [Show full text]
  • Sparse Cooperative Q-Learning
    Sparse Cooperative Q-learning Jelle R. Kok [email protected] Nikos Vlassis [email protected] Informatics Institute, Faculty of Science, University of Amsterdam, The Netherlands Abstract in uncertain environments. In principle, it is pos- sible to treat a multiagent system as a `big' single Learning in multiagent systems suffers from agent and learn the optimal joint policy using stan- the fact that both the state and the action dard single-agent reinforcement learning techniques. space scale exponentially with the number of However, both the state and action space scale ex- agents. In this paper we are interested in ponentially with the number of agents, rendering this using Q-learning to learn the coordinated ac- approach infeasible for most problems. Alternatively, tions of a group of cooperative agents, us- we can let each agent learn its policy independently ing a sparse representation of the joint state- of the other agents, but then the transition model de- action space of the agents. We first examine pends on the policy of the other learning agents, which a compact representation in which the agents may result in oscillatory behavior. need to explicitly coordinate their actions only in a predefined set of states. Next, we On the other hand, in many problems the agents only use a coordination-graph approach in which need to coordinate their actions in few states (e.g., two we represent the Q-values by value rules that cleaning robots that want to clean the same room), specify the coordination dependencies of the while in the rest of the states the agents can act in- agents at particular states.
    [Show full text]