Learning Deep State Representations with Convolutional Autoencoders Gabriel Barth-Maron Supervised by Stefanie Tellex

Learning Deep State Representations With Convolutional Autoencoders Gabriel Barth-Maron Supervised by Stefanie Tellex Department of Computer Science Brown University Abstract Advances in artificial intelligence algorithms and techniques are quickly allowing us to create artificial agents that interact with the real Figure 1: A 80 × 20 gray scale image of the 10 × 2 state. world. However, these agents need to main- The agent is at location (3; 0) and the goal is at location tain a carefully constructed abstract represen- (9; 1). tation of the world around them [9]. Recent research in deep reinforcement learning attempts to overcome this challenge. Mnih et al. [24] at Feature engineering becomes a major hindrance as we DeepMind and Levine et al. [18] demonstrate create learning agents for more complex state spaces. successful methods of learning deep end-to-end Additionally, it requires expert knowledge and does not policies from high-dimensional input. In addi- generalize well across different domains. Several areas of tion, Böhmeret al. [1] and Mattner et al. [22] research attempt to deal with the challenge of exponen- extract deep state representations that can be tially large state spaces, such as Monte Carlo Tree Search used with traditional value function approxi- [2], hierarchical planning [4, 30, 7], and value function mation algorithms to learn policies. We present approximation [29]. a model that discovers low-dimensional deep Here we take an alternative approach with a focus on state representations in a similar fashion to the planning with sensor input. Visual information is an eas- deep fitted Q algorithm [1]. A plethora of func- ily accessible rich source of information, however uncov- tion approximation techniques can be used in ering structured information is a difficult and well stud- the lower dimension space to obtain the Q- ied problem in computer vision. Many vision problems function. To test our algorithms, we run sev- have been solved through the use of carefully crafted eral experiments on 80 × 20 images taken from features such as scale invariant feature transformations a 10 × 2 grid world and show that convolu- [20] and histogram of gradients [3]. Recent advances tional autoencoders can be trained to obtain in deep learning have made it possible to automatically deep state representations that are almost as extract high-level features from raw visual data, lead- good as knowing the ground-truth state. ing to breakthroughs in several areas of computer vision [14, 26, 23]. 1 Introduction In our model we use neural networks as an unsupervised technique to learn an abstract feature represen- Reinforcement learning provides an excellent framework tation of the raw visual input. Similar to hierarchical for planning and learning in non-stochastic domains. techniques, these neural networks allow us to plan in Since inception it has been used to accomplish a wide the (significantly simplified) abstract state space. This variety of tasks, from robotics [10, 5, 15] to sequential model is similar to the algorithm designed by DeepMind decision-making games [32, 11], and dialogue systems that plays Atari 2600 games from visual input [24]. How- [27, 34]. ever their algorithm performs end-to-end learning (which However, many reinforcement learning algorithms directly produces a policy), whereas ours learns a deep have a run-time that is polynomial in the number of state representation that can be used by a variety of re- states and actions. To learn in large domains, researchers inforcement learning algorithms. In addition, the Deep- have had to carefully craft features of their state space Mind algorithm does not allow for model-based alterna- so that they are general enough to represent the orig- tives, as we believe ours does. Böhmeret al. [1], Mattner inal problem, but small enough to be computationally et al. [22] have created a deep fitted Q (DFQ) algorithm tractable. that is very similar to what we propose, however our use (a) Autoencoder AE-10 with 10 (b) Autoencoder AE-20 with 20 (c) Stacked autoencoder SAE with 2 hidden nodes. hidden nodes. final hidden nodes. Figure 2: Autoencoder architectures of convolutional autoencoders takes advantage of image as a weighted linear sum of a set of features [12]. These structure and produces better state representations. features are also known as basis functions, some com- We used 80 × 20 pixel gray scale images taken from a mon examples being Radial Basis Functions, CMACs, 10×2 grid world, an example state may be seen in Figure and the Fourier Basis Function. 1. Because the 10 × 2 grid world can be characterized One particular algorithm for learning a linear value by only two numbers { the agent's x and y coordinates { function approximation is Gradient Descent SARSA(λ) one of our goals is to attempt to compress these images [25]. This algorithm combines Q-learning with Tempo- to a two dimensional output. ral Difference learning (TD-learning) to learn the Q- In Section 2 we give a brief overview of reinforcement function 1. Lin [19] derives an update equation for a learning and deep learning. Section 6 reviews state of the Q-learning algorithm that uses a neural network basis art techniques that combine reinforcement learning and function (it is also applicable to any other basis func- deep learning. Then in Section 3 we introduce our mod- tion) with weights w. els, and show their empirical performance in Sections 4 and 5. @Qt) ∆wt = η rt + γ max Qt+1(a) − Qt (1) a2A @wt 2 Background The Gradient Descent SARSA(λ) update scheme is This section should serve as a self-contained introduction similar with two notable exceptions. First, in order to to reinforcement learning and deep learning for those update previous states ∆wt is multiplied by a weighted who are not already familiar with the fields. sum of previous gradients. Second, the max operator is dropped in favor of using Qt+1 associated with the ac- 2.1 Reinforcement Learning tion that was selected, which allows for a better trade-off between exploration and exploitation { as the algorithm Reinforcement learning problems are typically modelled converges it will start behaving as if it were always se- as a Markov Decision Process (MDP). A MDP is a five- lecting the action that maximizes the Q-function. tuple: hS; A; T ; R; γi, where S is a state space; A is the agent's set of actions; T denotes T (s0 j s; a), the 2.2 Deep Learning transition probability of an agent applying action a 2 A 0 0 An autoencoder is a fully-connected neural network that in state s 2 S and arriving in s 2 S; R(s; a; s ) denotes attempts to learn the identity function. Additionally the the reward received by the agent for applying action a 0 network contains a single hidden layer that has a num- in state s and transitioning to state s ; and γ 2 [0; 1] is a ber of nodes significantly less than the input. During discount factor that defines how much the agent prefers training the autoencoder attempts to find a good com- immediate rewards over future rewards (the agent prefers pression of the input data. In addition, autoencoders to maximize immediate rewards as γ decreases). MDPs can be stacked { the output of one autoencoder's hid- may also include terminal states that cause all action to den layer as the input of another { to form deep archi- cease once reached. tectures. Autoencoders and stacked autoencoders have Reinforcement learning involves estimating a value been shown to be very useful in performing unsupervised function from experience, simulation, or search [28, 33]. dimensionality reduction [8]. Typically the value function is parametrized by the state Convolutional neural networks (CNNs) use convolu- space { there exists one unique entry per state. However tion to take advantage of the locality of image fea- in continuous state spaces (or as we will later see, in tures. In addition, since these networks share the ker- large discrete state spaces) it is desirable to find an al- nel's weights for each layer, they are be much sparser ternate parametrization of the value function. The most than their fully-connected counterparts. CNNs have common technique for doing so is linear value function 1 approximation, where the value function is represented We use Qt as shorthand for Q(st; at). been used to achieve state of the art performance in im- a more difficult optimization problem for the value func- age classification [14], face verification [31], and object tion approximation algorithm. detection [17]. 3 Architectures 4 Experiments We used autoencoders to learn abstract features for images similar to the one in Figure 1 in an unsupervised All of our experiments used 80 × 20 images taken from a manner. To train these networks we used backpropaga- 10 × 2 grid world as seen in Figure 1. The autoencoders tion on an image set that captures the entirety of the AE-10, AE-20, SAE, along with the convolutional au- state space. We combined different numbers of layers toencoders CAE and SCAE-AGENT were trained on all and hidden nodes, and have reported the results for some 20 possible images while the goal was at location (9; 1). of the final models in Section 5. We also used convo- The convolutional autoencoders SCAE-8 and SCAE-4 lutional autoencoders (CAEs) to take advantage of the were both trained on all 400 possible images by moving structure and locality that is found in naturally occur- both the agent and goal. The larger training data set ring images. was used to make the kernels goal-location invariant. The output of the middle layer of the (convolutional) The middle layer of each of these neural networks autoencoders was used as a basis function, which served was then used as a feature basis for Gradient Descent as the features for linear value function approximation.

Learning Deep State Representations with Convolutional Autoencoders Gabriel Barth-Maron Supervised by Stefanie Tellex

Training Autoencoders by Alternating Minimization

Turbo Autoencoder: Deep Learning Based Channel Codes for Point-To-Point Communication Channels

Double Backpropagation for Training Autoencoders Against Adversarial Attack

Survey on Reinforcement Learning for Language Processing

Artificial Intelligence Applied to Electromechanical Monitoring, A

Unsupervised Speech Representation Learning Using Wavenet Autoencoders Jan Chorowski, Ron J

A Deep Reinforcement Learning Neural Network Folding Proteins

An Introduction to Deep Reinforcement Learning

Modelling Working Memory Using Deep Recurrent Reinforcement Learning

Reinforcement Learning: a Survey

An Energy Efficient Edgeai Autoencoder Accelerator For

Sparse Cooperative Q-Learning