Learning Deep State Representations with Convolutional Autoencoders Gabriel Barth-Maron Supervised by Stefanie Tellex
Total Page:16
File Type:pdf, Size:1020Kb
Learning Deep State Representations With Convolutional Autoencoders Gabriel Barth-Maron Supervised by Stefanie Tellex Department of Computer Science Brown University Abstract Advances in artificial intelligence algorithms and techniques are quickly allowing us to cre- ate artificial agents that interact with the real Figure 1: A 80 × 20 gray scale image of the 10 × 2 state. world. However, these agents need to main- The agent is at location (3; 0) and the goal is at location tain a carefully constructed abstract represen- (9; 1). tation of the world around them [9]. Recent re- search in deep reinforcement learning attempts to overcome this challenge. Mnih et al. [24] at Feature engineering becomes a major hindrance as we DeepMind and Levine et al. [18] demonstrate create learning agents for more complex state spaces. successful methods of learning deep end-to-end Additionally, it requires expert knowledge and does not policies from high-dimensional input. In addi- generalize well across different domains. Several areas of tion, B¨ohmeret al. [1] and Mattner et al. [22] research attempt to deal with the challenge of exponen- extract deep state representations that can be tially large state spaces, such as Monte Carlo Tree Search used with traditional value function approxi- [2], hierarchical planning [4, 30, 7], and value function mation algorithms to learn policies. We present approximation [29]. a model that discovers low-dimensional deep Here we take an alternative approach with a focus on state representations in a similar fashion to the planning with sensor input. Visual information is an eas- deep fitted Q algorithm [1]. A plethora of func- ily accessible rich source of information, however uncov- tion approximation techniques can be used in ering structured information is a difficult and well stud- the lower dimension space to obtain the Q- ied problem in computer vision. Many vision problems function. To test our algorithms, we run sev- have been solved through the use of carefully crafted eral experiments on 80 × 20 images taken from features such as scale invariant feature transformations a 10 × 2 grid world and show that convolu- [20] and histogram of gradients [3]. Recent advances tional autoencoders can be trained to obtain in deep learning have made it possible to automatically deep state representations that are almost as extract high-level features from raw visual data, lead- good as knowing the ground-truth state. ing to breakthroughs in several areas of computer vision [14, 26, 23]. 1 Introduction In our model we use neural networks as an unsuper- vised technique to learn an abstract feature represen- Reinforcement learning provides an excellent framework tation of the raw visual input. Similar to hierarchical for planning and learning in non-stochastic domains. techniques, these neural networks allow us to plan in Since inception it has been used to accomplish a wide the (significantly simplified) abstract state space. This variety of tasks, from robotics [10, 5, 15] to sequential model is similar to the algorithm designed by DeepMind decision-making games [32, 11], and dialogue systems that plays Atari 2600 games from visual input [24]. How- [27, 34]. ever their algorithm performs end-to-end learning (which However, many reinforcement learning algorithms directly produces a policy), whereas ours learns a deep have a run-time that is polynomial in the number of state representation that can be used by a variety of re- states and actions. To learn in large domains, researchers inforcement learning algorithms. In addition, the Deep- have had to carefully craft features of their state space Mind algorithm does not allow for model-based alterna- so that they are general enough to represent the orig- tives, as we believe ours does. B¨ohmeret al. [1], Mattner inal problem, but small enough to be computationally et al. [22] have created a deep fitted Q (DFQ) algorithm tractable. that is very similar to what we propose, however our use (a) Autoencoder AE-10 with 10 (b) Autoencoder AE-20 with 20 (c) Stacked autoencoder SAE with 2 hidden nodes. hidden nodes. final hidden nodes. Figure 2: Autoencoder architectures of convolutional autoencoders takes advantage of image as a weighted linear sum of a set of features [12]. These structure and produces better state representations. features are also known as basis functions, some com- We used 80 × 20 pixel gray scale images taken from a mon examples being Radial Basis Functions, CMACs, 10×2 grid world, an example state may be seen in Figure and the Fourier Basis Function. 1. Because the 10 × 2 grid world can be characterized One particular algorithm for learning a linear value by only two numbers { the agent's x and y coordinates { function approximation is Gradient Descent SARSA(λ) one of our goals is to attempt to compress these images [25]. This algorithm combines Q-learning with Tempo- to a two dimensional output. ral Difference learning (TD-learning) to learn the Q- In Section 2 we give a brief overview of reinforcement function 1. Lin [19] derives an update equation for a learning and deep learning. Section 6 reviews state of the Q-learning algorithm that uses a neural network basis art techniques that combine reinforcement learning and function (it is also applicable to any other basis func- deep learning. Then in Section 3 we introduce our mod- tion) with weights w. els, and show their empirical performance in Sections 4 and 5. @Qt) ∆wt = η rt + γ max Qt+1(a) − Qt (1) a2A @wt 2 Background The Gradient Descent SARSA(λ) update scheme is This section should serve as a self-contained introduction similar with two notable exceptions. First, in order to to reinforcement learning and deep learning for those update previous states ∆wt is multiplied by a weighted who are not already familiar with the fields. sum of previous gradients. Second, the max operator is dropped in favor of using Qt+1 associated with the ac- 2.1 Reinforcement Learning tion that was selected, which allows for a better trade-off between exploration and exploitation { as the algorithm Reinforcement learning problems are typically modelled converges it will start behaving as if it were always se- as a Markov Decision Process (MDP). A MDP is a five- lecting the action that maximizes the Q-function. tuple: hS; A; T ; R; γi, where S is a state space; A is the agent's set of actions; T denotes T (s0 j s; a), the 2.2 Deep Learning transition probability of an agent applying action a 2 A 0 0 An autoencoder is a fully-connected neural network that in state s 2 S and arriving in s 2 S; R(s; a; s ) denotes attempts to learn the identity function. Additionally the the reward received by the agent for applying action a 0 network contains a single hidden layer that has a num- in state s and transitioning to state s ; and γ 2 [0; 1] is a ber of nodes significantly less than the input. During discount factor that defines how much the agent prefers training the autoencoder attempts to find a good com- immediate rewards over future rewards (the agent prefers pression of the input data. In addition, autoencoders to maximize immediate rewards as γ decreases). MDPs can be stacked { the output of one autoencoder's hid- may also include terminal states that cause all action to den layer as the input of another { to form deep archi- cease once reached. tectures. Autoencoders and stacked autoencoders have Reinforcement learning involves estimating a value been shown to be very useful in performing unsupervised function from experience, simulation, or search [28, 33]. dimensionality reduction [8]. Typically the value function is parametrized by the state Convolutional neural networks (CNNs) use convolu- space { there exists one unique entry per state. However tion to take advantage of the locality of image fea- in continuous state spaces (or as we will later see, in tures. In addition, since these networks share the ker- large discrete state spaces) it is desirable to find an al- nel's weights for each layer, they are be much sparser ternate parametrization of the value function. The most than their fully-connected counterparts. CNNs have common technique for doing so is linear value function 1 approximation, where the value function is represented We use Qt as shorthand for Q(st; at). been used to achieve state of the art performance in im- a more difficult optimization problem for the value func- age classification [14], face verification [31], and object tion approximation algorithm. detection [17]. 3 Architectures 4 Experiments We used autoencoders to learn abstract features for im- ages similar to the one in Figure 1 in an unsupervised All of our experiments used 80 × 20 images taken from a manner. To train these networks we used backpropaga- 10 × 2 grid world as seen in Figure 1. The autoencoders tion on an image set that captures the entirety of the AE-10, AE-20, SAE, along with the convolutional au- state space. We combined different numbers of layers toencoders CAE and SCAE-AGENT were trained on all and hidden nodes, and have reported the results for some 20 possible images while the goal was at location (9; 1). of the final models in Section 5. We also used convo- The convolutional autoencoders SCAE-8 and SCAE-4 lutional autoencoders (CAEs) to take advantage of the were both trained on all 400 possible images by moving structure and locality that is found in naturally occur- both the agent and goal. The larger training data set ring images. was used to make the kernels goal-location invariant. The output of the middle layer of the (convolutional) The middle layer of each of these neural networks autoencoders was used as a basis function, which served was then used as a feature basis for Gradient Descent as the features for linear value function approximation.