Deep Auto-Encoder Neural Networks in Reinforcement Learning

Deep Auto-Encoder Neural Networks in Reinforcement Learning Sascha Lange and Martin Riedmiller Abstract— This paper discusses the effectiveness of deep auto- New methods for unsupervised training of very deep encoder neural networks in visual reinforcement learning (RL) architectures with up to millions of weights have opened up tasks. We propose a framework for combining deep auto- completely new application areas for neural networks [4], encoder neural networks (for learning compact feature spaces) with recently-proposed batch-mode RL algorithms (for learning [5], [6]. We now propose another application area, reporting policies). An emphasis is put on the data-efficiency of this on first results of applying Deep Learning (DL) to visual combination and on studying the properties of the feature navigation tasks in RL. Whereas most experiments conducted spaces automatically constructed by the deep auto-encoders. so far have concentrated on distinguishing different objects, These feature spaces are empirically shown to adequately more or less perfectly centred in small image patches, in resemble existing similarities between observations and allow to learn useful policies. We propose several methods for improving the task studied here, the position of one object of interest the topology of the feature spaces making use of task-dependent wandering around an image has to be extracted and encoded information in order to further facilitate the policy-learning. in a very low-dimensional feature vector that is suitable for Finally, we present first results on successfully learning good the later application of reinforcement learning. control policies using synthesized and real images. We will mainly concentrate on two open topics. The first is about how to integrate unsupervised training of I. INTRODUCTION deep auto-encoders into RL in a data-efficient way, without Recently, there have been reported several impressive suc- introducing much additional overhead. In this respect, a new cesses of appyling reinforcment learning to real-world sys- framework for integrating the deep learning approach into tems [1], [2], [3]. But present algorithms are still limited to recently proposed memory-based batch-RL methods [7] will solving tasks with state spaces of rather low dimensionality1. be discussed in section III. We will show the auto-encoders in Learning policies directly on visual input—e.g. raw images this framework to produce good reconstructions of the input as captured by a camera—is still far from being possible. images in a simple navigation task after passing the high- Usually, when dealing with visual sensory input, the original dimensional data through a bottle neck of only two neurons learning task is split into two separate processing stages in their innermost hidden layer. (see fig. 1). The first is for extracting and condensing the Nevertheless, the question remains whether or not the relevant information into a low-dimensional representation encoding implemented by these two inner-most neurons is using methods from image processing. The second stage is useful only in the original task, that is reconstructing the for learning a policy on this particular encoding. input, or can also be used for learning a policy. Whereas properties of deep neural networks have been thoroughly studied in object classification tasks, their applicability to classical solution: image processing Low-dimensional unsupervised learning a useful preprocessing layer in visual Reinforcement here: Feature Space Learning Action reinforcement learning tasks remains rather unclear. The unsupervised training of deep answer to this question—the second main topic—mainly autoencoders Sensing Policy depends on whether the feature space allows for abstracting Visiomotoric Learning from particular images and for generalizing what has been learned so far, to newly seen similar observations. In section Fig. 1. Classic decomposition of the visual reinforcement learning task. V, we will name four evaluation criteria and do a thorough examination of the feature space in this respect and finally In oder to increase the autonomy of a learning system, give a positive answer to this question. Moreover, we will letting it adapt to the environment and find suitable repre- present some ideas on how to further optimize the topology sentations by itself, it will be necessary to eliminate the need in the feature space using task specific information. Finally, for manual engineering in the first stage. This is exactly we present first successes of learning control policies directly the setting where we see a big opportunity for integrating on synthesized images and—for the very first time—using a recently proposed deep auto-encoders replacing hand-crafted real, noisy image formation process in section VI. preprocessing and more classical learning in the first stage. II. RELATED WORK Sascha Lange and Martin Riedmiller are with the Department of [8] was the first attempt of applying model-based batch- Computer Science, Technical Faculty, Albert-Ludwigs University of RL directly to (synthesized) images. Ernst did a similar Freiburg, D-79194 Freiburg, Germany (phone: +49 761 203 8006; email: experiment using model-free batch-RL algorithms [9]. The fslange,[email protected]). 1over-generalizing: less than 10 intrinsic dimensions for value-function- interesting work of [10] can be seen at the verge of fully based methods and less than 100 for policy gradient methods integrating the learning of the image processing into RL. Nevertheless, the extraction of the descriptive local fea- encoding of the images in a low-dimensional feature space. tures was still implemented by hand, learning just the task- We advocate a combination with recently proposed model- dependent selection of the most discriminative features. All free batch-RL algorithms, such as Fitted Q-Iteration (FQI) thre [8], [9], [10] lacked realistic images, ignored noise and [13], LSPI [14] and NFQ [15], because these methods have just learned to memorize a finite set of observations, not been successful on real-world continuous problems and, as testing for generalization at all. these sample-based batch-algorithms [7] already store and Instead of using Restricted Boltzman Machines during reuse state transitions (st; at; rr+1; st+1), the training of the the layer-wise pretraining of the deep auto-encoders [4] auto-encoders integrates very well (see fig. 3) into the batch- our own implementation completely relies on regular multi- RL-framework with episodic exploration as presented in [3]. layer perceptrons, as proposed in chapter 9 of [11]. Previous publications have concentrated on applying deep learning Sample Experience, to classical image recognition tasks like face and letter Interacting Transition Data in Unsupervised Training of Observation Space Deep Auto-encoder recognition [4], [5], [11]. The RL-tasks studied here also with the Environment add the complexity of tracking moving objects and encoding outer loop Deep Encoder their positions adequately in very low-dimensional feature Changed vectors. Policy III. DEEP FITTED Q-ITERATIONS Prepare Pattern Set inner loop In this section, we will present the new deep fitted q- iteration algorithm (DFQ) that integrates unsupervised train- Appoximated Transition Data in ing of deep auto-encoders into memory-based batch-RL. Value Feature Space Function A. General Framework Batch-Mode In the general reinforcement learning setting [12], an Supervised Learning agent interacts with an environment in discrete time steps t, observing some state s 2 S and some reward signal r to then respond with an action a 2 A. We’re interested in Fig. 3. Graphical sketch of he proposed framework for deep batch-RL with episodic exploration. tasks that can be modelled as markov decision processes [12] with continous state spaces and finite action sets. The task In the outer loop, the agent uses the present approximation is to learn a strategy π : S 7! A maximizing the expectation of the Q-function [12] to derive a policy—e.g. by -greedy P1 t of the sum Rt = k=0 γ rt+k+1 of future rewards rt with exploration—for collecting further experience. In the inner discount factor γ 2 [0; 1]. In the visual RL tasks considered loop, the agent uses the present encoder to translate all here, the present state of the system is not directly observable collected observations to the feature space and then applies by the agent. Instead, the agent receives a high-dimensional, some batch-RL algorithm to improve an approximation of continuous observation o 2 O (image) in each time step. the Q-function. From time to time, the agent may retrain a new auto-encoder. The details of the processing steps will be discussed in the following subsections. observation deep B. Training Deep Auto-Encoders with RProp semantics unknown encoder feature vector semantics unknown Training the weights of deep auto-encoder neural networks reward image function to encode image data has been thoroughly treated in the formation approximator literature [4], [11]. In our implementation, we use several agent shallow auto-encoders for the layer-wise pre-training of the state environment q-values semantics known deep network, starting with the first hidden layer, always training on reconstructing the output of the previous layer. system action dynamics selection After this pre-training, the whole network is unfolded and fine-tuned for several epochs by training on reconstructing action the the inputs. Differing from other implementations, we make no use of RBMs but use multi-layer perceptrons (MLP) and standard gradient descent on units with sigmoidal Fig. 2. Extended agent-environment loop in visual RL tasks. In the deep- RL framework proposed here, a deep-encoder network is used to transfer activations in both phases, as proposed in chapter 9 of [11]. the high-dimensional observations into a lower-dimensional feature vector Weights are updated using the RProp learning rule [16]. As which can be handled by available RL-algorithms. RProp only considers the direction of the gradient and not its length, this update-rule is not as vulnerable to vanishing We will insert the training of deep auto-encoders on the gradients as standard back-propagation.

Deep Auto-Encoder Neural Networks in Reinforcement Learning

Survey on Reinforcement Learning for Language Processing

A Deep Reinforcement Learning Neural Network Folding Proteins

An Introduction to Deep Reinforcement Learning

Modelling Working Memory Using Deep Recurrent Reinforcement Learning

Reinforcement Learning: a Survey

An Energy Efficient Edgeai Autoencoder Accelerator For

Sparse Cooperative Q-Learning

4 Perceptron Learning

Optimal Path Routing Using Reinforcement Learning

18 Reinforcement Learning

Deep Learning and Reinforcement Learning

Deep Reinforcement Learning for Imbalanced Classification