Faculteit Wetenschappen en Bio-ingenieurswetenschappen Vakgroep Computerwetenschappen

Unsupervised Feature Extraction for Reinforcement Learning

Proefschrift ingediend met het oog op het behalen van de graad van Master of Science in de Ingenieurswetenschappen: Computerwetenschappen Yoni Pervolarakis

Promotor: Prof. Dr. Peter Vrancx Prof. Dr. Ann Now´e

Juni 2016 Faculty of Science and Bio-Engineering Sciences Department of Computer Science

Unsupervised Feature Extraction for Reinforcement Learning

Thesis submitted in partial fulfillment of the requirements for the degree of Master of Science in de Ingenieurswetenschappen: Computerwetenschappen Yoni Pervolarakis

Promotor: Prof. Dr. Peter Vrancx Prof. Dr. Ann Now´e

June 2016 Abstract

When using high dimensional features chances are that most of the features are not important to a specific problem. To eliminate those features and potentially finding better features different possibilities exist. For example, feature extraction that will transform the original input features to a new smaller dimensional feature set or even a method where only features are taken that are more important than other features. This can be done in a supervised or unsupervised manner. In this thesis, we will investigate if we can use as a means of unsupervised feature extraction method on data that is not necessary interpretable. These new features will then be tested in a Reinforcement Learning environment. This data will be represented as RAM states and are blackbox since we cannot understand them. The autoencoders will receive a high dimensional feature set and will transform it into a lower dimension, these new features will be given to an agent who will make use of those features and tries to learn from them. The results will be compared to a manual feature selection method and no feature selection method.

i Acknowledgements

First and foremost I would like to thank Prof. Dr. Peter Vrancx for helping me find a subject I am passionate about, taking the time for weekly updates and for all his suggestions and numerous conversions on how this subject could be tackled. Secondly, I would also like to thank Prof. Dr. Ann Now´efor piquing my interest in the master Artificial Intelligence when taking her course in my first year on the Vrije Universiteit Brussel. And finally I would also like to thank my mother for supporting me to pursue my studies at university level and my girlfriend for her endless support.

ii Contents

1 Introduction 1 1.1 Research Question ...... 4

2 6 2.1 Supervised learning ...... 7 2.1.1 Classification ...... 7 2.1.2 Regression ...... 10 2.2 Unsupervised learning ...... 11 2.3 Underfitting and overfitting ...... 13 2.4 Bias - Variance ...... 15 2.5 Ensembles methods ...... 17 2.5.1 Bagging ...... 17 2.5.2 Boosting ...... 18 2.6 Curse of dimensionality ...... 18 2.7 Evaluating models ...... 19 2.7.1 Cross validation ...... 20

3 Artificial Neural Networks 21 3.1 Perceptrons ...... 21 3.2 Training perceptrons ...... 22 3.3 Multilayer perceptron ...... 25 3.4 Activation functions ...... 26 3.4.1 Sigmoid ...... 27 3.4.2 Hyperbolic tangent ...... 28 3.4.3 Rectified Linear Unit ...... 28 3.4.4 Which is better? ...... 29 3.5 Tips and tricks ...... 30

iii 3.6 Backpropagation ...... 30 3.7 Autoencoders ...... 32 3.8 Conclusion ...... 33

4 Reinforcement Learning 34 4.1 The setting ...... 35 4.2 Rewards ...... 38 4.3 Markov Decision Process ...... 39 4.4 Value functions ...... 40 4.5 Action Selection ...... 42 4.6 Incrementing Q-values ...... 44 4.7 Monte Carlo & Dynamic Programming ...... 45 4.8 Temporal Difference ...... 46 4.8.1 Q-Learning ...... 47 4.8.2 SARSA ...... 48 4.9 Eligibility traces ...... 48 4.10 Function approximation ...... 51

5 Experiments and results 54 5.1 ALE ...... 54 5.2 Space Invaders ...... 55 5.3 Reconstruction ...... 56 5.4 Flow of experiments ...... 59 5.5 Manual features and basic RAM ...... 60 5.6 Difference between bits and bytes ...... 61 5.7 Comparing different activation functions ...... 63 5.8 Initializing Q-values ...... 65 5.9 Pretraining and extracting other layers ...... 68 5.10 Combination of RAM and layer ...... 72 5.11 Visualizing high dimensional data ...... 73

6 Conclusions 75 6.1 Future work ...... 76

Appendices 77

A Extended graphs and tables 78

7 Bibliography 82

iv List of Figures

1 Architecture of data processing ...... 5

2 Example of a decision tree ...... 8 3 Classification ...... 9 4 Regression ...... 11 5 Data of two features ...... 12 6 k-mean clustering ...... 13 7 Unsupervised learning: reduction of dimensions ...... 13 7a MNIST example of the number 2 ...... 13 7b MNIST reduction of dimensions ...... 13 8 Difference between under and overfitting ...... 15 9 Dartboard analogy from (Sammut & Webb, 2011) ...... 16 10 Bias Variance trade-off ...... 17 11 Random Forest ...... 18 12 Searching in different dimensions ...... 19 12a 1D space ...... 19 12b 2D space ...... 19 12c 3D space ...... 19

13 Example of a perceptron ...... 21 14 Bitwise operations ...... 23 14a AND operator ...... 23 14b OR operator ...... 23 14c XOR operator ...... 23 15 XOR with decision boundaries by learnt MLP ...... 25 16 Multilayer perceptron ...... 26 17 Other activation functions: linear and step function ...... 27 18 Sigmoid activation function ...... 27

v 19 Hyperbolic tangent activation function ...... 28 20 ReLU activation function ...... 29 21 Example of an ...... 33

22 A Skinner’s Box from (Skinner, 1938) ...... 35 23 Agent Environment setting ...... 36 24 Another view of the agent environment setting ...... 36 25 Mountain car; image from (RL-Library, n.d.) ...... 37 26 Pole Balancing; image from (Anji, n.d.) ...... 37 27 Maze world ...... 38 28 Eligibility trace; image from (Sutton & Barto, 1998) ...... 49 29 Replacing traces; image from (Sutton & Barto, 1998) . . . . . 51 30 Coarse coding; image from (Sutton & Barto, 1998) ...... 53

31 The difference between RAM and Frames ...... 55 31a RAM ...... 55 31b Frames ...... 55 32 Space Invaders screen ...... 56 33 MSE of autoencoder with 128 bits input ...... 58 34 MSE Autoencoder from 1024 bits input ...... 59 35 Difference RAM and RAM with AND ...... 61 36 Autoencoders on 128 bytes ...... 62 37 Autoencoders on 1024 bytes ...... 63 38 Q = −1 ...... 66 39 Q =1 ...... 67 40 Extraction of a layer other than the bottleneck ...... 68 41 Pretraining with extraction of layer 512 ...... 69 42 Pretraining with extraction to a hidden layer of 4 nodes . . . . 70 43 Pretraining with extraction of layer 512 with dropout . . . . . 71 44 Pretraining with extraction of layer 512 with dropout . . . . . 72 45 Combining the original layer with the encoded version . . . . . 73 46 t-tsne ...... 74

47 Linear activation function on an autoencoder ...... 78 48 Sigmoid activation function on an autoencoder ...... 79 49 ReLU activation function on an autoencoder ...... 79 50 Pretraining with extraction of layer 512 ...... 80 51 Combining the original layer with the encoded version . . . . . 80

vi List of Tables

1 Classification of animals ...... 8 2 Predicting the price of a house ...... 10

3 V ∗(s)...... 43 4 π∗(s)...... 43 5 Gridworld Example ...... 43

6 Comparing different activation functions ...... 64 7 P-values of the MannWhitney U test ...... 65 8 The difference between in setting different Q-values ...... 67

9 Training to a specific layer and extracting a chosen layer . . . 81

vii List of Algorithms

1 Q-Learning ...... 47 2 SARSA ...... 48 3 SARSA(λ)...... 50 4 Q-Learning(λ)...... 51

viii Chapter 1 Introduction

Artificial Intelligence is a field in computer science which studies a wide range of topics like Machine Learning, Reinforcement Learning and a new rising topic, Deep Learning. Artificial Intelligence is now more part of the daily life than two decades ago. Take for example a robotic vacuum cleaners where the robot knows when to clean the house, to know exactly when the robot must return to the charging station to get a full battery and even to pick up where he has left off after recharging. More than ten years ago the vacuum cleaner robots were not seen as an AI because the robot would simply do random walks, if doing a random walk in a house long enough the whole house would eventually be cleaned. With new algorithms available, the robot can map the house to vacuum efficiently and detect how to make a detour if a object is suddenly in the way. The only way to gather all this data is to perceive all features possible. Another example are the new smart thermostats like Nest thermostat devel- oped by Google or the ATAG ONE thermostat. These new smart thermostats know when the house is empty, when the owners go to work and come back. By learning the behaviour of the owners the thermostat will automatically adapt so that the heating will be higher just before the owners are coming home and the heating will be set lower after the owners go to work or go sleeping, this can ultimately have a great impact on the energy consump- tion. All these new domestic devices make the daily life easier and do seem nat- urally. Behind the hood is often a complicated AI that uses many features, measurements of sensory inputs like the velocity, battery usage, IR detector or thermometer. These features can be very specific, comprehensive and can in general consist of thousands or millions different inputs. Not all of them

1 are equally important depending on the task that must be completed.

Going back to the example of the robotic vacuum cleaner, features like the texture of the floor will have an impact on the duration of the task. Vacuum- ing a carpet is harder than a concrete floor is. Features like the temperature outside will have little to no impact. It is therefore of most important to se- lect features that only matter to the task at hand. For simple task with little features manual feature selection is feasible, but when millions of features are in play it is not. DNA microarrays for example, store an enormous amount of features. Manually selecting which features are important for some task is a horrendous job, not only because the person who selects these features needs knowledge about the task and features, but also because features in isolation can seem unimportant but when combined can have a strong influ- ence on the result. Feature extraction is on of the key business in Machine Learning. Many problems may rise when using many features, such as the curse of dimensionality, overfitting and a longer training time with a much larger chance to get stuck in local minima. When using too many features there is also a possibility that many features are redundant and do not have any methods to for example a classification. Many feature selection or ex- traction rely on a supervised method. There are different feature selection methods like entropy, which is sometimes used in Decision Trees, correlation techniques to find which features are high correlated and thus useful for a certain task or even dimension reduction techniques like PCA which is a lin- ear transformation of data. All of these techniques are linear or need some supervised manner in setting them.

One technique of supervised feature extraction is template matching where similarities or equivalently the dissimilarities between the input data and the labelled data are measured to use for classification. This method is of- ten used in Optical Character Recognition (OCR) software (Trier, Jain, & Taxt, 1996). Researchers also combined ImageNet, which is an online public database with more then 14 million images all manual labelled into roughly more than 21.000 categories, and deep learning for classification. By using deep convolutional neural network researchers were able to classify those im- ages, the used deep networks consisted of more than 650,000 neurons with 60 million parameters (Krizhevsky, Sutskever, & Hinton, 2012). These neural networks are a supervised feature extraction because layers can learn ab- straction of raw inputs, for example from pixels to edges to objects. But also in regression (Geoffrey E. Hinton & Salakhutdinov, 2008) where unlabelled

2 data is used to learn a good covariance kernel. Autoencoders (Section 3) can be used to find features by reducing the dimensionality and extracting those compressed features (Geoffrey E Hinton & Salakhutdinov, 2006; Ng, 2011).

Solving problems is what keeps AI an interesting business. Researches in the field of AI have a particular interest in solving games because they rep- resent problems that provide a challenging search space but still have a clear set of rules and the AI performance can be directly compared to human per- formance. The program Chinook (Schaeffer et al., 1992) was one of the first AI that has solved chess and has beaten some expert champions by using heuristics and search trees. The engine Deep Blue is another chess AI that uses databases with game data and parallelism with search engines (Camp- bell, Hoane, & Hsu, 2002) and has defeated the worlds best champion chess player. Another example of an AI is TD-Gammon that has won from the best backgammons players by using neural networks and TD(λ). This was achieved by playing repetitively against itself (Tesauro, 1994) and by doing so training itself.

One example to see how popular Artificial Intelligence has become and more in particular Deep Learning, is Go. Google DeepMind has succeeded in de- feating the worlds top Go player (Silver et al., 2016). Go is a boardgame with relative simple rules. Players must take turns and put white or black stones on the board. But nevertheless, Go is one of the hardest game for an AI to learn, this because there are more moves possible than there are atoms in the world. Traditional AI algorithms build trees for all possible moves and settings and try to look where the agent has the most possible chance of winning before selecting a move. Because the choices and different options of Go this is simply not feasible. Google Deepmind trained neural networks of recorded strategies and moves from top Go players and tried to predict them, afterwards they used Reinforcement Learning with these neural networks to play against itself and try and learn new moves. Afterwards they used Monte Carlo tree search to estimate values of a state instead of browsing through the whole tree.

More recently, Google Deepmind has created DQN which is combining deep neural networks with reinforcement learning together with experience replay (Mnih et al., 2015) and has succeeded in beating human players on different games. Deep Learning is not only popular with classification and regression tasks but also in the field of Natural Language Processing, where the deep

3 network can return tags, semantic roles and even semantic similarity give a sentence (Collobert & Weston, 2008). In this thesis we will consider the problem of applying machine learning methods to computer games, by using autoencoders as a feature extraction method.

1.1 Research Question

In this thesis we will develop automatic feature extraction methods that can be used in combination with Reinforcement Learning. This is an important problem as the performance of an RL agent is strongly dependent on the representation used for learning. Selecting good features is challenging as it requires knowledge of the problem domain and the task to be solved. This thesis will investigate the use of unsupervised learning methods to replace manual feature selection. A current example is the blackbox challenge 1 where the contestant receives a dataset that we, as a human, do not un- derstand. Every time step the agent perceives a new state and a variety of actions that are possible to take. These can be stochastic and late rewards are possible after taking an action. This challenge was designed in a way that contestants do not know how to interpret the data, so they cannot manually do a feature selection method. The data is somewhat blackbox.

We will consider the problem of learning by playing Atari games using the RAM game state as input. As a human we cannot interpret the RAM state and so the step of manual feature selection will be skipped and instead do an unsupervised feature extraction via autoencoders. Figure 1 shows the usual case when dealing with too many dimensions. The idea is to replace the middle box, manual feature selection, and replace is it by an unsupervised feature extraction method. These autoencoders will be trained with different settings and different levels in dimension reduction. These new features will then be used by an RL-agent with SARSA(λ) who will play Space Invaders on a Atari 2600 emulator. By using the game we can see how good these new features will perform in comparison with the manual feature selection.

1http://blackboxchallenge.com

4 Figure 1: Current infeasible setting when dealing with too many dimensions

This thesis will first focus on the background of Machine Learning (Chap- ter 2), Artificial Neural Networks (Chapter 3) and Reinforcement Learning (Chapter 4). Followed by all the experiments done (Chapter 5) and the last chapter will contain the final conclusion with some possibilities on future work (Chapter 6).

5 Chapter 2 Machine Learning

The term Machine Learning is a broad term that covers many subfields. To give such a definition is difficult and many different definitions exist. In this thesis the definition of Tom Mitchell will be adopted. He describes Machine Learning as: A computer program is said to learn from experience E with re- spect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E. (Mitchell, 1997) Applying this definition to this thesis will give a better understanding. This thesis will research the unsupervised training of autoencoders and by doing so unsupervised feature extraction. These new features will then be used to train an agent by using Reinforcement Learning. This research question can then be divided in two different parts. • The unsupervised autoencoder training (Section 3.7), where the task T is to learn a sort of compression and by doing so feature extraction. The experience E will be the RAM or frames states, received from our gameplay and the performance P is the Mean Square Error (MSE), section 2.1.2 which will determine how good the reconstruction is of the RAM or frame states and thus how good an autoencoder is.

• Reinforcement Learning (Section 4) of Atari games, where the task T is to learn to play a game by getting a score as high as possible. The experience E will be the interactions between the game and the results that comes from it. The performance P is measured by the score itself and the total reward.

6 Machine Learning is highly interesting for many current problems. For ex- ample, cancer that can be detected and classified (Cruz & Wishart, 2006), self-driven cars (Google, n.d.), speech recognition like Siri from Apple and so on.

There are three major distinctions in learning some task T; supervised, un- supervised learning and sequential decision making.

2.1 Supervised learning

Supervised learning is the task of receiving some input data X and output data, or labelled data, y and creating a function y = f(x) that can map the input values to output values. There are different kinds of supervised learn- ing; classification and regression. Classification will classify features into a small discrete number of groups, for example a breed of an animal. In regres- sion problems on the contrary, the number of possible outputs can be very large or even continuous.

Supervised learning searches for a function h(x), or also known as the hy- potheses, that given the data x will return an estimated output value. For example a linear hypothesis:

h(~x) = θ0 + θ1x1 + θ2x2 + ..θnxn

The linear hypothesis has some parameters θ that can be optimized through learning. The linear hypothesis will return a value h(~x) that can be com- pared to our labelled data, y. By using different techniques, which will be explained later on, θ-values can be tweaked so that h(~x) will be equal to y.

Below we will discuss two classes of supervised learning problems; classi- fication and regression.

2.1.1 Classification A classification problem is a problem where the data is classified or labelled in different classes. Take for example some input data that are features about animals (Table 1); the number of feet, color and if the animal has wings or not. The classification will then be the breed of the animal; in this case a dog, duck or spider.

7 Feet Color Wings y

x1 4 Brown No Dog

x2 2 White Yes Duck

x3 8 Black No Spider ......

x100 8 Brown No ?

Table 1: Classification of animals

The classifier will try to determine a decision boundary between the dogs, ducks and spiders. Take the last example in the previous table, where an animal has 8 legs, has a brown color and no wings. Since there is no label, the classifier must determine what animal x100 must be. As human it is clear that if there are only three possible animals, the unknown animal must be a spider since the only animal with 8 feet is a spider. But the classifier cannot determine this so easily.

Figure 2: Example of a decision tree

One example of a supervised learning method are decision trees. Decision trees are trees that have different nodes. Each node will ask a question. This question will lead to another question or a leaf. A leaf will represent

8 the classification of an example. Everything depends on which questions is asked first, this means that the most informative feature has the most po- tential to generate a shorter and preciser decision tree. This can be done by using for example entropy and information gain. Figure 2 shows an example of a decision tree for the input data of Table 1. This tree could be shorter by removing the color question after the question wings with answer yes, because if the animal has wings, it is automatically a duck in our example. Different adaptations of decision trees exist to optimize trees by for example pruning (Quinlan, 1987).

Figure 3 shows a classification with two features x1 and x2. The red points belong to a certain Class 1 and the blue points to Class 2. The classifier tries to find a decision boundary in the input space where all data points, or at least as many points as possible, belong to the correct class. In an ideal situation the decision boundary can separate the classes exactly. But in real world data, this would be highly unlikely since data is often noisy and/or cor- rupted. The classifier needs to find a way where the cost of misclassification is the lowest.

80 Decision boundry Class 1 Class 2 60

40

2 20 x

0

20

40 15 10 5 0 5 10 15 20 x1

Figure 3: Classification of 2 features into 2 classes, separated by a decision boundary

9 2.1.2 Regression Regression problems cannot be divided into classes but will have some con- tinuous target value. Take for example the prediction of house prices with features like the amount of bedrooms, kitchens, gardens and garages (Table

2). Obviously this cannot be labelled, and predicting the output of x100 is not that simply.

Bedroom Kitchen Garden Garage Bathroom y

x1 1 1 0 1 1 e 153.314

x2 3 1 1 2 2 e 317.135

x3 6 2 1 3 4 e 683.562 ......

x100 2 1 1 0 1 e ?

Table 2: Predicting the price of a house

The question then remains how a regression model would predict values. A linear model will try to create a fitted line through the data points, which in the example above, are the amount of certain room types. This line is also called the regression line. Figure 4 shows an example where the blue dotted points are the input data and the green line will represent a regression line. Multiple regression lines are possible but not all of them are equally good.

A well known simple linear regression function is yi = β0 + β1xi + i where i = 1..n and n data entries. The -value or disturbance term will represent the noise in the data values.

Supervised learning will try and create a function h(x), by optimizing the parameter values, and by doing so predicting the y as good as possible. To get an idea how good a model is, there is a need for a cost function that de- termines how good a model is. A commonly used cost function for regression is the Mean Square Error, or MSE. m 1 X MSE = (h(x ) − y(x ))2 2m i i i=1 MSE finds the difference between the predicted output h(x) and the true value, y. This will be squared so the difference signs will not make a differ-

10 ence. The additional 2 is used cancel out when differentiating which will be used in the neural networks. The lower the error, the better the hypothesis is fitted to the data.

14 Input-output vector Prediction

12

10

8

Ouput, y 6

4

2

0 0 2 4 6 8 10 Inputs, X

Figure 4: A regression line between the input values X and the output values y

2.2 Unsupervised learning

Unlike supervised learning, unsupervised learning has no target output y but only input data X, Figure 5. Because there is no target output, it is the job of an unsupervised learning model to find a relationship or structure between the input data. This relation can be used to group data or even reduce di- mensions.

One example of finding structure in data and grouping them, is k-means clustering (Figure 6), where k amount of clusters are formed. Each cluster has a mean or also called a centroid. First k random centroids are placed

11 within these data points. Each iteration every data point is assigned to the closest cluster. When all data points are assigned each centroid is recalcu- lated and moved. This iteration is done until the centroids no longer move.

25

20

15

10 2 x

5

0

5

10 10 5 0 5 10 15 20 25 x1

Figure 5: Data of two features

Another example of unsupervised learning is dimension reduction with au- toencoders (Section 3.7). Autoencoders are a form of artificial neural network but with their input equal to their output. By doing so, the autoencoder will learn the identity function and in the internal representations used, autoen- coders will learn to compress the data. MNIST is a database of handwritten digits in their raw feature form. Each digit can be converted to a 28x28 image and thus 784 pixels or dimensions (Figure 7a.). Autoencoders can be used to go from 784 dimensions to 2 dimensions and by doing dimension reduction. Each point shown in Figure 7b is the image of a number like Figure 7a. These points where reduced from 784 to 2 dimensions and colors indicate the class where the number belongs to. It can be seen that compressions maps the same numbers close to each other.

12 25 Cluster 1 Cluster 2

20 Cluster 3 Cluster 4 Centroids

15

10 2 x

5

0

5

10 10 5 0 5 10 15 20 25 x1

Figure 6: k-mean clustering

(a) MNIST example of the number 2 (b) MNIST reduction of dimensions

Figure 7: Unsupervised learning: reduction of dimensions

2.3 Underfitting and overfitting

There are different ways to build models and not every model is good. Take for example some data points, these data points must have some underlying function that we do not know. These data points will represent one input feature x and the output y with some random noise, y = f(x) + . A model

13 can then be created to predict this underlying function. Different functions are plotted in Figure 8 by using linear regression. The blue dotted points will denote the samples, the green line will represent the underlying function that is unknown and the blue line will be the hypothesis h(x) of our model. The first figure shows a model that is underfitting, it cannot represent the under- lying function at all. The function is too simple and the underlying function cannot be represented by a straight line, which is in this case a polynomial with 1 degree or also a linear regression. The second figure shows that the model has learnt the true underlying function, although without knowing the underlying function it is still a hypothesis. In this case it is a polynomial of 4 degrees. The last figure shows a model that is overfitting, it tries to model every training data too well and uses a polynomial of 15 degrees. If the model then tries to predict unseen data it will fail because the model does not generalize over dataset but tries to fit it perfectly. Note that neither under- and overfitting are good.

A good way to test if the model is good or bad is dividing the data in a training- and test set. Let the model train on the training set and when the model has done training, let it predict on the test set. Seeing how much the predicited output differs from the output of the test set gives a good indi- cation. A good way is using the Mean Square Error, the smaller the error the better the model fits the data. There is a difference between training and test error. The training error is when the model is being trained. The model receives an input value, predicts it and if it is wrong will adapt the model. The test error is when the model is done training. A new set of data is presented to the model. The model will predict the output and the error will be calculate how far off the model is.

Figure 8 will present different models for an, unknown, underlying function, this function will be the green line. The model that has been trained will predict values, these values will be represented in the blue line. As a test set is presented to the model, the blue line will give the answer which outcome the model will have. The samples where the model has been training on are the blue dots. It can be seen that the first model has a high training error as well as test error. The model cannot represent the model with 1 polynomial and can certainly not represent a new test set. The next model, one with 4 polynomials, has a very low training- and test error. It can fit the training data and the test set will be predicted fairly good, because the function of the model matches closely to the true underlying function. The last image

14 with 15 polynomials will have a low training error. As can be seen it can fit the training samples perfectly. But it will have a high test error as it cannot represent the new test data.

MSE = 0.37 MSE = 0.04 MSE = 182212904.43 2.0 2.0 2.0 Model Model Model 1.5 True function 1.5 True function 1.5 True function Samples Samples Samples 1.0 1.0 1.0

0.5 0.5 0.5

y 0.0 y 0.0 y 0.0

0.5 0.5 0.5

1.0 1.0 1.0

1.5 1.5 1.5

2.0 2.0 2.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 x x x

Figure 8: Difference between under and overfitting. From left to right: poly- nomial with 1 - 4 - 15 degrees. Image adapted from sklearn 1

2.4 Bias - Variance

The question remains how can the architect of a model detect if the model is under- or overfitting. This can be seen by determining the bias and variance. First there are expected values, which are values of a random variable. A random variable associates numeric values with different outcomes of an ex- periment. Random variables can or will change when repeating experiments. Repeating an experiment to get average results is thus important. Bias is then the difference of the expected value of the predicted outcome and the real target outcome. Bias(ˆy) = E(ˆy) − y Bias will see how far off a model is to the correct output of the underlying, unknown, function.

Variance will find the variability of the model, with respect to the expected model. 1http://scikit-learn.org/stable/auto examples/model selection/plot underfitting overfitting.html

15 V ar(ˆy) = E[(ˆy − E(ˆy))2] The dartboard analogy (Figure 9) gives a more visual idea of what bias and variance means. Imagine that someone is throwing darts and that the bullseye represents a good model. If all the darts have been thrown and they are spread out and thus not close to each other, than there is a form of high variance. Bias on the other hand is the average distance to the bullseye. In a case of low bias and low variance all darts are close to each other and directly on or close to the bullseye itself.

Figure 9: Dartboard analogy from (Sammut & Webb, 2011)

The mean squared error, MSE, gives us a squared result of how good the model is. MSE(ˆy) = E[(y − yˆ)2] MSE can then be decomposed as the bias-variance decomposition.

MSE(ˆy) = (E(ˆy) − y)2 + E[(ˆy − E(ˆy))2] + σ2 = Bias2 + V ar + Error

The last term, the irreducible error, will represent the noise in the data.

16 Figure 10: Bias Variance trade-off

When applying the bias and variance to under- and overfitting, it can be seen that underfitting is when the bias is too high. The model is too simple and can not learn the underlying function. Overfitting gives a high variance, because it is too complex and fits the noise instead of the underlying function (Figure 10).

2.5 Ensembles methods

One way to get a better performance of the model is using ensemble methods. These methods combine different models that are more accurate than a single model.

2.5.1 Bagging Bootstrap aggregating, or also known as bagging, is mostly used for reducing the variance of a model. Bagging belongs to the class of averaging methods since they will average their result and by doing so getting a combined result. It starts by taking random subsets of the training data. By using different subsets and training on them, models will be different and predict differ- ently. Bagging will then accumulate all separate models and combine them in one concluding model (Breiman, 1996). An example of a bagging model method is tree bagging or an extension; random forest, Figure 11. It starts

17 with training B trees, this can be for example decision trees. Each training the model draws random and uniformly with replacement from the pool of training data. After all B trees are trained, the models will be ensambled by 1 P using the average B fb(x) or voting, where the majority rules counts. This only decreases the variance and does not increase the bias. Random forests will also add a random feature subset while learning the trees.

Figure 11: Random Forest

2.5.2 Boosting As in all models, there are strong learners and weak learners. Weak learners are defined as being slightly better than a random prediction, but still not good enough. The idea comes from combining multiple weak learners and create a single strong model. The most popular boosting algorithm is Adap- tive Boosting, or AdaBoost (Freund & Schapire, 1997). AdaBoost combines the results of weak learners into a weighted sum or majority rule.

2.6 Curse of dimensionality

One might think, the more features data has, the better a learner or model will perform. This is not true. Imagine if e1 is dropped on a straight line of 100 meters. The coin will be easily found. If a coin is dropped on a surface of 100 x 100 meters which is 10000 m2, this is also possible but not so easy

18 anymore. If a a coin is dropped in a 3D space of 100 x 100 x 100, which is 1000000 m3, it is more difficult than before (Figure 12). This analogy is only to illustrate the difficulty in finding a coin in a multidimensional space. In machine learning the dimensionality can go up to tens of thousands of dimensions, for example DNA sequences. This also means the higher the dimension goes, the sparser the data becomes. One way to reduce dimensions is using feature selection or even feature extraction, like autoencoders.

1.0

0.8

0.6

0.0 0.2 0.4 0.6 0.8 1.0 0.4

0.2

0.0 0.0 0.2 0.4 0.6 0.8 1.0

(a) 1D space (b) 2D space

0.06 0.04 0.02 0.00 0.02 0.04 0.06 0.53 0.52 0.51 0.47 0.48 0.50 0.49 0.50 0.49 0.51 0.48 0.52 0.530.47

(c) 3D space

Figure 12: Searching in different dimensions

2.7 Evaluating models

After models are created, there is a need to evaluate them. Often, if sufficient data is available 70 or 80% of the data is taken at random and will be used to train a model. The remaining number will be used predict and see how far off a model is. This method is not flawless, there is a chance that only outliers of the data are in the test set which can determine that the model is bad when in fact it is actually quite good. Therefore different methods are invented to get an average prediction of how good a model is.

19 2.7.1 Cross validation The first method is cross validation. The goal is to see how effective a model is. There are different cross validation methods such as k-fold cross validation and holdout methods. These methods are classified as non-exhaustive. The first method, k-fold cross validation, splits the data in k folds or subsets. Each iteration, done k times, one subset is taken which will represent the test set and the other subsets will be used as the training set. Hereafter the results of all k predictions will be averaged and the error can be estimated. The holdout method is the same as k-fold cross validation but in here k = 2. Each datapoint is either assigned, at random, to the training set or test set.

20 Chapter 3 Artificial Neural Networks

Artificial Neural Networks (ANN) are machine learning models inspired by the human brain. The brain consists of approximately 1011 neurons, a neu- ron itself is cell that transmits information to other neurons. This connec- tion between other neurons is called a synapse, there are approximately 1014 synapses. These neurons with their synapses can make decisions based on their input, for example a human can recognize his family members imme- diately when seeing them. This is exactly why researches wanted to create artificial neurons with a mathematical model for handling information.

3.1 Perceptrons

One type of Artificial Neural Networks are perceptrons (Rosenblatt, 1958) which are binary classifiers, Figure 13.

Figure 13: Example of a perceptron

21 Perceptrons can only take real-valued inputs and construct one single binary output. The output is calculated by a linear combination of real-valued weights (w) and inputs (x), this will result in a value that, depending on a certain threshold, will result in zero or one. This can be rewritten into the following function; ( 0 if ~w. ~x + b ≤ 0 f(~x) = 1 if ~w. ~x + b > 0

Where ~w. ~x is the dot product of vectors, note that x0 will be set equal to 1 for this vector notation. A bias will influence how easier it is to get a 0 or 1 as output. For example if the bias is negative, the dot product of vectors ~x and ~w must have a value greater than the absolute value of the bias to get over the threshold. The bias can thus adjust the decision boundary. Note that for perceptrons only a linear decision boundary is possible. Bitwise op- erations, like AND and OR, can be implemented by one single perceptron by adapting the weights or the bias. Figures 14a and 14b show an example how the perceptron can distinguish the bitwise operation AND and OR. Both axis signify all states a bit can take, the color denotes if a bit will be 0 or 1 depending on the operation and the black line will be a decision boundary.

Not all operations can be represented by one perceptron, XOR is for ex- ample not linearly separable, see Figure 14c, and thus needs more layers of perceptrons to solve this problem.

3.2 Training perceptrons

The difficult part of perceptrons is setting the weights in a way that the perceptron’s output results in a correct output. To do this there are several ways to learn weights. The first way is called the perceptron training rule where all weights are initialized at random. The next step is iterating over all the training examples and whenever the classification is wrong the weights are updated by the following rule:

wj = wj + ∆wj where

∆wj = η(t − o)xj

22 AND OR

1 1 0 0 Boundary Boundary

1 1 2 2 t t i i B B

0 0

0 1 0 1

Bit1 Bit1

(a) AND operator (b) OR operator

XOR

1 0

1 2 t i B

0

0 1

Bit1

(c) XOR operator

Figure 14: Bitwise operations

This is done until all training examples are classified correctly. The rule takes the difference between the target output t and the perceptron’s output o which is then multiplied by a learning rate η and the input xj. It can be seen that whenever the perceptron’s output is equal to the correct output the update will be equal to 0 and thus no weights are updated. It has been proven that the perceptron’s training rule will converge (Minsky & Papert, 1969), if the learning rate is sufficiently small and when the data is linearly separable.

It is often unknown if the data is linearly separable. The delta rule or gra- dient descent will therefore search for a good approximation for all outputs by using gradient descent if the data is not linearly separable. The idea is

23 by minimizing the following error: 1 X E = (t − o )2 2 d d d∈D Where E will be the squared error and D the set of all training examples. 1 Note that the 2 is used to cancel out the exponent when differentiating. The error is always non-negative due to the power. If the error is small the perceptron’s output can represent the target output well. To find the minimum of E, the derivative with respect to the weights can be taken.

∇E =  δE + δE + .. + δE  δw0 δw1 δwn

The gradient gives the direction of the steepest increase of E. To find the steepest decrease, the negative sign can be added. The learning rule will then become: w = w + ∆w where ∆w = −η∇E(w) This can be rewritten by

δE δ 1 X 2 = (td − od) δwi δwi 2 d∈D X = (td − od)(−xid) d inD X ∆wi = η (td − od)xid d inD The η will determine how big the step size will be in the gradient descent search.

Another variation is called the stochastic or incremental gradient descent where the gradient descent is calculated for each training data separately instead of summing.

∆wi = η(t − o)xi Standard (or batch) gradient descent will thus go through all examples before updating the weights. While stochastic gradient descent will take one exam- ple and updates the weights based on that example. The gradient descent

24 will be a very costly algorithm when the size of training samples is large. Stochastic gradient descent will improve much faster than gradient descent ever will and will eventually converge faster but its error will be not as good as the gradient descent will be.

3.3 Multilayer perceptron

As explained previously, a single perceptron cannot represent non-linear data like XOR. Multilayer perceptrons, or MLP, can represent this by using mul- tiple layers of perceptrons. This will result in, for example two different decision boundaries for XOR, Figure 15. The layers of MLP’s are fully con- nected, except the input layer and each perceptron has a non-linear activation function, Figure 16.

XOR

1 0 Boundary

1 2 t i B

0

0 1

Bit1

Figure 15: Example of XOR with two decision boundaries learnt by a MLP

25 Figure 16: Example of a multilayer perceptron with 4 input nodes, 2 hidden layers with each 5 hidden nodes and 3 output nodes

3.4 Activation functions

The activation, ϕ on Figure 13, is a function, possibly non-linear, applied after multiplying inputs with their network weights. For example a linear neuron, which uses a linear activation function, can set the output on or off, which means it belongs to class A or B if there are only two features. It thus activates the node or not. The problem with linear neurons is that using multiple layers of linear neurons will still yield a linear result. The same goes for a step function where the output will result in a 0 or 1 depending on the threshold θ. There is thus a need for a unit that given an input will yield an output which is a non-linear result of its input. The advantage of the following described activations is that their functions are all differentiable, this can minimize the computational load when training neural networks. Other basic activations are the linear and step function, Figure 17.

26 Activation function: Linear Activation function: Step function 1.0 2.0 Linear Step function

1.5

0.5 1.0

0.5

0.0

0.0

0.5 0.5

1.0

1.0 1.5 1.0 0.5 0.0 0.5 1.0 4 3 2 1 0 1 2 3 4

Figure 17: Other activation functions: linear and step function

3.4.1 Sigmoid The sigmoid unit sets the threshold as a sigmoid function, Figure 18. This results in a continuous function of its input by using: 1 σ(x) = 1 + e−x This output will map the input between a 0 and 1 output. The derivative of the sigmoid function will be:

d σ(x) = σ(x)(1 − σ(x)) dx 1 1 = (1 − ) 1 + e−x 1 + e−x

Activation function: Sigmoid 1.0 Sigmoid

0.8

0.6

0.4

0.2

0.0 4 2 0 2 4

Figure 18: Sigmoid activation function

27 3.4.2 Hyperbolic tangent The same goes for the hyperbolic tangent or tanh, Figure 19. This will map the input between a -1 and 1 output.

sinh(x) tanh(x) = cosh(x) e2x − 1 = e2x + 1 d tanh(x) = 1 − tanh(x)2 dx

Activation function: Tanh 1.0 Tanh

0.5

0.0

0.5

1.0 4 2 0 2 4

Figure 19: Hyperbolic tangent activation function

3.4.3 Rectified Linear Unit Another recently discovered activation is the rectified linear unit (Nair & Hinton, 2010), or ReLU, Figure 20. This has the advantage that when there is a neural network with random initialized weights, only 50 % of the hidden neurons will be activated. This results in a sparse activation. ReLU is not differentiable at 0, but the can differentiated at any other point. In the last years ReLU has grown more popular in Deep Learning because they learn must faster when going in neural networks with many layers (Y. LeCun, Bengio, & Hinton, 2015). It can also compete with neural networks that use pre-training and neural networks that do not use pre-training with the activation function ReLU (Glorot, Bordes, & Bengio, 2011).

28 relu(x) = max(0, x) ( d x = x > 0 relu(x) = dx 0 = x ≤ 0

Activation function: ReLU 2.0 ReLU

1.5

1.0

0.5

0.0

0.5

1.0

1.5 4 3 2 1 0 1 2 3 4

Figure 20: ReLU activation function

3.4.4 Which is better? The question then remains which activation function is better. Using a non- linear function is essential when there the data is not linearly separable and wanting a non-linear output. Unfortunately there is no activation function that is best above all others. Often is the hyperbolic tangent more pre- ferred because the data will be centered, if the data is normalized, around 0. This causes the hyperbolic tangent to often converge faster than the sigmoid function. (Y. A. LeCun, Bottou, Orr, & M¨uller,2012). Over the last few years ReLU has been typically preferred over other activation functions in deep networks, ReLU has the advantage that it has no vanishing gradient problem. When learning weights in deep networks with backpropagation it is possible that the first layers will learn slowly because of the amount of chainrules that must be surpassed before reaching the input layers. Because so many chainrules are passed the derivative can be a very small number which means updating can be very slow.

29 3.5 Tips and tricks

As can be seen there are many parameters that can be applied to a neural network. This does not mean that neural networks will converge. There are a few possibilities to speed up the process although this does not mean it will lead to a good solution. One of those possibilities is batch or stochastic learning. Batch learning is when all the training data is passed through the neural network and only then the gradient will be computed and weights will be updated, this is different from stochastic training where there is only one update done after a forward pass on a single (random) input. Stochastic has the advantage that it is much quicker than batch training and is often known to perform better, although this is not sure.

Another option is to randomize the input so that the input ~x1 and ~x101 are not likely related as ~x1 and ~x2 would be. For example, two consecutive RAM states from a game are related. But two random RAM states are probably not as related as the two consecutive would be. It is often good practice to train on examples that return a bigger error than examples that give a lower error. Another way to boost the process is normalizing the input with mean 0, (Y. A. LeCun et al., 2012) shows that whenever the input is for example all positive the weights will only increase or decrease which means the update rule will only zigzag its way to find the best weights. This causes inefficiency algorithm.

3.6 Backpropagation

Backpropagation is used to train a neural network to optimize the weights of the network with gradient descent. First the previously error E needs to be redefined because it was the error of only one unit. This can be done by summing all difference between the target and output of all kth output units with training data d;

1 X X E = (t − o )2 2 kd kd d∈D k∈outputs

The problem with backpropagation and the previous gradient descent for one output unit is that the dimensional space of E contained only one local minima, while backpropagation can have multiple. This means that back- propagation will converge to any of those local minima but is not certain

30 that this local minima is also a global minima. This aside backpropagation still produces good results. The algorithms starts by initializing the number of nodes and outputs and setting random small weights. For each training example the network calculates the output and the error. It then computes the gradient of that error followed by adapting the weights of the network. This iteration can and probably will be looped many times until the network can calculate the output decently. There are many criteria that can be set to end the iteration, for example a fixed number of iterations or having to loop till the error falls below some threshold. The weights are updates with the following rule

wji = wji + ∆wji where

∆wji = ηδjxji

This rule is an adapted version of the previously seen delta rule. For output units the new δ will be the previous (t − o), target value minus the output value, but multiplied with the derivation of an activation function, ϕ.

δk = φ(tk − ok) where d φ = ϕ dx

For the inner nodes, lets assume there are only two layers one output layer and one hidden layer, the δ will be defined differently, since there are no target values available. The δ is then calculated by summing the δ of the outputs weighted by the weights of the hidden node. X δh = φ wkhδk k∈output where d φ = ϕ dx

This can be extended with more than two layers by using the chain rule.

31 3.7 Autoencoders

Autoencoders are artificial neural networks with the special property that they do not need target values. This makes autoencoders an unsupervised learning method because the target values will be set equal to input values, ~y = ~x (Geoffrey E Hinton & Salakhutdinov, 2006; Ng, 2011). This forces the autoencoder to learn the identity function. This may seem trivial, but setting constraints on the network like limiting the number of layers and nodes, see Figure 21, can create a bottleneck which forces the autoencoder to reduce the input information and thus creating a compression technique. Real world examples of input data, such as pictures and the amount of pixels, DNA sequences, and so on have big input features. Therefore there is a need for some kind of compression that reduces the amount of features.

Deep learning is a technique that can also be used together with autoen- coders. Deep learning will have multiple layers, where each layer will learn some abstraction of the input features and in the end will create some com- plex structure of abstractions (Bengio, 2009; Y. LeCun et al., 2015; Schmid- huber, 2015). The layers of such a deep network can be initialized by first training an autoencoder on the input layers. The weights of the trained au- toencoder then typically provide a good starting point for the deep network weights.

There are different kinds of autoencoders. The first variation is a sparse autoencoder, this is when the hidden layers of the autoencoders can have more hidden nodes than the original feature input vector. Having more hid- den nodes leads to more computational heavy calculations. The sparsity parameter will enforce that a node will be on average active. This intro- duces sparsity and can give interesting results (Ng, 2011). Another variation on sparse autoencoders are the k-Sparse autoencoders (Makhzani & Frey, 2013), which only takes the k best activations and cancel out the rest, mean- ing initialization them on zero. Denoising autoencoder, (Vincent, Larochelle, Bengio, & Manzagol, 2008), can be an alternative for sparsity or bottleneck. It will corrupt the input data and the autoencoder will be trained to fill in missing parts, and thus reconstruct the input data. This is done by training the autoencoder and removing features random.

32 Figure 21: Example of an autoencoder

3.8 Conclusion

This thesis will primarily focus on autoencoders and their capabilities in un- supervised feature extraction. Because autoencoders have the capability of reducing dimensions, it is interesting to investigate how good these features are. Unfortunately there is no way to see if these features are good or what they mean, since they are somewhat blackbox. These features have some nu- meric value these are not easily interpreted, not unlike for example a decision tree that is easy humanly readable. By using different activation functions we can also see the impact that one function has on the reconstruction of the input data. Because autoencoders are starting in relative high dimensions, 1024 or 128 depending of how to interpret the RAM state, it is also a good idea to experiment with adding dropout in an autoencoder. Dropout forces the autoencoder to randomly drop nodes with all their connections. By do- ing so, the autoencoder is forced to learn in another connection and this also prevents to overfit the network (Srivastava, Hinton, Krizhevsky, Sutskever, & Salakhutdinov, 2014). The downside of using dropout is more learning time.

33 Chapter 4 Reinforcement Learning

The history of Reinforcement Learning, RL, has its roots in psychology. Ed- ward Thorndike introduced the law of effect, which he defines as:

Of several responses made to the same situation, those which are accompanied or closely followed by satisfaction to the animal will, other things being equal, be more firmly connected with the situ- ation, so that, when it recurs, they will be more likely to recur; those which are accompanied or closely followed by discomfort to the animal will, other things being equal, have their connections with that situation weakened, so that, when it recurs, they will be less likely to occur. The greater the satisfaction or discomfort, the greater the strengthening or weakening of the bond. (Thorndike, 1911)

This will be one of the key points of Reinforcement Learning, only positive interactions will be encouraged and negative interactions will be discouraged, but not rejected.

Skinner invented the Skinner’s Box (Skinner, 1938), where animals have to press a lever when receiving a signal. This can be anything from a light pulse to a sound. When the animal presses the lever on the correct signal it will receive a reward, which will most likely be food (Figure 22). But it can also receive a negative reward like electrical shocks when pressing the lever at the wrong signal. Skinner is known to perform this kind of tests on pigeons and rats (Skinner, 1951, 1948).

34 Figure 22: A Skinner’s Box from (Skinner, 1938)

Many other examples exists where animals are trained, like Pavlov’s dogs (Todes, 2002), where dogs were trained to response to receiving food, result- ing in producing more saliva. This was trained by ringing a sound before giving the dogs their food.

All the previous research, for example Pavlov’s dogs, forms a base of how dogs are trained now. Dogs now will get a biscuit if the dog performs a command correctly, if he does something bad he gets scold. This is exactly what Reinforcement Learning tries to recreate.

Reinforcement Learning is used in many current applications. For exam- ple a robot vacuum cleaner that adapts itself to know when to dock itself to recharge and restart where it has left off or even to adapt the motors of the robot depending on the material of the floor to save energy and be more efficient. Even in games is Reinforcement Learning widely used. Researchers let an AI play backgammon against itself and by doing so, learning from itself and correcting his mistakes which made him a master level player and close to one of the best backgammon players (Tesauro, 1994).

4.1 The setting

The Reinforcement Learning setting can be summarized in Figure 23 and 24 (Sutton & Barto, 1998). An agent is an entity that can observe the environment and can act upon it and by doing so learn from the interactions.

35 The environment is where the actions take place, which will then yield a state and reward. The agent will eventually learn how to map situations onto to different kinds of actions based on what the agent has learnt. The goal of Reinforcement Learning is maximizing its reward.

Figure 23: Agent Environment setting

Going back to the Figure 23, an agent can interact each time step t = 0, 1, 2, 3, 4, .. with the environment. After each time step t, the environment produces a state st ∈ S, where S contains all possible states. Based upon a state, an action will be chosen and taken, at ∈ A(st), where A(st) will be all possible actions in the state st. The next time step t + 1, the environment will yield a reward Rt+1 ∈ with a new state St+1.

Figure 24: Starting from a state, the agent will choose an action. The next time step the agent will receive a reward and comes in a new state. This is done T times

Example 4.1.1. One of the most known well examples in Reinforcement Learning is the mountain car (Figure 25). The agent has to drive the car to the top of the mountain, but it does not have the power to get to the goal position in one go. Therefore the agent can use gravity in order to get to the goal as quickly as possible. The agent can do this by driving up the hill, let go and drive backwards to gain momentum. The states of the mountain car

36 are, the position on the map which is one dimensional and the velocity of the car. The actions can be, to drive forward, backward and do nothing. The rewards are always negative per time step unless he reaches the goal. The agent will learn to minimize his reward, since it is negative.

Figure 25: Mountain car; image from (RL-Library, n.d.)

Example 4.1.2. Another widely used example in Reinforcement Learning is Pole Balancing, Figure 26 (Michie & Chambers, 1968; Barto, Sutton, & Anderson, 1983). A pole is mounted on a cart at its center of mass, this allows the pole to be balanced at an exact point. The cart itself can only move left and right and the pole can only indirectly move from left to right. The goal is balancing the pole in an upward position. This can be done by moving the cart back and forth to get to that point. The states are the pole’s angle and angular velocity. The actions are moving left and right and by doing so creating a force to get the pole to a balanced state. The rewards can for example be, for each time step an incremental reward with reward 1 until the cart fails.

Figure 26: Pole Balancing; image from (Anji, n.d.)

37 4.2 Rewards

The goal of Reinforcement Learning is to have a maximum reward over time. The agent receives a reward every time step + 1, because when the agent does the action, he can only observe the reward the next time step. This can be formally written as;

Gt = Rt+1 + Rt+2 + ..RT

Where Gt is the expected total reward. The agent does not know the exact reward, he can only expect a certain reward. T is the final time step, when the agent goes into an end state. When the environment has a notion of time as in learning an episode which starts and ends, like for example a maze environment (Figure 27). The agent starts on the left and each time step can move only one adjacent square where there is no wall. The agent needs to find a way outside of the maze. This is called an episodic task be because each episode the agent can do an action at in time step t.

Figure 27: A maze world where the agent starts from the left and needs to find a way to get outside

An episodic task will eventually always go into a final state. When there is no terminal state it is called a continuous task. This means that the formally noted Gt is no longer true because there is no final time step T . Gt

38 can easily be adapted from T to infinity ∞ 1. An additional approach to the expected reward is adding a discount factor. This factor is used to determine if whether the agent is interested in an immediate return or more interested in a future reward.

2 Gt = Rt+1 + γRt+2 + γ Rt+3 + ..

= Rt+1 + γ(Rt+2 + γRt+3 + ..)

= Rt+1 + γ(Gt+1) or ∞ X k = γ Rt+1+k k=0 This means that γ, the discount factor, decides whether the agents seeks for a long term and future reward or an immediate reward. The discount factor is bounded between 0 ≤ γ ≤ 1. This is interpreted as follows, imagine that

γ is equal to 0. Then is Gt = Rt+1, meaning that the agent only cares about the reward it is about to receive. If γ = 1, it can be seen that rewards in the future are equally important as the immediate reward.

For most parts the reward scheme is unknown, meaning the rewards are chosen by the designer of the implementation. In the example of mountain car, Example 4.1.1, the reward scheme is always −1 until the car reaches the mountain. But not always, in this thesis the focus will lie on the reward scheme given by Space Invaders itself (Section 5.2).

4.3 Markov Decision Process

The Markov Property states that whenever the agent is in a state s it contains all valuable information to go to the next state s0 with its reward r0. From this information it can decide in the future where to go. It is said that, when the reward and transition probabilities only depend on the current state, action and time step and not on the previous visited states, the problem has

1example of contin task?

39 the Markov Property. It can thus be defined as;

0 P (Rt+1 = r, St+1 = s |S0,A0,R1, ..., St−1,At−1,Rt,St,At) = 0 P (Rt+1 = r, St+1 = s |St,At)

Which states what the probability is of the reward r and the next state s0 given all previous information is equal to only the previous state, which is exactly what the Markov Property defines.

A Markov Decision Process is when a Reinforcement Learning task has the Markov Property. It consists of:

• Set of States S: S0, S1, .. , Sn

• Set of Actions A: A0, A1, .. , An

0 0 • Transition function: T (s, a, s ) = P (St+1 = s |St = s, At = a)

0 0 • Reward function: r(s, a, s ) = E[Rt+1|St = s, At = a, St+1 = s ] The Transition function T gives the probability of a state s0 given the current state and action. The Reward function r gives the expected reward given the current state, action and next state.

Applying this to Example 4.1.1, the mountain car, the first transition func- tion can be going from the current state, which is standing still, and the action acceleration, to a next state which is higher on the mountain. The reward scheme was designed as follows, only negative rewards are given un- less the goal is reached. Since the car is in the start state and the action is acceleration, the expected reward will be −1 since it is the first time step and the goal was not reached.

4.4 Value functions

A policy π is the long term goal of an agent where the agent selects an action in a state at any given time. A policy will take all elements into consideration with regards to maximize the reward. It thus maps each state with a probability onto an action, π(a|s). The value of taken that action a

40 in state s and further following the policy π is denoted as V π(s) and is called the state-value function for policy π. This can be formally written as:

π V (s) = Eπ[Gt|St = s] ∞ X k = Eπ[ γ Rt+1+k|St = s] k=0 Meaning the expected value, E, will be the expected reward given the state the agent is currently in. Equivalently the action-value function can be de- fined for a policy π, this will be denoted as Qπ(s, a). The action-value func- tion returns the expected return from a chosen state s and an action a by following the policy π. The action-value function can thus be defined as:

π Q (s, a) = Eπ[Gt|St = s, At = a] ∞ X k = Eπ[ γ Rt+1+k|St = s, At = a] k=0 Value functions give an indication if going into a state is a good or a bad op- tion regarding the future. These value functions only come from experience and the only method to get experience is gaining as much as information as possible by traversing the environment.

The state-value function has a special property between the current state, the action taken and the successor of the state followed from that action, which is a recursive relationship. The following equation is named the Bell- man equation for State Values. It looks at the action s and all the following states s0 that follow from action a. The same can be applied on state-action values. ∞ π X k V (s) = Eπ[ γ Rt+1+k|St = s] k=0 ∞ X k = Eπ[Rt+1 + γ γ Rt+2+k|St = s] k=0 X X = π(s, a) T (s, a, s0)[R(s, a, s0) + γV π(s0)] a s0 ∞ π X k Q (s, a) = Eπ[ γ Rt+1+k|St = s, At = a] k=0 X = T (s, a, s0)[R(s, a, s0) + γV π(s0)] s0

41 The Bellman equation will look at a start state and calculates for every possible action the states of the successor with their expected reward. The Bellman equation is going to average all the potentials with their weighted probability of occurring.

It is only logical that multiple policies exist and thus also multiple state- value functions, as a designer to solve the problematic task, there is a need to find the optimal policy and optimal state-value function. π ≥ π0 if and only if π π0 V (s) ≥ V (s) ∀s ∈ S

A policy is only better when the state-value or action-value function is better or equal than the every other policy. Of all policies there is one policy which is the best and thus optimal, π∗ with the associated optimal state-value function, V ∗ or optimal action-value function Q∗. Note that an optimal policy is not unique but an optimal action-value function is. Which will be defined as: ∗ π V (s) = maxπV (s) ∀s ∈ S ∗ π Q (s, a) = maxπQ (s, a) ∀s ∈ S and ∀a ∈ A Example 4.4.1. Take the gridworld as an example. Where an agent is put on the grid and needs to find the goal, here the goal is indicated by a green square. The agent can only move right, left, up and down. The agents receives in this example +100 when moving to the goal. The optimal state- value is then showed in Table 3. The optimal policy π∗, will be the shortest way to the goal. Every possible optimal policy path is identified by arrows, multiple arrows indicate multiple optimal paths exist, this is shown in Table 4. To be sure that the agent finds the optimal policy the agents must visit every possible state, in the gridworld this is a doable option. But when there are millions of states this can be quite exhausting, therefore these functions can be approximated.

4.5 Action Selection

The problem remains which action to select and why to select a certain action. A naive way to select an action is always selecting the action with

42 54 63 72 63 54 63 72 81 72 63 72 81 90 81 72 81 90 100 90 81 90 100 0 100 90

Table 3: V ∗(s) Table 4: π∗(s)

Table 5: Gridworld Example

∗ the highest Q-value, Qt(At ) = maxaQt(a). This method will always choose to take the action which yield the highest reward above all other rewards, this is also called exploitation. It only uses what the agent has learnt and will not explore other options. One of the disadvantage of this greedy method is that the agent will never find another possibility or another way that has more rewards and is perhaps shorter. There is a way to force the selection method to explore, which is initializing the Q-values on another value. A more optimistic way of exploring while keeping some exploiting is the - greedy method where there is a probability of  to select a random action or choosing the greedy method. Equivalently in this case there are also different ways to optimise the action selection method. There is a possibility to keep the  fixed over different episodes but there is also a way to keep the  high in the beginning of the episode, to force the agent to explore as much as possible, and after a certain time t, the exploration rate will be reduced to force the agent to change his exploration to more exploitation. A disadvantage of the -greedy method is that when selecting an action, it will choose each action with same probability. This means that it could choose a very good action, but also a extreme bad action. The softmax action selection solves this by using probabilities of selecting an action which are ranked by their estimation of Q-values. Q(s,a) e τ P (s, a) = n Q(s,n) P τ i=0 e The parameter τ, or temperature, is used to determine how long the explo- ration will continue, the higher τ is, the more randomly it will play. The closer τ is to 0, the more greedily it will be. The same thought can be ap- plied by reducing the τ over time.

43 Balancing the amount of exploration and exploitation is one of the impor- tant elements of learning. There is no need to always exploit the same path, because the first best path is not per se the all time best path. The same goes for exploration, always exploring random actions will never yield a good result. Although when exploring tremendously the agent knows all possible paths that can be taken. There is no current research that declares which action selection method is the best. Both -greedy and softmax are methods that are widely used today in Reinforcement Learning. In current research -greedy is more used, simply because setting the  parameter is easy under- standable while the τ parameters needs knowledge of the action values and e.

4.6 Incrementing Q-values

When using action selection methods, there is a need for a value of for an action other than the reward. A simplistic way of representing these values is by averaging all rewards.

R1 + R2 + R3 + ..RKa Qt(s, a) = Ka The rewards are averaged when the action a was selected K times before a time step t. When the agent just starts, K is equal to 0, that makes Q(s, a) undefined. Therefore Q-values are always initialized by some number, for example 0. The law of number states that when K → ∞, Q(s, a) will converge to Q∗(s, a). This method is also called the sample-average method. As being said, this is a fairly naive way of implementing these values. For the method to work, the computer needs to remember all possible reward to average them, this will only increase the longer the task lasts. The same goes for computational power, each time a new action is taken the computer needs to recalculate the entire average, with thousands of rewards of only one action in a state can overload a computer. One way to avoid this problem is using incremental updates.

44 k 1 X Q = R k+1 k i i=1 k−1 1 X = (R + R ) k k i i=1 1 = (R + (k − 1)Q + Q − Q ) k k k k k 1 = (R + kQ − Q ) k k k k 1 = Q + [R − Q ] k k k k

The computer only needs to remember the Qk and k value, which makes the computational load a lot smaller. This incremental update can be generalized by using the following equation:

Estimatenew = Estimateold + stepsize[T arget − Estimateold]

The difference between the T arget and Estimateold can be seen as the error between the estimated value of an action method and the target. Usually in Reinforcement Learning the stepsize will be replaced by α. The α can be a constant, which makes the current reward weighted heavier than the older rewards, where 0 < α ≤ 1, which is then called the weighted average.

4.7 Monte Carlo & Dynamic Programming

Monte Carlo methods used in Reinforcement Learning do not need full knowl- edge of an environment but only needs experience. It can even learn from simulated experience by sampling the environment. By doing so it only needs to generate sample transitions. Monte Carlo methods are based on averaging sample returns. This means that averages can only be calculated when the episodes are completed, assuming the states are finite and episodic. Monte Carlo methods can also be used to mimic policy iteration. The first phase is Policy Evaluation, where given a policy π, the goal is to compute the Qπ(s, a) or an approximation for all pairs. These pairs can be estimated by averaging the sampled returns. When running long enough Q will approximate Qπ.

45 The next phase is Policy Improvement where a greedy policy is calculated with respect to Q. The greedy policy will return an action a, given a state s and a new policy that maximizes the state-action values. Monte Carlo meth- ods are more complicated when used in non-episodic tasks because averaging is only done after the episode is finished. When data has high variance, con- vergence will be slower because more samples are needed. This means that Monte Carlo is an unbiased method, while on the other hand Bootstrapping, which is a method from Dynamic Programming, is a biased learner because bootstrapping updates after one single step. These updates are calculated on estimations. This will converge in finite and discrete cases to their true values.

2 T −t−1 Rt = rt+1 + γrt+2 + γ rt+3 + .. + γ rT vs.

Rt = rt+1 + γVe(St+1) This equation shows the difference between a Monte Carlo method, the first equation, because it needs all rewards over an episode and the bootstrapping method, second equation, only calculates the estimate of an estimate.

4.8 Temporal Difference

Temporal Difference, TD, learning is a mix between Dynamic Programming and Monte Carlo methods. They learn from experience, by sampling by some policy π, without any knowledge of the environment and updates are learnt from other estimates. TD methods only need the next time step to update, while Monte Carlo methods need the whole episode before updating.

V (St) ← V (St) + α[Gt − V (St)] vs.

V (St) ← V (St) + α[Rt+1 + γV (St+1) − V (St)]

It can be seen that the first method shown is a Monte Carlo method because it must wait until it has the Gt value, which is only gatherable after a whole episode. The second method is a TD method because it can update after the next time step. It uses the bootstrapping technique which is an estimation of the Q values by only using estimates for the next state. This is a useful feature which lower the computational load.

46 Before going into algorithms, there is a need to make a distinction between different policies; on-policy and off-policy. On-policy is when an agent im- proves the policy it is currently following to get a result. While off-policy is learning the value of a policy, independently of the actions of the agent.

4.8.1 Q-Learning An example of off-policy learning is Q-learning (C. J. C. H. Watkins, 1989). The Q-values will approximate the optimal action value function independent of the policy it is following, which makes it off-policy. Q-learning will converge as long as states are visited and updated. (C. J. Watkins & Dayan, 1992).

Algorithm 1 Q-Learning 1: Initialize all Q(s, a) for s ∈ S, a ∈ A 2: Repeat (for every episode): 3: Initialize s 4: Repeat (for each step of episode): 5: Choose a from s using policy derived from Q (e.g., -greedy) 6: Take action a, observe r, s0 0 7: Q(s, a) ← Q(s, a) + α[r + γmaxaQ(s , a) − Q(s, a)] 8: s ← s0 9: until s is terminal

The algorithm goes as follows, first all Q(s, a) states are initialized arbitrarily. For every episode the agent will choose a start location, s. For every step of that episode the agent will choose an action a from a policy like -greedy. The agent will take the action and receives a new state s0 and a reward for going from state s to s0. Then the Q-values are updated by using the following rule; use the old Q-value from were the agent started. Then the agent calculates the reward the agent got plus the maximum of the Q-value of the next state, which will be the estimate of the future reward multiplied by a discount factor. This will be subtracted by the old value, all of this will be multiplied by a learning rate which is then added to the old reward. The agent will now go to the observed new state s0 and the iteration is restarted with the new state.

47 4.8.2 SARSA SARSA, previously named modified Q-learning (Rummery & Niranjan, 1994) and renamed to SARSA by (Sutton, 1996), is an on-policy method. The name stands for State Action Reward State Action and comes from the agent which is in state s1, chooses action a1 and receives reward r. The agent will then go in state s2 after taking action a1 and chooses its next action which will be action a2.

Algorithm 2 SARSA 1: Initialize all Q(s, a) for s ∈ S, a ∈ A 2: Repeat (for every episode): 3: Initialize s 4: Choose a from s using policy derived from Q (e.g., -greedy) 5: Repeat (for each step of episode): 6: Take action a, observe r, s0 7: Choose a0 from s0 using policy derived from Q (e.g., -greedy) 8: Q(s, a) ← Q(s, a) + α[r + γQ(s0, a0) − Q(s, a)] 9: s ← s0 a ← a0 10: until s is terminal

The SARSA algorithms starts equivalently the same as Q-learning, where all Q values are initialized arbitrarily. Then for every episode a state s is chosen and immediately the following action a is derived from a policy like -greedy. Then the agent will go into a loop until the state s is terminal. The agent will take the action a and observes the reward r and the new state s0. From this new state it will choose a new action a0 derived from a policy like -greedy. The Q-values are then updated by calculating the reward it got plus the new state action values multiplied by a discount factor. This will then be subtracted by the old state-action values. This result will then be multiplied by a learning rate and then added to the old state-action value. The agent will now go to the new state s0 and and the new action a0.

4.9 Eligibility traces

TD methods use the current reward together with the estimated value, Monte Carlo methods uses the exact reward but only after the episode is finished. There is also a method in between where the numbers of steps (or backups)

48 are chosen, n-step method (C. J. C. H. Watkins, 1989) , before using the estimated value.

(1) Gt = Rt+1 + γV (St+1) (2) 2 Gt = Rt+1 + γRt+2 + γ V (St+2) (3) 2 3 Gt = Rt+1 + γRt+2 + γ Rt+3 + γ V (St+3) (n) 2 n Gt = Rt+1 + γRt+2 + γ Rt+3 + .. + γ V (St+n) The first equation, is simply bootstrapping. The second equation is called the 2-step method , the third is called the 3-step method and so on. It can be seen that whenever n is equal to the number of steps in an episode, it is no longer the estimation but the actual reward which means it is the Monte Carlo method.

With eligibility traces each state receives an extra variable, e, called the eligibility trace. When the agent comes in a state s the eligibility trace of that variable will be incremented, all other states will be decayed. ( γλet−1(s) if s 6= st et(s) = γλet−1(s) + 1 if s = st Figure 28 shows what happens to a state. Every time a state is visited the eligibility trace will be incremented, when the agents does not visit the state, the state wil automatically decay. This is done by the decay parameter, denoted as λ. By doing this, it can be seen that when learning happens some states will be more affected than other states because of the frequented visited states. When λ = 0, it can be seen that bootstrapping will happen, because only the current trace is the important one and all other traces will be zero. When the trace is set to λ = 1, it will mimic the Monte Carlo methods.

Figure 28: Eligibility trace; image from (Sutton & Barto, 1998)

49 Eligibility traces can thus also be applied on SARSA, which is called SARSA(λ). The idea from the original SARSA remains the same and is still on-policy. Only now state action values are calculated with their eligibility trace and the use of a TD error which is;

δ = rt+1 + γV (St+1) − V (st)

But can also be calculated for q(s, a) values in stead of V (s).

Algorithm 3 SARSA(λ) 1: Initialize all Q(s, a) for s ∈ S, a ∈ A and e(s, a) = 0 2: Repeat (for every episode): 3: Initialize s, a 4: Repeat (for each step of episode): 5: Take action a, observe r, s0 6: Choose a0 from s0 using policy derived from Q (e.g., -greedy) 7: δ ← r + γQ(s0, a0) − Q(s, a) 8: e(s, a) ← e(s, a) + 1 9: For all s, a: 10: Q(s, a) ← Q(s, a) + αδe(s, a) 11: e(s, a) ← γλe(s, a) 12: s ← s0 a ← a0 13: until s is terminal

The same can be applied on Q-learning, Q(λ). But with the single adaptation that whenever Q-learning is following the greedy action selection, the expe- rience can be followed but not when the random action or the non-greedy action is selected. When a non-greedy action is selected will the eligibility traces be reset to zero.

50 Algorithm 4 Q-Learning(λ) 1: Initialize all Q(s, a) for s ∈ S, a ∈ A and e(s, a) = 0 2: Repeat (for every episode): 3: Initialize s, a 4: Repeat (for each step of episode): 5: Take action a, observe r, s0 6: Choose a0 from s0 using policy derived from Q (e.g., -greedy) ∗ 0 0 ∗ 0 7: a ← argmaxbQ(s , b) (if a ties for the max, then a ← a ) 8: δ ← r + γQ(s0, a∗) − Q(s, a) 9: e(s, a) ← e(s, a) + 1 10: For all s, a: 11: Q(s, a) ← Q(s, a) + αδe(s, a) 12: If a0 = a∗ 13: then e(s, a) ← γλe(s, a) 14: else e(s, a) ← 0 15: s ← s0; a ← a0 16: until s is terminal

Sometimes better performance can be gathered by using replacing traces in stead of the standard traces where; ( γλet−1(s) if s 6= st et(s) = 1 if s = st

Figure 29: Replacing traces; image from (Sutton & Barto, 1998)

4.10 Function approximation

Previously it was assumed that all Q-values would have a table. In this table each Q(s, a) pair would have some value. This is a feasible method when having states and actions on a small scale. If there are millions of state-action pairs this would require a lot of memory but also time and data

51 to accurately compute them. Think for example the difference between the state-space of backgammon, 1020, and the state-space for a robotic helicopter. The robotic helicopter cannot map the whole world in his table and has thus a continuous state-space. The solution to this problem will be to generalize by gathering previously visited states and generalize them over the complete set of states even if they are not yet visited. This generalization is also called function approximation, where it take samples from the value function and tries to generalize them and by doing so constructing an approximation of the function. From now on these functions will be generalized and will be parametrized by a vector w ∈ R;

vˆ(s, w) ≈ vπ(s)

qˆ(s, a, w) ≈ qπ(s, a)

This new functionv ˆ(s, w) can be computed by a linear combination, a neural network where the w will be the weights or a decision tree where w will be split points and leaves.

Learning these function approximation can be done by gradient descent. T Where w = (w1, w2, ..., wn) andv ˆ(s, w) can be differentiated denoted as J(w). Each time step the agent observes a selected state St and its true value under the policy vπ(St). With those values the gradient can be calcu- lated by trying to minimizing the error as much as possible and going to a local minima. This is done by updating the weights where the error will be the lowest;

1 h i2 w = w − α∇ v (S ) − vˆ(S , w ) t+1 t 2 wt π t t t h i = wt + α vπ(St) − vˆ(St, wt) ∇wt vˆ(St, wt)

where α is the step size and ∇wt J(wt) is the partial derivative defined as;

T ∂J(wt) ∂J(wt) ∂J(wt)  ∇wt J(wt) = , , ..., , ∂wt,1 ∂wt,2 ∂wt,n

The goal will be to find a local minima by updating the weights where the error will be the lowest and by doing so finding a local minimum. Value function can thus also be represented by a linear combination ofv ˆ and w.

52 This can be written as;

n T X vˆ(s, w) = w x(s) = wixi(s) i=1

T Where each state has a vector of features x(s) = (x1(s), x2(s), .., xn(s)) with the same amount of weights. The gradient descent with respect to w will then be;

∇wvˆ(s, w) = x(s) These features can be constructed by using different methods. One example of such a method is Coarse Coding. Where the state is a continuous space, which in this example will be a two dimensional space (Figure 30). The feature vector can in this example be if the state is in the circle or not. The feature will be zero if the state is absent and 1 if the feature is present in a certain circle. These features can overlap because the state can be in multiple circles at once. Gradient-descent will update the weights of all the circles the agent is in. The approximate value function will affect every point that is between the union of the intersected circles with a greater affect if they have more point in common.

Figure 30: Coarse coding; image from (Sutton & Barto, 1998)

53 Chapter 5 Experiments and results

5.1 ALE

The environment that this thesis is based on is the Arcade Learning Environ- ment (M. G. Bellemare, Naddaf, Veness, & Bowling, 2013), or abbreviated ALE. It allows anyone to write AI-agents that can interact with Atari 2600 games. ALE is written on top of Stella 1 which is an open-source Atari emu- lator. ALE enables interactions with the Stella emulator which permits the user to gather all sorts of data like RAM and frame states parallel while the game is playing and can even send data, like action moves, to the game.

The Atari 2600 console was invented in 1977. The hardware of the con- sole is rather simple, compared to consoles today, it has a CPU of 1.19 Mhz and has a RAM of 128 bytes. Games only had a screen of 160 pixels wide and 210 pixel high with a maximum of 128 colors. The screen has thus 33600 pixels in total. The ALE system allows an agent to observe the current game screen and/or the RAM state of the console. The advantage of frames is that they are human interpretable (Figure 31b). But unfortunately, frames provide an agent with only partial information as a single frame does not provide information about the movement of objects. The RAM is not hu- manly interpretable, but has more information and even holds the complete state of the game (Figure 31a). The console has a joystick with 18 different possible moves, but not all of them are used when playing a game. Because the console is -hardware wise- not powerful it can easily be emulated. This makes an excellent testbed for AI-agents because the possibilities with the

1http://stella.sourceforge.net

54 frames, RAM and the limited possible actions. This on the contrary to cur- rent games which have millions of pixels and multiple gigabytes of RAM states. This does not mean that Atari 2600 games can easily be learnt, take for example a game where only 4 possible actions are valid, this means that when the game is running at 60 frames per second, only looking one second ahead means searching through 460 different simulations that can be done.

0 0

5 50 10

15 100

20 150 25

30 200 0 5 10 15 20 25 30 0 20 40 60 80 100 120 140

(a) RAM (b) Frames

Figure 31: The difference between RAM and Frames

5.2 Space Invaders

The game of Space Invaders (Figure 32) was chosen as a test bed for com- bining Reinforcement Learning together with autoencoders. Space Invaders is one of the most used games as a test bed for RL-agents (Mnih et al., 2013; M. G. Bellemare et al., 2013). It is known that Reinforcement Learning agents can beat a human level player (Mnih et al., 2015).

Space Invaders was first released in 1978 by Tomohiro Nishikado, since then many different adoptions exist. The player controls a space ship and can fire missiles. The goal of the game is to hit all layers of aliens and go to the next level. The player can hide behind walls to shield himself from the lasers coming from the aliens. The player can only move left, right, shoot and do nothing. When a player misses his shot, he must wait until the laser is off the screen so he can fire his next missile. Once all rows of aliens are cleared the game goes to the next level, where the aliens will move more quickly. The Command Alien Ship will randomly come and when shot will yield more points than the basic alien ships. When the aliens come too close to the shields, the shields will disappear and when the aliens eventually come

55 too close to the players ship, the game ends and will restart. The player has a total of 3 lives before the game starts from scratch. The players receives only a reward when hitting an alien spaceship.

0

50

100

150

200 0 20 40 60 80 100 120 140

Figure 32: Space Invaders screen

5.3 Reconstruction

When using autoencoders for extracting features and dimensionality reduc- tion, it is essential that they are trained properly and that the autoencoders in question can reconstruct from their different hidden layers. Using the Mean Square Error we can see how far off the prediction of an autoencoder is from the input values. 1 MSE(~x,~y) = (~x − ~y)2 n Where the ~x is the input, the original RAM state, and ~y is the reconstruc- tion of the autoencoder of ~x. The input values are gathered by running an agent with SARSA(λ) and saving all possible RAM states. The agents plays a total of 3000 episodes, each episode consist of an undetermined amount of steps. These steps are only known when the agent has died three times in the game. Each step the agent receives a RAM state which is then saved.

The dots shows when an autoencoder is trained from an input of 128 bytes RAM state. Autencoders can be trained in two ways, the direct and in- direct way. The direct way is to go from the start dimension of 128 to a

56 specified number of hidden nodes and back to 128 output nodes. The di- rect autoencoder has thus only 1 hidden layer. Figure 33 shows going from 128 → Number of nodes → 128, where each arrow denotes the interconnec- tion between two layers. It was decided that when training autoencoders on different amount of hidden nodes, the number of hidden nodes will always be divided by 2. As can be seen the lower the amount of hidden nodes, with lowest going from 128 → 1 → 128 and highest 128 → 128 → 128, the higher the MSE will be. This is only a logical conclusion, 1 hidden node cannot perform as well as 128 hidden nodes. There is too much information lost in going from a high number of dimension to a too low number of dimensions in contrast with a high number of hidden nodes. Although it can be argued that an error of 0.086 for 1 hidden node in 1 hidden layer is not that high. One way to counteract the loss of going from one big dimension to an immediate lower dimension is adding multiple layers. The red dots shows us the MSE when going to the a lower dimension with intermediate layers, for example in the case of going from 128 bytes to 1 node will be:

128 → 64 → 32 → 16 → 8 → 4 → 2 → 1 → 2 → 4 → 8 → 16 → 32 → 64 → 128 This also means that the trainingtime of the autencoder with multiple hidden layers will be higher than a direct autoencoder. But it can be seen that when using multiple hidden layers the autoencoder in question can achieve a lower MSE than a direct autencoder. Note that the indirect autoencoder from 128 → 64 → 128 is omitted since it does not use multiple layers.

57 Trained autoencoder from 128 to another layer 0.10 Directly Indirectly 0.08

0.06 MSE 0.04

0.02

0.00 1 2 4 8 16 32 64 128 Layer size

Figure 33: Mean Square Error of a trained autoencoder from an input layer with 128 bits to a smaller layer directly and indirectly

It is also a good idea when experimenting with RAM states and autoencoders to also train autoencoders not only in their byte form but also in their bit form, thus instead of using 128 bytes training the autoencoder with input of 1024 bits, Figure 34. The blue dots shows us then going from 1024 → Number of nodes → 1024 and the red dots shows us the MSE with multiple hidden layers. The same conclusion can be drawn in here as in the case with autoencoders with 128 bytes. The deeper an autoencoder goes the more information is lost. This can be made up for by using multiple layers. To compare the settings with 128 bytes and 1024 bits as input layer, it can be seen that 128 bytes performs better in reconstructing the input and thus going to a lower dimension and then back to the same dimension.

58 Trained autoencoder from 1024 to another layer 0.10 Directly Indirectly 0.08

0.06 MSE 0.04

0.02

0.00 8 16 32 64 128 256 512 1024 Layer size

Figure 34: Mean Square Error of a trained autoencoder from a input layer of 1024 bits to a smaller layer directly and indirectly

5.4 Flow of experiments

All experiments will follow the same phases but with different settings. The first phase is the preparation phase where the manual features SARSA(λ) is run for 3000 episodes and where all RAM states are captured. The second phase is preprocessing phase where the autoencoder is trained. The settings of the autoencoder must be specified, number of layers, hidden nodes, and so on. The n epochs are set on 15, this is how many times all trainings exam- ples are put through the autoencoder. One epoch is thus one trainingcycle. Further is the batch size, the number of trainings examples put through before updating the weights, set on the same number as input dimensions. So if the input dimension is 1024, from 1024 bits RAM, then the batch size will be set on 1024. Additional can the loss function and activation function be specified. After the autoencoder is trained, the last phase starts. The agents receives a RAM state. This RAM state will go through the trained autoencoder. Depending on the criteria a specified layer will be exacted and

59 used as the features. The agent will use these feature and learn with them.

5.5 Manual features and basic RAM

In the paper of (Naddaf, 2010; M. G. Bellemare et al., 2013), they perform manual feature extraction by concatenating the original RAM state with the pairwise logical AND of every possible pair. Figure 35 shows the difference between the two combinations, it also shows a random performance where the agent chooses a random action no matter which feature are presented. The x-axis denotes the amount of episodes played an the y-axis presents the rewards ALE returns when choosing actions. As can be seen the RAM states with the pairwise AND will perform better than the basic RAM states. These pairwise AND feature construction is manually done, the designer of the algorithm must implement the pairwise algorithm and before he can decide that the pairwise AND performs better than the basic RAM states many experiments have passed. This is the aim of this thesis to skip the test of finding good features and let the autoencoders handle the feature extraction. From now on, the RAM concatenated with the pairwise AND will be seen as the manual features and the standalone RAM will be seen as basic RAM.

60 Difference between RAM & RAM + AND 450 RAM + pairwise AND Random 400 RAM

350

300

250

Rewards 200

150

100

50

0 0 500 1000 1500 2000 2500 3000 Episodes

Figure 35: The difference between RAM combined with the pairwise logical AND and RAM alone

5.6 Difference between bits and bytes

When working with RAM states we can choose how to represent the RAM state, as bytes or bits. Note that bytes are normalized by dividing them by 255, so that their range is between [0,1]. By normalizing the input values, the converage will be usually faster than when using not-normalized data (Y. A. LeCun et al., 2012). Figure 36 shows when the input values are the normalized bytes with a hidden layer of 128 nodes. By doing this, we will simulate the identity function with the same amount of input values. As can be seen it cannot translate the input values of the RAM bytes well to a good feature vector. There can be a wide range of possible problems why the bytes are not a good feature extraction. For example the batch size was too low or too high, perhaps a Denoising Autoencoder could have helped or even different activation functions glued together with multiple layers of the same amount of hidden nodes. Of course if we put enough time and effort in tuning all different hyperparameters we

61 would eventually get a better result. This is not the goal of this thesis, we want to find an autoencoder as simple as possible without tweaking too much and finding a good feature vector. Another explanation possible is that the agent simply does not have enough information available in the extracted feature vector and that valuable information that was previously available in the basic RAM has been lost. The agent still learns better than playing random, but is not as good as the manual features.

Autoencoders trained from 128 -> 128 400 128->128 Lin 128->128 Sig 128->128 Rel 350 Manual features Random

300

250 Rewards 200

150

100

50 0 500 1000 1500 2000 2500 3000 Episodes

Figure 36: Autoencoders on 128 bytes

In Figure 37 we see the results of an autoencoder with as input value the RAM state represented in bits. The same autoencoder was used as with bytes, with the exact same settings. As can be seen the agent could use all the extra information available, in contrast with the 128 byte autoencoder, and could actually learn from the extracted feature vector. With this confirmation the rest of this thesis will investigate the bit version of RAM states.

62 Autoencoders trained from 1024 -> 1024 400 1024->1024 Lin 1024->1024 Rel

350 1024->1024 Sig Manual features Random

300

250

200 Rewards

150

100

50

0 0 500 1000 1500 2000 2500 3000 Episodes

Figure 37: Autoencoders on 1024 bytes

5.7 Comparing different activation functions

As said previously choosing the right activation function can help in creating better results. Table 6 depicts autoencoders which uses different activation functions. For a more visual representation, see Appendix A, Figures 47, 48 and 49. It shows averages the last 1000 rewards of episodes with their standard deviation. Note that when an activation function is set, all layers use the same activation. There is also a possibility to use different activation function in different layers, but this was not investigated. Each activation function has been tested with an autoencoder going from the input 1024 to a chosen bottleneck and back to the original inputsize. Note that each layer is each time divided by two. So using an autoencoder which is depicted as 1024 → 256, uses three hidden layers, encoding from 1024 → 512 to the bottleneck of 256 and back encoding to 512 → 1024. As can be seen a linear activation function performs best with encoding the original state 1024 to an encoded version of 512. Going deeper with linear activation function will yield, in this case, NaN values. Because linear activation functions have no

63 limit and will only keep rising. This in contrast with the Sigmoid function which is bound between [0, 1] and ReLU which forces neuron to be approxi- mately 50 % active. Note that a linear activation function is nearly equivalent with using the method PCA, Principal Component Analysis. PCA is a lin- ear technique that can be used for and by doing so finding the principal components. They show directions where the data is most spread out and has the biggest variance. Linear autoencoders can only return a linear encoding because the activation is also linear, therefore we will pursuit to research more in non-linear activation functions.

As can be seen the ReLU activation does not perform too well in contrast to the other activation functions. Sigmoid performs well when using a hidden layer with 1024 nodes, the same as with the linear activation. To statistically confirm this we used the MannWhitney U test, which assumes the data is not normal distributed. The first test was between the Manual features and the Basic features and results in a p-value of 9.63357008643e-07. We can assume that when the p-value is smaller than 0.05 that there is a difference between the Manual features and the Basic with 95% certainty. Which is exactly what can be seen on Figure 35.

Linear Sigmoid ReLU 1024 → 1024 323.43 (± 47.11) 325.01 (± 43.96) 288.85 (±44.41) 1024 → 512 323.83 (± 45.44) 290.53 (± 38.64) 230.74 (± 39.11) 1024 → 256 NA 250.09 (± 35.35) 267.08 (± 43.84) 1024 → 128 NA 250.9 (± 41.42) 191.86 (± 30.04) 1024 → 64 NA 152.75 (± 23.83) 116.1 (± 25.64) Manual features 330.87 (± 35.26) Basic 301.92 (± 36.39)

Table 6: Comparing different activation functions against the number of hidden layers and nodes

To statically prove that there is a difference with the manual features and the encoded feature extraction we will test the manual features against the different activation function from 1024 → 1024, 1024 → 512 and 1024 → 256,

64 Linear Sigmoid ReLU 1024 → 1024 0.0322843539108 0.00798380208929 1.6703515625e-06 1024 → 512 0.293818666313 6.30184822139e-08 1.6703515625e-06 1024 → 256 NA 5.73303143758e-07 2.99746184625e-06

Table 7: P-values of the MannWhitney U test

Table 7. As can be seen almost all p-values are lower than 0.05 which means we can assume with 95% certainty that they differ from the manual features. This does not mean that they are better or worse features. Except we can- not assume they are different with the autoencoder with a linear activation function with 1024 → 512.

5.8 Initializing Q-values

When designing SARSA it is of most important to set the right and optimal Q-values. Initializing the Q-values will influence the speed of learning and the efficiency of the algorithm (Koenig & Simmons, 1996). When the agent is put in a setting, for example the grid world, the agent needs to find the goal before even searching for a good policy. One way to do this is by letting the agent explore the whole world, when the agent is exploring he will adapt Q-values and put them in a way that he will remember of going in a state with a certain action is a good action or not. If we have some knowledge we can even adapt the Q-values via some rule. For example if we know the goal of the setting, it would be easier to set the Q-value on a higher or lower value to reduce the exploring. For example; ( 0 if s ∈ , a ∈ Q(s, a) = G A q if s ∈ S\G, a ∈ A where the Q(s, a) will be set on zero when the state is also a goal state and otherwise will be set on some value q when the state is not a goal state. This forces the agent to learn with the given Q-values, which he will learn in a more optimistic way, by doing this the learning time and exploration will be less than when initializing everything on the same number.

65 Unfortunately Space Invaders is a never-ending game, so setting a differ- ent value on the goal state cannot be done. Even if it was known we cannot set the goal state differently than other states because the features are black- box and do not mean anything to a human. We can adapt all Q-values to some other number and see how this will evolve and if the agent can learn more optimistically. All previous graphs and tables are Q-values which are initialized on zero. This experiments were run with sigmoid, so we know our values will be between [0, 1]. Taking an average of the whole Q-values on the last 500 episodes of our best autoencoders gives us an averaged value of ±0.57. So initializing Q-values on −1 and 1 would affect the learning rate. Figure 38 shows when the Q-values are initialized on Q(s, a) = −1 and Fig- ure 39 shows when initialized on Q(s, a) = 1. We can immediately see the difference in how quick the agent is learning. Take for example on Figure 38 and Figure 39 the autoencoder trained from 1024 → 1024, thus learning the identity function. As can be seen that when the Q-values are −1 the agent will learn incredibly slow, it is even so slow that only after 3000 episodes the agent reaches the same value as randomly playing. While on episode 500 the agent, where Q-values are initialized on 1, will already have 4 times more reward than he has where the Q-values are initialized on −1. As can be seen generally speaking the values will tend to the same result as Q = 0 as long as the experiments run long enough.

Sigmoid activation with Q-values=-1 450 1024->64 1024->128 400 1024->256 1024->512 1024->1024 350 Manual features Random

300

250

Rewards 200

150

100

50

0 0 500 1000 1500 2000 2500 3000 Episodes

Figure 38: Q = −1

66 Sigmoid activation with Q-values=1 450 1024->64 1024->128 400 1024->256 1024->512 1024->1024 350 Manual features Random

300

250

Rewards 200

150

100

50

0 0 500 1000 1500 2000 2500 3000 Episodes

Figure 39: Q = 1

Table 8 shows the average of rewards that was received by using different autoencoders. This average was taken on the 500 last episodes. As can be seen the Q = −1 does not perform any good, it takes too much time to learn. But there is a competition between the Q = 0 and Q = 1. Autoencoders trained to 1024 and 512 perform better when the Q-values are initialized on 1. But when trained deeper with multiple hidden layers tend to learn better with the initialization on Q = 0. Since we are experimenting how deep we can go with deep learning before losing to many information of our unsupervised feature extraction method we will continue from now on using the values initialized on Q = 0.

Q = −1 Q = 0 Q = 1 1024 → 1024 64.78 (± 17.71) 325.01 (± 43.96) 267.42 (± 37.49) 1024 → 512 216.06 (± 31.57) 290.53 (± 38.64) 296.39 (± 40.5) 1024 → 256 152.03 (± 24.25) 250.09 (± 35.35) 253.82 (± 39.31) 1024 → 128 239.04 (± 35.52) 250.9 (± 41.42) 238.25 (± 37.14) 1024 → 64 158.91 (± 26.29) 152.75 (± 23.83) 150.84 (± 29.8)

Table 8: The difference between in setting different Q-values

67 5.9 Pretraining and extracting other layers

In previous experiments only the bottleneck was used as the extracted fea- ture method. But since we are experimenting with deep learning and thus using different layers it could also be useful to go to a very small bottleneck and extracting a different layer than first intended. This was also used in previous research (Stadie, Levine, & Abbeel, 2015), where they did not take the bottleneck layer. Figure 40 shows a visual way, where the third hidden layer with the red box is extracted instead of the intended bottleneck.

Figure 40: Example of an autoencoder with another layer extracted than the bottleneck

When going into deep learning it also a good idea to pretrain the network. Pretaining is when each layer is trained separately and then concatenated together. For example if we want to have a pretrained autoencoder from 1024 → 256, we will first train another autoencoder from 1024 → 512. Then all the weights are saved together with all the encoded form of the input layer, so now our input layer will be 512. The next step will be creating an autoencoder from 512 → 256, this will be trained with are new, encoded, input features. Afterwards a whole new autoencoder is created with the weights that are saved for each layer. The autoencoder can then be fine-tuned

68 by training again on the whole layer. Note that this is very time-consuming because multiple autoencoder are trained.

Training deep with pretraining and extracting layer 512 400 Manual features Random Basic 350

300

250 Rewards 200

150

100

50

Basic Random

1024 -> 8: 512 1024 -> 4: 512 1024 -> 64: 512 1024 -> 32: 512 1024 -> 16: 512 Manual features 1024 -> 256: 512 1024 -> 128: 512 Figure 41: Pretraining with extraction of layer 512

Figure 41 shows when autencoders are trained with pretraining to a very small layer and each time the layer 512 is extracted. Boxplots are shown for the last 1000 episodes together with boxplots of the Basic, Manual features and Random with their average line to get a good comparison. As can be seen the deeper the autoencoder, which goes to layers of 32, 16, 8 and 4, the more information is lost. This results in rewards which are not good compared to the results of Manual features and Basic. But pretraining has helped in training the autoencoder of 1024 → 64. It shows that it perfor- mance is better than the Basic but still underperforms in comparison to the Manual features. See in Appendix A Figure 50 for the detailed plot.

Figure 42 shows the result of training to a layer with 4 nodes. This means that there are a total of 8 possible layers that can be extracted. When train- ing to a layer with 4 features and extracting those 4 features will not yield a good score. There is too much information lost from going to 1024 possible

69 features to only 4. But when the same autoencoder is extracting a layer that has a higher number of hidden nodes than these 4, it will yield more information and a higher result. The reason that 512 nodes does not yield a bigger score than just training to one layer of 512 nodes is because of the training error. As previously mentioned the deeper a network is trained the more information is lost (Section 5.3).

Training deep with to a layer with 4 nodes 400 Manual features Random Basic 350

300

250 Rewards 200

150

100

50

Basic Random

1024 -> 4: 8 1024 -> 4: 4 1024 -> 4: 64 1024 -> 4: 32 1024 -> 4: 16 1024 -> 4: 512 1024 -> 4: 256 1024 -> 4: 128 Manual features

Figure 42: Pretraining with extraction to a hidden layer of 4 nodes

A more detailed table of all the autoencoders with all their possible layers extracted is depicted in Appendix A Table 9 with their result and standard deviation of the 1000 last episodes.

As suggested by (Srivastava et al., 2014) adding dropout to a deep net- work can prevent the network from overfitting. Remember that when the network is trained on samples it will try to create a network that can fit the data perfectly. But when the network can mimic the training samples almost perfectly but cannot mimic the test samples, or new samples from our agent, it is overfitting. By adding dropout, and thus randomly drop- ping nodes and their connections, the network will try to learn the samples

70 via different nodes and connections. Figure 43 show what happens to the performance when adding dropout. When using fewer hidden layers which leads to also fewer hidden nodes it can be seen that the rewards gained from the agent will be worse than before. But training with autoencoders with 1024 → 32, 16, 8 it can be seen that they perform better than before. The network is probably overfitting and trying to recreate all training samples exactly, by using a dropout of 30% this can be avoided. Although the box- plots show that the autoencoder 1024 → 256 : 512 has a lower reward than the autoencoder with 1024 → 256 : 512 after 3000 episodes but Figure 44 shows that the learning curve, the black line, is not converging and is still increasing. This does mean that adding dropout means that learning will be slower as well as for the autoencoder as for the agent.

Training deep with pretraining and extracting layer 512: dropout 400 Manual features Random Basic 350

300

250 Rewards 200

150

100

50

Basic Random

1024 -> 8: 512 1024 -> 4: 512 1024 -> 64: 512 1024 -> 32: 512 1024 -> 16: 512 Manual features 1024 -> 256: 512 1024 -> 128: 512 Figure 43: Pretraining with extraction of layer 512 with dropout

71 Training deep with pretraining and extracting layer 512 with dropout 500 1024 -> 4: 512 1024 -> 8: 512 1024 -> 16: 512 1024 -> 32: 512 400 1024 -> 64: 512 1024 -> 128: 512 1024 -> 256: 512 Manual features Random 300 Rewards

200

100

0 0 500 1000 1500 2000 2500 3000 Episodes

Figure 44: Pretraining with extraction of layer 512 with dropout

5.10 Combination of RAM and layer

Combining layers of RAM and the encoded version of RAM could give us information of how much the encoded version of the RAM is contributing. Figure 45 shows us the results, for a more detailed plot see Appendix A Figure 51. Adding the RAM state will give a boost to a poorer feature ex- traction. Note that it is important that RAM state is between [0, 1] because the activation function sigmoid limits also the values between [0, 1]. Nonethe- less with a weaker feature extraction the original RAM state will take over and will be used over the extracted features from the autoencoder. Figure 45 also shows the difference between the boxplot of 1024 → 512 + RAM and 1024 → 512. It can be seen that the features from the autoencoder and RAM perform a little better than an agent which uses only the feautres from the autoencoder. This means that the autoencoder does not have captured all valuable information that was in the RAM, if it would have the performance would have been the same. Although it can be argued that the difference is minimal so it has captured most parts of the valuable information.

72 Combining RAM and encoded RAM 400 Manual features Random Basic 350

300

250 Rewards 200

150

100

50

Basic Random 1024 -> 512 Manual features 1024 -> 512 + RAM 1024 -> 256 + RAM 1024 -> 128 + RAM 1024 -> 64 + RAM

Figure 45: Combining the original layer with the encoded version

5.11 Visualizing high dimensional data

It is also possible to visualize our high dimension data by using a technique called t-tsne, t-Distributed Stochastic Neighbor Embedding (Van der Maaten & Hinton, 2008). This mapping will map the high dimensions onto a two dimensional space, this is done by searching for states that are very similar. Both of our axis will go from our best autoencoder, 1024 → 512, with the sigmoid function and save all the encoded states. This will then be mapped to a two dimensional space by using the t-tsne technique. This will result in a scatter-plot. All points will then get a color by using the following;

colors = max(φ · θ)

Where φ will be the encoding of the RAM state and θ the state-action. Since our φ will be of dimension (samples × nodes), where nodes is an array of values from our autoencoder encoding and θ will be of dimension (nodes × action), where action will be the possible actions that the agent can

73 take. We can then take the dot-product, this gives us an array of dimension (sample × action) afterwards we will take the maximum value of the results, which gives us a one-dimensional array. This array gives the maximum Q- value for an input state. Figure 46 shows the result of the last 10.000 RAM states, encoding and state-action values. As can be seen there are clusters with the same colors like red, some blue-ish and even some green. This means that there are states from the RAM state that are comparable and are closely matched with states coming from the autoencoder. This is an indication that the features that we use to learn values are in fact relevant features for the task, despite that the values are not being used to learn features.

150

100

50

0

50

100

150 150 100 50 0 50 100 150

Figure 46: t-tsne

74 Chapter 6 Conclusions

We have developed a method for unsupervised feature extraction that outper- forms the use of raw input features and almost matches the manual feature encoding methods. Our method is based on the use of autoencoding neu- ral networks to learn a compressed representation of the input data. We have compared multiple possible autoencoders based approaches and com- pared these empirically. A number of conclusions can be drawn from these experiments. The non-linear autoencoder is in this case not better than a linear autoencoder. The linear autoencoder can compete with the Manual features, but it could have easily been a PCA method which would yield the same results. It does yield results in researching different activation func- tions, because as can be seen on graphs they do make a wide difference. When finetuning autoencoders and reducing to a very small dimension, com- ing from a big dimension, with many layers it is a good idea to add pretraining and dropout. These mechanism are needed so that the autoencoder does not overfit on the training data. Seeing the visualization of the autoencoder we can indeed see some clusters, thus the autoencoder does find a representation where the input RAM dimen- sion is well represented by the encoded states together with the SARSA(λ) values.

When using autoencoders as a feature extraction method, research in dif- ferent layers, activation function and even different input methods must be taken to get a wide range of possibilities in choosing the best autoencoder. It is proven in this research that when working on a blackbox of data, because RAM is not humanly interpretable, it is possible to get a better result than using plain features.

75 6.1 Future work

This thesis is entirely based on RAM states, because RAM states are black- box it is difficult to see what happens or to interpret what happens. We know RAM states contains the entire state of a game. It knows where the agent is, if the laser is fired and in what direction. Unfortunately it is practi- cally impossible to find these things from the RAM state. This is in contrast with frames. ALE offers also the possibility to receive frames, these frames consists of pixels with different color values. In Atari 2600 games each color is for a specific item, for example green is the players ship, orange the shield. These are useful features that can be used to learn in a better way. This can be learnt by removing the background, the static colors like the score, the khaki base and so on. But when the agent receives pixels, he does not know what happens. It does not contain the entire state of the game. For example, see previous Figure 32, the agent receives the frame. But he cannot determine from a single frame where the laser is going. This laser can come from the agent itself, from a few time steps back, or even come from the aliens. To overcome this problem multiple frames can be used in stead of using one frame, like we did in this thesis only 1 RAM state per time step.

76 Appendices

77 Appendix A Extended graphs and tables

Gamplay with autencoders and linear activation function 400 1024->512 1024->1024

350 Manual features Random

300

250

200 Rewards

150

100

50

0 0 500 1000 1500 2000 2500 3000 Episodes

Figure 47: Autoencoders with multiple hidden layers with a Linear activation function

78 Gamplay with autencoders and Sigmoid activation function 400 1024->16 1024->32 1024->64 350 1024->128 1024->256 1024->512

300 1024->1024 Manual features Random

250 Rewards 200

150

100

50 0 500 1000 1500 2000 2500 3000 Episodes

Figure 48: Autoencoders with multiple hidden layers with a Sigmoid activa- tion function

Gamplay with autencoders and ReLU activation function 400 1024->64 1024->128 1024->256 350 1024->512 1024->1024 Manual features

300 Random

250 Rewards 200

150

100

50 0 500 1000 1500 2000 2500 3000 Episodes

Figure 49: Autoencoders with multiple hidden layers with a ReLU activation function

79 Training deep with pretraining and extracting layer 512 500 1024 -> 4: 512 1024 -> 8: 512 1024 -> 16: 512 1024 -> 32: 512 400 1024 -> 64: 512 1024 -> 128: 512 1024 -> 256: 512 Manual features Random 300 Rewards

200

100

0 0 500 1000 1500 2000 2500 3000 Episodes

Figure 50: Pretraining with extraction of layer 512

Combining the encoded RAM + original RAM 400 1024->64 1024->128 1024->256 350 1024->512 Manual features Random

300 RAM

250 Rewards 200

150

100

50 0 500 1000 1500 2000 2500 3000 Episodes

Figure 51: Combining the original layer with the encoded version

80 1024 → 256 1024 → 128 1024 → 64 1024 → 32 1024 → 16 1024 → 8 1024 → 4 Layer 512 291.67 (± 33.03) 297.2 (± 35.01) 306.86 (± 38.53) 253.74 (± 32.01) 251.57 (± 30.44) 249.99 (± 32.69) 236.42 (± 29.71) Layer 256 294.01 (± 37.12) 289.75 (± 35.69) 270.12 (± 34.52) 258.19 (± 30.82) 240.92 (± 33.65) 247.75 (± 34.49) 251.18 (± 31.93) Layer 128 272.67 (± 34.38) 232.15 (± 30.43) 233.08 (± 30.86) 242.41 (± 31.82) 210.0 (± 32.45) 238.56 (± 29.51) Layer 64 228.96 (± 27.87) 219.76 (± 26.28) 249.08 (± 35.29) 204.68 (± 28.71) 215.38 (± 28.74) 81 Layer 32 240.66 (± 30.14) 223.82 (± 28.25) 213.47 (± 28.38) 212.85 (± 27.6) Layer 16 238.78 (± 32.97) 248.57 (± 31.52) 185.06 (± 27.21) Layer 8 191.69 (± 31.47) 153.39 (± 23.84) Layer 4 145.6 (± 29.45)

Table 9: Training to a specific layer and extracting a chosen layer Chapter 7 Bibliography

Anji. (n.d.). Pole balance. Retrieved April 29, 2016, from http : / / anji . sourceforge.net/polebalance.htm Barto, A. G., Sutton, R. S., & Anderson, C. W. (1983). Neuronlike adaptive elements that can solve difficult learning control problems. Systems, Man and Cybernetics, IEEE Transactions on, (5), 834–846. Bellemare, M. G., Naddaf, Y., Veness, J., & Bowling, M. (2013, June). The arcade learning environment: an evaluation platform for general agents. Journal of Artificial Intelligence Research, 47, 253–279. Bengio, Y. (2009). Learning deep architectures for ai. Foundations and trends R in Machine Learning, 2 (1), 1–127. Breiman, L. (1996). Bagging predictors. Machine learning, 24 (2), 123–140. Campbell, M., Hoane, A. J., & Hsu, F.-h. (2002). Deep blue. Artificial intel- ligence, 134 (1), 57–83. Collobert, R. & Weston, J. (2008). A unified architecture for natural language processing: deep neural networks with multitask learning. In Proceed- ings of the 25th international conference on machine learning (pp. 160– 167). ACM. Cruz, J. A. & Wishart, D. S. (2006). Applications of machine learning in cancer prediction and prognosis. Cancer informatics, 2. Freund, Y. & Schapire, R. E. (1997). A decision-theoretic generalization of on-line learning and an application to boosting. Journal of computer and system sciences, 55 (1), 119–139. Glorot, X., Bordes, A., & Bengio, Y. (2011). Deep sparse rectifier neural net- works. In International conference on artificial intelligence and statis- tics (pp. 315–323).

82 Google. (n.d.). Google self-driving car project. Retrieved April 29, 2016, from https://www.google.com/selfdrivingcar/reports/ Hinton, G. E. [Geoffrey E] & Salakhutdinov, R. R. (2006). Reducing the dimensionality of data with neural networks. Science, 313 (5786), 504– 507. Hinton, G. E. [Geoffrey E.] & Salakhutdinov, R. R. (2008). Using deep belief nets to learn covariance kernels for gaussian processes. In J. C. Platt, D. Koller, Y. Singer, & S. T. Roweis (Eds.), Advances in neural in- formation processing systems 20 (pp. 1249–1256). Curran Associates, Inc. Koenig, S. & Simmons, R. G. (1996). The effect of representation and knowl- edge on goal-directed exploration with reinforcement-learning algorithms. Machine Learning, 22 (1-3), 227–250. Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In F. Pereira, C. J. C. Burges, L. Bottou, & K. Q. Weinberger (Eds.), Advances in neural information processing systems 25 (pp. 1097–1105). Curran Associates, Inc. LeCun, Y. A., Bottou, L., Orr, G. B., & M¨uller,K.-R. (2012). Efficient back- prop. In Neural networks: tricks of the trade (pp. 9–48). Springer. LeCun, Y., Bengio, Y., & Hinton, G. (2015). Deep learning. Nature, 521 (7553), 436–444. RL-Library. (n.d.). Mountain car. Retrieved April 29, 2016, from http:// library.rl-community.org/wiki/Mountain Car (Java) Makhzani, A. & Frey, B. (2013). K-sparse autoencoders. arXiv preprint arXiv:1312.5663. Michie, D. & Chambers, R. A. (1968). Boxes: an experiment in adaptive control. Machine intelligence, 2 (2), 137–152. Minsky, M. & Papert, S. (1969). Perceptrons. MIT press. Mitchell, T. (1997). Machine learning. McGraw-Hill International Editions. McGraw-Hill. Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., & Riedmiller, M. (2013). Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602. Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., . . . Ostrovski, G., et al. (2015). Human-level control through deep reinforcement learning. Nature, 518 (7540), 529–533. Naddaf, Y. et al. (2010). Game-independent ai agents for playing atari 2600 console games (Doctoral dissertation, University of Alberta).

83 Nair, V. & Hinton, G. E. [Geoffrey E]. (2010). Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th international conference on machine learning (icml-10) (pp. 807–814). Ng, A. (2011). Sparse autoencoder. CS294A Lecture notes, 72, 1–19. Quinlan, J. R. (1987). Simplifying decision trees. International journal of man-machine studies, 27 (3), 221–234. Rosenblatt, F. (1958). The perceptron: a probabilistic model for information storage and organization in the brain. Psychological review, 65 (6), 386. Rummery, G. A. & Niranjan, M. (1994). On-line q-learning using connec- tionist systems. Sammut, C. & Webb, G. I. (2011). Encyclopedia of machine learning. Springer Science & Business Media. Schaeffer, J., Culberson, J., Treloar, N., Knight, B., Lu, P., & Szafron, D. (1992). A world championship caliber checkers program. Artificial In- telligence, 53 (2), 273–289. Schmidhuber, J. (2015). Deep learning in neural networks: an overview. Neu- ral Networks, 61, 85–117. Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., Van Den Driessche, G., . . . Lanctot, M., et al. (2016). Mastering the game of go with deep neural networks and tree search. Nature, 529 (7587), 484–489. Skinner, B. F. (1938). The behavior of organisms: an experimental analysis. Skinner, B. F. (1948). Superstition in the pigeon. Journal of experimental psychology, 38 (2), 168. Skinner, B. F. (1951). How to teach animals. Freeman. Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. (2014). Dropout: a simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research, 15 (1), 1929– 1958. Stadie, B. C., Levine, S., & Abbeel, P. (2015). Incentivizing exploration in reinforcement learning with deep predictive models. arXiv preprint arXiv:1507.00814. Sutton, R. S. (1996). Generalization in reinforcement learning: successful examples using sparse coarse coding. Advances in neural information processing systems, 1038–1044. Sutton, R. S. & Barto, A. G. (1998). Reinforcement learning: an introduction. MIT press. Tesauro, G. (1994). Td-gammon, a self-teaching backgammon program, achieves master-level play. Neural computation, 6 (2), 215–219.

84 Thorndike, E. L. (1911). Animal intelligence: an experimental study of the associative processes in animals. Todes, D. P. (2002). Pavlov’s physiology factory: experiment, interpretation, laboratory enterprise. JHU Press. Trier, Ø. D., Jain, A. K., & Taxt, T. (1996). Feature extraction methods for character recognition-a survey. , 29 (4), 641–662. Van der Maaten, L. & Hinton, G. (2008). Visualizing data using t-sne. Jour- nal of Machine Learning Research, 9 (2579-2605), 85. Vincent, P., Larochelle, H., Bengio, Y., & Manzagol, P.-A. (2008). Extract- ing and composing robust features with denoising autoencoders. In Proceedings of the 25th international conference on machine learning (pp. 1096–1103). ACM. Watkins, C. J. & Dayan, P. (1992). Q-learning. Machine learning, 8 (3-4), 279–292. Watkins, C. J. C. H. (1989). Learning from delayed rewards (Doctoral dis- sertation, University of Cambridge England).

85