Ghent University

Master Thesis

Exploration of deep on cooking recipes

Author: Promoter: Lander Bodyn Prof. Dr. Christophe Ley Tutor: Co-promoter: Ir. Michiel Stock Prof. Dr. Willem Waegeman

A thesis submitted in partial fulfilment of the requirements for the degree of Master of Science in Computational Statistics

Department of Applied Mathematics, and Statistics

January 2017

GHENT UNIVERSITY

Abstract

Master of Science in Computational Statistics

Exploration of deep autoencoders on cooking recipes

by Lander Bodyn

Deep autoencoders are a form of deep neural networks that can be used to reduce the dimensionality of datasets. These deep networks can sometimes be very hard to train [1]. The gradient descent algorithm has been explored to train deep au- toencoders on a dataset of cooking recipes. Minibatches, momentum and pretraining were added as extensions to improve the gradient descent algorithm. The performance of data reduction to two dimensions of the deep autoencoders was compared to singu- lar value decomposition. The best deep autoencoder model obtained a cross entropy loss of 0.048, much lower than cross entropy loss of 0.066 for singular value decomposi- tion. From the two reduced dimension, the regions of the recipes were predicted using the KNN and QDA algorithms. For the deep autoencoder models, the best predic- tion accuracy was 65.4%, outperforming the best prediction accuracy of singular value decomposition, 55.4%. The best prediction accuracy of the raw dataset was 69.8%, suggesting that the deep autoencoders maintain the structure of the regions very well in two dimensions. Using a deep autoencoder with data reduction to 100 dimensions, the prediction accuracy was 72.0%, suggesting deep autoencoders might have some usefulness for representation learning on this dataset. tech- niques can also be used as recommender systems, using collaborative filtering. Deep autoencoder models were optimized to have the best retrieval rank of an ingredient that was either removed from or added to an existing recipe. De Clercq et al. [2] have built two similar recommender models on the same dataset: a non-negative matrix factorization and a two-step kernel ridge regression model. The deep autoencoder (mean rank = 25.2) outperforms the non-negative matrix factorization (mean rank = 33.0) and comes close in performance to the two-step kernel ridge regression (mean rank = 23.6). Acknowledgements

I would first like to thank my promoter Prof. Dr. Christophe Ley from the Faculty of Sciences at Ghent University. He instantly accepted my plan to do a thesis in and helped me by proofreading many parts of the thesis, suggesting which parts I should explain more clearly. While my promoter is not accustomed to the field deep learning, he assisted me to find a co-promoter that could guide me in the practical parts of the thesis.

This brought me to my co-promoter Prof. Dr. Willem Waegeman and supervisor Ir. Michiel Stock from the Faculty of Bioengineering at Ghent University. I would like to thank both for coming up with a very interesting thesis subject, proof-reading several parts of the thesis and continually guiding me during the thesis.

I also want to thank my friend Giancarlo Kerg, who inspired me to start my master in Computational Statistics, as a foundation to move towards the field of and deep learning in specific.

I also want to thank the company Yazzoom, where I could do an internship in deep learning during my thesis. Some of the skills I learned at Yazzoom helped me to make progress in my thesis.

Finally, I want to thank my family, who have supported me throughout the whole process. Special thanks to my grandma, who made my lunch and dinner every day and my parents, who endured all the fluctuations in my mood during the writing of my thesis.

iii Contents

Abstract ii

Acknowledgements iii

Contents iv

1 Introduction1 1.1 Theoretical background...... 1 1.2 Overview of the thesis...... 2 1.3 The cooking recipes dataset...... 2

2 Methods4 2.1 From artificial intelligence to deep learning...... 4 2.1.1 Machine learning...... 5 2.1.2 Representation learning...... 6 2.1.3 Deep learning...... 7 2.2 Deep autoencoders...... 9 2.2.1 Network architecture...... 10 2.2.2 Singular value decomposition for dimension reduction..... 12 2.2.3 Deep autoencoders for dimension reduction...... 13 2.2.4 Deep autoencoders for representation learning...... 15 2.2.5 Deep autoencoders for collaborative filtering...... 18 2.3 Training the network with gradient descent...... 19 2.3.1 Local minima...... 20 2.3.2 The vanishing gradient problem...... 21 2.3.3 Initialisation of the network parameters...... 23 2.3.4 Minibatch gradient descent...... 23 2.3.5 Momentum...... 24 2.4 Optimisation of the hyperparameters...... 25 2.4.1 The gradient descent hyperparameters...... 26 2.4.1.1 δ ...... 27 2.4.1.2 Batchsize...... 29 2.4.1.3 Inertia α ...... 29 2.4.1.4 Initialisation range...... 29

iv Contents v

2.4.2 The network architecture...... 30 2.5 Python and the Theano package...... 30 2.5.1 Backward propagation of the gradient...... 30 2.5.2 Other packages...... 31

3 Results 32 3.1 Training the autoencoders...... 32 3.1.1 Adding extensions to the gradient descent algorithm...... 32 3.1.2 Plateaus...... 35 3.2 Comparing data reduction methods...... 36 3.2.1 Singular value decompostion...... 36 3.2.2 Autoencoders...... 37 3.3 Prediction of the regions...... 40 3.4 Collaboratorive filtering for recipe creation...... 43 3.4.1 Reconstruction of the removed ingredient...... 43 3.4.2 Elimination of the added ingredient...... 44

4 Conclusion and discussion 46 4.1 Conclusion...... 46 4.2 Discussion...... 47

A Admission for circulating the work 48

Bibliography 49 Chapter 1

Introduction

1.1 Theoretical background

An autoencoder is a type of artificial neural network. When a neural network has several hidden layers, the network is called a deep network. The gradient descent algorithm is currently the dominant way of training neural networks. It can however sometimes be difficult to train neural networks using the gradient descent algorithm; this is especially true for deep autoencoders [1].

Autoencoders are designed to reduce the dimensionality of the dataset while minimizing a reconstruction error. They can be seen as a non-linear extension of the linear data reduction method singular value decomposition. Data reduction methods are useful to obtain visualisations of the data in two or three dimensions. Dimensionality reduction has also other applications. For example, the reduced features can be more suitable for a machine learning task than the original features.

Since the increase in availability of datasets of cooking recipes online, machine learning is starting to play a prominent role in tasks such as food preference modelling. Having an algorithm that could combine left over ingredients to create a good recipe would be a useful application. De Clercq et al. [2] built two such recommender systems on a dataset containing the ingredients of recipes. For the recommender systems, the authors used a non-negative matrix factorization model and a two-step kernel ridge regression model. Deep autoencoders can also be used as a recommender system: in order to reduce the ingredients of the recipes, meaningful features of the recipes will have to be learned. A selection of ingredients can be reconstructed by the autoencoder,

1 Chapter 1. Introduction 2 after which the selection will resemble the recipes from which the autoencoder has learned its parameters.

1.2 Overview of the thesis

In the thesis, it was explored how deep autoencoders can be optimally trained with the gradient descent algorithm on a dataset of cooking recipes. To speed up the gradient descent algorithm and improve its performance, two extensions were added to the algorithm: minibatches and momentum. Aside from improving the gradient descent algorithm itself, pretraining of the network parameters was implemented as another tool to facilitate convergence to a good solution.

The deep autoencoder models were compared to singular value decomposition (SVD) for the purpose of data reduction. The performance of the models was measured using a reconstruction error. It was also examined how well both methods maintained the structure of the data, by visually checking if recipes with similar regions of origin lay close together on the biplots of the reduced features. The regions of the recipes were then predicted, using the KNN and QDA algorithms on the reduced features. This predicting might even be better than prediction models using the original dataset: data reduction algorithms can possibly make the data more suitable for the prediction task. The thesis also explored the use of deep autoencoder models as recommender systems. The same dataset and performance measures of De Clerq were used, to enable comparison with their recommender models [2].

1.3 The cooking recipes dataset

The data for the thesis was obtained from Ahn et al. [3]. Recipes with less than three ingredients were removed as was done for the recommender systems of De Clercq et al. [2], in order to enable comparison with the autoencoder recommender system. The cleaned dataset contains 55001 different recipes using 381 ingredients. Each recipe was represented by a binary vector: ones denote the presence of ingredients, zeros denote the absence of ingredients. As each recipe contains only a small selection of the ingredients, the data-matrix is a sparse matrix with a filling degree of 2.16%. Chapter 1. Introduction 3

The region of origin of the recipes is included in the dataset. In total there are eleven regions. Recipes of North American origin are the largest category, taking up 73,4% of the recipes. Due to the short history of the North American recipes, most recipes of North American origin are imported versions of recipes from all over the world. Because of this, the North American recipes will be removed for the prediction of the region of origin but not for the training of the autoencoder.

Of the 55001 recipes, 2500 will be set aside for validation and 2500 for testing. The remaining recipes are used for training. The validation set will be used to determine the convergence of the gradient descent algorithm as well as to select the optimal model for the collaborative filtering. The test set will be used to test the performance of the collaborative filtering. For the prediction of the origin, the dimensions of the whole dataset (without the North American recipes) will be reduced with the autoencoder. This dataset will then be split in 70% training data and 30% test data for the supervised machine learning algorithms. There is no problem in using the training and validation data of the autoencoder for the supervised machine learning problem: the autoencoder is an unsupervised algorithm that does not require the values of the regions for training. Chapter 2

Methods

2.1 From artificial intelligence to deep learning

Artificial intelligence (AI) is the field of making computers intelligent. Since pro- grammable computers were first conceived, people have been wondering whether such machines might become intelligent. In the early days of artificial intelligence, rapid progress was made in logical tasks that can be easily defined with mathematical rules. Since humans are typically not very good at these tasks, it did not take very long before the computer started to outperform humans at those tasks. One such a task is the logical board game chess, in which the IBM supercomputer Deep Blue defeated the former world champion Gary Kasparov in 1997.

Ironically, many tasks that seem trivial for humans, like processing visual and auditive information, are very hard for a computer to solve. The real challenge in artificial intelligence turned out to be solving these kinds of intuitive problems. It is only recently that some of these problems have been solved: for example in march 2016 did the AlphaGo program of DeepMind manage to defeat Lee Sedol, the world champion of go [4]. Go is a board game similar to chess, but with a lot more board positions. Being a good player at go requires a lot of spatial and intuitive thinking, a task that is very hard to write down in logical rules.

There have been many approaches to solving the challenges in artificial intelligence. One of them is the knowledge-based approach. The knowledge-based approach tries to make computers intelligent by hard coding different knowledge rules by hand. An example of this is the Cyc project, which has as goal to enable AI applications of

4 Chapter 2. Methods 5 human-like reasoning [5]. These types of efforts have not been very fruitful however. It turned out to be very hard to compose logical rules that capture all of the complexity of human reasoning. In Figure 2.1 the relation between AI and several of its subfields are shown. These subfields will be discussed in the following subsections.

Figure 2.1: A Venn-diagram explaining the relation between artificial intelligence, machine learning, representation learning and deep learning. For each subfield an exclusive example is given [6].

2.1.1 Machine learning

Machine learning was invented as a different approach to artificial intelligence. Instead of trying to hard-code everything, the programmer will define algorithms by which the computer can extract its own logical rules from given data. Nowadays, machine learning is used everywhere and has many applications. For example, can determine whether to recommend cesarean delivery [7]. Within machine learning, there are two main categories of algorithms: supervised and unsupervised algorithms. Chapter 2. Methods 6

With supervised algorithms, the goal is to make a prediction about a desired outcome variable, given some input variables (features). The algorithm will do this by modelling the factors in the data that are responsible for variation in the outcome variable. In order to learn the optimal parameters of the model, the algorithm will need some training data to learn from. The parameters will be learned to give the best predictions of the outcome variables. Since the training dataset is only a sample of the true data generating process, it will contain some random fluctuations. If there is not enough data or if the model is too complex (has a lot of parameters), it is possible that the algorithm will model some of these random fluctuations. This is called overfitting and is not desired: modelling the random fluctuations in the training set will not generalise to new data. In order to obtain a real measure of the performance of the algorithm, the model has to be tested on a separated data set, called the test data. The difference in performance between the train and test dataset can help to decide on the complexity of the model to prevent overfitting. Supervised machine learning algorithms are often split into two types, depending on the outcome variable. If the desired outcome value of a supervised algorithm is discrete, one will speak of a classification problem. For a continuous outcome variable the algorithm is called a regression problem.

Unsupervised algorithms do not have a desired outcome variable. Instead, these types of algorithms try to find structure in the data. Unsupervised algorithms will for example try to find clusters in the data or try to reduce the number of dimensions in the data set.

In all machine learning algorithms, the performance of the algorithms relies heavily on the construction of relevant features. Raw data, like the pixels of a picture, might not contain much correlation with the desired outcome. It is only by designing intelligent features that such algorithms gain a lot of power. However, the creation of these features can be very complex and time consuming for the programmer.

2.1.2 Representation learning

In machine learning, the field of representation learning will not only use algorithms to learn a desired outcome from some hand-crafted features, but will also use algorithms to learn preferable features for the given task. For this purpose algorithms like singular value decomposition and shallow autoencoders can be used. These algorithm will reduce high dimensional data to meaningful features, after which these features can be used for a supervised machine learning task. Chapter 2. Methods 7

The thesis will explore the use of deep autoencoders. While deep autoencoders are a type of unsupervised algorithms that can be used for representation learning, they are also a type of deep learning algorithm. Shallow autoencoders on the other hand can be used for representation learning but are not a part of the deep learning algorithms, as shown in the Venn diagram of Figure 2.1. The different between shallow and deep algorithms will be explained in the next subsection.

2.1.3 Deep learning

In representation learning, the computer will learn relevant features from the data and use these features to predict the outcome. In deep learning, several layers of features will be stacked on top of each other, in order to create much more complex features, which can then be used to predict the outcome. The term ‘deep’ refers to the depth of the layers of features that are built upon each other. In certain complex artificial intelligence tasks, this approach can be very powerful.

In Figure 2.2 an example is shown of deep learning applied to object recognition in images. The pixels of the image are given as the input layer. On top of the input layer, several hidden layers are built from which eventually the type of object is predicted. These layers are called hidden because they are not given as input or used as output, instead they will be constructed by the algorithm itself. In the figure one can really see what the computer is trying to learn: in the first hidden layer it will try to recognise relevant low level objects like edges and color gradients. In the second hidden layer it will use those objects to construct shapes like corners and contours. In the next layer those shapes will be used to construct whole object parts. Finally, in the output layer the object identity will be predicted from the object parts.

Another name often given to deep learning is artificial neural networks. This name originates from some of the first implementations of deep learning algorithms in the 1940’s. Back then, researchers were using these types of algorithms in neuroscience as computational models to learn how our own brain works. The researchers were in fact trying to simulate the algorithm that the human brain uses to learn on the computer. From an artificial intelligence perspective, it also makes sense to study these models, since we know they can produce intelligence in humans. Nowadays neuroscience has a diminishing influence on the progresses in deep learning research. A lot of the terminology of neuroscience models still exists today however, like the word ‘neuron’ for the features in the different layers. Chapter 2. Methods 8

Figure 2.2: Visualisation of a convolutional neural network, each layer building on the features of the previous layer [8].

Although the field of deep learning has been around for a long time, it has only recently become very popular. Progress in computing power, together with big amounts of data, has made it possible for deep learning algorithms to outperform other simpler machine learning algorithms on several AI tasks:

• On the MNIST digit image classification problem, deep learning managed to break the supremacy of support vector machines [9][10].

• Microsoft’s 2012 version of their audio and video indexing speech system (MAVIS) based on deep learning managed to reduce the word error rate by about 30% compared to state-of-the-art models based on Gaussian mixtures [11].

• In natural language processing, the SENNA software which has applications in tasks such as language modelling, semantic role labelling and syntactic parsing, approaches or surpasses the state-of-the-art on these tasks and is simpler and much faster than traditional predictors [12]. Chapter 2. Methods 9

2.2 Deep autoencoders

In the thesis deep autoencoders are explored on cooking recipes. Autoencoders are a type of unsupervised machine learning algorithm: instead of trying to predict a certain outcome, autoencoders will try to reconstruct their own inputs. If hidden layers have more neurons than the input layer, the autoencoder can potentially learn the identity function. Such a reconstruction is not very useful. However, if the network has at least one hidden layer with a number of neurons lower than the input layer, the use of an autoencoder becomes more interesting. In this case, the network will have to learn a compact description of the data in such a way as to retain as much information as possible, despite the reduced number of dimensions. Figure 2.3 shows a visualisation of the structure of a deep autoencoder network. The hidden layer with the least number of neurons is called the bottleneck hidden layer. The structure of an autoencoder will often be symmetric, with the bottleneck hidden layer in the middle. Although this is not a strict rule, the reason behind this is very intuitive: if it requires a certain amount of complexity (number of layers) to encode the inputs to the bottleneck layer, the decoding back to the output layer would likely require a similar amount of complexity.

Figure 2.3: A visualisation of a deep autoencoder with a central bottleneck hidden layer. The output layer of an autoencoder tries to reconstruct his input layer. Chapter 2. Methods 10

2.2.1 Network architecture

In this subsection the elements of the architecture of an autoencoder network will be discussed. As mentioned before, each layer is constructed from the previous layer. The exact definition of this construction can be found in the equation:

! X yk = f wkixi + bk . (2.1) i

Each neuron yk of the next layer can be defined as a function f over the weighted sum of the neurons of the previous layer. The weighted sum can also contain a bias term bk. This is equal to saying that the previous layer has an extra neuron x0, with a value identical to one. Note that for each neuron in the next layer, a different set of weights wki is used, in order for each neuron to learn a different feature. If we have m neurons in a layer and n neurons in the next layer, the number of network parameters between the two layers will thus be given by mn or (m + 1)n when a bias term is added. The function f is called the activation function. Some typical examples of activation functions are:

• The linear function: f(x) = x;

1 • The sigmoid function: f(x) = 1+e−x ;

• The rectifier function: f(x) = max(x, 0).

The activation function of a neural network is often very simple. The complexity is not generated by using very complex functions, but rather by combining several simple functions to build layers of features on top of each other, each layer increasing in complexity. There are however important differences between the activation functions.

The linear activation function is a special case among the activation functions of neural networks. One of the properties of a linear function is that the combination of two linear functions is again a linear function. In a neural network, this means that the addition of an extra linear layer to another linear layer will be equivalent to just one linear layer. A network with only linear activation functions will thus not be able to learn complex (non-linear) features. The whole concept of deep learning would not work here, so other activation functions will have to be added to the network. Chapter 2. Methods 11

Function Function derivative 10 1.2 1.0 5 0.8 0.6 0 0.4 Linear −5 0.2 0.0 −10 −0.2 1.2 0.30 1.0 0.25 0.8 0.20 0.6 0.15 0.4 0.10

Sigmoid 0.2 0.05 0.0 0.00 −0.2 −0.05 10 1.2 8 1.0 6 0.8 0.6 4 0.4 2 0.2 Rectifier 0 0.0 −2 −0.2 −10 −5 0 5 10 −10 −5 0 5 10

Figure 2.4: A plot of different activation functions and their derivatives.

The sigmoid function is another function often used in neural networks. It is a non- linear function and can thus be used to build a deep network. It also has the nice property of being a monotonically increasing function that maps [−∞, ∞] to [0, 1] as can be seen in Figure 2.4. In the data set of the thesis, the inclusion of different ingredients in the recipes is coded as 0/1. We could use this knowledge about the data to build an appropriate architecture for the autoencoder. By using a sigmoid activation function for the output layer, the values of the output will be restricted to the interval [0, 1]. This way, the autoencoder will have a much easier time reconstructing his input values.

The rectifier function is another non-linear activation function and can thus be used to build complex features, just like the sigmoid function. But unlike the sigmoid function, the rectifier function has a property that is very useful to train neural networks: the derivative of the function is non-vanishing for a large region of the parameter space (all values x > 0). This is not the case for the derivative of the sigmoid function, which is only significantly greater than zero for parameter values close to zero. This will be very useful to prevent the vanishing gradient problem, which will be discussed in the next section. Chapter 2. Methods 12

The output of the network as function of the input layer is then defined as the chaining of the activation functions of the different layers. On the output layer a will be defined. This loss function is used to optimize the network for the task at hand. The loss function J(θ) of a general neural network is given by:

n 1 X J(θ) = L(g(x(i); θ), y(i)). (2.2) n i=1

In an autoencoder, the network will try to reconstruct its own inputs: x(i) = y(i). In this equation, g is the chaining of the activation functions over the different layers. The function L is called the cost function. The cost function determines how the deviations from the values are penalized. The two cost functions explored in the thesis are:

p P 2 • The function: L(yˆ, y) = (ˆyj − yj) j=1

p P • The cross entropy function: L(yˆ, y) = yj logy ˆj + (1 − yj) log(1 − yˆj) j=1

The least squares function is one of the most used cost functions in machine learning, many of the algorithms optimize this cost function. However when the outcome is restricted to the interval [0, 1] by the use of a sigmoid activation function, it makes much more sense to use the cross entropy. With a least square cost function, there will not be much difference in cost between an output of 0.01 and 0.0001 while this is a big difference for the arguments of the sigmoid function. The cross entropy will penalize these differences exponentially, giving much more information to learn from, when optimizing the network. This will be important for the vanishing gradient problem which is discussed in the next chapter.

2.2.2 Singular value decomposition for dimension reduction

As stated above, when a bottleneck layer with a low number of layers is introduced in the autoencoder, the reconstruction will in general not be perfect anymore. The autoencoder can then be used to reduce the dimensions of the data while maintaining as much information as possible. Within machine learning, there is another set of techniques famous for being able to reduce data: principle component analysis (PCA), which is a version of singular value decomposition (SVD). The SVD on the data matrix Chapter 2. Methods 13

M with n rows representing the observations and p columns representing the features is given by the equation: M = UDV T . (2.3)

with D a diagonal matrix with the non-negative singular values on the diagonal, ordered from large to small. The singular values represent the amount of variation of the data in their corresponding direction. The matrices U and V are called the left-singular and right-singular matrices of M respectively. The columns of the matrix V span the space of the decomposed features. If we write the matrix containing the first k columns of

V by Vk, we can reduce our dataset by projecting the features onto the subspace Vk using the equation:

Zk = MVk. (2.4)

Because the first k columns of V have the largest singular values of the decomposition,

the reduced features Zk will be the features that capture the most variation in the dataset. This variation is the variation observed from the zero point. The data is

reconstructed by projecting the reduced features Zk back on the original p-dimensional feature space using the equation:

T Mk = ZkVk . (2.5)

This reconstruction will be incomplete: the data points will now lie in a k-dimensional subspace of the p-dimensional feature space. PCA is defined as the eigenvalue decom- position of the data covariance matrix. PCA will transform the features to (linearly) uncorrelated features. The decomposed features will be ranked in decreasing order of variation. This variation is the variation observed within the data. When the data is centered, all the data will vary around zero. In this case, SVD and PCA will lead to the same principle component directions. Aside from centring, the SVD/PCA algorithms are also very sensitive for the normalisation of the data.

2.2.3 Deep autoencoders for dimension reduction

If we choose to use a linear activation function in Equation 2.1, the next layer in our network is defined as a linear transformation of the previous layer. Equations 2.4 and 2.5 of singular value decomposition are also both linear transformations. Fur- thermore, the projection parameters Vk are optimized to retain the most variation possible. This is equal to minimizing the Frobenius norm of V - Vk for a fixed k, with Chapter 2. Methods 14

the Frobenius norm given by:

v u n p uX X 2 ||M||F = t Mij. (2.6) i=1 j=1

This equation is exactly the same as the loss defined by Equation 2.2 when using a least squares cost function, apart from a constant factor. This shows that an autoencoder with a least squares cost function, linear activation functions, no bias terms and a bottleneck hidden layer with k neurons will do exactly the same thing as an SVD with k dimensions!

Singular value decomposition can thus be seen as a special case of a shallow autoen- coder. Using autoencoders to perform dimension reduction has the benefit that it can be generalised to multiple non-linear layers and learn deep features. Where singular value decomposition will project the data on linear manifolds, autoencoders will be able to extend this to curved manifolds. This extension will be especially beneficial for complex, non-linear data.

Figure 2.5: Dimension reduction on 20x20 images of digits using 30 dimensional deep autoencoders and principle component analysis [1].

In Figure 2.5 an example of reduction on the MNIST digit data set is shown. The MNIST has 70 000 images of digits with dimensions 20 × 20, thus 400 pixels in total. The first row of the figure shows an example of each digit in the dataset. The second and third rows show the reconstruction of each digit using a deep autoencoder and PCA respectively. For both, the data was reduced from 400 pixels to 30 dimensions that will contain the most important features of the digits. From these 30 dimensions, the original 400 pixels will be reconstructed. The PCA does a reasonable job of reconstructing the dataset, but it is not very spectacular. The autoencoder does very Chapter 2. Methods 15 well in reconstructing the original dataset. In fact, it does arguably better than the original digits! For example: the upper loop of the digit eight has been fixed, as well as both ends of the digit zero have been better connected together. In order to preserve as much information as possible while reducing the dimensions of the dataset, the autoencoder will have to learn complex features of the data. An unclosed loop in the digit eight is a rarity in the data set and as a result, the autoencoder did not learn this feature when only having access to 30 dimensions. Instead, it knows the digit looks a lot like the other eights and will reconstruct a more general eight.

Dimensionality reduction has a lot of applications. One of them is data visualisation. When the data is reduced to two dimensions, those dimensions can be visualised on a biplot. Latent semantic analysis (LSA) is a domain that makes often use of such biplots. LSA is a natural language processing technique that analyses the link between words and the documents they originate from. The reasoning behind this is that words that are similar in meaning will be used in similar contexts. A matrix can be constructed using a collection of the documents with the frequencies of their most important words as features. Figure 2.6 shows the biplot of the reduction of such a matrix to two dimensions using SVD (B) and a deep autoencoder (C). After the data reduction, the documents are coloured by the type of their content. Compared to the SVD biplot, the deep autoencoder contains much more structure: documents that are similar to each other are closer together. When the cosine of the angle between two codes was used to measure similarity, the autoencoder clearly outperformed SVD (A).

The autoencoder for LSA can also be used in another way: instead of reducing to two continuous features, one could build an autoencoder that has 32 binary bottleneck neurons. Each document can then be compressed to a 32 long bit sequence. As a result, documents that are very similar in content will be very similar in the bit sequence. Each document has then a hash given by its bit sequence. Such a hash can be used for fast retrieval of documents with similar content. Using neural networks to hash and retrieve documents, called semantic hashing, is much faster than other hashing algorithms [13].

2.2.4 Deep autoencoders for representation learning

Reducing features of a dataset to more usable features for supervised machine learning is a form of representation learning. Although data reduction will throw away some Chapter 2. Methods 16

Figure 2.6: Reduction of words using singular value decomposition (B) and deep autoencoders (C) and the document retrieval accuracy for these methods (A) [1]. information about the data, the reduced features might be more usable for the pre- diction task. Features that are useless for the prediction task might also be reduced, which might prevent overfitting. In the thesis, deep autoencoders will be explored to reduce the ingredients of cooking recipes. With these reduced features, supervised machine learning will be performed to predict the region of origin of the recipes. Two commonly used supervised machine learning algorithms will be used for this purpose: k-nearest neighbors and quadratic discriminant analysis. k-nearest neighbors (KNN) is one of the simplest supervised machine learning algo- rithm. For each point in the parameter space, the k-nearest neighbors in the training dataset are determined. For classification, the prediction of the outcome variable for a given observation will then be given by the majority vote of the outcome variables of the nearest neighbors. To determine the observations nearest to a given point, a distance measure must be defined. Very often this distance measure will just be the Euclidean distance. The algorithm is very sensitive to how the features are scaled, since features that are much larger than others will easily dominate the distance measure. The features produced by dimension reduction with the SVD or autoencoder all have the same dimensions, scaling the features before use will thus not be important. The Chapter 2. Methods 17

Figure 2.7: An example of the prediction of three supervised machine learning algorithms: LDA (left), QDA (right) and KNN (below). value k of the KNN algorithm will be determined by a 5-fold cross-validation on the validation data.

Linear discriminant analysis (LDA) is another common supervised machine learning algorithm. LDA is closely related to PCA: where PCA is an unsupervised machine learning algorithm that tries to find the directions in the data with the most variation, LDA is a supervised algorithm that will try to find the directions in the data with the most variation in the outcome variable. In these directions of greatest variation, the middle points will be determined. The outcome variable will then be predicted depending on which side of the middle points the data lies. Quadratic discriminant analysis (QDA) is an extension of this method that does allow for the classes to have different covariances, as will be the case for the regions of the recipes. For this reason, QDA will be used instead of LDA. Chapter 2. Methods 18

2.2.5 Deep autoencoders for collaborative filtering

Data reduction methods can also be used for the purpose of collaborative filtering to build recommender systems. An example of recommender systems can be found in the services Netflix and Amazon, where products will be recommended that might be interesting to the specific user. Collaborative filtering algorithms try to solve this problem from the following perspective: using the preferences of a customer, what other products can be recommended to that customer based on other customers with similar preferences? Data reduction methods can here be used to learn how the pref- erences of the customers look like: in order to have a good reconstruction, meaningful features will have to be extracted from the data, while unimportant features will be thrown away. The observations will be reconstructed to match better with the other observations. An example of this was already seen in the reconstruction of the digits zero and eight in the Figure 2.5.

Collaborative filtering can be used on the recipe dataset to create a good recipe from a selection of ingredients that do not necessarily form a good recipe to start with. A deep autoencoder must first be trained on a training dataset containing a lot of recipes. The autoencoder will learn from this dataset how ingredients are combined in the recipes. The trained network can then be used to recommend adaptations to the selection of ingredients. These adaptations can simply be done by putting the selection of ingredients through the network and checking the output. The output of the autoencoder was forced to have values between zero and one by using the sigmoid function. Therefore, if an ingredient does not match well with the rest of the ingredients, this ingredient will have an output close to zero. On the other hand, ingredients that were not present in the in the selection but would match well, will have an output more towards one.

The performance of the autoencoder can be tested on the test dataset that has been set aside for this purpose. Each recipe in the test data will be modified by randomly adding or removing one ingredients. Adding an ingredient is done by changing the value in the recipe from zero to one, while removing an ingredient is done by changing the value from one to zero. The deep autoencoder can then be used to determine which ingredient has been modified by examining the reconstruction of the ingredi- ents. If an ingredient was removed, the ingredients that were not included in the modified recipe were ranked in decreasing order on their reconstruction values. If an ingredient was added, the ingredients that were included in the modified recipe were Chapter 2. Methods 19

ranked in increasing order. In both cases, the rank of the changed ingredient will be determined. A low rank implies that the recommender system did well in finding back which ingredient was changed.

From the ranks of the changed ingredients, several performance measures can be extracted. For the ranks of the removed ingredients, the mean rank, median rank and the percentage of recipes with a reconstruction ranking in the top 10 have been used as performance measures. These performance measures are the same measures that have been used in De Clercq et al. [2], to enable comparison with their recommender systems. For the ranks of the added ingredients, the mean rank, median rank and the percentage of ingredients on the first rank are used.

2.3 Training the network with gradient descent

In the previous section, the different elements of the network architecture have been explained. In this section, the optimization of the network parameters will be discussed. Unlike some other machine learning algorithms, it is generally not possible to find an analytical solution for the parameters of a neural network. However the network can be optimized using the gradient descent algorithm. The gradient descent algorithm starts from a certain initialisation of the parameters. From this start point, the direction of steepest descent of the loss function will be determined. This direction will be given by the opposite of the gradient of the loss function. A small step will be taken in the direction of the steepest descent. If this step is small enough, the loss should be smaller for this new set of parameters. This procedure will be repeated until the algorithm stops improving the loss, around the global minimum if all goes well. The exact equation of how to update the network parameters is given by:

θ ← θ − δ∇θJ(θ). (2.7)

The gradient symbol ∇θ represents the vector containing the derivatives of the pa- rameters θ. This equation also introduces a new parameter δ, which is called the learning rate. The learning rate determines the size of the steps that will be taken in the direction of steepest descent. The gradient, which determines the direction of steepest descent, is also responsible for the size of the steps. This will also help to converge to a global minimum: if the loss function behaves well enough (has a con- tinuous derivative) around the convergence point, the gradient will diminish around Chapter 2. Methods 20 this point, slowing down the gradient descent algorithm. Figure 2.8 shows how the gradient descent algorithm can converge from a certain initial position to the global minimum.

Figure 2.8: A visualisation of the gradient descent algorithm on a loss function J with two parameters θ1 and θ2 [14].

2.3.1 Local minima

When using the gradient descent algorithm to find the optimal solution of a neural network, there are a number of potential problems that can arise. When the activation functions of the network are non-linear, the loss function will in general be non-convex. This means that there will be several local minima in the loss function. It is very possible for the gradient descent algorithm to converge to one of the local minima instead of the global minimum by starting from a different position, as depicted in Figure 2.9.

In practice, most local minima do not play a big role in the training and application of neural networks. These local minima are only abundant in regions of the parameter space which have a loss close to the loss of the global minimum. For practical appli- cations, it does not matter if the solution is a local or global minimum, as long as the loss is close enough to the global minimum. Chapter 2. Methods 21

Figure 2.9: A different initial position can lead to convergence to a local minima instead of the global minimum [14].

2.3.2 The vanishing gradient problem

There are however some special local minima that can be detrimental to proper con- vergence of the gradient descent algorithm. As mentioned before, the output of the network is given by the chaining of the activation functions of the different layers. To find the derivative of a chained function, the chain rule can be used as defined in the equation: 0 0 0 f1(f2(x)) = f2(x)f1(f2(x)). (2.8)

As can be seen in the equation, the derivatives of the parameters of a certain layer 0 (f1(f2(x)) ) will depend on the values of derivatives of the parameters of the next 0 layer (f1(f2(x))). If the derivatives of all the parameters of a certain layer have very low values, the derivatives of all the preceding layers will also have very low values. As a result, the network parameters will hardly update in those layers and the gradient descent algorithm will not be able to converge to a good solution in a reasonable amount of time.

This situation is especially likely to occur in deep autoencoders [1], for two reasons: a high number of layers and a small number of neurons in the bottleneck layer. A lot of layers can be problematic if the derivatives of the activation functions are smaller than one, as is the case for the sigmoid function for example. Since the derivatives of the parameters of a certain layer will be the product of derivatives of the activation Chapter 2. Methods 22 functions of all the next layers, the gradient for the parameters in the first layers can become very small. This problem is known as the vanishing gradient problem [15]. The bottleneck layers of autoencoders makes the problem even worse. Due to the low number of neurons in this layer, it is much more likely for all the derivatives of the bottleneck neurons to become very small, preventing the gradient descent algorithm to properly work in the layers before the bottleneck layer.

A similar situation occurs when all but one of the bottleneck layer have gradients close to zero. The gradient descent algorithm will still be able to do some learning through that one neuron, but the learning will be limited. It is possible that the learning through this one neuron is not enough to start activating the other neurons during the process.

There are several solutions for these problems. One of them is using pretraining to find good initial values of the parameters, after which the gradient descent algorithm will easily converge to a good value [16]. This pretraining is usually done layer by layer, using restricted Boltzmann machines, an unsupervised deep learning algorithm. In the thesis, pretraining of difficult network architectures was done using the neuron values of trained networks with the same amount of layers and neurons, but easier to train activation functions. Although there is no obvious explanation why this worked by the author, it was a solution that worked for those network architectures where the normal initialisation procedure (discussed in the next subsection) failed.

Another solution to the problem of plateaus and the vanishing gradient problem is the use of appropriate activation functions. The rectifier function derivatives are either zero or one, while the sigmoid function only has derivatives smaller than one. The gradients of deep neural networks using rectifier activation functions will as a result not diminish towards zero unlike the sigmoid activation functions, preventing the vanishing gradient problem that occurs with many layers of sigmoid functions. The rectifier does have another problem however: having a zero derivative for all negative input values. For autoencoders, this can make it likely to have a bottleneck layer where all neurons have a gradient of zero. To solve this, some modifications of the rectifier function which have a small but non-zero gradient for the negative input values are invented [17]. This problem only occured for the activation function to the bottleneck layer: using a linear or sigmoid activation function for this layer prevented the problem. For the architectures with a rectifier activation function to the bottleneck layer, choosing a proper initialisation size of the parameters helped to make proper convergence much more likely. Chapter 2. Methods 23

2.3.3 Initialisation of the network parameters

One could naively initialise all the parameters to the same value, however this approach would not work: if all the parameters have the same value, their gradients will also have the same value. After every gradient descent step, they will still have the same value. The gradient descent algorithm will therefore not be able to work properly. To solve this problem, the parameters of the network can be initialised with random values.

The size of the range of these random values is important. If the range of the random values is too large or too small, the gradient descent algorithm will have to do a lot of work to adjust the parameters to their appropriate sizes. This will take a lot of time to compute and the algorithm is more likely to get stuck in one of the plateaus discussed in the previous subsection.

2.3.4 Minibatch gradient descent

Although the gradient descent algorithm generally works well to converge to a good solution, it can often take a very long time. One way to improve the algorithm is to use minibatch gradient descent instead of deterministic gradient descent. In deterministic gradient, the gradient is calculated over the whole dataset, before the network parameters are updated with this gradient. This is often not very efficient: in large datasets, the calculation of the gradient can take a very long time. However, √ the accuracy of the gradient descent estimator only increases with a factor n, with n being the number of training samples in the dataset. There is also often a lot of redundancy in the dataset: different data samples give very similar contributions to the gradient.

Minibatch gradient descent will calculate the gradient only over a small batch of training samples, typically 10 to 100 samples, before doing a gradient descent step. For each iteration through the dataset, the order will be randomized and split in the batches. In order to compensate for the reduced accuracy of the gradient, the learning rate will have to be reduced. After the gradient descent step, the gradient will be calculated over the next minibatch, and so on. The gradients calculated over the minibatches are good enough for the gradient descent algorithm to work, but will be calculated much faster, allowing the gradient descent algorithm to converge in a fraction of the time it would take with deterministic gradient descent. In minibatch Chapter 2. Methods 24 gradient descent with a very large training dataset, it is possible for the algorithm to converge before the end of the dataset is even reached! In general, several epochs through the dataset will be needed to converge to the best value.

Minibatch gradient descent also has another benefit. The randomness introduced by the minibatches will have a regularising effect, lowering the degree of overfitting, making the network generalise better [18]. This regularising effect is the strongest for batches of size one, but training the network with batches of size one can also take a very long time. In the case the batches have size one, the algorithm is called stochastic gradient descent.

2.3.5 Momentum

Another improvement in the gradient descent algorithm can be done by adding a momentum to the direction of descent. This method of momentum [19] can make the algorithm converge faster and helps to prevent getting stuck in local minima. In Equation 2.7 the changes in the network parameters θ are directly related to the gradient of the loss function. With momentum, the gradient will be used to update a momentum term for each network parameter as following:

v ← αv − δ∇θJ(θ). (2.9)

This momentum term can be seen as an exponentially decaying moving average of the past gradients. The momentum term will then be used to update the parameters using: θ ← θ + v. (2.10)

The momentum method introduces a new parameter α, the inertia parameter. This parameter can have values in the range [0, 1[. It is used to determine the fraction of the previous momentum step that remains in the next step. If α is equal to zero, we would have no momentum. In case α would be one, the contributions of the past gradients would just keep adding up in the momentum parameter, possibly leading to a diverging momentum. By using a value for α smaller than one, a decay is added to the momentum. The momentum term will be the largest when all the past gradients are orientated towards the same direction in the parameter space, in which case they will amplify each other. The inertia will determine the upper bound for the size of the momentum in relation to the size of the gradients. This factor can be found by Chapter 2. Methods 25

substituting Equation 2.9 repeatedly in Equation 2.10, yielding

δ δ + δα + δα2 + δα3 + ... = . (2.11) 1 − α

One can imagine the regions of same loss in the parameter space of a neural network to look similar to concentric hyperellipses. Some axes of those hyperellipses will be very short and other axes will be very long. The direction of the gradient in such a configuration can then be almost perpendicular to the direction of the centre of the ellipses. This will make the gradient descent algorithm very inefficient. In Figure 2.10 an example is shown of the gradient descent algorithm without momentum in an elliptical loss function with a long and short axis.

Figure 2.10: Gradient descent without momentum.

Momentum solves this problem: along the axes where the gradient changes direction often (short axes), the momentum will be diminished, while along the axes where the gradient does not change often in direction (long axes), the momentum will grow in size. Figure 2.11 shows the gradient descent algorithm with momentum with α = 0.5.

Figure 2.11: Gradient descent with momentum with an inertia α = 0.5.

2.4 Optimisation of the hyperparameters

The previous chapter discussed how the gradient descent algorithm can be used to train neural networks. While the gradient descent algorithm works generally very well Chapter 2. Methods 26 for this purpose, a lot depends on properly-chosen parameters for the gradient descent algorithm. These parameters, different from the actual network parameters, are called hyperparameters. The choice of the network architecture can also be viewed as being part of the hyperparameters: the number of layers and neurons in each layer, the choice of activation functions and loss function and the choice to add a bias term or not. As mentioned in the subsection about machine learning, in order to test the true performance of an algorithm, a separate test set needs to be used. The hyperparameters will be optimized against a measure of the performance. However, the test set cannot be used for this purpose: optimizing the hyperparameters is also a form of training of the algorithm. Tuning the hyperparameters against the test performance could result in overfitting towards this dataset. In such a case, the test set would not give a realistic measure of the performance anymore. To solve this problem, a second dataset will be separated from the training data, on which the hyperparameters will be optimized. This dataset is called the validation dataset.

2.4.1 The gradient descent hyperparameters

The optimal values for the gradient descent algorithm have been first tuned manually until a reasonable solution was found. After this, an algorithm was used to find the optimal solution in the region of the solution found by hand.

Originally, grid search was used for this purpose. In grid search, for each parameter a grid of several values is defined. After this, each parameter combination is executed to find the best combination. While this approach worked well initially, it was not feasible in the long term: the training of the more complex autoencoders could take up to an hour. If for example we picked five values for each of the four gradient descent parameters, it would take around 26 days to try them all out! Also, the possible values of the parameters in a grid search are fixed to only five values, while the optimal value might lie somewhere in between.

In this situation, a random search for hyperparameter optimisation works much better. A random search will pick the values of parameters randomly in a predefined range of possible values and will try to find the best set of parameters within a certain number of attempts. Random search has the obvious benefit of allowing a continuous range of parameters to try out. But random search has an even bigger benefit: it does much better in finding a good solution in a multidimensional hyperparameter space compared to grid search, within the same amount of time. This is because some Chapter 2. Methods 27 hyperparameters will have less effect than others. Grid search will try out several combinations of these less important parameters while keeping the other parameters constant, whereas random search will change all hyperparameters on every draw.

In my thesis, this approach has been adapted a bit further: after each draw has been tried out, the range of the hyperparameters changes to become centered around the best solution so far. The size of the ranges have also been tuned manually to become smaller, the closer the search became to the best solution (when the random search slows down finding better solutions in the region of the best solution so far). This adapted random search made it possible to find the optimal parameters for each network architecture within a day. In the next subsections, the effect of the parameters on the convergence of the gradient descent algorithm will be explained.

2.4.1.1 Learning rate δ

The learning rate is one of the most important parameters of the gradient descent algorithm. If the learning rate is very high, it is possible for the gradient descent algorithm to diverge in the loss function.

Figure 2.12: Divergence of the gradient descent algorithm with a too large learn- ing rate on a parabolic loss function [20]. Chapter 2. Methods 28

Figure 2.12 shows an example of such a divergence on a parabolic loss function. Starting from a point on the parabola, taking a too large step in the direction of steepest descent makes the parameter end up at the other end of the parabola. At this new point, the gradient is even larger than the starting point. Since the size of the gradient descent steps is also dependent on the size of the gradient, the next step will be even larger. This step size can continue to grow this way and lead to divergence of the loss function.

Figure 2.13: The effect of the different learning rates on the convergence of the loss function with the gradient descent algorithm [21].

But even if the loss function does not diverge under the gradient descent algorithm, this does not mean that the parameters are well adjusted. In Figure 2.13 the loss functions during the gradient descent are shown for different learning rates. If the learning rate is too large, but small enough to converge, rapid progress will be made towards the global minimum. However, it is possible for such a learning rate to flatten out too early. Similar to the diverging learning rate, it will be constantly overshooting the global minimum: the gradient descent will take a step towards the right direction but too large in size, making it end up on the other side. The learning rate will never really fully converging like a good learning rate would do in such a case. A too small learning rate is another problem. The convergence of such a learning rate would take a very long time. With a small learning rate it is also more likely for the gradient descent algorithm to get stuck in local minima, saddle points or other flat regions in the loss function. Chapter 2. Methods 29

2.4.1.2 Batchsize

Another important parameter is the size of the minibatches for the minibatch gradient descent. As discussed, using batches smaller than the whole dataset can speed up the gradient descent algorithm a lot. But using a too small batch size can also slow down the algorithm: the smaller the batch size, the more random the gradient will be. To compensate for this added randomness, the learning rate will have to be decreased. While smaller batches will reduce the need to calculate the gradient over a lot of data points, more steps will be needed to converge. The optimal batch size will be found by the best trade off.

2.4.1.3 Inertia α

Adding momentum to the gradient descent algorithm is another method by which the algorithm can converge faster and avoid local minima. The size of the momentum 1 is determined by the inertia α, from which the upper bound 1−α of the stepsize in relation to the gradient was derived in Equation 2.11. The momentum will increase the stepsize by this factor in the directions that maintain the orientation of their gradients. However, too much momentum will lead to oscillations and instabilities in the gradient descent algorithm. Reducing the learning rate will help to prevent these instabilities. The optimal value for the inertia will be found by the trade off between boosting the directions that maintain orientation of their gradients and keeping a high enough learning rate in the other directions.

2.4.1.4 Initialisation range

The network parameters need to be initialised at random. One important aspect of this initialisation is the size of the range of these random values. This size needs to match the size of the optimal network parameters in order for the gradient descent algorithm to work well. If the initialisation of the parameters is too small or too large, the gradient descent algorithm will have to work too hard, making it likely to converge to one of the trivial local minima solutions discussed in the previous section. Chapter 2. Methods 30

2.4.2 The network architecture

For each (deep) network architecture, it takes a long time to search for the best gradient descent parameters and train the model with these optimal parameters. The different network architectures have been optimized manually for the different purposes of the thesis: using an algorithmic search would not have been feasible given the time constraints.

2.5 Python and the Theano package

The thesis has been programmed in Python. Python is open source software and has a large online community that extends the language with many user-written packages. This makes it convenient for the individual to easily perform computing tasks such as machine learning. One of the packages in Python that is especially useful for deep learning is the Theano package [22].

Theano gives the user the option to define variables in a symbolical manner, leaving calculations with those variables to its software. It will also compile the code to run faster and give the option to run the code on GPU instead of CPU. GPU’s have the possibility of running code in parallel. Calculations such as finding the gradients of each observation of a batch can easily be parallelized, making the algorithm run much faster on GPU. This option has not been explored in the thesis because of a lack of compatible hardware. One of the most important aspects of the Theano package is the efficient calculation of the gradient, the most important computational task when training the neural network.

2.5.1 Backward propagation of the gradient

One of the limiting factors in the early years of neural networks was the very slow calculation of the gradients. Because they were so slow to train, there was not a lot of interest in researching them, halting a lot of the progress in this domain for several years. To give an example of what it requires to calculate the gradients, imagine a network with layers X, Y and Z as the last three layers. On the output layer Z a loss function J(θ) is defined, which has to be optimized with gradient descent. As was shown in Equation 2.8, the gradients of the network can be found with the chain rule. Chapter 2. Methods 31

X The partial derivative of weight wij of the activation function going to the neuron Xj can then be written as

∂J(θ) X X ∂Xj ∂Yk ∂Zl ∂J(θ) X = X . (2.12) ∂w ∂w ∂Xj ∂Yk ∂Zl ij k l ij

For deep neural networks, this equation includes the calculation and summation over many variables. A naive implementation would be to do this calculation for each neuron separately, starting from the first layers. However, there is a way to calculate the derivatives in a much faster way. Equation 2.12 for weights in the same layer contains many of the same factors. Also across the different layers, many of the factors are the same. For example, all the weights in the network will need the terms ∂J(θ) to calculate ∂Zl its derivative. Instead of making these calculations over and over for each weight, one could start from the last layer and calculate the derivatives. Subsequently, the weights of the second last layer could be calculated using previously obtained values and so on, going back to the first layers. This approach of propagating the error backward through the network is called the backward propagation method [23][24]. Not using this backward propagation algorithm could easily slow down the gradient descent algorithm by multiple magnitudes. The Theano package makes the implementation for this very easy: the user just has to define the mathematical relation between the variables, after which the package will calculate the gradient to all parameters using backward propagation.

2.5.2 Other packages

There are many other packages for deep learning like Caffe and TensorFlow. Also many other packages, like Lasagne and Nolearn, have been built on Theano for the purpose of deep learning. These packages take care of the implementation of the network and its gradient descent algorithm. While this is useful for commercial purposes, it does not give the user much incentive to learn how and why these algorithms precisely work. One of the purposes of the thesis was to get familiar with the concepts of the gradient descent algorithm, implemented with two of its most important extensions: minibatch and momentum. Because of this, it has been decided to code the thesis fully using only the Theano package for the gradient descent. Chapter 3

Results

3.1 Training the autoencoders

In this thesis, the autoencoder networks are trained using the gradient descent algo- rithm. The gradient descent algorithm has no inherent endpoint: the convergence of the algorithm has to be decided by hand or using certain convergence criteria like a maximum run time. Keeping track of the loss of the network during the gradient de- scent algorithm can be a useful tool to help with this task. This loss can be checked on a validation set every certain amount of steps. This sequence of loss values can then be plotted against the the number of steps. Visual inspection of such a loss function plot can give a very good idea of the convergence of the gradient descent algorithm, often much better than predefined convergence criteria. The loss function plots will be used to compare the effect of different adaptations of the gradient descent algorithm in the following subsections.

3.1.1 Adding extensions to the gradient descent algorithm

In the first plot of Figure 3.1 the loss function is shown for a simple autoencoder as a toy model. The autoencoder has linear activation function and one hidden layer containing two neurons. In other words, the autoencoder will learn to perform singular value decomposition with two dimensions. After the initialisation of the network parameters, the values of the bottleneck neurons will all be close to zero. Initially the gradient descent algorithm will not be able to learn much, due to lack of structure in the

32 Chapter 3. Results 33

(a) Normal gradient descent (544 sec) (b) Dubbel learning rate (270 sec)

(c) Minibatches per 250 (68 sec) (d) Momentum with 0.5 inertia (296 sec)

Figure 3.1: The loss function of the gradient descent algorithm and some exten- sions. The run time is shown between brackets network. This can be seen as a plateau on the loss function in the start, equal to a loss of predicting no ingredients for each recipe. Once the gradient descent algorithm has learned some structure of the data, the first principle component will be learned at a fast pace, after which the algorithm will slow down again on a second plateau. At this point, the second principle component has yet to be learned: one of the neuron values is still close to zero or they have both almost the same value for all the observations. Either case, the network is not using its full learning capacity. After escaping this plateau, the gradient descent algorithm will learn the second principle component, converging to the same loss of the singular value decomposition.

The time it took to train this simple network was pretty long: 554 seconds. In each of the other plots of the figure the gradient descent algorithm has been improved. The second plot has a double learning rate and converged in 270 seconds, which is about half of the original time. Increasing the learning rate will generally speed up the algorithm, however there is an upper boundary above which the algorithm might Chapter 3. Results 34

not converge fully or even start to diverge. Minibatches have been used in the third plot, yielding a convergence time of 68 seconds. Adding minibatches to the gradient descent algorithm will speed up the calculation of the gradient in each step. In the fourth plot momentum was added, yielding a convergence time of 296 seconds. Adding momentum with 0.5 inertia was almost as useful as doubling the learning rate. This is to be expected: doubling the learning rate will increase the stepsize in all directions, while adding momentum with 0.5 inertia will double the stepsize in the directions where the gradients do not change orientation, but diminish for the other directions. Combining all these optimisations and using the best values will enable the autoencoder to be fully trained in a few seconds. This is even faster than the analytical calculation performing singular value decomposition which takes about 15 seconds.

Figure 3.2: Too much momentum can make the gradient descent algorithm unstable (left). A diverging loss function as a results of a too large learning rate (right)

While the tools discussed above can speed up the algorithm, they can also cause instability when too extreme parameters are chosen. In Figure 3.2 two examples of this are shown. The first plot shows the convergence of the gradient descent algorithm using too much momentum (inertia = 0.95). The algorithm has a lot of fluctuations up and down before settling to the convergence value. This could have ended even worse, with a diverging loss function. An example of this is shown in the second plot, where a too large learning rate was chosen. After some fluctuation, the loss shoots up. The next values of the loss are not shown on the plot because they grew to the upper boundary of the numpy floats within the three following steps.

In Figure 3.3 the loss function for a deep autoencoder is shown. The gradient descent algorithm starts off really fast but slows down rapidly. The loss of the deep autoencoder almost instantly surpasses the loss value 0.066, which is the loss of a singular value Chapter 3. Results 35

Figure 3.3: The loss function of a complex network for the training data and validation data. decomposition with the same dimension reduction and cost function. If we let the gradient descent algorithm train for too long, the training loss might start to dip under the validation loss for models with a lot of parameters. This did not happen for the models of this thesis: there was a lot of data and the models did not have too many parameters in order for overfitting to occur.

3.1.2 Plateaus

In a linear network with least squares cost function, the loss will always be convex, guaranteeing the gradient descent algorithm to always be able to escape the learning plateaus with enough small steps. This is not the case for a general loss function. For a non-linear network it is possible for the gradient descent algorithm to get stuck on certain plateaus in the loss function. In Figure 3.4 the bottleneck neurons for several types of plateaus are shown. The gradient descent algorithm can get stuck when all the bottleneck neurons have a value of zero or close to zero, as is the case for the first plot of the figure. In the second plot the gradient descent algorithm has learned Chapter 3. Results 36

(a) (b)

(c)

Figure 3.4: The bottleneck neurons for a gradient descent algorithm getting stuck on a zero dimensional (left), one dimensional (right) or two dimensional (down) subspace of the bottleneck neurons. meaningful values for the first neuron but not for the second and is unable to escape the plateau. In the third plot, the values of the three bottleneck neurons are stuck on a two-dimensional subspace. The gradient descent algorithm can get stuck in these plateaus if the hyperparameters are not properly chosen.

3.2 Comparing data reduction methods

3.2.1 Singular value decompostion

In Figure 3.5 biplots of the first two components of singular value decomposition and its variants are shown. While SVD captures the directions of greatest variation around Chapter 3. Results 37

zero, PCA captures the directions of greatest variation within the dataset, which is more useful. When the data gets centered around zero, SVD will learn the same principle component directions as PCA. Both SVD and PCA are also very sensitive to the normalisation. Without normalisation the algorithms will try to model the frequencies of the ingredients and the number of ingredients per recipe. By normalising the data, these factors are eliminated and more meaningful features of the data will be learned. To check how well the reduced features maintain the structure of the data, the recipes are colored by their regions. As expected from the discussion above, PCA with normalisation seems to preserve the structure the best: recipes with similar regions lie more together.

(a) SVD (b) SVD with normalisation

(c) PCA (d) PCA with normalisation

Figure 3.5: Biplots of different variations of singular value decomposition

3.2.2 Autoencoders

An autoencoder can learn to perform singular value decomposition when linear acti- vation functions and a least squares cost function are chosen. When such a model is Chapter 3. Results 38

fully trained, the bottleneck neurons of the autoencoder should have the same values as the reduced dimensions of SVD. In Figure 3.6 the two features of SVD are shown as well as the two bottleneck neurons of an autoencoder that has learned to perform SVD. The autoencoder manages to reproduce the results of SVD almost perfectly, aside from some scaling factors.

Figure 3.6: Two dimensional features of SVD (left) versus the two bottleneck neurons of a linear autoencoder with a least squares cost function (right).

Unlike SVD, autoencoders do not need centering and normalisation of the data to work well. This is because the autoencoder can simply add a bias term to the layers of its network. This bias term will model the frequencies of the ingredients and the number of ingredients per recipe. The bottleneck neurons are then free to represent more meaningful features of the data. Centering and normalisation of the data is also not preferred for the autoencoder: the values of the input variables will not be fixed on zero or one anymore. The sigmoid activation function with cross entropy cost function can then not be used for the output neurons. This will make the network harder to train, as well as losing the meaning of the values of the input and output layers as the ingredients being present or not in the recipe. Chapter 3. Results 39

Figure 3.7: Biplots of the bottleneck neurons of non-linear autoencoders with one hidden layer (left) and three hidden layers (right).

In a complex dataset with a lot of observations, autoencoders can potentially perform better than SVD by modelling more structure of the data. This is done by adding more layers and using non-linear activation functions. In Figure 3.7 the bottleneck neurons of non-linear autoencoders with one and three hidden neurons are shown. The autoencoder with one hidden layer has cross entropy loss of 0.062, which is better than 0.066, the cross entropy loss of SVD. This model does better than the first three plots in Figure 3.5 in preserving the structure, as a result of the added bias term. The PCA with normalisation still seems to outperform this autoencoder network. Adding an extra hidden layer on both sides of the bottleneck layer gives the autoencoder shown on the right side of the plot. This autoencoder has a cross entropy cost of 0.055, much lower than both the SVD and the autoencoder with one hidden layer. It also seems to better maintain the structure of the data compared to all the other models.

Figure 3.8: Biplots of the bottleneck neurons of autoencoders with five hidden layers. For the left plot a linear activation function was used to the bottleneck layer, while for the right plot a sigmoid activation function was used. Chapter 3. Results 40

Adding two more hidden layers improved the performance even further. In Figure 3.8 the bottleneck neurons of autoencoders with five hidden layers are shown. For the left plot a linear activation function was used to the bottleneck layer, which resulted in a cross entropy loss of 0.051. For the right plot a sigmoid function was used, which resulted in a cross entropy loss of 0.048. Both models improved the loss by a lot compared to the models with less layers. Models with rectifier activation functions to the bottleneck layer were also tried out, but they were visually not as satisfying. Adding more layers was also explored, but this did not improve the results by much compared to five hidden layers.

3.3 Prediction of the regions

In Figure 3.9 the features of the PCA model with normalisation for the training and testing are shown, as well as the KNN and QDA predictions.

(a) Training data (b) Test data

(c) KNN predictions (d) QDA predictions

Figure 3.9: Biplots of the features of PCA with normalisation. Chapter 3. Results 41

The recipes of North American region take up 73.4% of the dataset and have recipes similar to recipes from all over the world. These recipes would dominate the prediction of regions of all the recipes, which would not be very useful. For this reason, the North American recipes are removed for the prediction part, but not for the training of the models. On the reduced features of PCA with normalisation, the KNN classifier has a prediction accuracy of 55.4%, while the QDA classifier has a prediction accuracy of 55.2%

(a) Training data (b) Test data

(c) KNN predictions (d) QDA predictions

Figure 3.10: The datasets for region prediction on the two bottleneck neurons of the deep autoencoder model with a linear activation function to the bottleneck layer.

In Figure 3.10 the features of a deep autoencoder with a linear activation function to the bottleneck layer for the training and testing are shown, as well as the KNN and QDA predictions. The KNN classifier has a prediction accuracy of 65.0%, while the QDA classifier has a prediction accuracy of 57.8%

In Figure 3.11 the features of a deep autoencoder with a linear activation function to the bottleneck layer for the training and testing are shown, as well as the KNN and Chapter 3. Results 42

(a) Train (b) Test

(c) knn (d) qda

Figure 3.11: The datasets for region prediction on the two bottleneck neurons of the deep autoencoder model with a linear activation function to the bottleneck layer.

QDA predictions. The KNN classifier has a prediction accuracy of 65.4%, while the QDA classifier has a prediction accuracy of 58.2%. Both deep autoencoder models outperformed the prediction accuracies of the PCA model with normalisation.

The performance on the raw dataset was also measured. KNN gave the best perfor- mance with a prediction accuracy of 69.8%. This is very close to the KNN prediction values of the deep autoencoder models with two bottleneck neurons, while not so close to the KNN prediction of the PCA with normalisation. This suggests that deep autoencoders retain much more structure of the data when reducing the dimensions. As a side experiment, some models were tested with a higher number of bottleneck neurons, from which a model with 100 bottleneck neurons was selected as the model with the highest prediction accuracy on the validation dataset. The prediction accu- racy of this model on the test data was 72.0%. This suggests that deep autoencoder Chapter 3. Results 43 models can also be useful for representation learning, although the benefit compared to the raw dataset was minimal.

3.4 Collaboratorive filtering for recipe creation

Several autoencoder architectures have been explored for collaborative filtering. It was found that deeper models performed better. Using sigmoid activation functions instead of rectifier activation functions also improved the performance. The author noted also something different: models that were not fully trained appeared to be better than models for which the gradient descent algorithm had fully converged. Even though the fully trained models were not overfitted: the fully trained models had the lowest cross validation loss and this loss was very close the the training loss. The author does not see any reason why the not fully trained models performed better. From all the different models, the one performing the best on the validation dataset was used on the test dataset.

As explained in Chapter2, the recipes are modified by randomly either removing or adding an ingredient as a way to measure the performance. If the autoencoder works well for collaborative filtering, it should be able to find back the changed ingredient. For the recipes with a removed ingredient, the ingredients not present in the adapted recipe will be ranked on how well they would fit the adapted recipe. For the recipes with an added ingredient, the ingredient of the adapted recipe will be ranked on how bad they fit the adapted recipe. In both cases, a low rank means the autoencoder did well in finding back the changed ingredient.

3.4.1 Reconstruction of the removed ingredient

In Table 3.1 an example of recipe retrieval is shown for one of the test recipes. The original recipe contains the following ingredients: cocoa, cream cheese, eggs, milk, wheat and vanilla. This recipe has been modified by removing vanilla. The modified recipe has been put through the network, resulting in reconstruction values for all ingredients. From ingredients not included in the modified recipe, the five with highest reconstruction values were selected. The missing ingredient vanilla ranks second with a reconstruction value of 55%. For this recipe, the model did very well in retrieving the missing ingredient. Expecting a perfect retrieval is not reasonable: other ingredients Chapter 3. Results 44

might also combine well with the adapted recipe. The other suggestions in table would indeed be good combinations with the adapted recipe.

Ingredients cocoa, cream cheese, eggs, milk, wheat Suggestions to add cream vanilla butter yeast vegetable oil Reconstruction % 58 55 35 25 20

Table 3.1: The first row contains the ingredients of a recipe from which vanilla was removed. The following row contain the top five of suggestions of ingredients to add to the adapted recipe, with their reconstruction values in the last row. The removed ingredient vanilla was ranked second of the suggestions.

In Table 3.2 the performance measures for ingredient retrieval of the autoencoder are shown and compared to the two models of De Clercq et al. [2]. The deep autoencoder has mean rank of 25.2 and median rank of 8 for the removed ingredient. This per- formance is very good: randomly selecting ingredients would result in ranks uniformly distributed between one and the number of ingredients not in the recipe (there are 381 ingredients in total). Compared to the models of De Clercq et al., the autoencoder outperforms the non-negative matrix factorisation model and comes close in perfor- mance to the two-step kernel ridge regression model. Deep autoencoder can thus be used as an alternative method for collaborative filtering.

Performance measure mean rank median rank % with rank ≤ 10 Deep autoencoder 25.2 8 54.5 NMF 33.0 12 48.2 Two-step KRR 23.6 7 59.1

Table 3.2: Comparing the rank of ingredient reconstruction for the deep autoen- coder, non-negative matrix factorization (NMF) and two-step kernel ridge regression (Two-step KRR) models.

3.4.2 Elimination of the added ingredient

In Table 3.3 an example of elimination of an added ingredient is shown for one of the test recipes. The original recipe contains the following ingredients: milk, coffee and cocoa. This recipe has been modified by adding mustard. The modified recipe has been put through the network, resulting in the reconstruction values of the ingredients shown in the table. The added ingredient mustard ranks lowest with a reconstruction value of 3%. The model did very well in retrieving the added ingredient.

In Table 3.4 the performance measures for ingredient elimination of the autoencoder are shown. The model has mean rank of 1.5, a median rank of 1 for the removed Chapter 3. Results 45

Ingredients mustard cocoa coffee milk Reconstruction % 3 84 94 96

Table 3.3: A recipe for cappuccino. Mustard has been randomly added as extra ingredient and has a very bad reconstruction unlike the other ingredients. The model predicts mustard as the first ranked ingredient to eliminate from the recipe. ingredient and eliminated the correct ingredient 80% of the time. Eliminating an added ingredient is much easier than retrieving a removed ingredient, since it only has to pick from the ingredients that are in the adapted recipe. Nevertheless, this performance is very good.

Performance measure mean rank median rank % first rank Deep autoencoder 1.5 1 78.8

Table 3.4: The performance measures of the elimination of a randomly added ingredient. Chapter 4

Conclusion and discussion

4.1 Conclusion

In the thesis, it was explored how to train deep autoencoder networks on cooking recipes using the gradient descent algorithm. Since it can be very hard to train deep autoencoders [1], several extensions were added to improve the gradient descent algo- rithm. Adding minibatches and momentum to the gradient descent algorithm helped to speed up the algorithm and improved the performance. Pretraining the network with similar, easier to train networks prevented the algorithm from getting stuck in plateaus with a high loss.

These deep autoencoders were then compared to singular value decomposition for the purpose of data reduction of the ingredients of recipes to two dimensions. To measure the performance, the cross entropy loss for the reconstruction of the ingredients was used. Singular value decomposition had loss of 0.066 while the best deep autoencoder performed much better with a loss of 0.048.

On the two reduced dimensions of all models, supervised machine learning was used to predict the regions of the recipes. The was done using two algorithm: KNN and QDA, with KNN outperforming QDA on all models. On the SVD model, the KNN algorithm has a prediction accuracy of 55.4%. On the deep autoen- coder models, the KNN algorithm had a much better prediction accuracy: 65.0% for the model with a linear activation function to the bottleneck neurons and 65.4% for the model with a sigmoid activation function to the bottleneck neurons. Performing KNN on the raw dataset resulted in a prediction accuracy of 69.8%, suggesting that

46 Chapter 4. Conclusion and discussion 47

the two bottleneck neurons of the deep autoencoders maintained the structure of the regions very well. Performing KNN on the reduced features of a deep autoencoder with 100 bottleneck neurons gave a prediction accuracy of 72.0%, suggesting deep autoencoders might have some usefulness for representation learning on the dataset.

Separate deep autoencoder models were trained for collaborative filtering with the purpose of making recommender systems. The performance of these recommender systems were tested by the ranks of the recommendations of the ingredients that were randomly either removed from or added to an recipe. De Clercq et al. [2] have build two similar recommender models on the same dataset. As can be seen in Table 4.1, the deep autoencoder outperforms the non-negative matrix factorization and comes close to the two-step kernel ridge regression.

Performance measure mean rank median rank % first rank % top 10 Reconstruction DAE 25.2 8 / 54.5 Reconstruction NMF 33.0 12 / 48.2 Reconstruction two-step KRR 23.6 7 / 59.1 Elimination DAE 1.5 1 78.8 /

Table 4.1: The performance measures of reconstruction of a randomly removed in- gredient for the deep autoencoder (DAE), non-negative matrix factorization (NMF) and two-step kernel ridge regression (two-step KRR) models. The elimination per- formance of the added ingredients are also included.

4.2 Discussion

To improve the gradient descent algorithm even further, a variant of an adaptive learning rate could be implemented, rather than using a fixed learning rate. Also, only a certain number of deep autoencoder architectures were explored in the thesis. All the deep autoencoders models had very little overfitting, since the train and validation loss were always very close. It is possible that deep autoencoder architectures with more parameters might fit the data better, although these types of complex models will likely be even harder to train. The different performance measures explored in the thesis could be further improved with those models. This could be explored in further research. Appendix A

Admission for circulating the work

The author, promoter and co-promotor give permission to consult this master dis- sertation and to copy it or parts of it for personal use. Each other use falls under the restrictions of the copyright, in particular concerning the obligation to mention explicitly the source when using results of this master dissertation.

48 Bibliography

[1] Geoffrey E Hinton and Ruslan R Salakhutdinov. Reducing the dimensionality of data with neural networks. Science, 313(5786):504–507, 2006.

[2] Marlies De Clercq, Michiel Stock, Bernard De Baets, and Willem Waegeman. Data-driven recipe completion using machine learning methods. Trends in Food Science & Technology, 49:1–13, 2016.

[3] Yong-Yeol Ahn, Sebastian E Ahnert, James P Bagrow, and Albert-L´aszl´o Barab´asi.Flavor network and the principles of food pairing. Scientific reports, 1, 2011.

[4] David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershel- vam, Marc Lanctot, et al. Mastering the game of go with deep neural networks and tree search. Nature, 529(7587):484–489, 2016.

[5] Douglas B Lenat and Ramanathan V Guha. Building large knowledge-based sys- tems; representation and inference in the Cyc project. Addison-Wesley Longman Publishing Co., Inc., 1989.

[6] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. MIT Press, 2016. http://www.deeplearningbook.org.

[7] Shlomo Mor-Yosef, Arnon Samueloff, Baruch Modan, Daniel Navot, and Joseph G Schenker. Ranking the risk factors for cesarean: logistic of a nationwide study. Obstetrics & Gynecology, 75(6):944–947, 1990.

[8] Matthew D Zeiler and Rob Fergus. Visualizing and understanding convolutional networks. In European conference on computer vision, pages 818–833. Springer, 2014.

49 Bibliography 50

[9] Geoffrey E Hinton, Simon Osindero, and Yee-Whye Teh. A fast learning algorithm for deep belief nets. Neural computation, 18(7):1527–1554, 2006.

[10] Yoshua Bengio, Pascal Lamblin, Dan Popovici, Hugo Larochelle, et al. Greedy layer-wise training of deep networks. Advances in neural information processing systems, 19:153, 2007.

[11] Frank Seide, Gang Li, and Dong Yu. Conversational speech transcription using context-dependent deep neural networks. In Interspeech, pages 437–440, 2011.

[12] Ronan Collobert, Jason Weston, L´eon Bottou, Michael Karlen, Koray Kavukcuoglu, and Pavel Kuksa. Natural language processing (almost) from scratch. Journal of Machine Learning Research, 12(Aug):2493–2537, 2011.

[13] Ruslan Salakhutdinov and Geoffrey Hinton. Semantic hashing. International Journal of Approximate Reasoning, 50(7):969–978, 2009.

[14] Andrew Nguyen. Machine learning course on coursera. https://www.coursera. org/learn/machine-learning. Accessed: 2017-01-23.

[15] Sepp Hochreiter. The vanishing gradient problem during learning recurrent neural nets and problem solutions. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, 6(02):107–116, 1998.

[16] Dumitru Erhan, Yoshua Bengio, Aaron Courville, Pierre-Antoine Manzagol, Pas- cal Vincent, and Samy Bengio. Why does unsupervised pre-training help deep learning? Journal of Machine Learning Research, 11(Feb):625–660, 2010.

[17] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In Proceedings of the IEEE international conference on computer vision, pages 1026– 1034, 2015.

[18] D Randall Wilson and Tony R Martinez. The general inefficiency of batch training for gradient descent learning. Neural Networks, 16(10):1429–1451, 2003.

[19] Boris T Polyak. Some methods of speeding up the convergence of iteration methods. USSR Computational Mathematics and Mathematical Physics, 4(5): 1–17, 1964. Bibliography 51

[20] Divergence of the gradient descent algorithm with a too large learning rate on a parabolic loss function. http://www.cs.cornell.edu/courses/cs4780/ 2015fa/web/lecturenotes/lecturenote07.html. Accessed: 2017-01-23.

[21] The effect of the different learning rates on the convergence of the loss function with the gradient descent algorithm. https: //leonardoaraujosantos.gitbooks.io/artificial-inteligence/ content/more_images/learningrates.jpeg. Accessed: 2017-01-23.

[22] James Bergstra, Olivier Breuleux, Fr´ed´ericBastien, Pascal Lamblin, Razvan Pas- canu, Guillaume Desjardins, Joseph Turian, David Warde-Farley, and Yoshua Ben- gio. Theano: A cpu and gpu math compiler in python. In Proc. 9th Python in Science Conf, pages 1–7, 2010.

[23] Martin Fodslette Møller. A scaled conjugate gradient algorithm for fast supervised learning. Neural networks, 6(4):525–533, 1993.

[24] Yann LeCun, Bernhard Boser, John S Denker, Donnie Henderson, Richard E Howard, Wayne Hubbard, and Lawrence D Jackel. applied to handwritten zip code recognition. Neural computation, 1(4):541–551, 1989.