EXAMENSARBETE INOM TEKNIK, GRUNDNIVÅ, 15 HP STOCKHOLM, SVERIGE 2016

The effects of Deep Belief Network pre-training of a Multilayered under varied labeled data conditions

CHRISTOFFER MÖCKELIND

MARCUS LARSSON

KTH SKOLAN FÖR DATAVETENSKAP OCH KOMMUNIKATION Royal Institute of Technology

The effects of Deep Belief Network pre-training of a Multilayered perceptron under varied labeled data conditions

Effekterna av att initialisera en MLP med en tränad DBN givet olika begränsningar av märkt data

Author: Supervisor: Christoffer Möckelind Pawel Herman Examinator: Marcus Larsson Örjan Ekeberg

May 11, 2016 Abstract Sometimes finding labeled data for tasks is difficult. This is a problem for purely supervised models like the Multilayered perceptron(MLP). ADiscriminativeDeepBeliefNetwork(DDBN)isasemi-supervisedmodelthat is able to use both labeled and unlabeled data. This research aimed to move towards a rule of thumb of when it is beneficial to use a DDBN instead of an MLP, given the proportions of labeled and unlabeled data. Several trials with different amount of labels, from the MNIST and Rectangles-Images datasets, were conducted to compare the two models. It was found that for these datasets, the DDBNs had better accuracy when few labels were available. With 50% or more labels available, the DDBNs and MLPs had comparable accuracies. It is concluded that a rule of thumb of using a DDBN when less than 50% of labels are available for training, would be in line with the results. However, more research is needed to make any general conclusions.

Sammanfattning Märkt data kan ibland vara svårt att hitta för maskininlärningsuppgifter. Detta är ett problem för modeller som bygger på övervakad inlärning, exem- pelvis Multilayerd Perceptron(MLP). Ett Discriminative Deep Belief Network (DDBN) är en semi-övervakad modell som kan använda både märkt och omärkt data. Denna forskning syftar till att närma sig en tumregel om när det är för- delaktigt att använda en DDBN i stället för en MLP, vid olika proportioner av märkt och omärkt data. Flera försök med olika mängd märkt data, från MNIST och Rectangle-Images datamängderna, genomfördes för att jämföra de två mo- dellerna. Det konstaterades att för dessa datamängder hade DDBNerna bättre precision när ett fåtal märkt data fanns tillgängligt. När 50% eller mer av datan var märkt, hade DDBNerna och MLPerna jämförbar noggrannhet. Slutsatsen är att en tumregel att använda en DDBN när mindre än 50% av av träningsdatan är märkt, skulle vara i linje med resultaten. Det behövs dock mer forskning för att göra några generella slutsatser.

1 Contents

1 Introduction 4 1.1 Scope ...... 5

2Background 5 2.1 ...... 5 2.2 Deep Belief Network - DBN ...... 6 2.2.1 Restricted - RBM ...... 6 2.2.2 Contrastive divergence (CD) ...... 7 2.2.3 Deep Belief Network ...... 7 2.2.4 Discriminative Deep Belief Network ...... 8 2.3 Related research ...... 8

3 Method 9 3.1 Datasets ...... 9 3.2 Measurements ...... 10 3.3 Training methods ...... 11 3.4 Parameter selection ...... 11 3.5 Tools ...... 12

4 Result 12 4.1 MNIST ...... 12 4.1.1 First architecture ...... 13 4.1.2 Second architecture ...... 14 4.1.3 Thirdarchitecture ...... 15 4.1.4 Architecture comparison ...... 16 4.2 Rectangles-Images ...... 18 4.2.1 First architecture ...... 20 4.2.2 Second architecture ...... 21 4.2.3 Thirdarchitecture ...... 22 4.2.4 Architecture comparison ...... 22 4.3 Label variation ...... 23

5 Discussion 24 5.1 Other findings ...... 26 5.2 Limitations ...... 26

6 Conclusion 27 6.1 Further research ...... 27

2 List of abbreviations

Architecture - The number of layers in the network and the number of neurons in each layer.

NN -Neuralnetwork MLP - MultiLayered Perceptron - A neural network usually trained with BP DBN - Deep Belief Network - A neural network trained with stacked RBMs and CD learning

DDBN - Discriminative DBN - An MLP that is initialized with the weights from a trained DBN. RBM -RestrictedBoltzmannMachine BP -Backpropagation-Analgorithmforupdatingweightsinaneuralnetwork

GD -GradientDescent-Trainingmethodthatcomputesgradientofafunctionand follows that gradient to find a minima of the function. CD - Contrastive Divergence - Approximate training method of log-likelihood.

3 1 Introduction

Today the amount of available data is constantly growing, and sometimes it is hard to associate the data with labels. This is a problem for purely supervised models like the Multilayer perceptron(MLP). The MLP, an artificial neural network(ANN) model, has been used in many ap- plications since the 1960s[1]. During the years, other types of networks have been created that outperform MLPs in various applications[1]. However, a version of the MLP was recently (2015) the winning approach in a competition of finding the best taxi route[2], which proves they are still useful. MLPs are trained with back propaga- tion(BP) and (GD)[3]. When the number of layers increase, so does the risk of the vanishing gradient problem[1], which makes BP learning harder. Another ANN model is the Discriminative Deep Belief Network(DDBN)[4]. The difference between the MLP and the DDBN is the initialization of weights. The weights of an MLP are initialized with random values while a DDBN receives its weights from a trained DBN[4]. It can hence be said that when an MLP is pre-trained by a DBN we obtain a DDBN. The DBN is one among several networks trained with methods reduce the vanishing gradient problem by training greedily layer-by-layer[1]. The DBN is an unsupervised model[5, 6, 7], which gives it the possibility to train on data without labels. This is useful since it is usually easier to acquire unlabeled, rather than labeled, data. While a DDBN may profit from this unsupervised pre-training[1], it requires additional time. Also, if the network is shallow, an MLP might be just as accurate with less training time in addition to being a simpler model. This raises the question of when DBN pre-training is useful. Earlier comparisons of the two models have shown that the MLP can perform better than the DDBN[8], whilst other comparisons show the opposite[9, 10, 11]. In experiments conducted by Larochelle et al., DDBNs outperformed MLPs on both MNIST and other image classification datasets [11]. It is worth noting that neither the DDBN nor the MLP can compete with convolutional ANNs on image classification, and hence this research should not be viewed as an attempt at ranking state-of-the-art image classification models. DDBNs also outperformed MLPs on the Aurora2 dataset in research conducted by Vinyals and Ravuri[9]. Comparing MNIST benchmark re- sults of the two models, shows that the MLP has the best performance[8]. In some of the cases above, DDBNs have outperformed MLPs, in others, MLPs seems to be better. Notably, their results are often quite similar. When labeled training data is scarce and there is an abundance of unlabeled training data, it has been observed that the accuracy of DDBNs is higher than the MLPs[4]. However, when there is a moderate amount of labeled data it is unclear if the pre-training provides any significant improvement. In the earlier research mentioned above, the authors either did not use the same architecture for the DDBN and MLP, or they did not mention what architectures were used. The architecture does have an impact on what functions the network can model, which is why comparing different architectures might not result in a fair comparison. Therefore, this study compares MLPs and DDBNs with the same architectures. With this study, we aim to cast light on what effect the unsupervised pre-training of a DDBN has compared to an MLP only trained with BP. We investigate, at what proportions between labeled and unlabeled data, DDBNs are preferred to MLPs. Furthermore we focus on the question: How does the amount of labeled data affect the accuracy of DDBNs compared to MLPs with the same architecture?

4 1.1 Scope The goal of this study was to empirically investigating the accuracy of the two models when they have the same architecture and while varying the amount of available labeled training data. We tested 3-architectures on each of the datasets, and used architectures that have been successful in earlier research on the same dataset. BP learning rate was selected by using a limited grid search. Other model parameters were selected according to guidelines, or through short empirical tests, and kept static during the trials. We did not focus on tuning the parameters to get state-of-the-art results, since this might imply different architectures for the different networks and amounts of labeled data. The amount of labels that were available for the networks were varied from a few to all. The investigation was conducted using two benchmark datasets, the MNIST[8] handwritten digit classification dataset and the synthetic Rectangles-Images dataset[12].

2 Background 2.1 Multilayer Perceptron An MLP is a standard deep learning model also called a deep feedforward network[13, p. 164]. It is a model for learning non-linear functions. An MLP consist of one input layer, one output layer and one or more hidden layers. Each layer has a number of neurons that process data through the network. Figure 1 illustrates the layers and neurons. If there are no hidden layers, the network can be a model[14]. With a single hidden layer, an MLP can approximate many non-linear functions and with more layers it can also learn complex functions[14]. On the other hand, training becomes harder and takes more time with more layers[9].

Figure 1: An illustration of a multi-layered feed-forward network

The output of an MLP is an approximation of the function that the network should learn. Let f ⇤ be the function we want to approximate, and let fn be the function that the last layer of the MLP outputs. Every layer of the MLP outputs the function:

fi = (Wifi 1 + bi) (1) where f0 = x is the input to the network, i indicates what layer, Wi is the weight matrix for that layer and bi is the bias vector for that layer. is called the activation

5 function, which is some non-linear function. Commonly used activation functions are for example: tanh, sigmoid and softmax.[13] There are many ways of training an MLP[3]. The most commonly used is back- propagation(BP) with gradient descent(GD)[3]. During gradient decent, the k th. update of the weights at layer i are given by

W (k)= ↵ L + µW (k 1) (2) i rWi i where ↵ is the learning rate µ is the momentum and L is the loss[13]. Examples of loss functions are cross-entropy and L2. Cross entropy loss minimizes the difference between the actual and expected distributions[15], while L2 minimizes the euclidean distance between the result and the expected value[16, 17]. Cross entropy has been shown to perform better when weights are randomly initialized[18]. A version of GD that has been shown to have good performance for large datasets is Stochastic Gradient Decent(SGD)[19]. SGD is GD, but the gradient is computed from random a portion of the dataset[19]. These portions are called batches. The training set is divided into batches so that all train examples are in exactly one batch and then training is done by iterating through all batches. When all batches have run, one epoch has passed. The selection of training algorithm can have significant impact on the results[3]. Training a deep MLP with BP can encounter the vanishing gradient problem[1]. To overcome that problem, there are methods that train the network layer-by-layer with other algorithms than BP[1]. The DBN is one example of network that uses a layer- by-layer training algorithm[1].

2.2 Deep Belief Network - DBN ADBNisagenerativestochasticmodelfordeeplearningusingseverallayersof latent variables[20]. The weights gained from this model after training can be used as a starting point for training a multi-layer network[21, p. 25]. The building blocks of a DBN are several Restricted Boltzmann Machines(RBM)[22]. In this section we will first explain how an RBM works and in the end come back to how to build a DBN out of RBMs.

2.2.1 Restricted Boltzmann Machine - RBM An RBM could be interpreted as a particular instance of a Markov Random Field(MRF), in other words, they are undirected stochastic graphical models[21]. The nodes in an RBM are divided into two sets, h = hidden and v = visible[21]. Together, they form acompletebipartitegraph[21](Figure2).AnRBMcanbedefinedbylettingithave m visible and n hidden nodes so that v =(v1,v2,...,vm) and h =(h1,h2,...,hn). The joint distribution of this RBM can be expressed as

1 E(v,h) p(v, h)= e (3) Z where Z is a normalizing factor. The distribution in equation 3 is called a Gibbs distribution[21]. E(v, h) is the energy function[21] of the form

n m m n E(v, h)= w h v b v c h (4) i,j i j j j i i i=1 j=1 j=1 i=1 X X X X

6 [21, p. 26], where wi,j are weights between hidden and visible nodes. ci and bi are bias terms for the corresponding nodes.

Figure 2: An illustration of an RBM

Training an RBM can be achieved by using the Maximum Likelihood(ML) method[21, p. 19], i.e. maximizing p(v ✓), where ✓ =(b, c,W) concatenated into a vector. How- | ever, using ML is not efficient enough[21, p. 27]. Currently, contrastive divergence learning is the standard way of training RBMs[21, p. 27].

2.2.2 Contrastive divergence (CD) CD learning is an effective algorithm for training an RBM[21, p. 27]. CD learning approximates the log-likelihood gradient given some data and performs gradient as- cent on that approximation[21, p. 27]. Even though it is an approximation, it has shown to work well enough in practice[23, p. 5]. Training starts by initializing the visual layer with a training example vector[23, p. 4]. Because there are no direct connections within the visible or the hidden layer in an RBM, the hidden units can be sampled from the visible through

p(h v)=(W v + b) (5) | x where is the logistic (x)=1/(1 + e )[23]. The units can either be set to binary values where p(h v) is the probability for values in h to be equal to | 1, or h = p(h v)[23, p. 4]. The next step in the training is to sample a new visible | vector that reconstructs the data-vector by calculating p(v h)[23, p. 6]. Calculating | these two probabilities is called a Gibbs step[23, p. 4]. Performing n Gibbs steps before updating weights and biases in CD learning is called CDn[23]. CD1 works well in practice[24].

2.2.3 Deep Belief Network Now that RBMs have been described we can start to understand a DBN. A DBN can be viewed as stacked RBMs[20]. Training is done layer-by-layer[20]. First the bottom two layers are viewed as an RBM and CD learning is applied. When training the next layer, the hidden layer of the first RBM becomes the visual layer in the next RBM and then that layer is trained with CD. Figure 3 illustrates the stacking of RBMs to construct a DBN.

7 Figure 3: To the left is one RBM and after training, another RBM can be stacked on top to train more layers as in the right image.

2.2.4 Discriminative Deep Belief Network A DDBN is an MLP where the weights are initialized with the weights from a trained DBN. After CD learning, a DDBN can be configured for classification by adding an output layer at the top of the last RBM. The network can then be trained with BP[20, 4]. BP works much better if the weights in the hidden layers are initialized by the weights from a CD learned DBN[25].

2.3 Related research ApreviousresearchthatcomparesMLPstoDDBNswasconductedbyVinyalsand Ravuri[9]. The comparison was conducted using the Aurora2 dataset[9]. A DDBN outperformed the MLPs[9]. Several architectures were tested and the networks with the best accuracy were presented. The architecture of the best MLP and the best DDBN was not the same. Vinyals and Ravuri stated that in their initial tests, they used the same architecture for the DDBN as the MLP, and that the DDBN yielded no change in performance compared to the MLP[9]. They also speculate that DDBNs may only be useful if the networks are deep[9]. Another paper comparing DDBNs and MLPs examine their accuracy and training time, with and without weight initialization of a DBN, on three music classification tasks[10]. The dataset used, contained a lot of unlabeled data and they concluded that this helped convergence speed during the BP fine-tuning of the DDBN. This is similar to conclusions made by Nelson[26, p.11-12], where they argue that training a DDBN will matter more when labeled training data is scarce. However, Dielman et al.[10] found that the DDBN and the MLP had similar accuracy in all the tasks, even though the DDBN had a faster convergence rate. Larochelle et al.[11] compared accuracy of DDBNs and other networks on several image-recognition datasets, MNIST and Rectangles-Images being two of them. For MNIST, they only used 10000 labeled training data and no unlabeled data. Validation and test set had the sizes of 2000 and 50000 respectively. Two different DDBNs were tested, DBN-3 and DBN-1, where the number indicates the number of hidden layers[11]. An MLP with one hidden layer was also tested, but it did not have the same number of neurons in the hidden layer as the DBN-1[11]. The DDBNs outperformed the MLP in all benchmarks[11].

8 Zhou et al.[4] compared accuracy of DDBNs and other classification methods when labeled data was scarce. They motivate their study by arguing that unlabeled data is easier to acquire. They found that when few labels are available and a moder- ate amount of unlabeled data was available, DDBNs outperform other methods like MLP(NN). In their experiments, less than 100 labels were available, which raises the question about how an increased amount of labels would affect their results. Yu et al.[27] compared the performance of DDBNs when altering the amount of unlabeled data on speech recognition. It was found that as long as there existed sufficient data to initialize the DDBN weights, adding moderate amounts of unlabeled training data for the DBN gave insignificant improvements in prediction accuracy. In conjunction with other results, that have shown that DDBNs often have similar performance to MLPs[9, 10], the absence of an increase in accuracy when unlabeled training data is increased, indicates that the CD learning of DBNs do not provide a significant advantage when there is enough labeled data. In addition to that, Zhou et al.[4] suggest that DDBNs have their greatest strength when labeled data in sparse. When comparing results on the MNIST benchmark dataset, the best MLP is a 6-layered network with error rate of 0.35%[8]. The MLP committee approach[28] is not far away with 0.39%. In the list of MNIST results, there is one approach that uses RBM training and got 1% error rate[8]. However, it was combined with other methods[8]. MLPs are actually outperforming DDBNs on the MNIST database according to the benchmark[8]. However, they have different architectures which does not make them comparable in the way we aim for in this paper.

3 Method

As previously stated in section 1.1, the goal of this study was to empirically inves- tigating the accuracy of DDBNs and MLPs when they have the same architecture, and while varying the amount of available labeled training data. To achieve this, we conducted six trials with different portions of available labels, for each of 3 selected architectures, for both MNIST and Rectangles-Images. During each trial, both a DDBN and an MLP was trained with the given architecture. The MLP used BP training with the given labeled portion of the whole dataset. The DDBN first used CD learning on the whole dataset, and later conducted BP training on the portion of the dataset with available labels. When we in the future refer to scenarios where X% of labeled data was available, we mean that X is the portion of labels available for BP training. In this section we will describe the datasets that were used, how the parameters like architecture, learning rate and amounts of labeled data were chosen and what training methods and tools were used.

3.1 Datasets MNIST is a subset of the NIST dataset that contains handwritten digits[8]. The standard MNIST set consists of 60,000 train images with labels and 10,000 test images with labels. Every image provided is 28*28 pixels. The task associated with the dataset is to detect what number from 0 to 9 is contained in an image. As stated earlier, the other dataset used was the artificial Rectangles-Images dataset[12]. The dataset consists of 28*28 pixel images of rectangles. During cre- ation of the dataset, the authors describe how the width and height of the rectangles were sampled uniformly. Further, the inside and border of each rectangle was filled

9 with an image, and the background another. The width and height of the rectangles was limited to be at least 10 pixels and their difference 5 pixels. The rectangles were also limited to cover 25-75% of the images. The dataset consists of 12000 labeled training data and 50000 test data. The task associated with the dataset is to detect whether a given rectangle has a larger width or height. From the training set in the datasets above, 20% was dedicated to validation, and the remaining 80% was used for actual training during the trials. When a reduced portion of labels were available, the training set was first reduced with the given ratio. After that, the validation and actual training sets were created, this to provide resemblance with real world scenarios with limited labeled data. The amounts of available labels we ran tests with are displayed in table 1 below. We considered using all these percentages of labels for both datasets, but if we found that running a certain ratio of label for one of the datasets would not help us reach our goal of determining when to use the DDBN over the MLP, we did not make a run with that label amount for that dataset. Additionally, the reason for using different amounts of labels for lower label ratios were for comparability with other research.

Table 1: Amounts of available labels we ran trials for, for the two datasets XX XX Dataset XXX MNIST Rectangle-Images Labels XXX 0.1 X 0.2 X 0.5 X 1 X 2 X 10 X 20 X X 50 X X 100 X X

3.2 Measurements For BP training, the datasets were divided into train, validation and test as described above. Every second epoch of the training, the networks classified all examples in the validation set and an error rate was computed. This error rate is called validation error, and it is the percentage of incorrect classified images. The error is calculated as follows: Let S be the set, in this case the validation set, P the number of correct classified examples, and F the number of incorrect classified examples. Then S = P + F and the error E (validation error) is: F E = (6) S After training, the networks were tested against the test set, which they had never seen before. The test error was computed in the same way as validation error, with equation 6. The accuracy of a network for a specific test is defined as A =1 E. Therefore, we analyzed the accuracy of the network by comparing the test errors of the networks.

10 3.3 Training methods Our MLPs were trained with BP and stochastic gradient descent(SGD), and the DDBNs were trained with CD1 and then fine-tuned with BP and SGD. The loss function used was cross-entropy, in conjunction with a final softmax layer. We also evaluated L2 loss. Initial test, however, indicated that this resulted in worse accuracy in both datasets.

3.4 Parameter selection The architecture can theoretically be selected to any combination of integers making it practically impossible to use a brute force approach, determining good architectures. Therefore, this study used architectures successful in previous research for the given datasets. In the case of MNIST, the hidden layer architectures we ran trials with were 1. [500 150] 2. [2500 2000 1500 1000 500] 3. [500 500 2000]

The first of these, is the best performing MLP on the MNIST benchmark website only using BP[8]. The second one was chosen since the best performing MLP on the MNIST benchmark website used this architecture[8]. Finally, the third one was selected because it was used in the DBN tutorials from Hinton, which produced good results[29]. For the Rectangles-Images dataset[12], we used a similar approach to Larochelle et al.[11]. To determine the hyper parameters to use for their DBN-3 (a DDBN), they used a grid search technique that they believed would yield reasonable local minima. Their grid search was done within the bounds displayed in table 2.

Table 2: Parameter ranges tested by Larochelle et al.[11]

hyper first hidden second hid- third hidden learning rate learning rate parameter layer size den layer layer size CD-1 GD size range [500 , 3000] [500 , 4000] [1000 , 6000] [0.0001 , 0.1] [0.0001, 0.1]

We were not able to conduct as an extensive search of architectures as Larochelle et al. Instead, we tested the following hidden layer sizes within their search space. 1. [500,500,1000] 2. [1500,2000,3000] 3. [3000,4000,6000]

AgridsearchoverBPlearningrateswasconductedforboththeDDBNandMLP in each trial, for both of the datasets. The accuracies of the networks were later compared and the ones with the best validation error rates are presented below as our results for the given trial. The learning rates tested in the grid search for the Rectangles-Images dataset were [.0001,.001,.01,.1], all in the range used by Larochelle

11 et al. For MNIST, the tested learning rates were [.001, .002, .003, .01, .02, .1, .2, .3, 1, 2, 3]. During BP learning we used momentum. For MNIST we used the default setting in the deep learning toolbox of 0.5[17]. A short test with the Rectangles-Images dataset showed that a BP momentum of 0.1 resulted in better accuracies, and was hence used in our experiments. The batch size was selected to be small. In Hinton’s guide for RBM training, he suggests using a small batch size, but large enough so there is probably at least one of each label in every batch[23]. Since MNIST has 10 different labels, and to increase the comparability of results between the two datasets, we decided to have each batch consist of 50 labels for both datasets. In the case of BP training, we used early stopping. If the validation error rate did not drop for 30 consecutive epochs, training was aborted. With MNIST we stopped at 350 epochs if an early stop did not occur earlier, and with Rectangles-Images we stopped after 500 epochs. The learning rate and momentum to use for DBN learning, as well as for how many epochs to learn was selected by experimenting. We trained DBNs for different numbers of epochs, with different learning rates, for different architectures, for the two datasets with a low amount of data to reduce training time. We then trained DBDNs using their weights. The parameters we used for the DBNs were the ones that seemed to generally produce the lowest validation errors for all architectures. The parameters used can be found in table 3. The momentum used for MNIST DBNs was selected based on recommendations from Hinton’s practical guide[23]. In the case of Rectangles-Images however, experiments indicated that the momentum term did not have any significant impact on the resulting DDBN accuracy, hence a 0 momentum was used.

Table 3: Parameters used for the DBN learning for the two datasets ``` ``` Dataset ``` MNIST Rectangles-Images Parameter ``` Learning Rate 10.1 Momentum 0.5 0 Epochs 200 150

3.5 Tools To run the tests, we used Matlab with the DeepLearnToolbox library provided by Bergpalm[17]. The library provides among other models, implementations of an MLP and a DBN where one can set the parameters and then call the train or test methods. The MLP uses BP with SGD and the DBN is implemented with CD1 learning.

4 Result 4.1 MNIST The results from the three network architectures described in the method section will be presented below. Table 5, 7 and 9 show the lowest error rate achieved on the validation set and the corresponding test error rate. The error rate is the percentage of incorrectly classified data examples. From the tables it is clear that with very low

12 amounts of labeled data on MNIST, the DDBN outperform the MLP with the same architecture.

4.1.1 First architecture Table 4 shows the parameters of the first architecture tested. The DDBN and MLP was trained for a maximum of 350 epochs if the early stopping did not occur before.

Table 4: Parameters used on the networks using the first architecture on the MNIST dataset. The MLP and DDBN learning rate was varied since a grid search was conducted for the parameter. See section 3.4 for more information.

Architecture 784-500-150-10 BP maximum epochs 350 MLP learning rate varied DDBN learning rate varied BP momentum 0.5 CD learning epochs 200 CD learning learning rate 1 CD learning momentum 0.5

Table 5 shows the error rates after training. With low amounts of labeled data, the DDBN outperform the MLP. With 20% labels the networks error become similar and at 50% labels the DDBN is just slightly better. The reader should keep in mind that the test set is 10000 images, which means that a difference of 0.1%isonly10 images out of 10000. With 100% the networks have basically the same accuracy, MLP has lower validation error than the DDBN but higher test error. Nevertheless, they seemed to have reached similar minima, which the MLP took less time to reach considering the CD learning pre-training of the DDBN.

Table 5: Best achieved validation error and test error from same epoch on MNIST with networks of architecture 784-500-150-10.

Labels Network 0.1% 0.2% 2% 20% 50% 100% MLP validation 42.22 27.64 11.60 4.17 2.64 1.81 MLP test 42.81 26.86 10.65 4.00 2.64 1.97 DDBN validation 37.58 20.96 7.49 3.34 2.36 1.90 DDBN test 36.97 19.69 6.44 3.24 2.41 1.91

Figure 4 shows the learning progress of the networks when using 50% of the labels. The DDBN starts with a lower validation error and has slightly better validation error through all epochs, but already after about 50 epochs both networks seem to have stagnated. Both networks also did early stopping before 220 epochs. The DDBN had 200 epochs of pre-training with CD learning, which means that the MLP reached almost the same accuracy within significantly shorter time of training.

13 Figure 4: MNIST training progress with 50% labels and architecture 784-500-150-10

4.1.2 Second architecture With 5 hidden layers, this architecture was the deepest that we tested. The DDBN had quite good accuracy with few labels available, but when all training data was labeled the accuracy of the networks were similar.

Table 6: Parameters used on the networks using the second architecture on the MNIST dataset. The MLP and DDBN learning rate was varied since a grid search was conducted for the parameter. See section 3.4 for more information.

Architecture 784-2500-2000-1500-1000-500-10 BP maximum epochs 350 MLP learning rate varied DDBN learning rate varied BP momentum 0.5 CD learning epochs 200 CD learning learning rate 1 CD learning momentum 0.5

Table 7 shows the error rates after training. With this deep architecture, the difference between the MLP and DDBN is bigger with small amount of labels than it was in the previous architecture (see table 5). The MLP validation error has almost caught up to the DDBN validation error when 50% of the training data was labeled, and at 100% available, the networks seem to have found the same validation error minima.

14 Table 7: Best achieved validation error and test error from same epoch on MNIST with networks of architecture 784-2500-2000-1500-1000-500-10.

Labels Network 0.1% 0.2% 2% 20% 50% 100% MLP validation 48.43 31.18 12.76 5.35 3.18 1.86 MLP test 47.83 29.83 11.53 5.02 3.23 2.08 DDBN validation 13.74 8.98 6.04 3.50 2.36 1.86 DDBN test 13.33 8.36 5.74 3.23 2.40 1.89

4.1.3 Third architecture This architecture was chosen due to that it was used in the DBN tutorial by Hinton[29]. Therefore, we assumed that the DDBN would perform well with this architecture. The assumption was correct, the DDBN with this architecture archived our best test error rate on MNIST from 20% labels and above. However, it was still not as good as the second architecture with less than 20% labels.

Table 8: Parameters used on the networks using the third architecture on the MNIST dataset. The MLP and DDBN learning rate was varied since a grid search was conducted for the parameter. See section 3.4 for more information.

Architecture 784-500-500-2000-10 20BP maximum epochs 350 MLP learning rate varied DDBN learning rate varied BP momentum 0.5 CD learning epochs 200 CD learning learning rate 1 CD learning momentum 0.5

Table 9 shows the error rates after training. This architecture also illustrates the trend that MLP and DDBN gets more equal with more labels, and that the DDBN outperform the MLP when few labels were available.

Table 9: Best achieved validation error and test error from same epoch on MNIST with networks of architecture 784-500-500-2000-10

Labels Network 0.1% 0.2% 2% 20% 50% 100% MLP validation 42.01 29.47 12.18 4.55 2.76 1.86 MLP test 41.62 28.59 11.30 4.47 2.86 1.99 DDBN validation 32.91 19.76 6.76 3.18 2.22 1.61 DDBN test 33.1 19.19 6.05 2.97 2.19 1.58

15 4.1.4 Architecture comparison With only 0.1% of the train data as labeled, the first architecture had the worst accu- racy compared to the other architectures, for both DDBN and MLP. As the number of layers increased, the first architecture got better accuracy than the other architec- tures. When 100% of the data was labeled, the first architecture had better accuracy than the second architecture for the MLP. In addition to having good accuracy, the first network was also the smallest and therefore had the least training time due to less computations. For this study, the MLP had similar accuracy no matter what architecture was chosen. For each amount of labels, the accuracy in table 5, 7 and 9 have a low variance. However, the accuracy for the DDBN vastly differed. The DDBNs had similar accuracy when 100% labeled but vastly differed when only 0.1% was labeled. The second architecture (and the deepest) could achieve significantly better accuracy with only a few labels. The DDBN with the second architecture had a test error of only 13.33% with 0.1% of the labels available, whilst the other architectures had more than 30% test error rate. This means that the CD learning had more effect on the bigger architectures in our study for MNIST. Figure 5 shows validation error progress for both MLP and DDBN on all different labeled conditions tested. It can be observed that the DDBN had a more stable and faster convergence that the MLP in all tests, which was also observed by Dielman et al.[10]. This could be a matter of tuning learning rate, but the results shown is the best tuning from our small empirical tests. In some cases, the faster convergence of the DDBN led to early stopping before the MLP. This behavior is wanted since the DDBN has already been pre-trained with CD learning before BP training. Each epoch of BP training takes the same amount time for the DDBN and the MLP of the same architecture, therefore less epochs of BP training means CD learning was beneficial. The architecture had a major impact on the training time. The extra amount of computations increases with the size of the architecture. The first architecture was also the smallest and had the least training time, in addition to that, with all labels available it also achieved better accuracy or almost as good as the other architectures. We did not measure training time exactly, but roughly estimated, the MLP with the first architecture had a training time of less than 1 hour with 100% labeled data, compared to the second and biggest architecture with a training time of 12 hours.

16 Figure 5: Summary of BP training progress of all architectures on the MNIST dataset. The plots in the first column represents the first architecture, the second column the second architecture and the third column the third architecture. Each row represents each of the following ratios of available labels, in order from top to bottom: 0.1%, 0.2%, 2%, 20%, 50%, 100%. In each plot the Y-axis represents the validation error rate and the X-axis represents the epoch the error rate was achieved at. DDBN validation error is displayed in green and MLP validation error in red.

17 4.2 Rectangles-Images In the Rectangles-Images data set, using bellow 1% of labeled data was too little for practically any learning to occur for either the DDBN or the MLP, see figure 6. This is not surprising, keeping in mind that 1% of 12000 is only 120 items. Still, it is notable that this amount was enough for the DDBN to learn in MNIST. This indicates the different characteristics of the two datasets. When the amount of labels increased to 10%, we can see that the DDBN outperformed the MLP in all architectures, and that the difference increased as the architectures grew larger, which was the case with MNIST as well. When the amount of data was increased to 50% the MLP and DDBN results became comparable, even though the DDBN always slightly outperformed the MLPs. However, considering the additional time the DDBN spend on CD learning, one should be careful comparing in such a fine grained manner, since the MLP could have been trained for more BP epochs than the DDBN, and would still have trained for less time in total.

18 Figure 6: Summary of BP training progress of all architectures on the Rectangle- Images dataset. The plots in the first column represents the first architecture, the second column the second architecture and the third column the third architecture. Each row represents each of the following ratios of available labels, in order from top to bottom: 0.5%, 1%, 10%, 20%, 50%, 100%. In each plot the Y-axis represents the validation error rate and the X-axis represents the epoch the error rate was achieved at. DDBN validation error is displayed in green and MLP validation error in red.

19 4.2.1 First architecture

Table 10: Parameters used on the networks using the first architecture on the Rectangle-Images dataset. The MLP and DDBN learning rate was varied since a grid search was conducted for the parameter. See section 3.4 for more information.

Architecture 784-500-500-1000-2 BP maximum epochs 500 MLP learning rate varied DDBN learning rate varied BP momentum 0.1 CD learning epochs 150 CD learning learning rate 0.1 CD learning momentum 0

Table 11 show the error rates after training. Notable from these results is the difference of 2.4% in the test error of the MLP compared to the DDBN when 10% of the labels are available. Keeping in mind the trend displayed in figure 7, it is possible that the MLP would have closed the gap to the DDBN given more training time. When 50% of the labels were available the difference between the DDBN and MLP was only 0.02%, which is not significant in this architecture. At 100% the difference in accuracy of the networks slightly increased, however, the accuracies are still comparable.

Figure 7: Rectangles-Images 500-500-1000 10% of labels

20 Table 11: Best achieved validation error and corresponding test error from same epoch on Rectangles-Images dataset with networks of architecture 784-500-500-1000- 2.

Labels Network 0.5 % 1% 10% 20% 50% 100% MLP validation 46.7 40.0 26.1 26.4 23.1 22.4 MLP Test 50.2 49.6 28.7 27.3 24.2 23.7 DDBN validation 46.7 36.0 20.3 23.7 22.9 22.2 DDBN Test 50.2 44.2 26.3 24.9 24.22 23.1

4.2.2 Second architecture

Table 12: Parameters used on the networks using the second architecture on the Rectangle-Images dataset. The MLP and DDBN learning rate was varied since a grid search was conducted for the parameter. See section 3.4 for more information.

Architecture 784-1500-2000-3000-2 BP maximum epochs 150 MLP learning rate varied DDBN learning rate varied BP momentum 0.1 CD learning epochs 150 CD learning learning rate 0.1 CD learning momentum 0

Table 13 show the validation and test errors after training. The DDBN with this architecture had the best test error rate of 24.6, among the other architectures, when only 10% of the labels were available. It is notable that increasing the amount of labels to 100% only gave a 1.6% test error improvement. This showcases the ability of the CD learning to find good initial weights. At 50%, the difference is 1.3% between MLP test error and DDBN test error. Considering the extra training time the DDBN had, these errors are comparable. Another outstanding result in table 13 is that the DDBN test error increased when labels increased from 10% to 20%. This could be due to a bad learning rate for the given amount of labels.

Table 13: Best achieved validation error and corresponding test error from the same epoch on the Rectangles-Images dataset with networks of architecture 784-1500-2000- 3000-2.

Labels Network 0.5 % 1% 10% 20% 50% 100% MLP Validation 46.7 44.0 38.2 26.6 24.1 23.3 MLP Test 50.2 49.8 43.6 27.6 24.9 24.4 DDBN Validation 40.0 44.0 19.5 23.9 21.0 22.1 DDBN Test 50.2 44.2 24.6 25.2 23.6 23.0

21 4.2.3 Third architecture

Table 14: Parameters used on the networks using the third architecture on the Rectangle-Images dataset. The MLP and DDBN learning rate was varied since a grid search was conducted for the parameter. See section 3.4 for more information.

Architecture 784-3000-4000-6000-2 BP maximum epochs 150 MLP learning rate varied DDBN learning rate varied BP momentum 0.1 CD epochs 150 CD learning rate 0.1 CD momentum 0

Table 15: shows the validation and test errors after training. This architecture achieved our best DDBN accuracy with 100% of the labels from the Rectangles- Images dataset. The reader can also see that in this architecture the MLP was not able to conduct any significant learning when 10% of the labels was available but the DDBN was able to achieve acceptable results.

Table 15: Best achieved validation error and corresponding test error from the same epoch on Rectangles-Images dataset with networks of architecture 784-1500-2000- 3000-2.

Labels Network 0.5 % 1% 10% 20% 50% 100% MLP Validation 26.6 40.0 42.7 27.6 22.9 23.1 MLP Test 49.7 49.7 42.7 27.9 24.4 24.3 DDBN Validation 26.7 32.0 20.7 23.1 22.1 21.9 DDBN Test 44.4 44.7 25.3 23.9 23.3 22.5

4.2.4 Architecture comparison Regarding the tables 11, 13 and 15, one can observe that when 10% of the labels were available, there is a significant difference between the test and validation errors of the DDBN across all architectures. We can observe from figure 6 that the DDBN and MLP validation errors started at roughly the same level in all cases with 10% labels. However, the CD learning has been able to initialize the weights in such a manner that the BP training manages to reduce the error rate fast in the larger architectures. Another result to observe is that, the larger the architecture the harder it seems to be for the MLP to achieve good results when the amount of labels is low. Whilst the DDBN seems to be able to find results comparable with the ones from smaller architectures.

22 4.3 Label variation The focus in this report was to compare how varied amount of labeled data affect the accuracy of DDBNs and MLPs. Figure 8 and 9 illustrates the difference in accuracy for each architecture and amount of labels tested. The DDBN always had better accuracy, except for the smallest architecture on Rectangles-Images dataset. However, this exception was most likely random since neither the DDBN or the MLP had learned anything at that moment. It is a clear trend that when increasing the amount of labels, the MLP is closing in on the DDBN error rate. Already at 50% labeled data, the difference between MLP and DDBN is almost insignificant in all trials. At 100% labeled data, the difference in MNIST decreased a small amount, whilst the difference in Rectangles-images increased again. However, the difference could still be insignificant.

Figure 8: The difference between DDBN and MLP test error on MNIST for each architecture and amount of labels. The Y-axis represents MLP test error minus DDBN test error. A positive value means that the DDBN had a lower error rate than the MLP. The X-axis represents the architecture and each color the amount of labels as displayed in the legend.

23 Figure 9: The difference between DDBN and MLP test error on Rectangles-Images for each architecture and amount of labels. The Y-axis represents MLP test error minus DDBN test error. A positive value means that the DDBN had a lower error rate than the MLP and a negative value means the MLP had a lower error rate than the DDBN. The X-axis represents the architecture and each color the amount of labels as displayed in the legend.

5 Discussion

This study focused on comparing MLPs and DDBNs. Our objective was to investi- gate how profitable the DBN pre-training of a DDBN is under some circumstances. We intended to analyze if there might exist a rule of thumb that could indicate when to use a DDBN instead of an MLP, with respect to different proportions of labeled and unlabeled data. We found that the DDBNs accuracy advantage over the cor- responding MLP diminished as the amount of labels increased, and at 50% labeled data, the DDBN and MLP had a comparable accuracy. For benchmarks, two image classification datasets were chosen. It is important to note that the accuracy in our results are not state-of-the-art for these datasets. There are better methods for image classification, such as using convolutional NN. As our research question states, we focused on how varied amounts of labels affect the accuracies of DDBNs compared to MLPs. Part of the question had already been answered by Zhou et al.[4], who conducted a similar comparison of DDBNs and MLPs, however, they only reported tests up to 100 labels. They stated that when labeled data was that scarce, the DDBNs clearly outperformed the MLPs.

24 There are several studies that compare accuracies of DDBNs and MLPs where all training data was labeled[9, 11, 8], which show that the accuracy of DDBNs and MLPs are similar. However, it is unclear how the accuracies are affected when the portion of labeled data is between a few percent and 100%. This research sheds light upon the effect of labeled data portions in this range. Figure 8 and 9 showcases the diminishing advantage we found for using a DDBN. As the amount of labels increase, the difference in accuracy between the DDBN and the MLP decrease. At 20%, the difference between them is starting to get low, and at 50%, the majority of our trials had similar accuracies for both models. This is in line with Yu et al.[27], who tried to increase the amount of unlabeled data so that they had 50% labeled. They stated that adding this additional amount of unlabeled data did not give any significant improvement for the DDBN. The exact test error of our trials can be viewed in table 5,7,9,11,13 and 15. For MNIST with 50% labels, the difference between the DDBN and MLP was always less than 0.83 units. That is less than 83 examples since the test set was 10000 examples. For the Rectangles-Images dataset with 50% labels, the difference between the DDBN and the MLP was always less than 1.3 units. Which is less than 650 examples since the test set was 50000 examples. It might sound a lot, but considering that the test errors were around 20%, which is 10000 out of 50000, 650 is not that big of a difference. This study has provided a complementary picture of the results from Zhou et al.[4]. We tested a broader range of labeled data in the comparison of the networks, giving the MLP more fair conditions against the DDBN. Our results also verify their con- clusion that when the amount of labels is scarce, the DDBN outperformed the MLP. Our results have also shown that there are cases where the DDBN have comparable accuracy to the MLP already when only 50% of the train data is labeled. However, more research is needed to conclude if that is a general trend where increasing the portion of labeled train data leads to more similar performance between the MLP and DDBN. To get back to our objective if there might be a rule of thumb when to use a DDBN instead of an MLP, we need to consider a few more aspects than the accuracies. The first would be the training time. Since the DDBN uses CD learning as pre-training, it does take longer time to train if the early stopping in BP does not occur a lot earlier than for the MLP. In figure 5 and 6 it can be seen that early stopping occurred earlier for the DDBN than the MLP only a few times. On the other hand, the DDBN could have stopped earlier in many cases if a more sophisticated early stopping method was used. As earlier stated, the CD learning still takes up a lot of training time. For example, the CD learning for MNIST, with the second architecture, lasted about 13 hours, and BP training with 100% labels lasted another 12 hours. Therefore, in about half the time, the MLP achieved almost the same accuracy as the DDBN. With 50% labels, the CD learning takes the same amount of time but the BP training takes less. Therefore, it should be considered that the CD learning takes a long time, and if it does not yield a significant improvement it might not be worth it. The architecture also seems to be an important factor when the choice of MLP and DDBN should be made. Figure 8 and 9 illustrates that bigger architectures seem to gain more effect of the CD learning pre-training than the smaller architectures, especially with sparse labeled data. This needs to be tested more to be able to draw ageneralconclusion,however,allourtestsindicatethattheremightbeacorrelation.

25 5.1 Other findings Larochelle et al.[11] compared the performance of a 3-layered DDBN with several other models including a single hidden layer MLP on the Rectangles-Images dataset, as already mentioned in section 4.1.4. They reported a test error rate of 33.2% using an MLP and 22.5% using a 3-layered DDBN. Comparing this with our results we can see that the DDBN performance is the same as our best result. Our methods used parameters and architectures from the same ranges so this is not an unexpected result. On the other hand, our best MLP had a best test error rate of 23.7 and hence outperformed the MLP used by Larochelle et al.[11]. They do not report what hyper parameters were used to train their MLP. In conjunction with the inaccuracy of their MLP, this indicates that the training of the MLP might not have received as much attention as the training of the DDBN in their study. As previously stated we have seen a trend of DDBNs performing better in our deeper and wider architectures, while the MLP performs better in our smaller architectures. The performance of Larochelle et al.’s MLP, in conjunction with the higher accuracy of our deeper MLP, shows that it might be hard to speculate what should be considered a shallow network in any given case. We suggest that a good rule of thumb might be to at least try the same architectures and learning rates when doing a comparison between MLPs and DDBNs.

5.2 Limitations In this study, we used only two datasets, and both were image classification datasets. Therefore, it is hard to draw a general conclusion for the networks. If we were to run tests on more datasets and different kinds of classification problems, the comparison could be more general. However, MNIST is a classification task where data examples from the same class can be very similar, in the way that activated pixels are almost the same ones, and there are only 10 labels. Rectangles-Images is a binary classification task, where data examples from the same class can be very different in terms of width, height and position. This can be hard to learn with a few labels of each class. Therefore, the datasets chosen can still be considered quite different. Given our limited computational resources several reductions in scope, as stated in section 1.1, had to be made. These scope reductions may in the worst case mean that we did not find good parameter combinations for the different models. We selected to do a grid search over BP learning rate because it was the parameter we found had the highest impact on accuracies during initial experiments, and it was available in both models, hence yielding more comparable results. However, other parameters that we did not thoroughly investigate may also play a decisive role in determining performance of the two models. Notably the DDBN has many more parameters that could be grid searched than the MLP, inherent in its DBN pre-training. It is possible that such a search would yield different results from ours. Another limitation related to limited computational power is that we only ran each combination of parameters once during our grid search. Since we use stochastic training methods, the same hyper parameters may lead to different outcomes different runs, and an unlucky and especially lucky run might cause outlying results. On the other hand, since we ran every architecture with several different learning rates one could hope that an unlucky run with a certain learning rate to be somewhat offset by another run with a different learning rate.

26 6 Conclusion

Our goal with this study was to investigate if there might exist a rule of thumb for when to use a DDBN instead of an MLP. By focusing on the question how varied amounts of labeled data affect the accuracies of DDBNs and MLPs with the same architecture, this study has shown that such a rule of thumb might exist. This research was conducted on two image recognition datasets, MNIST and Rectangles- Images. For these dataset the results were in line with a small gain from using DDBNs when more than 50% of training data was available. Considering the extra training time of the DDBN, it might therefore not be worthwhile to use the DDBN when more than 50% of training data is labeled. However, research on many more datasets is required to verify this as a good rule of thumb.

6.1 Further research Both MNIST and Rectangles-Images dataset have thousands of train data examples. In our tests, all data was made available as unlabeled. It could be interesting to see what happens when the total amount of train data is limited. For example, use 25000 of the train data. Give all 25000 as unlabeled and vary the portion of labeled data as we did. We speculate that the performance difference between DDBNs and MLPs might correlate with the proportions between available labeled and unlabeled data. The deepest architecture we tested had 5 hidden layers. The MLP seemed to be able to handle that quite well when all labels were available. By testing even deeper networks, the MLP might be affected by the vanishing gradient problem, resulting in bad performance. Whilst the DDBN might perform even better based on our observations that deeper DDBNs seems to profit more from the pre-training.

27 References

[1] J. Schmidhuber, “Deep Learning,” Scholarpedia,vol.10,no.11,p.32832,2015. revision #152272.

[2] A. de Brébisson, É. Simon, A. Auvolat, P. Vincent, and Y. Bengio, “Arti- ficial neural networks applied to taxi destination prediction,” arXiv preprint arXiv:1508.00021,2015. [3] N. Coskun and T. Yildirim, “The effects of training algorithms in mlp network on image classification,” in Neural Networks, 2003. Proceedings of the International Joint Conference on, vol. 2, pp. 1223–1226, IEEE, 2003. [4] S. Zhou, Q. Chen, and X. Wang, “Discriminative deep belief networks for image classification,” in Proceedings of the International Conference on Image Process- ing, ICIP 2010, September 26-29, Hong Kong, China,pp.1561–1564,2010.

[5] A. Fischer, “Training restricted boltzmann machines,” KI,vol.29,no.4,pp.441– 444, 2015. [6] G. E. Hinton, S. Osindero, and Y. W. Teh, “A fast learning algorithm for deep belief nets,” Neural Computation,vol.18,no.7,pp.1527–1554,2006. [7] Y. Bengio, P. Lamblin, D. Popovici, and H. Larochelle, “Greedy layer-wise train- ing of deep networks,” in Advances in Neural Information Processing Systems 19, Proceedings of the Twentieth Annual Conference on Neural Information Pro- cessing Systems, Vancouver, British Columbia, Canada, December 4-7, 2006, pp. 153–160, 2006.

[8] Y. LeCun, C. Cortes, and C. Burges", “Mnist handwritten digit database.” http: //yann.lecun.com/exdb/mnist/.Accessed2016. [9] O. Vinyals and S. V. Ravuri, “Comparing multilayer perceptron to deep belief network tandem features for robust ASR,” in Proceedings of the IEEE Interna- tional Conference on Acoustics, Speech, and Signal Processing, ICASSP 2011, May 22-27, 2011, Prague Congress Center, Prague, Czech Republic,pp.4596– 4599, 2011. [10] S. Dieleman, P. Brakel, and B. Schrauwen, “Audio-based music classification with apretrainedconvolutionalnetwork,”inProceedings of the 12th International So- ciety for Music Information Retrieval Conference, ISMIR 2011, Miami, Florida, USA, October 24-28, 2011,pp.669–674,2011.

[11] H. Larochelle, D. Erhan, A. Courville, J. Bergstra, and Y. Bengio, “An empirical evaluation of deep architectures on problems with many factors of variation,” in Proceedings of the 24th international conference on Machine learning,pp.473– 480, ACM, 2007. [12] DumitruErhan, “Rectangles and rectangles-images data,” Jun 2007.

[13] Y. B. Ian Goodfellow and A. Courville, “Deep learning.” Book in preparation for MIT Press, 2016.

28 [14] “ — deeplearning 0.1 documentation [internet].” http:// deeplearning.net/tutorial/mlp.html.Accessed2016. [15] “Multilayer perceptron — deeplearning 0.1 documentation [internet].” http:// deeplearning.net/tutorial/logreg.html.Accessed2016. [16] “Getting started — deeplearning 0.1 documentation [inter- net].” http://deeplearning.net/tutorial/gettingstarted.html# l1-l2-regularization.Accessed2016. [17] R. Bergpalm, “Deeplearntoolbox.” https://github.com/rasmusbergpalm/ DeepLearnToolbox.Accessed2016. [18] P. Golik, P. Doetsch, and H. Ney, “Cross-entropy vs. squared error training: a theoretical and experimental comparison.,” in INTERSPEECH,pp.1756–1760, 2013. [19] L. Bottou, “Large-scale machine learning with stochastic gradient descent,” in Proceedings of COMPSTAT’2010,pp.177–186,Springer,2010. [20] R. Salakhutdinov and G. E. Hinton, “Semantic hashing,” Int. J. Approx. Reason- ing,vol.50,no.7,pp.969–978,2009. [21] A. Fischer and C. Igel, “An introduction to restricted boltzmann machines,” in Progress in Pattern Recognition, Image Analysis, Computer Vision, and Appli- cations,pp.14–36,Springer,2012. [22] N. Le Roux and Y. Bengio, “Representational power of restricted boltzmann machines and deep belief networks,” Neural computation,vol.20,no.6,pp.1631– 1649, 2008. [23] G. E. Hinton, “A practical guide to training restricted boltzmann machines,” in Neural Networks: Tricks of the Trade - Second Edition,pp.599–619,2012. [24] “Restricted boltzmann machine — deeplearning 0.1 documentation [online].” http://deeplearning.net/tutorial/rbm.html.Accessed2016. [25] G. E. Hinton and R. R. Salakhutdinov, “Reducing the dimensionality of data with neural networks,” Science,vol.313,no.5786,pp.504–507,2006. [26] N. Morgan, “Deep and wide: Multiple layers in automatic speech recognition,” Audio, Speech, and Language Processing, IEEE Transactions on,vol.20,no.1, pp. 7–13, 2012. [27] D. Yu, L. Deng, and G. Dahl, “Roles of pre-training and fine-tuning in context- dependent dbn-hmms for real-world speech recognition,” in Proc. NIPS Workshop on Deep Learning and Unsupervised ,2010. [28] U. Meier, D. C. Ciresan, L. M. Gambardella, and J. Schmidhuber, “Better digit recognition with a committee of simple neural nets,” in 2011 International Con- ference on Document Analysis and Recognition, ICDAR 2011, Beijing, China, September 18-21, 2011,pp.1250–1254,2011. [29] G. Hinton, “Deep Belief Nets, [Online].” https://www.cs.toronto.edu/ ~hinton/nipstutorial/nipstut3.pdf,2007.Accessed2016.

29 www.kth.se