The Effects of Deep Belief Network Pre-Training of a Multilayered Perceptron Under Varied Labeled Data Conditions
Total Page:16
File Type:pdf, Size:1020Kb
EXAMENSARBETE INOM TEKNIK, GRUNDNIVÅ, 15 HP STOCKHOLM, SVERIGE 2016 The effects of Deep Belief Network pre-training of a Multilayered perceptron under varied labeled data conditions CHRISTOFFER MÖCKELIND MARCUS LARSSON KTH SKOLAN FÖR DATAVETENSKAP OCH KOMMUNIKATION Royal Institute of Technology The effects of Deep Belief Network pre-training of a Multilayered perceptron under varied labeled data conditions Effekterna av att initialisera en MLP med en tränad DBN givet olika begränsningar av märkt data Author: Supervisor: Christoffer Möckelind Pawel Herman Examinator: Marcus Larsson Örjan Ekeberg May 11, 2016 Abstract Sometimes finding labeled data for machine learning tasks is difficult. This is a problem for purely supervised models like the Multilayered perceptron(MLP). ADiscriminativeDeepBeliefNetwork(DDBN)isasemi-supervisedmodelthat is able to use both labeled and unlabeled data. This research aimed to move towards a rule of thumb of when it is beneficial to use a DDBN instead of an MLP, given the proportions of labeled and unlabeled data. Several trials with different amount of labels, from the MNIST and Rectangles-Images datasets, were conducted to compare the two models. It was found that for these datasets, the DDBNs had better accuracy when few labels were available. With 50% or more labels available, the DDBNs and MLPs had comparable accuracies. It is concluded that a rule of thumb of using a DDBN when less than 50% of labels are available for training, would be in line with the results. However, more research is needed to make any general conclusions. Sammanfattning Märkt data kan ibland vara svårt att hitta för maskininlärningsuppgifter. Detta är ett problem för modeller som bygger på övervakad inlärning, exem- pelvis Multilayerd Perceptron(MLP). Ett Discriminative Deep Belief Network (DDBN) är en semi-övervakad modell som kan använda både märkt och omärkt data. Denna forskning syftar till att närma sig en tumregel om när det är för- delaktigt att använda en DDBN i stället för en MLP, vid olika proportioner av märkt och omärkt data. Flera försök med olika mängd märkt data, från MNIST och Rectangle-Images datamängderna, genomfördes för att jämföra de två mo- dellerna. Det konstaterades att för dessa datamängder hade DDBNerna bättre precision när ett fåtal märkt data fanns tillgängligt. När 50% eller mer av datan var märkt, hade DDBNerna och MLPerna jämförbar noggrannhet. Slutsatsen är att en tumregel att använda en DDBN när mindre än 50% av av träningsdatan är märkt, skulle vara i linje med resultaten. Det behövs dock mer forskning för att göra några generella slutsatser. 1 Contents 1 Introduction 4 1.1 Scope . 5 2Background 5 2.1 Multilayer Perceptron . 5 2.2 Deep Belief Network - DBN . 6 2.2.1 Restricted Boltzmann Machine - RBM . 6 2.2.2 Contrastive divergence (CD) . 7 2.2.3 Deep Belief Network . 7 2.2.4 Discriminative Deep Belief Network . 8 2.3 Related research . 8 3 Method 9 3.1 Datasets . 9 3.2 Measurements . 10 3.3 Training methods . 11 3.4 Parameter selection . 11 3.5 Tools . 12 4 Result 12 4.1 MNIST ................................... 12 4.1.1 First architecture . 13 4.1.2 Second architecture . 14 4.1.3 Thirdarchitecture . 15 4.1.4 Architecture comparison . 16 4.2 Rectangles-Images ............................. 18 4.2.1 First architecture . 20 4.2.2 Second architecture . 21 4.2.3 Thirdarchitecture . 22 4.2.4 Architecture comparison . 22 4.3 Label variation . 23 5 Discussion 24 5.1 Other findings . 26 5.2 Limitations . 26 6 Conclusion 27 6.1 Further research . 27 2 List of abbreviations Architecture - The number of layers in the network and the number of neurons in each layer. NN -Neuralnetwork MLP - MultiLayered Perceptron - A neural network usually trained with BP DBN - Deep Belief Network - A neural network trained with stacked RBMs and CD learning DDBN - Discriminative DBN - An MLP that is initialized with the weights from a trained DBN. RBM -RestrictedBoltzmannMachine BP -Backpropagation-Analgorithmforupdatingweightsinaneuralnetwork GD -GradientDescent-Trainingmethodthatcomputesgradientofafunctionand follows that gradient to find a minima of the function. CD - Contrastive Divergence - Approximate training method of log-likelihood. 3 1 Introduction Today the amount of available data is constantly growing, and sometimes it is hard to associate the data with labels. This is a problem for purely supervised models like the Multilayer perceptron(MLP). The MLP, an artificial neural network(ANN) model, has been used in many ap- plications since the 1960s[1]. During the years, other types of networks have been created that outperform MLPs in various applications[1]. However, a version of the MLP was recently (2015) the winning approach in a competition of finding the best taxi route[2], which proves they are still useful. MLPs are trained with back propaga- tion(BP) and gradient descent(GD)[3]. When the number of layers increase, so does the risk of the vanishing gradient problem[1], which makes BP learning harder. Another ANN model is the Discriminative Deep Belief Network(DDBN)[4]. The difference between the MLP and the DDBN is the initialization of weights. The weights of an MLP are initialized with random values while a DDBN receives its weights from a trained DBN[4]. It can hence be said that when an MLP is pre-trained by a DBN we obtain a DDBN. The DBN is one among several networks trained with deep learning methods reduce the vanishing gradient problem by training greedily layer-by-layer[1]. The DBN is an unsupervised model[5, 6, 7], which gives it the possibility to train on data without labels. This is useful since it is usually easier to acquire unlabeled, rather than labeled, data. While a DDBN may profit from this unsupervised pre-training[1], it requires additional time. Also, if the network is shallow, an MLP might be just as accurate with less training time in addition to being a simpler model. This raises the question of when DBN pre-training is useful. Earlier comparisons of the two models have shown that the MLP can perform better than the DDBN[8], whilst other comparisons show the opposite[9, 10, 11]. In experiments conducted by Larochelle et al., DDBNs outperformed MLPs on both MNIST and other image classification datasets [11]. It is worth noting that neither the DDBN nor the MLP can compete with convolutional ANNs on image classification, and hence this research should not be viewed as an attempt at ranking state-of-the-art image classification models. DDBNs also outperformed MLPs on the Aurora2 dataset in research conducted by Vinyals and Ravuri[9]. Comparing MNIST benchmark re- sults of the two models, shows that the MLP has the best performance[8]. In some of the cases above, DDBNs have outperformed MLPs, in others, MLPs seems to be better. Notably, their results are often quite similar. When labeled training data is scarce and there is an abundance of unlabeled training data, it has been observed that the accuracy of DDBNs is higher than the MLPs[4]. However, when there is a moderate amount of labeled data it is unclear if the pre-training provides any significant improvement. In the earlier research mentioned above, the authors either did not use the same architecture for the DDBN and MLP, or they did not mention what architectures were used. The architecture does have an impact on what functions the network can model, which is why comparing different architectures might not result in a fair comparison. Therefore, this study compares MLPs and DDBNs with the same architectures. With this study, we aim to cast light on what effect the unsupervised pre-training of a DDBN has compared to an MLP only trained with BP. We investigate, at what proportions between labeled and unlabeled data, DDBNs are preferred to MLPs. Furthermore we focus on the question: How does the amount of labeled data affect the accuracy of DDBNs compared to MLPs with the same architecture? 4 1.1 Scope The goal of this study was to empirically investigating the accuracy of the two models when they have the same architecture and while varying the amount of available labeled training data. We tested 3-architectures on each of the datasets, and used architectures that have been successful in earlier research on the same dataset. BP learning rate was selected by using a limited grid search. Other model parameters were selected according to guidelines, or through short empirical tests, and kept static during the trials. We did not focus on tuning the parameters to get state-of-the-art results, since this might imply different architectures for the different networks and amounts of labeled data. The amount of labels that were available for the networks were varied from a few to all. The investigation was conducted using two benchmark datasets, the MNIST[8] handwritten digit classification dataset and the synthetic Rectangles-Images dataset[12]. 2 Background 2.1 Multilayer Perceptron An MLP is a standard deep learning model also called a deep feedforward network[13, p. 164]. It is a model for learning non-linear functions. An MLP consist of one input layer, one output layer and one or more hidden layers. Each layer has a number of neurons that process data through the network. Figure 1 illustrates the layers and neurons. If there are no hidden layers, the network can be a linear regression model[14]. With a single hidden layer, an MLP can approximate many non-linear functions and with more layers it can also learn complex functions[14]. On the other hand, training becomes harder and takes more time with more layers[9]. Figure 1: An illustration of a multi-layered feed-forward network The output of an MLP is an approximation of the function that the network should learn.