<<

Regularization Methods in Neural Networks

By Jacob Kasche and Fredrik Nordström

Department of Statistics Uppsala University

Supervisors: Johan Lyhagen and Andreas Östling

2020

Abstract

Overfitting is a common problem in neural networks. This report uses a simple neural network to do simulations relevant for the field of image recognition. In this report, four common regularization methods for dealing with overfitting are evaluated. The methods L1, L2, and Dropout are first tested with the MNIST data set and then with the CIFAR-10 data set. All methods are compared to a baseline where no regularization is used at sample sizes ranging from 500 to 50 000 images. The simulations in the report show that all four methods have repetitive patterns throughout the study and that Dropout continuously is superior to the other three methods as well as the baseline.

Table of Contents 1 Introduction ...... 1 2 History of , and neural networks ...... 2 2.1 Brief introduction to neural networks ...... 4 3 Data ...... 5 4 Method ...... 6 4.1 Base network ...... 6 4.2 Metrics ...... 7 4.3 Overfitting and regularizations...... 8 4.4 Bias-variance tradeoff ...... 10 4.5 Simulation method ...... 11 5 Results ...... 13 5.1 MNIST ...... 13 5.2 CIFAR-10 ...... 19 6 Discussion ...... 24 7 Conclusion ...... 26 Appendix A ...... 28

1 Introduction In recent years neural networks have reached new heights with human like results in perceptual problems.1 Skills earlier seen as near impossible for machines such as hearing and seeing can now be performed to a high degree by algorithms2. Thus, providing aid and new solutions in vastly different fields such as autonomous transportation, medical imaging, and language translation.3

Neural networks can be described as “…a means of doing machine learning, in which a computer learns to perform some task by analyzing training examples.”4. A common problem with neural networks is overfitting, this is when the neural network model random noise in the data set. A problem that frequently occurs when a network has too many hidden nodes, when the data set is small or of bad quality. Through the years different generalization methods have emerged and proved themselves at mitigating overfitting in neural networks. Some of the common regularization methods are L1, L2, Early stopping and Dropout5. Previous research has focused on the validity of the specific methods mentioned above with specific data sets but has not gone further in comparing the methods side by side more thoroughly to understand their behavioral patterns. Thus, this report aims to expand the research data concerning these four methods. This will be done by testing and evaluating the four regularization methods L1, L2, Early stopping and Dropout with a focus on the MNIST data set, and thereafter on the more complex CIFAR-10 data set. The four regularization methods will then be evaluated by comparing their performances, thus expanding the research fields’ understanding of the method’s behavioral patterns.

1 Chollet, François, and Joseph J. Allaire. , ' with R', Anonymous Translator(1st edn, Shelter Island, NY, Manning Publications Co, 2018), Section 1.1.6. 2 Chollet et al, 'Deep Learning with R', Section 1.1.6. 3 Synced. https://syncedreview.com/2019/10/31/google-introduces-huge-universal-language-translation-model-103- languages-trained-on-over-25-billion-examples/ Accessed (2020-01-13) 4 Hardesty. http://news.mit.edu/2017/explained-neural-networks-deep-learning-0414 Accessed (2020-01-13) 5 Brownlee. https://machinelearningmastery.com/introduction-to-regularization-to-reduce-overfitting-and-improve- generalization-error/ Accessed (2020-01-13)

1 The first part of this report will start by a short description of the background of neural networks and provide an example of a neural network to give the reader a fundamental understanding of the research field. The second part will present a more detailed description of the two neural networks used in this report as well as the methods involved. The third part of the report will be concluded with the results and a discussion concerning the findings. The research question for this report is: how do the four methods perform at different sample sizes with the MNIST and CIFAR-10 data sets, and what does a comparison between them say about their specific behaviors?

2 History of artificial intelligence, machine learning and neural networks Artificial intelligence “the effort to automate intellectual tasks normally performed by humans” was born in the 1950’s6. The idea seemed very promising at first but as the tasks became more complex the problem of being able to create enough instructions for the algorithms to perform the desired task became apparent. Out of this problem the idea of machine learning rose. Could a machine learn how to perform a task or solve a problem by itself, given enough data? And could those specific rules then be used with new data?7

Figure 1. A comparison between classical programming and machine learning.8

6 Chollet et al. , 'Deep Learning with R', Section 1.1.1. 7 Chollet et al. , 'Deep Learning with R', Section 1.1.2. 8 Chollet et al. , 'Deep Learning with R', section 1.1.2.

2 This brings us to neural networks, a concept in machine learning. The idea of neural networks come from neurobiology and the human brain. Even though the inspiration of neural networks comes from some of our understandings of how the human brain works, there is no evidence for any resemblance between how the actual model in a neural network works and how the brain works.9

Machine learning can be argued to have been born in the 1950’s, however an earlier idea was of great importance for the invention of this field. In 1943 Warren Mcculloch and Walter Pitts published A Logical Calculus of Ideas Immanent in Nervous Activity, which was the first ever of a neural network to be published. Fifteen years later, in 1958, Frank Rosenblatt published the idea of a in the paper The Perceptron: A Probabilistic Model for Information Storage and Organization in The Brain (1958). Shortly thereafter the “Mark 1 Perceptron” machine was built based on the algorithm which was used for . However, it was quickly proven that this single layer perceptron only was capable of learning linear separability. Hence unable to solve more complex problems with many classes. A point that the book : an introduction to computational geometry by Marvin Minsky and Seymour Paper in 1969 made clear. Further advancements were made but the initial hype caused unsustainable expectations and the field lost most of its momentum. This led to a lack of funding and overall stagnation in the 1970’s.

In the 1980’s machine learning started to gain popularity again. One reason was a US-Japan joint conference where Japan announced efforts for further advancement in the field which quickly lead to more funding from the US as well. In the new millennia, with the internet further establishing itself, computational power increasing and big data becoming more available, neural networks have further established their utility and value.10

9 Chollet et al. , 'Deep Learning with R', section 1.1.4. 10 Strachnyi. https://medium.com/analytics-vidhya/brief-history-of-neural-networks-44c2bf72eec Accessed (2019- 12-15)

3 2.1 Brief introduction to neural networks

Figure 2. Showing an example of a Neural Network.11

The data set is first processed so the information can be assigned to as many nodes as needed in the input layer to represent the information at hand. With images this is often translated to one node for each pixel and a value for each node corresponding to the color of the pixel. This information is then sent to the hidden layer which through a set of weights, one for each input node, decide how the previous combination will be represented, which value it will have, in its nodes. This then continue to the output layer nodes where it is transformed into an output to make a prediction of what information it was provided with in the input layer. This makes up the forward feed of the model. The predictions are a direct result of how well the hidden layer managed to transform the information from the input layer. The prediction is then measured and the error is quantified in order for the weights to be updated. This happens repeatedly through back propagation and the forward feed being repeated. As the model gets more data to train on it will eventually, have its weights updated to a point where the amount of inaccurate predictions is

11 DataFlair. https://data-flair.training/blogs/artificial-neural-networks-for-machine-learning/ Accessed (2019-12-15)

4 minimized. Once trained the network can be used to make predictions with other, similar, data sets in the future.12

3 Data Data used is from two different data sets. First the MNIST handwritten digits data set. An acronym for Modified National Institute of Standards and Technology. The data set contains images of handwritten digits and is divided into two parts, 60 000 training images and 10 000 testing images. The images in MNIST have been standardized in the sense that they are in a grey-scale color and that the numbers are centered in the pictures. Every picture is of the size 28x28 pixels. Figure 3 shows a cropped version of how the network initially handle a specific picture, here with the number 5. The second data set is CIFAR-10, a more complex data set with 60 000 photographed pictures of real-life objects. CIFAR-10 is an acronym for Canadian Institute for Advanced Research, with 10 different classes. The images in CIFAR-10 are less standardized compared to the MNIST pictures which result in greater variation in each class and often weaker correlations between the objects in each class. Every picture in CIFAR-10 is 32x32 pixels big and use the red, green and blue model (RGB-model) to create a specific color for each pixel.

Figure 3. A screenshot of an R-matrix from MNIST, representing the number 5.

12 Nicholson. https://pathmind.com/wiki/neural-network Accessed (2019-12-15)

5 4 Method This paper will evaluate four different approaches for mitigating overfitting in neural networks. This will first be done with a fixed model on the MNIST data set. Later in the report this will also be expanded to the CIFAR-10 data set where additional nodes in the hidden layer will be added to assure a valid neural network for the new data set at hand. Below, the exact details of these structures will be described and further through this will be referred to as the Base Network.

4.1 Base network The network has four layers in total. One input layer, two hidden layers and one output layer. The number of nodes in the hidden layers are of essence when it comes to training a neural network. The problem primary discussed in this paper is overfitting, but an equally disturbing problem is underfitting. When underfitting exist the network will miss out on valuable information. Therefore, one method when deciding the number of hidden nodes is instead to choose a high number and then use regularization methods to deal with the expected overfitting. That way, it is easier to know that too much potential information has not been missed out on. A suggestion for the number of hidden nodes is 5 to 100 but as discussed above the data’s underlying function should be highly regarded when choosing the hidden layers’ content.13

The Base Network for MNIST will contain two hidden layers with 64 respectively 32 nodes with both using the rectified linear unit (ReLU) as . Lastly, the Softmax function will be used to get all the output nodes to values between 0 and 1, while not exceed 1 aggregated, to meet the criteria for a probability. The input layer contains 784 nodes, one for every pixel in each image (28x28x1) and a value for the greyscale it uses.

The redefined Base Network for CIFAR-10 will contain two hidden layers with 128 respectively 64 nodes. The input layer contains 3072 nodes, one for every pixel in each image (32x32x3) as well as values for the colors. Using the RGB-model it assigns a value to each of these three colors as well, for each picture. The rest of the network is kept unaltered.

13Hastie, Trevor, Robert Tibshirani, and Jerome Friedman. , 'The Elements of Statistical Learning: , Inference, and Prediction', Anonymous Translator (2.th edn, New York, Springer, 2009), 358.

6 In the last layer the output is probabilities which if summed up must equal to 1. These predictions from the network on what category the image belongs to can be compared to the true value, which has the value 0 for nine categories and the value 1 for the correct category. This will be done for all images and then summed together. How the output probabilities, 푓푘(푥푖), and the true values , 푦푖푘, will be compared and summed is defined by the so-called , 푓(푦, 푥, 푤). The function used for the Base network is categorical cross-entropy,

푁 퐾 푓(푦, 푥, 푤) = − ∑푖=1 ∑푘=1 푦푖푘푙표푔 푓푘(푥푖)

which is suitable when working with classification problems and therefore will be used for the Base network.14 This loss function is used when the weights for the network is updated. The weights should be changed to give better predictions which results in a lower loss. To minimize the loss function the method is used for the Base network. The gradient is the derivative of the loss function. If the weights are updated optimally the value of the gradient decrease to 0 and the minimum of the loss function have been reached. Assumed that a global minimum has been reached and not a local.15 Exactly how the weights are updated using the gradient is decided by the optimizer; for the Base network the RMSprop is used.

4.2 Metrics As mentioned, both MNIST and CIFAR-10 contains 10 different classes, numbers and different ordinary objects, such as different kinds of animals and transport vehicles. Because of this, the neural network in this paper only handles classification problems and therefore a suitable evaluation metrics is accuracy. The number of correct predictions will be counted in all three data sets, train, validation and test. The accuracy on the test data set will be the primary evaluation measure when the different regularizations methods are compared to the baseline network without regularization.

14 Hastie et al., 'The Elements of Statistical Learning: Data Mining, Inference, and Prediction', 353. 15 Chollet et al. , 'Deep Learning with R', section 2.4.3.

7 4.3 Overfitting and regularizations As mentioned, overfitting is a common problem in machine learning and is therefore of interest. Overfitting can be defined as “The production of an analysis which corresponds too closely or exactly to a particular set of data, and may therefore fail to fit additional data or predict future observations reliably.”16 Regarding a neural network this means that the network is overfitted when it fits in a too high extent to the training data set and therefore loses its ability to predict on new data. The problem of overfitting in neural networks can be handled in many ways. This report will evaluate four of the ways:

● Early stopping ● L1 ● L2 ● Dropout

Other ways of dealing with overfitting can be changes in the structure of the model. The number of layers and nodes per layer. As mentioned, this report will look at different methods for mitigate overfitting when the model is fixed.

Early stopping was invented in the early days of neural networks. It has a straightforward approach; the training of the network is being stopped before it overfits to the training data. Early stopping can be regarded as straightforward in the aspect of understanding the basics; the training should be stopped when generalization decreases. Exactly when this occurs could be clear in some cases but less so in other. Therefore, a problem with this method can be deciding an appropriate stopping criterion, which takes both the generalization and the training time in aspect. The goal is then to find a criterion which secures a generalized network but stops when the increase in generalization is not of substantial character anymore; with the purpose that the training time can be seemed effective. A common stopping criterion is to stop training when the

16 Oxford University Press. https://www.lexico.com/en/definition/overfitting Accessed (2020-01-08)

8 validation loss has not reached a new minimum in the last 5 epochs. Hence continuous improvements do not trigger Early stopping.17

Another method for reducing overfitting in the neural network is weight decay or also known as weight regularization. Weight decay penalize the network for consisting of large weights by adding a cost to the training loss, 푓(푦, 푥, 푤). The extent of the added cost is set by the hyperparameter Lambda, 휆, and is often of small proportion.18 Therefore Lambda will be tested on the values 0.01 and 0.005. After this change in the loss function, the network will, to some extent, avoid large weights.19 There are different ways of penalizing the network with weight decay. Two common ways, L1 and L2, are defined below as new loss functions; decay-1,

푑1(푦, 푥, 푤), and decay-2, 푑2(푦, 푥, 푤).

L1

푑1(푦, 푥, 푤) = 푓(푦, 푥, 푤) + 휆 ∑‖푤‖ 푖=1

L2

푛 2 푑2(푦, 푥, 푤) = 푓(푦, 푥, 푤) + 휆 ∑| 푤 | 푖=1

In this report, the last method for acting against an overfitted network is called Dropout. As the name may give a hint of this method randomly drops nodes in the network’s hidden layers. This is done by forcing the weight for these randomly selected nodes to zero and then scaling up the remaining weights. How much to scale up is in perfect correlation to the Dropout rate. For

17 Prechelt L. “Early Stopping — But When?”. In: Montavon G., Orr G.B., Müller KR. (eds) Neural Networks: Tricks of the Trade. Lecture Notes in Computer Science, vol 7700. Springer, Berlin, Heidelberg, 2012. 18 Chollet et al. , 'Deep Learning with R', section 4.4.2. 19 Chollet et al. , 'Deep Learning with R', section 4.4.2.

9 example, if 1/5 of the nodes are dropped the remaining weights should be multiplied with 6/5. An often-used number of dropouts is between 20-50%. This method can be seen as arbitrary, because of the randomness, but is in contrast often used for overfitted neural networks and is regarded as effective. The idea of this technique is to make the network more uncertain when it updates the weights; to make it only pick up the more significant patterns. The network otherwise risks to pick up random noise in the training data but is now less likely to do so, because Dropout has added more noise.20

4.4 Bias-variance tradeoff The interest in mitigating overfitting in neural networks is due to make better predictions when the network is used on new never seen data. In all four methods above this is done by affecting the training loss in different ways. L1 and L2 are doing it directly by a change in the loss function. Early stopping and Dropout can be said to do it indirectly, by different changes in the construction of the training processes. The training loss is targeted in the fight against overfitting because it is an estimator of the matter in focus regarding predictions; the test loss. In basic statistics courses finding the unbiased estimator is always the first goal, then a low variance. This approach gives a right picture of the population, on average, but could often be far away if the variance is high. This uncertainty could be solved using Bias variance-tradeoff. This is an approach that loses the restriction of an unbiased estimator for a lower variance. Regarding neural networks this is done when the network is not optimized on the training data. Then, a bias is accepted for a lower variance in the test loss. The network can then be said to be more generalized. In regard, the Bias-variance tradeoff must be used carefully. Otherwise, the network risks to underfit which would cause the predictions to drop again.21

20 Chollet et al. , 'Deep Learning with R', section 4.4.3. 21 Hastie et al., 'The Elements of Statistical Learning: Data Mining, Inference, and Prediction', 38.

10 4.5 Simulation method Mainly, the results in this paper will be of simulations done in the programming language R. Requests of the source code may be done to the authors of this thesis. To stabilize the results the simulation for every sample size is performed several times and then an average of the evaluation numeric is calculated. For the small sample sizes, up to 1 400 images, there are 10 replications done per sample size. For the rest of the sample sizes 3 replications are performed. It is the random weights, which are set in the beginning of the training, which are the unstable factors. The randomness should be of bigger proportion when small samples are used because the training time is shorter and the network then has less time to change, possible bad, starting weights. This is one reason to why more replications are done for the smaller samples. The second reason is the time factor. The bigger sample sizes take longer to train with and therefore, for this papers time plan, more replications were not possible. If this study would have had more time and unlimited finance, the replications could have been done up to 30 times, to secure even more stable results.

Furthermore, the primary evaluation aspect of the study is performed by looking at the network’s accuracy for the 10 000 images big test data set. The whole test set is used for all the different sample sizes, which at first may seem strange; to set aside 10 000 images for test and then train with 1 400. The reason for the whole test set to be used is to stabilize the evaluation measure for this study and that the test images are not seen as a part of the original sample. Instead, the whole samples are considered of the training images, 70%, and the validation images, 30%. The training and validation set is randomly selected but the same random seeding numbers are set for all the replications regarding the same sample size. This randomness may lead to that especially bad or good numbers are chosen and therefore another random sample could get a different accuracy. This should though have a similar effect for all four regularization methods and therefore can the comparison still be seen as valid.

In statistics, an area of focus is the sample size and how it affects the result. “Is there enough data for the result to be statistically significant?” is a reasonable question and the matter of how small a data set can be and still be useful is important. In comparison, machine learning is often used when large amount of data is available but how much data is really needed to the accurate

11 predictions? And can machine learning also be used for smaller sample sizes? In context, of these wonderings, this paper will not hold the sample size constant and the results will instead focus on how the different regularizations applies when the sample size varies. The accuracy in this report will be measured in three groups of sample sizes. Referred to as the low, mid and high group. The low group measure at 500, 650, 800, 950, 1100, 1250 and 1400. The mid group measure at 2000, 4000, 6000, 8000 and 10 000. The high group measure at 10 000, 20 000, 30 000, 40 000 and 50 000.

All four regularization methods need an assigned value. There are recommended ranges of values as previously mentioned in the report. The first tests in this report were to see which of the values that perform best at the low sample size. The values that performed best for MNIST respectively CIFAR-10 were used for the low sample size. Because of this CIFAR-10 does not seem to be tested with the same regularization methods values as MNIST. However, the MNIST values were tested as well but performed weaker than the ones shown. For the mid and high sample sizes the same regularization method values performed best and is therefore used and shown. The tests to decide all hyperparameter values to be used in the final results can be found in Appendix A.

12 5 Results

5.1 MNIST The data is first controlled by running the Base network with no regularization, in order to get a baseline. The result of this can be seen in Figure 4, where two different sample sizes are used. Both for the smaller and bigger sample the network can be seen to overfit; this is noticeable in the validation loss. For both sample sizes the validation loss first decreases and then for the later part of the training time has an upward trend. In the MNIST data set the accuracy is relatively high and stable and therefore the effect of overfitting is rather small and harder to illustrate.

Figure 4. Shows loss and accuracy for validation set and training set, when the sample size of 1 400 (left) and 10 000 (right) is used when training the base network for 100 epochs.

13

Figure 5. Two graphs showing training done without regularization (left) and training with Dropout (right). The Base network is used with sample size of 400 images from MNIST.

In Figure 5 we can see a comparison between training done without regularization and training done when Dropout is added to the hidden layers. It can be seen that the validation loss has a rising trend for the later part of the training with Dropout, although slower and smaller in total.

14

Figure 6. Test accuracy for all different sample sizes from MNIST, when the Base network is trained without regularization.

In Figure 6, all previously mentioned sample sizes in the study are shown at the same time. Test accuracy can be seen to increase with sample sizes. This correlation seems not to be linear, but instead exponential, here measured without regularization methods.

15

Figure 7. A comparison of different regularization methods with a baseline. Regarding different low sample sizes from MNIST.

The best hyperparameter values from the four different regularization methods is summarized in Figure 7. The only regularization method that in some extent stands out positively from the baseline training is Dropout with roughly 90% accuracy. Early stopping performs just below no regularization while L1 and L2 perform worse by a greater margin.

16

Figure 8. A comparison of different regularization methods with a baseline. Regarding different mid sample sizes from MNIST. Regularization method L1 excluded.

This trend also continues when the sample size increases. In Figure 8 Dropout has reached an accuracy of 96%, with Early stopping, no regularization and L2 just below in accuracy. L1 has been excluded from the graph due to performing well below the other four and therefore making the above results harder to see. Graphs including L1 can be found in Appendix A. Occasionally, other methods than Dropout is performing better than the Base network without regularization. L2 seems to be performing better for a while in the middle here.

17

Figure 9. A comparison of different regularization methods with a baseline. Regarding different high sample sizes from MNIST. Regularization method L1 excluded.

As the tests with MNIST reaches the final sample sizes Dropout continue to perform better than no regularization and finish at an accuracy of 97.5%. Early stopping is on par with no regularization while L2 perform a bit worse. L1 perform a lot worse and therefore graphs including L1 can be found in Appendix A.

18 5.2 CIFAR-10 CIFAR-10 will now be used instead of MNIST to provide additional results to further evaluate the four methods.

Figure 10 shows that the test accuracy continuously is below 0.1 and the model is therefore no better than a guess of chance at the current state. Therefore, more nodes are added to the hidden layers to better handle the increased complexity of the new data set.

Figure 10. A comparison of different regularization methods with a baseline. Using the original defined Base Network. Regarding different low sample sizes from CIFAR-10.

19

Figure 11. Shows loss and accuracy for validation set and training set, when the sample size of 1 400 (left) and 10 000 (right) is used when training the base network for 100 epochs.

The redefined Base network is now trained with no regularization on the CIFAR-10 data set and the version with 10 000 sample images is shown in Figure 11. In opposite to the MNIST version, the overfitting can be seen in the area of focus; the accuracy. In Figure 11, it clearly can be seen that the difference between the training accuracy and the validation accuracy increases with the epochs, especially for the ones representing a sample size of 10 000. The lower sample size displayed, with 1 400 images, is not as clear. The training accuracy and validation accuracy does not separate as much in the 100 epochs.

20

Figure 12. A comparison of different regularization methods with a baseline. Regarding different low sample sizes from CIFAR-10.

Moving on to the comparison of the four regularization methods. In Figure 12, where the small samples are displayed, it is hard to distinguish any positive effect from the tested regularization methods. Sometimes, Dropout reaches above the baseline, but the trend does not seem stable. As seen in Figure 12 the test accuracy increases with the additional nodes added to the hidden layers in the new model. Dropout and no regularization perform at 30% while Early stopping and L2 are closer to 25% accuracy. L1 on the other hand show no improvements with the additional nodes in the network.

21

Figure 13. A comparison of different regularization methods with a baseline. Regarding different mid sample sizes from CIFAR-10. L1 excluded.

This uncertainty continues further and throughout to the sample size of 6 000, as can be seen in Figure 13. Above this point regarding sample sizes, the Dropout method seems to have a positive impact when training the network. Also, Early stopping seem to perform better as the sample size gets bigger, even though it mainly performs below the level of the baseline at these sample sizes. In Figure 13 where the sample size has been increased to the 2 000-10 000 range Dropout and no regularization perform well constantly, with Dropout performing best in the last instances, now with roughly 42% accuracy. L2 continuously have a worse performance than the other shown and Early stopping start with low accuracy but manage to reach the same accuracy as no regularization and Dropout in the middle and from there on perform well finishing in the 40% test accuracy range. L1 has been dropped due to no improvements in accuracy and therefore causing a skewness in the graph. Graphs including L1 can be found in Appendix A.

22

Figure 14. A comparison of different regularization methods with a baseline. Regarding different high sample sizes from CIFAR-10. L1 excluded.

Lastly in Figure 14 the simulations are run on sample sizes from 10 000 to 50 000. Here Dropout have the best performance of all five simulations with Early stopping and no regularization closely resembling each other a couple of percentage points below. Dropout finish the test at an accuracy of 50%. L2 follows similar patterns as the other methods but perform worse than the other three methods throughout the simulations. L1 is once again emitted.

23 6 Discussion

The MNIST data set is simpler than the CIFAR-10 in many ways. The way the numbers are centered and how they can be expected to be of roughly the same size should result in a higher correlation among the different representations of the numbers. In CIFAR-10 the pictures of the objects can be expected to be taken from different distances therefore causing big differences in size. As well as different lightening and the fact that objects such as cars, dogs and trucks can have more than one specific color and share color patterns. Therefore, the problem with overfitting due to random noise with MNIST can be expected to clear up faster and be less severe. To account for the expected increased complexity of the images in the CIFAR-10 data set more nodes were added to the hidden layers. This made it possible for the network to account for weaker correlations of value which otherwise could have been left out. It should also be added that while a simple neural network can produce a very high accuracy for MNIST the results for CIFAR-10 are less accurate. This is to be expected but in order to have as similar networks as possible the lower accuracy is acceptable.

At the low sample sizes for both MNIST and CIFAR-10 it seems that the added noise by both L1 and L2 is greater than needed and therefore make the network disregard weak but valuable patterns among the numbers, which causes it to have a lower accuracy than the baseline. This raises the question if those values initially should have been even lower in order to improve their results. Due to the time limitations for this report all work concerning deciding what the initial values for the regularization methods should be had to be cut somewhat short. The values used were based on previous research and generalization and extrapolation was needed which may have caused suboptimal values. More research concerning this would be both valuable and needed to some extent.

Since overfitting is not a huge problem with the MNIST data set the room for improvements ought to be rather small and it is therefore to be expected that none of the regularization methods outperform the baseline by much. The good performance of Dropout was therefore interesting to see as this showed its value with this data set. With the added complexity in the CIFAR-10 data set the differences between the regularization methods were bigger. Since the overfitting

24 problem seemed to be bigger with this data set it makes sense that the regularization methods have a greater possibility of improvements. With CIFAR-10 Dropout continue to perform well and clearly better than no regularization as soon as the training data set reach roughly 5 000 images. The dropout technique of forcing the network to create alternative ways of explanation and therefore be slightly less sure with its predictions seem to be of value. The performance of Early stopping come as a bit of a surprise overall. With the criteria it follows to stop the model before it overfits too much it seems reasonable that it should be one of the best performers. But the results here rarely show much difference compared to the baseline. This further indicates that the overfitting may be a less severe problem at times. Therefore, a matter that can be discussed is the need of regularization for these two data sets. It must be considered that building an advanced enough network can in some cases be of bigger interest than choosing the right amount of regularization. Still, both are of necessity in order to optimize accuracy.

25 7 Conclusion At the beginning of this report it is stated that the aim of the report is to expand the research field concerning the four regularization methods L1, L2, Early stopping and Dropout. This was done by testing and evaluating the four methods L1, L2, Early stopping and Dropout with a focus on the MNIST data set, and thereafter on the more complex CIFAR-10 data set. The four methods were then evaluated by comparing their performances, in an attempt to expand the research fields understanding of the method’s behavioral patterns. The research question was: how do the four methods perform at different sample sizes with the MNIST and CIFAR-10 data sets, and what does a comparison between them say about their specific behaviors?

In the report it was shown that Dropout continuously perform better than both Early stopping and no regularization which show similar accuracies. L2 continuously performed slightly below baseline and L1 perform significantly worse than all other regularization methods. These performance patterns repeat themselves in both data sets and at most sample sizes, with a reservation for the smallest sample sizes where randomness play a bigger role. For the results in the report it is important to consider that all four regularization methods have had specific values set which is the main cause for their behavior. All methods can be used with different values, as mentioned previously in the report in their respective sections. Therefore, one ought to be cautious when it comes to extrapolating these findings outside these two data sets and the values used for the specific regularization methods. More research is needed in order to better understand how different hyperparameter values change the behavior of the specific regularization methods.

26 References

Brownlee, Jason. How to Avoid Overfitting in Deep Learning Neural Networks. 2019. https://machinelearningmastery.com/introduction-to-regularization-to-reduce-overfitting-and- improve-generalization-error/ Accessed (2020-01-13)

Chollet, François, and Joseph J. Allaire. , 'Deep Learning with R', Anonymous Translator(1st edn, Shelter Island, NY, Manning Publications Co, 2018).

DataFlair. Introduction to Artificial Neural Networks. 2018. https://data- flair.training/blogs/artificial-neural-networks-for-machine-learning/ Accessed (2019-12-15)

Hardesty, Larry. Explained: Neural networks. 2017 http://news.mit.edu/2017/explained-neural- networks-deep-learning-0414 Accessed (2020-01-13)

Hastie, Trevor, Robert Tibshirani, and Jerome Friedman. , 'The Elements of Statistical Learning: Data Mining, Inference, and Prediction', Anonymous Translator(2.th edn, New York, Springer, 2009).

Nicholson, Chris. A Beginner's Guide to Neural Networks and Deep Learning. 2019. https://pathmind.com/wiki/neural-network Accessed (2019-12-15)

Oxford University Press. Lexico.com, 2019. https://www.lexico.com/en/definition/overfitting Accessed (2020-01-08)

Prechelt L. “Early Stopping — But When?”. In: Montavon G., Orr G.B., Müller KR. (eds) Neural Networks: Tricks of the Trade. Lecture Notes in Computer Science, vol 7700. Springer, Berlin, Heidelberg, 2012.

Sarle, Warren. “Stopped Training and Other Remedies for Overfitting.” 1995.

Strachnyi, Kate. Brief History of Neural Networks. 2019. https://medium.com/analytics- vidhya/brief-history-of-neural-networks-44c2bf72eec Accessed (2019-12-15)

Synced. 2019. https://syncedreview.com/2019/10/31/google-introduces-huge-universal- language-translation-model-103-languages-trained-on-over-25-billion-examples/ Accessed (2020-01-13)

27 Appendix A

MNIST

Figure 14. Comparing different hyperparameter values for the regularization methods with MNIST. Here Early stopping and Dropout are tested.

Figure 15. Comparing different hyperparameter values for the regularization methods with MNIST. Here L1 and L2 are tested.

28

Figure 16. A comparison of different regularization methods with a baseline. Regarding different mid sample sizes from MNIST. Regularization method L1 included.

Figure 17. A comparison of different regularization methods with a baseline. Regarding different high sample sizes from MNIST. Regularization method L1 included.

29 CIFAR-10

Figure 18. Comparing different hyperparameter values for the regularization methods with CIFAR-10. Here Early stopping and Dropout are tested.

Figure 19. Comparing different hyperparameter values for the regularization methods with CIFAR-10. Here L1 and L2 are tested.

30

Figure 20. A comparison of different regularization methods with a baseline. Regarding different mid sample sizes from CIFAR-10. L1 included.

Figure 21. A comparison of different regularization methods with a baseline. Regarding different high sample sizes from CIFAR-10. L1 included.

31