<<

DEGREE PROJECT IN TECHNOLOGY, FIRST CYCLE, 15 CREDITS STOCKHOLM, SWEDEN 2019

Stochastic Descent in

CHRISTIAN L. THUNBERG

NIKLAS MANNERSKOG

KTH ROYAL INSTITUTE OF TECHNOLOGY SCHOOL OF ENGINEERING SCIENCES Abstract

Some tasks, like recognizing digits and spoken words, are simple for humans to complete yet hard to solve for computer programs. For instance the human intuition behind recognizing the number eight, ”8 ”, is to identify two loops on top of each other and it turns out this is not easy to represent as an . With machine learning one can tackle the problem in a new, easier, way where the computer program learns to recognize patterns and make conclusions from them. In this bachelor thesis a digit recognizing program is implemented and the parameters of the stochastic gradient descent optimizing algorithm are analyzed based on how their effect on the computation speed and accuracy. These parameters being the ∆t and batch size N. The implemented digit recognizing program yielded an accuracy of around 95 % when tested and the time per iteration stayed constant during the training session and increased linearly with batch size. Low learning rates yielded a slower rate of convergence while larger ones yielded faster but more unstable con- vergence. Larger batch sizes also improved the convergence but at the cost of more computational power.

Keywords: Stochastic Gradient Descent, Machinelearning, Neural Networks, Learning Rate, Batch Size, MNIST

i DEGREE PROJECT IN TEKNIK, FIRST CYCLE, 15 CREDITS STOCKHOLM, SWEDEN 2019

Stochastic Gradient Descent inom Maskininlärning

CHRISTIAN L. THUNBERG

NIKLAS MANNERSKOG

KTH ROYAL INSTITUTE OF TECHNOLOGY SKOLAN FÖR TEKNIKVETENSKAP Sammanfattning

Vissa problem som f¨orm¨anniskor ¨arenkla att l¨osa,till exempel: att k¨annaigen siffror och sagda ord, ¨arsv˚artatt implementera i datorprogram. Till exempel, den m¨anskligaintuitionen att k¨anna igen siffran ˚atta”8 ” ¨aratt notera tv˚aslingor ovanp˚avarandra, detta visar sig vara sv˚artatt representera som en algoritm. Med maskininl¨arning ¨ardet m¨ojligtatt angripa problemet p˚aett nytt, enklare, s¨attd¨ardatorprogrammet l¨arsatt k¨annaigen utformningar som datorprogrammet drar slutsatser fr˚an. I denna kandidatuppsats implementeras ett sifferigenk¨anningsprogramoch parametrarna i ”stochastic gradient descent” analyseras i deras p˚averkan av programmets ber¨akn- ingshastighet och tr¨affs¨akerhet. Dessa parametrar ¨ar”learning rate” ∆t och ”batch size” N. Det implementerade programmet f¨orsifferigenk¨anninghade en tr¨affs¨akerhet p˚aomkring 95 % n¨ardet testades och tiden per iteration var konstant under tr¨aningenav programmet, samtidigt som den ¨okade linj¨artmed ¨okad batch size. L˚agalearning rates resulterade i l˚agmen stadig konvergens medans st¨orreresulterade i snabbare men mer instabil konvergens. St¨orrebatch sizes f¨orb¨attrade konvergensen men p˚abekostnad av l¨angre ber¨akningstid.

Nyckelord: Stokastiska Gradientmetoden, Maskininl¨arning,Neurala N¨atverk,Inl¨arningshastighet, Batchstorlek, MNIST

ii Acknowledgements

This bachelor thesis was written as a part of course SA114X, Degree Project in Engineering Physics First cycle, at the Department of Numerical Analysis at the Royal Institute of Technology. We would like to thank our supervisors Anna Persson and Patrick Henning for their support and feed- back throughout the entire project.

Christian L. Thunberg, [email protected] Niklas Mannerskog, [email protected]

Stockholm, 2019

iii Contents

1 Introduction 1 1.1 Purpose ...... 1 1.2 Background and target group ...... 1 1.3 Delimitation ...... 1

2 Theory 1 2.1 Artificial intelligence ...... 1 2.2 Machine learning ...... 2 2.3 Artificial neural network ...... 2

3 Training Neural Networks 4 3.1 Standard Gradient Descent ...... 4 3.2 Stochastic Gradient Descent ...... 4 3.3 Empirical and Expected Values ...... 5

4 Implementation 5 4.1 MNIST Problem and Model, bild p˚aMNIST ...... 5 4.2 Code and measurements ...... 6

5 Parameter study 6 5.1 Time dependency on batch size ...... 6 5.1.1 Method ...... 6 5.1.2 Result ...... 7 5.2 Accuracy and learning rate ...... 9 5.2.1 Method ...... 9 5.2.2 Result ...... 9 5.3 Accuracy and batch size ...... 11 5.3.1 Method ...... 11 5.3.2 Result ...... 12 5.4 Performance ...... 13 5.4.1 Method ...... 13 5.4.2 Result ...... 14

6 Analysis and Conclusions 15

7 Appendix I 7.1 Digit recognizing program ...... I

iv 1 Introduction

1.1 Purpose This bachelor thesis will cover two separated topics: 1. A literature study about Numerical stochastic gradient descent and other iterative methods used in machine learning to find a local minimum of a . 2. How to make a digit recognizing computer program by using a neural network which will recognize hand written digits, then analyze some parameters’ impact on the result and com- putation speed.

1.2 Background and target group The interest of machine learning has increased during the past years [1]. It has potential to form the future more than any other technical concept year 2019 [2] and interesting applications available today justifies the increased interest, for example:

1. IBM’s Deep Blue chess-playing system which defeated the world champion in chess in 1997 [3]. 2. An AI developed by Google’s DeepMind which in February 2019 defeated the best StarCraft II players [4]. 3. Alexa, Amazon’s virtual assistant, became good at and natural language understanding with help of machine learning [5]. This bachelor thesis will first cover explanations about the important concepts within machine learning. After the concepts are explained the machine learning’s ”hello world” program will be implemented and explained. After the implementation some parameters will be analyzed to see how the parameter effects the accuracy of the program. Everyone with basic knowledge in programming and mathematics, who is new to machine learning could see this thesis as an introduction to the subject.

1.3 Delimitation The thesis will only cover which is one of many ways a computer program can be trained and only a handful variables will be analyzed. The variables that will be analyzed are: how the batch size, N, affects the computation time and accuracy of the digit recognizing program and how the learning rate, ∆t, affects the accuracy. With regards to the learning rate, there are with adaptive learning rates to train neural networks, in this thesis however we only consider constant learning rates and constant batch sizes. Also note designing a neural network that achieves maximum possible accuracy is not in the scope of this thesis.

2 Theory

2.1 Artificial intelligence Artificial intelligence has no unambiguous definition but can be summarized as the science and en- gineering of making intelligent entities which includes the most important topic - making intelligent computer programs [6]. An intelligent entity could be a mechanical arm which sorts papers with hand written digits or a self driven car who reacts to people in front of the car on the road.

1 2.2 Machine learning Machine learning was defined as the ”field of study that gives computers the ability to learn without being explicitly programmed” by a pioneer in the field of machine learning, Arthur Samuel in 1959.

The field, machine learning, can be understood as a scientific study of computational methods that use experience to raise performance or to make better predictions [7]. The entity’s software could learn to recognize handwritten digits by analyzing a large set of hand written digits or being run in a simulator where feedback is given to the software.

2.3 Artificial neural network An artificial neural network or from now on ”neural network”, is a framework for a set of machine learning algorithms inspired by the human brain. The neural network, used in this thesis, consists of different layers, two which are visible for humans and one or more hidden layers. The layers, visible to humans, is the first which is where the data is fed and the last layer which is the neural network’s output. Each layer consists of an arbitrary amount of neurons, connected to all neurons in the layer in front and behind, no connections between neurons in a layer exists. In figure 1 the above described neural network is visually illustrated.

Figure 1 – Illustration of a neural network with: I input neurons, H hidden layers, with A, B or C neurons and O output neurons.

Each neuron contains a value, the value of a neuron is calculated by creating a weighted sum. The weighted sum is calculated by multiplying each neuron value in the previous layer with a weight and sum all of the created products, then add a bias to the sum. The weighted sum is then mapped to an interval I, chosen by the author by a function σ : R → I, the mapped value is the neuron’s value. The function σ is called the ””, the activation function used in this thesis is the , which is a frequently used activation function in machine learning [8] 1 sigmoid function: σ(x) = . (1) 1 + e−x

Let Σbh be the weighted sum of Nbh and nbh be the value of Nbh in figure 1. The weighted sum with the bias is then calculated through

A X Σbh = ωab · na1 + bh (2) a=1 where ωab is the weight, bh is the bias and na1 is a neuron’s value in the previous layer. The index ab means from neuron a in the previous layer to neuron b in the current layer. The weights and biases are real values and can be thought of as a description of how a neuron depends of a neuron in the previous layer. The value of the neuron, nbh, is then calculated by inserting the weighted sum, Σbh, into the activation function. If so, the value of the neuron is

2 nbh = σ(Σbh). (3) In figure 2 the above described neuron is visually illustrated.

Figure 2 – Illustration of a how the j’th neuron in layer h in a neural network is calculated. The value of a neuron in the previous layer is labeled na,h−1 where a ∈ {1, 2, ..., A} and h − 1 is the previous layer’s index. From now on a will be used instead of a, h − 1. Each na is associated with a weight ωa ∈ (−∞, ∞) and bias bh ∈ (−∞, ∞) which describes how this neuron depends on na. The value of the neuron of interest, nj,h, is then calculated by inserting the weighted sum PA Σ = a=1 ωa · na + bh into the activation function σ, hence nj,h = σ(Σ). The value of a neuron is then transmitted, to all neurons in the next layer.

In this thesis the inputs and outputs of the model can be interpreted as vectors, X = (X1, ..., XI ) and α = (α1, ..., αO) where the components Xi and αo are real numbers. The output layer is mapped to a vector T(α) where the components, To, describes the probability of an event. The largest component in T, Tmax = max(T), is the neural network’s conclusion. E.g. if the neural network can distinguish pictures of cats or dogs, the mapped output layer could be T = [T1,T2] = [P (Dog),P (Cat)]. If the output array for a picture is T = [0.52, 0.89], the largest component Tmax = 0.89 hence, the neural network will conclude ”this is a picture of a cat”. Before the neural network can make a good conclusion it needs to be trained or taught by ”looking” at input data with labels that contains the correct output layer vector

( 1 if the o’th component is the wanted output y = (y1, ..., yo, ..., yO) where yo = . (4) 0 else The correct output layer vector (y) or the one-hot vector when y is defined as above, is then com- pared with the neural network’s mapped output vector, T, for an error estimation, let’s denote the comparison as TCompared. The problem is now to minimize f(All variables in the neural network) = TCompared in order to minimize the error. In general this is not easily done and certain methods are used to complete the minimization problem, see section 3 for further reading. E.g. a picture of a cat is given to a neural network in the training phase, the correct output array is y = [0, 1] = [y1, y2] and the the network’s output is T = [T1,T2]. The comparison could be the sum of the components’ squared difference, hence

O X 2 TCompared = (To − yo) . (5) o=1 The goal is now to minimize [9]

E[f(All variables in the neural network)]. (6) To simplify the notation from now on: θ will be used to represent the model’s parameters, x is the input, y is the one-hot vector and α will be the neural network function (taking x and θ as inputs). Hence the function to minimize (with respect to θ) can be written as:

E[f(all variables)] = E[f(α(x, θ), y)]. (7)

3 3 Training Neural Networks

3.1 Standard Gradient Descent As previously stated the machine learning problem of training a neural network is reduced to finding θ that solves

min E[f(α(x, θ), y)] (8) θ where we specify a cost function f(α, y) and feed datapoints (xi, yi) from the training data to compute the optimal neural network parameters θ. α is the neural network function taking input vectors x and parameters θ. We can solve the problem approximately using the following algorithm

• Make an initial guess θ0 for θ

Ntraining • Create a set of labeled training data {(xi, yi)}i=1 • Choose suitable learning rate ∆t > 0 • for m ∈ {1, 2, ..., M} do

1 PNtraining • Compute θm+1 = θm − ∆t ∇θ f(α(xj, θm), yj) Ntraining j=1 which is typically referred to as ”standard” gradient descent or GD for short. Naturally GD 1 PNtraining actually tries to minimize f(α(xj, θ), yj) [10]. Ntraining j=1

3.2 Stochastic Gradient Descent The most computationally taxing part of the algorithm described above is calculating the gradient which is done in a exact matter using the entire training data set. However, the gradient can instead be approximated using a smaller sample (mini batch) of the data set and thus reduce the computation needed. The algorithm then becomes

• Make an initial guess θ0 for θ

Ntraining • Create a set of labeled training data {(xi, yi)}i=1

• Choose suitable batch size N ≤ Ntraining • Choose suitable learning rate ∆t > 0 • for m ∈ {1, 2, ..., M} do

N • Choose random {jk}k=1 from {1, 2, ..., Ntraining} 1 PN • Compute θm+1 = θm − ∆t N ∇θ k=1 f(α(xjk , θm), yjk ) which we call stochastic gradient descent (SGD) [10]. If N = Ntraining we get standard GD instead. Naturally this method will be less exact due to its random nature but will hopefully be faster in solving (8) by using less computation. In conclusion we must specify at least two parameters, the learning rate ∆t and the batch size N in order to use SGD as an optimizer. In this most simple version we keep ∆t constant, but this M could also be varied according to a schedule {∆ti}i=1 in more advanced versions of the algorithm.

4 3.3 Empirical and Expected Values As mentioned in 3.1 the actual functions attempted to be minimized in both GD and SGD is 1 PNtraining f(α(xj, θ), yj) which is based on a training sample from the distribution of Ntraining j=1 the data x. This only yields an approximate solution that is dependent on the training data used and thus is refereed to as the empirical . To make an estimation of the expected loss, E[f(α(x, θ), y)], we can feed the network data point not from the training set (from the test set), thus not previously seen by the model. As both data sets are sampled from the same distribution, minimizing the empirical loss function is hoped to also minimize the expected loss function. However, if M is too high training may fit the model to noise in the training data and thus not further increase the expected loss. This is called over-fitting.

4 Implementation

4.1 MNIST Problem and Model, bild p˚aMNIST To make a numerical analysis of the stochastic gradient descent optimizer we use a standard handwritten digit recognition problem and the MNIST data set. The data set consists of labeled 28×28 (pixels) images, picturing different handwritten digits 0-9. Each label comes in the form of a 1 × 10 one-hot label [11]. The goal is to create a neural network that achieves a high accuracy in identifying which digit displayed in such a picture.

Figure 3 – An MNIST 4 Figure 4 – An MNIST 5

Figure 5 – An MNIST 7. Figure 6 – Another MNIST 7.

The basic model used in this report to solve the problem is a three layer neural network. As each image has 28 × 28 = 784 pixels we have 784 neurons in the input layer. Hidden layers increase the complexity of the model and allows for a higher accuracy but also increases the time to train it, thus only one hidden layer with K = 500 neurons and a sigmoid activation function is used. The output layer consists of 10 output neurons (one for each digits). Note that the purpose here is not to achieve the maximal possible accuracy for the problem, but to observe the SGD optimization process. Furthermore, the output of the neural network is normalized by passing it through the softmax T function. With z = (z1, ..., zn) , we define the S(z) as follows [12]:

5 ez1 ezn S(z) = ( n , ..., n ). (9) P zi P zi i=1 e i=1 e x This forces each component of the output into the interval 0 < S(z)i < 1 as e > 0. It also satisfies Pn i=1 S(z)i = 1. The cost function used is called the cross-entropy function [13]. In the discrete T case the cross-entropy function H(y, z) is defined as follows (y = (y1, ..., yn) ):

n X H(y, z) = − yilog(zi) (10) i=1 In this particular problem, y is the label of the image and z is the output from the model through the softmax function. The value of H(y, z) is clearly defined for all outputs of the neural network passed through the softmax function, as these all are greater than 0 by (9). We also have n = 10 and the concrete problem to solve becomes

min E[H(y,S(α(x, θ)))] (11) θ where x is the pixel data from the images, y is the corresponding label and α(x, θ) is the function of the neural network, using the same notation as in (8). This will be solved using stochastic gradient descent described in section 3.3, with measurements taken according to the next section.

4.2 Code and measurements The actual code used to train the model can be found in the appendix. In short we have used a TensorFlow [14] implementation in python, inspired by tutorials from [15]. The MNIST data set N {(x , y )}Ntot comes divided into a training set {(x , y )} split and a test set {(x , y )}Ntot i i i=1 i i i=1 i i i=Nsplit+1 which we use for training and verification respectively. For each set of parameters the network was trained for 7500 iterations, recording the following measurements each 25’th iteration m:

1 PNsplit • Empirical cost, defined as H(yj, α(θm, xj)) Nsplit j=1

1 PNtot • Expected cost, defined as H(yj, α(θm, xj)) Ntot−Nsplit j=Nsplit+1 • Empirical accuracy, the ratio of images correctly identified on the training data set • Expected accuracy, the ratio of images correctly identified on the test data set

This way we obtained data on how the SGD optimization effect the performance on training and test data sets while solving (11) over the training loop described in section 3.3. Also, as the SGD algorithm is random in nature we train the network 10 times for each set of parameters in order to obtain mean and variance measurements for the values collected above. In addition to these we also ran separate training sessions where we collected the time it takes for training the network for different batch sizes. The parameter sets tested were combinations of the following

• Learning rates used are 0.01, 0.1, 0.5, 1, 2, 5, 10, 20 • Batch sizes used are 1, 2, 4, 8, 16, 32, 64, 128, 256, 512 for a total of 8 × 10 × 10 = 800 training sessions for varying batch sizes and learning rates, in addition to 10 × 30 = 300 training sessions for recording time data.

5 Parameter study

5.1 Time dependency on batch size 5.1.1 Method In order to analyze how the batch size affects the time efficiency, the digit recognizing program was run with K = 500, M = 7500 and ∆t = 1 as fixed parameters while the batch size was varied, the batch sizes (N) used were: 1, 2, 4, 8, 16, 32, 64, 128, 256 and 512.

6 The digit recognizing program was run with each, previously chosen, batch size. The start time was saved when calculations for a new batch size started and for each 25’th iteration the elapsed time (tnow − tstart) was saved. This procedure was repeated 30 times and generated 30 time series for each batch size. The first two plots below shows the averages of all 30 time series for the iterations and the corre- sponding standard deviation to each data point. A linear function was fitted to each time series where the , k, represents time per iteration for that time series. The third plot below shows the average of the 30 for each batch size with the standard deviation for each average slope.

5.1.2 Result

Figure 7 – The mean of the 30 different time series of the elapsed time for the different batch sizes.

Figure 7 shows that the time the digit recognizing program needs to make m steps grows linearly with the number of iterations made. It is also, from this figure, possible to see that the time needed to perform m steps grows faster when the batch size is increased. This means that time per iteration is longer for larger batch sizes.

7 Figure 8 – The uncertainty for time passed for different batch sizes.

Figure 8 shows the uncertainty in time elapsed between different time series. The general trend is that the uncertainty of time elapsed grows linearly with the number of iteration. This is logical, since there is a difference in time per iteration between the different time series this difference will grow larger when the number of iterations is increased. Something surprising was: for N = 16, and for some other Ns, one can see patterns in the data which probably has something to do with the computer the digit recognizing program ran on.

Figure 9 – Time per iteration for different batch sizes.

Figure 9 illustrates how the time per iteration increases seemingly linearly for large batch sizes (N ≥ 8), though this does not seem to hold for the smaller batch sizes. Note that figure 9’s x-axis is logarithmic, hence the exponential implies a linear relationship. When the error bars are compared with figure 8 one can see, as expected, small error bars correlate with low slopes. The slope of this line is unique for the computer the training sessions were run on. A supercomputer could make 2 × 1017 calculations per second [16]. While home desktop computers can only make billions (109) of calculations per second. Hence the super computer is expected to perform the same calculations faster then the home desktop computer [17].

8 5.2 Accuracy and learning rate 5.2.1 Method We measured empirical and expected losses/accuracies in accordance with the model described in 4.2 for the learning rates ∆t = 0.01, 0.1, 0.5, 1, 2, 5, 10, 20. Keeping the batch size constant at N = 128 we run 10 training sessions for learning rate, taking measurements every 25’th iteration over a total of 7500 iterations. As the optimization is stochastic in nature the plots below are showing the averages over these ten sessions for the iterations measured in order to better capture the converging behaviour.

5.2.2 Result

Figure 10 – Empirical loss for different learning rates, N = 128.

Figure 10 describes the empirical losses for the different learning rates. Unsurprisingly the lower ∆t’s exhibit slower declining rates, increasing up until ∆t = 2 which seem to yield the lowest cost at m = 7500. For ∆t = 20 the effect is reversed, declining similar to ∆t = 0.01 but with a higher variance suggesting the algorithm has become unstable. This relationship between higher variance and higher learning rates seem to be prevalent for all tested ∆t. Note that figure 10’s y-axis is logarithmic.

9 Figure 11 – Expected loss for different learning rates, N = 128.

In contrast, the expected losses described in figure 11 suggest convergence instead of a constant decline. Again, the lower ∆t’s show a slower rate of decline up until ∆t = 10, where ∆t = 20 again exhibit behaviour similar to that of ∆t = 0.01 apart from a higher variance. This higher variance is also present in ∆t = 10, which initially converges slower than ∆t = 5 and seem more sporadic, but ultimately converges to a lower value. Note that figure 11’s y-axis is logarithmic.

Figure 12 – Empirical accuracies for different convergence rates, N = 128.

Figure 12 shows the empirical accuracies for the different learning rates. Clearly noticeable is the fact that the accuracy has not completely converged yet for ∆t = 0.01 which, in accordance with the previous figures, exhibits the slowest growth. The growth increases until ∆t = 5 after which it begins to decline and instead increase in variance. Similarly to previous figures, ∆t = 20 converges at a similar rate to that of ∆t = 0.1 but with higher variance. Most of the other models converge very close to 1, suggesting potential over-fitting at higher iterations.

10 Figure 13 – Empirical accuracies for different convergence rates, N = 128.

Qualitatively the behavior in figure 13 is very similar to that in figure 12, apart from the final convergence rate being lower for seemingly all learning rates but 0.01, 0.1 and 20. The accuracies also flattens out faster than in the previous figure indicating most of the latter training iterations for these learning rates, are unnecessary.

5.3 Accuracy and batch size 5.3.1 Method We measured empirical and expected losses/accuracies in accordance with the model described 4.2 for the batch sizes N = 1, 2, 4, 8, 16, 32, 64, 128, 256, 512. Keeping the learning rate constant at ∆t = 1 we run 10 training sessions for each batch size, taking measurements every 25’th iteration over a total of 7500 iterations. As the optimization is stochastic in nature the plots below are showing the averages over these ten sessions for the iterations measured in order to better capture the converging behaviour.

11 5.3.2 Result

Figure 14 – Empirical loss for different batch sizes, ∆t = 1.

Figure 14 describes the empirical losses for the different batch sizes. Unsurprisingly the loss exhibits faster decline rates with increased batch size up to N = 128. For iterations lower than 6000, batch size of 512 seem to have the fastest rate of decline, but is then overtaken by the lower batch sizes of 128 and 256 which is an anomaly considering the more general patterns of the plot. Note that figure 14’s y-axis is logarithmic.

Figure 15 – Expected loss for different batch sizes, ∆t = 1.

Figure 15 describes the expected losses for the different batch sizes. In contrast to figure 14 which shows that the loss for N = 128, 256 or 512 declines fastest, figure 15 shows N = 64 declines fastest after around 2000 iterations which suggest that the model has been over-fitted. Note that figure 15’s y-axis is logarithmic.

12 Figure 16 – Empirical accuracy for different batch sizes, ∆t = 1.

Figure 16 describes the empirical accuracy for the different batch sizes. Clearly noticeable the fact that the accuracy has converged for all batch sizes. Another fact is that N = 128, 256 and 512 achieves an accuracy of 1 which could mean that the model has been over-fitted.

Figure 17 – Expected accuracy for different batch sizes, ∆t = 1.

Qualitatively the behavior in figure 17 is very similar to that in figure 16, apart from the final convergence rate being lower for seemingly all batch sizes but N = 1, 2, 4 and 8. For these lower N’s the empirical and expected accuracies seem all but equal. For the remaining batch sizes, the accuracies flattens out faster than in the previous figure indicating most of the latter training iterations for these learning rates are unnecessary.

5.4 Performance 5.4.1 Method Performance of the different parameter sets was computed in two ways. The final expected accuracy Accex was taken as the average of the last 500 iterations over the 10 training sessions (20 × 10

13 datapoints). The accuracy at iteration m, Accm, was said to have converged to this value if Accex − Accm ≤ 0.005. Below we plot these measurements against batch size for the different learning rates.

5.4.2 Result

Figure 18 – Final expected accuracy against batch size for different learning rates ∆t.

Figure 18 describes how the final accuracy depends on the parameter sets. Most learning rates seem, apart from ∆t = 0.01 to converge to higher values with an increasing batch size, with lower payoffs from very large and very small batch sizes. Overall, high batch sizes with a learning rate of 10 seems to yield the highest final expected accuracy. In addition, higher learning rates shift the plots rightwards, suggesting a larger batch size is required to achieve high accuracy. Note that figure 18’s x-axis is logarithmic.

14 Figure 19 – Iterations until convergence against batch size for various learning rates ∆t.

Similarly, figure 19 describes the dependency of the convergence rate on the batch size for the different learning rates. The overall behaviour observed is a somewhat concave shape for most ∆t’s, suggesting very small and very large batch sizes yield faster convergence iteration wise compared to medium size batches. Note that figure 19’s x-axis is logarithmic.

6 Analysis and Conclusions

In this thesis the parameters of stochastic gradient descent has been analyzed in regards to their effect on the neural network optimization, in particular the parameters learning rate ∆t and batch size N. This was done by examining the empirical and expected costs and accuracies during the training processes, in addition to timing the training sessions. With regards to time per iteration, the duration of each iteration was constant with a small deviation between the different time series. The constant nature of the time per iteration remained during the whole training session. When analyzing the time elapsed for different batch sizes one could conclude that the time per iterations increased linearly with increased batch size. The computer the training session was run on determines how the slope changes for different batch sizes. The linearly increased slope for increased batch size was expected as the algorithm which solves equation (8) scales linearly with the batch size (N) as described in section 3. Since this was empirically shown one could also conclude that the TensorFlow gradient computing algorithm also scales linearly with the batch size. Moreover, a low learning rate unsurprisingly yielded a slower rate of convergence. By observing the algorithm in section 3.3 this is explained by the smaller iterative steps taken between each update of the network, which thus also yields higher stability of the optimization. In contrast, the highest learning rates consistently yield lower final accuracies and decreased stability which suggest there is an optimal learning rate for each batch size. As mentioned in section 3.3 more sophisticated algorithms make use of varied learning rates which could make use of the quick initial convergence of a high learning rate with the stability and precision of a lower one for higher iterations. It is also worth making note of the discrepancy between the empirical and expected loss (figure 10 and figure 11) which increases with larger step sizes for the higher iterations. Thus higher learning rates seem more prone to over-fit the model, by continue to optimize the empirical training function while have little effect on the expected one after the initial iterations. Furthermore, the higher and more stable convergence of higher batch sizes illustrated in figure 14 and figure 15 can be explained by the more accurate approximations of the gradient. Higher batch

15 sizes also yield quicker and higher convergence suggesting using a high batch size is preferable. The lower batch sizes also converge quickly but are unstable and yield lower final convergence accuracies, thus explaining this phenomenon. Figure 19 implies medium size batches converge the slowest (iteration wise). This is possibly due to them being able to approximate the gradient good enough to yield an acceptable final accuracy, but still bad enough to not being able to compete with the higher batch sizes. As the higher batch sizes clearly do not yield higher performance in a linear manner (figure 18, figure 19), while the cost of implementing these does (figure 9), it is doubtful using the highest batch sizes is preferable to simply iterating the optimization loop more times. As described in the result, while the medium batch sizes may require more iterations to converge it is not always the double amount, thus using a lower batch size might be preferable in this regard. Overall, the different batch sizes and learning rates have different properties making them useful for different situations. Using constant batch sizes and learning rates as done in the thesis i clearly not optimal and these could be varied to increase the performances of the networks. Using a medium initial batch size with higher learning rate initially and then increasing the batch size while decreasing the learning rate later should be a better solution, given the results in the previous section. Further studies could be made into such adaptive optimization variants of SGD, as a constant learning rate clearly is not optimal. Varying batch sizes during training sessions is also an inter- esting area. In this paper, the size of the training set and test set was also held constant and taken directly from the MNIST database. Studies into how the ratio of training and test set sizes effect the expected and empirical losses and accuracies was not explored in this thesis and is therefore something that could be investigated further.

16 References

[1] Google Trends. (2019) Machine learning — Google Trends. [Online]. Available: https: //trends.google.com/trends/explore?date=all&q=Machine%20learning [Accessed: 2019-04- 24]. [2] V. Maini and S. Sabri, “Machine learning for humans,” Medium, 2017. [Online]. Available: https://medium.com/machine-learning-for-humans [Accessed: 2019-04-24]. [3] I. Goodfellow, Y. Bengio, and A. Courville, . MIT Press, 2016, ch. 1, p. 2. [Online]. Available: http://www.deeplearningbook.org [Accessed: 2019-05-03]. [4] J. Vincent, “DeepMind’s AI agents conquer human pros at StarCraft II,” The Verge, Jan. 2019. [Online]. Available: https://www.theverge.com/2019/1/24/18196135/ google-deepmind-ai-starcraft-2-victory [Accessed: 2019-04-24].

[5] Day One Staff. (2018, Mar.) ”How our scientists are making Alexa smarter”. [Online]. Available: https://blog.aboutamazon.com/amazon-ai/ how-our-scientists-are-making-alexa-smarter [Accessed: 2019-04-24]. [6] John McCarthy, “What is artificial intelligence?” p. 1, 2007. [Online]. Avail- able: https://web.archive.org/web/20151118212404/http://www-formal.stanford.edu/jmc/ whatisai/node1.html [Accessed: 2019-04-28]. [7] M. Esposito, K. Bheemaiah, and T. Tse, “What is machine learning?” The conversa- tion, 2017. [Online]. Available: http://theconversation.com/what-is-machine-learning-76759 [Accessed: 2019-04-28]. [8] Michael A. Nielsen, Neural Networks and Deep Learning. Determination Press, 2015, ch. 1, Sigmoid neurons. [Online]. Available: http://neuralnetworksanddeeplearning.com/chap1. html#sigmoid neurons [Accessed: 2019-05-02]. [9] ——, Neural Networks and Deep Learning. Determination Press, 2015, ch. 1. [Online]. Available: http://neuralnetworksanddeeplearning.com/chap1.html [Accessed: 2019-04-28]. [10] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. MIT Press, 2016. [Online]. Available: http://www.deeplearningbook.org [11] Y. LeCun, C. Cortes, and C. Burges. The MNIST database of handwritten digits. [Online]. Available: http://yann.lecun.com/exdb/mnist/ [Accessed: 2019-04-24]. [12] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. MIT Press, 2016, ch. 4, p. 79. [Online]. Available: http://www.deeplearningbook.org [Accessed: 2019-05-03].

[13] ——, Deep Learning. MIT Press, 2016, ch. 5, p. 130. [Online]. Available: http: //www.deeplearningbook.org [Accessed: 2019-05-03]. [14] Tensorflow. [Online]. Available: https://www.tensorflow.org/ [Accessed: 2019-04-27].

[15] Harrison Kinsley ”Sentdex”. Pythonprograming. [Online]. Available: https: //pythonprogramming.net [Accessed: 2019-04-27]. [16] J. Bryner, “This supercomputer can calculate in 1 second what would take you 6 billion years,” Live science, 2018. [Online]. Available: https://www.livescience.com/ 62827-fastest-supercomputer.html [Accessed: 2019-04-29].

[17] J. Strickland, “What is computing power?” HowStuffWorks.com, 2019. [Online]. Available: https://computer.howstuffworks.com/computing-power.htm [Accessed: 2019-04-29].

17 7 Appendix

7.1 Digit recognizing program The following code was used to train the digit recognizing program and was, as previously stated, inspired by code examples from: https://pythonprogramming.net/tensorflow-neural-network-session-machine-learning-tutorial/

I II www.kth.se