Stochastic Gradient Descent in Machine Learning

DEGREE PROJECT IN TECHNOLOGY, FIRST CYCLE, 15 CREDITS STOCKHOLM, SWEDEN 2019 Stochastic Gradient Descent in Machine Learning CHRISTIAN L. THUNBERG NIKLAS MANNERSKOG KTH ROYAL INSTITUTE OF TECHNOLOGY SCHOOL OF ENGINEERING SCIENCES Abstract Some tasks, like recognizing digits and spoken words, are simple for humans to complete yet hard to solve for computer programs. For instance the human intuition behind recognizing the number eight, "8 ", is to identify two loops on top of each other and it turns out this is not easy to represent as an algorithm. With machine learning one can tackle the problem in a new, easier, way where the computer program learns to recognize patterns and make conclusions from them. In this bachelor thesis a digit recognizing program is implemented and the parameters of the stochastic gradient descent optimizing algorithm are analyzed based on how their effect on the computation speed and accuracy. These parameters being the learning rate ∆t and batch size N. The implemented digit recognizing program yielded an accuracy of around 95 % when tested and the time per iteration stayed constant during the training session and increased linearly with batch size. Low learning rates yielded a slower rate of convergence while larger ones yielded faster but more unstable convergence. Larger batch sizes also improved the convergence but at the cost of more computational power. Keywords: Stochastic Gradient Descent, Machinelearning, Neural Networks, Learning Rate, Batch Size, MNIST i DEGREE PROJECT IN TEKNIK, FIRST CYCLE, 15 CREDITS STOCKHOLM, SWEDEN 2019 Stochastic Gradient Descent inom Maskininlärning CHRISTIAN L. THUNBERG NIKLAS MANNERSKOG KTH ROYAL INSTITUTE OF TECHNOLOGY SKOLAN FÖR TEKNIKVETENSKAP Sammanfattning Vissa problem som förmänniskor ärenkla att lösa,till exempel: att kännaigen siffror och sagda ord, ärsv˚artatt implementera i datorprogram. Till exempel, den mänskligaintuitionen att känna igen siffran ˚atta"8 " äratt notera tv˚aslingor ovanp˚avarandra, detta visar sig vara sv˚artatt representera som en algoritm. Med maskininlärning ärdet möjligtatt angripa problemet p˚aett nytt, enklare, sättdärdatorprogrammet lärsatt kännaigen utformningar som datorprogrammet drar slutsatser fr˚an. I denna kandidatuppsats implementeras ett sifferigenkänningsprogramoch parametrarna i "stochastic gradient descent" analyseras i deras p˚averkan av programmets beräkn- ingshastighet och träffsäkerhet. Dessa parametrar är"learning rate" ∆t och "batch size" N. Det implementerade programmet försifferigenkänninghade en träffsäkerhet p˚aomkring 95 % närdet testades och tiden per iteration var konstant under träningenav programmet, samtidigt som den ökade linjärtmed ökad batch size. L˚agalearning rates resulterade i l˚agmen stadig konvergens medans störreresulterade i snabbare men mer instabil konvergens. Störrebatch sizes förbättrade konvergensen men p˚abekostnad av längre beräkningstid. Nyckelord: Stokastiska Gradientmetoden, Maskininlärning,Neurala Nätverk,Inlärningshastighet, Batchstorlek, MNIST ii Acknowledgements This bachelor thesis was written as a part of course SA114X, Degree Project in Engineering Physics First cycle, at the Department of Numerical Analysis at the Royal Institute of Technology. We would like to thank our supervisors Anna Persson and Patrick Henning for their support and feedback throughout the entire project. Christian L. Thunberg, [email protected] Niklas Mannerskog, [email protected] Stockholm, 2019 iii Contents 1 Introduction 1 1.1 Purpose . .1 1.2 Background and target group . .1 1.3 Delimitation . .1 2 Theory 1 2.1 Artificial intelligence . .1 2.2 Machine learning . .2 2.3 Artificial neural network . .2 3 Training Neural Networks 4 3.1 Standard Gradient Descent . .4 3.2 Stochastic Gradient Descent . .4 3.3 Empirical and Expected Values . .5 4 Implementation 5 4.1 MNIST Problem and Model, bild p˚aMNIST . .5 4.2 Code and measurements . .6 5 Parameter study 6 5.1 Time dependency on batch size . .6 5.1.1 Method . .6 5.1.2 Result . .7 5.2 Accuracy and learning rate . .9 5.2.1 Method . .9 5.2.2 Result . .9 5.3 Accuracy and batch size . 11 5.3.1 Method . 11 5.3.2 Result . 12 5.4 Performance . 13 5.4.1 Method . 13 5.4.2 Result . 14 6 Analysis and Conclusions 15 7 Appendix I 7.1 Digit recognizing program . .I iv 1 Introduction 1.1 Purpose This bachelor thesis will cover two separated topics: 1. A literature study about Numerical stochastic gradient descent and other iterative methods used in machine learning to find a local minimum of a function. 2. How to make a digit recognizing computer program by using a neural network which will recognize hand written digits, then analyze some parameters' impact on the result and computation speed. 1.2 Background and target group The interest of machine learning has increased during the past years [1]. It has potential to form the future more than any other technical concept year 2019 [2] and interesting applications available today justifies the increased interest, for example: 1. IBM's Deep Blue chess-playing system which defeated the world champion in chess in 1997 [3]. 2. An AI developed by Google's DeepMind which in February 2019 defeated the best StarCraft II players [4]. 3. Alexa, Amazon's virtual assistant, became good at speech recognition and natural language understanding with help of machine learning [5]. This bachelor thesis will first cover explanations about the important concepts within machine learning. After the concepts are explained the machine learning's "hello world" program will be implemented and explained. After the implementation some parameters will be analyzed to see how the parameter effects the accuracy of the program. Everyone with basic knowledge in programming and mathematics, who is new to machine learning could see this thesis as an introduction to the subject. 1.3 Delimitation The thesis will only cover supervised learning which is one of many ways a computer program can be trained and only a handful variables will be analyzed. The variables that will be analyzed are: how the batch size, N, affects the computation time and accuracy of the digit recognizing program and how the learning rate, ∆t, affects the accuracy. With regards to the learning rate, there are algorithms with adaptive learning rates to train neural networks, in this thesis however we only consider constant learning rates and constant batch sizes. Also note designing a neural network that achieves maximum possible accuracy is not in the scope of this thesis. 2 Theory 2.1 Artificial intelligence Artificial intelligence has no unambiguous definition but can be summarized as the science and engineering of making intelligent entities which includes the most important topic - making intelligent computer programs [6]. An intelligent entity could be a mechanical arm which sorts papers with hand written digits or a self driven car who reacts to people in front of the car on the road. 1 2.2 Machine learning Machine learning was defined as the "field of study that gives computers the ability to learn without being explicitly programmed" by a pioneer in the field of machine learning, Arthur Samuel in 1959. The field, machine learning, can be understood as a scientific study of computational methods that use experience to raise performance or to make better predictions [7]. The entity's software could learn to recognize handwritten digits by analyzing a large set of hand written digits or being run in a simulator where feedback is given to the software. 2.3 Artificial neural network An artificial neural network or from now on "neural network", is a framework for a set of machine learning algorithms inspired by the human brain. The neural network, used in this thesis, consists of different layers, two which are visible for humans and one or more hidden layers. The layers, visible to humans, is the first layer which is where the data is fed and the last layer which is the neural network's output. Each layer consists of an arbitrary amount of neurons, connected to all neurons in the layer in front and behind, no connections between neurons in a layer exists. In figure 1 the above described neural network is visually illustrated. Figure 1 { Illustration of a neural network with: I input neurons, H hidden layers, with A; B or C neurons and O output neurons. Each neuron contains a value, the value of a neuron is calculated by creating a weighted sum. The weighted sum is calculated by multiplying each neuron value in the previous layer with a weight and sum all of the created products, then add a bias to the sum. The weighted sum is then mapped to an interval I, chosen by the author by a function σ : R ! I, the mapped value is the neuron's value. The function σ is called the "activation function", the activation function used in this thesis is the sigmoid function, which is a frequently used activation function in machine learning [8] 1 sigmoid function: σ(x) = . (1) 1 + e−x Let Σbh be the weighted sum of Nbh and nbh be the value of Nbh in figure 1. The weighted sum with the bias is then calculated through A X Σbh = !ab · na1 + bh (2) a=1 where !ab is the weight, bh is the bias and na1 is a neuron's value in the previous layer. The index ab means from neuron a in the previous layer to neuron b in the current layer. The weights and biases are real values and can be thought of as a description of how a neuron depends of a neuron in the previous layer. The value of the neuron, nbh, is then calculated by inserting the weighted sum, Σbh, into the activation function. If so, the value of the neuron is 2 nbh = σ(Σbh): (3) In figure 2 the above described neuron is visually illustrated.

Stochastic Gradient Descent in Machine Learning

Training Autoencoders by Alternating Minimization

Learning to Learn by Gradient Descent by Gradient Descent

Training Neural Networks Without Gradients: a Scalable ADMM Approach

Training Deep Networks with Stochastic Gradient Normalized by Layerwise Adaptive Second Moments

CSE 152: Computer Vision Manmohan Chandraker

Automated Source Code Generation and Auto-Completion Using Deep Learning: Comparing and Discussing Current Language Model-Related Approaches

GEE: a Gradient-Based Explainable Variational Autoencoder for Network Anomaly Detection

Optimization and Gradient Descent INFO-4604, Applied Machine Learning University of Colorado Boulder

Gradient Descent (PDF)

Lecture 10: Recurrent Neural Networks

Lecture 6: Stochastic Gradient Descent Sanjeev Arora Elad Hazan

An Evolutionary Method for Training Autoencoders for Deep Learning Networks