Time Efficiency and Mistake Rates for Online Learning Algorithms

DEGREE PROJECT IN TECHNOLOGY, FIRST CYCLE, 15 CREDITS STOCKHOLM, SWEDEN 2019 Time efficiency and mistake rates for online learning algorithms A comparison between Online Gradient Descent and Second Order Perceptron algorithm and their performance on two different data sets Paul Gorgis Josef Holmgren Faghihi KTH ROYAL INSTITUTE OF TECHNOLOGY ELECTRICAL ENGINEERING AND COMPUTER SCIENCE Tidseffektivitet och felfrekvenser hos online learning algoritmer En jämförelse mellan Online Gradient Descent och Second Order Perceptron algoritmen och deras presterande på två olika dataset Paul Gorgis Josef Holmgren Faghihi i Abstract This dissertation investigates the differences between two different online learning algorithms: Online Gradient Descent (OGD) and Second-Order Perceptron (SOP) algorithm, and how well they perform on different data sets in terms of mistake rate, time cost and number of updates. By studying different online learning algorithms and how they perform in different environments will help understand and develop new strategies to handle further online learning tasks. The study includes two different data sets, Pima Indians Diabetes and Mushroom, together with the LIBOL library for testing. The results in this dissertation show that Online Gradient Descent performs overall better concerning the tested data sets. In the first data set, Online Gradient Descent recorded a notably lower mistake rate. For the second data set, although it recorded a slightly higher mistake rate, the algorithm was remarkably more time efficient compared to Second-Order Perceptron. Future work would include a wider range of testing with more, and different, data sets as well as other relative algorithms. This will lead to better result and higher credibility. ii Sammanfattning Den här avhandlingen undersöker skillnaden mellan två olika “online learning”- algoritmer: Online Gradient Descent och Second-Order Perceptron, och hur de presterar på olika datasets med fokus på andelen felklassificeringar, tidseffektivitet och antalet uppdateringar. Genom att studera olika “online learning”-algoritmer och hur de fungerar i olika miljöer, kommer det hjälpa till att förstå och utveckla nya strategier för att hantera vidare “online learning”- problem. Studien inkluderar två olika dataset, Pima Indians Diabetes och Mushroom, och använder biblioteket LIBOL för testning. Resultatet i denna avhandling visar att Online Gradient Descent presterar bättre överlag på de testade dataseten. För det första datasetet visade Online Gradient Descent ett betydligt lägre andel felklassificeringar. För det andra datasetet visade OGD lite högre andel felklassificeringar, men samtidigt var algoritmen anmärkningsvärt mer tidseffektiv i jämförelse med Second-Order Perceptron. Framtida studier inkluderar en bredare testning med mer, och olika, datasets och andra relaterade algoritmer. Det leder till bättre resultat och höjer trovärdigheten. iii Authors Paul Gorgis <[email protected]> Josef Holmgren Faghihi <[email protected]> Information and Communication Technology KTH Royal Institute of Technology Place for Project Stockholm, Sweden Examiner Örjan Ekberg KTH Royal Institute of Technology Supervisor Erik Fransén KTH Royal Institute of Technology Contents 1 Introduction 1 1.1 Research question ............................ 2 1.2 Scope ................................... 3 2 Background 4 2.1 Machine learning ............................. 4 2.1.1 Supervised Learning ....................... 5 2.2 Binary classification ........................... 5 2.3 Online learning .............................. 6 2.4 Online learning algorithms ....................... 8 2.4.1 First-Order methods ....................... 8 2.4.2 Second-Order methods ..................... 8 2.4.3 Online gradient descent algorithm (OGD) ........... 8 2.4.4 Second Order Perceptron algorithm (SOP) .......... 9 2.5 LIBOL ................................... 10 2.6 Related work ............................... 10 3 Method 11 3.1 Data sets ................................. 11 3.1.1 Pima Indians Diabetes ...................... 11 3.1.2 Mushroom ............................ 12 3.2 Approach and Evaluation ........................ 12 3.3 Procedure ................................. 13 4 Result 14 4.1 Diabetes data set ............................. 14 4.2 Mushroom data set ............................ 18 5 Discussion 21 5.1 Discussion of results ........................... 21 5.2 Improvements and future work ..................... 23 6 Conclusion 25 v References 26 vi 1 Introduction The ability to emulate human-like behaviour and intelligence based on surrounding environment is today a well discussed topic in many different fields, such as medicine and finance. Machine learning is a technology that consists of different computational algorithms designed to solve these problems and is considered to be a part of a new era called big data [3]. A machine learning algorithm uses data as input to solve or achieve a task without being hard coded to yield a desired output. The way these algorithms “learn” is by adaptation through repetition, so called training. When training an algorithm, sample inputs, together with expected outputs, are fed to the algorithm. The algorithm will make decisions or predictions and alter itself based on whether the predicted output was correct or not. The result is an algorithm well suited for a particular problem that can generalize new, “never seen”, data. However, there are some problems concerning machine learning. One problem with machine learning is what to do as new data becomes available. This can be solved using Online learning. Online learning is about updating a machine learning predictor as new data becomes available [13]. This is done in sequential order. Examples of online learning usage are with stock data, where new data is generated in real time. Online learning can also be used for spam mail detection, where there is constant change of senders and content, and the mail application must therefore decide correctly whether the received mail is spam or not. There are many different Online learning techniques and sequential algorithms trying to solve this problem when new data becomes available, however they all solve it in unique ways. Some better than others.The opposite of Online machine learning is called Batch learning (or Offline learning), which is training on the entire data set at once [1]. As data is becoming more accessible in very large volumes and in some cases significantly change over time, online learning grow more relevant in the studies of machine learning. To rely on a model trained on a data set all at once, or to store the data obtained, is in some systems not optimal. A model might need to adapt itself for recently received data or changes in behaviour to produce the 1 best predictions or decisions, such as a user’s recommendation feed in a shopping website. Furthermore, as there are different algorithms solving the same problem, there is interest in observing which of the investigated algorithms will produce the most optimal result. Therefore, the focused variables, time efficiency and mistake rate, will perform as determining factors to answer the research question. Time efficiency is of importance when it comes to processing huge amounts of data which is common in online learning models, since it’s desirable to process the data and adjust the model in a rapid manner. Moreover, the larger amount of data an algorithm is trained on, the better it becomes at predicting or classify the task given, thus reducing the mistake rate. The mistake rate is a percentage of how many incorrect predictions or decisions the algorithm makes, and is a crucial factor in many system since wrong predictions could potentially cause lots of harm, for example in medicine where the task could be whether a patient suffers from disease or not. This study will cover two different Online learning algorithms for binary classification. More specifically, this project will investigate the difference in performance for two different online learning algorithms, Online Gradient Descent (OGD) algorithm and Second Order Perceptron (SOP) algorithm [5], by measuring time efficiency and mistake rate. The test will be done on one small data set called Pima Indians Diabetes[12] and one bigger data set called Mushroom[2]. 1.1 Research question Which online learning algorithm between OGD (Online Gradient Descent) and SOP (Second Order Perceptron) performs better when it comes to time efficiency and mistake rate in binary classification for the two data sets; Pima Indians Diabetes and Mushrooms? 2 1.2 Scope This study only investigates two Online learning algorithms, OGD and SOP. The reason being to narrow down the result as well as being able to compare a First- Order Online learning algorithm to a Second-Order Online learning algorithm, since they use different complex calculations for optimization. Also, the study contains only two data sets for testing, one smaller data set and one bigger data set. That way it is possible to see if the results may be dependent on the data set size. Lastly, the variables of focus are mistake rate, time cost and number of updates. This because they are considered relevant to draw a conclusion and answer the research question properly. 3 2 Background This chapter describes the necessary topics required to understand the purpose of this study. The topics included are; The fundamentals of machine learning, supervised learning, binary classifications, online learning, First and Second-

Load more