DEGREE PROJECT IN TECHNOLOGY, FIRST CYCLE, 15 CREDITS STOCKHOLM, SWEDEN 2019

Time efficiency and mistake rates for online learning algorithms A comparison between Online Gradient Descent and Second Order algorithm and their performance on two different data sets Paul Gorgis Josef Holmgren Faghihi

KTH ROYAL INSTITUTE OF TECHNOLOGY ELECTRICAL ENGINEERING AND Tidseffektivitet och felfrekvenser hos online learning algoritmer En jämförelse mellan Online Gradient Descent och Second Order Perceptron algoritmen och deras presterande på två olika dataset Paul Gorgis Josef Holmgren Faghihi

i Abstract

This dissertation investigates the differences between two different online learning algorithms: Online Gradient Descent (OGD) and Second-Order Perceptron (SOP) algorithm, and how well they perform on different data sets in terms of mistake rate, time cost and number of updates. By studying different online learning algorithms and how they perform in different environments will help understand and develop new strategies to handle further online learning tasks. The study includes two different data sets, Pima Indians Diabetes and Mushroom, together with the LIBOL library for testing. The results in this dissertation show that Online Gradient Descent performs overall better concerning the tested data sets. In the first data set, Online Gradient Descent recorded a notably lower mistake rate. For the second data set, although it recorded a slightly higher mistake rate, the algorithm was remarkably more time efficient compared to Second-Order Perceptron. Future work would include a wider range of testing with more, and different, data sets as well as other relative algorithms. This will lead to better result and higher credibility.

ii Sammanfattning

Den här avhandlingen undersöker skillnaden mellan två olika “online learning”- algoritmer: Online Gradient Descent och Second-Order Perceptron, och hur de presterar på olika datasets med fokus på andelen felklassificeringar, tidseffektivitet och antalet uppdateringar. Genom att studera olika “online learning”-algoritmer och hur de fungerar i olika miljöer, kommer det hjälpa till att förstå och utveckla nya strategier för att hantera vidare “online learning”- problem. Studien inkluderar två olika dataset, Pima Indians Diabetes och Mushroom, och använder biblioteket LIBOL för testning. Resultatet i denna avhandling visar att Online Gradient Descent presterar bättre överlag på de testade dataseten. För det första datasetet visade Online Gradient Descent ett betydligt lägre andel felklassificeringar. För det andra datasetet visade OGD lite högre andel felklassificeringar, men samtidigt var algoritmen anmärkningsvärt mer tidseffektiv i jämförelse med Second-Order Perceptron. Framtida studier inkluderar en bredare testning med mer, och olika, datasets och andra relaterade algoritmer. Det leder till bättre resultat och höjer trovärdigheten.

iii Authors

Paul Gorgis Josef Holmgren Faghihi Information and Communication Technology KTH Royal Institute of Technology

Place for Project

Stockholm, Sweden

Examiner

Örjan Ekberg KTH Royal Institute of Technology

Supervisor

Erik Fransén KTH Royal Institute of Technology Contents

1 Introduction 1 1.1 Research question ...... 2 1.2 Scope ...... 3

2 Background 4 2.1 ...... 4 2.1.1 ...... 5 2.2 Binary classification ...... 5 2.3 Online learning ...... 6 2.4 Online learning algorithms ...... 8 2.4.1 First-Order methods ...... 8 2.4.2 Second-Order methods ...... 8 2.4.3 Online gradient descent algorithm (OGD) ...... 8 2.4.4 Second Order Perceptron algorithm (SOP) ...... 9 2.5 LIBOL ...... 10 2.6 Related work ...... 10

3 Method 11 3.1 Data sets ...... 11 3.1.1 Pima Indians Diabetes ...... 11 3.1.2 Mushroom ...... 12 3.2 Approach and Evaluation ...... 12 3.3 Procedure ...... 13

4 Result 14 4.1 Diabetes data set ...... 14 4.2 Mushroom data set ...... 18

5 Discussion 21 5.1 Discussion of results ...... 21 5.2 Improvements and future work ...... 23

6 Conclusion 25

v References 26

vi 1 Introduction

The ability to emulate human-like behaviour and intelligence based on surrounding environment is today a well discussed topic in many different fields, such as medicine and finance. Machine learning is a technology that consists of different computational algorithms designed to solve these problems and is considered to be a part of a new era called big data [3]. A machine learning algorithm uses data as input to solve or achieve a task without being hard coded to yield a desired output. The way these algorithms “learn” is by adaptation through repetition, so called training. When training an algorithm, sample inputs, together with expected outputs, are fed to the algorithm. The algorithm will make decisions or predictions and alter itself based on whether the predicted output was correct or not. The result is an algorithm well suited for a particular problem that can generalize new, “never seen”, data. However, there are some problems concerning machine learning.

One problem with machine learning is what to do as new data becomes available. This can be solved using Online learning. Online learning is about updating a machine learning predictor as new data becomes available [13]. This is done in sequential order. Examples of online learning usage are with stock data, where new data is generated in real time. Online learning can also be used for spam mail detection, where there is constant change of senders and content, and the mail application must therefore decide correctly whether the received mail is spam or not. There are many different Online learning techniques and sequential algorithms trying to solve this problem when new data becomes available, however they all solve it in unique ways. Some better than others.The opposite of Online machine learning is called Batch learning (or Offline learning), which is training on the entire data set at once [1].

As data is becoming more accessible in very large volumes and in some cases significantly change over time, online learning grow more relevant in the studies of machine learning. To rely on a model trained on a data set all at once, or to store the data obtained, is in some systems not optimal. A model might need to adapt itself for recently received data or changes in behaviour to produce the

1 best predictions or decisions, such as a user’s recommendation feed in a shopping website. Furthermore, as there are different algorithms solving the same problem, there is interest in observing which of the investigated algorithms will produce the most optimal result. Therefore, the focused variables, time efficiency and mistake rate, will perform as determining factors to answer the research question. Time efficiency is of importance when it comes to processing huge amounts of data which is common in online learning models, since it’s desirable to process the data and adjust the model in a rapid manner. Moreover, the larger amount of data an algorithm is trained on, the better it becomes at predicting or classify the task given, thus reducing the mistake rate. The mistake rate is a percentage of how many incorrect predictions or decisions the algorithm makes, and is a crucial factor in many system since wrong predictions could potentially cause lots of harm, for example in medicine where the task could be whether a patient suffers from disease or not.

This study will cover two different Online learning algorithms for binary classification. More specifically, this project will investigate the difference in performance for two different online learning algorithms, Online Gradient Descent (OGD) algorithm and Second Order Perceptron (SOP) algorithm [5], by measuring time efficiency and mistake rate. The test will be done on one small data set called Pima Indians Diabetes[12] and one bigger data set called Mushroom[2].

1.1 Research question

Which online learning algorithm between OGD (Online Gradient Descent) and SOP (Second Order Perceptron) performs better when it comes to time efficiency and mistake rate in binary classification for the two data sets; Pima Indians Diabetes and Mushrooms?

2 1.2 Scope

This study only investigates two Online learning algorithms, OGD and SOP. The reason being to narrow down the result as well as being able to compare a First- Order Online learning algorithm to a Second-Order Online learning algorithm, since they use different complex calculations for optimization. Also, the study contains only two data sets for testing, one smaller data set and one bigger data set. That way it is possible to see if the results may be dependent on the data set size. Lastly, the variables of focus are mistake rate, time cost and number of updates. This because they are considered relevant to draw a conclusion and answer the research question properly.

3 2 Background

This chapter describes the necessary topics required to understand the purpose of this study. The topics included are; The fundamentals of machine learning, supervised learning, binary classifications, online learning, First and Second- Order algorithms, Online Gradient Descent algorithm, Second Order Perceptron algorithm and the LIBOL open source software.

2.1 Machine learning

Machine learning is deemed as a subset of the -field and could be simply expressed as a system that learns from data. More precisely, machine learning uses one or a set of algorithms that iteratively reads data (from a large data set) to produce an accurate model, or “machine” [8]. Whenever the algorithm reads an input of data it makes a prediction of the expected output, and depending on the correct answer of the output, the algorithm adjusts itself after each iteration to become skillful enough to accurately perform the given task. This process is called training, as the algorithm “learns” the given task by repeating the mentioned procedure for each sequence in the data set. The machine learning model is then created as a consequence when the machine learning algorithm has been trained with data [8]. This method of learning mirrors the way humans learn. The algorithm functions similar to a human brain; it gathers information and processes what to do with the information. The more often the brain repeats or gathers new information, the better it will become at performing a task, remembering information and expand its knowledge [9].

Although all machine learning algorithms perform training on sets of data, the training data could be structured in different ways. Therefore, there are difference machine learning techniques depending on the desired task the model should execute, such as supervised learning.

4 2.1.1 Supervised Learning

In supervised learning, each record in the training data consist of several input values and its correct output value [10]. This implies that the machine learning algorithm evaluate one training data sequence at the time, investigates the input values and, based on the input values, produces a guess for the output value. Since the expected output value is given in the sequence of data currently being evaluated, the algorithm checks whether the guess was correct or not. In either case, the algorithm adjusts itself after learning the correct output value, and keeps doing every time there is unevaluated data to evaluate. However, it adjusts differently depending if the algorithm’s guess was precise or not. The supervised method of learning is, for example, used in classification problems, where the classification type could be either binary (two expected outputs, typically yes or no) or multiclass (several expected outputs) [10]. Both OGD and SOP are different types of supervised learning algorithms. The opposite of supervised learning is known as , which lack the expected output value, causing the algorithm to independently determine where each sequence of data fits optimally based on the input data properties and values..

2.2 Binary classification

Binary classification problems can be seen as a problem with two possible outcomes. An example of this is if we want to know if a patient is sick from some some kind of disease. The possible outcomes are either that the patient is sick and carries the disease, or that the patient is not sick, and does not carry the disease. There are different binary classification methods, such as Decision trees, Support vector machines, Neural networks, Bayesian networks and many more. Depending on different factors you can choose the best method for a particular classification. The factors used to choose method could for example be the structure of the data sets, dimensionality of features, noise in the data etc.

5 2.3 Online learning

Traditionally, machine learning has often been used in an offline learning environment, so called batch learning. In batch learning, the learning algorithm trains up a model by working through a data set at once. When the whole data set has been processed, the algorithm deploys a model that can be used for the given task [7]. However, as the model has been deployed, it can not be changed or updated, which is a drawback in the era of big data as data change and arrives in huge streams [7]. Online learning solves the problem by constantly updating the model as data arrives in real-time. This is done sequentially. Any time there is input data available to train on, the online learning algorithm processes that data and performs a prediction or decision. Then, depending if the prediction was correct or not, the algorithm executes suitable adjustments based on the current model in order to improve the model’s ability to produce percentually more correct decisions in the future. Lastly, the new model is output as the result of the adjustment made from the last input data. The described procedure of online machine learning is illustrated in figure 2.2, whereas figure 2.1 is showing the procedure of batch learning in contrast to online learning.

Figure 2.1: Batch learning procedure

6 Figure 2.2: Online learning procedure

The difference of the figures is notably shown in the output. Figure 2.1 outputs a static model, whereas figure 2.2 use the model to update it for each sequence of incoming input data. The online learning procedure is more beneficial when concerning time, due to its ability to constantly adapt the model when data change which is not possible in a batch learning procedure. For example, when a user browses a shopping website to purchase a shirt, recommendations of shirts are shown when the user revisits the website. After a time period however, the same user might use the same page to browse jackets. Therefore, the recommendation feed must adapt to the user by showing jackets in the recommendation feed, yet possibly also show shirts again since it was a product of interest further back in time. Moreover, data might be too large to fit in the memory in some systems, since the data gathering can arrive in high volumes and high data rates [7]. The batch learning procedure needs to store the data to be able to train a model. However, this is not the case with the online learning procedure, since the training is done sequentially and therefore, after each sequence of input data is processed, the data is not needed anymore, meaning that online learning is space efficient [4].

7 2.4 Online learning algorithms

2.4.1 First-Order methods

The algorithms classified as first-order learning algorithms uses gradient information of first order when optimizing the objective function. Online Gradient Descent is a popular first-order method used for convex optimization [7].

2.4.2 Second-Order methods

Unlike the first-order algorithms, the second-order algorithms use gradient information of both first order and second order for optimization [7]. This is primarily done to accelerate the optimization convergence, causing the algorithm to faster learn and thus perform more accurate predictions. In expense, the complexity of computations raises, thus raising the time to complete the computations compared to first-order algorithms.

2.4.3 Online gradient descent algorithm (OGD)

Figure 2.3: Pseudocode of the OGD algorithm [6]

8 One of the online learning algorithms investigated in this study is the Online Gradient Descent algorithm (OGD). It is classified as a first-order algorithm [6] and is applied in supervised learning-environments.

In each iteration, the algorithm receives an incoming instance and tries to predict the correct class label based on the values provided in the instance. When the prediction is made, the true class label is revealed, and the algorithm calculates the suffered loss as shown in figure 2.3 on line 7. Based on the calculated loss, the algorithm updates itself as shown in line 8-9 to improve future predictions.

2.4.4 Second Order Perceptron algorithm (SOP)

Figure 2.4: Pseudocode of the SOP algorithm [6]

The other online learning algorithm investigated in this study is the Second-Order Perceptron algorithm (SOP). It is classified as a second-order algorithm [6] and is applied in supervised learning-environments. The algorithm is based on and supposed to be an improvement of the classical Perception algorithm from 1960s which is classed as a first order algorithm [7].

In each iteration, the algorithm receives an incoming instance and tries to predict the correct class label based on the values provided in the instance. When the prediction is made, the true class label is revealed, and the algorithm calculates

9 the suffered loss as shown in figure 2.4 on line 7 Based on the calculated loss, the algorithm updates itself as shown in line 8-10 to improve future predictions.

2.5 LIBOL

LIBOL is an open-source library made specifically for solving online learning tasks. It includes 16 different algorithms for binary classification and 13 algorithms for multivariable classification. It includes both of first and second order algorithms, where OGD (Online Gradient Descent) and SOP (Second-Order Perceptron) are two examples included [5]. The LIBOL package is used in combination with MATLAB, however C and C++ implementation of the basic functions do exist. A launch of the program on a specific data set will show the result of three factors for each binary classification algorithm. Those are the mistake rate, the number of updates and the time cost [5]. Also, the LIBOL package makes it easy for side by side comparison with graphs and tables.

2.6 Related work

Related work regarding Online learning algorithms has been done in a study at Singapore University [5]. The authors of the study are the creators of the LIBOL open source library whose main purpose with the study is to demonstrate the usage and functionality of the LIBOL library. In the study, a comparison between a variety of binary classification algorithms on two different data sets is shown. Focusing only on the algorithms investigated in this study, their study shows that OGD did perform better than SOP despite changes in features and number of records in the data sets. One reason for the chosen algorithms in this study is to investigate if there is a case where SOP could be preferred over OGD, since an algorithm that is worse than another in all performing aspects while being more complicated computationally cannot be considered particularly useful.

10 3 Method

This chapter presents and motivates why the specific data sets and methods is used. This chapter also explains the use of the LIBOL software.

3.1 Data sets

In this study, two different data sets are used for comparison. One smaller data set, Pima Indians Diabetes, containing 768 records and one bigger data set, Mushrooms, containing 8124 number of records. The purpose of using two data sets is to determine whether the attributes of size and features affect different performances on the algorithms or not, and the chosen data sets differ in size and features. The data sets are presented in table 3.1 together with their characteristics.

Data set Features Size Pima Indians Diabetes 8 768 Mushroom 22 8124

Table 3.1: table of every data set used in this study, together with their size and features

3.1.1 Pima Indians Diabetes

The first data set is Pima Indians Diabetes data set from the National institute of Diabetes and Kidney Diseases [12]. It contains information from 21 years old females from Pima Indian heritage, with and without diabetes. The data set has one target variable (Outcome) and multiple predictor variables such as the patients BMI, age, insulin level etc. All together, this data set is considered small with 768 records in total, and 8 features.

11 3.1.2 Mushroom

The records in the Mushroom data set are drawn from the book The Audubon Society Field Guide to North American Mushrooms (1981) [2]. It includes samples of 23 different species of mushrooms, which are classified as edible, poisonous or unknown/not recommended. For this data set however, the classes have been reduced to only edible and poisonous, whereas unknown/not recommended is defined as poisonous to maintain this data set for usage of binary classification. This data set contains 8124 records in total and 22 features, being the larger of the two data sets used in this study.

3.2 Approach and Evaluation

Testing of the two algorithms on the data sets will be executed using the LIBOL library together with MATLAB. With the combination of both, it is possible to run the experiment with the selected algorithms and obtain graphs, in form of line charts, and numeric tables, which will be useful for showcasing and discussing the obtained result from the experiment.

The important factors to analyze from the graphs and tables we receive are the mistake rate and the time efficiency for each algorithm. The mistake rate is crucial for any machine learning algorithm as the goal is to minimize the rate of predicting the wrong outcome for any input data. The mistake rate is calculated by dividing the total number of wrong predictions with the total number of guesses. An example is the medical field, where a model trained with machine learning determines whether a patient suffers from diabetes or not. If the model does not determine the correct answer and the doctor relies on the answer of the model, the consequences could be harmful, such as wrong treatment and reduction of the patient’s well being. Therefore, the mistake rate will be considered the most important factor in the evaluation of the result. The time efficiency will also be considered as the run time for each algorithm on a data set does have an impact when concerning huge amounts of data. Therefore, for the maximum time efficiency of an algorithm, the computational time should be as low as possible. Number of updates is also a factor mentioned in the table values and graphs. Its

12 importance is not considered as important as the other two, yet it will be taken in consideration when analyzing the first two mentioned factors, as the number of updates an algorithm performs to improve itself could possibly affect the mistake rate and/or the computational time for the algorithm.

3.3 Procedure

First the LIBOL package was acquired from its Github page [11] and its content was transferred to an appropriate folder in MATLAB to be able to use collectively. Then the search began for data sets with different size and features while also being classed binary classification with supervised learning structure. The data of both sets were structured in a format called LIBSVM, which can be handled by LIBOL. When the data sets were obtained, MATLAB was launched and the LIBOL folder was chosen to execute from. The experiment used two functions related of the LIBOL package to obtain the data shown in the results–section: demo and run_experiment. Demo was used to acquire the tables shown in table 4.1 and 4.2 and calculates the average result of the three factors after 20 runs. The parameters of the function are which algorithm to use, which data set to use and classification type, which was in this case binary classification. Run_experiment is applied on all algorithms based on the specified classification type, which was also here binary classification. The output was three graphs, one for each factor of mistake rate, number of updates and time cost. The parameters were classification type and data set. When the graphs were obtained, data points of algorithms other than OGD and SOP were filtered out with MATLAB to only showcase those relevant for the study and added to the results section of this study.

13 4 Result

This section will present tables and graphs obtained from the experiment performed with the LIBOL package on MATLAB. Two evaluations were performed on each data set. One to generate the table for each algorithm with their mean value after 20 runs for the factors of mistake rate, number of updates and time cost. The other to plot line chart graphs for the three factors for each algorithm where the x-axis for each graph reflects how many samples, or records, of the data set that have been processed during runtime.

4.1 Diabetes data set

Algorithm Mistake rate Number of updates Cpu time (seconds) OGD 0.2574 ± 0.0070 461.90 ± 9.02 0.0131 ± 0.0002 SOP 0.3103 ± 0.0144 238.30 ± 11.06 0.0131 ± 0.0014

Table 4.1: Average result of 20 runs on the Diabetes data set

Table 4.1 displays the average result for each algorithm run twenty times on the diabetes data set. The OGD algorithm obtained a mistake rate of roughly 25.7%, which is 5.3 percentage points, or 17%, less than the SOP algorithm. Number of updates on the algorithm (model) was however 94% more frequent with the OGD than SOP. The run time over the data set for each algorithm was nearly identical.

14 Figure 4.1: Mistake rate for each algorithm on the Diabetes data set

On the diabetes data set, figure 4.1 shows that the mistake rate appears to be lower for the OGD algorithm than the SOP algorithm throughout the run of the experiment. Both algorithms’ mistake rates are similar when they have run through the 50 first samples. However, as the number of samples increases, OGD decreases faster than SOP and differ about 5 percentage points in the end. Additionally, the OGD algorithm decreases its mistake rate monotonically throughout the run, whereas the SOP algorithm alter between decreasing and increasing its mistake rate after about 200 samples.

15 Figure 4.2: Number of updates performed by each algorithm on the Diabetes data set

During the experiment on the diabetes data set, the OGD algorithm performs more updates on itself than the SOP algorithm as the number of samples increases. The lines of each algorithm in figure 4.2 appears to be linear.

16 Figure 4.3: Time spent by each algorithm to process through the Diabetes data set

In terms of time cost, the SOP algorithm require slightly less time to perform the operations on the diabetes data set than the OGD algorithm. The lines of each algorithm in figure 4.3 appears to be linear

17 4.2 Mushroom data set

Algorithm Mistake rate Number of updates Cpu time (seconds) OGD 0.0123 ± 0.0016 328.25 ± 20.08 0.1405 ± 0.0010 SOP 0.0106 ± 0.0008 86.25 ± 6.17 0.3779 ± 0.0024

Table 4.2: Average result of 20 runs on the Mushroom data set

Table 4.2 displays the average result for each algorithm run twenty times on the mushroom data set. The SOP algorithm obtained a mistake rate of roughly 1.06% which is 14% less than the OGD algorithm. Additionally, it performed 74% less updates to the algorithm (model). On the other hand, OGD required 63% less time to run through the samples compared to SOP.

Figure 4.4: Mistake rate for each algorithm on the Mushroom data set

On the mushroom data set, figure 4.4 shows that the mistake rate is initially lower for the OGD algorithm than the SOP algorithm. When about 3000 samples have been processed by the algorithms, SOP’s mistake rate becomes lower than OGD and remains until the end of the experiment. Both algorithms decrease their mistake rate continuously throughout the run.

18 Figure 4.5: Number of updates performed by each algorithm on the Mushroom data set

During the experiment on the mushroom data set, the OGD algorithm performs close to four more updates on itself than the SOP algorithm as the number of samples increases. Unlike figure 4.2 which is the corresponding graph for the diabetes data set, figure 4.5 displays curved lines for the algorithms, showing a decrease of performed updates by both algorithms as the number of samples increases.

19 Figure 4.6: Time spent by each algorithm to process through the Diabetes data set

In terms of time cost, the OGD algorithm require less time to perform the operations on the mushroom data set than the SOP algorithm. The lines of each algorithm in figure 4.6 appears to be linear.

20 5 Discussion

In this chapter discussion over results are being presented as well as future work and improvements. Describe the conclusions (reflect on the whole introduction given in Chapter 1).

5.1 Discussion of results

As seen in the result for the Pima Indians Diabetes data set, Online Gradient Descent had a mistake rate of roughly 25.7% compared to Second-Order Perceptron with a mistake rate of roughly 31%. Furthermore, OGD begins at a lower mistake rate and improves itself further into the data samples, whereas SOP improves itself the first 200 samples but then manages to vary between reducing and increasing the mistake rate and thereby stops at a similar rate when all 768 records have been processed. Also, although the number of updates was higher for OGD with a total of 461 updates, compared to SOPs 238 number of updates, the computational time are very similar with a difference of 0.0012 in the standard deviation. Therefore, for the Pima Indians Diabetes data set, OGD was the best performing algorithm, since the mistake rate was lower throughout the whole experiment and, after the whole data set was processed, had 17% lower rate than SOP yet was equally time efficient.

Regarding the second data set Mushrooms, the result looks similar for the first 800 samples compared to the Pima Indians Diabetes data set, where OGD maintains a lower mistake rate. However, at around 3000 samples, the mistake rate for SOP improves and becomes better than OGD. The overall mistake rate for OGD was 1,23% and for SOP 1,06%. SOP starts at a notably higher mistake rate but seems to improve itself more rapidly further into the data set, thus having a steeper learning curve than OGD as seen in figure 4.6 and therefore demonstrating the beneficial reason to use second-order algorithm. The number of updates was higher for OGD, even for this dataset, with a total of 328 updates. SOP had a total of 86 number of updates. This shows that SOP becomes better over time despite having far less number of updates than OGD. The curve for

21 the number of updates in figure 4.7 shows that OGD continues to grow while SOP have a flatter development, meaning that it is closer to stabilizing itself and thus reaching optimal efficiency. However, the time cost was lower for OGD with 0.1405 seconds of computational time compared to SOP’s 0.3779 seconds of computational time which means SOP was approximately 2.7 times slower. This means SOP performed better regarding the mistake rate and number of updates but was also notably slower. Here it also demonstrates the backside of second- order algorithms; that they are computationally complex. The small difference in mistake rate between the algorithm is miniscule compared to the time cost between the both, meaning that OGD would be considered the more preferable algorithm in this case as well, especially when the data flow is huge due to being remarkably more time efficient than SOP while still maintaining a similar mistake rate.

Comparing this study to the related study referenced in section 2.6, both similarities and differences could be found. In section 2.6 it is mentioned that OGD did perform better than SOP both in term of mistake rate and time cost for both the data sets used in the related study. The similarity is that OGD was considered to perform better in both studies, which strengthen the claim to assume that OGD is a more effective online learning algorithm than SOP. The studies were also similar in how the factors differ, where the smaller data sets in both studies show a large difference in mistake rate, and the larger sets show notably different time efficiency, in OGD’s favour. Speaking of difference between the studies, the most notable is that SOP did achieve a lesser mistake rate for the mushroom data set compared to OGD, which was not the case for the other set in this study and the data sets in the related study. This means there is cases or scenarios where SOP could achieve better performance than OGD, meaning the algorithm should not be completely disregarded. Also in the diabetes data set, the time efficiency was roughly the same between the algorithms, whereas for the remaining three data sets OGD were clearly more time efficient, which means SOP could be rapid as well in the right circumstances, possibly when the number of features is low since diabetes data set only had 8 features compared to the other data sets where the minimum of features was 22.

22 An interesting point of discussion is the significant difference of the algorithms’ mistake rates between the two data sets. In the Pima Indian Diabetes data set, the mistake rate variance throughout the run is 25-34% according to figure 4.2, whereas for the Mushroom data set, the variance is 1-10% according to figure 4.6. However, if the latter data set is reduced to the first 800-1000 sequences of data to correspond to the diabetes data set’s 768 records, the number would be approximately 3,5-10%. Regardless, it is still a large gap between the both data sets. The reason for this gap could possibly depend on two factors. The first is the number of features for each data set. Diabetes had 8 features whereas Mushroom had 22 which is seen in figure 3.1. A bigger set of features could be beneficial for a machine learning algorithm to decide the correct output of the data sequence, since more information is provided to the algorithm and, as a consequence, more secure decisions can be made by the algorithm, thereby lowering the mistake rate. The second factor is the balance of expected outputs in the data sets. In the Mushroom data set, the distribution between a mushroom being edible and being poisonous is nearly equal. However, in the Diabetes data set, the distribution of the 768 records is 500/268, which means 500 has been diagnosed with diabetes and 268 the opposite. This uneven distribution could mean that the algorithms are mostly trained with more data that’s positive for the diagnosis, so when an incoming data sequence is negative diagnosis, they are more likely to guess incorrectly due to being more used to the positive diagnosis data. Whereas with the Mushroom data set, the even distribution causes the algorithms to know more surely if the mushroom is edible or poisonous.

5.2 Improvements and future work

With more time and data, more accurate results and conclusions could be drawn regarding whether Online Gradient Descent or Second-Order Perceptron performs better on small or big data. Future work could include a wider range of testing with more data as well as other relative algorithms. This will lead to better result and higher credibility. Furthermore, a deeper understanding of the algorithms work on code level will improve the result and make it easier to draw a solid conclusion.

23 The data sets chosen is also to be taken into consideration. Both the Mushroom data set and the Pima Indians Diabetes data set are quite different. For example, the data sets do not share the same features. Although there were thoughts about having different sized data sets when deciding those, the thought about number of features or other differences in the data sets did not occur. The diabetes did not only have a lower number of features, it was unbalanced with 500 out of 768 records being positive outcomes. Future work could include testing on more similar data with, for instance, the same number of features or more balanced data. Furthermore, one could test the algorithms again on the edited datasets and look for other differences. This will help validate the result even more.

Another improvement for future work could include testing on more than two datasets with varying sizes. Also, it is possible to observe more variables than just time cost, number of updates and mistake rate.

24 6 Conclusion

This study investigated the performance of the Online Gradient Descent algorithm and the Second-Order Perceptron algorithm on the two data sets: Pima Indian Diabetes and Mushroom. The results show that the OGD algorithm performed better on both data sets and would therefore be the best alternative. Regarding the first data set, OGD managed to maintain a notably lower mistake rate yet being equally time efficient to the SOP algorithm. Regarding the second data set, the mistake rate was close to similar for both algorithms despite SOP making much fewer number of updates. However, the difference in time efficiency was significantly higher for OGD and is therefore considered more optimal of the two algorithms. Improvements of future studies in the area could be done by including a wider range of data sets, both with similar and different properties, and measuring other factors than those used in this study.

25 References

[1] David Ziganto. An Introduction To Online Machine Learning. 2017. URL: https://dziganto.github.io/data%20science/online%20learning/ python / scikit - learn / An - Introduction - To - Online - Machine - Learning/.

[2] Dua, Dheeru and Graff, Casey. UCI Machine Learning Repository. 2017. URL: https://archive.ics.uci.edu/ml/datasets/mushroom.

[3] El Naqa, Issam and Murphy, Martin J. “What is machine learning?” In: Machine Learning in Radiation Oncology. Springer, 2015, pp. 3–11.

[4] Felipe Almeida. Online Machine Learning: introduction and examples. 2016. URL: https : / / www . slideshare . net / queirozfcom / online - machine-learning-introduction-and-examples.

[5] Hoi, Steven C. H., Wang, Jialei, and Zhao, Peilin. “LIBOL: A Library for Online Learning Algorithms”. In: J. Mach. Learn. Res. 15.1 (Jan. 2014), pp. 495–499. ISSN: 1532-4435. URL: http : / / dl . acm . org / citation . cfm?id=2627435.2627450.

[6] Hoi, Steven C. H. et al. “LIBOL: A Library for Online Learning Algorithms”. In: Nanyang Technological University. 2012.

[7] Hoi, Steven C. H. et al. “Online Learning: A Comprehensive Survey”. In: CoRR abs/1802.02871 (2018). arXiv: 1802.02871. URL: http://arxiv. org/abs/1802.02871.

[8] J. Hurwitz, D. Kirsch. Machine Learning: IBM Limited Edition, John Wiley Sons. 2018. URL: https://mscdss.ds.unipi.gr/wp-content/uploads/ 2018/02/Untitled-attachment-00056-2-1.pdf.

[9] Peter Jeffcock. What’s the Difference Between AI, Machine Learning, and ? 2018. URL: https : / / blogs . oracle . com / bigdata / difference-ai-machine-learning-deep-learning.

[10] Simeone, Osvaldo. “A Very Brief Introduction to Machine Learning With Applications to Communication Systems”. In: CoRR abs/1808.02342 (2018). arXiv: 1808.02342. URL: http://arxiv.org/abs/1808.02342.

26 [11] Steven C.H. Hoi Jialei Wang, Peilin Zhao. LIBOL. 2013. URL: https: / / github.com/LIBOL/LIBOL.

[12] Uknown. Pima Indians Diabetes. 2018. URL: https://www.kaggle.com/ uciml/pima-indians-diabetes-database.

[13] Wikipedia contributors. Online machine learning — Wikipedia, The Free Encyclopedia. [Online; accessed 15-May-2019]. 2019. URL: https://en. wikipedia.org/w/index.php?title=Online_machine_learning&oldid= 892983344.

27 TRITA-EECS-EX-2019:369

www.kth.se