Evaluation of Adaptive Random Forest Algorithm for Classification of Evolving Data Stream

DEGREE PROJECT IN TECHNOLOGY, FIRST CYCLE, 15 CREDITS STOCKHOLM, SWEDEN 2020 Evaluation of Adaptive random forest algorithm for classification of evolving data stream AYHAM ALKAZAZ MARWA SAADO KHAROUKI KTH ROYAL INSTITUTE OF TECHNOLOGY SCHOOL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE Evaluation of Adaptive random forest algorithm for classification of evolving data stream AYHAM ALKAZAZ & MARWA SAADO KHAROUKI Degree Project in Computer Science Date: August 16, 2020 Supervisor: Erik Fransén Examiner: Pawel Herman School of Electrical Engineering and Computer Science Swedish title: Evaluering av Adaptive random forest algoritm för klassificering av utvecklande dataström Abstract In the era of big data, online machine learning algorithms have gained more and more traction from both academia and industry. In multiple scenarios de- cisions and predictions has to be made in near real-time as data is observed from continuously evolving data streams. Offline learning algorithms fall short in different ways when it comes to handling such problems. Apart from the costs and difficulties of storing these data streams in storage clusters and the computational difficulties associated with retraining the models each time new data is observed in order to keep the model up to date, these methods also don't have built-in mechanisms to handle seasonality and non-stationary data streams. In such streams, the data distribution might change over time in what is called concept drift. Adaptive random forests are well studied and effective for online learning and non-stationary data streams. By using bagging and drift detection mechanisms adaptive random forests aim to improve the accuracy and performance of traditional random forests for online learning. In this study, we analyze the predictive classification accuracy of adaptive random forests when used in conjunction with different data streams and concept drifts. The data streams used to evaluate the accuracy are SEA and Agrawal. Each data stream is tested in 3 different concept drift configurations; gradual, sudden, and recurring. The results obtained from the performed benchmarks shows that adaptive random forests have better accuracy handling SEA than Agrawal, which could be interpreted by the dimensions and structure of the input attributes. Adap- tive random forests showed no clear difference in accuracy between gradual and sudden concept drifts. However, recurring concept drifts had lower accuracy in the benchmarks than both the sudden and the gradual counterparts. This could be a result of the higher frequency of concept drifts within the same time period (number of observed samples). Sammanfattning I big data tiden har online-maskininlärningsalgoritmerf˚attmer och mer dragkraft fr˚anb˚adeakademin och industrin. I flera scenarier m˚astebeslut och predek- tioner görasi närarealtid närdata observeras fr˚andataströmmarsom kontin- uerligt utvecklas. Offline-inlärningsalgoritmer brister p˚aolika sätt närdet gäller att hantera s˚adanaproblem. Bortsett fr˚ankostnaderna och sv˚arigheternamed att lagra dessa dataströmmari en lagringskluster och den beräkningsmässiga sv˚arigheternaförknippademed att tränamodellen p˚anytt varje g˚angny data observeras föratt h˚allamodellen uppdaterad. Dessa metoder har inte heller inbyggda mekanismer föratt hantera säsongsbetonade och icke-stationäradata- strömmar. I s˚adanaströmmarkan datadistributionen förändrasöver tid i det som kallas konceptdrift. Anpassningsbara slumpmässigaskogar (Adaptive random forests) ärvälstuderade och effektiva modeller föronline-inlärningoch hantering av icke-stationäradataströmmar.Genom att användamekanismer för att upptäcka konceptdrift och bagging syftar adaptiva slumpmässigaskogar att förbättranoggrannheten och prestandan hos traditionella slumpmässigaskogar föronlineinlärning. I denna studie analyserar vi den prediktiva klassificer- ingsnoggrannheten föradaptiva slumpmässigaskogar närde användsi sam- band med olika dataströmmaroch konceptdrift. Dataströmmarnasom används föratt utvärdera prestandan ärSEA och Agrawal. Varje dataströmtestas i 3 olika konceptdriftkonfigurationer; gradvis, plötsligoch ˚aterkommande. Re- sultaten som erh˚allitsfr˚ande utfördaexperiment visar att anpassningsbara slumpmässigaskogar har bättrenoggrannhet änAgrawal, vilket kan tolkas av antal dimensioner och strukturen av inmatningsattributen. Adaptiva slumpmässiga skogar visade dock ingen tydlig skillnad i noggrannhet mellan gradvisa och plötsligakonceptdrift. Emellertid hade ˚aterkommande konceptdrift lägrenog- grannhet i riktmärken änb˚adede plötsligaoch gradvisa motstycken. Detta kan vara ett resultat av den högrefrekvensen av konceptdrift inom samma tidsperiod (antal observerade prover). Contents Acronyms . .3 1 Introduction 4 1.1 Problem statement . .4 1.2 Scope . .5 1.3 Thesis outline . .5 2 Background 6 2.1 Offline and Online learning . .6 2.2 Data stream classification . .6 2.3 Ensemble methods . .7 2.4 Concept drifts . .7 2.5 Algorithms . .8 2.5.1 Decision Tree . .8 2.5.2 Hoeffding Tree . .9 2.5.3 Random forest . .9 2.5.4 Adaptive random forest . 10 2.6 Related Work . 12 2.6.1 Ensemble methods for data streams . 12 2.6.2 Ensemble methods with drift detectors . 13 2.6.3 Dynamic weighted majority . 13 2.6.4 Dynamic streaming random forest . 14 3 Method 15 3.1 Software frameworks . 15 3.2 Data streams . 15 3.2.1 SEA . 16 3.2.2 Agrawal . 16 3.2.3 Concept drift stream . 18 3.3 Experimental settings . 18 3.4 Training and Benchmarking . 19 4 Results 20 4.1 Gradual concept drift . 20 4.2 Sudden concept drift . 21 1 4.3 Recurring concept drift . 22 4.4 Statistical summary . 22 5 Discussion 24 5.0.1 Limitations . 25 5.0.2 Ethics and sustainability . 26 5.0.3 Future research . 26 6 Conclusion 28 7 Appendix 1 - Plots 32 7.1 SEA data stream . 32 7.2 AGRAWAL data stream . 34 7.3 Summary . 35 8 Appendix 2 - Benchmark data 36 8.1 All . 36 8.2 Summary . 38 2 Acronyms ADWIN ADaptive WINdowing. 5 ARF Adaptive Random Forest. 10 DSRF Dynamic Streaming Random Forest. 14 DT Decision Tree. 8 DWM Dynamic Weighted Majority. 14 HT Hoeffding Tree. 9 RF Random Forest. 9 3 Chapter 1 Introduction In some real-world applications of machine learning, data arrives in continuous streams. In such situations, the entire training data-set is not available at the the time of designing the model. The underlying data distribution for these streams might also change over time in response to various events. There is, therefore, a need for machine models that can learn on-the-fly from continuously evolving data streams and adapt to changes in the underlying probability distribution of the observed data. This means that the model needs to learn and adapt as new observations become available. In these situations, online machine learning approaches are required. Many learning algorithms use one model to form a final prediction. However, it is also possible to combine several models in some manner to make predictions for new examples. The method that combines multiple base model is known as the ensemble method. There are many different well-known ensemble methods nowadays, such as bagging [1] and random forests [2], that aim to improve gen- eralization performance and model accuracy. Adaptive random forests (ARF) algorithm is an adaptation of batch-based random forests that aims to handle evolving data streams [3]. In this work we aim to study the behaviour of ARF in the challenging context of evolving data streams. More specifically, we aim to study ARF's accuracy when used to classify observations from an evolving data streams with different kinds of concept drifts, where the data distribution changes over time in different patterns; gradual, sudden or recurring. 1.1 Problem statement Different online machine learning algorithms have different performance and accuracy characteristics when it comes to classifying non-stationary data streams. This study aims to study the behaviour of adaptive random forests when used to classify data streams with various forms of concept drifts by comparing the changes in predictive accuracy between different data-streams and concept 4 drifts. The study aims to answer the following question: how do adaptive random forests perform when it is used to classify non-stationary data streams with sudden, gradual, and recurring concept drifts? Two synthetic data stream generators are used in order to evaluate the performance of the studied models; SEA and Agrawal. 1.2 Scope In this study, we investigate the adaptive random forests (ARF) algorithm [3], an online ensemble method which deals with concept drifts using dynamic update methods and batch learning. There are an infinite number of data streams, concept drift setups and model parameter permutation. The study, therefore, focuses on three types of concept drifts : gradual, sudden and recurring. All data streams and algorithms used in this paper are available in scikit- multiflow framework [4]. The drift adaptation strategy that we choose is the Adaptive WINdowing (ADWIN) [5] with threshold (0.01) for warning detection and threshold (0.001) for drift detection. We also use 10 Hoeffding trees classi- fiers for the ensemble model. Other model parameters are to the predetermined defaults chosen by the framework designers. The performance evaluation method is Prequential evaluation [6] (inter- leaved test-then-train method) and the evaluation measure used to assess the performance is the classification accuracy. The algorithm is evaluated in the immediate setting where labels are presented to the learner before the next in- stance arrives. Situations where labels arrive with delay are outside the scope of this work. The scope of this thesis is also limited by two data streams: SEA and Agrawal with one test configuration for each type of concept drift. 1.3 Thesis outline This report is divided into six chapters. The first chapter involves an introduction to the subject and the problem statement of the study. The second chapter presents background information related to online machine learning and ARF algorithm.

Evaluation of Adaptive Random Forest Algorithm for Classification of Evolving Data Stream

Malware Classification with BERT

Performance Comparison of Support Vector Machine, Random Forest, and Extreme Learning Machine for Intrusion Detection

Machine Learning Methods for Classification of the Green

Random Forest Regression of Markov Chains for Accessible Music Generation

Evaluating the Combination of Word Embeddings with Mixture of Experts and Cascading Gcforest in Identifying Sentiment Polarity

10-601 Machine Learning, Project Phase1 Report Random Forest

Evaluation and Comparison of Word Embedding Models, for Efficient Text Classification

Random Forests, Decision Trees, and Categorical Predictors: the “Absent Levels” Problem

Random Forests

Deep Learning with Long Short-Term Memory Networks for Financial Market Predictions

A Hybrid Random Forest Based Support Vector Machine Classification Supplemented by Boosting by T Arun Rao & T.V

Neural Networks Vs. Random Forests – Does It Always Have to Be Deep Learning? by Prof