Evaluation of Adaptive Random Forest Algorithm for Classification of Evolving Data Stream
Total Page:16
File Type:pdf, Size:1020Kb
DEGREE PROJECT IN TECHNOLOGY, FIRST CYCLE, 15 CREDITS STOCKHOLM, SWEDEN 2020 Evaluation of Adaptive random forest algorithm for classification of evolving data stream AYHAM ALKAZAZ MARWA SAADO KHAROUKI KTH ROYAL INSTITUTE OF TECHNOLOGY SCHOOL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE Evaluation of Adaptive random forest algorithm for classification of evolving data stream AYHAM ALKAZAZ & MARWA SAADO KHAROUKI Degree Project in Computer Science Date: August 16, 2020 Supervisor: Erik Fransén Examiner: Pawel Herman School of Electrical Engineering and Computer Science Swedish title: Evaluering av Adaptive random forest algoritm för klassificering av utvecklande dataström Abstract In the era of big data, online machine learning algorithms have gained more and more traction from both academia and industry. In multiple scenarios de- cisions and predictions has to be made in near real-time as data is observed from continuously evolving data streams. Offline learning algorithms fall short in different ways when it comes to handling such problems. Apart from the costs and difficulties of storing these data streams in storage clusters and the computational difficulties associated with retraining the models each time new data is observed in order to keep the model up to date, these methods also don't have built-in mechanisms to handle seasonality and non-stationary data streams. In such streams, the data distribution might change over time in what is called concept drift. Adaptive random forests are well studied and effective for online learning and non-stationary data streams. By using bagging and drift detection mechanisms adaptive random forests aim to improve the accuracy and performance of traditional random forests for online learning. In this study, we analyze the predictive classification accuracy of adaptive random forests when used in conjunction with different data streams and concept drifts. The data streams used to evaluate the accuracy are SEA and Agrawal. Each data stream is tested in 3 different concept drift configurations; gradual, sudden, and recur- ring. The results obtained from the performed benchmarks shows that adaptive random forests have better accuracy handling SEA than Agrawal, which could be interpreted by the dimensions and structure of the input attributes. Adap- tive random forests showed no clear difference in accuracy between gradual and sudden concept drifts. However, recurring concept drifts had lower accuracy in the benchmarks than both the sudden and the gradual counterparts. This could be a result of the higher frequency of concept drifts within the same time period (number of observed samples). Sammanfattning I big data tiden har online-maskininl¨arningsalgoritmerf˚attmer och mer dragkraft fr˚anb˚adeakademin och industrin. I flera scenarier m˚astebeslut och predek- tioner g¨orasi n¨ararealtid n¨ardata observeras fr˚andatastr¨ommarsom kontin- uerligt utvecklas. Offline-inl¨arningsalgoritmer brister p˚aolika s¨att n¨ardet g¨aller att hantera s˚adanaproblem. Bortsett fr˚ankostnaderna och sv˚arigheternamed att lagra dessa datastr¨ommari en lagringskluster och den ber¨akningsm¨assiga sv˚arigheternaf¨orknippademed att tr¨anamodellen p˚anytt varje g˚angny data observeras f¨oratt h˚allamodellen uppdaterad. Dessa metoder har inte heller inbyggda mekanismer f¨oratt hantera s¨asongsbetonade och icke-station¨aradata- str¨ommar. I s˚adanastr¨ommarkan datadistributionen f¨or¨andras¨over tid i det som kallas konceptdrift. Anpassningsbara slumpm¨assigaskogar (Adaptive ran- dom forests) ¨arv¨alstuderade och effektiva modeller f¨oronline-inl¨arningoch hantering av icke-station¨aradatastr¨ommar.Genom att anv¨andamekanismer f¨or att uppt¨acka konceptdrift och bagging syftar adaptiva slumpm¨assigaskogar att f¨orb¨attranoggrannheten och prestandan hos traditionella slumpm¨assigaskogar f¨oronlineinl¨arning. I denna studie analyserar vi den prediktiva klassificer- ingsnoggrannheten f¨oradaptiva slumpm¨assigaskogar n¨arde anv¨andsi sam- band med olika datastr¨ommaroch konceptdrift. Datastr¨ommarnasom anv¨ands f¨oratt utv¨ardera prestandan ¨arSEA och Agrawal. Varje datastr¨omtestas i 3 olika konceptdriftkonfigurationer; gradvis, pl¨otsligoch ˚aterkommande. Re- sultaten som erh˚allitsfr˚ande utf¨ordaexperiment visar att anpassningsbara slumpm¨assigaskogar har b¨attrenoggrannhet ¨anAgrawal, vilket kan tolkas av antal dimensioner och strukturen av inmatningsattributen. Adaptiva slumpm¨assiga skogar visade dock ingen tydlig skillnad i noggrannhet mellan gradvisa och pl¨otsligakonceptdrift. Emellertid hade ˚aterkommande konceptdrift l¨agrenog- grannhet i riktm¨arken ¨anb˚adede pl¨otsligaoch gradvisa motstycken. Detta kan vara ett resultat av den h¨ogrefrekvensen av konceptdrift inom samma tidsperiod (antal observerade prover). Contents Acronyms . .3 1 Introduction 4 1.1 Problem statement . .4 1.2 Scope . .5 1.3 Thesis outline . .5 2 Background 6 2.1 Offline and Online learning . .6 2.2 Data stream classification . .6 2.3 Ensemble methods . .7 2.4 Concept drifts . .7 2.5 Algorithms . .8 2.5.1 Decision Tree . .8 2.5.2 Hoeffding Tree . .9 2.5.3 Random forest . .9 2.5.4 Adaptive random forest . 10 2.6 Related Work . 12 2.6.1 Ensemble methods for data streams . 12 2.6.2 Ensemble methods with drift detectors . 13 2.6.3 Dynamic weighted majority . 13 2.6.4 Dynamic streaming random forest . 14 3 Method 15 3.1 Software frameworks . 15 3.2 Data streams . 15 3.2.1 SEA . 16 3.2.2 Agrawal . 16 3.2.3 Concept drift stream . 18 3.3 Experimental settings . 18 3.4 Training and Benchmarking . 19 4 Results 20 4.1 Gradual concept drift . 20 4.2 Sudden concept drift . 21 1 4.3 Recurring concept drift . 22 4.4 Statistical summary . 22 5 Discussion 24 5.0.1 Limitations . 25 5.0.2 Ethics and sustainability . 26 5.0.3 Future research . 26 6 Conclusion 28 7 Appendix 1 - Plots 32 7.1 SEA data stream . 32 7.2 AGRAWAL data stream . 34 7.3 Summary . 35 8 Appendix 2 - Benchmark data 36 8.1 All . 36 8.2 Summary . 38 2 Acronyms ADWIN ADaptive WINdowing. 5 ARF Adaptive Random Forest. 10 DSRF Dynamic Streaming Random Forest. 14 DT Decision Tree. 8 DWM Dynamic Weighted Majority. 14 HT Hoeffding Tree. 9 RF Random Forest. 9 3 Chapter 1 Introduction In some real-world applications of machine learning, data arrives in continuous streams. In such situations, the entire training data-set is not available at the the time of designing the model. The underlying data distribution for these streams might also change over time in response to various events. There is, therefore, a need for machine models that can learn on-the-fly from continuously evolving data streams and adapt to changes in the underlying probability distribution of the observed data. This means that the model needs to learn and adapt as new observations become available. In these situations, online machine learning approaches are required. Many learning algorithms use one model to form a final prediction. However, it is also possible to combine several models in some manner to make predictions for new examples. The method that combines multiple base model is known as the ensemble method. There are many different well-known ensemble methods nowadays, such as bagging [1] and random forests [2], that aim to improve gen- eralization performance and model accuracy. Adaptive random forests (ARF) algorithm is an adaptation of batch-based random forests that aims to handle evolving data streams [3]. In this work we aim to study the behaviour of ARF in the challenging context of evolving data streams. More specifically, we aim to study ARF's accuracy when used to classify observations from an evolving data streams with different kinds of concept drifts, where the data distribution changes over time in different patterns; gradual, sudden or recurring. 1.1 Problem statement Different online machine learning algorithms have different performance and ac- curacy characteristics when it comes to classifying non-stationary data streams. This study aims to study the behaviour of adaptive random forests when used to classify data streams with various forms of concept drifts by comparing the changes in predictive accuracy between different data-streams and concept 4 drifts. The study aims to answer the following question: how do adaptive ran- dom forests perform when it is used to classify non-stationary data streams with sudden, gradual, and recurring concept drifts? Two synthetic data stream generators are used in order to evaluate the performance of the studied models; SEA and Agrawal. 1.2 Scope In this study, we investigate the adaptive random forests (ARF) algorithm [3], an online ensemble method which deals with concept drifts using dynamic update methods and batch learning. There are an infinite number of data streams, concept drift setups and model parameter permutation. The study, therefore, focuses on three types of concept drifts : gradual, sudden and recurring. All data streams and algorithms used in this paper are available in scikit- multiflow framework [4]. The drift adaptation strategy that we choose is the Adaptive WINdowing (ADWIN) [5] with threshold (0.01) for warning detection and threshold (0.001) for drift detection. We also use 10 Hoeffding trees classi- fiers for the ensemble model. Other model parameters are to the predetermined defaults chosen by the framework designers. The performance evaluation method is Prequential evaluation [6] (inter- leaved test-then-train method) and the evaluation measure used to assess the performance is the classification accuracy. The algorithm is evaluated in the immediate setting where labels are presented to the learner before the next in- stance arrives. Situations where labels arrive with delay are outside the scope of this work. The scope of this thesis is also limited by two data streams: SEA and Agrawal with one test configuration for each type of concept drift. 1.3 Thesis outline This report is divided into six chapters. The first chapter involves an introduc- tion to the subject and the problem statement of the study. The second chapter presents background information related to online machine learning and ARF algorithm.