Handling Concept Drift: Importance, Challenges & Solutions

Total Page:16

File Type:pdf, Size:1020Kb

Handling Concept Drift: Importance, Challenges & Solutions What CD is About :: Types of Drifts & Approaches :: Techniques in Detail :: Evaluation :: MOA Demo :: Types of Applications :: Outlook Handling Concept Drift: Importance, Challenges & Solutions A. Bifet, J. Gama, M. Pechenizkiy, I. Žliobaitė PAKDD-2011 Tutorial May 27, Shenzhen, China http://www.cs.waikato.ac.nz/~abifet/PAKDD2011/ What CD is About :: Types of Drifts & Approaches :: Techniques in Detail :: Evaluation :: MOA Demo :: Types of Applications :: Outlook Motivation for the Tutorial • in the real world data – more and more data is organized as data streams – data comes from complex environment, and it is evolving over time – concept drift = underlying distribution of data is changing • attention to handling concept drift is increasing, since predictions become less accurate over time • to prevent that, learning models need to adapt to changes quickly and accurately PAKDD-2011 Tutorial, A. Bifet, J.Gama, M. Pechenizkiy, I.Zliobaite 2 May 27, Shenzhen, China Handling Concept Drift: Importance, Challenges and Solutions What CD is About :: Types of Drifts & Approaches :: Techniques in Detail :: Evaluation :: MOA Demo :: Types of Applications :: Outlook Objectives of the Tutorial • many approaches to handle drifts in a wide range of applications have been proposed • the tutorial aims to provide an integrated view of the adaptive learning methods and application needs PAKDD-2011 Tutorial, A. Bifet, J.Gama, M. Pechenizkiy, I.Zliobaite 3 May 27, Shenzhen, China Handling Concept Drift: Importance, Challenges and Solutions What CD is About :: Types of Drifts & Approaches :: Techniques in Detail :: Evaluation :: MOA Demo :: Types of Applications :: Outlook Outline Block 1: PROBLEM Block 2: TECHNIQUES Block 3: EVALUATION Block 4: APPLICATIONS OUTLOOK PAKDD-2011 Tutorial, A. Bifet, J.Gama, M. Pechenizkiy, I.Zliobaite 4 May 27, Shenzhen, China Handling Concept Drift: Importance, Challenges and Solutions What CD is About :: Types of Drifts & Approaches :: Techniques in Detail :: Evaluation :: MOA Demo :: Types of Applications :: Outlook Presenters & Schedule Indrė Žliobaitė Block 1 : what concept drift is, why it needs 14:05 – 14:50 to be handled and how João Gama 14:50 – 15:40 Block 2: approaches for adaptive machine learning to handle concept drift PAKDD-2011 Tutorial, A. Bifet, J.Gama, M. Pechenizkiy, I.Zliobaite 5 May 27, Shenzhen, China Handling Concept Drift: Importance, Challenges and Solutions What CD is About :: Types of Drifts & Approaches :: Techniques in Detail :: Evaluation :: MOA Demo :: Types of Applications :: Outlook Presenters & Schedule Albert Bifet Block 3: training/evaluation 16:00 – 16:45 of adaptive learning approaches and a software demo Mykola Pechenizkiy Block 4: challenges for handling concept drift 16:45 – 17:30 in different applications PAKDD-2011 Tutorial, A. Bifet, J.Gama, M. Pechenizkiy, I.Zliobaite 6 May 27, Shenzhen, China Handling Concept Drift: Importance, Challenges and Solutions What CD is About :: Types of Drifts & Approaches :: Techniques in Detail :: Evaluation :: MOA Demo :: Types of Applications :: Outlook PAKDD-2011 Tutorial, A. Bifet, J.Gama, M. Pechenizkiy, I.Zliobaite 7 May 27, Shenzhen, China Handling Concept Drift: Importance, Challenges and Solutions What CD is About :: Types of Drifts & Approaches :: Techniques in Detail :: Evaluation :: MOA Demo :: Types of Applications :: Outlook Block 1: introduction PAKDD-2011 Tutorial, A. Bifet, J.Gama, M. Pechenizkiy, I.Zliobaite 8 May 27, Shenzhen, China Handling Concept Drift: Importance, Challenges and Solutions What CD is About :: Types of Drifts & Approaches :: Techniques in Detail :: Evaluation :: MOA Demo :: Types of Applications :: Outlook The goals: • to introduce the problem of concept drift in supervised learning • to overview and categorize the main principles how concept drift can be (and is) handled PAKDD-2011 Tutorial, A. Bifet, J.Gama, M. Pechenizkiy, I.Zliobaite 9 May 27, Shenzhen, China Handling Concept Drift: Importance, Challenges and Solutions What CD is About :: Types of Drifts & Approaches :: Techniques in Detail :: Evaluation :: MOA Demo :: Types of Applications :: Outlook Example: production PAKDD-2011 Tutorial, A. Bifet, J.Gama, M. Pechenizkiy, I.Zliobaite 10 May 27, Shenzhen, China Handling Concept Drift: Importance, Challenges and Solutions What CD is About :: Types of Drifts & Approaches :: Techniques in Detail :: Evaluation :: MOA Demo :: Types of Applications :: Outlook What CD is About :: Types of Driftsmany & Approaches process :: Techniques in Detail :: Evaluation :: MOA Demomany :: Types ofprocess Applications :: Outlook Example:parameters productionparameters prediction predict quality reduced and transformed t1, t2 q parameters 3 79 3S 78 monitor 2 2S 77 1 76 75 Target t[2] 0 74 Konz. (INA&INAL) Konz. -1 73 2S 72 -2 3S 0 100 200 300 400 -3 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8 Num t[1] SIMCA-P+ 11 - 13.12.2005 10:17:16 PAKDD-2011 Tutorial, A. Bifet, J.Gama, M. Pechenizkiy, I.Zliobaite 11 May 27, Shenzhen, China Handling Concept Drift: Importance, Challenges and Solutions What CD is About :: Types of Drifts & Approaches :: Techniques in Detail :: Evaluation :: MOA Demo :: Types of Applications :: Outlook What CD is About :: Types of Driftsmany & Approaches process :: Techniques in Detail :: Evaluation :: MOA Demomany :: Types ofprocess Applications :: Outlook Example:parameters productionparameters prediction predict quality reduced and transformed t1, t2 q parameters 3 79 3S 78 monitor 2 2S 77 1 76 75 Target t[2] 0 74 Konz. (INA&INAL) Konz. -1 73 2S 72 -2 3S 0 100 200 300 400 -3 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8 Num t[1] SIMCA-P+ 11 - 13.12.2005 10:17:16 after 2-3 months predictions get “out of tune” PAKDD-2011 Tutorial, A. Bifet, J.Gama, M. Pechenizkiy, I.Zliobaite 12 May 27, Shenzhen, China Handling Concept Drift: Importance, Challenges and Solutions What CD is About :: Types of Drifts & Approaches :: Techniques in Detail :: Evaluation :: MOA Demo :: Types of Applications :: Outlook Static model Model does not change Process changes source: Evonik Industries PAKDD-2011 Tutorial, A. Bifet, J.Gama, M. Pechenizkiy, I.Zliobaite 13 May 27, Shenzhen, China Handling Concept Drift: Importance, Challenges and Solutions What CD is About :: Types of Drifts & Approaches :: Techniques in Detail :: Evaluation :: MOA Demo :: Types of Applications :: Outlook the supplier of raw a sensor breaks/ new operating material changes “wears off”/ crew is replaced Model does not change Process changes source: Evonik Industries new regulations to new production save electricity procedures PAKDD-2011 Tutorial, A. Bifet, J.Gama, M. Pechenizkiy, I.Zliobaite 14 May 27, Shenzhen, China Handling Concept Drift: Importance, Challenges and Solutions What CD is About :: Types of Drifts & Approaches :: Techniques in Detail :: Evaluation :: MOA Demo :: Types of Applications :: Outlook Desired Properties of a System To Handle Concept Drift • Adapt to concept drift asap • Distinguish noise from changes – robust to noise, but adaptive to changes • Recognizing and reacting to reoccurring contexts • Adapting with limited resources – time and memory PAKDD-2011 Tutorial, A. Bifet, J.Gama, M. Pechenizkiy, I.Zliobaite 15 May 27, Shenzhen, China Handling Concept Drift: Importance, Challenges and Solutions What CD is About :: Types of Drifts & Approaches :: Techniques in Detail :: Evaluation :: MOA Demo :: Types of Applications :: Outlook Setting: adaptive learning PAKDD-2011 Tutorial, A. Bifet, J.Gama, M. Pechenizkiy, I.Zliobaite 16 May 27, Shenzhen, China Handling Concept Drift: Importance, Challenges and Solutions What CD is About :: Types of Drifts & Approaches :: Techniques in Detail :: Evaluation :: MOA Demo :: Types of Applications :: Outlook Supervised Learning Training: Learning Testing X' a mapping function population data y = L (x) 2. Application: Applying L to unseen data X Historical data Learner 1. training y' = L (X') y labels 2. application testing labels ? y' PAKDD-2011 Tutorial, A. Bifet, J.Gama, M. Pechenizkiy, I.Zliobaite 17 May 27, Shenzhen, China Handling Concept Drift: Importance, Challenges and Solutions What CD is About :: Types of Drifts & Approaches :: Techniques in Detail :: Evaluation :: MOA Demo :: Types of Applications :: Outlook Learning with concept drift population Training: = ?? Testing X' data y = L (x) population Application: y' = L?? (X') X Historical data Learner y labels testing labels ? y' PAKDD-2011 Tutorial, A. Bifet, J.Gama, M. Pechenizkiy, I.Zliobaite 18 May 27, Shenzhen, China Handling Concept Drift: Importance, Challenges and Solutions What CD is About :: Types of Drifts & Approaches :: Techniques in Detail :: Evaluation :: MOA Demo :: Types of Applications :: Outlook Online setting ? time Data arrives in a stream PAKDD-2011 Tutorial, A. Bifet, J.Gama, M. Pechenizkiy, I.Zliobaite 19 May 27, Shenzhen, China Handling Concept Drift: Importance, Challenges and Solutions What CD is About :: Types of Drifts & Approaches :: Techniques in Detail :: Evaluation :: MOA Demo :: Types of Applications :: Outlook Adaptive Learning ? time updated model as time passes the model updates/retrains prediction itself if needed PAKDD-2011 Tutorial, A. Bifet, J.Gama, M. Pechenizkiy, I.Zliobaite 20 May 27, Shenzhen, China Handling Concept Drift: Importance, Challenges and Solutions What CD is About :: Types of Drifts & Approaches :: Techniques in Detail :: Evaluation :: MOA Demo :: Types of Applications :: Outlook How to Adapt Select the right training data and or adjust retrain the model incrementally
Recommended publications
  • Concept Drift Adaptation Techniques in Distributed Environment for Real-World Data Streams
    smart cities Article Concept Drift Adaptation Techniques in Distributed Environment for Real-World Data Streams Hassan Mehmood 1,* , Panos Kostakos 1, Marta Cortes 1, Theodoros Anagnostopoulos 2 , Susanna Pirttikangas 1 and Ekaterina Gilman 1 1 Center for Ubiquitous Computing, University of Oulu, Pentti Kaiteran katu 1, 90570 Oulu, Finland; Panos.Kostakos@oulu.fi (P.K.); Marta.Cortes@oulu.fi (M.C.); Susanna.Pirttikangas@oulu.fi (S.P.); Ekaterina.Gilman@oulu.fi (E.G.) 2 DigiT.DSS.Lab, Department of Business Administration, University of West Attica, P. Ralli & Thivon 250, Aigaleo, 122 44 Athens, Greece; [email protected] * Correspondence: Hassan.Mehmood@oulu.fi Abstract: Real-world data streams pose a unique challenge to the implementation of machine learning (ML) models and data analysis. A notable problem that has been introduced by the growth of Internet of Things (IoT) deployments across the smart city ecosystem is that the statistical properties of data streams can change over time, resulting in poor prediction performance and ineffective decisions. While concept drift detection methods aim to patch this problem, emerging communication and sensing technologies are generating a massive amount of data, requiring distributed environments to perform computation tasks across smart city administrative domains. In this article, we implement and test a number of state-of-the-art active concept drift detection algorithms for time series analysis within a distributed environment. We use real-world data streams and provide critical analysis of results retrieved. The challenges of implementing concept drift adaptation algorithms, along with their applications in smart cities, are also discussed. Citation: Mehmood, H.; Kostakos, P.; Keywords: Cortes, M.; Anagnostopoulos, T.; concept drift; machine learning; smart cities; edge computing; time series analysis; Pirttikangas, S.; Gilman, E.
    [Show full text]
  • Adaptation Strategies for Automated Machine Learning on Evolving Data
    1 Adaptation Strategies for Automated Machine Learning on Evolving Data Bilge Celik and Joaquin Vanschoren Abstract—Automated Machine Learning (AutoML) systems have been shown to efficiently build good models for new datasets. However, it is often not clear how well they can adapt when the data evolves over time. The main goal of this study is to understand the effect of concept drift on the performance of AutoML methods, and which adaptation strategies can be employed to make them more robust to changes in the underlying data. To that end, we propose 6 concept drift adaptation strategies and evaluate their effectiveness on a variety of AutoML approaches for building machine learning pipelines, including Bayesian optimization, genetic programming, and random search with automated stacking. These are evaluated empirically on real-world and synthetic data streams with different types of concept drift. Based on this analysis, we propose ways to develop more sophisticated and robust AutoML techniques. Index Terms—AutoML, data streams, concept drift, adaptation strategies F 1 INTRODUCTION HE field of automated machine learning (AutoML) aims concept drift. We propose six different adaptation strategies T to automatically design and build machine learning to cope with concept drift and implement these in open- systems, replacing manual trial-and-error with systematic, source AutoML libraries. We find that these can indeed be data-driven decision making [43]. This makes robust, state- effectively used, paving the way towards novel AutoML of-the-art machine learning accessible to a much broader techniques that are robust against evolving data. range of practitioners and domain scientists. The online learning setting under consideration is one Although AutoML has been shown to be very effective where data can be buffered in batches using a moving win- and even rival human machine learning experts [13], [24], dow.
    [Show full text]
  • DATA STREAM MINING a Practical Approach
    DATA STREAM MINING A Practical Approach Albert Bifet and Richard Kirkby August 2009 Contents I Introduction and Preliminaries3 1 Preliminaries5 1.1 MOA Stream Mining........................5 1.2 Assumptions.............................7 1.3 Requirements............................8 1.4 Mining Strategies.......................... 10 1.5 Change Detection Strategies.................... 13 2 MOA Experimental Setting 17 2.1 Previous Evaluation Practices................... 19 2.1.1 Batch Setting........................ 19 2.1.2 Data Stream Setting.................... 21 2.2 Evaluation Procedures for Data Streams............. 25 2.2.1 Holdout........................... 26 2.2.2 Interleaved Test-Then-Train or Prequential....... 26 2.2.3 Comparison......................... 27 2.3 Testing Framework......................... 28 2.4 Environments............................ 30 2.4.1 Sensor Network....................... 31 2.4.2 Handheld Computer.................... 31 2.4.3 Server............................ 31 2.5 Data Sources............................. 32 2.5.1 Random Tree Generator.................. 33 2.5.2 Random RBF Generator.................. 34 2.5.3 LED Generator....................... 34 2.5.4 Waveform Generator.................... 34 2.5.5 Function Generator..................... 34 2.6 Generation Speed and Data Size................. 35 2.7 Evolving Stream Experimental Setting.............. 39 2.7.1 Concept Drift Framework................. 40 2.7.2 Datasets for concept drift................. 42 II Stationary Data Stream Learning 45 3 Hoeffding Trees 47 3.1 The Hoeffding Bound for Tree Induction............. 48 3.2 The Basic Algorithm........................ 50 3.2.1 Split Confidence...................... 51 3.2.2 Sufficient Statistics..................... 51 3.2.3 Grace Period........................ 52 3.2.4 Pre-pruning......................... 53 i CONTENTS 3.2.5 Tie-breaking......................... 54 3.2.6 Skewed Split Prevention.................. 55 3.3 Memory Management....................... 55 3.3.1 Poor Attribute Removal.................
    [Show full text]
  • A Review of Meta-Level Learning in the Context of Multi-Component
    AREVIEW OF META-LEVEL LEARNING IN THE CONTEXT OF MULTI-COMPONENT,MULTI-LEVEL EVOLVING PREDICTION SYSTEMS Abbas Raza Ali Marcin Budka Bogdan Gabrys Faculty of Science and Technology Faculty of Science and Technology Advanced Analytics Institute Bournemouth University Bournemouth University University Technology Sydney Poole BH12 5BB Poole BH12 5BB Ultimo NSW 2007 United Kingdom United Kingdom Australia [email protected] [email protected] [email protected] ABSTRACT The exponential growth of volume, variety and velocity of data is raising the need for investigations of automated or semi-automated ways to extract useful patterns from the data. It requires deep expert knowledge and extensive computational resources to find the most appropriate mapping of learning methods for a given problem. It becomes a challenge in the presence of numerous configurations of learning algorithms on massive amounts of data. So there is a need for an intelligent recommendation engine that can advise what is the best learning algorithm for a dataset. The techniques that are commonly used by experts are based on a trial and error approach evaluating and comparing a number of possible solutions against each other, using their prior experience on a specific domain, etc. The trial and error approach combined with the expert’s prior knowledge, though computationally and time expensive, have been often shown to work for stationary problems where the processing is usually performed off-line. However, this approach would not normally be feasible to apply to non-stationary problems where streams of data are continuously arriving. Furthermore, in a non- stationary environment, the manual analysis of data and testing of various methods whenever there is a change in the underlying data distribution would be very difficult or simply infeasible.
    [Show full text]
  • SAMOA: Scalable Advanced Massive Online Analysis
    Journal of Machine Learning Research 16 (2015) 149-153 Submitted 6/14; Revised 9/14; Published 1/15 SAMOA: Scalable Advanced Massive Online Analysis Gianmarco De Francisci Morales [email protected] Albert Bifet [email protected] Yahoo Labs Av. Diagonal 177, 8th floor, 08018, Barcelona, Spain Editor: Geoff Holmes Abstract samoa (Scalable Advanced Massive Online Analysis) is a platform for mining big data streams. It provides a collection of distributed streaming algorithms for the most common data mining and machine learning tasks such as classification, clustering, and regression, as well as programming abstractions to develop new algorithms. It features a pluggable architecture that allows it to run on several distributed stream processing engines such as Storm, S4, and Samza. samoa is written in Java, is open source, and is available at http://samoa-project.net under the Apache Software License version 2.0. Keywords: data streams, distributed systems, classification, clustering, regression, tool- box, machine learning 1. Introduction Big data is “data whose characteristics forces us to look beyond the traditional methods that are prevalent at the time” (Jacobs, 2009). Currently, there are two main ways to deal with big data: streaming algorithms and distributed computing (e.g., MapReduce). samoa aims at satisfying the future needs for big data stream mining by combining the two approaches in a single platform under an open source umbrella. Data mining and machine learning are well established techniques among web companies and startups. Spam detection, personalization, and recommendation are just a few of the applications made possible by mining the huge quantity of data available nowadays.
    [Show full text]
  • Iovfdt for Concept-Drift Problem in Big Data Rajeswari P
    International Journal of Computer Science Trends and Technology (IJCST) – Volume 4 Issue 5, Sep - Oct 2016 RESEARCH ARTICLE OPEN ACCESS iOVFDT for Concept-drift Problem in Big Data Rajeswari P. V. N [1], Radhika P. [2] Assoc. Professor [1], Dept of CSE, Visvodaya Enginnering College, Kavali Asst. Professor [2], Dept of CSE, Geethanjali Institute of Science & Technology, Nellore AP – India. ABSTRACT The problem of How to efficiently uncover the knowledge hidden within massive and big data remains an open problem. It is one of the challenges is the issue of ‘concept drift’ in streaming data flows. Concept drift is a well- known problem in data analytics, in which the statistical properties of the attributes and their target classes shift over time, making the trained model less accurate. Many methods have been proposed for data mining in batch mode. Stream mining represents a new generation of data mining techniques, in which the model is updated in one pass whenever new data arrive. This one-pass mechanism is inherently adaptive and hence potentially more robust than its predecessors in handling concept drift in data streams. In this paper, we evaluate the performance of a family of decision-tree-based data stream mining algorithms. The advantage of incremental decision tree learning is the set of rules that can be extracted from the induced model. The extracted rules, in the form of predicate logics, can be used subsequently in many decision-support applications. However, the induced decision tree must be both accurate and compact, even in the presence of concept drift. We compare the performance of three typical incremental decision tree algorithms (VFDT [2], ADWIN [3], iOVFDT [4]) in dealing with concept-drift data.
    [Show full text]
  • Mining Data Streams with Concept Drift
    Poznan University of Technology Faculty of Computing Science and Management Institute of Computing Science Master’s thesis MINING DATA STREAMS WITH CONCEPT DRIFT Dariusz Brzeziński Supervisor Jerzy Stefanowski, PhD Dr Habil. Poznań, 2010 Tutaj przychodzi karta pracy dyplomowej; oryginał wstawiamy do wersji dla archiwum PP, w pozostałych kopiach wstawiamy ksero. Contents 1 Introduction 1 1.1 Motivation . .1 1.2 Thesis structure . .2 1.3 Acknowledgments . .2 2 Mining data streams3 2.1 Data streams . .3 2.2 Concept drift . .5 2.3 Data stream mining . .6 2.3.1 Online learning . .7 2.3.2 Forgetting mechanisms . .7 2.3.3 Taxonomy of methods . .8 2.4 Applications . .9 2.4.1 Monitoring systems . .9 2.4.2 Personal assistance . 10 2.4.3 Decision support . 10 2.4.4 Artificial intelligence . 11 3 Single classifier approaches 13 3.1 Traditional learners . 13 3.2 Windowing techniques . 14 3.2.1 Weighted windows . 15 3.2.2 FISH . 15 3.2.3 ADWIN . 17 3.3 Drift detectors . 18 3.3.1 DDM . 18 3.3.2 EDDM . 20 3.4 Hoeffding trees . 20 4 Ensemble approaches 23 4.1 Ensemble strategies for changing environments . 23 4.2 Streaming Ensemble Algorithm . 25 4.3 Accuracy Weighted Ensemble . 26 4.4 Hoeffding option trees and ASHT Bagging . 28 4.5 Accuracy Diversified Ensemble . 29 5 MOA framework 33 5.1 Stream generation and management . 33 I II Contents 5.2 Classification methods . 34 5.3 Evaluation procedures . 35 5.3.1 Holdout . 35 5.3.2 Interleaved Test-Then-Train .
    [Show full text]
  • Automated Machine Learning Techniques for Data Streams Alexandru-Ionut Imbrea University of Twente P.O
    Automated Machine Learning Techniques for Data Streams Alexandru-Ionut Imbrea University of Twente P.O. Box 217, 7500AE Enschede The Netherlands [email protected] ABSTRACT ative tasks that involve fine-tuning, which is usually per- Automated Machine Learning (AutoML) techniques ben- formed by data scientists in a trial-and-error process until efitted from tremendous research progress recently. These the desired performance is achieved. The ever-growing developments and the continuous-growing demand for ma- number of machine learning algorithms and hyperparam- chine learning experts led to the development of numerous eters leads to an increase in the number of configurations AutoML tools. Industry applications of machine learning which makes data scientists' job more laborious than ever. on streaming data become more popular due to the in- Considering the above reason and due to the lack of ex- creasing adoption of real-time streaming in IoT, microser- perts required in the industry, the field of Automated vices architectures, web analytics, and other fields. How- Machine Learning (AutoML) benefited from considerable ever, the AutoML tools assume that the entire training advances recently [7]. AutoML tools and techniques en- dataset is available upfront and that the underlying data able non-experts to achieve satisfactory results and ex- distribution does not change over time. These assump- perts to automate and optimize their tasks. Although the tions do not hold in a data-stream-mining setting where field of AutoML is relatively new, it consists of multiple an unbounded stream of data cannot be stored and is likely other sub-fields such as Automated Data Cleaning (Auto to manifest concept drift.
    [Show full text]
  • Mitigating Concept Drift in Data Mining Applications for Intrusion Detection Systems
    Mitigating concept drift in data mining applications for intrusion detection systems Koutrouki Evgenia SID: 3307160006 SCHOOL OF SCIENCE & TECHNOLOGY A thesis submitted for the degree of Master of Science (MSc) in Communications and Cybersecurity DECEMBER 2017 THESSALONIKI – GREECE 1 Mitigating concept drift in data mining applications for intrusion detection systems Koutrouki Evgenia SID: 3307160006 Supervisor: Prof. Georgios Ioannou SCHOOL OF SCIENCE & TECHNOLOGY A thesis submitted for the degree of Master of Science (MSc) in Communications and Cybersecurity DECEMBER 2017 THESSALONIKI – GREECE 2 ABSTRACT The phenomenon of concept drift is defined as the unexpected behavior of a data stream under changing environments. It is considered one of the major problems of data mining applications. Intrusion detection systems are correlated with data mining due to the fact that they collect, monitor and analyze data, so as it is expected, they also experience concept drift. By analyzing the different types of data mining, machine learning, intrusion detection systems and concept drift, a new method is developed with the aim to mitigate this phenomenon in data mining applications for intrusion detection systems. Koutrouki Evgenia 31/12/2017 3 Contents ABSTRACT ................................................................................................................................. 3 CHAPTER 1 INTRODUCTION .......................................................................................................................... 6 1.1 Context
    [Show full text]
  • Anomaly Detection for Data Streams Based on Isolation Forest Using
    Anomaly Detection for Data Streams Based on Isolation Forest using Scikit-multiflow Maurras Togbe, Mariam Barry, Aliou Boly, Yousra Chabchoub, Raja Chiky, Jacob Montiel, Vinh-Thuy Tran To cite this version: Maurras Togbe, Mariam Barry, Aliou Boly, Yousra Chabchoub, Raja Chiky, et al.. Anomaly De- tection for Data Streams Based on Isolation Forest using Scikit-multiflow. The 20th International Conference on Computational Science and its Applications (ICCSA 2020), Jul 2020, Caligari, Italy. hal-02874869v2 HAL Id: hal-02874869 https://hal.archives-ouvertes.fr/hal-02874869v2 Submitted on 26 Aug 2020 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. Anomaly Detection for Data Streams Based on Isolation Forest using Scikit-multiflow Maurras Ulbricht Togbe1, Mariam Barry2;3, Aliou Boly4, Yousra Chabchoub1, Raja Chiky1, Jacob Montiel5, and Vinh-Thuy Tran2 1 ISEP, LISITE, France [email protected] 2 BNP Paribas, IT Group, Division Data [email protected] 3 T´el´ecomParis, LTCI, Institut Polytechnique de Paris, France 4 Universit´eCheikh Anta Diop de Dakar, S´en´egal, [email protected] 5 University of Waikato, Department of Computer Science, New Zealand, [email protected] Abstract.
    [Show full text]
  • Extremely Fast Decision Tree Mining for Evolving Data Streams
    KDD 2017 Applied Data Science Paper KDD’17, August 13–17, 2017, Halifax, NS, Canada Extremely Fast Decision Tree Mining for Evolving Data Streams Albert Bifet Jiajin Zhang Wei Fan LTCI, Télécom ParisTech HUAWEI Noah’s Ark Lab Baidu Research Big Data Lab Université Paris-Saclay Hong Kong Sunnyvale, CA Paris, France 75013 Cheng He, Jianfeng Zhang Jianfeng Qian Geoff Holmes, HUAWEI Noah’s Ark Lab Columbia University Bernhard Pfahringer Hong Kong New York, NY University of Waikato Hamilton, New Zealand ABSTRACT 1 INTRODUCTION Nowadays real-time industrial applications are generating a huge Streaming data analysis in real time is becoming the fastest and most amount of data continuously every day. To process these large data efficient way to obtain useful knowledge from what is happening streams, we need fast and efficient methodologies and systems. A now, allowing organizations to react quickly when problems appear useful feature desired for data scientists and analysts is to have easy or to detect new trends helping to improve their performance. One to visualize and understand machine learning models. Decision example is the data produced in mobile broadband networks. A trees are preferred in many real-time applications for this reason, metropolitan city in China can have nearly 10 million subscribers and also, because combined in an ensemble, they are one of the who continuously generate massive data streams in both signalling most powerful methods in machine learning. and data planes into the network. Even with summarisation and In this paper, we present a new system called streamDM-C++, compression, the daily volume of the data can reach over 10TB, that implements decision trees for data streams in C++, and that has among which 95% is in the format of sequential data streams.
    [Show full text]
  • Scikit-Multiflow: a Multi-Output Streaming Framework
    Journal of Machine Learning Research 19 (2018) 1-5 Submitted 4/18; Published 10/18 Scikit-Multiflow: A Multi-output Streaming Framework Jacob Montiel [email protected] LTCI, T´el´ecom ParisTech, Universit´eParis-Saclay Paris, FRANCE Jesse Read [email protected] LIX, Ecole´ Polytechnique Palaiseau, FRANCE Albert Bifet [email protected] LTCI, T´el´ecom ParisTech, Universit´eParis-Saclay Paris, FRANCE Talel Abdessalem [email protected] LTCI, T´el´ecom ParisTech, Universit´eParis-Saclay Paris, FRANCE UMI CNRS IPAL, National University of Singapore Editor: Balazs Kegl Abstract scikit-multiflow is a framework for learning from data streams and multi-output learning in Python. Conceived to serve as a platform to encourage the democratization of stream learn- ing research, it provides multiple state-of-the-art learning methods, data generators and evaluators for different stream learning problems, including single-output, multi-output and multi-label. scikit-multiflow builds upon popular open source frameworks including scikit- learn, MOA and MEKA. Development follows the FOSS principles. Quality is enforced by complying with PEP8 guidelines, using continuous integration and functional testing. The source code is available at https://github.com/scikit-multiflow/scikit-multiflow. Keywords: Machine Learning, Stream Data, Multi-output, Drift Detection, Python 1. Introduction Recent years have witnessed the proliferation of Free and Open Source Software (FOSS) in the research community. Specifically, in the field of Machine Learning, researchers have benefited from the availability of different frameworks that provide tools for faster develop- ment, allow replicability and reproducibility of results and foster collaboration. Following the FOSS principles, we introduce scikit-multiflow, a Python framework to implement algo- rithms and perform experiments in the field of Machine Learning on Evolving Data Streams.
    [Show full text]