Mining Data Streams with Concept Drift
Total Page:16
File Type:pdf, Size:1020Kb
Poznan University of Technology Faculty of Computing Science and Management Institute of Computing Science Master’s thesis MINING DATA STREAMS WITH CONCEPT DRIFT Dariusz Brzeziński Supervisor Jerzy Stefanowski, PhD Dr Habil. Poznań, 2010 Tutaj przychodzi karta pracy dyplomowej; oryginał wstawiamy do wersji dla archiwum PP, w pozostałych kopiach wstawiamy ksero. Contents 1 Introduction 1 1.1 Motivation . .1 1.2 Thesis structure . .2 1.3 Acknowledgments . .2 2 Mining data streams3 2.1 Data streams . .3 2.2 Concept drift . .5 2.3 Data stream mining . .6 2.3.1 Online learning . .7 2.3.2 Forgetting mechanisms . .7 2.3.3 Taxonomy of methods . .8 2.4 Applications . .9 2.4.1 Monitoring systems . .9 2.4.2 Personal assistance . 10 2.4.3 Decision support . 10 2.4.4 Artificial intelligence . 11 3 Single classifier approaches 13 3.1 Traditional learners . 13 3.2 Windowing techniques . 14 3.2.1 Weighted windows . 15 3.2.2 FISH . 15 3.2.3 ADWIN . 17 3.3 Drift detectors . 18 3.3.1 DDM . 18 3.3.2 EDDM . 20 3.4 Hoeffding trees . 20 4 Ensemble approaches 23 4.1 Ensemble strategies for changing environments . 23 4.2 Streaming Ensemble Algorithm . 25 4.3 Accuracy Weighted Ensemble . 26 4.4 Hoeffding option trees and ASHT Bagging . 28 4.5 Accuracy Diversified Ensemble . 29 5 MOA framework 33 5.1 Stream generation and management . 33 I II Contents 5.2 Classification methods . 34 5.3 Evaluation procedures . 35 5.3.1 Holdout . 35 5.3.2 Interleaved Test-Then-Train . 35 5.3.3 Data Chunks . 36 6 Experimental evaluation 37 6.1 Algorithms . 37 6.2 Data sets . 38 6.3 Experimental environment . 39 6.4 Results . 39 6.4.1 Time analysis . 39 6.4.2 Memory usage . 41 6.4.3 Classification accuracy . 42 6.5 Remarks . 45 7 Conclusions 47 A Implementation details 49 A.1 MOA Installation . 49 A.2 Attribute filtering . 50 A.3 Data chunk evaluation . 50 A.4 Accuracy Weighted Ensemble . 51 A.5 Accuracy Diversified Ensemble . 51 B Additional figures 53 Bibliography 71 Streszczenie 77 Chapter 1 Introduction 1.1 Motivation In today’s information society, computer users are used to gathering and sharing data anytime and anywhere. This concerns applications such as social networks, banking, telecommunication, health care, research, and entertainment, among others. As a result, a huge amount of data related to all human activity is gathered for storage and processing purposes. These data sets may contain interesting and useful knowledge represented by hidden patterns, but due to the volume of the gathered data it is impossible to manually extract that knowledge. That is why data mining and knowledge discovery methods have been proposed to automatically acquire interesting, non-trivial, previously unknown and ultimately understandable patterns from very large data sets [26, 14]. Typical data mining tasks include association mining, classification, and clustering, which all have been perfected for over two decades. A recent report [35] estimated that the digital universe in 2007 was 281 billion gigabytes large and it is forecast that it will reach 5 times that size until 2011. The same report states that by 2011 half of the produced data will not have a permanent home. This is partially due to a new class of emerging applications - applications in which data is generated at very high rates in the form of transient data streams. Data streams can be viewed as a sequence of relational tuples (e.g., call records, web page visits, sensor readings) that arrive continuously at time-varying, possibly unbound streams. Due to their speed and size it is impossible to store them permanently [45]. Data stream application domains include network monitoring, security, telecommunication data management, web applications, and sensor networks. The introduction of this new class of applica- tions has opened an interesting line of research problems including novel approaches to knowledge discovery called data stream mining. Current research in data mining is mainly devoted to static environments, where patterns hidden in data are fixed and each data tuple can be accessed more than once. The most popular data mining task is classification, defined as generalizing a known structure to apply it to new data [26]. Traditional classification techniques give great results in static environments however, they fail to successfully process data streams because of two factors: their overwhelming volume and their distinctive feature - concept drift. Concept drift is a term used to describe changes in the learned structure that occur over time. These changes mainly involve substitutions of one classification task with another, but also include steady trends and minor fluctuations of the underlying probability distributions [54]. For most traditional classifiers the occurrence of concept drift leads to a drastic drop in classification accuracy. That is why recently, new classification algorithms dedicated to data streams have been proposed. 1 2 Introduction The recognition of concept drift in data streams has led to sliding-window approaches that model a forgetting process, which allows to limit the number of processed data and to react to changes. Different approaches to mining data streams with concept drift include instance selection methods, drift detection, ensemble classifiers, option trees and using Hoeffding boundaries to estimate classifier performance. Recently, a framework called Massive Online Analysis (MOA) for implementing algorithms and running experiments on evolving data streams has been developed [12, 11]. It includes a collection of offline and online data stream mining methods as well as tools for their evaluation. MOA is a new environment that can facilitate and consequently accelerate the development of new time-evolving stream classifiers. The aim of this thesis is to review and compare single classifier and ensemble approaches to data stream mining. We test time and memory costs, as well as classification accuracy, of representative algorithms from both approaches. The experimental comparison of one of the algorithms, called Accuracy Weighted Ensemble, with other selected classifiers has, to our knowledge, not been pre- viously done. Additionally, we propose and evaluate a new algorithm called Accuracy Diversified Ensemble, which selects, weights, and updates ensemble members according to the current stream distribution. For our experiments we use the Massive Online Analysis environment and extend it by attribute filtering and data chunk evaluation procedures. We also verify the framework’s capability to become the first commonly used software environment for research on learning from evolving data streams. 1.2 Thesis structure The structure of the thesis is as follows. Chapter2 presents the basics of data stream mining. In particular, definitions of data streams, concept drift as well as types of stream learners and their applications are shown. Chapter3 gives a deeper insight into single classifier approaches to data stream mining, presenting windowing techniques and Hoeffding trees. Ensemble approaches to classification in data streams, including the Streaming Ensemble Algorithm, Hoeffding Option Tree, and our Accuracy Diversified Ensemble, are presented in Chapter4. The Massive Online Analysis framework, which was used for evaluation purposes in this thesis, is presented in Chapter5. Chapter6 describes experimental results and compares single classifier and ensemble algorithms for mining concept-drifting data streams. Finally, Chapter7 concludes the thesis with a discussion on the completed work and possible lines of further investigations. 1.3 Acknowledgments The author would like to thank all the people who contributed to this study. He is grateful to his supervisor, prof. Jerzy Stefanowski, for his inspiration, motivation, and the care with which he reviewed this work. The author is also greatly indebted to many other teachers of the Institute of Computing Science of Poznań University of Technology who got him interested in machine learning and data mining. Finally, the author wishes to thank his family: his parents, for their love, unconditional support and encouragement to pursue his interests, and his sister, for sharing her experience of dissertation writing and giving invaluable advice. Chapter 2 Mining data streams Before describing and evaluating different approaches to mining streams with concept drift, we present the basics of data streams. First, in Section 2.1 we focus on the main characteristics of the data stream model and how it differs from traditional data. Next, in Section 2.2, we define and categorize concept drift and its causes. Section 2.3 discusses the differences between data stream mining and classic data mining. It also presents a taxonomy of adaptive classification techniques. Finally, Section 2.4 describes the main applications of data stream mining techniques. As this thesis concentrates on classification techniques, we will use the term data stream learning as a synonym for data stream mining. 2.1 Data streams A data stream is an ordered sequence of instances that arrive at a rate that does not permit to permanently store them in memory. Data streams are potentially unbounded in size making them impossible to process by most data mining approaches. The main characteristics of the data stream model imply the following constraints [5]: 1. It is impossible to store all the data from the data stream. Only small summaries of data streams can be computed and stored, and the rest of the information is thrown away. 2. The arrival speed of data stream tuples forces each particular element to be processed essen- tially in real time, and then discarded. 3. The distribution generating the items can change over time. Thus, data from the past may become irrelevant or even harmful for the current summary. Constraint 1 limits the amount of memory that algorithms operating on data streams can use, while constraint 2 limits the time in which an item can be processed. The first two constraints led to the development of data stream summarization techniques.