DATA STREAM MINING a Practical Approach
Total Page:16
File Type:pdf, Size:1020Kb
DATA STREAM MINING A Practical Approach Albert Bifet and Richard Kirkby August 2009 Contents I Introduction and Preliminaries3 1 Preliminaries5 1.1 MOA Stream Mining........................5 1.2 Assumptions.............................7 1.3 Requirements............................8 1.4 Mining Strategies.......................... 10 1.5 Change Detection Strategies.................... 13 2 MOA Experimental Setting 17 2.1 Previous Evaluation Practices................... 19 2.1.1 Batch Setting........................ 19 2.1.2 Data Stream Setting.................... 21 2.2 Evaluation Procedures for Data Streams............. 25 2.2.1 Holdout........................... 26 2.2.2 Interleaved Test-Then-Train or Prequential....... 26 2.2.3 Comparison......................... 27 2.3 Testing Framework......................... 28 2.4 Environments............................ 30 2.4.1 Sensor Network....................... 31 2.4.2 Handheld Computer.................... 31 2.4.3 Server............................ 31 2.5 Data Sources............................. 32 2.5.1 Random Tree Generator.................. 33 2.5.2 Random RBF Generator.................. 34 2.5.3 LED Generator....................... 34 2.5.4 Waveform Generator.................... 34 2.5.5 Function Generator..................... 34 2.6 Generation Speed and Data Size................. 35 2.7 Evolving Stream Experimental Setting.............. 39 2.7.1 Concept Drift Framework................. 40 2.7.2 Datasets for concept drift................. 42 II Stationary Data Stream Learning 45 3 Hoeffding Trees 47 3.1 The Hoeffding Bound for Tree Induction............. 48 3.2 The Basic Algorithm........................ 50 3.2.1 Split Confidence...................... 51 3.2.2 Sufficient Statistics..................... 51 3.2.3 Grace Period........................ 52 3.2.4 Pre-pruning......................... 53 i CONTENTS 3.2.5 Tie-breaking......................... 54 3.2.6 Skewed Split Prevention.................. 55 3.3 Memory Management....................... 55 3.3.1 Poor Attribute Removal.................. 60 3.4 MOA Java Implementation Details................ 60 3.4.1 Fast Size Estimates..................... 63 4 Numeric Attributes 65 4.1 Batch Setting Approaches..................... 66 4.1.1 Equal Width......................... 67 4.1.2 Equal Frequency...................... 67 4.1.3 k-means Clustering..................... 67 4.1.4 Fayyad and Irani...................... 68 4.1.5 C4.5.............................. 68 4.2 Data Stream Approaches...................... 69 4.2.1 VFML Implementation................... 70 4.2.2 Exhaustive Binary Tree................... 71 4.2.3 Quantile Summaries.................... 72 4.2.4 Gaussian Approximation................. 73 4.2.5 Numerical Interval Pruning................ 77 5 Prediction Strategies 79 5.1 Majority Class............................ 79 5.2 Naive Bayes Leaves......................... 79 5.3 Adaptive Hybrid.......................... 82 6 Hoeffding Tree Ensembles 85 6.1 Batch Setting............................. 89 6.1.1 Bagging........................... 89 6.1.2 Boosting........................... 90 6.1.3 Option Trees......................... 95 6.2 Data Stream Setting......................... 97 6.2.1 Bagging........................... 97 6.2.2 Boosting........................... 98 6.2.3 Option Trees......................... 101 6.3 Realistic Ensemble Sizes...................... 106 III Evolving Data Stream Learning 109 7 Evolving data streams 111 7.1 Algorithms for mining with change............... 112 7.1.1 OLIN: Last.......................... 112 7.1.2 CVFDT: Domingos..................... 112 7.1.3 UFFT: Gama......................... 113 7.2 A Methodology for Adaptive Stream Mining.......... 115 ii CONTENTS 7.2.1 Time Change Detectors and Predictors: A General Frame- work............................. 116 7.2.2 Window Management Models.............. 118 7.3 Optimal Change Detector and Predictor............. 120 8 Adaptive Sliding Windows 121 8.1 Introduction............................. 121 8.2 Maintaining Updated Windows of Varying Length....... 122 8.2.1 Setting............................ 122 8.2.2 First algorithm: ADWIN0 ................. 122 8.2.3 ADWIN0 for Poisson processes.............. 127 8.2.4 Improving time and memory requirements....... 128 8.3 K-ADWIN = ADWIN + Kalman Filtering.............. 132 9 Adaptive Hoeffding Trees 135 9.1 Introduction............................. 135 9.2 Decision Trees on Sliding Windows................ 136 9.2.1 HWT-ADWIN : Hoeffding Window Tree using ADWIN . 136 9.2.2 CVFDT............................ 139 9.3 Hoeffding Adaptive Trees..................... 140 9.3.1 Example of performance Guarantee........... 141 9.3.2 Memory Complexity Analysis.............. 142 10 Adaptive Ensemble Methods 143 10.1 New method of Bagging using trees of different size...... 143 10.2 New method of Bagging using ADWIN .............. 146 10.3 Adaptive Hoeffding Option Trees................. 147 10.4 Method performance........................ 147 Bibliography 149 iii Introduction Massive Online Analysis (MOA) is a software environment for implement- ing algorithms and running experiments for online learning from evolving data streams. MOA includes a collection of offline and online methods as well as tools for evaluation. In particular, it implements boosting, bagging, and Hoeffd- ing Trees, all with and without Na¨ıve Bayes classifiers at the leaves. MOA is related to WEKA, the Waikato Environment for Knowledge Anal- ysis, which is an award-winning open-source workbench containing imple- mentations of a wide range of batch machine learning methods. WEKA is also written in Java. The main benefits of Java are portability, where applica- tions can be run on any platform with an appropriate Java virtual machine, and the strong and well-developed support libraries. Use of the language is widespread, and features such as the automatic garbage collection help to reduce programmer burden and error. This text explains the theoretical and practical foundations of the meth- ods and streams available in MOA. The moa and the weka are both birds na- tive to New Zealand. The weka is a cheeky bird of similar size to a chicken. The moa was a large ostrich-like bird, an order of magnitude larger than a weka, that was hunted to extinction. Part I Introduction and Preliminaries 3 1 Preliminaries In today’s information society, extraction of knowledge is becoming a very important task for many people. We live in an age of knowledge revolution. Peter Drucker [42], an influential management expert, writes “From now on, the key is knowledge. The world is not becoming labor intensive, not material intensive, not energy intensive, but knowledge intensive”. This knowledge revolution is based in an economic change from adding value by producing things which is, ultimately limited, to adding value by creating and using knowledge which can grow indefinitely. The digital universe in 2007 was estimated in [64] to be 281 exabytes or 281 billion gigabytes, and by 2011, the digital universe will be 10 times the size it was 5 years before. The amount of information created, or captured exceeded available storage for the first time in 2007. To deal with these huge amount of data in a responsible way, green com- puting is becoming a necessity. Green computing is the study and practice of using computing resources efficiently. A main approach to green comput- ing is based on algorithmic efficiency. The amount of computer resources required for any given computing function depends on the efficiency of the algorithms used. As the cost of hardware has declined relative to the cost of energy, the energy efficiency and environmental impact of computing sys- tems and programs are receiving increased attention. 1.1 MOA Stream Mining A largely untested hypothesis of modern society is that it is important to record data as it may contain valuable information. This occurs in almost all facets of life from supermarket checkouts to the movements of cows in a paddock. To support the hypothesis, engineers and scientists have pro- duced a raft of ingenious schemes and devices from loyalty programs to RFID tags. Little thought however, has gone into how this quantity of data might be analyzed. Machine learning, the field for finding ways to automatically extract infor- mation from data, was once considered the solution to this problem. Histor- ically it has concentrated on learning from small numbers of examples, be- cause only limited amounts of data were available when the field emerged. Some very sophisticated algorithms have resulted from the research that 5 CHAPTER 1. PRELIMINARIES can learn highly accurate models from limited training examples. It is com- monly assumed that the entire set of training data can be stored in working memory. More recently the need to process larger amounts of data has motivated the field of data mining. Ways are investigated to reduce the computation time and memory needed to process large but static data sets. If the data cannot fit into memory, it may be necessary to sample a smaller training set. Alternatively, algorithms may resort to temporary external storage, or only process subsets of data at a time. Commonly the goal is to create a learning process that is linear in the number of examples. The essential learning procedure is treated like a scaled up version of classic machine learning, where learning is considered a single, possibly expensive, operation—a set of training examples are processed to output a final static model. The data