Adaptive Learning and Mining for Data Streams and Frequent Patterns
Total Page:16
File Type:pdf, Size:1020Kb
Adaptive Learning and Mining for Data Streams and Frequent Patterns Doctoral Thesis presented to the Departament de Llenguatges i Sistemes Informatics` Universitat Politecnica` de Catalunya by Albert Bifet April 2009 Revised version with minor revisions. Advisors: Ricard Gavalda` and Jose´ L. Balcazar´ Abstract This thesis is devoted to the design of data mining algorithms for evolving data streams and for the extraction of closed frequent trees. First, we deal with each of these tasks separately, and then we deal with them together, developing classification methods for data streams containing items that are trees. In the data stream model, data arrive at high speed, and the algorithms that must process them have very strict constraints of space and time. In the first part of this thesis we propose and illustrate a framework for devel- oping algorithms that can adaptively learn from data streams that change over time. Our methods are based on using change detectors and estima- tor modules at the right places. We propose an adaptive sliding window algorithm ADWIN for detecting change and keeping updated statistics from a data stream, and use it as a black-box in place or counters or accumula- tors in algorithms initially not designed for drifting data. Since ADWIN has rigorous performance guarantees, this opens the possibility of extending such guarantees to learning and mining algorithms. We test our method- ology with several learning methods as Na¨ıve Bayes, clustering, decision trees and ensemble methods. We build an experimental framework for data stream mining with concept drift, based on the MOA framework, similar to WEKA, so that it will be easy for researchers to run experimental data stream benchmarks. Trees are connected acyclic graphs and they are studied as link-based structures in many cases. In the second part of this thesis, we describe a rather formal study of trees from the point of view of closure-based min- ing. Moreover, we present efficient algorithms for subtree testing and for mining ordered and unordered frequent closed trees. We include an analy- sis of the extraction of association rules of full confidence out of the closed sets of trees, and we have found there an interesting phenomenon: rules whose propositional counterpart is nontrivial are, however, always implic- itly true in trees due to the peculiar combinatorics of the structures. And finally, using these results on evolving data streams mining and closed frequent tree mining, we present high performance algorithms for mining closed unlabeled rooted trees adaptively from data streams that change over time. We introduce a general methodology to identify closed patterns in a data stream, using Galois Lattice Theory. Using this method- ology, we then develop an incremental one, a sliding-window based one, and finally one that mines closed trees adaptively from data streams. We use these methods to develop classification methods for tree data streams. Acknowledgments I am extremely grateful to my advisors, Ricard Gavalda` and Jose´ L. Balcazar.´ They have been great role models as researchers, mentors, and friends. Ricard provided me with the ideal environment to work, his valuable and enjoyable time, and his wisdom. I admire him deeply for his way to ask questions, and his silent sapience. I learnt from him that less may be more. Jose´ L. has been always motivating me for going further and further. His enthusiasm, dedication, and impressive depth of knowledge has been of great inspiration to me. He is a man of genius and I learnt from him to focus and spend time on important things. I would like to thank Antoni Lozano for his support and friendship. Without him, this thesis could not have been possible. Also, I would like to thank V´ıctor Dalmau, for introducing me to research, and Jorge Castro for showing me the beauty of high motivating objectives. I am also greatly indebted with Geoff Holmes and Bernhard Pfahringer for the pleasure of collaborating with them and for encouraging me, the very promising work on MOA stream mining. And Joao˜ Gama, for in- troducing and teaching me new and astonishing aspects of mining data streams. I thank all my coauthors, Carlos Castillo, Paul Chirita, Ingmar Weber, Manuel Baena, Jose´ del Campo, Raul´ Fidalgo, Rafael Morales, and Richard Kirkby for their help and collaboration. I want to thank my former office- mates at LSI for their support : Marc Comas, Bernat Gel, Carlos Merida,´ David Cabanillas, Carlos Arizmendi, Mario Fadda, Ignacio Barrio, Felix Castro, Ivan Olier, and Josep Pujol. Most of all, I am grateful to my family. Contents I Introduction and Preliminaries1 1 Introduction3 1.1 Data Mining............................3 1.2 Data stream mining........................4 1.3 Frequent tree pattern mining..................6 1.4 Contributions of this thesis...................9 1.5 Overview of this thesis...................... 11 1.5.1 Publications........................ 12 1.6 Support............................... 14 2 Preliminaries 15 2.1 Classification and Clustering.................. 15 2.1.1 Na¨ıve Bayes........................ 16 2.1.2 Decision Trees....................... 16 2.1.3 k-means clustering.................... 17 2.2 Change Detection and Value Estimation............ 17 2.2.1 Change Detection..................... 18 2.2.2 Estimation......................... 20 2.3 Frequent Pattern Mining..................... 23 2.4 Mining data streams: state of the art.............. 24 2.4.1 Sliding Windows in data streams............ 25 2.4.2 Classification in data streams.............. 25 2.4.3 Clustering in data streams................ 28 2.5 Frequent pattern mining: state of the art............ 28 2.5.1 CMTreeMiner....................... 30 2.5.2 DRYADEPARENT ..................... 31 2.5.3 Streaming Pattern Mining................ 31 II Evolving Data Stream Learning 33 3 Mining Evolving Data Streams 35 3.1 Introduction............................ 35 3.1.1 Theoretical approaches.................. 36 3.2 Algorithms for mining with change.............. 36 3.2.1 FLORA: Widmer and Kubat............... 37 3.2.2 Suport Vector Machines: Klinkenberg......... 37 3.2.3 OLIN: Last......................... 38 3.2.4 CVFDT: Domingos.................... 39 3.2.5 UFFT: Gama........................ 39 v CONTENTS 3.3 A Methodology for Adaptive Stream Mining......... 41 3.3.1 Time Change Detectors and Predictors: A General Framework........................ 42 3.3.2 Window Management Models............. 44 3.4 Optimal Change Detector and Predictor............ 46 3.5 Experimental Setting....................... 47 3.5.1 Concept Drift Framework................ 49 3.5.2 Datasets for concept drift................ 51 3.5.3 MOA Experimental Framework............ 54 4 Adaptive Sliding Windows 55 4.1 Introduction............................ 55 4.2 Maintaining Updated Windows of Varying Length...... 56 4.2.1 Setting........................... 56 4.2.2 First algorithm: ADWIN0 ................ 56 4.2.3 ADWIN0 for Poisson processes............. 61 4.2.4 Improving time and memory requirements...... 62 4.3 Experimental Validation of ADWIN .............. 66 4.4 Example 1: Incremental Na¨ıve Bayes Predictor........ 74 4.4.1 Experiments on Synthetic Data............. 76 4.4.2 Real-world data experiments.............. 77 4.5 Example 2: Incremental k-means Clustering.......... 80 4.5.1 Experiments........................ 81 4.6 K-ADWIN = ADWIN + Kalman Filtering............. 81 4.6.1 Experimental Validation of K-ADWIN ......... 83 4.6.2 Example 1: Na¨ıve Bayes Predictor........... 85 4.6.3 Example 2: k-means Clustering............. 85 4.6.4 K-ADWIN Experimental Validation Conclusions... 86 4.7 Time and Memory Requirements................ 88 5 Decision Trees 91 5.1 Introduction............................ 91 5.2 Decision Trees on Sliding Windows............... 92 5.2.1 HWT-ADWIN : Hoeffding Window Tree using ADWIN 92 5.2.2 CVFDT........................... 95 5.3 Hoeffding Adaptive Trees.................... 96 5.3.1 Example of performance Guarantee.......... 97 5.3.2 Memory Complexity Analysis............. 98 5.4 Experimental evaluation..................... 98 5.5 Time and memory......................... 104 vi CONTENTS 6 Ensemble Methods 107 6.1 Bagging and Boosting...................... 107 6.2 New method of Bagging using trees of different size..... 108 6.3 New method of Bagging using ADWIN ............. 111 6.4 Adaptive Hoeffding Option Trees................ 111 6.5 Comparative Experimental Evaluation............. 111 III Closed Frequent Tree Mining 117 7 Mining Frequent Closed Rooted Trees 119 7.1 Introduction............................ 119 7.2 Basic Algorithmics and Mathematical Properties....... 120 7.2.1 Number of subtrees................... 121 7.2.2 Finding the intersection of trees recursively...... 122 7.2.3 Finding the intersection by dynamic programming. 124 7.3 Closure Operator on Trees.................... 125 7.3.1 Galois Connection.................... 127 7.4 Level Representations...................... 129 7.4.1 Subtree Testing in Ordered Trees............ 132 7.5 Mining Frequent Ordered Trees................. 133 7.6 Unordered Subtrees........................ 134 7.6.1 Subtree Testing in Unordered Trees.......... 135 7.6.2 Mining frequent closed subtrees in the unordered case 135 7.6.3 Closure-based mining.................. 138 7.7 Induced subtrees and Labeled trees............... 139 7.7.1 Induced subtrees..................... 139 7.7.2 Labeled trees....................... 140 7.8 Applications............................ 140 7.8.1