Anytime Algorithms for Stream Data Mining

Anytime Algorithms for Stream Data Mining Von der Fakultat¨ fur¨ Mathematik, Informatik und Naturwissenschaften der RWTH Aachen University zur Erlangung des akademischen Grades eines Doktors der Naturwissenschaften genehmigte Dissertation vorgelegt von Diplom-Informatiker Philipp Kranen aus Willich, Deutschland Berichter: Universitatsprofessor¨ Dr. rer. nat. Thomas Seidl Visiting Professor Michael E. Houle, PhD Tag der mundlichen¨ Prufung:¨ 14.09.2011 Diese Dissertation ist auf den Internetseiten der Hochschulbibliothek online verfugbar.¨ Contents Abstract / Zusammenfassung1 I Introduction5 1 The Need for Anytime Algorithms7 1.1 Thesis structure......................... 16 2 Knowledge Discovery from Data 17 2.1 The KDD process and data mining tasks ........... 17 2.2 Classification .......................... 25 2.3 Clustering............................ 36 3 Stream Data Mining 43 3.1 General Tools and Techniques................. 43 3.2 Stream Classification...................... 52 3.3 Stream Clustering........................ 59 II Anytime Stream Classification 69 4 The Bayes Tree 71 4.1 Introduction and Preliminaries................. 72 4.2 Indexing density models.................... 76 4.3 Experiments........................... 87 4.4 Conclusion............................ 98 i ii CONTENTS 5 The MC-Tree 99 5.1 Combining Multiple Classes.................. 100 5.2 Experiments........................... 111 5.3 Conclusion............................ 116 6 Bulk Loading the Bayes Tree 117 6.1 Bulk loading mixture densities . 117 6.2 Experiments........................... 122 6.3 Conclusion............................ 127 7 The Classifier Family: Learn from your Relatives 129 7.1 Introduction........................... 129 7.2 Learning from Relatives .................... 131 7.3 Experiments........................... 141 7.4 Conclusion............................ 150 8 Application: Anytime Classification in HealthNet Scenarios 151 8.1 Scenario and Prototype..................... 152 8.2 Summary ............................ 157 9 Anytime Algorithms on Constant Streams 159 9.1 Introduction........................... 160 9.2 Novel Approaches for Constant Data Streams . 161 9.3 Experiments........................... 169 9.4 Conclusion............................ 180 10 Future Work 181 III Anytime Stream Clustering 183 11 Self-adaptive Anytime Stream Clustering 185 11.1 The ClusTree Algorithm .................... 186 11.2 Analysis and experiments ................... 196 11.3 Conclusion............................ 205 CONTENTS iii 12 Exploiting additional time in the ClusTree 207 12.1 Alternative descent strategies . 207 12.2 Evaluation of descent strategies . 213 12.3 Conclusion............................ 215 13 Robust Anytime Stream Clustering 217 13.1 The LiarTree........................... 217 13.2 Experiments........................... 229 13.3 Conclusion............................ 235 14 Application: Using Modeling for Anytime Outlier Detection 237 14.1 Introduction........................... 237 14.2 Related work .......................... 238 14.3 Detecting outliers in streaming data . 240 14.4 Experiments........................... 243 14.5 Conclusion............................ 250 15 MOA and CMM 251 15.1 The MOA Framework...................... 252 15.2 Evaluation Measures for Stream Clustering . 260 16 Future Work 263 IV Summary and Outlook 265 V AppendicesI Bibliography III Statement of Originality XLI List of Publications XLIII Curriculum Vitae XLVII Abstract Data is collected and stored everywhere, be it images or audio files on private computers, customer data in traditional or electronic businesses, performance or control data in production sites, web traffic and click streams at internet providers, statistical data at government agencies, sensor mea- surements in scientific experimentation, surveillance data, etc. There are countless examples, and the amount of data is tremendous. Data mining is the process of finding useful and previously unknown patterns in data. In the examples listed above, data mining can be used for automated recommen- dation of audio files, business analysis and target marketing, or performance optimization and hazard warnings. While early mining algorithms only considered static data sets, research and practice in data mining must nowadays deal with continuous, possible infinite streams of data, which are prevalent in most real world applications and scenarios. Anytime algorithms constitute a special type of algorithm that is well suited to work on data streams. They inherit their name from their ability to provide a result after any amount of processing time. The amount of time available is not known to the algorithm in advance: anytime algorithms quickly compute an initial result and strive to improve it as long as time remains. When interrupted they deliver the best result obtained until that point in time. In this thesis anytime classification is studied in depth for the Bayesian approach. New algorithmic solutions for anytime classification are developed and evaluated in extensive experimentation. The first anytime stream clustering algorithm is proposed, and an application to anytime outlier detection is presented. In addition to the algorithmic contributions, new meta- approaches are described that significantly widen the area of applications for anytime algorithms. The solutions and results of this thesis contribute to the state of the art in anytime algorithms and stream data mining research. 1 Zusammenfassung Die rasante Entwicklung der Informationstechnologie hat zur Folge, dass in allen Bereichen der Gesellschaft und des taglichen¨ Lebens große Mengen an Daten erzeugt und gespeichert werden. Beispiele reichen von Multimedia- Daten auf privaten Computern bis hin zu Messdaten in wissenschaftlichen Experimenten. Data Mining beschreibt die Aufgabe, in solchen Daten neue und interessante Muster zu finden. Diese konnen¨ beispielsweise zur automa- tischen Empfehlung von Filmen genutzt werden oder helfen neue Zusam- menhange¨ aufzudecken und Prozesse zu verstehen. Seit Beginn der Data Mining Forschung wachst¨ die Große¨ der zu verarbeitenden Datensatze.¨ Wahrend¨ Datensatze¨ zunachst¨ als statisch und vollstandig¨ gegeben angenom- men wurden, generieren viele Anwendungen heute kontinuierliche und teil- weise unendliche Datenstrome.¨ Anytime-Algorithmen stellen eine Klasse von Algorithmen dar, welche sich besonders gut zum Einsatz auf Datenstromen¨ eignet. Ihr Name ruhrt¨ von ihrer Eigenschaft her, zu jeder Zeit ein Ergebnis liefern zu konnen.¨ Die zur Verfugung¨ stehende Zeit ist dem Algorithmus dabei nicht bekannt: er berechnet ein initiales Ergebnis und verbessert dieses solange zusatzliche¨ Rechenzeit vorhanden ist. Wird der Algorithmus unterbrochen, so liefert er das beste Ergebnis zuruck,¨ welches bis zu diesem Zeitpunkt erzielt wurde. In dieser Dissertation werden neue Anytime-Verfahren fur¨ die Bayes Klassifikation entwickelt, intensiv untersucht und evaluiert. Der erste Anytime-Algorithmus zum Clustern von Datenstromen¨ wird vorgestellt und eine Anwendung fur¨ die Erkennung von Ausreißern wird diskutiert. Neben neuen Algorithmen werden zwei ubergeordnete¨ Verfahren entwickelt, die den Anwendungsbereich fur¨ Anytime-Algorithmen signifikant erweitern. Die in dieser Dissertation vorgestellten Ansatze¨ und Resultate tragen zum Stand der Forschung im Bereich Anytime-Algorithmen und Data Mining auf Daten- stromen¨ bei. 3 Part I Introduction 5 Chapter 1 The Need for Anytime Algorithms The rapid development of computer and information technology fundamen- tally changed many processes in science, industry and daily life. Systems for collecting, storing and managing data evolved from primitive file processing systems to sophisticated and powerful database systems. The tremendous amounts of data soon exceeded the human ability of comprehension and thus called for advanced tools for analysis, compression and exploration. Knowledge discovery from data bases (KDD), or data mining, emerged as an interdisciplinary research field from the areas of data bases, machine learning, statistics, visualization and other areas. Many mining algorithms were invented that helped reducing data, automatically classifying new objects based on historical data or extracting rules and correlations that exhibit novel patterns and useful knowledge. In early days of data mining research data sets were often considered static and given as a whole. This assumption allowed mining algorithms to perform random access on the data and to process objects multiple times during execution. With ever growing amounts of data the need for efficient processing yielded single pass algorithms and general solutions for fast access to large data sets were developed. A data stream is a continuous, possibly endless sequence of data items that must be processed as they arrive. Many real world applications can be associated with data streams as large amounts of data must be processed 7 8 The Need for Anytime Algorithms every day, hour, minute or even second. Examples include business/market data in companies, medical data in hospitals, statistical data in governmental institutions, experimental data in scientific laboratories, click streams in the world wide web, traffic/network data at web hosts or telecommunications companies, financial/stock market data in banking institutions, transaction/ customer profile data in e-commerce companies, etc. Mining on data streams can basically perform the same tasks as mining on static data sets. Summarization

Anytime Algorithms for Stream Data Mining

Data Streams and Applications in Computer Science

Efficient Semi-Streaming Algorithms for Local Triangle Counting In

Streaming Quantiles Algorithms with Small Space and Update Time

Turnstile Streaming Algorithms Might As Well Be Linear Sketches

Finding Graph Matchings in Data Streams

Lecture 1 1 Graph Streaming

Streaming Algorithms

Improving Iot Data Stream Analytics Using Summarization Techniques Maroua Bahri

Data Stream Clustering Techniques, Applications, and Models: Comparative Analysis and Discussion

Massive Online Analysis, a Framework for Stream Classiﬁcation and Clustering

An Extensible Framework for Data Stream Clustering Research with R

Efficient Clustering of Big Data Streams Efficient