Efficient Clustering of Big Data Streams Efficient
Total Page:16
File Type:pdf, Size:1020Kb
4 ERGEBNISSE AUS DER INFORMATIK Recent advances in data collecting devices and data storage systems are continuously offering cheaper possibilities for gathering and storing increasingly bigger volumes of data. Similar improvements in the processing power and data bases enabled the acces- sibility to a large variety of complex data. Data mining is the task of extracting useful patterns and previously unknown knowledge out of this voluminous, various data. This thesis focuses on the data mining task of clustering, i.e. grouping objects into clusters such that similar objects are assigned to the same cluster while dissimilar ones are as- signed to different clusters. While traditional clustering algorithms merely considered static data, today’s applications and research issues in data mining have to deal with Marwan Hassani continuous, possibly infinite streams of data, arriving at high velocity. Web traffic data, click streams, surveillance data, sensor measurements, customer profile data and stock trading are only some examples of these daily-increasing applications. Since the growth of data sizes is accompanied by a similar raise in their dimensionalities, clusters cannot be expected to completely appear when considering all attributes to- Efficient Clustering of Big Data gether. Subspace clustering is a general approach that solved that issue by automatically finding the hidden clusters within different subsets of the attributes rather than consi- Streams dering all attributes together. In this thesis, novel methods for an efficient subspace clustering of high-dimensional data streams are presented and deeply evaluated. Approaches that efficiently combine the anytime clustering concept with the stream subspace clustering paradigm are inten- sively studied. Additionally, efficient and adaptive density-based clustering algorithms are presented for high-dimensional data streams. New algorithmic solutions for an ener- gy-efficient in-sensor-network aggregation and a light-weighted clustering are presen- ted for sensor streaming data. Novel open-source assessment framework and evaluation measures are presented for subspace stream clustering. Primarily, efficient models of advanced and complex clustering tasks are for the first time contributed for data streams. Efficient Clustering of Big Data Streams Efficient ISBN 978-3-86359-318-6 Marwan Hassani Efficient Clustering of Big Data Streams Von der Fakultat¨ fur¨ Mathematik, Informatik und Naturwissenschaften der RWTH Aachen University zur Erlangung des akademischen Grades eines Doktors der Ingenieurwissenschaften genehmigte Dissertation vorgelegt von Diplom-Ingenieur Marwan Hassani aus Aleppo, Syrien Berichter: Univ.-Prof. Dr. rer. nat. Thomas Seidl Assoc. Prof. Dr. Mohamed M. Gaber Prof. Dr.-Ing. Stefan Kowalewski Tag der mundlichen¨ Prufung:¨ 26.01.2015 Diese Dissertation ist auf den Internetseiten der Hochschulbibliothek online verfugbar.¨ To my family who supported me with patience and love. To my beautiful daughters Yumna and Amena, without whom this thesis would have been completed one year earlier. Bibliografische Information der Deutschen Nationalbibliothek Ǣ òǣȀȀǤǤǤ ǣ ͳǤǡʹͲͳͷ Ͷ Ǧ¡ǡͳͲͲΨ Ǥ ǡ ǡʹͲͳͷ ò ǤʹͷǡͷʹͲͶ ǣǤǦǤǡǦǣ̷ǦǤ ͻͺǦ͵Ǧͺ͵ͷͻǦ͵ͳͺǦ ͺʹȋǤ ǡʹͲͳͷȌ Contents Abstract / Zusammenfassung 1 1 Overview 5 1.1 Introduction .............................. 5 1.2 Challenges for Stream Clustering .................. 8 1.3 Contribution and Thesis Structure .................. 11 I Energy-Efficient Aggregation and Clustering of Sensor Streaming Data 15 2 Energy-Efficient Distributed In-Network Clustering of Sensor Data 17 2.1 Motivation ............................... 18 2.2 Related Work ............................. 21 2.3 Problems Formulation ........................ 25 2.4 The EDISKCO Algorithm ....................... 28 2.5 Experimental Evaluation ....................... 36 2.6 Conclusion ............................... 40 3 Weighted Outlier-Aware K-Center Clustering of Sensor Data 41 3.1 Motivation ............................... 41 3.2 Preliminaries ............................. 42 3.3 The SenClu Algorithm ........................ 43 3.4 Experimental Evaluation ....................... 48 3.5 Conclusion ............................... 51 4 Energy-Efficient Self-Adaptive Clustering of Sensor Nodes 53 4.1 Motivation ............................... 54 4.2 Related Work ............................. 56 i ii CONTENTS 4.3 Problem Formulation ......................... 59 4.4 The ECLUN Algorithm ........................ 60 4.5 Experimental Evaluation ....................... 66 4.6 Conclusion ............................... 74 II High-Dimensional Density-Based Stream Clustering 75 5 Projected Density-based Stream Clustering 77 5.1 Motivation ............................... 77 5.2 Related Work ............................. 80 5.3 Problem Formulation and Definitions ................ 83 5.4 The PreDeConStream Algorithm ................... 92 5.5 Experimental Evaluation ....................... 96 5.6 Conclusion ............................... 109 6 Hierarchical Adaptive Density-based Stream Clustering 111 6.1 Motivation ............................... 112 6.2 Related Work ............................. 114 6.3 Preliminaries and Data Structure .................. 119 6.4 The HASTREAM Algorithm ..................... 132 6.5 Experimental Evaluation ....................... 140 6.6 Conclusion ............................... 154 III Advanced Anytime Stream Clustering 155 7 Outlier-Aware Non-Redundant Anytime Stream Clustering 157 7.1 Motivation ............................... 157 7.2 Related work ............................. 159 7.3 The LiarTree Algorithm ........................ 161 7.4 Experimental Evaluation ....................... 172 7.5 Conclusion ............................... 178 8 Anytime Subspace Stream Clustering 179 8.1 Motivation ............................... 180 8.2 Developing SubClusTree ....................... 181 8.3 The SubClusTree Algorithm ..................... 191 CONTENTS iii 8.4 Experimental Evaluation ....................... 195 8.5 Conclusion ............................... 204 IV Framework and Evaluation Measures for Stream Sub- space Clustering 205 9 The Subspace MOA Framework 207 9.1 Motivation ............................... 207 9.2 The Subspace MOA Framework ................... 209 9.3 Experimental Evaluation ....................... 213 9.4 Conclusion ............................... 217 10 Subspace Cluster Mapping Measure (SubCMM) 219 10.1 Motivation ............................... 219 10.2 Related Work ............................. 221 10.3 SubCMM: Subspace Cluster Mapping Measure ........... 225 10.4 Experimental Evaluation ....................... 227 10.5 Conclusion ............................... 234 V Summary 237 11 Summary and Future Work 239 11.1 Summary ............................... 239 11.2 Future Work .............................. 243 VI Appendices I Bibliography III Statement of Originality XIX List of Publications XXI Abstract Recent advances in data collecting devices and data storage systems are continu- ously offering cheaper possibilities for gathering and storing increasingly bigger volumes of data. Similar improvements in the processing power and data bases enabled the accessibility to a large variety of complex data. Data mining is the task of extracting useful patterns and previously unknown knowledge out of this voluminous, various data. This thesis focuses on the data mining task of cluster- ing, i.e. grouping objects into clusters such that similar objects are assigned to the same cluster while dissimilar ones are assigned to different clusters. While traditional clustering algorithms merely considered static data, today’s applica- tions and research issues in data mining have to deal with continuous, possibly infinite streams of data, arriving at high velocity. Web traffic data, click streams, surveillance data, sensor measurements, customer profile data and stock trading are only some examples of these daily-increasing applications. Since the growth of data sizes is accompanied by a similar raise in their dimen- sionalities, clusters cannot be expected to completely appear when considering all attributes together. Subspace clustering is a general approach that solved that issue by automatically finding the hidden clusters within different subsets of the attributes rather than considering all attributes together. In this thesis, novel methods for an efficient subspace clustering of high- dimensional data streams are presented and deeply evaluated. Approaches that efficiently combine the anytime clustering concept with the stream subspace clustering paradigm are intensively studied. Additionally, efficient and adap- tive density-based clustering algorithms are presented for high-dimensional data streams. New algorithmic solutions for an energy-efficient in-sensor-network ag- gregation and a light-weighted clustering are presented for sensor streaming data. Novel open-source assessment framework and evaluation measures are presented for subspace stream clustering. Primarily, efficient models of advanced and complex clustering tasks are for the first time contributed for data streams. 1 Zusammenfassung Aktuelle Entwicklungen in den Datenerfassungsgeraten¨ und Datenspeichersys- temen bieten standig¨ gunstigere¨ Moglichkeiten¨ zur Sammlung und Speicherung von großen Datenmengen. Mit steigender Rechenleistung und effizienteren Datenbanken wird der Zugang zu einer Vielzahl komplexer Daten ermoglicht.¨