Mining and Similarity Search in Temporal Databases
Total Page:16
File Type:pdf, Size:1020Kb
1 ERGEBNISSE AUS DER INFORMATIK Insights from database research, notably in the areas of data mining and similarity search, and advances in storage and microprocessor technology have enabled users to analyze and explore large-scale datasets. Data mining is the task of extracting previously unknown knowledge from data; similarity search encompasses techniques for finding objects similar by content. A prominent kind of data used in these tasks are temporal datasets, which stand out due to their information richness and their many possible applications. This thesis contributes novel, advanced methods for data mining and simi- larity search on temporal databases. Hardy Kremer A major challenge in data mining research is the effectiveness of the approaches, corre- sponding to the quality of extracted patterns. The thesis addresses this challenge for the mining task of temporal clustering. First, a clustering technique is developed that is spe- cifically designed for the requirements of real world time series. Even in difficult settings with various measurement errors and misalignments between time series, it correctly identifies patterns concealed in temporal or dimensional subspaces of the data domain. Mining and Similarity Search Second, new methods for the complex task of mapping clusters between clusterings are contributed, for which two applications are investigated: tracing of evolving clusters in in Temporal Databases spatio-temporal data and the evaluation of clustering results in data stream scenarios. The core of content-based similarity search systems and many data mining tasks are distance functions measuring the similarity between objects. An effective but also com- putationally expensive distance function for time series is based on adaptive warping on the time axis. This thesis introduces novel methods for queries under time warping. These methods exploit previously unused information in filter-and-refine frameworks for substantial runtime improvements. The anticipatory pruning technique utilizes distance information from a given filter step for rapid rejection of candidates in the refinement step, while the multiple query approach exploits shared characteristics between queries for joint pruning of candidates. The presented approaches are experimentally analyzed and evaluated with respect to competing solutions. Overall, the techniques and results of this thesis represent a major Databases in Temporal Mining and Similarity Search advance in the research areas of data mining and similarity search on temporal data. ISBN 978-3-86359-157-1 Hardy Kremer Hardy Mining and Similarity Search in Temporal Databases Von der Fakultat¨ fur¨ Mathematik, Informatik und Naturwissenschaften der RWTH Aachen University zur Erlangung des akademischen Grades eines Doktors der Naturwissenschaften genehmigte Dissertation vorgelegt von Diplom-Informatiker Hardy Kremer aus Koln¨ Berichter: Universitatsprofessor¨ Dr. rer. nat. Thomas Seidl Privatdozent Dr. rer. nat. Peer Kroger¨ Tag der mundlichen¨ Prufung:¨ 14.10.2013 Diese Dissertation ist auf den Internetseiten der Hochschulbibliothek online verfugbar.¨ Bibliografische Information der Deutschen Nationalbibliothek Ǣ òǣȀȀǤǤǤ ǣ ͳǤǡʹͲͳ͵ ͳȀʹͲͳ͵ Ǧ¡ǡͳͲͲΨ Ǥ ǡ ǡʹͲͳ͵ ò ǤʹͷǡͷʹͲͶ ǣǤǦǤǡǦǣ̷ǦǤ ͻͺǦ͵Ǧͺ͵ͷͻǦͳͷǦͳ ͺʹȋǤ ǡʹͲͳ͵Ȍ Contents Abstract / Zusammenfassung 1 1 Overview 5 1.1 Introduction . ........................... 5 1.2 Contributions and Thesis Structure ................ 9 I Effective Time Series Clustering 15 2 Subspace Clustering: Introduction 17 3 Subspace Mining in Time Series Databases 19 3.1 Motivation ............................... 20 3.2 Comparison with Related Work .................. 23 3.3 Robust, Effective Clustering of Multivariate Time Series .... 26 3.4 Efficient Computation ........................ 36 3.5 Experiments . ........................... 41 3.6 Conclusion ............................... 46 II Effective Cluster Mapping for Temporal Data 47 4 An Evaluation Measure for Clustering on Data Streams 49 4.1 Motivation ............................... 50 4.2 Comparison with Related Work .................. 51 4.3 Cluster Mapping Measure ...................... 53 4.4 Analysis and Refinement ...................... 62 4.5 Experiments . ........................... 66 i ii CONTENTS 4.6 Stream Clustering Evaluation in MOA . ............. 76 4.7 Conclusion . ............................ 78 5 Tracing of Evolving Subspace Clusters 79 5.1 Motivation . ............................ 80 5.2 Comparison with Related Work .................. 83 5.3 Model for Subspace Cluster Tracing . ............. 85 5.4 Experiments . ............................ 100 5.5 Application Scenario ......................... 104 5.6 Conclusion . ............................ 105 III Introduction to Similarity Search 107 6 Distance-Based Similarity Search: Introduction 109 6.1 Distance Functions and Similarity Queries ............ 109 6.2 Efficient Query Processing ..................... 112 7 Subspace Clustering for Efficient Query Processing 117 7.1 Motivation . ............................ 118 7.2 Comparison with Related Work .................. 121 7.3 Subspace-Clustering-Based Indexing . ............. 122 7.4 Experiments . ............................ 133 7.5 Conclusion . ............................ 144 IV Efficient Time Series Similarity Search 145 8 Dynamic Time Warping (DTW) 147 8.1 Introduction to DTW ......................... 147 8.2 DTW Applications .......................... 153 9 Anticipating Distances for Efficient DTW Query Processing 157 9.1 Motivation . ............................ 158 9.2 Comparison with Related Work .................. 159 9.3 Anticipatory DTW .......................... 160 CONTENTS iii 9.4 Experiments . ........................... 176 9.5 Conclusion ............................... 182 10 Simultaneous Processing of Multiple DTW Queries 185 10.1 Motivation ............................... 186 10.2 Comparison with Related Work .................. 187 10.3 Efficient Multiple DTW Query Processing ............ 187 10.4 Experiments . ........................... 203 10.5 Conclusion ............................... 209 V Summary 211 11 Conclusion and Future Work 213 11.1 Conclusion ............................... 213 11.2 Future Work . ........................... 215 VI Appendices I Bibliography III Statement of Originality XXIII List of Publications XXV Abstract Insights from database research, notably in the areas of data mining and similar- ity search, and advances in storage and microprocessor technology have enabled users to analyze and explore large-scale datasets. Data mining is the task of ex- tracting previously unknown knowledge from data; similarity search encompasses techniques for finding objects similar by content. A prominent kind of data used in these tasks are temporal datasets, which stand out due to their information rich- ness and their many possible applications. This thesis contributes novel, advanced methods for data mining and similarity search on temporal databases. A major challenge in data mining research is the effectiveness of the approaches, corresponding to the quality of extracted patterns. The thesis addresses this chal- lenge for the mining task of temporal clustering. First, a clustering technique is developed that is specifically designed for the requirements of real world time se- ries. Even in difficult settings with various measurement errors and misalignments between time series, it correctly identifies patterns concealed in temporal or dimen- sional subspaces of the data domain. Second, new methods for the complex task of mapping clusters between clusterings are contributed, for which two applications are investigated: tracing of evolving clusters in spatio-temporal data and the evalu- ation of clustering results in data stream scenarios. The core of content-based similarity search systems and many data mining tasks are distance functions measuring the similarity between objects. An effective but also computationally expensive distance function for time series is based on adaptive warping on the time axis. This thesis introduces novel methods for queries under time warping. These methods exploit previously unused information in filter-and- refine frameworks for substantial runtime improvements. The anticipatory pruning technique utilizes distance information from a given filter step for rapid rejection of candidates in the refinement step, while the multiple query approach exploits shared characteristics between queries for joint pruning of candidates. The presented approaches are experimentally analyzed and evaluated with re- spect to competing solutions. Overall, the techniques and results of this thesis rep- resent a major advance in the research areas of data mining and similarity search on temporal data. 1 Zusammenfassung Neue Erkenntnisse in der Datenbankforschung, insbesondere in den Bereichen des Data Mining und der Ahnlichkeitssuche,¨ und die fortschreitende Entwicklung von Speichertechnologien und Mikroprozessoren ermoglichen¨ die Analyse und Explo- ration von großen Datenmengen. Wahrend¨ es im Data Mining das Ziel ist, unbe- kanntes Wissen aus Daten zu extrahieren, behandelt die Ahnlichkeitssuche¨ Tech- niken des inhaltsbasierten Objektvergleichs. Im Rahmen dieser Aufgaben erfreuen sich temporale Daten einer wachsenden Beliebtheit. Sie zeichnen sich durch ihren hohen Informationsgehalt und ihre zahlreichen Anwendungsmoglichkeiten¨ aus. In dieser Dissertation werden neue Techniken des Data Mining und der Ahnlichkeits-¨