Clustering of Large Time-Series Datasets Using a Multi-Step Approach

CLUSTERING OF LARGE TIME-SERIES DATASETS USING A MULTI-STEP APPROACH SAEED REZA AGHABOZORGI SAHAF YAZDI THESIS SUBMITTED IN FULFILMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY FACULTY OF COMPUTER SCIENCE AND INFORMATION TECHNOLOGY UNIVERSITY OF MALAYA KUALA LUMPUR 2013 UNIVERSITY MALAYA ORIGINAL LITERARY WORK DECLARATION Name of Candidate: SAEED REZA AGHABOZORGI SAHAF YAZDI I.C/Passport No: X95385279 Registration/Matric No: WHA080005 Name of Degree: DOCTOR OF PHILOSOPHY Title of Project Paper/Research Report/Dissertation/Thesis (“this Work”): CLUSTERING OF LARGE TIME-SERIES DATASETS USING A MULTI-STEP APPROACH Field of Study: DATA MINING I do solemnly and sincerely declare that: (1) I am the sole author/writer of this Work; (2) This Work is original; (3) Any use of any work in which copyright exists was done by way of fair dealing and for permitted purposes and any excerpt or extract from, or reference to or reproduction of any copyright work has been disclosed expressly and sufficiently and the title of the Work and its authorship have been acknowledged in this Work; (4) I do not have any actual knowledge nor do I ought reasonably to know that the making of this work constitutes an infringement of any copyright work; (5) I hereby assign all and every rights in the copyright to this Work to the University of Malaya (“UM”), who henceforth shall be owner of the copyright in this Work and that any reproduction or use in any form or by any means whatsoever is prohibited without the written consent of UM having been first had and obtained; (6) I am fully aware that if in the course of making this Work I have infringed any copyright whether intentionally or otherwise, I may be subject to legal action or any other action as may be determined by UM. Candidate’s Signature Date Subscribed and solemnly declared before, Witness’s Signature Date Name: Designation: II ABSTRACT Various data mining approaches are currently being used to analyse data within different domains. Among all these approaches, clustering is one of the most-used approaches, which is typically adopted in order to group data based on their similarities. The data in various systems such as finance, healthcare, and business, are stored as time-series. Clustering such complex data can discover patterns which have valuable information. Time-series clustering is not only useful as an exploratory technique but also as a subroutine in more complex data mining algorithms. As a result, time-series clustering (as a part of temporal data mining research) has attracted increasing interest for use in various areas such as medicine, biology, finance, economics, and in the Web. Several studies which focus on time-series clustering have been conducted in said areas. Many of these studies focus on the time complexity of time-series clustering in large datasets and utilize dimensionality reduction approaches and conventional clustering algorithms to address the problem. However, as is the case in many systems, conventional clustering approaches are not practical for time-series data because they are essentially designed for static data and not for time-series data, which leads to poor clustering accuracy. Adequate clustering approaches for time-series are therefore lacking. In this thesis, the problem of the low quality in existing works is taken into account, and a new multi-step clustering model is proposed. This model facilitates the accurate clustering of time-series datasets and is designed specifically for very large time-series datasets. It overcomes the limitations of conventional clustering algorithms in dealing with time-series data. In the first step of the model, data is pre-processed, represented by symbolic aggregate approximation, and grouped approximately by a novel approach. Then, the groups are III refined in the second step by using an accurate clustering method, and a representative is defined for each cluster. Finally, the representatives are merged to construct the ultimate clusters. The model is then extended as an interactive model where the results garnered by the user increase in accuracy over time. In this work, the accurate clustering based on shape similarity is performed. It is shown that clustering of time-series does not need to calculate the exact distances/similarity between all time-series in a dataset; instead, by using prototypes of similar time-series, accurate clusters can be obtained. To evaluate its accuracy, the proposed model is tested extensively by using published time-series datasets from diverse domains. This model is more accurate than any existing work and is also scalable (on large datasets) due to the use of multi-resolution of time-series in different levels of clustering. Moreover, it provides a clear understanding of the domains by its ability to generate hierarchical and arbitrary shape clusters of time-series data. IV ACKNOWLEDGEMENT First, I would like to thank Dr.Teh Ying Wah, with whom I have enjoyed working and who is a great advisor. Thank you for your encouragement, support, and direction you have provided during the past four years. I have been very grateful to have you as a counsellor and advisor. My special thanks go to my friends in the Faculty of Computer Science and Information Technology (FSKTM) at University of Malaya for the helpful discussions we had about my project. My gratitude also goes to Dr. Keogh for sharing the UCR dataset with me, and guiding me on time-series concepts. I also thank Mr.Zia Madani for provision of the Bank data. I owe a lot to him for his support during my hard days in Malaysia. I wish to thank my parents for the encouragement and for providing me with so many opportunities to improve in academics. Thank you for the invaluable support you gave me in the course of my life and studies. Finally, I am deeply grateful to my wife, Mahda, for her ongoing moral support, and acceptance of my long hours away from our family. Thank you so much for your constant encouragement and ridiculous amounts of patience. Kuala Lumpur, 7th March 2013 Saeed R. Aghabozorgi V TABLE OF CONTENTS Abstract ......................................................................................................................... III Acknowledgement .......................................................................................................... V Table of Contents ......................................................................................................... VI List of Figures .............................................................................................................. XII List of Tables .......................................................................................................... XVIII List of Abbreviations and Acronyms ....................................................................... XIX 1.0 Introduction .......................................................................................................... 1 1.1 Background: Time-series Clustering .................................................................. 1 1.2 Motivation .......................................................................................................... 4 1.3 Problem Statement ............................................................................................. 5 1.3.1 Overlooking of Data .................................................................................... 6 1.3.2 Inaccurate Similarity Measure .................................................................... 7 1.3.3 Inappropriate Algorithms ............................................................................ 9 1.4 Research Questions .......................................................................................... 10 1.5 Research Objectives ......................................................................................... 11 1.6 Scope of Research ............................................................................................ 11 1.7 Chapter Organization ....................................................................................... 12 2.0 Background and Literature Review ................................................................. 14 2.1 Introduction ...................................................................................................... 14 VI 2.2 Time-series Clustering ..................................................................................... 15 2.2.1 Applications of Time-series Clustering .................................................... 16 2.2.2 Taxonomy of Time-series Clustering ....................................................... 19 2.3 Whole Time-series Clustering .......................................................................... 20 2.4 Time-series Representation .............................................................................. 23 2.4.1 Other representation methods ................................................................... 29 2.4.2 Discussion ................................................................................................. 30 2.4.3 Brief Review of PAA ................................................................................ 31 2.4.4 Brief Review of SAX ................................................................................ 32 2.5 Similarity/Dissimilarity Measure ..................................................................... 34 2.5.1 Discussion ................................................................................................

Clustering of Large Time-Series Datasets Using a Multi-Step Approach

A Survey of Hierarchical Clustering Algorithms the Journal Of

The CURE for Class Imbalance

An Improved CURE Algorithm Mingjuan Cai, Yongquan Liang

Dynamic Group Recommendation Based on the Attention Mechanism

An Improvement for DBSCAN Algorithm for Best Results in Varied Densities

Download (PDF)

Hierarchical Clustering Algorithm for Big Data Using Hadoop and Mapreduce

Statistical Modeling Machine Learning

Intel® Xeon® Processors Projector

Privacy-Preserving Clustering Using Representatives Over Arbitrarily Partitioned Data∗

Robust Clustering Algorithms

Large Scale Clustering