Role of Similarity Measures in Time Series Analysis
Total Page:16
File Type:pdf, Size:1020Kb
UNIVERSITY OF NOVI SAD FACULTY OF SCIENCE DEPARTMENT OF MATHEMATICS AND INFORMATICS Zoltan Geler Role of Similarity Measures in Time Series Analysis - Doctoral Dissertation - Mentor: Prof. Dr. Mirjana Ivanović Novi Sad, 2015 Preface The main subject of this dissertation is a comprehensive examination of the role of similarity measures in time-series analysis through studying the influence of the Sakoe-Chiba band on accuracy of classification. A time series represents a series of numerical data points in successive order, usually with uniform intervals between them. Time series are used for storage, analysis and visualization of data collected in many different domains, including science, medicine, economics and others. The increasing demand to work with large amounts of data has led to a growing interest in researching different tasks of time-series data mining. In recent years, there is an increasing interest in studying various aspects of time-series classification. One of the most important issues of time-series classification is a thoughtful and adequate choice of the similarity measure. They enable comparison of time series by describing their similarity (or dissimilarity) with numerical values. In the field of time-series data mining a great number of different similarity measures has been proposed and used. Many of these measures are implemented using dynamic programming. Unfortunately, the time requirements of this technique may adversely affect the possibilities of its application in practice. One way of dealing with this issue is to limit the search area by applying global constraints. However, these restrictions may affect the accuracy of classification, and therefore understanding their impact is of great importance. The main task of this thesis is to discover and explain the role of these constrained similarity measures in the field of time-series analysis and data mining with a focus on classification accuracy. This dissertation is divided into the following six chapters: 1. Introduction 2. Time Series and Similarity Measures 3. Time-Series Classification 4. Methods, Tools and Data Sets 5. Techniques for Improving Classification Accuracy 6. Framework for Time Series Analysis (FAP) 7. Conclusion In the first chapter a concise overview of the related topics is given in conjunction with the objects of the dissertation. Preface The necessary basic concepts of time-series data mining, including the formal definition of time series, descriptions of the analyzed similarity measures and global constraints are given in chapter two. Chapter three is devoted to time-series classification. A detailed overview of the nearest neighbor rule and its extensions along with several different weighting schemes are discussed in this chapter. Chapter four describes the methods, tools and datasets used in our research. A detailed description of the extensive experiments performed and discussion of results are presented in chapter five. All experiments within this thesis were performed using our free Java library for time-series analysis and data mining (Framework for Analysis and Prediction developed at Department of Mathematics and Informatics, Faculty of Sciences, University of Novi Sad). The details of its structure and implementation are given in chapter six. Chapter seven concludes the dissertation and lists some possible directions for future research. The appendices of this dissertation contain detailed results of the experiments described in chapter five. In the end, for an introduction to the extremely interesting field of time-series data mining, for valuable help, unreserved support and patient guidance through my research I am especially indebted to my supervisor Mirjana Ivanović. This thesis would not have been possible without her, and without encouragement she has given me over the last years. I would also like to thank warmly all the members of the committee for their patience and valuable suggestions regarding this thesis. My special thanks and gratitude are addressed to Vladimir Kurbalija and Miloš Radovanović for their stimulating suggestions and advice during my research. My special thanks are also goes to other colleagues and professors at the Department of Mathematics and Informatics, Faculty of Sciences and the Department of Media Studies, Faculty of Philosophy for a nice and pleasant studying and working environment. I would like to thank Eamonn Keogh for collecting and making available the UCR time series datasets, as well as everyone who contributed data to the collection, without whom the experiments would not have been possible. Finally, I thank my parents and brother for their endless support and patience during my studies. Novi Sad, 2015 Zoltan Geler iv Contents Preface ............................................................................................................................ iii Contents .......................................................................................................................... v 1. Introduction ................................................................................................................. 7 2. Time Series and Similarity Measures ........................................................................... 11 2.1. Similarity Measures ............................................................................................................. 13 2.1.1. Euclidean Distance ......................................................................................................................14 2.1.2. Dynamic Time Warping (DTW) ...................................................................................................15 2.1.3. Longest Common Subsequence (LCS) ........................................................................................16 2.1.4. Edit Distance with Real Penalty (ERP).........................................................................................17 2.1.5. Edit Distance on Real sequence (EDR) ........................................................................................18 2.2. Global Constraints of Time-Series Similarity Measures ...................................................... 18 3. Time-Series Classification ............................................................................................ 23 3.1. The Nearest Neighbor Rule ................................................................................................. 25 3.1.1. Dudani's weighting functions .....................................................................................................27 3.1.2. Macleod's weighting function ....................................................................................................28 3.1.3. The Fibonacci weighting function ...............................................................................................28 3.1.4. The uniform and dual-uniform weighting functions ..................................................................29 3.1.5. Zavrel's weighting function ........................................................................................................32 3.1.6. The dual distance-weighted function .........................................................................................34 3.2. Classifier Evaluation Methods ............................................................................................. 36 4. Methods, Tools and Datasets ...................................................................................... 39 4.1. Framework for Analysis and Prediction (FAP) ..................................................................... 39 4.2. Datasets used in the Experimental Evaluation ................................................................... 40 4.3. FAP Validation ..................................................................................................................... 44 4.4. Distance Matrices ................................................................................................................ 46 5. Techniques for Improving Classification Accuracy ........................................................ 47 5.1. Constraining the Similarity Measures and Applying Weights ............................................. 49 5.2. Performance of Constrained Similarity Measures .............................................................. 51 5.3. Improving the Accuracy of the 1NN Classifier Using Constrained Similarity Measures ..... 52 5.3.1. Change of the 1NN Graph with Narrowing Constraints .............................................................53 5.3.2. Change of Classes with Narrowing Constraints ..........................................................................62 Contents 5.3.3. Impact on Classification Accuracy ..............................................................................................77 5.4. Improving the Accuracy of the kNN Classifier Using Constrained Similarity Measures ..... 82 5.4.1. Preparing the Experiments .........................................................................................................83 5.4.2. The Unweighted kNN Classifier ..................................................................................................85 5.4.3. The Weighted kNN Classifier ......................................................................................................89 5.5. Improving the Accuracy of the kNN Classifier Using Weights ............................................ 93 5.5.1. Euclidean distance ......................................................................................................................95