Experimental Evaluation of Data Series Approximate Similarity Search
Total Page:16
File Type:pdf, Size:1020Kb
Return of the Lernaean Hydra: Experimental Evaluation of Data Series Approximate Similarity Search Karima Echihabi Kostas Zoumpatianos Themis Palpanas IRDA, Rabat IT Center, Harvard University Universite´ de Paris ENSIAS, Mohammed V Univ. [email protected] [email protected] [email protected] Houda Benbrahim IRDA, Rabat IT Center, ENSIAS, Mohammed V Univ. [email protected] ABSTRACT domain including science and engineering, medicine, busi- Data series are a special type of multidimensional data ness, finance and economics [80, 135, 121, 141, 107, 72, 129, present in numerous domains, where similarity search is a 19, 87, 97, 155, 68]. The increasing presence of IoT tech- key operation that has been extensively studied in the data nologies is making collections of data series grow to multiple series literature. In parallel, the multidimensional commu- terabytes [117]. These data series collections need to be ana- nity has studied approximate similarity search techniques. lyzed in order to extract knowledge and produce value [119]. We propose a taxonomy of similarity search techniques that The process of retrieving similar data series (i.e., similarity reconciles the terminology used in these two domains, we search), forms the backbone of most analytical tasks, includ- describe modifications to data series indexing techniques en- ing outlier detection [35, 28], frequent pattern mining [127], abling them to answer approximate similarity queries with clustering [84, 130, 128, 153], and classification [39]. Thus, quality guarantees, and we conduct a thorough experimental to render data analysis algorithms and pipelines scalable, we evaluation to compare approximate similarity search tech- need to make similarity search more efficient. niques under a unified framework, on synthetic and real Similarity Search. A large number of data series simi- datasets in memory and on disk. Although data series differ larity search methods has been studied, supporting exact from generic multidimensional vectors (series usually exhibit search [7, 137, 124, 81, 127, 110], approximate search [136, correlation between neighboring values), our results show 85, 10, 46, 49], or both [32, 134, 152, 33, 163, 157, 88, 96, that data series techniques answer approximate queries with 95, 122, 158, 90, 123]. In parallel, the research community strong guarantees and an excellent empirical performance, has also developed exact [23, 67, 22, 26, 44, 154, 57] and ap- proximate [73] similarity search techniques geared towards on data series and vectors alike. These techniques outper- 2 form the state-of-the-art approximate techniques for vectors generic multidimensional vector data . In the past few years when operating on disk, and remain competitive in memory. though, we are witnessing a renewed interest in the devel- opment of approximate methods [74, 156, 142, 18, 103]. PVLDB Reference Format: This study is the first experimental comparison of the ef- Karima Echihabi, Kostas Zoumpatianos, Themis Palpanas, and ficiency and accuracy of data series approximate similarity Houda Benbrahim. Return of the Lernaean Hydra: Experimen- tal Evaluation of Data Series Approximate Similarity Search. search methods ever conducted. Specifically, we evaluate the PVLDB, 13(3): 402-419, 2019. accuracy of both data series specific approximate similarity DOI: https://doi.org/10.14778/3368289.3368303 search methods, as well as that of approximate similarity search algorithms that operate on general multidimensional 1. INTRODUCTION vectors. Moreover, we propose modifications to data series techniques in order to support approximate query answering Motivation. A data series is a sequence of ordered real val- with theoretical guarantees, following [43]. ues1. Data series are ubiquitous, appearing in nearly every Our experimental evaluation covers in-memory and out- 1The order attribute can be angle, mass, time, etc. [118]. of-core experiments, the largest publicly available datasets, When the order is time, the series is called a time series. We extensive comparison criteria, and a new set of methods use data series, time series and sequence interchangeably. that have never been compared before. We thus differ from other experimental studies, which focused on the efficiency of exact search [54], the accuracy of dimensionality reduc- This work is licensed under the Creative Commons Attribution tion techniques and similarity measures for classification NonCommercialNoDerivatives 4.0 International License. To view a copy tasks [83, 52, 20], or in-memory high-dimensional meth- of this license, visit http://creativecommons.org/licenses/byncnd/4.0/. For any use beyond those covered by this license, obtain permission by emailing ods [93, 17, 112]. In this study, we focus on the problem [email protected]. Copyright is held by the owner/author(s). Publication rights of approximate whole matching similarity search in collec- licensed to the VLDB Endowment. Proceedings of the VLDB Endowment, Vol. 13, No. 3 2 ISSN 21508097. A comprehensive survey of techniques for multidimensional DOI: https://doi.org/10.14778/3368289.3368303 vectors can be found elsewhere [132]. 403 tions with a very large number of data series, i.e., similarity SQ, and a candidate series, SC , is denoted by d(SQ,SC ). search that produces approximate (not exact) results, by The Euclidean distance is the most widely used, and one calculating distances on the whole (not a sub-) sequence. of the most effective for large series collections [52]. Some This problem represents a common use case across many similarity search methods also rely on the lower-bounding domains [62, 142, 18, 103, 47, 105, 64, 119]. distance (distances in the reduced dimensionality space are Contributions. Our key contributions are as follows: guaranteed to be smaller than or equal to distances in the 1. We present a similarity search taxonomy that classifies original space) [33, 163, 134, 152, 157, 96, 44, 81] and upper- methods based on the quality guarantees they provide for bounding distance (distances in the reduced space are larger the search results, and that unifies the varied nomenclature than the distances in the original space) [152, 81]. used in the literature. Following this taxonomy, we include a On Similarity Search Queries. We assume a data series brief survey of similarity search approaches supporting ap- collection, S, a query series, SQ, and a distance function proximate search, bringing together works from the data d(·, ·). A k-Nearest-Neighbor (k-NN) query identifies series and multidimensional data research communities. the k series in the collection with the smallest distances to 2. We propose a new set of approximate approaches with the query series, while an r-range query identifies all the theoretical guarantees on accuracy and excellent empirical series in the collection within range r from the query series. performance, based on modifications to the current data se- Definition 1. [54] Given an integer k, a k-NN query ries exact methods. A S retrieves the set of series = {{SC1 ,...,SCk } ⊆ |∀ SC ∈ 3. We evaluate all methods under a unified framework A and ∀ SC′ ∈/ A, d(SQ,SC ) ≤ d(SQ,SC′ )}. to prevent implementation bias. We used the most efficient Definition 2. [54] Given a distance r, an r-range query C/C++ implementations available for all approaches, and retrieves the set of series A = {SC ∈ S|d(SQ,SC ) ≤ r}. developed from scratch in C the ones that were only imple- We additionally identify whole matching (WM) queries mented in other programming languages. Our new imple- (similarity between an entire query series and an entire can- mentations are considerably faster than the original ones. didate series), and subsequence matching (SM) queries 4. We conduct the first comprehensive experimental eval- (similarity between an entire query series and all subse- uation of the efficiency and accuracy of data series approxi- quences of a candidate series). mate similarity search approaches, using synthetic and real series and vector datasets from different domains, including Definition 3. [54] A WM query finds the candidate data S the two largest vector datasets publicly available. The re- series S ∈ that matches SQ, where |S| = |SQ|. sults unveil the strengths and weaknesses of each method, Definition 4. [54] A SM query finds the subsequence S[i : S and lead to recommendations as to which approach to use. j] of a candidate data series S ∈ that matches SQ, where 5. Our results show that the methods derived from the |S[i : j]| = |SQ| < |S|. exact data series indexing approaches generally surpass the In practice, we have WM queries on large collections of state-of-the-art techniques for approximate search in vector short series [55, 3], SM queries on large collections of short spaces. This observation had not been made in the past, series [1], and SM queries on collections of long series [59]. and it paves the way for exciting new developments in the Note that a SM query can be converted to WM [96, 95]. field of approximate similarity search for data series and On Similarity Search Methods. The similarity search multidimensional data at large. algorithms (k-NN or range) that always produce correct 6. We share all source codes, datasets, and queries [6]. and complete answers are called exact. Algorithms that do not satisfy this property are called approximate. An 2. DEFINITIONS AND TERMINOLOGY ǫ-approximate algorithm guarantees that its distance re- sults have a relative error no more than ǫ, i.e., the approx- Similarity search represents a common problem in various imate distance is at most (1 + ǫ) times the exact one. A areas of computer science. In the case of data series, several δ-ǫ-approximate algorithm, guarantees that its distance different flavors have been studied in the literature, often results will have a relative error no more than ǫ (i.e., the times using overloaded and conflicting terms. We summarize approximate distance is at most (1+ ǫ) times the exact dis- here these variations, and provide definitions, thus setting a tance), with a probability of at least δ.