The Omni-Family of All-Purpose Access Methods: a Simple and Effective Way to Make Similarity Search More Efﬁcient

The VLDB Journal (2007) 16: 483 – 505 DOI 10.1007/s00778-005-0178-0 REGULAR PAPER Caetano Traina Jr · Roberto F. Santos Filho · Agma J. M. Traina · Marcos R. Vieira · Christos Faloutsos The Omni-family of all-purpose access methods: a simple and effective way to make similarity search more efficient Received: 2 November 2004 / Accepted: 1 July 2005 / Published online: 29 June 2006 c Springer-Verlag 2006 Abstract Similarity search operations require executing ex- “alike” but not “equal” is known as similarity search.Per- pensive algorithms, and although broadly useful in many forming similarity searches on complex data is a cumber- new applications, they rely on specific structures not yet some process, having its own challenges and problems. supported by commercial DBMS. In this paper we discuss A requirement to query sets of complex data types is a the new Omni-technique, which allows to build a variety function to evaluate the degree of similarity (or the oppo- of dynamic Metric Access Methods based on a number of site, the dissimilarity or distance) between two objects of the selected objects from the dataset, used as global reference dataset. The challenge is to simulate the process performed objects. We call them as the Omni-family of metric access by the human specialist when comparing such data. The methods. This technique enables building similarity search comparison usually requires expensive algorithms, taking a operations on top of existing structures, significantly im- lot of computing time to find the answer. Another problem is proving their performance, regarding the number of disk ac- how to store large sets of complex data. Such data tend to be cess and distance calculations. Additionally, our methods structurally complex and each element composed of thou- scale up well, exhibiting sub-linear behavior with growing sands of bytes, requiring efficient strategies for secondary database size. storage management. Therefore, querying large datasets re- Keywords Similarity search · Metric access methods · quires using efficient indexing structures to reduce the num- Index structures · Multimedia databases ber of comparison operations. Many approaches to accelerate similarity queries in large datasets have been proposed, the majority of them creat- 1 Introduction ing new access methods. In this paper we present a new approach to improve existing access methods, the Omni- The evolution of the Computer Science field brought us technique, leading to the development of a family of new new challenges and the demand to deal with diverse human indexing structures. It aims at improving the performance of knowledge, considerably expanding the gamut of data types similarity queries by reducing both the required number of that must be supported by the computational applications, as comparison operations and the amount of data accessed dur- is the case of the Database Management Systems (DBMS) ing the query execution. We show that our technique is faster for multimedia data. Protein sequences, DNA strings, im- than the existing ones, while it guarantees the accuracy of the ages, audio, video and time series are examples of new and query answer with no false dismissals and no false alarms, complex data types that must be stored and retrieved. Com- as false alarms are cleaned in a post-processing phase. plex data need to be searched and compared through their content, demanding more elaborate processing in the re- trieval operations. Answering queries based on data that are 1.1 Paper scope, contributions and organization C. Traina Jr (B) · R. F. S. Filho · A. J. M. Traina · M. R. Vieira As the evaluation of similarity between two complex objects Department of Computer Science, Drop and Statistics, University of São Paulo at São Carlos, is usually expensive and time consuming, the reduction of Avenida Trabalhador Saocarlense, 400, São Carlos, SP, Brazil the number of comparisons in an important issue. There- E-mail: {caetano, figueira, agma, mrvieira}@icmc.usp.br fore, we regard the cost of distance calculations as the most C. Faloutsos expensive step during the process of answering similarity Department of Computer Science, Carnegie Mellon University, queries. Also, while the information technology commu- 5000 Forbes Ave., Pittsburgh, PA, USA nity urges for techniques to efficiently manage complex E-mail: [email protected] data types, inserting a new access method in commercial 484 C. Traina Jr et al. database management systems (DBMS) is an expensive Table 1 Summary of symbols and definitions task. In commercial DBMS, problems such as concurrent access, robustness and intensive insertions and deletions of Symbol Definition objects in a highly dynamic environment, among others, AS[] Answer set of a similarity query C(si ) Omni-coordinates of object si must also be carefully treated. d() Distance function To target this problem, we propose a family of methods dfg(si ) Distance from an object si to focus fg lying on the metric space model that can be implemented on D2 Correlation fractal dimension of the dataset S top of existing access methods, such as B-Trees and R-Trees, E Embedded dimension of the dataset S F or even on top of sequential scan. The proposed technique fg The gth focus of base F The Omni-foci base uses a number of selected objects from the dataset as global h Number of foci in F reference points. These objects, called Omni-foci, are chosen IOid Object Id used internally by an Omni-member at the beginning of the indexing process and are used to im- k The number of neighbors in a NN query prove the underlying access method, decreasing the amount nb f Farthest neighbor in a k-nearest neighbor list of distance calculations needed to answer similarity queries. N Number of objects in the dataset S rq Radius of a range query We call this new approach the “Omni-technique”, because it sq a Query object (or query center) 1 can be applied on the majority of existing access methods. si , s j Objects of S S Set of objects in domain S Coupling the Omni-technique with another existing method S generates a new one, leading to a whole new class of access Domain of objects methods that we call the “Omni-family”. This paper presents the following contributions: comparison with existing access methods. Finally, Sect. 8 1. It provides a technique to minimize the number of dis- gives the conclusions of this paper. tance calculations required to answer similarity queries, using the Omni-foci from the dataset. Experimental evi- dences show that this technique gives good performance 2 Background even if the query retrieves significantly more than 10% of the dataset; This section presents the basic definitions and fundamental 2. It explains how to define an adequate number of objects concepts supporting this work. Table 1 summarizes the sym- to be used as Omni-foci, with the best tradeoff between bols used through this paper. the (secondary) memory demanded and the number of A core problem related to retrieving complex data com- distance calculations required; paring their content is the definition of a similarity (or dis- 3. It depicts an inexpensive algorithm to choose adequate similarity) function that ought to match the notion of simi- Omni-foci; larity of the human specialist in the data domain. A distance 4. It shows how to combine the Omni-technique with other function should return small values for object pairs that a existing access methods. Three members of the Omni- human would consider close (or similar), and larger values family are presented: the Omni-Sequential, OmniR-Tree as more dissimilar the objects are. At least two different ap- and OmniB-Forest, including improved algorithms for proaches related to the data domain have been modeled and the two most required kinds of similarity queries: the studied in the literature, the vector space model and the met- range and the k-nearest neighbor queries. Algorithms for ric space model: post-optimization of the data structure (after many insertions) are also discussed; 1. Vector Space Model The complex objects are described 5. Finally, it presents experimental results comparing the by their main features, which are extracted by well- performance of the proposed Omni-family members and tailored algorithms designed by specialists from the data the sequential scan, the Slim-Tree, M-Tree, VP-Tree and domain. Such information is placed in the so-called fea- the R-Tree. ture vectors, which are e-dimensional arrays, where e is the number of features extracted from the objects. Each The remainder of this paper is organized as follows. feature vector can be seen as a point pertaining to an Section 2 provides the background information about metric e-dimensional space. The dissimilarity between two ob- distance functions, similarity queries and intrinsic dimen- jects can be measured, for example, by any Minkowski sionality of datasets. Section 3 surveys the state-of-the-art distance function [49], such as Euclidean (L ), Manhat- works about Metric Access Methods (MAMs). Section 4 2 tan (L ), or Infinity (L∞). Color histograms are com- introduces the Omni-technique, including file organization 1 mon examples of feature vectors extracted from images, and query algorithms. Section 5 shows how to combine where the dimensionality of the space is the number of the Omni-technique with three access methods: sequential colors used in the image quantization process. scan, B+-Trees and R-Trees. Section 6 presents experiments 2. Metric Space Model For some domains of objects, ex- showing the behavior of these Omni-family members and a tracting features does not lead to a fixed number of char- 1 A preliminary version of this paper was presented at IEEE-ICDE’ acteristics. This happens, for example, with fingerprints, 2001 [35]. because everyone has a particular number of features, The OMNI-family of all-purpose access methods 485 such as deltas, endings and so on [38].

The Omni-Family of All-Purpose Access Methods: a Simple and Effective Way to Make Similarity Search More Efﬁcient

Lecture 04 Linear Structures Sort

The Fractal Dimension Making Similarity Queries More Efficient

Pivot Selection Method for Optimizing Both Pruning and Balancing in Metric Space Indexes

Peer-To-Peer Similarity Search Based on M-Tree Indexing

Similarity Searching in Peer-To-Peer Environment

Prefix Hash Tree an Indexing Data Structure Over Distributed Hash

Efficient Nearest-Neighbor Computation for GPU-Based Motion

Distributed Computation of the Knn Graph for Large High-Dimensional Point Sets

An Investigation of Practical Approximate Nearest Neighbor Algorithms

Search Trees

Indexing the Fully Evolvement of Spatiotemporal Objects

K-NN Search in Non-Clustered Case Using K-P-Tree