Searching in Metric Spaces
Total Page:16
File Type:pdf, Size:1020Kb
Searching in Metric Spaces EDGAR CHAVEZ´ Escuela de Ciencias F´ısico-Matem´aticas, Universidad Michoacana GONZALO NAVARRO, RICARDO BAEZA-YATES Depto. de Ciencias de la Computaci´on, Universidad de Chile AND JOSE´ LUIS MARROQU´IN Centro de Investigaci´on en Matem´aticas (CIMAT) The problem of searching the elements of a set that are close to a given query element under some similarity criterion has a vast number of applications in many branches of computer science, from pattern recognition to textual and multimedia information retrieval. We are interested in the rather general case where the similarity criterion defines a metric space, instead of the more restricted case of a vector space. Many solutions have been proposed in different areas, in many cases without cross-knowledge. Because of this, the same ideas have been reconceived several times, and very different presentations have been given for the same approaches. We present some basic results that explain the intrinsic difficulty of the search problem. This includes a quantitative definition of the elusive concept of “intrinsic dimensionality.” We also present a unified Categories and Subject Descriptors: F.2.2 [Analysis of algorithms and problem complexity]: Nonnumerical algorithms and problems—computations on discrete structures, geometrical problems and computations, sorting and searching; H.2.1 [Database management]: Physical design—access methods; H.3.1 [Information storage and retrieval]: Content analysis and indexing—indexing methods; H.3.2 [Information storage and retrieval]: Information storage—file organization; H.3.3 [Information storage and retrieval]: Information search and retrieval—clustering, search process; I.5.1 [Pattern recognition]: Models—geometric; I.5.3 [Pattern recognition]: Clustering General Terms: Algorithms Additional Key Words and Phrases: Curse of dimensionality, nearest neighbors, similarity searching, vector spaces This project has been partially supported by CYTED VII.13 AMYRI Project (all authors), by CONACyT grant R-28923A (first author) and by Fondecyt grant 1-000929 (second and third authors). Authors’ addresses: E. Chavez,´ Escuela de Ciencias F´ısico-Matematicas,´ Universidad Michoacana, Edificio “B”, Ciudad Universitaria, Morelia, Mich. Mexico´ 58000; email: [email protected]; G. Navarro and R. Baeza-Yates, Depto. de Ciencias de la Computacion,´ Universidad de Chile, Blanco Encalada 2120, Santiago, Chile; email: gnavarro,rbaeza @dcc.uchile.cl; J. L. Marroqu´ın, Centro de In- vestigacion´ en Matematicas´ (CIMAT), Callej{ on´ de Jalisco S/N,} Valenciana, Guanajuato, Gto. Mexico´ 36000; email: [email protected]. Permission to make digital/hard copy of part or all of this work for personal or classroom use is granted without fee provided that the copies are not made or distributed for profit or commercial advantage, the copyright notice, the title of the publication, and its date appear, and notice is given that copying is by permission of the ACM, Inc. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or a fee. c 2001 ACM 0360-0300/01/0900-0273 $5.00 ! ACM Computing Surveys, Vol. 33, No. 3, September 2001, pp. 273–321. 274 E. Ch´avez et al. view of all the known proposals to organize metric spaces, so as to be able to understand them under a common framework. Most approaches turn out to be variations on a few different concepts. We organize those works in a taxonomy that allows us to devise new algorithms from combinations of concepts not noticed before because of the lack of communication between different communities. We present experiments validating our results and comparing the existing approaches. We finish with recommendations for practitioners and open questions for future development. 1. INTRODUCTION applications such as data mining re- quire accessing the database by any field, Searching is a fundamental problem in not only those marked as “keys.” Hence, computer science, present in virtually new models for searching in unstructured every computer application. Simple ap- repositories are needed. plications pose simple search problems, The above scenarios require more gen- whereas a more complex application will eral search algorithms and models than require, in general, a more sophisticated those classically used for simple data. form of searching. A unifying concept is that of “similarity The search operation traditionally has searching” or “proximity searching,” that been applied to “structured data,” that is, searching for database elements that is, numerical or alphabetical information are similar or close to a given query that is searched for exactly. That is, a element.1 Similarity is modeled with search query is given and the number or a distance function that satisfies the string that is exactly equal to the search triangle inequality, and the set of ob- query is retrieved. Traditional databases jects is called a metric space. Since the are built around the concept of exact problem has appeared in many diverse searching: the database is divided into areas, solutions have appeared in many records, each record having a fully com- unrelated fields, such as statistics, compu- parable key. Queries to the database re- tational geometry, artificial intelligence, turn all the records whose keys match the databases, computational biology, pattern search key. More sophisticated searches recognition, and data mining, to name a such as range queries on numerical keys or few. Since the current solutions come from prefix searching on alphabetical keys still such diverse fields, it is not surprising that rely on the concept that two keys are or the same solutions have been reinvented are not equal, or that there is a total lin- many times, that obvious combinations ear order on the keys. Even in recent years, of solutions have not been noticed, and when databases have included the ability that no thorough comparisons have been to store new data types such as images, done. More important, there have been the search has still been done on a prede- no attempts to conceptually unify all termined number of keys of numerical or those solutions. alphabetical types. In some applications the metric space With the evolution of information turns out to be of a particular type called and communication technologies, unstruc- “vector space,” where the elements consist tured repositories of information have of k real-valued coordinates. A lot of work emerged. Not only new data types such as has been done on vector spaces by exploit- free text, images, audio, and video have ing their geometric properties, but nor- to be queried, but also it is no longer pos- mally these cannot be extended to general sible to structure the information in keys metric spaces where the only available and records. Such structuring is very dif- ficult (either manually or computation- ally) and restricts beforehand the types of 1 The term “approximate searching” is also used, but queries that can be posed later. Even when it is misleading and we use it here only when refer- a classical structuring is possible, new ring to approximation algorithms. ACM Computing Surveys, Vol. 33, No. 3, September 2001. Searching in Metric Spaces 275 information is the distance among objects. as to be able to categorize and analyze In this general case, moreover, the dis- them under a common framework. We fo- tance is normally quite expensive to com- cus mainly on the number of distance eval- pute, so the general goal is to reduce the uations needed to execute range queries number of distance evaluations. In con- (i.e., with fixed tolerance radius), which trast, the operations in vector spaces tend are the most basic ones. However, we also to be simple and hence the goal is mainly pay some attention to the total CPU time, to reduce I/O. Some important advances time and space cost to build the indexes, have been done for general metric spaces, nearest neighbor queries, dynamic capa- mainly around the concept of building bilities of the indexes, and I/O consid- an index, that is, a data structure to re- erations. There are some features that duce the number of distance evaluations we definitely do not cover in order to at query time. Some recent work [Ciaccia keep our scope reasonably bounded, such et al. 1997; Prabhakar et al. 1998] tries as (1) complex similarity queries involv- to achieve simultaneously the goals of re- ing more than one similarity predicate ducing the number of distance evaluations [Ciaccia et al. 1998b], as few works on and the amount of I/O performed. them exist and they are an elaboration The main goal of this work is to present over the simple similarity queries (a par- a unifying framework to describe and ticular case is polygon searching in vec- analyze all the existing solutions to this tor spaces); (2) subqueries (i.e., searching problem. We show that all the existing in- a small element inside a larger element) dexing algorithms for proximity search- since the solutions are basically the same ing consist of building a set of equiv- after a domain-dependent transformation alence classes, discarding some classes, is done; and (3) inverse queries (i.e., find- and exhaustively searching the rest. Two ing the elements for which q is their main techniques based on equivalence re- closest neighbor) and total queries (e.g., lations, namely, pivoting and compact par- finding all the closest neighbors) since titions, are shown to encompass all the ex- the algorithms are, again, built over the isting methods. As a consequence of the simple ones. analysis we are able to build a taxon- This article is organized as follows. The omy on the existing algorithms for prox- first part (Sections 2 through 5) is a pure imity search, to classify them according survey of the state of the art in search- to their essential features, and to analyze ing metric spaces, with no attempt to pro- their efficiency.