Searching in Metric Spaces

EDGAR CHAVEZ´ Escuela de Ciencias F´ısico-Matem´aticas, Universidad Michoacana

GONZALO NAVARRO, RICARDO BAEZA-YATES Depto. de Ciencias de la Computaci´on, Universidad de Chile

AND JOSE´ LUIS MARROQU´IN Centro de Investigaci´on en Matem´aticas (CIMAT)

The problem of searching the elements of a that are close to a given query element under some similarity criterion has a vast number of applications in many branches of computer science, from pattern recognition to textual and multimedia information retrieval. We are interested in the rather general case where the similarity criterion defines a metric space, instead of the more restricted case of a vector space. Many solutions have been proposed in different areas, in many cases without cross-knowledge. Because of this, the same ideas have been reconceived several times, and very different presentations have been given for the same approaches. We present some basic results that explain the intrinsic difficulty of the search problem. This includes a quantitative definition of the elusive concept of “intrinsic dimensionality.” We also present a unified

Categories and Subject Descriptors: F.2.2 [Analysis of algorithms and problem complexity]: Nonnumerical algorithms and problems—computations on discrete structures, geometrical problems and computations, sorting and searching; H.2.1 [Database management]: Physical design—access methods; H.3.1 [Information storage and retrieval]: Content analysis and indexing—indexing methods; H.3.2 [Information storage and retrieval]: Information storage—file organization; H.3.3 [Information storage and retrieval]: Information search and retrieval—clustering, search process; I.5.1 [Pattern recognition]: Models—geometric; I.5.3 [Pattern recognition]: Clustering General Terms: Algorithms Additional Key Words and Phrases: Curse of dimensionality, nearest neighbors, similarity searching, vector spaces

This project has been partially supported by CYTED VII.13 AMYRI Project (all authors), by CONACyT grant R-28923A (first author) and by Fondecyt grant 1-000929 (second and third authors). Authors’ addresses: E. Chavez,´ Escuela de Ciencias F´ısico-Matematicas,´ Universidad Michoacana, Edificio “B”, Ciudad Universitaria, Morelia, Mich. Mexico´ 58000; email: [email protected]; G. Navarro and R. Baeza-Yates, Depto. de Ciencias de la Computacion,´ Universidad de Chile, Blanco Encalada 2120, Santiago, Chile; email: gnavarro,rbaeza @dcc.uchile.cl; J. L. Marroqu´ın, Centro de In- vestigacion´ en Matematicas´ (CIMAT), Callej{ on´ de Jalisco S/N,} Valenciana, Guanajuato, Gto. Mexico´ 36000; email: [email protected]. Permission to make digital/hard copy of part or all of this work for personal or classroom use is granted without fee provided that the copies are not made or distributed for profit or commercial advantage, the copyright notice, the title of the publication, and its date appear, and notice is given that copying is by permission of the ACM, Inc. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or a fee. c 2001 ACM 0360-0300/01/0900-0273 $5.00 !

ACM Computing Surveys, Vol. 33, No. 3, September 2001, pp. 273–321. 274 E. Ch´avez et al.

view of all the known proposals to organize metric spaces, so as to be able to understand them under a common framework. Most approaches turn out to be variations on a few different concepts. We organize those works in a taxonomy that allows us to devise new algorithms from combinations of concepts not noticed before because of the lack of communication between different communities. We present experiments validating our results and comparing the existing approaches. We finish with recommendations for practitioners and open questions for future development.

1. INTRODUCTION applications such as data mining re- quire accessing the database by any field, Searching is a fundamental problem in not only those marked as “keys.” Hence, computer science, present in virtually new models for searching in unstructured every computer application. Simple ap- repositories are needed. plications pose simple search problems, The above scenarios require more gen- whereas a more complex application will eral search algorithms and models than require, in general, a more sophisticated those classically used for simple data. form of searching. A unifying concept is that of “similarity The search operation traditionally has searching” or “proximity searching,” that been applied to “structured data,” that is, searching for database elements that is, numerical or alphabetical information are similar or close to a given query that is searched for exactly. That is, a element.1 Similarity is modeled with search query is given and the number or a distance function that satisfies the string that is exactly equal to the search triangle inequality, and the set of ob- query is retrieved. Traditional databases jects is called a metric space. Since the are built around the concept of exact problem has appeared in many diverse searching: the database is divided into areas, solutions have appeared in many records, each record having a fully com- unrelated fields, such as statistics, compu- parable key. Queries to the database re- tational geometry, artificial intelligence, turn all the records whose keys match the databases, computational biology, pattern search key. More sophisticated searches recognition, and data mining, to name a such as range queries on numerical keys or few. Since the current solutions come from prefix searching on alphabetical keys still such diverse fields, it is not surprising that rely on the concept that two keys are or the same solutions have been reinvented are not equal, or that there is a total lin- many times, that obvious combinations ear order on the keys. Even in recent years, of solutions have not been noticed, and when databases have included the ability that no thorough comparisons have been to store new data types such as images, done. More important, there have been the search has still been done on a prede- no attempts to conceptually unify all termined number of keys of numerical or those solutions. alphabetical types. In some applications the metric space With the evolution of information turns out to be of a particular type called and communication technologies, unstruc- “vector space,” where the elements consist tured repositories of information have of k real-valued coordinates. A lot of work emerged. Not only new data types such as has been done on vector spaces by exploit- free text, images, audio, and video have ing their geometric properties, but nor- to be queried, but also it is no longer pos- mally these cannot be extended to general sible to structure the information in keys metric spaces where the only available and records. Such structuring is very dif- ficult (either manually or computation- ally) and restricts beforehand the types of 1 The term “approximate searching” is also used, but queries that can be posed later. Even when it is misleading and we use it here only when refer- a classical structuring is possible, new ring to approximation algorithms.

ACM Computing Surveys, Vol. 33, No. 3, September 2001. Searching in Metric Spaces 275 information is the distance among objects. as to be able to categorize and analyze In this general case, moreover, the dis- them under a common framework. We fo- tance is normally quite expensive to com- cus mainly on the number of distance eval- pute, so the general goal is to reduce the uations needed to execute range queries number of distance evaluations. In con- (i.e., with fixed tolerance radius), which trast, the operations in vector spaces tend are the most basic ones. However, we also to be simple and hence the goal is mainly pay some attention to the total CPU time, to reduce I/O. Some important advances time and space cost to build the indexes, have been done for general metric spaces, nearest neighbor queries, dynamic capa- mainly around the concept of building bilities of the indexes, and I/O consid- an index, that is, a to re- erations. There are some features that duce the number of distance evaluations we definitely do not cover in order to at query time. Some recent work [Ciaccia keep our scope reasonably bounded, such et al. 1997; Prabhakar et al. 1998] as (1) complex similarity queries involv- to achieve simultaneously the goals of re- ing more than one similarity predicate ducing the number of distance evaluations [Ciaccia et al. 1998b], as few works on and the amount of I/O performed. them exist and they are an elaboration The main goal of this work is to present over the simple similarity queries (a par- a unifying framework to describe and ticular case is polygon searching in vec- analyze all the existing solutions to this tor spaces); (2) subqueries (i.e., searching problem. We show that all the existing in- a small element inside a larger element) dexing algorithms for proximity search- since the solutions are basically the same ing consist of building a set of equiv- after a domain-dependent transformation alence classes, discarding some classes, is done; and (3) inverse queries (i.e., find- and exhaustively searching the rest. Two ing the elements for which q is their main techniques based on equivalence re- closest neighbor) and total queries (e.g., lations, namely, pivoting and compact par- finding all the closest neighbors) since titions, are shown to encompass all the ex- the algorithms are, again, built over the isting methods. As a consequence of the simple ones. analysis we are able to build a taxon- This article is organized as follows. The omy on the existing algorithms for prox- first part (Sections 2 through 5) is a pure imity search, to classify them according survey of the state of the art in search- to their essential features, and to analyze ing metric spaces, with no attempt to pro- their efficiency. We are able to identify vide a new way of thinking about the prob- essentially similar approaches, to point lem. The second part (Sections 6 through out combinations of ideas that have not 8) presents our basic results on the diffi- previously been noticed, and to identify culty of the search problem and our unify- the main open problems in this area. ing model that allows understanding the We also present quantitative methods to essential features of the problem and its assert the intrinsic difficulty in search- existing solutions. Finally, Section 9 gives ing on a given metric space and provide our conclusions and points out future re- lower bounds on the search problem. This search directions. includes a quantitative definition of the previously conceptual notion of “intrin- sic dimensionality,” which we show to be 2. MOTIVATING APPLICATIONS very appropriate. We present some ex- perimental results that help to validate We now present a sample of applications our assertions. where the concept of proximity searching We remark that we are concerned with appears. Since we have not presented a the essential features of the search al- formal model yet, we do not try to explain gorithms for general metric spaces. That the connections between the different ap- is, we try to extract the basic features plications. We rather delay this discussion from the wealth of existing solutions, so to Section 3.

ACM Computing Surveys, Vol. 33, No. 3, September 2001. 276 E. Ch´avez et al.

2.1. Query by Content in Structured 2.2. Query by Content in Multimedia Objects Databases New data types such as images, finger- In general, the query posed to a database prints, audio, and video (called “multime- presents a piece of a record of information, dia” data types) cannot be meaningfully and it needs to retrieve the entire record. queried in the classical sense. Not only In the classical approach, the piece pre- can they not be ordered, but they cannot sented is fixed (the key). Moreover, it is even be compared for equality. There is not allowed to search with an incomplete no interest in an application for search- or an erroneous key. On the other hand, in ing an audio segment exactly equal to a the more general approach required nowa- given one. The probability that two differ- days the concept of searching with a key ent images are pixelwise equal is negli- is generalized to searching with an arbi- gible unless they are digital copies of the trary subset of the record, allowing errors same source. In multimedia applications, or not. all the queries ask for objects similar to Possible types of searches are point or a given one. Some example applications key search (all the key information is are face recognition, fingerprint match- given), range search (only some fields are ing, voice recognition, and, in general, given or only a range of values is speci- multimedia databases [Apers et al. 1997; fied for them), and proximity search (in Yoshitaka and Ichikawa 1999]. addition, records “close” to the query are Think, for example, of a repository of im- considered interesting). These types of ages. Interesting queries are of the type, search are of use in data mining (where “Find an image of a lion with a savanna the interesting parts of the record can- background.” If the repository is tagged, not be predetermined), when the informa- and each tag contains a full description tion is not precise, when we are looking of what is inside the image, then our ex- for a range of values, when the search key ample query can be solved with a classi- may have errors (e.g., a misspelled word), cal scheme. Unfortunately, such a classifi- and so on. cation cannot be done automatically with A general solution to the problem of the available image processing technol- range queries by any record field is the ogy. The state of object recognition in real- grid file [Nievergelt and Hinterberger world scenes is still too immature to per- 1984]. The domain of the database is seen form such complex tasks. Moreover, we as a hyperrectangle of k dimensions (one cannot predict all the possible queries that per record field), where each dimension will be posed so as to tag the image for ev- has an ordering according to the domain ery possible query. An alternative to au- of the field (numerical or alphabetical). tomatic classification consists of consid- Each record present in the database is ering the query as an example image, so considered as a point inside the hyper- that the system searches all the images rectangle. A query specifies a subrectan- similar to the query. This can be built gle (i.e., a range along each dimension), inside a more complex feedback system and all the points inside the specified where the user approves or rejects the im- query are retrieved. This does not ad- ages found, and a new query is submit- dress the problem of searching on non- ted with the approved images. It is also traditional data types, nor allowing errors possible that the query is just part of an that cannot be recovered with a range image and the system has to retrieve the query. However, it converts the original whole image. search problem to a problem of obtain- These approaches are based on the def- ing, in a given space, all the points “close” inition of a similarity function among ob- to a given query point. Grid files are es- jects. Those functions are provided by an sentially a disk organization technique expert, but they pose no assumptions on to efficiently retrieve range queries in the type of queries that can be answered. secondary memory. In many cases, the distance is obtained via

ACM Computing Surveys, Vol. 33, No. 3, September 2001. Searching in Metric Spaces 277 a set of k “features” that are extracted from databases with low quality control are the object (e.g., in an image a useful fea- emerging (e.g., the Web), and typing, ture is the average color). Then each ob- spelling, or OCR (optical character recog- ject is represented as its k features (i.e., nition) errors are commonplace in the text a point in a k-dimensional space), and we and the query, we have cases where doc- are again in a case of range queries on vec- uments that contain a misspelled word tor spaces. are no longer retrievable by a correctly There is a growing community of written query. Models of similarity among scientists deeply involved with the de- words exist (variants of the “edit distance” velopment of such similarity measures [Sankoff and Kruskal 1983]) that capture [Cascia et al. 1998; Bhanu et al. 1998; very well those types of errors. In this Bimbo and Vicario 1998]. case, we give a word and want to re- trieve all the words close to it. Another related application is spelling checkers, 2.3. Text Retrieval where we look for close variants of the misspelled word. Although not considered a multimedia In particular, OCR can be done using data type, unstructured text retrieval a low-level classifier, so that misspelled poses similar problems to those of mul- words can be corrected using the edit timedia retrieval. This is because textual distance to find promising alternatives documents are in general not structured to replace incorrect words [Chavez´ and to easily provide the desired informa- Navarro 2002]. tion. Text documents may be searched for strings that are present or not, but in 2.4. Computational Biology many cases they are searched for semantic concepts of interest. For instance, an ideal ADN and protein sequences are the ba- scenario would allow searching a text dic- sic object of study in molecular biology. tionary for a concept such as, “to free from As they can be modeled as texts, we have obligation,” retrieving the word “redeem.” the problem of finding a given sequence of This search problem cannot be properly characters inside a longer sequence. How- stated with classical tools. ever, an exact match is unlikely to occur, A large community of researchers has and computational biologists are more in- been working on this problem for a long terested in finding parts of a longer se- time [Salton and McGill 1983; Frakes quence that are similar to a given short and Baeza-Yates 1992; Baeza-Yates and sequence. The fact that the search is not Ribeiro-Neto 1999]. A number of measures exact is due to minor differences in the ge- of similarity have emerged. The problem is netic streams that describe beings of the solved basically by retrieving documents same or closely related species. The mea- similar to a given query. The user can even sure of similarity used is related to the present a document as a query, so that probability of mutations such as reversals the system finds similar documents. Some of pieces of the sequences and other rear- similarity approaches are based on map- rangements [Waterman 1995; Sankoff and ping a document to a vector of real val- Kruskal 1983]. ues, so that each dimension is a vocabulary Other related problems are to build phy- word and the relevance of the word to the logenetic trees (a sketching the evolu- document (computed using some formula) tionary path of the species), to search pat- is the coordinate of the document along terns for which only some properties are that dimension. Similarity functions are known, and others. then defined in that space. Notice, how- 2.5. Pattern Recognition and Function ever, that the dimensionality of the space Approximation is very high (thousands of dimensions). Another problem related to text re- A simplified definition of pattern recog- trieval is spelling. Since huge text nition is the construction of a function

ACM Computing Surveys, Vol. 33, No. 3, September 2001. 278 E. Ch´avez et al. approximator. In this formulation of the sent it is searched (with a tolerance) in the problem one has a finite sample of the subframe buffer and if it is not found then data, and each data sample is labeled as the entire subframe is added to the buffer. belonging to a certain class. When a fresh If the subframe is found then only the in- data sample is provided, the system is re- dex of the similar frame found is sent. This quired to label this new sample with one of implies, naturally, that a fast similarity the known data labels. In other words, the search algorithm has to be incorporated to classifier can be thought of as a function the server to maintain a minimum frames- defined from the object (data) space to the per-second rate. set of labels. In this sense all the classifiers are considered as function approximators. 3. BASIC CONCEPTS If the objects are m-dimensional vectors of real numbers then the natural choices All the applications presented in the pre- are neural nets and fuzzy function approx- vious section share a common framework, imators. Another popular universal func- which is in essence to find close objects, tion approximator, the k-nearest-neighbor under some suitable similarity function, classifier, consists of finding the k objects among a finite set of elements. In this sec- nearest to the unlabeled sample, and as- tion we present the formal model compris- signing to this sample the label having ing all the above cases. the majority among the k nearest objects. Opposed to neural nets and fuzzy classi- 3.1. Metric Spaces fiers, the k-nearest-neighbor rule has zero We now introduce the basic notation training time, but if no indexing algorithm for the problem of satisfying proximity is used it has linear complexity [Duda and queries and for the model used to group Hart 1973]. and analyze the existing algorithms. Other applications of this universal The set denotes the universe of valid function approximator are density esti- X objects. A finite subset of it, , of size n mation [Devroye 1987] and reinforcement U , is the set of objects where we search.= learning [Sutton and Barto 1998]. In gen- U | |is called the dictionary, database, or eral, any problem where we want to infer U simply our set of objects or elements. The a function based on a finite set of samples function is a potential application. d : X X R × → 2.6. Audio and Video Compression denotes a measure of “distance” between Audio and video transmission over a nar- objects (i.e., the smaller the distance, the rowband channel is an important prob- closer or more similar are the objects). lem, for example, in Internet-based audio Distance functions have the following and video conferencing or in wireless com- properties. munication. A frame (a static picture in (p1) x, y X, d(x, y) 0 positiveness, a video, or a fragment of the audio) can ∀ ∈ ≥ be thought of as formed by a number of (p2) x, y X, (possibly overlapped) subframes (16 16 ∀d(x, y∈) d( y, x) symmetry, × = subimages in a video, for example). In (p3) x X, d(x, x) 0 reflexivity, a very succinct description, the problem ∀ ∈ = can be solved by sending the first frame and in most cases as-is and for the next frames sending (p4) x, y X, x y only the subframes having a significative ∀d(x, y∈) > 0'= ⇒ strict difference from the previously sent sub- positiveness. frames. This description encompasses the MPEG standard. The similarity properties enumerated The algorithms use, in fact, a subframe above only ensure a consistent definition buffer. Each time a frame is about to be of the function, and cannot be used to save

ACM Computing Surveys, Vol. 33, No. 3, September 2001. Searching in Metric Spaces 279 comparisons in a proximity query. If d is some cases we are satisfied with one indeed a metric, that is, if it satisfies such element (in continuous spaces there is normally just one answer (p5) x, y, z X, d(x, y) d(x, z) d(z, y) anyway). We can also give a maximum triangle∀ inequality,∈ ≤ + distance r∗ such that if the closest element is at distance more than r we then the pair (X, d) is called a metric space. ∗ If the distance does not satisfy the strict do not want any one reported. positiveness property (p4) then the space k-Nearest neighbor query NNk(q). Retrieve is called a pseudometric space. Although the k closest elements to q in U. This for simplicity we do not consider pseudo- is, retrieve a set A U such that ⊆ metric spaces in this work, all the pre- A k and u A, v U A, sented techniques are easily adapted to d| (q|=, u) d(q, v∀). Note∈ that∈ in case− of them by simply identifying all the objects ties we≤ are satisfied with any set of k at distance zero as a single object. This elements satisfying the condition. works because, if (p5) holds, one can eas- ily prove that d(x, y) 0 z, d(x, z) The most basic type of query is the range d( y, z). = ⇒∀ = query. The left part of Figure 1 illustrates a query on a set of points which is our run- In some cases we may have a quasi- 2 metric, where the symmetry property (p2) ning example, using R as the metric space does not hold. For instance, if the objects for clarity. are corners in a city and the distance cor- A range query is therefore a pair (q, r)d responds to how much a car must travel with q an element in X and r a real number to move from one to the other, then the ex- indicating the radius (or tolerance) of the query. The set u U, d(q, u) r is called istence of one-way streets makes the dis- { ∈ ≤ } tance asymmetric. There exist techniques the outcome of the range query. to derive a new, symmetric, distance func- We use “NN” as an abbreviation of tion from an asymmetric one, such as “nearest neighbor,” and give the generic name “NN-query” to the last two types of d *(x, y) d(x, y) d( y, x). However, to be able to= bound the+ search radius of a queries and “NN searching” to the tech- query when using the symmetric function niques to solve them. As we see later, NN- we need specific knowledge of the domain. queries can be systematically built over Finally, we can relax the triangle in- range queries. equality (p5) to d(x, y) αd(x, z) βd(z, The total time to evaluate a query can y) δ, and after some≤ scaling+ we can be split as search+ in this space using the same algo- rithms designed for metric spaces. If the T distance evaluations = distance is symmetric we need α β for complexity of d() = × consistency. extra CPU time I/O time In the rest of the article we use the term + + distance in the understanding that we re- fer to a metric. and we would like to minimize T. In many applications, however, evaluating d() is so costly that the other components of 3.2. Proximity Queries the cost can be neglected. This is the There are basically three types of queries model we use in this article, and hence of interest in metric spaces. the number of distance evaluations per- formed is the measure of the complex- Range query (q, r)d . Retrieve all elements ity of the algorithms. We can even allow that are within distance r to q. This is a linear in n (but reasonable) amount of u U/d(q, u) r . { ∈ ≤ } CPU work and a linear traversal over the Nearest neighbor query N N(q). Retrieve database on disk, as long as the number of the closest elements to q in U. This is distance computations is kept low. How- u U/ v U, d(q, u) d(q, v) . In ever, we pay some marginal attention to { ∈ ∀ ∈ ≤ }

ACM Computing Surveys, Vol. 33, No. 3, September 2001. 280 E. Ch´avez et al.

Fig. 1. On the left, an example of a range query on a set of points. On the right, the set of points at the same distance to a center point, for different Ls distances. the so-called extra CPU time. The I/O time ally tuples of any field) then the pair is can be the dominant factor in some appli- called a finite-dimensional vector space, or cations and negligible in others, depend- vector space for short. ing on the amount of main memory avail- A k-dimensional vector space is a par- able and the relative cost to compute the ticular metric space where the objects are distance function. We cover the little ex- identified with k real-valued coordinates isting work on relating metric spaces and (x1, ..., xk). There are a number of options I/O considerations in Section 5.3.2. for the distance function to use, but the It is clear that either type of query can most widely used is the family of Ls dis- be answered by examining the entire dic- tances, defined as tionary U. In fact if we are not allowed to preprocess the data (i.e., to build an in- dex data structure), then this exhaustive Ls((x1, ..., xk), ( y1, ..., yk)) examination is the only way to proceed. k 1/s An indexing algorithm is an offline proce- x y s . dure to build beforehand a data structure = | i − i| ! i 1 # (called an index) designed to save distance "= computations when answering proximity queries later. This data structure can be expensive to build, but this will be amor- The right part of Figure 1 illustrates tized by saving distance evaluations over some of these distances. For instance, the many queries to the database. The aim is L1 distance accounts for the sum of the dif- therefore to design efficient indexing algo- ferences along the coordinates. It is also rithms to reduce the number of distance called “block” or “Manhattan” distance, evaluations. All these structures work on since in two dimensions it corresponds to the basis of discarding elements using the the distance to walk between two points triangle inequality (the only property that in a city of rectangular blocks. The L2 allows saving distance evaluations). distance is better known as “Euclidean” distance, as it corresponds to our notion of spatial distance. The other most used member of the family is L , which corre- 4. THE CASE OF VECTOR SPACES ∞ sponds to taking the limit of the Ls for- If the elements of the metric space (X, d) mula when s goes to infinity. The result is are indeed tuples of real numbers (actu- that the distance between two points is the

ACM Computing Surveys, Vol. 33, No. 3, September 2001. Searching in Metric Spaces 281 maximum difference along a coordinate: led to attempts to measure the intrinsic dimension such as the concept of “fractal L ((x1, ..., xk), ( y1, ..., yk)) dimension” [Faloutsos and Kamel 1994]. ∞ k Despite the fact that no technique can cope max xi yi . = i 1 | − | with intrinsic dimension higher than 20, = much higher representational dimensions Searching with the L distance cor- can be handled by dimensionality reduc- responds directly to a∞ classical range tion techniques [Faloutsos and Lin 1995; search query, where the range is the Cox and Cox 1994; Hair et al. 1995]. k-dimensional hyperrectangle. This dis- Since efficient techniques to cope with tance plays a special role in this survey. vector spaces exist, application design- In many applications the metric space is ers try to give their problems a vector indeed a vector space; that is, the objects space structure. However, this is not al- are k-dimensional points and the similar- ways easy or feasible at all. For exam- ity is interpreted geometrically. A vector ple, experts in image processing try to ex- space permits more freedom than a gen- press the similarity between images as the eral metric space when designing search distance between “vectors” of features ex- approaches, since it is possible to use geo- tracted from the images, although in many metric and coordinate information that is cases better results are obtained by cod- unavailable in a general metric space. ing specific functions that compare two In this framework optimal algorithms images, despite their inability to be eas- (on the database size) exist in both the av- ily expressed as the distance between two erage and the worst case [Bentley et al. vectors (e.g., crosstalk between features 1980] for closest point search. Search [Faloutsos et al. 1994]). Another example structures for vector spaces are called spa- that resists conversion into a vector space tial access methods (SAM). Among the is similarity functions between strings, to most popular are kd-trees [Bentley 1975, compare DNA sequences, for instance. A 1979], R-trees [Guttman 1984], quad- step towards improving this situation is trees [Samet 1984], and the more recent Chavez´ and Navarro [2002]. X -trees [Berchtold et al. 1996]. These For this reason several authors resort to techniques make extensive use of coor- general metric spaces, even knowing that dinate information to group and classify the search problem is much more difficult. points in the space. For example, kd-trees Of course it is also possible to treat a vec- divide the space along different coordi- tor space as a general metric space, by us- nates and R-trees group points in hy- ing only the distances between points. One perrectangles. Unfortunately the existing immediate advantage is that the intrinsic techniques are very sensitive to the vector dimension of the space shows up, indepen- space dimension. Closest point and range dent of any representational dimension search algorithms have an exponential de- (this requires extra care in vector spaces). pendency on the dimension of the space It is interesting to remark that Ciaccia [Chazelle 1994] (this is called the curse of et al. [1997] present preliminary results dimensionality). showing that a metric space data struc- Vector spaces may suffer from large dif- ture (the M-tree) can outperform a well- ferences between their representational known vector space data structure (the dimension (k) and their intrinsic dimen- R∗-tree) when applied to a vector space. sion (i.e., the real number of dimensions Specific techniques for vector spaces are in which the points can be embedded a whole different world which we do not while keeping the distances among them). intend to cover in this work (see Samet For example, a plane embedded in a 50- [1984], White and Jain [1996], Gaede and dimensional space has intrinsic dimension Gunther¨ [1998], and Bohm¨ et al. [2002] for 2 and representational dimension 50. This good surveys). However, we discuss in the is, in general, the case of real applications, next section a technique that, instead of where the data are clustered, and it has treating a vector space as a metric space,

ACM Computing Surveys, Vol. 33, No. 3, September 2001. 282 E. Ch´avez et al. tries to embed a general metric space into This type of mapping is a special case of a vector space. This concept is central in a general idea in the literature that says this survey, although specific knowledge of that one can find it cheaper to compute dis- specific techniques for vector spaces is, as tances that lower-bound the real one, and we shortly show, not necessary to under- use the cheaper distance to filter out most stand it. elements (e.g., for images, the average color is cheaper to compute than the dif- ferences in the color histograms). While 4.1. Resorting to Vector Spaces in general this is domain-dependent, An interesting and natural reduction of mapping onto a vector space can be done the similarity search problem consists of without knowledge of the domain. After a mapping $ from the original metric the mapping is done and we have iden- space into a vector space. In this way, tified each data element with a point in each element of the original metric space the projected space, we can use a general- is represented as a point in the target purpose spatial access method for vector vector space. The two spaces are related spaces to retrieve the candidate list. The by two distances: the original one d(x, y) elements found in the projected space and the distance in the projected space must be finally checked using the original D($(x), $( y)). If the mapping is contrac- distance function. tive (i.e., D($(x), $( y)) d(x, y) for any Therefore, there are two types of dis- pair of elements), then≤ one can process tance evaluations: first to obtain the co- range queries in the projected space with ordinates in the projected space and later the same radius. Since some spurious ele- to check the final candidates. These are ments can be captured in the target space, called “internal” and “external” evalu- the outcome of the query in the projected ations, respectively, later in this work. space is a candidate list, which is later Clearly, incrementing internal evalua- verified elementwise with the original dis- tions improves the quality of the filter and tance to obtain the actual outcome of the reduces external evaluations, and there- query. fore we seek a balance. Intuitively, the idea is to “invent” k co- Notice finally that the search itself in ordinates and map the points onto a vec- the projected space does not use evalua- tor space, using some vector space tech- tions of the original distance, and hence it nique as a first filter to the actual answer is costless under our complexity measure. to a query. One of the main theses of this Therefore, the use of kd-trees, R-trees, or work is that a large subclass of the ex- other data structures aims at reducing the isting algorithms can be regarded as re- extra CPU time, but it makes no differ- lying on some mapping of this kind. A ence in the number of evaluations of the widely used method (explained in detail d distance. in Section 6.6) is to select p1, ..., pk U How well do metric space techniques { k }⊆ and map each u U to (R , L ) using perform in comparison to vector space ∈ ∞ $(u) (d(u, p1), ..., d(u, pk)). It can be methods? It is difficult to give a formal an- seen that= this mapping is contractive but swer because of the different cost models not proximity preserving. involved. In metric spaces we use the num- If, on the other hand, the mapping ber of distance evaluations as the basic is proximity preserving (i.e., d(x, y) measure of complexity, while vector space d(x, z) D($(x), $( y)) D($(x), $(z))),≤ techniques may very well use many coordi- then NN-queries⇒ can≤ be directly per- nate manipulations and not a single eval- formed in the projected space. Indeed, uation of the distance. Under our model, most current algorithms for NN-queries the cost of a method that maps to a vec- are based in range queries, and with some tor space to trim the candidate list is mea- care they can be done in the projected sured as the number of distance evalua- space if the mapping is contractive, even tions to realize the mapping plus the final if it is not proximity preserving. distances to filter the trimmed candidate

ACM Computing Surveys, Vol. 33, No. 3, September 2001. Searching in Metric Spaces 283 list, whereas the work on the artificial co- different (and incompatible) assumptions ordinates is seen as just extra CPU time. and in many cases give just gross analyses A central question related to this reduc- or no analysis at all (just heuristic con- tion is: how well can a metric space be em- siderations). Therefore, we give the com- bedded into a vector space? How many co- plexities as claimed by the authors of each ordinates have to be considered so that the paper, not as a proven fact. At best, the original metric space and the target vec- results are analytical but rely on diverse tor spaces are similar enough so that the simplifying assumptions. At worst, the re- candidate list given by the vector space is sults are based on a few incomplete ex- not much larger than the actual outcome periments. Keep also in mind that there of the query in the original space? This is are hidden factors depending (in many a very difficult question that lies behind cases exponentially) on the dimension of this entire article, and we return to it in the space, and that the query complexity Section 7. is always on average, as in the worst case The issue is better developed in vector we can be forced to compare all the ele- spaces. There are different techniques to ments. Even in the simple case of orthogo- reduce the dimensionality of a set of points nal range searching on vector spaces there while preserving the original distances exist %(nα) lower bounds for the worst case as much as possible [Cox and Cox 1994; [Melhorn 1984]. Hair et al. 1995; Faloutsos and Lin 1995], that is, to find the intrinsic dimension of 5.1.1. Trees for Discrete Distance Functions. the data. We start by describing tree data structures that apply to distance functions which re- 5. CURRENT SOLUTIONS FOR turn a small set of different values. At the METRIC SPACES end we show how to cope with the general case with these trees. In this section we explain the existing in- dexes to structure metric spaces and how 5.1.1.1. BKT. Probably the first general they are used for range and NN search- solution to search in metric spaces was ing. Since we have not yet developed the presented in Burkhard and Keller [1973]. concepts of a unifying perspective, the de- They propose a tree (thereafter called the scription is kept at an intuitive level, with- Burkhard–Keller Tree, or BKT), which is out any attempt to analyze why some ideas suitable for discrete-valued distance func- are better or worse. We add a final sub- tions. It is defined as follows. An arbitrary section devoted to more advanced issues element p U is selected as the root of ∈ such as dynamic capabilities, I/O consider- the tree. For each distance i > 0, we de- ations, and approximate and probabilistic fine Ui u U, d(u, p) i as the set of ={ ∈ = } algorithms. all the elements at distance i to the root p. Then, for any nonempty Ui, we build a child of p (labeled i), where we recursively 5.1. Range Searching build the BKT for Ui. This process can be We divide the presentation in several repeated until there is only one element to parts. The first one deals with tree in- process, or until there are no more than b dexes for discrete distance functions, that elements (and we store a bucket of size b). is, functions that deliver a small set of val- All the elements selected as roots of sub- ues. The second part corresponds to tree trees are called pivots. indexes for continuous distance functions, When we are given a query q and a dis- where the set of alternatives is infinite or tance r, we begin at the root and enter into very large. Third, we consider other meth- all children i such that d(p, q) r i ods that are not tree-based. d(p, q) r, and proceed recursively.− If≤ we≤ Table I summarizes the complexities arrive at+ a leaf (bucket of size one or more) of the different structures. These are ob- we compare sequentially all its elements. tained from the source papers, which use Each time we perform a comparison

ACM Computing Surveys, Vol. 33, No. 3, September 2001. 284 E. Ch´avez et al.

Table I. Average Complexities of Existing Approachesa Data Space Construction Claimed Query Extra CPU Structure Complexity Complexity Complexity Query Time BKT n ptrs n log n nα — FQT n ..n log n ptrs n log n nα — FHQT n ..nh ptrs nh log n b nα FQA nhb bits nh log n b nα log n VPT n ptrs n log n log n c — MVPT n ptrs n log n log n c — 2 α 1 α c VPF n ptrs n − n − log n — BST n ptrs n log n not analyzed — GHT n ptrs n log n not analyzed — 2 GNAT nm dsts nm logm n not analyzed — VT n ptrs n log n not analyzed — 2 MT n ptrs n(m..m ) logm n not analyzed — 1 &(1/ log log n) SAT n ptrs n log n/ log log n n − — AESA n2 dsts n2 O(1) d n ..n2 LAESA kn dsts kn k O(1) d log n ..kn + LC n ptrs n2/m not analyzed n/m aTime complexity considers only n, not other parameters such as dimension. Space complexity mentions the most expensive storage units used (“ptrs” is short for “pointers” and “dsts” for “distances”). α is a number between 0 and 1, different for each structure, whereas the other letters are parameters particular to each structure. bIf h log n. cOnly= valid when searching with very small radii. dEmpirical conclusions without analysis, in the case of LAESA for “large enough” k.

(against pivots or bucket elements u) cally a BKT where all the pivots stored in where d(q, u) r, we report the element u. the nodes of the same level are the same The triangle≤ inequality ensures that we (and of course do not necessarily belong to cannot miss an answer. All the subtrees the set stored in the subtree). The actual not traversed contain elements u that are elements are all stored at the leaves. The at distance d(u, p) i from some node p, advantage of such construction is that where d(p, q) i => r. By the triangle some comparisons between the query and inequality,| d(p,−q) | d(p, u) d(u, q), and the nodes are saved along the backtrack- therefore d(u, q) ≤d(p, q) +d(p, u) > r. ing that occurs in the tree. If we visit many Figure 2 shows≥ an example− where the el- nodes of the same level, we do not need ement u11 has been selected as the root. We to perform more than one comparison have built only the first level of the BKT because all the pivots in that level are the for simplicity. A query q is also shown, same. This is at the expense of somewhat and we have emphasized the branches of taller trees. FQTs are shown experimen- the tree that would have to be traversed. tally in Baeza-Yates et al. [1994] to per- In this and all the examples of this section form fewer distance evaluations at query we discretize the distances of our example, time using a couple of different metric so that they return integer values. space examples. Under some simplifying The results of Table I for BKTs are assumptions (experimentally validated in extrapolated from those made for fixed the paper) they show that FQTs built over queries trees [Baeza-Yates et al. 1994], n elements are O(log n) height on aver- which can be easily adapted to this case. age, are built using O(n log n) distance The only difference is that the space over- evaluations, and that the average number head of BKTs is O(n) because there is ex- of distance computations is O(nα), where actly one element of the set per tree node. 0 <α<1 is a number that depends on the range of the search and the structure of 5.1.1.2. FQT. A further development the space (this analysis is easy to extend over BKTs is the fixed queries tree or FQT to BKTs as well). The space complexity is [Baeza-Yates et al. 1994]. This tree is basi- superlinear since, unlike BKTs, it is not

ACM Computing Surveys, Vol. 33, No. 3, September 2001. Searching in Metric Spaces 285

Fig. 2. On the left, the division of the space obtained when u11 is taken as a pivot. On the right, the first level of a BKT with u11 as root. We also show a query q and the branches that it has to traverse. We have discretized the distances so they return integer values. true that a different element is placed at verse the leaves of the tree left to right and each node of the tree. An upper bound put the elements in an array, the result is is O(n log n) since the average height is the FQA. For each element of the array O(log n). we compute h numbers representing the branches to take in the tree to reach the 5.1.1.3. FHQT. In Baeza-Yates et al. element from the root (i.e., the distances [1994] and Baeza-Yates [1997], the au- to the h pivots). Each of these h numbers thors propose a variant called “fixed- is coded in b bits and they are concate- height FQT” (or FHQT for short), where nated in a single (long) number so that the all the leaves are at the same depth higher levels of the tree are the most sig- h, regardless of the bucket size. This nificant digits. makes some leaves deeper than neces- As a result the FQA is sorted by the re- sary, which makes sense because we may sulting hb-bits number, each subtree of the have already performed the comparison FHQT corresponds to an interval in the between the query and the pivot of an FQA, and each movement in the FHQT intermediate level, therefore eliminating is simulated with two binary searches in for free the need to consider the leaf. In the FQA (at O(log n) extra CPU cost fac- Baeza-Yates [1997] and Baeza-Yates and tor, but no extra distances are computed). Navarro [1998] it is shown that by using There is a similarity between this idea and O(log n) pivots, the search takes O(log n) suffix trees versus suffix arrays [Frakes distance evaluations (although the cost and Baeza-Yates 1992]. This idea of using depends exponentially on the search ra- fewer bits to represent the distances ap- dius r). The extra CPU time, that is, the peared also in the context of vector spaces number of nodes traversed, remains how- [Weber et al. 1998]. ever, O(nα). The space, like FQTs, is some- Using the same memory, the FQA simu- where between O(n) and O(nh). In prac- lation is able to use many more pivots than tice the optimal h O(log n) cannot be the original FHQT, which improves the ef- achieved because of= space limitations. ficiency. The b bits needed by each pivot can be lowered by merging branches of the 5.1.1.4. FQA. In Chavez´ et al. [2001], FHQT, trying that about the same num- the fixed queries array (FQA) is presented. ber of elements lies in each cell of the next The FQA, although not properly a tree, level. This allows using even more pivots is no more than a compact representation with the same space usage. For reasons of the FHQT. Imagine that an FHQT of that are made clear later, the FQA is also fixed height h is built on the set. If we tra- called FMVPA in this work.

ACM Computing Surveys, Vol. 33, No. 3, September 2001. 286 E. Ch´avez et al.

Fig. 3. Example BKT, FQT, FHQT, and FQA for our set of points. We use b 2 for the BKT and FQT, and h 2 for FHQT and FQA. = =

Figure 3 shows an arbitrary BKT, FQT, FQT. FHQTs and FQAs do not degenerate FHQT, and FQA built on our set of points. in this sense, but they lose their sublinear Notice that, while in the BKT there is a extra CPU time. different pivot per node, in the others In Baeza-Yates et al. [1994] the au- there is a different pivot per level, the thors mention that the structures can be same for all the nodes of that level. adapted to a continuous distance by as- signing a range of distances to each branch 5.1.1.5. Hybrid. In Shapiro [1977], the of the tree. However, they do not specify use of more than one element per node of how to do this. Some approaches explic- the tree is proposed. Those k elements al- itly defined for continuous functions are low eliminating more elements per level explained later (VPTs and others), which at the cost of doing more distance evalua- assign the ranges trying to leave the same tions. The same effect would be obtained number of elements at each class. if we had a mixture between BKTs and FQTs, so that for k levels we had fixed keys 5.1.2. Trees for Continuous Distance Func- per level, and then we allowed a different tions. We now present the data structures key per node of the level k 1, continu- designed for the continuous case. They ing the process recursively on+ each sub- also can be used for discrete distance func- tree of the level k 1. The authors conjec- tions with virtually no modifications. ture that the pivots+ should be selected to be outside the clusters. 5.1.2.1. VPT. The “metric tree” is pre- sented in Uhlmann [1991b] as a tree 5.1.1.6. Adapting to Continuous Functions. data structure designed for continuous If we have a continuous distance or if it distance functions. More complete work gives too many different values, it is not on the same idea [Yianilos 1993; Chiueh possible to have a child of the root for any 1994] calls them “vantage-point trees” or such value. If we did that, the tree would VPTs. They build a recursively, degenerate into a flat tree of height 2, taking any element p as the root and and the search algorithm would be almost taking the median of the set of all dis- like sequential searching for the BKT and tances, M median d(p, u)/u U . Those = { ∈ }

ACM Computing Surveys, Vol. 33, No. 3, September 2001. Searching in Metric Spaces 287

Fig. 4. Example VPT with root u11. We plot the radius M used for the root. For the first levels we show explicitly the radii used in the tree. elements u such that d(p, u) M are in- uniform percentiles instead of just the serted into the left subtree,≤ while those median. This is suggested in Brin [1995] such that d(p, u) > M are inserted into and Bozkaya and Ozsoyoglu [1997]. In the right subtree. The VPT takes O(n) the latter, the “multi-vantage-point tree” space and is built in O(n log n) worst case (MVPT) is presented. They propose the time, since it is balanced. To solve a query use of many elements in a single node, in this tree, we measure d d(q, p). If much as in Shapiro [1977]. It can be d r M we enter into the= left subtree, seen that the space is O(n), since each and− if≤d r > M we enter into the right internal node needs to store the m per- subtree (notice+ that we can enter into both centiles but the leaves do not. The con- subtrees). We report every element consid- struction time is O(n log n) if we search the ered that is close enough to the query. See m percentiles hierarchically at O(n log m) Figure 4. instead of O(mn) cost. Bozkaya and The query complexity is argued to be Ozsoyoglu [1997] show experimentally O(log n) in Yianilos [1993], but as pointed that the idea of m-ary trees slightly im- out, this is true only for very small search proves over VPTs (and not in all cases), radii, too small to be an interesting case. while a larger improvement is obtained by In trees for discrete distance functions, using many pivots per node. The analysis the exact distance between an element in of query time for VPTs can be extrapolated the leaves and any pivot in the path to to MVPTs in a straightforward way. the root can be inferred. However, here we only know that the distance is larger or 5.1.2.3. VPF. Another generalization of smaller than M. Unlike the discrete case, the VPT is given by the VPF (shorthand it is possible that we arrive at an element for excluded middle vantage point for- in a leaf which we do not need to compare, est) [Yianilos 1999]. This algorithm is de- but the tree has not enough information signed for radii-limited NN search (an to discover that. Some of those exact dis- NN(q) query with a maximum radius r∗), tances lost can be stored explicitly, as pro- but in fact the technique is perfectly com- posed in Yianilos [1993], to prune more patible with a range search query. The elements before checking them. Finally, method consists of excluding, at each level, Yianilos [1993] considers the problem of the elements at intermediate distances to pivot selection and argues that it is better their pivot (this is the most populated part to take elements far away from the set. of the set): if r0 and rn stand for the closest and farthest elements to the pivot p, the 5.1.2.2. MVPT. The VPT can be ex- elements u U such that d(p, r0) δ tended to m-ary trees by using the m 1 d(p, u) d(∈p, r ) δ are excluded+ from≤ − ≤ n −

ACM Computing Surveys, Vol. 33, No. 3, September 2001. 288 E. Ch´avez et al.

Fig. 5. Example of the first level of a BST or GHT and a query q. Either using covering radii (BST) or hyperplanes (GHT), both subtrees have to be considered in this example. the tree. A second tree is built with the ex- downward in the tree. Figure 5 illustrates cluded “middle part” of the first tree, and the first step of the tree construction. so on to obtain a forest. With this idea they eliminate the backtracking when search- 5.1.2.5. GHT. Proposed in Uhlmann ing with a radius r∗ (rn r0 2δ)/2, [1991b], the “generalized-hyperplane and in return they have≤ to search− − all the tree” (GHT) is identical in construc- trees of the forest. The VPF, of O(n) size, tion to a BST. However, the algorithm 2 ρ is built using O(n − ) time and answers uses the hyperplane between c1 and c2 1 ρ queries in O(n − log n) distance evalua- as the pruning criterion at search time, tions, where 0 <ρ<1 depends on r∗. Un- instead of the covering radius. At search fortunately, to achieve ρ>0, r∗ has to be time we enter into the left subtree if quite small. d(q, c1) r < d(q, c2) r and into the − + right subtree if d(q, c2) r d(q, c1) r. 5.1.2.4. BST. In Kalantari and Again, it is possible to− enter≤ into both+ McDonald [1983], the “bisector trees” subtrees. In Uhlmann [1991b] it is argued (BSTs) are proposed. The BST is a binary that GHTs could work better than VPTs in tree built recursively as follows. At each high dimensions. The same idea of reusing node, two “centers” c1 and c2 are selected. the parent node is proposed in Bugnion et The elements closer to c1 than to c2 go al. [1993], this time to avoid performing into the left subtree and those closer two distance evaluations at each node. to c2 into the right subtree. For each of the two centers, its “covering radius” is 5.1.2.6. GNAT. The GHT is extended in stored, that is, the maximum distance Brin [1995] to an m-ary tree, called GNAT from the center to any other element in (geometric near-neighbor access tree), its subtree. At search time, we enter into keeping the same essential idea. We se- each subtree if d(q, c ) r is not larger lect, for the first level, m centers c , ..., c , i − 1 m than the covering radius of ci. That is, and define Ui u U, d(ci, u) < d(c j , u), ={ ∈ we can discard a branch if the query ball j i . That is, Ui are the elements ∀ '= } (i.e., the hypersphere of radius r centered closer to ci than to any other c j . From in the query) does not intersect the ball the root, m children numbered i 1 ..m that contains all the elements inside that are built, each one recursively as a= GNAT branch. In Noltemeier et al. [1992], the for Ui. Figure 6 shows a simple exam- “monotonous BST” is proposed, where ple of the first level of a GNAT. Notice one of the two elements at each node the relationship between this idea and a is indeed the parent center. This makes Voronoi-like partition of a vector space the covering radii decrease as we move [Aurenhammer 1991].

ACM Computing Surveys, Vol. 33, No. 3, September 2001. Searching in Metric Spaces 289

Fig. 6. Example of the first level of a GNAT with m 4. =

The search algorithm, however, is quite VTs have the property that the covering different. At indexing time, the GNAT radius is reduced as we move downwards stores at each node an O(m2) size table in the tree, which provides better pack- rangeij [minu U j (ci, u), maxu U j (ci, u)], ing of elements in subtrees. It is shown in = ∈ ∈ which stores minimum and maximum dis- Dehne and Noltemeier [1987] that VTs are tances from each center to each class. superior and better balanced than BSTs. At search time the query q is compared In Noltemeier [1989] they show that bal- against some center ci and then it discards anced VTs can be obtained by insertion any other center c such that d(q, c ) r procedures similar to those of B-trees, a j i ± does not intersect rangei, j . All the subtree fact later exploited in M-trees (see next). can be discarded using the triangle in- U j 5.1.2.8. MT. The M-tree (MT) data equality. The process is repeated with ran- structure is presented in Ciaccia et al. dom centers until none can be discarded. [1997], aiming at providing dynamic ca- The search then enters recursively in each pabilities and good I/O performance in nondiscarded subtree. In the process, any addition to few distance computations. center close enough to q is reported. The structure has some resemblances The authors use a gross analysis to show 2 to a GNAT, since it is a tree where a that the tree takes O(nm ) space and is set of representatives is chosen at each built in close to O(nm logm n) time. Ex- node and the elements closer to each perimental results show that the GHT is representative2 are organized into a sub- worse than the VPT, which is only beaten tree rooted by that representative. The with GNATs of arities between 50 and 100. search algorithm, however, is closer to Finally, they mention that the arities of BSTs. Each representative stores its cov- the subtrees could depend on their depth ering radius. At query time, the query is in the tree, but give no clear criteria to compared to all the representatives of the do this. node and the search algorithm enters re- 5.1.2.7. VT. The “Voronoi tree” (VT) is cursively into all those that cannot be dis- proposed in Dehne and Noltemeier [1987] carded using the covering radius criterion. as an improvement over BSTs, where this The main difference of the MT is the time the tree has two or three elements way in which insertions are handled. (and children) per node. When a new tree An element is inserted into the “best” node has to be created to hold an inserted element, its closest element from the par- 2 There are many variants but this is reported as the ent node is also inserted in the new node. most effective.

ACM Computing Surveys, Vol. 33, No. 3, September 2001. 290 E. Ch´avez et al. subtree, defined as that causing the subtree covering radius to expand less (zero expansion is the ideal), and in case of ties selecting the closest representa- tive. Finally, the element is added to the leaf node and if the node overflows (i.e., becomes of size m 1) it is split in two and one node element+ is promoted upwards, as in a B-tree or an R-tree [Guttman 1984]. Hence the MT is a balanced data structure, much as the VP family. There are many criteria to select the representative and to split the node, the best results being obtained by trying a split that minimizes the maximum of the two covering radii obtained. They show Fig. 7. Example of a SAT and the traversal towards a query q, starting at u . experimentally that the MT is resistant 11 to the dimensionality of the space and that it is competitive against R∗-trees. 1 &(1/ log log n) O(n) space, and inspects O(n − ) elements at query time. Covering radii are 5.1.2.9. SAT. The SAT algorithm (spa- also used to increase pruning. Figure 7 tial approximation tree) [Navarro 1999] shows an example and the search path for does not use centers to split the set of a query. candidate objects, but rather relies on “spatial” approximation. An element p is selected as the root of a tree, and it is 5.1.3. Other Techniques connected to a set of “neighbors” N, de- fined as a subset of elements u U such 5.1.3.1. AESA. An algorithm that is that u is closer to p than to any∈ other el- close to many of the presented ideas but ement in N (note that the definition is performs surprisingly better by an or- self-referential). The other elements (not der of magnitude is that of Vidal [1986] in N p ) are assigned to their clos- (called AESA, for approximating eliminat- est element∪{ } in N. Each element in N is ing search algorithm). The structure is recursively the root of a new subtree con- simply a matrix with the n(n 1)/2 pre- taining the elements assigned to it. computed distances among the− elements This allows searching for elements with of U. At search time, they select an ele- radius zero by simply moving from the root ment p U at random and measure rp ∈ = to its “neighbor” (i.e., connected element) d(p, q), eliminating all elements u of U which is closest to the query q. If a radius that do not satisfy rp r d(u, p) rp r. r > 0 is allowed, then we consider that an Notice that all the d−(u,≤p) distances≤ + are unknown element q* U is searched with precomputed, so only d(p, q) has been cal- tolerance zero, from which∈ we only know culated at search time. This process of tak- that d(q, q ) r. Hence, we search as be- ing a random pivot among the (not yet * ≤ fore for q and consider that any distance eliminated) elements of U and eliminating measure may have an “error” of at most more elements from U is repeated until the r. Therefore, we may have to enter into candidate set is empty and the query has many± branches of the tree (not only the been satisfied. Figure 8 shows an example closest one), since the measuring “error” with a first pivot u11. could permit a different neighbor to be the Although this idea seems very similar closest one. That is, if c N is the clos- to FQTs, there are some key differences. est neighbor of q, we enter∈ into all c N The first one, only noticeable in continu- * ∈ such that d(q, c*) r d(q, c) r. The tree ous spaces, is that there are no predefined is built in O(n log−n/≤log log n)+ time, takes “rings” so that all the intersected rings

ACM Computing Surveys, Vol. 33, No. 3, September 2001. Searching in Metric Spaces 291

A way to reduce the extra CPU time is presented later in Mico´ et al. [1994], which builds a GHT-like structure using the same pivots. The algorithm is argued to be sublinear in CPU time. Alternative search structures to reduce CPU time not losing information on distances are pre- sented in Nene and Nayar [1997] and Chavez´ et al. [1999], where the distances to each pivot are sorted separately so that the relevant range [d(q, p) r, d(q, p) r] can be binary searched.3 −Extra pointers+ are added to be able to trace an element across the different orderings for each pivot (this needs more space, however). Fig. 8. Example of the first iteration of AESA. The points between both rings centered at u11 5.1.3.3. Clustering and LCs. Clustering qualify for the next iteration. is a very wide area with many applications [Jain and Dubes 1988]. The general goal qualify (recall Figure 2). Instead, only is to divide a set into subsets of elements the minimal necessary area of the rings close to each other in the same subset. qualifies. The second difference is that the A few approaches to index metric spaces second element to compare against q is se- based on clustering exist. lected from the qualifying set, instead of One of them, LC (for “list of clusters”) from the whole set as in FQTs. Finally, [Chavez´ and Navarro 2000] chooses an el- the algorithm uses every distance com- ement p and finds the m closest elements putation as a pivoting operation, while in U. This is the cluster of p. The process FQTs must precompute that decision (i.e., continues recursively with the remaining bucket size). elements until a list of n/(m 1) clus- The problem with the algorithm [Vidal ters is obtained. The covering radius+ cr() 1986] is that it needs O(n2) space and con- of each cluster is stored. At search time struction time. This is unacceptably high the clusters are inspected one by one. If for all but very small databases. In this d(q, p ) r > cr(p ) we do not enter the i − i sense the approach is close to Sasha and cluster of pi, otherwise we verify it ex- Wang [1990], although in this latter case haustively. If d(q, pi) r < cr(pi) we do they may take fewer distances and bound not need to consider subsequent+ clusters. the unknown ones. AESA is experimen- The authors report unbeaten performance tally shown to have O(1) query time. on high dimensional spaces, at the cost of O(n2/m) construction time. 5.1.3.2. LAESA and Variants. In a newer version of AESA, called LAESA (for linear AESA), Mico´ et al. [1994] propose using 5.2. Nearest-Neighbor Queries k fixed pivots, so that the space and con- We have concentrated in range search struction time is O(kn). In this case, the queries up to now. This is because, as we only difference with an FHQT is that fixed show in this section, most of the existing rings are not used, but the exact set of el- solutions for NN-queries are built system- ements in the range is retrieved. FHQT atically over range searching techniques, uses fixed rings to reduce the extra CPU and indeed can be adapted to any of the time, while in this case no such algorithm data structures presented (despite having is given. In LAESA, the elements are sim- been originally designed for specific ones). ply linearly traversed, and those that can- not be eliminated after considering the k pivots are directly compared against 3 Although Nene and Nayar [1997] consider only vec- the query. tor spaces, the same technique can be used here.

ACM Computing Surveys, Vol. 33, No. 3, September 2001. 292 E. Ch´avez et al.

5.2.1. Increasing Radius. The simplest NN not interested in elements farther away search algorithm is based on using a range than the current kth closest element). searching algorithm as follows. Search q Each time a new element is seen whose with fixed radii r aiε (a > 1), starting distance is relevant, it is inserted as one of with i 0 and increasing= it until at least the k nearest neighbors known up to now the desired= number of elements (1 or k) (possibly displacing one of the old candi- lies inside the search radius r aiε. dates from the list) and r is updated. In = ∗ Since the complexity of the range the beginning we start with r∗ and searching normally grows sharply on the keep this value until the first k =elements∞ search radius, the cost of this method are found. can be very close to the cost of a range A variant of this type of query is the lim- searching with the appropriate r (which ited radius NN searching. Here we start is not known in advance). The increas- with the maximum expected distance be- ing steps can be made smaller (a 1) to tween the query element and its nearest avoid searching with a radius much→ larger neighbor. This type of query has been the than necessary. focus of Yianilos [1999, 2000].

5.2.2. Backtracking with Decreasing Radius. 5.2.3. Priority Backtracking. The previous A more elaborate technique is as follows. technique can be improved by a smarter We first explain the search for the clos- selection of which elements to consider est neighbor. Start the search on any data first. For clarity we consider backtracking structure using r∗ . Each time q is in a tree, although the idea is general. In- compared to some element=∞ p, update the stead of following the normal backtrack- search radius as r∗ min(r∗, d(q, p)) and ing order of the range query, modifying at continue the search← with this reduced ra- most the order in which the subtrees are dius. This has been proposed, for exam- traversed, we give much more freedom to ple, for BKTs and FQTs [Burkhard and the traversal order. The goal is to increase Keller 1973; Baeza-Yates et al. 1994] as the probability of quickly finding elements well as for vector spaces [Roussopoulos close to q and therefore reduce r∗ fast. et al. 1995]. This technique has been used in vector As closer and closer elements to q are and metric spaces several times [Uhlmann found, we search with smaller radius and 1991a; Hjaltason and Samet 1995; Ciaccia the search becomes cheaper. For this rea- et al. 1997]. son it is important to try to find quickly el- At each point of the search we have a set ements that are close to the query (which of possible subtrees that can be traversed is unimportant in range queries). The way (not necessarily all at the same level). For to achieve this is dependent on the partic- each subtree, we know a lower bound to ular data structure. For example, in BKTs the distance between q and any element and FQTs we can begin at the root and of the subtree. We first select subtrees with measure i d(p, q). Now, we consider the the smallest lower bound. Once a subtree edges labeled= i, i 1, i 1, i 2, i 2, and has been selected we compare q against − + − + so on, and proceed recursively in the chil- its root, update r∗ and the candidates for dren (other heuristics may work better). output if necessary, and determine which Therefore, the exploration ends just after of the children of the considered root de- considering the branch i r (r is reduced serve traversal. Unlike the normal back- + ∗ ∗ along the process). At the end r∗ is the dis- tracking, those children are not immedi- tance to the closest neighbors and we have ately traversed but added to the set of already seen all of them. subtrees that have to be traversed at some NNk(q) queries are solved as an exten- moment. Then we select again a subtree sion of the above technique, where we keep from the set. the k elements seen that are closest to q The best way to implement this search and set r∗ as the maximum distance be- is with a priority queue ordered by the tween those elements and q (clearly we are heuristic “goodness,” where the subtrees

ACM Computing Surveys, Vol. 33, No. 3, September 2001. Searching in Metric Spaces 293 are inserted and removed. We start with tures tolerate insertions well, but not an empty queue where we insert the root deletions. of the tree. Then we repeat the step of re- We first consider insertions. Among the moving the most promising subtree, pro- structures that we have surveyed, the cessing it, and inserting the relevant sub- least dynamic is SAT, which needs full trees until the lower bound of the best knowledge of the complete set at index subtree exceeds r∗. construction time and has difficulty in If applied to a BKT or a FQT, this handling later insertions (some solutions method yields the same result as the previ- are described in Navarro and Reyes ous section, but this technique is superior [2001]). The VP family (VPT, MVPT, for dealing with continuous distances. VPF) has the problem of relying on global statistics (such as the median) to build the 5.2.4. Specific NN Algorithms. The tech- tree, so later insertions can be performed niques described above cover almost but the performance of the structure may all the existing proposals for solving deteriorate. Finally, the FQA needs, in NN-queries. The only exception we are principle, insertion in the middle of an aware of was presented in Clarkson array, but this can be handled by using [1999], which is a GNAT-like data struc- standard techniques. All the other data ture where the points are inserted into structures can handle insertions in a rea- more than one subtree to limit back- sonable way. There are some structure pa- tracking (hence the space requirement rameters that may depend on n and thus is superlinear). require periodical structural reorganiza- After selecting the representatives for tion, but we disregard this issue here (e.g., the root of the tree, each element u is not adding or removing pivots is generally only inserted into the subtree of its clos- problematic). est representative p, but also in the tree Deletion is a little more complicated. In of any other representative p* such that addition to the above structures, which d(u, p*) 3d(u, p). At search time, the present the same problems as for inser- query q ≤enters not only into its nearest tion, BKTs, GHTs, BSTs, VTs, GNATs, representative p but also into every other and the VP family cannot tolerate deletion representative p* such that d(q, p*) of an internal tree node because it plays an 3d(q, p). As shown in Clarkson [1999] this≤ essential role in organizing the subtree. Of is enough to guarantee that the nearest course this can be handled as just marking neighbor will be reached. the node as removed and actually keep- k 1 By using subsets of size n1/2 + at depth ing it for routing purposes, but the qual- k in the tree, the search time is polyloga- ity of the data structure is affected over rithmic in n and the space requirement is time. O(n polylog n) if some conditions hold in Therefore, the only structures that fully the metric space. support insertions and deletions are in the FQ family (FQT, FQHT, FQA, since there are no truly internal nodes), AESA 5.3. Extensions and LAESA approaches (since they are We cover in this section the work that has just vectors of coordinates), the MT (which been pursued on extensions of the basic is designed with dynamic capabilities in problems or in alternative models. None mind and whose insertion/deletion algo- of these are the main focus of our survey. rithms are reminiscent of those of the B- tree), and a variant of GHTs designed 5.3.1. Dynamic Capabilities. Many of the to support dynamic operations [Verbarg data structures for metric spaces are de- 1995]. The analysis of this latter struc- signed to be built on a static data set. ture shows that dynamic insertions can In many applications this is not reason- be done in O(log2 n) amortized worst case able because elements have to be inserted time, and that deletions can be done at and deleted dynamically. Some data struc- similar cost under some restrictions.

ACM Computing Surveys, Vol. 33, No. 3, September 2001. 294 E. Ch´avez et al.

5.3.2. I/O Considerations. Most of the re- able in some applications because the met- search on metric spaces deals with reduc- ric space modelization already involves an ing the number of distance evaluations or approximation to the true answer (recall at most the total CPU time. However, de- Section 2), and therefore a second approx- pending on the application, the I/O cost imation at search time may be acceptable. may play an important role. As most of the Additionally to the query one specifies a research on metric spaces has focused on precision parameter ε to control how far algorithms to discard elements, I/O con- away (in some sense) we want the out- siderations have been normally left aside. come of the query from the correct result. The only exception to the rule is MT, de- A reasonable behavior for this type of al- signed specifically for secondary memory. gorithm is to asymptotically approach the The tree nodes in the MT are to be stored correct answer as ε goes to zero, and com- in a single disk page (indeed, the MT does plementarily to speed up the algorithm, not fix an arity but rather a node capac- losing precision, as ε moves in the opposite ity in bytes). Earlier balanced trees exist direction. (such as the VP family), but the purpose This alternative to exact similarity of this balancing is to keep the extra CPU searching is called approximate similar- costs low. As we show later, unbalanced ity searching, and encompasses approxi- data structures perform much better in mate and probabilistic algorithms. We do high dimensions and it is unlikely that the not cover them in depth here but present a reduced CPU costs may play an important few examples. Approximation algorithms role. The purpose of balancing the MT, on for similarity searching are considered in the other hand, is to keep I/O costs low,and depth in White and Jain [1996]. depending on the application this may be As a first example, we mention an ap- even more important than the number of proximate algorithm for NN search in distance evaluations. real vector spaces using any Ls metric by Other data structures could probably Arya et al. [1994]. They propose a data be adapted to perform well in secondary structure, the BBD-tree, inspired in kd- memory, but the authors simply have not trees, that can be used to find “(1 ε) considered the problem. For instance, it nearest neighbors”: instead of finding+ is not hard to imagine strategies for the u such that d(u, q) d(v, q) v U, ≤ ∀ ∈ tree data structures to pack “compact” they find an element u∗, an (1 ε)-nearest subtrees in disk pages, so as to make as neighbor, differing from u by+ a factor of good use as possible of a page that is (1 ε), that is, u such that d(u , q) + ∗ ∗ ≤ read from disk. When a subtree grows (1 ε) d(v, q) v U. larger than a page it is split into two The+ essential∀ idea∈ behind this algorithm pages of approximately the same number is to locate the query q in a cell (each leaf of nodes. Of course, a B-tree-like scheme in the tree is associated with a cell in the such as that of MT has to be superior decomposition). Every point inside the cell in this respect. Finally, array-oriented ap- is processed to obtain the current nearest proaches such as FQA, AESA, and LAESA neighbor (u). The search stops when no are likely to read all the disk pages of the promising cells are encountered, that is, index for each query, and hence have bad when the radius of any ball centered at q I/O performance. and intersecting a nonempty cell exceeds the radius d(q, p)/(1 ε). The query time k + 5.3.3. Approximate and Probabilistic Algo- is O( 1 6k/ε k log n). rithms. For the sake of a complete overview A second1 + example2 is a probabilistic al- we include a brief description of an im- gorithm for vector spaces [Yianilos 2000]. portant branch of similarity searching, The data structure is like a standard kd- where a relaxation on the query preci- tree, using “aggressive pruning” to im- sion is allowed to obtain a speedup in prove the performance. The idea is to in- the query time complexity. This is reason- crease the number of branches pruned

ACM Computing Surveys, Vol. 33, No. 3, September 2001. Searching in Metric Spaces 295

Fig. 9. The unified model for indexing and querying metric spaces. at the expense of losing some candidate time. Here ρ is the logarithm of the ra- points in the process. This is done in a tio between the farthest and the nearest controlled way, so the probability of suc- pairs of points in the union of U and the cess is always known. In spite of the vec- training set. tor space focus of the algorithm, it could be generalized to metric spaces as well. The data structure is useful for finding only 6. A UNIFYING MODEL limited radius nearest neighbors, that is, At first sight, the indexes and the search neighbors within a fixed distance to the algorithms seem to emerge from a great query. diversity, and different approaches are an- Finally, an example of a probabilistic alyzed separately, often under different NN algorithm for general metric spaces assumptions. Currently, the only realistic is that of Clarkson [1999]. The original way to compare two different algorithms intention is to build a Voronoi-like data is to apply them to the same data set. structure on a metric space. As this is In this section we make a formal intro- not possible in general because there is duction to our unifying model. Our inten- not enough knowledge of the character- tion is to provide a common framework istics of the queries that will come later to analyze all the existing approaches to [Navarro 1999], Clarkson [1999] chooses proximity searching. As a result, we are to have a “training set” of queries and to able to capture the similarities of appar- build a data structure able to answer cor- ently different approaches. We also obtain rectly only queries belonging to the train- truly new ways of viewing the problem. ing set. The idea is that this is enough The conclusion of this section can be to answer correctly an arbitrary query summarized in Figure 9. All the indexing with high probability. Under some prob- algorithms partition the set U into subsets. abilistic assumptions on the distribution An index is built that allows determining of the queries, it is shown that the proba- a set of candidate subsets where the el- bility of not finding the nearest neighbor ements relevant to the query can appear. is O(log n)2/K ), where K can be made ar- At query time, the index is searched to find bitrarily large at the expense of O(Knρ) the relevant subsets (the cost of doing this space and O(Kρ log n) expected search is called “internal complexity”) and those

ACM Computing Surveys, Vol. 33, No. 3, September 2001. 296 E. Ch´avez et al. subsets are checked exhaustively (which ered themselves as objects in a new metric corresponds to the “external complexity” space, provided we define a distance func- of the search). tion in π(X). The last two subsections describe the We introduce a new function D0 : π(X) × two main approaches to similarity search- π(X) R now defined in the quotient. ing in abstract terms. → Definition 1. Given a metric space (X, d) and a partition π(X), the extension of 6.1. Equivalence Relations d to π(X) is defined as D0([x], [ y]) = infx [x], y [ y] d(x, y) . The relevance of equivalence classes for ∈ ∈ { } this survey comes from the possibility D0 gives the maximum possible values of partitioning a metric space so that a that keep the mapping contractive (i.e., new metric space is derived from the quo- D ([x], [ y]) d(x, y) for any x, y). Unfor- 0 ≤ tient set. Readers familiar with equiva- tunately, D0 does not satisfy the triangle lence relations can safely skip this short inequality, just (p1) to (p3), and in most section. cases (p4) (recall Section 3.1). Hence, D0 Given a set X,apartition π(X) itself is not suitable for indexing purposes. π , π , ... is a collection of pairwise dis-= However, we can use any metric D { 1 2 } joint subsets whose union is X; that is, that lower bounds D0 (i.e., D([x], [ y]) ≤ πi X and i j, πi π j . D0([x], [ y])). Since D is a metric, (π(X), D) ∪ A= relation,∀ denoted'= by∩ ,= is∅ a subset of is a metric space and therefore we can ∼ the crossproduct X X (the set of ordered make queries in π(X) in the same way × pairs) of X. Two elements x, y are said to we have done in X. We redefine the out- be related, denoted by x y, if the pair come of a query in π(X) as ([q], r)D u ∼ ={ ∈ (x, y) is in the subset. A relation is said U, D([u], [q]) r (although formally we to be an equivalence relation if it satisfies,∼ should retrieve≤ classes,} not elements). for all x, y, z X, the properties of reflex- Since the mapping is contractive ivity (x x), symmetry∈ (x y y x), (D([x], [ y]) d(x, y)) we can convert and transitivity∼ (x y y∼ z⇔ x ∼ z). one search problem≤ into another, it is ∼ ∧ ∼ ⇒ ∼ Every partition π(X) induces an equiv- hoped simpler, search problem. For a alence relation and, conversely, every given query (q, r)d we find out to which equivalence relation∼ induces a partition: equivalence class the query q belongs two elements are related if they belong (i.e., [q]). Then, using the new distance to the same partition element. Every el- function D the query is transformed into ement πi of the partition is then called ([q], r)D. As the mapping is contractive, an equivalence class. An equivalence class we have (q, r)d ([q], r)D. That is, ([q], r)D is often named after one of its represen- is indeed a candidate⊆ list, so it is enough tatives (any element of πi can be taken to perform an exhaustive search on that as a representative). An alternative def- candidate list (now using the original inition of an equivalence class of an ele- distance d), to obtain the actual outcome ment x is the set of all y such that x y. of the query (q, r)d . We denote the equivalence class of ∼x as Our main thesis is that the above pro- [x] y, x y . cedure is in fact used in virtually every ={ ∼ } Given the set X and an equivalence rela- indexing algorithm (recall Figure 9). tion , we obtain the quotient π(X) X/ . ∼ = ∼ It indicates the set of equivalence classes PROPOSITION. All the existing indexing (or just classes), obtained when applying algorithms for proximity searching consist the equivalence relation to the set X. of building an equivalence relation, so that at search time some classes are discarded and the others are exhaustively searched. 6.2. Indexing and Partitions The equivalence classes in the quotient set As we show shortly, the most important π(X) of a metric space X can be consid- tradeoff when designing the partition is to

ACM Computing Surveys, Vol. 33, No. 3, September 2001. Searching in Metric Spaces 297

(q, r)d the candidate list ([q], r)D consists of all elements x such that D([q],[x]) r, or which is the same, d(q, p) d(x, p≤) r. Graphically, this distance| represents− |≤ a ring centered at p containing a ball cen- tered at q with radius r (recall Figures 10 and 8). This is the familiar rule used in many independent algorithms to trim the space. Example 2. As explained, the similarity search problem was first introduced in vec- tor spaces, and the very first family of al- gorithms used there was based on a par- Fig. 10. Two points x and y, and their equivalence tition operation. These algorithms were classes (the shaded rings). D gives the minimal dis- called bucketing methods, and consisted tance among rings, which lower bounds the distance of the construction of cells or buckets between x and y. [Bentley et al. 1980] Searching for an ar- k bitrary point in R is converted into an ex- balance the cost to find ([q], r)D and the haustive search in a finite set of cells. The cost to verify this candidate list. procedure uses two steps: they find which In Figure 10 we can see a schematic ex- cell the query point belongs to and build ample of the idea. We divide the space in a set of candidate cells using the query several regions (equivalence classes). The range; then they inspect this set of can- objects inside each region become indistin- didate cells exhaustively to find the actual guishable. We can consider them as ele- points inside the query range.4 In this case ments in a new metric space. To find the the equivalence classes are the cells, and answer, instead of exhaustively examin- the tradeoff is that the larger the cells, the ing the entire dictionary we just examine cheaper it is to find the appropriate ones, the classes that contain potentially inter- but the more costly is the final exhaustive esting objects. In other words, if a class search. can contain an element that should be re- turned in the outcome of the query, then 6.3. Coarsening and Refining a Partition the class will be examined (see also the rings considered in Figure 2). We start by defining the concepts of refine- We recall that this property is not ment and coarsening. enough for an arbitrary NN search algo- Definition 2. Let and be two rithm to work (since the mapping would ∼1 ∼2 equivalence relations over a set X. We say have to preserve proximity instead), but that is a refinement of or that is most existing algorithms for NN are based ∼1 ∼2 ∼2 a coarsening of 1 if for any pair x, y X on range queries (recall Section 5.2), and ∼ ∈ such that x 1 y it holds x 2 y. The same these algorithms can be applied as well. terms can be∼ applied to the∼ corresponding Some examples may help to understand 1 2 partitions π (X) and π (X). the above definitions, for both the concept of equivalence relation and the obtained Refinement and coarsening are impor- distance function. tant concepts for the topic we are dis- cussing. The following lemma shows the Example 1. Say that we have an ar- effect of coarsening on the effectiveness of bitrary reference pivot p X and the the partition for searching purposes. equivalence relation is given∈ by x y ∼ ⇔ d(p, x) d(p, y). In this case D([x],[ y]) 4 d(x, p)= d( y, p) is a safe lower bound= The algorithm is in fact a little more sophisticated | − | because they try to find the nearest neighbor of a for the D0 distance (guaranteed by the tri- point. However, the version presented here for range angle inequality). For a query of the form queries is in the same spirit as the original one.

ACM Computing Surveys, Vol. 33, No. 3, September 2001. 298 E. Ch´avez et al.

LEMMA 1. If is a coarsening of We recall that ([q], r) refers to the ∼1 | D| 2 then their extended distances D1 number of elements in the original metric ∼and D have the property D ([x], [ y]) space, not the number of classes retrieved. 2 1 ≤ D2([x], [ y]). There is a concept related to the ex- ternal complexity of a search algorithm, PROOF. Let us denote [x] and [ y] the i i which we define next. equivalence classes of x and y under equivalence relation . Then ∼i Definition 4. The discriminative power of a search algorithm based on a mapping D1([x], [ y]) inf d(x, y) = x [x]1, y [ y]1{ } from (X, d) to (π(X), D), with regard to a ∈ ∈ query (q, r)d of nonempty outcome, is de- inf d(x, y) D2([x], [ y]), ≤ x [x]2, y [ y]2{ }= fined as (q, r)d / ([q], r)D . ∈ ∈ | | | | since [x]2 [x]1 and [ y]2 [ y]1. Although the definition depends on q ⊆ ⊆ and r, we can speak in general terms of the An interesting idea arising from the discriminative power by averaging over above lemma is to build a hierarchy of the qs and rs of interest. The discrimina- coarsening operations. Using this hierar- tive power serves as an indicator of the chy we could proceed downwards from a performance or fitness of the equivalence very coarse level building a candidate list relation. of equivalence classes of the next level. In general, it will be more costly to have This candidate list will be refined using more discriminative power. The indexing the next distance function and so on until scheme needs to find a balance between we reach the bottom level. the complexity to find the relevant classes and the discriminative power. 6.4. Discriminative Power Let us consider Example 1. The inter- As sketched previously, most indexing al- nal complexity is one distance evaluation gorithms rely on building an equivalence (the distance from q to p), and the external relation. The corresponding search algo- complexity will correspond to the number rithms have the following parts. of elements that lie in the selected ring. We could intersect it with more rings (increas- 1. Find the classes that may be relevant ing internal complexity) to reduce the ex- for the query. ternal complexity. 2. Exhaustively search all the elements of The tradeoff is partially formalized with these classes. this lemma. The first part involves performing some evaluations of the d distance, as shown LEMMA 2. If A1 and A2 are search algo- rithms based on equivalence relations in Example 1 above. It may also involve ∼1 some extra CPU time (which although not and 2, respectively, and 1 is a coarsen- ing of∼ , then A has higher∼ external com- the central point in this article, must be ∼2 1 kept reasonable). The second part consists plexity than A2. of directly comparing the query against the candidate list. The following defini- PROOF. We have to show that, for any tion gives a name to both parts of the r, ([q], r)D ([q], r)D . But this is clear, 2 ⊆ 1 search cost. since D1([x], [ y]) D2([x], [ y]) implies ≤ ([q], r)D y U, D2([q], [ y]) r Definition 3. Let A be a search algo- 2 ={ ∈ ≤ }⊆ y U, D1([q], [ y]) r ([q], r)D1 . rithm over (X, d) based on a mapping to { ∈ ≤ }= (π(X), D), and let (q, r)d be a range query. Although having more discriminative Then the internal complexity of A is the power normally costs more internal eval- number of evaluations of d necessary to uations, one can make better or worse use compute ([q], r)D, and the external com- of the internal complexity. We elaborate plexity is ([q], r) . more on this next. | D|

ACM Computing Surveys, Vol. 33, No. 3, September 2001. Searching in Metric Spaces 299

Fig. 11. An equivalence relation induced by intersecting rings centered in two pivots, and how a query is transformed.

6.5. Locality of a Partition In general metric spaces we can also take a sufficient number of pivots so as to obtain The equivalence classes can be thought highly local partitions. of as a set of nonintersecting cells in the However, obtaining local partitions may space, where every element inside a given be expensive in internal complexity and cell belongs to the same equivalence class. not enough to achieve low external com- However, the mathematical definition of plexity, otherwise the bucketing method an equivalence class is not confined to a for vector spaces [Bentley et al. 1980] ex- single cell. We define “locality” as a prop- plained in Example 2 would have excellent erty of a partition that stands for how performance. Even with such a local par- much the classes resemble cells. tition and assuming uniformly distributed Definition 5. The nonlocality of a par- data, a number of empty cells are ver- tition π(X) π1, π2, ... with respect ified, whose volume grows exponentially ={ } to a finite dictionary U is defined as with the dimension. We return later to maxi maxx, y πi U d(x, y) , that is, as the this issue. maximum{ distance∈ ∩ between} elements lying in the same class. 6.6. The Pivot Equivalence Relation We say that a partition is “local” or “non- local” meaning that it has high or low lo- A large class of algorithms to build the cality. Figure 11 shows an example of a equivalence relations is based on pivot- nonlocal partition (u5 and u12 lie in sepa- ing. This consists of considering the dis- rate fragments of a single class). It is nat- tances between an element and a num- ural to expect more discriminative power ber of preselected “pivots” (i.e., elements from a local partition than from a nonlo- of U or even X, also called reference points, cal one. This is because in a nonlocal par- vantage points, keys, queries, etc. in the tition the candidate list tends to contain literature). elements actually far away from the query. The equivalence relation is defined in Notice that in Figure 11 the locality terms of the distances of the elements to would improve sharply if we added a third the pivots, so that two elements are equiv- pivot. In a vector space of k dimensions, it alent if they are at the same distance from suffices to consider k 1 pivots in general all the pivots. If we consider one pivot p, position5 to obtain a highly+ local partition. then this equivalence relation is

5 That is, not lying on a (k 1)-hyperplane. x y d(x, p) d( y, p). − ∼p ⇔ =

ACM Computing Surveys, Vol. 33, No. 3, September 2001. 300 E. Ch´avez et al.

Fig. 12. Mapping from a metric space onto a vector space under the L metric, using two pivots. ∞

The equivalence classes correspond to an element is the distance of the ele- the intuitive notion of the family of sphere ment to the ith pivot. Once this is done, k shells with center p. Points falling in the we can identify points in R with ele- same sphere shell (i.e., at the same dis- ments in the original space with the L tance from p) are equivalent from the distance. As we have described in Sec-∞ viewpoint of p. tion 6, the indexing algorithm will consist The above equivalence relation is easily of finding the set of equivalence classes generalized to k pivots. such that they fall inside the radius of the search when using D in the quo- Definition 6. The pivot equivalence re- tient space. In this particular case for a query of the form (q, r) we have to lation based on elements p1, ..., pk (the d k pivots) is defined as { } find the candidate list as the set ([q], r)D, that is, the set of equivalence classes [ y] such that D([q], [ y]) r. In other words, x p y i, d(x, pi) d( y, pi). ≤ ∼{ i } ⇔∀ = we want the set of objects y such that maxi d(q, pi) d( y, pi) r. This is A graphical representation of the class equivalent{| to search− with|} the ≤L distance k ∞ in the general case corresponds to the in the vector space R where the equiv- intersection of several sphere shells cen- alence classes are projected. Figure 12 tered at the points pi (recall Figure 11). illustrates this concept (Figure 11 is The distance d(x, y) cannot be smaller also useful). than d(x, p) d( y, p) for any element p, Yet a third way to see the technique, less because| of the− triangle| inequality. Hence formal but perhaps more intuitive, is as D([x], [ y]) d(x, p) d( y, p) is a safe follows. To check if an element u U be- =| − | ∈ lower bound to the D0 function corre- longs to the query outcome, we try a num- sponding to the class of sphere shells cen- ber of random pivots pi. If, for any such pi, tered in p. With k pivots, this becomes we have d(q, p ) d(u, p ) > r, then by the | i − i | D([x], [ y]) maxi d(x, pi) d( y, pi) . triangle inequality we know that d(q, u) > This D distance= lower{| bounds− d and|} r without the need to actually evalu- hence can be used as our distance in the ate d(q, u). At indexing time we precom- quotient space. pute the d(u, pi) values and at search time Alternatively, we can consider the we compute the d(q, pi) values. Only those equivalence relation as a projection to elements u that cannot be discarded by k the vector space R , k being the num- looking at the pivots are actually checked ber of pivots used. The ith coordinate of against q.

ACM Computing Surveys, Vol. 33, No. 3, September 2001. Searching in Metric Spaces 301

whole set U is the set of centers. Its as- sociated concept, the “Delaunay tessela- tion,” is a graph whose nodes are the elements of U and whose edges connect nodes whose classes share a border. The Delaunay tesselation is the basis of very good algorithms for proximity searching [Aurenhammer 1991; Yao 1990]. For ex- ample, an O(log n) time NN algorithm ex- ists in two dimensions. Unfortunately,this algorithm does not generalize efficiently to more than two dimensions. The Delaunay tesselation, which has O(n) edges in two dimensions, can have O(n2) edges in three and more dimensions. In a general metric space, the Fig. 13. A compact partition using four centers D0([x], [ y]) distance in the space π(X) and two query balls intersecting some classes. of the compact classes is, as before, the Note that the class of c3 could be excluded smallest distance between points x [x] from consideration for q by using the covering 1 and y [ y]. To find [q] we basically∈ need radius criterion but not the hyperplane crite- ∈ rion, whereas the opposite happens to discard c4 to find the nearest neighbor of q in the for q2. set of centers. The outcome of the query

([q], r)D0 is the set of classes intersected 6.7. The Compact Equivalence Relation by a query ball (see Figure 13). A problem in general metric spaces is A different type of equivalence relation, that it is not easy to bound a class so as to used by another large class of algorithms, determine whether the query ball inter- is defined with respect to the proximity to sects it. From the many possible criteria, a set of elements (which we call “centers” two are the most popular. to distinguish them from the pivots of the previous section). 6.7.1. Hyperplane Criterion. This is the most basic one and the one that best Definition 7. The compact equivalence expresses the idea of compactness. In relation based on c , ..., c (the centers) { 1 m} essence, if c is the center of the class is [q] (i.e., the center closest to q), then (1) the query ball of course intersects [c]; (2) x c y closest(x, ci ) closest( y, ci ), ∼{ i } ⇔ { } = { } the query ball does not intersect [ci] if d(q, c) r < d(q, ci) r. Graphically, if where closest(z, S) w S, w* S, + − ={ ∈ ∀ ∈ the query ball does not intersect the hy- d(z, w) d(z, w*) . The associated parti- perplane dividing its closest neighbor and tion is called≤ a compact} partition. another center ci, then the ball is totally That is, we divide the space with one outside the class of ci. partition element for each c and the class i 6.7.2. Covering Radius Criterion. This of ci is that of the points that have ci as their closest center. Figure 13 shows an ex- tries to bound the class [ci] by considering 2 a ball centered at ci that contains all the ample in (R , L2). In particular, note that elements of U that lie in the class. we can select U as the set of centers, in which case the partition has optimal local- Definition 8. The covering radius of c ity. Even if ci is not U, compact partitions for U is cr(c) maxu [c] U d(c, u). have good locality.{ } = ∈ ∩ In vector spaces the compact partition Now it is clear that we can discard ci if coincides with the Voronoi partition if the d(q, c ) r > cr(c ). i − i

ACM Computing Surveys, Vol. 33, No. 3, September 2001. 302 E. Ch´avez et al.

Fig. 14. A low-dimensional (left) and high-dimensional (right) histogram of distances, showing that on high dimensions virtually all the elements become candidates for exhaustive evaluation. Moreover, we should use a larger r in the second plot in order to retrieve some elements.

7. THE CURSE OF DIMENSIONALITY niques. This is partially based on previous work [Chavez´ and Navarro 2001a]. As explained, one of the major obstacles for the design of efficient search techniques on 7.1. Intrinsic Dimensionality metric spaces is the existence and ubiq- uity in real applications of the so-called Let us start with a well-known example. high-dimensional spaces. Traditional in- Consider a distance such that d(x, x) 0 dexing techniques for vector spaces (such and d(x, y) 1 for all x y. Under this= as kd-trees) have an exponential depen- distance (in= fact an equality'= test), we do dency on the representational dimension not obtain any information from a compar- of the space (as the volume of a box or hy- ison except that the element considered is percube containing the answers grows ex- or is not our query. It is clear that it is ponentially with the dimension). not possible to avoid a sequential search More recent indexing techniques for vec- in this case, no matter how smart our in- tor spaces and those for generic metric dexing technique is. spaces can get rid of the representational Let us consider the histogram of dis- dimension of the space. This makes a tances between points in the metric space big difference in many applications that X. This can be approximated by using handle vector spaces of high represen- the dictionary U as a random sample of tational dimension but low intrinsic di- X. This histogram is mentioned in many mension (e.g., a plane immersed in a papers, for example, Brin [1995], Chavez´ 50-dimensional vector space, or simply and Marroqu´ın [1997], and Ciaccia et al. clustered data). However, in some cases [1998a], as a fundamental measure re- even the intrinsic dimension is very high lated to the intrinsic dimensionality of the and the problem becomes intractable for metric space. The idea is that, as the space exact algorithms, and we have to resort has higher intrinsic dimension, the mean to approximate or probabilistic algorithms µ of the histogram grows and its variance (Section 5.3.3). σ 2 is reduced. Our previous example is an Our aim in this section is (a) to show extreme case. that the concept of intrinsic dimensional- Figure 14 gives an intuitive explanation ity can be conceived even in a general met- of why the search problem is harder when ric space; (b) to give a quantitative defi- the histogram is concentrated. If we con- nition of the intrinsic dimensionality; (c) sider a random query q and an indexing to show analytically the reason for the so- scheme based on random pivots, then the called “curse of dimensionality;” and (d) to possible distances between q and a pivot p discuss the effects of pivot selection tech- are distributed according to the histogram

ACM Computing Surveys, Vol. 33, No. 3, September 2001. Searching in Metric Spaces 303 of the figure. The elimination rule says Let (q, r)d be a range query over a met- that we can discard any point u such that ric space indexed by means of k random d(p, u) [d(p, q) r, d(p, q) r]. The pivots, and let u be an element of U. The grayed areas'∈ in the− figure show the+ points probability that u cannot be excluded that we cannot discard. As the histogram from direct verification after considering is more and more concentrated around its the k pivots is exactly mean, fewer points can be discarded using the information given by d(p, q). (a) Pr( d(q, p1) d(u, p1) This phenomenon is independent of the | − | r, ..., d(q, pk) d(u, pk) r). nature of the metric space (vectorial or not, ≤ | − |≤ in particular) and gives us a way to quan- Since all the pivots are assumed to tify how hard it is to search on an arbitrary be random and their distance distribu- metric space. tions independent identically distributed Definition 9. The intrinsic dimension- random variables, this expression is the ality of a metric space is defined as ρ product of probabilities µ2/2σ 2, where µ and σ 2 are the mean and= variance of its histogram of distances. (b) Pr( d(q, p1) d(u, p1) r) | − |≤ ×··· Pr( d(q, p ) d(u, p ) r) The technical convenience of the exact × | k − k |≤ definition is made clear shortly.The impor- tant part is that the intrinsic dimension- which for the same reason can be simpli- ality grows with the mean and decreases fied to with the variance of the histogram. Pr(not discarding u) The particular cases of the Ls distances in vector spaces are useful illustrations. As Pr( d(q, p) d(u, p) r)k. = | − |≤ shown in Yianilos [1999], a uniformly dis- tributed k-dimensional vector space under for a random pivot p. 1/s the Ls distance has mean &(k ) and stan- If X and Y are two independent iden- 1/s 1/2 dard deviation &(k − ). Therefore its tically distributed random variables with intrinsic dimensionality is &(k) (although mean µ and variance σ 2, then the mean of the constant is not necessarily 1). So the X Y is 0 and its variance is 2σ 2. Using intuitive concept of dimensionality in vec- Chebyschev’s− inequality6 we have that tor spaces matches our general concept of Pr( X Y >ε) < 2σ 2/ε2. Therefore, intrinsic dimensionality. | − | 2σ 2 Pr( d(q, p) d(u, p) r) 1 , 7.2. A Lower Bound for Pivoting Algorithms | − |≤ ≥ − r2

Our main result in this section relates 2 the intrinsic dimensionality with the where σ is precisely the variance of the difficulty of searching with a given search distance distribution in our metric space. radius r using a pivoting equivalence re- The argument that follows is valid for σ 2/ 2 < > √ σ lation that chooses the pivots at random. 2 r 1, or r 2 (large enough As we show next, the difficulty of the radii); otherwise the lower bound is zero. problem is related to r and the intrinsic Then, we have dimensionality ρ. k We are considering independent identi- 2σ 2 Pr(not discarding u) 1 . cally distributed random variables for the ≥ − r2 distribution of distances between points. $ % Although not accurate, this simplification We have now that the total search is optimistic and hence can be used to cost is the number of internal distance lower bound the performance of the 6 indexing algorithms. We come back to For an arbitrary distribution Z with mean µz and this shortly. variance σ 2, Pr( Z µ >ε) <σ2/ε2. z | − z | z

ACM Computing Surveys, Vol. 33, No. 3, September 2001. 304 E. Ch´avez et al. evaluations (k) plus the external eval- for a random range query retrieving at uations, whose number is on average most a fraction f of the set, where ρ is the n Pr(not discarding u). Therefore intrinsic dimension of the metric space. × k 2σ 2 This result matches that of Baeza-Yates Cost k n 1 ≥ + − r2 [1997] and Baeza-Yates and Navorro $ % [1998] on FHQTs, about obtaining &(log n) is a lower bound to the average search search time using &(log n) pivots, but here cost by using random pivots. Optimizing we are more interested in the “constant” we obtain that the best k is term, which depends on the dimension. The theorem shows clearly that the ln n ln ln(1/t) k∗ + , parameters governing the performance = ln(1/t) of range-searching algorithms are ρ and f . One can expect that f keeps constant where t 1 2σ 2/r2. Using this optimal as ρ grows, but it is possible that in some = − k∗, we obtain an absolute (i.e., indepen- applications the query q is known to be dent on k) lower bound for the average a perturbation of some element of U and cost of any random pivot-based algorithm: therefore we can keep a constant search radius r as the dimension grows. Even in ln n ln ln(1/t) 1 Cost + + those cases the lower bound may grow if ≥ ln(1/t) σ shrinks with the dimension, as in the > ln n r2 Ls vector spaces with s 2. We have considered independent iden- 2 ln n ≥ ln(1/t) ≥ 2σ tically distributed random variables which shows that the cost depends for each pivot and the query. This is a strongly on σ/r. As r increases t tends reasonable approximation, as we do not to 1 and the scheme requires more and expect much difference between the “view more pivots and it is anyway more and points” from the general distribution of more costly. distances to the individual distributions A nice way to represent this result is to (see Section 7.4). The expression given assume that we are interested in retriev- in Equation (7.2b) cannot be obtained ing a fixed fraction of at most f of the without this simplification. elements, in which case r can be written as A stronger assumption comes from con- r µ σ/√ f by Chebyschev’s inequality. sidering all the variables as independent. In= this− case the lower bound becomes This is an optimistic consideration equiv- alent to assuming that in order to discard each element u of the set we take k new r2 (µ σ/√ f )2 ln n − ln n pivots at random. The real algorithm fixes 2σ 2 = 2σ 2 k random pivots and uses them to try to 2 1 discard all the elements u of the set. The ρ 1 ln n latter alternative can suffer from depen- = ! − 2ρ f # dencies from a point u to another, which 2 cannot happen in the former case (e.g., if u &1 √ρ ln n is close to the third pivot and u* is close to u = ! − 2 f # then the distance from u* to the third pivot carries less information). Since the as- & which is valid for f 1/(2ρ). We have sumption is optimistic, using it to reduce just proved the following.≥ the joint distribution in Equation (7.2a) to the expression given in Equation (7.2b) THEOREM 1. Any pivot-based algo- keeps the lower bound valid. rithm using random pivots has a lower Figures 15 and 16 show an experiment 2 + bound (√ρ 1/ 2 f ) ln n in the average on the search cost in (R , L2) using a differ- number of distance− evaluations performed ent number of pivots k and dimensions +. &

ACM Computing Surveys, Vol. 33, No. 3, September 2001. Searching in Metric Spaces 305

Fig. 15. Internal, external, and overall distance evaluations in eight dimensions, using differ- ent numbers of pivots k.

Fig. 16. Overall distance evaluations as the dimension grows for fixed k.

ACM Computing Surveys, Vol. 33, No. 3, September 2001. 306 E. Ch´avez et al.

The n 100, 000 elements are generated dom centers c , ..., c . The m distances = { 1 m} at random and the pivots are randomly d(q, ci) can be considered as random chosen from the set. We average over 1,000 variables X 1, ..., X m whose distribution random queries whose radius is set to re- is that of the histogram of the metric trieve 10 elements of the set. We count the space. The distribution of the distance number of distance evaluations. Figure 15 from q to its closest center c is that of shows the existence of an optimum k Y min X , ..., X . The hyperplane ∗ = = { 1 m} 110, and Figure 16 shows the predicted criterion specifies that a class [ci] cannot k O(n(1 1/&(+)) ) behavior. We have not be excluded if d(q, c) r d(q, ci) r. The enough− memory in our machine to show probability that this+ happens≥ is Pr− (Y ≥ the predicted growth in k∗ &(+) ln(n). X i 2r). But since Y is the minimum over Figure 17 shows the effect≈ in a dif- m variables− with the same distribution, ferent way. As the dimension grows, the probability is Pr(Z X 2r)m, where ≥ − the histogram of L2 moves to the right X and Z are two random variables dis- (µ &(√+)). Yet the pivot distance D tributed according to the histogram. Using = k (in the projected space (R , L )) remains Chebyschev’s inequality and noticing that about the same for fixed k.∞ Increasing if Z < X 2r then X or Z are at distance − k from 32 to 512 moves the histogram at least r from their mean, we can say that slightly to the right. This shift is effective in low-dimensional metric spaces, but it is Pr(not discarding [ci]) uneffective in high dimensions. The plots σ 2 m of these two histograms can measure how Pr(Z X 2r)m 1 . = ≥ − ≥ − r2 good are the pivoting algorithms. Intu- $ % itively, the overlap between the histogram for the pivot distance and the histogram On average each class has n/m ele- for the original distance is directly propor- ments, so that the external complexity tional to the discriminative power of the is n Pr (not discarding [ci]). The inter- × pivot mapping. As the overlap increases nal cost to find the intersected classes the algorithms become more effective. deserves some discussion. In all the hier- The particular behavior of D in archical schemes that exist, we consider Figure 17 is due to the fact that D is the that the real partition is that induced by maximum of k random variables whose the leaves of the trees, that is, the most mean is &(σ) (i.e., d(p, q) d(p, x) ). The refined ones. We see all the rest of the fact that D does not| grow− with + means| hierarchy as a mechanism to reduce the that, as a lower bound for d, it gets less internal complexity of finding the small effective in higher dimensions. classes (hence the m we use here is not, say, the m of GNATs, but the total number of final classes). It is difficult to determine 7.3. A Lower Bound for Compact this internal complexity (an upper bound Partitioning Algorithms is m), so we call it I (m), knowing that We now try to obtain a lower bound to it is between %(log mC) and O(m). Then a the search cost of algorithms based on the lower bound to the search complexity is compact partition. The result is surpris- ingly similar to that of the previous sec- σ 2 m tion. Our lower bound considers only the Cost I (m) n 1 ≥ C + − r2 hyperplane criterion, which is the most $ % purely associated with the compact parti- tion. We assume just the same facts about which indeed is very similar to the the distribution of distances as in the pre- lower bound on pivot-based algorithms. vious section: all of them are independent Optimizing on m yields identically distributed random variables. Let (q, r) be a range query over a ln n ln ln(1/t*) ln I* (m∗) d m∗ + − C , metric space indexed by means of m ran- = ln(1/t*)

ACM Computing Surveys, Vol. 33, No. 3, September 2001. Searching in Metric Spaces 307 512. = k 32 and on the right = k 128. On the left = + -dimensional Euclidean spaces and the pivot distance (MaxDist) obtained using + 16 and in the bottom row = + distance in different 2 L of pivots. In the top row k . Histograms comparing the Fig. 17 different numbers

ACM Computing Surveys, Vol. 33, No. 3, September 2001. 308 E. Ch´avez et al.

Fig. 18. Overall distance evaluations using a hierarchical compact partitioning with different arities. where t 1 σ 2/r2. Using the optimal the classes), when the theorem becomes * = − m∗ the search cost is lower bounded by very similar to Theorem 1, there is an important reason that explains why the r2 compact partitioning algorithms can in Cost %( I (log n)) % I ln n practice be better than pivot-based ones. = C 1/t* = C σ 2 $ $ %% We can in this case achieve the optimal number of centers m∗, which is impossible which also shows an increase in the cost in practice for pivot-based algorithms. The as the dimensionality grows. As before reason is that it is much more economical we can write r µ σ/√ f . We have just = − to represent the compact partition using proved the following. m centers than the pivot partition using THEOREM 2. Any compact partitioning k pivots. algorithm based on random centers has Figure 18 shows an experiment on the 2 same dataset, where we have used differ- a lower bound I (2(√ρ 1/ 2 f ) ) in the average numberC of distance− evaluations ent m values and a hierarchical compact performed for a random& range query partitioning based on them. We have used retrieving a fraction of at most f of the the hyperplane and the covering radius database, where ρ is the intrinsic dimen- criteria to prune the search. As can be seen, the dependency on the dimension of sion of the space and I () is the internal complexity to find theC relevant classes, the space is not so sharp as for pivot-based satisfying %(log m) (m) O(m). algorithms, and is closer to a dependency = CI = of the form &(+). This result is weaker than Theorem 1 The general conclusion is that, even because of our inability to give a good if the lower bounds using pivot-based lower bound on I , so we cannot ensure or compact partitioning algorithms look more than a logarithmicC increase with similar, the first ones need much more respect to ρ. However, even assuming space to store the classes resulting from k (m) &(m) (i.e., exhaustive search of pivots than the last ones using the same CI =

ACM Computing Surveys, Vol. 33, No. 3, September 2001. Searching in Metric Spaces 309 number of partitions. Hence, the latter Definition 10. The local histogram can realistically use the optimal number of an element u is the distribution of of classes, whereas the former cannot. If distances from u to every x X. pivot-based algorithms are given all the ∈ necessary memory, then using the optimal A good reference has a flatter histogram, number of pivots they can improve over which means that it will discard more compact partitioning algorithms, because elements at query time. The measure 2 2 t is better than t*, but this is more and ρ µ /(2σ ) of intrinsic dimensionality more difficult as the dimension grows. (now= defined on the local histogram of u) Figure 19 compares both types of can be used as a good parameter to eval- partitioning. As can be seen, the pivoting uate how good a reference is (good refer- algorithm improves over the compact ences have local histograms with small ρ). partitioning if we give it enough pivots. This is related to the difference in However, “enough” is a number that viewpoints (histograms) between differ- increases with the dimension and with ent references, a subject discussed in the fraction retrieved (i.e., ρ and f ). For ρ depth in Ciaccia et al. [1998a]. The idea and f large enough, the required number is that the local histogram of a reference of pivots will be unacceptably high in u may be quite different from the global terms of memory requirements. histogram (especially if u is not selected at random). This is used in Ciaccia et al. [1998a] to show that if the histograms 7.4. Pivot and Center Selection Techniques for different references are similar then In Farago´ et al. [1993], they prove formally they can accurately predict the behavior that if the dimension is constant, then of instances of their data structure (the after properly selecting (not at random!) MT), a completely different issue. a constant number k of pivots the exhaus- Note also that good reference selection tive search costs O(1). This contrasts with becomes harder as the intrinsic dimension our %(log n) lower bound. The difference is grows. As the global histogram reduces that they do not take the pivots at random its variance, it becomes more difficult to but select a set of pivots that is assumed find references that deviate significantly to have certain selectivity properties. This from it, as already noted in Ciaccia et al. shows that the way in which the pivots [1998a]. are selected can affect the performance. The histogram characterization ex- Unfortunately, their pivot selection tech- plains a well-known phenomenon: to nique works for vector spaces only. discriminate among the elements in the Little is known about pivot/center (let class of a local partition (or in a cluster), us call them collectively, “references”) it is a good idea to select a pivot from the selection policies, and in practice most same class. This makes it more probable methods choose them at random, with a to select an element close to them (the few exceptions. For instance, in Shapiro ideal would be a centroid). In this case, [1977] it is recommended to select pivots the distances tend to be smaller and outside the clusters, and Baeza-Yates the histogram is not so concentrated in et al. [1994] suggest using one pivot from large values. For instance, for LAESA, each cluster. All authors agree in that Mico´ et al. [1994] do not use the pivots the references should be far apart from in a fixed order, but the next one is that each other, which is evident since close with minimal L1 distance to the current references will give almost the same in- candidates. On the other hand, outliers formation. On the other hand, references can be good at a first approximation, in selected at random are already far apart order to discriminate among clusters, but in a high-dimensional space. later they are unable to separate well the The histogram of distances gives a for- elements of the same cluster. mal characterization of good references. Selecting good individual references, Let us start with a definition. that is, with a flatter histogram, is not

ACM Computing Surveys, Vol. 33, No. 3, September 2001. 310 E. Ch´avez et al.

Fig. 19. Distance evaluations for increasing dimension. We compare the compact partition- ing algorithm of Figure 18 using 128 centers per level against the filtration using k pivots for k 4, 16, 64. On top the search radius captures 0.01% of the set, on the bottom 0.1%. =

ACM Computing Surveys, Vol. 33, No. 3, September 2001. Searching in Metric Spaces 311 enough to ensure a good performance. For This means that we have to examine example, a reference close to a good one a volume (and a number of candidate is probably good as well, but using both of elements) that, with respect to the size of them gives almost the same information the final result, grows exponentially with as using only one. It is necessary to the dimension k. This fact is behind the include a variety of independent view- exponential dependency on the dimension points in the selection. If we consider the in data structures for vector spaces such selection of references as an optimization as the kd-tree or the R-tree. problem, the goal is to obtain a set of The same phenomenon occurs in gen- references that are collectively good for eral metric spaces, where the “shapes” of indexing purposes. The set of references the cells cannot possibly fit an unknown is good if it approximates the original query ball. We can add more pivots to distance well. This is more evident in better bound any possible ball, but this the case of pivoting algorithms, where a increases the internal complexity and contractive distance is implicitly used. A becomes more difficult as the intrinsic practical way to decide whether our set dimension grows (the smaller the vari- of references is effective is to compare the ance, the more pivots are needed to make histograms of the original distance and a difference). the pivot distance as in Figure 17. A good reference set will give a pivot distance histogram with a large intersection with 8. A TAXONOMY OF SEARCH ALGORITHMS the histogram of the original distance. In this section we apply our unify- ing model to organize all the known 7.5. The Effect on the Discriminative Power approaches in a taxonomy. This helps Another reason that explains the curse of to identify the essential features of all dimensionality is related to the decrease the existing techniques, to find possible in the discriminative power of a partition, combinations of algorithms not noticed up because of the odd “shapes” of the classes. to now, and to detect which are the most As explained before, a nonlocal partition promising areas for optimization. suffers from the problem of being unable to We first concentrate on pivoting algo- discriminate between points that are actu- rithms. They differ in their method to ally far away from each other, which leads select the pivots, in when the selection to unnecessarily high external complexity. is made, and in how much information The solution is to select enough pivots or on the comparisons is used. Later, we to use a compact partitioning method that consider the compact partitioning algo- yields a local partition. However, even in rithms. These differ in their methods a local partition, the shape of the cell is to select the centers and to bound the fixed at indexing time, while the shape equivalence classes. of the space region defined as interesting Figure 20 summarizes our classifica- for a query q is a ball dynamically deter- tion of methods attending to their most mined at query time. The discriminative important features (explained throughout power can be visualized as the volume of the section). the query ball divided by the total volume of the classes intersected by that ball. 8.1. Pivot-Based Algorithms A good example is that of a ball inside a box in a k-dimensional L2 space. The box 8.1.1. Search Algorithms. Once we have is the class obtained after using k orthog- determined the equivalence relation to onal pivots, and the partition obtained is use (i.e., the k pivots), we preprocess the quite local. Yet the volume of the ball is dictionary by storing, for each element of smaller, and the ratio with respect to the U, its k coordinates (i.e., distance to the volume of the box (i.e., the discriminative k pivots). This takes O(kn) preprocessing power) decreases exponentially with k. time and space overhead. The “index” can

ACM Computing Surveys, Vol. 33, No. 3, September 2001. 312 E. Ch´avez et al.

Fig. 20. Taxonomy of the existing algorithms. The methods in italics are combinations that appear naturally as soon as the taxonomy is built.

be seen as a table of n rows and k columns maxi 1..k xi yi > r. For each row not = {| − |} storing d(ui, pj ). discarded using this rule, compare the At query time, we first compare the element directly against q. This is equiv- query q against the k pivots, hence obtain- alent to traversing the quotient space, ing its k coordinates [q] ( y1, ..., yk) in using D to discard uninteresting classes. the target space, that is,= its equivalence Although this traversal does not per- class. The cost of this is k evaluations form more evaluations of d than neces- of the distance function d, which corre- sary, it is not the best choice. The reasons sponds to the internal complexity of the are made clear later, as we discover the search. We have now to determine, in the advantages of alternative approaches. target space, which classes may be rele- First, notice that the amount of CPU vant to the query (i.e., which ones are at work is O(kn) in the worst case. However, distance r or less in the L metric, which as we abandon a row as soon as we find corresponds to the D distance).∞ This does a difference larger than r along a coordi- not use further evaluations of d, but it nate, the average case is much closer to may take extra CPU cost. Finally, the ele- O(n) for queries of reasonable selectivity. ments belonging to the qualifying classes The first improvement is to process (i.e., those that cannot be discarded after the set columnwise. That is, we compare considering the k pivots) are directly the query against the first pivot p1. Now, compared against q (external complexity). we consider the first column of the table The simplest search algorithm proceeds and discard all the elements that satisfy rowwise: consider each element of the x y > r. Then, we consider the second | 1 − 1| set (i.e., each row (x1, ..., xk) of the table) pivot p2 and repeat the process only on the and see if the triangle inequality allows elements not discarded up to now. An algo- discarding that row, that is, whether rithm implementing this idea is LAESA.

ACM Computing Surveys, Vol. 33, No. 3, September 2001. Searching in Metric Spaces 313

It is not hard to see that the amount of remaining elements. Despite its practi- evaluations of d and the total CPU work cal inapplicability because of its O(n2) remains the same as for the rowwise case. preprocessing time and space overhead However, we can do better now, since each (i.e., all the distances among the known column can be sorted so that the range of elements are precomputed), the algorithm qualifying rows can be binary instead of performs a surprisingly low number of sequentially searched [Nene and Nayar distance evaluations, much better than 1997; Chavez´ et al. 1999]. This is possible when the pivots are fixed. This shows that because we are interested, at column i, in it is a good idea to select pivots from the the values [ yi r, yi r]. The extra CPU current set of candidates (as discussed in cost gets closer− to O(k+log n) than to O(n) the previous sections). by using this technique. Finally, we notice that instead of a se- This is not the only improvement quential search in the mapped space, we allowed by a columnwise evaluation that could use an algorithm to search in vector cannot be done rowwise. A very important spaces of k dimensions (e.g., kd-trees or one is that it is not necessary to consider R-trees). Depending on their ability to all the k coordinates (recall that we have handle larger k values, we could be able to perform one evaluation of d to obtain to use more pivots without significantly each new query coordinate yi). As soon increasing the extra CPU cost. Recall also as the remaining set of candidates is that, as more pivots are used, the search small enough, we can stop considering the structures for vector spaces perform remaining coordinates and directly verify worse. This is a very interesting subject the candidates using the d distance. This which has not been pursued yet, that point is difficult to estimate beforehand: accounts for balancing between distance despite the (few) theoretical results ex- evaluations and CPU time. isting [Farago´ et al. 1993; Baeza-Yates 1997; Baeza-Yates and Navarro 1998], 8.1.2. Coarsening the Equivalence Relation. one cannot normally understand the The alternative of not considering all the application well enough to predict the k pivots if the remaining set of candidates actual optimal number of pivots k∗ (i.e., is small is an example of coarsening the point where it is better to switch to an equivalence relation. That is, if we exhaustive evaluation). do not consider a given pivot p, we are Another improvement that can be done merging all the classes that differ only with columnwise evaluation is that the in that coordinate. In this case we prefer selection of the pivots can be done on the to coarsen the pivot equivalence relation fly instead of beforehand as we have pre- because computing it with more precision sented it. That is, once we have selected is worse than checking it as is. the first pivot p1 and discarded all the There are many other ways to coarsen uninteresting elements, the second pivot the equivalence relation, and we cover p2 may depend on which was the result them here. However, in these cases the of p1. However, for each potential pivot p coarsening is not done for the sake of we have to store the coordinates of all the reducing the number of distance evalu- elements of the set for p (or at least some, ations, but to improve space usage and as we show later). That is, we select k precomputation time, as O(kn) can be pro- potential pivots and precompute the table hibitively expensive for some applications. as before, but we can choose in which Another reason is that, via coarsening, we order the pivots are used (according to the obtain search algorithms that are sublin- current state of the search) and where we ear in their extra CPU time. We consider stop using pivots and compare directly. in this section range coarsening, bucket An extreme case of this idea is AESA, coarsening, and scope coarsening. Their where k n (i.e., all the elements are ideas are roughly illustrated in Figure 21. potential= pivots), and the new pivot at It must be clear that all these types each iteration is selected among the of coarsenings reduce the discriminative

ACM Computing Surveys, Vol. 33, No. 3, September 2001. 314 E. Ch´avez et al.

Fig. 21. Different coarsification methods. power of the resulting equivalence classes, p. This is obviously a coarsening of the making it necessary to exhaustively previous relation p and can be naturally consider more elements than in the un- extended to more∼ than one pivot. coarsened versions of the relations. In the Figure 2 exemplifies a pivoting equiva- example of the previous section this is lence relation where range coarsening is amortized by the lower cost to obtain the applied, for one pivot. Points in the same coarsened equivalence relation. Here we ring are in the same equivalence class, reduce the effectiveness of the algorithms despite the fact that their exact distance via coarsening, for the sake of reduced to the pivot may be different. preprocessing time and space overhead. A number of actual algorithms use However, space reduction may have a one or another form of this coarsening counterpart in time efficiency. If we use technique. VPTs and MVPTs divide the less space, then with the same amount distances in slices so that the same num- of memory we can have more pivots (i.e., ber of elements lie in each slice (note that larger k). This can result in an overall the slices are different for each pivot). improvement. The fair way to compare VPTs use two slices and MVPTs use two algorithms is to give them the same many. Their goal is to obtain balanced amount of memory to use. trees. BKTs, FQTs, and FHQTs, on the other hand, propose range coarsening for 8.1.3. Range Coarsening. The auxiliary continuous distance functions but do not data structures proposed by most authors specify how to coarsen. for continuous distance functions are In this work we consider that the aimed at reducing the amount of space “natural” extension of BKTs, FQTs, and needed to store the coordinates of the FHQTs assigns slices of the same width elements in the mapped space, as well to each branch, and that the tree has the as the time to find the relevant classes. same arity across all its nodes. At each The most popular form is to reduce the node, the slice width is recomputed so precision of d. This is written as that using slices of that fixed width the

x p, r y i, ri d(x, p) < ri 1 and node has the desired arity. ∼ { i } ⇔∃ ≤ + Therefore, we can have slices of fixed ri d( y, p) < ri 1 ≤ + width (BKT, FQT, FHQT) or determined with ri a partition of the interval [0, ). by percentiles (VPT, MVPT, FQA). We That{ is,} we assign the same equivalence∞ may have a different pivot per node (BKT, class to elements falling in the same range VPT, MVPT) or per level (FQT, FHQT, of distances with respect to the same pivot FQA). Among the last, we can define the

ACM Computing Surveys, Vol. 33, No. 3, September 2001. Searching in Metric Spaces 315

Table II. Different Options for Range Coarsening Fixed Percentiles Fixed Width Different pivot per node VPT, MVPT BKT (scope coarsening) Different slice Different pivot per node FMVPT FQT, FHQT per level Different slice per level FMVPA (FQA) FHQA aWe put in italics the new structures created to fill the empty holes. slices at each node (FQT, FHQT) or for coarsen the distances to put them into 2b the whole level (FQA). All these choices ranges. Their total space requirement is are drawn in Table II. The empty slots then reduced to kbn bits. have been filled with new structures that However, range coarsening is not just are now defined. a technique to reduce space, but the same space can be used to accommodate more 8.1.3.1. FHQA. This is similar to an pivots. It is not immediate how conve- FQA except that the slices are of fixed nient it is to coarsen in order to use more width. At each level the slice width is pivots, but it is clear that this technique recomputed so that a maximum arity is can improve the overall effectiveness of guaranteed. In the FHQT, instead, each the algorithm. node has a different slice width. 8.1.3.5. Percentiles Versus Fixed Width. 8.1.3.2. FMVPT. This is a cross be- Another unclear issue is whether fixed tween an MVPT and a FHQT. The range slices are better or worse than percentile of values is divided using the m 1 splitting. A balanced data structure has uniform percentiles to balance the tree,− obvious advantages because the internal as in MVPTs. The tree has a fixed height complexity may be reduced. Fixed slices h, as FHQTs. At each node the ranges produce unbalanced structures since the are recomputed according to the elements outer rings have many more elements (es- lying in that subtree. The particular case pecially on high dimensions). On the other where m 2 is called FHVPT. = hand, in high dimensions the outer rings 8.1.3.3. FMVPA. This is just a new tend to be too narrow if percentile split- name for the FQA, more appropriate for ting is used (because a small increment our discussion since it is to MVPTs as in radius gets many new elements inside FHQAs are to FHQTs: the FMVPA uses the ring). If the rings are too narrow, variable width slices to ensure that the many rings will be frequently included in subtrees are balanced in size, but the same the radius of interest of the queries (see slices are used for all the nodes of a single Figure 22). An alternative idea is shown in level, so that the balance is only levelwise. Chavez´ [1999], where the slices are opti- The combinations we have just cre- mized to minimize the number of branches ated allow us to explain some important that must be considered. In this case, concepts. each class can be an arbitrary set of slices.

8.1.3.4. Amount of Range Coarsening. Let 8.1.3.6. Trees Versus Arrays. FHQTs us consider FHQAs and FMVPAs. They and FMVPTs are almost tree versions of are no more than LAESA with different FHQAs and FMVPAs, respectively. They forms of range coarsening. They use k are m-ary trees where all the elements fixed pivots and use b bits to represent belonging to the same coarsened class the coordinates (i.e., the distances from are stored in the same subtree. Instead of each point to each of the h pivots). So only explicitly storing the m coordinates, the 2b different values can be expressed. The trees store them implicitly: the elements two structures differ only in how they are at the leaves, and their paths from the

ACM Computing Surveys, Vol. 33, No. 3, September 2001. 316 E. Ch´avez et al.

Fig. 22. The same query intersects more rings when using uniform percentiles. root spell out the coarsened coordinate 8.1.4. Bucket Coarsening. To reduce space values. This makes the space require- requirements in the above trees, we can ments closer to O(n) in practice, instead of avoid building subtrees that have few O(bkn) (although the constant is very low elements. Instead, all their elements for the array versions, which may actually are stored in a bucket. When the search take less space). Moreover, the search for arrives at a bucket, it has to exhaustively the relevant elements can be organized consider all the elements. using the tree: if all the interesting This is a form of coarsening, since for elements have their first coordinate in the elements in the bucket we do not the ith ring, then we just traverse the ith consider the last pivots, and it resembles subtree. This reduces the extra CPU time. the previous idea (Section 8.1.1) of not If the distance is too fine-grained, how- computing the k pivots. However, in ever, the root will have nearly n children this case the decision is taken offline, at and the subtrees will have just one child. index construction time, and this allows The structure will be very similar to a reducing space by not storing those coor- table of k coordinates per element and the dinates. In the previous case the decision search will degenerate into a linear row- was taken at search time. The crucial wise traversal. Hence, range coarsening is difference is that if the decision is taken also a tool to reduce the extra CPU time. at search time, we can know exactly the We have given the trees the ability to total amount of exhaustive work to do by define the slices at each node instead of at not taking further coordinates. However, each level as do the array versions. This in an offline decision we can only consider allows them to adapt better to the data, the search along this branch of the tree, but the values of the slices used need and we cannot predict how many branches more space. It is not clear whether it pays will be considered at search time. to store all these slice values. This idea is used for discrete distance Summarizing, range coarsening can functions in FQTs, which are similar to be applied using fixed-width or fixed- FHQTs except for the use of buckets. It percentile slices. They can reduce the has been also applied to continuous setups space necessary to store the coordinates, to reduce space requirements further. which can allow the use of more pivots with the same amount of memory. There- 8.1.5. Scope Coarsening. The last and fore, it is not just a technique to reduce least obvious form of coarsening is the one space but it can improve the search we call “scope coarsening.” In fact, the use complexity. Range coarsening can also be of this form of coarsening makes it diffi- used to organize treelike search schemes cult to notice that many algorithms based that are sublinear in extra CPU time. on trees are in fact pivoting algorithms.

ACM Computing Surveys, Vol. 33, No. 3, September 2001. Searching in Metric Spaces 317

This coarsening is based on, instead BKT and VPT, resp.) where a number of of storing all the coordinates of all the fixed pivots are used at each node, and for elements, just storing some of them. each resulting class a new set of pivots Hence, comparing the query against some is selected. Note that, historically, FQTs pivots helps to discriminate on some and FHQTs are an evolution over BKTs. subset of the database only. To use this fact to reduce space, we must determine 8.2. Compact Partitioning Algorithms offline which elements will store their distance to which pivots. There are many All the remaining algorithms (GHTs, ways to use this idea, but it has been used BSTs, GNATs, VTs, MTs, SATs, LCs) rely only in the following way. on a hierarchical compact partition of the In FHVPTs there is a single pivot per metric space. A first source of differences level of the tree, as in FHQTs. The left and is in how the centers are selected at each right subtrees of VPTs, on the other hand, node. GHTs and BSTs take two elements use different pivots. That is, if we have to per level. VTs repeat previous centers consider both the left and right subtrees when creating new nodes. GNATs select m (because the radius r does not allow us to centers far apart. MTs try to minimize cov- completely discard one), then comparing ering radii. SATs select a variable number the query q against the left pivot will be of close neighbors of the parent node. LC useful for the left subtree only. There is chooses at random one center per cluster. no information stored about the distances The main difference, however, lies in from the left pivot to the elements of the the search algorithm. While GHTs use right subtree, and vice versa. Hence, we purely the hyperplane criterion, BSTs, have to compare q against both pivots. VTs, MTs, and LCs use only the covering This continues recursively. The same idea radius criterion. SATs use both criteria to is used for BKTs and MVPTs. increase pruning. In all these algorithms Although at first sight it is clear that we the query q is compared against all reduce space, this is not the direct way in the centers of the current node and the which the idea is used in those schemes. criteria are used to discard subtrees. Instead, they combine it with a huge in- GNATs are a little different, as they crease in the number of potential pivots. use none of the above criteria. Instead, For each subtree, an element belonging they apply an AESA-like search over to the subtree is selected as the pivot and the m centers considering their “range” deleted from the set. If no bucketing is values. That is, a center ci is selected, and used, the result is a tree where each ele- if the query ball does not intersect a ring ment is a node somewhere in the tree and around ci that contains all the elements hence a potential pivot. The tree takes of c j , then c j and all its class (subtree) O(n) space, which shows that we can suc- can be safely discarded. In other words, cessfully combine a large number of pivots GNATs limit the class of each center with scope coarsening to have low space by intersecting rings around the other requirements (n instead of n2 as in AESA). centers. This way of limiting the extent of The possible advantage (apart from a class is different from both the hyper- guaranteed linear space and slightly plane and the covering radius criteria. It reduced space in practice) of these is probably more efficient, but it requires structures over those that store all the storing O(m2) distances at each level. coordinates (as FQTs and FHQTs) is Table III summarizes the differences. It that the pivots are better suited to the is clear that there are many possible com- searched elements in each subtree, since binations that have not been tried, but we they are selected from the same subset. do not attempt to enumerate all of them. This same property makes AESA a good The compact partition is an attempt to (although impractical) algorithm. obtain local classes, more local than those Shapiro [1977] and Bozkaya and based on pivots. A general technique to do Ozsoyoglu [1997] propose hybrids (for this is to identify clusters of close objects

ACM Computing Surveys, Vol. 33, No. 3, September 2001. 318 E. Ch´avez et al.

Table III. Different Options for Limiting Classes ing algorithms, and they divide the With hyperplanes GHT, SAT search cost in terms of internal and With balls BST, VT, MT, SAT, LC With rings GNAT external complexity. 4. A large class of search algorithms relies on taking k pivots and mapping the in the set. There exist many clustering k metric space onto R using the L algorithms to build equivalence relations. ∞ However, most are defined on vector distance. Another important class uses spaces instead of general metric spaces. compact partitioning. An exception is Brito et al. [1996], who 5. The equivalence relations can be coars- report very good results. However, it is ened to save space or to improve the not clear that good clustering algorithms overall efficiency by making better use directly translate into good algorithms for of the pivots. A hierarchical refinement proximity searching. Another clustering of classes can improve performance. algorithm, based on cliques, is presented 6. Although there is an optimal number in Burkhard and Keller [1973], but the of pivots to use, this number is too results are similar to the simpler BKT. high in terms of space requirements. This area is largely unexplored, and the In practical terms, a pivot-based index developments here could be converted can outperform a compact partitioning into improved search algorithms. index if it has enough memory. 7. As this amount of memory becomes 9. CONCLUSIONS infeasible as the dimension grows, Metric spaces are becoming a popular compact partitioning algorithms nor- model for similarity retrieval in many mally outperform pivot-based ones in unrelated areas. We have surveyed the high dimensions. algorithms that index metric spaces to 8. In high dimensions the search radius answer proximity queries. We have not needed to retrieve a fixed percentage just enumerated the existing approaches of the database is very large. This is to discuss their good and bad points. We the reason for the failure to overcome have, in addition, presented a unified the brute force search with an exact framework that allows understanding indexing algorithm. the existing approaches under a common view. It turns out that most of the existing A number of open issues require further algorithms are indeed variations on a attention. The main ones follow. few common ideas, and by identifying —For pivot-based algorithms, understand them, previously unnoticed combinations better the effect of pivot selection, devis- have naturally appeared. We have also ing methods to choose effective pivots. analyzed the main factors that affect the The subject of the appropriate number efficiency when searching metric spaces. of pivots and its relation to the intrinsic The main conclusions of our work are dimensionality of the space plays a role summarized as follows. here. The histogram of distances may 1. The concept of intrinsic dimensional- be a good tool for pivot selection. ity can be defined on general metric —For compact partitioning algorithms, spaces as an abstract and quan- work more on clustering schemes in tifiable measure that affects the order to select good centers. Find ways search performance. to reduce construction times (which are 2. The main factors that affect the effi- in many cases too high). ciency of the search algorithms are the —Search for good hybrids between intrinsic dimensionality of the space compact partitioning and pivoting and the search radius. algorithms. The first ones cope better 3. Equivalence relations are a common with high dimensions and the second ground underlying all the index- ones improve as more memory is given

ACM Computing Surveys, Vol. 33, No. 3, September 2001. Searching in Metric Spaces 319

to them. After the space is clustered the proximate string matching in a dictionary. In intrinsic dimension of the clusters is Proceedings of the Fifth South American Sym- posium on String Processing and Information smaller, so a top-level clustering struc- Retrieval (SPIRE’98), IEEE Computer Science ture joined with a pivoting scheme for Press, Los Alamitos, Calif., 14–22. the clusters is an interesting alterna- BAEZA-YATES, R. AND RIBEIRO-NETO, B. 1999. Mod- tive. Those pivots should be selected ern Information Retrieval. Addison-Wesley, from the cluster because the clusters Reading, Mass. have high locality. BAEZA-YATES, R., CUNTO, W., MANBER, U., AND WU,S. 1994. Proximity matching using fixed-queries —Take extra CPU complexity into ac- trees. In Proceedings of the Fifth Combinatorial count, which we have barely considered Pattern Matching (CPM’94), Lecture Notes in in this work. In some applications the Computer Science, vol. 807, 198–212. distance is not so expensive that one BENTLEY, J. 1975. Multidimensional binary search can disregard any other type of CPU trees used for associative searching. Commun. cost. The use of specialized search ACM 18, 9, 509–517. structures in the mapped space (espe- BENTLEY, J. 1979. Multidimensional binary search k trees in database applications. IEEE Trans. cially R ) and the resulting complexity Softw. Eng. 5, 4, 333–340. tradeoff deserves more attention. BENTLEY, J., WEIDE, B., AND YAO, A. 1980. Optimal —Take I/O costs into account, which may expected-time algorithms for closest point prob- very well dominate the search time in lems. ACM Trans. Math. Softw. 6, 4, 563–580. some applications. The only existing BERCHTOLD, S., KEIM, D., AND KRIEGEL, H. 1996. The work on this is the M-tree [Ciaccia X-tree: An index structure for high-dimensional data. In Proceedings of the 22nd Conference on et al. 1997]. Very Large Databases (VLDB’96), 28–39. —Focus on . Most BHANU, B., PENG, J., AND QING, S. 1998. Learning current algorithms for this problem are feature relevance and similarity metrics in based on range searching, and despite image databases. In Proceedings of the IEEE that the existing heuristics seem diffi- Workshop on Content-Based Access of Image and Video Libraries (Santa Barbara, Calif.), IEEE cult to improve, truly independent ways Computer Society, Los Alamitos, Calif., 14–18. to address the problem could exist. BIMBO, A. D. AND VICARIO, E. 1998. Using weighted —Consider approximate and probabilis- spatial relationships in retrieval by visual tic algorithms, which may give much contents. In Proceedings of the IEEE Workshop on Content-Based Access of Image and Video Li- better results at a cost that, especially braries (Santa Barbara, Calif.), IEEE Computer for this problem, seems acceptable. Society, Los Alamitos, Calif., 35–39. A first promising step is Chavez´ and BOHM¨ , C., BERCHTOLD, S., AND KEIM, A. 2002. Navarro [2001b]. Searching in high-dimensional spaces—index structures for improving the performance of multimedia databases. ACM Comput. Surv. To REFERENCES appear. BOZKAYA,T.AND OZSOYOGLU, M. 1997. Distance- APERS, P., BLANKEN, H., AND HOUTSMA, M. 1997. based indexing for high-dimensional metric Multimedia Databases in Perspective. Springer- spaces. In Proceedings of ACM SIGMOD Inter- Verlag, New York. national Conference on Management of Data, ARYA, S., MOUNT, D., NETANYAHU, N., SILVERMAN, R., SIGMOD Rec. 26, 2, 357–368. AND WU, A. 1994. An optimal algorithm for BRIN, S. 1995. Near neighbor search in large met- approximate nearest neighbor searching in ric spaces. In Proceedings of the 21st Conference fixed dimension. In Proceedings of the Fifth on Very Large Databases (VLDB’95), 574–584. ACM–SIAM Symposium on Discrete Algorithms BRITO, M., CHAVEZ´ , E., QUIROZ, A., AND YUKICH,J. (SODA’94), 573–583. 1996. Connectivity of the mutual k-nearest AURENHAMMER, F. 1991. Voronoi diagrams—A sur- neighbor graph in clustering and outlier vey of a fundamental geometric data structure. detection. Stat. Probab. Lett. 35, 33–42. ACM Comput. Surv. 23, 3. BUGNION, E., FHEI, S., ROOS, T., WIDMAYER, P., BAEZA-YATES, R. 1997. Searching: An algorithmic AND WIDMER, F. 1993. A spatial index for tour. In Encyclopedia of Computer Science and approximate multiple string matching. In Technology, vol. 37, A. Kent and J. Williams, Proceedings of the First South American Eds., Marcel Dekker, New York, 331–359. Workshop on String Processing (WSP’93), R. BAEZA-YATES, R. AND NAVARRO, G. 1998. Fast ap- Baeza-Yates and N. Ziviani, Eds., 43–53.

ACM Computing Surveys, Vol. 33, No. 3, September 2001. 320 E. Ch´avez et al.

BURKHARD,W.AND KELLER, R. 1973. Some ap- CIACCIA, P., PATELLA, M., AND ZEZULA, P. 1998a. A proaches to best-match file searching. Commun. cost model for similarity queries in metric ACM 16, 4, 230–236. spaces. In Proceedings of the Seventeenth ACM CASCIA, M. L., SETHI, S., AND SCLAROFF, S. 1998. SIGACT–SIGMOD-SIGART Symposium on Combining textual and visual cues for content- Principles of Database Systems (PODS’98). based image retrieval on the world wide web. CIACCIA, P., PATELLA, M., AND ZEZULA, P. 1998b. In Proceedings of the IEEE Workshop on Processing complex similarity queries with Content-Based Access of Image and Video Li- distance-based access methods. In Proceed- braries (Santa Barbara, Calif.), IEEE Computer ings of the Sixth International Conference on Society, Los Alamitos, Calif., 24–28. Extending Database Technology (EDBT ’98). CHAVEZ´ , E. 1999. Optimal discretization for CLARKSON, K. 1999. Nearest neighbor queries in pivot based algorithms. Manuscript. ftp:// metric spaces. Discrete Comput. Geom. 22, 1, garota.fismat.umich.mx/pub/users/elchavez/ 63–93. minimax.ps.gz. COX,T.AND COX, M. 1994. Multidimensional CHAVEZ´ , E. AND MARROQU´ıN, J. 1997. Proximity Scaling. Chapman and Hall, London. queries in metric spaces. In Proceedings of the DEHNE,F.AND NOLTEMEIER, H. 1987. Voronoi Fourth South American Workshop on String trees and clustering problems. Inf. Syst. 12, 2, Processing (WSP’97), R. Baeza-Yates, Ed. 171–175. Carleton University Press, Ottawa, 21–36. DEVROYE, L. 1987. A Course in Density Estimation. CHAVEZ´ , E. AND NAVARRO, G. 2000. An effective Birkhauser, Boston. clustering algorithm to index high dimensional DUDA, R. AND HART, P. 1973. Pattern Classification metric spaces. In Proceedings of the Seventh and Scene Analysis. Wiley, New York. String Processing and Informational Retrieval FALOUTSOS,C.AND KAMEL, I. 1994. Beyond uni- (SPIRE’00), IEEE CS Press, 75–86. formity and independence: Analysis of R-trees CHAVEZ´ , E. AND NAVARRO, G. 2001a. Towards using the concept of fractal dimension. In Pro- measuring the searching complexity of met- ceedings of the Thirteenth ACM Symposium on ric spaces. In Proceedings of the Mexican Principles of Database Principles (PODS’94), Computing Meeting, vol. II, 969–972. 4–13. CHAVEZ´ , E. AND NAVARRO, G. 2001b. A proba- FALOUTSOS,C.AND LIN, K. 1995. Fastmap: A fast bilistic spell for the curse of dimensionality. algorithm for indexing, data mining and visual- In Proceedings of the Third Workshop on Al- ization of traditional and multimedia datasets. gorithm Engineering and Experimentation ACM SIGMOD Rec. 24, 2, 163–174. (ALENEX’01), 147–160. Lecture Notes in FALOUTSOS, C., EQUITZ, W., FLICKNER, M., NIBLACK, W., Computer Science v. 2153. PETKOVIC, D., AND BARBER, R. 1994. Efficient CHAVEZ´ , E. AND NAVARRO, G. 2002. A metric index and effective querying by image content. J. for approximate string matching. In Proceed- Intell. Inf. Syst. 3, 3/4, 231–262. ings of the Fifth Latin American Symposium on FARAGO´, A., LINDER, T., AND LUGOSI, G. 1993. Fast Theoretical Informatics (LATIN’02), To appear, nearest-neighbor search in dissimilarity spaces. Lecture Notes in Computer Science. IEEE Trans. Patt. Anal. Mach. Intell. 15, 9, CHAVEZ´ , E., MARROQU´ıN, J., AND BAEZA-YATES, R. 957–962. 1999. Spaghettis: An array based algorithm FRAKES,W.AND BAEZA-YATES, R., EDS. 1992. Infor- for similarity queries in metric spaces. In Pro- mation Retrieval: Data Structures and Algo- ceedings of String Processing and Information rithms. Prentice-Hall, Englewood Cliffs, N.J. Retrieval (SPIRE’99), IEEE Computer Science GAEDE,V.AND GUNTHER¨ , O. 1998. Multidimen- Press, Los Alamitos, Calif., 38–46. sional access methods. ACM Comput. Surv. 30, 2, CHAVEZ´ , E., MARROQUIN, J. L., AND NAVARRO, G. 2001. 170–231. Fixed queries array: a fast and economical data GUTTMAN, A. 1984. R-trees: A dynamic index structure for proximity searching. Multimedia structure for spatial searching. In Proceedings Tools and Applications 14 (2), 113–135. Kluwer. of the ACM SIGMOD International Conference CHAZELLE, B. 1994. Computational geometry: on Management of Data, 47–57. A retrospective. In Proceedings of the 26th HAIR, J., ANDERSON, R., TATHAM, R., AND BLACK,W. ACM Symposium on the Theory of Computing 1995. Multivariate Data Analysis with Read- (STOC’94), 75–94. ings, 4th ed. Prentice-Hall, Englewood Cliffs, CHIUEH, T. 1994. Content-based image indexing. N.J. In Proceedings of the Twentieth Conference on HJALTASON,G.AND SAMET, H. 1995. Ranking in Very Large Databases (VLDB’94), 582–593. spatial databases. In Proceedings of Fourth CIACCIA, P., PATELLA, M., AND ZEZULA, P. 1997. M- International Symposium on Large Spatial tree: An efficient access method for similarity Databases, 83–95. search in metric spaces. In Proceedings of JAIN, A. AND DUBES, R. 1988. Algorithms for Clus- the 23rd Conference on Very Large Databases tering Data. Prentice-Hall, Englewood Cliffs, (VLDB’97), 426–435. N.J.

ACM Computing Surveys, Vol. 33, No. 3, September 2001. Searching in Metric Spaces 321

KALANTARI, I. AND MCDONALD, G. 1983. A data SANKOFF,D.AND KRUSKAL, J., EDS. 1983. Time structure and an algorithm for the nearest point Warps, String Edits, and Macromolecules: The problem. IEEE Trans. Softw. Eng. 9, 5. Theory and Practice of Sequence Comparison. MEHLHORN, K. 1984. Data Structures and Al- Addison-Wesley, Reading, Mass. gorithms, Volume III—Multidimensional SASHA,D.AND WANG, T. 1990. New techniques for Searching and Computational Geometry. best-match retrieval. ACM Trans. Inf. Syst. 8, 2, Springer-Verlag, New York. 140–158. MICO´, L., ONCINA, J., AND CARRASCO, R. 1996. A SHAPIRO, M. 1977. The choice of reference points fast branch and bound nearest neighbour clas- in best-match file searching. Commun. ACM 20, sifier in metric spaces. Patt. Recogn. Lett. 17, 5, 339–343. 731–739. SUTTON, R. AND BARTO, A. 1998. Reinforcement MICO´, L., ONCINA, J., AND VIDAL, E. 1994. A new Learning : An Introduction. MIT Press, version of the nearest-neighbor approximating Cambridge, Mass. and eliminating search (AESA) with linear UHLMANN, J. 1991a. Implementing metric trees preprocessing-time and memory requirements. to satisfy general proximity/similarity queries. Patt. Recog. Lett. 15, 9–17. Manuscript. NAVARRO, G. 1999. Searching in metric spaces UHLMANN, J. 1991b. Satisfying general proximity/ by spatial approximation. In Proceedings of similarity queries with metric trees. Inf. Proc. String Processing and Information Retrieval Lett. 40, 175–179. (SPIRE’99), IEEE Computer Science Press, Los VERBARG, K. 1995. The C-Ttree: A dynamically Alamitos, Calif., 141–148. balanced spatial index. Comput. Suppl. 10, NAVARRO,G.AND REYES, N. 2001. Dynamic spatial 323–340. approximation trees. In Proceedings of the VIDAL, E. 1986. An algorithm for finding nearest Eleventh Conference of the Chilean Computer neighbors in (approximately) constant average Science Society (SCCC’01), IEEE CS Press, time. Patt. Recogn. Lett. 4, 145–157. 213–222. WATERMAN, M. 1995. Introduction to Computa- NENE,S.AND NAYAR, S. 1997. A simple algorithm tional Biology. Chapman and Hall, London. for nearest neighbor search in high dimensions. WEBER, R. SCHEK, H.-J., AND BLOTT, S. 1998. IEEE Trans. Patt. Anal. Mach. Intell. 19, 9, A quantitative analysis and performance 989–1003. study for similarity-search methods in high- NIEVERGELT,J.AND HINTERBERGER, H. 1984. The dimensional spaces. In Proceedings of the Inter- grid file: An adaptable, symmetric multikey national Conference on Very Large Databases file structure. ACM Trans. Database Syst. 9, 1, (VLDB’98). 38–71. WHITE,D.AND JAIN, R. 1996. Algorithms and NOLTEMEIER, H. 1989. Voronoi trees and appli- strategies for similarity retrieval. Tech. Rep. cations. In Proceedings of the International VCL-96-101 (July), Visual Computing Labora- Workshop on Discrete Algorithms and Complex- tory, University of California, La Jolla. ity (Fukuoka, Japan), 69–74. YAO, A. 1990. Computational Geometry, J. Van NOLTEMEIER, H., VERBARG, K., AND ZIRKELBACH,C. Leeuwen, Ed. Elsevier Science, New York, 1992. Monotonous bisector∗ trees—A tool for 345–380. efficient partitioning of complex schenes of YIANILOS, P. 1993. Data structures and algorithms geometric objects. In Data Structures and Ef- for nearest neighbor search in general metric ficient Algorithms, Lecture Notes in Computer spaces. In Proceedings of the Fourth ACM–SIAM Science, vol. 594, Springer-Verlag, New York, Symposium on Discrete Algorithms (SODA ’93), 186–203. 311–321. PRABHAKAR, S., AGRAWAL, D., AND ABBADI, A. E. 1998. YIANILOS, P. 1999. Excluded middle vantage Efficient disk allocation for fast similarity point forests for nearest neighbor search. searching. In Proceedings of ACM SPAA ’98 In DIMACS Implementation Challenge, (Puerto Vallarta, Mexico). ALENEX ’99 (Baltimore, Md). OUSSOPOULOS ELLEY AND INCENT R , N., K , S., V , F. 1995. YIANILOS, P. 2000. Locally lifting the curse of Nearest neighbor queries. In Proceedings of the dimensionality for nearest neighbor search. ACM SIGMOD ’95, 71–79. In Proceedings of the Eleventh ACM–SIAM SALTON,G.AND MCGILL, M. 1983. Introduction to Symposium on Discrete Algorithms (SODA ’00), Modern Information Retrieval. McGraw-Hill, 361–370. New York. YOSHITAKA, A. AND ICHIKAWA, T. 1999. A survey SAMET, H. 1984. The and related hi- on content-based retrieval for multimedia erarchical data structures. ACM Comput. databases. IEEE Trans. Knowl. Data Eng. 11, 1, Surv. 16, 2, 187–260. 81–93.

Received July 1999; revised March 2000; accepted December 2000

ACM Computing Surveys, Vol. 33, No. 3, September 2001.