Fast Nearest Neighbor Search in High-Dimensional Space
Total Page:16
File Type:pdf, Size:1020Kb
14th International Conference on Data Engineering (ICDE’98), February 23-27, 1998, Orlando, Florida. Fast Nearest Neighbor Search in High-dimensional Space Stefan Berchtold†, Bernhard Ertl, Daniel A. Keim‡, Hans-Peter Kriegel, Thomas Seidl Institute for Computer Science, University of Munich, Oettingenstr. 67, 80538 Munich, Germany {berchtol, ertl, keim, kriegel, seidl}@dbs.informatik.uni-muenchen.de † current address: AT&T Labs Research, [email protected] ‡ current address: University of Halle-Wittenberg, [email protected] Abstract Most of the existing approaches solving the nearest Similarity search in multimedia databases requires an effi- neighbor problem perform a search on a priori built index cient support of nearest-neighbor search on a large set of while expanding the neighborhood around the query point high-dimensional points as a basic operation for query until the desired closest point is reached. However, as recent processing. As recent theoretical results show, state of the theoretical results [BBKK 97] show, such index-based ap- art approaches to nearest-neighbor search are not efficient proaches must access a large portion of the data points in in higher dimensions. In our new approach, we therefore higher dimensions. Therefore, searching an index by ex- precompute the result of any nearest-neighbor search panding the query region is, in general, inefficient in high di- which corresponds to a computation of the voronoi cell of mensions. One way out of this dilemma is exploiting paral- each data point. In a second step, we store the voronoi cells lelism for an efficient nearest neighbor search as we did in in an index structure efficient for high-dimensional data [Ber+ 97]. spaces. As a result, nearest neighbor search corresponds to In this paper, we suggest a new solution to sequential a simple point query on the index structure. Although our nearest neighbor search which is based on precalculating technique is based on a precomputation of the solution and indexing the solution space instead of indexing the da- space, it is dynamic, i.e. it supports insertions of new data ta. The solution space may be characterized by a complete points. An extensive experimental evaluation of our tech- and overlap-free partitioning of the data space into cells, nique demonstrates the high efficiency for uniformly dis- each containing exactly one data point. Each cell consists tributed as well as real data. We obtained a significant re- of all potential query points which have the corresponding duction of the search time compared to nearest neighbor data point as a nearest neighbor. The cells therefore corre- search in the X-tree (up to a factor of 4). spond to the d-dimensional Voronoi cells [PS 85]. Deter- mining the nearest neighbor of a query point now becomes equivalent to determining the Voronoi cell in which the 1 Introduction query point is located. Since the Voronoi cells may be rath- er complex high-dimensional polyhedra which require too An important research issue in the field of multimedia much disk space when stored explicitly, we approximate databases is the content based retrieval of similar multime- the cells by minimum bounding (hyper-)rectangles and dia objects such as images, text and videos [Alt+ 90] store them in a multidimensional index structure such as [Fal+ 94] [Jag 91] [MG 93] [SBK 92] [SH 94]. However, the X-tree [BKK 96]. The nearest neighbor query now be- in contrast to searching data in a relational database, a con- comes a simple point query which can be processed effi- tent based retrieval requires the search of similar objects as ciently using the multidimensional index. In order to obtain a basic functionality of the database system. Most of the ap- a good approximation quality for high-dimensional cells, proaches addressing similarity search use a so-called fea- we additionally introduce a new decomposition technique ture transformation which transforms important properties for high-dimensional spatial objects. of the multimedia objects into high-dimensional points The paper is organized as follows: In section 2, we in- (feature vectors). Thus, the similarity search corresponds to troduce our new solution to the nearest neighbor problem, a search of points in the feature space which are close to a which is based on approximating the solution space. We for- given query point and, therefore, corresponds to a nearest mally define the solution space as well as the necessary cell neighbor search. Up to now, a lot of research has been done approximations and outline an efficient algorithm for deter- in the field of nearest neighbor search in high-dimensional mining the high-dimensional cell approximations. In spaces [Ary 95] [Ber+ 97] [HS 95] [PM 97] [RKV 95]. section 3, we then discuss the problems related to indexing the high-dimensional cell approximations and introduce our NN-Diagram() DB := {}NN-Cell() P PDB∈ . solution which is based on a new decomposition of the ap- proximations. In section 4, we present an experimental eval- According to Definition 2, the sum of the volumes of all uation of our new approach using uniformly distributed as NN-cells is the volume of the data space (cf. figure 1b): N well as real data. The evaluation unveils significant speed- ()() ∑ Vol NN-Celli = Vol DS . ups over the R*-tree- and X-tree-based nearest neighbor i = 1 search. If we are able to efficiently determine the NN-cells, to explicitly store them, and to directly find the NN-cell 2 Approximating the Solution Space which contains a given query point, then the costly nearest neighbor query could be executed by one access to the cor- Our new approach to solving the nearest neighbor responding NN-cell. In general, however, determining the problem is based on precalculating the solution space. Pre- NN-cells is rather time consuming and requires (at least) d ⁄ 2 calculating the solution space means determining the Ω()NNlog for d ∈Ω{}12, and ()N for d ≥ 3 in the Voronoi diagram (cf. figure 1a) of the data points in the da- worst case [PS 85] for an Euclidean distance function. In tabase. In the following, we recall the definition of Voronoi addition, the space requirements for the NN-cells (number cells as provided in [Roo 91]. of k faces of the NN-diagram) are min() d + 1 – k, d ⁄ 2 Definition 1. (Voronoi Cell, Voronoi Diagram) ON()for 0 ≤≤kd– 1 Let DB be a database of points. For any subset ADB∈ of in the worst case [Ede 87], making it impossible to store size m := A , 1 ≤ mN< , and a given distance function them explicitly. For a practicable solution, it is therefore ℜd ×ℜℜd → + d : 0 , the order-m Voronoi Cell of A is de- necessary to use approximations of the NN-cells, which is fined as a well-known technique that has been successfully used for VoronoiCell() A := improving the query processing in the context of geograph- d ical databases [BKS 93]. In principle, any approximation {}x ∈ ℜ ∀()P ∈ A ∀()P ∈ DB\A : dxP(), ≤ dxP(), . i j i j such as (hyper-)rectangles, rotated (hyper-)rectangles, The order-m Voronoi diagram of DB is defined as spheres, ellipsoids, etc. may be employed. In our applica- () tion, we use an approximation of the NN-cells by minimum VoronoiDiagramm DB := {}VoronoiCell() A ADB⊂ ∧ Am= . bounding (hyper-)rectangles and store them in a multidi- Note that we are primarily interested in the nearest mensional index structure such as the X-tree [BKK 96]. neighbor of a query point, and that we assume the Voronoi The nearest neighbor query can then be executed by a sim- cells to be bounded by the data space (DS). Therefore, in ple and very efficient point query on this index. In the fol- the following we only have to consider bounded Voronoi lowing, we define the approximation of the NN-cells. cells of order 1, which are also called NN-cells (cf. Definition 3. (MBR-Approximation of NN-cells) figure 1b). The MBR approximation ApprMBR of a nearest neighbor (2,5) cell (NNC) is the minimum bounding (hyper-)rectangle (),,,,… ,,… (4,5) 5 MBR= l1 h1 ld hd of NNC, i.e. for i = 1 d : 2 (5,6) {}∈ {}∈ (2,4) (4,6) (6,8) li = min pi pNNC and hi = max pi pNNC. (3,4) 4 6 8 1(1,2) (1,4) (6,7) (4,7) Let us now consider examples of the NN-cells and (1,3) (7,8) their MBR-approximations for a number of different data 3 (3,7) distributions. Figure 2a and b show the NN-diagram and the corresponding approximation diagram for two indepen- 7 a. Voroni Diagram of order 2 b. NN-Diagram dent uniformly distributed dimensions, figure 2c and d (cf [PS 85]) Figure 1: Voronoi diagram and NN-diagram show the two diagrams for a regular multidimensional uni- form distribution, and figure 2e and f show the diagrams for a sparse distribution. A uniform distribution is usually Definition 2. (NN-cell, NN-Diagram) generated by using a random number generator to produce For any Point PDB∈ and a given distance function the data values for each of the dimensions independently. ℜd ×ℜℜd → + d : 0 , the NN-cell of P is defined as This generation process produces a data set which - project- NN-Cell() P := ed onto each of the dimension axes - provides a uniform {}xDS∈ ∀()P'∈ DB\{}P : dxP(), ≤ dxP(), ' . distribution. It does not mean, however, that the data is dis- tributed uniformly in multidimensional space, i.e. for a par- The NN-Diagram of a database of points DB is defined as titioning of the data space into cells of equal size that each a.