Fast Algorithm for Nearest Neighbor Search Based on a Lower Bound Tree
Total Page:16
File Type:pdf, Size:1020Kb
in Proceedings of the 8th International Conference on Computer Vision, Vancouver, Canada, Jul. 2001, Volume 1, pages 446–453 Fast Algorithm for Nearest Neighbor Search Based on a Lower Bound Tree Yong-Sheng Chen†‡ Yi-Ping Hung†‡ Chiou-Shann Fuh‡ †Institute of Information Science, Academia Sinica, Taipei, Taiwan ‡Dept. of Computer Science and Information Engineering, National Taiwan University, Taiwan Email: [email protected] Abstract In the past, Bentley [2] proposed the k-dimensional bi- nary search tree to speed up the the nearest neighbor search. This paper presents a novel algorithm for fast nearest This method is very efficient when the dimension of the data neighbor search. At the preprocessing stage, the proposed space is small. However, as reported in [13] and [3], its algorithm constructs a lower bound tree by agglomeratively performance degrades exponentially with increasing dimen- clustering the sample points in the database. Calculation of sion. Fukunaga and Narendra [8] constructed a tree struc- the distance between the query and the sample points can be ture by repeating the process of dividing the set of sam- avoided if the lower bound of the distance is already larger ple points into subsets. Given a query point, they used than the minimum distance. The search process can thus a branch-and-bound search strategy to efficiently find the be accelerated because the computational cost of the lower closest point by using the tree structure. Djouadi and Bouk- bound, which can be calculated by using the internal node tache [5] partitioned the underlying space of the sample of the lower bound tree, is less than that of the distance. points into a set of cells. A few cells located in the vicinity To reduce the number of the lower bounds actually calcu- of the query point can be determined by calculating the dis- lated, the winner-update search strategy is used for travers- tances between the query point and the centers of the cells. ing the tree. Moreover, the query and the sample points The nearest neighbor can then be found by searching only can be transformed for further efficiency improvement. Our in these neighboring cells, instead of the whole space. Nene experiments show that the proposed algorithm can greatly and Nayar [13] proposed a very fast algorithm to search for speed up the nearest neighbor search process. When apply- the nearest neighbor within a pre-specified distance thresh- ing to the real database used in Nayar’s object recognition old. From the first to the last dimension, their method can system, the proposed algorithm is about one thousand times exclude the sample points that the distances from those sam- faster than the exhaustive search. ple points to the query point at the current dimension are larger than the threshold. Then, the nearest neighbor can be determined by examining the remaining candidates. Lee and Chae [11] proposed another elimination-based method. 1. Introduction Based on triangle inequality, they used a number of anchor sample points to eliminate many distance calculations. In- Nearest neighbor search is very useful in recognizing ob- stead of finding the exact nearest neighbor, Arya et al. [1] jects [12], matching stereo images [15], classifying patterns proposed a fast algorithm which can find the approximate [9], compressing images [10], and retrieving information in nearest neighbor within a factor of (1 + r) of the distance database system [7]. In general, given a fixed data set P between the query point and its exact nearest neighbor. which consists of s sample points in a d-dimensional space, In this paper, we present a novel algorithm which can ef- d that is, P = {pi ∈ R |i = 1,..., s}, preprocessing can be ficiently search for the exact nearest neighbor in Euclidean performed to construct a particular data structure. For each space. At the preprocessing stage, the proposed algorithm query point q, the goal of the nearest neighbor search is to constructs a lower bound tree (LB-tree), in which each leaf find in P the point closest to q. The straightforward way node represents a sample point and each internal node rep- to find the nearest neighbor is to exhaustively compute and resents a mean point in a space of smaller dimension. Given compare the distances from the query point to all the sam- a query point, the lower bound of its distance to each sample ple points. The computational complexity of this exhaustive point can be calculated by using the mean point of the inter- search is O(s · d). When s, d, or both are large, this process nal node in the LB-tree. Calculation of the distance between becomes very computational-intensive. the query and the sample points can be avoided if the lower bound of the distance is already larger than the minimum 0 p p1 distance between the query point and its nearest neighbor. Because the computational cost of the lower bound is less than that of the distance, the whole search process can be 1 accelerated. p p1 p2 Furthermore, we adopt the following three techniques to reduce the number of the lower bounds actually calcu- lated. The first one is the winner-update search strategy 2 p p1 p2 p3 p4 [4] for traversing the LB-tree. We adopt this search strat- egy to reduce the number of the nodes examined. Starting from the root node of the LB-tree, the node having the min- 3 imum lower bound is chosen and its children will then join p p1 p2 p3 p4 p5 p6 p7 p8 the competition after their lower bounds having been calcu- lated. The second one is the agglomerative clustering tech- nique for the LB-tree construction. The major advantage of Figure 1. An example of the 4-level structure this technique is that it can keep the number of the internal of the point p, where p ∈ R8. nodes as small as possible while keeping the lower bound as tight as possible. The last technique we adopt is data trans- formation. By applying data transformation to each point, p. Figure 1 illustrates an example of the 4-level structures, the lower bound of an internal node can be further tight- 0 3 ened, and thus save more computation. We use two kind {p ,..., p }, where d = 8. of data transformation in this work: wavelet transform and Given the multilevel structures of two points p and q, we principal component analysis. can derive the following inequality property. Our experiments show that the proposed algorithm can save substantial computation of the nearest neighbor search, Property 1 The Euclidean distance between p and q is larger than or equal to the Euclidean distance between their in particular, when the distance of the query point to its l l nearest neighbor is relatively small compared with its dis- level-l projections p and q for each level l. That is, tance to most other samples. l l kp − qk2 ≥ kp − q k2, l = 0,..., L. l l 2. Multilevel structure and LB-tree From Property 1, the distance kp − q k2 between the level-l projections can be considered as a lower bound of the This section introduces the LB-tree, the essential data distance kp − qk2 between the points p and q. Notice that l l structure used in the proposed nearest neighbor search al- the computational cost of the distance kp −q k2 is less than gorithm. We will first describe the multilevel structure of a that of the distance kp−qk2. To be specific, the complexity data point. The multilevel structures of all the sample points of calculating the distance between level-l projections arises 0 can be used for the LB-tree construction. from O(2 ) to O(2L) as l varies from 0 to L. 2.1. Multilevel structure of each point 2.2. LB-tree for the data set For a point p = [p1, p2,..., pd] in a d-dimensional Eu- To construct an LB-tree, we need to use the multilevel d clidean space, R , we define its multilevel structure, de- structures of all the sample points pi, i = 1,..., s, in the noted by {p0, p1,..., pL}, in the following way. At each data set P . Suppose the multilevel structure has L + 1 lev- level l, pl comprises the first dl dimensions of the point p, els, that is, from level 0 to level L. Then, the LB-tree also where the integer dl satisfies 1 ≤ dl ≤ d, l = 0,..., L. has L + 1 levels without considering the dummy root node l That is, p = [p1, p2,..., pdl ], which will be referred to as having zero dimension. Each leaf node at level L in the LB- L the level-l projection of p. tree contains a level-L projection pi , which is exactly the l In this paper, we assume that the dimension of the un- same as the sample point pi. The level-l projections, pi , derlying space, d, is equal to 2L without loss of generality. i = 1,..., s, of all the sample points can be hierarchically Zero padding can be used to enlarge the dimension if d is clustered from level 0 to level L − 1, as illustrated in Fig- not a power of 2. In the multilevel structure, the dimension ure 2. More discussions on the LB-tree construction will be at level l is set to be dl = 2l. In this way, a (L + 1)-level given in Section 4. structure of triangle shape can be constructed for point p. Let hpi denote the node containing the point p in the LB- L l Notice that level-L projection, p , is the same as the point tree.