Rank-Approximate Nearest Neighbor Search: Retaining Meaning and Speed in High Dimensions
Total Page:16
File Type:pdf, Size:1020Kb
Rank-Approximate Nearest Neighbor Search: Retaining Meaning and Speed in High Dimensions Parikshit Ram, Dongryeol Lee, Hua Ouyang and Alexander G. Gray Computational Science and Engineering, Georgia Institute of Technology Atlanta, GA 30332 p.ram@,dongryel@cc.,houyang@,agray@cc. gatech.edu { } Abstract The long-standing problem of efficient nearest-neighbor (NN) search has ubiqui- tous applications ranging from astrophysics to MP3 fingerprinting to bioinformat- ics to movie recommendations. As the dimensionality of the dataset increases, ex- act NN search becomes computationally prohibitive; (1+) distance-approximate NN search can provide large speedups but risks losing the meaning of NN search present in the ranks (ordering) of the distances. This paper presents a simple, practical algorithm allowing the user to, for the first time, directly control the true accuracy of NN search (in terms of ranks) while still achieving the large speedups over exact NN. Experiments on high-dimensional datasets show that our algorithm often achieves faster and more accurate results than the best-known distance-approximate method, with much more stable behavior. 1 Introduction In this paper, we address the problem of nearest-neighbor (NN) search in large datasets of high dimensionality. It is used for classification (k-NN classifier [1]), categorizing a test point on the ba- sis of the classes in its close neighborhood. Non-parametric density estimation uses NN algorithms when the bandwidth at any point depends on the ktℎ NN distance (NN kernel density estimation [2]). NN algorithms are present in and often the main cost of most non-linear dimensionality reduction techniques (manifold learning [3, 4]) to obtain the neighborhood of every point which is then pre- served during the dimension reduction. NN search has extensive applications in databases [5] and computer vision for image search Further applications abound in machine learning. Tree data structures such as kd-trees are used for efficient exact NN search but do not scale better than the na¨ıve linear search in sufficiently high dimensions. Distance-approximate NN (DANN) search, introduced to increase the scalability of NN search, approximates the distance to the NN and any neighbor found within that distance is considered to be “good enough”. Numerous techniques exist to achieve this form of approximation and are fairly scalable to higher dimensions under certain assumptions. Although the DANN search places bounds on the numerical values of the distance to NN, in NN search, distances themselves are not essential; rather the order of the distances of the query to the points in the dataset captures the necessary and sufficient information [6, 7]. For example, consider the two-dimensional dataset (1, 1), (2, 2), (3, 3), (4, 4),... with a query at the origin. Appending non-informative dimensions to each of the reference points produces higher dimensional datasets of the form (1, 1, 1, 1, 1, ....), (2, 2, 1, 1, 1, ...), (3, 3, 1, 1, 1, ...), (4, 4, 1, 1, 1, ...),.... For a fixed dis- tance approximation, raising the dimension increases the number of points for which the distance to the query (i.e. the origin) satisfies the approximation condition. However, the ordering (and hence the ranks) of those distances remains the same. The proposed framework, rank-approximate nearest- neighbor (RANN) search, approximates the NN in its rank rather than in its distance, thereby making the approximation independent of the distance distribution and only dependent on the ordering of the distances. 1 This paper is organized as follows: Section 2 describes the existing methods for exact NN and DANN search and the challenges they face in high dimensions. Section 3 introduces the proposed approach and provides a practical algorithm using stratified sampling with a tree data structure to obtain a user-specified level of rank approximation in Euclidean NN search. Section 4 reports the experiments comparing RANN with exact search and DANN. Finally, Section 5 concludes with discussion of the road ahead. 2 Related Work The problem of NN search is formalized as the following: Problem. Given a dataset S X of size N in a metric space (X, d) and a query q X, efficiently find a point p S such that ⊂ ∈ ∈ d(p, q) = min d(r, q). (1) r∈S 2.1 Exact Search The simplest approach of linear search over S to find the NN is easy to implement, but requires O(N) computations for a single NN query, making it unscalable for moderately large N. Hashing the dataset into buckets is an efficient technique, but scales only to very low dimensional X. Hence data structures are used to answer queries efficiently. Binary spatial partitioning trees, like kd-trees [9], ball trees [10] and metric trees [11] utilize the triangular inequality of the distance metric d (commonly the Euclidean distance metric) to prune away parts of the dataset from the com- putation and answer queries in expected O(log N) computations [9]. Non-binary cover trees [12] answer queries in theoretically bounded O(log N) time using the same property under certain mild assumptions on the dataset. Finding NNs for O(N) queries would then require at least O(N log N) computations using the trees. The dual-tree algorithm [13] for NN search also builds a tree on the queries instead of going through them linearly, hence amortizing the cost of search over the queries. This algorithm shows orders of magnitude improvement in efficiency and is conjectured to be O(N) for answering O(N) queries using the cover trees [12]. 2.2 Nearest Neighbors in High Dimensions The frontier of research in NN methods is high dimensional problems, stemming from common datasets like images and documents to microarray data. But high dimensional data poses an inherent problem for Euclidean NN search as described in the following theorem: Theorem 2.1. [8] Let C be a -dimensional hypersphere with radius a. Let A and B be any two points chosen at random in C, theD distributions of A and B being independent and uniform over the interior of C. Let r be the Euclidean distance between A and B (r [0, 2a]). Then the asymptotic ∈ distribution of r is N(a√2, a2/2 ). D This implies that in high dimensions, the Euclidean distances between uniformly distributed points lie in a small range of continuous values. This hypothesizes that the tree based algorithms perform no better than linear search since these data structures would be unable to employ sufficiently tight bounds in high dimensions. This turns out to be true in practice [14, 15, 16]. This prompted interest in approximation of the NN search problem. 2.3 Distance-Approximate Nearest Neighbors The problem of NN search is relaxed in the following form to make it more scalable: Problem. Given a dataset S X of size N in some metric space (X, d) and a query q X, efficiently find any point p′ S⊂such that ∈ ∈ d(p′, q) (1 + ) min d(r, q) (2) ≤ r∈S for a low value of ℝ+ with high probability. ∈ This approximation can be achieved with kd-trees, balls trees, and cover trees by modifying the search algorithm to prune more aggressively. This introduces the allowed error while providing some speedup over the exact algorithm [12]. Another approach modifies the tree data structures to 2 bound error with just one root-to-leaf traversal of the tree, i.e. to eliminate backtracking. Sibling nodes in kd-trees or ball-trees are modified to share points near their boundaries, forming spill trees [14]. These obtain significant speed up over the exact methods. The idea of approximately correct (satisfying Eq. 2) NN is further extended to a formulation where the (1 + ) bound can be exceeded with a low probability , thus forming the PAC-NN search algorithms [17]. They provide 1-2 orders of magnitude speedup in moderately large datasets with suitable and . These methods are still unable to scale to high dimensions. However, they can be used in combina- tion with the assumption that high dimensional data actually lies on a lower dimensional subspace. There are a number of fast DANN methods that preprocess data with randomized projections to reduce dimensionality. Hybrid spill trees [14] build spill trees on the randomly projected data to obtain significant speedups. Locality sensitive hashing [18, 19] hashes the data into a lower dimen- sional buckets using hash functions which guarantee that “close” points are hashed into the same bucket with high probability and “farther apart” points are hashed into the same bucket with low probability. This method has significant improvements in running times over traditional methods in high dimensional data and is shown to be highly scalable. However, the DANN methods assume that the distances are well behaved and not concentrated in a small range. However, for example, if the all pairwise distances are within the range (100.0, 101.00), any distance approximation 0.01 will return an arbitrary point to a NN query. The exact tree- based algorithms failed to be efficient≥ because many datasets encountered in practice suffered the same concentration of pairwise distances. Using DANN in such a situation leads to the loss of the ordering information of the pairwise distances which is essential for NN search [6]. This is too large of a loss in accuracy for increased efficiency. In order to address this issue, we propose a model of approximation for NN search which preserves the information present in the ordering of the distances by controlling the error in the ordering itself irrespective of the dimensionality or the distribution of the pairwise distances in the dataset. We also provide a scalable algorithm to obtain this form of approximation. 3 Rank Approximation To approximate the NN rank, we formulate and relax NN search in the following way: Problem.