
On the Difficulty of Nearest Neighbor Search Junfeng He [email protected] Department of Electrical Engineering, Columbia University, New York, NY 10027, USA Sanjiv Kumar [email protected] Google Research, New York, NY 10011, USA Shih-Fu Chang [email protected] Department of Electrical Engineering, Columbia University, New York, NY 10027, USA Abstract Rapidly increasing data in many domains such as the Fast approximate nearest neighbor(NN) Web is posing new challenges on how to efficiently search in large databases is becoming popu- retrieve nearest neighbors of a query from massive lar. Several powerful learning-based formula- databases. Fortunately, in most applications, it is suf- tions have been proposed recently. However, ficient to return approximate nearest neighbors of a not much attention has been paid to a more query, which allows efficeint scalable search. fundamental question: how difficult is (ap- A large number of approximate Nearest Neighbor proximate) nearest neighbor search in a given (NN) search techniques have been proposed in the data set? And which data properties affect last decade including hashing and tree-based meth- the difficulty of nearest neighbor search and ods, to name a few, (Datar et al., 2004; Liu et al., how? This paper introduces the first concrete 2004; Weiss et al., 2008). However, the performance measure called Relative Contrast that can of all these techniques depends heavily on the data be used to evaluate the influence of several set characteristics. In fact, as a fundamental question, crucial data characteristics such as dimen- one would like to know how difficult is (approximate) sionality, sparsity, and database size simul- NN search in a given data set. And more broadly, taneously in arbitrary normed metric spaces. which data characteristics of the dataset affect the Moreover, we present a theoretical analysis ”difficulty” and how? The term ”difficulty” here has to prove how the difficulty measure (relative two different but related meanings: in the context of contrast) determines/affects the complexity NN search problem (independent of indexing meth- of Local Sensitive Hashing, a popular approx- ods), ”difficulty” represents ”meaningfulness”, i.e., for imate NN search method. Relative contrast a query, how differentiable is its NN point compared to also provides an explanation for a family of other points? In the context of approximate NN search heuristic hashing algorithms with good prac- methods like tree or hashing based indexing methods, tical performance based on PCA. Finally, we ”difficulty” represents ”complexity”, i.e., what is the show that most of the previous works in mea- time and space complexity to guarantee to find the NN suring NN search meaningfulness/difficulty point (with a high probability)? These questions have can be derived as special asymptotic cases not been paid much attention in the literature. for dense vectors of the proposed measure. In terms of ”meaningfulness” of NN search problem in a given dataset, most of the existing works have 1. Introduction focused on the effect of one data property: dimen- sionality, that too in an asymptotic sense, showing Finding nearest neighbors is a key step in many ma- that NN search will be meaningless when the num- chine learning algorithms such as spectral cluster- ber of dimensions goes to infinity (Beyer et al., 1999; ing, manifold learning and semi-supervised learning. Aggarwal et al., 2001; Francois et al., 2007). First, Appearing in Proceedings of the 29 th International Confer- non-asymptotic analysis has not been discussed, i.e., ence on Machine Learning, Edinburgh, Scotland, UK, 2012. when the number of dimensions is finite. Moreover, Copyright 2012 by the author(s)/owner(s). the effect of other crucial properties has not been stud- On the Difficulty of Nearest Neighbor Search d ied, for instance, the sparsity of data vectors. Since in where xi, q R are i.i.d samples from an unknown many applications, high-dimensional vectors tend to distribution p∈(x). Further, let D( , ) be the distance · · be sparse, it is important to study the two data prop- function for the d-dimensional data. We focus on Lp j j p 1/p erties e.g., dimensionality and sparsity together, along distances in this paper: D(x, q)=( j x q ) . with other factors such as database size and distance | − | metric. 2.1. Definition P q In terms of the complexity of approximate NN Suppose Dmin = min D(xi, q) is the distance to the i=1,...n search methods like Locality Sensitive Hashing 1 q nearest database sample , and Dmean = Ex[D(x, q)] (LSH), some general bounds for (Gionis et al., 1999; is the expected distance of a random database sample Indyk & Motwani, 1998) have been presented. How- from the query q. We define the relative contrast for q ever, it has not been studied how the complexity of q Dmean the data set X for a query q as : Cr = Dq . It is approximate NN search methods is affected by the min difficulty of NN search problem on the dataset, and a very intuitive measure of separability of the nearest moreover, by various data properties like dimension, neighbor of q from the rest of the database points. sparsity, etc. Now, taking expectations with respect to queries, the relative contrast for the dataset X is given as, The main contributions of this paper are: q 1. We introduces a new concrete measure Relative Eq[Dmean] Dmean Cr = q = (1) Contrast for the meaningfulness/difficulty of nearest Eq[Dmin] Dmin neighbor search problem in a given data set (in- dependent of indexing methods). Unlike previous Intuitively, Cr captures the notion of difficulty of NN works that only provide asymptotic discussions for search in X. Smaller the Cr, more difficult the search. one or two data properties, we derive an explicitly If Cr is close to 1, then on average a query q will have computable function to estimate relative contrast in almost the same distance to its nearest neighbor as non-asymptotic case. It for the first time enables us to that to a random point in X. This will imply that NN analyze how the difficulty of nearest neighbor search search in database X is not very meaningful. is affected by different data properties simultaneously, In the following sections, we derive relative contrast as such as dimensionality, sparsity, database size, along a function of various important data characteristics. with the norm of Lp distance metric , for a given data set. (Sec. 2) 2.2. Estimation Suppose xj and qj are the jth dimensions of vectors x 2. We provide a theoretical analysis on how the and q. Let’s define, difficulty measure ”relative contrast” determines the complexity of LSH, a popular approximate NN d search method. This is the first work to relate the R = E [ xj qj p],R = R . (2) j q | − | j complexity of approximate NN search methods to j=1 the difficulty measure of a given dataset, allowing us X j to analyze how the complexity is affected by various Both Rj and R are random variables (because x is data properties simultaneously. For practitioners’ a random variable). Suppose each Rj has finite mean 2 benefits, relative contrast also provides insights on and variance denoted as µj = E[Rj ], σj = var[Rj]. how to choose parameters e.g., the number of hash Then, the mean and variance of R are given as, tables of LSH, and a principled explanation of why d d PCA-based methods perform well in practice. (Sec. 3) µ = µ , σ2 σ2. j ≤ j j=1 j=1 3. We reveal the relationship between relative X X contrast and previous studies on measuring NN search 2 2 Here, if dimensions are independent then σ = j σj . difficulty, and show that most existing works can be Without the loss of generality, we can scale the data P derived as special asymptotic cases for dense vectors such that the new mean µ′ is 1. The variance of the of the proposed relative contrast. (Sec. 4) scaled data, called normalized variance will be: 2 2. Relative Contrast (Cr) 2 σ σ′ = . (3) µ2 Suppose we are given a data set X containing n d- 1 dim points, X = xi, i = 1,...,n , and a query q Without loss of generality, we assume that the query { } q is distinct from the database samples, i.e., Dmin =6 0. On the Difficulty of Nearest Neighbor Search The normalized variance gives the spread of the dis- Moreover, after normalization, R follows a Gaus- tances from query to random points in the database sian distribution with mean 1. So, Rmean = 1, and 1 with the mean distance fixed at 1. If the spread is p Dmean Rmean = 1. Thus, the relative contrast can small, it is harder to separate the nearest neighbor be approximated≈ as: from the rest of the points. Next, we estimate the relative contrast for a given dataset as follows. Dmean 1 Cr = 1 Dmin ≈ 1 1 1 p [1 + φ− ( n + φ( −σ′ ))σ′] Theorem 2.1 If Rj, j=1,...d are independent and { 2 } satisfy Lindeberg’s condition , the relative contrast can which completes the proof. be approximated as, Dmean 1 Range of Cr: Note that when n is large enough C = 1 (4) 1 1 1 1 1 1 1 r φ( −′ ) +φ( −′ ) ,so0 1+φ− ( +φ( −′ ))σ′ Dmin ≈ 1 1 1 p σ n σ 2 n σ [1 + φ− ( n + φ( −σ′ ))σ′] ≤ ≤ ≤ ≤ 1 and hence Cr is always 1. And moreover, when 1 ≥ σ′ 0, φ( −′ ) 0, and Cr 1. where φ is the c.d.f of standard Gaussian, n is the → σ → → number of database samples, σ′ is normalized standard Generalization 1: The concept of relative contrast deviation, and p is the distance metric norm.
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages8 Page
-
File Size-