On the Difficulty of Nearest Neighbor Search

Junfeng He [email protected] Department of Electrical Engineering, Columbia University, New York, NY 10027, USA Sanjiv Kumar [email protected] Google Research, New York, NY 10011, USA Shih-Fu Chang [email protected] Department of Electrical Engineering, Columbia University, New York, NY 10027, USA

Abstract Rapidly increasing data in many domains such as the Fast approximate nearest neighbor(NN) Web is posing new challenges on how to efficiently search in large is becoming popu- retrieve nearest neighbors of a query from massive lar. Several powerful learning-based formula- databases. Fortunately, in most applications, it is suf- tions have been proposed recently. However, ficient to return approximate nearest neighbors of a not much attention has been paid to a more query, which allows efficeint scalable search. fundamental question: how difficult is (ap- A large number of approximate Nearest Neighbor proximate) nearest neighbor search in a given (NN) search techniques have been proposed in the data set? And which data properties affect last decade including hashing and tree-based meth- the difficulty of nearest neighbor search and ods, to name a few, (Datar et al., 2004; Liu et al., how? This paper introduces the first concrete 2004; Weiss et al., 2008). However, the performance measure called Relative Contrast that can of all these techniques depends heavily on the data be used to evaluate the influence of several set characteristics. In fact, as a fundamental question, crucial data characteristics such as dimen- one would like to know how difficult is (approximate) sionality, sparsity, and size simul- NN search in a given data set. And more broadly, taneously in arbitrary normed spaces. which data characteristics of the dataset affect the Moreover, we present a theoretical analysis ”difficulty” and how? The term ”difficulty” here has to prove how the difficulty measure (relative two different but related meanings: in the context of contrast) determines/affects the complexity NN search problem (independent of indexing meth- of Local Sensitive Hashing, a popular approx- ods), ”difficulty” represents ”meaningfulness”, i.e., for imate NN search method. Relative contrast a query, how differentiable is its NN point compared to also provides an explanation for a family of other points? In the context of approximate NN search heuristic hashing algorithms with good prac- methods like tree or hashing based indexing methods, tical performance based on PCA. Finally, we ”difficulty” represents ”complexity”, i.e., what is the show that most of the previous works in mea- time and space complexity to guarantee to find the NN suring NN search meaningfulness/difficulty point (with a high probability)? These questions have can be derived as special asymptotic cases not been paid much attention in the literature. for dense vectors of the proposed measure. In terms of ”meaningfulness” of NN search problem in a given dataset, most of the existing works have 1. Introduction focused on the effect of one data property: dimen- sionality, that too in an asymptotic sense, showing Finding nearest neighbors is a key step in many ma- that NN search will be meaningless when the num- chine learning algorithms such as spectral cluster- ber of dimensions goes to infinity (Beyer et al., 1999; ing, manifold learning and semi-supervised learning. Aggarwal et al., 2001; Francois et al., 2007). First, Appearing in Proceedings of the 29 th International Confer- non-asymptotic analysis has not been discussed, i.e., ence on Machine Learning, Edinburgh, Scotland, UK, 2012. when the number of dimensions is finite. Moreover, Copyright 2012 by the author(s)/owner(s). the effect of other crucial properties has not been stud- On the Difficulty of Nearest Neighbor Search

d ied, for instance, the sparsity of data vectors. Since in where xi, q R are i.i.d samples from an unknown many applications, high-dimensional vectors tend to distribution p∈(x). Further, let D( , ) be the distance · · be sparse, it is important to study the two data prop- function for the d-dimensional data. We focus on Lp j j p 1/p erties e.g., dimensionality and sparsity together, along distances in this paper: D(x, q)=( j x q ) . with other factors such as database size and distance | − | metric. 2.1. Definition P q In terms of the complexity of approximate NN Suppose Dmin = min D(xi, q) is the distance to the i=1,...n search methods like Locality Sensitive Hashing 1 q nearest database sample , and Dmean = Ex[D(x, q)] (LSH), some general bounds for (Gionis et al., 1999; is the expected distance of a random database sample Indyk & Motwani, 1998) have been presented. How- from the query q. We define the relative contrast for q ever, it has not been studied how the complexity of q Dmean the data set X for a query q as : Cr = Dq . It is approximate NN search methods is affected by the min difficulty of NN search problem on the dataset, and a very intuitive measure of separability of the nearest moreover, by various data properties like dimension, neighbor of q from the rest of the database points. sparsity, etc. Now, taking expectations with respect to queries, the relative contrast for the dataset X is given as, The main contributions of this paper are: q 1. We introduces a new concrete measure Relative Eq[Dmean] Dmean Cr = q = (1) Contrast for the meaningfulness/difficulty of nearest Eq[Dmin] Dmin neighbor search problem in a given data set (in- dependent of indexing methods). Unlike previous Intuitively, Cr captures the notion of difficulty of NN works that only provide asymptotic discussions for search in X. Smaller the Cr, more difficult the search. one or two data properties, we derive an explicitly If Cr is close to 1, then on average a query q will have computable function to estimate relative contrast in almost the same distance to its nearest neighbor as non-asymptotic case. It for the first time enables us to that to a random point in X. This will imply that NN analyze how the difficulty of nearest neighbor search search in database X is not very meaningful. is affected by different data properties simultaneously, In the following sections, we derive relative contrast as such as dimensionality, sparsity, database size, along a function of various important data characteristics. with the norm of Lp distance metric , for a given data set. (Sec. 2) 2.2. Estimation Suppose xj and qj are the jth dimensions of vectors x 2. We provide a theoretical analysis on how the and q. Let’s define, difficulty measure ”relative contrast” determines the complexity of LSH, a popular approximate NN d search method. This is the first work to relate the R = E [ xj qj p],R = R . (2) j q | − | j complexity of approximate NN search methods to j=1 the difficulty measure of a given dataset, allowing us X j to analyze how the complexity is affected by various Both Rj and R are random variables (because x is data properties simultaneously. For practitioners’ a random variable). Suppose each Rj has finite mean 2 benefits, relative contrast also provides insights on and variance denoted as µj = E[Rj ], σj = var[Rj]. how to choose parameters e.g., the number of hash Then, the mean and variance of R are given as, tables of LSH, and a principled explanation of why d d PCA-based methods perform well in practice. (Sec. 3) µ = µ , σ2 σ2. j ≤ j j=1 j=1 3. We reveal the relationship between relative X X contrast and previous studies on measuring NN search 2 2 Here, if dimensions are independent then σ = j σj . difficulty, and show that most existing works can be Without the loss of generality, we can scale the data P derived as special asymptotic cases for dense vectors such that the new mean µ′ is 1. The variance of the of the proposed relative contrast. (Sec. 4) scaled data, called normalized variance will be:

2 2. Relative Contrast (Cr) 2 σ σ′ = . (3) µ2 Suppose we are given a data set X containing n d- 1 dim points, X = xi, i = 1,...,n , and a query q Without loss of generality, we assume that the query { } q is distinct from the database samples, i.e., Dmin =6 0. On the Difficulty of Nearest Neighbor Search

The normalized variance gives the spread of the dis- Moreover, after normalization, R follows a Gaus- tances from query to random points in the database sian distribution with mean 1. So, Rmean = 1, and 1 with the mean distance fixed at 1. If the spread is p Dmean Rmean = 1. Thus, the relative contrast can small, it is harder to separate the nearest neighbor be approximated≈ as: from the rest of the points. Next, we estimate the relative contrast for a given dataset as follows. Dmean 1 Cr = 1 Dmin ≈ 1 1 1 p [1 + φ− ( n + φ( −σ′ ))σ′] Theorem 2.1 If Rj, j=1,...d are independent and { 2 } satisfy Lindeberg’s condition , the relative contrast can which completes the proof. be approximated as,

Dmean 1 Range of Cr: Note that when n is large enough C = 1 (4) 1 1 1 1 1 1 1 r φ( −′ ) +φ( −′ ) ,so0 1+φ− ( +φ( −′ ))σ′ Dmin ≈ 1 1 1 p σ n σ 2 n σ [1 + φ− ( n + φ( −σ′ ))σ′] ≤ ≤ ≤ ≤ 1 and hence Cr is always 1. And moreover, when 1 ≥ σ′ 0, φ( −′ ) 0, and Cr 1. where φ is the c.d.f of standard Gaussian, n is the → σ → → number of database samples, σ′ is normalized standard Generalization 1: The concept of relative contrast deviation, and p is the distance metric norm. can be extended easily to the k-nearest neighbor set- k Dmean ting by defining Cr = , where Dknn is the ex- Proof: Since R are independent and satisfy Linde- Dknn j pected distance to the kth nearest neighbor. Using berg’s condition, from central limit theorem, R will be N¯(Dp ) N¯(R ) = k, and following similar argu- distributed as Gaussian for large enough d with mean knn ≈ knn 2 2 ments as above, one can easily show that µ = j µj and variance σ = j σj . Normalizing the data by dividing by µ, the new mean is µ′ = 1, and new k Dmean 1 P 2 P Cr = 1 (7) variance is σ′ as defined in (3). Now, the probability 1 k 1 Dknn ≈ [1 + φ ( + φ( ′ ))σ ] p that R α for any 0 α 1 is given as − n −σ ′ ≤ ≤ ≤ α 1 0 1 P (R α) φ( − ) φ( − ), (5) 2.3. Effect of normalized variance σ′ on Cr ≤ ≈ σ′ − σ′ where φ is the c.d.f of standard Gaussian, and the sec- From (4), relative contrast is a function of database 2 ond term in RHS is the correction factor since R is size n, normalized variance σ′ , and distance metric always nonnegative. norm p. Here, σ′ is a function of data characteristics such as dimensionality and sparsity. Figure 1 shows Let’s denote the number of samples for which R α how C changes with σ′ according to (4) when n is as N(α). Clearly, N(α) follows Binomial distribution≤ r varied from 100 to 100M, and 0 <σ′ < 0.2 (Note that with probability of success given in (5): σ′ is usually very small for high dimensional data, e.g., far smaller than 0.1). It is clear that smaller σ′ leads n k n k P (N(α) = k) = (P (R α)) (1 P (R α)) − . k ≤ − ≤ to smaller relative contrast, i.e., more difficult nearest   neighbor search. Hence the expected number of database points, N¯(α) In the above plots, p was fixed to be 1 but other values that satisfy R α can be computed as ≤ yield similar results. An interesting thing to note is α 1 1 that as the database size n increases, relative contrast N¯(α)=E[N(α)]=nP (R α) = n(φ( − ) φ(− )). ≤ σ′ − σ′ increases. In other words, nearest neighbor search is more meaningful for a larger database.4 However, this Recall D is the expected distance to the nearest min effect is not very pronounced for smaller values of σ . neighbor and R Dp .3 Thus, N¯(Dp ) ′ min ≈ min min ≈ N¯(Rmin) = 1. Hence, 2.4. Data Properties vs σ′

1 1 1 1 1 1 Dmin (N¯ − (1)) p [1 + φ− ( + φ(− ))σ′] p (6) Since we already know the relationship between Cr ≈ ≈ n σ′ and σ′, by analyzing how data properties affect σ′, we 2 Lindeberg’s condition is a sufficient condition for cen- will find out how data properties affect Cr, i.e., the tral limit theorem to be applicable even when variables are difficulty of NN search. Though many data proper- not identically distributed. Intuitively speaking, the Lin- ties can be studied, in this work we focus on sparsity derberg condition guarantees that no Rj dominates R. 3 4 The approximation becomes exact when metric L1 is It should not be confused with computational considered. For other norms (e.g., p = 2), bounds on Dmin ease since computationally search costs more in larger can be further derived. databases. On the Difficulty of Nearest Neighbor Search

3.5 well-known phenomenon of distance concentration in

3 high dimensional spaces. However, when dimensions 100 1000 are not independent, thankfully, the rate at which dis- 2.5 10000 100000 tances start concentrating slows down. 2 1000000 10000000 100000000 Data Sparsity (s): From (8), we can see that σ′ = Relative Contrast 1.5 2 ′ m2p 2 1 (m2p 2m p)+ s 1 2 ′ − 2 1 / 1. If m′ 2mp 0, when s −4 −3 −2 −1 0 d [(m 2mp)s+2mp] p 10 10 10 10 10 p− − − ≥ σ’ becomesr smaller (i.e., data vectors have fewer non-zero Figure 1. Change in relative contrast with respect to nor- ′ elements), σ′ gets larger, and so does the relative con- malized data variance σ as in (4). The database size n trast. Another interesting case is when p 0 , i.e., L varies from 100 to 100M and p = 1. Graph is best viewed → + 0 with color. or zero-one distance. In this case, mp = mp′ = 1, and 2 1 (1 s) − 2 from (8) σ′ = d1/2 1 (1 s) , which increases mono- − − (a very important property in many domains involv- tonically as s decreases.q However, for general cases, it ing, say, text, images and videos), together with other is not easy to theoretically prove how σ′ will change properties like data dimension and metric. when s gets smaller. But in experiments, we have al- Suppose, the jth dimensions of vectors x and q are ways found that smaller s will lead to larger σ′. In other words, when data vectors become more sparse, distributed the same way as a random variable Vj. But NN search becomes easier. That raises another inter- each dimension has only sj probability of having a esting question: What is the effective dimensionality non-zero value where 0 < sj 1. Denote mj,p as ≤ of sparse vectors? One may be tempted to use d s the p-th moment of Vj , and m′ as the p-th moment | | j,p as the intrinsic dimensionality. But as we will show· in of Vj1 Vj2 , where Vj1 and Vj2 are independently | − | the experimental section, this is generally not the case distributed as Vj. and relative contrast provides an empirical approach Theorem 2.2 If dimensions are independent, to finding intrinsic dimensionality of high-dimensional d 2 ′ 2 s m 2 +2(1 sj )sj mj,2p µ sparse vectors. 2 j=1 j j, p − − j σ′ = d 2 ( µj ) j=1 Database Size (n): From (4), keeping σ′ fixed, Cr P 2 where µ = s m′ + 2(1 s )s m . Moreover, if increases monotonically with n. Hence, NN search is j j Pj,p − j j j,p dimensions are i.i.d., more meaningful in larger databases. Actually, when 1 1 1 n , irrespective of σ′, 1+φ− ( +φ( −′ ))σ′ 0, → ∞ n σ → 1 s[(m′ 2m2p)s + 2m2p] and Cr . Thus, when the database size is large σ = 2p − 1. (8) → ∞ ′ 1/2 2 2 enough, one doesn’t need to worry about the meaning- d s s [(mp′ 2mp)s + 2mp] − − fulness of NN search irrespective of the dimensionality. However, unfortunately when dimensionality is high, Proof: Please see the supplementary material (He, Cr increases very slowly with n, making the gains not 2012). very pronounced in practice. This is the same phe- nomenon noticed in Fig. 1 for small values of σ′. For some distributions, mp and mp′ have a closed form representation. For example, if every dimension fol- Distance Metric Norm (p): Since p appears in both lows uniform distribution U(0, 1), then pth moment is (4) and (8), it makes analysis of relative contrast with 1 2 2 respect to p not as straightforward. In the special case quite easy in this case: mp = (p+1) ,mp′ = p+1 p+2 . − when data vectors are dense (i.e., s = 1), and each However, if mp and mp′ do not have a closed form representation, one can always generate samples ac- dimension is i.i.d with uniform distribution, one can show that smaller p leads to bigger contrast. cording to the distribution, and estimate mp and mp′ empirically. 2.6. Validation of Relative Contrast

2.5. Data Properties vs Relative Contrast Cr To verify the form of relative contrast derived in Sec. 2, we conducted experiments with both synthetic and We now summarize how different database properties real-world datasets, which are summarized below. and distance metric affect relative contrast. 2.6.1. Synthetic Data Data Dimensionality (d): From (8), it is easy to see that larger d will lead to smaller σ′. Moreover, from We generated synthetic data by assuming each dimen- (4), smaller σ′ implies smaller relative contrast Cr, sion to be i.i.d from uniform distribution U[0, 1]. Fig. making NN search less meaningful. This indicates the On the Difficulty of Nearest Neighbor Search

2.5 4 s=0.5,p=1,Empirical d=500,p=1,Empirical Table 1. Description of the real-world datasets. n - s=0.5,p=1,Predicted d=500,p=1,Predicted r 2 r s=1,p=1,Empirical 3 d=1000,p=1,Empirical database size, d - dimensionality, s - sparsity (fraction of s=1,p=1,Predicted d=1000,p=1,Predicted nonzero dimensions), de - effective dimensionality contain-

Contrast c 1.5 2 Contrast c ing 85% of data variance.

1 1 n d s de 0 500 1000 1500 2000 0 0.5 1 dimension d sparsity s gist 95000 384 1 71 (a) (b) sift 95000 128 0.89 40 6 d = 30,s=1,p=1,Empirical 2.5 d = 30,s=1,p=1,Predicted color (histograms) 95000 1382 0.027 22 5 d = 60,s=0.5,Empirical d = 60,s=1,p=1,Empirical r d = 60,s=1,p=1,Predicted image (bag-of-words) 95000 10000 0.024 71 d = 60,s=0.5,Predicted r 4 2 contrast than the d-dim s-sparse data set. 3

Contrast c 1.5 Contrast c 2 The effects of two other characteristics i.e., Lp distance

1 1 metric for different p and database size n are shown in 0 1 2 3 4 1000 3000 10000 30000 100000 L p database size n Figs. 2 (c) and (d), respectively. The effect of these (c) (d) parameters on relative contrast is milder than that of Figure 2. Experiments with synthetic data on how relative d and s. For large d, the contrast drops quickly and it contrast changes with different database characteristics. becomes hard to visualize the effects of p and n. So, Graphs are best viewed with color. here we show these plots for smaller values of d. From Fig. 2 (c) it is clear that for norms less than 1, contrast 2 compares the predicted (theoretical) relative con- is the highest (Note that we have an approximation trast with the empirical one. The solid curves show the for p > 1 in Theorem 2.1, which causes the bias of predicted contrast computed using (4), where the nor- predicted Cr for p = 3, 4). This observation matches malized variance σ′ is estimated using (8). The dotted the conclusion from (Aggarwal et al., 2001) for dense curves show the empirical contrast, directly computed vectors. Fig. 2 (d) shows that as the database size according to the definition in (1) from the data by av- increases, it becomes more meaningful to do nearest eraging the results over one hundred queries. For most neighbor search. But as the dimensionality is increased of the cases, the predicted and empirical contrasts have (from 30 to 60 in the plot), the rate of increase of similar values. contrast with n decreases. For very high dimensional Fig. 2 (a) confirms that as dimensionality increases, data, the effect of n is very small. relative contrast decreases, thus making the nearest 2.6.2. Real-world Data neighbor search harder. Moreover, except for very small d, the prediction is close to the empirical con- Next, we conducted experiments with four real-world trast verifying the theory. It is not surprising that datasets commonly used in applica- predictions are not very accurate for small d since the tions: sift, gist, color and image. The details of these central limit theorem(CLT) is not applicable in that sets are given in Table 1. The sift and gist sets contain case. It is interesting to note that (4) also predicts the 128-dim and 384-dim vectors, which are mostly dense. rate at which contrast changes with d, unlike the pre- On the other hand, both color and image datasets are vious works (Beyer et al., 1999; Aggarwal et al., 2001) very high dimensional as well as sparse. Color data which only show that NN search becomes impossible set contains color histogram of images while the image when dimensionality goes to infinity. data set contains bag-of-words representation of local features in images. Fig. 2 (b) shows how data sparsity affects the contrast for two different choices of d. The main observation While deriving the form of relative contrast in Sec. 2, is that as s increases (denser vectors), contrast de- we assumed that dimensions were independent. How- creases, making nearest neighbor search harder. In ever, this assumption may not be true for real-world other words, lesser the number of non-zero dimensions data. One way to adress this problem would be to for a fixed d, easier the search. In fact, the search re- assume that the dimensions become independent after mains well-behaved even in high-dimensional datasets embedding the data in an appropriate low-dimensional if data is sparse. The prediction is quite accurate in space. In these experiments, we define effective dimen- sionality de as the number of dimensions necessary to comparison to the empirical one except when s.d is 5 small and hence CLT does not apply any more. As a preserve 85% variance of the data . The effective di- note of caution, one should not regard s.d as the in- mensionality for different datasets is shown in Table trinsic dimensionality of the data, since a dataset with 5For large databases, one can use a small subset to es- dense vectors of dimension s.d usually has different timate the covariance matrix. On the Difficulty of Nearest Neighbor Search

series of k hash functions h (x),j = 1, , k. Each Table 2. Experiments with four real-world datasets. Here, j hash function is designed to satisfy the· locality · · con- predicted contrast is computed using the effective dimen- sionality containing 85% of data variance. dition i.e., neighboring points have the same hashed value with high probability and vice-a-versa. A com- p=1 p=2 T monly used hash function in LSH is h(x) = w x+b , gist empirical contrast 1.83 1.78 ⌊ t ⌋ gist predicted contrast 1.62 1.87 where w is a vector with entries sampled from a p- sift empirical contrast 4.78 4.23 stable distribution, and b is uniformly distributed as sift predicted contrast 2.03 3.94 U[0,t] (Datar et al., 2004). We now provide the fol- color emprical contrast 3.19 4.81 lowing theorems to show how relative contrast (Cr) color predicted contrast 2.78 8.10 affects the complexity of LSH. image empirical contrast 1.90 1.66 Theorem 3.1 LSH can find the exact nearest neigh- image predicted contrast 1.62 1.87 bor with probability 1 δ by returning O(log 1 ng(Cr )) − δ candidate points, where g(Cr) is a function monoton- ically decreasing with Cr. 1. Table 2 compares the empirical and predicted rela- Proof: tive contrasts for different datasets. Since our theory Please see the supplementary material. is based on the law of large numbers, the prediction Corollary 3.2 LSH can find the exact nearest neigh- bor with a probability at least 1 δ with a time is more accurate on image and gist data sets as their − complexity O(d log 1 ng(Cr ) log n) and space complexity effective dimensions are large enough. For the color δ O(log 1 n(1+g(Cr )) + nd). l, the number of hash tables data, de is too small (just 22) and hence the predic- δ 1 g(Cr ) tion of relative contrast shows more bias for this set. needed, is l = O(log δ n ). One interesting outcome of these experiments is that Proof: Please see the supplementary material. our analysis provides an alternative way of finding in- The above theorems imply that when Cr is larger, trinsic dimensionality of the data which can be further g(Cr) will be smaller, thus, among the datasets of same used by various nearest neighbor search methods. The size, to get the same recall of the true nearest neighbor, traditional method of finding intrinsic dimensionality the dataset with higher relative contrast Cr will have using data variance suffers from the assumption of lin- better time and space complexity, return less number earity of the low-dimensional space and the arbitrary of candidates for reranking, and need fewer number of choice of threshold on variance. On the other hand, hash tables, or in one word, be easier for approximate nonlinear methods are computationally prohibitive for NN search with LSH. large datasets. In the relative contrast based method, for a given dataset, one can sweep over different values Note that our theory shares some similarity to the re- sults in (Gionis et al., 1999) about the complexity of of d′ where 0 < d′ < d, and find the one which gives the least discrepancy between the predicted and empirical LSH, however, it has several unique properties. First, contrasts averaged over different p. For large datasets, our theory is about finding exact NN (with a proba- one can use a smaller sample and a few queries to es- bility guarantee), not finding approximate NN (with a timate the empirical contrast. Using this procedure, probability guarantee) like in previous works. More- the intrinsic dimensionality for the four datasets turns over, we have related the complexity of LSH to relative out to be: sift - 41, gist - 75, color - 41, image - 70. contrast Cr, enabling us to analyze how the complex- For the two sparse datasets (color and image), it indi- ity of LSH is affected by various data properties of the cates the dimensionality of equivalent low-dimensional dataset simultaneously. To the best of our knowledge, dense . It is interesting to note that in- our work is the first one on this important topic. trinsic dimensionality is not equal to d s for the two To verify the effect of relative contrast on LSH, we · sparse datasets as discussed before. For image dataset, conducted experiments on three real-world datasets. it is much smaller than d s indicating high correlations in non-zero entries of the· data vectors. In Fig. 3, performance of LSH for L1 distance (i.e., p = 1) is given on three datasets: sift, gist and color. 3. Relative Contrast and Hashing From Table 2, for p = 1, Cr for the three datasets is in this order: sift(4.78) > color(3.19) > gist (1.83). 3.1. Relative Contrast and LSH From Fig. 3 (a), we can see that for several settings of LSH are commonly used in many practical large-scale number of bits and number of tables, the number of re- search systems due to their efficiency and ability to turned points needed to get the same nearest neighbor deal with high-dimensional data. In each hash table, recall for the three sets follows sift < color < gist, as every data point x is converted into codes by using a predicted by Theorem 3.1. Moreover, from Fig. 3 (b), On the Difficulty of Nearest Neighbor Search

1 1 1 0.8 sift MRC MRC 0.8 0.8 LSH 0.8 LSH 0.6 gist PCA PCA 0.6 color SH 0.6 SH 0.6 0.4 recall

0.4 recall recall sift recall 0.4 0.4 0.2 gist 0.2 0.2 color 0.2 0 0 0 1 2 3 4 5 0 1 2 10 10 10 10 10 10 10 10 10 0 0 1 100 10000 1 100 10000 returned points number of tables #retrieved samples #retrieved samples (a) (b) (a) (b) Figure 3. Performance of LSH on three datasets: sift, gist, Figure 5. Recall of 1-NN for hamming reranking with dif- and color. (a) Recall of the nearest neighbor. Each curve ferent hashing methods on color data using (a) 80 bits, represents different number of bits, e.g., k = 12, 16, ...40. (b)100 bits. Relative contrast based method (MRC) can Each marker on the curve represents different number of improve upon PCA-based hashing. Graphs are best viewed hash tables l, e.g., l = 1, 2, ...128. (b) Recall of the nearest with color. neighbor for different number of hash tables for k = 32. Graphs are best viewed with color. commonly used hash function in PCA-based hashing 1 0.8 methods is sift sift gist 0.6 gist h(x) = sgn(wT x + b) (9) color color 0.5 0.4 recall recall where w is heuristically picked as a PCA direction, 0.2 and b is a threshold which is usually chosen as E[wT x].

0 1 2 3 4 0 0 1 2 3 10 10 10 10 10 10 10 10 Assuming the data to be zero-centered, i.e., E[x] = 0, returned points returned points leads to b = 0. Since q and x are assumed to be i.i.d (a) k =20 (b) k = 28 samples from some unknown p(x), E[q] = 0 as well.

Figure 4. Recall vs the number of returned points when For a query q, denote xq,NN as q’s NN in the database. T using hamming ranking. Number of bits k = 20 for (a) Denote SNN = Eq[(q xq,NN )(q xq,NN ) ], and and k = 28 for (b). Graphs are best viewed with color. T − − ΣX = (1/n) i xixi . The following theorem shows the number of hash tables needed to get the same recall that maximizing relative contrast will lead us to PCA P follows sift < color < gist, as predicted by Corollary hashing under some assumptions. 3.2. We have tried experiments with k = 12, 16..., 40 and observe the same trend, but only show results for Theorem 3.3 For linear hashing as (9), to find pro- jection vector w to maximize relative contrast, we k = 32 due to space limit. T w ΣX w should find wˆ = arg max T . If we further as- The above experiments used the typical framework of w w SNN w sume that the nearest neighbors are isotropic, i.e., hash table lookup. Another popular way to retrieve T SNN = αI, we will get wˆ = arg max w ΣX w, i.e., neighbors in code space is via hamming ranking. When w using a k-bit code, points that are within hamming PCA hashing. distance r to the query are returned as candidates. In Proof: Please see the supplementary material. Figure 4, we show the recall of nearest neighbor for two If we do not assume nearest neighbors to be isotropic, different values of k. Similar to the case of hash ta- we can empirically compute S from a few samples. ble lookup experiments, the number of returned points NN And then we can find projection vectors w in (9) as needed to get the same recall follows sift < color < T w ΣX w wˆ = arg max T , which are the generalized eigen- gist. This follows the same order as suggested by rel- w w SNN w ative contrast. The interesting thing is that color has vectors of ΣX and SNN . This will often obtain better much higher dimensionality than gist, but its sparsity results than PCA hashing. We provide one example helps in achieving better relative contrast and hence in Figure 5, in which, ”MRC” represents the method better search performance. we described as above, and ”PCA”, ”LSH”, ”SH” are PCA hashing, Locality Sensitive Hashing, and Spec- 3.2. Relative Contrast and PCA hashing tral Hashing (Weiss et al., 2008) respectively. Hashing methods that use PCA as a heuristics often achieve quite good performance in practice 4. Related Works (Weiss et al., 2008; Gong & Lazebnik, 2011). In this 4.1. Previous Works section, we show PCA hashing is actually seeking pro- jections that maximize relative contrast in each pro- Some of the influential works on analyzing NN search jection with L2 distance under some assumptions. A difficulty are (Beyer et al., 1999) and (Francois et al., On the Difficulty of Nearest Neighbor Search

2007), whose main results are shown in Theorem 4.1 5. Conclusion and Future Work and 4.2. q In this work, we introduced a new measure called rela- Theorem 4.1 (Beyer et al., 1999) Denote Dmax = q tive contrast to describe the difficulty of nearest neigh- max D(xi, q) and Dmin = min D(xi, q). If i=1,...n i=1,...n bor search in a data set. The proposed measure can p D(xi,q) be used to evaluate the influence of several crucial lim var( p ) 0, then for every ǫ 0, d E[D(xi,q) ] → ≥ →∞ q q data characteristics such as dimensionality, sparsity, lim P [Dmax (1 + ǫ)Dmin] = 1. d ≤ and database size simultaneously in arbitrary normed →∞ metric spaces. Furthermore, we show how relative con- Theorem 4.2 (Francois et al., 2007) If every di- trast determines the difficulty of ANN search with LSH mension of the data is i.i.d., when d , and provides guidance for better parameter settings. → ∞ √V ar( xi q p) 1 1 σj j j p In the future, we would like to relax the independence || − || , where σj = V ar( x q ) E( xi q p) √d p µj i p || − || ≈ || − || assumption used in the theory of relative contrast, and and µ = E( xj qj p) are the variance and mean on j || i − ||p also study how relative contrast affects the complex- each dimension. ity of other approximate NN search methods besides LSH. Moreover, we will explore a better but harder q 4.2. Relations Between Our Analysis and Dmean definition of Cr = Eq[ Dq ]. Previous Works min Relation to Beyer’s Work References Note that if the distance function D(xi, q) in Beyer’s p 2 Aggarwal, C., Hinneburg, A., and Keim, D. On the D(xi,q) σ work is L distance, then var( p ) = 2 = p E[D(xi,q) ] µ surprising behavior of distance metrics in high di- 2 (σ′) . When σ′ 0(d ), Beyer’s work shows mensional space. ICDT, 2001. q q→ → ∞ that D D , and our theory shows Cr 1, max ≈ min → Beyer, K., Goldstein, J., Ramakrishnan, R., and Shaft, or equivalently Dmean Dmin. So we will get the same conclusion: when d→ , NN search is not very U. When is nearest neighbor meaningful? ICDT, ”meaningful” , because we→ ∞ can not differentiate the 1999. nearest neighbor from other points. However, Beyer’s Datar, M., Immorlica, N., Indyk, P., and Mirrokni, theory works for the worst case (i.e., compare NN point V.S. Locality-sensitive hashing scheme based on p- to the worst point with maximum distance), while ours stable distributions. In SOGC, 2004. works for the average case. Francois, D., Wertz, V., and Verleysen, M. The con- Relation to Francois’s Work centration of fractional distances. IEEE Transac- In Theorem 4.2, a measurement called ”relative vari- tions on Knowledge and Data Engineering, 2007. √V ar( xi q p) ance”, defined as || − || , is discussed, which E( xi q p) || − || p D(xi,q) Gionis, A., Indyk, P., and Motwani, R. Similarity is a modification of the condition var( p ) in E[D(xi,q) ] search in high dimensions via hashing. In VLDB, √V ar( xi q p) || − || 1999. Beyer’s work. If E( x q ) 0 , NN search will || i− ||p → become meaningless. The following theory reveals the Gong, Y. and Lazebnik, S. Iterative quantization: A relationship between relative variance and relative con- procrustean approach to learning binary codes. In trast. CVPR, 2011.

Theorem 4.3 In (4), if σ′ 0 (e.g., d ), He, J., et. al. Supplementary material for ”on 1 → → ∞ Cr −1 1 1 1 σj . the difficulty of nearest neighbor search”, 2012. 1+φ ( ) 1 2 ≈ n p / µj d www.ee.columbia.edu/~jh2700/sup_DNNS.pdf. Proof: Please see the supplementary material. Indyk, P. and Motwani, R. Approximate nearest From Theorem 4.3, we see when σ′ 0 (e.g., d neighbors: towards removing the curse of dimen- → → ), the relative contrast monotonically depends on sionality. In STOC, 1998. ∞1 1 σj 1/2 , which equals to ”relative variance” as in The- p d µj orem 4.2. Liu, T., Moore, A.W., Gray, A., and Yang, K. An in- vestigation of practical approximate nearest neigh- To summarize, most of the known analysis can be de- bor algorithms. NIPS, 2004. rived as special asymptotic cases (when σ′ 0, e.g., d ) of the proposed measure with the→ focus on Weiss, Y., Torralba, A., and Fergus, R. Spectral hash- only→ one ∞ or two data properties. ing. NIPS, 2008.