SRS: Solving C-Approximate Nearest Neighbor Queries in High Dimensional Euclidean Space with a Tiny Index
Total Page:16
File Type:pdf, Size:1020Kb
SRS: Solving c-Approximate Nearest Neighbor Queries in High Dimensional Euclidean Space with a Tiny Index Yifang Suny Wei Wangy Jianbin Qiny Ying Zhangz Xuemin Liny yUniversity of New South Wales, Australia zUniversity of Technology Sydney, Australia fyifangs, weiw, jqin, [email protected] [email protected] ABSTRACT more than c times the distance of the nearest point. Local- Nearest neighbor searches in high-dimensional space have ity sensitive hashing (LSH) [21] is a widely adopted method many important applications in domains such as data min- that answers such a query in sublinear time with constant ing, and multimedia databases. The problem is challenging probability. It also enjoys great success in practice, due to due to the phenomenon called \curse of dimensionality". An its excellent performance and ease of implementation. alternative solution is to consider algorithms that returns a A major limitation of LSH-based methods is that their c-approximate nearest neighbor (c-ANN) with guaranteed indexes are typically very large. The original LSH method probabilities. Locality Sensitive Hashing (LSH) is among requires indexing space superlinear in the number of data the most widely adopted method, and it achieves high ef- points [14], and the index typically includes hundreds or ficiency both in theory and practice. However, it is known thousands of hash tables in practice. Recently, several LSH- to require an extremely high amount of space for indexing, based variants, such as LSB-forest [42] and C2LSH [17], are hence limiting its scalability. proposed to build a smaller sized index size without los- In this paper, we propose several surprisingly simple meth- ing the performance guarantees of the original LSH method. However, their index sizes are still superlinear in the number ods to answer c-ANN queries with theoretical guarantees 1 requiring only a single tiny index. Our methods are highly of points; this prevents their usage for large datasets. flexible and support a variety of functionalities, such as find- In this paper, we propose a surprisingly simple algorithm ing the exact nearest neighbor with any given probability. to solve the c-ANN problem with (1) rigorous theoretical In the experiment, our methods demonstrate superior per- guarantees, (2) requiring an extremely small index, and (3) formance against the state-of-the-art LSH-based methods, delivering superior empirical performance to existing meth- and scale up well to 1 billion high-dimensional points on a ods. Our methods are based on projecting high-dimensional single commodity PC. points into an appropriately chosen low-dimensional space via 2-stable random projections. The key observation is that the inter-point distance in the projected space (called pro- 1. INTRODUCTION jected distance) over that in the original high-dimensional Given a d-dimensional query point q in a Euclidean space, space follows a known distribution, which has a sharp con- a nearest neighbor (NN) query returns the point o in a centration bound. Therefore, given any threshold on the dataset D such that the Euclidean distance between q and projected distance, and for any point o, we can compute o is the minimum. It is a fundamental problem and has exactly the probability that o's projected distance is within wide applications in many domain areas, such as computa- the threshold. This observation gives rise to our basic al- tional geometry, data mining, information retrieval, multi- gorithm, which reduces a d-dimensional c-ANN query to a media databases, and data compression. m-dimensional exact k-NN query equipped with some early- While efficient solutions exist for NN queries in low di- termination test. It can be shown that the resulting algo- mensional space, they are challenging in high-dimensional rithm answers a c-ANN query with at least constant prob- space. This is mainly due to the phenomenon of \curse of ability using γ1 · n I/Os in the worst case, using an in- dimensionality", where indexing-based methods are outper- dex of γ2 · n pages, where γ1 1 and γ2 1 are two small constants. For example, in a typical setting, we have formed by the brute-force linear scan method [45]. 2 The curse of dimensionality can be largely mitigated by γ1 = 0:0083 and γ2 = 0:0059. We also derive several vari- allowing a small amount of errors. This is the c-ANN query, ants of the basic algorithm, which offer various new function- which returns a point o0, such that its distance to q is no alities (as described in Section 5.3.2 and shown in Table 1. Our proposed algorithms can also be extended to support returning the c-approximate k-nearest neighbors (i.e., c-k- ANN queries) with conditional guarantees. This work is licensed under the Creative Commons Attribution- The contributions of the paper are summarized below: NonCommercial-NoDerivs 3.0 Unported License. To view a copy of this li- cense, visit http://creativecommons.org/licenses/by-nc-nd/3.0/. Obtain per- mission prior to any use beyond those covered by the license. Contact copy- 1Note that the data size is always O(n · d). right holder by emailing [email protected]. Articles from this volume were 2 invited to present their results at the 41st International Conference on Very As will be pointed out in Section 3, existing methods using Large Data Bases, August 31st - September 4th 2015, Kohala Coast, Hawaii. space linear in the number of data points (n) typically Proceedings of the VLDB Endowment, Vol. 8, No. 1 cannot outperform the linear scan method, whose query Copyright 2014 VLDB Endowment 2150-8097/14/09. complexity is 0:25n in this example. 1 Table 1: Comparison of Algorithms Algorithm Success Index Size Query Cost Constraint on Approx- Comment Probability imation Ratio c 1:5 0:5 LSB-forest [42] 1=2 − 1=e O((dn=B) ) O((dn=B) logB n) c ≥ 4 and must be the Index size superlin- square of an integer ear in both n and d C2LSH [17] 1=2 − 1=e O((n log n)=B) O((n log n)=B) c ≥ 4 and must be the β = O(1) (or O(n=B)) (or O(n=B)) square of an integer (or β = O(1=n)) SRS-12(T; c; pτ ) 1=2 − 1=e Fastest γ2n (worst case), SRS-1(T; c) 1=2 − 1=e γ n (worst case), Better quality 1 γ2 1 c ≥ 1 + 0 γ 1 SRS-12 (T; c ; pτ ) 1=2 − 1=e 1 Tunable quality SRS-2(c; p) p O(n=B) Support finding NN • We proposed a novel solution to answer c-ANN queries for In this paper, we consider a dataset D which contains high-dimensional points. Our methods use a tiny amount n points in a d dimensional Euclidean space <d. We are of space for the index, and answer the query with bounded particularly interested in the high-dimensional case where I/O costs and guaranteed success probability. They also d is a fairly large number (e.g., d ≥ 50). The coordinate have excellent empirical performance and allow the user value of o on the i-th dimension is denoted as o[i]. The to fine-tune the tradeoff between query time and result Euclidean distance between two points, dist(o1; o2), is de- quality. We present the technical details in Sections 4 q d fined as P (o [i] − o [i])2. Throughout the rest of the and 5 and theoretical analysis in Section 6. i=1 1 2 • We obtained several new results with our proposal: (i) Our paper, we are concerned with a point's distance with re- SRS-12 algorithm is the first c-ANN query processing algo- spect to a query point q, so we use the shorthand notation rithm with guarantees, using linear-sized index and work- dist(o) := dist(o; q), and refer to it as distance of point o. ing with any approximation ratio. Previous methods use For simplicity, we assume there is no draw in terms of superlinear index size and/or cannot work with non-integer distances. This makes the following definitions unique and c or c < 2 (See Table 1 for a comparison). (ii) Our SRS-2 deterministic. The results in this paper still hold without algorithm returns exact nearest neighbor with any given such assumption by a straight-forward extension. Given a query point q, the nearest neighbor (NN) of q (de- probability, which is not possible for LSH-based methods ∗ since they require c > 1. (iii) There is no approxima- noted as o ) is the point in D that has the smallest distance. We can generalize the concept to the i-th nearest neighbor tion ratio guarantee for c-k-ANN query results by existing ∗ (denoted as oi ). A k nearest neighbor query, or k-NN query, LSH-based methods for k > 1, while our SRS-12 provides ∗ ∗ ∗ such a guarantee under certain conditions. returns the ordered set of f o1; o2; : : : ; ok g. • We performed an extensive performance evaluation against Given an approximation ratio c > 1, a c-approximate the state-of-the-art LSH-based methods in the external nearest neighbor query, or c-ANN query, returns a point o 2 D, such that dist(o) ≤ c · dist(o∗); such a point o is memory setting. In particular, we used datasets with 8 3 million and 1 billion points, which is much larger than also called a c-ANN point. Similarly, a c-approximate k- nearest neighbor query, or c-k-ANN query, returns k points those previous reported. Experiments on these large datasets ∗ not only demonstrate the need for linear-sized index, but oi 2 D (1 ≤ i ≤ k), such that dist(oi) ≤ c · dist(oi ).