Efficient and Accurate Nearest Neighbor and Closest Pair Search In
Total Page:16
File Type:pdf, Size:1020Kb
Efficient and Accurate Nearest Neighbor and Closest Pair Search in High-Dimensional Space YUFEI TAO Chinese University of Hong Kong KE YI Hong Kong University of Science and Technology CHENG SHENG Chinese University of Hong Kong and PANOS KALNIS King Abdullah University of Science and Technology Nearest Neighbor (NN) search in high-dimensional space is an important problem in many appli- cations. From the database perspective, a good solution needs to have two properties: (i) it can be easily incorporated in a relational database, and (ii) its query cost should increase sublinearly with the dataset size, regardless of the data and query distributions. Locality-Sensitive Hashing (LSH) 20 is a well-known methodology fulfilling both requirements, but its current implementations either incur expensive space and query cost, or abandon its theoretical guarantee on the quality of query results. Motivated by this, we improve LSH by proposing an access method called the Locality-Sensitive B-tree (LSB-tree) to enable fast, accurate, high-dimensional NN search in relational databases. The combination of several LSB-trees forms a LSB-forest that has strong quality guarantees, but improves dramatically the efficiency of the previous LSH implementation having the same guarantees. In practice, the LSB-tree itself is also an effective index which consumes linear space, Y. Tao and C. Sheng were supported by Grants GRF 4161/07, GRF 4173/08, and GRF 4169/09 from HKRGC, and a direct grant (2050395) from CUHK. K. Yi was supported by a Hong Kong Direct Allocation grant (DAG07/08). Author’s addresses: Y. Tao, Department of Computer Science and Engineering, Chinese University of Hong Kong, Sha Tin, Hong Kong; email: [email protected], K. Yi, Department of Computer Science and Engineering, Hong Kong University of Science and Technology, Clear Water Bay, Hong Kong; email: [email protected], C. Sheng, Department of Computer Science and Engineering, Chinese University of Hong Kong, Sha Tin, Hong Kong; email: [email protected], P. Kalnis, Division of Mathematical and Computer Sciences and Engineering, King Abdullah University of Science and Technology, Thuwal, Saudi Arabia; email: [email protected]. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial- advantage and that copies show this notice on the first page or initial screen of a display along with the full citation. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component of this work in other works requires prior specific permission and/or a fee. Permissions may be requested from Publications Dept., ACM, Inc., 2 Penn Plaza, Suite 701, New York, NY 10121-0701 USA, fax +1 (212) 869-0481, or [email protected]. C 2010 ACM 0362-5915/2010/07-ART20 $10.00 DOI 10.1145/1806907.1806912 http://doi.acm.org/10.1145/1806907.1806912 ACM Transactions on Database Systems, Vol. 35, No. 3, Article 20, Publication date: July 2010. 20:2 • Y.Taoetal. supports efficient updates, and provides accurate query results. In our experiments, the LSB-tree was faster than: (i) iDistance (a famous technique for exact NN search) by two orders of magnitude, and (ii) MedRank (a recent approximate method with nontrivial quality guarantees) by one order of magnitude, and meanwhile returned much better results. As a second step, we extend our LSB technique to solve another classic problem, called Closest Pair (CP) search, in high-dimensional space. The long-term challenge for this problem has been to achieve subquadratic running time at very high dimensionalities, which fails most of the existing solutions. We show that, using a LSB-forest, CP search can be accomplished in (worst-case) time significantly lower than the quadratic complexity, yet still ensuring very good quality. In practice, accurate answers can be found using just two LSB-trees, thus giving a substantial reduction in the space and running time. In our experiments, our technique was faster: (i) than distance browsing (a well-known method for solving the problem exactly) by several orders of magnitude, and (ii) than D-shift (an approximate approach with theoretical guarantees in low-dimensional space) by one order of magnitude, and at the same time, outputs better results. Categories and Subject Descriptors: H.2.2 [Database Management]: Physical Design—Access methods; H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval General Terms: Theory, Algorithms, Experimentation Additional Key Words and Phrases: Locality-sensitive hashing, nearest neighbor search, closest pair search ACM Reference Format: Tao, Y., Yi, K., Sheng, C., and Kalnis, P. 2010. Efficient and accurate nearest neighbor and closest pair search in high-dimensional space. ACM Trans. Datab. Syst. 35, 3, Article 20 (July 2010), 46 pages. DOI = 10.1145/1806907.1806912 http://doi.acm.org/10.1145/1806907.1806912 1. INTRODUCTION Nearest Neighbor (NN) search is a classic problem with tremendous impacts on artificial intelligence, pattern recognition, information retrieval, and so on. Let D be a set of points in d-dimensional space. Given a query point q,itsNNis the point o∗ ∈ D closest to q. Formally, there is no other point o ∈ D satisfying o, q < o∗, q, where , denotes the distance of two points. In this article, we consider high-dimensional NN search. Some studies [Beyer et al. 1999] argue that high-dimensional NN queries may not be mean- ingful. On the other hand, there is also evidence [Bennett et al. 1999] that such an argument is based on restrictive assumptions. Intuitively, a meaning- ful query is one where the query point q is much closer to its NN than to most data points. This is true in many applications involving high-dimensional data, as supported by a large body of recent works [Andoni and Indyk 2006; Athit- sos et al. 2008; Ciaccia and Patella 2000; Datar et al. 2004; Fagin et al. 2003; Ferhatosmanoglu et al. 2001; Gionis et al. 1999; Goldstein and Ramakrishnan 2000; Har-Peled 2001; Houle and Sakuma 2005; Indyk and Motwani 1998; Li et al. 2002; Lv et al. 2007; Panigrahy 2006]. Sequential scan trivially solves a NN query by examining the entire dataset D, but its cost grows linearly with the cardinality of D. From the database perspective, a good solution should satisfy two requirements: (i) it can be easily implemented in a relational database, and (ii) its query cost should increase sublinearly with the cardinality for all data and query distributions. Despite the bulk of NN literature (see Section 8), with a single exception to be explained ACM Transactions on Database Systems, Vol. 35, No. 3, Article 20, Publication date: July 2010. Efficient and Accurate Nearest Neighbor and Closest Pair Search • 20:3 shortly, we are not aware of any existing solution that is able to fulfill both requirements at the same time. Specifically,a majority of them (e.g., those based on new indexes [Arya et al. 1998; Goldstein and Ramakrishnan 2000; Har- Peled 2001; Houle and Sakuma 2005; Lin et al. 1994]) demand nonrelational features, and thus cannot be incorporated in a commercial system. There also exist relational solutions (such as iDistance [Jagadish et al. 2005] and MedRank [Fagin et al. 2003]) which are experimentally shown to perform well for some datasets and queries. Their drawback is that they may incur expensive query cost on other datasets. Locality-Sensitive Hashing (LSH) is the only known solution that satisfies both requirements (i) and (ii). It supports c-approximate NN search. Formally, apointo is a c-approximate NN of q if its distance to q is at most c times the distance from q to its exact NN o∗, namely, o, q≤co∗, q, where c ≥ 1is the approximation ratio. It is widely recognized that approximate NNs already fulfill the needs of many applications [Andoni and Indyk 2006; Arya et al. 1998; Athitsos et al. 2008; Datar et al. 2004; Ferhatosmanoglu et al. 2001; Gionis et al. 1999; Har-Peled 2001; Houle and Sakuma 2005; Indyk and Motwani 1998; Krauthgamer and Lee 2004; Li et al. 2002; Lv et al. 2007; Panigrahy 2006]. LSH was originally proposed as a theoretical method [Indyk and Motwani 1998] with attractive asymptotical space and query performance. As elaborated in Section 3, its practical implementation can be either rigorous or ad hoc. Specifically, rigorous-LSH ensures good quality of query results (i.e., small approximation ratio c), but requires expensive space and query cost. Although ad hoc-LSH is more efficient, it abandons quality control, that is, the neighbor it outputs can be arbitrarily bad. In other words, no LSH implementation is able to ensure both quality and efficiency simultaneously, which is a serious problem severely limiting the applicability of LSH. Motivated by this, we propose an access method called Locality-Sensitive B-tree (LSB-tree) that enables fast high-dimensional NN search with excellent quality. The combination of several LSB-trees leads to a structure called the LSB-forest that combines the advantages of both rigorous- and ad hoc-LSH, without sharing their shortcomings. Specifically, the LSB-forest has the fol- lowing features. First, its space consumption is the same as ad hoc-LSH, and significantly lower than rigorous-LSH, typically by a factor over an order of magnitude. Second, it retains the approximation guarantee of rigorous-LSH (recall that ad hoc-LSH has no such guarantee). Third, its query cost is sub- stantially lower than ad hoc-LSH, and as an immediate corollary, sublinear to the dataset size.