Distance-Sensitive Hashing∗
Total Page:16
File Type:pdf, Size:1020Kb
Distance-Sensitive Hashing∗ Martin Aumüller Tobias Christiani IT University of Copenhagen IT University of Copenhagen [email protected] [email protected] Rasmus Pagh Francesco Silvestri IT University of Copenhagen University of Padova [email protected] [email protected] ABSTRACT tight up to lower order terms. In particular, we extend existing Locality-sensitive hashing (LSH) is an important tool for managing LSH lower bounds, showing that they also hold in the asymmetric high-dimensional noisy or uncertain data, for example in connec- setting. tion with data cleaning (similarity join) and noise-robust search (similarity search). However, for a number of problems the LSH CCS CONCEPTS framework is not known to yield good solutions, and instead ad hoc • Theory of computation → Randomness, geometry and dis- solutions have been designed for particular similarity and distance crete structures; Data structures design and analysis; • Informa- measures. For example, this is true for output-sensitive similarity tion systems → Information retrieval; search/join, and for indexes supporting annulus queries that aim to report a point close to a certain given distance from the query KEYWORDS point. locality-sensitive hashing; similarity search; annulus query In this paper we initiate the study of distance-sensitive hashing (DSH), a generalization of LSH that seeks a family of hash functions ACM Reference Format: Martin Aumüller, Tobias Christiani, Rasmus Pagh, and Francesco Silvestri. such that the probability of two points having the same hash value is 2018. Distance-Sensitive Hashing. In Proceedings of 2018 International Confer- a given function of the distance between them. More precisely, given ence on Management of Data (PODS’18). ACM, New York, NY, USA, 17 pages. a distance space ¹X; distº and a “collision probability function” (CPF) https://doi.org/10.1145/XXXXXX.XXXXXX f : R !»0; 1¼ we seek a distribution over pairs of functions ¹h;дº such that for every pair of points x; y 2 X the collision probability 1 INTRODUCTION is Pr»h¹xº = д¹yº¼ = f ¹dist¹x; yºº. Locality-sensitive hashing is the study of how fast a CPF can decrease as the distance grows. The growth of data from a variety of sources that need to be man- For many spaces, f can be made exponentially decreasing even aged and analyzed has made it increasingly important to design if we restrict attention to the symmetric case where д = h. We data management systems with features that make them robust show that the asymmetry achieved by having a pair of functions and tolerant towards noisy data. For example: different texts repre- makes it possible to achieve CPFs that are, for example, increasing senting the same object (in data reconciliation), slightly different or unimodal, and show how this leads to principled solutions to versions of a string (in plagiarism detection), or feature vectors problems not addressed by the LSH framework. This includes a whose similarity reflects the affinity of two objects (in recommender novel application to privacy-preserving distance estimation. We systems). In data management, such tasks are often addressed using believe that the DSH framework will find further applications in the similarity join operator [49]. high-dimensional data management. When data sets are high-dimensional, traditional algorithmic To put the running time bounds of the proposed constructions approaches often fail. Fortunately, there are general principles for into perspective, we show lower bounds for the performance of handling high-dimensional data sets. One of the most successful DSH constructions with increasing and decreasing CPFs under approaches is the locality-sensitive hashing (LSH) framework by angular distance. Essentially, this shows that our constructions are Indyk and Motwani [32], further developed in collaboration with Gionis [27] and Har-Peled [29]. LSH is a powerful framework for ap- ∗The research leading to these results has received funding from the European Research proximate nearest neighbor (ANN) search in high dimensions that Council under the European Union’s 7th Framework Programme (FP7/2007-2013) / achieves sublinear query time, and it has found many further appli- ERC grant agreement no. 614331. Silvestri has also been supported by project SID2017 of the University of Padova. cations. However, for a number of problems the LSH framework is not known to yield good solutions, for example output-sensitive Permission to make digital or hard copies of all or part of this work for personal or similarity search, and indexes supporting annulus queries returning classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation points at distance approximately r from a query point. on the first page. Copyrights for components of this work owned by others than ACM Motivating example. A classical application of similarity search must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a is in recommender systems: Suppose you have shown interest in a fee. Request permissions from [email protected]. particular item, for example a news article x. The semantic meaning PODS’18, June 10–15, 2018, Houston, TX, USA of a piece of text can be represented as a high-dimensional feature © 2018 Association for Computing Machinery. ACM ISBN 978-1-4503-4706-8/18/06...$15.00 vector, for example computed using latent semantic indexing [25]. https://doi.org/10.1145/XXXXXX.XXXXXX In order to recommend other news articles we might search the set P of article feature vectors for articles that are “close” to x. But in applicable in cases where symmetric methods are used, it seems general it is not clear that it is desirable to recommend the “closest” relevant to re-assess whether such constructions can be improved, articles. Indeed, it might be desirable to recommend articles that are even in settings where a decreasing CPF is desired. on the same topic but are not too aligned with x, and may provide Though our DSH applications do not lead to quantitative run- a different perspective [2]. ning time improvements compared to existing, published ad-hoc Discussion. Unfortunately, existing LSH techniques do not al- solutions, we believe that studying the possibilities and limitations low us to search for points that are “close, but not too close”. In a of the DSH framework will help unifying approaches to solving nutshell: LSH provides a sequence of hash functions h1;h2;::: such “distance sensitive” algorithmic problems. We now proceed with a that if x and y are close we have hi ¹xº = hi ¹yº for some i with con- more detailed description of our results. stant probability, while if x and y are distant we have h ¹xº = h ¹yº i i 1.1.1 Angular distance. We consider monotonically increasing only with small probability (typically 1/n, where n upper bounds CPFs for angular distance between vectors on the unit sphere Sd−1. the number of distant points). As a special case, this paper discusses It will be convenient to express distances in terms of dot prod- techniques that allow us to refine the first requirement: If x and ucts h·; ·i rather than by angles or Euclidean distances, with the y are “too close” we would like collisions to occur only with very understanding that there is a 1-1 correspondence between them. small probability. At first sight this seems impossible because we The constructions rely on the idea of “negating the query point”: will, by definition, have a collision when x = y. However, this ob- leveraging state-of-the-art symmetric LSH schemes, we obtain DSH jection is overcome by switching to an asymmetric setting where constructions with monotone CPFs by replacing the query point we work with pairs of functions ¹h ;д º and are concerned with i i q by −q in a symmetric LSH construction. We initially apply this collisions of the form h ¹xº = д ¹yº. i i idea in section 2.1 to Cross-Polytope LSH [8] getting an efficient More generally, we initiate the systematic study of the following DSH with a monotonically decreasing CPF. We then show that a question: In the asymmetric setting, what is the class of functions more flexible result follows with a variant of ideas used in thefilter f for which it is possible to achieve Pr»h¹xº = д¹yº¼ = f ¹dist¹x; yºº, constructions from [10, 14, 21]. The filter based approach contains where the probability is over the choice of ¹h;дº and dist¹x; yº is the a parameter t that can be used for fine tuning the scheme. This distance between x and y. We refer to such a function as a collision parameter is exploited in the data structure solving the annulus probability function (CPF). More formally: query problem (see section 6.2). More specifically, the filter based Definition 1.1. A distance-sensitive hashing (DSH) scheme for the approach gives the following result: space ¹X; distº is a distribution D over pairs of functions h;д : X ! For every t > there exists a distance-sensitive R with collision probability function (CPF) f : R !»0; 1¼ if for Theorem 1.2. 1 D ¹ d−1 h· ·iº each pair x; y 2 X and ¹h;дº ∼ D we have Pr»h¹xº = д¹yº¼ = family − for S ; ; with a monotonically decreasing CPF f 2 (− º j j − / f ¹dist¹x; yºº. such that for every α 1; 1 with α < 1 1 t we have The theory of locality-sensitive hashing is the study of decreasing 1+α t 2 CPFs whose collision probability is high for neighboring points and ln¹1/f ¹αºº = 1−α 2 + Θ¹log tº: (1) low for far-away points.