Set Similarity Search Beyond Minhash∗

Set Similarity Search Beyond MinHash∗ Tobias Christiani Rasmus Pagh IT University of Copenhagen IT University of Copenhagen Copenhagen, Denmark Copenhagen, Denmark [email protected] [email protected] d ABSTRACT searching a data set P ⊆ f0; 1g for a vector of similarity at least j1 d We consider the problem of approximate set similarity search un- with a query vector q 2 f0; 1g , but allow the search algorithm der Braun-Blanquet similarity B¹x; yº = jx \ yj/max¹jxj; jyjº. The to return a vector of similarity j2 < j1. To simplify the exposition ¹b1;b2º-approximate Braun-Blanquet similarity search problem is we will assume throughout the introduction that all vectors are to preprocess a collection of sets P such that, given a query set q, if t-sparse, i.e., have the same Hamming weight t. there exists x 2 P with B¹q; xº ≥ b1, then we can eciently return Recent theoretical advances in data structures for approximate 0 0 x 2 P with B¹q; x º > b2. near neighbor search in Hamming space [5] make it possible to beat We present a simple data structure that solves this problem with the asymptotic performance of MinHash-based Jaccard similarity 1+ρ Í ρ space usage O¹n logn+ x2P jxjº and query time O¹jqjn lognº search (using the LSH framework of [21]) in cases where the similar- where n = jP j and ρ = log¹1/b1)/log¹1/b2º. Making use of exist- ity threshold j2 is not too small. However, numerical computations ing lower bounds for locality-sensitive hashing by O’Donnell et suggest that MinHash is always better when j2 < 1/45. al. (TOCT 2014) we show that this value of ρ is tight across the In this paper we address the problem: Can similarity search parameter space, i.e., for every choice of constants 0 < b2 < b1 < 1. using MinHash be improved in general? We give an armative In the case where all sets have the same size our solution strictly answer in the case where all sets have the same size t by intro- improves upon the value of ρ that can be obtained through the use of ducing Chosen Path: a simple data-independent search method state-of-the-art data-independent techniques in the Indyk-Motwani that strictly improves MinHash, and is always better than the locality-sensitive hashing framework (STOC 1998) such as Broder’s data-dependent method of [5] when j2 < 1/9. Similar to data- MinHash (CCS 1997) for Jaccard similarity and Andoni et al.’s cross- independent locality-sensitive ltering (LSF) methods [9, 16, 24] polytope LSH (NIPS 2015) for cosine similarity. Surprisingly, even our method works by mapping each data (or query) vector to a though our solution is data-independent, for a large part of the set of keys that must be stored (or looked up). The name Chosen parameter space we outperform the currently best data-dependent Path stems from the way the mapping is constructed: As paths in a method by Andoni and Razenshteyn (STOC 2015). layered random graph where the vertices at each layer is identied with the set f1;:::;dg of dimensions, and where a vector x is only allowed to choose paths that stick to non-zero components xi . This 1 INTRODUCTION is illustrated in Figure1. In this paper we consider the approximate set similarity problem or, equivalently, the problem of approximate Hamming near neighbor search in sparse vectors. Data that can be represented as sparse vectors is ubiquitous — a typical example is the representation of text documents as term vectors, where non-zero vector entries cor- x y x' respond to occurrences of words (or shingles). In order to perform identication of near-identical text documents in web-scale collec- tions, Broder et al. [11, 12] designed and implemented MinHash (a.k.a. min-wise hashing), now understood as a locality-sensitive hash function [21]. This allowed approximate answers to similarity arXiv:1612.07710v2 [cs.DS] 18 Apr 2017 queries to be computed much faster than by other methods, and in particular made it possible to cluster the web pages of the AltaVista search engine (for the purpose of eliminating near-duplicate search results). Almost two decades after it was rst described, MinHash remains one of the most widely used locality-sensitive hashing Figure 1: Chosen Path uses a branching process to associate d k methods as witnessed by thousands of citations of [11, 12]. each vector x 2 f0; 1g with a set Mk ¹xº ⊆ f1;:::;dg of paths A similarity measure maps a pair of vectors to a similarity score of lengtk k (in the picture k = 3). The paths associated with in »0; 1¼. It will often be convenient to interpret a vector x 2 f0; 1gd x are limited to indices in the set fi j xi = 1g, represented as the set fi j xi = 1g. With this convention the Jaccard similarity by an ellipsoid at each level in the illustration. In the exam- 0 of two vectors can be expressed as J¹x; yº = jx \ yj/jx [ yj. In ple the set sizes are: jM3¹xºj = 4 and jM3¹yºj = jM3¹x ºj = 3. approximate similarity search we are interested the problem of Parameters are chosen such that a query y that is similar to x 2 P is likely to have a common path in x \ y (shown as a ∗The research leading to these results has received funding from the European Research bold line), whereas it shares few paths in expectation with Council under the European Union’s 7th Framework Programme (FP7/2007-2013) / 0 ERC grant agreement no. 614331. vectors such as x that are not similar. 1 1.1 Related Work does not necessarily imply that the LSH is optimal for solving the High-dimensional approximate similarity search methods can be approximate search problem for the measure S. The problem of characterized in terms of their ρ-value which is the exponent for nding tight upper and lower bounds on the ρ-value that can be which queries can be answered in time O¹dnρ º, where n is the size obtained through the LSH framework for data-independent ¹s1;s2º- of the set P and d denotes the dimensionality of the space. Here we approximate similarity search across the entire parameter space focus on the “balanced” case where we aim for space O¹n1+ρ + dnº, ¹s1;s2º remains open for two of the most common measures of set but note that there now exist techniques for obtaining other trade- similarity: Jaccard similarity J¹x; yº = jx \ yj/jx [ yj and cosine p os between query time and space overhead [4, 16]. similarity C¹x; yº = jx \ yj/ jxjjyj. Locality-Sensitive Hashing Methods. We begin by describ- A random function from the MinHash family Hminhash hashes ing results for Hamming space, which is a special case of simi- a set x ⊂ f1;:::;dg to the rst element of x in a random per- larity search on the unit sphere (many of the results cited apply mutation of the set f1;:::;dg. For h ∼ Hminhash we have that to the more general case). In Hamming space the focus has tradi- Pr»h¹xº = h¹yº¼ = J¹x; yº, yielding an LSH solution to the approxi- tionally been on the ρ-value that can be obtained for solutions to mate Jaccard similarity search problem. For cosine similarity the the ¹r;crº-approximate near neighbor problem: Preprocess a set of SimHash family Hsimhash, introduced by Charikar [13], works by d points P ⊆ f0; 1gd such that, given a query point q, if there exists sampling a random hyperplane in R that passes through the origin 0 0 x 2 P with kx − qk1 ≤ r, then return x 2 P with kx − qk1 < cr. In and hashing x according to what side of the hyperplane it lies on. For the literature this problem is often presented as the c-approximate h ∼ Hsimhash we have that Pr»h¹xº = h¹yº¼ = 1 − arccos¹C¹x; y))/π, near neighbor problem where bounds for the ρ-value are stated in which can be used to derive a solution for cosine similarity, al- terms of c and, in the case of upper bounds, hold for every choice though not the clean solution that we could have hoped for in the of r, while lower bounds may only hold for specic choices of r. style of MinHash for Jaccard similarity. There exists a number of O’Donnell et al. [30] have shown that the value ρ = 1/c for dierent data-independent LSH approaches [2, 3, 34] that improve c-approximate near neighbor search in Hamming space, obtained upon the ρ-value of SimHash. Perhaps surprisingly, it turns out that in the seminal work of Indyk and Motwani [23], is the best possible these approaches yield lower ρ-values for the ¹j1; j2º-approximate in terms of c for schemes based on Locality-Sensitive Hashing Jaccard similarity search problem compared to MinHash for cer- (LSH). However, the lower bound only applies when the distances tain combinations of ¹j1; j2º. Unfortunately, while asymptotically of interest, r and cr, are relatively small compared to d, and better superior these techniques suer from a non-trivial on¹1º-term in upper bounds are known for large distances. Notably, other LSH the exponent that only decreases very slowly with n. In compari- schemes for angular distance on the unit sphere such as cross- son, both MinHash and SimHash are simple to describe and have polytope LSH [2] give lower ρ-values for large distances.

Set Similarity Search Beyond Minhash∗

Challenges in Web Search Engines

Mining of Massive Datasets

Applied Statistics

Soft Similarity and Soft Cosine Measure: Similarity of Features in Vector Space Model

Arxiv:2102.08942V1 [Cs.DB]

Stanford University's Economic Impact Via Innovation and Entrepreneurship

SIGMOD Flyer

The Best Nurturers in Computer Science Research

Evaluating Vector-Space Models of Word Representation, Or, the Unreasonable Effectiveness of Counting Words Near Other Words

Practical Considerations in Benchmarking Digital Testing Systems

Text Similarity Estimation Based on Word Embeddings and Matrix Norms for Targeted Marketing

The Pagerank Citation Ranking: Bringing Order to the Web