In Defense of Minhash Over Simhash

In Defense of MinHash Over SimHash Anshumali Shrivastava Ping Li Department of Computer Science Department of Statistics and Biostatistics Computing and Information Science Department of Computer Science Cornell University, Ithaca, NY, USA Rutgers University, Piscataway, NJ, USA Abstract 1 Introduction MinHash and SimHash are the two widely The advent of the Internet has led to generation of adopted Locality Sensitive Hashing (LSH) al- massive and inherently high dimensional data. In gorithms for large-scale data processing ap- many industrial applications, the size of the datasets plications. Deciding which LSH to use for has long exceeded the memory capacity of a single a particular problem at hand is an impor- machine. In web domains, it is not difficult to find tant question, which has no clear answer in datasets with the number of instances and the num- the existing literature. In this study, we pro- ber of dimensions going into billions [1, 6, 28]. vide a theoretical answer (validated by experiments) that MinHash virtually always out- The reality that web data are typically sparse and high performs SimHash when the data are binary, dimensional is due to the wide adoption of the \Bag as common in practice such as search. of Words" (BoW) representations for documents and images. In BoW representations, it is known that the The collision probability of MinHash is a word frequency within a document follows power law. function of resemblance similarity (R), while Most of the words occur rarely in a document and most the collision probability of SimHash is a func- of the higher order shingles in the document occur only tion of cosine similarity (S). To provide a once. It is often the case that just the presence or common basis for comparison, we evaluate absence information suffices in practice [7, 14, 17, 23]. retrieval results in terms of S for both Min- Leading search companies routinely use sparse binary Hash and SimHash. This evaluation is valid representations in their large data systems [6]. as we can prove that MinHash is a valid LSH with respect to S, by using a general inequal- Locality sensitive hashing (LSH) [16] is a gen- S2 ≤ R ≤ S ity 2−S . Our worst case analysis eral framework of indexing technique, devised for effi- can show that MinHash significantly outper- ciently solving the approximate near neighbor search forms SimHash in high similarity region. problem [11]. The performance of LSH largely de- Interestingly, our intensive experiments re- pends on the underlying particular hashing methods. veal that MinHash is also substantially better Two popular hashing algorithms are MinHash [3] and than SimHash even in datasets where most SimHash (sign normal random projections) [8]. of the data points are not too similar to each MinHash is an LSH for resemblance similarity other. This is partly because, in practical which is defined over binary vectors, while SimHash R ≥ S data, often z−S holds where z is only is an LSH for cosine similarity which works for gen- slightly larger than 2 (e.g., z ≤ 2:1). Our re- eral real-valued data. With the abundance of binary stricted worst case analysis by assuming data over the web, it has become a practically im- S ≤ R ≤ S z−S 2−S shows that MinHash in- portant question: which LSH should be preferred in deed significantly outperforms SimHash even binary data?. This question has not been adequately in low similarity region. answered in existing literature. There were prior at- We believe the results in this paper will pro- tempts to address this problem from various aspects. vide valuable guidelines for search in practice, For example, the paper on Conditional Random Sam- especially when the data are sparse. pling (CRS) [19] showed that random projections can be very inaccurate especially in binary data, for the Appearing in Proceedings of the 17th International Con- task of inner product estimation (which is not the same ference on Artificial Intelligence and Statistics (AISTATS) as near neighbor search). A more recent paper [26] em- 2014, Reykjavik, Iceland. JMLR: W&CP volume 33. Copy- pirically demonstrated that b-bit minwise hashing [22] right 2014 by the authors. outperformed SimHash and spectral hashing [30]. 886 In Defense of MinHash Over SimHash Our contribution: Our paper provides an essentially 1 conclusive answer that MinHash should be used for 0.8 near neighbor search in binary data, both theoretically and empirically. To favor SimHash, our theoretical 0.6 S analysis and experiments evaluate the retrieval results 0.4 2−S of MinHash in terms of cosine similarity (instead of S2 resemblance). This is possible because we are able to 0.2 show that MinHash can be proved to be an LSH for 0 0 0.2 0.4 0.6 0.8 1 cosine similarity by establishing an inequality which S bounds resemblance by purely functions of cosine. Figure 1: Upper (in red) and lower (in blue) bounds Because we evaluate MinHash (which was designed for in Theorem 1, which overlap in high similarity region. resemblance) in terms of cosine, we will first illustrate the close connection between these two similarities. While the high similarity region is often of interest, we must also handle data in the low similarity region, 2 Cosine Versus Resemblance because in a realistic dataset, the majority of the pairs We focus on binary data, which can be viewed as sets are usually not similar. Interestingly, we observe that ⊆ (locations of nonzeros). Consider two sets W1;W2 for the six datasets in Table 1, we often have R = S f g S z−S Ω = 1; 2; :::; D . The cosine similarity ( ) is with z only being slightly larger than 2; see Figure 2. a S = p ; where (1) f1f2 Table 1: Datasets f1 = jW1j; f2 = jW2j; a = jW1 \ W2j (2) R Dataset # Query # Train # Dim The resemblance similarity, denoted by , is MNIST 10,000 60,000 784 jW \ W j a NEWS20 2,000 18,000 1,355,191 R = R(W ;W ) = 1 2 = (3) 1 2 jW [ W j f + f − a NYTIMES 5,000 100,000 102,660 1 2 1 2 RCV1 5,000 100,000 47,236 Clearly these two similarities are closely related. To URL 5,000 90,000 3,231,958 better illustrate the connection, we re-write R as WEBSPAM 5,000 100,000 16,609,143 p a= f f S R p p 1 2 p = = − S (4) 0.8 0.4 f1=f2 + f2=f1 − a= f1f2 z MNIST NEWS20 p 1 0.6 0.3 z = z(r) = r + p ≥ 2 (5) r 0.4 0.2 f2 f1f2 f1f2 1 Frequency Frequency r = = ≤ = (6) 0.2 0.1 2 2 S2 f1 f1 a 0 0 There are two degrees of freedom: f =f , a=f , which 2 2.1 2.2 2.3 2.4 2.5 2 2.1 2.2 2.3 2.4 2.5 2 1 2 z z make it inconvenient for analysis. Fortunately, in The- R S 0.5 0.4 orem 1, we can bound by purely functions of . NYTIMES 0.4 RCV1 0.3 Theorem 1 0.3 S 0.2 2 0.2 S ≤ R ≤ (7) Frequency Frequency − S 0.1 2 0.1 Tightness Without making assumptions on the data, 0 0 S2 S 2 2.1 2.2 2.3 2.4 2.5 2 2.1 2.2 2.3 2.4 2.5 neither the lower bound or the upper bound 2−S can z z be improved in the domain of continuous functions. 1 0.4 URL ∗ 0.8 WEBSPAM Data dependent bound If the data satisfy z ≤ z , 0.3 where z is defined in (5), then 0.6 0.2 S S 0.4 Frequency ≤ R ≤ Frequency ∗ (8) 0.1 z − S 2 − S 0.2 Proof: See Appendix A. 0 0 2 2.1 2.2 2.3 2.4 2.5 2 2.1 2.2 2.3 2.4 2.5 z z Figure 1 illustrates that in high similarity region, the Figure 2: Frequencies of the z values for all six datasets upper and lower bounds essentially overlap. Note that, in Table 1, where z is defined in (5). We compute z ≈ ≈ ≈ in order to obtain S 1, we need f1 f2 (i.e., z 2). for every query-train pair of data points. 887 Anshumali Shrivastava, Ping Li 1 1 For each dataset, we compute both cosine and resem- MNIST Resemblance MNIST Cosine blance for every query-train pair (e.g., 10000 × 60000 0.9 0.9 0.8 pairs for MNIST dataset). For each query point, we 0.8 rank its similarities to all training points in descending 0.7 Similarity 0.7 order. We examine the top-1000 locations as in Fig- 0.6 ure 3. In the left panels, for every top location, we plot 0.5 Resemblance of Rankings 0 1 2 3 0.6 0 1 2 3 10 10 10 10 10 10 10 10 the median (among all query points) of the similari- Top Location Top Location ties, separately for cosine (dashed) and resemblance 0.5 1 NEWS20 Resemblance NEWS20 (solid), together with the lower and upper bounds of 0.4 Cosine 0.9 R (dot-dashed). We can see for NEWS20, NYTIMES, 0.3 0.8 and RCV1, the data are not too similar. Interestingly, 0.2 Similarity for all six datasets, R matches fairly well with the up- 0.7 S 0.1 S2 per bound 2−S . In other words, the lower bound 0 Resemblance of Rankings 0.6 0 1 2 3 0 1 2 3 can be very conservative even in low similarity region.

In Defense of Minhash Over Simhash

Communications of the Acm

Challenges in Web Search Engines

Benchmarks for IP Forwarding Tables

Mining of Massive Datasets

Applied Statistics

Distance-Sensitive Hashing∗

Arxiv:2102.08942V1 [Cs.DB]

Stanford University's Economic Impact Via Innovation and Entrepreneurship

Diffie and Hellman Receive 2015 Turing Award Rod Searcey/Stanford University

SIGMOD Flyer

The Best Nurturers in Computer Science Research

Lower Bounds on Lattice Sieving and Information Set Decoding