In Defense of Minhash Over Simhash
Total Page:16
File Type:pdf, Size:1020Kb
In Defense of MinHash Over SimHash Anshumali Shrivastava Ping Li Department of Computer Science Department of Statistics and Biostatistics Computing and Information Science Department of Computer Science Cornell University, Ithaca, NY, USA Rutgers University, Piscataway, NJ, USA Abstract 1 Introduction MinHash and SimHash are the two widely The advent of the Internet has led to generation of adopted Locality Sensitive Hashing (LSH) al- massive and inherently high dimensional data. In gorithms for large-scale data processing ap- many industrial applications, the size of the datasets plications. Deciding which LSH to use for has long exceeded the memory capacity of a single a particular problem at hand is an impor- machine. In web domains, it is not difficult to find tant question, which has no clear answer in datasets with the number of instances and the num- the existing literature. In this study, we pro- ber of dimensions going into billions [1, 6, 28]. vide a theoretical answer (validated by exper- iments) that MinHash virtually always out- The reality that web data are typically sparse and high performs SimHash when the data are binary, dimensional is due to the wide adoption of the “Bag as common in practice such as search. of Words” (BoW) representations for documents and images. In BoW representations, it is known that the The collision probability of MinHash is a word frequency within a document follows power law. function of resemblance similarity ( ), while Most of the words occur rarely in a document and most the collision probability of SimHashR is a func- of the higher order shingles in the document occur only tion of cosine similarity ( ). To provide a once. It is often the case that just the presence or common basis for comparison,S we evaluate absence information suffices in practice [7, 14, 17, 23]. retrieval results in terms of for both Min- Leading search companies routinely use sparse binary Hash and SimHash. This evaluationS is valid representations in their large data systems [6]. as we can prove that MinHash is a valid LSH with respect to , by using a general inequal- Locality sensitive hashing (LSH) [16] is a gen- 2 S ity S . Our worst case analysis eral framework of indexing technique, devised for effi- S ≤R≤ 2 can show that MinHash−S significantly outper- ciently solving the approximate near neighbor search forms SimHash in high similarity region. problem [11]. The performance of LSH largely de- Interestingly, our intensive experiments re- pends on the underlying particular hashing methods. veal that MinHash is also substantially better Two popular hashing algorithms are MinHash [3] and SimHash (sign normal random projections) [8]. arXiv:1407.4416v1 [stat.CO] 16 Jul 2014 than SimHash even in datasets where most of the data points are not too similar to each MinHash is an LSH for resemblance similarity other. This is partly because, in practical which is defined over binary vectors, while SimHash data, often S holds where z is only is an LSH for cosine similarity which works for gen- R ≥ z slightly larger than−S 2 (e.g., z 2.1). Our re- eral real-valued data. With the abundance of binary ≤ stricted worst case analysis by assuming data over the web, it has become a practically im- S S shows that MinHash in- portant question: which LSH should be preferred in z ≤R≤ 2 deed−S significantly−S outperforms SimHash even binary data?. This question has not been adequately in low similarity region. answered in existing literature. There were prior at- We believe the results in this paper will pro- tempts to address this problem from various aspects. vide valuable guidelines for search in practice, For example, the paper on Conditional Random Sam- especially when the data are sparse. pling (CRS) [19] showed that random projections can be very inaccurate especially in binary data, for the task of inner product estimation (which is not the same as near neighbor search). A more recent paper [26] em- pirically demonstrated that b-bit minwise hashing [22] outperformed SimHash and spectral hashing [30]. In Defense of MinHash Over SimHash Our contribution: Our paper provides an essentially 1 conclusive answer that MinHash should be used for 0.8 near neighbor search in binary data, both theoretically and empirically. To favor SimHash, our theoretical 0.6 S analysis and experiments evaluate the retrieval results 0.4 2−S of MinHash in terms of cosine similarity (instead of S2 resemblance). This is possible because we are able to 0.2 show that MinHash can be proved to be an LSH for 0 0 0.2 0.4 0.6 0.8 1 cosine similarity by establishing an inequality which S bounds resemblance by purely functions of cosine. Figure 1: Upper (in red) and lower (in blue) bounds Because we evaluate MinHash (which was designed for in Theorem 1, which overlap in high similarity region. resemblance) in terms of cosine, we will first illustrate the close connection between these two similarities. While the high similarity region is often of interest, we must also handle data in the low similarity region, 2 Cosine Versus Resemblance because in a realistic dataset, the majority of the pairs We focus on binary data, which can be viewed as sets are usually not similar. Interestingly, we observe that (locations of nonzeros). Consider two sets W1, W2 ⊆ for the six datasets in Table 1, we often have = z S Ω= 1, 2, ..., D . The cosine similarity ( ) is R −S { } S with z only being slightly larger than 2; see Figure 2. a = , where (1) S √f1f2 Table 1: Datasets f1 = W1 , f2 = W2 , a = W1 W2 (2) | | | | | ∩ | Dataset # Query # Train # Dim The resemblance similarity, denoted by , is R MNIST 10,000 60,000 784 W1 W2 a NEWS20 2,000 18,000 1,355,191 = (W1, W2)= | ∩ | = (3) NYTIMES 5,000 100,000 102,660 R R W W f + f a | 1 ∪ 2| 1 2 − RCV1 5,000 100,000 47,236 Clearly these two similarities are closely related. To URL 5,000 90,000 3,231,958 better illustrate the connection, we re-write as WEBSPAM 5,000 100,000 16,609,143 R a/√f1f2 = = S (4) 0.8 0.4 R f1/f2 + f2/f1 a/√f1f2 z − − S MNIST NEWS20 1 0.6 0.3 z = zp(r)= √r +p 2 (5) √r ≥ 0.4 0.2 f2 f1f2 f1f2 1 Frequency Frequency 0.2 0.1 r = = 2 2 = 2 (6) f1 f1 ≤ a S 0 0 2 2.1 2.2 2.3 2.4 2.5 2 2.1 2.2 2.3 2.4 2.5 There are two degrees of freedom: f2/f1, a/f2, which z z make it inconvenient for analysis. Fortunately, in The- 0.5 0.4 orem 1, we can bound by purely functions of . NYTIMES 0.4 RCV1 R S 0.3 Theorem 1 0.3 0.2 2 0.2 S (7) Frequency Frequency S ≤R≤ 2 0.1 − S 0.1 Tightness Without making assumptions on the data, 0 0 2 2 2.1 2.2 2.3 2.4 2.5 2 2.1 2.2 2.3 2.4 2.5 neither the lower bound or the upper bound 2 S can z z S −S be improved in the domain of continuous functions. 1 0.4 URL 0.8 WEBSPAM Data dependent bound If the data satisfy z z∗, 0.3 ≤ where z is defined in (5), then 0.6 0.2 0.4 Frequency Frequency S S (8) 0.1 z ≤R≤ 2 0.2 ∗ − S − S 0 0 Proof: See Appendix A. 2 2.1 2.2 2.3 2.4 2.5 2 2.1 2.2 2.3 2.4 2.5 z z Figure 1 illustrates that in high similarity region, the Figure 2: Frequencies of the z values for all six datasets upper and lower bounds essentially overlap. Note that, in Table 1, where z is defined in (5). We compute z in order to obtain S 1, we need f f (i.e., z 2). ≈ 1 ≈ 2 ≈ for every query-train pair of data points. Anshumali Shrivastava, Ping Li 1 1 For each dataset, we compute both cosine and resem- MNIST Resemblance MNIST Cosine blance for every query-train pair (e.g., 10000 60000 0.9 0.9 × 0.8 pairs for MNIST dataset). For each query point, we 0.8 rank its similarities to all training points in descending 0.7 Similarity 0.7 order. We examine the top-1000 locations as in Fig- 0.6 ure 3. In the left panels, for every top location, we plot 0.5 Resemblance of Rankings 0 1 2 3 0.6 0 1 2 3 10 10 10 10 10 10 10 10 the median (among all query points) of the similari- Top Location Top Location ties, separately for cosine (dashed) and resemblance 0.5 1 NEWS20 Resemblance NEWS20 (solid), together with the lower and upper bounds of 0.4 Cosine 0.9 (dot-dashed). We can see for NEWS20, NYTIMES, 0.3 R 0.8 and RCV1, the data are not too similar. Interestingly, 0.2 Similarity for all six datasets, matches fairly well with the up- 0.7 0.1 R 2 per bound 2 S . In other words, the lower bound 0 Resemblance of Rankings 0.6 −S S 0 1 2 3 0 1 2 3 can be very conservative even in low similarity region. 10 10 10 10 10 10 10 10 Top Location Top Location 0.3 1 The right panels of Figure 3 present the comparisons NYTIMES Resemblance NYTIMES Cosine of the orderings of similarities in an interesting way.