In Defense of MinHash Over SimHash

Anshumali Shrivastava Ping Li Department of Computer Science Department of Statistics and Biostatistics Computing and Information Science Department of Computer Science Cornell University, Ithaca, NY, USA Rutgers University, Piscataway, NJ, USA

Abstract 1 Introduction MinHash and SimHash are the two widely The advent of the Internet has led to generation of adopted Locality Sensitive Hashing (LSH) al- massive and inherently high dimensional data. In gorithms for large-scale data processing ap- many industrial applications, the size of the datasets plications. Deciding which LSH to use for has long exceeded the memory capacity of a single a particular problem at hand is an impor- machine. In web domains, it is not difficult to find tant question, which has no clear answer in datasets with the number of instances and the num- the existing literature. In this study, we pro- ber of dimensions going into billions [1, 6, 28]. vide a theoretical answer (validated by exper- iments) that MinHash virtually always out- The reality that web data are typically sparse and high performs SimHash when the data are binary, dimensional is due to the wide adoption of the “Bag as common in practice such as search. of Words” (BoW) representations for documents and images. In BoW representations, it is known that the The collision probability of MinHash is a word frequency within a document follows power law. function of resemblance similarity (R), while Most of the words occur rarely in a document and most the collision probability of SimHash is a func- of the higher order shingles in the document occur only tion of (S). To provide a once. It is often the case that just the presence or common basis for comparison, we evaluate absence information suffices in practice [7, 14, 17, 23]. retrieval results in terms of S for both Min- Leading search companies routinely use sparse binary Hash and SimHash. This evaluation is valid representations in their large data systems [6]. as we can prove that MinHash is a valid LSH with respect to S, by using a general inequal- Locality sensitive hashing (LSH) [16] is a gen- S2 ≤ R ≤ S ity 2−S . Our worst case analysis eral framework of indexing technique, devised for effi- can show that MinHash significantly outper- ciently solving the approximate near neighbor search forms SimHash in high similarity region. problem [11]. The performance of LSH largely de- Interestingly, our intensive experiments re- pends on the underlying particular hashing methods. veal that MinHash is also substantially better Two popular hashing are MinHash [3] and than SimHash even in datasets where most SimHash (sign normal random projections) [8]. of the data points are not too similar to each MinHash is an LSH for resemblance similarity other. This is partly because, in practical which is defined over binary vectors, while SimHash R ≥ S data, often z−S holds where z is only is an LSH for cosine similarity which works for gen- slightly larger than 2 (e.g., z ≤ 2.1). Our re- eral real-valued data. With the abundance of binary stricted worst case analysis by assuming data over the web, it has become a practically im- S ≤ R ≤ S z−S 2−S shows that MinHash in- portant question: which LSH should be preferred in deed significantly outperforms SimHash even binary data?. This question has not been adequately in low similarity region. answered in existing literature. There were prior at- We believe the results in this paper will pro- tempts to address this problem from various aspects. vide valuable guidelines for search in practice, For example, the paper on Conditional Random Sam- especially when the data are sparse. pling (CRS) [19] showed that random projections can be very inaccurate especially in binary data, for the Appearing in Proceedings of the 17th International Con- task of inner product estimation (which is not the same ference on Artificial Intelligence and Statistics (AISTATS) as near neighbor search). A more recent paper [26] em- 2014, Reykjavik, Iceland. JMLR: W&CP volume 33. Copy- pirically demonstrated that b-bit minwise hashing [22] right 2014 by the authors. outperformed SimHash and spectral hashing [30].

886 In Defense of MinHash Over SimHash

Our contribution: Our paper provides an essentially 1 conclusive answer that MinHash should be used for 0.8 near neighbor search in binary data, both theoretically and empirically. To favor SimHash, our theoretical 0.6 S analysis and experiments evaluate the retrieval results 0.4 2−S of MinHash in terms of cosine similarity (instead of S2 resemblance). This is possible because we are able to 0.2 show that MinHash can be proved to be an LSH for 0 0 0.2 0.4 0.6 0.8 1 cosine similarity by establishing an inequality which S bounds resemblance by purely functions of cosine. Figure 1: Upper (in red) and lower (in blue) bounds Because we evaluate MinHash (which was designed for in Theorem 1, which overlap in high similarity region. resemblance) in terms of cosine, we will first illustrate the close connection between these two similarities. While the high similarity region is often of interest, we must also handle data in the low similarity region, 2 Cosine Versus Resemblance because in a realistic dataset, the majority of the pairs We focus on binary data, which can be viewed as sets are usually not similar. Interestingly, we observe that ⊆ (locations of nonzeros). Consider two sets W1,W2 for the six datasets in Table 1, we often have R = S { } S z−S Ω = 1, 2, ..., D . The cosine similarity ( ) is with z only being slightly larger than 2; see Figure 2. a S = √ , where (1) f1f2 Table 1: Datasets f1 = |W1|, f2 = |W2|, a = |W1 ∩ W2| (2) R Dataset # Query # Train # Dim The resemblance similarity, denoted by , is MNIST 10,000 60,000 784 |W ∩ W | a NEWS20 2,000 18,000 1,355,191 R = R(W ,W ) = 1 2 = (3) 1 2 |W ∪ W | f + f − a NYTIMES 5,000 100,000 102,660 1 2 1 2 RCV1 5,000 100,000 47,236 Clearly these two similarities are closely related. To URL 5,000 90,000 3,231,958 better illustrate the connection, we re-write R as WEBSPAM 5,000 100,000 16,609,143 √ a/ f f S R √ √ 1 2 √ = = − S (4) 0.8 0.4 f1/f2 + f2/f1 − a/ f1f2 z MNIST NEWS20 √ 1 0.6 0.3 z = z(r) = r + √ ≥ 2 (5) r 0.4 0.2

f2 f1f2 f1f2 1 Frequency Frequency r = = ≤ = (6) 0.2 0.1 2 2 S2 f1 f1 a 0 0 There are two degrees of freedom: f /f , a/f , which 2 2.1 2.2 2.3 2.4 2.5 2 2.1 2.2 2.3 2.4 2.5 2 1 2 z z make it inconvenient for analysis. Fortunately, in The- R S 0.5 0.4 orem 1, we can bound by purely functions of . NYTIMES 0.4 RCV1 0.3 Theorem 1 0.3 S 0.2 2 0.2 S ≤ R ≤ (7) Frequency Frequency − S 0.1 2 0.1

Tightness Without making assumptions on the data, 0 0 S2 S 2 2.1 2.2 2.3 2.4 2.5 2 2.1 2.2 2.3 2.4 2.5 neither the lower bound or the upper bound 2−S can z z be improved in the domain of continuous functions. 1 0.4 URL ∗ 0.8 WEBSPAM Data dependent bound If the data satisfy z ≤ z , 0.3 where z is defined in (5), then 0.6 0.2 S S 0.4 Frequency ≤ R ≤ Frequency ∗ (8) 0.1 z − S 2 − S 0.2 Proof: See Appendix A.  0 0 2 2.1 2.2 2.3 2.4 2.5 2 2.1 2.2 2.3 2.4 2.5 z z Figure 1 illustrates that in high similarity region, the Figure 2: Frequencies of the z values for all six datasets upper and lower bounds essentially overlap. Note that, in Table 1, where z is defined in (5). We compute z ≈ ≈ ≈ in order to obtain S 1, we need f1 f2 (i.e., z 2). for every query-train pair of data points.

887 Anshumali Shrivastava, Ping Li

1 1 For each dataset, we compute both cosine and resem- MNIST Resemblance MNIST Cosine blance for every query-train pair (e.g., 10000 × 60000 0.9 0.9 0.8 pairs for MNIST dataset). For each query point, we 0.8 rank its similarities to all training points in descending 0.7 Similarity 0.7 order. We examine the top-1000 locations as in Fig- 0.6 ure 3. In the left panels, for every top location, we plot 0.5 Resemblance of Rankings 0 1 2 3 0.6 0 1 2 3 10 10 10 10 10 10 10 10 the median (among all query points) of the similari- Top Location Top Location ties, separately for cosine (dashed) and resemblance 0.5 1 NEWS20 Resemblance NEWS20 (solid), together with the lower and upper bounds of 0.4 Cosine 0.9 R (dot-dashed). We can see for NEWS20, NYTIMES, 0.3 0.8 and RCV1, the data are not too similar. Interestingly, 0.2 Similarity for all six datasets, R matches fairly well with the up- 0.7 S 0.1 S2 per bound 2−S . In other words, the lower bound

0 Resemblance of Rankings 0.6 0 1 2 3 0 1 2 3 can be very conservative even in low similarity region. 10 10 10 10 10 10 10 10 Top Location Top Location

0.3 1 The right panels of Figure 3 present the comparisons NYTIMES Resemblance NYTIMES Cosine of the orderings of similarities in an interesting way. 0.9 0.2 For every query point, we rank the training points in 0.8 descending order of similarities, separately for cosine

Similarity 0.1 0.7 and resemblance. This way, for every query point we have two lists of numbers (of the data points). We

0 Resemblance of Rankings 0 1 2 3 0.6 0 1 2 3 10 10 10 10 10 10 10 10 truncate the lists at top-T and compute the resem- Top Location Top Location blance between the two lists. By varying T from 1 to 0.5 1 RCV1 Resemblance RCV1 1000, we obtain a curve which roughly measures the Cosine 0.4 0.9 “similarity” of cosine and resemblance. We present the 0.3 0.8 averaged curve over all query points. Clearly Figure 3 0.2 Similarity shows there is a strong correlation between the two 0.7 0.1 measures in all datasets, as one would expect.

0 Resemblance of Rankings 0 1 2 3 0.6 0 1 2 3 10 10 10 10 10 10 10 10 Top Location Top Location 3 Locality Sensitive Hashing (LSH)

1 1 URL A common formalism for approximate near neighbor 0.9 0.9 problem is the c-approximate near neighbor or c-NN. 0.8 0.8 Definition:(c-Approximate Near Neighbor or c-NN). Similarity URL Rd 0.7 Resemblance 0.7 Given a set of points in a d-dimensional space , and Cosine parameters S0 > 0, δ > 0, construct a data structure Resemblance of Rankings 0.6 0 1 2 3 0.6 0 1 2 3 10 10 10 10 10 10 10 10 which, given any query point q, does the following with Top Location Top Location probability 1 − δ: if there exist an S0-near neighbor of 1 1 q in P , it reports some cS -near neighbor of q in P . 0.9 0 0.9 0.8 WEBSPAM The usual notion of S0-near neighbor is in terms of the 0.7 0.8 WEBSPAM distance function. Since we are dealing with similar- Similarity 0.6 Resemblance 0.7 ities, we can equivalently define S0-near neighbor of 0.5 Cosine point q as a point p with Sim(q, p) ≥ S0, where Sim Resemblance of Rankings 0.4 0 1 2 3 0.6 0 1 2 3 10 10 10 10 10 10 10 10 is the similarity function of interest. Top Location Top Location Figure 3: Left panels: For each query point, we A popular technique for c-NN, uses the underlying the- rank its similarities to all training points in descend- ory of Locality Sensitive Hashing (LSH) [16]. LSH is ing order. For every top location, we plot the median a family of functions, with the property that similar (among all query points) of the similarities, separately input objects in the domain of these functions have a for cosine (dashed) and resemblance (solid), together higher probability of colliding in the range space than with the lower and upper bounds of R (dot-dashed). non-similar ones. In formal terms, consider H a family Right Panels: For every query point, we rank the of hash functions mapping RD to some set S. training points in descending order of similarities, sep- Definition: Locality Sensitive Hashing A fam- arately for cosine and resemblance. We plot the resem- ily H is called (S , cS , p , p )-sensitive if for any two blance of two ranked lists at top-T (T = 1 to 1000). 0 0 1 2 points x, y ∈ Rd and h chosen uniformly from H sat- isfies the following:

888 In Defense of MinHash Over SimHash

• if Sim(x, y) ≥ S0 then P rH(h(x) = h(y)) ≥ p1 sign random projections (SRP) [8]. Given a vector x, SRP utilizes a random vector w with each component • if Sim(x, y) ≤ cS0 then P rH(h(x) = h(y)) ≤ p2 generated from i.i.d. normal, i.e., wi ∼ N(0, 1), and only stores the sign of the projected data. Formally, For approximate typically, SimHash is given by p1 > p2 and c < 1 is needed. Since we are defining neighbors in terms of similarity we have c < 1. To sim T hw (x) = sign(w x) (11) get distance analogy we can use the transformation D(x, y) = 1 − Sim(x, y) with a requirement of c > 1. It was shown in [12] that the collision under SRP sat- isfies the following equation: The definition of LSH family H is tightly linked with θ the similarity function of interest Sim. An LSH allows P r(hsim(x) = hsim(y)) = 1 − , (12) us to construct data structures that give provably ef- w w π ( ) ficient query time algorithms for c-NN problem. T T where θ = cos−1 x y . The term x y , is ||x||2||y||2 ||x||2||y||2 Fact: Given a family of (S0, cS0, p1, p2) -sensitive hash the cosine similarity for data vectors x and y, which functions, one can construct a data structure for c-NN becomes S = √ a when the data are binary. with O(nρ log n) query time, where ρ = log p1 . f1f2 1/p2 log p2 Since 1 − θ is monotonic with respect to cosine sim- The quantity ρ < 1 measures the efficiency of a given π ilarity S. Eq. (12) implies that SimHash is a LSH, the smaller the better. In theory, in the worst ( ( ) ( ) ) −1 S −1 S case, the number of points scanned by a given LSH S S − cos ( 0) − cos (c 0) 0, c 0, 1 π , 1 π sensitive to find a c-approximate near neighbor is O(nρ) [16], ( ) −1 S log 1− cos ( 0) which is dependent on ρ. Thus given two LSHs, for π ( ) hash function with efficiency ρ = −1 . the same c-NN problem, the LSH with smaller value − cos (cS0) log 1 π of ρ will achieve the same approximation guarantee and at the same time will have faster query time. LSH 4 Theoretical Comparisons with lower value of ρ will report fewer points from We would like to highlight here that the ρ values for the database as the potential near neighbors. These MinHash and SimHash, shown in the previous section, reported points need additional re-ranking to find the are not directly comparable because they are in the true c-approximate near neighbor, which is a costly context of different similarity measures. Consequently, step. It should be noted that the efficiency of an LSH it was not clear, before our work, if there is any theo- scheme, the ρ value, is dependent on many things. It retical way of finding conditions under which MinHash depends on the similarity threshold S0 and the value is preferable over SimHash and vice versa. It turns out of c which is the approximation parameter. that the two sided bounds in Theorem 1 allow us to 3.1 Resemblance Similarity and MinHash prove MinHash is also an LSH for cosine similarity. Minwise hashing [4] is the LSH for resemblance simi- 4.1 MinHash as an LSH for Cosine Similarity larity. The minwise hashing family applies a random We fix our gold standard similarity measure to be the → permutation π :Ω Ω, on the given set W , and cosine similarity Sim = S. Theorem 1 leads to two stores only the minimum value after the permutation simple corollaries: mapping. Formally MinHash is defined as: min Corollary 1 If S(x, y) ≥ S0, then we have hπ (W ) = min(π(W )). (9) min min R ≥ 2 P r(hπ (x) = hπ (y)) = (x, y) S0 Given sets W1 and W2, it can be shown by elementary S ≤ probability argument that Corollary 2 If (x, y) cS0, then we have min min cS0 P r(h (x) = h (y)) = R(x, y) ≤ − | ∩ | π π 2 cS0 min min W1 W2 R P r(hπ (W1) = hπ (W2)) = = . (10) |W1 ∪ W2| Immediate consequence of these two corollaries com- bined with the definition of LSH is the following: It follows from (10) that minwise hashing is (R0, cR0, R0, cR0) sensitive family of hash function Theorem 2 For binary data, MinHash is 2 cS0 (S0, cS0,S , ) sensitive family of hash func- when the similarity function of interest is resemblance 0 2−cS0 log R0 2 i.e R. It has efficiency ρ = for approximate log S0 log cR0 tion for cosine similarity with ρ = . log cS0 resemblance based search. 2−cS0 3.2 SimHash and Cosine Similarity 4.2 1-bit Minwise Hashing SimHash is another popular LSH for the cosine sim- SimHash generates a single bit output (only the signs) ilarity measure, which originates from the concept of whereas MinHash generates an integer value. Recently

889 Anshumali Shrivastava, Ping Li

proposed b-bit minwise hashing [22] provides simple 1 1 Worst Case SimHash SimHash strategy to generate an informative single bit output 0.8 MinHash 0.8 MinHash from MinHash, by using the parity of MinHash values: 1−bit MH 1−bit MH { 0.6 0.6 min ρ ρ min,1bit 1 if hπ (W1) is odd 0.4 0.4 hπ (W1) = (13) 0 otherwise 0.2 0.2 Worst Case S = 0.95 S = 0.9 0 0 0 0 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 f1 → c c For 1-bit MinHash and very sparse data (i.e., D 0, f2 → 1 1 D 0), we have the following collision probability SimHash SimHash 0.8 MinHash 0.8 MinHash R 1−bit MH 1−bit MH min,1bit min,1bit + 1 P r(hπ (W1) = hπ (W2)) = (14) 0.6 0.6 2 ρ ρ 0.4 0.4 The analysis presented in previous sections allows us to theoretically analyze this new scheme. The inequal- 0.2 Worst Case 0.2 Worst Case R S = 0.8 S = 0.7 +1 0 0 0 0 ity in Theorem 1 can be modified for 2 and using 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 similar arguments as for MinHash we obtain c c Figure 4: Worst case gap (ρ) analysis, i.e., (15) (16) Theorem 3 For binary data, 1-bit MH (minwise (17), for high similarity region; lower is better. 2 S0 +1 1 hashing) is (S0, cS0, , ) sensitive family of 2 2−cS0 2 log 2 hash function for cosine similarity with ρ = S0 +1 . Note that this is still a worst case analysis (and hence log (2−cS0) can still be very conservative). Figure 5 presents the 4.3 Worst Case Gap Analysis ρ values for this restricted worst case gap analysis, for We will compare the gap (ρ) values of the three hashing two values of z (2.1 and 2.3) and S0 as small as 0.2. methods we have studied: ( ) The results confirms that MinHash still significantly −1 S − cos ( 0) outperforms SimHash even in low similarity region. log 1 π SimHash: ρ = ( ) (15) −1 cos (cS0) − 1 1 log 1 π SimHash MinHash 2 0.8 0.8 log S0 1−bit MH MinHash: ρ = (16) 0.6 0.6 cS0

log ρ SimHash ρ 2−cS0 0.4 MinHash 0.4 2 1−bit MH log S2+1 1-bit MH: ρ = 0 (17) 0.2 Restricted Worst 0.2 Restricted Worst − S = 0.7, z = 2.1 S = 0.7, z = 2.3 log (2 cS0) 0 0 0 0 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 This is a worst case analysis. We know the lower bound c c S2 ≤ R is usually very conservative in real data when 1 1 the similarity level is low. Nevertheless, for high simi- 0.8 0.8 larity region, the comparisons of the ρ values indicate 0.6 0.6 SimHash

ρ SimHash ρ that MinHash significantly outperforms SimHash as 0.4 MinHash 0.4 MinHash 1−bit MH 1−bit MH shown in Figure 4, at least for S0 ≥ 0.8. 0.2 Restricted Worst 0.2 Restricted Worst S = 0.5, z = 2.1 S = 0.5, z = 2.3 4.4 Restricted Worst Case Gap Analysis 0 0 0 0 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 The worst case analysis does not make any assumption c c on the data. It is obviously too conservative when the 1 1 data are not too similar. Figure 2 has demonstrated 0.8 0.8 that in real data, we can fairly safely replace the lower 0.6 SimHash 0.6 SimHash

S ρ ρ S2 MinHash MinHash bound with z−S for some z which, defined in (5), 0.4 0.4 is very close to 2 (for example, 2.1). If we are willing 1−bit MH 1−bit MH 0.2 Restricted Worst 0.2 Restricted Worst to make this assumption, then we can go through the S = 0.3, z = 2.1 S = 0.3, z = 2.3 0 0 0 0 same analysis for MinHash as an LSH for cosine and 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 c c compute the corresponding ρ values: Figure 5: Restricted worst case gap (ρ) analysis by S0 S S log −S ≤ ≤ z 0 assuming the data satisfy z−S R 2−S , where z MinHash: ρ = S (18) c 0 is defined in (5). The ρ values for MinHash and 1-bit log 2−cS 0 MinHash are expressed in (18) and (19), respectively. 2(z−S0) log z 1-bit MH: ρ = − S (19) log (2 c 0) Both Figure 4 and Figure 5 show that 1-bit MinHash

890 In Defense of MinHash Over SimHash

can be less competitive when the similarity is not high. 1 1

This is expected as analyzed in the original paper of 0.8 0.8 b-bit minwise hashing [20]. The remedy is to use more 0.6 0.6 bits. As shown in Figure 6, once we use b = 8 (or even ρ ρ 0.4 0.4 b = 4) bits, the performance of b-bit minwise hashing SimHash SimHash is not much different from MinHash. 0.2 Idealized Case MinHash 0.2 Idealized Case MinHash S = 0.3, z = 2 1−bit MH S = 0.3, z = 2.5 1−bit MH 1 1 0 0 0 0 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 c c 0.8 0.8 b=1 b=1 2 1 1 2 4 0.6 4 0.6

ρ ρ 0.8 0.8 b=8 0.4 b=8 0.4 0.6 0.6

0.2 Restricted Worst 0.2 Restricted Worst ρ ρ 0.4 0.4 S = 0.5, z = 2.1 S = 0.5, z = 2.3 0 0 0 0 SimHash SimHash 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.2 0.2 MinHash c c Idealized Case MinHash Idealized Case S = 0.1, z = 2 1−bit MH S = 0.1, z = 2.5 1−bit MH 1 1 0 0 0 0 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 b=1 b=1 2 c c 2 0.8 0.8 4 4 b=8 Figure 7: Idealized case gap (ρ) analysis by assuming 0.6 0.6 R S ρ b=8 ρ = z−S for a fixed z (z = 2 and z = 2.5 in the plots). 0.4 0.4

0.2 Restricted Worst 0.2 Restricted Worst the choice of K and L is dependent on the similarity S = 0.3, z = 2.1 S = 0.3, z = 2.3 0 0 0 0 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 thresholds and the hash function under consideration. c c In the task of top-k near neighbor retrieval, the sim- Figure 6: Restricted worst case gap (ρ) analysis for ilarity thresholds vary with the datasets. Hence, the b-bit minwise hashing for b = 1, 2, 4, 8. actual choice of ideal K and L is difficult to deter- 4.5 Idealized Case Gap Analysis mine. To ensure that this choice does not affect our evaluations, we implemented all the combinations of The restricted worst case analysis can still be very con- K ∈ {1, 2, ..., 30} and L ∈ {1, 2, ..., 200}. These com- servative and may not fully explain the stunning per- binations include the reasonable choices for both the formance of MinHash in our experiments on datasets hash function and different threshold levels. of low similarities. Here, we also provide an analysis based on fixed z value. That is, we only analyze the For each combination of (K,L) and for both of the R S gap ρ by assuming = z−S for a fixed z. We call hash functions, we computed the mean recall of the this idealized gap analysis. Not surprisingly, Figure 7 top-k gold standard neighbors along with the average confirms that, with this assumption, MinHash signif- number of points reported per query. We then com- icantly outperform SimHash even for extremely low pute the least number of points needed, by each of the similarity. We should keep in mind that this idealized two hash functions, to achieve a given percentage of gap analysis can be somewhat optimistic and should recall of the gold standard top-k, where the least was only be used as some side information. computed over the choices of K and L. We are there- fore ensuring the best over all the choices of K and 5 Experiments L for each hash function independently. This elimi- We evaluate both MinHash and SimHash in the actual nates the effect of K and L, if any, in the evaluations. task of retrieving top-k near neighbors. We imple- The plots of the fraction of points retrieved at different mented the standard (K,L) parameterized LSH [16] recall levels, for k = 1, 10, 20, 100, are in Figure 8. algorithms with both MinHash and SimHash. That A good hash function, at a given recall should retrieve is, we concatenate K hash functions to form a new less number of points. MinHash needs to evaluate hash function for each table, and we generate L such significantly less fraction of the total data points to tables (see [2] for more details about the implemen- achieve a given recall compared to SimHash. MinHash tation). We used all the six binarized datasets with is consistently better than SimHash, in most cases very the query and training partitions as shown in Table 1. significantly, irrespective of the choices of dataset and For each dataset, elements from training partition were k. It should be noted that our gold standard mea- used for constructing hash tables, while the elements sure for computing top-k neighbors is cosine similar- of the query partition were used as query for top-k ity. This should favor SimHash because it was the only neighbor search. For every query, we compute the known LSH for cosine similarity. Despite this “disad- gold standard top-k near neighbors using the cosine vantage”, MinHash still outperforms SimHash in top similarity as the underlying similarity measure. near neighbor search with cosine similarity. This nicely In standard (K,L) parameterized bucketing scheme confirms our theoretical gap analysis.

891 Anshumali Shrivastava, Ping Li

0.05 0.2 0.2 0.2 MNIST: Top 1 MNIST: Top 10 MNIST: Top 20 MNIST: Top 100 0.04 SimHash 0.15 SimHash 0.15 SimHash 0.15 SimHash MinHash MinHash MinHash MinHash 0.03 0.1 0.1 0.1 0.02 0.05 0.05 0.05 Fraction Retrieved Fraction Retrieved Fraction Retrieved 0.01 Fraction Retrieved

0 0 0 0 0.4 0.5 0.6 0.7 0.8 0.9 1 0.4 0.5 0.6 0.7 0.8 0.9 1 0.4 0.5 0.6 0.7 0.8 0.9 1 0.4 0.5 0.6 0.7 0.8 0.9 1 Recall Recall Recall Recall

0.1 0.4 0.4 0.4 NEWS20: Top 1 NEWS20: Top 10 NEWS20: Top 20 NEWS20: Top 100 0.08 SimHash SimHash SimHash 0.3 SimHash 0.3 0.3 MinHash MinHash MinHash MinHash 0.06 0.2 0.2 0.2 0.04 0.1 0.1 0.1

Fraction Retrieved 0.02 Fraction Retrieved Fraction Retrieved Fraction Retrieved

0 0 0 0 0.4 0.5 0.6 0.7 0.8 0.9 1 0.4 0.5 0.6 0.7 0.8 0.9 1 0.4 0.5 0.6 0.7 0.8 0.9 1 0.4 0.5 0.6 0.7 0.8 0.9 1 Recall Recall Recall Recall

0.4 1 1 1 NYTIMES: Top 1 NYTIMES: Top 100 NYTIMES: Top 10 NYTIMES: Top 20 0.8 0.8 0.8 0.3 SimHash SimHash SimHash MinHash MinHash MinHash SimHash 0.6 0.6 0.6 0.2 MinHash 0.4 0.4 0.4 0.1

Fraction Retrieved Fraction Retrieved 0.2 Fraction Retrieved 0.2 Fraction Retrieved 0.2

0 0 0 0 0.4 0.5 0.6 0.7 0.8 0.9 1 0.4 0.5 0.6 0.7 0.8 0.9 1 0.4 0.5 0.6 0.7 0.8 0.9 1 0.4 0.5 0.6 0.7 0.8 0.9 1 Recall Recall Recall Recall

0.2 1 1 1 RCV1: Top 1 RCV1: Top 10 RCV1: Top 20 RCV1: Top 100 SimHash 0.8 0.8 0.8 0.15 SimHash SimHash SimHash MinHash 0.6 MinHash 0.6 MinHash 0.6 MinHash 0.1 0.4 0.4 0.4 0.05

Fraction Retrieved Fraction Retrieved 0.2 Fraction Retrieved 0.2 Fraction Retrieved 0.2

0 0 0 0 0.4 0.5 0.6 0.7 0.8 0.9 1 0.4 0.5 0.6 0.7 0.8 0.9 1 0.4 0.5 0.6 0.7 0.8 0.9 1 0.4 0.5 0.6 0.7 0.8 0.9 1 Recall Recall Recall Recall

0.05 0.5 0.5 0.5 URL: Top 1 URL: Top 10 URL: Top 20 URL: Top 100 0.04 SimHash 0.4 SimHash 0.4 SimHash 0.4 SimHash MinHash MinHash MinHash MinHash 0.03 0.3 0.3 0.3

0.02 0.2 0.2 0.2 Fraction Retrieved Fraction Retrieved Fraction Retrieved 0.01 Fraction Retrieved 0.1 0.1 0.1

0 0 0 0 0.4 0.5 0.6 0.7 0.8 0.9 1 0.4 0.5 0.6 0.7 0.8 0.9 1 0.4 0.5 0.6 0.7 0.8 0.9 1 0.4 0.5 0.6 0.7 0.8 0.9 1 Recall Recall Recall Recall

0.05 0.5 0.5 0.5 WEBSPAM: Top 1 WEBSPAM: Top 10 Webspam: Top 20 WEBSPAM: Top 100 0.04 0.4 SimHash 0.4 SimHash 0.4 SimHash SimHash MinHash MinHash MinHash 0.03 MinHash 0.3 0.3 0.3

0.02 0.2 0.2 0.2 Fraction Retrieved Fraction Retrieved Fraction Retrieved 0.01 Fraction Retrieved 0.1 0.1 0.1

0 0 0 0 0.4 0.5 0.6 0.7 0.8 0.9 1 0.4 0.5 0.6 0.7 0.8 0.9 1 0.4 0.5 0.6 0.7 0.8 0.9 1 0.4 0.5 0.6 0.7 0.8 0.9 1 Recall Recall Recall Recall Figure 8: Fraction of data points retrieved (y-axis) in order to achieve a specified recall (x-axis), for comparing SimHash with MinHash. Lower is better. We use top-k (cosine similarities) as the gold standard for k = 1, 10, 20, 100. For all 6 binarized datasets, MinHash significantly outperforms SimHash. For example, to achieve a 90% recall for top-1 on MNIST, MinHash needs to scan, on average, 0.6% of the data points while SimHash has to scan 5%. For fair comparisons, we present the optimum outcomes (i.e., smallest fraction of data points) separately for MinHash and SimHash, by searching a wide range of parameters (K,L), where K ∈ {1, 2, ..., 30} is the number of hash functions per table and L ∈ {1, 2, ..., 200} is the number of tables.

892 In Defense of MinHash Over SimHash

0.05 0.2 ing for online advertising [25], compressing social net- MNIST (Real): Top 1 MNIST (Real): Top 10 0.04 SimHash SimHash works [9], advertising diversification [13], graph sam- 0.15 MinHash MinHash pling [24], Web graph compression [5], etc. Further- 0.03 0.1 more, the recent development of one permutation hash- 0.02 ing [21, 27] has substantially reduced the preprocessing 0.05 Fraction Retrieved 0.01 Fraction Retrieved costs of MinHash, making the method more practical.

0 0 0.4 0.5 0.6 0.7 0.8 0.9 1 0.4 0.5 0.6 0.7 0.8 0.9 1 In machine learning research literature, however, it ap- Recall Recall pears that SimHash is more popular for approximate 0.2 0.2 MNIST (Real): Top 20 MNIST (Real): Top 100 near neighbor search. We believe part of the reason is 0.15 SimHash 0.15 SimHash that researchers tend to use the cosine similarity, for MinHash MinHash 0.1 0.1 which SimHash can be directly applied.

0.05 0.05 It is usually taken for granted that MinHash and Fraction Retrieved Fraction Retrieved SimHash are theoretically incomparable and the choice 0 0 0.4 0.5 0.6 0.7 0.8 0.9 1 0.4 0.5 0.6 0.7 0.8 0.9 1 between them is decided based on whether the desired Recall Recall notion of similarity is cosine similarity or resemblance. 0.2 1 RCV1 (Real): Top 1 RCV1 (Real): Top 10 This paper has shown that MinHash is provably a bet- 0.8 0.15 SimHash SimHash ter LSH than SimHash even for cosine similarity. Our MinHash MinHash 0.6 analysis provides a first provable way of comparing two 0.1 0.4 LSHs devised for different similarity measures. Theo- 0.05 retical and experimental evidence indicates significant

Fraction Retrieved Fraction Retrieved 0.2 computational advantage of using MinHash in place of 0 0 0.4 0.5 0.6 0.7 0.8 0.9 1 0.4 0.5 0.6 0.7 0.8 0.9 1 SimHash. Since LSH is a concept studied by a wide va- Recall Recall riety of researchers and practitioners, we believe that 1 1 RCV1 (Real): Top 20 RCV1 (Real): Top 100 the results shown in this paper will be useful from both 0.8 SimHash 0.8 SimHash theoretical as well as practical point of view. MinHash MinHash 0.6 0.6 Acknowledgements: Anshumali Shrivastava is a 0.4 0.4 Ph.D. student supported by NSF (DMS0808864, Fraction Retrieved 0.2 Fraction Retrieved 0.2 SES1131848, III1249316) and ONR (N00014-13-1-

0 0 0.4 0.5 0.6 0.7 0.8 0.9 1 0.4 0.5 0.6 0.7 0.8 0.9 1 0764). Ping Li is partially supported by AFOSR Recall Recall (FA9550-13-1-0137), ONR (N00014-13-1-0764), and Figure 9: Retrieval experiments on the original real- NSF (III1360971, BIGDATA1419210). valued data. We apply SimHash on the original data and MinHash on the binarized data, and we evaluate A Proof of Theorem 1 the retrieval results based on the cosine similarity of The only less obvious step is the Proof of tightness: the original data. MinHash still outperforms SimHash. Let a continuous function f(S) be a sharper upper R ≤ S ≤ S S p bound i.e., f( ) 2−S . For any rational = q , To conclude this section, we also add a set of experi- with p, q ∈ N and p ≤ q, choose f1 = f2 = q and ments using the original (real-valued) data, for MNIST a = p. Note that f1, f2 and a are positive integers. S R p and RCV1. We apply SimHash on the original data This choice leads to 2−S = = 2q−p . Thus, the upper and MinHash on the binarized data. We also evaluate bound is achievable for all rational S. Hence, it must S S R the retrieval results based on the cosine similarities of be the case that f( ) = 2−S = for all rational values the original data. This set-up places MinHash in a very of S. For any real number c ∈ [0, 1], there exists a disadvantageous place compared to SimHash. Never- Cauchy sequence of rational numbers {r1, r2, ...rn, ...} theless, we can see from Figure 9 that MinHash still no- such that rn ∈ Q and limn→∞ rn = c. Since all rn’s are rn ticeably outperforms SimHash, although the improve- rational, f(rn) = . From the continuity of both 2−rn S ments are not as significant, compared to the experi- rn f and 2−S , we have f(limn→∞ rn) = limn→∞ 2−r ments on binarized data (Figure 8). c ∀ ∈ n which implies f(c) = 2−c implying c [0, 1]. √ 6 Conclusion S2 S p For tightness of , let = q , choosing f2 = a = p Minwise hashing (MinHash), originally designed for and f1 = q gives an infinite set of points having detecting duplicate web pages [3, 10, 15], has been R = S2. We now use similar arguments in the proof widely adopted in the search industry, with numerous tightness of upper bound. All we need is the existence applications, for example, large-sale machine learning of a Cauchy sequence of square root of rational num- systems [23, 21], Web spam [29, 18], content match- bers converging to any real c. 

893 Anshumali Shrivastava, Ping Li

References [16] and . Approximate near- est neighbors: Towards removing the curse of dimen- [1] Alekh Agarwal, Olivier Chapelle, Miroslav Dudik, and sionality. In STOC, pages 604–613, Dallas, TX, 1998. John Langford. A reliable effective terascale linear learning system. Technical report, arXiv:1110.4198, [17] Yugang Jiang, Chongwah Ngo, and Jun Yang. To- 2011. wards optimal bag-of-features for object categoriza- tion and semantic video retrieval. In CIVR, pages [2] Alexandr Andoni and Piotr Indyk. E2lsh: Exact eu- 494–501, Amsterdam, Netherlands, 2007. clidean locality sensitive hashing. Technical report, 2004. [18] Nitin Jindal and Bing Liu. Opinion spam and analysis. In WSDM, pages 219–230, Palo Alto, California, USA, [3] Andrei Z. Broder. On the resemblance and contain- 2008. ment of documents. In the Compression and Complex- [19] Ping Li, Kenneth W. Church, and Trevor J. Hastie. ity of Sequences, pages 21–29, Positano, Italy, 1997. Conditional random sampling: A sketch-based sam- [4] Andrei Z. Broder, , Alan M. Frieze, pling technique for sparse data. In NIPS, pages 873– and . Min-wise independent 880, Vancouver, BC, Canada, 2006. permutations. In STOC, pages 327–336, Dallas, TX, [20] Ping Li and Arnd Christian K¨onig. b-bit minwise 1998. hashing. In Proceedings of the 19th International Con- [5] Gregory Buehrer and Kumar Chellapilla. A scalable ference on World Wide Web, pages 671–680, Raleigh, pattern mining approach to web graph compression NC, 2010. with communities. In WSDM, pages 95–106, Stanford, [21] Ping Li, Art B Owen, and Cun-Hui Zhang. One per- CA, 2008. mutation hashing. In NIPS, Lake Tahoe, NV, 2012.

[6] Tushar Chandra, Eugene Ie, Kenneth Goldman, [22] Ping Li, Anshumali Shrivastava, and Arnd Christian Tomas Lloret Llinares, Jim McFadden, Fernando K¨onig.b-bit minwise hashing in practice. In Internet- Pereira, Joshua Redstone, Tal Shaked, and Yoram ware, Changsha, China, 2013. Singer. Sibyl: a system for large scale machine learn- ing. Technical report, 2010. [23] Ping Li, Anshumali Shrivastava, Joshua Moore, and Arnd Christian K¨onig.Hashing algorithms for large- [7] Olivier Chapelle, Patrick Haffner, and Vladimir N. scale learning. In NIPS, Granada, Spain, 2011. Vapnik. Support vector machines for histogram-based image classification. 10(5):1055–1064, 1999. [24] Marc Najork, Sreenivas Gollapudi, and Rina Pani- grahy. Less is more: sampling the neighborhood graph [8] Moses S. Charikar. Similarity estimation techniques makes salsa better and faster. In WSDM, pages 242– from rounding algorithms. In STOC, pages 380–388, 251, Barcelona, Spain, 2009. Montreal, Quebec, Canada, 2002. [25] Sandeep Pandey, , Flavio Chierichetti, [9] Flavio Chierichetti, Ravi Kumar, Silvio Lattanzi, Vanja Josifovski, Ravi Kumar, and Sergei Vassilvit- Michael Mitzenmacher, Alessandro Panconesi, and skii. Nearest-neighbor caching for content-match ap- Prabhakar Raghavan. On compressing social net- plications. In WWW, pages 441–450, Madrid, Spain, works. In KDD, pages 219–228, Paris, France, 2009. 2009.

[10] Dennis Fetterly, Mark Manasse, Marc Najork, and [26] Anshumali Shrivastava and Ping Li. Fast near neigh- Janet L. Wiener. A large-scale study of the evolution bor search in high-dimensional binary data. In ECML, of web pages. In WWW, pages 669–678, Budapest, Bristol, UK, 2012. Hungary, 2003. [27] Anshumali Shrivastava and Ping Li. Densifying one [11] Jerome H. Friedman, F. Baskett, and L. Shustek. An permutation hashing via rotation for fast near neigh- for finding nearest neighbors. IEEE Trans- bor search. In ICML, 2014. actions on Computers, 24:1000–1006, 1975. [28] Simon Tong. Lessons learned developing a [12] Michel X. Goemans and David P. Williamson. Im- practical large scale machine learning system. proved approximation algorithms for http://googleresearch.blogspot.com/2010/04/lessons- and satisfiability problems using semidefinite pro- learned-developing-practical.html, 2008. gramming. Journal of ACM, 42(6):1115–1145, 1995. [29] Tanguy Urvoy, Emmanuel Chauveau, Pascal Filoche, [13] Sreenivas Gollapudi and Aneesh Sharma. An ax- and Thomas Lavergne. Tracking web spam with html iomatic approach for result diversification. In WWW, style similarities. ACM Trans. Web, 2(1):1–28, 2008. pages 381–390, Madrid, Spain, 2009. [30] Yair Weiss, Antonio Torralba, and Robert Fergus. [14] Matthias Hein and Olivier Bousquet. Hilbertian met- Spectral hashing. In NIPS, 2008. rics and positive definite kernels on probability mea- sures. In AISTATS, pages 136–143, Barbados, 2005.

[15] Monika Rauch Henzinger. Finding near-duplicate web pages: a large-scale evaluation of algorithms. In SI- GIR, pages 284–291, 2006.

894