Maximally Consistent Sampling and the Jaccard Index of Probability Distributions

Anonymous Anonymous

Abstract—We introduce simple, efficient algorithms for MinHashes of vectors of positive integers, yielding a computing a MinHash of a probability distribution, suitable collision probability of for both sparse and dense data, with equivalent running ∑ times to the state of the art for both cases. The collision i min (xi, yi) JW (x, y) = ∑ probability of these algorithms is a new of the max (x , y ) similarity of positive vectors which we investigate in detail. i i i We describe the sense in which this collision probability is Subsequent work[7][8][9] has improved the efficiency optimal for any Locality Sensitive Hash based on sampling. of the second algorithm and extended it to arbitrary We argue that this similarity measure is more useful for probability distributions than the similarity pursued by positive weights, while still achieving JW as the collision other algorithms for weighted MinHash, and is the natural probability. generalization of the Jaccard index. JW is one of several generalizations of the Jaccard Index Terms—Locality Sensitive Hashing, Retrieval, index to non-negative vectors. It is useful because it is MinHash, Jaccard index, Jensen-Shannon divergence monotonic with respect to the L1 distance between x and y when they are both L1 normalized, but it is unnatural I.INTRODUCTION in many ways for probability distributions. ∈ MinHashing[1] is a popular Locality Sensitive Hash- If we convert sets to binary vectors x, y, with xi, yi ing algorithm for clustering and retrieval on large {0, 1}, then JW (x, y) = J(X,Y ). But if we convert these datasets. Its extreme simplicity and efficiency, as well vectors to probability{ } distributions{ by} normalizing them ∈ 1 ∈ 1 ̸ as its natural pairing with MapReduce and key-value so that xi 0, |X| , yi 0, |Y | , then JW (x, y) = datastores, have made it a basic building block in J(X,Y ) when |X| ̸= |Y |. The correspondence breaks many domains, particularly document clustering[1][2] and it no longer generalizes the Jaccard index. Instead, and graph clustering[3][4]. for |X| > |Y |, Given a finite set U, and a uniformly random per- 7→ |X ∩ Y | mutation π, the map X arg mini∈X π(i) provides a JW (x, y) = < J(X,Y ). (1) representation of any subset X of U that is stable under |X \ Y | + |X| small changes to X. If X, Y are both subsets of U the As a consequence, switching a system from an un- well-known Jaccard index[5] is defined as weighted MinHash to a MinHash based on JW will |X ∩ Y | generally decrease the collision probabilities. J(X,Y ) = |X ∪ Y | Furthermore, JW is insensitive to important differ- ences between probability distributions. It counts all Stability of arg mini∈X π(i) with respect to X is quan- differences on an element in the same linear scale tified by regardless of the mass the distributions share on that [ ] element. For instance, JW ((a, b, c, 0), (a, b, 0, c)) = Pr arg min π(i) = arg min π(i) = J(X,Y ) JW ((a + c, b), (a, b + c)). This makes it a poor choice i∈X i∈Y when the ultimate goal is to measure similarity using Practically, this random permutation is generated by an expression based in information-theory or likelihood applying some hash function to each i with a fixed where having differing support typically results in the random seed, hence “MinHashing.” worst possible score. In order to hash objects other than sets, Chum et al.[6] For a drop-in replacement for the Jaccard index that introduced two algorithms for incorporating weights treats its input as a probability distribution, we’d like it in the computation of MinHashes. The first algorithm to have the following properties. associates constant global weights with the set members, 1) Scale invariant. suitable for∑ idf weighting. The collision probability that 2) Not lower than the Jaccard Index when applied to w results is ∑i∈X∩Y i . The second algorithm computes discrete uniform distributions. i∈X∪Y wi

1 Algorithm 1: P-MinHash. We require only a single for retrieving documents that are similar under JW exponentially distributed hash per nonzero element. than JW itself (and consequently, sometimes better for input : Vector x, Seed s retrieving based on L1-distance.) output: Stable sample from x II. P-MINHASHANDITS MATCH PROBABILITY foreach i where xi > 0 do − log(UniformNonZeroFloat(i,s)) Let h :[n] → (0, 1] be a pseudo-random hash mapping ei ← xi ≤ ≤ end every element 1 i n to an independent uniform random value in (0, 1]. Over a non-negative vector x, return arg mini ei define − log h(i) H(x) := arg min 3) Sensitive to changes in support, in a similar way i xi to information-based measures. For brevity, we will be using the extended real number 4) Easily achievable as a collision probability. system in which 1/∞ := 0. Each term is an exponen- JW fails all but the last. tially distributed random variable with rate xi, 1) It isn’t scale invariant, JW (αx, y) ≠ JW (x, y). [ ] − log h(i) − 2) If the vectors are normalized to make it scale Pr < y = 1 − e xiy, invariant, the values drop below the corresponding xi Jaccard index (equation 1.) so it follows that 3) It is insensitive to changes in support. ∑xi 4) Good algorithms exist, but they are non-trivial. Pr [H(x) = i] = . i xi Existing work has thoroughly explored improvements to Chum et al.’s second algorithm, while leaving their This well known and beautiful property of expo- first untouched. In this work we instead take their first nential random variables derives from the fact that algorithm as a starting point. We extend it to arbitrary Pr [min(x, y) > α] = Pr [x > α] Pr [y > α]. positive vectors (rather than sets with constant global Theorem II.1. For any two nonnegative vectors x, y ∈ weights) and analyze the result. In doing so, we find Rn +, that the collision probability is a new generalization of Pr [H(x) = H(y)] = JP (x, y). the Jaccard Index to positive vectors, which we here call − log h(i) JP . Proof. Any monotonic transform on will not ∑ xi 1 change the arg min, so multiplying each xi by a positive JP (x, y) = ∑ ( ). (2) xj yj α won’t either. H(αx) = H(x). Thus for xi, yi > 0 {i : xi,yi>0} j max x , y [ ( ) ( )] i i x y The names used here, JW and JP , are chosen to reflect Pr [i = H(x) = H(y)] = Pr i = H = H xi yi how each function interprets x and y, and the conditions ( ) ( ) − under which they match the original Jaccard index. JW By definition, H x = H y = i means log h(i) ≤ xi yi xi/xi treats a difference in magnitude the same as any other − − − log h(j) and log h(i) ≤ log h(j) , for j ≠ i, or difference, so treats vectors as “weighted sets.” JP is xj /xi yi/yi yj /yi equivalently, scale invariant, so any input is treated the same as a ( ) corresponding probability distribution. − − − ≤ log h(j) log h(j) The primary contribution of this work is to derive and log h(i) min , xj/xi yj/yi analyze JP , and to show that in many situations where − log h(j) the objects being hashed are probability distributions, JP ≤ ( ). max xj , yj is a more useful collision probability than JW . xi yi We will describe the sense in which JP is an optimal i i collision probability for any LSH based on sampling. We Now we desire a new vector z such that H(z ) = i will prove that if the collision probability of a sampling if and only if H(x/xi) = H(y/yi) = i. This requires − log h(j) − log( h(j) ) i P that i i = xj yj . Thus for fixed i, zj = algorithm exceeds J on one pair, it must sacrifice zj /zi max , ( ) xi yi P xj yj collisions on a pair that has higher J . max , x(i yi ) We will motivate JP ’s utility by showing experimen- ∑ . Consequently, max xk , yk tally that it has a tighter relationship to the Jensen- k xi yi Shannon divergence than JW , and is more closely cen- [ ] 1 Pr H(zi) = i = ∑ ( ) tered around the Jaccard index than JW . We will even x y max j , j show empirically that in some circumstances, it is better j xi yi

2 (a) Pr [H(x) = i] (b) Pr [H(y) = i] (c) Pr [H(x),H(y) = i, j] (d) Pr [H(x) = H(y) = i] Fig. 1: The P-MinHash process can be interpreted geometrically as dividing the unit simplex into smaller simplexes proportional to the mass of each term, then selecting the same point in each simplex as its representative. Pictured here are x = (0.5, 0.4, 0.1), y = (0.2, 0.4, 0.4). Colors are assigned to unique values of i, or (i, j) in the case of the joint distributions. The sum of the remaining colored areas in (d) is proportional to JP (x, y).

Repeating this process for all i in the intersection yields exterior face. As a result, a uniformly chosen point on ∑ 1 the unit simplex will land in one of the sub-simplices Pr [H(x) = H(y)] = ∑ ( ) with probability given by x. P-MinHashing is equivalent xj yj i max , j xi yi to sampling in this fashion, but holding the chosen point Continuing the notational convention, we will refer to constant when sampling from each new distribution. The match probability is then proportional to the sum of the hashing algorithms that achieve JP as their pair collision probability as P-MinHashes, and algorithms that achieve intersections of simplices that share an external face. This representation makes several properties obvious on JW as W-MinHashes. small examples that we prove generally in the next III.INTUITIONABOUT JP section. While JP ’s expression is superficially awkward, we can aid intuition by representing it in other ways. The IV. OPTIMALITYOF JP simplest interpretation is to view it as a variant of JW . When MinHashing is used with a key-value store, We rewrite JW allowing one input to be rescaled before high collision probabilities are generally more efficient computing each term, than low collision probabilities, because as we discuss in ∑ section VI, it is much cheaper to lower them than to raise min(αixi, yi) JW (x, y, α) = ∑ max(α x , y ) them. For this reason, we are interested in the question i j i j j of the highest collision probability that can be achieved and choose the vector α to maximize this general- through sampling. The constraint that the samples follow ized JW . If αixi > yi, increasing αi raises only the each distribution forces the collision probability to re- denominator. If αixi < yi, increasing αi raises the main discriminative, but given that constraint, we would numerator more than the denominator. So the optimal like to make it as high as possible to maximize flexibility α sets αixi = yi, and results in JP . and efficiency. We can derive a more powerful representation by Suppose for two distributions, x and y, we want to viewing the P-MinHash algorithm itself geometrically. choose a joint distribution that maximizes Pr[H(x) = A vector of k + 1 exponential random variables, when H(y)]. If we were concerned with only these two normalized to sum to 1, is a uniformly distributed particular distributions in isolation, the upper bound random point on the unit k-simplex. Every point in the of Pr[H(x) = H(y)] is given by the Total Variation unit k-simplex is also a probability distribution over k+1 distance, or equivalently 1 − L1(x, y)/2. Meeting this elements. Using these two facts we can construct the P- bound requires the probability mass where x exceeds y MinHash as a function of the simplex as illustrated in to be perfectly coupled with the mass where y exceeds Figure 1. x. Both the mass they share and the mass they do not For a probability distribution x, mark the point on must be perfectly aligned. the unit simplex corresponding to x, (x1, . . . , xk+1), Rather than just two, we want to create a joint dis- and connect it to each of the corners of the simplex. tribution (or coupling) of all possible distributions on a These edges divide it into k + 1 smaller simplices that given space where the collision probability for any pair fill the unit simplex. Each internal simplex has volume is as high as possible. It is always possible to increase proportional to the coordinate of x opposite to its unique the collision probability of one pair at the expense of

3 another so long as the chosen pair has not hit the Total subscript∑ to indicate a partial sum, i.e. JP (x, y)i>a := Variation limit, so the kind of optimality we are aiming {i:i>a} JP (x, y)i. for is Pareto optimality. This requires that no collision probability be able to exceed ours everywhere; any gain Lemma IV.2. Useful tools from sorting JP ’s indices: on one pair must have a corresponding loss on another. 1) For at least two values of i, and distributions x, y, This by itself would not be a very consequential bound JP (x, y)i = min(xi, yi). for its retrieval performance. We really only desire high 2) Let w1...wn be distributions with disjoint support. collision probabilities for items that are similar, and Consider two linear combinations of these distri- we would happily lower the collision probability of a butions with coefficients α and β.JP (α·w, β·w) = dissimilar pair to increase it for a similar pair. JP (α, β) However we are able to prove something stronger by Proof. For a given i, if every max chooses the same side, examining the pair whose collisions must be sacrificed. JP (x, y) = min(x , y ). x /y has both an arg max and To increase the collision probability for one pair above i i i i i arg min for which this is true. This gives us part 1. its JP , you must always sacrifice collisions on a pair Let x /y = x /y . with even higher JP . To get better recall on one pair, k k l l you must always give up recall on an even better pair. ∑ xi xk + xl The Jaccard index itself is optimal on uniform distri- ∑ ( ) = ∑ ( ) xj xi xj xk ∈{ } yj max , yj max , butions, and the short proof is a model for the general i k,l j yj yi j yj yk ∑ ( ) ( ) case. xj xi xk xi yj max , = (yk + yl) max , yj yi yk yi Theorem IV.1. No method of sampling from uniform j∈{k,l} discrete distributions can have collision probabilities ′ ′ dominating the Jaccard index of their supports. The Thus, we can form x , y by merging the mass of xk, xl Jaccard index is Pareto optimal. into one element and yk, yl into one element and have ′ ′ JP (x, y) = JP (x , y ) Let w ...w be distributions with Proof. Let Z be the of X and 1 n disjoint support, and consider JP (α · w, β · w). Repeat Y , Z = X ∪ Y − X ∩ Y , and let z, x, y be the the merging process until all elements of each w are corresponding discrete uniform distributions. The three i merged into one. This gives us (2). intersections, X ∩ Y,X \ Y,Y \ X, are disjoint, so the three collision events, H(x) = H(y), H(z) = H(x), and H(z) = H(y), are also disjoint. As disjoint events, This ordering also lets us work more effectively with their probabilities must sum to at most 1. When the the z distributions we constructed in theorem II.1. This three probabilities are given by the Jaccard index of the lemma contains all the algebra needed for the main corresponding sets, they already sum to 1, respectively, proof. | ∩ | | \ | | \ | X Y , X Y , Y X , so the Jaccard index is Pareto |X∪Y | |X∪Y | |X∪Y | Lemma IV.3. For fixed a, let za(, x, y be) probability optimal. distributions where za ∝ max xi , yi . Reorder i i xa ya xi xi+1 P according to the sorting of (3) such that ≥ . To prove the same claim on J for all distributions, yi yi+1 we need a few tools. We can rearrange JP to separate The following are true: the two iteration indices within the max. ∀ ≤ a a a ≥ i 1) i a, JP (x, z )i = zi = min(xi, zi ) zi . ∑ a i xi 2) ∀i ≥ a, JP (x, z )i = z . JP (x, y) = ∑ ( ) i xj xi (3) 3) JP (x, y)i = min(xi, yi) =⇒ i yj max , j yj yi a a JP (x, z )i = min(xi, zi ). a a This lets us characterize the arg max choices in the 4) JP (x, z ) ≥ JP (x, y) and JP (y, z ) ≥ JP (x, y)

denominator as functions of a sorted list of xi/yi where ≤ xi ≥ xa ⇒ xi ≥ yi a ∝ xi x Proof. For i a, , hence z . xi ≥ i+1 yi ya xa ya i xa . If we reindex i according to this sorted list, a yi yi yi+1 Similarly, for i ≥ a, z ∝ . Working first with the i ya then we can describe each JP (x, y)i knowing that i terms lower group of indices, choose the left side of the max, and n−i terms choose the P O(n log n) right. (This also lets us compute J in rather a 1 2 ∀i ≤ a, JP (x, z ) = ( ) than O(n ) time by sorting first and keeping running i ∑ xj yj x max( , ) max j , xa ya sums of each choice of the max.) j xi xi/xa To analyze JP we will need to refer to each xi/x( a ) a term in its outer sum via subscript. JP (x, y)i := = ∑ = zi xj yj ∑ 1( ) i max , xj yj = zi . We will also use quantifiers in this j xa ya max , j xi yi

4 (a) Pr [H(x),H(y) = i, j] (b) Pr [H(zg) = i] (c) Pr [H(x),H(zg) = i, j] (d) Pr [H(y),H(zg) = i, j] Fig. 2: A visual proof of theorem IV.4 on 3 element disributions. Using the same distributions as in Figure 1, the green regions of x and y could be shifted to overlap more and improve the collision probability of this pair, but any modification that achieves that would worsen at least one of collision probabilities between x or y and zgreen (both of which have higher collision probability than the (x, y) pair.)

xi/xa xi/xa And since ∑ x y ≤ ∑ x we also know that In the base case where m = 0,JP (x, y) = max j , j j ∑ j ( xa ya ) j xa a a min(xi, yi) which cannot be improved. ∀i ≤ a, z = min(xi, z ). Furthermore, ∀i ≤ a, i i i Assume the proposition to be proved is true ∀x, ∀y, a ( 1 ) and ∀p < m. By IV.2.1 we know that m ≤ n − 2, since zi = ∑ max xj /xa , yj /ya at least the two endpoints of the sorted list have reached j max xi , yi max xi , yi ( xa ya ) ( xa ya ) their upper bound. We proceed by induction on m. ( ( 1 ) ( )) ≥ i As in theorem II.1, for each a where JP (x, y)a < = ∑ zi a a ∝ xj xj ya yj yj xa min((xa, ya),) construct a distribution z , where zi j max min x , x y , min y , y x i a i i a i max xi , yi . Reorder i according to the sorting of xa ya which gives us part 1. Now, continuing on to the upper (3) such that xi ≥ xi+1 . From lemma IV.3 we know yi yi+1 group of indices, ∀ ≤ a a a ∀ ≥ i a, JP (x, z )i = zi = min(xi, zi ) and i a a a a 1 a, JP (y, z )i = zi = min(yi, zi ). ∀i ≥ a, JP (x, z ) = ( ) i ∑ xj yj x max( , ) Now consider a new sampling method, G. The fol- max j , xa ya j xi yi/ya lowing events are pairwise disjoint by inspection. 1 < { a } = ∑ ( ) Ea := G(x) = G(z ) < a xj ya yj = max , xj , { } j xi xa,yi yi Ea := G(x) = G(y) = a > a xayi E := {G(y) = G(z ) > a} Since ∀i ≥ a, ≥ xi we can simplify further to a ya a i ∀ ≥ P = < conclude part 2. i a, J (x, z )i = zi . By noting that So their probabilities are constrained. Pr[Ea ]+Pr[Ea ]+ > ≤ the choices within the max are preserved, we conclude Pr[Ea ] 1. When these probabilities are given by JP = < > part 3. Finally, having bounded all indices, part 4: they already∑ sum∑ to 1. Pr[Ea ] + Pr[Ea ] + Pr[Ea ] = a a a a z + z + z = 1. ∀i, JP (x, z ) ≥ JP (x, y) a ia i i i From this bound, if Pr[G(x) = G(y) = a] > a JP (x, z ) ≥ JP (x, y). JP (x, y)a, at least one of

P a a We now have the tools to prove the optimality of J . Pr[G(x) = G(z ) < a] < JP (x, z )i a] < JP (y, z )i>a Pr[G(x) = G(y) = i] > JP (x, y) , there exists a z i must be true. Since the two cases are symmetric, we such that Pr[G(x) = G(z)] < JP (x, z) or Pr[G(y) = will assume the first one: Pr[G(x) = G(za) < a] < G(z)] < JP (y, z). This implies that no method of a JP (x, z )i JP (x, z )i≥a. a a a Proof. Let m be the number of elements i for which However, JP (x, z )a = JP (y, z )a = za , so these a JP (x, y)i < min(xi, yi). terms have exhausted z and cannot be increased. Using

5 IV.3.1 and IV.3 we know that this adds at least one Shifting the mass in this way has no effect on JW , ′ ′ additional term that is fully consumed, i.e. the size of {i : but it decreases JP to equal JW . x and y can a a } JP (x, z )i < min(xi, zi ) is less than m. So by induc- be expressed as linear combinations over the three a a ′ ′ tion hypothesis, if Pr[G(x) = (z ) < a] > JP (x, z )i on probability to the root node than all others; in particular, the tree distributions. (i, (1, . . . , n)). Since i and its internal-node sibling form − P a two element distribution, by lemma 3 the probability Theorem IV.6. 1 JP is a proper metric on where P(Ω) is the space of probability distributions over a of a collision on i will be min(xi, yi) for all x, y. finite set Ω. What of JW then? Is it another Pareto optimum? It is not, it is dominated by JP . Proof. Symmetry is obvious. Non-degeneracy over P 1−p 1−p follows from ∑ 1 ( ) ≤ min(x , y ). The W (x, y) = ≤ xj yj i i Theorem IV.5. If J , then ∈ max , 1+p 1+p j Ω xi yi JP (x, y) ≤ 1−p, and there are distributions that achieve triangle inequality follows from being a collision prob- both bounds on JP for any value of JW . ability, 1 − JP (x, y) = Pr [H(x) ≠ H(y)]. Therefore, Proof. The lower bound becomes clear by rewriting JW Pr [H(x) = H(y)] ≥ Pr [H(x) = H(z) ∧ H(y) = H(z)] in a similar form to JP . Pr [H(x) ≠ H(y)] ≤ Pr [H(x) ≠ H(z) ∨ H(y) ≠ H(z)] ∑ 1 JW (x, y) = ∑ max(xj ,yj ) for any distribution z. But by the union bound, i j min(x ,y ) [ ] ∑ ∑ i i Pr H(x) ≠ H(z) ∨ H(y) ≠ H(z) 1( ) ( 1 ) [ ] [ ] ∑ ≥ ∑ ≤ ̸ ̸ xj yj xj yj xj yj Pr H(x) = H(z) + Pr H(y) = H(z) i max , i max , , , j xi yi j xi yi yi xi V. HASHINGON DENSEAND CONTINUOUS DATA JP (x, y) ≥ JW (x, y) The algorithm we’ve presented so far is suitable for To achieve this lower bound, we can transform the dis- sparse data such as documents or graphs. It is linear in tributions by moving the “excess” mass to new elements. the number of non-zeros, equivalent to Ioffe 2010 [7]. On ′ ′ dense data (such a image feature histograms[8]) there’s x = y = min (x , y ) 2i 2i i i significant overlap in the supports of each distribution, ′ − ′ − x2i+1 = max (xi yi, 0) , y2i+1 = max (yi xi, 0) so rehashing each element for every distribution wastes

6 Algorithm 2: Dense and Continuous P-MinHash. find the new minimum element, only a small prefix “Global-Bound” A* Sampling [10] with a fixed seed. of the list must be examined. The running time of input : sample space Ω, finding the new minimum is a function of the differ- sigma-finite measure µ, ence between the two vectors of parameters, and has proposal sigma-finite measure λ, equivalent running time to rejection sampling. This gives finite upper bound, B := max(µ(i)/λ(i)) algorithm 2 equivalent running time to the state of the shared random seed s art for computing a MinHash of dense data, Shrivastava output: Stable sample from (Ω, F, µ) 2016[8]. Because algorithm 2 admits an unbounded list ∗ (M,X , k, e−1) ← (inf, null, 0, 0) of random variables, it is also applicable to continuous do distributions, as the paper[10] describes in detail. P ek ← − log (UniformNonZeroFloat(k, s))+ek−1 Let’s first show that algorithm 2 gives J (µ, ν) when Xk ← Sample(λ(Ω), s) Ω is finite. Indeed, this construction is simply an al- − Mk ← ekλ(Xk)/µ(Xk) ternative way of finding the minimum log Ui/µi. A* − ∗ if Mk < M then sampling merely reads off the minimum of log Ui/λi − M ← Mk λi/µi = log Ui/µi, and similar for ν. ∗ X ← Xk To prove the general case for infinite Ω, we need to end first define what we mean by JP (µ, ν) in that setting. One Ω ← Ω \ Xk option is to replace all the summation by integrals in the k ← k + 1 formula (2). This runs into two difficulties however: n while M > ek/B 1) A probability space Ω may not be a subset of R . return X∗ 2) Either µ or ν could be singular. Instead, we define it as a limit over increasingly finer finite partitions of Ω. More formally, work. With a shared stream of sorted hashes, we expect the hash we select for each distribution to be biased Definition V.1. Assume JP (µ, ν) is defined as before | | ∞ towards the beginning of the stream, and closer to the when Ω < , we define beginning when the data is denser. Therefore one might JP (µ, ν) = inf JP (µF , νF ), (4) expect that we could improve performance by searching F⊢Ω only some prefix of the stream to find our sample. where F ranges over finite partitions of the space Ω, A* Sampling (Maddison, Tarlow, and Minka and µF denotes the push-forward of µ with respect to the 2014[10]) explores this idea thoroughly, and we lean map π :Ω → F, π(x) = Q ∈ F iff x ∈ Q.(µF is simply on it heavily in this section. In particular, we use their a coarsified probability measure on the finite space F “Global Bound” algorithm, and essentially just run where it (tautologically) assigns probability µ(Q) to the it with a fixed random seed (algorithm 2.) We leave element Q ∈ F.) the proof of running time and of correctness as a First we verify that the definition above coincides with sampling method to that work, and limit our discussion JP when Ω is finite. to the proof of the resulting collision probability. (Their derivation uses Gumbel variables and maxima. We use Lemma V.2. Let |Ω′| + 1 = |Ω| = n, µ, ν ∈ exponential variables and minima to make the continuity P(Ω), and µ′, ν′ ∈ P(Ω′) obtained by merging the with the rest of our work clear, which is achieved by a last two elements of Ω into a single element. Then ′ ′ simple change of variables.) JP (µ, ν) ≤ JP (µ , ν ), with strict inequality if both µ, ν The key insight of A* Sampling is that when a (pos- have nonzero masses on those two elements. sibly infinite) stream of independent exponential random Proof. By considering JP (µ, ν) as the probability that variables is ordered, the exact marginal distributions of the argmin’s of two lists of independent exponentials each variable can be computed as a function of the land on the same index 1 ≤ i∗ ≤ n, and using the rank and the previous variables. If the vector of sorted fact that the minimum of two independent exponentials exponential variables is e with corresponding param- is an exponential with the sum of the parameters, we eter vector x, then once e , . . . , e − are all known, 1 k 1 can couple the four argmin’s arising from µ, ν, µ′, ν′ and the distribution∪ of ek is a truncated exponential with | \ k | conclude by inspection. rate x i=1 xi truncated from below at ek−1. The “statelessness” of exponential variables makes this trun- The lemma shows that any partition of Ω will lead to cation easy to accomplish. Simply generate the desired a JP that’s greater than or equal to the original JP . So exponential, and add ek−1 to shift it. the infimum is achieved with the most refined partition, To change the parameters of those exponentials and namely Ω itself.

7 1.00 1.00

Bounds 1 − x 0.75 0.75 1 + x ≈ d(1 − x) 1 − x d   1 + x 0.50 0.50 count 1e+12 1e+09 0.25 0.25 1e+06 Jensen−Shannon Divergence Jensen−Shannon Divergence 1e+03

0.00 0.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 JW JP

Fig. 3: The Jensen-Shannon divergence compared with JP and JW on pairs of normalized unigram term vectors of random web documents. JSD has a much tighter relationship with JP than JW . We show exact bounds for JW (1−x) − (1+x) against JSD and approximate bounds for JP against JSD where d(x) = 2 log2(1 x) + 2 log2(1 + x). The −7 curve that appears to lower bound JSD against JP is violated on 10 of the pairs.

1.00 1.0

Bounds 0.75 2x 0.5 1 + x x Metric JP P log

J 0.50 count 0.0 J 1e+12 JW log 1e+09 Log Ratio J 1e+06 0.25 1e+03 −0.5

0.00 −1.0 0.00 0.25 0.50 0.75 1.00 0 0.25 0.5 0.75 1

JW J Fig. 4: Comparison of Jaccard measures on normalized unigram term vectors of pairs of random web documents. The left graph shows the joint distribution of JP and JW and the bounds we prove. The right graph shows the conditional distributions of JP and JW against the (set) Jaccard index of the terms. We show the distribution of the log of their ratios against the Jaccard index and highlight the median. JP is generally centered around the Jaccard index, while JW is consistently centered below, as predicted by their behavior on uniform distributions.

Finally we show that A* sampling applied to µ and point p ∈ Ω for both µ and ν is exactly JP (µ, ν). ν simultaneously has a collision probability equal to Proof. The first statement follow from the procedural JP (µ, ν) as defined above. definition of A* sampling described in [10]. For the second statement, since in-order A* is proven equivalent Theorem V.3. Given two probability measures µ and to hierarchical partition A* in [10], we are free to choose ν on an arbitrary Polish space Ω, both absolutely any partition to our convenience. The natural choice is continuous with respect to a common third measure λ, it then the partition used in the definition of JP (µ, ν). is possible to apply A* sampling with base distribution More precisely, we know there is a finite partition F ⊢ λ, either in-order or with a hierarchical partition of Ω, Ω such that JP (µ, ν) ≤ JP (µF , νF ) ≤ JP (µ, ν) + ϵ, to sample from µ and ν simultaneously. Further, the for any ϵ > 0. On the other hand, for finite parti- probability of the procedure terminating at the same tion like F , the exponential variables attached to the

8 Precision/Recall on JSD < 0.25 Precision/Recall on JW > 0.5

●●●●●●●●● ●● ●● ● ● ● 1.00 ● ● ● ●● ●● ● Lookup/ 1.00 ●●●●●●●●●●●●●● ●● ●●● ●●● Lookup/ ●● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● Storage ● ● ● ● ● ● ● ● Storage ● ● Cost ● ● ● ● ●● Cost ●● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● 1 ● ● 1 ● ● ● ● ● ● ● ● ● ● ● ● ● ● 2 ● ● ● ● ● ● ● ● 2 ● ● ● 0.75 ● ● ● ● ● ● ● 4 0.75 ● ● ● ● ● 4 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 8 ●● ● ● 8 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 16 ● ● 16 ● ● ● ● ● ● ● ●● ● ● 32 ● ● ● ● ● ● ● ● 32 ● ● ●● ● ● ● ● ● ● ● ● 0.50 ● ● ● ● ● 0.50 ● ● ● ● 64 ● ● 64 ● ● ● ● ● ● ● ●● Recall ● ● ● Recall ● ● ● ●● ●● ● ● ● ● ● ● ● ●● 128 ● ●● ● ● 128 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ●●● ● 256 ● ● ● ● ● 256 ● ● ●●●● ● ● ●●● ● ● ●● ● ● ● ●● ● ● ●●●● ● ● ●●● ● ● ●●● ● ●● ● ●● ● ● ● ● ●● ●● 512 ● ●● 512 ● ●●●●● ● ● ●●●● ● ●● ● ● ● ●● ● ● ● ● ● ●●●●● ● ●●●● 0.25 ● ● ●●● 0.25 ● ●●● ●●●●● ● ● ● ●●●●● ● ● ●● ●● ● ●●● ● ● ● ● ●●●●●● ● ●●●●● ● ●●●● ● ● ● ●● ● ● ●●● ● ● ● ● ●●●●● ● ● ● ●●●● Collision ● ●● Collision ● ●●●● ●● ● ●● ● ●●● ● ● ●●● ● ● ● ●● Probability ● ●●● ● ●●● ● ● ●●● Probability ● ●●● ●●●● ●●●●● ●● ●●● ●●● ● J ● ●● ● ●●● P ●●● JP ●● ●●● ● ● ● ● ● ● J ● ● 0.00 W 0.00 ● JW 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 Precision Precision

(a) Performance on retrieving pairs with Jensen-Shannon di- (b) Performance on retrieving pairs with JW greater than 0.5. vergence less than 0.25. JP performs better both by having The fact that P-MinHash slightly outperforms W-MinHashes a tighter relationship with JSD, and by having higher match illustrates the benefit of higher match probabilities. JP is higher probabilities overall, making high recall cheaper to achieve. P- than JW for pairs that are similar under JW , which makes high MinHash achieves the same performance with 64 hashes that recall cheaper. As the cost increases, JW overtakes JP as the a W-MinHash achieves with 128. difference between the two scores becomes the larger factor. Fig. 5: Precision/recall curves illustrating the typical case of retrieval using a key-value store. Each point represents outputting o independent sums of a hashes each, for a collision probability of 1 − (1 − pa)o. The cost in storage and CPU is dominated by o, so we connect these points to show the trade-offs possible at similar cost.

representative of each part Q ∈ F are jointly dis- of their sampling probability to simulate an unbiased tributed as exponentials with rate λ(Q) with common sample of all non-zero pairs. seed. Let U, V be the two coupled A* processes re- The Jensen-Shannon divergence (JSD) defines the F stricted to . Either one of them does not terminate, information loss that results from representing two dis- or they both terminate and collide conditionally with tributions using a model that is an equal mixture of them, P probability JP (µF , νF ). In other words, letting (T ) and as such is the ideal criterion to form information- F be the probability that both terminate at level, and preserving clusters of items of equal importance. Like F AC(U, V ; ) be the collision probability of U, V re- both JW and JP it is bounded, symmetric, and monotonic F F P stricted to , then AC(U, V ; ) = (T )JP (µF , νF ). in a metric distance. Due to these properties, as well as P ≤ F ≤ ≤ Thus JP (µ, ν) (T ) AC(U, V ; ) JP (µF , νF ) its popularity, we use it here as a basis for comparison. JP (µ, ν) + ϵ. JP has a much tighter relationship with the Jensen- So AC(U, V ; F ) is squeezed between JP (µ, ν)P(T ) Shannon divergence than JW as shown in figure 3. and JP (µ, ν) + ϵ. Since P(T ) → 1 and ϵ → 0 under a Tight bounds on JSD as a function of JW are given refinement sequence F , we get in the limit by JW ’s monotonic relationship with Total Variation, as AC(U, V ) := AC(U, V ; F∞) = JP (µ, ν). described by [11]. Let p be the total variation distance, (1−p) − (1+p) and d(p) = log2(1 p) + log2(1 + p). 2 −2 VI.UTILITYOF JP ON PAIRSOF WEB DOCUMENTS d(p) ≥ JSD(x, y) ≥ p. Substituting 1 JW = p extends 1+JW To determine whether the difference between JP and these to JW . These same bounds apply to JP as well, but JW matters in practice and whether achieving JP as a JP has a much tighter relationship with what appears to collision probability is a useful goal, we computed both be a much higher lower bound. for a large sample of pairs of unigram term vectors of We have approached finding this lower bound with web documents. From an index of 6.6 billion documents, large differential equations that we have only solved we selected pairs using a sum of 2 W-MinHashes numerically, but small examples form good approximate to perform importance sampling. We computed several bounds. On 2 element distributions, JSD has a direct −7 similarity scores for 100 million pairs of normalized relationship with JP , and only 1 × 10 of the pairs fall unigram term vectors, and weighted them by the inverse below the resulting curve, d(1 − JP ). No pairs in our

9 sample had JSD more than 0.0077 below it. In contrast, JP ’s similarity in practice to the Jensen-Shannon diver- −3 JW puts 7×10 of the pairs below this curve, with the gence, a popular clustering criterion. We’ve described farthest point 0.16 below. two MinHashing algorithms that achieve this as their We also compare both JP and JW to the Jaccard collision probability with equivalent running time to the index of the set of terms, and compute a kernel density state of the art on both sparse and dense data. estimate of the log of their ratios. JP is generally REFERENCES centered around the Jaccard index, while JW is con- sistently centered below, as predicted by their behavior [1] A. Z. Broder, “On the resemblance and contain- on uniform distributions. This makes P-MinHash less ment of documents,” in Compression and Com- disruptive as a drop-in replacement for an unweighted plexity of Sequences 1997. Proceedings, IEEE, MinHash. Parameters of the system such as the number 1997, pp. 21–29. of hashes or the length of concatenated hashes are likely [2] ——, “Filtering near-duplicate documents,” in to continue to function well. Proc. FUN 98, Citeseer, 1998. In the typical case of retrieval using a key-value [3] D. Gibson, R. Kumar, and A. Tomkins, “Discov- store, performance is characterized by cheap ANDs and ering large dense subgraphs in massive graphs,” in expensive ORs.[12] To reduce the collision probabil- Proceedings of the 31st international conference ity we can sum multiple hashes to form keys, but to on Very large data bases, VLDB Endowment, raise the collision probability, we must output multiple 2005, pp. 721–732. independent keys. This lets us apply an asymmetric [4] S.-J. Lee, S. Kang, H. Kim, and J.-K. Min, “An sigmoid to the collision probabilities, 1 − (1 − pa)o efficient parallel graph clustering technique using with o independent outputs of a summed hashes each. pregel,” in Big Data and Smart Computing (Big- Assuming that the cost of looking up hashes dominates Comp), 2016 International Conference on, IEEE, the cost of generating them, ANDs are essentially free, 2016, pp. 370–373. while CPU and storage cost are both linear in the number [5] P. Jaccard, “Distribution de la flore alpine dans of ORs. Furthermore, as a increases linearly, o must le bassin des dranses et dans quelques regions increase exponentially to keep the inflection point of the voisines,” in Bulletin de la Socit Vaudoise des sigmoid in the same place. For instance, if the sigmoid Sciences Naturelles 37, 1901, pp. 241–272. passes through (0.5, 0.5), then o ≈ log(2)2a. This gives [6] O. Chum, J. Philbin, A. Zisserman, et al., “Near a significant performance advantage to algorithms with duplicate image detection: Min-hash and tf-idf higher collision probabilities, and thus to JP over JW . weighting.,” in BMVC, vol. 810, 2008, pp. 812– Lowering the probability is much cheaper than raising 815. it. [7] S. Ioffe, “Improved consistent sampling, weighted The effect of this is demonstrated in Figure 5. Unsur- and l1 sketching,” in ICDM, 2010. prisingly from the tightness of the joint distribution, P- [8] A. Shrivastava, “Simple and efficient weighted MinHash achieves better precision and recall retrieving minwise hashing,” in Advances in Neural Infor- low JSD documents for a given cost. More surprising is mation Processing Systems, 2016, pp. 1498–1506. that it also achieves slightly better precision and recall [9] M. Manasse, F. McSherry, and K. Talwar, “Con- on retrieving high JW documents when the cost is low, sistent weighted sampling,” Tech. Rep., Jun. 2010. even though this is the task W-MinHashes are designed [Online]. Available: https://www.microsoft.com/ for. The reason for this can be seen from the upper en- us/research/publication/consistent- weighted- bound, JP ≤ 2JW /(1 + JW ). On items that achieve sampling/. this bound, the collision probability when summing two [10] C. J. Maddison, D. Tarlow, and T. Minka, “A* hashes, (2x/(1+x))2, is similar to 4x2 near 0 and similar sampling,” in Advances in Neural Information to x near 1. This in effect gives it the recall of 1 hash Processing Systems, 2014, pp. 3086–3094. with the precision of 2 hashes on this subset of items, [11] I. Sason, “Tight bounds for symmetric divergence and thus a better precision/recall trade-off overall. measures and a refined bound for lossless source coding,” IEEE Transactions on Information The- VII.CONCLUSION ory, vol. 61, no. 2, pp. 701–707, 2015. We’ve described a new generalization of the Jaccard [12] M. Datar, N. Immorlica, P. Indyk, and V. S. Mir- index, and shown several qualities that motivate it as the rokni, “Locality-sensitive hashing scheme based natural extension to probability distributions. In particu- on p-stable distributions,” in Proceedings of the lar, we proved that it is optimal on all distributions in the twentieth annual symposium on Computational same sense that the Jaccard index is optimal on uniform geometry, ACM, 2004, pp. 253–262. distributions. We’ve demonstrated its utility by showing

10