Maximally Consistent Sampling and the Jaccard Index of Probability Distributions
Total Page:16
File Type:pdf, Size:1020Kb
Maximally Consistent Sampling and the Jaccard Index of Probability Distributions Anonymous Anonymous Abstract—We introduce simple, efficient algorithms for MinHashes of vectors of positive integers, yielding a computing a MinHash of a probability distribution, suitable collision probability of for both sparse and dense data, with equivalent running P times to the state of the art for both cases. The collision i min (xi; yi) JW (x; y) = P probability of these algorithms is a new measure of the max (x ; y ) similarity of positive vectors which we investigate in detail. i i i We describe the sense in which this collision probability is Subsequent work[7][8][9] has improved the efficiency optimal for any Locality Sensitive Hash based on sampling. of the second algorithm and extended it to arbitrary We argue that this similarity measure is more useful for probability distributions than the similarity pursued by positive weights, while still achieving JW as the collision other algorithms for weighted MinHash, and is the natural probability. generalization of the Jaccard index. JW is one of several generalizations of the Jaccard Index Terms—Locality Sensitive Hashing, Retrieval, index to non-negative vectors. It is useful because it is MinHash, Jaccard index, Jensen-Shannon divergence monotonic with respect to the L1 distance between x and y when they are both L1 normalized, but it is unnatural I. INTRODUCTION in many ways for probability distributions. 2 MinHashing[1] is a popular Locality Sensitive Hash- If we convert sets to binary vectors x; y, with xi; yi ing algorithm for clustering and retrieval on large f0; 1g, then JW (x; y) = J(X; Y ). But if we convert these datasets. Its extreme simplicity and efficiency, as well vectors to probabilityn o distributionsn byo normalizing them 2 1 2 1 6 as its natural pairing with MapReduce and key-value so that xi 0; jXj ; yi 0; jY j , then JW (x; y) = datastores, have made it a basic building block in J(X; Y ) when jXj= 6 jY j. The correspondence breaks many domains, particularly document clustering[1][2] and it no longer generalizes the Jaccard index. Instead, and graph clustering[3][4]. for jXj > jY j, Given a finite set U, and a uniformly random per- 7! jX \ Y j mutation π, the map X arg mini2X π(i) provides a JW (x; y) = < J(X; Y ): (1) representation of any subset X of U that is stable under jX n Y j + jXj small changes to X. If X, Y are both subsets of U the As a consequence, switching a system from an un- well-known Jaccard index[5] is defined as weighted MinHash to a MinHash based on JW will jX \ Y j generally decrease the collision probabilities. J(X; Y ) = jX [ Y j Furthermore, JW is insensitive to important differ- ences between probability distributions. It counts all Stability of arg mini2X π(i) with respect to X is quan- differences on an element in the same linear scale tified by regardless of the mass the distributions share on that [ ] element. For instance, JW ((a; b; c; 0); (a; b; 0; c)) = Pr arg min π(i) = arg min π(i) = J(X; Y ) JW ((a + c; b); (a; b + c)). This makes it a poor choice i2X i2Y when the ultimate goal is to measure similarity using Practically, this random permutation is generated by an expression based in information-theory or likelihood applying some hash function to each i with a fixed where having differing support typically results in the random seed, hence “MinHashing.” worst possible score. In order to hash objects other than sets, Chum et al.[6] For a drop-in replacement for the Jaccard index that introduced two algorithms for incorporating weights treats its input as a probability distribution, we’d like it in the computation of MinHashes. The first algorithm to have the following properties. associates constant global weights with the set members, 1) Scale invariant. suitable forP idf weighting. The collision probability that 2) Not lower than the Jaccard Index when applied to w results is Pi2X\Y i . The second algorithm computes discrete uniform distributions. i2X[Y wi 1 Algorithm 1: P-MinHash. We require only a single for retrieving documents that are similar under JW exponentially distributed hash per nonzero element. than JW itself (and consequently, sometimes better for input : Vector x, Seed s retrieving based on L1-distance.) output: Stable sample from x II. P-MINHASH AND ITS MATCH PROBABILITY foreach i where xi > 0 do − log(UniformNonZeroFloat(i;s)) Let h :[n] ! (0; 1] be a pseudo-random hash mapping ei xi ≤ ≤ end every element 1 i n to an independent uniform random value in (0; 1]. Over a non-negative vector x, return arg mini ei define − log h(i) H(x) := arg min 3) Sensitive to changes in support, in a similar way i xi to information-based measures. For brevity, we will be using the extended real number 4) Easily achievable as a collision probability. system in which 1=1 := 0. Each term is an exponen- JW fails all but the last. tially distributed random variable with rate xi, 1) It isn’t scale invariant, JW (αx; y) =6 JW (x; y). [ ] − log h(i) − 2) If the vectors are normalized to make it scale Pr < y = 1 − e xiy; invariant, the values drop below the corresponding xi Jaccard index (equation 1.) so it follows that 3) It is insensitive to changes in support. Pxi 4) Good algorithms exist, but they are non-trivial. Pr [H(x) = i] = : i xi Existing work has thoroughly explored improvements to Chum et al.’s second algorithm, while leaving their This well known and beautiful property of expo- first untouched. In this work we instead take their first nential random variables derives from the fact that algorithm as a starting point. We extend it to arbitrary Pr [min(x; y) > α] = Pr [x > α] Pr [y > α]. positive vectors (rather than sets with constant global Theorem II.1. For any two nonnegative vectors x; y 2 weights) and analyze the result. In doing so, we find Rn +, that the collision probability is a new generalization of Pr [H(x) = H(y)] = JP (x; y): the Jaccard Index to positive vectors, which we here call − log h(i) JP . Proof. Any monotonic transform on will not X xi 1 change the arg min, so multiplying each xi by a positive JP (x; y) = P ( ): (2) xj yj α won’t either. H(αx) = H(x). Thus for xi; yi > 0 fi : xi;yi>0g j max x ; y [ ( ) ( )] i i x y The names used here, JW and JP , are chosen to reflect Pr [i = H(x) = H(y)] = Pr i = H = H xi yi how each function interprets x and y, and the conditions ( ) ( ) − under which they match the original Jaccard index. JW By definition, H x = H y = i means log h(i) ≤ xi yi xi=xi treats a difference in magnitude the same as any other − − − log h(j) and log h(i) ≤ log h(j) , for j =6 i, or difference, so treats vectors as “weighted sets.” JP is xj =xi yi=yi yj =yi equivalently, scale invariant, so any input is treated the same as a ( ) corresponding probability distribution. − − − ≤ log h(j) log h(j) The primary contribution of this work is to derive and log h(i) min ; xj=xi yj=yi analyze JP , and to show that in many situations where − log h(j) the objects being hashed are probability distributions, JP ≤ ( ): max xj ; yj is a more useful collision probability than JW . xi yi We will describe the sense in which JP is an optimal i i collision probability for any LSH based on sampling. We Now we desire a new vector z such that H(z ) = i will prove that if the collision probability of a sampling if and only if H(x=xi) = H(y=yi) = i. This requires − log h(j) − log( h(j) ) i P that i i = xj yj . Thus for fixed i, zj = algorithm exceeds J on one pair, it must sacrifice zj =zi max ; ( ) xi yi P xj yj collisions on a pair that has higher J . max ; x(i yi ) We will motivate JP ’s utility by showing experimen- P . Consequently, max xk ; yk tally that it has a tighter relationship to the Jensen- k xi yi Shannon divergence than JW , and is more closely cen- [ ] 1 Pr H(zi) = i = P ( ) tered around the Jaccard index than JW . We will even x y max j ; j show empirically that in some circumstances, it is better j xi yi 2 (a) Pr [H(x) = i] (b) Pr [H(y) = i] (c) Pr [H(x);H(y) = i; j] (d) Pr [H(x) = H(y) = i] Fig. 1: The P-MinHash process can be interpreted geometrically as dividing the unit simplex into smaller simplexes proportional to the mass of each term, then selecting the same point in each simplex as its representative. Pictured here are x = (0:5; 0:4; 0:1); y = (0:2; 0:4; 0:4). Colors are assigned to unique values of i, or (i; j) in the case of the joint distributions. The sum of the remaining colored areas in (d) is proportional to JP (x; y). Repeating this process for all i in the intersection yields exterior face. As a result, a uniformly chosen point on X 1 the unit simplex will land in one of the sub-simplices Pr [H(x) = H(y)] = P ( ) with probability given by x.