Setsketch: Filling the Gap Between Minhash and Hyperloglog

SetSketch: Filling the Gap between MinHash and HyperLogLog Otmar Ertl Dynatrace Research Linz, Austria [email protected] ABSTRACT Commutativity: The order of insert operations should not be rele- MinHash and HyperLogLog are sketching algorithms that have vant for the fnal state. If the processing order cannot be guaranteed, become indispensable for set summaries in big data applications. commutativity is needed to get reproducible results. Many data While HyperLogLog allows counting diferent elements with very structures are not commutative [13, 32, 65]. little space, MinHash is suitable for the fast comparison of sets as it Mergeability: In large-scale applications with distributed data allows estimating the Jaccard similarity and other joint quantities. sources it is essential that the data structure supports the union set This work presents a new data structure called SetSketch that is operation. It allows combining data sketches resulting from partial able to continuously fll the gap between both use cases. Its commu- data streams to get an overall result. Ideally, the union operation tative and idempotent insert operation and its mergeable state make is idempotent, associative, and commutative to get reproducible it suitable for distributed environments. Fast, robust, and easy-to- results. Some data structures trade mergeability for better space implement estimators for cardinality and joint quantities, as well efciency [11, 13, 65]. as the ability to use SetSketch for similarity search, enable versatile Space efciency: A small memory footprint is the key purpose of applications. The presented joint estimator can also be applied to data sketches. They generally allow to trade space for estimation other data structures such as MinHash, HyperLogLog, or Hyper- accuracy, since variance is typically inversely proportional to the MinHash, where it even performs better than the corresponding sketch size. Better space efciencies can sometimes also be achieved state-of-the-art estimators in many cases. at the expense of recording speed, e.g. by compression [21, 36, 62]. Recording speed: Fast update times are crucial for many appli- PVLDB Reference Format: cations. They vary over many orders of magnitude for diferent Otmar Ertl. SetSketch: Filling the Gap between MinHash and HyperLogLog. data structures. For example, they may be constant like for the PVLDB, 14(11): 2244 - 2257, 2021. HyperLogLog (HLL) sketch [30] or proportional to the sketch size doi:10.14778/3476249.3476276 like for minwise hashing (MinHash) [6]. Cardinality estimation: An important use case is the estimation PVLDB Artifact Availability: of the number of elements in a set. Sketches have diferent efcien- The source code, data, and/or other artifacts have been made available at cies in encoding cardinality information. Apart from estimation https://github.com/dynatrace-research/set-sketch-paper. error, robustness, speed, and simplicity of the estimation algorithm are important for practical use. Many algorithms rely on empirical 1 INTRODUCTION observations [34] or need to combine diferent estimators for dif- Data sketches [16] that are able to represent sets of arbitrary size ferent cardinality ranges, as is the case for the original estimators using only a small, fxed amount of memory have become important of HLL [30] and HyperMinHash [72]. and widely used tools in big data applications. Although individual Joint estimation: Knowing the relationship among sets is impor- elements can no longer be accessed after insertion, they are still tant for many applications. Estimating the overlap of both sets able to give approximate answers when querying the cardinality or in addition to their cardinalities eventually allows the computa- joint quantities which may include the intersection size, union size, tion of joint quantities such as Jaccard similarity, cosine similarity, size of set diferences, inclusion coefcients, or similarity measures inclusion coefcients, intersection size, or diference sizes. Data like the Jaccard and the cosine similarity. structures store diferent amounts of joint information. In that re- Numerous such algorithms with diferent characteristics have gard, for example, MinHash contains more information than HLL. been developed and published over the last two decades. They can Extracting joint quantities is more complex compared to cardinali- be classifed based on following properties [53] and use cases: ties, because it involves the state of two sketches and requires the Idempotency: Insertions of values that have already been added estimation of three unknowns in the general case, which makes should not further change the state of the data structure. Even fnding an efcient, robust, fast, and easy-to-implement estimation though this seems to be an obvious requirement, it is not satisfed algorithm challenging. If a sketch supports cardinality estimation by many sketches for sets [11, 47, 58, 67, 69]. and is mergeable, the joint estimation problem can be solved using the inclusion-exclusion principle * + = * + * + . j \ j j j ¸ j j − j [ j This work is licensed under the Creative Commons BY-NC-ND 4.0 International However, this naive approach is inefcient as it does not use all License. Visit https://creativecommons.org/licenses/by-nc-nd/4.0/ to view a copy of information available. this license. For any use beyond those covered by this license, obtain permission by Locality sensitivity: A further use case is similarity search. If emailing [email protected]. Copyright is held by the owner/author(s). Publication rights licensed to the VLDB Endowment. the same components of two diferent sketches are equal with a Proceedings of the VLDB Endowment, Vol. 14, No. 11 ISSN 2150-8097. probability that is a monotonic function of some similarity measure, doi:10.14778/3476249.3476276 it can be used for locality-sensitive hashing (LSH)[2, 35, 43, 73], an 2244 indexing technique that allows querying similar sets in sublinear 1.2 MinHash time. For example, the components of MinHash have a collision MinHash [6] maps a set ( to an <-dimensional vector , . , < ¹ 1 º probability that is equal to the Jaccard similarity. using The optimal choice of a sketch for sets depends on the applica- 8 := minℎ8 3 . 3 ( ¹ º tion and is basically a trade-of between estimation accuracy, speed, 2 and memory efciency. Among the most widely used sketches are Here ℎ8 are independent hash functions. The probability, that com- MinHash [6] and HLL [30]. Both are idempotent, commutative, and ponents * 8 and +8 of two diferent MinHash sketches for sets * mergeable, which is probably one of the reasons for their popu- and + match, equals the Jaccard similarity � larity. The fast recording speed, simplicity, and memory-efciency * + % * 8 = +8 = j* \+ j = � . have made HLL the state-of-the-art algorithm for cardinality esti- ¹ º j [ j mation. In contrast, MinHash is signifcantly slower and is less This locality sensitivity makes MinHash very suitable for set com- memory-efcient with respect to cardinality estimation, but is parisons and similarity search, because the Jaccard similarity can locality-sensitive and allows easy and more accurate estimation of be directly and quickly estimated from the fraction of equal com- joint quantities. ponents with a root-mean-square error (RMSE) of � 1 � <. ¹ − º/ MinHash also allows the estimation of cardinalities [12, 13] and p other joint quantities [15, 19]. 1.1 Motivation and Contributions Meanwhile, many improvements and variants have been pub- The motivation to fnd a data structure that supports more accurate lished that either improve the recording speed or the memory ef- joint estimation and locality sensitivity like MinHash, while having ciency. One permutation hashing [40] reduces the costs for adding a better space-efciency and a faster recording speed that are both a new element from < to 1 . However, there is a high prob- O¹ º O¹ º closer to those of HLL, has led us to a new data structure called ability of uninitialized components for small sets leading to large SetSketch. It flls the gap between HLL and MinHash and can be estimation errors. This can be remedied by applying a fnalization seen as a generalization of both as it is confgurable with a continu- step called densifcation [44, 63, 64] which may be expensive for ous parameter between those two special cases. The possibility to small sets [27] and also prevents further aggregations. Alternatively, fne-tune between space-efciency and joint estimation accuracy fast similarity sketching [17], SuperMinHash [25], or weighted min- allows better adaptation to given requirements. wise hashing algorithms like BagMinHash [26] and ProbMinHash We present fast, robust, and simple methods for cardinality and [27] specialized to unweighted sets can be used instead. Compared joint estimation that do not rely on any empirically determined to one permutation hashing with densifcation, they are merge- constants and that also give consistent errors over the full cardinal- able, allow further element insertions, and even give more accurate ity range. The cardinality estimator matches that of both MinHash Jaccard similarity estimates for small set sizes. or HLL in the corresponding limit cases. Together with the derived To shrink the memory footprint, b-bit minwise hashing [39] can corrections for small and large cardinalities, which do not require be used to reduce all MinHash components from typically 32 or any empirical calibration, the estimator can also be applied to HLL 64 bits to only a few bits in a fnalization step. Although this loss and generalized HyperLogLog (GHLL) sketches over the entire car- of information must be compensated by increasing the number of dinality range. Since its derivation is much simpler than that in components<, the memory efciency can be signifcantly improved the original paper based on complex analysis [30], this work also if precise estimates are needed only for high similarities. However, contributes to a better understanding of the HLL algorithm. the need of more components increases the computation time and The presented joint estimation method can be also specialized the sketch cannot be further aggregated or merged after fnalization.

Setsketch: Filling the Gap Between Minhash and Hyperloglog

Mining of Massive Datasets

Applied Statistics

Arxiv:2102.08942V1 [Cs.DB]

Similarity Search Using Locality Sensitive Hashing and Bloom Filter

Compressed Slides

Lecture Note

Distributed Clustering Algorithm for Large Scale Clustering Problems

Minner: Improved Similarity Estimation and Recall on Minhashed Databases

Strand: Fast Sequence Comparison Using Mapreduce and Locality Sensitive Hashing

Efficient Minhash-Based Algorithms for Big Structured Data

Hash-Grams: Faster N-Gram Features for Classification and Malware Detection Edward Raff Charles Nicholas Laboratory for Physical Sciences Univ

December, 2018 2018