SetSketch: Filling the Gap between MinHash and HyperLogLog

Otmar Ertl Dynatrace Research Linz, Austria [email protected]

ABSTRACT Commutativity: The order of insert operations should not be rele- MinHash and HyperLogLog are sketching algorithms that have vant for the fnal state. If the processing order cannot be guaranteed, become indispensable for set summaries in big data applications. commutativity is needed to get reproducible results. Many data While HyperLogLog allows counting diferent elements with very structures are not commutative [13, 32, 65]. little space, MinHash is suitable for the fast comparison of sets as it Mergeability: In large-scale applications with distributed data allows estimating the Jaccard similarity and other joint quantities. sources it is essential that the data structure supports the union set This work presents a new data structure called SetSketch that is operation. It allows combining data sketches resulting from partial able to continuously fll the gap between both use cases. Its commu- data streams to get an overall result. Ideally, the union operation tative and idempotent insert operation and its mergeable state make is idempotent, associative, and commutative to get reproducible it suitable for distributed environments. Fast, robust, and easy-to- results. Some data structures trade mergeability for better space implement estimators for cardinality and joint quantities, as well efciency [11, 13, 65]. as the ability to use SetSketch for similarity search, enable versatile Space efciency: A small memory footprint is the key purpose of applications. The presented joint estimator can also be applied to data sketches. They generally allow to trade space for estimation other data structures such as MinHash, HyperLogLog, or Hyper- accuracy, since variance is typically inversely proportional to the MinHash, where it even performs better than the corresponding sketch size. Better space efciencies can sometimes also be achieved state-of-the-art estimators in many cases. at the expense of recording speed, e.g. by compression [21, 36, 62]. Recording speed: Fast update times are crucial for many appli- PVLDB Reference Format: cations. They vary over many orders of magnitude for diferent Otmar Ertl. SetSketch: Filling the Gap between MinHash and HyperLogLog. data structures. For example, they may be constant like for the PVLDB, 14(11): 2244 - 2257, 2021. HyperLogLog (HLL) sketch [30] or proportional to the sketch size doi:10.14778/3476249.3476276 like for minwise hashing (MinHash) [6]. Cardinality estimation: An important use case is the estimation PVLDB Artifact Availability: of the number of elements in a set. Sketches have diferent efcien- The source code, data, and/or other artifacts have been made available at cies in encoding cardinality information. Apart from estimation https://github.com/dynatrace-research/set-sketch-paper. error, robustness, speed, and simplicity of the estimation algorithm are important for practical use. Many algorithms rely on empirical 1 INTRODUCTION observations [34] or need to combine diferent estimators for dif- Data sketches [16] that are able to represent sets of arbitrary size ferent cardinality ranges, as is the case for the original estimators using only a small, fxed amount of memory have become important of HLL [30] and HyperMinHash [72]. and widely used tools in big data applications. Although individual Joint estimation: Knowing the relationship among sets is impor- elements can no longer be accessed after insertion, they are still tant for many applications. Estimating the overlap of both sets able to give approximate answers when querying the cardinality or in addition to their cardinalities eventually allows the computa- joint quantities which may include the intersection size, union size, tion of joint quantities such as Jaccard similarity, , size of set diferences, inclusion coefcients, or similarity measures inclusion coefcients, intersection size, or diference sizes. Data like the Jaccard and the cosine similarity. structures store diferent amounts of joint information. In that re- Numerous such algorithms with diferent characteristics have gard, for example, MinHash contains more information than HLL. been developed and published over the last two decades. They can Extracting joint quantities is more complex compared to cardinali- be classifed based on following properties [53] and use cases: ties, because it involves the state of two sketches and requires the Idempotency: Insertions of values that have already been added estimation of three unknowns in the general case, which makes should not further change the state of the data structure. Even fnding an efcient, robust, fast, and easy-to-implement estimation though this seems to be an obvious requirement, it is not satisfed algorithm challenging. If a sketch supports cardinality estimation by many sketches for sets [11, 47, 58, 67, 69]. and is mergeable, the joint estimation problem can be solved using the inclusion-exclusion principle � � = � � � � . | ∩ | | | + | | − | ∪ | This work is licensed under the Creative Commons BY-NC-ND 4.0 International However, this naive approach is inefcient as it does not use all License. Visit https://creativecommons.org/licenses/by-nc-nd/4.0/ to view a copy of information available. this license. For any use beyond those covered by this license, obtain permission by Locality sensitivity: A further use case is similarity search. If emailing [email protected]. Copyright is held by the owner/author(s). Publication rights licensed to the VLDB Endowment. the same components of two diferent sketches are equal with a Proceedings of the VLDB Endowment, Vol. 14, No. 11 ISSN 2150-8097. probability that is a monotonic function of some similarity measure, doi:10.14778/3476249.3476276 it can be used for locality-sensitive hashing (LSH)[2, 35, 43, 73], an

2244 indexing technique that allows querying similar sets in sublinear 1.2 MinHash time. For example, the components of MinHash have a collision MinHash [6] maps a set � to an �-dimensional vector � , . . . , �� ( 1 ) probability that is equal to the Jaccard similarity. using The optimal choice of a sketch for sets depends on the applica- �� := minℎ� � . � � ( ) tion and is basically a trade-of between estimation accuracy, speed, ∈ and memory efciency. Among the most widely used sketches are Here ℎ� are independent hash functions. The probability, that com- MinHash [6] and HLL [30]. Both are idempotent, commutative, and ponents �� � and ��� of two diferent MinHash sketches for sets � mergeable, which is probably one of the reasons for their popu- and � match, equals the Jaccard similarity � larity. The fast recording speed, simplicity, and memory-efciency � � � �� � = ��� = |� ∩� | = � . have made HLL the state-of-the-art algorithm for cardinality esti- ( ) | ∪ | mation. In contrast, MinHash is signifcantly slower and is less This locality sensitivity makes MinHash very suitable for set com- memory-efcient with respect to cardinality estimation, but is parisons and similarity search, because the Jaccard similarity can locality-sensitive and allows easy and more accurate estimation of be directly and quickly estimated from the fraction of equal com- joint quantities. ponents with a root-mean-square error (RMSE) of � 1 � �. ( − )/ MinHash also allows the estimation of cardinalities [12, 13] and p other joint quantities [15, 19]. 1.1 Motivation and Contributions Meanwhile, many improvements and variants have been pub- The motivation to fnd a data structure that supports more accurate lished that either improve the recording speed or the memory ef- joint estimation and locality sensitivity like MinHash, while having ciency. One permutation hashing [40] reduces the costs for adding a better space-efciency and a faster recording speed that are both a new element from � to 1 . However, there is a high prob- O( ) O( ) closer to those of HLL, has led us to a new data structure called ability of uninitialized components for small sets leading to large SetSketch. It flls the gap between HLL and MinHash and can be estimation errors. This can be remedied by applying a fnalization seen as a generalization of both as it is confgurable with a continu- step called densifcation [44, 63, 64] which may be expensive for ous parameter between those two special cases. The possibility to small sets [27] and also prevents further aggregations. Alternatively, fne-tune between space-efciency and joint estimation accuracy fast similarity sketching [17], SuperMinHash [25], or weighted min- allows better adaptation to given requirements. wise hashing algorithms like BagMinHash [26] and ProbMinHash We present fast, robust, and simple methods for cardinality and [27] specialized to unweighted sets can be used instead. Compared joint estimation that do not rely on any empirically determined to one permutation hashing with densifcation, they are merge- constants and that also give consistent errors over the full cardinal- able, allow further element insertions, and even give more accurate ity range. The cardinality estimator matches that of both MinHash Jaccard similarity estimates for small set sizes. or HLL in the corresponding limit cases. Together with the derived To shrink the memory footprint, b-bit minwise hashing [39] can corrections for small and large cardinalities, which do not require be used to reduce all MinHash components from typically 32 or any empirical calibration, the estimator can also be applied to HLL 64 bits to only a few bits in a fnalization step. Although this loss and generalized HyperLogLog (GHLL) sketches over the entire car- of information must be compensated by increasing the number of dinality range. Since its derivation is much simpler than that in components�, the memory efciency can be signifcantly improved the original paper based on complex analysis [30], this work also if precise estimates are needed only for high similarities. However, contributes to a better understanding of the HLL algorithm. the need of more components increases the computation time and The presented joint estimation method can be also specialized the sketch cannot be further aggregated or merged after fnalization. and used for other data structures. In particular, we derived a new Besides the original application of fnding similar websites [33], closed-form estimator for MinHash that dominates the state-of- MinHash is nowadays widely used for [35], the-art estimator based on the number of identical components. association-rule mining [14], machine learning [41], metagenomics Furthermore, our approach is able to improve joint estimation from [3, 22, 46, 51], molecular fngerprinting [57], graph embeddings [7], HLL, GHLL, and HyperMinHash sketches except for very small sets or malware detection [50, 60]. compared to the corresponding state-of-the-art estimators. Given the popularity of all those sketches, specifcally of MinHash and 1.3 HyperLogLog HLL, these side results could therefore have a major impact in The HyperLogLog (HLL) data structure consists of� integer-valued practice, as more accurate estimates can be made from existing registers �1, �2, . . . , �� similar to MinHash. While MinHash typi- and persisted data structures. We also found that a performance cally uses at least 32 bits per component, HLL only needs 5-bit or optimization actually developed for SetSketch can be applied to 6-bit registers to count up to billions of distinct elements [30, 34]. HLL to speed up recording of large sets. The state for a given set � is defned by The presented methods have been empirically verifed by exten- �� := max� � 1 log� ℎ� � with ℎ� � Uniform 0, 1 (1) sive simulations. For the sake of reproducibility the corresponding ∈ ⌊ − ( )⌋ ( ) ∼ ( ) source code including a reference implementation of SetSketch where ℎ� are � independent hash functions. �� = 0 is used for is available at https://github.com/dynatrace-research/set-sketch- initialization and the representation of empty sets. The original paper. An extended version of this paper with additional appendices HLL uses the base � = 2, which makes the logarithm evaluation containing mathematical proofs and more experimental results is very cheap. We refer to the generalized HyperLogLog (GHLL) when also available [28]. using any � > 1 [12, 53]. Defnition (1) leads to a recording time of

2245 2 4 � . Therefore, to have a constant time complexity, HLL imple- 2− GHLL 1 = √2 2− GHLL 1 = √8 2 ( ) ( ) O( ) 4 HyperMinHash A = 1 6 HyperMinHash A = 3 mentations usually use stochastic averaging [30] which distributes 2− ( ) 2− ( )

6 8 the incoming data stream over all � registers 2− 2−

probability 8 10 2− 2− �� := max� �:ℎ1 � =� 1 log� ℎ2 � , ∈ ( ) ⌊ − ( )⌋ 10 12 2− 2− where ℎ1 and ℎ2 are independent uniform hash functions map- 0 5 10 15 0 20 40 60 ping to 1, 2, . . . ,� and the interval 0, 1 , respectively. Instead of { } ( ) using two hash functions, implementations typically calculate a Figure 1: Probabilities of register update values for GHLL single hash value that is split into two parts. The original cardinal- and HyperMinHash with equivalent confgurations. ity estimator [30] was inaccurate for small and large cardinalities. Therefore, a series of improvements have been proposed [34, 59, 70], HyperMinHash with parameter � is very similar to GHLL with which fnally ended in an estimator that does not require empirical � calibration [23, 24] and that is already successfully used by the base � = 22− . The reason is that HyperMinHash approximates the Redis in-memory data store. probability distribution of GHLL with probabilities that are just 1 HLL is not very suitable for joint estimation or similarity search. powers of 2 as shown in Figure 1. This has the advantage that hash The reason is that there is no simple estimator for the Jaccard simi- values can be mapped to corresponding update values using only larity and no closed-form expression for the collision probability a few CPU instructions. Like those of HLL and GHLL, the register like for MinHash. The inclusion-exclusion principle gives signif- values of HyperMinHash are not locality-sensitive. cantly worse estimates for joint quantities than MinHash using the same memory footprint [18]. A maximum likelihood (ML) based 1.5 Further Related Work method [23, 24] for joint estimation was proposed, that performs A simple, though obviously not very memory-efcient, way to signifcantly better, but requires solving a three-dimensional opti- combine the properties of MinHash and HLL is to use them both in mization problem, which does not easily translate into a fast and parallel [9, 52]. Probably the best alternative to MinHash and HLL robust algorithm. A computationally less expensive method was which also works for distributed data and which even supports proposed in [49] for estimating inclusion coefcients. However, it binary set operations is the ThetaSketch [18]. The downsides are a relies on precomputed lookup tables and does not properly account signifcantly worse memory efciency compared to HLL in terms for stochastic averaging, which can lead to large errors for small of cardinality estimation and that it is not locality-sensitive like sets. MinHash. These joint estimation approaches have proven that HLL also encodes some joint information. This might be one reason why 2 METHODOLOGY HLL’s asymptotic space efciency, when considering cardinality The basics of SetSketch follow directly from the properties intro- estimation only, is not optimal [53, 54]. Lossless compression [21, duced in Section 1. Similar to MinHash and HLL we consider a 36, 62] is able to improve the memory efciency at the expense of mapping from a set � to � integer-valued registers � , � , . . . , � . recording speed and is therefore rarely used in practice. Approaches 1 2 � Furthermore, we assume that register values are obtained through using lossy compression [69] are not recommended as their insert some hashing procedure and that they are identically distributed. operations are not idempotent and commutative. Since we want to estimate the cardinality from the register values, Nevertheless, due to its simplicity, HLL has become the state- the distribution of �� should only depend on the cardinality � = � of-the-art cardinality estimation algorithm, especially if the data | | is distributed. This is evidenced by the many databases like Redis, � �� � = � �;� . Oracle, Snowfake, Microsoft SQL Server, BigQuery, Vertica, ( ≤ ) ( ) Mergeability requires that the register value �� � ,� of the union Elasticsearch, Aerospike, or Amazon Redshift that use HLL under ∪ the hood to realize approximate distinct count queries. HLL is also of two sets � and � can be computed directly from the corre- used for many applications like metagenomics [1, 22, 46], graph sponding individual register values �� � and ��� using some bi- nary operation � = � � , � . To enable cardinality esti- analysis [4, 55, 56], query optimization [31], or fraud detection [10]. � � ,� ( � � �� ) mation, register values∪ need to have some monotonic dependence 1.4 HyperMinHash on the cardinality. Without loss of generality, we consider non- decreasing monotonicity. Therefore, and because � � � and HyperMinHash [72] is the frst approach to combine some prop- | ∪ | ≥ | | � � � � � � , we want � to satisfy � �,� � erties of HLL and MinHash in one data structure. It can be seen | | ≥ | | ⇒ | ∪ | ≥ | ∪ | ( ) ≥ and � � � �,� � �, � , respectively. The only binary as a generalization of HLL and one permutation hashing. Hyper- ≥ ⇒ ( ) ≥ ( ) operation with these properties and that is also idempotent and MinHash supports cardinality and joint estimation by using larger commutative, is the maximum function [28]. Therefore, we require registers and hence more space than HLL. Unfortunately, the theo- � = max � , � . retically derived Jaccard similarity estimator is relatively complex � � ,� ( � � �� ) For∪ two disjoint sets � and � with cardinalities � = � and and too expensive for practical use. Therefore, an approximation | | � � = � we have � � = � � . In this case � and � are based on empirical observations was proposed. However, the orig- | | � | ∪ | � + � � � �� independent and therefore � � � = � max � , � inal estimation approach is not optimal, as our results presented ( � � ,� ≤ ) ( ( � � �� ) ≤ � = � � � � � � which∪ results in a functional equation later have shown that even the naive inclusion-exclusion principle ) ( � � ≤ )· ( �� ≤ ) works better in some scenarios. � �;� � = � �;� � �;� . (2) ( � + � ) ( � )· ( � )

2246 Algorithm 1: SetSketch Since register values are increasing with cardinality, only the Data: � smallest hash values ℎ� � will be relevant for the fnal state accord- Result: �1, �2, . . . , �� ( ) � � , . . . , � �1, �2, . . . , �� 0, 0,..., 0 ing to (6). In particular, if low min 1 � denotes some ( ) ← ( ) ≤ ( ) � � 0 lower bound of the current state, only hash values ℎ� � � low low ← − � 0 ( ) ≤ ← will be able to alter any register. As more elements are added to the forall � � do ∈ SetSketch, the register values increase and so does their common initialize pseudorandom number generator with seed � for � 1 to � do minimum, allowing �low to be set to higher and higher values. For ← � � generate �-th smallest out of � exponentially distributed random �low ← large sets, �− will eventually become so small that almost all values with rate � using (7) for SetSketch1 or (8) for SetSketch2 hash values of further elements will be greater. Therefore, com- if � > � �low then break � − puting the hash values in ascending order would allow to stop the � max 0, min � 1, 1 log � � ← ( ( + ⌊ − � ⌋)) � if � � then break processing of an element after a hash value is greater than �− low , ≤ low � sample from 1, 2, . . . ,� without replacement and thus to achieve an asymptotic time complexity of 1 . ← { } if � > � then O( ) � For that, we consider all hash values ℎ� � as random values �� � ( ) ← � � 1 generated by a pseudorandom number generator that was seeded ← + if � � then with element �. This perspective allows us to use any other ran- ≥ � min � , � , . . . , �� low ← ( 1 2 ) dom process that assigns exponentially distributed random values � 0 to each ℎ� � . In particular, we can use an appropriate ascending ← ( ) random sequence 0 < �1 < �2 < ... < �� whose values are ran- domly shufed and assigned to ℎ � , ℎ � , . . . , ℎ� � . Shufing 1 ( ) 2 ( ) ( ) corresponds to sampling without replacement and can be efciently To allow estimation with constant relative error over a wide range realized as described in [26, 27] based on the Fisher-Yates algorithm of cardinalities, the location of the distribution should increase [29]. The random values � � must be chosen such that ℎ� � are ( ) logarithmically with the cardinality �, while its shape should remain eventually exponentially distributed as required by (6). essentially the same. For a discrete distribution this can be enforced One possibility to achieve that is to use exponentially distributed by the functional equation spacings [20] 1 � �;� = � � 1;�� . (3) � � � � 1 � 1 � Exp � with �0 = 0. (7) ( ) ( + ) ∼ − + + − ( ) Multiplying the cardinality with the constant � > 1, which corre- In this way the fnal hash values ℎ� � will be statistically indepen- ( ) sponds to an increment on the logarithmic scale, shifts the distribu- dent and the state of SetSketch will look like as if it was generated tion to the right by 1. using � independent hash functions. The system of functional equations composed of (2) and (3) has Alternatively, the domain of the exponential distribution with the solution rate � can be divided into� intervals �� 1,�� with�0 = 0 and�� = ��� � 1 [ − ) � �� � = � �;� = � − (4) . Setting �� := log 1 � � � ensures that an exponentially ( ≤ ) ( ) − ∞ � ( + /( − )) > distributed random value � Exp � has equal probability to fall with some constant � 0 [28]. For a set with a single element, in ∼ ( ) 1 into any of these � intervals � � �� 1,�� = [28]. Hence, if particular, this is ( ∈ [ − )) � the points are sampled in ascending order according to � ��− � �� � � = 1 = � �; 1 = �− . (5) � � Exp �;�� 1,�� , (8) ( ≤ | ) ( ) ∼ ( − ) Therefore, �� needs to be distributed as �� 1 log� � where where Exp �;�� 1,�� denotes the truncated exponential distribu- ∼ ⌊ − ⌋ ( − ) � Exp � is exponentially distributed with rate �. This directly tion with rate � and domain �� 1,�� , the shufed assignment ∼ ( ) [ − ) leads to the defnition of our new SetSketch data structure, which will lead to hash values ℎ� � that are exponentially distributed ( ) sets the state for a given set � as with rate �. However, in contrast to the frst approach, the hash values ℎ� � will be statistically dependent, because there is always �� := max� � 1 log� ℎ� � with ℎ� � Exp � , (6) ( ) ∈ ⌊ − ( )⌋ ( ) ∼ ( ) exactly one point sampled from each interval. This correlation will where ℎ� � are hash functions with exponentially distributed out- be less signifcant for large sets � �, where it is unlikely that ( ) ≫ put. This defnition is very similar to that of HLL without stochastic one element is responsible for the values of diferent registers at averaging (1) where the hash values are distributed uniformly in- the same time. stead of exponentially. Dependent on which method is used in conjunction with (6) to calculate the register values, we refer to SetSketch1 for the uncor- 2.1 Ordered Register Value Updates related approach using exponential spacings and to SetSketch2 for Updating SetSketch registers based on (6) is not very efcient, be- the correlated approach using sampling from disjoint intervals. cause processing a single element requires � hash function evalua- tions and � is typically in the hundreds or thousands. Therefore, 2.2 Lower Bound Tracking Algorithm 1 uses an alternative method based on ideas from our Diferent strategies can be used to keep track of a lower bound previous work [25–27] to reduce the average time complexity for �low as needed for an asymptotic constant-time insert operation. adding an element from � to 1 for sets signifcantly larger Since maintenance of the minimum of all register values with a O( ) O( ) than �. small worst case complexity would require additional space by using

2247 either a histogram [23] or binary trees [26, 27], we decided to simply our experiments presented later have shown that the estimators de- update the lower bound regularly by scanning the whole register rived under this assumption also perform well for SetSketch2 over array as suggested in [61] which takes � time. However, since the entire cardinality range. The correlation even has a positive O( ) this approach is inefcient for small bases �, Algorithm 1 counts the efect and reduces the estimation errors for small sets. number of register modifcations � instead of the number of register We introduce some special functions and corresponding approx- values greater than the minimum. After every � register updates imations that are used for the derivation of the estimators. The when � �, � is updated and set to the current minimum of functions �1 � and �2 � defned by ≥ low � ( ) � ( ) all register values, and � is reset. By defnition, this contributes log� � � �� � := �� � � � � − 1, (9) at most amortized constant time to every register increment and � ( ) Γ � �∞= ( − ) − ≈ hence does not change the expected time complexity. ( ) −∞ where Γ denotes the gammaÍ function, are both very close to 1 1 5 for � 2 [28]. For � = 2 we have max� �� � 1 10− and ≤2 4 | ( ) − | ≤ 2.3 Parameter Confguration max� � � 1 10 , which shows that these approximations | � ( ) − | ≤ − SetSketch has 4 parameters, �, �, �, and �, that need to be set ap- introduce only a small error. Smaller values of � lead to even smaller propriately. If we have � registers, which are able to represent all errors as both functions converge quickly towards 1 as � 1. → values from 0, 1, 2, . . . , �, � 1 with some nonnegative integer �, Another approximation that we will use is { + } the memory footprint will be � log � 2 bits without special � � � � ⌈ 2 ( + )⌉ � � , � := � � 1− � � 2− � � . (10) � � 1 2 �∞= − − 2 1 encoding. determines the accuracy of cardinality estimates, be- ( ) −∞ − ≈ − cause, as shown later, the RMSE is in the range 1 √�, 1.04 √� �� Í�1,�2 �2 �1 5 [ / / ] The relative error ( � )−(� − ) is smaller than 10− for � = 2 for � 2. While the choice of the base � has only little infuence on | 2− 1 | ≤ and approaches zero quickly as � 1 [28]. cardinality estimation, joint estimation and locality sensitivity are → signifcantly improved as � 1. The cardinality range, for which → 3.1 Cardinality Estimation accurate estimation is possible, is controlled by the parameters � and �. Since the support of distribution (4) is unbounded, � and � Instead of using the straightforward way of estimating the cardi- need to be chosen such that register values smaller than 0 or greater nality using the ML method based on (4), we derive a closed-form than � 1 are very unlikely and can be ignored for the expected estimator, based on our previous work [23, 24], that is simpler to + implement, cheaper to evaluate, and gives almost identical results. cardinality range. The lower bound of this range is typically 1 as �� � The expectation of �� with � 1, 2 and �� distributed we want SetSketches to be able to represent any set with at least ( − ) ∈ { } one element. according to (4) is given by � � 1 If we choose � log � � � for some � 1, it is guaranteed E �� �� � = �� � � � ���− � ���− + ≥ ( / )/ ≪ (( − ) ) �∞= ( − ) ( − − − ) that negative register values occur only with a maximum probability −∞ � = Í1 � � �� � �� ���− of � [28]. Therefore, the error introduced by using zero as initial ( − − ) �∞= ( − ) − � −∞ value as done in Algorithm 1 is negligible for small enough values 1 � � log �� � � log� �� � = �− Í � � � ( ( )− ) of �. In practice, setting � = 20 is a good choice in most cases. Even −� �∞= ( ( )− ) − � Γ −∞ � Γ 20 1 �− � � 1 �− � in the extreme case with � 1 and � = 2 , the probability is still = ( − � Í) ( ) � log �� ( − � ) ( ) . (11) → � log� � ( � ( )) ≈ � log� less than 0.22 % to encounter at least one negative value. Similarly, ��max� Here the asymptotic identity (9) was used as approximation in the setting � log� guarantees that register values greater ≥ ⌊ � ⌋ log� 1 � �� than � 1 occur only with a maximum probability of � for all fnal step. We now consider the statistic �� = 1 1 � � �=1 ��− . + − / � � Using (11) for the cases � = 1 and � = 2 we get for its expectation cardinalities up to max [28]. A small implies a larger value of Í � and therefore also a larger memory footprint, if the cardinality and its variance [28] range is fxed. E 1 1 � 1 �� � and Var �� ��2 �+1 log � 1 . To give a concrete example, consider a SetSketch with parameters ( ) ≈ ( ) ≈ − ( ) − � = 4096, � = 1.001, � = 20, and � = 216 2 = 65534. The last The delta method [8] gives for �   − parameter ensures that two bytes are sufcient to represent a single → ∞ E 1 1 �2 � 1 register and the whole data structure takes 8 kB. The probability ��− � and Var ��− � �+1 log � 1 ( ) ≈ ( ) ≈ − ( ) − that there is at least one register with negative value is 8.28 10 6   × − which suggests to use � 1 as estimator for the cardinality � for a set with just a single element. Furthermore, the probability �− � . 6 � � 1 1 � that any register value is greater than 1 is 2 93 10− for = �ˆ = ( − �/ ) �� . (12) + × � log � � � 1018. Therefore, a SetSketch using this confguration is suitable to ( ) =1 − represent any set with up to 1018 distinct elements. The expected This estimator can be quickly evaluated,Í if the powers of � are error of cardinality estimates is approximately 1 √� 1.56 %. precalculated and stored in a lookup table for all possible exponents / ≈ which are known to be from 0, 1, . . . , � 1 . { + } The corresponding relative standard deviation (RSD) Var �ˆ � 3 ESTIMATION ( )/ 1 � 1 In this section we present methods for cardinality and joint estima- is given by � �+1 log � 1 . It is minimized for � p 1, where ( − ( ) − ) → tion. We will assume that register values are statistically indepen- it equals 1 √q�, and increases slowly with �. For � = 2 the expected / dent which is only true for SetSketch1 and only a good approxima- error is still as low as 1.04 √�. However, as already discussed, / tion for SetSketch2 in the case of large sets with � �. However, smaller values of � require larger values of � and hence more space. ≫

2248 � �� �� � � � − � � Optimal memory efciency, measured as the product of the variance � �− 1 � �− � �− ( + + ) �� �� , = � �� �� � � ≥ and the memory footprint, is typically obtained for values of � � � � − � � �� �− ( − 1 � + − ) � � . ranging from 2 to 4. Theoretically, if register values are compressed,  + � ≤ � a better memory efciency can be achieved for � [53]. > → ∞ It can be used to calculate the probability that �� � ��� Values signifcantly greater than 2 invalidate the approximation  � � > � = � � = � � � (9) which was used for the derivation of estimator (12). To reduce the � � �� �∞= � � 1 �� ( ) −∞ ( + ∧ ≤ ) estimation error in this regime, random ofsets, which correspond to = �∞= � �� � Í � 1 ��� � � �� � � ��� � −∞ ( ≤ + ∧ ≤ ) − ( ≤ ∧ ≤ ) register-specifc values of parameter �, would be necessary [42, 53]. � �� �� � � �� �� �� − �� + Í − � 1 � ��− 1 � This work, however, focuses on � 2. Although small values of � = �∞= �− ( ( + ) + ) �− + ≤ −∞ − are inefcient for cardinality estimation, they contain, as we will �� �� � �� �� = �Í� log� � � −1 � �� , log� � 1+� show, more information about how sets relate to each other, which ( + ) + + ultimately justifes slightly larger memory footprints.  �� �� �� �� �  �� �� � log� � 1+� log� � � −1 � �� = �� ��− �� . ≈ + − ( + ) + + 3.2 Joint Estimation Here we used the approximation   (10) which is asymptotically equal as � 1 and introduces a negligible error if � 2. � is defned The relationship of two sets � and � can be characterized by three → ≤ � as � � := log 1 � � 1 . Complemented by � � < � , quantities. Without loss of generality we choose the cardinalities � ( ) − � ( − �− ) ( � � �� ) � � which is obtained analogously, and � �� � = ��� = 1 � �� � > � = � , � = � , and the Jaccard similarity � = | ∩ | with ( ) − ( � | | � | | � � � � � < � we have �� �� | ∪ | �� ) − ( � � �� ) the natural constraint � 0, min �� , �� for parameterization. ∈ [ ( )] � � > � � � � � , � � < � � � �� , Other quantities such as ( � � �� ) ≈ � ( − ) ( � � �� ) ≈ � ( − ) �� �� � �� � = ��� 1 �� � � � �� � �� , (14) � � = 1+� , union size ( ) ≈ − ( − ) − ( − ) | ∪ | + ( ) �� �� �� � where we introduced the relative cardinalities � = �� �� and � � = ( 1+ � ) , intersection size �� + | ∩ | + ( ) � = �� �� with � � = 1. The approximation of � �� � = ��� is �� �� � �� �� � + + ( ) � � = 1− � , � � = 1− � , diference sizes also always nonnegative [28]. | \ | + | \ | + ( ) � � �� �� � If the cardinalities �� and �� are known, the ML method can be | ∩ | = ( + ) , cosine similarity √�� �� 1 � used to estimate �. The corresponding log-likelihood function as √ � � ( + ) ( ) | | | | function of � using the approximated probabilities is � � = �� �� � , � � = �� �� � , inclusion coefcients | �∩ | (�� +1 �) | �∩ | (�� +1 �) | | ( + ) | | ( + ) ( ) log � = � log �� � � � � log �� � �� L( ) + ( ( − )) + − ( ( − )) � log 1 � � � � � � �� , can be expressed in terms of only these three variables. If we have + 0 ( − � ( − ) − � ( − )) two SetSketches representing two diferent sets, we would like to where � := � : �� � > ��� , � := � : �� � < ��� , and + |{ }| − |{ }| fnd estimates for � , � , and �, which would allow us to estimate � := � : � = � are the number of registers in the sketch of � � 0 |{ � � �� }| any other quantity of interest. � that are greater than, less than, or equal to those in the sketch of� , While we can use the cardinality estimator derived in the pre- respectively. Since this log-likelihood function is cheap to evaluate vious section to get estimates for �� and �� , respectively, the requiring only 5 logarithm evaluations and is strictly concave on estimation of � is more challenging. One possibility is to leverage the domain � 0, min �� , �� = 0, min � , � at least for ∈ [ ( �� �� )] [ ( � � )] the mergeability of data sketches and use the inclusion-exclusion all � � 2.718 [28], the ML estimate for � can be quickly and ≤ ≈ principle. Assume �ˆ� , �ˆ� , and �ˆ� � are cardinality estimates of robustly found using standard univariate optimization algorithms � , � , and � � , respectively.∪ Then the inclusion-exclusion like Brent’s method [5]. | | | | | ∪ | principle allows estimating the intersection size as �ˆ� �ˆ� �ˆ� � . Since the ML estimator is asymptotically efcient, the RMSE of + − ∪ This can be used to estimate � as the ML estimate is expected to be equal to � 1 2 � as � . − / ( ) → ∞ �ˆ� �ˆ� �ˆ� � � � denotes the Fisher information with respect to � for known �ˆ = ∪ . in-ex +�ˆ� −� (13) ( ) ∪ cardinalities �� and �� , which can be derived as [28] �� �� 2 2 2 To ensure the natural constraint � 0, min , and the non- ���� � �� ���� � �� ���� � �� ���� � �� �� �� � � 1 2 ( − ) ( − ) ( − ) ( − ) ∈ [ ( )] � � = ( − ) + . negativity of estimated intersection and diference sizes, it might �2 log2 �  �� � ��   �� � ��   1 �� � �� �� � ��  ( ) ( ) ( − ) + ( − ) + − ( − )− ( − ) ! be necessary to trim �ˆ to the range 0, min �ˆ �ˆ , �ˆ �ˆ . in-ex [ ( � / � � / � )] The asymptotic RMSE can be compared with the Jaccard estimator In the following a new joint estimation approach is derived that of MinHash for which the RMSE is � 1 � �. The correspond- dominates the inclusion-exclusion principle as our experiments ( − )/ ing ratio is shown in Figure 2 for the cases � = � and � = 0.5� , have shown. Using (4), the joint cumulative distribution function p � � � � and various values of �. For �� = �� the curves approach 1 as of registers �� � and ��� of two SetSketches representing sets � � 1. If � = 1.001 the diference from 1 is already very small and � , respectively, is given by → and hence the estimation error for � will be almost the same as for � �� � �� ��� �� MinHash with the same number of registers �. Given that 2-byte ( ≤ ∧ ≤ ) = � �� � ,� �� � �� � ,� �� � �� � ,� min �� , �� registers are sufcient for � = 1.001 as discussed in Section 2.3, this ( \ ≤ ) ( \ ≤ ) ( ∩ ≤ ( )) �� �� � � �� �� � � �� �� � � ,� signifcantly improves the space efciency compared to MinHash − � − � ( + ) min � � 1 � ��− 1 � ��− 1 � ��− ( ) = �− + �− + �− + which uses 4- or even 8-byte components. As shown in Figure 2 for

2249 =* = =+ =* = 0.5=+ 1 = 2 1 = 1.2 1 = 1.001

< 2 1.0 )/

) 0.8 − +8 1

(

0.6 1 = p * 8 )/ 0.4

( ( %

1 / 2 0.2 −

 0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 1 = 2 1 = 1.2 1 = 1.05 1 = 1.001 1 = 1 Figure 3: The range of possible collision probabilities of SetS- Figure 2: Asymptotic RMSE of the new estimation approach ketch registers as function of � . with known cardinalities �� and �� relative to the RMSE of p MinHash � (1 − � )/� with equal number of registers �. < = 256 < = 4096

< 2.0 )/ 1.8 −

1 1.6 ( � = 0.5� , the error can be even smaller than that of the MinHash � � p 1.4 / Jaccard estimator. The reason is that our estimation approach incor- 1.2

RMSE 1.0 porates �� and �� and also if register values are greater or smaller 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 while the classic MinHash estimator only considers the number of equal registers. 1 = 2 1 = 1.2 1 = 1.08 1 = 1.02 1 = 1.001 If the true cardinalities �� and �� are not known, we simply replace them by corresponding estimates �ˆ� and �ˆ� using (12). As Figure 4: The RMSE of �ˆup for the case �� = �� relative to a consequence, the RMSE will apparently increase and � 1 2 � will p − / ( ) the RMSE of MinHash � (1 − � )/�. only represent an asymptotic lower bound as � . Nevertheless, → ∞ as supported by our experiments presented later, the estimation error is indistinguishable from the case with known cardinalities efcient than MinHash, but unlike them, SetSketches can be fur- if the sets have equal size, thus if �� = �� . The reason is that � ther aggregated. The RMSE of MinHash is � 1 � �. As long + ( − )/ and � are identically distributed in this case, which makes the as the RMSE of SetSketch is not signifcantly greater than that, the − p log-likelihood function symmetric with respect to � and �. Since additional error from using SetSketches instead of MinHash is neg- 1 this also implies � = � = and estimates �ˆ = �ˆ� �ˆ� �ˆ� and ligible. For comparison, we consider the worst case scenario with 2 /( + ) �ˆ = �ˆ� �ˆ� �ˆ� for � and � satisfy �ˆ �ˆ = 1 by defnition, they � = � � = � = 1 , which maximizes the collision probability, /( + ) + 2 can be written as �ˆ = 1 � and �ˆ = 1 �, respectively, with some | | | | ⇔ˆ 2 + 2 − while using �up, which is based on the minimum possible collision estimation error �. Therefore and due to the symmetry of the log- probability, for estimation. Figure 4 shows the theoretical results likelihood function, the ML estimator is an even function of �. The for diferent values of � and �. Even for large values like � = 2 the estimated Jaccard similarity will therefore difer only by an �2 O( ) RMSE is only increased by less than 20%, for all similarities greater term from the estimate based on the true cardinalities �� and �� . than 0.7 or 0.9 for the cases� = 256 and� = 4096, respectively. The Since � 1 for sufciently large �, this second order term can be ≪ diference becomes smaller with decreasing �. The RMSE almost ignored in practice. matches that of MinHash for � = 1.001, for which 2-byte registers sufce as mentioned earlier, again showing great potential for space 3.3 Locality Sensitivity savings compared to MinHash. A collision probability, that is a monotonic function of some simi- When searching for nearest neighbors with LSH, a set of candi- larity measure, is the foundation of LSH [2, 35, 43, 73]. Rewriting dates is determined frst, which is then further fltered. For fltering, of (14) and using � � = 1 gives for the probability of a register to the presented more precise joint estimation approach can be used + be equal in two diferent SetSketches instead of (15) to reduce the false positive rate. � 1 2 � � = � log 1 � � 1 ( − ) � � � � �� . ( � � �� ) ≈ � ( + ( − ) + � ( − )( − )) 4 COMPARISON Using 0 � � � � �� 1 1 � 2 [28] leads to ≤ ( − )( − ) ≤ 4 ( − ) This section compares MinHash and HLL with SetSketch, and in 2 ≲ ≲ 2 � 1 particular shows that they relate to SetSketches with � = 1 and log 1 � � 1 � �� � = ��� log 1 � � 1 1 � ( −� ) . � ( + ( − )) ( ) � ( + ( − )+( − ) 4 ) � = 2, respectively. These bounds are illustrated in Figure 3 for � 2, 1.2, 1.001 . They ∈ { } are very tight for large �. Both bounds approach � � = � = � ( � � �� ) 4.1 MinHash as � 1. Estimating � � = � by � � and resolving for � → ( � � �� ) 0/ MinHash can be regarded as an extreme case of SetSketch1 with results in corresponding lower and upper bound estimators � ��− � � � 1 for which the statistic �− approaches a continuous uni- 0 1 �0 � → ˆ � /2 + 1 ˆ � / 1 form distribution according to (5). Therefore, since the registers �low := max 0, 2 � 1 − 1 , �up := − . (15) − − � 1 of SetSketch1 are also statistically independent, the transforma-   − �� The tight boundaries, especially for large �, make SetSketch an tion � = 1 � ��− makes SetSketch1 equivalent to MinHash �′ − − interesting alternative to MinHash for LSH. Like b-bit minwise with values � as � 1. Applying this transformation to (12) for �′ → hashing [39] and odd sketches [47], SetSketches are more memory � 1 and using lim� 1 1 1 � log � = 1 indeed leads to the → → ( − / )/ ( )

2250 cardinality estimator of MinHash [12, 13] with a RSD of 1 √� �ˆ also includes a correction for very large cardinalities, for which / corr � register update values greater than � 1, exceeding the value range �ˆ = � . (16) � log 1 � + =1 − ( − �′) of registers, are more likely to occur. The corrected estimator could The presented joint estimationÍ method can also be applied to also be applied to misconfgured SetSketches that do not allow to MinHash. However, as MinHash uses the minimum instead of the ignore the probability of registers with values less than 0 or greater maximum for state updates we have to redefne � and � as than � 1 as discussed in Section 2.3. To save computation time, < > + − + � := � : � ′ � ′ , � := � : � ′ � ′ . For � 1, the the values of �� �0 � and �� �� 1 � can be tabulated for all + |{ � � �� }| − |{ � � �� }| → ( / ) ( + / ) ML estimate can be explicitly expressed as [28] possible values of �0,�� 1 0, 1, . . . ,� for fxed � and �. + ∈ { } Due to the similarity of the register value distribution, the pro- 2 2 2 2 2 2 2 � �0 � � �0 � � �0 � � �0 � 4� � � � ˆ ( + −)+ ( + +)− ( ( + −)− ( + +)) + − + posed joint estimation method also works well for GHLL, provided � = q 2��� (17) that all register values are from 1, 2, . . . , � . Since the estimator { } and has an asymptotic RMSE of [28] relies only on the relative order, it can be even applied as long as there are no registers having value 0 or � 1 simultaneously in � 1 2 � 1 � � � 2 � � 1 � + RMSE �ˆ →∞= � � = ( − ) 1 ( − ) ( − ) . both GHLL sketches. Registers equal to � 1 can be easily avoided ( ) − / ( ) � − �� 1 � 2 ≤ � + q r ( − ) q by choosing � sufciently large. Registers, that have never been This shows that this estimator outperforms the state-of-the-art Jac- updated and thus are zero in both sketches, are expected for union � card estimator, which has a RMSE of � 1 � �. Our experiments cardinalities � � smaller than���, where �� := 1 � is the ( − )/ | ∪ | �=1 / showed that this estimator also works better, if the cardinalities �-th harmonic number. This follows directly from the coupon col- p are unknown and need to be estimated using (17). Therefore, this lector’s problem [48]. If the registers do not satisfy theÍ prerequisites approach is a much less expensive alternative to the ML approach for applying the new estimation approach, the inclusion-exclusion described in [15], which considers the full likelihood as a func- principle (13) could still be used as fallback. tion of all 3 parameters and requires solving a three-dimensional optimization problem. 4.3 HyperMinHash The state of SetSketch2 is logically equivalent to that of Super- The similarity with GHLL as discussed in Section 1.4 allows using MinHash as � 1. SuperMinHash also uses correlation between → the proposed joint estimation approach for HyperMinHash as well. components and is able to reduce the variance of the standard If the condition of not too small cardinalities is satisfed, the exper- estimator for � by up to a factor of 2 for small sets [25]. iments presented below have shown that the new approach also 4.2 HyperLogLog outperforms the original estimator of HyperMinHash, especially when the cardinalities of the two sets are very diferent. The generalized HyperLogLog (GHLL) with stochastic averaging, which includes HLL as a special case with � = 2, is similar to 5 EXPERIMENTS SetSketch with � = 1 �. Under the Poisson model [23, 30], which / For the sake of reproducibility, we used synthetic data for our exper- assumes that the cardinality � is not fxed but Poisson distributed iments to verify SetSketch and the proposed estimators. The wide with mean �, the register values will be distributed as � �� � = 1 � ( ≤ ) and successful application of MinHash, HLL, and many other prob- � ��− �− for � 0 [28]. Comparison with (4) shows that the − ≥ abilistic data structures has proven that the output of high-quality register values of GHLL are distributed as those of SetSketch for hash functions [66] is indistinguishable from uniform random val- � = 1 � and � = � provided that all register values are nonzero. / ues in practice. This justifes to simply use sets consisting of random � Therefore, the cardinality estimator (12) can be used to estimate 64-bit integers for our experiments. In this way arbitrary many ran- � in this case. An unbiased estimator for the Poisson parameter is dom sets of predefned cardinality can be easily generated, which � also an unbiased estimator for the true cardinality [28]. Since (12) is fundamental for the rigorous verifcation of the presented esti- is asymptotically unbiased as � , it can be used to estimate → ∞ mators. Real-world data sets usually do not include thousands of � . And indeed, (12) corresponds to the cardinality estimator with diferent sets with the same predefned cardinality. a RSD of 3 log 2 1 √� 1.04 √� presented in [30] for HLL ( ) − / ≈ / with base � = 2. p 5.1 Implementation For small cardinalities, when many registers are zero, the value distribution will difer signifcantly from (4) and the estimator will Both SetSketch variants as well as GHLL have been implemented fail. However, it can be fxed by applying the same trick as presented in C++. The corresponding source code, including scripts to repro- for the case � = 2 in [23]. If registers values are limited to the range duce all the presented results, has been made available at https: //github.com/dynatrace-research/set-sketch-paper. We used the 0, 1, . . . , � 1 , and � := � : �� = � is the histogram of register { + } � |{ }| values, the corrected estimator can be written as [28] Wyrand pseudorandom number generator [71] which is extremely fast, has a state of 64 bits, and passes a series of statistical quality � 1 1 � �ˆcorr = �( − / )� � . (18) tests [38]. Seeded with the element of a set, it is able to return a � log � ��� �0 � � ���− ��− �� 1 �� 1 � ( ) ( / )+( =1 )+ ( − + / ) sequence of 64-bit pseudorandom integers. The random bits are   The functions �� and �� are defnedÍ as converging series used very economically. Only if all 64 bits are consumed, the next � bunch of 64 bits will be generated. Sampling with replacement was � � := � � 1 �� 1�� , � ( ) + ( − ) �∞=1 − realized as constant-time operation as described in [27] based on � � � := 1 � � Í1 � � 1 ��− 1 . Fisher-Yates shufing [29]. The algorithm proposed in [37] was used � ( ) − + ( − ) �∞=0 − − ( − ) Í

2251 1 = 2 1 = 1.001 bias is signifcantly smaller than the RMSE in all cases. Furthermore, 0.5 when comparing � = 256 and � = 4096, the bias decreases more 0.0 quickly than the RMSE as � . Hence, the bias can be ignored rel. bias (%) → ∞ in practice for reasonable values of �. <

5 = The independence of register values in SetSketch1 leads to a con- 256 stant error over the entire cardinality range, that matches well with rel. RMSE (%) 0 the theoretically derived RSD from Section 3.1. In accordance with 102 the discussion in Section 3.1, the error is only marginally improved 101

kurtosis when decreasing � from 2 to 1.001. The error curves of SetSketch2 100 100 102 104 106 100 102 104 106 and GHLL are very similar and show a signifcant improvement for cardinalities smaller than �. This is attributed to the correlation of 0.1 register values of SetSketch2 and the stochastic averaging of GHLL, 0.0 rel. bias (%) respectively, both of which have a similar positive efect. However, 2 < when comparing the kurtosis, which is expected to be equal to 3 for =

1 4096 normal-like distributions, we observed huge values for GHLL. This means that GHLL is very susceptible to outliers for small cardinal- rel. RMSE (%) 0 104 ities. This is not surprising, as stochastic averaging only updates 103 a single register per element. For a set with cardinality � = 2, for 102 kurtosis 101 example, it happens with 1 � probability that both elements update 100 / 100 102 104 106 100 102 104 106 the same register, which makes the state indistinguishable from cardinality cardinality that of a set with just a single element and obviously results in SetSketch1 SetSketch2 GHLL expected a large 50 % error. We also tested the ML method for cardinality estimation and obtained visually almost identical results [28] which Figure 5: The relative bias, the relative RMSE, and the kur- are therefore omitted in Figure 5 and which prove the efciency of tosis of �ˆ (12) for SetSketch1 and SetSketch2 and of �ˆ corr (18) estimators (12) and (18). for GHLL. 5.3 Joint Estimation to sample random integers from intervals. The ziggurat method [45] To verify the presented joint estimation approach, we generated as implemented in the Boost.Random C++ library [68] was used many pairs of random sets with predefned relationship. Each pair to obtain exponentially distributed random values. The algorithm of sets � ,� was constructed by generating three sets � , � , and ( ) 1 2 described in [27] was applied to efciently sample from a truncated �3 of 64-bit random numbers with fxed cardinalities �1, �2, and exponential distribution as needed for SetSketch2. Our implementa- �3, respectively, which are merged according to � = �1 �3 and 1 ∪ tion precalculates and stores all relevant powers of �− in a sorted � = � � . The use of 64-bit random numbers allows ignoring 2 ∪ 3 array, which is then used to fnd register update values as defned by collisions and all three sets can be considered to be distinct. By �3 (6) using binary search instead of expensive logarithm evaluations. construction, we have � = , �� = �1 �3, and �� = �2 �3, �1 �2 �3 + + This also allows to limit the search space to values greater than which guarantees both cardinalities,+ + the Jaccard similarity, and �low, which further saves time with increasing cardinality. hence also other joint quantities to be the same for all generated pairs � ,� . After recording both sets using the data structure ( ) 5.2 Cardinality Estimation under consideration and applying the proposed joint estimation Our cardinality estimation approach was tested by running 10 000 approach, we fnally compared the estimates with the prescribed simulation cycles for diferent SetSketch and GHLL confgurations. true values for various joint quantities. In each cycle we generated 10 million 64-bit random integer values Figure 6 shows the relative RMSE when estimating the Jaccard and inserted them into the data structure. The use of 64 bits makes similarity, cosine similarity, inclusion coefcients, intersection size, collisions very unlikely and allows regarding the number of inserted and diference sizes using diferent approaches from SetSketch1 random values as the cardinality of the recorded set. During each with � = 20 and � 1.001, 2 . The union cardinality was fxed ∈ { } simulation cycle the cardinality was estimated for a predefned set � � = 106 and the Jaccard similarity was selected from � | ∪ | ∈ of true cardinality values which have been chosen to be roughly 0.01, 0.1, 0.5 . Each data point shows the result after evaluating { } equally spaced on the log-scale. 1000 pairs of randomly generated sets. The charts were obtained We investigated SetSketch1, SetSketch2, and GHLL with stochas- by varying the ratio � � � � over 10 3, 103 while keeping | \ |/| \ | [ − ] tic averaging using � = 256 and � = 4096 registers. We considered � and � � fxed. The symmetry allowed us to perform the | ∪ | diferent bases � = 1.001 and � = 2, for which we used 6-bit (� = 62) experiments only for ratios from 1, 103 . [ ] and 2-byte registers (� = 216 2 = 65534), respectively. For both Figure 6 clearly shows that the proposed joint estimator dom- − SetSketch variants, we used � = 20 and (12) for cardinality estima- inates the inclusion-exclusion principle for all investigated joint tion. The corrected estimator (18) was used for GHLL. quantities. The diference is more signifcant for small set over- Figure 5 shows the relative bias, the relative RMSE, and the laps like for � = 0.01. Comparing the results for � = 1.001 and kurtosis of the cardinality estimates over the true cardinality. The � = 2 shows that joint estimation can be signifcantly improved

2252 Jaccard similarity = * + * + cosine similarity * + * + inclusion coeicient * + * intersection size * + dierence size * + | ∩ |/| ∪ | | ∩ |/ | || | | ∩ |/| | | ∩ | 0 | \ | 100 100 100 100 10

p 1 10− = 0 .

2 01 10− 1 1 1 1 10− 10− 10− 10− 3

relative RMSE 10− 0 1 1 1 1 10 10− 10− 10− 10−

1 1

10− = = 0 2

2 . 10− 1 2 2 2 2 10− 10− 10− 10− 3

relative RMSE 10− 100

2 2 2 2 1

10− 10− 10− 10− 10− = 0

2 . 10− 5 10 3 10 3 10 3 10 3 3 relative RMSE − − − − 10−

3 2 1 0 1 2 3 3 2 1 0 1 2 3 3 2 1 0 1 2 3 3 2 1 0 1 2 3 3 2 1 0 1 2 3 10− 10− 10− 10 10 10 10 10− 10− 10− 10 10 10 10 10− 10− 10− 10 10 10 10 10− 10− 10− 10 10 10 10 10− 10− 10− 10 10 10 10 0 100 100 100 100 10 1 10− = 0 .

2 01 10− 1 1 1 1 10− 10− 10− 10− 3

relative RMSE 10− 0 10 1 10 1 10 1 10 1 10

− − − − 1

1 =

10− = 1 0 . 001

2 . 10− 1 2 2 2 2 10− 10− 10− 10− 3

relative RMSE 10− 100

2 2 2 2 1

10− 10− 10− 10− 10− = 0

2 . 10− 5 10 3 10 3 10 3 10 3 3 relative RMSE − − − − 10−

3 2 1 0 1 2 3 3 2 1 0 1 2 3 3 2 1 0 1 2 3 3 2 1 0 1 2 3 3 2 1 0 1 2 3 10− 10− 10− 10 10 10 10 10− 10− 10− 10 10 10 10 10− 10− 10− 10 10 10 10 10− 10− 10− 10 10 10 10 10− 10− 10− 10 10 10 10 * + + * * + + * * + + * * + + * * + + * | \ |/| \ | | \ |/| \ | | \ |/| \ | | \ |/| \ | | \ |/| \ | new new (cardinalities known) theory new (cardinalities known) inclusion-exclusion

Figure 6: The relative RMSE of various estimated joint quantities when using SetSketch1 with � ∈ {1.001, 2} and � = 4096 for sets with a fxed union cardinality of |� ∪ � | = 106.

Jaccard similarity = * + * + cosine similarity * + * + inclusion coeicient * + * intersection size * + dierence size * + | ∩ |/| ∪ | | ∩ |/ | || | | ∩ |/| | | ∩ | 0 | \ | 100 100 100 100 10

p 1 10− = 0 .

2 01 10− 1 1 1 1 10− 10− 10− 10− 3

relative RMSE 10− 0 1 1 1 1 10 10− 10− 10− 10−

1 1

10− = = 0 2

2 . 10− 1 2 2 2 2 10− 10− 10− 10− 3

relative RMSE 10− 100

2 2 2 2 1

10− 10− 10− 10− 10− = 0

2 . 10− 5 10 3 10 3 10 3 10 3 3 relative RMSE − − − − 10−

3 2 1 0 1 2 3 3 2 1 0 1 2 3 3 2 1 0 1 2 3 3 2 1 0 1 2 3 3 2 1 0 1 2 3 10− 10− 10− 10 10 10 10 10− 10− 10− 10 10 10 10 10− 10− 10− 10 10 10 10 10− 10− 10− 10 10 10 10 10− 10− 10− 10 10 10 10 0 100 100 100 100 10 1 10− = 0 .

2 01 10− 1 1 1 1 10− 10− 10− 10− 3

relative RMSE 10− 0 10 1 10 1 10 1 10 1 10

− − − − 1

1 =

10− = 1 0 . 001

2 . 10− 1 2 2 2 2 10− 10− 10− 10− 3

relative RMSE 10− 100

2 2 2 2 1

10− 10− 10− 10− 10− = 0

2 . 10− 5 10 3 10 3 10 3 10 3 3 relative RMSE − − − − 10−

3 2 1 0 1 2 3 3 2 1 0 1 2 3 3 2 1 0 1 2 3 3 2 1 0 1 2 3 3 2 1 0 1 2 3 10− 10− 10− 10 10 10 10 10− 10− 10− 10 10 10 10 10− 10− 10− 10 10 10 10 10− 10− 10− 10 10 10 10 10− 10− 10− 10 10 10 10 * + + * * + + * * + + * * + + * * + + * | \ |/| \ | | \ |/| \ | | \ |/| \ | | \ |/| \ | | \ |/| \ | new new (cardinalities known) theory new (cardinalities known) inclusion-exclusion

Figure 7: The relative RMSE of various estimated joint quantities when using SetSketch2 with � ∈ {1.001, 2} and � = 4096 for sets with a fxed union cardinality of |� ∪ � | = 103.

2253 Jaccard similarity = * + * + cosine similarity * + * + inclusion coeicient * + * intersection size * + dierence size * + | ∩ |/| ∪ | | ∩ |/ | || | | ∩ |/| | | ∩ | 0 | \ | 0 0 0 0 10 10 10 p 10 10 1 10− = 0 .

2 01 10− 1 1 1 1 10− 10− 10− 10− 3

relative RMSE 10− 0 1 1 1 1 10 10− 10− 10− 10−

1

10− = 0

2 . 10− 1 2 2 2 2 10− 10− 10− 10− 3

relative RMSE 10− 100

2 2 2 2 1

10− 10− 10− 10− 10− = 0

2 . 10− 5 3 3 3 3 3

relative RMSE 10− 10− 10− 10− 10−

3 2 1 0 1 2 3 3 2 1 0 1 2 3 3 2 1 0 1 2 3 3 2 1 0 1 2 3 3 2 1 0 1 2 3 10− 10− 10− 10 10 10 10 10− 10− 10− 10 10 10 10 10− 10− 10− 10 10 10 10 10− 10− 10− 10 10 10 10 10− 10− 10− 10 10 10 10 * + + * * + + * * + + * * + + * * + + * | \ |/| \ | | \ |/| \ | | \ |/| \ | | \ |/| \ | | \ |/| \ | new new (cardinalities known) theory new (cardinalities known) original original (cardinalities known) inclusion-exclusion

Figure 8: The relative RMSE of various estimated joint quantities when using MinHash with � = 4096 for sets with a fxed union cardinality of |� ∪ � | = 106.

Jaccard similarity = * + * + cosine similarity * + * + inclusion coeicient * + * intersection size * + dierence size * + | ∩ |/| ∪ | | ∩ |/ | || | | ∩ |/| | | ∩ | 0 | \ | 0 0 0 0 10 10 10 p 10 10 1 10− = 0 .

2 01 10− 1 1 1 1 10− 10− 10− 10− 3

relative RMSE 10− 0 1 1 1 1 10 10− 10− 10− 10−

1

10− = 0

2 . 10− 1 2 2 2 2 10− 10− 10− 10− 3

relative RMSE 10− 100

2 2 2 2 1

10− 10− 10− 10− 10− = 0

2 . 10− 5 3 3 3 3 3

relative RMSE 10− 10− 10− 10− 10−

3 2 1 0 1 2 3 3 2 1 0 1 2 3 3 2 1 0 1 2 3 3 2 1 0 1 2 3 3 2 1 0 1 2 3 10− 10− 10− 10 10 10 10 10− 10− 10− 10 10 10 10 10− 10− 10− 10 10 10 10 10− 10− 10− 10 10 10 10 10− 10− 10− 10 10 10 10 * + + * * + + * * + + * * + + * * + + * | \ |/| \ | | \ |/| \ | | \ |/| \ | | \ |/| \ | | \ |/| \ | new new (cardinalities known) theory new (cardinalities known) original original (cardinalities known) inclusion-exclusion

Figure 9: The relative RMSE of various estimated joint quantities when using HyperMinHash with � = 4096 and � = 10, which 10 corresponds to � = 2−2 ≈ 1.000677, for sets with a fxed union cardinality of |� ∪ � | = 106.

when using smaller bases �. Knowing the true values of �� and We repeated all simulations with a fxed union cardinality of � further reduces the estimation error for which we observed � � = 103. For SetSketch1, we obtained the same results as � | ∪ | perfect agreement with the theoretically derived RMSE. If � and for � � = 106 [28]. However, the errors for SetSketch2 shown � | ∪ | �� are known, any other joint quantity � can be expressed as a in Figure 7 are signifcantly smaller and also lower than theoreti- function of �. The corresponding ML estimate is therefore given by cally predicted. The reason for this improvement is, as before for �ˆ = � �ˆ . Variable transformation of the Fisher information given in cardinality estimation, the statistical dependence between register ( ) Section 3.2, allows to calculate the corresponding asymptotic RMSE values in case of small sets. For � = 1.001, the error is reduced by of � which is � 1 2 � � � as � . As also predicted for the a factor of up to √2 when estimating the Jaccard similarity using − / ( )| ′( )| → ∞ case �� = �� , the estimation error of � is the same regardless of our new approach. This is expected as SetSketch2 corresponds to whether the true cardinalities are known or not. SuperMinHash [25] as � 1, for which the variance is known → When running the same simulations for SetSketch2 and GHLL to be approximately 50% smaller than for MinHash, if � � is | ∪ | also with bases � 1.001, 2 , we got almost identical charts [28], smaller than �. Our joint estimator failed for GHLL in this case ∈ { } 3 which therefore have been omitted here. The reason why our es- [28], because � � = 10 is signifcantly smaller than ��� | ∪ | timation approach also worked for GHLL is that the union car- with � = 4096, and hence, the condition for its applicability is not dinality was fxed at 106 which is large enough to not have any satisfed as discussed in Section 4.2. registers that are zero in both sketches as discussed in Section 4.2. We also applied the new approach to MinHash and used the In particular, this shows that our approach can signifcantly im- explicit estimation formula (17). Figure 8 only shows the results for prove joint estimation from HLL sketches with � = 2 for which the � � = 106, because the results are very similar for � � = 103, | ∪ | | ∪ | inclusion-exclusion principle is the state-of-the-art approach for as expected [28]. The results are also almost indistinguishable from joint estimation. those obtained for SetSketch with � = 1.001 shown in Figure 6.

2254 5 10− < = 256,1 = 2 < = 256,1 = 1.001 the runtime diferences compared to reality, where elements also 6 10− need to be loaded from main memory. Furthermore, we excluded 7 10−

time (s) initialization costs, which are comparable for all considered data 8 10− structures and which are a negligible factor for large cardinalities.

4 10− < = 4096,1 = 2 < = 4096,1 = 1.001 For each cardinality considered, we performed 1000 simulation 5 10− runs. In each run, we measured the time needed to generate the 6 10− 7 corresponding number of random elements and to insert them into

time (s) 10− 8 10− the data structure. Afterwards, we calculated the average recording time per element. We measured the performance for SetSketch1, 100 102 104 106 100 102 104 106 cardinality cardinality SetSketch2, GHLL with � 1.001, 2 , MinHash, and HLL and ∈ { } SetSketch1 GHLL GHLL (lower bound tracking) MinHash sketch sizes � 256, 4096 on a Dell Precision 5530 notebook SetSketch2 HLL HLL (lower bound tracking) ∈ { } with an Intel Core i9-8950HK processor. Figure 10 summarizes the results. The recording speed was Figure 10: The average recording time per element as func- roughly independent of the set size for HLL and GHLL due to sto- tion of set cardinality. chastic averaging. We also implemented variants of both algorithms that use lower bound tracking as described in Section 2.2. If up- date values are not greater than the current lower bound, they will Thus, SetSketch is able to give almost the same estimation accuracy not be able to change any register. This can avoid many relatively using signifcantly less space as 2-byte registers are sufcient for costly random accesses to individual registers. This optimization, � = 1.001. We also analyzed the state-of-the-art MinHash estimator which is simpler than other approaches [23, 61], led to a signifcant based on the fraction of equal components. For the case that the performance improvement for HLL and GHLL with � = 2 at large cardinalities are not known, they have been estimated using (16). cardinalities. However, this optimization did not speed up recording Our new estimator has a signifcantly better overall performance for GHLL with � = 1.001, because the calculation of register update in both cases with known and unknown cardinalities, respectively. values is the bottleneck here as it is more expensive for small �. Only for inclusion coefcients and diference sizes the original For MinHash the recording time was also independent of the method led to a slightly smaller error for � = 0.5 and � � � | \ |/| \ cardinality, but many orders of magnitude slower due to its � � > 1. However, the new estimator clearly dominated, if sets O( ) | insert operation, which was also the reason why we only simulated have a small overlap. In contrast to the original estimator, it also sets up to a size of 105. The insert operations of both SetSketch vari- dominates the inclusion-exclusion principle in all cases. ants have almost identical performance characteristics. As expected, Finally, due to its similarity to GHLL for which our approach was insertions are quite slow for small sets. However, with increasing able to improve joint estimation, we also considered HyperMinHash. cardinality, the tracked lower bound �low will also increase and Figure 9 shows the results for a HyperMinHash with parameter 10 fnally leads to better and better recording speeds. For large sets, � = 10 which corresponds to a base of � = 2 2 1.000677 as − ≈ SetSketch is several orders of magnitude faster than MinHash and discussed in Section 1.4. The original estimator of HyperMinHash SetSketch2 even achieves the performance of the non-optimized led to similar results as the original estimator of MinHash (compare (without lower bound tracking) versions of HLL and GHLL. We Figure 8). In contrast to the original estimator, which is based on observed a quicker decay for � = 1.001 than for � = 2. The reason empirically determined constants, our approach is solely based on is that register values are updated more frequently for smaller �, theory and never performed worse than the inclusion-exclusion which allows �low to be raised earlier. principle. Furthermore, since the estimation error was signifcantly reduced in many cases, our estimator seems to be superior to the 6 CONCLUSION original estimation approach. The results for the case that the car- We have presented a new data structure for sets that combines the dinalities are known show perfect agreement with the theoretical properties of MinHash and HLL and allows fne-tuning between es- RMSE originally derived for SetSketch1. Therefore, at least for large timation accuracy and memory efciency. The presented estimators sets, HyperMinHash seems to encode joint information equally well for cardinality and joint quantities do not require empirical calibra- as SetSketch with corresponding base. However, the big advantage tion, give consistent errors over the full cardinality range without of SetSketch is that the same estimator can be applied for any car- having to consider special cases such as small sets, and can be eval- dinalities, while estimation from GHLL or HyperMinHash sketches uated in a numerically robust fashion. The simple estimation from requires special treatment of small sets. The original HyperMin- SetSketches, plus locality sensitivity as a bonus, compensate for Hash estimator delegates to a second estimator in this case. the slower recording speed compared to HLL and GHLL for small sets. The developed joint estimator can also be straightforwardly 5.4 Performance applied to existing MinHash, HLL, GHLL, and HyperMinHash data The runtime behavior of a sketch is crucial for its practicality. There- structures to obtain more accurate results for joint quantities than fore, we measured the recording time for sets with cardinalities up with the corresponding state-of-the-art estimators in many cases. 7 to 10 . Instead of generating a set frst, we simply generated 64-bit We expect that our estimation approach will also work for other set random numbers on the fy using the very fast Wyrand generator similarity measures or joint quantities that have not been covered [71]. As hashing of more complex items is usually more expensive by our experiments. than generating random values, this experimental setup amplifes

2255 REFERENCES International Conference on the Analysis of Algorithms (AofA). 127–146. [1] D. N. Baker and B. Langmead. 2019. Dashing: fast and accurate genomic distances [31] M. J. Freitag and T. Neumann. 2019. Every Row Counts: Combining Sketches with HyperLogLog. Genome Biology 20, 265 (2019). and Sampling for Accurate Group-By Result Estimates. In Proceedings of the 9th [2] M. Bawa, T. Condie, and P. Ganesan. 2005. LSH Forest: self-tuning indexes for Conference on Innovative Data Systems Research (CIDR). similarity search. In Proceedings of the 14th International Conference on World [32] A. Helmi, J. Lumbroso, C. Martínez, and A. Viola. 2012. Data Streams as Random Wide Web (WWW). 651–660. Permutations: the Distinct Element Problem. In Proceedings of the 23rd Interna- [3] K. Berlin, S. Koren, C.-S. Chin, J. P. Drake, J. M. Landolin, and A. M. Phillippy. tional Meeting on Probabilistic, Combinatorial, and Asymptotic Methods for the 2015. Assembling large genomes with single-molecule sequencing and locality- Analysis of Algorithms (AofA). sensitive hashing. Nature Biotechnology 33 (2015), 623–630. [33] M. Henzinger. 2006. Finding Near-Duplicate Web Pages: A Large-Scale Evalua- [4] P. Boldi, M. Rosa, and S. Vigna. 2011. HyperANF: Approximating the neigh- tion of Algorithms. In Proceedings of the 29th Annual International ACM SIGIR bourhood function of very large graphs on a budget. In Proceedings of the 20th Conference on Research and Development in Information Retrieval (SIGIR). 284–291. International Conference on World Wide Web (WWW). 625–634. [34] S. Heule, M. Nunkesser, and A. Hall. 2013. HyperLogLog in Practice: Algorithmic [5] R. P. Brent. 1973. Algorithms for minimization without derivatives. Prentice-Hall. Engineering of a State of the Art Cardinality Estimation Algorithm. In Proceedings [6] A. Z. Broder. 1997. On the resemblance and containment of documents. In of the 16th International Conference on Extending Database Technology (EDBT). Proceedings of Compression and Complexity of Sequences. 21–29. 683–692. [7] F. Béres, D. M. Kelen, R. Pálovics, and A. A. Benczúr. 2019. Node embeddings in [35] P. Indyk and R. Motwani. 1998. Approximate Nearest Neighbors: Towards Re- dynamic graphs. Applied Network Science 4, 64 (2019). moving the Curse of Dimensionality. In Proceedings of the 30th Annual ACM [8] G. Casella and R. L. Berger. 2002. Statistical Inference (2nd ed.). Duxbury, Pacifc Symposium on Theory of Computing (STOC). 604–613. Grove, CA. [36] K. J. Lang. 2017. Back to the Future: an Even More Nearly Optimal Cardinality [9] R. Castro Fernandez, J. Min, D. Nava, and S. Madden. 2019. Lazo: A Cardinality- Estimation Algorithm. (2017). arXiv:1708.06839 [cs.DS] Based Method for Coupled Estimation of Jaccard Similarity and Containment. [37] D. Lemire. 2019. Fast Random Integer Generation in an Interval. ACM Transactions In Proceedings of the 35th International Conference on Data Engineering (ICDE). on Modeling and Computer Simulation 29, 1 (2019), 3:1–3:12. 1190–1201. [38] D. Lemire. 2020. TestingRNG: Testing Popular Random-Number Generators. Re- [10] Y. Chabchoub, R. Chiky, and B. Dogan. 2014. How can sliding HyperLogLog and trieved July 18, 2021 from https://github.com/lemire/testingRNG EWMA detect port scan attacks in IP trafc? EURASIP Journal on Information [39] P. Li and C. König. 2010. B-Bit Minwise Hashing. In Proceedings of the 19th Security 2014, 5 (2014). International Conference on World Wide Web (WWW). 671–680. [11] A. Chen, J. Cao, L. Shepp, and T. Nguyen. 2011. Distinct Counting With a Self- [40] P. Li, A. Owen, and C.-H. Zhang. 2012. One Permutation Hashing. In Proceedings Learning Bitmap. J. Amer. Statist. Assoc. 106, 495 (2011), 879–890. of the 25th Conference on Neural Information Processing Systems (NIPS). 3113–3121. [12] P. Cliford and I. A. Cosma. 2012. A Statistical Analysis of Probabilistic Counting [41] P. Li, A. Shrivastava, J. Moore, and A. C. König. 2011. Hashing Algorithms for Algorithms. Scandinavian Journal of Statistics 39, 1 (2012), 1–14. Large-Scale Learning. In Proceedings of the 24th Conference on Neural Information [13] E. Cohen. 2015. All-Distances Sketches, Revisited: HIP Estimators for Massive Processing Systems (NIPS). 2672–2680. Graphs Analysis. IEEE Transactions on Knowledge and Data Engineering 27, 9 [42] A. Łukasiewicz and P. Uznański. 2020. Cardinality estimation using Gumbel (2015), 2320–2334. distribution. (2020). arXiv:2008.07590 [cs.DS] [14] E. Cohen, M. Datar, S. Fujiwara, A. Gionis, P. Indyk, R. Motwani, J. D. Ullman, [43] Q. Lv, W. Josephson, Z. Wang, M. Charikar, and K. Li. 2007. Multi-Probe LSH: and C. Yang. 2001. Finding interesting associations without support pruning. Efcient Indexing for High-Dimensional Similarity Search. In Proceedings of the IEEE Transactions on Knowledge and Data Engineering 13, 1 (2001), 64–78. 33rd International Conference on Very Large Data Bases (VLDB). 950–961. [15] R. Cohen, L. Katzir, and A. Yehezkel. 2017. A Minimal Variance Estimator for the [44] T. Mai, A. Rao, M. Kapilevich, R. A. Rossi, Y. Abbasi-Yadkori, and R. Sinha. 2019. Cardinality of Big Data Set Intersection. In Proceedings of the 23rd ACM SIGKDD On Densifcation for Minwise Hashing. In Proceedings of the 35th Conference on International Conference on Knowledge Discovery and (KDD). 95–103. Uncertainty in Artifcial Intelligence (UAI). 831–840. [16] Graham Cormode. 2017. Data sketching. Communications of the ACM 60, 9 [45] G. Marsaglia and W. W. Tsang. 2000. The Ziggurat Method for Generating (2017), 48–55. Random Variables. Journal of Statistical Software 5, 8 (2000). [17] S. Dahlgaard, M. B. T. Knudsen, and M. Thorup. 2017. Fast Similarity Sketching. [46] G. Marçais, B. Solomon, R. Patro, and C. Kingsford. 2019. Sketching and Sublinear In Proceedings of the IEEE 58th Annual Symposium on Foundations of Computer Data Structures in Genomics. Annual Review of Biomedical Data Science 2, 1 Science (FOCS). 663–671. (2019), 93–118. [18] A. Dasgupta, K. J. Lang, L. Rhodes, and J. Thaler. 2016. A Framework for Esti- [47] M. Mitzenmacher, R. Pagh, and N. Pham. 2014. Efcient Estimation for High mating Stream Expression Cardinalities. In Proceedings of the 19th International Similarities Using Odd Sketches. In Proceedings of the 23rd International Conference Conference on Database Theory (ICDT). 6:1–6:17. on World Wide Web (WWW). 109–118. [19] T. Dasu, T. Johnson, S. Muthukrishnan, and V. Shkapenyuk. 2002. Mining database [48] M. Mitzenmacher and E. Upfal. 2005. Probability and Computing: Randomiza- structure; or, how to build a data quality browser. In Proceedings of the ACM tion and Probabilistic Techniques in Algorithms and Data Analysis. Cambridge International Conference on Management of Data (SIGMOD). 240–251. University Press. [20] L. Devroye. 1986. Uniform and Exponential Spacings. In Non-Uniform Random [49] A. Nazi, B. Ding, V. Narasayya, and S. Chaudhuri. 2018. Efcient Estimation Variate Generation. Springer, New York, NY, 206–245. of Inclusion Coefcient Using Hyperloglog Sketches. In Proceedings of the 44th [21] M. Durand. 2004. Combinatoire analytique et algorithmique des ensembles de International Conference on Very Large Data Bases (VLDB). 1097–1109. données. Ph.D. Dissertation. École Polytechnique, Palaiseau, France. [50] N. Nissim, O. Lahav, A. Cohen, Y. Elovici, and L. Rokach. 2019. Volatile memory [22] R. A. L. Elworth, Q. Wang, P. K. Kota, C. J. Barberan, B. Coleman, A. Balaji, G. analysis using the MinHash method for efcient and secured detection of malware Gupta, R. G. Baraniuk, A. Shrivastava, and T. J. Treangen. 2020. To Petabytes in private cloud. Computers & Security 87, 101590 (2019). and beyond: recent advances in probabilistic and signal processing algorithms [51] B. D. Ondov, T. J. Treangen, P. Melsted, A. B. Mallonee, N. H. Bergman, S. Ko- and their application to metagenomics. Nucleic Acids Research 48, 10 (2020), ren, and A. M. Phillippy. 2016. Mash: fast genome and metagenome distance 5217–5234. estimation using MinHash. Genome Biology 17, 132 (2016). [23] O. Ertl. 2017. New cardinality estimation algorithms for HyperLogLog sketches. [52] A. Pascoe. 2013. Hyperloglog and MinHash-A union for intersections. Technical (2017). arXiv:1702.01284 [cs.DS] Report. AdRoll. Retrieved July 18, 2021 from https://tech.nextroll.com/media/ [24] O. Ertl. 2017. New Cardinality Estimation Methods for HyperLogLog Sketches. hllminhash.pdf (2017). arXiv:1706.07290 [cs.DS] [53] S. Pettie and D. Wang. 2020. Information Theoretic Limits of Cardinality Estima- [25] O. Ertl. 2017. SuperMinHash - A New Minwise Hashing Algorithm for Jaccard tion: Fisher Meets Shannon. (2020). arXiv:2007.08051 [cs.DS] Similarity Estimation. (2017). arXiv:1706.05698 [cs.DS] [54] S. Pettie, D. Wang, and L. Yin. 2020. Simple and Efcient Cardinality Estimation [26] O. Ertl. 2018. BagMinHash - Minwise Hashing Algorithm for Weighted Sets. in Data Streams. (2020). arXiv:2008.08739 [cs.DS] In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge [55] B. W. Priest. 2020. DegreeSketch: Distributed Cardinality Sketches on Massive Discovery and Data Mining (KDD). 1368–1377. Graphs with Applications. arXiv preprint arXiv:2004.04289 (2020). [27] O. Ertl. 2020. ProbMinHash – A Class of Locality-Sensitive Hash Algorithms for [56] B. W. Priest, R. Pearce, and G. Sanders. 2018. Estimating Edge-Local Triangle the (Probability) Jaccard Similarity. IEEE Transactions on Knowledge and Data Count Heavy Hitters in Edge-Linear Time and Almost-Vertex-Linear Space. In Engineering (2020). https://doi.org/10.1109/TKDE.2020.3021176 Proceedings of the IEEE High Performance Extreme Computing Conference (HPEC). [28] O. Ertl. 2021. SetSketch: Filling the Gap between MinHash and HyperLogLog [57] D. Probst and J.-L. Reymond. 2018. A probabilistic molecular fngerprint for big (extended version). (2021). arXiv:2101.00314 [cs.DS] data settings. Journal of Cheminformatics 10, 66 (2018). [29] R. A. Fisher and F. Yates. 1938. Statistical Tables for Biological, Agricultural and [58] Y. Qi, P. Wang, Y. Zhang, Q. Zhai, C. Wang, G. Tian, J. C. S. Lui, and X. Guan. Medical Research. Oliver and Boyd Ltd., Edinburgh. 2020. Streaming Algorithms for Estimating High Set Similarities in LogLog IEEE Transactions on Knowledge and Data Engineering [30] P. Flajolet, É. Fusy, O. Gandouet, and F. Meunier. 2007. HyperLogLog: the anal- Space. (2020). https: ysis of a near-optimal cardinality estimation algorithm. In Proceedings of the //doi.org/10.1109/TKDE.2020.2969423

2256 [59] J. Qin, D. Kim, and Y. Tung. 2016. LogLog-Beta and More: A New Al- [66] R. Urban. 2020. SMhasher: Hash function quality and speed tests. Retrieved July gorithm for Cardinality Estimation Based on LogLog Counting. (2016). 18, 2021 from https://github.com/rurban/smhasher arXiv:1612.02284 [cs.DS] [67] P. Wang, Y. Qi, Y. Zhang, Q. Zhai, C. Wang, J. C. S. Lui, and X. Guan. 2019. A [60] E. Raf and C. Nicholas. 2017. An Alternative to NCD for Large Sequences, Memory-Efcient Sketch Method for Estimating High Similarities in Stream- Lempel-Ziv Jaccard Distance. In Proceedings of the 23rd ACM SIGKDD Interna- ing Sets. In Proceedings of the 25th ACM SIGKDD International Conference on tional Conference on Knowledge Discovery and Data Mining (KDD). 1007–1015. Knowledge Discovery and Data Mining (KDD). 25–33. [61] P. Reviriego, V. Bruschi, S. Pontarelli, D. Ting, and G. Bianchi. 2020. Fast Updates [68] S. Watanabe and J. Maurer. 2020. Boost.Random C++ Library. Retrieved July 18, for Line-Rate HyperLogLog-Based Cardinality Estimation. IEEE Communications 2021 from https://www.boost.org/doc/libs/1_75_0/doc/html/boost_random.html Letters 24, 12 (2020), 2737–2741. [69] Q. Xiao, S. Chen, Y. Zhou, and J. Luo. 2020. Estimating Cardinality for Arbitrarily [62] B. Scheuermann and M. Mauve. 2007. Near-optimal compression of probabilistic Large Data Stream With Improved Memory Efciency. IEEE/ACM Transactions counting sketches for networking applications. In Proceedings of the 4th ACM on Networking 28, 2 (2020), 433–446. SIGACT-SIGOPS International Workshop on Foundation of Mobile Computing. [70] Y. Zhao, S. Guo, and Y. Yang. 2016. Hermes: An Optimization of HyperLogLog [63] A. Shrivastava. 2017. Optimal Densifcation for Fast and Accurate Minwise Counting in real-time data processing. In Proceedings of the International Joint Hashing. In Proceedings of the 34th International Conference on Machine Learning Conference on Neural Networks (IJCNN). 1890–1895. (ICML). 3154–3163. [71] Wang Yi. 2021. Wyhash. Retrieved July 18, 2021 from https://github.com/wangyi- [64] A. Shrivastava and P. Li. 2014. Improved Densifcation of One Permutation Hash- fudan/wyhash/tree/399078afad15f65ec86836e1c3b0a103d178f15b ing. In Proceedings of the 30th Conference on Uncertainty in Artifcial Intelligence [72] Y. W. Yu and G. M. Weber. 2020. HyperMinHash: MinHash in LogLog space. IEEE (UAI). 732–741. Transactions on Knowledge and Data Engineering (2020). https://doi.org/10.1109/ [65] D. Ting. 2014. Streamed Approximate Counting of Distinct Elements: Beating TKDE.2020.2981311 Optimal Batch Methods. In Proceedings of the 20th ACM SIGKDD International [73] Erkang Zhu, Fatemeh Nargesian, Ken Q. Pu, and Renée J. Miller. 2016. LSH Conference on Knowledge Discovery and Data Mining (KDD). 442–451. Ensemble: Internet-Scale Domain Search. In Proceedings of the 42nd International Conference on Very Large Data Bases (VLDB). 1185–1196.

2257