Quotient Hash Tables-Efficiently Detecting Duplicates in Streaming
Total Page:16
File Type:pdf, Size:1020Kb
Quotient Hash Tables - Efficiently Detecting Duplicates in Streaming Data Rémi Gérauda,c, Marius Lombard-Platet∗a,b, and David Naccachea,c aDépartement d’informatique de l’ENS, École normale supérieure, CNRS, PSL Research University, Paris, France bBe-Studys, Geneva, Switzerland cIngenico Labs Advanced Research, Paris, France Abstract This article presents the Quotient Hash Table (QHT) a new data structure for duplicate detection in unbounded streams. QHTs stem from a corrected analysis of streaming quotient filters (SQFs), resulting in a 33% reduction in memory usage for equal performance. We provide a new and thorough analysis of both algorithms, with results of interest to other existing constructions. We also introduce an optimised version of our new data structure dubbed Queued QHT with Duplicates (QQHTD). Finally we discuss the effect of adversarial inputs for hash-based duplicate filters similar to QHT. 1 Introduction and Motivation bits. We consider in this paper the following problem: given a pos- Proof. A perfect duplicate filter must be able to store all sibly infinite stream of symbols, detect whether a given sym- streams Ei = (e1,...,en), i.e. must be able to store any bol appeared somewhere in the stream. It turns out that in- subset of Γ, which we denote by (Γ). Given that there are P stances of this duplicate detection problem arise naturally in (Γ) = 2 Γ = 2U of them, according to information theory |P | | | many applications: backup systems [11] or Web caches [10], any such filter requires at least log ( (Γ) ) = log (2U )= U 2 |P | 2 search engine databases or click counting in web advertise- bits of storage. ment [16] , retrieval algorithms [3], data stream management systems [1], or even cryptographic contexts where nonce- reuse is problematic1 [15]. Because of this result, perfect duplicate detection is often It is generally not possible to store the whole stream in out of reach when U is big — however probabilistic solutions memory, therefore practical solutions to this problem must are often sufficient for many applications. Such algorithms somehow trade off memory for accuracy. The special case of make errors: false positives (claiming a duplicate where there a single duplicate has a known optimal solution [13, 14], but isn’t) and false negative (missing a duplicate). to the best of the authors knowledge no such results exists for detecting all duplicates in an unbounded stream. 2.2 Filters Several algorithms were proposed to address this specific question, which we discuss below. Of particular interest to Definition 2.1 (Filter). A filter F over the memory is M our investigation are Streaming Quotient Filter (SQF), as a tuple F = ( , Detect, Insert), where: S described by Dutta et al. in [8]. We point out several cru- is the current state cial mistakes in the original analysis of SQF, and in doing • S∈M so highlight that a more efficient data structure can be con- Detect : Γ DUPLICATE, UNSEEN structed along the same lines. We flesh out such a data • ×M→{ } structure, which we call the Quotient Hash Table (QHT), Insert : Γ and provide a thorough analysis of both SQF and QHT. This • ×M→M analysis is our main contribution. QHT itself is not optimal Here DUPLICATE corresponds to a guess that the pro- and can be further improved: we describe such an improve- vided element is duplicate, and UNSEEN that it is unseen. ment, dubbed Queued QHT with Duplicates (QQHT), and Insert corresponds to an update of the filter’s memory state benchmark our new algorithms against popular alternatives. after observing a new element. Here models the amount M of memory (states) available to the algorithm. In practice, Detect and Insert are often merged into a 2 Preliminaries and related work Stream Insert Detect arXiv:1901.04358v1 [cs.DS] 14 Jan 2019 single algorithm . ← ◦ 2.1 Duplicate Detection Definition 2.2 (False positive (resp. negative)). If, for an unseen (resp. duplicate) element e, Detect(e) outputs DU- E = (e ,...,e ,... ) Let 1 n be a (possibly infinite) sequence of PLICATE (resp. UNSEEN), e is called a false positive (resp. elements ei Γ, and write U = Γ . An element ei from E ∈ | | negative). is a duplicate if ej E,j < i such as ej = ei. Otherwise, 2 ∃ ∈ ei is unseen. Definition 2.3 (FPR, FNR). The false positive rate (FPR) The duplicate detection problem is the question of find- of a stream E is the frequency of false positive. The false ing all duplicates in a stream E.3 We recall the following negative rate is similarly defined as the frequency of false well-known result: negatives. Theorem 2.1. Assume that each ei is sampled uniformly at We are interested in filters that whose FNR and FPR can random from Γ. Then perfect detection requires U memory be kept low when is bounded. M ∗ Part of this work was done while the author was visiting Ingenico Labs 1A nonce is a one-time use random number. 2 Note that by definition e1 is always unseen. 3Note that this problem is equivalent to a dynamic formulation of the approximate set membership problem. 1 2.3 Hash-based filters 1. Cells in the filter’s state are not independent: in par- ticular, in every row, non-empty cells hold different Filters rely on hashing to efficiently answer the duplicate de- values (by design); tection problem. These filters are often a variation over the well-known Bloom filters [2]. A Bloom filter uses an array T of M bits, initially all set to 0. k hash functions hi are 2. Terms in geometric sums of order > 1 cannot be ne- also needed. Insertion of an element e is made by setting all glected. T [hi(e)] to 1. Detection of an element f is the AND of all cells T [hi(f)]: if the AND value is 0, then emit UNSEEN, Taking these effects into account (see Sections 4.1 and 4.2), DUPLICATE 7 2r otherwise emit . and using the same approximation than [8] used , r r ≈ It is easy to see that this approach has a FNR of 0. How- 4 √πr , we get the following asymptotic formulae for the FNR ever, as the stream grows, the FPR gets worse, and in the and FPR limit of an infinite stream the FPR is 1. Many variants have been proposed in the literature to compensate for this effect k k [20], mostly by allowing deletion [5, 6], but doing so increases FPR r′ FNR 1 r′ the FNR. ∞ ≈ 2 √πr ∞ ≈ − 2 √πr Alternatively, other filters prefer to store fingerprints: this is the rationale behind several recent constructions [8, 9]. that disagree with [8] — most importantly, the error rates do For the data structures most interesting to our question, not decrease to 0, contrarily to what was claimed. Interest- we refer the reader to the algorithms described in [7, 9, 19, ingly, the suggested parameters (r = 2, k = 4, r′ = r/2 = 1) 8, 22].4 indeed achieve an FNR of 0 — but also an FPR of 1.8 Other data structures, such as [4, 12] are not considered Further more, there is redundancy between the Hamming ′ in this paper, as they require an unbounded amount of mem- r r weight we of he and the reduced remainder he . For in- ′ ory as the stream grows. r stance, if he contains at least one bit set to 1, we know that we = 0. Intuitively this means SQF is wasteful, and 6 2.4 Streaming Quotient Filter we could expect to avoid collisions in the filter’s state by using a better adjusted encoding: this intuition happens to One construction which we must describe at length is the be correct as shown in Section 4.4. Streaming Quotient Filter (SQFs) [8]. Given an element e, Finally, since the state table contains 2q rows, with k cell 5 a certain fingerprint s(e) is stored in an array, at row h(e) each, the total memory required by an SQF is M = 2qkσ. and a certain column amongst k. This array constitutes the filter’s state. The filter’s construction uses integers q, k, r r′ < r and 3.1 Full-size Hashing, Memory Adjust- a hash function h : 0, 1 0, 1 q+r. The filter’s state { }∗ → { } ments is an array of 2q rows and k columns, each holding a σ-bit element (with σ = r + log (r + 1) ), or the special empty ′ ⌈ 2 ⌉ Our first observation is that SQF’s fingerprint scheme (a symbol . The filter’s state is initially in every cell. ⊥ ⊥ hash and a Hamming weight) can be fruitfully replaced by a Then, [8] describes Stream(e) as follows: single hash function of the same size. Not only does this sim- plify the theoretical analysis, it also provides a much more Compute h(e). The q most significant bits of h(e) de- • q r efficient use of the available space. We also use more flexible fine he, and the r least significant bits define he. hash functions, that give much more flexibility in adjusting ′ the total memory M of the filter. Combining these two ef- r r Let we be the Hamming weight of he, and let he fects, we obtain Algorithm 1 which we call Quotient Hash • 6 r be r′ bits deterministically chosen from he. Let Table (QHT). ′ r s(e) h we where denotes concatenation. ← e k k q Algorithm 1 QHT Setup and Stream If s(e) is already stored in the row numbered he, emit σ • 1: function Setup(M,σ,k) ⊲M >σk and 0 <k 2 DUPLICATE. ≤ 2: Let N M/(k σ) .