arXiv:1901.04358v1 [cs.DS] 14 Jan 2019 hoe 2.1. Theorem result: well-known elements . ulct Detection Let Duplicate work related 2.1 and Preliminaries 2 alternatives and popular against (QQHT), improve- algorithms Duplicates new an with our such benchmark QHT describe Queued optimal we not dubbed improved: is ment, further itself be data QHT can This a contribution. and (QHT), QHT. main and such Table our SQF out both is Hash of analysis Quotient flesh analysis thorough the We a provide call and con- we lines. doing be which same can in structure, structure cru- the and data several SQF, along efficient out more of structed point a analysis We that original highlight [8]. the so in in as mistakes al. (SQF), cial et Filter to Dutta Quotient interest by Streaming particular described Of are investigation below. our discuss we which exists question, results such no knowledge authors but the 14], detecting [13, of for solution best optimal of the known case a to special has The duplicate accuracy. must single for a problem memory this off to trade somehow solutions practical therefore memory, ytm 1,o vncytgahccnet hr nonce- where contexts cryptographic problematic even is reuse or managemen advertise- stream [1], data web [3], systems algorithms in retrieval counting , [16] click [10], caches ment or Web or [11] engine search systems backup applications: many e admfrom random sa is n l ulctsi stream a in duplicates all ing o perdsmweei h tem ttrsotta in- that out turns It this stream. of sym- the stances given in a somewhere whether appeared detect pos- bol symbols, a given of stream problem: infinite following sibly the paper this in Motivation consider We and Introduction 1 i ∗ 3 2 1 is eea loihswr rpsdt drs hsspecific this address to proposed were algorithms Several in stream whole the store to possible not generally is It The ato hswr a oewieteato a iiigIngen visiting was author the while done was work this of Part oeta hspolmi qiaett yai formulatio dynamic a to equivalent is problem this that number. Note definition random by use that one-time Note a is nonce A E duplicate unseen a eoyuaefreulpromne epoieanwadthor and new a provide qu constructions. We streaming existing of other performance. analysis to equal corrected interest for a usage from memory stem QHTs streams. éatmn ’nomtqed ’N,Éoenraesupéri normale École l’ENS, de d’informatique Département ( = ulct eeto problem detection duplicate ial edsusteeeto desra nusfrhash- for inputs adversarial of struct data effect new the our discuss of we version Finally optimised an introduce also We dat new a (QHT) Table Hash Quotient the presents article This utetHs als-EcetyDtcigDpiae nSt in Duplicates Detecting Efficiently - Tables Hash Quotient e e i 1 . e , . . . , ∈ 2 all Γ if Γ ulct detection duplicate hnpretdtcinrequires detection perfect Then . n write and , sueta each that Assume ∃ ulctsi nubuddstream. unbounded an in duplicates e n j . . . , 1 ∈ éiGéraud Rémi [15]. ,j 1 cannot be ne- also needed. Insertion of an element e is made by setting all glected. T [hi(e)] to 1. Detection of an element f is the AND of all cells T [hi(f)]: if the AND value is 0, then emit UNSEEN, Taking these effects into account (see Sections 4.1 and 4.2), DUPLICATE 7 2r otherwise emit . and using the same approximation than [8] used , r r ≈ It is easy to see that this approach has a FNR of 0. How- 4 √πr , we get the following asymptotic formulae for the FNR ever, as the stream grows, the FPR gets worse, and in the and FPR limit of an infinite stream the FPR is 1. Many variants have been proposed in the literature to compensate for this effect k k [20], mostly by allowing deletion [5, 6], but doing so increases FPR r′ FNR 1 r′ the FNR. ∞ ≈ 2 √πr ∞ ≈ − 2 √πr Alternatively, other filters prefer to store fingerprints: this is the rationale behind several recent constructions [8, 9]. that disagree with [8] — most importantly, the error rates do For the data structures most interesting to our question, not decrease to 0, contrarily to what was claimed. Interest- we refer the reader to the algorithms described in [7, 9, 19, ingly, the suggested parameters (r = 2, k = 4, r′ = r/2 = 1) 8, 22].4 indeed achieve an FNR of 0 — but also an FPR of 1.8 Other data structures, such as [4, 12] are not considered Further more, there is redundancy between the Hamming ′ in this paper, as they require an unbounded amount of mem- r r weight we of he and the reduced remainder he . For in- ′ ory as the stream grows. r stance, if he contains at least one bit set to 1, we know that we = 0. Intuitively this means SQF is wasteful, and 6 2.4 Streaming Quotient Filter we could expect to avoid collisions in the filter’s state by using a better adjusted encoding: this intuition happens to One construction which we must describe at length is the be correct as shown in Section 4.4. Streaming Quotient Filter (SQFs) [8]. Given an element e, Finally, since the state table contains 2q rows, with k cell 5 a certain fingerprint s(e) is stored in an array, at row h(e) each, the total memory required by an SQF is M = 2qkσ. and a certain column amongst k. This array constitutes the filter’s state. The filter’s construction uses integers q, k, r r′ < r and 3.1 Full-size Hashing, Memory Adjust- a h : 0, 1 0, 1 q+r. The filter’s state { }∗ → { } ments is an array of 2q rows and k columns, each holding a σ-bit element (with σ = r + log (r + 1) ), or the special empty ′ ⌈ 2 ⌉ Our first observation is that SQF’s fingerprint scheme (a symbol . The filter’s state is initially in every cell. ⊥ ⊥ hash and a Hamming weight) can be fruitfully replaced by a Then, [8] describes Stream(e) as follows: single hash function of the same size. Not only does this sim- plify the theoretical analysis, it also provides a much more Compute h(e). The q most significant bits of h(e) de- • q r efficient use of the available space. We also use more flexible fine he, and the r least significant bits define he. hash functions, that give much more flexibility in adjusting

′ the total memory M of the filter. Combining these two ef- r r Let we be the Hamming weight of he, and let he fects, we obtain Algorithm 1 which we call Quotient Hash • 6 r be r′ bits deterministically chosen from he. Let Table (QHT). ′ r s(e) h we where denotes concatenation. ← e k k q Algorithm 1 QHT Setup and Stream If s(e) is already stored in the row numbered he, emit σ • 1: function Setup(M,σ,k) ⊲M >σk and 0

4Cuckoo filters [9] requires a minimal adaptation for unbounded streams: in the original paper, failure is emitted after some number of relocations; we just discard the failure. Theoretical analysis of this new structure is not addressed in this paper. 5[8] refers to them as “signatures”, but we shun this term to avoid any claim of cryptographic properties. 6 ′ r E.g., the r least significant bits of he . 7Even though not mentioned in [8], the approximation (stemming from Catalan numbers) is only asmptotatically valid, but computations show that the approximation is correct even for small values of r, such as the ones used in practical applications. 8With these values, only 4 distinct fingerprints exist, and they can all be stored in the 4 cells per row. When the filter is full, any duplicate will be reported as such, hence a FNR of 0. But every new element will also be reported as duplicate, hence a FPR of 1.

2 3.2 Empty Cells element e′ with a hash and fingerprint colliding with those of e. e is called a false duplicate, and if e is still in the filter SQF and QHT as described above make essential use of the ′ ′ when e arrives, we refer to the event as a hard collision. “empty” cells in T . The need for this feature is present for all fingerprint-based structures including Cuckoo filters Our first remark is that the only false duplicate that may [9].9 However, a low-level implementation cannot rely on create a hard collision with e is the last false duplicate in- the availability of such a special value. Our options are to serted before e arrives: let us assume that e1, e2 are false initialise all cells to 0 and either duplicates, and that e1 arrives before e2. When we insert e2 in the filter, e1 is either still in the filter or has been evicted. 1. treat 0 as a fingerprint; If e1 has been evicted, then e1 will not hardly collide • 2. if an element has a fingerprint of 0, reassign its finger- with e. print to 1;

If e1 is still in the filter, then e2 will be claimed as a 3. while an element has a fingerprint 0, compute a new • fingerprint based on some deterministic scheme. duplicate and dismissed.

The first option is at a risk of a high false positive rate, even However, if we look at the table T storing all fingerprints, for small streams: when a new element, whose fingerprint dismissing e2 is strictly equivalent to replacing e1 by e2. is 0, should be stored in an empty bucket, it is instead dis- Consequently, every false duplicate is erased by any new missed as a duplicate. More specifically, before the filter is false duplicate, and only the last false duplicate can cause a 1 completely filled, a new element has probability at least S hard collision. As a result, we will only focus on the proba- to be false positive, which leads to a high number of false bility that the last false duplicate (before e arrives) causes a positive at the beginning of any stream. hard collision. 10 The second option is significantly faster than the third , Let us assume that the last false duplicate appears at po- which on the other hand has better statistical properties, sition i of the stream E = e1,e2,...,em,e , in other words, making the analysis simpler. While the second option was { } ei is the last false duplicate in the stream before e. preferred by [9] in their implementation11, we retain the Now, e has to remain in the filter until e arrives, even slower, but easier to analyse option. i though new elements are added. Let us suppose that ele- ment ej , i

3.4 Comparison with Hash Tables Since we know that ei is the last false duplicate, we cannot simultaneously have h(ej ) = h(ei) and s(ej ) = s(ei). As As the name implies, QHT are related to hash tables. In- such, the first two conditions are not independent. Let Phs deed, a hash table is a QHT, wherein the number of rows be the probability that these two conditions are satisfied. is equal to 1. In particular, QHT cannot have worse per- Given that ei is the last false positive, there are only NS 1 − formances than such structures, and bear similarities with possibilities for the couple (h(ej ),s(ej )). Moreover, among so-called compaction techniques [21]. these NS 1 states, only S 1 verify the first condition, and among these− S 1 states, only− S k verify the second condi- − S k − tion. Finally, P = . If we assign the last event to the 4 Error Rate Analysis hs NS− 1 − 1 probability Pselection, we immediately get Pselection = k . 4.1 False Positive Rate Finally, the probability P evict of ei not being evicted by ¬ S k ej is P evict = 1 PhsPselection and: P evict = 1 k(NS− 1) Consider a QHT with N rows, k buckets per row, and s-bits ¬ − ¬ − − fingerprints. For simplicity, further assume that no bucket is Now, ei has to avoid eviction by every element before empty (which is true after some time), and that the stream e arrives, i.e. by all elements ei+1,...,em, which happens m i is sampled uniformly at random from Γ. with probability P (hc)i = (P evict) − . ¬ At that point, we know the probability that a hard colli- Theorem 4.1. For a QHT of N rows, k buckets per row sion happens when the last false duplicate has been inserted and S possible fingerprints, as the number of insertions goes at position i. to infinity, the FPR goes to k/S. Moreover the probability that an unseen element inserted after m other elements trig- The probability of any element e′ being a false dupli- m 1 1 k 1 cate is Pfd = N S . In the stream E = (e1,...,em,e), the gers a false positive is FP = 1 1 . m last false duplicate is em with probability Pfd; it is em 1 S − − Nk −     with probability (1 Pfd) Pfd; and em k with probability − − Proof. An element e is a false positive if and only if e has not (1 P )k P . The probability that the next element will − fd fd been encountered yet, but Detect(e) = DUPLICATE. This result in a false positive after m elements are inserted, is event is triggered by the presence, in the filter, of another equal to the probability that the last false duplicate is not

9Other constructions do not face this issue, including SBF [7] and b_DBF [19]: in these schemes, 0 always codes for absence. 10 S The third option needs, on average, S−1 hash computations for each insertion, whereas the second option only needs 1. 11https://github.com/efficient/cuckoofilter/blob/aac6569cf30f0dfcf39edec1799fc3f8d6f594da/src/cuckoofilter.h

3 k evicted. Thus the probability FPm that a false positive hap- approximate asymptotic FPR of ′ for SQFs, as an- 2r √πr pens after m elements are inserted is nounced in Section 3. m FPm = (ej is the last false duplicate × 4.2 False Negative Rate j X=1 ej is not evicted until e arrives) Theorem 4.2. For a QHT of N rows, k buckets per row, m and S different fingerprints, assuming U N, then as the m j ≫ k = (1 Pfd) − Pfd P (hc)j number of insertions goes to infinity, FNR = 1 . − × ∞ j − S X=1 m m j m j S k − Proof. Assume that e is a duplicate, we denote by e the last = (1 P ) − P 1 − k − fd fd − j k(NS 1) element in the stream such that ei = e. Following the same X=1  −  m reasoning as for the FPR, ei will trigger a false negative if S k 1 (1 P ) 1 − and only if ei is removed from the filter before e arrives, and − − fd − k(NS 1) = Pfd − if any false duplicate of e, inserted between the removal of h  S k i 1 (1 Pfd) 1 k(NS− 1) ei and e, is deleted before e arrives. − − − −   Let us assume that ei is deleted at time j (this happens Del with probability Pi,j ) by something else than a false dupli- m cate. Using similar arguments than in the previous section, 1 S k j i 1 1 1 1 − Del S k − − S k 1 − − NS − k(NS 1) we have Pi,j = 1 k(NS− 1) k(NS− 1) . = − − − − NS h 1   S k i The probability that all false duplicates, inserted after 1 1 NS 1 k(NS− 1) − − − − time j, are deleted before e arrives is 1 FPn j .   m − − 1 1 S k  1 + S k S k k 1 (NS− 1)k NS kNS(NS− 1) Let a = k(NS− 1) , b = S and c = 1 Nk , and denote by = − − − − − − − h NS(S k) 1 S k i FNi,m the probability, in a stream E = (e1,...,em,e) where 1+ (NS −1)k k NS− 1 − − − the last duplicate of e is inserted at i, that ei is deleted before m 1 S k e arrives and no false positive persists until e, 1 1 NS NSk− = − − − S k m h 1+ − i k Del m FNi,m = Pi,j (1 FPm j ) k 1 − − = 1 1 j=i+1 S − − Nk X     m j i 1 m j = (1 a) − − a 1 b 1 c − We now have the probability that a new element e, inserted − − − j=i+1 after m insertions, is detected as a false positive. X  Assuming the stream is of size n, and noting FPRn its m j i 1 false positive rate, we get that FPRn is equal to the averages = a (1 b) (1 a) − − − − n j i 1 =X+1 h i of all FPm of its stream, i.e., FPRn = FPm. Given n m m=1 j i 1 m j X + ab (1 a) − − c − k − that lim FPm = and using Cesàro’s mean properties, j=i+1 m S X h i →∞ k we get, as one could have expected, FPRn . n S −−−−→→∞ Given that 1 a = c, we get: Thanks to the expression of FPm, we can see that the − 6 more rows there are, the slower the FPR reaches its asymp- m i m i totic (saturated) value. m i (1 a) − c − FNi,m = (1 b) 1 (1 a) − + ab − − A similar phenomenon can be observed with the Stable − − − 1 a c [7]12: adding rows will only decrease the FPR   − − to a certain point. An other structure adapted to streaming The probability FNm of e to be a false negative is then m data we found, the block decaying Bloom Filter [19], oper- P FNi,m, where P is the probability that the ates on a sliding window and therefore did not use the same dup,i · dup,i 13 i=1 False Positive definition as the one we did . X m i U 1 − 1 last duplicate already seen is ei, so Pdup,i = U− U . 4.1.1 Application to SQF th We obtain the probability FNm that the (m + 1) ele- One should be tempted to directly apply this result to an ment e of the stream will be a false negative: SQF. However, as pointed out earlier, in an SQF fingerprints m are correlated and therefore not equiprobable14 . FNm = P FNi,m However, for an optimal SQF, fingerprints are equiprob- dup,i · i=1 able so the analysis above holds for optimal SQFs. In the X general case, the authors of [8] have approximated the prob- 1 For a stream of n elements, the false negative rate FNRn ability of fingerprint collision in an SQF with ′ . Re- 2r √πr is defined as the average error probability: FNRn = 1 1 n placing S with this probability in our analysis, we get an n m=1 FNm Max K 12   1 P   In a Stable Bloom Filter, the FPR is bounded by 1 , where m is the number of rows, and K, P , Max − 1+ 1   P (1/K−1/m)   are diverse filter parameters. We clearly see that a higher number of rows will only decrease the FPR down to a certain point, but no further. 13More precisely, their definition false positive definition is restrained to the sliding window. So an element is a false positive if is not already present in the sliding window, and yet marked as a duplicate. 14 ′ r For example, consider the SQF with the parameters r = 3, r = 1. Only the hash h1 = 000 will lead to the fingerprint 000, whereas both r r hashes h2 = 001 and h3 = 010 will lead to the fingerprint 001.

4 We show, in Appendix B, that for FNR = lim FNRn, Interpretation The interpretation of these results could ∞ n →∞ suggest that streaming filters are useless: they need more 1 b memory than a random filter, despite being asymptotically FNR = 1 b − equivalent. However, this is only true because of our hy- ∞ − − U (U 1)(1 a) − − − potheses and definitions: we define a false negative to be a ab + duplicate element claimed as unseen by the filter. However, (1 a c)(U (U 1)(1 a)) after some amount of time, it is often acceptable that the − − − − − ab element may be considered as unseen again: for instance a nonce is theoretically unique, but in practice after a reason- − (1 a c)(U (U 1)c) − − − − able amount of time nonce reuse is not a vulnerability. For A not obvious consequence of the above expression for FNR this reason, adapting false positive and false negative to slid- is that when N increases, which corresponds to using more ing windows may be relevant here. Moreover, we assumed memory, it is possible to achieve an arbitrary small error that all elements of the stream had the same probability of rate.15 At some point however, the memory is so large that occurrence. In practice, this hypothesis is not always cor- every stream element can be stored, and there is no need for rect, and as we will see in Section 6, filters operating on real probabilistic data structures anymore. In practice we expect data perform significantly better than random filters, and memory to be a limited resource, so this situation is unlikely resist better to saturation. to present itself. Finally, assuming U Nk (or even U N), i.e., that 4.4 Comparing QHT to SQF ≫ ≫ there are more distinct elements in the stream that what the Let us compare the memory required for a QHT to reach the filter is able to store, we get the limit rate of 1 b, which is k − same error rates than an SQF. FNR 1 . ∞ ≈ − S Given that both FPR and FNR depend only on k, N and S, imposing the equality on these parameters ensures that Note that FNR +FPR = 1. both filters have exactly the same FPR and FNR. ∞ ∞ The value of FNR also gives (using the same correc- Note that S is not a user-chosen parameter, but rather a ∞ tions as for the FPR in Section 4.1.1) the FNR for SQF. consequence of other parameters. Theorem 4.3. For exactly the same FPR and FNR, a QHT 4.3 Error Rate and Filter Saturation requires 33% fewer memory than an SQF. Claim 4.1. The asymptotic relation FNR +FPR = 1 The proof is made through the following subsections. ∞ ∞ is universal (as long as U M) amongst hash-based du- plicate detection filters and≫ is the result of the filter’s sat- 4.4.1 Deriving S from Filters Parameters uration. Furthermore, when a filter reaches saturation, it For a QHT, S is derived from the number of bits of the fin- behaves similarly, from the error rate point of view, to a gerprint s, with the straightforward relation16 : S = 2s. For filter answering DUPLICATE at random (we will call such an SQF, however, the relation is more complicated. filters random filters). r When r and r′ are fixed, for any element e, let he be ′ ′ r r r decomposed as he = he f, where he is the r′-bits word Proof. In order to see this, first note that random filters k used in the fingerprint, h being the r r′ remaining bits always verify the relation FNR + FPR = 1. Given that a r − of he. For ω( ) the Hamming weight function, we have DUPLICATE ′ random filter will return with a probability of ·r r r s(e) = hr′ ω(he). Yet ω(he) = ω(he )+ ω(f). We know DUPLICATE ′ ′ p, an unseen element will be classified as with r k r UNSEEN that ω(he ) is entirely dependent on he , which is already probability p, and a duplicate will be classified as ′ r with probability 1 p, hence the result. used in the fingerprint. Thus, if we fix he , there are only − r r r′ + 1 possible values for ω(f) and thus for ω(he). Now, on infinite streams with infinitely many different ′ ′ − r r elements, filters are saturated with information. Given that Given that he can have 2 different values, we get that ′ r a filter can only store at most one element per bit (cf. Sec- SSQF = 2 (r r′ + 1). tion 2.1), thus a filter of size M can remember at most · − M elements. However, after fM insertions (with f 1), ≫ 4.4.2 Comparing Required Memory the filter remembers at most a tiny fraction 1 1 of the f ≪ stream. Having reached its saturated state, the filter has For QHTs, we have the relation MQHT = NQHTkQHTs. an extremely tiny probability 1 0 of correctly guessing For SQF, the formula is rather MSQF = NSQFkSQF(r′ + f ≈ log (r + 1) ) (because fingerprints occupy r + log (r + 1) whether the incoming element is a duplicate or not, given ⌈ 2 ⌉ ′ ⌈ 2 ⌉ bits). what the filter actually knows about the stream. For this Given that NQHT = NSQF and kQHT = kSQF, the ratio reason, the best strategy of a saturated filter is almost in- M QHT is: distinguishable from a random strategy, i.e. randomly out- MSQF putting DUPLICATE. When the stream grows indefinitely, M N k s s the filter becomes asymptotically equivalent to a random fil- QHT = QHT QHT = ter. M N k (r + log (r + 1) ) r + log (r + 1) SQF SQF SQF ′ ⌈ 2 ⌉ ′ ⌈ 2 ⌉ ′ Given that s = log (S ) = log (S ) = log (2r Furthermore, in practical cases, we observe that dupli- ⌈ 2 QHT ⌉ ⌈ 2 SQF ⌉ ⌈ 2 · (r r + 1)) , we have cate filters are equivalent to a specific kind of random fil- − ′ ⌉ ters: they answer DUPLICATE with some fixed probability ′ M log (2r (r r + 1)) r + log (r r + 1) p, p depending on the filter’s nature and its parameters (i.e., QHT = ⌈ 2 − ′ ⌉ = ′ ⌈ 2 − ′ ⌉ M r + log (r + 1) r + log (r + 1) p does not change with time). SQF ′ ⌈ 2 ⌉ ′ ⌈ 2 ⌉ 15This fact is straightforward but requires the substitution of a,b,c by their full expression, before taking the limit. 16If we are on a system without the empty feature (see Section 3.2, then S = 2s 1. For an SQF, one can just assign the empty value to − one of the unassigned fingerprints. ′ 17Their optimal choice of parameters also includes setting k = 4. However, setting r = 2 and r = 1 imposes S = 4, so that setting k = 4 results in an FPR of 1: after some time the filter systematically responds DUPLICATE. Using e.g. k = 3 avoids this.

5 17 Using the recommended settings in [8] (r = 2, r′ = 1) the 5.2 Queuing Buckets for a Better Sliding M ratio becomes QHT = 2 , which concludes the proof. Window MSQF 3 One caveat of the QHT (and QHTD) is the fact that at any insertion, any element of the row is equally likely to be 4.5 Parameter Tuning evicted: if this allows an easy FPR and FNR derivation, it As noted in Section 4.2, no matter the choice of the param- makes it functioning a bit counter intuitive. As a matter of eters we have FNR + FPR = 1: any particular parameters fact, one would expect a filter to first forget about the oldest choice will be a trade off between good FPR and good FNR elements before forgetting about the newest ones. Indeed, performance, at least asymptotically. However, when the this behaviour matches the need of a filter operating on a stream is small enough, one may choose parameters that sliding window, without taking into account oldest elements. will maximally delay saturation. The solution we provide for QHT is to order the buckets of a given cell in a FIFO queue, which means that instead We know that M = Nk log (S), so plugging this into the 2 of selecting a random bucket in the cell for insertion, one FPR formula gives will append the fingerprint to the end of the queue, and pop the first element (so that the size of the queue remains k 1 m FPRm = 1 1 constant). Combined with QHTD improvement, this yields S − − kN     Algorithm 2. k log (S) m = 1 1 2 S − − M Algorithm 2 Queued Quotient Hash Table with Duplicates’     (QQHTD) Stream

k 1: for each element e E do Which means that, for fixed M and (i.e. for a fixed ∈ S 2: result UNSEEN memory amount and a fixed asymptotic FPR), log2(S) must ← 3: Quotient of e: h(e); Fingerprint of e: s(e). be as small as possible in order to keep the FPR low for as 4: for each bucket b in the queue T [h(e)] do long as possible. i 5: if (entry in bi)= s(e) then For instance, assume that we want an asymptotic FPR 6: result DUPLICATE of 25%. The potential values for the couple (k,S) are (1, 4), ← 7: break (2, 8), (3, 16) and so on (leading respectively to s = 2, 3, 4). 8: Pop the first element of the queue T [h(e)] and ap- k = Because of the above relation, we know that setting pend s(e) at the end of same queue 1,s = 2 will yield the best saturation resistance for the FPR. 9: return result In order to test this heuristic, we compared several QHT k of approximately 65,536 bits, with the same ratio S , but a different value for S. We took streams of 100,000 ele- Note that classical queues (i.e. doubly chained lists) are ments (well under saturation value for this amount of mem- not suited for our use, as chains require extra storage bits for ory), from an alphabet of 220 elements. Each element of pointers. Thus, we create an array of k elements, in which the stream was uniformly randomly selected, leading to a we manually move every element at each “pop”. When k is stream with about 4.6% of duplicate elements. We averaged small (typically less than 5, which it usually is), the added the results on 10 runs. overhead is not significant. Finally, note that a QQHTD with one bucket par cell We observe in Table 1 that filters with a small value of is equivalent to its QHT counterpart. However, with more S do indeed perform better than filters with a bigger value buckets per cell, QQHTDs offer a noticeable improvement of S, which concludes the experiment. over QHTs on real data streams (see Appendix A).

Table 1: Error rates of QHTs with the same asymptotic FPR 6 Benchmarks S = 4 S = 8 S = 16 S = 32 S = 64 FPR (%) 22.57 23.25 23.53 23.62 23.50 6.1 Comparison of QHT to Other Filters FNR (%) 35.89 44.24 50.77 54.55 58.73 FPR+FNR 58.45 67.49 74.30 78.17 82.23 We used streams of 150,000,000 elements, on filters of size ranging from 10 kb to 8 Mb. In any case the filters are too small to keep track of the whole stream, and we will see 5 Further improvements: QQHTD that filters do reach saturation. We used 2 artificial streams, for which the elements where randomly generated from an 24 27 5.1 Keeping Track of Duplicates alphabet of 2 and 2 elements respectively, leading to a duplicate rate of about 88% and 38% respectively. We also In Algorithm 1, we do not insert anything if the element used a real dataset of URLs visited by a crawling robot, ex- is detected as a duplicate. However, following [9]’s example, tracted from the April 2018 CommonCrawl’s dump [17] The 18 we can insert it anyway, resulting in a structure we call QHT source code is available from on BitBucket. . with Duplicates, or QHTD. We briefly discuss its properties. The filters, so their asymptotic FPR was as close as pos- As we showed previously, the asymptotic FPR of a QHT sible to the arbitrary value of 25%, are: k is S , which was expected: each cell stores k distinct fin- SQF, 1 bucket per row, r = 2 and r′ = 1 gerprints, the probability that one of them matches the fin- • gerprint of a unique element is logically k . Similarly, in QHT, 1 bucket per row, 3 bits per fingerprint. This S • QHTD each cell stores k fingerprints, not necessarily dis- specific QHT is equivalent to a QQHTD with the same tinct. The probability that at least one of these finger- parameters, so we do not include the latter in the prints is the same than the one of an unseen element is benchmark. k = 1 1 1 FPRQHTD S . Given the results of Section 4.3, Cuckoo Filter[9], cells containing 1 element of 3 bits − − k • the asymptotic FNR of QHTD is 1 1 . each  − S 18https://bitbucket.org/team_qht/qht/src/master  19Our benchmarks actually obtained an asymptotic FPR of around 28%, without us being able to find bugs in our implementation.

6 Table 2: Error rate (multiplied by 100) on streams of 150,000,000 elements Stream (duplicate %) Memory (bits) SQF QHT Cuckoo SBF A2 b_DBF QQHTD 8e+06 51.25 43.78 66.48 54.08 58.94 45.22 1e+06 55.18 48.38 68.96 57.17 62.12 45.30 Real (10.3 %) 100,000 58.21 52.22 71.83 59.44 64.88 59.39 10,000 66.45 58.75 78.86 76.29 66.42 99.77 8e+06 86.49 82.76 96.94 97.79 88.06 99.96 1e+06 98.27 97.80 99.60 99.74 98.49 99.96 Artificial (88.82 %) 100,000 99.78 99.79 99.96 99.98 99.84 99.96 10,000 99.97 99.97 100.02 99.99 100.00 99.98 8e+06 96.51 95.37 99.21 99.40 96.86 99.99 1e+06 99.56 99.42 99.91 99.91 99.60 99.99 Artificial (39.79 %) 100,000 99.94 99.95 100.00 99.99 99.96 99.98 10,000 100.00 100.00 100.00 100.00 99.99 100.00

Stable Bloom Filter (SBF) [7], 2 bits per cell, 2 hash an attack (nonce requirements), then the question of adver- • functions, targeted FPR of 0.0219 sarial resistance is primordial. We can model an adversarial system in which an attacker A2 Filter[22], targeted FPR of 0.1 on the sliding win- has a knowledge of the output of Stream (i.e., whether the • dow. element is classified as DUPLICATE or UNSEEN), but no knowledge of the internal memory state . Thus the at- M Block-Decaying Bloom Filter (b_DBF)[19], sliding tacker is able to carry an adaptive attack, by choosing the • window of 6000 elements. next element to send to the filter as a function of all pre- vious insertions. In this adversarial game, the attacker can Results, averaged on 5 runs, are given in Table 2. Note send an arbitrary stream to the filter, and is allowed to get that, for better readability, the error rates have been multi- the result of Stream for every element. Then, at her conve- plied by 100 in the table. Further more, we recall that the nience, the attacker goes into the second phase of the game, error rate, being defined as E = FPR+FNR, is bounded in which she has two possible actions: by 0 below and 2 above, 1 being the error rate of a random filter. A filter can have worse results than random; for in- Send an unseen element that will be a false positive • stance a filter which is always wrong has an error rate of with high probability (false positive attack); 2. As we can see in Table 2, QHT (or QQHTD) are ex- Send a duplicate element that will be a false negative • tremely competitive and resist very well to saturation; they with high probability (false negative attack). also appear to be the most competitive on the real stream. b_DBF are efficient, but reach very quickly their saturation. Theorem 7.1. No filter can resist a false negative attack. A more detailed analysis of their FPR (see Appendix A) shows that even though their FPR is close to 0, they get an Proof. We craft false negatives in Ω(M) steps (M being FNR close to 1. Moreover, as we see in Section 6.2, they are the filter’s memory size). Let us remind that no struc- significantly slower, which can be a bottleneck for critical ture can remember more than one element per memory applications. Further more, not only are QHT/QQHTD the bit. For this reason, the structure can remember at most most efficient filter on both real and artificial streams, they M different elements. Consequently, if the attacker gener- are also very easily tunable, and any asymptotic FPR rate is ates a stream of random unseen elements, then on average very simply achievable. This is not the case of other filters, each element will stay for M insertions in the filter’s mem- such as A2 or b_DBF, which require careful tuning. ory. More generally, after hM insertions (for some rational h 1), an element is forgotten with probability at least ≥ P = 1 (1 1 )hM 1 e h. Thus an attacker sim- 6.2 Speed Comparison forgot − − M ≃ − − ply generates Ω(M) unique elements before sending the first We also benchmarked the speed of every filter on real-time element again. M can even be estimated via saturation (see detection, on a laptop with Intel i7. We used the same fil- Section 4.3). Given that CPU time is cheaper than mem- ters as in the previous subsection, with a memory of 1 Mb, ory requirements, the attacker keeps her advantage over any on a stream of 150,000,000 elements. We averaged, on these filter of any size. filters, the time needed for Insert Detect to execute for each ◦ element. Results are shown in Table 3. We observe that Theorem 7.2. Assuming the existence of one-way func- safe for SQF, QHT is 6 times as fast as any other filter, and tions, QHT can resist a false positive attack. 10 times as fast as b_DBF. Even SQF is 50% slower than QHT. Even with additional features, QQHTD are also faster Proof. Following [18] we replace all hash functions by one- than SQF by a large magin, because fingerprint derivation is way hash functions, and apply a (secret) one-way permuta- more costly in the latter. As a conclusion, we observe that tion on incoming elements, then classically store the results QQHTD are most suited filters for high-speed analysis. in the filter. Because of the permutation, the attacker gains no advantage in adaptively choosing the elements, thus loos- 7 Adversarial Resistance ing her advantage.

Now, despite the good performances of QQHTD on normal As a conclusion, QHTs are adapted to contexts where low streams, one may not always assume that the stream is “nor- false positives are crucial, such as white-list email filtering. mal”: there may be an attacker trying to fool the filter. For Note however that this is the case of most filters, as long as instance if the filter must detect duplicates in order to avoid they rely on hash functions (and so can apply [18]).

7 Table 3: Average amount of time (in µs) required for one iteration of Stream on each filter with 1 Mb of memory Filter SQF QHT QQHTD Cuckoo SBF A2 b_DBF Time 0.423 0.288 0.330 2.464 1.578 1.280 2.565

n 8 Conclusion 1 1 (1 1/U)m FNRn = (1 b) − − nU − 1/U m=1  This paper introduces a new duplicate detection filter, QHT, X 1 ((1 1/U)(1 a))m and its variant QHTD. QHTs achieve a better utilization +(1 b) − − − − 1 (1 a)(U 1)/U of the available space, and as such are more efficient than − − − existing filters. Moreover, QHTD have more efficiency for ab 1 ((1 1/U)(1 a))m + − − − detecting duplicates in a real dataset. 1 a c 1 (1 a)(U 1)/U − − − − − We showed that, for an infinite stream with an infinite ab 1 ((1 1/U)c)m number of unseen elements, the number of rows is less impor- − − − 1 a c 1 c(U 1)/U tant than the fingerprint space, and the number of buckets − − − −  per row. Moreover, we proved that all filters, having reached Given that FNR = lim FNRn, using Cesàro’s mean saturation, are not more efficient than random filters, and ∞ n as such, a benchmarking of stream filters should only focus we get: →∞ on the pre-saturation state, with small streams. Even though QHTs are significantly more efficient than 1 b other structures in the literature, we do not know if these FNR = 1 b − ∞ − − U (U 1)(1 a) filters are optimal: are there other filters having an optimal − − − resilience to saturation? Future work also includes exam- ab + ining the theoretical resistance to saturation of QQHTDs, (1 a c)(U (U 1)(1 a)) − − − − − and a finer examination of the QHT/QQHTD behaviour on ab a sliding window. − (1 a c)(U (U 1)c) − − − − A Tables of error rates for various Which concludes the proof. streams C Comparison of QHT and QQHTD Table 4 gives the error rates (FPR and FNR) used to derive Table 2. In this appendix, we explore the difference between a QHT and a QQHTD with the same parameters, on the same stream. We took filters of 65,536 bits each, on streams of B Deriving FNR 100,000 elements each. One stream issued from our ‘real’ ∞ dataset (10.32% of duplicates), the other a random uniform stream on an alphabet of 220 elements (4.62% of duplicates). This derivation was removed from Section 4.2 for bet- n Results are given in 5. 1 ter readability. We know that FNRn = FNm = We observe that while QQHTD offer no advantage on ar- n m=1 tificial streams, their performance (relative to the QHT) are n m X 1 noticeably better, which empirically validates the optimiza- P FNi,m, so dup,i · tions. n m i X=1 X=1 References n m m i 1 U 1 − 1 FNRn = − (1 b) n U U − [1] Babcock, B., Datar, M., and Motwani, R. Load m=1 i=1    X X shedding for aggregation queries over data streams. In m i m i m i (1 a) − c − Proceedings. 20th International Conference on Data 1 (1 a) − + ab − − − − 1 a c Engineering (March 2004), IEEE Computer Society, − −   pp. 350–361. Expanding, [2] Bloom, B. H. Space/time trade-offs in hash cod- ing with allowable errors. Commun. ACM 13, 7 (July 1970), 422–426. n m m i 1 U 1 − FNRn = (1 b) − − [3] Borg, M., Runeson, P., Johansson, J., and nU m " i U X=1 X=1   Mäntylä, M. V. A replicated study on duplicate de- m m i U 1 − tection: Using apache lucene to search among android +(1 b) − (1 a) − · − defects. In Proceedings of the 8th ACM/IEEE Interna- i U X=1   tional Symposium on Empirical Software Engineering m m i ab U 1 − and Measurement (New York, NY, USA, 2014), ESEM + − (1 a) − ’14, ACM, pp. 8:1–8:4. 1 a c i U − − X=1   m m i ab U 1 − [4] Chen, H., Liao, L., Jin, H., and Wu, J. The dy- + − c 1 a c U namic cuckoo filter. In 2017 IEEE 25th International i=1   # − − X Conference on Network Protocols (ICNP) (Oct 2017), pp. 1–10.

8 Table 4: Results of filters for various streams of 150,000,000 elements (FPR/FNR in %) Stream (duplicate %) Memory (bits) SQF QHT/QQHT Cuckoo SBF A2 b_DBF 8e+06 24.61/26.64 14.00/29.78 28.23/38.26 25.10/28.98 37.72/21.21 0.00/45.22 1e+06 24.95/30.22 14.25/34.14 28.30/40.65 25.23/31.94 38.00/24.11 0.15/45.16 Real (10.3 %) 100,000 24.99/33.22 14.28/37.94 28.33/43.51 25.42/34.03 38.05/26.83 25.75/33.64 10,000 25.00/41.45 14.29/44.46 28.37/50.49 26.00/50.29 38.01/28.41 99.58/0.20 8e+06 21.87/64.62 12.02/70.74 27.80/69.14 25.94/71.85 35.22/52.84 0.00/99.96 1e+06 24.60/73.67 14.00/83.80 28.41/71.19 26.43/73.30 37.72/60.77 0.17/99.79 Artificial (88.82 %) 100,000 24.94/74.84 14.26/85.53 28.52/71.44 26.49/73.49 38.00/61.83 27.46/72.51 10,000 24.98/74.99 14.28/85.69 28.55/71.47 26.50/73.50 38.03/61.97 99.67/0.31 8e+06 24.42/72.09 13.86/81.52 28.39/70.82 26.40/73.00 37.54/59.32 0.00/99.99 1e+06 24.92/74.64 14.24/85.18 28.50/71.41 26.47/73.44 38.00/61.60 0.17/99.82 Artificial (39.79 %) 100,000 24.99/74.96 14.29/85.66 28.51/71.49 26.49/73.50 38.05/61.91 27.46/72.52 10,000 25.00/74.99 14.28/85.72 28.52/71.49 26.49/73.51 38.02/61.97 99.69/0.31

Table 5: Error rate (times 100) of QHTs and QQHTDs with the same parameters on different streams, depending on their parameters k and S Stream (duplicate %) Filter k = 2,S = 8 k = 4,S = 16 k = 8,S = 32 k = 16,S = 64 QHT 27.01 28.07 28.95 29.3 Real (10.3 %) QQHTD 24.67 24.56 24.42 24.62 QHT 67.39 74.48 79.32 82.10 Artificial (4.62 %) QQHTD 67.91 74.41 79.19 82.26

[5] Cohen, S., and Matias, Y. Spectral bloom filters. In [14] Kapralov, M., Nelson, J., Pachocki, J., Wang, Proceedings of the 2003 ACM SIGMOD International Z., Woodruff, D. P., and Yahyazadeh, M. Opti- Conference on Management of Data (New York, NY, mal lower bounds for universal relation, and for sam- USA, 2003), SIGMOD ’03, ACM, pp. 241–252. plers and finding duplicates in streams. In FOCS (2017), IEEE Computer Society, pp. 475–486. [6] Cormode, G., and Muthukrishnan, S. An improved data stream summary: the count-min sketch and its ap- [15] Køien, G. A brief survey of nonces and nonce usage. plications. Journal of Algorithms 55, 1 (2005), 58 – 75. In Securware 2015 - The Ninth International Confer- ence on Emerging Security Information, Systems and Deng, F., and Rafiei, D. [7] Approximately detecting Technologies" (2015), SECURWARE ’15, IARIA XPS duplicates for streaming data using stable Bloom filters. Press, pp. 85–91. In SIGMOD Conference (2006), ACM, pp. 25–36. [16] Metwally, A., Agrawal, D., and El Abbadi, A. [8] Dutta, S., Narang, A., and Bera, S. K. Streaming Duplicate detection in click streams. In Proceedings of quotient filter: A near optimal approximate duplicate the 14th International Conference on World Wide Web detection approach for data streams. Proc. VLDB En- (New York, NY, USA, 2005), WWW ’05, ACM, pp. 12– dow. 6, 8 (June 2013), 589–600. 21. [9] Fan, B., Andersen, D. G., Kaminsky, M., and Nagel, S. Mitzenmacher, M. D. Cuckoo filter: Practically [17] April 2018 crawl archive now available, 2018. better than bloom. In Proceedings of the 10th ACM http://commoncrawl.org/2018/05/april-2018-crawl-archive-now-available/. International on Conference on Emerging Networking [18] Naor, M., and Yogev, E. Bloom filters in ad- Experiments and Technologies (New York, NY, USA, versarial environments. In Advances in Cryptology – 2014), CoNEXT ’14, ACM, pp. 75–88. CRYPTO 2015 (Berlin, Heidelberg, 2015), R. Gen- [10] Fan, L., Cao, P., Almeida, J., and Broder, A. Z. naro and M. Robshaw, Eds., Springer Berlin Heidel- Summary cache: A scalable wide-area web cache shar- berg, pp. 565–584. ing protocol. IEEE/ACM Trans. Netw. 8, 3 (June Shen, H., and Zhang, Y. 2000), 281–293. [19] Improved approximate de- tection of duplicates for data streams over sliding win- [11] Fu, M., Feng, D., Hua, Y., He, X., Chen, Z., Xia, dows. Journal of Computer Science and Technology 23, W., Zhang, Y., and Tan, Y. Design tradeoffs for 6 (2008), 973–987. data deduplication performance in backup workloads. In Proceedings of the 13th USENIX Conference on File [20] Tarkoma, S., Rothenberg, C. E., and Lager- and Storage Technologies (Berkeley, CA, USA, 2015), spetz, E. Theory and practice of bloom filters for FAST’15, USENIX Association, pp. 331–344. distributed systems. IEEE Communications Surveys Tutorials 14, 1 (First 2012), 131–155. [12] Guo, D., Wu, J., Chen, H., Yuan, Y., and Luo, X. The dynamic bloom filters. IEEE Transactions [21] Wolper, P., and Leroy, D. Reliable hashing with- on Knowledge and Data Engineering 22, 1 (Jan 2010), out collision detection. In Computer Aided Verifica- 120–133. tion (Berlin, Heidelberg, 1993), C. Courcoubetis, Ed., Springer Berlin Heidelberg, pp. 59–70. [13] Jowhari, H., Saglam, M., and Tardos, G. Tight bounds for lp samplers, finding duplicates in streams, [22] Yoon, M. Aging bloom filter with two active buffers and related problems. CoRR abs/1012.4889 (2010), for dynamic sets. IEEE Trans. on Knowl. and Data 49–58. Eng. 22, 1 (Jan. 2010), 134–138.

9