Quotient Hash Tables-Efficiently Detecting Duplicates in Streaming

Total Page:16

File Type:pdf, Size:1020Kb

Quotient Hash Tables-Efficiently Detecting Duplicates in Streaming Quotient Hash Tables - Efficiently Detecting Duplicates in Streaming Data Rémi Gérauda,c, Marius Lombard-Platet∗a,b, and David Naccachea,c aDépartement d’informatique de l’ENS, École normale supérieure, CNRS, PSL Research University, Paris, France bBe-Studys, Geneva, Switzerland cIngenico Labs Advanced Research, Paris, France Abstract This article presents the Quotient Hash Table (QHT) a new data structure for duplicate detection in unbounded streams. QHTs stem from a corrected analysis of streaming quotient filters (SQFs), resulting in a 33% reduction in memory usage for equal performance. We provide a new and thorough analysis of both algorithms, with results of interest to other existing constructions. We also introduce an optimised version of our new data structure dubbed Queued QHT with Duplicates (QQHTD). Finally we discuss the effect of adversarial inputs for hash-based duplicate filters similar to QHT. 1 Introduction and Motivation bits. We consider in this paper the following problem: given a pos- Proof. A perfect duplicate filter must be able to store all sibly infinite stream of symbols, detect whether a given sym- streams Ei = (e1,...,en), i.e. must be able to store any bol appeared somewhere in the stream. It turns out that in- subset of Γ, which we denote by (Γ). Given that there are P stances of this duplicate detection problem arise naturally in (Γ) = 2 Γ = 2U of them, according to information theory |P | | | many applications: backup systems [11] or Web caches [10], any such filter requires at least log ( (Γ) ) = log (2U )= U 2 |P | 2 search engine databases or click counting in web advertise- bits of storage. ment [16] , retrieval algorithms [3], data stream management systems [1], or even cryptographic contexts where nonce- reuse is problematic1 [15]. Because of this result, perfect duplicate detection is often It is generally not possible to store the whole stream in out of reach when U is big — however probabilistic solutions memory, therefore practical solutions to this problem must are often sufficient for many applications. Such algorithms somehow trade off memory for accuracy. The special case of make errors: false positives (claiming a duplicate where there a single duplicate has a known optimal solution [13, 14], but isn’t) and false negative (missing a duplicate). to the best of the authors knowledge no such results exists for detecting all duplicates in an unbounded stream. 2.2 Filters Several algorithms were proposed to address this specific question, which we discuss below. Of particular interest to Definition 2.1 (Filter). A filter F over the memory is M our investigation are Streaming Quotient Filter (SQF), as a tuple F = ( , Detect, Insert), where: S described by Dutta et al. in [8]. We point out several cru- is the current state cial mistakes in the original analysis of SQF, and in doing • S∈M so highlight that a more efficient data structure can be con- Detect : Γ DUPLICATE, UNSEEN structed along the same lines. We flesh out such a data • ×M→{ } structure, which we call the Quotient Hash Table (QHT), Insert : Γ and provide a thorough analysis of both SQF and QHT. This • ×M→M analysis is our main contribution. QHT itself is not optimal Here DUPLICATE corresponds to a guess that the pro- and can be further improved: we describe such an improve- vided element is duplicate, and UNSEEN that it is unseen. ment, dubbed Queued QHT with Duplicates (QQHT), and Insert corresponds to an update of the filter’s memory state benchmark our new algorithms against popular alternatives. after observing a new element. Here models the amount M of memory (states) available to the algorithm. In practice, Detect and Insert are often merged into a 2 Preliminaries and related work Stream Insert Detect arXiv:1901.04358v1 [cs.DS] 14 Jan 2019 single algorithm . ← ◦ 2.1 Duplicate Detection Definition 2.2 (False positive (resp. negative)). If, for an unseen (resp. duplicate) element e, Detect(e) outputs DU- E = (e ,...,e ,... ) Let 1 n be a (possibly infinite) sequence of PLICATE (resp. UNSEEN), e is called a false positive (resp. elements ei Γ, and write U = Γ . An element ei from E ∈ | | negative). is a duplicate if ej E,j < i such as ej = ei. Otherwise, 2 ∃ ∈ ei is unseen. Definition 2.3 (FPR, FNR). The false positive rate (FPR) The duplicate detection problem is the question of find- of a stream E is the frequency of false positive. The false ing all duplicates in a stream E.3 We recall the following negative rate is similarly defined as the frequency of false well-known result: negatives. Theorem 2.1. Assume that each ei is sampled uniformly at We are interested in filters that whose FNR and FPR can random from Γ. Then perfect detection requires U memory be kept low when is bounded. M ∗ Part of this work was done while the author was visiting Ingenico Labs 1A nonce is a one-time use random number. 2 Note that by definition e1 is always unseen. 3Note that this problem is equivalent to a dynamic formulation of the approximate set membership problem. 1 2.3 Hash-based filters 1. Cells in the filter’s state are not independent: in par- ticular, in every row, non-empty cells hold different Filters rely on hashing to efficiently answer the duplicate de- values (by design); tection problem. These filters are often a variation over the well-known Bloom filters [2]. A Bloom filter uses an array T of M bits, initially all set to 0. k hash functions hi are 2. Terms in geometric sums of order > 1 cannot be ne- also needed. Insertion of an element e is made by setting all glected. T [hi(e)] to 1. Detection of an element f is the AND of all cells T [hi(f)]: if the AND value is 0, then emit UNSEEN, Taking these effects into account (see Sections 4.1 and 4.2), DUPLICATE 7 2r otherwise emit . and using the same approximation than [8] used , r r ≈ It is easy to see that this approach has a FNR of 0. How- 4 √πr , we get the following asymptotic formulae for the FNR ever, as the stream grows, the FPR gets worse, and in the and FPR limit of an infinite stream the FPR is 1. Many variants have been proposed in the literature to compensate for this effect k k [20], mostly by allowing deletion [5, 6], but doing so increases FPR r′ FNR 1 r′ the FNR. ∞ ≈ 2 √πr ∞ ≈ − 2 √πr Alternatively, other filters prefer to store fingerprints: this is the rationale behind several recent constructions [8, 9]. that disagree with [8] — most importantly, the error rates do For the data structures most interesting to our question, not decrease to 0, contrarily to what was claimed. Interest- we refer the reader to the algorithms described in [7, 9, 19, ingly, the suggested parameters (r = 2, k = 4, r′ = r/2 = 1) 8, 22].4 indeed achieve an FNR of 0 — but also an FPR of 1.8 Other data structures, such as [4, 12] are not considered Further more, there is redundancy between the Hamming ′ in this paper, as they require an unbounded amount of mem- r r weight we of he and the reduced remainder he . For in- ′ ory as the stream grows. r stance, if he contains at least one bit set to 1, we know that we = 0. Intuitively this means SQF is wasteful, and 6 2.4 Streaming Quotient Filter we could expect to avoid collisions in the filter’s state by using a better adjusted encoding: this intuition happens to One construction which we must describe at length is the be correct as shown in Section 4.4. Streaming Quotient Filter (SQFs) [8]. Given an element e, Finally, since the state table contains 2q rows, with k cell 5 a certain fingerprint s(e) is stored in an array, at row h(e) each, the total memory required by an SQF is M = 2qkσ. and a certain column amongst k. This array constitutes the filter’s state. The filter’s construction uses integers q, k, r r′ < r and 3.1 Full-size Hashing, Memory Adjust- a hash function h : 0, 1 0, 1 q+r. The filter’s state { }∗ → { } ments is an array of 2q rows and k columns, each holding a σ-bit element (with σ = r + log (r + 1) ), or the special empty ′ ⌈ 2 ⌉ Our first observation is that SQF’s fingerprint scheme (a symbol . The filter’s state is initially in every cell. ⊥ ⊥ hash and a Hamming weight) can be fruitfully replaced by a Then, [8] describes Stream(e) as follows: single hash function of the same size. Not only does this sim- plify the theoretical analysis, it also provides a much more Compute h(e). The q most significant bits of h(e) de- • q r efficient use of the available space. We also use more flexible fine he, and the r least significant bits define he. hash functions, that give much more flexibility in adjusting ′ the total memory M of the filter. Combining these two ef- r r Let we be the Hamming weight of he, and let he fects, we obtain Algorithm 1 which we call Quotient Hash • 6 r be r′ bits deterministically chosen from he. Let Table (QHT). ′ r s(e) h we where denotes concatenation. ← e k k q Algorithm 1 QHT Setup and Stream If s(e) is already stored in the row numbered he, emit σ • 1: function Setup(M,σ,k) ⊲M >σk and 0 <k 2 DUPLICATE. ≤ 2: Let N M/(k σ) .
Recommended publications
  • A General-Purpose High Accuracy and Space Efficient
    countBF: A General-purpose High Accuracy and Space Efficient Counting Bloom Filter 1st Sabuzima Nayak 2nd Ripon Patgiri, Senior Member, IEEE Dept. Computer Science & Engineering Dept. Computer Science & Engineering National Institute of Technology Silchar National Institute of Technology Silchar Cachar-788010, Assam, India Cachar-788010, Assam, India sabuzima [email protected] [email protected] Abstract—Bloom Filter is a probabilistic data structure for Delete operation is crucial for Bloom Filter. Suppose a the membership query, and it has been intensely experimented database management system integrates Bloom Filter to avoid in various fields to reduce memory consumption and enhance unnecessary disk accesses and enhance its performance sig- a system’s performance. Bloom Filter is classified into two key categories: counting Bloom Filter (CBF), and non-counting nificantly with a tiny amount of memory [18]. Thus, the Bloom Filter. CBF has a higher false positive probability than database management system requires to insert, query, and standard Bloom Filter (SBF), i.e., CBF uses a higher memory delete operation for which conventional Bloom Filter is not footprint than SBF. But CBF can address the issue of the false suitable. Therefore, counting Bloom Filter is used in such negative probability. Notably, SBF is also false negative free, but kind of requirements. Another example is a network router. it cannot support delete operations like CBF. To address these issues, we present a novel counting Bloom Filter based on SBF Older packets are removed from the router databases. There- and 2D Bloom Filter, called countBF. countBF uses a modified fore, the router requires counting Bloom Filter.
    [Show full text]
  • Efficient Data Structures for High Speed Packet Processing
    Efficient data structures for high speed packet processing Paolo Giaccone Notes for the class on \Computer aided simulations and performance evaluation " Politecnico di Torino November 2020 Outline 1 Applications 2 Theoretical background 3 Tables Direct access arrays Hash tables Multiple-choice hash tables Cuckoo hash 4 Set Membership Problem definition Application Fingerprinting Bit String Hashing Bloom filters Cuckoo filters 5 Longest prefix matching Patricia trie Giaccone (Politecnico di Torino) Hash, Cuckoo, Bloom and Patricia Nov. 2020 2 / 93 Applications Section 1 Applications Giaccone (Politecnico di Torino) Hash, Cuckoo, Bloom and Patricia Nov. 2020 3 / 93 Applications Big Data and probabilistic data structures 3 V's of Big Data Volume (amount of data) Velocity (speed at which data is arriving and is processed) Variety (types of data) Main efficiency metrics for data structures space time to write, to update, to read, to delete Probabilistic data structures based on different hashing techniques approximated answers, but reliable estimation of the error typically, low memory, constant query time, high scaling Giaccone (Politecnico di Torino) Hash, Cuckoo, Bloom and Patricia Nov. 2020 4 / 93 Applications Probabilistic data structures Membership answer approximate membership queries e.g., Bloom filter, counting Bloom filter, quotient filter, Cuckoo filter Cardinality estimate the number of unique elements in a dataset. e.g., linear counting, probabilistic counting, LogLog and HyperLogLog Frequency in streaming applications, find the frequency of some element, filter the most frequent elements in the stream, detect the trending elements, etc. e.g., majority algorithm, frequent algorithm, count sketch, count{min sketch Giaccone (Politecnico di Torino) Hash, Cuckoo, Bloom and Patricia Nov.
    [Show full text]
  • When the Levee Breaks: a Practical Guide to Sketching Algorithms for Processing the Flood of Genomic Data Will P
    Rowe Genome Biology (2019) 20:199 https://doi.org/10.1186/s13059-019-1809-x REVIEW Open Access When the levee breaks: a practical guide to sketching algorithms for processing the flood of genomic data Will P. M. Rowe1,2 Abstract Considerable advances in genomics over the past decade have resulted in vast amounts of data being generated and deposited in global archives. The growth of these archives exceeds our ability to process their content, leading to significant analysis bottlenecks. Sketching algorithms produce small, approximate summaries of data and have shown great utility in tackling this flood of genomic data, while using minimal compute resources. This article reviews the current state of the field, focusing on how the algorithms work and how genomicists can utilize them effectively. References to interactive workbooks for explaining concepts and demonstrating workflows are included at https://github.com/will-rowe/genome-sketching. Introduction made, such as the quantity, quality and distribution of data. To gain biological insight from genomic data, a genomicist In an ideal world, most genomicists would elect to include must design experiments, run bioinformatic software and all available data but this is growing harder to achieve as evaluate the results. This process is repeated, refined or theamountofdatadrivesupruntimesandcosts. augmented as new insight is gained. Typically, given the In response to being unable to analyze all the things,gen- size of genomic data, this process is performed using high omicists are turning to analytics solutions from the wider performance computing (HPC) in the form of local com- data science field in order to process data quickly and effi- pute clusters, high-memory servers or cloud services.
    [Show full text]
  • A Cuckoo Filter Modification Inspired by Bloom Filter
    AUT Journal of Electrical Engineering AUT J. Elec. Eng., 51(2) (2019) 187-200 DOI: 10.22060/eej.2019.16370.5287 A Cuckoo Filter Modification Inspired by Bloom Filter Ha. Sasaniyan Asl1,*, B. Mozaffari Tazehkand2, M.J. Museviniya2 1 MSc,Faculty of Electrical and Computer Engineering, University of Tabriz, Tabriz, Iran 2 PhD,Faculty of Electrical and Computer Engineering, University of Tabriz, Tabriz, Iran ABSTRACT: Probabilistic data structures are so popular in membership queries, network applications, Review History: and databases and so on. Bloom Filter and Cuckoo Filter are two popular space efficient models Received: 2019-05-20 that incorporate in set membership checking part of many important protocols. They are compact Revised: 2019-09-17 representations of data that use hash functions to randomize a set of items. Being able to store more Accepted: 2019-09-17 elements while keeping a reasonable false positive probability is a key factor of design. A new algorithm Available Online: 2019-12-01 is proposed to improve some of the performance properties of Cuckoo Filter such as false positive rate and insertion performance and solve some drawbacks of the Cuckoo algorithm such as endless loop. Keywords: Main characteristic of the Bloom Filter is used to improve Cuckoo Filter, so we have a smart Cuckoo Bloom Filter Filter which is modified by Bloom Filter so called as SCFMBF. SCFMBF uses the same table of buckets Cuckoo Filter as Cuckoo Filter but instead of storing constant Fingerprints, it stores Bloom Filters. Bloom Filters can be accumulated in the table’s buckets which leads to higher insertion feasibility.
    [Show full text]
  • Fast and Space-Efficient Maps for Large Data Sets
    Fast and Space-Efficient Maps for Large Data Sets A Dissertation presented by Prashant Pandey to The Graduate School in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy in Computer Science Stony Brook University December 2018 Stony Brook University The Graduate School Prashant Pandey We, the dissertation committee for the above candidate for the Doctor of Philosophy degree, hereby recommend acceptance of this dissertation Michael A. Bender, Thesis Advisor Professor, Computer Science Rob Johnson, Thesis Advisor Research Assistant Professor, Computer Science Michael Ferdman, Thesis Committee Chair Associate Professor, Computer Science Rob Patro Assistant Professor, Computer Science Guy Blelloch Professor, Computer Science Carnegie Mellon University This dissertation is accepted by the Graduate School Dean of the Graduate School ii Abstract of the Dissertation Fast and Space-Efficient Maps for Large Data Sets by Prashant Pandey Doctor of Philosophy in Computer Science Stony Brook University 2018 Approximate Membership Query (AMQ) data structures, such as the Bloom filter, have found numerous applications in databases, storage systems, networks, computa- tional biology, and other domains. However, many applications must work around limitations in the capabilities or performance of current AMQs, making these appli- cations more complex and less performant. For example, many current AMQs cannot delete or count the number of occurrences of each input item, nor can they store values corresponding to items. They take up large amounts of space, are slow, cannot be resized or merged, or have poor locality of reference and hence perform poorly when stored on SSD or disk. In this dissertation, we show how to use recent advances in the theory of compact and succinct data structures to build a general-purpose AMQ.
    [Show full text]
  • Abstract Algorithms and Data Structures For
    ABSTRACT Title of Dissertation: ALGORITHMS AND DATA STRUCTURES FOR INDEXING, QUERYING, AND ANALYZING LARGE COLLECTIONS OF SEQUENCING DATA IN THE PRESENCE OR ABSENCE OF A REFERENCE Fatemeh Almodaresi Doctor of Philosophy, Dissertation Directed by: Professor Rob Patro Department of Computer Science High-throughput sequencing has helped to transform our study of biological organisms and processes. For example, RNA-seq is one popular sequencing assay that allows measuring dynamic transcriptomes and enables the discovery (via assem- bly) of novel transcripts. Likewise, metagenomic sequencing lets us probe natural environments to profile organismal diversity and to discover new strains and species that may be integral to the environment or process being studied. The vast amount of available sequencing data, and its growth rate over the past decade also brings with it some immense computational challenges. One of these is how do we design memory-efficient structures for indexing and querying this data. This challenge is not limited to only raw sequencing data (i.e. reads) but also to the growing collection of reference sequences (genomes, and genes) that are assembled from this raw data. We have developed new data structures (both reference-based and reference-free) to index raw sequencing data and assembled reference sequences. Specifically, we describe three separate indices, “Pufferfish”, an index over a set of genomes or tran- scriptomes, and “Rainbowfish” and “Mantis” which are both indices for indexing a set of raw sequencing data sets. All of these indices are designed with consideration of support for high query performance and memory efficient construction and query. The Pufferfish data structure is based on constructing a compacted, colored, reference de Bruijn graph (ccdbg), and then indexing this structure in an efficient manner.
    [Show full text]
  • {PDF} Probabilistic Data Structures and Algorithms for Big Data
    PROBABILISTIC DATA STRUCTURES AND ALGORITHMS FOR BIG DATA APPLICATIONS PDF, EPUB, EBOOK Andrii Gakhov | 220 pages | 11 Feb 2019 | BOOKS ON DEMAND | 9783748190486 | English | none Probabilistic Data Structures and Algorithms for Big Data Applications PDF Book Mousavi A. About this product Product Information A technical book about popular space-efficient data structures and fast algorithms that are extremely useful in modern Big Data applications. A simple proof of the restricted isometry property for random matrices. Global Ocean Sampling This service is more advanced with JavaScript available. Edition Location Original text Corrected text 1st, p. Rank We study data structures to estimate quantiles and percentiles in a data stream using only one pass through the data. On the other hand, calculating the exact cardinality requires an amount of memory proportional to the cardinality, which is impractical for very large data sets. Sectional minhash for near-duplicate detection. As the size of the reference databases terabytes to petabytes and the number of reads 10s of millions to billions in metagenomic samples increase, it becomes computationally intractable to perform exhaustive comparison of all k -mers in the reads against all k -mers within the reference databases, opening the door for efficient new tools. Emerging Research Trends and New Horizons. Finally, the containment index is converted to a score that approximates sequence similarity. Big Data Analytics. In addition to extensions and variations of fundamental methods, recent advances have developed by combining several core data structures and techniques. About this book The purpose of this book is to introduce probabilistic data structures and algorithms to technology practitioners, including software architects and developers, as well as technology decision makers.
    [Show full text]
  • Concurrent Expandable Amqs on the Basis of Quotient Filters
    Concurrent Expandable AMQs on the Basis of Quotient Filters Tobias Maier Karlsruhe Institute of Technology, Germany [email protected] Peter Sanders Karlsruhe Institute of Technology, Germany [email protected] Robert Williger Karlsruhe Institute of Technology, Germany Abstract A quotient filter is a cache efficient Approximate Membership Query (AMQ) data structure. De- pending on the fill degree of the filter most insertions and queries only need to access one or two consecutive cache lines. This makes quotient filters very fast compared to the more commonly used Bloom filters that incur multiple independent memory accesses depending on the false positive rate. However, concurrent Bloom filters are easy to implement and can be implemented lock-free while concurrent quotient filters are not as simple. Usually concurrent quotient filters work by using an external array of locks – each protecting a region of the table. Accessing this array incurs one additional memory access per operation. We propose a new locking scheme that has no memory overhead. Using this new locking scheme we achieve 1.6× times higher insertion performance and over 2.1× higher query performance than with the common external locking scheme. Another advantage of quotient filters over Bloom filters is that a quotient filter can change its capacity when it is becoming full. We implement this growing technique for our concurrent quotient filters and adapt it in a way that allows unbounded growing while keeping a bounded false positive rate. We call the resulting data structure a fully expandable quotient filter. Its design is similar to scalable Bloom filters, but we exploit some concepts inherent to quotient filters to improve the space efficiency and the query speed.
    [Show full text]
  • Arxiv:1208.0290V1 [Cs.DB] 1 Aug 2012 1
    Don’t Thrash: How to Cache Your Hash on Flash Michael A. Bender∗§ Martin Farach-Colton†§ Rob Johnson∗ Russell Kraner¶ Bradley C. Kuszmaul‡§ Dzejla Medjedovic∗ Pablo Montes∗ Pradeep Shetty∗ Richard P. Spillane∗ Erez Zadok∗ ∗{bender,rob,dmededov,pmontes,pshetty,spillane,ezk}@cs.stonybrook.edu †[email protected][email protected][email protected] ABSTRACT queries for elements that do not exist in the database, in This paper presents new alternatives to the well-known external storage, or on a remote network host. Bloom filter data structure. The Bloom filter, a compact The Bloom filter is the classic example of an approximate data structure supporting set insertion and membership membership query data structure (AMQ). A Bloom filter queries, has found wide application in databases, storage supports insert and lookup operations on a set of keys. For systems, and networks. Because the Bloom filter performs a key in the set, lookup returns “present.” For a key not frequent random reads and writes, it is used almost exclu- in the set, lookup returns “absent” with probability at least 1 ε, where ε is a tunable false-positive rate. There is sively in RAM, limiting the size of the sets it can represent. − This paper first describes the quotient filter, which sup- a tradeoff between ε and the space consumption. Other ports the basic operations of the Bloom filter, achieving AMQs, such as counting Bloom filters, additionally support roughly comparable performance in terms of space and time, deletes [3,11]. For a comprehensive review of Bloom filters, but with better data locality. Operations on the quotient fil- see Broder and Mitzenmacher [4].
    [Show full text]
  • Concurrent Dynamic Quotient Filters: Packing Fingerprints Into Atomics
    Master Thesis Concurrent Dynamic Quotient Filters: Packing Fingerprints into Atomics Robert Williger Date: 29. March 2019 Supervisors: Prof. Dr. Peter Sanders Tobias Maier Institute of Theoretical Informatics, Algorithmics Department of Informatics Karlsruhe Institute of Technology Hiermit versichere ich, dass ich diese Arbeit selbständig verfasst und keine anderen, als die angegebenen Quellen und Hilfsmittel benutzt, die wörtlich oder inhaltlich übernommenen Stellen als solche kenntlich gemacht und die Satzung des Karlsruher Instituts für Technolo- gie zur Sicherung guter wissenschaftlicher Praxis in der jeweils gültigen Fassung beachtet habe. ———————————— Karlsruhe, den 29. März 2019 Abstract Quotient filters are approximate membership data structures that can be used for a variety of applications, from speeding up database accesses to uses in computational biology. We look at several improvements to common quotient filters which are orthogonal to each other and can be combined as needed. We show how to use compact arbitrary length data types to reduce the memory usage by up to seven times and get very close to the theoretical optimum. We also present two different approaches to building concurrent quotient filters that use localized locking or no locking at all. This leads to a good speedup with increasing thread counts and an increase in performance of up to four times compared to traditional locking techniques. Using the compact packing technique with concurrent quotient filters also has the benefit of reducing atomic operations and enabling lock elision optimizations. Additionally, we describe how to use a multilevel quotient filter structure to implement dynamic growing while still allowing a user defined maximum false positive rate and being less than 10% slower compared to non-growing variants in scenarios where growing is not necessary.
    [Show full text]
  • Vector Quotient Filters: Overcoming the Time/Space Trade-Off in Filter Design
    Vector Quotient Filters: Overcoming the Time/Space Trade-Off in Filter Design Prashant Pandey Alex Conway Joe Durie [email protected] [email protected] [email protected] Lawrence Berkeley National Lab VMware Research Rutgers University and University of California Berkeley Michael A. Bender Martin Farach-Colton Rob Johnson [email protected] [email protected] [email protected] Stony Brook University Rutgers University VMware Research ABSTRACT International Conference on Management of Data (SIGMOD ’21), June Today’s filters, such as quotient, cuckoo, and Morton, havea 20–25, 2021, Virtual Event, China. ACM, New York, NY, USA, 14 pages. https://doi.org/10.1145/3448016.3452841 trade-off between space and speed; even when moderately full (e.g., 50%-75% full), their performance degrades nontrivially. The result is that today’s systems designers are forced to choose between speed 1 INTRODUCTION and space usage. Filters, such as Bloom [9], quotient [43], and cuckoo filters [31], In this paper, we present the vector quotient filter (VQF). Locally, maintain compact representations of sets. They tolerate a small the VQF is based on Robin Hood hashing, like the quotient filter, false-positive rate ε: a membership query to a filter for set S returns but uses power-of-two-choices hashing to reduce the variance of present for any x 2S, and returns absent with probability at least runs, and thus offers consistent, high throughput across load factors. 1−ε for any x <S. A filter for a set of size n uses space that depends Power-of-two-choices hashing also makes it more amenable to on ε and n but is much smaller than explicitly storing all items of S.
    [Show full text]
  • A Survey of Sketches in Traffic Measurement: Design, Optimization
    A survey of sketches in traffic measurement: Design, Optimization, Application and Implementation Shangsen Li, Lailong Luo, Deke Guo, Qianzhen Zhang, Pengtao Fu Science and Technology on Information System Engineering Laboratory, National University of Defense Technology, Changsha Hunan 410073, P.R. China. Abstract—Network measurement probes the underlying net- cardinality), the number of packets for a flow (flow size), the work to support upper-level decisions such as network manage- traffic volume contributed by a specific flow (flow volume) ment, network update, network maintenance, network defense and the time information. These statistics together profile and beyond. Due to the massive, speedy, unpredictable features of network flows, sketches are widely implemented in measurement the network and provide essential information for network nodes to approximately record the frequency or estimate the management and protection/defense. cardinality of flows. At their cores, sketches usually maintain one or multiple counter array(s), and rely on hash functions to select TABLE I: Comparison with existing surveys the counter(s) for each flow. Then the space-efficient sketches from the distributed measurement nodes are aggregated to First author # Variants #Measurement tasks Year provide statistics of the undergoing flows. Currently, tremendous Cormode [9] ≤ 15 3 2010 redesigns and optimizations have been proposed to improve the Cormode [10] ≤ 20 3 2011 sketches for better network measurement performance. However, Zhou [11] ≤ 10 4 2014 existing reviews or surveys mainly focus on one particular Yassine [12] — — 2015 aspect of measurement tasks. Researchers and engineers in the Gibbons [13] ≤ 10 1 2016 network measurement community desire an all-in-one survey that Yan [14] — 1 2016 This survey ≥ 90 13 2021 covers the entire processing pipeline of sketch-based network measurement.
    [Show full text]