A General-Purpose Counting Filter: Making Every Bit Count

A General-Purpose Counting Filter: Making Every Bit Count Prashant Pandey, Michael A. Bender, Rob Johnson, and Rob Patro Stony Brook University Stony Brook, NY, USA {ppandey, bender, rob, rob.patro}@cs.stonybrook.edu ABSTRACT 1. INTRODUCTION Approximate Membership Query (AMQ) data structures, such as Approximate Membership Query (AMQ) data structures main- the Bloom filter, quotient filter, and cuckoo filter, have found nu- tain a probabilistic representation of a set or multiset, saving space merous applications in databases, storage systems, networks, com- by allowing queries occasionally to return a false-positive. Exam- putational biology, and other domains. However, many applica- ples of AMQ data structures include Bloom filters [7], quotient fil- tions must work around limitations in the capabilities or perfor- ters [5], cuckoo filters [17], and frequency-estimation data struc- mance of current AMQs, making these applications more complex tures [12]. AMQs have become one of the primary go-to data struc- and less performant. For example, many current AMQs cannot tures in systems builders’ toolboxes [9]. delete or count the number of occurrences of each input item, take AMQs are often the computational workhorse in applications, up large amounts of space, are slow, cannot be resized or merged, but today’s AMQs are held back by the limited number of opera- or have poor locality of reference and hence perform poorly when tions they support, as well as by their performance. Many systems stored on SSD or disk. based on AMQs (usually Bloom filters) use designs that are slower, This paper proposes a new general-purpose AMQ, the counting less space efficient, and significantly more complicated than neces- quotient filter (CQF). The CQF supports approximate membership sary in order to work around the limited functionality provided by testing and counting the occurrences of items in a data set. This today’s AMQ data structures. general-purpose AMQ is small and fast, has good locality of refer- For example, many Bloom-filter-based applications work around ence, scales out of RAM to SSD, and supports deletions, counting the Bloom filter’s inability to delete items by organizing their pro- (even on skewed data sets), resizing, merging, and highly concur- cesses into epochs; then they throw away all Bloom filters at the rent access. The paper reports on the structure’s performance on end of each epoch. Storage systems, especially log-structured both manufactured and application-generated data sets. merge (LSM) tree [25] based systems [4,32] and deduplication sys- In our experiments, the CQF performs in-memory inserts and tems [14, 15, 34], use AMQs to avoid expensive queries for items queries up to an order-of-magnitude faster than the original quo- that are not present. These storage systems are generally forced tient filter, several times faster than a Bloom filter, and similarly to to keep the AMQ in RAM (instead of on SSD) to get reasonable the cuckoo filter, even though none of these other data structures performance, which limit their scalability. Many tools that process support counting. On SSD, the CQF outperforms all structures by DNA sequences use Bloom filters to detect erroneous data (erro- a factor of at least 2 because the CQF has good data locality. neous subsequences in the data set) but work around the Bloom fil- The CQF achieves these performance gains by restructuring the ter’s inability to count by using a conventional hash table to count metadata bits of the quotient filter to obtain fast lookups at high load the number of occurrences of each subsequence [24, 29]. More- factors (i.e., even when the data structure is almost full). As a result, over, these tools use a cardinality-estimation algorithm to approx- the CQF offers good lookup performance even up to a load factor imate the number of distinct subsequences a priori to workaround of 95%. Counting is essentially free in the CQF in the sense that the Bloom filter’s inability to dynamically resize [20]. the structure is comparable or more space efficient even than non- As these examples show, four important shortcomings of Bloom counting data structures (e.g., Bloom, quotient, and cuckoo filters). filters (indeed, most production AMQs) are (1) the inability to The paper also shows how to speed up CQF operations by using delete items, (2) poor scaling out of RAM, (3) the inability to re- new x86 bit-manipulation instructions introduced in Intel’s Haswell size dynamically, and (4) the inability to count the number of times line of processors. The restructured metadata transforms many each input item occurs, let alone support skewed input distribu- quotient filter metadata operations into rank-and-select bit-vector tions (which are so common in DNA sequencing and other appli- operations. Thus, our efficient implementations of rank and select cations [24, 33]). may be useful for other rank-and-select-based data structures. Many Bloom filter variants in the literature [30] try to overcome the drawbacks of the traditional Bloom filter. The counting Bloom filter [18] can count and delete items, but does not support skewed Permission to make digital or hard copies of all or part of this work for personal or distributions and uses much more space than a regular Bloom fil- classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation ter (see Section 4). The spectral Bloom filter [11] solves the space on the first page. Copyrights for components of this work owned by others than the problems of the counting Bloom filter, but at a high computational author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or cost. The buffered Bloom filter [10] and forest-structured Bloom republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. filter [23] use in-RAM buffering to scale out on flash, but still re- SIGMOD’17, May 14 - 19, 2017, Chicago, Illinois, USA quire several random reads for each query. The scalable Bloom filter [3] solves the problem of resizing the data structure by main- © 2017 Copyright held by the owner/author(s). Publication rights licensed to ACM. ISBN 978-1-4503-4197-4/17/05. $15.00 taining a series of Bloom filters, but queries become slower because DOI: http://dx.doi.org/10.1145/3035918.3035963 each Bloom filter must be tested for the presence of the item. See 775 Section 2 for a thorough discussion of Bloom filter variants and variable-sized counters enable the CQF to use substantially less other AMQs. space for Zipfian and other skewed input distributions. Counting, in particular, has also become important to AMQ • The CQF restructures the quotient filter’s metadata scheme to structures, primarily as a tool for supporting deletion. The counting speed up lookups when the data structure is nearly full. This Bloom filter [18] was originally introduced to add support for dele- enables the CQF to save space by supporting higher load factors tion to the Bloom filter, albeit at a high space cost. More recently, than the QF. The original quotient filter performs poorly above the cuckoo [17] and quotient filters [5] support deletion by allow- 75% occupancy, but the CQF provides good lookup performance ing each item to be duplicated a small number of times. However, up to 95% occupancy. The metadata scheme also saves space by many real data sets have a Zipfian distribution [13] and therefore reducing the number of metadata bits per item. contain many duplicates, so even these data structures can have poor performance or fail entirely on real data sets. • The CQF can take advantage of new x86 bit-manipulation in- Full-featured high-performance AMQs. More generally, as structions introduced in Intel’s Haswell line of processors to AMQs have become more widely used, applications have placed speed up quotient-filter-metadata operations. Our metadata more performance and feature demands on them. Applications scheme transforms many quotient-filter-metadata operations into would benefit from a general purpose AMQ that is small and fast, rank-and-select bit-vector operations. Thus, our efficient imple- has good locality of reference (so it can scale out of RAM to SSD), mentations of rank and select may be useful for other rank-and- and supports deletions, counting (even on skewed data sets), resiz- select-based data structures. Other counting filters, such as the ing, merging, and highly concurrent access. This paper proposes a counting Bloom filter and cuckoo filter, do not have metadata new AMQ data structure that that has all these benefits. The paper and hence cannot benefit from these optimizations. reports on the structure’s performance on both manufactured and application-generated data sets. 2. AMQ AND COUNTING STRUCTURES Results 2.1 AMQ data structures This paper formalizes the notion of a counting filter, an AMQ data structure that counts the number of occurrences of each input item, Approximate Membership Query (AMQ) data structures provide and describes the counting quotient filter (CQF), a space-efficient, a compact, lossy representation of a set or multiset. AMQs support scalable, and fast counting filter that offers good performance on INSERT(x) and QUERY(x) operations, and may support other op- arbitrary input distributions, including highly skewed distributions. erations, such as delete. Each AMQ is configured with an allowed We compare the CQF to the fastest and most popular count- false-positive rate, δ. A query for x is guaranteed to return true if ing data structure, the counting Bloom filter (CBF) [17], and to x has ever been inserted, but the AMQ may also return true with three (noncounting) approximate membership data structures, the probability δ even if x has never been inserted.

Load more