A General-Purpose Counting Filter: Making Every Bit Count

A General-Purpose Counting Filter: Making Every Bit Count Prashant Pandey, Michael A. Bender, Rob Johnson, Rob Patro Stony Brook University, NY Approximate Membership Query (AMQ) insert(X) AMQ isMember(X) yes/no • An AMQ is a lossy representation of a set. • Operations: inserts and membership queries. • Compact space: • Often taking < 1 byte per item. • Comes at the cost of occasional false positives. Bloom filter [Bloom, 1970] m 0 0 0 0 0 0 • A Bloom filter is a bit-array + k hash functions. (Here k=2.) Insertions in a Bloom filter m 1 0 1 0 0 0 W • A Bloom filter is a bit-array + k hash functions. (Here k=2.) Insertions in a Bloom filter m 1 0 1 1 0 0 W X • A Bloom filter is a bit-array + k hash functions. (Here k=2.) Insertions in a Bloom filter m 1 0 1 1 0 1 W X Y • A Bloom filter is a bit-array + k hash functions. (Here k=2.) Membership query in a Bloom filter query(W) 1 0 1 1 0 1 W X Y • The Bloom filter has a bounded false-positive rate. Membership query in a Bloom filter YES query(W) 1 0 1 1 0 1 W X Y • The Bloom filter has a bounded false-positive rate. Membership query in a Bloom filter YES query(W) query(U) 1 0 1 1 0 1 W X Y • The Bloom filter has a bounded false-positive rate. Membership query in a Bloom filter YES NO query(W) query(U) 1 0 1 1 0 1 W X Y • The Bloom filter has a bounded false-positive rate. Membership query in a Bloom filter YES NO query(W) query(U) query(Z) 1 0 1 1 0 1 W X Y • The Bloom filter has a bounded false-positive rate. Membership query in a Bloom filter YES NO YES false positive query(W) query(U) query(Z) 1 0 1 1 0 1 W X Y • The Bloom filter has a bounded false-positive rate. Bloom filters are ubiquitous Streaming applications Networking Databases Computational biology Storage systems Counting filters: AMQs for multisets insert(X) delete(X) Counting filter getCount(X) count • A counting filter is a lossy representation of a multiset. • Operations: inserts, count, and delete. • Generalizes AMQs • False positives ≈ over-counts. Why is counting important? • Counting filters have numerous applications: • Computational biology, e.g., k-mer counting. • Network anomaly detection. • Natural language processing, e.g., n-gram counting. • Counting enables AMQs to support deletes. Many real data sets have skewed counts 109 108 NumberNumber ofof items:items: ~19.6B~19.6B 107 NumberNumber ofof distinctdistinct items:items: ~1.1B~1.1B 106 105 104 103 102 101 100 Number of items with the frequency the with Number of items 100 101 102 103 104 105 106 107 Frequency Counting filters should handle skewed data sets efficiently. Many real data sets have skewed counts 109 108 NumberNumber ofof items:items: ~19.6B~19.6B 107 NumberNumber ofof distinctdistinct items:items: ~1.1B~1.1B 106 105 104 103 102 101 100 Number of items with the frequency the with Number of items 100 101 102 103 104 105 106 107 Frequency Counting filters should handle skewed data sets efficiently. Counting Bloom filters [Fan et al., 2000] insert(W,4) insert(X,3) insert(Y,1) 4 0 7 4 0 1 • Counters must be large enough to hold count of most frequent item. • Counting Bloom filters are not space-efficient for skewed data sets. Counting Bloom filters [Fan et al., 2000] insert(W,4) insert(X,3) insert(Y,1) RNA-seq dataset Total number of items: 19.6 Billion Number of distinct items: 1.1 Billion 4 Maximum0 frequency:7 4 ~8 Million0 1 Space usage of a CBF: ~38GB • Counters must be large enough to hold count of most frequent item. • Counting Bloom filters are not space-efficient for skewed data sets. This paper: The counting quotient filter (CQF) • A replacement for the (counting) Bloom filter. • Space and computationally efficient. • Uses variable-sized counters to handle skewed data sets efficiently. CQF space ≤ BF space + O(∑} log c(x)) x∈S Asymptotically optimal This paper: The counting quotient filter (CQF) • A replacement RNA-seqfor the (counting) dataset Bloom filter. Total number of items: 19.6 Billion • Space Numberand computationally of distinct items: efficient 1.1 Billion. • Uses variable-sizedMaximum frequency: counters ~8 to Millionhandle skewed data Spacesets efficiently. usage of a CQF: ~2.5GB CQF space ≤ BF space + O(∑} log c(x)) x∈S Asymptotically optimal Other features of the CQF • Smaller than many non-counting AMQs • Bloom, cuckoo [Fan et al., 2014], and quotient [Bender et al., 2012] filters. • Good cache locality • Deletions • Dynamically resizable • Mergeable Contributions • New quotient filter metadata scheme • Smaller and faster than original quotient filter • Efficient variable-length counter encoding method • Zero overhead for counters • Fast implementation of bit-vector select on words • Exploits new x86 bit-manipulation instructions Quotienting: An alternative to Bloom filters • Store fingerprint compactly in a hash table. • Take a fingerprint h(x) for each element x. x h(x) • Only source of false positives: • Two distinct elements x and y, where h(x) = h(y). • If x is stored and y isn’t, query(y) gives a false positive. Storing compact fingerprints Bucket index h(x) b(x) t(x) Tag b(u) q r 0 1 • b(x) = location in the hash table 2 • t(x) = tag stored in the hash table 2q 3 4 t(u) 5 6 Storing compact fingerprints Bucket index h(x) b(x) t(x) Tag b(u) q r 0 b(v) 1 • b(x) = location in the hash table 2 • t(x) = tag stored in the hash table 2q 3 Collisions in the hash table? 4 t(u) t(v) 5 6 Storing compact fingerprints Bucket index h(x) b(x) t(x) Tag b(u) q r 0 b(v) 1 • b(x) = location in the hash table 2 • t(x) = tag stored in the hash table 2q 3 Collisions in the hash table? 4 t(u) Linear probing. t(v) 5 t(v) 6 Storing compact fingerprints Bucket index h(x) b(x) t(x) Tag b(u) q r 0 • The home bucket for b(v) 1 t(u) and t(v) is 4. 2 2q 3 4 t(u) t(v) 5 t(v) Does t(v) belongs to bucket 4 or 5 ? 6 Resolving collisions in the CQF • CQF uses two metadata bits to resolve collisions and identify the home bucket. 1 1 t(u) t(v) t(w) t(x) t(y) • The metadata bits group tags by their home bucket. • Metadata scheme supports efficient inserts/deletes. Resolving collisions in the CQF • CQF uses two metadata bits to resolve collisions and identify the home bucket. insert v 1 1 t(u) t(v) t(v) t(w) t(x) t(y) • The metadata bits group tags by their home bucket. • Metadata scheme supports efficient inserts/deletes. Resolving collisions in the CQF • CQF uses two metadata bits to resolve collisions and identify the home bucket. insert v 1 1 t(u) t(v) t(v) t(w) t(x) t(y) • The metadata bits group tags by their home bucket. The• Metadata metadata scheme bits supports enable efficient us toinserts/deletes. identify the slots holding the contents of each bucket. Counting quotient filter (CQF) Implementation: Abstract Representation q 2 Meta-bits per slot. 2 0 1 2 3 4 5 6 7 h(x) --> h0(x) || h1(x) runends occupieds 2q Counting quotient filter (CQF) Implementation: Abstract Representation q 2 Meta-bits per slot. 2 0 1 2 3 4 5 6 7 h(x) --> h0(x) || h1(x) h(a) runends occupieds 1 1 h1(a) 2q Counting quotient filter (CQF) Implementation: Abstract Representation q 2 Meta-bits per slot. 2 0 1 2 3 4 5 6 7 h(x) --> h0(x) || h1(x) h(a) runends h(b) occupieds 1 1 h1(a) h1(b) 2q Counting quotient filter (CQF) Implementation: Abstract Representation q 2 Meta-bits per slot. 2 0 1 2 3 4 5 6 7 h(x) --> h0(x) || h1(x) h(a) h(d) runends h(b) occupieds 1 1 1 1 h1(a) h1(b) h1(d) 2q Counting quotient filter (CQF) Implementation: Abstract Representation q 2 Meta-bits per slot. 2 0 1 2 3 4 5 6 7 h(x) --> h0(x) || h1(x) h(a) h(d) runends h(b) h(e) occupieds 1 1 1 1 h1(a) h1(b) h1(d) h1(e) 2q Counting quotient filter (CQF) Implementation: Abstract Representation q 2 Meta-bits per slot. 2 0 1 2 3 4 5 6 7 h(x) --> h0(x) || h1(x) h(a) h(d) runends h(b) h(e) occupieds h(c) 1 1 1 1 h1(a) h1(b) h1(c) h1(d) h1(e) 2q Counting quotient filter (CQF) Implementation: Abstract Representation q 2 Meta-bits per slot. 2 0 1 2 3 4 5 6 7 h(x) --> h0(x) || h1(x) h(a) h(d) h(f) runends h(b) h(e) occupieds h(c) 1 1 1 1 1 1 h1(a) h1(b) h1(c) h1(d) h1(e) h1(f) 2q Metadata operations run occupieds 1 1 1 1 0 1 2 3 4 5 6 7 runends Rank(occupieds, 3) = 2 Select(runends, 2) = 5 • Can accelerate metadata operations using x86 bit-manipulation instructions.

Load more