Stacked Filters: Learning to Filter by Structure

Kyle Deeds∗ Brian Hentschel∗ Stratos Idreos Harvard University Harvard University Harvard University [email protected] [email protected] [email protected]

ABSTRACT [42], detecting flooding attacks [22], cryptocurrency transaction We present Stacked Filters, a new probabilistic filter which is fast verification23 [ ], distributed joins [38, 40], and LSM-tree based and robust similar to query-agnostic filters (such as Bloom and key-value stores [14], amongst many others [43]. Cuckoo filters), and at the same time brings low false positive rates Robust Query-Agnostic Designs. Query-agnostic filters utilize and sizes similar to classifier-based filters (such as Learned Filters). only the data during construction. For example, a Bloom Filter The core idea is that Stacked Filters incorporate workload knowl- [5] starts with an array of set to 0 and using multiple hash edge about frequently queried non-existing values. Instead of learn- functions per value, hashes each value to various positions, setting ing, they structurally incorporate that knowledge using hashing those bits to 1. A query for a value 푥 probes the filter by hashing 푥 and several sequenced filter layers, indexing both data and frequent using the same hash functions and returns positive if all positions negatives. Stacked Filters can also gather workload knowledge on- 푥 hashes to are set to 1. When querying a value that exists in the the-fly and adaptively build the filter. We show experimentally that data set, these bits were set to 1 during construction and so there for a given memory budget, Stacked Filters achieve end-to-end are no false negatives; however, when querying a value not in the query throughput up to 130x better than the best alternative for data set, a false positive can occur if the value is hashed entirely to a workload, either query-agnostic or classifier-based filters, and positions which were set to 1 during construction. depending on where data is (SSD or HDD). Other query-agnostic filters such as Cuckoo [20], Quotient [34], and Xor [25] filters work similarly to Bloom Filters. These filters PVLDB Reference Format: generally work by hashing data values and storing the resulting Kyle Deeds, Brian Hentschel, Stratos Idreos. Stacked Filters: Learning to fingerprints in a losslessly (for instance, using cuckoo Filter by Structure. PVLDB, 14(4): 600 - 612, 2021. hashing for Cuckoo Filters or linear probing for Quotient filters). doi:10.14778/3436905.3436919 If the hash functions are truly random, then every query-able PVLDB Artifact Availability: value not in the data set has the same false positive chance. This The source code, data, and/or other artifacts have been made available at makes query-agnostic filters robust, as they have the same expected https://bitbucket.org/HarvardDASlab/stackedfilters. performance across all workloads, and easy to deploy, as they re- quire no workload knowledge. However, at the same time, this limits 1 LEARNING FILTERS BY STRUCTURE the performance of query-agnostic filters, as they are required to work for any query distribution. Additionally, the possibility for Filters are Everywhere. The storage and retrieval of values in a further improvements in the trade-off between space and false pos- data set is one of the most fundamental operations in computer itive rate are limited, as current query-agnostic filters are close to science. For large data sets, the raw data is usually stored over their theoretical lower bound in size [7, 11]. a slow medium (e.g., on disk) or distributed across the nodes of a network. Because of this, it is critical for performance to limit Learning from Queries for Reduced Filter Size. Classifier based accesses to the full data set. That is, applications should be able filters such as Weighted Bloom Filters8 [ , 45], Ada-BF [13], and to avoid accessing slow disk or remote nodes when querying for Learned Bloom Filters [28, 39] utilize workload knowledge, making values that are not present in the data. This is the exact utility of it possible to move beyond the theoretical limits of query-agnostic approximate membership query (AMQ) structures, also referred filters. Such filters need as input a sample of past queries and using to as filters. Filters have tunably small sizes, so that they fitin that they train a classifier to model how likely every possible value memory, and provide probabilistic answers to whether a queried is to 1) be queried, and 2) exist in the data set. The classifier is then value exists in the data set with no false negatives and a limited used in one of two ways. number of false positives. For all filters, the probability of returning In the first28 [ , 39], it acts as a module which accepts values that a false positive, known as the false positive rate (FPR), and the have a high weighted probability of being in the data. It cannot space used are competing goals. Filters are used in a large number reject values as the stochasticity of the classifier might cause false of diverse applications such as web indexing [24], web caching negatives. Thus, a query-agnostic filter is also built using the (few) [21], prefix matching [18], deduping data [17], DNA classification values in the data set for which the classifier returns a false negative. Queries are first evaluated by the classifier, and if rejected, then ∗Both authors contributed equally to the paper probe the query-agnostic filter. In the second [8, 13, 45], the classifier This work is licensed under the Creative Commons BY-NC-ND 4.0 International uses the weighted probability of being in the set to control the License. Visit https://creativecommons.org/licenses/by-nc-nd/4.0/ to view a copy of number of hash functions used by a Bloom filter for each value. For this license. For any use beyond those covered by this license, obtain permission by emailing [email protected]. Copyright is held by the owner/author(s). Publication rights values with a high likelihood of being in the set, few hash functions licensed to the VLDB Endowment. are used, setting fewer bits in the filter but also checking fewer bits, Proceedings of the VLDB Endowment, Vol. 14, No. 4 ISSN 2150-8097. and therefore providing fewer chances to catch a false positive. For doi:10.14778/3436905.3436919 values not likely to be in the set, this is reversed. Classifier based filters can reduce the required memory to achieve • Generalization: We show that Stacked Filters work for all query- a given false positive rate when keys and non-keys have clear se- agnostic filters including Bloom, Cuckoo, and Quotient Filters. mantic differences. URL Blacklisting is such a use28 case[ ], wherein • Better trade-off of FPR and size: We derive the metric equations browsers such as Google Chrome maintain a list of dangerous web- of Stacked Filters for size, computation, and false positive rate. sites and alert users before visiting one. Browsers store a filter Using these equations, we provide theoretical results showing containing the dangerous websites at the client. For each web re- Stacked Filters are strictly better in terms of FPR vs. size than quest, the client checks the filter; if the filter rejects the query, the query-agnostic filters on the majority of workloads, and quantify website is safe and can be visited. If the filter accepts the query, an the expected benefit. expensive full check to a remote list of dangerous websites is done. • Optimization: We show that the optimization problem of tuning Because there are no false negatives, every dangerous website is the number of layers and layer sizes is non-convex. Still, we caught, and that there are only a few false positives means most provide 휖-approximation , running in the order of safe websites do not need to perform extra work. milliseconds, which automatically tune the number of layers However, classifier based filters offer a host of new problems. and the individual sizes of each layer so that performance is First, the classifier has to be accurate, which can be hard: for some arbitrarily close to optimal. workloads the data value and its probability of existing in the data • Adaptivity: We show that the benefits of Stacked Filters can be set are loosely correlated. For other types of data that appear in extended to Stacked Filters built adaptively. Here, Stacked Filters practice such as hashed IDs, there is no correlation. Additionally, start with a rough knowledge of how skewed a workload is, but even when data patterns exist, often data such as textual keys not which values are frequently queried, and build their structure have complex decision boundaries which require complex models incrementally during normal query execution. such as neural networks for accurate classification. As a result, the • Experiments: Using URL blacklisting, a networking benchmark, computational expense of the classifier is often orders of magnitude and synthetic experiments we show that Stacked Filters 1) pro- more expensive than hashing. Finally, the classifier is trained ona vide improvements in FPR of up to 100× over the best alternative specific sample workload, and thus if the workload shifts, weneed query-agnostic or classifier-based filter for the same memory to go through the expensive process of gathering sample queries budget, while retaining good robustness properties, 2) provide and retraining the classifier to maintain good performance. a superior tradeoff between false positive rate, size, and com- Stacked Filters: Encapsulating workload information struc- putation than all other filters, resulting in up to 130× better turally. We introduce a new class of filters, Stacked Filters, which end-to-end query throughput than the best alternative, and 3) for structurally encapsulate knowledge about frequently queried non- scenarios where learning is not easy, Stacked Filters can still existing values. The key intuition is as follows: for non-keys which utilize workload knowledge and offer throughput up to 1000× are queried often, find a structural way to run them through multi- better than classifier-based filters. ple filter checks. Stacked Filters achieve that through several layers of query-agnostic filters which alternate between representing val- 2 NOTATION AND METRICS ues in the data and frequently queried non-existing values. We first introduce notation used throughout the paper and metrics All frequently queried non-existing values need to pass multiple that are critical for describing the behavior of filters. Table 1 lists membership checks to be false positives, and so they incur exponen- the key variables and metrics. tially smaller false positive rates. At the same time, each additional Notation. layer is exponentially smaller in size, and thus the total size of a Let 푈 be the universe of possible data values, such as Stacked Filter is comparable to the first filter in its stack. A similar the domain of strings or integers. Let 푃 be a data set, which we will = − pattern holds for computational costs. Both size and computational refer to as the positive set, and let 푁 푈 푃 be the set of negative costs rise like a geometric series with the number of layers, and values. From now on we will refer to data values as well as to thus have values close to that of a single filter, while an entire set of queried values as elements, which is the traditional terminology in non-existing values has their FPR decrease arbitrarily close to zero. the filters literature. We will denote filter structures by 퐹, which we → { } The overall result is that for workloads with any frequently queried treat as a function from 푈 0, 1 , and we say that 푥 is accepted ( ) = ( ) = non-existing values, Stacked Filters provide a superior tradeoff be- by 퐹 if 퐹 푥 1 and that 푥 is rejected by 퐹 if 퐹 푥 0. ( ) = ∀ ∈ tween false positive rate, filter size, and computation than either of As a filter, we have 퐹 푥 1 : 푥 푃 and we are interested classifier-based filters or query-agnostic filters. in minimizing the number of false positives, which are the event ( ) = ∉ While the idea of checking frequently queried non-existing val- 퐹 푥 1, 푥 푃. The filter 퐹 is itself random; different instantiations ues multiple times is intuitive, it comes with significant challenges. of 퐹 produce different data structures, either because the hash How many layers are needed for good filter performance? How functions used have randomly chosen parameters or because the should the memory budget be spread across the layers? How much machine learning model used in classifier based filters is stochastic. workload knowledge is enough for good filter performance? Can Expected False Positive Bound (EFPB). A traditional guarantee we build Stacked Filters without any workload knowledge? for a filter 퐹 is to bound 퐸퐹 [P(퐹 (푥) = 1|푥 ∉ 푃)] for any 푥 chosen Contributions. The contributions of this paper are as follows: independently of the creation of 퐹. We call this bound the expected • Formalization: We introduce a new way to de- false positive bound. sign workload-aware filters as multi-layer filter structures which Expected False Positive Rate (EFPR). Given a distribution 퐷 index both positives and frequent negatives. over 푈 which captures the query probabilities for elements in 푈 , the expected false positive rate is 퐸푥∼퐷 [퐸퐹 |퐷 [P(퐹 (푥) = 1|푥 ∉ 푃)]]. Notation Definition of the stack alternating between subsets of 푃 and subsets of 푁 , to 푃 Set of all positive elements iteratively prune the set of elements in 푈 whose set membership is 푁 Set of all negative elements undecided. 푁푓 Negatives used to construct a Stacked Filter Stacked Filters by Example. We start with an example of a 3 layer 푁푖 The complement of 푁푓 , i.e. 푁 \ 푁푓 Stacked Filter using Figure 1. The filter is given the data set 푃 and a 푠 Size of a filter in bits/element set of frequently queried negatives 푁푓 . The first filter in the stack, 푐 Cost of inserting an element from set 푆 푖,푆 퐿1, is constructed using 푃 similarly to a traditional filter except 푐푞,푆 Cost of querying a filter for an element in 푆 with fewer bits per element so as to reserve space for subsequent Metric Definition layers. Conceptually, 퐿1 partitions the universe 푈 . Items that 퐿1 EFPR Expected false positive rate of a filter structure given a rejects are known to be in 푁 and can be rejected by the Stacked specific query distribution Filter. Items accepted by 퐿1 can have set membership of 푃 or 푁 EFPB Upper bound on the EFPR of a filter structure for queries and thus their status is unknown. If the Stacked Filter ended here chosen independently of the filter after a single filter, as is the case for all query-agnostic filters, all Table 1: Notation used throughout the paper undecided elements would be accepted by the Stacked Filter. Instead, Stacked Filters construction continues by probing 퐿 for Optimization via Expected False Positive Rate. Query through- 1 each element in 푁 . Using all elements of 푁 accepted by 퐿 , and put most directly relies on the EFPR and since the EFPR depends 푓 푓 1 which therefore normally would become false positives, Stacked Fil- on the query distribution D, the goal of workload-aware filters is ters build a second layer with another query-agnostic filter. During to capture and utilize D to improve the EFPR. Namely, for 푥 ∈ 푁 a query, values which are still undecided after 퐿 are passed to 퐿 . with higher chance of being queried, workload-aware filters should 1 2 If 퐿 rejects the value, the value is definitely in 푁 , which includes lower the probability that 푥 is a false positive. 2 푓 both 푃 and 푁 \푁 , which we denote by 푁푖 and call the infrequently Robustness via Bounding False Positive Probability. Optimiz- 푓 queried negative set. Since 푃 ∪ 푁푖 contains both positives and neg- ing the EFPR helps system throughput but brings concerns about atives, the overall Stacked Filter accepts all the rejected elements of workload shift. Query-agnostic filters can act as a safeguard against 퐿 in order to maintain a zero false negative rate. If the element is such a shift. To see this, note that 퐹 and 퐷 are independent by 2 instead accepted by 퐿 , then its set membership is still undecided assumption and so for a query-agnostic filter with FPR 휖, 2 and so it continues down the stack. 퐸푥∼퐷 [퐸퐹 |퐷 [P(퐹 (푥) = 1|푥 ∉ 푃)]] ≤ 퐸푥∼퐷 [퐸퐹 |퐷 [휖]] = 휖 Construction then continues by querying 퐿2 for all elements in 푃 and building a third layer with a query-agnostic filter. This layer regardless of what 퐷 is. Thus, filters which provide an expected uses as input all elements of 푃 whose set membership is undecided false positive bound provide an upper bound on the expected false after querying 퐿2. At query time, 퐿3 performs the same operations positive rate for any workload 퐷 chosen independently of the filter. as 퐿1; elements rejected by 퐿3 are certainly in 푃 = 푁 and so are Memory - False Positive Tradeoff. For all filter structures, their rejected by the Stacked Filter. EFPR and EFPB can be made arbitrarily close to 0 with enough mem- Workload-aware Design. 퐿2 and 퐿3 are how Stacked Filters struc- ory, and there exists a tradeoff between the memory required and turally incorporate workload knowledge. They collaborate to filter the false positive rate provided. Thus for purposes of comparison, out frequent negatives to minimize FPR. All frequent negatives that we always report EFPR and EFPB with respect to a space budget. are false positives on 퐿1 reach 퐿3 since they are in the construction Space budgets in practice tend to be between 6 and 14 bits per ele- set for 퐿2, and so such frequent negatives need to pass an extra ment, and are significantly smaller than the elements they represent membership check to be false positives for the full Stacked Filter. (which can be anywhere from 4 bytes to several megabytes). To make deeper Stacked Filters, and thus perform more checks on For query-agnostic filters, the EFPR and the EFPB are equal, frequently queried negatives, we recursively perform this process, and is just called the false positive rate. Additionally, for all query- adding more paired sets of layers. agnostic filters, the false positive rate 훼 and size in bits per element An intuitive understanding of the effectiveness of Stacked Filters 푠 are 1-1 functions of each other. When going from one to the other, comes from the interplay between the extra size of additional layers ( ) ( ) we denote the quantities by 푠 훼 and 훼 푠 , which denote the size vs. their benefit for FPR. Compare the simple 3 layer example above, for a given FPR and the FPR for a given size respectively. assuming each layer has an FPR of 0.01, with a single traditional Computational Performance. Filter structures desire computa- filter using FPR 0.01. If |푁푓 | = |푃 |, then 퐿2 and 퐿3 are on average tional performance much faster than the cost to access the data 1/100 the size of 퐿1, and the Stacked Filter has 2% higher space they protect. We denote the cost to insert into a filter an element costs than a traditional filter. But for every element in 푁푓 , the of set 푆 by 푐푖,푆 . We also denote the cost to query for an element of extra membership check makes their 퐹푃푅 a full 100× lower; thus if set 푆 by 푐 . If no set is denoted, then 푆 = 푈 . 푞,푆 푁푓 contains any significant portion of the query distribution, the 3 STACKED FILTERS Stacked Filter has a much lower EFPR. General Stacked Filters Construction. 1 shows the The traditional view of filters is that they are built on aset 푆, and full construction algorithm. Like both query-agnostic and classifier return no false negatives for 푆. An alternative view of a filter is that based filters, Stacked Filters need two inputs 1) the data set, and2) it returns that an element is certainly in 푆 (the complement of 푆), or a constraint (memory budget or a desired maximum EFPR). that an element’s set membership is unknown. In Stacked Filters, we use this way of thinking about filters, with 푆 for different layers Query({p1, n2, n4, n5}) S SN P f p n n n p p p n n n 1 2 4 5 1 2 3 1 2 3 L1 0 1 0 0 1 0 0 1 0 0 L1 0 1 0 0 1 0 0 1 0 0

S SN P f L L2 1 0 0 1 0 0 0 1 0 0 1 0 0 1 0 0 1 L1 0 1 0 0 1 0 0 1 0 0 p1 p2 p3 n1 n2 S SN Surviving P f L n n 1 0 0 1 0 0 2 1 2 Elements of N p n n f L2 1 0 0 1 0 0 1 1 2 Surviving 1 0 L3 L2 0 0 0 0 0 0 L p1 3 1 0 Elements of P n p n n False Positives 4 1 2 5 1 L Insertion 2 L 3 n 3 0 0 Query 3 0 0 1 in Nf Accepted Rejected Figure 1: Stacked Filters are built in layer order, with each layer containing either positives or negatives. The set of elements to be encoded at each layer decreases as construction progresses down the stack. For queries, the layers are queried in order and the element is accepted or rejected based on whether the first layer to return not present is negative or positive.

Algorithm 1 ConstructStackedFilter(푆푃 ,푊 ,퐶) Algorithm 2 Query(푥)

Input: 푆푃 ,푊 ,퐶: data set, workload, constraint Input: 푥: the element being queried { } 1: 푆푁푓 , 훼푖 = OptimizeSF(W,C) // In Sec. 5 1: // Iterate through the layers until one rejects 푥. 2: // Construct the layers in the filter sequentially. 2: for 푖 = 1 to 푇퐿 do ( ) 3: for 푖 = 1 to 푇퐿 do 3: if 퐿푖 푥 = 0 then 4: 푆푟 = {} 4: if 푖 mod 2 = 1 then 5: if 푖 mod 2 = 1 then // layer positive 5: return reject // Layer positive, reject x 6: construct 퐿푖 w/ space 푠 (훼푖 ) ∗ |푆푝 | 6: else return // Layer negative, accept x 7: for 푥푝 ∈ 푆푃 do 7: accept 8: return accept // No layer rejected, accept 푥 8: 퐿푖 .insert(푥푝 ) ∈ 9: for 푥푛 푆푁푓 do 10: if 퐿푖 (푥푛) = 1 then by some negative layer and thus accepted by the Stacked Filter. 11: 푆푟 = 푆푟 ∪ {푥푛 } Figure 1 shows how this process works for the positive element 푝1. 12: 푆 = 푆 푁푓 푟 Querying a Negative. For an infrequently queried negative ele- 13: else // layer negative ( ) ∗ | | ment 푥 ∈ 푁 \푁 , it is in neither 푃 nor 푁 and so has high likelihood 14: allocate 퐿푖 w/ space 푠 훼푖 푆푁푓 푖 푓 푓 ∈ 15: for 푥푛 푆푁푓 do of being rejected by both positive and negative layers. As a result, 16: 퐿푖 .insert(푥푛) the majority of infrequently queried negatives are rejected at 퐿1, 17: for 푥푝 ∈ 푆푝 do and the majority of false positives occur from elements rejected at 18: if 퐿 (푥 ) = 1 then 푖 푝 퐿 . Frequently queried negatives are a false positive for the Stacked 19: 푆푟 = 푆푟 ∪ {푥푝 } 2 20: 푆푃 = 푆푟 Filter only if they are a false positive for each positive layer. This 21: return F drives the FPR of frequently queried negatives towards 0, as their Stacked Filters also need workload knowledge similarly to classifier- false positive rate decreases exponentially in the number of layers. based filters. Later on we describe in detail how much knowledge Figure 1 shows how this process works for infrequently queried is needed, and how to gather it (even adaptively). For now, we de- negatives 푛4 and 푛5 as well as frequently queried negative 푛2. note workload knowledge abstractly as 푊 , and feed this into an Item Insertion and Deletion. Stacked Filters retain the supported optimization algorithm which returns 1) a set of frequently queried operations of the query-agnostic filters they use. If the underlying negatives, denoted by 푁푓 , and 2) the number of layers 푇퐿 and false query-agnostic filter supports insertion, then so does the Stacked positive rate of each layer 훼1, . . . , 훼푇퐿 . Filter. The same is true for deletion of positives. Querying a Stacked Filter. Algorithm 2 formalizes the description Insertion of an element after construction follows the same path given in the example for querying a Stacked Filter. Querying for as an insertion during the original construction of the filter. The an element 푥 starts with the first layer and goes through the layers positive element alternates between inserting itself into every posi- in ascending order. At every layer, if the element is accepted by tive filter, and checking itself against every negative filter, stopping the layer, it continues to the next layer. If an element is rejected at the first negative filter which rejects the element. This process by a layer containing positive elements (called a positive layer), works even in the case that the element was previously a frequently it is rejected by the Stacked Filter. Conversely, if an element is queried negative. The deletion algorithm follows the same pattern rejected by a layer containing negative elements (negative layer), it as the insertion algorithm, except it deletes instead of inserts the is accepted by the Stacked Filter. If the element reaches the end of element at every positive layer. the stack, i.e. it was accepted by every layer, then the Stacked Filter accepts the element. We now show this algorithm is correct, and 4 METRIC EQUATIONS explain how the algorithm differently affects positives, frequently We now present the metric equations and derivations for Stacked queried negatives, and infrequently queried negatives. Filters. These equations are then used in Section 5 to show how to Querying a Positive. Stacked Filters maintain the crucial property tune Stacked Filters for optimal performance, and in Section 7 to of having no false negatives. During construction, every positive show how Stacked Filters outperform state-of-the-art filters. We element 푥푝 is added into each positive layer until it either hits the show that compared to query-agnostic filters, Stacked Filters are as end of the stack, or is rejected by a negative layer. Querying the robust, while improving drastically in EFPR and size. Stacked Filter for 푥푝 follows the same path. Either 푥푝 makes it to the Notation. Stacked Filters metrics are written as a function of 훼⃗ = end of the stack and is accepted by the Stacked Filter, or it is rejected (훼1, . . . , 훼푇퐿 ) plus an additional dummy 훼0 = 1 which is used for readability. To distinguish them from the metrics for the base filter, The total space for a Stacked Filter is the sum of these two equations. metrics for Stacked Filters are denoted with a prime at the end, Because 훼푖 values are small, the products in parenthesis go to 0 so for instance, the size and EFPR of a Stacked Filter are 푠′(훼⃗) quickly in both equations and so the total size of a Stacked Filter is and 퐸퐹푃푅′(훼⃗). The metrics are variations on either an exponential dominated by its first layer. function or geometric series. To make this clear by immediate Size When Each Layer has Equal FPR. In the case that all 훼 inspection, we will often consider that all 훼푖 values have the same values are the same, we can use a geometric series bound on both value 훼 (this also makes 훼 and 푇퐿 easier to optimize in Section 5). arguments above, giving 1 |푁푓 | 훼 4.1 Stacked Filters EFPR 푠′(훼⃗) ≤ 푠(훼)·( + ) (2) 1 − 훼 |푃 | 1 − 훼 To calculate the total EFPR for a Stacked Filter, we introduce a new where 푠′(훼) represents the size in bits per (positive) element. variable 휓 which captures the probability that a negative element Stochasticity of Size or Filter Behavior. When constructing from query distribution 퐷 is in 푁푓 , i.e. 휓 = P(푥 ∈ 푁푓 |푥 ∈ 푁 ). Stacked Filters, there are two choices when it comes to space. First, Frequently Queried Negatives. For 푥 ∈ 푁 , a Stacked Filter 푓 all filters can have their memory allocated up front. Using this returns 1 if and only if it makes it to the end of the stack. This method, size is fixed but if a higher proportion of elements makes occurs only if it is a false positive on each positive layer and so the it through the initial layers of the stack, bad behavior can happen probability of this happening is at the subsequent filters in the stack. This happens in the formof (푇 −1)/2 퐿∏︂ increased FPR (Bloom filters), failed construction (Cuckoo filters), P(퐹 (푥) = |푥 ∈ 푁 ) = 훼 1 푓 2푖+1 or long probe times (Quotient filters). Instead, our default is to allo- 푖= 0 cate size proportional to the number of items which make it to a Infrequently Queried Negatives. For 푥 ∈ 푁 , its total false posi- 푖 layer in the stack (see lines 6 and 14 of Algorithm 1). This makes the tive probability is the sum of the probability that it is rejected by size of a Stacked Filter random, however, for large sets the size of a each negative layer, plus the probability it makes it through the Stacked Filter concentrates sharply around its mean. In particular, entire stack. For negative layer 2푖, the probability of rejecting this ∏︁2푖−1 P(|푠′ − 퐸[푠′]| ≥ 푘퐸[푠′]) ≤ element is 푗=1 훼 푗 ·(1 −훼2푖 ), where the first factor is the probabil- | | ity of making it to layer 2푖 and the second factor is the probability ( + 푁푓 ) 1 훼 푠(훼 )(1 − 훼 ) 1 | | · 푚푎푥 ·( 푚푖푛 푚푖푛 )2 · 푃 that this layer rejects 푥. Summing up these terms and adding in the 2 | | |푃 | 푘 푠(훼푚푎푥 )(1 − 훼푚푎푥 ) ( + 푁푓 )2 probability of making it through the full stack, we have 1 |푃 | 훼푚푖푛 푇 (푇퐿−1)/2 2푖−1 ∏︂퐿 ∑︂ ∏︂ where 훼푚푖푛, 훼푚푎푥 are the lowest, highest FPRs of any layer in the P(퐹 (푥) = 1|푥 ∈ 푁푖 ) = 훼푖 + ( 훼 푗 )(1 − 훼2푖 ) Stacked Filter. The proof can be found in Technical Report Section 푖=1 푖=1 푗=1 1 11.3.1 [16]. The leading |푃 | term ensures that for large sets the Expected False Positive Rate. Since 푁푖 and 푁푓 partition 푁 , the chance of deviating away from the expected set size is negligible. EFPR of a Stacked Filter is 4.3 Stacked Filter Robustness (푇퐿−1)/2 푇퐿 (푇퐿−1)/2 2푖−1 ∏︂ (︁ ∏︂ ∑︂ ∏︂ )︁ 휓 훼2푖+1 + (1 − 휓) 훼푖 + ( 훼 푗 )(1 − 훼2푖 ) The First Layer Provides Robustness. Any element in 푁 is ei- 푖=0 푖=1 푖=1 푗=1 ther in 푁푖 or 푁푓 , and so its probability of being a false positive is If all 훼 values are equal, then this is equal to either P(퐹 (푥) = 1|푥 ∈ 푁푖 ) or P(퐹 (푥) = 1|푥 ∈ 푁푓 ). Since elements 푇 +1 of 푁푖 have a higher chance of being a false positive, the EFPB of a 푇퐿+1 훼 + 훼 퐿 퐸퐹푃푅 = 휓훼 2 + (1 − 휓) (1) Stacked Filter is P(퐹 (푥) = 1|푥 ∈ 푁 ). For a Stacked Filter, an easy 1 + 훼 푖 Thus, the FPR for frequently queried negatives is exponential in bound on this is the FPR of the first layer. Since the majority of the the number of layers and goes quickly to 0, whereas infrequently size of a Stacked Filter is in its first layer, worst case performance queried negatives have EFPR close to the FPR of the first layer. is similar to a query-agnostic filter (of the same size). Performance Change Under Workload Shift. While EFPB pro- 4.2 Stacked Filter Sizes vides worst case bounds, the EFPR equation shows what happens under the common case of more mild workload drifts. For an initial Size of a Stacked Filter Given the FPR at Each Layer. For every query distribution 퐷 with corresponding 휓, which changes to 퐷 ′ positive layer after the first, an element from 푃 is added to the layer and corresponding 휓 ′, the change in EFPR from 퐷 to 퐷 ′ depends if it appears as a false positive in every previous negative layer. only on the change in 휓 to 휓 ′. In particular, the change in EFPR is Thus, the size of all positive layers in bits per positive element is ( ′ − )· (︁P( ( ) = | ∈ ) − P( ( ) = | ∈ ))︁ (푇 −1)/2 푖 휓 휓 퐹 푥 1 푥 푁푓 퐹 푥 1 푥 푁푖 퐿∑︂ ∏︂ 푠(훼2푖+1)·( 훼2푗 ) Thus, the performance in terms of EFPR for a Stacked Filter de- 푖=0 푗=0 creases linearly with the change in the proportion of queries aimed Similarly, negatives appear in a negative layer if they are false at frequently queried negatives. positives for every prior positive layer and so the size of all negative layers (using the traditional metric bits per positive element) is 4.4 Stacked Filter Computational Costs

(푇퐿−1)/2 푖 Like the previous derivations, the resulting equations for query ∑︂ |푁푓 | ∏︂ 푠(훼 )· ·( 훼 − ) computation time and construction time are modified geometric 2푖 |푃 | 2푗 1 푖=1 푗=1 series. We give here bounds on the resulting equations specifically for all 훼푖 equal to 훼. The derivations for exact equations with general which do so depend on the base filter being used and fall into 훼푖 and an arbitrary number of layers are in Technical Report Section two categories. In the first, the query-agnostic filter can takeon 11.2 [16]. All equations are in terms of average computational cost any value of 훼, which is a good approximation for filters such as and given as the number of base filter operations required. Bloom Filters. In the second, the possible 훼 values are of the form − Construction. The construction cost of Stacked Filters is 2 푘 for 푘 ∈ N, which is true or a very close approximation for 1 훼 fingerprint based filters such as Cuckoo and Quotient Filters. For |푃 |(푐푖 + 푐푞) + |푁 |푐푞 + |푁 |(푐푖 + 푐푞) 1 − 훼 푓 푓 1 − 훼 both methods, we assume that the base filter has a size equation − log (훼)+푐 Like in the previous subsections, filters after the first add only of the form 푠(훼) = 2 with 푐 ≥ 0, 푓 ≥ 1. This holds true negligible costs to construction to Stacked Filters when 훼 is small. 푓 or is a very close approximation for all major filters in practice The majority of the cost above comes from 1) 푐 ·|푃 | for constructing 푖 including Bloom, Cuckoo, and Quotient filters, and additionally the first layer and 2) 푐 · (|푁 | + |푃 |) for querying the first layer. 푞 푓 covers the equation for the theoretical lower bound on size for Query Costs. The costs for querying a Stacked Filter for a positive, query-agnostic filters. Throughout the section, optimization is given frequently queried negative, and infrequently queried negative are: in terms of minimizing EFPR with respect to a constraint on size. 2 1 + 2훼 1 푐′ ≤ 푐 , 푐 ≤ 푐 , 푐′ ≤ 푐 (3) Optimization of size with respect to a bound on EFPR is similar. 푞,푃 푞 푞,푁푓 푞 푞,푁푖 푞 1 − 훼 1 − 훼 1 − 훼 Additional constraints on the EFPB or expected number of filter For small 훼, the cost of querying negative elements is essentially checks may be added by only minor modifications. identical to querying a single filter. The cost of querying positive elements is about 2× the cost of a single filter as they make it 5.2.1 Outer Loop: Sweeping over 푁 . For both continuous and through the first layer with certainty before being rejected at layer 푓 discrete FPR filters, there is an outer loop which chooses setsof 푁 2 with high probability. 푓 to optimize and an inner optimization which optimizes the Stacked 5 OPTIMIZING STACKED FILTERS Filter given 푁푓 . The best performing value of 푁푓 is then used. To choose 푁 , we make use of the workload model and choose Sections 3 and 4 introduced Stacked Filters given their parameters: 푓 푁 to be a subset of 푁 . The 휓 (푁 ) value is the sum of the the set of frequent negatives, the FPRs of each layer, and the number 푓 푠푎푚푝 푓 estimated query frequencies of each value chosen to be in 푁 . of layers. We now complete the picture of how Stacked Filters are 푓 Because EFPR is a monotonically decreasing function of 휓, for a constructed. This is done in two stages: first we go from a sample fixed size 푁 it is optimal to greedily choose the negative elements of past queries to a workload model, and then we go from a model 푓 queried most. Thus we can order 푁 by the element’s query of the workload to the choice of Stacked Filters parameters. 푠푎푚푝 frequencies and then sweep over various sizes for 푁푓 , always using the most frequently queried elements of 푁 to be in 푁 . The 5.1 Modeling the Workload 푠푎푚푝 푓 following theorem shows we can choose the size of 푁푓 efficiently. Like classifier-based filters, Stacked Filters require workload knowl- Its proof, and the proof of all other theorems in this paper, is given edge in the form of a sample of past queries. This set of past queries in the Technical Report [16]. can have multiple sources depending on the application. For in- Theorem 1. Given an oracle returning the optimal EFPR for a stance, it can be: 1) publicly available, as in the case of URL blacklist- given set 푁푓 , finding the optimal EFPR across all values of |푁푓 | to ing [28] with popular non-spam websites and their query frequen- within 휖 requires 푂 ( 1 ) calls to the oracle. cies collected by OpenPageRank [2], 2) collected by the application 휖 | | by default, as is the case for web indexing and document search The core idea of the theorem is that values of 푁푓 that are [24], where query term frequencies are collected and stored, or 3) close together have solutions with optimal EFPR close to each other. can be collected by the system by choice, as is the case for most Using the theorem, our algorithm starts with a “current" 푁푓 of size 0. | | data systems including key-value stores [41]. It then increases 푁푓 to a strategically chosen larger value, making | | After collecting the set of sample queries, Stacked Filters create sure the skipped values of 푁푓 could have EFPR no more than 휖 a model to identify frequently queried elements. They do this by lower than the checked values, and runs the optimization with the | | | | | | creating a smoothed histogram of the empirical query frequencies. new fixed 휓 and 푁푓 . It continues to do so until 푁푓 = 푁푠푎푚푝 , More specifically, each element observed in the set of sample queries and returns the setup giving the best observed EFPR. is put into a set 푁푠푎푚푝 . Then, the proportion of queries at elements 5.2.2 Inner Optimization: Continuous FPR Filters. The inner opti- outside 푁푠푎푚푝 is estimated by looping over all possible subsets of 푄 − 1 queries from the set of 푄 sample queries, and seeing for mization loop for continuous filters has 푁푓 and 휓 given and works what proportion of subsets the Qth query value is not present in in two steps. First, we assume that all layers have the same FPR and the set of 푄 − 1 queries. If we denote this value by ℓ, then for each optimize the filter as if it had infinitely many layers. Second, we ( − ) truncate the infinite layer Stacked Filter to a small finite layered one 푥 ∈ 푁 , its query frequency is estimated as 1 ℓ · ∑︁푄 1 . 푖 푠푎푚푝 푄 푗=1 푞푗 =푥푖 that is close in performance to the infinite layer one. An optional Our optimization algorithms below then choose some of the values third step modifies the procedure to search filters with varying 훼 in 푁푠푎푚푝 to be in 푁푓 , creating the frequent negatives set. values across layers, but we note that this procedure is optional as it generally does not improve the EFPR. 5.2 Optimization Algorithms Step 1: Fixed 푁푓 , Infinite Equal FPR Layers. Take the equations The final step in constructing Stacked Filters is to use the workload of Section 4 with FPR equal across layers and let 푇퐿 → ∞. The { }푇퐿 ′( ) ( )· model to choose 푁푓 ,푇퐿, and 훼푖 푖=1. The optimization algorithms equations for EFPR, size, and EFPB converge to 푠 훼 = 푠 훼 |푁 | ( 1 + 푓 훼 ), 퐸퐹푃푅(훼) = (1 − 휓) 훼 , and 퐸퐹푃퐵 = 훼 , and Bloom, fixed 훼 Bloom, varied 훼 Cuckoo 1−훼 |푃 | 1−훼 1+훼 1−훼 휖 EFPR Time EFPR Time EFPR Time the equations for computation converge to Equation (3). 10−2 0.00175 775휇푠 0.00175 47 ms 0.00203 12 ms By inspection, the equations for EFPR, EFPB, and computation 10−3 0.00173 781휇푠 0.00173 328 ms 0.00190 13 ms monotonically increase with 훼. Thus attempts to minimize the EFPR 10−4 0.00172 1.01 ms 0.00172 2.9 s 0.00184 44 ms or satisfy EFPB or computation constraints should all lower 훼. For 10−5 0.00172 3.7 ms 0.00172 26.7 s 0.00184 367 ms the size equation, the following theorem holds. − ( )+ |푁 | Table 2: Optimization is efficient and tunably optimal ′( ) = log2 훼 푐 ·( 1 + 푓 훼 ) Theorem 2. The function 푠 훼 푓 1−훼 |푃 | 1−훼 is quasiconvex on (0,1) when 푐 ≥ 0, 푓 > 0. can lead to an optimal solution of each layer are used, constraining Quasi-convex functions have unique global minima, and as a the number of options expanded. Eventually, each search path ter- result, the size equation can be minimized via gradient descent. minates, either because its choices already created too many false Specifically, we use gradient descent with backtracking line-search positives, it used all the available space budget, or the number of to choose the step size. To minimize EFPR using size as a constraint, queries which would reach the current layer of the chosen stack at a given time step if we are below the size constraint we decrease 훼. is less than 휖. The full algorithm, its explanation, and proofs of its Otherwise, we use the gradient of size with respect to 훼 to decrease theoretical properties are given in the Technical Report [16]. the size. If for the given 푁푓 one or more constraints is not satisfiable, Theoretically, the algorithm is guaranteed to return a filter with we return an exception. 퐸퐹푃푅 ≤ 퐸퐹푃푅∗ + 휖, where 퐸퐹푃푅∗ is the EFPR of the best possible Step 2: Truncating to a Finite Stack. When performing trun- filter satisfying all constraints. We can bound the runtime ofthe −3 cation, we measure the difference in each metric equation using algorithm theoretically by 푂 (|푁푠푎푚푝 |+휖 ). Additionally, the proof infinite 푇퐿 and an increasing finite value of 푇퐿 and stop when the of the runtime bound does not rely on several key optimizations difference is below 휖 for all metric equations. Because each metric of the algorithm, and the experimental run time of the algorithm − equation is either an exponentially decreasing function of 푇퐿 or a behaves more like 푂 (휖 1). Thus, the algorithmic run time is both geometric series in 푇퐿, the convergence to the infinite layer values tunable and efficient, as can be seen in Table 2 for Cuckoo Filters. is on the order of 푂 (훼푇퐿/2), and so usually 5 or 7 layers suffice. 휖 Algorithm Analysis. By using 3 in both the outer loop over 푁푓 6 INCREMENTAL CONSTRUCTION and both steps 1 and 2, the overall algorithm is an 휖 approximation AND ADAPTIVITY to the best possible EFPR for a Stacked Filter with 훼 fixed across −1 So far we assumed that workload knowledge can be collected and layers. Its runtime is푂 (휖 +|푁푠푎푚푝 |) and its empirical optimization that workloads are static or drift slowly. While this holds for many times for Stacked Bloom Filters at 10 bits per element are listed in filter use cases such as URL Blacklisting [28] and Web Indexing Table 2 under “Bloom, fixed 훼". The workload, described in detail in [24], there are many other applications where workloads change Section 8.2, is the synthetic integer dataset with a Zipf distribution 7 quickly, in which case continuously gathering workload knowledge with 휂 = 1, and |푁푠푎푚푝 | = 5 · 10 . is expensive. To address these use cases, we introduce Adaptive For both this algorithm as well as the subsequent two, we note Stacked Filters (ASFs) which require knowledge about workload the runtime has two regions. When 휖 is small, the runtime is ap- shape (such as how skewed they are), but do not require the gath- proximately linear in |푁푠푎푚푝 |. As 휖 grows, it becomes the primary ering of a set of negative queries. Crucial to the design of ASFs is cost of algorithmic runtime and the runtime is linear in 휖−1. the idea of incremental construction, which allows ASFs to process Varying FPRs Across Layers. When allowing the 훼푖 values to queries immediately, learn frequent negatives during query evalua- change across layers, we are faced with a non-convex optimization tion, and gain benefits from stacking before finishing construction. objective and constraint, even when fixing 푇퐿 (this can be seen Mirroring how we described Stacked Filters, we explain first the by taking second derivatives). To perform optimization, at each structure of ASFs given their parameters, then how to collect work- checked value of 푁푓 we first run the optimization using equal load knowledge, and finally how to optimize their parameters. FPR across layers. We then polish the resulting filter by using the Incremental Construction. ASFs start by constructing 퐿 and use gradient-free algorithm COBYLA [36] to modify the FPRs of each 1 this to answer incoming queries until more layers are constructed. layer. While this method very occasionally achieves improvements They also allocate an empty 퐿 . During query processing, when a over the fixed FPR per layer method, it generally does not, asseen 2 false positive occurs, it is added to 퐿 . Then, when 퐿 is full (in that in Table 2. Additionally, an alternative strategy of discretizing the 2 2 it either hits its load factor for Cuckoo and Quotient filters or has search space for the FPRs at each layer and using the optimization half its bits set for Bloom filters), the ASF brings in the positive set, routines described in Section 5.2.3 also did not in general improve queries it against 퐿 and adds the false positives on 퐿 to a new 퐿 . upon the equal FPR per layer solution. Thus we view this final 2 2 3 Processing then continues using the layers up until 퐿 , gaining the polishing as optional in the optimization of continuous FPR filters. 3 benefits in terms of EFPR that come with extra layers. Additionally, 5.2.3 Inner Optimization: Fingerprint Based Filters. For fingerprint construction on layers 퐿4 and 퐿5 can begin (if they exist), and uses based filters, the discrete number of fingerprint bits makes search the same procedure. Since 푁푓 is captured during query processing, easier. The main idea of our approach is to use breadth first search it does not need to be gathered before the construction of the ASF. expanding the number of fingerprint bits used at each layer, work- This is the primary benefit of ASFs over Stacked Filters: they require ing two layers at a time: one positive and one negative. At each only the workload shape (to figure out how big each layer should pair of layers, derived bounds on which possible fingerprint lengths be) but not which values are important. Rank-Frequency Workload Knowledge. To create ASFs, we 퐿1. If this does not fix performance, a remodeling of the workload need a rank-frequency distribution of the negative query work- happens, and the filter is re-optimized and rebuilt from scratch. load. This is akin to the workload knowledge of Section 5.1, but Positive Set Adaptivity. ASFs address the common use case where instead of describing the query frequencies of actual values, the the positive set is static but the frequent negatives are changing. distribution describes the query frequencies of the value at each This is common for read-only datasets such as the levels of an LSM rank, where rank is itself defined by ordering elements’ query fre- tree, and it is common in general for filters because filters are not quencies. A classic example of this type of distribution is the Zipf easily adaptive to changes in data size. However, in cases where distribution, which models how often the first and second most new items are frequently seen, filters need to be able to adapt. We popular values occur without reference to the actual values. explore several preliminary strategies for this in Technical Report The rank-frequency distribution can be calculated in many differ- Section 11.6 [16]. ent ways. We do so using a set of past queries to make a smoothed histogram as described in Section 5.1. ASFs assume that the work- 7 BETTER SIZE-FPR TRADEOFFS load shape is relatively static even if the values are not, and under With Stacked Filters and their adaptive counterparts described fully, this case, ASFs rebuild themselves without performing a new analy- we ask the following critical question: when are Stacked Filters sis of the workload. For instance, YouTube video queries and other better than query-agnostic filters and by how much? In terms of periodic workloads are a good example of a case where this holds: their trade-off between EFPR and size, the following theorems an- queries for popular videos on a given week follow consistent pat- swer this question using only summary properties of the workload: terns even if which videos are popular changes each week [12]. In namely a choice of |푁푓 | and 휓 (푁푓 ). Along with each theorem, we this case, an ASF being rebuilt and adapting to new frequent values present a visualization of its results, which can be used by systems knows what shape to take before it knows the new frequent values. designers to estimate the benefits of Stacked Filters on their work- Optimizing Collection vs Exploitation. We optimize ASFs in load a priori to spending the time to gather workload knowledge. two different ways, depending on the nature of the workload. The Each theorem holds exactly in the case that 훼 is a continuous first uses the optimization procedures of base Stacked Filters assum- parameter for the base query-agnostic filter, and we discuss how ing we pick the most frequently queried negatives and allocates the theorems apply to integer length fingerprint filters at the end the exact same filter. It then creates the filter incrementally during of this section. The first theorem shows when a Stacked Filter is query processing instead of all at once. strictly better than a query-agnostic filter, as opposed to when a The above process builds the best eventual ASF but can face 1-layer filter is best. many queries before achieving a fully built ASF. For this reason, Theorem 3. Let the positive set have size |푃 |, let the distribution we additionally create a second approach which assumes that ASFs of our negative queries be 퐷, and let 훼 be a desired expected false are 3 layers and focuses on building a fully built ASF very quickly. positive rate. If there exists any set 푁푓 , 휓 = P퐷 (푥 ∈ 푁푓 |푥 ∈ 푁 ), and This form of optimization takes as input an estimate of how long 0 ≤ 푘 ≤ 휓 such that the filter will last and then chooses a number of queries to observe 1 when building the second layer, denoted by 푁표 . The value of 푁표 |푁푓 | ln 1 − 푘 − 훼 ≤ 1−푘 · − 1 determines the expected values of 휓 and |푁푓 |: |푃 | 1−푘 훼 ln 훼 + 푐 ∑︂ then a Stacked Filter (optimized using Section 5 and given access to 퐸[휓] = 푓 (푥)(1 − (1 − 푓 (푥))푁표 ) any 푁푓 satisfying the constraint) achieves the EFPR 훼 using fewer 푥 ∈푁푠푎푚푝 ∑︂ bits than a query-agnostic filter. 퐸[푁 ] = (ℓ · 푁 ) + (1 − (1 − 푓 (푥))푁표 ) 푓 표 Figures 2a and 2b use this theorem to create visualizations of 푥 ∈푁 푠푎푚푝 which workloads Stacked Filters are certainly better than query- ℓ where here we recall that is the estimate of what portion of queries agnostic filters. Figure 2a shows this for a desired EFPR of 훼 = 0.03; fall on values outside our sample. The optimization then weights any workload with a set 푁 such that |푁 |,휓 (푁 ) is above the line 퐿 푁 푓 푓 푓 the EFPR using just 1 for 표 queries vs. the EFPR of the ASF has a Stacked Filter which is strictly better than a Bloom filter. Figure using all 3 layers on the rest of the queries. To choose the best 2b shows this trend for more alpha values. Even at a high desired 푁 configuration, we perform grid search on 표 . At each value of 훼 of 0.05, Stacked Filters cover a sizeable number of workloads; 푁 휓 |푁 | 표 , we either calculate or estimate and 푓 , depending on the many workloads contain a negative set half the size of their positive 푁 size of 푠푎푚푝 , and perform optimization of the three layers using set, and which contain 25% of all negative queries. As the desired discrete search for both continuous FPR filters and integer-length EFPR decreases, Stacked Filters cover almost all real workloads; for fingerprint filters (see Technical Report Sec. 11.5 for details [16]). instance at a desired EFPR of 훼 = 0.01, if |푁푓 | = |푃 |, then only 7% of Monitoring and Adapting. To maintain robust performance, if negative queries need to be at values in 푁푓 for the Stacked Filter to the elements in 푁푓 become less frequently queried over time, this be more space efficient. At even lower values, the amount needed needs to be rectified. To address this, the ASF monitors its perfor- becomes negligible and almost any workload sees improvements. mance and initiates a rebuild whenever the FPR differs by more Estimating Space Savings from Stacked Filters. Using the op- than 50% from its expected FPR. The ASF initially a rebuild timization routines of the prior sections and given values for |푁 | assuming that the popularity of particular values has changed but 푓 and 휓 (푁푓 ), it becomes possible to estimate the space savings of not the rank-frequency distribution; the layers after the first of the using a Stacked Filter as compared to a query-agnostic filter for ASF are dropped and the procedure for construction starts from any desired EFPR. Namely, we can optimize the size of a Stacked Figure 2: Analytical equations can predict both when Stacked Filters are better, and by how much. Filter with equal FPRs across layers while requiring that its EFPR Filter Implementations. All filters use CityHash as the hash func- is less than a query-agnostic filter: tion [35]. The Counting (CQF) and Cuckoo Filter 1 |푁푓 | 훼퐿 (CF) implementations are taken from the original papers [20, 34]. In 푠(훼) − min푠(훼퐿)( + ) 훼퐿 1 − 훼퐿 |푃 | 1 − 훼퐿 the original implementation, CQF is constrained to have the filter 훼 size be a power of two to allow for operations such as resizing 푠.푡. 훼퐿 ∈ (0, ] 1 − 휓 and merging. This is not relevant to our testing, so we removed |푁푓 | Figure 2c uses this to graph how a combination of |푃 | and 휓 this restriction. For CF, the implementation provided only sup- produces a reduction in filter size using Stacked Bloom Filters ata ports certain signature lengths, so we implemented a fix to allow |푁푓 | all integer signature lengths. For classifier-based filters, we use desired EFPR of 훼 = 0.01. For each fixed value of 푃 , the graph contains three parts: in the first part, it is not advantageous to build Learned Bloom Filters [28] with text data using a 16 dimensional Stacked Filters and a query-agnostic filter is built. For all values of character-level GRU as in the original paper, and integer data using |푁 | 푓 , this is a small area and covers workloads with no frequently a shallow feed-forward neural network. Additionally, we compare |푃 | with Sandwiched Learned Filters (SLF) [32], which uses the same queried negative elements. In the second part, Stacked Filters have model as the Learned Filter but has a query-agnostic pre-filter as superlinear improvement in 휓, with improvement starting at 0 bits well as a backup filter (Bloom filters). For Stacked Filter layers we per element saved and going up to 7 bits per element saved. At the use the same implementations as in the query agnostic filters. For tail end of the graphs, the improvement stops even as 휓 increases. most experiments we use Bloom Filters and we refer to the filter as This is the point where 훼 crosses the minimal 훼 value for size. 1−휓 퐿 Stacked Bloom Filter but we also show results with other filters. After, the Stacked Filter can choose larger false positive rates at Experimental Infrastructure. All experiments are run on a ma- each layer while having the same EFPR as the query-agnostic filter, chine with an Intel Core-i7 i7-9750H (2.60GHz with6 cores), 32 GB but this larger 훼 value increases size. Instead, the Stacked Filter 퐿 of RAM, and a Nvidia GeForce GTX 1660 Ti graphics card. Each keeps the minimal 훼 value for size and the Stacked Filter produces 퐿 experimental number reported is the average of 25 runs. both a space benefit and has lower EFPR than the input 훼. Datasets. We use three diverse datasets: Integer Length Fingerprint Filters. The above equations and (1) URL Blacklisting: The URL Blacklisting application was used theory assumed that all FPRs were possible at each layer. For in- to introduce Learned Filters [28]. As the dataset in [28] is not teger length fingerprint filters, this is not the case and the theory publicly available, we instead use two open-source databases, does not hold exactly; however, the general trajectory remains the Shalla’s Blacklists [1] as a positive set of dangerous URLs, and the same. Additionally, experimentally the results for integer length top 10 million websites from the Open Page Rank Initiative[2] as fingerprint filters are often better than the continuous FPR approx- a negative set of safe URLs, with the probability of querying a imations suggest. This is because query-agnostic filters also suffer safe URL proportional to its PageRank. from limitations on the FPRs they can choose; often given more (2) Packet Filtering: Packet filtering is a common application for size as a budget there isn’t enough space to add a full for every filters and was used to evaluate Counting Quotient Filters34 [ ]. positive element. In these cases, Stacked Filters can often make use Following their lead, we use the benchmark Firehose [3] which of this space to build layers deeper in the stack, and the added flex- simulates an environment where some subset of packets are iblity of being able to use space on any layer in the stack provides labeled suspicious and need to be filtered. The benchmark is run additional improvements over the theory above. under its default settings. (3) Synethetic Integers: To more finely control experimental settings, 8 EXPERIMENTAL ANALYSIS we also use synthetic data. The data set consists of 1 million pos- itive elements using randomly generated integer keys. Negative We now experimentally demonstrate that Stacked Filters offer better queries on the dataset come from a set of 100 million negative false positive rates compared to query-agnostic Filters for the same elements, also with randomly generated keys, and follow a Zip- size, or they offer the same false positive rate at a smaller size.We fian distribution. The skew of the negative query distribution is also show that Stacked Filters are more computationally efficient, a controlled parameter 휂 taking values between 0.5 and 1.25. For robust, and are more generally applicable than classifier-based all experiments and graphs where 휂 is not listed, 휂 = 0.75. filters while offering similar false positive rates and sizes. 2421 24 24 2424 24 79 11568 62 133 14027 10 2.00 9 QA Filter 2 DiskFilter Learned Filter 50 47 7.8 10 1 1.75 Stacked Filter 8 Sandwiched Filter Query Agnostic Filter Stacked Filter 7 1.50 40 Learned Filter (GPU) Sandwiched LF (GPU) 2 6 10 1.25 30 5 4.4 1.00 25 26 10 3 4 0.75 20 19 2.5 14 14 3 4 0.50 10 s) Query Time ( 8.4 2 1.6 URL Blacklisting URL Rate Positive False .25 .27 10 6.4 .17 6 6.2 1 1 1 1 0.25 .14.13 .13.13 .14 4.23.8 1 e.Fle HDD Latency Rel. Filter + 1 1 1 1 10 5 0.00 0 Latency SSD + Filter Rel. 0 a) 4 6 8 10 12 14 b) 10 Bits 14 Bits 10 Bits 14 Bits 8 10 12 14 8 10 12 14 Bits Per Positive Element Negative Positive c) Bits Per Positive Element d) Bits Per Positive Element 24 24 24 2424 24 1013 1028 603 711 740 721 25 0.10 .1 500 471 .09 .09 22 1 10 421 19 0.08 .08 400 20 361 353 17 10 2 .06 15 0.06 300 13 10 3 240 .04 0.04 200 10 FIrehose .03 .03 .03 .03 142 10 4 125 130 105 6

ur ie(s) Query Time ( 96

False Positive Rate Positive False 0.02 100 5 2.9 2 10 5 1 1 1 11.4 Rel. Filter + HDD Latency HDD + Filter Rel. 1 1 1 1 Latency SSD + Filter Rel. 0.00 0 0 e) 4 6 8 10 12 14 f) 10 Bits 14 Bits 10 Bits 14 Bits g) 8 10 12 14 h) 8 10 12 14 Bits Per Positive Element Negative Positive Bits Per Positive Element Bits Per Positive Element 24 24 24 2424 24 27 18 52 145 396 η=.5 0.14 12 3.0 11 2.7 10 1 .12 2.6 0.12 2.5 .1 10 2.5 2.4 η=.75 .1 10 2 0.10 .08 8 2.0 η=1 7 6.8 1.7 0.08 1.5 1.5 10 3 .06 .06 6 1.5 1.3 0.06 .05

η=1.25 .05.05 .05 1 1 1 1 Synthetic 3 10 4 4 1.0 0.04 2.7 2.8 ur ie(s) Query Time ( 2 1.9 False Positive Rate 0.02 2 1.3 1.5 1.5 0.5 10 5 1 1 1 1 Rel. Filter + HDD Latency 0.00 0 Latency SSD + Filter Rel. 0.0 4 6 8 10 12 14 10 Bits 14 Bits 10 Bits 14 Bits 8 10 12 14 l) 8 10 12 14 i) Bits Per Positive Element j) Negative Positive k) Bits Per Positive Element Bits Per Positive Element Figure 3: Stacked Bloom Filters achieve a similar EFPR to Learned Bloom Filters while maintaining a 170x throughput advan- tage and beat both query-agnostic and learned filters in overall performance on typical workloads. For each dataset, we simulate having incomplete information about Across all three workloads, the negative query distribution has the query distribution by giving Learned, Sandwiched Learned, and frequently queried elements and so Stacked Filters can find a “small" Stacked Filters only the higher frequency half of the negative set set of negative elements that contain a large portion of the query for training. We also varied this amount from between 10% to 80% workload (we need only 푁푓 to not be dramatically larger than 푃). of the training data and saw the same relative results. Under these conditions, deeper stacks have only marginal overhead compared to a single layer and so Stacked Filters allocate almost all their space budget to the first layer. This way, they achieve 8.1 Evaluating Total Filter Performance essentially the same FPR as query-agnostic filters on infrequent The overall performance of a filter can be broken down into two negatives and at the same time, eliminate all false positives from fre- pieces, 1) the overhead incurred by the filter checks and 2) the cost quently queried negatives. Stacked Filters improve upon the FPR of of unnecessary operations incurred by false positives. While 1) is a query-agnostic Bloom Filters by 5-100×, as seen in Figures 3a, e, and characteristic of just the filter, 2) depends both on the FPR ofthe i. Alternatively, Stacked Filters can reduce their size significantly filter and the cost of the operations the filter protects against. Thus, while achieving the same FPRs as query-agnostic filters. we fix memory and measure FPR and probe (computation) time. Learned and Sandwiched Learned Filters’ performance depends We then translate these two metrics into total filter performance heavily on the dataset. With URL Blacklisting (Fig. 3a) where keys for the common case of protecting against base data accesses on have semantic meaning and are correlated with their appearance persistent hardware using a slower HDD (favoring lower FPRs) and in the positive or negative set, Learned and Stacked filters achieve a faster NVMe SSD (favoring lower computational rates). similar results. When the keys have less semantic meaning such as Workload Knowledge Improves FPRs. Figures 3a, e, and i show in Firehose (Fig. 3e) or the synthetic integer data (Fig. 3i), Learned the false positive rates for all filters across all three datasets. Here Filters’ performance degrades and becomes worse than a standard we use Bloom filters for both the query-agnostic filter and the Stack Bloom Filter. Overall, the evaluation of Learned Filters depends on Filter. We see that Stacked Bloom Filters is the clear winner across 1) how correlated keys are with their positive/negative set member- all workloads. Learned Filters are closer for the URL Blacklisting ship, and 2) how complicated their decision boundaries are for the scenario which is a favorable scenario for learning but still Stacked dataset (this affects computational performance as well). Figure 3a Filters provide a better FPR for most memory budgets. For other shows that even on tasks that are considered good fit for Learned workloads, Learned Filters give similar or slightly worse FPR than Filters, their performance can be matched by Stacked Filters. traditional query-agnostic filters while Stacked Filters provide a drastic benefit. Cuckoo Filter Quotient Filter 2 8.3 Stacked Filters are Workload-Robust 10 η=.75 η=.75 η=.5 We now demonstrate that Stacked Filters retain the robustness of 10 3 query-agnostic filters, bringing an additional benefit over classifier- η=1 η=.5 η=1 based filters. We focus on two facets of robustness: maintaining 10 4 η=1.25 performance under shifting workloads and providing utility for a η=1.25

False Positive Rate Query Agnostic Filter variety of workloads. We use the synthetic integer dataset with 10 5 Stacked Filter size 푠 = 10, and Bloom filters for both the query-agnostic and the 8 9 10 11 12 13 14 8 9 10 11 12 13 14 Stacked filter. a) Bits Per Positive Element b) Bits Per Positive Element Robust to Workload Shifts. Figure 5a shows how Stacked Filters’ Figure 4: Stacking provides robust performance benefits performance adapts to workload changes with diverse values for 휂. across a variety of underlying filter types. We vary the workload by reducing the value of 휓 from its initial value to a value of 0. This captures scenarios where the frequently queried negative values are changing over time, and so Stacked Hash-Based Filters Dominate Classifier-Based Filters Com- Filters are no longer optimized for the most frequently queried putationally. Figures 3b, f, and j depict the computational perfor- negatives. For Stacked Filters, regardless of the skew of their initial mance of each filter. Noticeably, Learned Filters have computational distribution, drastic workload shifts are needed to become worse performance that is orders of magnitude slower than hash-based than query-agnostic filters, with query-agnostic filters being better filters, with the difference between 90-190×. In comparison, Stacked only after the frequently queried negative set loses more than 60% Filters have computational performance more in line with query- of its initial set. Even in the extreme case that every frequently agnostic filters, with the difference between× 1-2 . For queries on queried element from when the Stacked Filter was built is no longer negative elements, the Bloom Filter and Stacked Bloom Filter have queried (휓 = 0), the Stacked Filter is never more than 50% worse essentially identical computational cost, and for queries on pos- than a query-agnostic filter. itive elements, the Stacked Filter probe is about 1.5× the cost of Robust to Workload Misspecification. Figure 5b shows the be- the Bloom Filter probe. For Sandwiched Learned Filters, their per- havior when the sample queries used to build the filter come from formance is a weighted combination of hash-based filters and the a different distribution than the future workload. Here the queries classifier; overall, they are at least 3× more expensive and can be come from the integer dataset with 휂 = 0.75, however, during work- as much as 90× more computationally expensive. load modeling, we used 휂 values from 0.15 to 1.35. Figure 5b shows Stacked Filters Maximize Overall Performance. As Figures 3 that even when the modeled workload is significantly different from c-g-k and d-h-l show, Stacked Filters strike the best balance be- the true workload, the Stacked Filter retains most of its performance tween decreased false positive rates and affordable computational and outperforms query-agnostic filters. speeds, resulting in the best overall performance across both hard Robust to Workload Type. Because Stacked Filters rely on taking disk and SSD. Compared to query-agnostic filters, both have fast advantage of frequently queried negative values, uniform work- hash-based computational performance but Stacked Filters have loads are the most difficult ones. However, if |푁 | is relatively small, significantly fewer false positives; as a result, they provide total every negative element is still frequently queried. Figure 5c shows workload costs which are 1.4-130× lower. In comparison to the that Stacked Bloom Filters outperform query agnostic Bloom fil- classifier-based Filters, Stacked Filters are better or equal in terms ters when |푁 | is a reasonable multiple of |푃 |, anywhere up to 25×. of false positive rates and better by orders of magnitude in compu- Since the uniform distribution is the worst distribution possible tational performance, resulting in 1.5-1028× lower total workload for Stacked Filters, this shows that Stacked Filters are better than costs. query-agnostic filters for all small universe sizes. Easy to Reconstruct. While Stacked Filters are effective under 8.2 Stacking Improves Diverse Filter Types shifting workloads, they need to be rebuilt to regain their best Figures 4a, and b show that Stacked Filters benefits generalize across performance. Figure 5d shows that the construction cost grows diverse filter types used for the stacked layers. More specifically, slowly with the number of the frequently queried negatives, is Figure 4 shows the performance of Stacked Cuckoo and Stacked significantly faster than classifier-based filters, and is comparable to Quotient Filters on the synthetic integer dataset. This is the same query-agnostic filters. Because Stacked Filters can be reconstructed experiment as for Stacked Bloom Filters in Figure 3i. Collectively quickly, they can handle periodic bulk updates. Such a strategy is these three graphs all show the same benefits, which is that Stacked key for handling workloads with dynamic positive data. Filters achieve lower false positive rates compared to their query- agnostic counter-parts as workloads are more skewed. This is be- cause more skewed workloads query their frequent negatives more 8.4 Incrementally Adapting to Workload Shifts often and because at lower FPRs the first layer in a Stacked Filter We now show that Adaptive Stacked Filters (ASFs) are capable of culls almost all elements, making subsequent layers in the Stack adapting quickly to changing workload patterns improving on the cheaper to build (as can be seen in Equation (2)). Further, while (good) worst case guarantees of base Stacked Filters. We use the not shown here, the computational costs of Stacked Cuckoo and synthetic integer dataset with Zipf parameter 1, let 푠 = 10 bits per Stacked Quotient Filters follow the same patterns as seen in Figures element, and use Bloom Filters for both the query-agnostic and 3b, f, and j, leading to similar benefits in terms of total throughput. Stacked Filter. The ASF is optimized using the 3-layer approach. 0.009 Stacked Filter 0.014 Stacked Filter 3700 Stacked Filter Stacked Filter QA Filter QA Filter QA Filter QA Filter 0.012 Learned Filter 0.008 3600 η=.5 0.010 QA Fil 0.007 0.008 3500 EFPR EFPR 0.006 1.00 0.006 0.75 0.004 η= 1 0.50 0.005 0.002 Stacked/QAFilter EFPR η=1.25 Actual Zipf =.75 0.25 0.000 Construction Time (sec) 0.004 0.00 0.2 0.4 0.6 0.8 1.0 1.2 0 10 20 30 40 50 60 0 2 4 6 8 10 a) b) Zipf Parameter Used in Optimization c) Negative Universe Size (Multiples of |P|) d) Negative Elements Stored (Millions) Figure 5: Stacked Filters are robust to workload shifts, skew, noisy data, and can be rebuilt quickly.

18 BuildingL 1 Phase 1 Phase 2 Then, as the ASF builds its 3rd layer its FPR drops to nearly 1/4 of its 6 Filter (1-Layer) 16 Disk (1-Layer) × BuildingL 3 14 previous value and is 3 more performant than the query-agnostic

5 Filter (3-Layer) Disk (3-Layer) 12 filter. Finally, we see that compared to a static Stacked Filter on 4 10 phase 1, ASF is about 1.6× worse, so if workloads are sufficiently SSD Queries (Sec) 3 7 8 static, then a static Stacked Filter is best. 6 2 Time Elapsed (Sec) 4 9 RELATED WORK

1 2 Latency on 10 Sandwiched Learned Filters. Sandwiched Learned Filters (SLFs)

0 0 ASF QA 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 a) b) Queries Processed 1e7 are an extension of Learned Bloom Filters which use a preliminary filter before querying the classifier32 [ ]. Similarly to our findings, ASF 1-Layer QA Filter ASF 3-Layer Static SF the use of a sequence of filters brings increased space efficiency c) FPR .0101 .0087 .0028 .0017 and reduced false positive rates. However, SLFs remain centered Figure 6: High performance without workload knowledge. around a computationally expensive model, whereas Stacked Filters achieve performance similar to hash-based filters. In addition, Sta- Fast Incremental Construction. Figure 6a shows the performance cked Filters extend to arbitrary heights, while extending SLFs to of an ASF and a query-agnostic filter on a static workload of10 more layers would need to take into account the difficulties arising million queries, with the filters protecting against data accesses to from using a series of dependent ML models. an SSD. The height of each bar shows the total time taken to process Filters that Use Frequent Negatives. There is a limited existing the queries, and the colors inside each bar show the breakdown of literature on Bloom Filters which use knowledge of the negative set how that time is spent. Overall, the ASF results in better perfor- in construction to create better false positive rates. Most of these mance over the workload compared with the query-agnostic filter, require different assumptions than Stacked Filters such as 1)being as its gains in FPR after the 3rd filter is built more than compensate able to store the whole negative set [10, 30] or 2) allowing some for the cost to build 퐿3 and its slightly worse performance when false negatives [19]. The closest in functionality would be Adaptive using only 퐿1. Cuckoo Filters [33], which correct false positives at the cost of extra Adapting to Shifting Workloads. Figure 6b shows that ASFs space. Compared to Stacked Filters, they more easily adapt to the work well for shifting workloads. Phase 1 of the experiment repli- workload but have a worse FPR-size tradeoff. cates the previous experiment; then, in phase 2, the query frequency Filter Variations. Many designs have been proposed to solve the = distribution remains a Zipf with parameter 휂 1 but the popular- standard filtering problem and work to lower the FPR vs. size trade- ity of the elements is flipped so that the previously least popular off of filters4 [ , 6, 20, 44]. Other filter designs aim at building better element is now the most popular and vice versa. The figure details filters in terms of their system throughput and optimize their com- how long the ASF and query-agnostic filter take to process each putation or memory accesses [9, 15, 26, 27, 29, 31, 34, 37]. This work workload. Additionally, the colored bars show what phase the ASF is all complementary to Stacked Filters as increasing the efficiency is in during query processing at the time of each query. of each base filter creates better Stacked Filters. While phase 1 shows as before that the ASF performs well on static workloads, phase 2 shows that the ASF can adapt to shifts in 10 CONCLUSION workload. The ASF does so by recognizing at the start of phase 2 Stacked Filters learn to structurally take advantage of workload that a shift has occurred and signaling a rebuild. Then, performance knowledge via hashing, as opposed to learning to separate keys continues as in a static phase: the ASF learns the new workload from non-keys based on semantic information via a classifier. They pattern quickly by incrementally building layer 2, then builds layer eschew intra-key similarity in order to focus on succinct hash-based 3 and capitalizes on the reduced false positive rate once that occurs. memorization of frequent negatives. This provides provably good Ultimately, across both phases, ASFs process the workload 1.6× properties in terms of computation, robustness, and false positive faster than query-agnostic filters. rate. By taking into account workload knowledge, Stacked Filters Breaking Down ASF Performance. Figure 6c breaks down the also provide better FPR/space tradeoffs than query-agnostic filters, benefits of ASFs and compares them with static Stacked Filters. The providing a good balance of the best properties of both query- table repeats a trend seen throughout the paper: the first layer of agnostic and Learned filters. a Stacked Filter is nearly as performant as a query-agnostic filter. REFERENCES [24] B. Goodwin, M. Hopcroft, D. Luu, A. Clemmer, M. Curmei, S. Elnikety, and Y. He. [1] Shalla secure services kg. http://www.shallalist.de. Bitfunnel: Revisiting signatures for search. In SIGIR ’17 Proceedings of the 40th [2] Top 10 million websites: Openpagerank. "https://www.domcop.com/openpagerank/what- International ACM SIGIR Conference on Research and Development in Information is-openpagerank". Retrieval, 2017. [3] K. Anderson and S. Plimpton. Firehose streaming benchmarks, 2015. [25] T. M. Graf and D. Lemire. Xor filters: Faster and smaller than bloom and cuckoo "https://firehose.sandia.gov/". filters, 2019. [4] M. A. Bender, M. Farach-Colton, R. Johnson, R. Kraner, B. C. Kuszmaul, D. Med- [26] T. Gubner, D. Tomé, H. Lang, and P. Boncz. Fluid co-processing: Gpu bloom- jedovic, P. Montes, P. Shetty, R. P. Spillane, and E. Zadok. Don’t thrash: how to filters for cpu joins. In Proceedings of the 15th International Workshop on Data your hash on flash. Proceedings of the VLDB Endowment, 5(11):1627–1637, Management on New Hardware, page 9. ACM, 2019. 2012. [27] A. Iacob, L. Itu, L. Sasu, F. Moldoveanu, and C. Suciu. Gpu accelerated information [5] B. H. Bloom. Space/time trade-offs in hash coding with allowable errors. Com- retrieval using bloom filters. In 2015 19th International Conference on System munications of the ACM, 13(7):422–426, 1970. Theory, Control and Computing (ICSTCC), pages 872–876. IEEE, 2015. [6] A. D. Breslow and N. S. Jayasena. Morton filters: Faster, space-efficient cuckoo [28] T. Kraska, A. Beutel, E. H. Chi, J. Dean, and N. Polyzotis. The case for learned index filters via biasing, compression, and decoupled logical sparsity. Proceedings of structures. In Proceedings of the 2018 International Conference on Management of the VLDB Endowment, 11(9):1041–1055, 2018. Data, pages 489–504. ACM, 2018. [29] H. Lang, T. Neumann, A. Kemper, and P. Boncz. Performance-optimal filtering: [7] A. Broder, M. Mitzenmacher, and A. B. I. M. Mitzenmacher. Network applications Bloom overtakes cuckoo at high throughput. Proceedings of the VLDB Endowment, of bloom filters: A survey. In Internet Mathematics, pages 636–646, 2002. 12(5):502–515, 2019. [8] J. Bruck, J. Gao, and A. Jiang. Weighted bloom filter. pages 2304 – 2308, 08 2006. [30] H. Lim, J. Lee, and C. Yim. Complement bloom filter for identifying true posi- [9] M. Canim, G. A. Mihaila, B. Bhattacharjee, C. A. Lang, and K. A. Ross. Buffered tiveness of a bloom filter. IEEE communications letters, 19(11):1905–1908, 2015. bloom filters on solid state storage. In ADMS@ VLDB, pages 1–8, 2010. [31] L. Ma, R. D. Chamberlain, J. D. Buhler, and M. A. Franklin. Bloom filter perfor- [10] L. Carrea, A. Vernitski, and M. Reed. Yes-no bloom filter: A way of representing mance on graphics engines. In 2011 International Conference on Parallel Processing, sets with fewer false positives. arXiv preprint arXiv:1603.01060, 2016. pages 522–531. IEEE, 2011. [11] L. Carter, R. Floyd, J. Gill, G. Markowsky, and M. Wegman. Exact and approximate [32] M. Mitzenmacher. A model for learned bloom filters and optimizing by sand- membership testers. In Proceedings of the tenth annual ACM symposium on Theory wiching. In Advances in Neural Information Processing Systems, pages 464–473, of computing, pages 59–65. ACM, 1978. 2018. [12] M. Cha, H. Kwak, P. Rodriguez, Y.-Y. Ahn, and S. Moon. Analyzing the video pop- [33] M. Mitzenmacher, S. Pontarelli, and P. Reviriego. Adaptive cuckoo filters. ACM ularity characteristics of large-scale user generated content systems. IEEE/ACM J. Exp. Algorithmics, 25, Mar. 2020. Transactions on networking, 17(5):1357–1370, 2009. [34] P. Pandey, M. A. Bender, R. Johnson, and R. Patro. A general-purpose [13] Z. Dai and A. Shrivastava. Adaptive learned bloom filter (ada-bf): Efficient counting filter: Making every bit count. In Proceedings of the 2017 ACM In- utilization of the classifier, 2019. ternational Conference on Management of Data, pages 775–787. ACM, 2017. [14] N. Dayan, M. Athanassoulis, and S. Idreos. Monkey: Optimal navigable key-value https://github.com/splatlab/cqf. store. In Proceedings of the 2017 ACM International Conference on Management of [35] G. Pike and J. Alakuijala. Cityhash, Jan 2011. Data, pages 79–94. ACM, 2017. https://github.com/google/cityhash/blob/master/src/city.h. [15] B. Debnath, S. Sengupta, J. Li, D. J. Lilja, and D. H. Du. Bloomflash: Bloom [36] M. J. D. Powell. A Direct Search Optimization Method That Models the Objective filter on flash-based storage. In 2011 31st International Conference on Distributed and Constraint Functions by Linear Interpolation, pages 51–67. 1994. Computing Systems, pages 635–644. IEEE, 2011. [37] F. Putze, P. Sanders, and J. Singler. Cache-, hash-and space-efficient bloom [16] K. Deeds, B. Hentschel, and S. Idreos. Technical report: Stacked filters, 2020. filters. In International Workshop on Experimental and Efficient Algorithms, pages http://daslab.seas.harvard.edu/publications/ 108–121. Springer, 2007. StackedFilters.pdf. [38] D. L. Quoc, I. E. Akkus, P. Bhatotia, S. Blanas, R. Chen, C. Fetzer, and T. Strufe. Ap- [17] F. Deng and D. Rafiei. Approximately detecting duplicates for streaming data proxjoin: Approximate distributed joins. In ACM Symposium of Cloud Computing using stable bloom filters. In Proceedings of the 2006 ACM SIGMOD international (SoCC) 2018, 2018. conference on Management of data, pages 25–36. ACM, 2006. [39] J. W. Rae, S. Bartunov, and T. P. Lillicrap. Meta-learning neural bloom filters. [18] S. Dharmapurikar, P. Krishnamurthy, and D. E. Taylor. Longest prefix matching arXiv preprint arXiv:1906.04304, 2019. using bloom filters. In Proceedings of the 2003 conference on Applications, tech- [40] S. Ramesh, O. Papapetrou, and W. Siberski. Optimizing distributed joins with nologies, architectures, and protocols for computer communications, pages 201–212. bloom filters. In International Conference on Distributed Computing and Internet ACM, 2003. Technology, pages 145–156. Springer, 2008. [19] B. Donnet, B. Baynat, and T. Friedman. Retouched bloom filters: allowing net- [41] RocksDB. Rocksdb trace, replay, analyzer, and workload generation, worked applications to trade off selected false positives against false negatives. 2020. "https://github.com/facebook/rocksdb/wiki/RocksDB-Trace,-Replay,- In Proceedings of the 2006 ACM CoNEXT conference, page 13. ACM, 2006. Analyzer,-and-Workload-Generation". [20] B. Fan, D. G. Andersen, M. Kaminsky, and M. D. Mitzenmacher. Cuckoo filter: [42] H. Stranneheim, M. Käller, T. Allander, B. Andersson, L. Arvestad, and J. Lun- Practically better than bloom. In Proceedings of the 10th ACM International on deberg. Classification of dna sequences using bloom filters. Bioinformatics, Conference on emerging Networking Experiments and Technologies, pages 75–88. 26(13):1595–1600, 2010. ACM, 2014. https://github.com/efficient/cuckoofilter. [43] S. Tarkoma, C. E. Rothenberg, and E. Lagerspetz. Theory and practice of bloom fil- [21] L. Fan, P. Cao, J. Almeida, and A. Z. Broder. Summary cache: a scalable wide- ters for distributed systems. IEEE Communications Surveys & Tutorials, 14(1):131– area web cache sharing protocol. IEEE/ACM Transactions on Networking (TON), 155, 2012. 8(3):281–293, 2000. [44] M. Wang, M. Zhou, S. Shi, and C. Qian. Vacuum filters: More space-efficient and [22] D. Geneiatakis, N. Vrakas, and C. Lambrinoudakis. Utilizing bloom filters for faster replacement for bloom and cuckoo filters. Proc. VLDB Endow., 13(2):197–210, detecting flooding attacks against sip based services. computers & security, Oct. 2019. 28(7):578–591, 2009. [45] X. Wang, Y. Ji, Z. Dang, X. Zheng, and B. Zhao. Improved weighted bloom filter [23] A. Gervais, S. Capkun, G. O. Karame, and D. Gruber. On the privacy provisions and space lower bound analysis of algorithms for approximated membership of bloom filters in lightweight clients. In Proceedings of the 30th Annual querying. In M. Renz, C. Shahabi, X. Zhou, and M. A. Cheema, editors, Database Computer Security Applications Conference, pages 326–335. ACM, 2014. Systems for Advanced Applications, pages 346–362, Cham, 2015. 11 PROOFS 퐿3 and checked against 퐿4, and so on. The total cost, in terms of base 11.1 Proof of Size Concentration Bounds filter operations, is |푃 |(푐푖 +푐푞)+|푃 |훼2 (푐푖 +푐푞)+|푃 |훼2훼4 (푐푖 +푐푞)+... . ′ In more concise notation, this is Theorem 4. Let 푠 be the size of a filter using Algorithm 1, and 푇퐿−1 푇퐿−1 −1 푖 assume that a Stacked Filter of 푇 layers false positive rate 훼 for each ∑︂2 ∏︂ ∏︂2 퐿 푖 |푃 |(푐 + 푐 ) (︁ 훼 )︁ + |푃 |푐 훼 layer 푖 ∈ {1, . . . ,푇 }. Let 훼 = 푚푎푥 훼 ≤ 1 and 훼 = min 훼 . 푖 푞 2푖 푖 2푖 퐿 푚푎푥 푖 푖 2 푚푖푛 푖 푖 푖=0 푗=0 푗=0 Then ′ ′ ′ where the final term comes from the last layer insertions. P(|푠 − 퐸[푠 ]| ≥ 푘퐸[푠 ]) ≤ For negative layers, the analysis is similar, but the paired layers | | ( + 푁푓 ) are instead 2 and 3, 4 and 5, and so on, with frequently queried 1 1 푠(훼 )(1 − 훼 ) 1 | | 훼푚푎푥 · ·( 푚푖푛 푚푖푛 )2 · 푃 negatives having the first layer unpaired instead of the last. The | | |푃 | 푘2 푠(훼푚푎푥 )(1 − 훼푚푎푥 ) 푁푓 2 (1 + 훼푚푖푛) number of operations to insert negatives is |푃 | 푇퐿−1 2 푖 Proof. The result is an application of Chebyshev’s inequality. ∑︂ ∏︂ |푁푓 |푐푞 + |푁푓 |(푐푖 + 푐푞) 훼2푗−1 푛,푗 ∈ { | |} Start by letting 푋푁 , 푗 1,..., 푁푓 denote the event that item 푗 푖=0 푗=1 of the set 푁푓 made it to negative layer 푛 in construction. Similarly, The total construction cost is the sum of the operations for positive 푛,푗 and frequently queried negative elements. 푋푃 denote the event positive item 푗 made it to positive layer 푛. Since 푠′, 푠 are in bits per positive element, we have In the case that the 훼 values are all equal, the total cost can be |푃 | (푇퐿+1)/2 |푁푓 | (푇퐿−1)/2 bounded using geometric series by ′ 1 ∑︂ ∑︂ 푖,푗 ∑︂ ∑︂ 푖,푗 1 훼 푠 = 푠(훼2푖−1)푋 + 푠(훼2푖 )푋 |푃 |(푐푖 + 푐푞) + |푁 |푐푞 + |푁 |(푐푖 + 푐푞) |푃 | 푃 푁 1 − 훼 1 − 훼 푗=1 푖=1 푗=1 푖=1 In this case, comparing against the cost for a single base filter of By using 훼 in the size equation and 훼 for the condi- 푚푎푥 푚푖푛 푐 · |푃 |, the overhead is seen to be the cost of querying a base filter tional chance of making it to the next layer, we have 퐸[푠′] ≥ 푖 for both every positive and frequently queried negative element |푁푓 | 1+ 훼푚푖푛 푠(훼 ) |푃 | . once, plus a small overhead from stacking. 푚푎푥 1−훼푚푖푛 ′ ′ Query Costs. The analysis of querying a Stacked Filter for a posi- 푚,푗 ⊥⊥ 푚 ,푗 = For the variance part, we have that 푋퐶 푋퐶 unless 퐶1 ′ 1 2 tive or frequently queried negative is almost exactly the same as 퐶2 and 푗 = 푗 by our method of construction. Thus the analysis for constructing a Stacked Filter, except that inserts 푇퐿+1 푇퐿−1 2 2 |푁 | 2 are replaced with queries. For instance, when querying a positive, ′ 푠(훼푚푖푛) (︁ ∑︂ 푖,1 푓 ∑︂ 푖,1 )︁ Var[푠 ] ≤ ( ) Var[ 푋 ] + Var[ 푋 ] it queries 퐿 and 퐿 with certainty, 퐿 and 퐿 with probability 훼 , |푃 | 푃 |푃 | 푁 1 2 3 4 2 푖=1 푖=1 퐿 and 퐿 with probability 훼 훼 and so on. The resulting equations F푛 3 4 2 4 To break down the summation, we use Eve’s law. Letting 푁 be for positive and frequently queried negative elements are the filtration containing all events up to negative layer 푛, we have 푇퐿−3 푇퐿−1 푛 푛 푛 ′ ∑︂ 2 ∏︂푖 ∏︂ 2 ∑︂ ∑︂ ∑︂ 푐 = 2푐푞 ( 훼2푖 ) + 푐푞 훼2푖 [ 푖,1] = [ [ 푖,1 | F푛−1] + [ [ 푖,1] | F푛−1] 푞,푃 푖=0 푗=0 푗=0 Var 푋푁 퐸 Var 푋푁 푁 Var 퐸 푋푁 푁 푇퐿−1 푖=1 푖=1 푖=1 ′ ∑︂ 2 ∏︂푖 푐 = 푐 + 2푐 훼 − 푛−2 푞,푁푓 푞 푞 2푗 1 ∑︂ 푖=0 푗=1 ≤ ( − ) 푛 + [ 푖,1 + ( + ) 푛−1,1] For an infrequently queried negative element, it can be rejected 1 훼푚푎푥 훼푚푎푥 Var 푋푁 1 훼푚푎푥 푋푁 푖=1 by both positive and negative layers. Thus, the probability of it From here, one can use Eve’s law recursively to expand the expres- reaching layer 푖 is the product of the false positive rates for layers 0 sion on the right until reaching an expression only involving 푋푁 . 1, . . . , 푖 − 1 Evaluating this gives: 푇퐿 푖 ′ ∑︂ ∏︂ 푛 푛 푗 푐 = 푐푞 ( 훼푖 ) ∑︂ 푛 ∑︂ ∑︂ 푖 2 푛−푗 푞,푁푖 Var[ 푋푁 ] ≤ ( 훼 ) · 훼푚푎푥 ·(훼푚푎푥 )(1 − 훼푚푎푥 ) 푖=0 푗=0 푖=1 푗=1 푖=0 Letting the FPRs at each layer be equal and using geometric 훼 ≤ 푚푎푥 series bounds, we have: 2 + (1 − 훼푚푎푥 ) ′ 2 1 훼 ′ 1 푐 ≤ 푐푞, 푐푞,푁 ≤ 푐푞, 푐 ≤ 푐푞 We note that this makes intuitive sense as it is less variance than 푞,푃 1 − 훼 푓 1 − 훼 푞,푁푖 1 − 훼 a geometric series using success parameter (1 − 훼푚푎푥 ). A similar [∑︁ 푛,1] 11.3 Proofs result holds for Var 푋푃 , and combining these two results with the variance expression above, the expected value expression above, 11.3.1 Proof of Quasiconvexity. Note: can cite paper "Beyond Con- and using Chebyshev’s inequality gives the desired result. □ vexity: Stochastic Quasi-Convex Optimization" for theoretical con- vergence to local minimum in some convergence speed. − ( ) |푁 | 11.2 Derivation of Computational Costs ( ) = log2 훼 ∗( 1 + 푓 훼 ) Theorem 5. The size function 푠 훼 푓 1−훼 |푃 | 1−훼 Construction. For both positives and frequently queried negatives, is quasiconvex on (0,1) for 푓 > 0. the construction algorithm is best analyzed in pairs, wherein the same number of elements are inserted at one layer and queried Proof. For functions of a single variable, a sufficient condition for quasiconvexity is that 푠′ starts negative, has at most one root, against the next. For positives, every element is inserted into 퐿1 and and is then positive after (in the case a root exists). We use this to checked against 퐿2. The false positives from 퐿2 are then inserted into prove the quasiconvexity of 푠. For notational convenience, we will As a result, we can start at |푁푓 | = 0 elements and scan the most |푁푓 | frequently queried elements in order. When more than 휖 difference replace |푃 | with 푁 , and note 푁 > 0. We have exists in the 휓 values between the last call to the oracle and the 푁훼2 + (1 − 푁 )훼 − (푁 + 1)훼 ln 훼 − 1 current 푁푓 , then we call the oracle again. By the above statements, 푠′(훼) = ((1 − 훼)2훼 ln 2) this produces an EFPR within an 휖 amount of the best EFPR possible Let 푓 be the numerator of 푠′, i.e. 푁훼2 + (1−푁 )훼 − (푁 +1)훼 ln 훼 −1), across all possible sets for 푁푓 . □ and note the denominator is positive for 훼 ∈ (0, 1). It follows that ′ Both Algorithm 3 and the algorithm for optimizing continu- sign(푠 ) = sign(푓 ), and therefore it is enough to show 푓 starts ous FPR Stacked Filters use less optimization subroutines than negative and either has no roots, or has one root and is positive for 1 the proof above, although the number is still 푂 ( 휖 ). The change all values after its root. comes modifying the final line of the proof, using the fact thatif First, as 훼 → 0, 푓 (훼) → −1, so 푓 starts negative. Second, we ∗ ≤ 퐸퐹푃푅2 훼, where 훼 is the FPR of a 1-layer Stacked Filter, then note the functional forms of 푓 ′, 푓 ′′: 훼 푃2 (푥|푥 ∈ 푁푢 ) ≤ − . 푓 ′(훼) = 2푁 (훼 − 1) − (푁 + 1) ln 훼 1 휓2 푁 + 1 11.3.3 Proof of Theorem 3. We restate Theorem 3 here for conve- 푓 ′′(훼) = 2푁 − 훼 nience: By inspection, 푓 ′′(훼) has one root in R, and so by Rolle’s Theorem, Theorem 8. Assume the equation for the size of a base filter in ′ ′ 푓 (훼) has at most two roots in R. 푓 (1) = 0, so it has at most one − log2 (훼푖 )+푐 bits per positive element is of the form 푠(훼푖 ) = . Let the other root. Since lim 푓 ′(훼) = ∞, then if the other root is after 1, 푓 훼→0 positive set have size |푃 |, let the distribution of our negative queries be 푓 is monotonically increasing on (0, 1) and we are done. Otherwise, 퐷, and let 훼 be a desired EFPR. If there exists any set 푁 , 휓 = P (푥 ∈ 푓 ′ has a single root 훼 ∈ (0, 1). It follows that 푓 cannot have a root 푓 퐷 1 푁 |푥 ∈ 푁 ), and 푘 ≤ 휓 such that [ ) ( ) ( ) ∈ [ ) 푓 in 훼1, 1 since 푓 1 = 0 and 푓 훼2 = 0 for 훼2 훼1, 1 would imply 1 ′ |푁푓 | ln 1 − 푘 − 훼 the existence of another root for 푓 . ≤ 1−푘 · − 1 ′ 1−푘 Since 푓 (훼) > 0 : ∀훼 ∈ (0, 훼1), it follows that if 푓 has a root in |푃 | ln + 푐 훼 ∗ ∗ 훼 훼 ∈ (0, 훼1), then 푓 (훼) > 0 : ∀훼 ∈ (훼 , 1). This is what we set out then a Stacked Filter (optimized using Section 5 and given access to to show and so 푠 is quasiconvex. □ any 푁푓 satisfying the constraint) achieves the EFPR 훼 using fewer − ( )+ |푁 | bits than a query-agnostic filter. ( ) = log2 훼 푐 ∗ ( 1 + 푓 훼 ) Theorem 6. The function 푠 훼 푓 1−훼 |푃 | 1−훼 is quasiconvex on (0,1) when 푐 ≥ 0, 푓 > 0. Proof. We limit ourselves to Stacked Filters which have an equal FPR 훼퐿 at each layer. In this case, we have that the EFPR of the | = 푁푘| filter is Proof. As before let 푁 |푃 | . Take the derivative of 푠, note − / 휓훼푇퐿 1 2 + (1 − 휓)[(1 − 훼 ) ∗ (훼 + 훼3 + ... + 훼푇퐿−2) + 훼푇퐿 ] the denominator is positive, and form 푓푐 from the numerator of 퐿 퐿 퐿 퐿 ≥ ( − ) 푠′, which in this case is 푁훼2 + (1 − 푁 )훼 − (푁 + 1)훼 ln 훼 − 1 + For some 푁 , 푇퐿 푁 implies this is less than 1 휓 훼퐿. Thus, a ′ Stacked Filter of length 푁 and 훼 ≤ 훼 has an EFPR less than 훼. (푁 + 1)푐훼 ln 2. As in the proof of Theorem 5, sign(푓푐 ) = sign(푠 ). 퐿 1−휓 Additionally, note that 푓푐 is equal to 푓 from the previous problem Our goal is then to show that there exists a Stacked Filter with ( + ) ≤ 훼 − ln(훼)+푐 plus an additional positive term: 푁 1 푐훼 ln 2. 훼퐿 1−휓 which has size less than 푓 . From Section 4, we In the case that 푓 is an increasing function on (0, 1), then so have that the size of a Stacked Filter is bounded above by ′ (︃ )︃ is 푓푐 and we are done. Otherwise, let 푓 have a root 훼1. Then on 1 |푁푓 | 훼퐿 ( , 훼 ) 푓 푓 푠(훼퐿) + 0 1 both and 푐 are increasing functions and so have at most 1 − 훼퐿 |푃 | 1 − 훼퐿 one root. On (훼1, 1), 푓 > 0 and so 푓푐 > 0; thus both have 0 roots Plugging in our equations for the size of a base filter, a Stacked in (훼 , 1). Finally, we note that 푙푖푚 → 푓 (훼) = −1 and 푓 (1) = ≤ ≤ 훼 1 훼 0 푐 푐 Filter is better than a traditional filter if 0 훼퐿 1−휓 and (푁 + 1) ∗ 푐 ∗ ln 2 > 0. Thus 푓푐 starts negative, has exactly one root, (︃ )︃ − ln 훼퐿 + 푐 1 |푁푓 | 훼퐿 − ln 훼 + 푐 and is then positive. The same is therefore true of 푠′ and so 푠(훼) is + ≤ (4) 푓 ln 2 1 − 훼퐿 |푃 | 1 − 훼퐿 푓 ln 2 quasiconvex. □ = 훼 We now let 훼퐿 1−푘 , and our EFPR equation gives the restraint 11.3.2 Sweep over 푁 Proof. 0 ≤ 푘 ≤ 휓. We further break apart our equation for size into two |푁 | Given an oracle returning the optimal EFPR for a parts: our goal will be to show that 1 + 푓 훼퐿 ≤ 1 + 푏, and Theorem 7. 1−훼퐿 |푃 | 1−훼퐿 − + given set 푁푓 satisfying all constraints, finding the optimal EFPR that (1 + 푏) ln 훼퐿 푐 ≤ − ln 훼+푐 . If both equations are satisfied, then | | ( 1 ) 푓 ln 2 푓 ln 2 across all values of 푁푓 to within 휖 requires 푂 휖 calls to the oracle. (4) is satisfied as well. ( + ) − ln 훼퐿+푐 ≤ − ln 훼+푐 = 훼 Proof. For 푖 ∈ {1, 2}, let 푁푖 be the set of 푛푖 most heavily queried Part 1: 1 푏 푓 ln 2 푓 ln 2 : Plugging in 훼퐿 1−푘 and solving ∗ elements, let 휓푖 = 푃 (푥 ∈ 푁푖 |푥 ∈ 푁 ), let 퐸퐹푃푅 be the best EFPR ≤ − ln 1−푘 푖 this inequaliity gives: 푏 − ln 훼 +푐 . using 푁 satisfying all constraints, and let 푃 (푥) be shorthand for the 1−푘 푖 푖 |푁 | |푁 | ∗ Part 2: 1 + 푓 훼퐿 ≤ 1 + 푏: Rearranging, we get 푓 ≤ chance 푥 is a false positive using the Stacked Filter giving 퐸퐹푃푅푖 . 1−훼퐿 |푃 | 1−훼퐿 |푃 | Then 푏 ln 1−푘 ∗ − 1 − 푏. We now set 푏 = 훼 , satisfying the inequality above, 훼퐿 ln 퐸퐹푃푅2 = 휓2푃2 (푥|푥 ∈ 푁푓 ) + (1 − 휓2)푃2 (푥|푥 ∈ 푁푢 ) 1−푘 = 휓1푃2 (푥|푥 ∈ 푁푓 ) + (1 − 휓1)푃2 (푥|푥 ∈ 푁푢 ) + (휓2 − 휓1)(푃2 (푥|푥 ∈ 푁푓 ) − 푃2 (푥|푥 ∈ 푁푢 )) ∗ ≥ 퐸퐹푃푅1 − (휓2 − 휓1)푃2 (푥|푥 ∈ 푁푢 ) ∗ ≥ 퐸퐹푃푅1 − (휓2 − 휓1) Algorithm 5 createNewOptObj 훼 and plug in 1−푘 for 훼퐿. The resulting equation is 1 Input: 푓1, 푓2: fingerprint sizes of new positive and negative layer |푁푓 | ln 1 − 푘 − 훼 ≤ 1−푘 · − 1 Input: optObj: prior optimization object |푃 | 1−푘 훼 푐−푓 푐−푓 ln + 푐 1: 훼1, 훼2 = 2 1 , 2 2 훼 |푁푓 | as desired, with the constraint that 0 ≤ 푘 ≤ 휓. 2: newOpt.s = (optObj.s - 푓 - objObj. × 훼 × 푓 )/훼 ; 1 |푃 | 1 2 2 To see that the constructed Stacked Filter achieves this goal, note 표푝푡푂푏 푗.휓 3: newOpt.휓 = 표푝푡푂푏 푗.휓 +(1−표푝푡푂푏 푗.휓 )×훼2 that as 휖 → 0, the optimization schemes of Section 5 approach the |푁푓 | |푁푓 | 훼 4: newOpt. = optObj. × 1 optimal filter using an equal FPR at all layers. □ |푃 | |푃 | 훼2 // prop is the proportion of negative queries which reach the end of the stack 11.4 Optimizing Integer length Fingerprint 5: newOpt.prop = optObj.prop ×훼1 × (obtObj.휓 + (1 − optObj.휓) × 훼2) Filters 6: newOpt.FPR = optObj.FPR + (1 − optObj.휓 × 훼 × (1 − 훼 )) 1 2 The optimization algorithms for integer length fingerprint Stacked 7: newOpt.fingerprints = optObj.fingerprints +{푓 , 푓 } 1 2 Filters work similarly to continuous FPR filters. One routine loops 8: return newOpt through possible values of |푁푓 |, always choosing greedily the most heavily queried elements, and calls another routine to optimize a Algorithm 3 Optimization of fingerprint-based filters Stacked Filter for a given 푁푓 and 휓. The first routine does so in Input: 휓푑푖푠푡 : a function translating |푁푓 | into a 휓 value a way so as to make sure that the skipped values can produce no Input: 푠: a size constraint in bits/element more than an 휖 better solution, as proven in Theorem 7. Input: 휖: a slack from optimum 11.4.1 Breadth First Search on Stacked Filters. Algorithms 4 and 1: FPR_1L = FPR_SINGLE_LAYER(size) 5 deal with the optimization of an integer fingerprint filter given 2: bestFPR, bestSetup = FPR_1_Layer, {훼1퐿} // remove constant load factor 푓 . 푁푓 and 휓. The description of the algorithm is split between the // Size eq. becomes 푠 (훼) = − log2 (훼) + 푐 pseudocode given and various steps described in the text here. For ′ 푠 3: 푠 = 푓 integer-length fingerprint filters, the optimization algorithm uses 4: lastPsi = 0 breadth first search on the discrete options available, and its goalis | | ⌈ |푃 | ⌉ | | 5: for 푁푓 = 1000 to 푁푠푎푚푝 do to find a solution within 휖 of the best solution possible. The breadth 6: newPsi = 휓푑푖푠푡 (푁푓 ) first search starts with an empty stack, and at each step expands − · ( 퐹푃푅_1퐿 ) ≥ 휖 7: if 푛푒푤푃푠푖 푙푎푠푡푃푠푖 min 1, 1−푛푒푤푃푠푖 2 then upon the current stack. It does so both by 1) finishing the current |푁푓 | 휖 stack with a positive layer of maximum size (lines 6-9 of Algorithm 8: FPR, setup = optNFixed(s’, newPsi, |푃 | , 2 ) 9: lastPsi = newPsi 4), which puts no new options onto the BFS queue and 2) looping 10: if FPR < bestFPR then over the possible fingerprint lengths for 1 positive and 1 negative 11: bestFPR, bestSetup = FPR, setup layer (lines 11-13 of Algorithm 4), and adding each of these deeper 12: return bestFPR, bestSetup stacks onto the BFS queue. A Brute Force Approach. As there are a discrete number of op- Algorithm 4 optNFixed tions for fingerprints at each layer (given our bounds on 푠), the BFS |푁 | will build all possible 1 layer stacks, then all possible 3 layer stacks, Input: 휓, 푠, 푓 , 휖 |푃 | then all possible 5 layer stacks, etc. Taking the best FPR solution // construction values below correspond to order they appear in Algo- of each of these would result in the exact optimal solution of the rithm 5 |푁푓 | best Stacked Filter with 1 layers, 3 layers, 5 layers, etc. However, 1: startObject = OptObj(s,휓, , 1.0, 0.0) |푃 | the number of possible Stacked Filters grows exponentially with 2: queue = {startObject} the number of layers and so this is infeasibly slow even for just 7 3: bestFPR, bestSetup = FPR_1_Layer, {훼1퐿} // begin breadth first search or 9 layers. Additionally there isn’t any guarantee about how close 4: while !queue.empty() do the found solution is to the optimal. 5: curObj = queue.pop() From Brute Force to 휖-approximation. A more efficient solution // calculate FPR with this as final layer and check if best is close each search path (i.e. a partially built stack) by building a 6: addedFPR = 2푐−⌊curObj.s⌋ × curObj.prop final positive layer in a way that preserves the performance ofBFS. 7: if curObj.FPR + addedFPR < bestFPR then In particular, we do so when using a final positive layer produces a 8: bestFPR = curObj.FPR + addedFPR solution within 휖 EFPR of the best possible Stacked Filter starting +{ ⌊ ⌋} 9: bestSetup = curObj.fingerprints curObj.s with the same initial layers as the current partially built stack. // check if prop. × FPR final layer less than 휖 Then, by aggregating the best seen result across all search paths, // since best FPR is ≥ 0, implies best expansion within 휖 10: if addedFPR < 휖 then continue the overall algorithm finds a Stacked Filter within 휖 EFPR of the // perform breadth first search. See text for bounds optimal. 11: for i = f1_min; i ≤ ⌊curObj.s⌋; i++ do Stopping Condition. A search path is stopped when the propor- 12: for j=c+1; j < f2_max; j++ do tion of negative queries which reach the end of the partial stack 13: queue.push(createNewOptObj(f1,f2,curObj)) multiplied by the FPR of a single layer built using all available 14: return bestFPR, bestSetup space budget is less than epsilon. As an example, assume we have a four layer stack each with FPR 0.01, that the initial 휓 = 0.5, that −5 휖 = 10 , and that using all space left for the filter on a single proportion of queries reaching the current layer are from 푁푓 vs. positive layer gives a final layer with FPR 0.01. The proportion of from 푁푖 . The parameter ’prop’ tells what portion of negative queries queries which will reach the newly built 5th layer is 0.5 ∗ 0.012 make it to this layer, and the FPR and fingerprints parameters keep + 0.5 · 0.014 ≈ 5 · 10−4, where the first term is the proportion of track of the already accrued false positives from negative layers queries from 푁푓 and the second is the proportion from 푁푖 . The best and the chosen fingerprint sizes so far. hypothetical Stacked Filter rejects all these queries, resulting in no Proof Sketch of Theorem ??. Since each search path terminates false positives from layer 5 onwards; a single layer stack would within 휖 of its best possible EFPR, and Algorithm 3 chooses the best have an approximate EFPR of 10−2 · 5 · 10−4 = 5 · 10−6 from layer 5 EFPR amongst all search paths, then the overall algorithm is within onwards, which is less than 휖 different from rejecting all queries. 휖 of the best EFPR achievable by any Stacked Filter. Thus it is suitably close to the best possible Stacked Filter starting For the runtime analysis of Algorithm 3, we detail only the major with four layers of FPR = 0.01, and so we choose the one layer stack steps of the proof. The theorem is meant to show that the runtime and terminate the search path. The stopping condition is lines 6-9 will not combinatorially explode. The actual runtime, which is of of Algorithm 4. more interest, is much closer to 푂 (휖−1) for reasons to be explained. Expanding the Search. To perform the BFS, we need to figure The main ideas of the proof are: out which limited paths need to be searched to produce an optimal (1) If at any point on a search path the product of the FPRs of all solution. One obvious constraint is that all search paths should be positive layers is less than 휖, we are done. This can be seen feasible; i.e. we should only choose fingerprint lengths for each easily since the proportion of negative queries is less than layer that do not go over the allowed filter size. A second obvious the product of the FPRs of all positive layers. constraint is that layers should have FPR less than 1, which gives (2) Thus, choosing lower values of 훼1 leads to faster progress to- 1 that the maximum FPR a filter has will be 2 . wards termination. Because 훼1 values decrease exponentially 1 1 A third set of constraints comes from the following bounds, (the options are 2, 4, ··· ), bounding the algorithm runtime which is derived from the fact that the derivative of size with respect focuses on showing that not too many paths continually pick to 훼1 should never be positive (as higher 훼1 values always lead to large 훼1 values (or equivalently, small 푓1 values). worse EFPR, it makes no sense to use 훼1 values which have worse (3) High 훼1 values lead to a small number of choices for 훼2, as | | EFPR and size). The second equation is just a rewritten version of seen in Equation 6. Additionally, the 푃 value two layers |푁푓 | the first. 훼2 |푃 | 1 1 deeper decreases proportionally by a factor 훼 , so that 훼2 훼 < · · 1 1 cannot continually be larger than 훼1 as this eventually drives |푁푓 | ln 2 − log2 훼2 + 푐 the bound in (6) to 0. − |푃 | 1 1 +푐 |푁푓 | ln 2 훼1 (4) Using this fact, it is possible to prove that for a fixed 푁 the 훼2 > 2 푓 | | The bounds on 훼 and 훼 then become the following bounds for ( −2 ( 푁푓 )−2) 1 2 discrete optimization runs in time 푂 휖 |푃 | fingerprint sizes, where 푓1, 푓2 are the integer-valued fingerprint (5) To make the total algorithm runtime 푂 (휖−3), an additional sizes for elements in 퐿1 and 퐿2: |푁푓 | 1 |푃 | 1 1 assumption is made that the optimal filter has |푃 | > 1000 . 푓1 > ⌊− log2 ⌋ (5) |푁푓 | 1 + 푐 ln 2 We find this to be true in practice all the time. |푃 | 1 푓 −푐 Explosion in 푠 along many paths. The previous theorem relied 푓2 < ⌈ 2 1 ⌉ (6) |푁 | ln 2 only on the interplay between choices of 푓1 and 푓2 and the bound 1 from equation (6). However, the real reason the algorithm performs In the first equation we plugged in 훼2 ≤ 2 since we choose 훼1 before 훼2 in Algorithm 4. The ceiling and floor functions comes efficiently is that a path terminates if the parameter 푠 explodes and from the fact that the problem is discrete. becomes large, as this means a single layer filter often has FPR < 휖. Recursive Nature of BFS. One of the key insights of the BFS As a reminder, 푠 is in bits per element, so even as the algorithm algorithm is that optimizing a Stacked Filter starting at level 3 (i.e. uses up space, if a negative layer choice eliminates many positive the first two levels are decided already) is equivalent to optimizing elements, 푠 can rise dramatically. This is exactly what happens; if we let 푠′ be the next value of s, then a Stacked Filter at level 1. In particular, because the problem of | | ′ 푓 −푐 푁푓 designing an optimal stack from layer 3 onwards is identical to 푠 = (2 2 )·(푠 − 푓1 − 훼1 · · 푓2) designing a stack from layer 1, the bounds above on 푓 and 푓 apply |푃 | 1 2 For almost all values of 푠 and 푓 (which decides 훼 ), this equation each time we look into building 2 new layers, (using an updated 1 1 |푁 | either has very few 푓2 values which are positive or the equation 푓 | | | | ′ value of |푃 | which represents the expected sizes of 푁푓 and 푃 grows nearly exponentially and so 푠 explodes. Since the FPR is of reaching this layer). the form 2푐 2−푠 , growths in 푠 drive the FPRs of a 1 layer stack quickly Algorithm 5 updates all the parameters for the next layer. That towards 0. For instance, at 푠 = 40, the FPR is around 10−12, which | | 푁푓 is a lower FPR than essentially any filter requires in practice. The is, it tracks parameters 휓, 푠, and |푃 | after going through 2 layers, where these values represent their values conditioned on making it result is that many search paths terminate essentially immediately, to the current layer. For instance, the new 휓 value represents what and so far fewer Stacked Filter possibilities are added to the queue than predicted by just the algorithmic analysis above. 7000 0.0035 Opt w/Sampling Opt w/Sampling search where we check sizes that are 0.1 bits per element apart. 6000 Opt w/Full Data Opt w/Full Data 0.0034 5000 As before, we return the best configuration that satisfies the size 4000 0.0033 constraint.

3000 0.0032 Figures 7 and 8 display the optimization times and FPRs for ASFs

2000 Filter FPR

Opt Time (Ms) 0.0031 on the synethetic integer dataset with and without sampling. The 1000 0 0.0030 dataset has 1 million positive values, 100 million negative values, 2 4 6 8 10 2 4 6 8 10 # of Collected Negatives (Millions) # of Collected Negatives (Millions) and a zipf of 1. The number of negative values in 푁푠푎푚푝 is the dependent variable. In general, for small datasets the times are Figure 7: full sample Figure 8: w/ subsampling comparable and execute in sub-second time. As the dataset size 11.5 3-Layer Optimization of Adaptive Stacked grows, the sampling based approach becomes significantly faster Filters while maintaining similar accuracy. This subsection details the process of optimizing a 3-layer ASF. As input the optimization has a number of queries the filter is expected 11.6 Positive Set Adaptivity to last, denoted by 퐹푇푇 퐿 as well as a size budget 푠. We start by In general, insertions is a very difficult area for all filters. For all, searching over a range of values for 푁표 , with the search performed they risk either blowing up their FPR (Bloom, Quotient) or rejecting logarithmically over values between 1000 and 퐹푇푇 퐿, i.e. each value the insertion (Cuckoo) after they grow beyond their desired number of 푁표 checked is 1.4 times as big as the previous value. of elements 푛. However, Stacked Filters suffer from insertions more The value of 푁표 determines the expected values of 휓 and |푁푓 |, acutely than query agnostic filters. Because the under-capacity which are: positive layers have a lower initial EFPR, many of the negative elements which will eventually be false positives once the filter ∑︂ 퐸[휓] = 푓 (푥)(1 − (1 − 푓 (푥)) |푁표 |) is at capacity do not register as false positives initially. Therefore, 푥 ∈푁푠푎푚푝 the Stacked Filter is unable to take advantage of its knowledge ∑︂ |푁표 | of known negatives as effectively as would normally. For Bloom 퐸[푁푓 ] = (ℓ · |푁표 |) + (1 − (1 − 푓 (푥)) ) Filters, there is a fairly easy fix. In general, one can insert negatives 푥 ∈푁푠푎푚푝 into 퐿 , 퐿 , and so on if they are close to being false positives in Recall that ℓ is the estimate of what portion of queries fall on values 2 4 that they are 1 set bit away from being a false positive. outside our histogram. The equations pessimistically assume in the Figures 9 and 10 show the changes in FPR of a Stacked Filters second equation that all elements of 푁 which were not present 푓 from additions into the filter with and without the one-bit adjust- in the histogram sample will never again be queried, i.e. they are ment. In the experiment, we use the integer experiment from the one-off queries. If the universe of elements is not extremely large paper with an initial positive set of 1 million elements, a negative or if the query pattern of the elements not present in the histogram universe of 100 million elements, 10 bits per positive element, and is skewed, then the true 퐸(휓) may be higher, and the true 퐸(푁 ) 푓 a zipf parameter of .75. To perform the experiment, we start by may be smaller. allocating a filter under the assumption that the filter can hold 1.25 If the size of 푁푠푎푚푝 is large, then calculating exactly 퐸[휓] and million positive elements. We then insert half a million new positive 퐸[푁 ] can be expensive. Thus, when dataset sizes are larger than 1 푓 elements and test the incremental change in the EFPR of both the million elements, we sample uniformly with replacement 100,000 Stacked Bloom Filter and query agnostic Bloom filter. The results values of 푁푠푎푚푝 and estimate 퐸[휓] and 퐸[푁 ] by: 푓 both show that Stacked Filters can struggle with new insertions | | 100,000 푁푠푎푚푝 ∑︂ |푁 | moreso than query-agostic filters, but also that with the adjust- 퐸[휓] = (1 − ℓ) − 푓 (푥푖 )·(1 − 푓 (푥푖 ) 표 100, 000 푖=1 ment Stacked Filters remain more performant. For both Stacked 100,000 Filters and query-agnostic filters, the difference in FPR of the filter |푁푠푎푚푝 | (︁ ∑︂ 푁표 )︁ between its low point and its high point is a full order of magnitude. 퐸[|푁 |] = (ℓ · 푁표 ) + 1 − (1 − 푓 (푥푖 )) 푓 100, 000 푖=1 Because filters struggle with new items, a common approach is ∑︁ ( ) = In the above estimations, we make use of the fact that 푥 ∈푁푠푎푚푝 푓 푥 to rebuild the filter on new updates. To assess whether this strategy ℓ, and hence we only estimate the −푓 (푥)·(1 − 푓 (푥))푁표 term. This would be effective for Stacked Filters as well, we include the results considerably reduces the variance of the estimate. of an additional experiment in figure 11. This experiment compares The optimization goal is to optimize the weighted sum of the the latency of query agnostic filters and stacked filters on awork- EFPR using only 퐿1 for the first 푁표 queries and the EFPR of the load consisting of a single bulk load followed by 10 million queries ASF using all 3 layers on the rest of the queries. To do this, we against an SSD for a range of positive set sizes. All other conditions use discrete search on the FPRs of each layers for both continuous are the same as they are in the above experiments. This shows FPR filters and integer-length fingerprint filters. For integer-length that for reasonable positive set sizes the stacked filter provides a fingerprint filters we check all fingerprint sizes on each layer which lower overall latency on a rebuild and query workload, implying 1 that they can be used effectively in filtering schemes which rely on are between [푐 + 1 and 푐 + 16] so that the FPRs are between 2 and 2−16, and return the best configuration satisfying the size constraint. this paradigm. The graph also shows how Learned and Sandwiched For the continuous FPR filters, we similarly use size bounds which Learned Filters perform under such a scenario; in this case the slow provide FPRs of 1 and 2−16 on each layer, and then perform grid filter rebuild makes them a significantly harder to deploy option 2 when fast rebuilds are needed. 0.03 912912 912912 912912 Stacked Filter 20.0 Stacked BF QA Filter 17.5 Traditional BF 0.02 Learned BF 15.0 Sandwiched BF 0.01 12.5 10 10.0 9.1 8.8 EFPR 0.006 7.5 6.8 5.0 0.003 2.2 2.5 1.2 Build + Query Time (sec) 0.002 0.0 0.8 0.9 1.0 1.1 1.2 0.1 1.0 10.0 Percent Capacity Filled |P| (Millions)

Figure 9: Normal Stacked Filter Figure 10: With one bit adjustment Figure 11: Rebuild + Query Time