Weighted Bloom Filter

Jehoshua Bruck∗ Jie Gao† Anxiao (Andrew) Jiang‡

∗ Department of Electrical Engineering, California Institute of Technology. [email protected]. † Department of Computer Science, Stony Brook University, Stony Brook, NY 11794. [email protected]. ‡ Department of Computer Science, Texas A&M University, College Station, TX 77843-3112. [email protected].

Abstract— A Bloom filter is a simple randomized are directed to proxies instead of the original Web server, that answers membership query with no false negative and a in order to reduce traffic and alleviate network bottlenecks. small false positive probability. It is an elegant data compression Upon a miss at a local , a proxy wants to check whether technique for membership information, and has broad applica- tions. In this paper, we generalize the traditional Bloom filter to other proxies have the desired document before sending out Weighted Bloom Filter, which incorporates the information on the requests to fetch the document. A Bloom filter is adopted query frequencies and the membership likelihood of the elements to summarize the contents of each proxy for two reasons – into its optimal design. It has been widely observed that in many a small false positive probability is tolerable, and the size applications, some popular elements are queried much more often of the Bloom filter is much smaller compared with the list than the others. The traditional Bloom filter for data sets with irregular query patterns and non-uniform membership likelihood of full URLs or documents. With a similar spirit, Bloom can be further optimized. We derive the optimal configuration filters are used in many network applications such as file of the Bloom filter with query-frequency and membership- search in P2P network [19], packet classification [6], [26], likelihood information, and show that the adapted Bloom filter trajectory sampling [12], en-route filtering of false data in the always outperforms the traditional Bloom filter. Under reasonable network [27], fast lookup [25] and many others [22], frequency models such as the step distribution or the Zipf’s distribution, the improvement of the false positive probability [21]. Motivated by these applications, Bloom filter, since its of the weighted Bloom filter over that of the traditional Bloom first invention by Bloom [3], has also been augmented in filter is usually of orders of magnitude. various ways [20], [8], [9], [14], [18], [19], [23]. Keywords: Bloom Filter, Membership Query, Combinatorics The Bloom filter takes a hidden assumption that all elements in the universe are viewed and treated identically, which is the I.INTRODUCTION best we can assume without further information of the query A Bloom filter [3] is a compact randomized data structure frequency distribution or their membership likelihood (likeli- for representing a in order to support membership queries. hood of being a member). And under this assumption it can It encodes the elements in a set S, called members, that exists be shown that the Bloom filter is space-wise asymptotically in a much larger universe U. (So S ⊆ U.) In short, a Bloom optimal with a fixed false positive probability and zero false filter is an m- array such that each member in S is hashed negative probability. However, it commonly occurs in practice to k positions () in the array, and those bits are set to ‘1’. that elements do not get queried evenly — popular elements A membership query takes an input element and checks the are queried much more often than unpopular elements. In k hashed positions. If all those bits are ‘1’, then the query the measurement of the Internet traffic pattern, it is observed returns ‘yes’. A Bloom filter has no false positive but a low that traffic flow is highly skewed and concentrates heavily false negative probability. In particular, when k = ln 2 · m/n, on popular files [15], [4]. If the query frequency or the with n = |S| being the number of members, the false positive membership likelihood is not uniform over all the elements probability achieves its minimum value (1/2)k. A Bloom filter in the universe, the traditional configuration of the Bloom uses only a constant number of bits on average in the memory filter does not give the optimal performance. In other words, space for each member and answers membership queries with the Bloom filter can be further optimized if we know the a very small false positive probability. It is a well known query frequency or the membership likelihood distribution of compression technique for membership information and has the elements in the universe. In many real systems, statistics broad applications. about the traffic flow such as frequencies of elements, top k The Bloom filter is a simple, elegant and space efficient categories to which the most elements belong, can be and is structure. Although theoretically a hash table supports mem- already being monitored [10], [17], [11], [7], [13], [2]. Thus bership queries and yields an asymptotically vanishing prob- it makes a lot of sense to use such statistics and adapt the ability of error by using only Θ(log n) bits per element, the Bloom filter with the information about query frequencies Bloom filter attracts a lot of interest in practice due to its and membership likelihood. Indeed, information such as query simplicity and space efficiency. In particular, the space effi- frequencies has already been used to improve the performance ciency of the Bloom filter makes it very appealing in network of caching structures [4]. applications [5], where systems need to share the information In this paper, we give the optimal configuration of the about their available resources. In a typical scenario, e.g., the Bloom filter for items with variable query frequencies and Web cache sharing [14], user queries for desired documents membership likelihood. We call such a Bloom filter the weighted Bloom filter. In particular, each element e ∈ U is that a specific bit remains ‘0’ is simply assigned ke hash functions, where ke depends on its query fre- 1 quency and its likelihood of being a member. Therefore, each p = (1 − )kn ≈ e−kn/m. m non-member element has a different false positive probability. The average false positive probability is actually a weighted Therefore, the probability of a false positive is sum over the query frequencies of the elements in the universe. ¡ 1 ¢k P = 1 − (1 − )kn ≈ (1 − e−kn/m)k = (1 − p)k. Intuitively, an element is assigned more hash functions if its FP m query frequency is high and its chance of being a member is low. We have derived the optimal configuration of the Bloom To obtain the best performance of the Bloom filter, we filter with respect to the query frequencies and membership would like to choose k that minimizes the false positive likelihood. When the query frequencies and the membership probability. Intuitively, more hash functions for an element likelihoods are the same for all elements in the universe, the will increase the chances of finding a ‘0’ for a non-member weighted Bloom filter just converges to the traditional Bloom but will also increases the total number of ‘1”s in the filter. filter. Thus our model is indeed a generalization and further The optimal number of hash functions can be obtained by optimization of the traditional Bloom filter. We also evaluate taking the derivative of PFP to be zero. This will reveal the performance gain of the weighted Bloom filter over the that the Bloom filter has the best performance if k is set to ln 2 · m/n. In that case, the false positive probability is traditional one, under various frequency distributions such as k −(m/n) ln 2 step-like functions or Zipf distributions [28], [4], [24]. We PFP = 1/2 = 2 . observe that the improvement in the false positive probability III.WEIGHTED BLOOM FILTER is of orders of magnitude under reasonable frequency models. The weighted Bloom filter can also be gracefully integrated The traditional Bloom filter does not differentiate the query with the frequency estimators [10], [17], [11], [7], [13], [2], frequencies of different elements or their a priori likelihoods which usually maintain a set of ‘hot’ categories. In practice, of being members. In this section, we study Weighted Bloom one can adapt the Bloom filter with such rough frequency Filter, the Bloom filter optimized based on the elements’ query estimations, represented by popularity buckets. All the ele- frequencies and their probabilities of being members. In many ments in a ‘hot’ category share the same frequency estimation, applications, query frequencies and membership likelihoods and they are associated with the category by simply checkable are estimated or collected with well developed techniques [10], attributes. Examples include the web files of a popular website [17], [11], [7], [13], [2]. Statistics of such information are or the songs of a popular singer. Similarly the membership maintained, especially for a set of ‘hot’ categories. As will likelihood can also be estimated using such a category-based be shown later, such data are very useful for the Bloom technique. That makes the obtaining of the estimations very filter, even if they are category-oriented and are only rough efficient and the design of the weighted Bloom filter very estimations for the individual elements. We will present the practical. We discuss more on the implementation issues in optimal configuration for the weighted Bloom filter, and show Section III. that it generalizes the traditional Bloom filter. Same as the traditional Bloom filter, a weighted Bloom filter II.AN OVERVIEW OF BLOOM FILTER uses m bits to record the n member elements in a set S. (|S| = n.) There are totally N elements in the universe, where We first give a quick introduction to Bloom filter [3]. A N À n. We assume that the probability of one element’s Bloom filter is used to represent a set of elements S in a big being a member, denoted by xe, is independent of all the other universe U. The number of elements in S (called members), elements. We introduce the indicator random variable Xe for n = |S|, is usually much smaller than the size of the universe, each e ∈ U as follows: N = |U|. (Therefore, S ⊆ U and n ¿ N.) A Bloom ½ 1 , if e ∈ S filter is an m- and uses k independent uniformly Xe = 0 , if e∈ / S random hash functions {hi|i = 1, 2, ··· , k} which map to range {1, ··· , m}. For each element in S, the hi(x)-th bits in Notice that E{Xe} = xe. the Bloom filter, i = 1, ··· , k, are set to ‘1’. For a membership The basic rational for designing the weighted Bloom filter is query about whether an element y is in the set S, the answer as follows. The filter’s false positive probability is a weighted is ‘yes’ if all the bits hi(y) are ‘1’ and ‘no’ otherwise. sum of each individual element’s false positive probability, The query to a Bloom filter has no false negative – if an where the weight corresponding to an individual element is element is a member, the query always returns ‘yes’. There is positively correlated with the element’s query frequency and a small false positive probability. The query to a non-member is negatively correlated with the element’s probability of being may get an answer ‘yes’ if the bits corresponding to its hashed a member. Therefore, we would in general like to assign more positions are all ‘1’. Assuming that the hash functions are hash functions to an element with a higher query frequency perfectly random, the probability of a false positive for a non- or with a lower probability of being a member, in order to member element can be calculated in a simple way. After all reduce the false-positive probability of an element with a larger the n members are hashed to the Bloom filter, the probability weight. For each element e ∈ U, denote its query frequency by With the above ke for e ∈ U, the probability that a bit in the fe. Denote the number of hash functions used for it by ke. Bloom filter is ‘0’ is The total numberP of hash functionsP used for elements in S is p = 1/2; (6) denoted by k = e∈S ke = e∈U Xe·ke. (Note that there is a slight abuse of notation here: the notation ‘k’ here corresponds and the expectation of the false-positive probability is: Y to ‘nk’ for the traditional Bloom filter.) The rule of answering −(m/n) ln 2 xe/n E{PFP } = 2 · N · E{re} . (7) the membership queries is the same as before. Specifically, for e∈U an element e, we check all the ke corresponding bits in the Proof: We explain here the intuition on how to derive the Bloom filter, and say ‘e is a member’ if and only if all the ke best configuration of the weighted Bloom filter. The detailed bits are 1. So there is no false negative, only false positive. derivation is in the Appendix. Basically, we see the number All the hash functions are uniformly random hash functions. of hash functions assigned to an element e ∈ U as a variable, Therefore the probability that a specific bit remains ‘0’ in the and seek a local optimum of E{PFP } by taking its partial Bloom filter, denoted by p, only depends on the total number derivative with respect to ke. By solving the equations we of hash functions operated on the Bloom filter. Specifically, obtain the relative ratios of ke for each e ∈ U. While adjusting we have: the number of hash functions assign to individual elements, we 1 k − k p = (1 − ) ≈ e m . (1) should also scale those numbers by an appropriate factor in m order to control the portion of the Bloom filter that are set to To simplify the analysis, we assume in the standard way [5], be ‘1’s, because the overall false-positive probability will be [20] that each bit in the Bloom filter is set to ‘0’ with probabil- hurt whether the portion is too large or too small. We choose ity p and ‘1’ with probability 1−p, independently of the other the total number of hash functions, k, to be an appropriate bits. This is a valid simplification because with n being relative value so as to minimize the expected false positive probability large (the typical setting for using a Bloom filter), the fraction E{PFP }, and show that the local optimum found is in fact of 0 bits in the Bloom filter is sharply concentrated around p, also globally optimum. ¤ as shown in [20]. In practice, the dependency between the bits In practice, ke needs to be a non-negative integer. So in the Bloom filter is negligible [1]. The probability of a false when implementing the weighted Bloom filter, we round positive is the weighted sum of the false positive probabilities ke to a nearby non-negative integer. More details on the of all non-members in the universe. Thus the probability of implementation of the weighted Bloom filter will be studied false positive, PFP , is: in subsection III-B. P f (1−p)ke e∈PU−S e PFP = f P e∈U−S e A. Generalization of Bloom filter (1−X )·f (1−p)ke e∈PU e e (2) = (1−X )·f In this subsection, we compare the weighted Bloom filter P e∈U e e P (1−Xe)fe ke to the traditional Bloom filter and show that our result is = e∈U (1−X )·f · (1 − p) i∈U i i a generalization and further optimization of the traditional Denote by re the normalized query frequency of e ∈ U, optimal configuration. In the traditional Bloom filter [3], no (1 − X )f knowledge about an element’s query frequency or membership r = P e e . (3) e (1 − X ) · f likelihood is assumed. Its optimal configuration is: i∈U i i  then the false positive probability can be represented as  p = 1/2; X ∀e ∈ U, ke = (m/n) ln 2; ke PFP = re(1 − p) , (4)  −(m/n) ln 2 PFP = 2 . e∈U We use E{·} to denote the expectation of a random variable. In the weighted Bloom filter setting, all the elements should Then, the expectation of the false positive probability is be treated the same way when no knowledge about query frequencies or membership likelihoods is available. Then ∀e ∈ E{PFP }. Given a fixed-sized Bloom filter of m bits, we would like to minimize the expected false positive probability U, we can set fe and xe to be constants. More specifically, by optimizing the number of hash functions assigned to each xe = E{Xe} = n/N, where n = |S| and N = |U|. By element. Our main result in this section is to derive the best equation 3, we see thatP E{re} is a constantP as well for any configuration of the weighted Bloom filter. The result is shown e ∈ U. Then since e∈U E{re} = E{ e∈U re} = 1, we in the following theorem. have that ∀ e ∈ U, E{re} = 1/N. By plugging xe = n/N and E{re} = 1/N for all e ∈ U into equations 5 and 7, we Theorem 3.1. In order to minimize the expected false positive −(m/n) ln 2 obtain the formulas ke = (m/n) ln 2 and PFP = 2 , probability E{PFP } of a weighted Bloom filter, the Bloom the same as the second and the third formulas of the tradi- filter should be configured as follows, assuming that ke, for tional Bloom filter’s configuration. By taking equation 6 into e ∈ U, can be any real number. The number of hash functions account, we see that the traditional optimal configuration of for an element e ∈ U is Bloom filter is reduced to our solution as a special case. m X x k = · ln 2 + (ln E{r } − i · ln E{r })/ ln 2; (5) When the elements in U have varied query frequencies e n e n i i∈U or membership likelihoods, the false-positive probability of the weighted Bloom filter usually becomes better than that If ke falls between two non-negative integers, we can also of the traditional Bloom filter. We use here a simple sce- round it probabilistically to one of them with a simple binary nario for a clear illustration. The more general case is its distribution, so that its expected value after rounding is still extension. Consider the case where all the elements have ke. That is the approach adopted in the numerical analysis in identical membership likelihood, but their query frequencies the following subsection. vary from element to element. Then the expected false-positive C. Numerical Analysis of Weighted Bloom filter probability, as shown in equation 7, becomes: Q It is commonly observed that across many scales in society −(m/n) ln 2 xe/n E{PFP } = 2 · N · E{re} Q e∈U and economics, human behaviors exhibit inherent character- E{r }1/N = 2−(m/n) ln 2 · Pe∈U e (8) istics. Statistics about query frequencies of Web pages [4], e∈U E{re}/N ≤ 2−(m/n) ln 2. [24] and input to search engines [15] often revealed that user queries data are skewed where a few popular items or files TheP first step holds because xe = n/N for all e ∈ U, are searched much more often than majority unpopular ones. and e∈U E{re} = 1. The second step, the inequality, Numerous studies have found that the distribution of query holds because the geometric average of E{re} must be no frequencies follows Zipf’s law [28], which states that the greater than the arithmetic average of E{re}. In fact, the relative probability of a request for the ith most popular item inequality will become equality only if E{re} = 1/N for is inversely proportional to the power of i. We also study a all e ∈ U, which, in turn, leads to the requirement that simplified analog of the Zipf’s distribution, which we denote fi = fj for all i, j ∈ U. (We skip the details of the proof.) by the step distribution where there are only two categories, the So the false-positive probability of the traditional Bloom filter popular (or hot) set and the unpopular (cold) set. This models always exceeds that of the weighted Bloom filter unless all the case when accurate frequency distributions are not known the elements have the same query frequency. and only a rough bucketing on the popularity of the elements B. Efficient implementation of weighted Bloom filter is maintained. Such kind of simple bucketing can be created and updated efficiently by recent on streaming data, The optimal result we have derived does not consider the which usually use an extremely small amount of storage space constraint that the number of hash functions assigned to each and a few passes of the data set [10], [17], [11], [7], [13], element needs to be a non-negative integer. Also, computing [2]. In this section, we evaluate the performance of weighted E{r } is often non-trivial — the expression for r (equation 3) e e Bloom filter on queries whose frequency distributions follow contains N binary random variables X , so there are totally i Zipf’s law or the like. Even with a rough estimation of the 2N disjoint cases. Below we present an efficient way to query frequency distribution, such as the 2-bucketing, the approximately compute the optimal solution. performance improvement of weighted Bloom filter over the Firstly, consider the expression for r as shown in equa- e traditional configuration is very impressive. tion 3, from which we get: 1) Different Query Frequency Models, Uniform Member- E{re} ship Likelihood: We first assume that each element in the = E{ P (1−Xe)fe } universe has the same likelihood of being a member and its i∈U (1−Xi)·fi P (1−0)fe query frequency follows some known distributions. The case = Prob{Xe = 0}· E{ (1−X )·f +(1−0)f }+ i∈U−{e} i i e when the membership likelihood is correlated with its query (1−1)fe Prob{Xe = 1}· E{ P } frequency is studied in the next subsection. i∈U−{e}(1−Xi)·fi+(1−1)fe fe a) Step Distribution: We start with a simplified analog = (1 − xe) · E{ P } i∈U−{e}(1−Xi)·fi+fe (9) of the Zipf’s distribution, which we denote by the step P distribution. The universe U is partitioned into two subsets: When the value of (1−Xi)·fi well concentrates i∈U−{e} a hot set A and a cold set B. The percentage of A in the around its expectation, E{re} can be approximately computed as universe is α = |A|/|U|. So 0 ≤ α ≤ 1 and α = 1 − |B|/|U|. E{re} Each element in B has query frequency f, while each element fe ≈ (1 − xe) · P in A has query frequency c · f.(c ≥ 1.) The value of c shows E{ i∈U−{e}(1−Xi)·fi}+fe fe how popular the ‘hot’ elements are in terms of query compared = (1 − xe) · P i∈U−{e}(1−E{Xi})·fi+fe to the ‘cold’ elements. fe (10) = (1 − xe) · P A simple calculation shows that the number of hash func- i∈U−{e}(1−xi)·fi+fe = P (1−xe)fe tions assigned to elements in set A and B are respectively, i∈U (1−xi)fi−(1−xe)fe+fe ½ = (1P−xe)fe (m/n) ln 2 + (1 − α)(ln c/ln 2) , if e ∈ A xefe+ i∈U (1−xi)fi ke = . P (m/n) ln 2 − α(ln c/ln 2) , if e ∈ B Thus we first compute the term (1−x )f , which will i∈U i i The ratio R of the false positive probability of the weighted be used in computing E{r } for each e ∈ U. Equation 10 e Bloom filter to the false positive probability of the traditional provides an efficient way to compute E{r }. Given E{r }, e e Bloom filter is: we can compute k for e ∈ U using equation 5. Then as cα e R = . a heuristic, we round ke to the closest non-negative integer. αc + (1 − α) We call 1/R, the inverse ratio of the two false positive positive correlation is Web caching and distributed file storage, probabilities, the PFP improvement. (The greater the PFP where popular files are more likely to be stored and replicated. improvement is, the better.) The above formulas assume that Examples of negative correlation appear in network anomaly the number of hash functions assigned to each element, detection [16] and security check systems, where suspicious ke, can be any real number; when we round ke to nearby events appear much less often than ordinary events. Bloom non-negative integers, the two formulas should be adjusted filter is an especially attractive data structure for such security- accordingly. When all ke take non-negative integer values, the related applications for its quicker rejection time and zero false PFP improvement corresponding to different values of α and negative rate. c are illustrated in Fig. 1 (i). We assume that the correlation between the membership In Figure 1 (i), the vertical axis is the PFP improvement. probabilities and the query frequencies can be expressed as a The two horizontal axes are α that varies in [0, 1] and c function — namely, ∀e ∈ U, the probability that e ∈ S is that varies in [1, 10000]. We see that the P improvement FP x = F (f ). increases with c, because the weighted Bloom filter improves e e its performance when the query frequencies become more The function F (·) can be diversified by the applications. In skewed. When c is fixed, we see that neither too many nor too this subsection, we study a basic family of functions: few hot elements leads to large values of the PFP improve- b ment. (The ratio of the hot elements is controlled by the pa- F (fe) = afe , rameter α.) That’s because in both cases, the query frequencies where we let b take all possible values. (The valueP of a depends get closer to a uniform distribution. So the best performance onP b, because F (·) is a probability function, so e∈U F (fe) = improvement that the weighted Bloom filter makes appears e∈U xe = n, which gives us an equation to solve a using somewhere in the middle. The difference in the number of b.) Such polynomial functions are of fundamental interest hash functions assigned to hot elements and cold elements is because they can help us study F (·) of more general forms. moderate — when m/n = 14, which is a typical Bloom filter To specifically explain it, let’s first assume that the function configuration, the number of hash functions assigned to hot F (·) can be expressed as the summation of a set of other elements varies between 10 and 23, and the number of hash probability functions g1(·), g2(·), ··· , gt(·) — namely, there functions assigned to cold elements varies between 0 and 10. exist constants λ1, λ2, ··· , λt, such that However, the improvement of the false positive probability is x = F (f ) = λ g (f ) + λ g (f ) + ··· + λ g (f ). a lot. The maximum PFP improvement here is 396.9. e e 1 1 e 2 2 e t t e b) Zipf Distribution: The Zipf distribution is defined as Since for any i ∈ {1, 2, ··· , t}, gi(·) is a probability function, follows. The elements in U are sorted according to their query P we must have e∈U gi(fe) = n. So we must have frequencies, from high to low. For the i-th element (where i P P P t λ = [ t λ g (f )]/n is called the element’s rank), its query frequency f is i=1 t Pi=1 Pi e∈U i e = [ t λ g (f )]/n 1 Pe∈U i=1 i i e = [ F (f )]/n f ∝ α e∈U e i = 1 for some constant parameter α. Now let’s look at Equation 7, the expression for the false- It is difficult to obtain a closed form expression for P FP positive probability of the weighted Bloom filter. We can improvement because of the divergence of some summations rewrite Equation 7 as in the related formulas. So we resort to numerical computation based on the known formulas. The result is as shown in E{PFP } Q = 2−(m/n) ln 2 · N · E{r }H(fe)/n Figure 1 (ii). There the horizontal axis is α and the vertical e∈U e P Qt Q t −(m/n) ln 2 λi i=1 λigi(fe)/n axis is the PFP improvement. The larger α is, the more = 2 · ( i N ) · e∈U E{re} Qt Q skewed the query frequencies are, so the P performance −(m/n) ln 2 gi(fe)/n λi FP = 2 · i=1(N e∈U E{re} ) increases. The PFP improvement corresponds to α = 1.6 is (11) 115.7. To reduce the false-positive probability by the factor of We denote by R the ratio of the false-positive probability 115.7, it will require the traditional Bloom filter to increase of the weighted Bloom filter to that of the traditional Bloom its size by 9.89n bits, where n is the number of members. filter. Then since the false-positive probability of the traditional 1 −(m/n) ln 2 That increase is about the size of a normal Bloom filter in Bloom filter is ( 2 ) , we get many applications. Therefore, such a P improvement may FP Yt Y be seen as significant. gi(fe)/n λi R = (N E{re} ) (12) 2) Considering Membership Likelihoods: Both positive i=1 e∈U correlations and negative correlations exist in real life between the membership likelihoods of elements and the query frequen- Let’s use Ri to denote the value of R when xe = gi(e) for all e ∈ U. Clearly, for 1 ≤ i ≤ t, cies. Positive correlation means that the higher an element’s Y gi(fe)/n query frequency is, the more likely it is a member, whose Ri = N E{re} (13) negative correlation is the other way around. An example of e∈U 120 400

350 100

300 80 250

Z 200 60

150

40 100 Ratio of False−Positive Probabilities 50 20

0 0 0 5000 0.8 1 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 0.2 0.4 0.6 100000 alpha X Y

(i) (ii)

Fig. 1. PFP improvement of the weighted Bloom filter, when the query frequencies are modelled, respectively, with the step distribution and the Zipf distribution. In both cases m/n = 14. (i) Step distribution. The vertical axis is the PFP improvement. The two horizontal axes are, respectively, α that varies in [0, 1] and c that varies in [1, 10000]. (ii) Zipf distribution. The vertical axis is the PFP improvement. The horizontal axis is α.

So Firstly we determine the value of a. We know that Yt λi P R = Ri (14) n = Pe∈U xe i=1 = e∈U F (fe) P b P b = e∈A afe + e∈B afe R is the performance improvement when H(·) is used as the = |A| · a · (cf)b + |B| · a · f b membership probability function, while Ri is the performance = αNacbf b + (1 − α)Naf b Pt improvement when gi(·) is used. Note that i=1 λi = 1. So R is the weighted geometry average (with λi as the weight) of So Ri,R2, ··· ,Rt. That means if we can learn Ri,R2, ··· ,Rt, we can learn R as well. Now note that many smooth functions n a = can be expressed as its Taylor series. In such cases, we use αNcbf b + (1 − α)Nf b Taylor series to express F (·) as X bi Then F (fe) = λi(aife ). i ½ b b b nc /(αNc + (1 − α)N), if e ∈ A xe = afe = b bi n/(αNc + (1 − α)N), if e ∈ B Here each term aife corresponds to a function gi(fe). Thus if we can learn the performance improvement of the weighted b Bloom filter with F (fe) = afe , we can learn the performance Since N is sufficiently large, the probability that a single when F (·) takes more general forms as well. And that is why element e is a member, x , approaches 0. Therefore, from b e it is interesting to study the family of functions F (fe) = afe . Equation 10, we get b In the following, we let F (fe) = afe , and study the performance of the weighted Bloom filter. When b > 0, the E{r } membership probability x has a positive correlation with the eP e = f /( f ) query frequency — the greater f is, the greater x = F (f ) e Pi∈U i P e e e = f /( f + f ) is. When b < 0, they have a negative correlation. We let b take e i∈A i i∈B i = f /(|A| · c · f + |B| · f) any value and study both cases. e = ½fe/(αNcf + (1 − α)Nf) We adopt here the step distribution as the model for the c/(αNc + (1 − α)N) , if e ∈ A = elements’ query frequencies for the purpose of simplified 1/(αNc + (1 − α)N) , if e ∈ B illustration, with the notations A, B, α, c and f defined as before. The analysis holds for a broad range of other query frequency models as well, but the calculation there becomes We can now compute the ratio of the false-positive prob- more complex. ability of the weighted Bloom filter to that of the traditional Bloom filter, R: [10] G. Cormode and S. Muthukrishnan. What’s hot and what’s not: Q tracking most frequent items dynamically. In PODS ’03: Proceedings R = N · E{r }re/n of the twenty-second ACM SIGMOD-SIGACT-SIGART symposium on Qe∈U e Q re/n re/n Principles of database systems, pages 296–306, New York, NY, USA, = N · e∈A E{re} · e∈B E{re} b 2003. ACM Press. nc · αN = N · ( c ) αNcb+(1−α)N n [11] E. D. Demaine, A. Lopez-Ortiz,´ and J. I. Munro. Frequency estimation αNc+(1−α)N of internet packet streams with limited space. In ESA ’02: Proceedings n (1−α)N b · n of the 10th Annual European Symposium on Algorithms, pages 348–360, ·( 1 ) αNc +(1−α)N αNc+(1−α)N (15) London, UK, 2002. Springer-Verlag. αcb c αcb+1−α [12] N. Duffield and M. Grossglauser. Trajectory sampling with unreliable = N · ( αNc+(1−α)N ) reporting. In Proc. INFOCOM’04, volume 3, pages 1570–1581, Hong 1−α 1 b Kong, China, 2004. ·( ) αc +1−α αNc+(1−α)N [13] M. Fang, N. Shivakumar, H. Garcia-Molina, R. Motwani, and J. D. Ull- αcb 1 man. Computing iceberg queries efficiently. In VLDB ’98: Proceedings = · c αcb+1−α αc+1−α of the 24rd International Conference on Very Large Data Bases, pages 299–310, San Francisco, CA, USA, 1998. Morgan Kaufmann Publishers Figure. 2 shows the PFP improvement. Comparing it to Inc. the result in Figure. 1 (i) (where there is no correlation [14] L. Fan, P. Cao, J. Almeida, and A. Z. Broder. Summary cache: a between membership likelihoods and query frequencies), we scalable wide-area web cache sharing protocol. IEEE/ACM Transactions on Networking, 8(3):281–293, 2000. see that the the correlation makes a clear difference. Positive [15] Google. Google zeitgeist – search patterns, trends, and surprises correlation decreases the PFP improvement, while negative according to google. http://www.google.com/press/zeitgeist.html. correlation increases it, as expected. [16] G. Gu, M. Sharif, X. Qin, D. Dagon, W. Lee, and G. Riley. Worm detection, early warning and response based on local victim information. In Proc. of the 20th Annual Computer Security Applications Conference IV. CONCLUSIONS (ACSAC ’04), December 2004. [17] R. M. Karp, S. Shenker, and C. H. Papadimitriou. A simple In this paper we have deepened our understanding on the for finding frequent elements in streams and bags. ACM Trans. Database well-known Bloom filter structure. When more information Syst., 28(1):51–55, 2003. about the membership likelihood and query frequencies is [18] A. Kumar, J. Xu, J. Wang, O. Spatschek, and L. Li. Space-code bloom filter for efficient per-flow traffic measurement. In Proc. INFOCOM’04, available, we derived the optimal Bloom filter configuration volume 3, pages 1762–1773, Hong Kong, China, 2004. and investigated efficient implementation schemes in practice. [19] A. Kumar, J. Xu, and E. W. Zegura. Efficient and scalable query routing We have compared the performance of the weighted Bloom for unstructured peer-to-peer networks. In Proc. INFOCOM’05, Miami, Florida, U.S.A., 2005. filter with the traditional Bloom filter under reasonable query [20] M. Mitzenmacher. Compressed bloom filters. IEEE/ACM Transactions frequency and membership likelihood models. For the future on Networking, 10(5):604–612, 2002. work, it would be interesting to evaluate the performance gain [21] R. Morselli, B. Bhattacharjee, J. Katz, and P. Keleher. Trust-preserving of the weighted Bloom filter on real network traces and further set operations. In Proceedings of INFOCOM, volume 4, pages 2231– 2241, Hong Kong, China, 2004. investigate the integration of the Bloom filter with network [22] P. Mutaf and C. Castelluccia. Compact neighbor discovery. In Proc. statistics estimators. INFOCOM’05, Miami, FL, USA, 2005. [23] S. C. Rhea and J. Kubiatowicz. Probabilistic location and routing. In Proc. INFOCOM’02, volume 3, pages 1248–1257, 2002. REFERENCES [24] P. Rodriguez, C. Spanner, and E. W. Biersack. Analysis of web [1] M. Adler, S. Chakrabarti, M. Mitzenmacher, and L. Rasmussen. Par- caching architectures: hierarchical and distributed caching. IEEE/ACM allel randomized load balancing. Random Structures and Algorithms, Transactions on Networking, 9(4):404–418, 2001. 13(2):159–188, 1998. [25] H. Song, S. Dharmapurikar, J. Turner, and J. Lockwood. Fast hash table [2] B. Babcock and C. Olston. Distributed top-k monitoring. In SIGMOD lookup using extended Bloom filter: An aid to network processing. In ’03: Proceedings of the 2003 ACM SIGMOD international conference Proc. SIGCOMM, Philadelphia, PA, USA, 2005. on Management of data, pages 28–39, New York, NY, USA, 2003. ACM [26] D. E. Taylor and J. S. Turner. Scalable packet classification using Press. distributed crossproducting of field labels. In Proc. INFOCOM’05, [3] B. H. Bloom. Space/time trade-offs in hash coding with allowable errors. Miami, FL, USA, 2005. Communications of the ACM, 13(7):422–426, 1970. [27] F. Ye, H. Luo, S. Lu, and L. Zhang. Statistical en-route filtering of [4] L. Breslau, P. Cao, L. Fan, G. Phillips, and S. Shenker. Web caching injected false data in sensor networks. In Proc. INFOCOM’04, volume 4, and zipf-like distributions: evidence and implications. In Proc. IEEE pages 2446–2457, 2004. INFOCOM’99, pages 126–134, 1999. [28] C. K. Zipf. Human Behavior and the Principle of Least Effort: An [5] A. Broder and M. Mitzenmacher. Network applications of bloom filters: Introduction to Human Ecology. Addison Wesley Press, Inc., Cambridge, A survey. In Proceedings of the 40th Annual Allerton Conference on MA, 1949. Communication, Control, and Computing, pages 636–646, 2002. [6] F. Chang, W. Feng, and K. Li. Approximate caches for packet PPENDIX classification. In Proceedings of INFOCOM, volume 4, pages 2196– V. A 2207, Hong Kong, China, 2004. [7] M. Charikar, K. Chen, and M. Farach-Colton. Finding frequent items Proof: [Theorem 3.1] In the derivation of the best config- in data streams. In ICALP ’02: Proceedings of the 29th International uration of the weighted Bloom filter, we assume that the orders Colloquium on Automata, Languages and Programming, pages 693–703, of N and n, as well as the size of the Bloom filter, m, are London, UK, 2002. Springer-Verlag. [8] B. Chazelle, J. Kilian, R. Rubinfeld, and A. Tal. The Bloomier filter: An sufficiently large. Also, we assume that the distributions of the efficient data structure for static support lookup tables. In Proceedings query frequencies of different elements and their probabilities of the Annual ACM-SIAM Symposium on Discrete Algorithms, pages of being members are not overly skewed, so that the term 30–39, New Orleans, LA., United States, 2004. P [9] S. Cohen and Y. Matias. Spectral Bloom filters. In Proc. SIGMOD, Pe∈U Xe·ke is sharply concentrated around its expected value pages 241–252, 2003. e∈U E{Xe}· ke. 3500 5000

3000 4000 2500 3000 2000

1500 2000

1000 1000

500 0 0 10000 10000 8000 8000 6000 6000 4000 4000

2000 2000

0 0 0.3 0.4 0.1 0.2 0.3 0.4 0 0.1 0.2 −0.4 −0.3 −0.2 −0.1 0 −0.4 −0.3 −0.2 −0.1

(i) (ii)

4500

4000 5000

3500 4000 3000 3000 2500

2000 2000

1500 1000

1000 0 −0.4 500 −0.2 1 0.8 0 0 1 0.6 0.8 0.4 0.4 0.6 0.2 0.2 0.4 0 0.2 −0.2 0.2 0 −0.4 0.4 0

(iii) (iv)

Fig. 2. PFP improvement with correlated membership likelihood and query frequencies. The vertical axis in all four figures is the PFP improvement. (i) α = 0.4. The two horizontal axes are, respectively, b that varies in [−0.4, 0.4] and c that varies in [1, 10000]. (ii) α = 0.7. The two horizontal axes are the same as in (i). (iii) and (iv) the same figure viewed from two different angles. Here c = 10000. The two horizontal axes are, respectively, b that varies in [−0.4, 0.4] and α that varies in [0, 1].

Define A as X Below we minimize the expectation of the false positive A = ke (16) probability, E{PFP }. e∈U Let e0 ∈ U be an element in U. Then, X and for e ∈ U, define te as te0 = 1 − ti (21)

ke ke i∈U−{e0} te = = P . (17) A i∈U ki and X P Also, define q as te 1− e∈U−{e } te E{PFP } = E{re}q + E{re }q 0 . q = (1 − p)A, (18) 0 e∈U−{e0} then equation 4 becomes: (22) Then, for any element e ∈ U − {e0}, we take the partial X ke X A· te PFP = re(1 − p) A = req . (19) derivative of equation 22 for te: e∈U e∈U P ∂E{PFP } t 1− ti e i∈U−{e0} = E{re}·ln q·q −E{re0 }·ln q·q . We have assumed that p almost always equals its expecta- ∂te tion. Therefore the expectation of PPF can be written as (23) X By setting equation 23 to be 0, we get: te E{PFP } = E{re}· q . (20) P t 1− ti e i∈U−{e0} e∈U E{re}· q = E{re0 }· q . (24) Multiplying equations 24 for all e ∈ U − {e0}, we get: Since p is seen as a constant, so should k by equation 36. Y Y P t 1− ti Therefore by equation 35, we have: e i∈U−{e0} (E{re}· q ) = (E{re0 }· q ), X e∈U−{e0} e∈U−{e0} xe · ke = E{k} = k ≈ −m ln p. (37) e∈U which gives P P Q By plugging e∈U xe · ke = k into equation 34, we get te e∈U−{e0} P q i∈U−{e } E{ri} P 0 (25) xe ln E{re} + B ln(1 − p) 1− ti e∈U i∈U−{e0} N−1 P (38) = (q · E{re0 }) . A 1 = N · [ln(1 − p)] · n + N · [ i∈U ln E{ri}] · n. By reorganizing equation 25, we have Raising both sides of equation 38 to the power of e, we get P Q Y Y (N)· te q e∈U−{e0} E{r } k xe An/N n/N i∈U−{e0} i (1 − p) E{re} = (1 − p) · E{ri} , (39) N−1 (26) = (q · E{re0 }) , e∈U i∈U which is equivalent to which gives Y Y P k xe 1/n A/N 1/N N−1 Y 1 [(1−p) E{re} ] = (1−p) · E{ri} . (40) te E{re0 } e∈U−{e0} N N q = q · ( ) . (27) e∈U i∈U E{ri} i∈U−{e0} By equations 30 and 40, we get that for any e ∈ U, Y By equations 24 and 27, we get that for any e ∈ U − {e0}: ke k xi 1/n E{re}· (1 − p) = [(1 − p) E{ri} ] . (41) t e i∈U E{re}· q P ti i∈U−{e0} = E{re0 }· q/q By equations 4, 36 and 41, we get 1 1 Q 1 (28) = E{r } N · q N · E{r } N e0 i∈U−{e0} i E{PFP } 1 Q 1 P k = q N · E{r } N . = E{r }(1 − p) e i∈U i Pe∈U e Q = [(1 − p)k E{r }xi ]1/n By using equation 21, we find that equation 28 holds also e∈U Q i∈U i k xi 1/n (42) = N · [(1 − p) E{ri} ] for e = e0. So for any element e ∈ U, we have: i∈UQ −m ln p xi 1/n Y = N · [(1 − p) E{ri} ] t 1 1 Qi∈U E{r }· q e = q N · E{r } N . (29) −m ln p ln(1−p) xi 1/n e i = N · [e i∈U E{ri} ] . i∈U So E{PFP } is minimized when ln p ln(1−p) is maximized. Combining equations 17, 18 and 29, we get that for any From the symmetry of the expression ln p ln(1−p), it is simple e ∈ U, to find that its value is maximized when A Y 1 ke 1 E{re}· (1 − p) = (1 − p) N · E{ri} N . (30) p = . (43) i∈U 2 Taking logarithm of both sides of equation 30, we get Plugging p = 1/2 into equation 42, we get A 1 X E{PFP } Q ln E{re} + ke · ln(1 − p) = · ln(1 − p) + · ln E{ri}. = N · [2−m ln 2 E{r }xe ]1/n N N eQ∈U e (44) i∈U = 2−m ln 2/n · N · E{r }xe/n. (31) e∈U e Multiplying both sides of equation 31 by Xe, we get By applying equations 36 and 43 to equation 41, we get that for any e ∈ U, Xe ln E{re} + Xeke ln(1 − p) P Y E{r }· 2−ke = [2−m ln 2 E{r }xi ]1/n. (45) = Xe · (A/N) · ln(1 − p) + Xe · (1/N) · i∈U ln E{ri}. e i (32) i∈U Taking the expectation of both sizes of equation 32, we get Taking logarithm of both sides of equation 45, we get X x ln E{r } + x k ln(1 − p) 1 x 2 e e e e P ln E{r } − k ln 2 = · [ ln E{r } i − m ln 2]. (46) = x · (A/N) · ln(1 − p) + x · (1/N) · ln E{r }. e e n i e e i∈U i i∈U (33) Summing over e ∈ U for both sizes of equation 33, we get So ∀ e ∈ U, we have P P m ln 2 1 1 X x ln E{r } + [ln(1 − p)] · x k ke = + (ln E{re} − · xi ln E{ri}). (47) e∈U e e Pe∈U e e n ln 2 n i∈U = (A/N) · [ln(1 − p)] · n + (1/N) · [ i∈U ln E{ri}] · n. (34) Equations 43, 47 and 42 are the same as equations 6, 5 Remember that we have defined k as and 7, which is the result we want to derive. We obtained X X k = k = X · k , (35) the result by taking the partial derivatives of the expected e e e false-positive probability and setting the partial derivatives to e∈S e∈U be zero, which gave a unique solution. This unique solution then by equation 1 we can determine the value of k: is clearly a global minimum for the expected false-positive k ≈ −m ln p. (36) probability. ¤