Folding and Unfolding Bloom Filters - an Off-Line Planning and On-Line Optimization Problem
Total Page:16
File Type:pdf, Size:1020Kb
Folding and Unfolding Bloom Filters - An Off-Line Planning and On-Line Optimization Problem Francoise Sailhan∗y, and Mark-Oliver Stehr∗ ∗ Computer Science Laboratory, SRI international Menlo Park, CA 94025 USA yCedric Laboratory, CNAM, 75003 Paris, France Emails: [email protected], [email protected] Abstract—The Internet of things has reached a stage that do not occur. Nevertheless, the approximate nature of Bloom allows ubiquitous data access. Still, practical limitations remain filter implies that false positives might appear. The probability in networks with scarce bandwidth. Here, we examine the of a false positives depends on (i) the Bloom filter size, (ii) Bloom filter data structure and its use in distributed protocols. We discuss how to minimize the bandwidth and energy usage the number of hash functions and (iii) the number of elements consumed when distributed protocols exchange Bloom filters, stored. Keeping the false positive rate to a minimum requires through dynamic Bloom filter resizing. We propose a general dynamic adaptation of the number of hash functions and/or of and novel formalization of Bloom filter resizing, through foldings the Bloom filter size. Further attempts to increase the size of and unfoldings. The key challenge in the folding approach is the Bloom filter have appeared in the literature. The basic idea determining suitable parameters and how to perform a folding. Specifically, we address the number of times that a Bloom filter is to create a set of Bloom filters [6], [16], [1]. As a side effect, should be folded and optionally unfolded, and how to determine a membership query is a function on the size of this set. While an ideal reduction factor for this process. We formulate our these methods deal with the increase of the Bloom filter size, approach as off-line planning of the integer factorization problem the related decrease is neglected. An alternative [7] consists in (where the integers correspond to the size of a Bloom filter), and partitioning the Bloom filter into k disjoint sub-ranges wherein propose further directions for optimizing the dynamic folding and unfolding of a Bloom filter. an element is added by hashing the sub-range with the related hash function hk. Depending on the expected probability of a I. INTRODUCTION false positives, k is either dynamically increased or decreased, The last decade is characterised by the convergence of per- hence leading to a reduced or increased Bloom filter size. vasive technologies including wireless communication, smart Nevertheless, changing the number of hash functions offers devices and the Internet. The resulting Internet of things lower granularity than resizing the Bloom filter. enables not only users but also things to access resources We herein consider the problem of dynamically resizing anywhere, anytime. In practice, data flows ubiquitously us- the Bloom filter by folding and unfolding it. This can be ing global wireless connectivity (e.g., 3G) or ad-hoc local expressed as follows: given a number of items stored in the area networking, as enabled by, for example, IEEE 802.15.4 Bloom filter, let reduce/increase the size of the Bloom filter by protocols. Two key limitations of these networks remain the folding/unfolding it so that the false positive rate ρ stays in the limits on bandwidth and energy. In this context, early-stage following interval [ρ − , ρ + ] and the increase of memory data and resource discovery should be provided so as to allow size is kept to a minimum. We acknowledge that halving a efficient data dissemination [14]. A Bloom filter [2] can play a Bloom filter was originally suggested in [3] and successfully crucial role by allowing to represent a set of data in a compact applied [5] so as to reduce the bandwidth consumption induced manner and by supporting efficient membership queries. More by the exchange of Bloom filters in the context of correlated precisely, a Bloom filter is a vector of bits. It is distinguished anomaly detection in large-scale grid computing [15]. We by the fact that the time associated with a membership query herein extend and generalize this approach by introducing the is independent of the number of elements stored in the Bloom concept of folded and unfolded Bloom filters, which permits filter as well the size of the Bloom filter. In a nutshell, the highly flexible resizing. Bloom filter works as follows: an element to be added to the We also introduce a novel formulation of the problem, Bloom filter is first hashed. Typically, k > 1 hash functions which allows to keep the false positive rate in check by are used. Then, the k outputs are used to flip to 1 the bits at the injecting entropy into the folding process. The key challenge related positions in the Bloom filter. In order to check whether consists in determining how a folding should be performed, an element is stored in the Bloom filter, the queried element namely the number of times the Bloom filter should be is hashed (relying on the same hashing functions). If the k folded/unfolded and the reduction factor associated with each bits at the related position are set to 1 in the Bloom filter, this folding/unfolding. We show that this can be formulated as (i) element is stored. It is worth mentioning that false negatives an off-line planning problem of the factorization of an integer (namely the Bloom filter size) and (ii) an on-line optimization This work was carried out when F. Sailhan was a visiting fellow at SRI. problem of the Bloom filter folding/unfolding. Our main contribution is twofold: we propose an algorithm for planning that depends on the space allocated to count. These counters the folding and unfolding of the Bloom filter depending on must be selected to be large enough to avoid overflows. the expected false positive rate and the related resource waste. A. Properties of Bloom Filters We formulate this problem and provide both a theoretical and experimental analysis of such planning. In addition, based on A [counting] Bloom filter is characterized by the following these planned foldings/unfoldings, we propose an approach to properties: dynamically optimizing the folding and unfolding of a Bloom 1) The time associated with a membership query is filter. independent of the number of elements stored in The remaining sections are organized as follows. We first the [counting] Bloom filter as well as its size. More introduce Bloom filters and survey the related literature (xII). precisely, the time required to insert or find an element Then, we present the Bloom filter folding/unfolding (xIII) depends on the number of hash functions and is hence and the planning of the folding/unfolding (xIV). Finally, we o(k). conclude this report with further optimization and research directions (xV). 2) Both basic and counting Bloom filters engender false positives. Given n the number of elements that are stored II. BACKGROUND ON BLOOM FILTERS in the [counting] Bloom filter (see Table I) and assuming Bloom filters [2] are commonly used to represent a set that the hash functions are perfect, the probability of a so as to support efficient membership queries, that is, to false positive, denoted ρ, satisfies: efficiently test whether an element is a member of a set. For 1 kn k kn this purpose, a Bloom filter B is represented as a vector of ρ = (1 − (1 − ) ) ≈ (1 − e m ) (1) bits, denoted b(1); ··· ; b(m). This vector is initially set to m 0, i.e., 8i 2 [0; m], b(i) = 0. The vector is then updated by Unlike a basic Bloom filter, a counting Bloom adding additional elements. filter may produce a false negative, which occurs if the counting vector is undersized and generates an overflow. a) Bloom filter update: In order to add an element e to the Bloom filter, k hash functions h1; ··· ; hk are used. Each 3) The union of two sets S1 and S2 can be obtained as hash function is applied to the element: h1(e); ··· ; hk(e). follows: Then, the k bits at positions h1(e) mod(m); ··· ; hk(e) • applying a bit-wise OR on the two corresponding mod(m) are set to 1. basic Bloom filters, and • adding the vectors of the two counting Bloom filters. Note that the resulting value cannot exceed m: Bloom filter size b(1); ··· ; b(m): Bloom filter a maximum value defined by the counters’ sizes. k: number of hash functions h1(); ··· ; hk(): set of hash functions n: set of items stored in the Bloom filter 4) The intersection of two sets S1 and S2 cannot be obtained in a deterministic manner on the [counting] TABLE I NOTATION Bloom filter. 5) With a counting Bloom filter, the subtraction of two b) Checking the membership of an element: The sets S1 − S2 can be obtained by subtracting the process of checking whether an element q belongs to the corresponding counting Bloom filters, CB1 − CB2. Bloom filter is very similar to adding an element. First, q is Basic Bloom filters do not support such subtraction. hashed: h1(q) mod(m); ··· ; hk(q) mod(m). If any bit at the related positions is set to 0, then the element is not stored Despite the above promising properties, [counting] Bloom in the Bloom filter. Otherwise (i.e., if none of those bits filters suffer from a fundamental shortcoming: once set, the is set to 0), the element is said to be stored in the Bloom filter. size of the Bloom filter cannot be easily changed. An over- or under-estimate of this size leads to a waste of mem- c) Suppressing an element: Items cannot be removed ory/bandwidth and/or an increased false positive rate.