Bloofi: a Hierarchical Bloom Filter Index with Applications To

Bloofi: A Hierarchical Bloom Filter Index with Applications to Distributed Data Provenance Adina Crainiceanu United States Naval Academy [email protected] ABSTRACT As digital data increases in both size and distribution, applications Bloom filters are probabilistic data structures that have been suc- generate a large number of Bloom filters, and these filters need to cessfully used for approximate membership problems in many ar- be searched to find the sets containing particular objects. eas of Computer Science (networking, distributed systems, databa- Our work was motivated by a highly distributed data provenance ses, etc.). With the huge increase in data size and distribution of application, in which data is tracked as it is created, modified, or data, problems arise where a large number of Bloom filters are sent/received between the multiple sites participating in the ap- available, and all the Bloom filters need to be searched for potential plication, each site maintaining the data in a cloud environment. matches. As an example, in a federated cloud environment, with For security and technical reasons, it is impractical to maintain a hundreds of geographically distributed clouds participating in the centralized index of all the data in the geographically distributed federation, information needs to be shared by the semi-autonomous system. However, information needs to be correlated across the cloud providers. Each cloud provider could encode the informa- semi-independent, geographically distributed clouds. Bloom filters tion using Bloom filters and share the Bloom filters with a central can be maintained by each individual site, in the "local" cloud, and coordinator. The problem of interest is not only whether a given shared with a central location. For each piece of data, we need to object is in any of the sets represented by the Bloom filters, but find the sites holding the data. Thus, we need to search through a which of the existing sets contain the given object. This problem large number of Bloom filters stored at the central location. cannot be solved by just constructing a Bloom filter on the union Indexing Bloom filters is different than indexing generic objects of all the sets. We propose Bloofi, a hierarchical index structure to improve search time. There is one level of indirection between for Bloom filters that speeds-up the search process and can be effi- the elements searched for, and the objects directly indexed by the ciently constructed and maintained. We apply our index structure index structure. In particular, each Bloom filter is a compact rep- to the problem of determining the complete data provenance graph resentation of an underlying set of elements. The question of in- in a geographically distributed setting. Our theoretical and exper- terest is an all-membership query: given a particular element (not imental results show that Bloofi provides a scalable and efficient a Bloom filter), which underlying sets contain that element? The solution for searching through a large number of Bloom filters. query subject is an element, but the objects we are indexing and searching through are Bloom filters, so what we are creating is a Categories and Subject Descriptors: E.1 Data Structures - Trees, meta-index. The traditional index structures, such as hash indexes, H.3.3 Information Search and Retrieval, H.3.5 Online Information B+trees, R trees etc. do not directly apply in this case. Services - Data sharing There has been significant work in using Bloom filters in various General Terms: Algorithms, Management, Performance. applications, and developing variations of Bloom filters. Count- Keywords: Bloom filter index, federated cloud, data provenance. ing filters [9] support deletions from the Bloom filter; compressed Bloom filters [10] are used with web caching; stable Bloom fil- 1. INTRODUCTION ters [6] eliminate duplicates in streams, spectral Bloom filters [5] Bloom filters [1] are used to efficiently check whether an object extend the applicability of Bloom filters to multi-sets. Not much is likely to be in the set (match) or whether the object is definitely work has been done when a very large number of Bloom filters not in the set (no match). False positives are possible, but false neg- need to be checked for membership. atives are not. Due to their efficiency, compact representation, and We propose Bloofi (Bloom Filter Index), a hierarchical index flexibility in allowing a trade-off between space and false positive structure for Bloom filters. Bloofi provides probabilistic answers to probability, Bloom filters are increasingly popular in representing all-membership queries and scales to tens of thousands of Bloom diverse sets of data. They are used in databases [11], distributed filters. When the probability of false positives is low, Bloofi of or- systems [4], web caching [9], and other network applications [2]. der d (a tunable parameter) provides O(d ∗ logdN) search cost, where N is the number of Bloom filters indexed. Bloofi also provides support for inserts, deletes, and updates with O(d ∗ logdN) cost and requires O(N) storage cost. Bloofi could be used when- ever a large number of Bloom filters that use the same hash functions need to be checked for matches. Bloofi can also be used to This paper is authored by an employee(s) of the United States Government support deletions from a Bloom filter when only current data from and is in the public domain. Non-exclusive copying or redistribution is al- a discretely moving window needs to be indexed. lowed, provided that the article citation is given and the authors and agency are clearly identified as its source. The rest of this paper is structured as follows: Section 2 intro- Cloud-I ’13 August 26, 2013, Riva del Garda, Trento, Italy duces Bloofi, a new hierarchical index structure for Bloom filters. Copyright 2013 ACM 978-1-4503-2108-2/13/08 ...$15.00. Section 3 introduces the maintenance algorithms and a theoretical performance analysis. Section 4 describes the distributed data provenance application for Bloofi and Section 5 shows the experi- mental results. We discuss related work in Section 6 and conclude in Section 7. 2. INDEXING BLOOM FILTERS Figure 1: Bloofi Tree of Order 2 Each Bloom filter [1] is a bit array of length m constructed by using a set of k hash functions. The empty Bloom filter has all bits 0. To add an element to the filter, each of the k hash functions Fig 1 shows an example of a Bloofi index of order 2. Each in- maps the new element to a position in the bit array. The bit in that ternal node has between 2 and 4 child pointers. The leaf level of position is turned to 1. To check whether an element is a member the tree contains the original Bloom filters indexed by the Bloofi of the set represented by the Bloom filter, the k hash functions are index, with identifiers 1, 2, 3, 4, 5, and 6. The node identifiers are applied to the test element. If any of the resulting k positions is shown here for ease of presentation, but they are also used in prac- 0, the test element is not in the set, with probability 1. If all k tice during updates and deletes to identify the node that needs to positions are 1, the Bloom filter matches the test element, and the be updated or deleted (see Section 3.2 and Section 3.3 for details). test element might be in the set (it might be a true positive or a false In the rest of the paper, we often use "node X" to refer to the node positive). There is a trade-off between the size of the Bloom filter with identifier X. At the next higher level, the values are obtained and the probability of false positives, probf, returned by it. probf by applying bitwise or on the values of the children nodes, so the can be lowered by increasing the size of the Bloom filter. In the rest value of the node 7 is the bitwise or between values of nodes 1, 2, of the paper we assume that all the Bloom filters indexed have the 3, and 4. The process continues until the root is reached. same length and use the same set of hash functions. More formally, the data structure for a Bloofi index of order d is a tree such that: 2.1 Bloofi: A Hierarchical Bloom Filter Index Bloofi, the Bloom filter index, is based on the following idea. • node:nbDesc = sizeOf(node:children) 8node We construct a tree: the leaves of the tree are the Bloom filters to • node:nbDesc = 0, 8node such that node is leaf be indexed, and the parent nodes are Bloom filters obtained by ap- • d ≤ node:nbDesc ≤ 2 ∗ d, 8node such that (node 6= root plying a bitwise or on the child nodes. This process continues until and node not leaf) the root is reached. The index has the property that each non-leaf • 2 ≤ root:nbDesc ≤ 2 ∗ d Bloom filter in the tree represents the union of the sets represented • If j is bitwise OR then node:val = node:children[0]:val j by the Bloom filters in the sub-tree rooted at that node. As a conse- node:children[1]:val j ... j node:children[node:nbDesc− quence, if an object matches a leaf-level Bloom filter, it matches all 1]:val, 8node such that node is not leaf the Bloom filters in the path from that leaf to the root. Conversely, • node1:parent = node2 , node1 2 node2:children if a particular Bloom filter in Bloofi does not match an object, there • node:level = height, 8node such that node is leaf is no match in the entire sub-tree rooted at that node.

Load more