Folding and Unfolding Bloom Filters - an Off-Line Planning and On-Line Optimization Problem

Total Page:16

File Type:pdf, Size:1020Kb

Folding and Unfolding Bloom Filters - an Off-Line Planning and On-Line Optimization Problem Folding and Unfolding Bloom Filters - An Off-Line Planning and On-Line Optimization Problem Francoise Sailhan∗y, and Mark-Oliver Stehr∗ ∗ Computer Science Laboratory, SRI international Menlo Park, CA 94025 USA yCedric Laboratory, CNAM, 75003 Paris, France Emails: [email protected], [email protected] Abstract—The Internet of things has reached a stage that do not occur. Nevertheless, the approximate nature of Bloom allows ubiquitous data access. Still, practical limitations remain filter implies that false positives might appear. The probability in networks with scarce bandwidth. Here, we examine the of a false positives depends on (i) the Bloom filter size, (ii) Bloom filter data structure and its use in distributed protocols. We discuss how to minimize the bandwidth and energy usage the number of hash functions and (iii) the number of elements consumed when distributed protocols exchange Bloom filters, stored. Keeping the false positive rate to a minimum requires through dynamic Bloom filter resizing. We propose a general dynamic adaptation of the number of hash functions and/or of and novel formalization of Bloom filter resizing, through foldings the Bloom filter size. Further attempts to increase the size of and unfoldings. The key challenge in the folding approach is the Bloom filter have appeared in the literature. The basic idea determining suitable parameters and how to perform a folding. Specifically, we address the number of times that a Bloom filter is to create a set of Bloom filters [6], [16], [1]. As a side effect, should be folded and optionally unfolded, and how to determine a membership query is a function on the size of this set. While an ideal reduction factor for this process. We formulate our these methods deal with the increase of the Bloom filter size, approach as off-line planning of the integer factorization problem the related decrease is neglected. An alternative [7] consists in (where the integers correspond to the size of a Bloom filter), and partitioning the Bloom filter into k disjoint sub-ranges wherein propose further directions for optimizing the dynamic folding and unfolding of a Bloom filter. an element is added by hashing the sub-range with the related hash function hk. Depending on the expected probability of a I. INTRODUCTION false positives, k is either dynamically increased or decreased, The last decade is characterised by the convergence of per- hence leading to a reduced or increased Bloom filter size. vasive technologies including wireless communication, smart Nevertheless, changing the number of hash functions offers devices and the Internet. The resulting Internet of things lower granularity than resizing the Bloom filter. enables not only users but also things to access resources We herein consider the problem of dynamically resizing anywhere, anytime. In practice, data flows ubiquitously us- the Bloom filter by folding and unfolding it. This can be ing global wireless connectivity (e.g., 3G) or ad-hoc local expressed as follows: given a number of items stored in the area networking, as enabled by, for example, IEEE 802.15.4 Bloom filter, let reduce/increase the size of the Bloom filter by protocols. Two key limitations of these networks remain the folding/unfolding it so that the false positive rate ρ stays in the limits on bandwidth and energy. In this context, early-stage following interval [ρ − , ρ + ] and the increase of memory data and resource discovery should be provided so as to allow size is kept to a minimum. We acknowledge that halving a efficient data dissemination [14]. A Bloom filter [2] can play a Bloom filter was originally suggested in [3] and successfully crucial role by allowing to represent a set of data in a compact applied [5] so as to reduce the bandwidth consumption induced manner and by supporting efficient membership queries. More by the exchange of Bloom filters in the context of correlated precisely, a Bloom filter is a vector of bits. It is distinguished anomaly detection in large-scale grid computing [15]. We by the fact that the time associated with a membership query herein extend and generalize this approach by introducing the is independent of the number of elements stored in the Bloom concept of folded and unfolded Bloom filters, which permits filter as well the size of the Bloom filter. In a nutshell, the highly flexible resizing. Bloom filter works as follows: an element to be added to the We also introduce a novel formulation of the problem, Bloom filter is first hashed. Typically, k > 1 hash functions which allows to keep the false positive rate in check by are used. Then, the k outputs are used to flip to 1 the bits at the injecting entropy into the folding process. The key challenge related positions in the Bloom filter. In order to check whether consists in determining how a folding should be performed, an element is stored in the Bloom filter, the queried element namely the number of times the Bloom filter should be is hashed (relying on the same hashing functions). If the k folded/unfolded and the reduction factor associated with each bits at the related position are set to 1 in the Bloom filter, this folding/unfolding. We show that this can be formulated as (i) element is stored. It is worth mentioning that false negatives an off-line planning problem of the factorization of an integer (namely the Bloom filter size) and (ii) an on-line optimization This work was carried out when F. Sailhan was a visiting fellow at SRI. problem of the Bloom filter folding/unfolding. Our main contribution is twofold: we propose an algorithm for planning that depends on the space allocated to count. These counters the folding and unfolding of the Bloom filter depending on must be selected to be large enough to avoid overflows. the expected false positive rate and the related resource waste. A. Properties of Bloom Filters We formulate this problem and provide both a theoretical and experimental analysis of such planning. In addition, based on A [counting] Bloom filter is characterized by the following these planned foldings/unfoldings, we propose an approach to properties: dynamically optimizing the folding and unfolding of a Bloom 1) The time associated with a membership query is filter. independent of the number of elements stored in The remaining sections are organized as follows. We first the [counting] Bloom filter as well as its size. More introduce Bloom filters and survey the related literature (xII). precisely, the time required to insert or find an element Then, we present the Bloom filter folding/unfolding (xIII) depends on the number of hash functions and is hence and the planning of the folding/unfolding (xIV). Finally, we o(k). conclude this report with further optimization and research directions (xV). 2) Both basic and counting Bloom filters engender false positives. Given n the number of elements that are stored II. BACKGROUND ON BLOOM FILTERS in the [counting] Bloom filter (see Table I) and assuming Bloom filters [2] are commonly used to represent a set that the hash functions are perfect, the probability of a so as to support efficient membership queries, that is, to false positive, denoted ρ, satisfies: efficiently test whether an element is a member of a set. For 1 kn k kn this purpose, a Bloom filter B is represented as a vector of ρ = (1 − (1 − ) ) ≈ (1 − e m ) (1) bits, denoted b(1); ··· ; b(m). This vector is initially set to m 0, i.e., 8i 2 [0; m], b(i) = 0. The vector is then updated by Unlike a basic Bloom filter, a counting Bloom adding additional elements. filter may produce a false negative, which occurs if the counting vector is undersized and generates an overflow. a) Bloom filter update: In order to add an element e to the Bloom filter, k hash functions h1; ··· ; hk are used. Each 3) The union of two sets S1 and S2 can be obtained as hash function is applied to the element: h1(e); ··· ; hk(e). follows: Then, the k bits at positions h1(e) mod(m); ··· ; hk(e) • applying a bit-wise OR on the two corresponding mod(m) are set to 1. basic Bloom filters, and • adding the vectors of the two counting Bloom filters. Note that the resulting value cannot exceed m: Bloom filter size b(1); ··· ; b(m): Bloom filter a maximum value defined by the counters’ sizes. k: number of hash functions h1(); ··· ; hk(): set of hash functions n: set of items stored in the Bloom filter 4) The intersection of two sets S1 and S2 cannot be obtained in a deterministic manner on the [counting] TABLE I NOTATION Bloom filter. 5) With a counting Bloom filter, the subtraction of two b) Checking the membership of an element: The sets S1 − S2 can be obtained by subtracting the process of checking whether an element q belongs to the corresponding counting Bloom filters, CB1 − CB2. Bloom filter is very similar to adding an element. First, q is Basic Bloom filters do not support such subtraction. hashed: h1(q) mod(m); ··· ; hk(q) mod(m). If any bit at the related positions is set to 0, then the element is not stored Despite the above promising properties, [counting] Bloom in the Bloom filter. Otherwise (i.e., if none of those bits filters suffer from a fundamental shortcoming: once set, the is set to 0), the element is said to be stored in the Bloom filter. size of the Bloom filter cannot be easily changed. An over- or under-estimate of this size leads to a waste of mem- c) Suppressing an element: Items cannot be removed ory/bandwidth and/or an increased false positive rate.
Recommended publications
  • Various Arithmetic Functions and Their Applications
    University of New Mexico UNM Digital Repository Mathematics and Statistics Faculty and Staff Publications Academic Department Resources 2016 Various Arithmetic Functions and their Applications Florentin Smarandache University of New Mexico, [email protected] Octavian Cira Follow this and additional works at: https://digitalrepository.unm.edu/math_fsp Part of the Algebra Commons, Applied Mathematics Commons, Logic and Foundations Commons, Number Theory Commons, and the Set Theory Commons Recommended Citation Smarandache, Florentin and Octavian Cira. "Various Arithmetic Functions and their Applications." (2016). https://digitalrepository.unm.edu/math_fsp/256 This Book is brought to you for free and open access by the Academic Department Resources at UNM Digital Repository. It has been accepted for inclusion in Mathematics and Statistics Faculty and Staff Publications by an authorized administrator of UNM Digital Repository. For more information, please contact [email protected], [email protected], [email protected]. Octavian Cira Florentin Smarandache Octavian Cira and Florentin Smarandache Various Arithmetic Functions and their Applications Peer reviewers: Nassim Abbas, Youcef Chibani, Bilal Hadjadji and Zayen Azzouz Omar Communicating and Intelligent System Engineering Laboratory, Faculty of Electronics and Computer Science University of Science and Technology Houari Boumediene 32, El Alia, Bab Ezzouar, 16111, Algiers, Algeria Octavian Cira Florentin Smarandache Various Arithmetic Functions and their Applications PONS asbl Bruxelles, 2016 © 2016 Octavian Cira, Florentin Smarandache & Pons. All rights reserved. This book is protected by copyright. No part of this book may be reproduced in any form or by any means, including photocopying or using any information storage and retrieval system without written permission from the copyright owners Pons asbl Quai du Batelage no.
    [Show full text]
  • Université De Montréal Prime Number Races Tony Haddad
    Université de Montréal Prime Number Races par Tony Haddad Département de mathématiques et de statistique Faculté des arts et des sciences Mémoire présenté en vue de l’obtention du grade de Maître ès sciences (M.Sc.) en mathématiques Orientation mathématiques pures Août 2020 c Tony Haddad, 2020 Université de Montréal Faculté des arts et des sciences Ce mémoire intitulé Prime Number Races présenté par Tony Haddad a été évalué par un jury composé des personnes suivantes : Andrew Granville (président-rapporteur) Dimitris Koukoulopoulos (directeur de recherche) Matilde Lalín (membre du jury) Résumé Sous l’hypothèse de Riemann généralisée et l’hypothèse d’indépendance linéaire, Rubinstein et Sarnak ont prouvé que les valeurs de x > 1 pour lesquelles nous avons plus de nombres premiers de la forme 4n + 3 que de nombres premiers de la forme 4n + 1 en dessous de x ont une densité logarithmique d’environ 99,59 %. En général, l’étude de la différence #{p 6 x : p ∈ A} − #{p 6 x : p ∈ B} pour deux sous-ensembles de nombres premiers A et B s’appelle la course entre les nombres premiers de A et de B. Dans ce mémoire, nous cherchons ultimement à analyser d’un point de vue numérique et statistique la course entre les nombres premiers p tels que 2p + 1 est aussi premier (aussi appelés nombres premiers de Sophie Germain) et les nombres premiers p tels que 2p − 1 est aussi premier. Pour ce faire, nous présentons au préalable l’analyse de Rubinstein et Sarnak pour pouvoir repérer d’où vient le biais dans la course entre les nombres premiers 1 (mod 4) et les nombres premiers 3 (mod 4) et émettons une conjecture sur la distribution des nombres premiers de Sophie Germain.
    [Show full text]
  • Folding and Unfolding Bloom Filters - an Off-Line Planing and on Line-Optimisation Francoise Sailhan, Mark-Oliver Stehr
    Folding and Unfolding Bloom Filters - an Off-line Planing and on Line-Optimisation Francoise Sailhan, Mark-Oliver Stehr To cite this version: Francoise Sailhan, Mark-Oliver Stehr. Folding and Unfolding Bloom Filters - an Off-line Planing and on Line-Optimisation. IEEE International Conference on Internet of Things, Nov 2012, Besancon, France. pp.34-41, 10.1109/GreenCom.2012.16. hal-01126174 HAL Id: hal-01126174 https://hal.archives-ouvertes.fr/hal-01126174 Submitted on 27 Nov 2020 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. Folding and Unfolding Bloom Filters - An Off-Line Planning and On-Line Optimization Problem Francoise Sailhan∗y, and Mark-Oliver Stehr∗ ∗ Computer Science Laboratory, SRI international Menlo Park, CA 94025 USA yCedric Laboratory, CNAM, 75003 Paris, France Emails: [email protected], [email protected] Abstract—The Internet of things has reached a stage that do not occur. Nevertheless, the approximate nature of Bloom allows ubiquitous data access. Still, practical limitations remain filter implies that false positives might appear. The probability in networks with scarce bandwidth. Here, we examine the of a false positives depends on (i) the Bloom filter size, (ii) Bloom filter data structure and its use in distributed protocols.
    [Show full text]
  • Folding and Unfolding Bloom Filters - an Off-Line Planing and On-Line Optimisation
    Folding and Unfolding Bloom Filters - An off-Line Planing and on-Line Optimisation Francoise Sailhan∗†, and Mark-Oliver Stehr∗ ∗ Computer Science Laboratory, SRI international Menlo Park, CA 94025 USA Email: fi[email protected] †Cedric Laboratory, CNAM, 75003 Paris, France Email: fi[email protected] I. INTRODUCTION an element that is added by hashing this latter with the related Last decade is characterised by the convergence of pervasive hash function hk. Depending on the expected probability of technologies including wireless communication, smart devices false positive, k is either dynamically increased and also and the Internet. The resulting Internet of the things enables decreased, hence leading to an reduced/increased Bloom filter not-only users but also things to access resource anywhere, size. Nevertheless, still changing the number of hash functions anytime. In practice, data is ubiquitously flowed relying on offers a lower granularity on the resizing of the Bloom filter global wireless connectivity (e.g., 3G) or on ad hoc-based comparing to a direct fine-grained resizing. local area networking, as enabled by e.g., IEEE 802.15.4 protocols. One key limitation of those networks remains the We herein consider the problem of dynamically resizing limited bandwidth and energy. In this context, early-stage data the Bloom filter by folding and unfolding it. This can be and resource discovery should be provided so as to provide expressed as follows. Given a number of items stored in the efficient data dissemination [12]. A Bloom filter is a space- Bloom filter, let us reduce/ increase the size of the Bloom filter efficient structure (namely a vector of bits) that is dedicated by folding/unfolding it so that the false positive rate keeps into [ρ ǫ,ρ + ǫ] and that the unnecessary increase of the to the representation of a set and designed so as to support − efficient membership queries.
    [Show full text]
  • Partitions, Hypergeometric Systems, and Dirichlet Processes in Statistics Springerbriefs in Statistics
    SPRINGER BRIEFS IN STATISTICS JSS RESEARCH SERIES IN STATISTICS Shuhei Mano Partitions, Hypergeometric Systems, and Dirichlet Processes in Statistics SpringerBriefs in Statistics JSS Research Series in Statistics Editors-in-Chief Naoto Kunitomo Akimichi Takemura Series editors Genshiro Kitagawa Tomoyuki Higuchi Toshimitsu Hamasaki Shigeyuki Matsui Manabu Iwasaki Yasuhiro Omori Masafumi Akahira Takahiro Hoshino Masanobu Taniguchi The current research of statistics in Japan has expanded in several directions in line with recent trends in academic activities in the area of statistics and statistical sciences over the globe. The core of these research activities in statistics in Japan has been the Japan Statistical Society (JSS). This society, the oldest and largest academic organization for statistics in Japan, was founded in 1931 by a handful of pioneer statisticians and economists and now has a history of about 80 years. Many distinguished scholars have been members, including the influential statistician Hirotugu Akaike, who was a past president of JSS, and the notable mathematician Kiyosi Itô, who was an earlier member of the Institute of Statistical Mathematics (ISM), which has been a closely related organization since the establishment of ISM. The society has two academic journals: the Journal of the Japan Statistical Society (English Series) and the Journal of the Japan Statistical Society (Japanese Series). The membership of JSS consists of researchers, teachers, and professional statisticians in many different fields including mathematics, statistics, engineering, medical sciences, government statistics, economics, business, psychology, education, and many other natural, biological, and social sciences. The JSS Series of Statistics aims to publish recent results of current research activities in the areas of statistics and statistical sciences in Japan that otherwise would not be available in English; they are complementary to the two JSS academic journals, both English and Japanese.
    [Show full text]
  • Optimizing Robustness While Generating Shared Secret Safe Primes
    Optimizing Robustness while Generating Shared Secret Safe Primes Emil Ong and John Kubiatowicz University of California, Berkeley Abstract. We develop a method for generating shared, secret, safe primes applicable to use in threshold RSA signature schemes such as the one developed by Shoup. We would like a scheme usable in practical settings, so our protocol is robust and efficient in asynchronous, hostile environments. We show that the techniques used for robustness need spe- cial care when they must be efficient. Specifically, we show optimizations that minimize the number and size of the proofs of knowledge used. We also develop optimizations based on computer arithmetic algorithms, in particular, precomputation and Montgomery modular multiplication. Keywords: Distributed key generation, safe primes, threshold RSA signatures. 1 Introduction Shoup's scheme [1] for threshold RSA signatures was a great leap forward in making threshold signature schemes practical. Its ability to avoid interaction while signing makes the scheme efficient and easy to implement. Unfortunately, Shoup's scheme required the use of a safe prime product modulus for its proof of correctness. Moreover, the scheme assumes a trusted dealer to create and distribute this modulus and the private key shares. Since the development of Shoup's scheme, several works ([2{4]) have been published to try to eliminate the single dealer, but none have shown the costs associated with a robust solution. In this paper, we show the cost required for a robust implementation of a distributed safe prime generation scheme. We follow the basic form of the al- gorithm in [2], but we also show that the changes necessary for robustness are non-trivial if we want efficiency.
    [Show full text]
  • Arxiv:1603.08456V1 [Math.GM] 28 Mar 2016 2 Preface
    Various Arithmetic Functions and their Applications Octavian Cira and Florentin Smarandache March 29, 2016 arXiv:1603.08456v1 [math.GM] 28 Mar 2016 2 Preface Over 300 sequences and many unsolved problems and conjectures related to them are presented herein. These notions, definitions, unsolved problems, questions, theorems corollaries, formulae, conjectures, examples, mathemat- ical criteria, etc. on integer sequences, numbers, quotients, residues, expo- nents, sieves, pseudo-primes squares cubes factorials, almost primes, mobile periodicals, functions, tables, prime square factorial bases, generalized factori- als, generalized palindromes, so on, have been extracted from the Archives of American Mathematics (University of Texas at Austin) and Arizona State Uni- versity (Tempe): "The Florentin Smarandache papers" special collections, Uni- versity of Craiova Library, and Arhivele Statului (Filiala Vâlcea & Filiala Dolj, România). The book is based on various articles in the theory of numbers (starting from 1975), updated many times. Special thanks to C. Dumitrescu and V. Se- leacufrom the University of Craiova (see their edited book "Some Notions and Questions in Number Theory", Erhus Press, Glendale, 1994), M. Bencze, L. Tu- tescu, E. Burton, M. Coman, F. Russo, H. Ibstedt, C. Ashbacher, S. M. Ruiz, J. Sandor, G. Policarp, V. Iovan, N. Ivaschescu, etc. who helped incollecting and editing this material. This book was born from the collaboration of the two authors, which started in 2013. The first common work was the volume "Solving Diophantine Equa- tions", published in 2014. The contribution of the authors can be summarized as follows: Florentin Smarandache came with his extraordinary ability to pro- pose new areas of study in number theory, and Octavian Cira – with his algo- rithmic thinking and knowledge of Mathcad.
    [Show full text]
  • Rough Numbers and Stick-Breaking
    Utrecht University Faculty of Science Department of Mathematics Rough numbers and stick-breaking Master Thesis Daily supervisor: dr. L.P. Grimmelt Author: First reader: Bart van Hoek dr. D. Schindler Second reader: Prof. dr. G.L.M. Cornelissen July 9, 2019 Abstract We construct the Buchstab function using a \stick-breaking process". This gives us a new way of expressing the function and thereby some new interpretations. To do this we first look at the basic results concerning rough numbers like the prime number theorem. We analyse the strongest known result for counting rough numbers by G´eraldTenenbaum [2]. 1 Contents 1 Introduction 3 2 The prime number theorem 5 2.1 Estimating sums over prime numbers . .5 2.2 Abel's summation formula . 10 2.3 Mertens' theorem . 11 2.4 Perron's formula . 12 2.5 The prime number theorem . 18 3 Rough numbers 24 3.1 Smooth numbers . 24 3.2 Rough numbers for small y .............................. 25 3.3 The Buchstab function . 27 3.4 Buchstab with better main term . 30 3.5 Strongest known result . 32 4 Stick-breaking 35 4.1 The stick-breaking process . 35 4.2 The Dickman function . 36 4.3 A naive guess . 37 4.4 Stick-breaking for the Buchstab function . 38 4.5 Interpretation of the !k ................................ 41 2 1 Introduction It was in 1896 that the prime number theorem was first proven. It was shown, independently, by both Hadamard [10] and de la Vall´eePoussin [11] that Z x 1 π(x) ≈ dt: (1) 2 log t The error term of this approximation has been improved multiple times over the last century and is still being looked at today.
    [Show full text]