2/16/2017
Bloom Filters
References A. Broder and M. Mitzenmacher, “Network applications of Bloom filters: A survey,” Internet Mathematics, vol. 1 no. 4, pp. 485-509, 2004.
Li Fan, Pei Cao, Jussara Almeida, Andrei Broder, “Summary Cache: A Scalable Wide-Area Web Cache Sharing Protocol,” IEEE/ACM Transactions on Networking, Vol. 8, No. 3, June 2000. o OiinfOrigin of cou ntinBlnting Bloo mfiltm filter s
2/16/2017 Bloom Filters (Simon S. Lam) 1
1 2/16/2017
Origin and applications
Randomized data structure introduced by Burton Bloom [CACM 1970] o It represents a set for membership queries, with false positives o Probability of false positive can be controlled by design parameters o When space efficiency is important, a Bloom filter may be used if the effect o f false po sitives can be mitigated.
First applications in dictionaries and databases
2/16/2017 Bloom Filters (Simon S. Lam) 2
2 2/16/2017
First application in networking: distributed cache (2000)
Proxy 2 Cache 2 Proxy 1 Summary 1 Cache 1 Summary 3 Summary 2 Summary 3 Proxy 3 Cache 3 Summary 1 Summary 2
Numerous appli ca tions in ne twor king s ince 2000
2/16/2017 Bloom Filters (Simon S. Lam) 3
3 2/16/2017
Standard Bloom Filter
A Bloom filter is an array of m bits representing
a set S = { x1, x2, … , xn} of n elements o Array set to 0 initially
k independent hash functions h1, … , hk with range {1, 2, …, m } o Assume that each hash function maps each item in the universe to a random number uniformly over the range {1, 2, …, m}
For each element x in S, the bit hi(x) in the array itt1f1is set to 1, for 1 ≤ i ≤ k, o A bit in the array may be set to 1 multiple times for different ele ments
2/16/2017 Bloom Filters (Simon S. Lam) 4
4 2/16/2017
A Bloom filter example (()three hash functions)
Insert X1 and X2
Check Y1 and Y2
2/16/2017 Bloom Filters (Simon S. Lam) 5
5 2/16/2017
Standard Bloom Filter (cont.)
To check membership of y in S, check
whether h i(y), 1≤i≤k, are all set to 1 o If not, y is definitely not in S o Else, we conclude that y is in S, but sometimes this conclusion is wrong (false positive) For many applications, false positives are acceptbltable as l ong as th e prob bbilitability o f a false positive is small enough
We will assume that kn < m
2/16/2017 Bloom Filters (Simon S. Lam) 6
6 2/16/2017
False positive probability
After all members of S have been hashed to a Bloom filter, the probability that a specific bit is still 0 is 1 − p '(1)=−kn ep kn/ m = m For a non member, it may be found to be a member of S (all of its k bits are nonzero)with) with false positive probability (1−−p ')kk (1p )
2/16/2017 Bloom Filters (Simon S. Lam) 7
7 2/16/2017
False positive probability (cont.)
Define 1 fp'(1')(1(1))=−kknk =−− m fp=−(1 )kknmk =− (1 e− / )
Two competing forces as k increases k o Larger k -> (1 − p ' ) is smaller for a fixed p’
o Larger k -> p’= (1 − 1 / m ) kn is smaller -> 1-p’ larger
2/16/2017 Bloom Filters (Simon S. Lam) 8
8 2/16/2017
False positive rate vs. k m Number of bits per member = 8 n
Number of
2/16/2017 Bloom Filters (Simon S. Lam) 9
9 2/16/2017
Optimal number k from derivative Rewrite f as fe=−=exp(ln(1−−kn// m ) k ) exp( ke ln(1 − kn m )) Let gk=− ln(1 e−kn/ m ) Minimizinggp() gfg will minimize = exp( ) −kn/ m ∂∂−gke− (1 ) =−ln(1e kn/ m ) + ∂∂∂kkk 1− e−kn/ m ∂k −−kn =−ln(1eekn// m ) + kn m =−+ ln(2) ln(2) = 0 1− em−kn/ m if we plug kmn= ( / )ln 2 which is optimal (iif(It is in fact a gl lblobal opti mum)
2/16/2017 Bloom Filters (Simon S. Lam) 10
10 2/16/2017
Optimal k from symmetry −kn/ m Alternatively, from pe = we get m kp=− ln( ) n From previous slide, we have
− m gk=−ln(1 ekn/ m ) =− ln( p )ln(1 − p ) n From above, symmetry indicates that the minimum value for ggp occurs when p=1/2. Thus mm k =−ln(1/ 2) = ln(2) opt nn
2/16/2017 Bloom Filters (Simon S. Lam) 11
11 2/16/2017
Optimal k from symmetry using the precise probability of false positive fp' =−(()1')k =exp( p(())k ln(1' −p ))
From p ' =−((),g11/mk),kn solving for 1 kp=ln(') nml(11/ln(1− 1 / ) Let ggp '=−k ln(()1'p ) ((qin equation for f ' above ) 1 =−ln(pp ')ln(1 ') nmln(1− 1/ )
2/16/2017 Bloom Filters (Simon S. Lam) 12
12 2/16/2017
Using the precise probability of false ppgp()ositive to get optimal k (cont.) From previous slide 1 gpp'=− ln( ')ln(1 ') nmln(1− 1/ )
By symmetry, g’ (also f’) minimized at p’=1/2
Optimal k is 11 kp'ln(')ln(1/2)== opt nmln(1−− 1 / ) nm ln(1 1 / )
2/16/2017 Bloom Filters (Simon S. Lam) 13
13 2/16/2017
Optimal number of hash functions m Using k = ln(2) the false positive rate is opt n mm ln(()2) ln ()(2) (1−=p )nn (0.5) (0.6185)mn/ , where ln(2) = 0.6931
In practice, k should be an integer. May choose an integer
valllhlue smaller than kopt to redhhihdduce hashing overhead m/n denotes False positive rate bits per entry
2/16/2017 Bloom Filters (Simon S. Lam) 14
14 2/16/2017
False positive rate vs. bits per entry False positive rate 4 hash functions
Using optimal number of hash functions
m/n 2/16/2017 Bloom Filters (Simon S. Lam) 15
15 2/16/2017
Standard Bloom Filter tricks
Two Bloom filters representing sets S1 and S2 with the same number of bits and using the same hash functions.
o A Bloom filter that represents the union of S1 and S2 can be obtained by taking the OR of the bit vectors A Bloom filter can be halved in size. Suppose thiihe size is a power o f2f 2. o Just OR the first and second halves of the bit vector o When hashing to do a lookup, the highest order bit is masked
Notation: OR denotes bitwise or
2/16/2017 Bloom Filters (Simon S. Lam) 16
16 2/16/2017
Counting Bloom filters
Proposed by Fan et al. [2000] for distributed caching Every entry in a counting Bloom filter is a small counter ((g)rather than a single bit). o When an item is inserted into the set, the corresponding counters are each incremented by 1 o When an item is dldelete dfhfrom the set, th e corresponding counters are each decremented by 1 To avoid counter overflow , its size must be sufficiently large. It was found that 4 bits per couuug.nter are enough.
2/16/2017 Bloom Filters (Simon S. Lam) 17
17 2/16/2017
Counter overflow probability Consider a set of n elements, k hash functions, and m counters o C(i) is the count for the ith counter nk j nk− j ==11 − Pci[() j ] 1 j mm nk ≥≤1 P[()ci j ] j j m
j ≤ enk (a very loose upper bound) jm
2/16/2017 Bloom Filters (Simon S. Lam) 18
18 2/16/2017
Counter overflow probability (cont.) Choose k such that k ≤ m/n (ln 2) Then j j ≥≤enk ≤ eln 2 Pci[()j ] jm j j ≥≤e ln 2 Pcijm[max ( ) ] for some i 1≤≤im j Using 4 bits, each counter counts from 0 to 15 Pcim[max ( )≥≤×× 16] 1.37 10−15 1≤≤im
2/16/2017 Bloom Filters (Simon S. Lam) 19
19 2/16/2017
Counter overflow consequences
When a counter does overflow, it may be left at its max imum value. This can later cause a false negative only if eventuallyyg the counter goes down to 0 when it should have remain at nonzero. The exppygected time to this event is very large but is something we need to keep in mind for any application that does not allow false negatives
2/16/2017 Bloom Filters (Simon S. Lam) 20
20 2/16/2017
Conclusions
Wherever a list or set is used, and space is at a prem i um, a Bloom f i lter may be used iffthe the effect of false positives can be mitigated o No false negative With a counting Bloom filter, false negatives are possible, albeit highly unlikely
2/16/2017 Bloom Filters (Simon S. Lam) 21
21 2/16/2017
The End
2/16/2017 Bloom Filters (Simon S. Lam) 22
22