2/16/2017

Bloom Filters

References A. Broder and M. Mitzenmacher, “Network applications of Bloom filters: A survey,” Internet Mathematics, vol. 1 no. 4, pp. 485-509, 2004.

Li Fan, Pei Cao, Jussara Almeida, Andrei Broder, “Summary : A Scalable Wide-Area Web Cache Sharing Protocol,” IEEE/ACM Transactions on Networking, Vol. 8, No. 3, June 2000. o OiinfOrigin of cou ntinBlnting Bloo mfiltm filter s

2/16/2017 Bloom Filters (Simon S. Lam) 1

1 2/16/2017

Origin and applications

Randomized introduced by Burton Bloom [CACM 1970] o It represents a for membership queries, with false positives o Probability of false positive can be controlled by design parameters o When space efficiency is important, a Bloom filter may be used if the effect o f false po sitives can be mitigated.

First applications in dictionaries and databases

2/16/2017 Bloom Filters (Simon S. Lam) 2

2 2/16/2017

First application in networking: distributed cache (2000)

Proxy 2 Cache 2 Proxy 1 Summary 1 Cache 1 Summary 3 Summary 2 Summary 3 Proxy 3 Cache 3 Summary 1 Summary 2

 Numerous appli ca tions in ne twor king s ince 2000

2/16/2017 Bloom Filters (Simon S. Lam) 3

3 2/16/2017

Standard Bloom Filter

 A Bloom filter is an array of m representing

a set S = { x1, x2, … , xn} of n elements o Array set to 0 initially

 k independent hash functions h1, … , hk with range {1, 2, …, m } o Assume that each maps each item in the universe to a random number uniformly over the range {1, 2, …, m}

 For each element x in S, the hi(x) in the array itt1f1is set to 1, for 1 ≤ i ≤ k, o A bit in the array may be set to 1 multiple times for different ele ments

2/16/2017 Bloom Filters (Simon S. Lam) 4

4 2/16/2017

A Bloom filter example (()three hash functions)

Insert X1 and X2

Check Y1 and Y2

2/16/2017 Bloom Filters (Simon S. Lam) 5

5 2/16/2017

Standard Bloom Filter (cont.)

To check membership of y in S, check

whether h i(y), 1≤i≤k, are all set to 1 o If not, y is definitely not in S o Else, we conclude that y is in S, but sometimes this conclusion is wrong (false positive) For many applications, false positives are acceptbltable as l ong as th e prob bbilitability o f a false positive is small enough

We will assume that kn < m

2/16/2017 Bloom Filters (Simon S. Lam) 6

6 2/16/2017

False positive probability

 After all members of S have been hashed to a Bloom filter, the probability that a specific bit is still 0 is 1 − p '(1)=−kn ep kn/ m = m  For a non member, it may be found to be a member of S (all of its k bits are nonzero)with) with false positive probability (1−−p ')kk (1p )

2/16/2017 Bloom Filters (Simon S. Lam) 7

7 2/16/2017

False positive probability (cont.)

Define 1 fp'(1')(1(1))=−kknk =−− m fp=−(1 )kknmk =− (1 e− / )

 Two competing forces as k increases k o Larger k -> (1 − p ' ) is smaller for a fixed p’

o Larger k -> p’= (1 − 1 / m ) kn is smaller -> 1-p’ larger

2/16/2017 Bloom Filters (Simon S. Lam) 8

8 2/16/2017

False positive rate vs. k m Number of bits per member = 8 n

Number of

2/16/2017 Bloom Filters (Simon S. Lam) 9

9 2/16/2017

Optimal number k from derivative Rewrite f as fe=−=exp(ln(1−−kn// m ) k ) exp( ke ln(1 − kn m )) Let gk=− ln(1 e−kn/ m ) Minimizinggp() gfg will minimize = exp( ) −kn/ m ∂∂−gke− (1 ) =−ln(1e kn/ m ) + ∂∂∂kkk 1− e−kn/ m ∂k −−kn =−ln(1eekn// m ) + kn m =−+ ln(2) ln(2) = 0 1− em−kn/ m if we plug kmn= ( / )ln 2 which is optimal (iif(It is in fact a gl lblobal opti mum)

2/16/2017 Bloom Filters (Simon S. Lam) 10

10 2/16/2017

Optimal k from symmetry −kn/ m  Alternatively, from pe = we get m kp=− ln( ) n From previous slide, we have

− m gk=−ln(1 ekn/ m ) =− ln( p )ln(1 − p ) n  From above, symmetry indicates that the minimum value for ggp occurs when p=1/2. Thus mm k =−ln(1/ 2) = ln(2) opt nn

2/16/2017 Bloom Filters (Simon S. Lam) 11

11 2/16/2017

Optimal k from symmetry using the precise probability of false positive fp' =−(()1')k =exp( p(())k ln(1' −p ))

From p ' =−((),g11/mk),kn solving for 1 kp=ln(') nml(11/ln(1− 1 / ) Let ggp '=−k ln(()1'p ) ((qin equation for f ' above ) 1 =−ln(pp ')ln(1 ') nmln(1− 1/ )

2/16/2017 Bloom Filters (Simon S. Lam) 12

12 2/16/2017

Using the precise probability of false ppgp()ositive to get optimal k (cont.) From previous slide 1 gpp'=− ln( ')ln(1 ') nmln(1− 1/ )

By symmetry, g’ (also f’) minimized at p’=1/2

Optimal k is 11 kp'ln(')ln(1/2)== opt nmln(1−− 1 / ) nm ln(1 1 / )

2/16/2017 Bloom Filters (Simon S. Lam) 13

13 2/16/2017

Optimal number of hash functions m  Using k = ln(2) the false positive rate is opt n mm ln(()2) ln ()(2) (1−=p )nn (0.5) (0.6185)mn/ , where ln(2) = 0.6931

 In practice, k should be an integer. May choose an integer

valllhlue smaller than kopt to redhhihdduce hashing overhead m/n denotes False positive rate bits per entry

2/16/2017 Bloom Filters (Simon S. Lam) 14

14 2/16/2017

False positive rate vs. bits per entry False positive rate 4 hash functions

Using optimal number of hash functions

m/n 2/16/2017 Bloom Filters (Simon S. Lam) 15

15 2/16/2017

Standard Bloom Filter tricks

Two Bloom filters representing sets S1 and S2 with the same number of bits and using the same hash functions.

o A Bloom filter that represents the union of S1 and S2 can be obtained by taking the OR of the bit vectors A Bloom filter can be halved in size. Suppose thiihe size is a power o f2f 2. o Just OR the first and second halves of the bit vector o When hashing to do a lookup, the highest order bit is masked

Notation: OR denotes bitwise or

2/16/2017 Bloom Filters (Simon S. Lam) 16

16 2/16/2017

Counting Bloom filters

Proposed by Fan et al. [2000] for distributed caching Every entry in a counting Bloom filter is a small counter ((g)rather than a single bit). o When an item is inserted into the set, the corresponding counters are each incremented by 1 o When an item is dldelete dfhfrom the set, th e corresponding counters are each decremented by 1 To avoid counter overflow , its size must be sufficiently large. It was found that 4 bits per couuug.nter are enough.

2/16/2017 Bloom Filters (Simon S. Lam) 17

17 2/16/2017

Counter overflow probability Consider a set of n elements, k hash functions, and m counters o C(i) is the count for the ith counter nk j nk− j ==11 −  Pci[() j ] 1  j mm  nk ≥≤1 P[()ci j ] j j m

j ≤ enk  (a very loose upper bound) jm

2/16/2017 Bloom Filters (Simon S. Lam) 18

18 2/16/2017

Counter overflow probability (cont.) Choose k such that k ≤ m/n (ln 2) Then j j ≥≤enk ≤ eln 2 Pci[()j ]  jm j j ≥≤e ln 2 Pcijm[max ( ) ]  for some i 1≤≤im j Using 4 bits, each counter counts from 0 to 15 Pcim[max ( )≥≤×× 16] 1.37 10−15 1≤≤im

2/16/2017 Bloom Filters (Simon S. Lam) 19

19 2/16/2017

Counter overflow consequences

When a counter does overflow, it may be left at its max imum value. This can later cause a false negative only if eventuallyyg the counter goes down to 0 when it should have remain at nonzero. The exppygected time to this event is very large but is something we need to keep in mind for any application that does not allow false negatives

2/16/2017 Bloom Filters (Simon S. Lam) 20

20 2/16/2017

Conclusions

Wherever a list or set is used, and space is at a prem i um, a Bloom f i lter may be used iffthe the effect of false positives can be mitigated o No false negative With a counting Bloom filter, false negatives are possible, albeit highly unlikely

2/16/2017 Bloom Filters (Simon S. Lam) 21

21 2/16/2017

The End

2/16/2017 Bloom Filters (Simon S. Lam) 22

22