Talk surveys results from I M. Patras¸cuˇ and M. Thorup: The power of simple tabulation hashing. STOC’11 and J. ACM’12. I M. Patras¸cuˇ and M. Thorup: Twisted Tabulation Hashing. SODA’13 I M. Thorup: Simple Tabulation, Fast Expanders, Double Tabulation, and High Independence. FOCS’13. I S. Dahlgaard and M. Thorup: Approximately Minwise Independence with Twisted Tabulation. SWAT’14. I T. Christiani, R. Pagh, and M. Thorup: From Independence to Expansion and Back Again. STOC’15. I S. Dahlgaard, M.B.T. Knudsen and E. Rotenberg, and M. Thorup: Hashing for Statistics over K-Partitions. FOCS’15. .
Using Tabulation to Implement The Power of Hashing Using Tabulation to Implement The Power of Hashing Talk surveys results from I M. Patras¸cuˇ and M. Thorup: The power of simple tabulation hashing. STOC’11 and J. ACM’12. I M. Patras¸cuˇ and M. Thorup: Twisted Tabulation Hashing. SODA’13 I M. Thorup: Simple Tabulation, Fast Expanders, Double Tabulation, and High Independence. FOCS’13. I S. Dahlgaard and M. Thorup: Approximately Minwise Independence with Twisted Tabulation. SWAT’14. I T. Christiani, R. Pagh, and M. Thorup: From Independence to Expansion and Back Again. STOC’15. I S. Dahlgaard, M.B.T. Knudsen and E. Rotenberg, and M. Thorup: Hashing for Statistics over K-Partitions. FOCS’15. . I Providing algorithmically important probabilisitic guarantees akin to those of truly random hashing, yet easy to implement. I Bridging theory (assuming truly random hashing) with practice (needing something implementable). I Many randomized algorithms are very simple and popular in practice, but often implemented with too simple hash functions, so guarantees only for sufficiently random input. I Too simple hash functions may work deceivingly well in random tests, but the real world is full of structured data on which they may fail miserably (as we shall see later).
Target
I Simple and reliable pseudo-random hashing. I Bridging theory (assuming truly random hashing) with practice (needing something implementable). I Many randomized algorithms are very simple and popular in practice, but often implemented with too simple hash functions, so guarantees only for sufficiently random input. I Too simple hash functions may work deceivingly well in random tests, but the real world is full of structured data on which they may fail miserably (as we shall see later).
Target
I Simple and reliable pseudo-random hashing. I Providing algorithmically important probabilisitic guarantees akin to those of truly random hashing, yet easy to implement. I Many randomized algorithms are very simple and popular in practice, but often implemented with too simple hash functions, so guarantees only for sufficiently random input. I Too simple hash functions may work deceivingly well in random tests, but the real world is full of structured data on which they may fail miserably (as we shall see later).
Target
I Simple and reliable pseudo-random hashing. I Providing algorithmically important probabilisitic guarantees akin to those of truly random hashing, yet easy to implement. I Bridging theory (assuming truly random hashing) with practice (needing something implementable). I Too simple hash functions may work deceivingly well in random tests, but the real world is full of structured data on which they may fail miserably (as we shall see later).
Target
I Simple and reliable pseudo-random hashing. I Providing algorithmically important probabilisitic guarantees akin to those of truly random hashing, yet easy to implement. I Bridging theory (assuming truly random hashing) with practice (needing something implementable). I Many randomized algorithms are very simple and popular in practice, but often implemented with too simple hash functions, so guarantees only for sufficiently random input. Target
I Simple and reliable pseudo-random hashing. I Providing algorithmically important probabilisitic guarantees akin to those of truly random hashing, yet easy to implement. I Bridging theory (assuming truly random hashing) with practice (needing something implementable). I Many randomized algorithms are very simple and popular in practice, but often implemented with too simple hash functions, so guarantees only for sufficiently random input. I Too simple hash functions may work deceivingly well in random tests, but the real world is full of structured data on which they may fail miserably (as we shall see later). Applications of Hashing Hash tables (n keys and 2n hashes: expect 1/2 keys per hash) I chaining: follow pointers
• x a t • ! ! • v • ! • f s r • ! ! ! Applications of Hashing Hash tables (n keys and 2n hashes: expect 1/2 keys per hash) I chaining: follow pointers
• x a t x • ! ! ! • v • ! • f s r • ! ! ! Applications of Hashing Hash tables (n keys and 2n hashes: expect 1/2 keys per hash) I chaining: follow pointers I linear probing: sequential search in one array
• q x a g ! c ! ! • • t Applications of Hashing Hash tables (n keys and 2n hashes: expect 1/2 keys per hash) I chaining: follow pointers I linear probing: sequential search in one array
• q x a g ! c ! x ! • t Applications of Hashing Hash tables (n keys and 2n hashes: expect 1/2 keys per hash) I chaining: follow pointers I linear probing: sequential search in one array I cuckoo hashing: search 2 locations, complex updates
a • s • z • y f x w • r • x b • Applications of Hashing Hash tables (n keys and 2n hashes: expect 1/2 keys per hash) I chaining: follow pointers I linear probing: sequential search in one array I cuckoo hashing: search 2 locations, complex updates
a • s • z • y f x w • r • x b • Applications of Hashing Hash tables (n keys and 2n hashes: expect 1/2 keys per hash) I chaining: follow pointers I linear probing: sequential search in one array I cuckoo hashing: search 2 locations, complex updates
a • s • z • y f x w • r • x b • Applications of Hashing Hash tables (n keys and 2n hashes: expect 1/2 keys per hash) I chaining: follow pointers I linear probing: sequential search in one array I cuckoo hashing: search 2 locations, complex updates
a • s • z • y f x w • r • x b • Applications of Hashing Hash tables (n keys and 2n hashes: expect 1/2 keys per hash) I chaining: follow pointers I linear probing: sequential search in one array I cuckoo hashing: search 2 locations, complex updates
a • s • z • y f x w • r • x b • Applications of Hashing Hash tables (n keys and 2n hashes: expect 1/2 keys per hash) I chaining: follow pointers I linear probing: sequential search in one array I cuckoo hashing: search 2 locations, complex updates
z • s • w • y f x x • a • r x b I sketch A and B to later find A B / A B | \ | | [ | A B / A B = Pr[min h(A)=min h(B)] | \ | | [ | h We need h to be "-minwise independent:
1 " x S : Pr[h(x)=min h(S)] = ± 2 S | |
Sketching, streaming, and sampling: 2 I second moment estimation: F2(x¯)= i xi P
Applications of Hashing Hash tables (n keys and 2n hashes: expect 1/2 keys per hash) I chaining: follow pointers. I linear probing: sequential search in one array I cuckoo hashing: search 2 locations, complex updates I sketch A and B to later find A B / A B | \ | | [ | A B / A B = Pr[min h(A)=min h(B)] | \ | | [ | h We need h to be "-minwise independent:
1 " x S : Pr[h(x)=min h(S)] = ± 2 S | |
Applications of Hashing Hash tables (n keys and 2n hashes: expect 1/2 keys per hash) I chaining: follow pointers. I linear probing: sequential search in one array I cuckoo hashing: search 2 locations, complex updates Sketching, streaming, and sampling: 2 I second moment estimation: F2(x¯)= i xi P Applications of Hashing Hash tables (n keys and 2n hashes: expect 1/2 keys per hash) I chaining: follow pointers. I linear probing: sequential search in one array I cuckoo hashing: search 2 locations, complex updates Sketching, streaming, and sampling: 2 I second moment estimation: F2(x¯)= i xi I sketch A and B to later find A B / A B | \ | | P[ | A B / A B = Pr[min h(A)=min h(B)] | \ | | [ | h We need h to be "-minwise independent:
1 " x S : Pr[h(x)=min h(S)] = ± 2 S | | Prototypical example: degree k 1 polynomial I u = b prime;
I choose a0, a1,...,ak 1 randomly in [u]; k 1 I h(x)= a0 + a1x + + ak 1x mod u. ··· Many solutions for k-independent hashing proposed, but generally slow for k > 3 and too slow for k > 5.
Wegman & Carter [FOCS’77] We do not have space for truly random hash functions, but
Family = h :[u] [b] k-independent iff for random h : H { ! } 2H I ( )x [u], h(x) is uniform in [b]; 8 2 I ( )x ,...,x [u], h(x ),...,h(x ) are independent. 8 1 k 2 1 k Many solutions for k-independent hashing proposed, but generally slow for k > 3 and too slow for k > 5.
Wegman & Carter [FOCS’77] We do not have space for truly random hash functions, but
Family = h :[u] [b] k-independent iff for random h : H { ! } 2H I ( )x [u], h(x) is uniform in [b]; 8 2 I ( )x ,...,x [u], h(x ),...,h(x ) are independent. 8 1 k 2 1 k
Prototypical example: degree k 1 polynomial I u = b prime;
I choose a0, a1,...,ak 1 randomly in [u]; k 1 I h(x)= a0 + a1x + + ak 1x mod u. ··· Wegman & Carter [FOCS’77] We do not have space for truly random hash functions, but
Family = h :[u] [b] k-independent iff for random h : H { ! } 2H I ( )x [u], h(x) is uniform in [b]; 8 2 I ( )x ,...,x [u], h(x ),...,h(x ) are independent. 8 1 k 2 1 k
Prototypical example: degree k 1 polynomial I u = b prime;
I choose a0, a1,...,ak 1 randomly in [u]; k 1 I h(x)= a0 + a1x + + ak 1x mod u. ··· Many solutions for k-independent hashing proposed, but generally slow for k > 3 and too slow for k > 5. Independence has been the ruling measure for quality of hash functions for 30+ years, but is it right?
How much independence needed? Chaining E[t]=O(1) 2 E[tk ]=O(1) 2k + 1 max t = O(lg n/ lg lg n) ⇥(lg n/ lg lg n) Linear probing 5 [Pagh2, Ruziˇ c’07]´ 5 [PT ICALP’10] Cuckoo hashing O(lg n) 6 [Cohen, Kane’05] F2 estimation 4 [Alon, Mathias, Szegedy’99] 1 1 "-minwise indep. O lg " [Indyk’99] ⌦ lg " [PT ICALP’10] How much independence needed? Chaining E[t]=O(1) 2 E[tk ]=O(1) 2k + 1 max t = O(lg n/ lg lg n) ⇥(lg n/ lg lg n) Linear probing 5 [Pagh2, Ruziˇ c’07]´ 5 [PT ICALP’10] Cuckoo hashing O(lg n) 6 [Cohen, Kane’05] F2 estimation 4 [Alon, Mathias, Szegedy’99] 1 1 "-minwise indep. O lg " [Indyk’99] ⌦ lg " [PT ICALP’10]
Independence has been the ruling measure for quality of hash functions for 30+ years, but is it right? I For i = 1,...,c, we have truly random hash table: R : char hash values (bit strings) i ! I Hash value
h(x)=R [x ] R [x ] 1 1 ··· c c 1/c I Space cN and time O(c). With 8-bit characters, each Ri has 256 entries and fit in L1 cache. I Simple tabulation is the fastest 3-independent hashing scheme. Speed like 2 multiplications. I Not 4-independent: h(a a ) h(a b ) h(b a ) h(b b ) 1 2 1 2 1 2 1 2 =(R [a ] R [a ]) (R [a ] R [b ]) 1 1 2 2 1 1 2 2 (R [b ] R [a ]) (R [b ] R [b ]) = 0. 1 1 2 2 1 1 2 2
Simple Tabulation Hashing [Zobrist’70 chess]
I Key x divided into c = O(1) characters x1,...,xc, e.g., 32-bit key as 4 8-bit characters. ⇥ I Hash value
h(x)=R [x ] R [x ] 1 1 ··· c c 1/c I Space cN and time O(c). With 8-bit characters, each Ri has 256 entries and fit in L1 cache. I Simple tabulation is the fastest 3-independent hashing scheme. Speed like 2 multiplications. I Not 4-independent: h(a a ) h(a b ) h(b a ) h(b b ) 1 2 1 2 1 2 1 2 =(R [a ] R [a ]) (R [a ] R [b ]) 1 1 2 2 1 1 2 2 (R [b ] R [a ]) (R [b ] R [b ]) = 0. 1 1 2 2 1 1 2 2
Simple Tabulation Hashing [Zobrist’70 chess]
I Key x divided into c = O(1) characters x1,...,xc, e.g., 32-bit key as 4 8-bit characters. ⇥ I For i = 1,...,c, we have truly random hash table: R : char hash values (bit strings) i ! 1/c I Space cN and time O(c). With 8-bit characters, each Ri has 256 entries and fit in L1 cache. I Simple tabulation is the fastest 3-independent hashing scheme. Speed like 2 multiplications. I Not 4-independent: h(a a ) h(a b ) h(b a ) h(b b ) 1 2 1 2 1 2 1 2 =(R [a ] R [a ]) (R [a ] R [b ]) 1 1 2 2 1 1 2 2 (R [b ] R [a ]) (R [b ] R [b ]) = 0. 1 1 2 2 1 1 2 2
Simple Tabulation Hashing [Zobrist’70 chess]
I Key x divided into c = O(1) characters x1,...,xc, e.g., 32-bit key as 4 8-bit characters. ⇥ I For i = 1,...,c, we have truly random hash table: R : char hash values (bit strings) i ! I Hash value
h(x)=R [x ] R [x ] 1 1 ··· c c I Simple tabulation is the fastest 3-independent hashing scheme. Speed like 2 multiplications. I Not 4-independent: h(a a ) h(a b ) h(b a ) h(b b ) 1 2 1 2 1 2 1 2 =(R [a ] R [a ]) (R [a ] R [b ]) 1 1 2 2 1 1 2 2 (R [b ] R [a ]) (R [b ] R [b ]) = 0. 1 1 2 2 1 1 2 2
Simple Tabulation Hashing [Zobrist’70 chess]
I Key x divided into c = O(1) characters x1,...,xc, e.g., 32-bit key as 4 8-bit characters. ⇥ I For i = 1,...,c, we have truly random hash table: R : char hash values (bit strings) i ! I Hash value
h(x)=R [x ] R [x ] 1 1 ··· c c 1/c I Space cN and time O(c). With 8-bit characters, each Ri has 256 entries and fit in L1 cache. I Not 4-independent: h(a a ) h(a b ) h(b a ) h(b b ) 1 2 1 2 1 2 1 2 =(R [a ] R [a ]) (R [a ] R [b ]) 1 1 2 2 1 1 2 2 (R [b ] R [a ]) (R [b ] R [b ]) = 0. 1 1 2 2 1 1 2 2
Simple Tabulation Hashing [Zobrist’70 chess]
I Key x divided into c = O(1) characters x1,...,xc, e.g., 32-bit key as 4 8-bit characters. ⇥ I For i = 1,...,c, we have truly random hash table: R : char hash values (bit strings) i ! I Hash value
h(x)=R [x ] R [x ] 1 1 ··· c c 1/c I Space cN and time O(c). With 8-bit characters, each Ri has 256 entries and fit in L1 cache. I Simple tabulation is the fastest 3-independent hashing scheme. Speed like 2 multiplications. Simple Tabulation Hashing [Zobrist’70 chess]
I Key x divided into c = O(1) characters x1,...,xc, e.g., 32-bit key as 4 8-bit characters. ⇥ I For i = 1,...,c, we have truly random hash table: R : char hash values (bit strings) i ! I Hash value
h(x)=R [x ] R [x ] 1 1 ··· c c 1/c I Space cN and time O(c). With 8-bit characters, each Ri has 256 entries and fit in L1 cache. I Simple tabulation is the fastest 3-independent hashing scheme. Speed like 2 multiplications. I Not 4-independent: h(a a ) h(a b ) h(b a ) h(b b ) 1 2 1 2 1 2 1 2 =(R [a ] R [a ]) (R [a ] R [b ]) 1 1 2 2 1 1 2 2 (R [b ] R [a ]) (R [b ] R [b ]) = 0. 1 1 2 2 1 1 2 2 PT’11: Despite its 4-dependence, simple tabulation suffices for all the above applications: One simple and fast hashing scheme for almost all your needs.
Knuth recommends simple tabulation but cites only 3-independence as mathematical quality. Need to prove, for worst-case input, that the dependence of simple tabulation is not harmful in any of the above applications.
How much independence needed? Wrong question Chaining E[t]=O(1) 2 E[tk ]=O(1) 2k + 1 max t = O(lg n/ lg lg n) ⇥(lg n/ lg lg n) Linear probing 5 [Pagh2, Ruziˇ c’07]´ 5 [PT ICALP’10] Cuckoo hashing O(lg n) 6 [Cohen, Kane’05] F2 estimation 4 [Alon, Mathias, Szegedy’99] 1 1 "-minwise indep. O(lg " ) [Indyk’99] ⌦(lg " ) [PT ICALP’10] Knuth recommends simple tabulation but cites only 3-independence as mathematical quality. Need to prove, for worst-case input, that the dependence of simple tabulation is not harmful in any of the above applications.
How much independence needed? Wrong question Chaining E[t]=O(1) 2 E[tk ]=O(1) 2k + 1 max t = O(lg n/ lg lg n) ⇥(lg n/ lg lg n) Linear probing 5 [Pagh2, Ruziˇ c’07]´ 5 [PT ICALP’10] Cuckoo hashing O(lg n) 6 [Cohen, Kane’05] F2 estimation 4 [Alon, Mathias, Szegedy’99] 1 1 "-minwise indep. O(lg " ) [Indyk’99] ⌦(lg " ) [PT ICALP’10]
PT’11: Despite its 4-dependence, simple tabulation suffices for all the above applications: One simple and fast hashing scheme for almost all your needs. Need to prove, for worst-case input, that the dependence of simple tabulation is not harmful in any of the above applications.
How much independence needed? Wrong question Chaining E[t]=O(1) 2 E[tk ]=O(1) 2k + 1 max t = O(lg n/ lg lg n) ⇥(lg n/ lg lg n) Linear probing 5 [Pagh2, Ruziˇ c’07]´ 5 [PT ICALP’10] Cuckoo hashing O(lg n) 6 [Cohen, Kane’05] F2 estimation 4 [Alon, Mathias, Szegedy’99] 1 1 "-minwise indep. O(lg " ) [Indyk’99] ⌦(lg " ) [PT ICALP’10]
PT’11: Despite its 4-dependence, simple tabulation suffices for all the above applications: One simple and fast hashing scheme for almost all your needs.
Knuth recommends simple tabulation but cites only 3-independence as mathematical quality. How much independence needed? Wrong question Chaining E[t]=O(1) 2 E[tk ]=O(1) 2k + 1 max t = O(lg n/ lg lg n) ⇥(lg n/ lg lg n) Linear probing 5 [Pagh2, Ruziˇ c’07]´ 5 [PT ICALP’10] Cuckoo hashing O(lg n) 6 [Cohen, Kane’05] F2 estimation 4 [Alon, Mathias, Szegedy’99] 1 1 "-minwise indep. O(lg " ) [Indyk’99] ⌦(lg " ) [PT ICALP’10]
PT’11: Despite its 4-dependence, simple tabulation suffices for all the above applications: One simple and fast hashing scheme for almost all your needs.
Knuth recommends simple tabulation but cites only 3-independence as mathematical quality. Need to prove, for worst-case input, that the dependence of simple tabulation is not harmful in any of the above applications. Chaining/hashing into bins Theorem Consider hashing n balls into m n1 1/(2c) bins by simple tabulation. Let q be an additional query ball, and define Xq as the number of regular balls hashing to the same bin as q. n Let µ = E[Xq]= m . The following probability bounds hold for any constant :