The Power of Tabulation Hashing

Thank you for inviting me to China Theory Week. Joint work with Mihai Patras¸cuˇ . Some of it found in Proc. STOC’11. The Power of Tabulation Hashing Mikkel Thorup University of Copenhagen AT&T Joint work with Mihai Patras¸cuˇ . Some of it found in Proc. STOC’11. The Power of Tabulation Hashing Mikkel Thorup University of Copenhagen AT&T Thank you for inviting me to China Theory Week. The Power of Tabulation Hashing Mikkel Thorup University of Copenhagen AT&T Thank you for inviting me to China Theory Week. Joint work with Mihai Patras¸cuˇ . Some of it found in Proc. STOC’11. I Providing algorithmically important probabilisitic guarantees akin to those of truly random hashing, yet easy to implement. I Bridging theory (assuming truly random hashing) with practice (needing something implementable). Target I Simple and reliable pseudo-random hashing. I Bridging theory (assuming truly random hashing) with practice (needing something implementable). Target I Simple and reliable pseudo-random hashing. I Providing algorithmically important probabilisitic guarantees akin to those of truly random hashing, yet easy to implement. Target I Simple and reliable pseudo-random hashing. I Providing algorithmically important probabilisitic guarantees akin to those of truly random hashing, yet easy to implement. I Bridging theory (assuming truly random hashing) with practice (needing something implementable). Applications of Hashing Hash tables (n keys and 2n hashes: expect 1/2 keys per hash) I chaining: follow pointers • x • ! a ! t • • ! v • • ! f ! s ! r Applications of Hashing Hash tables (n keys and 2n hashes: expect 1/2 keys per hash) I chaining: follow pointers • x • ! a ! t ! x • • ! v • • ! f ! s ! r Applications of Hashing Hash tables (n keys and 2n hashes: expect 1/2 keys per hash) I chaining: follow pointers I linear probing: sequential search in one array • q x a ! g ! c ! • • t Applications of Hashing Hash tables (n keys and 2n hashes: expect 1/2 keys per hash) I chaining: follow pointers I linear probing: sequential search in one array • q x a ! g ! c ! x • t Applications of Hashing Hash tables (n keys and 2n hashes: expect 1/2 keys per hash) I chaining: follow pointers I linear probing: sequential search in one array I cuckoo hashing: search ≤ 2 locations, complex updates a • • s • z y f x w • • r • x b Applications of Hashing Hash tables (n keys and 2n hashes: expect 1/2 keys per hash) I chaining: follow pointers I linear probing: sequential search in one array I cuckoo hashing: search ≤ 2 locations, complex updates a • • s • z y f x w • • r • x b Applications of Hashing Hash tables (n keys and 2n hashes: expect 1/2 keys per hash) I chaining: follow pointers I linear probing: sequential search in one array I cuckoo hashing: search ≤ 2 locations, complex updates a • • s • z y f x w • • r • x b Applications of Hashing Hash tables (n keys and 2n hashes: expect 1/2 keys per hash) I chaining: follow pointers I linear probing: sequential search in one array I cuckoo hashing: search ≤ 2 locations, complex updates a • • s • z y f x w • • r • x b Applications of Hashing Hash tables (n keys and 2n hashes: expect 1/2 keys per hash) I chaining: follow pointers I linear probing: sequential search in one array I cuckoo hashing: search ≤ 2 locations, complex updates a • • s • z y f x w • • r • x b Applications of Hashing Hash tables (n keys and 2n hashes: expect 1/2 keys per hash) I chaining: follow pointers I linear probing: sequential search in one array I cuckoo hashing: search ≤ 2 locations, complex updates z • • s • w y f x x • • a r x b I sketch A and B to later find jA \ Bj=jA [ Bj jA \ Bj=jA [ Bj = Pr[min h(A) = min h(B)] h We need h to be "-minwise independent: 1 ± " (8)x 62 S : Pr[h(x) < min h(S)] = jSj + 1 Sketching, streaming, and sampling: P 2 I second moment estimation: F2(x¯) = i xi Applications of Hashing Hash tables (n keys and 2n hashes: expect 1/2 keys per hash) I chaining: follow pointers. I linear probing: sequential search in one array I cuckoo hashing: search ≤ 2 locations, complex updates I sketch A and B to later find jA \ Bj=jA [ Bj jA \ Bj=jA [ Bj = Pr[min h(A) = min h(B)] h We need h to be "-minwise independent: 1 ± " (8)x 62 S : Pr[h(x) < min h(S)] = jSj + 1 Applications of Hashing Hash tables (n keys and 2n hashes: expect 1/2 keys per hash) I chaining: follow pointers. I linear probing: sequential search in one array I cuckoo hashing: search ≤ 2 locations, complex updates Sketching, streaming, and sampling: P 2 I second moment estimation: F2(x¯) = i xi Applications of Hashing Hash tables (n keys and 2n hashes: expect 1/2 keys per hash) I chaining: follow pointers. I linear probing: sequential search in one array I cuckoo hashing: search ≤ 2 locations, complex updates Sketching, streaming, and sampling: P 2 I second moment estimation: F2(x¯) = i xi I sketch A and B to later find jA \ Bj=jA [ Bj jA \ Bj=jA [ Bj = Pr[min h(A) = min h(B)] h We need h to be "-minwise independent: 1 ± " (8)x 62 S : Pr[h(x) < min h(S)] = jSj + 1 Applications of Hashing Hash tables (n keys and 2n hashes: expect 1/2 keys per hash) I chaining: follow pointers. I linear probing: sequential search in one array Important outside theory. These simple practical hash tables often bottlenecks in the processing of data—substantial fraction of worlds computational resources spent here. Prototypical example: degree k − 1 polynomial I u = b prime; I choose a0; a1;:::; ak−1 randomly in [u]; k−1 I h(x) = a0 + a1x + ··· + ak−1x mod u. Many solutions for k-independent hashing proposed, but generally slow for k > 3 and too slow for k > 5. Carter & Wegman (1977) We do not have space for truly random hash functions, but Family H = fh :[u] ! [b]g k-independent iff for random h 2 H: I (8)x 2 [u], h(x) is uniform in [b]; I (8)x1;:::; xk 2 [u], h(x1);:::; h(xk ) are independent. Many solutions for k-independent hashing proposed, but generally slow for k > 3 and too slow for k > 5. Carter & Wegman (1977) We do not have space for truly random hash functions, but Family H = fh :[u] ! [b]g k-independent iff for random h 2 H: I (8)x 2 [u], h(x) is uniform in [b]; I (8)x1;:::; xk 2 [u], h(x1);:::; h(xk ) are independent. Prototypical example: degree k − 1 polynomial I u = b prime; I choose a0; a1;:::; ak−1 randomly in [u]; k−1 I h(x) = a0 + a1x + ··· + ak−1x mod u. Carter & Wegman (1977) We do not have space for truly random hash functions, but Family H = fh :[u] ! [b]g k-independent iff for random h 2 H: I (8)x 2 [u], h(x) is uniform in [b]; I (8)x1;:::; xk 2 [u], h(x1);:::; h(xk ) are independent. Prototypical example: degree k − 1 polynomial I u = b prime; I choose a0; a1;:::; ak−1 randomly in [u]; k−1 I h(x) = a0 + a1x + ··· + ak−1x mod u. Many solutions for k-independent hashing proposed, but generally slow for k > 3 and too slow for k > 5. Independence has been the ruling measure for quality of hash functions for 30+ years, but is it right? How much independence needed? Chaining E[t] = O(1) 2 E[tk ] = O(1) 2k + 1 lg n lg n t = O lg lg n w.h.p. Θ lg lg n Linear probing ≤ 5 [Pagh2, Ruziˇ c’07]´ ≥ 5 [PT ICALP’10] Cuckoo hashing O(lg n) ≥ 6 [Cohen, Kane’05] F2 estimation 4 [Alon, Mathias, Szegedy’99] 1 1 "-minwise indep. O lg " [Indyk’99] Ω lg " [PT ICALP’10] How much independence needed? Chaining E[t] = O(1) 2 E[tk ] = O(1) 2k + 1 lg n lg n t = O lg lg n w.h.p. Θ lg lg n Linear probing ≤ 5 [Pagh2, Ruziˇ c’07]´ ≥ 5 [PT ICALP’10] Cuckoo hashing O(lg n) ≥ 6 [Cohen, Kane’05] F2 estimation 4 [Alon, Mathias, Szegedy’99] 1 1 "-minwise indep. O lg " [Indyk’99] Ω lg " [PT ICALP’10] Independence has been the ruling measure for quality of hash functions for 30+ years, but is it right? I Key x divided into c = O(1) characters x1; :::; xc, e.g., 32-bit key as 4 × 8-bit characters. I For i = 1; :::; c, we have truly random hash table: Ri : char ! hash values (bit strings) I Hash value h(x) = R1[x1] ⊕ · · · ⊕ Rc[xc] 1=c I Space cN and time O(c). With 8-bit characters, each Ri has 256 entries and fit in L1 cache. I Simple tabulation is the fastest 3-independent hashing scheme. Speed like 2 multiplications. I Not 4-independent: h(a1a2) ⊕ h(a1b2) ⊕ h(b1a2) ⊕ h(b1b2) = (R1[a1] ⊕ R2[a2]) ⊕ (R1[a1] ⊕ R2[b2]) ⊕ (R1[b1] ⊕ R2[a2]) ⊕ (R1[b1] ⊕ R2[b2]) = 0: Simple tabulation I Simple tabulation goes back to Carter and Wegman’77. I For i = 1; :::; c, we have truly random hash table: Ri : char ! hash values (bit strings) I Hash value h(x) = R1[x1] ⊕ · · · ⊕ Rc[xc] 1=c I Space cN and time O(c). With 8-bit characters, each Ri has 256 entries and fit in L1 cache.

The Power of Tabulation Hashing

Using Tabulation to Implement the Power of Hashing Using Tabulation to Implement the Power of Hashing Talk Surveys Results from I M

Non-Empty Bins with Simple Tabulation Hashing

6.1 Systems of Hash Functions

Linear Probing with 5-Independent Hashing

The Power of Simple Tabulation Hashing

Fundamental Data Structures Contents

Fast and Powerful Hashing Using Tabulation

Fast Hashing with Strong Concentration Bounds

Hashing, Load Balancing and Multiple Choice

Hashing, We Don’T Know in Advance What the User Will Put Into the Table

Chapter 13 Part 2

Lecture 10 Student Notes