Practical Survey on Hash Tables

Aurelian Țuțuianu

In memoriam

Mihai Pătraşcu (17 July 1982 – 5 June 2012)

“I have no intention to ever teach computer science. I want to teach the love for computer science, and let the learning happen.”

Teaching Statement (http://people.csail.mit.edu/mip/docs/job-application07/statements.pdf) Abstract • definition • Collision resolving schemas: – Chained hashing – Linear and • Some theory • Simple tabulation hashing

Omnipresence of hash tables

• Symbol tables in compilers • Cache implementations • Database storages • Manage memory pages in Linux • Route tables • Large number of documents Hash Tables

Considering a set of elements 푆 from a finite and much larger universe 푈.

A hash table consists of:

ℎ푎푠ℎ 푓푢푛푐푡𝑖표푛 ℎ: 푈 → {0, . . , 푚 − 1}

푣푒푐푡표푟 푣 표푓 푠𝑖푧푒 푚 Collisions same hash for two different 26.17.41.60 keys

126.15.12.154 What to do? 202.223.224.33 • Ignore them

7.239.203.66 • Chain colliding values • Skip and try again 176.136.103.233 • Hash and displace • Find a perfect hash function

War Story: cache with hash tables

Problem: An application which application gets some data from an expensive repository.

hash table Solution: Hash table with collision replacement. Key point: a big chunk of users “watched” a lot of common data. Data source Collision Resolution Schemas

• Chained hashing • Open hash: linear and quadratic probing • Cuckoo hashing

And many many others: perfect hashing, coalesced hashing, Robin Hood hashing, hopscotch hashing, etc. Chained Hashing

Each slot contains a linked list. 0 푛 푂( ) = 푂(1) for all operations. 1 푚

푛 2 y Load factor: <1. 푚 3

4 x w • easy to implement 5 z • works with weak hash functions 6 • consumes significant memory • default implementation Linear and quadratic probing All records are stored in the bucket array itself.

0 • Probe – a try to find an empty place. h(x,i) = 4 + i 1 • w 2

3 ℎ 푥, 𝑖 = ℎ0(푥) + 𝑖

y 4 • Quadratic probing

z 5 𝑖 + 𝑖 ∗ 𝑖 ℎ 푥, 𝑖 = ℎ0(푥) + x 6 2 War Story: Linear probing trick

Min. 1st Qu. Median Mean 3rd Qu. Max. 1 1947 3861 3925 5867 8070 – linear probing 1 8983 18370 21150 35600 50920 – chained hashing War Story: Let it be quadratic!

Replace library implementation with a “home-made” hash table

4 hours of work Cuckoo hashing

T1 T2

Two hash tables, T1, T2, of size m, and two hash functions h1, h2 : U -> {0, . . . , m −1}.

x Value x stored in cell h1(x) of T1 or in cell h2(x) of T2. y Hash and displace. h1(x) z Lookup is constant in worst case!

h2(x) Updates in constant amortized time. w

What about hash functions?

Any hash function is “good”? What does a “good” hash function mean? Can I have my own? The beginning of time

Introduced by Alfred Dumey in 1956 for the symbol table in a compiler. He used a “crazy”, “chaotic”, “random” function h:U->{0..m-1}.

h(x)=(x mod p) mod m, with p a big prime number.

Is seems to work, but why? First station: rigorous analysis

Consider that h really is a random function!

Knuth established a way to make a complete analysis, but based on a false assumption.

No matter how long you stare at h(x)=(x mod p) mod m, it will not morph into a random function! Next station: universality and k-independence

Wegman and Carter (1978) • A family of hash functions • No need of perfect random hash function, but universal : 1 ∀ x ,x ∈ S | x ≠ x , Pr[h(x )=h(x )] ≤ 1 2 1 2 1 2 푁 In generalized form the k-independence model uses statistics to measure “how much random” can a family of hash functions produce! How it works?

Random data

x formula h(x)

• Universal multiplicative shift: 표푢푡 ℎ푎 푥 = 푎 ∗ 푥 ≫ 푙 − 푙 • 2-independent multiplicative shift: 표푢푡 ℎ푎,푏 푥 = 푎 ∗ 푥 + 푏 ≫ 2푙 − 푙 • k-independent polynomial hashing: 푘−1 푖 푙표푢푡 ℎ 푥 = 푖=0 푎푖푥 푚표푑 푝 푚표푑 2 Facts on k-independence

Chained hashing 1978 - Wegman, Carter: requires only Linear probing 1990 – Siegel, Schmidt: O(logn)-independece is enough 2007 – Pagh – 5-independence suffices 2010 – Patrascu,Thorup – 4-independence is not enough Cuckoo hashing 2001 – Pagh: O(logn)-independence is enough 2005 – Cohen, Kane: 5-independence is not enough 2006 – Cohen, Kane: 6-independence is enough Simple tabulation hashing

Simple tabulation is the fastest 3-independent family of hash functions known. • Key 푥 of length 푙푒푛 (required bit width to store values) is divided into 푐 chars 푥1, 푥2, . . , 푥푐 • We create c tables 푅1, 푅2, .., 푅푐, filled with independent random values • Hash value is created with function ℎ 푥 = 푅1 푥1 ⊗ 푅2 푥2 ⊗ ⋯ ⊗ 푅푐 푥푐

푥 푅1 푥1 ⊗ 푅2 푥2 ⊗ 푅3 푥3 ⊗ 푅4 푥4

4 lookup tables with random 8-bit ℎ(푥) values The power of simple tabulation!

“The power of simple tabulation hashing” – Mihai Pătrașcu, Mikkel Thorup – December 6, 2011

According to this paper, even if is only 3-independent, we have: • Constant time for linear probing • Constant time for static cuckoo hashing

=> There are also other probabilistic properties which can be exploited, other than ones captured in k-independence theory

Summary

• Easy ways to implement optimal hash tables • Simple scheme to generate a hash function family • Theory produces practical results and is still alive!

There are a lot of occasions to apply these ideas, so: Work hard, have fun and make history! Questions?