Practical Survey on Hash Tables

Practical Survey on Hash Tables Aurelian Țuțuianu In memoriam Mihai Pătraşcu (17 July 1982 – 5 June 2012) “I have no intention to ever teach computer science. I want to teach the love for computer science, and let the learning happen.” Teaching Statement (http://people.csail.mit.edu/mip/docs/job-application07/statements.pdf) Abstract • Hash table definition • Collision resolving schemas: – Chained hashing – Linear and quadratic probing – Cuckoo hashing • Some hash function theory • Simple tabulation hashing Omnipresence of hash tables • Symbol tables in compilers • Cache implementations • Database storages • Manage memory pages in Linux • Route tables • Large number of documents Hash Tables Considering a set of elements 푆 from a finite and much larger universe 푈. A hash table consists of: ℎ푎푠ℎ 푓푢푛푐푡표푛 ℎ: 푈 → {0, . , 푚 − 1} 푣푒푐푡표푟 푣 표푓 푠푧푒 푚 Collisions same hash for two different 26.17.41.60 keys 126.15.12.154 What to do? 202.223.224.33 • Ignore them 7.239.203.66 • Chain colliding values • Skip and try again 176.136.103.233 • Hash and displace • Find a perfect hash function War Story: cache with hash tables Problem: An application which application gets some data from an expensive repository. hash table Solution: Hash table with collision replacement. Key point: a big chunk of users “watched” a lot of common data. Data source Collision Resolution Schemas • Chained hashing • Open hash: linear and quadratic probing • Cuckoo hashing And many many others: perfect hashing, coalesced hashing, Robin Hood hashing, hopscotch hashing, etc. Chained Hashing Each slot contains a linked list. 0 푛 푂( ) = 푂(1) for all operations. 1 푚 푛 2 y Load factor: <1. 푚 3 4 x w • easy to implement 5 z • works with weak hash functions 6 • consumes significant memory • default implementation Linear and quadratic probing All records are stored in the bucket array itself. 0 • Probe – a try to find an empty place. h(x,i) = 4 + i 1 • Linear probing w 2 3 ℎ 푥, = ℎ0(푥) + y 4 • Quadratic probing z 5 + ∗ ℎ 푥, = ℎ0(푥) + x 6 2 War Story: Linear probing trick Min. 1st Qu. Median Mean 3rd Qu. Max. 1 1947 3861 3925 5867 8070 – linear probing 1 8983 18370 21150 35600 50920 – chained hashing War Story: Let it be quadratic! Replace library implementation with a “home-made” hash table 4 hours of work Cuckoo hashing T1 T2 Two hash tables, T1, T2, of size m, and two hash functions h1, h2 : U -> {0, . , m −1}. x Value x stored in cell h1(x) of T1 or in cell h2(x) of T2. y Hash and displace. h1(x) z Lookup is constant in worst case! h2(x) Updates in constant amortized time. w What about hash functions? Any hash function is “good”? What does a “good” hash function mean? Can I have my own? The beginning of time Introduced by Alfred Dumey in 1956 for the symbol table in a compiler. He used a “crazy”, “chaotic”, “random” function h:U->{0..m-1}. h(x)=(x mod p) mod m, with p a big prime number. Is seems to work, but why? First station: rigorous analysis Consider that h really is a random function! Knuth established a way to make a complete analysis, but based on a false assumption. No matter how long you stare at h(x)=(x mod p) mod m, it will not morph into a random function! Next station: universality and k-independence Wegman and Carter (1978) • A family of hash functions • No need of perfect random hash function, but universal : 1 ∀ x ,x ∈ S | x ≠ x , Pr[h(x )=h(x )] ≤ 1 2 1 2 1 2 푁 In generalized form the k-independence model uses statistics to measure “how much random” can a family of hash functions produce! How it works? Random data x formula h(x) • Universal multiplicative shift: 표푢푡 ℎ푎 푥 = 푎 ∗ 푥 ≫ 푙 − 푙 • 2-independent multiplicative shift: 표푢푡 ℎ푎,푏 푥 = 푎 ∗ 푥 + 푏 ≫ 2푙 − 푙 • k-independent polynomial hashing: 푘−1 푖 푙표푢푡 ℎ 푥 = 푖=0 푎푖푥 푚표푑 푝 푚표푑 2 Facts on k-independence Chained hashing 1978 - Wegman, Carter: requires only universal hashing Linear probing 1990 – Siegel, Schmidt: O(logn)-independece is enough 2007 – Pagh – 5-independence suffices 2010 – Patrascu,Thorup – 4-independence is not enough Cuckoo hashing 2001 – Pagh: O(logn)-independence is enough 2005 – Cohen, Kane: 5-independence is not enough 2006 – Cohen, Kane: 6-independence is enough Simple tabulation hashing Simple tabulation is the fastest 3-independent family of hash functions known. • Key 푥 of length 푙푒푛 (required bit width to store values) is divided into 푐 chars 푥1, 푥2, . , 푥푐 • We create c tables 푅1, 푅2, .., 푅푐, filled with independent random values • Hash value is created with function ℎ 푥 = 푅1 푥1 ⊗ 푅2 푥2 ⊗ ⋯ ⊗ 푅푐 푥푐 푅 푥 ⊗ 푅 푥 ⊗ 푥 1 1 2 2 푅3 푥3 ⊗ 푅4 푥4 4 lookup tables with random 8-bit ℎ(푥) values The power of simple tabulation! “The power of simple tabulation hashing” – Mihai Pătrașcu, Mikkel Thorup – December 6, 2011 According to this paper, even if is only 3-independent, we have: • Constant time for linear probing • Constant time for static cuckoo hashing => There are also other probabilistic properties which can be exploited, other than ones captured in k-independence theory Summary • Easy ways to implement optimal hash tables • Simple scheme to generate a hash function family • Theory produces practical results and is still alive! There are a lot of occasions to apply these ideas, so: Work hard, have fun and make history! Questions? .

Practical Survey on Hash Tables

Using Tabulation to Implement the Power of Hashing Using Tabulation to Implement the Power of Hashing Talk Surveys Results from I M

The Power of Tabulation Hashing

SAHA: a String Adaptive Hash Table for Analytical Databases

Chapter 5 Hashing

Non-Empty Bins with Simple Tabulation Hashing

6.1 Systems of Hash Functions

Linear Probing with 5-Independent Hashing

The Power of Simple Tabulation Hashing

Fundamental Data Structures Contents

Fast and Powerful Hashing Using Tabulation

Fast Hashing with Strong Concentration Bounds

Double Hashing