Practical Survey on Hash Tables

Practical Survey on Hash Tables

Practical Survey on Hash Tables Aurelian Țuțuianu In memoriam Mihai Pătraşcu (17 July 1982 – 5 June 2012) “I have no intention to ever teach computer science. I want to teach the love for computer science, and let the learning happen.” Teaching Statement (http://people.csail.mit.edu/mip/docs/job-application07/statements.pdf) Abstract • Hash table definition • Collision resolving schemas: – Chained hashing – Linear and quadratic probing – Cuckoo hashing • Some hash function theory • Simple tabulation hashing Omnipresence of hash tables • Symbol tables in compilers • Cache implementations • Database storages • Manage memory pages in Linux • Route tables • Large number of documents Hash Tables Considering a set of elements 푆 from a finite and much larger universe 푈. A hash table consists of: ℎ푎푠ℎ 푓푢푛푐푡표푛 ℎ: 푈 → {0, . , 푚 − 1} 푣푒푐푡표푟 푣 표푓 푠푧푒 푚 Collisions same hash for two different 26.17.41.60 keys 126.15.12.154 What to do? 202.223.224.33 • Ignore them 7.239.203.66 • Chain colliding values • Skip and try again 176.136.103.233 • Hash and displace • Find a perfect hash function War Story: cache with hash tables Problem: An application which application gets some data from an expensive repository. hash table Solution: Hash table with collision replacement. Key point: a big chunk of users “watched” a lot of common data. Data source Collision Resolution Schemas • Chained hashing • Open hash: linear and quadratic probing • Cuckoo hashing And many many others: perfect hashing, coalesced hashing, Robin Hood hashing, hopscotch hashing, etc. Chained Hashing Each slot contains a linked list. 0 푛 푂( ) = 푂(1) for all operations. 1 푚 푛 2 y Load factor: <1. 푚 3 4 x w • easy to implement 5 z • works with weak hash functions 6 • consumes significant memory • default implementation Linear and quadratic probing All records are stored in the bucket array itself. 0 • Probe – a try to find an empty place. h(x,i) = 4 + i 1 • Linear probing w 2 3 ℎ 푥, = ℎ0(푥) + y 4 • Quadratic probing z 5 + ∗ ℎ 푥, = ℎ0(푥) + x 6 2 War Story: Linear probing trick Min. 1st Qu. Median Mean 3rd Qu. Max. 1 1947 3861 3925 5867 8070 – linear probing 1 8983 18370 21150 35600 50920 – chained hashing War Story: Let it be quadratic! Replace library implementation with a “home-made” hash table 4 hours of work Cuckoo hashing T1 T2 Two hash tables, T1, T2, of size m, and two hash functions h1, h2 : U -> {0, . , m −1}. x Value x stored in cell h1(x) of T1 or in cell h2(x) of T2. y Hash and displace. h1(x) z Lookup is constant in worst case! h2(x) Updates in constant amortized time. w What about hash functions? Any hash function is “good”? What does a “good” hash function mean? Can I have my own? The beginning of time Introduced by Alfred Dumey in 1956 for the symbol table in a compiler. He used a “crazy”, “chaotic”, “random” function h:U->{0..m-1}. h(x)=(x mod p) mod m, with p a big prime number. Is seems to work, but why? First station: rigorous analysis Consider that h really is a random function! Knuth established a way to make a complete analysis, but based on a false assumption. No matter how long you stare at h(x)=(x mod p) mod m, it will not morph into a random function! Next station: universality and k-independence Wegman and Carter (1978) • A family of hash functions • No need of perfect random hash function, but universal : 1 ∀ x ,x ∈ S | x ≠ x , Pr[h(x )=h(x )] ≤ 1 2 1 2 1 2 푁 In generalized form the k-independence model uses statistics to measure “how much random” can a family of hash functions produce! How it works? Random data x formula h(x) • Universal multiplicative shift: 표푢푡 ℎ푎 푥 = 푎 ∗ 푥 ≫ 푙 − 푙 • 2-independent multiplicative shift: 표푢푡 ℎ푎,푏 푥 = 푎 ∗ 푥 + 푏 ≫ 2푙 − 푙 • k-independent polynomial hashing: 푘−1 푖 푙표푢푡 ℎ 푥 = 푖=0 푎푖푥 푚표푑 푝 푚표푑 2 Facts on k-independence Chained hashing 1978 - Wegman, Carter: requires only universal hashing Linear probing 1990 – Siegel, Schmidt: O(logn)-independece is enough 2007 – Pagh – 5-independence suffices 2010 – Patrascu,Thorup – 4-independence is not enough Cuckoo hashing 2001 – Pagh: O(logn)-independence is enough 2005 – Cohen, Kane: 5-independence is not enough 2006 – Cohen, Kane: 6-independence is enough Simple tabulation hashing Simple tabulation is the fastest 3-independent family of hash functions known. • Key 푥 of length 푙푒푛 (required bit width to store values) is divided into 푐 chars 푥1, 푥2, . , 푥푐 • We create c tables 푅1, 푅2, .., 푅푐, filled with independent random values • Hash value is created with function ℎ 푥 = 푅1 푥1 ⊗ 푅2 푥2 ⊗ ⋯ ⊗ 푅푐 푥푐 푅 푥 ⊗ 푅 푥 ⊗ 푥 1 1 2 2 푅3 푥3 ⊗ 푅4 푥4 4 lookup tables with random 8-bit ℎ(푥) values The power of simple tabulation! “The power of simple tabulation hashing” – Mihai Pătrașcu, Mikkel Thorup – December 6, 2011 According to this paper, even if is only 3-independent, we have: • Constant time for linear probing • Constant time for static cuckoo hashing => There are also other probabilistic properties which can be exploited, other than ones captured in k-independence theory Summary • Easy ways to implement optimal hash tables • Simple scheme to generate a hash function family • Theory produces practical results and is still alive! There are a lot of occasions to apply these ideas, so: Work hard, have fun and make history! Questions? .

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    23 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us