Hashing Algorithms

Symbol-Table ADT Records with keys (priorities) basic operations • insert • search Hashing Algorithms • create generic operations • test if empty common to many ADTs • destroy Hash functions not needed for one-time use • copy but critical in large systems Separate Chaining Linear Probing Problem solved (?) Double Hashing • balanced, randomized trees use ST.h O(lg N) comparisons void STinit(); Is lg N required? void STinsert(Item); Item STsearch(Key); • no (and yes) int STempty(); Are comparisons necessary? ST interface in C • no 2 ST implementations cost summary Hashing: basic plan “Guaranteed” asymptotic costs for an ST with N items Save items in a key-indexed table (index is a function of the key) find kth insert search delete sort join Hash function largest unordered array 1N1NNlgN N • method for computing table index from key Collision resolution strategy BST NNNNNN • algorithm and data structure to handle * randomized BST lg N lg N lg N lgN N lgN two keys that hash to the same index red-black BST lg N lg N lg N lg N lg N lg N Classic time-space tradeoff * hashing 111NNlgN N • no space limitation: trivial hash function with key as address * assumes system can produce “random” numbers • no time limitation: trivial collision resolution: sequential search • limitations on both time and space (the real world) Can we do better? hashing 3 4 Hash function Hash function (long keys) Goal: random map (each table position equally likely for each key) Goal: random map (each table position equally likely for each key) Treat key as integer, use prime table size M Treat key as long integer, use prime table size M • hash function: h(K) = K mod M 4 • use same hash function: h(K) = K mod M Ex: 4-char keys, table size 101 26 ~ .5 million different 4-char keys 101 values • compute value with Horner’s method binary 01100001011000100110001101100100 ~50,000 keys per value hex 6 1 6 2 6 3 6 4 ascii a b c d 0x61 Ex: abcd hashes to 11 0x61626364 = 256*(256*(256*97+98)+99)+100 16338831724 % 101 = 11 Huge number of keys, small table: most collide! 25 items, 11 table positions scramble by using ~2 items per table position numbers too big? hash.c 117 instead of 256 abcd hashes to 11 0x61626364 = 1633831724 OK to take mod after each op int hash(char *v, int M) { int h, a = 117; 16338831724 % 101 = 11 256*97+98 = 24930 % 101 = 84 for (h = 0; *v != '\0'; v++) h = (a*h + *v) % M; dcba hashes to 57 256*84+99 = 21603 % 101 = 90 256*90+100 = 23140 % 101 = 11 return h; 0x64636261 = 1684234849 } 5 items, 11 table positions ... can continue indefinitely, for any length key 1633883172 % 101 = 57 ~ .5 items per table position hash function for strings in C abbc also hashes to 57 How much work to hash a string of length N? Uniform hashing: use a different 0x61626263 = 1633837667 N add, multiply, and mod ops random multiplier for each digit. 1633837667 % 101 = 57 5 6 Collision Resolution Separate chaining Two approaches Hash to an array of linked lists Separate chaining 0 Hash 1 LAAA • M much smaller than N 2 MX • map key to value between 0 and M-1 3 NC • ~N/M keys per table position 4 Array 5 EPEE • put keys that collide in a list • constant-time access to list with key 6 • need to search lists 7 GR Linked lists 8 HS 9 I • constant-time insert 10 Open addressing (linear probing, double hashing) • search through list using Trivial: average list length is N/M • M much larger than N elementary algorithm Worst: all keys hash to same list • plenty of empty table slots Theorem (from classical probability theory): • when a new key collides, find an empty slot M too large: too many empty array entries Probability that any list length is > tN/M is exponentially small in t • complex collision patterns M too small: lists too long Typical choice M ~ N/10: constant-time search/insert Guarantee depends on hash function being random map 7 8 Linear probing Double hashing Hash to a large array of items, use sequential search within clusters Avoid clustering by using second hash to compute skip for search A SA Hash Hash S A E SAE R • map key to array index between 0 and M-1 • map key to value between 0 and M-1 SACER S HAC ER Second hash GX S CERIN Large array SH ACER I G XSP CERIN • at least twice as many slots as items SH A C E R I N • map key to nonzero skip value GSHACER I N (best if relatively prime to M) Cluster G X SH ACER IN G X MSH AC E R IN • quick hack OK Trivial: average list length is N/M ≡α • contiguous block of items G X M S H P A C E R I N Ex: 1 + (k mod 97) Worst: all keys hash to same list and same skip Trivial: average list length is N/M ≡α • search through cluster using Theorem (deep): Worst: all keys hash to same list Avoids clustering elementary algorithm for arrays 1 Theorem (beyond classical probability theory): insert: • skip values give different search 1−α 1 1 1 insert: paths for keys that collide search: α) 2 (1 + (1−α)2 ) α ln(1+ M too large: too many empty array entries 1 1 search: (1 + ) M too small: clusters coalesce 2 (1−α) Typical choice M ~ 2N: constant-time search/insert Guarantees depend on hash functions being random map Typical choice M ~ 2N: constant-time search/insert Guarantees depend on hash Disadvantage: delete cumbersome to implement function being random map 9 10 Double hashing ST implementation Hashing tradeoffs static Item *st; code assumes Items are pointers, initialized to NULL Separate chaining vs. linear probing/double hashing void STinsert(Item x) insert • space for links vs. empty table slots { Key v = ITEMkey(x); • small table + linked allocation vs. big coherant array int i = hash(v, M); linear probing: int skip = hashtwo(v, M); take skip = 1 Linear probing vs. double hashing load factor (α) while (st[i] != NULL) i = (i+skip) % M; probe loop 50% 66% 75% 90% st[i] = x; N++; linear search 1.5 2.0 3.0 5.5 } probing insert 2.5 5.0 8.5 55.5 Item STsearch(Key v) search double search 1.4 1.6 1.8 2.6 { hashing insert 1.5 2.0 3.0 5.5 int i = hash(v, M); int skip = hashtwo(v, M); while (st[i] != NULL) probe loop Hashing vs. red-black BSTs if eq(v, ITEMkey(st[i])) return st[i]; • arithmetic to compute hash vs. comparison else i = (i+skip) % M; • hashing performance guarantee is weaker (but with simpler code) return NULL; • easier to support other ST ADT operations with BSTs } 11 12 ST implementations cost summary “Guaranteed” asymptotic costs for an ST with N items find kth insert search delete sort join largest unordered array 1N1NNlgN N BST NNNNNN * randomized BST lg N lg N lg N lgN N lgN red-black BST lg N lg N lg N lg N lg N lg N * hashing 111NNlgN N Not really: need lgN bits to distinguish N keys * assumes system can produce “random” numbers * assumes our hash functions can produce random values for all keys Can we do better? tough to be sure.... 13.

Hashing Algorithms

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support