<<

Symbol-Table ADT

Records with keys (priorities) basic operations • insert • search Hashing Algorithms • create generic operations • test if empty common to many ADTs • destroy Hash functions not needed for one-time use • copy but critical in large systems Separate Chaining Problem solved (?) • balanced, randomized trees use ST.h O(lg N) comparisons void STinit(); Is lg N required? void STinsert(Item); Item STsearch(Key); • no (and yes) int STempty(); Are comparisons necessary? ST interface in C • no

2

ST implementations cost summary Hashing: basic plan

“Guaranteed” asymptotic costs for an ST with N items Save items in a key-indexed table (index is a function of the key)

find kth insert search delete sort join Hash function largest unordered array 1N1NNlgN N • method for computing table index from key Collision resolution strategy BST NNNNNN • algorithm and structure to handle * randomized BST lg N lg N lg N lgN N lgN two keys that hash to the same index

red-black BST lg N lg N lg N lg N lg N lg N Classic time-space tradeoff * hashing 111NNlgN N • no space limitation: trivial hash function with key as address * assumes system can produce “random” numbers • no time limitation: trivial collision resolution: sequential search • limitations on both time and space (the real world) Can we do better? hashing

3 4 Hash function Hash function (long keys)

Goal: random map (each table position equally likely for each key) Goal: random map (each table position equally likely for each key) Treat key as integer, use prime table size M Treat key as long integer, use prime table size M • hash function: h(K) = K mod M 4 • use same hash function: h(K) = K mod M Ex: 4-char keys, table size 101 26 ~ .5 million different 4-char keys 101 values • compute value with Horner’s method binary 01100001011000100110001101100100 ~50,000 keys per value hex 6 1 6 2 6 3 6 4 a b c d 0x61 Ex: abcd hashes to 11 0x61626364 = 256*(256*(256*97+98)+99)+100 16338831724 % 101 = 11 Huge number of keys, small table: most collide! 25 items, 11 table positions scramble by using ~2 items per table position numbers too big? hash.c 117 instead of 256 abcd hashes to 11 0x61626364 = 1633831724 OK to take mod after each op int hash(char *v, int M) { int h, a = 117; 16338831724 % 101 = 11 256*97+98 = 24930 % 101 = 84 for (h = 0; *v != '\0'; v++) h = (a*h + *v) % M; dcba hashes to 57 256*84+99 = 21603 % 101 = 90 256*90+100 = 23140 % 101 = 11 return h; 0x64636261 = 1684234849 } 5 items, 11 table positions ... can continue indefinitely, for any length key 1633883172 % 101 = 57 ~ .5 items per table position hash function for strings in C abbc also hashes to 57 How much work to hash a string of length N? Uniform hashing: use a different 0x61626263 = 1633837667 N add, multiply, and mod ops random multiplier for each digit. 1633837667 % 101 = 57 5 6

Collision Resolution Separate chaining

Two approaches Hash to an array of linked lists

Separate chaining 0 Hash 1 LAAA • M much smaller than N 2 MX • map key to value between 0 and M-1 3 NC • ~N/M keys per table position 4 Array 5 EPEE • put keys that collide in a list • constant-time access to list with key 6 • need to search lists 7 GR Linked lists 8 HS 9 I • constant-time insert 10 Open addressing (linear probing, double hashing) • search through list using Trivial: average list length is N/M • M much larger than N elementary algorithm Worst: all keys hash to same list • plenty of empty table slots Theorem (from classical probability theory): • when a new key collides, find an empty slot M too large: too many empty array entries Probability that any list length is > tN/M is exponentially small in t • complex collision patterns M too small: lists too long Typical choice M ~ N/10: constant-time search/insert Guarantee depends on hash function being random map 7 8 Linear probing Double hashing

Hash to a large array of items, use sequential search within clusters Avoid clustering by using second hash to compute skip for search A SA Hash Hash S A E SAE R • map key to array index between 0 and M-1 • map key to value between 0 and M-1 SACER S HAC ER Second hash GX S CERIN Large array SH ACER I G XSP CERIN • at least twice as many slots as items SH A C E R I N • map key to nonzero skip value GSHACER I N (best if relatively prime to M) Cluster G X SH ACER IN G X MSH AC E R IN • quick hack OK Trivial: average list length is N/M ≡α • contiguous block of items G X M S H P A C E R I N Ex: 1 + (k mod 97) Worst: all keys hash to same list and same skip Trivial: average list length is N/M ≡α • search through cluster using Theorem (deep): Worst: all keys hash to same list Avoids clustering elementary algorithm for arrays 1 Theorem (beyond classical probability theory): insert: • skip values give different search 1−α 1 1 1 insert: paths for keys that collide search: α) 2 (1 + (1−α)2 ) α ln(1+ M too large: too many empty array entries 1 1 search: (1 + ) M too small: clusters coalesce 2 (1−α) Typical choice M ~ 2N: constant-time search/insert Guarantees depend on hash functions being random map Typical choice M ~ 2N: constant-time search/insert Guarantees depend on hash Disadvantage: delete cumbersome to implement function being random map 9 10

Double hashing ST implementation Hashing tradeoffs

static Item *st; code assumes Items are pointers, initialized to NULL Separate chaining vs. linear probing/double hashing

void STinsert(Item x) insert • space for links vs. empty table slots { Key v = ITEMkey(x); • small table + linked allocation vs. big coherant array int i = hash(v, M); linear probing: int skip = hashtwo(v, M); take skip = 1 Linear probing vs. double hashing load factor (α) while (st[i] != NULL) i = (i+skip) % M; probe loop 50% 66% 75% 90% st[i] = x; N++; linear search 1.5 2.0 3.0 5.5 } probing insert 2.5 5.0 8.5 55.5 Item STsearch(Key v) search double search 1.4 1.6 1.8 2.6 { hashing insert 1.5 2.0 3.0 5.5 int i = hash(v, M); int skip = hashtwo(v, M); while (st[i] != NULL) probe loop Hashing vs. red-black BSTs if eq(v, ITEMkey(st[i])) return st[i]; • arithmetic to compute hash vs. comparison else i = (i+skip) % M; • hashing performance guarantee is weaker (but with simpler code) return NULL; • easier to support other ST ADT operations with BSTs } 11 12 ST implementations cost summary

“Guaranteed” asymptotic costs for an ST with N items

find kth insert search delete sort join largest unordered array 1N1NNlgN N

BST NNNNNN

* randomized BST lg N lg N lg N lgN N lgN

red-black BST lg N lg N lg N lg N lg N lg N

* hashing 111NNlgN N

Not really: need lgN bits to distinguish N keys

* assumes system can produce “random” numbers * assumes our hash functions can produce random values for all keys

Can we do better? tough to be sure.... 13