Hashing: Basic Plan

Hashing: basic plan Save items in a key-indexed table (index is a function of the key). Hashing Hash function. Method for computing array index from key. 0 1 Tyler Moore hash("it") = 3 2 3 "it" ?? 4 CS 2123, The University of Tulsa Issues. hash("times") = 3 5 ・Computing the hash function. ・Equality test: Method for checking whether two keys are equal. ・Collision resolution: Algorithm and data structure to handle two keys that hash to the same array index. Some slides created by or adapted from Dr. Kevin Wayne. For more information see http://www.cs.princeton.edu/courses/archive/fall12/cos226/lectures.php. Classic space-time tradeoff. ・No space limitation: trivial hash function with key as index. ・No time limitation: trivial collision resolution with sequential search. ・Space and time limitations: hashing (the real world). 3 2 / 22 Computing the hash function Uniform hashing assumption Idealistic goal. Scramble the keys uniformly to produce a table index. Uniform hashing assumption. Each key is equally likely to hash to an Efficiently computable. integer between 0 and M - 1. ・ key ・Each table index equally likely for each key. thoroughly researched problem, still problematic in practical applications Bins and balls. Throw balls uniformly at random into M bins. Ex 1. Phone numbers. ・Bad: first three digits. Better: last three digits. table ・ index 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Ex 2. Social Security numbers. Bad: first three digits. 573 = California, 574 = Alaska Birthday problem. Expect two balls in the same bin after ~ π M / 2 tosses. ・ (assigned in chronological order within geographic region) ・Better: last three digits. Coupon collector. Expect every bin has ≥ 1 ball after ~ M ln M tosses. Practical challenge. Need different approach for each key type. Load balancing. After M tosses, expect most loaded bin has Θ ( log M / log log M ) balls. 5 13 3 / 22 4 / 22 Options for dealing with collisions Collisions Collision. Two distinct keys hashing to same index. ・Birthday problem ⇒ can't avoid collisions unless you have a ridiculous (quadratic) amount of memory. ・Coupon collector + load balancing ⇒ collisions are evenly distributed. 1 Open hashing aka separate chaining: store collisions in a linked list 2 Closed hashing aka open addressing: keep keys in the table, shift to unused space Collision resolution policies 0 1 1 Linear probing hash("it") = 3 2 2 Quadratic probing aka quadratic residue search 3 "it" 3 Double hashing ?? 4 hash("times") = 3 5 Challenge. Deal with collisions efficiently. 16 5 / 22 6 / 22 Separate chaining symbol table Analysis of separate chaining Use an array of M < N linked lists. [H. P. Luhn, IBM 1953] Proposition. Under uniform hashing assumption, prob. that the number of ・Hash: map key to integer i between 0 and M - 1. keys in a list is within a constant factor of N / M is extremely close to 1. ・Insert: put at front of ith chain (if not already there). ・Search: need to search only ith chain. Pf sketch. Distribution of list size obeys a binomial distribution. key hash value (10, .12511...) .125 S 2 0 E 0 1 A 8 E 12 0 A 0 2 0 10 20 30 R 4 3 st[] null Binomial distribution (N = 104, M = 103, ␣ = 10) C 4 4 0 H 4 5 1 2 X 7 S 0 E 0 6 3 equals() and hashCode() X 2 7 4 A 0 8 L 11 P 10 Consequence. Number of probes for search/insert is proportional to N / M. M 4 9 M too large ⇒ too many empty chains. P 3 10 M 9 H 5 C 4 R 3 ・ M too small ⇒ chains too long. M times faster than L 3 11 ・ sequential search E 0 12 ・Typical choice: M ~ N / 5 ⇒ constant-time ops. Hashing with separate chaining for standard indexing client 17 20 7 / 22 8 / 22 Closed hashing Closed hashing: insert Records stored directly in table of size M at hash index h(x) for key x Hash(key) into table at position i When a collision occurs: Repeat up to the size of the table Hashes to occupied home position If entry at position i in table isf blank or marked as deleted Record stored in first available slot based on repeatable collision then insert and exit resolution policy Let i be the next position using the collision resolution function Formally, for each i collisions h0(x); h1(x);::: hi (x) tried in succession where hi (x) = (h(x) + f (i)) mod M g 9 / 22 10 / 22 Closed hashing: search Closed hashing: delete Hash(key) into table at position i Hash(key) into table at position i Repeat up to the size of the table Repeat up to the size of the table If entry at position i in table matchesf key and not marked as deleted If entry at position i in table matchesf key then found and exit then mark as deleted and exit If entry at position i in table is blank If entry at position i in table is blank then not found and exit then not found and exit Let i be the next position using the collision resolution function Let i be the next position using the collision resolution function Not found and exit Not found and exit g g 11 / 22 12 / 22 Linear probing Clustering Cluster. A contiguous block of items. Observation. New keys likely to hash into middle of big clusters. Collision resolution function f (i) = i: hi (x) = (h(x) + i) mod M Work example 28 13 / 22 14 / 22 Knuth's parking problem Analysis of linear probing Model. Cars arrive at one-way street with M parking spaces. Proposition. Under uniform hashing assumption, the average # of probes Each desires a random space i : if space i is taken, try i + 1, i + 2, etc. in a linear probing hash table of size M that contains N = α M keys is: 1 1 1 1 Q. What is mean displacement of a car? 1+ 1+ ∼ 2 1 α ∼ 2 (1 α)2 − − search hit search miss / insert displacement = 3 Pf. Half-full. With M / 2 cars, mean displacement is ~ 3 / 2. Full. With M cars, mean displacement is ~ π M / 8 . Parameters. ・M too large ⇒ too many empty array entries. M too small ⇒ search time blows up. ・ # probes for search hit is about 3/2 Typical choice: α = N / M ~ ½. ・ # probes for search miss is about 5/2 29 30 15 / 22 16 / 22 Performance comparison of search Load factors and cost of probing Tree Worst-case cost Avg.-case cost (after n inserts) (after n inserts) Ordered What size hash table do we need when using linear probing and a load search insert delete search insert delete iteration? factor of α = 0:75 for closed hashing to achieve a more efficient Sequential search expected search time than a balanced binary search tree? (unordered list) Θ(n) Θ(n) Θ(n) Θ(n) Θ(n) Θ(n) no 1 1 Binary search Search hit: 2 (1 + 1−3=4 ) = 2:5 (ordered array) Θ(log(n)) Θ(n) Θ(n) Θ(log(n)) Θ(n) Θ(n) yes 1 1 BST Θ(n) Θ(n) Θ(n) Θ(log(n)) Θ(log(n)) Θ(log(n)) yes Search miss/insert: 2 (1 + (1−3=4)2 ) = 8:5 AVL Θ(log(n)) Θ(log(n)) Θ(log(n)) Θ(log(n)) Θ(log(n)) Θ(log(n)) yes B-tree Θ(log(n)) Θ(log(n)) Θ(log(n)) Θ(log(n)) Θ(log(n)) Θ(log(n)) yes Thus we need a hash table of size M where log2 M = 8:5, so Hash table Θ(n) Θ(n) Θ(n) Θ(1) Θ(1) Θ(1) no M 28:5 = 362 ≥ 17 / 22 18 / 22 Load factors and cost of probing Quadratic probing 50 2 2 Collision resolution function f (i) = i : hi (x) = (h(x) i ) mod M 20 ± ± for 1 i (M−1) 1000000000000 ≤ ≤ 2 10 M is a prime number of the form 4j + 3, which guarantees that the 100000000 probe sequence is a permutation of the table address space 5 alpha = 0.9 Eliminates primary clustering (when collisions group together causing breakeven input size hash table/BST input size breakeven expected # probes insert/searchexpected miss 10000 2 more collisions for keys that hash to different values) alpha = 0.9 Work example 1 1 0.0 0.2 0.4 0.6 0.8 0.0 0.2 0.4 0.6 0.8 load factor alpha load factor alpha 19 / 22 20 / 22 Double hashing Rehashing We have already seen how hash table performance falls rapidly as the With quadratic probing, secondary clustering remains: keys that table load factor approaches 1 (in practice, any load factor above 1/2 collide must follow sequence of prior collisions to find an open spot should be avoided) 0 Double hashing reduces both primary and secondary clustering: probe To rehash: create a new table whose capacity M is the first prime sequence is dependent on original key, not just one hash value more than twice as large as M Scan through the old table and insert into the new table, ignoring Collision resolution function f (i) = i hb(x): · cells marked as deleted hi (x) = (hA(x) + i hB (x)) mod M · Works best if M is prime Running time Θ(M) Relatively expensive operation on its own Our approach: hA(x) = x mod M, hB (x) = R (x mod R) where − But good hash table implementations will only rehash when the table is R is a prime < M.

Hashing: Basic Plan

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support