Data Structures and Algorithm Analysis (CSC317) Hash Tables
Total Page:16
File Type:pdf, Size:1020Kb
Data Structures and Algorithm Analysis (CSC317) Hash tables Hash table • We have elements with key and satellite data • Operations performed: Insert, Delete, Search/lookup • We don’t maintain order information • We’ll see that all operations on average O(1) • But worse case can be O(n) 254 Chapter 11 Hash Tables 11.1 Direct-address tables Direct addressing is a simple technique that works well when the universe U of keys is reasonably small. Suppose that an application needs a dynamic set in which each element has a key drawn from the universe U 0; 1; : : : ; m 1 ,wherem D f ! g is not too large. We shall assume that no two elements have the same key. To represent the dynamic set, we use an array, or direct-address table,denoted by TŒ0::m 1,inwhicheachposition,orslot,correspondstoakeyintheuni- ! verse U .Figure11.1illustratestheapproach;slotk points to an element in the set with key k.Ifthesetcontainsnoelementwithkeyk,thenTŒk NIL. D The dictionary operations are trivial to implement: DIRECT-ADDRESS-SEARCH.T; k/ 1 return TŒk Hash table DIRECT-ADDRESS-INSERT.T; x/ 1 TŒx:key x • SimpleD implementation: If universe of keys DIRECT-ADDRESS-DELETE.T; x/ 1 comesTŒx:key fromNIL a small set of integers [0..9], we canD store directly in array using the Each of these operations takes only O.1/ time. keys as indices into the slots. T 0 key satellite data U 1 (universe of keys) 2 0 6 2 9 3 4 7 3 4 1 5 K 2 5 (actual 3 6 5 keys) 8 7 8 8 9 Figure 11.1 How to implement a dynamic set by a direct-address table T .Eachkeyintheuniverse U 0; 1; : : : ; 9 corresponds to an index in the table. The set K 2;3; 5; 8 of actual keys D f g D f g determines the slots in the table that contain pointers to elements. The other slots, heavily shaded, contain NIL. 254 Chapter 11 Hash Tables 11.1 Direct-address tables Direct addressing is a simple technique that works well when the universe U of keys is reasonably small. Suppose that an application needs a dynamic set in which each element has a key drawn from the universe U 0; 1; : : : ; m 1 ,wherem D f ! g is not too large. We shall assume that no two elements have the same key. To represent the dynamic set, we use an array, or direct-address table,denoted by TŒ0::m 1,inwhicheachposition,orslot,correspondstoakeyintheuni- ! verse U .Figure11.1illustratestheapproach;slotk points to an element in the set with key k.Ifthesetcontainsnoelementwithkeyk,thenTŒk NIL. D Hash tableThe dictionary operations are trivial to implement: DIRECT-ADDRESS-SEARCH.T; k/ 1 return TŒk • SimpleDIRECT implementation:-ADDRESS-INSERT.T; x/ If universe of keys 1 TŒx:key x comes fromD a small set of integers [0..9], DIRECT-ADDRESS-DELETE.T; x/ 1 TŒx:key NIL we can storeD directly in array using the keys Eachas of indices these operations takesinto only O.1/thetime. slots. T 0 key satellite data U 1 (universe of keys) 2 0 6 2 9 3 4 7 3 4 1 5 K 2 5 (actual 3 6 5 keys) 8 7 8 8 9 • This isFigure also 11.1 How called to implement a dynamic a direct set by a direct-address-address table T .Eachkeyintheuniverse table U 0; 1; : : : ; 9 corresponds to an index in the table. The set K 2;3; 5; 8 of actual keys D f g D f g determines the slots in the table that contain pointers to elements. The other slots, heavily shaded, • Searchcontain timeNIL. just like in array – O(1) ! Example: Array versus Hash table • Imagine we have keys corresponding to friends that we want to store Example: Array versus Hash table • Imagine we have keys corresponding to friends that we want to store • Could use huge array, with each friend’s name mapped to some slot in array (eg, one slot in array for every possible name; each letter one of 26 characters, n letters in each name..) Example: Array versus Hash table John = A[23] Sally = A[2222177] Pros/cons? Example: Array versus Hash table John = A[23] Sally = A[2222177] Pros/cons? • We could insert, find key, and delete element in O(1) time – very fast! Example: Array versus Hash table John = A[23] Sally = A[2222177] Pros/cons? • We could insert, find key, and delete element in O(1) time – very fast! • But huge waste of memory, with many slots empty in many applications Example: Array versus Hash table John = A[23] Sally = A[2222177] Pros/cons? • We could insert, search key, and delete array element in O(1) time – very fast! • But huge waste of memory, with many slots empty in many applications • More effective: dynamic set of 1000 friends… Example: versus linked list • An alternative might be to use a linked list with all the friend names linked John -> Sally -> Bob • Pro: This is not wasteful because we only store the names that we want • Con: But search time is now O(n) • We want an approach that is fast, and not so wasteful! Example: versus linked list • We’ll eventually want best of both worlds – advantages of array and of linked list Hash table • Extremely useful alternative to static array for insert, search, and delete in O(1) time (on average) – VERY FAST • Useful when universe is large, but at any given time number of keys stored is small relative to total number of possible keys (not wasteful like a huge static array) • We usually don’t store key directly as index into array, but rather compute a hash function of the key k, h(k), as index 256 Chapter 11 Hash Tables 11.2 Hash tables The downside of direct addressing is obvious: if the universe U is large, storing atableT of size U may be impractical, or even impossible, given the memory j j available on a typical computer. Furthermore, the set K of keys actually stored may be so small relative to U that most of the space allocated for T would be wasted. When the set K of keys stored in a dictionary is much smaller than the uni- verse U of all possible keys, a hash table requires much less storage than a direct- address table. Specifically, we can reduce the storage requirement to ‚. K / while j j we maintain the benefit that searching for an element in the hash table still requires only O.1/ time. The catch is that this bound is for the average-case time,whereas for direct addressing it holds for the worst-case time. With direct addressing, an element with key k is stored in slot k.Withhashing, this element is stored in slot h.k/;thatis,weuseahash function h to compute the slot from the key k.Here,h maps the universe U of keys into the slots of a hash table TŒ0::m 1: ! h U 0; 1; : : : ; m 1 ; W ! f ! g where the size m of the hash table is typically much less than U .Wesaythatan Hash table j j element with key k hashes to slot h.k/;wealsosaythath.k/ is the hash value of •keyWhatk.Figure11.2illustratesthebasicidea.Thehashfunctionreducestherange problem can arise if we map keys to of arrayslots indices in a andhash hence table? the size Answer: of the array. collisions Instead of a; size of U ,thearray m j j cantwo have keys size .map to same slot. T 0 U (universe of keys) h(k1) h(k4) k1 K k4 k (actual 5 h(k2) = h(k5) keys) k 2 k 3 h(k3) m–1 Figure 11.2 Using a hash function h to map keys to hash-table slots. Because keys k2 and k5 map to the same slot, they collide. Collisions • Are guaranteed to happen when number of keys in table greater than number of slots in table • Or if “bad” hashing function – all keys were hashed to just one slot of hash table – more later • Even by chance, collisions are likely to happen. Consider keys that are birthdays. Recall the birthday paradox – room of 28 people, then 2 people have a 50 percent chance to have same birthday. Collisions • So we have to deal with collisions! • One solution? Collisions • So we have to deal with collisions! • One solution? Collision resolution by chaining Collision11.2 Hash tablesresolution by chaining 257 T U k1 k4 (universe of keys) k1 K k k 4 5 k k k (actual k 5 2 7 keys) 7 k2 k3 k8 k3 k6 k8 k6 Figure 11.3 Collision resolution by chaining. Each hash-table slot TŒj contains a linked list of all the keys whose hash value is j .Forexample,h.k1/ h.k4/ and h.k5/ h.k7/ h.k2/. D D D The linked list can be either singly or doubly linked; we show it as doubly linked because deletion is faster that way. There is one hitch: two keys may hash to the same slot. We call this situation a collision.Fortunately,wehaveeffectivetechniquesforresolvingtheconflict created by collisions. Of course, the ideal solution would be to avoid collisions altogether. We might try to achieve this goal by choosing a suitable hash function h.Oneideaisto make h appear to be “random,” thus avoiding collisions or at least minimizing their number. The very term “to hash,” evoking images of random mixing and chopping, captures the spirit of this approach. (Of course, a hash function h must be deterministic in that a given input k should always produce the same output h.k/.) Because U >m,however,theremustbeatleasttwokeysthathavethesamehash j j value; avoiding collisions altogether is therefore impossible.