Hash Tables What are hash tables? Chapter 20

 A is used to implement a set, providing basic operations in constant time: - insert CS 3358 - remove (optional) Summer I 2012 - find Jill Seaman - makeEmpty (need not be constant time)  The table uses a function that maps an object in the set to its location in the table.  The function is called a hash function.

1 2

Using a hash function Placing elements in the array

values values

[ 0 ] Empty HandyParts company [ 0 ] Empty Use the hash function

[ 1 ] 4501 makes no more than 100 [ 1 ] 4501

[ 2 ] Empty different parts. But the [ 2 ] Empty Hash(partNum) = partNum % 100

[ 3 ] 8903 parts all have four digit numbers. [ 3 ] 8903 7803 7803 [ 4 ] [ 4 ] to place the element with Empty This hash function can be used to Empty part number 5502 in the . 8 . 8 . store and retrieve parts in an array. . array. .10 10...... [ 97] Hash(partNum) = partNum % 100 [ 97] Empty Empty [ 98] [ 98] 2298 [ 99] [ 99] 2298

3699 3699 41 42 Placing elements in the array How to resolve the collision?

values values

[ 0 ] Empty Next place part number [ 0 ] Empty One way is by .

[ 1 ] 4501 6702 in the array. [ 1 ] 4501 This uses the following function

[ 2 ] 5502 [ 2 ] 5502 [ 3 ] Hash(partNum) = partNum % 100 [ 3 ] (HashValue + 1) % 100 7803 7803 [ 4 ] [ 4 ] Empty 6702 % 100 = 2 Empty repeatedly until an empty location . . . . is found for part number 6702...... But values[2] is already . . . [ 97] occupied. [ 97] Empty Empty [ 98] [ 98] COLLISION OCCURS [ 99] 2298 [ 99] 2298

3699 3699 43 44

Resolving the collision Collision resolved

values values

[ 0 ] Empty Still looking for a place for 6702 [ 0 ] Empty Part 6702 can be placed at

[ 1 ] 4501 using the function [ 1 ] 4501 the location with index 4.

[ 2 ] 5502 [ 2 ] 5502 [ 3 ] (HashValue + 1) % 100 [ 3 ] 7803 7803 [ 4 ] [ 4 ] Empty Empty ...... [ 97] [ 97] Empty Empty [ 98] [ 98]

[ 99] 2298 [ 99] 2298

3699 3699 45 46 Collision resolved Hashing concepts

 values Hash Table: where objects are stored by according to their key (usually an array) [ 0 ] Empty Part 6702 is placed at

[ 1 ] 4501 the location with index 4. - key: attribute of an object used for searching/ sorting [ 2 ] 5502 [ 3 ] - number of valid keys usually greater than number 7803 [ 4 ] Where would the part with of slots in the table 6702 number 4598 be placed using . - number of keys in use usually much smaller than . linear probing? . table size. . .  . Hash function: maps keys to a Table index [ 97] Empty  [ 98] Collision: when two separate keys hash to the

[ 99] 2298 same location 10

3699 47

Hashing concepts Hash Function

 Collision resolution: method for finding an  Goals: open spot in the table for a key that has - computation should be fast collided with another key already in the table. - should minimize collisions (good distribution)  Load Factor: the fraction of the hash table  that is full Some issues: - may be given as a percentage: 50% - should depend on ALL of the key (not just the last 2 digits or first 3 characters, - may be given as a fraction in the range from 0 to which may not themselves be well distributed) 1, as in: .5

11 12 Hash Function Hash Function: string keys

 Final step of hash function is usually:  If the key is not a number, hash function must temp % size transform it to a number, to mod by the size - temp is some intermediate result  Method 1: Add up ascii values - size is the hash table size int hash (string key, int tableSize) { int hashVal = 0; - ensures the value is a valid location in the table for (int i=0; i

Hash Function: string keys Hash Function: string keys

 Method 2: Multiply each char by a power of 128 - could just allow overflow, take mod at the end

3 2 1 0 - but repeated multiplying by 128 tends to shift early chars out hash = k[0]*128 + k[1]*128 + k[2]*128 + k[3]*128 to the left This equivalent to (which avoids large intermediate results): hash = (((k[0]*128 + k[1])*128 + k[2])*128 + k[3])*128  Method 3: Multiply each char by a power of 37 int hash (string key, int tableSize) { int hashVal = 0; for (int i=0; i

 Insert: When there is a collision on, search  Insert: 89, 18, 49, 58, 69, hash(k) = k mod 10

sequentially for the next available slot Probing function (attempt i): hi(K) = (hash(K) + i) % tablesize

 Find: if the key is not at the hashed location, 49 is in 0 because keep searching sequentially for it. 9 was full

- if it reaches an empty slot, the key is not found 58 is in 1 because 8, 9, 0 were full  Problem: if the the table is somewhat full, it may take a long time to find the open slot. 69 is in 1 because 9, 0 were full - May not be O(1) any more  Problem: Removing an element in the middle of a chain 17 18

Linear Probing: delete problem Linear Probing: Lazy deletion

 values Don’t remove the deleted object, just mark as deleted [ 0 ] Empty Part 6702 was placed at

[ 1 ] 4501 the location with index 4,  During find, marked deletions don’t stop the

[ 2 ] 5502 after colliding with 5502 searching

[ 3 ]  7803 During insert, the spot may be reused [ 4 ] Now remove 7803. 6702  If there are a lot of deletions, searching may . . Now find 6702 (hash(6702)=2): still take a long time in a “sparse” table . . . not at values[2] . [ 97] values[3] is empty, so not found Empty [ 98]

[ 99] 2298 20

3699 46 Linear Probing: Collision Resolution: Primary Clustering 2.

 Cluster: a large, sequential block of occupied  An attempt to eliminate primary clustering slots in the table  If the hash function returns H, and H is occupied,  Any key that hashes into the cluster requires try H+1, then H+4, then H+9, ... excessive attempts to resolve the collision - for each attempt i, try H+i2 next.

 2 If it’s during an insert operation, the cluster Probing function (attempt i): hi(K) = (hash(K) + i ) % tablesize gets bigger.  Is it guaranteed to find an empty slot if there is  If two clusters are separated by one slot, a one (like linear probing)? single insertion will drastically degrade the - Yes IF: the table size is prime and the load is <= 50% future performance  Primary clustering is a problem at high load factors (90%), not at 50% or less. 21 22

Quadratic Probing: Quadratic probing: Example expansion of table

 Insert: 89, 18, 49, 58, 69, hash(k) = k mod 10  Since the table should be less than 50% full:

2 Probing function (attempt i): hi(K) = (hash(K) + i ) % tablesize  Can the table be expanded if the load factor gets

49 is in 0 because more than 50%? 9 was full  Yes. 58 is in 2 because 8, 8+1=9 were full, - Find the next prime number greater than (8+4)%10=2 wasn’t 2*tableSize, resize to that. 69 is in 3 because 9, (9+1)%10=0 - Don’t just copy all the elements were full, (new tablesize => new hash function) (9+4)%10=3 wasn’t - Scan old table for non-empties, and use insert Note: smaller clusters function to add them to new table. 23 24  This is called rehashing. Quadratic Probing: Secondary Clustering

 quadratic probing helps reduce primary  To avoid secondary clustering, the probe clustering (more holes, less interference) sequence needs to be a function of the original - leaves more empty slots key, not the result of hashing that key. - less likely for collisions with one location to overlap with other  Solution: use two hash functions insertions - hash(k) still maps to a location  However, it is still prone to secondary clustering: - for collision resolution: use another hash function, hash2(k): - keys that hash to the same location probe the same sequence of cells. Probing function (attempt i): hi(K) = (hash(K) + i*hash2(K)) % tablesize - many keys hashing to same location can still generate long chains to search  hash2(K) must have special properties: - never evaluate to 0 25 26 - capable of probing all slots

Collision Resolution: Separate Chaining 3. Separate chaining

 Use an array of linked lists for the hash table  To insert a an object:

 Each linked list contains all objects that hashed to that - compute hash(k) location - insert at front of list at that location (if empty, make first node) - no collisions  To find an object: - compute hash(k) - search the linked list there for the key of the object Hash function is still: h(K) = k % 10  To delete an object: - compute hash(k) - search the linked list there for the key of the object - if found, remove it 27 28 Separate Chaining

 The load can be 1 or more - more than 1 node at each location, still O(1) inserts and finds - smaller loads do not improve performance - moderately larger loads do not hurt performance  Disadvantages - Memory allocation could be expensive - too many nodes at one position can slow operations (equivalent to secondary clusters)  Advantages: - deletion is easy - don’t have to resize/rehash 29