Hashing (part 2)

CSE 2011 Winter 2011

14 March 2011 1

Collision Handling

 Separate chaining  Pbi(Probing (open a ddress ing ) 

2

1 Quadratic Probing

 Linear probing:  Quadratic probing Insert item (k, e) A[i] is occupied i = h(k) Try A[(i+1) mod N]: used A[i] is occupied Try A[(i+22) mod N]: used Try A[(i+1) mod N]: used Try A[(i+32) mod N] Try A[(i+2) mod N] and so on and so on until an empty cell is found  May not be able to find an empty cell if N is not prime, or the is at least half full

3

Double Hashing

 Double hashing uses a Insert item (k, e) secondary i = h(()k) d(k) and handles collisions A[i] is occupied by placing an item in the first available cell of the Try A[(i+d(k))mod N]: used series Try A[(i+2d(k))mod N]: used (i  j  d(k)) mod N Try A[(i+3d(k))mod N] for j  0, 1, … , N  1 and so on until  The secondary hash an empty cell is found function d(k) cannot have zero values  The table size N must be a prime to allow probing of all the cells

4

2 Example of Double Hashing

kh(k ) d (k ) Probes  Consider a hash 18 5 3 5 tbltable s tori ng itinteger 41 2 1 2 22 9 6 9 keys that handles 44 5 5 5 10 collision with double 59 7 4 7 32 6 3 6 hashing 315 4 590  N 13 73 8 4 8  h(k)  k mod 13  d(k)  7  k mod 7 0123456789101112  Insert keys 18, 41, 22, 44, 59, 32, 31, 73, in this order 31 41 18 32 59 73 22 44 0123456789101112

5

Double Hashing (2)

 d(k) should be chosen to minimize clustering  Common choice of compression map for the secondary hash function: d(k)  q  k mod q where q  N q is a prime  The possible values for d(k) are 1, 2, … , q

 Note: linear probing has d(k)  1.

6

3 Comparing Collision Handling Schemes  = n/N Unsuccessful: key not found Successful: key found

7

Comparing Collision Handling Schemes (2)

 Separate chaining:  Linear probing: items are – simple implementation clustered into contiguous – faster than open runs (). addressing in general – using more memory  Quadratic probing: secondary clustering.  Open addressing: – using less memory  Double hashing: – slower than chaining in distributes keys more general uniformly than linear probing does. – more complex removals

8

4 Performance of Hashing

 In the worst case, searches,  The expected running time insertions and removals on a of all the dictionary ADT hash table take O(n) time. operations in a hash table is  The worst case occurs when O(1). all the keys inserted into the  In practice, hashing is very map collide. fast provided the load factor  The load factor   nN is not close to 100%. affects the performance of a  rehashing. hash table.  Applications of hash tables:  Assuming that the hash  small databases values are like random numbers, it can be shown  compilers that the expected number of  browser caches probes for an insertion with  converting non-integer open addressing is keys to integers. 1 (1  ) 9

Keys That Are Strings

 We need to convert a string to an integer before hashing.

 One option is to add up the ASCII values of the characters in the string.  Is this a good strategy?

 Polynomial accumulation:  We partition the bits of the key into a sequence of components of fixed length (e.g., 8, 16 or 32 bits). x0 x1 … xn1  We evaluate the polynomial 2 n1 p(z)  x0  x1 z  x2 z  …  xn1z at a fixed value z, ignoring overflows.

10

5 Polynomial Accumulation

 Polynomial p(z) can be evaluated in O(n) time using Horner’s rule:  The following polynomials are successively computed, each from the previous one in O(1) time

p0(z)  xn1 pi (z)  xni1  zpi1(z) (i  1, 2, …, n 1)

 We have p(z)  pn1(z)  GdGood z values: 3337394133, 37, 39, 41.  Especially suitable for strings  z  33 gives at most 6 collisions on a set of 50,000 English words.

11

Summary

 Purpose of hash tables:  If collision occurs, use to obtain O(()1) expected one of the collision query time using O(n+N) handling schemes, taking space. into account available  If the keys are not memory space. integers, convert them to  If the load factor   nN integer keys. approaches the specified  Map integer keys to the threshold, rehash. hash table entries using a compression map function.

12

6 Next time…

 Graphs (chapter 13)

13

7