CS 665 Analysis of Algorithms

Programming Assignment 2

Due date: 12/2/99

Liyan Zhang

Abstract In this assignment, we tested and evaluated the performance of hash functions created by division method and multiplication method, and collision resolution methods of chaining and open addressing method. 1. Introduction:

A hash table is an effective data structure for many applications. In this assignment, we worked on the following:  Generate a file with 897 distinct number as testing data  Write a program to test the performance of 4 different size of division hash function and multiplication hash function  Write a program to test two collision resolution methods: (1) chaining; (2) open addressing with linear probing and double hashing

2. Performance of hash functions created by division and multiplication method a. Division hash function: h(k) = k mod m m is size of hash table, k is the key of the element b. Multiplication hash function: h(k) = m(k A mod 1) In the experiment, we set A = 0.6180339887 as Knuth suggests.

Table 1. Number of Collisions Size of hash table (m) Division method Multiplication Method 200 698 699 512 470 466 997 309 288 1024 301 292

From the result, we can see that when the size of hash table is much less than the number of elements, the number of collisions is very high, while the size of hash table is increasing, the number of collisions is decreasing until the size of hash table is larger than the number of elements. In our experiment result of multiplication method, the collisions even increase when the size of hash table is increasing from 997 to 1024.

3. The performance of collision resolution techniques HASH-INSERT(T,k) i  0 Repeat j  h(k,i) If T[j] = NIL Then T[j] i  k Return j Else i  i+1 Until i = m Error “hash table overflow” HASH-SEARCH(T,k) i  0 Repeat j  h(k,i) If T[j] = k Then Return j i  i+1 Until i = m or T[j]= NIL Return NIL a. Collision resolution by Chaining In chaining, we put al the elements that hash to the same slot in a linked list. CHAINED-HASH-INSERT(T,x) insert x at the head of list T[h(key[x])] CHAINED-HASH-SEARCH(T,k) Search for an element with key k in list T[h(k)] CHAINED-HASH-DELETE(T,x) Delete x from the list T[h(key[x])] b. Open addressing Given an ordinary hash function h’: U {0, 1, …, m-1} a). Linear probing: Hash function is h(k,i)=(h’(k)+i) mod m . The theoretical values of the number of probes needed for searching are:

½ (1+1/(1-)) for successful search ½ (1+1/(1-)2) for unsuccessful search

b). Double hashing:

Hash function: h(k,i)=(h1(k)+ih2(k)) mod m h1(k) = k mod m h2(k) = 1 + (k mod m’)[m’=m-1 or m-2]

The theoretical values of the number of probes needed for searching are: 1/ln(1/(1-)) for successful search 1/(1-) for unsuccessful search Table 2. Experiment results: Average Probes for successful and unsuccessful search Hash Loading Average Probing for Successful Average Probing for Unsuccessful Table Search Search Size Factor 0.1 0.25 0.5 0.75 0.9 0.1 0.25 0.5 0.75 0.9 Chaining 1.04 1.1 1.24 1.38 1.36 1.06 1.14 1.32 1.42 1.64 997 Linear Probing 1.04 1.14 1.6 2.96 5.28 1.08 1.22 2.46 8.7 51.04 Double hashing 1.04 1.1 1.3 1.38 2.66 1.06 1.18 1.86 3.34 10.4 Chaining 1.06 1.08 1.3 1.36 1.36 1.1 1.14 1.36 1.62 1.76 2081 Linear Probing 1.04 1.18 1.3 5.92 5.22 1.12 1.24 2.24 7.54 44.84 Double hashing 1.04 1.2 1.24 2.36 3.7 1.12 1.26 2.04 4.1 10.08

Table 3. Theoretical number of probes for successful search

Loading Factor 0.1 0.25 0.5 0.75 0.9 Chaining 1.05 1.13 1.25 1.38 1.45 Linear Probing 1.06 1.17 1.5 2.5 5.5 Double Hashing 1.05 1.15 1.39 1.85 2.56

Table 4. Theoretical number of probes for unsuccessful search

Loading Factor 0.1 0.25 0.5 0.75 0.9 Chaining 1.1 1.25 1.5 1.75 1.9 Linear Probing 1.12 1.39 2.5 8.5 50.5 Double Hashing 1.11 1.33 2.00 4.00 10.00

Table 2 shows the Average Probes for successful and unsuccessful search. Table 3 and table 4 shows theoretical values for successful search and unsuccessful search.

Compare table 2 with table 3 and 4, we can observe that for the chaining resolution, the probes in the experiment is almost always less than the theoretical probe, that means the assumption of simple uniform hashing is true for this case.

For the linear probing resolution, when the load factor is larger than 0.5, the number of probes in experiment is larger then the theoretical values. That is because the assumption of uniform hashing is not hold when the load factor becomes larger than 0.5. After load factor is larger than 0.5, the problem of primary clustering became more severe, and linear probing is not a very good approximation to uniform hashing.

For the double hashing, the number of probes in experiment is less than or equal to the theoretical values. They are consistent.

We can see that the number of probes for both successful search and unsuccessful search are not related to the size of the hash table. Only load factor affects the performance. 4. Coding

There are three programs for the project: 1) generate.cpp was used to generate test data 2) 2) collisions.cpp was for testing the performance of hash function created by division and multiplication 3) 3) hashtest.cpp was a menu-driven test program to test the performance of various collisions resolution techniques. All three programs were compiled using g++ at a Sun workstation. They are in ~lzhang/cs665/prj2/ To compile them: g++ -o generate generate.cpp g++ -o collisions collisions.cpp g++ -o hashtest hashtest.cpp

Experiments:

(1). Run generate 897 50 and generate 1873 50 generate four key files, test_897, test_50_897_u, test_1873, and test_50_1873_u. The first two were for testing various hash function with m = 997. The last two were used for testing collision resolution techniques with m = 2081.

(2). Run collisions to test the performance of division hash function and multiplication hash function, here, A = 0.618033987.

(3). Run hashtest program to test of the performance of collision resolution techniques

5. Advantages and disadvantages of hash function created by division and multiplication a. The division method: The division method is affected by m. When using the division method, we usually avoid certain values of m (the number of slots). Good values for m are primes not too close to exact powers of 2. Similarly, powers of 10 should be avoided if the application deals with decimal keys. b. The multiplication method: An advantage of the multiplication method is that the value of m is not critical. Although this method works with any value of the constant A, it works better with some values than with others. Knuth suggests that A = 0.6180339887.

6. Collision resolution:

Chaining: The worst-case running time for insertion is O(1). For searching, the worst-case running time is proportional to the length of the list.

The worst-case behavior of hashing with chaining is (n). The average performance of hashing depends on how well the hash function h distributes the set of keys to be stored among the m slots, on the average.

Linear probing: Linear probing is easing to implement, but it suffers from a problem known as primary clustering. Long runs of occupied slots build up, increasing the average search time. Linear probing is not a very good approximation to uniform hashing.

Double Hashing: Double hashing is one of the best methods avoidable for open addressing because the permutations produced have many of the characteristics of randomly chosen permutations.

7. Comparison of Extendible Hashing with Double Directories (EHDD) and the three methods

EHDD is the effective resolution to deal with highly dynamic applications.

The chaining method can deal with dynamic applications too. But when the application set is very big, searching and deletion will not so effective as EHDD.

Open addressing can not deal with dynamic applications. It will either waste or short of a lot of storage space. It is hard to choose the size of the hash table.

8. Conclusion Through the experiment, we observed that the load factor affect the performance of hash function. Among the collision resolution, chaining method is an effect way to deal with the collision. Double hashing is a better resolution for collision than the linear probing.

Reference

[1]. Thomas H. Cormen, etc., 1990, Introduction to Algorithms, Chapter 12 Hash Tables.