Title : Hashing

By: Ritu Chaturvedi

1

1. Internal Hashing

2. Hash Tables

3. Hashing Functions

4. Collision resolution techniques:

4.1 Open Addressing:

4.1.1 : Primary Clustering

4.1.2 Secondary Handling Clustering Overflows 4.1.3 Random Probing

4.1.4

4.2 Separate Chaining

5. External Hashing

6. Applications

7. Conclusion

2

1. Internal Hashing

Data can be stored in the main memory as well as on Secondary memory.

Tables Files

2. is defined as a collection n rows (records) and each row has m columns (fields).

Example: Table (4 X 5)

Column 1 Column 2 (Key) Column 3 Column 4 Column 5

Key is defined as a column entry in a row that uniquely identifies that row in the table. Every row in a table is unique.

For simplicity: Rows Key 1 001 Address Space 002

N

Operations on Hash tables:

Insertion Deletion Update Search

3

Searching Methods:

Linear Search O(n) Binary Search O(log2n) Random Search using Hashing O(1)

O(1) Search time is independent of the total number of records

Hashing using key values can be INTERNAL or EXTERNAL

Tables Files

Hashing can be defined as a technique in which the actual address in the hash table of the record to be operated on is computed, given a key value.

KEY SPACE ADDRESS SPACE 1

K1

K2 Ki

N

Key-to-Address Transformation is called a mapping or a Hashing Function.

Mapping : one-to-one (Perfect Hashing)

Many-to-one

4

Example: Key space: 89, 18, 49, 58, 9 1 2 H(X) = X mod 7 3 4 5 6 7

Preconditioning : converts the keys to a form which can be easily manipulated by a hashing function.

3.0 HASHING FUNCTIONS:

Main Goals: - Speed - Keys should uniformly distributed over the address space

1. Division Remainder Method :

H(X) = X mod M

X key M prime number (size of the table)

2. Mid-Square Method :

H(X) = X2 or some digits taken from X2

3. Folding : - Folding by Shifting :

Partition X into several parts and then add the parts ignoring the carry

- Folding by boundary :

Partition X into several parts, reverse the digits in the first and the last partition and then add them all ignoring the carry

Collision: 5

A collision between two keys K and K' occurs when both have to be stored in the table and both hash to the same address in the table.

4.0 COLLISION RESOLUTION TECHNIQUES: - Open Addressing - Separate Chaining

Objective: to place colliding records elsewhere in the hash table.

4.1 OPEN ADDRESSING:

If key X collides at location D, then other locations in the hash table are examined until a free one is found. The sequence in which other locations are examined can be formulated in several ways:

4.1.1. Linear Probing

D, D+1, D+2, …M-1, M, 1,2, … D-1

Example:

Key Space: J10, B2, S19, N14, X24, W23 Address Space: 0 .. 6

Preconditioning: Subscripts are used as keys Hash Function: H(X) = X mod 7

Probe Sequence No of Probes

1 0 1 1 2 1 3 3, 4 2 4 1 5 2, 3, 4, 5, 6 5 6

6

Load Factor = n / M where n is the number of keys entered M is the size of the table

Problems with Linear Probing: - Deletion is complex - If > 0.8, performance degrades rapidly

Cluster: is defined as a sequence of consecutive occupied locations in a hash table.

On inserting a key X, if a collision occurs, X is added ultimately to the end of the cluster length of cluster increases by 1 ( maybe more than 1 !)

Example:

This feature is called Primary Clustering.

4.1.2. Quadratic Probing 7

D, D+12, D+22, … D+i2

Example:

Probe Sequence No of Probes

1 0 1 1 2 1 3 3, 4 2 4 1 5 2, 3, 6 3 6

4.1.3. Random Probing

D = (D + C) mod M where C = random number relative to prime

Example:

Probe Sequence No of Probes

1 0 1

1 2

1 3

6, 3 2 4

1 5 0, 4 2 6

8

Secondary Clustering: If two keys collide, then the same random sequence is generated for both the keys.

4.1.4. Double Hashing

D = (D+C) mod M

A 2nd hash function is used to compute C .

Example:

H2(X) is C = X div 7

For X24: 6, 2, 5, 1, 4, 0

For R17: 5, 0, 2, 4, 6, 1

Problem with Open Addressing: Handling Overflows

9

4.2 SEPARATE CHAINING

Prime Area: keys that are initially hashed Overflow Area: All colliding records

Example:

M-1

Average length of a linked list =1/ M * NI = N / M = I=0

A reasonable load factor for this method is 1.0. < 1 does not improve performance but costs extra space.

10

5.0 EXTERNAL HASHING:

To suit characteristics of disk storage, a key maps to a series of address b

Bucket

= n / (M * b)

Figure:

block address bucket on disk number

0 Key Space

. . K1 .

Ki

M -1 M -2

11

Example: Key space: 21,51,98,87,19,25,70,83

Hash Function: 1. X goes into bucket 0 if X is between 1 and 20 1 if X is between 21 and 40 2 if X is between 41 and 60 3 if X is between 61 and 80 4 if X is between 81 and 100

2. If a key X collides at some location D, then X is placed in the set of colliding records so as to preserve the order of the keys

if X1 < X2 then H(X1) < H(X2)

Order-Preserving Function

Example :

buckets

0

1

2

3

4

12

6.0 Applications of Hashing:

- Symbol tables used in Compilers - Spelling checkers

7.0 Conclusion:

Hash tables are recommended to be used anywhere you need to perform numerous searches with the highest possible efficiency.

The hash function should be chosen with care - the performance of all hashing methods depends on the use of a good hash function.

The probe increment (or decrement) should be chosen such that it enumerates all table locations.

The performance of the hashing method used should be always checked.

Load factor of 1.0 is considered reasonable for Chaining. 0.5 is considered reasonable for Open Addressing.

Theoretical results:

Expected number of probes

Not Found Found

Chaining 1 + 1 + / 2

Linear probing 1 / 2 + 1 / 2 * (1- )2 1 / 2 + 1 / 2 * (1- )

Double hashing 1 / (1 - ) 1 / * ln (1 / (1 - ))

13