Hash Function, H, Is a Function from the Set of All Search-Keys, K, to the Set of All Bucket Addresses, B

Hashing Techniques

Material based on slides by George Bebis https://www.cse.unr.edu/~bebis/CS477/Lect/Hashing.ppt

The Search Problem

• Find items with keys matching a given search key – Given an array A, containing n keys, and a search key x, find the index i such as x=A[i] – As in the case of sorting, a key could be part of a large record.

2 Applications

• Keeping track of customer account information at a bank – Search through records to check balances and perform transactions • Keep track of reservations on flights – Search to find empty seats, cancel/modify reservations • Search engine – Looks for all documents containing a given word

3 Direct Addressing

• Assumptions: – Key values are distinct – Each key is drawn from a universe U = {0, 1, . . . , m - 1} • Idea: – Store the items in an array, indexed by keys

• Direct-address table representation: – An array T[0 . . . m - 1] – Each slot, or position, in T corresponds to a key in U – For an element x with key k, a pointer to x (or x itself) will be placed in location T[k] – If there are no elements with key k in the set, T[k] is empty, represented by NIL

4 Direct Addressing (cont’d)

5 Examples Using Direct Addressing

Example 1:

6 Examples Using Direct Addressing

Example 2:

7 Hashing

• Hashing provides a means for accessing data without the use of an index structure. • Data is addressed on disk by computing a function on a search key instead. Organization

• A bucket in a hash file is unit of storage (typically a disk block) that can hold one or more records. • The hash function, h, is a function from the set of all search-keys, K, to the set of all bucket addresses, B. • Insertion, deletion, and lookup are done in constant time. Hash Tables

• When K is much smaller than U, a hash table requires much less space than a direct-address table – Can reduce storage requirements to |K| – Can still get O(1) search time, but on the average case, not the worst case

10 Hash Tables

Idea: – Use a function h to compute the slot for each key – Store the element in slot h(k) • A hash function h transforms a key into an index in a hash table T[0…m-1]: h : U → {0, 1, . . . , m - 1} • We say that k hashes to slot h(k) • Advantages: – Reduce the range of array indices handled: m instead of |U| – Storage is also reduced

11 Example: HASH TABLES

U (universe of keys) h(k1)

h(k4)

K k1 k h(k ) = h(k ) (actual 4 k2 2 5 keys) k k5 3 h(k3)

m - 1

12 Revisit Example 2

13 Do you see any problems with this approach?

U (universe of keys) h(k1)

h(k4)

K k1 k h(k ) = h(k ) (actual 4 k2 2 5 keys) k k5 3 h(k3)

m - 1

14 Do you see any problems with this approach?

U (universe of keys) h(k1)

h(k4)

K k1 k h(k ) = h(k ) (actual 4 k2 2 5 keys) k k5 3 h(k3)

m - 1 Collisions!

15 • store names using a hashing function • h(k)= k mod m • k = sum of alphabet positions • Let m = 51

• MOHIT : 13+15+8+9+20 = 65 mod 51 = 14 • RANA : 34 • STOP: 19 • BISWAS: 22 • XEROX: 14 • TOPS: 19

16 Collisions

• Two or more keys hash to the same slot!! • For a given set K of keys – If |K| ≤ m, collisions may or may not happen, depending on the hash function – If |K| > m, collisions will definitely happen (i.e., there must be at least two keys that have the same hash value) • Avoiding collisions completely is hard, even with a good hash function

17 Handling Collisions

• We will review the following methods: – Chaining – Open addressing • Linear probing • Quadratic probing • Double hashing .

18 Handling Collisions Using Chaining

• Idea: – Put all elements that hash to the same slot into a linked list

– Slot j contains a pointer to the head of the list of all elements that hash to j 19 Collision with Chaining - Discussion

• Choosing the size of the table – Small enough not to waste space – Large enough such that lists remain short • How should we keep the lists: ordered or not? – Not ordered! • Insert is fast • Can easily remove the most recently inserted elements

20 Insertion in Hash Tables

• Worst-case running time is O(1) • Assumes that the element being inserted isn’t already in the list • It would take an additional search to check if it was already inserted

21 Searching in Hash Tables

search for an element with key k in list T[h(k)]

• Running time is proportional to the length of the

list of elements in slot h(k)

22 Hash Functions

• A hash function transforms a key into a table address • What makes a good hash function? (1) Easy to compute (2) Approximates a random function: for every input, every output is equally likely (simple uniform hashing) • In practice, it is very hard to satisfy the simple uniform hashing property – i.e., we don’t know in advance the probability distribution that keys are drawn from

23 Good Approaches for Hash Functions

• Minimize the chance that closely related keys hash to the same slot – Strings such as pt and pts should hash to different slots • Derive a hash value that is independent from any patterns that may exist in the distribution of the keys

24 The Division Method • Idea: – Map a key k into one of the m slots by taking the remainder of k divided by m h(k) = k mod m • Advantage: – fast, requires only one operation • Disadvantage: – Certain values of m are bad, e.g., • power of 2 • non-prime numbers

25 Example - The Division Method m m 97 100 • If m = 2p, then h(k) is just the least significant p bits of k – p = 1 ⇒ m = 2 ⇒ h(k) = {0, 1} , least significant 1 bit of k – p = 2 ⇒ m = 4 ⇒ h(k) ={0, 1, 2, 3} , least significant 2 bits of k Choose m to be a prime, not close to a power of 2 Column 2: k mod 97 Column 3: k mod 100

26 Probing

• Without using linked lists • Use a larger table and try successive locations

27 Common Open Addressing Methods

• Linear probing • Quadratic probing • Double hashing

28 Linear probing: Inserting a key

• Idea: when there is a collision, check the next available position in the table (i.e., probing)

h(k,i) = (h1(k) + i) mod m i=0,1,2,...

• First slot probed: h1(k)

• Second slot probed: h1(k) + 1

• Third slot probed: h1(k)+2, and so on probe sequence: < h1(k), h1(k)+1 , h1(k)+2 , ....>

• Can generate m probe sequences maximum, why? wrap around 29 • Insert keys 89, 18, 49, 58, 69 • H(x) = x mod 10

0 1 2 3 4 5 6 7 8 9 49 58 69 18 89 • 49 collides with 89, place in next location 0 • 58 collides with 18, next available place is 1 • 69 lands in location 2.

• Some slots tend to be crowded, forming a cluster

30 • Insert keys 89, 18, 49, 58, 69 • H(x) = x mod 12

0 1 2 3 4 5 6 7 8 9 10 11 49 89 18 69 58

• 89: 5, 18: 6, 49: 1, 58: 10, 69: 9 • No collisions • Time to search O(1)

31 Linear probing: Searching for a key

• Three cases: (1) Position in table is occupied with an element of equal key 0 (2) Position in table is empty h(k1) (3) Position in table occupied with a h(k4) different element h(k ) = h(k ) • Case 3: probe the next higher index 2 5 until the element is found or an h(k ) empty position is found 3 • The process wraps around to the m - 1 beginning of the table

32 Linear probing: Deleting a key

• Problems

– Cannot mark the slot as empty 0 – Impossible to retrieve keys inserted after that slot was occupied • Solution – Mark the slot with a sentinel value DELETED • The deleted slot can later be used

for insertion m - 1 • Searching will be able to find all the keys

33 Primary Clustering Problem

• Some slots become more likely than others • Long chunks of occupied slots are created ⇒ search time increases!!

initially, all slots have probability 1/m

Slot b: 2/m

Slot d: 4/m

Slot e: 5/m

34 Quadratic Probing

• To overcome primary clusters, the scheme of quadratic probing is proposed • f(i) = i2

35 Insert 89,18,49,58,69

• 49 collides with 89, try 0 49 with i = 1, goes to location 1 0 2 58 3 69 • 58 collides with 18, 4 collides with 89 with i=1, 5 try with i=2 (4 cells away) 6 goes to location 2 7 8 18 9 89 • 69 collides with 89, then 49 so try with i=2 , finds an empty slot at 3

• ,,, Quadratic probing with prime TS

• If table size is chosen as a prime, place to hold an element can be found as long as the table is not yet half full. • First TS/2 alternative locations are going to be all distinct. Two of these locations are • h(x) + i2 mod TS and • h(x) + j2 mod TS where i and j are both less than TS/2 • To prove that the locations are going to be distinct, for the sake of contradiction , let us suppose that while i and j are different, the two locations turn out to be the same.

37 Can i and j point to same location?

• Then • h(x) + i2 = h(x) + j2 mod TS • i2 = j2 mod TS • i2 – j2 = 0 mod TS • (i + j) ( i – j ) = 0 mod TS • Since table size is prime, it follows either • (i + j) = 0 mod TS (not possible since both i and j are less than TS/2) • OR (i – j ) = 0 mod TS (not possible since i and j are distinct) • Thus the first TS/2 locations are distinct.

38 example

• Consider a table of size 37, • Let h(x) be 26, what are the alternative locations? • for i = 2 : 26+4 = 30 • for i = 3 : 26+9 = 35 • for i = 4 : 26+16 = 42 = 5 • for i = 5 : 26+25 = 51 = 14 • Such data structures do not support deletion, as cells might have caused a collision to go past it). • One could use lazy deletion ( mark with a flag) • When table gets half full, enlarge the hash table.

• , Double Hashing

40 Double Hashing

• We use a second hash function h2(x). • We probe at h2(x), 2 h2(x), 3 h2(x) • A good choice h2(x) = R – ( x mod R) where R is a prime number smaller than TS.

41 Double Hashing

• We use a second hash function h2(x). • We probe at h2(x), 2 h2(x), 3 h2(x) • A good choice h2(x) = R – ( x mod R) where R is a prime number smaller than TS. • Consider the problem of inserting 89, 18, 49, 58,69 on a table of size 10. Let R = 7 • 49 gets a Collision at position 9 • h2(49) = 7 – 49 mod 7 =7 • count 7 positions from there

0 1 2 3 4 5 6 7 8 9 49 18 89

42 Double Hashing

• 58: h2(58) = 7 – 58 mod 7 = 5 • 69: h2(69) = 7 – 69 mod 7 = 1 • Now try 60: h2(60) = 7 – 60 mod 7 = 3 • Collision with 58, try 2 h2(60)= 6 • Collision with 49, try 3 h2(60)= 9 • Collision with 89, try 4 h2(60)= 12 mod 10 = 2 • Now try with 23 , problem? • Table size small, not prime…….

0 1 2 3 4 5 6 7 8 9 69 60 58 49 18 89

43 A different Double Hashing style

(1) Use one hash function to determine the first slot (2) Use a second hash function to determine the increment for the probe sequence

h(k,i) = (h1(k) + i h2(k) ) mod m, i=0,1,...

• Initial probe: h1(k)

• Second probe is offset by h2(k) mod m, so on ... • Advantage: avoids clustering

44 Different Double Hashing: Example

h1(k) = k mod 13 0 h (k) = 1+ (k mod 11) 1 79 2 2 h(k,i) = (h1(k) + i h2(k) ) mod 13 3 • Insert key 14: 4 69 5 98 h1(14,0) = 14 mod 13 = 1 6 h(14,1) = (h (14) + h (14)) mod 13 7 72 1 2 8 = (1 + 4) mod 13 = 5 9 14 h(14,2) = (h (14) + 2 h (14)) mod 13 10 1 2 11 50 = (1 + 8) mod 13 = 9 12

45 Rehashing

• When table size gets too full, running time starts getting large • Solution : Double the table size and with a new hash function insert the old elements into the new table.

46 Extendible Hashing

• When data is too large to fit in main memory, main consideration is number of disk accesses. • Rehashing is very expensive as all entries to be done all over again in a new table • Borrow idea from B Trees. • Let M be records fitting in one disk block. As M increases depth of B tree decreases. (but increases the branching factor so processing time increases)

47 • The strategy used in extendible hashing is to reduce the time to search for the appropriate leaf.

• Let the numbers be hashed to 6 bit integers. Create a pointer table of size 4, with each cell pointing to first 2 bit of each number (D=2). • Let us assume that each leaf could hold up to M=4 elements

48 00 01 10 11

000100 100000 001000 010100 101000 111000 011000 111001 001010 101100 001011 101110

49 What happens when leaf gets full

• Suppose we want to insert 100100. This should go to 3rd leaf , but it is already full • So we split this leaf into 2 leaves • This results in D being changed to 3 (each leaf being determined by 3 bits) • Note all leaves not involved in splits are now pointed to by two adjacent directory entries. • Directory is new, but other leaves are not disturbed

50 010 101 110 000 001 011 100 111

000100 100000 111000 101000 001000 010100 100100 111001 101100 001010 011000 001011 101110

51 • If key 000000 is now inserted, then the first leaf needs to be split (others are not disturbed). • The scheme is a very simple strategy for quick access times for insert and search operations on large databases.

52 • It helps if the bits are fairly random. This can be accomplished by hashing the keys into a reasonably long integer.

• Balanced search trees are quite expensive to implement for storing large number of data values. If there is any suspicion that the data might be sorted , hashing would be the data structure of choice.

53 The end