Adam BlankLecture 11 Winter 2017 CSE 332: Data Structures and Parallelism CSE Hashing: Part 2 332

Data Structures and Parallelism

HashTable Review 1 Hashing Choices 2

Hash Tables 1 Choose a Provides 1 core Dictionary operations (on average) 2 Choose a table size We call the key space the “universe”: U and the T O( ) We should use this data structure only when we expect U T 3 Choose a collision resolution strategy (Or, the key space is non-integer values.) S S >> S S Separate Chaining hash function mod T collision? E int Table Index Fixed Table Index Other issues to consider: S S ÐÐÐÐÐÐÐ→Hash Table Client ÐÐÐÐ→ HashÐÐÐÐÐ→ Table Library 4 Choose an implementation of deletion ´¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¸¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¶´¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¸¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¶

5 Choose a λ that means the table is “too full” Another Consideration?

What do we do when λ (the load factor) gets too large? We discussed the first few of these last time. We’ll discuss the rest today.

Review: Collisions 3 Open Addressing 4

Definition (Collision) Definition (Open Addressing) A collision is when two distinct keys map to the same location in the Open Addressing is a type of collision resolution strategy that resolves hash table. collisions by choosing a different location when the natural choice is full. A good hash function attempts to avoid as many collisions as possible, There are many types of open addressing. Here’s the key ideas: but they are inevitable. We must be able to duplicate the path we took. How do we deal with collisions? We want to use all the spaces in the table. We want to avoid putting lots of keys close together. There are multiple strategies: It turns out some of these are difficult to achieve. . . Separate Chaining Strategy #1: Linear Probing Example Open Addressing 1 i = 0; Insert 38,19,8,109,10 into a while (index in use) { Linear Probing 2 hash table with hash function Quadratic Probing 3 try (h(key) + i)% T 4 } h x x and linear probing Double Hashing S S ( ) = 8 109 10 38 19 T[0] T[1] T[2] T[3] T[4] T[5] T[6] T[7] T[8] T[9] (Items with the same hash code are the same color) Open Addressing: Linear Probing 5 Analyzing Linear Probing 6

Which Criteria Does Linear Probing Meet? Example Strategy #1: Linear Probing We want to use all the spaces in the table. 1 i = 0; Insert 38,19,8,109,10 into a Yes! Linear probing will fill the whole table while (index in use) { . 2 hash table with hash function 3 try (h(key) + i) % T We want to avoid putting lots of keys close together. h x x linear probing 4 } and Uh. . . not so much S S ( ) = 8 109 10 38 19 T[0] T[1] T[2] T[3] T[4] T[5] T[6] T[7] T[8] T[9] (Items with the same hash code are the same color) Primary Clustering is when different keys collide to form one big group. 8 109 10 101 20 38 19 Other Operations with Linear Probing T[0] T[1] T[2] T[3] T[4] T[5] T[6] T[7] T[8] T[9] insert? Finds the next open spot. The worst case is n Think of this as “clusters of many colors”. Even though these keys are all find? We have to retrace our steps. If the insert chain was k long, O( ) different, they end up in a giant cluster. then find k . delete? We don’t have a choice; we must use lazy deletion. What In linear probing, we expect to get lgn size clusters. ∈ O( ) happens if we delete 19 and then do find(109) in our example? O( ) 8 109 10 38 19X This is really bad! But, how bad, really? T[0] T[1] T[2] T[3] T[4] T[5] T[6] T[7] T[8] T[ T[99]]

Analyzing Linear Probing 7 ///////Linear Quadratic Probing 8 Load Factor & Space Usage There’s nothing theoretically wrong with open addressing that forces Note that λ 1, and we will eventually get to λ 1. primary clustering. We’d like a different (easy to compute) function to probe with. That is: Average Number≤ of Probes = Open Addressing In General Unsuccessful Search Successful Search Choose a new function f x and then probe with 1 1 1 1 1 1 2 1 λ 2 2 1 λ (h)key f i mod T Œ + ‘ Œ + ‘ ( ( )+ ( )) S S 140 ( − ) ( − ) Strategy #2: Quadratic Probing Example 1 i = 0; 120 Insert 89,18,49,58,79 into a 2 while (index in use) { 2 hash table with hash function 100 3 try (h(key) + i )% T 4 } h x x and quadratic probing 80 S S 60 49 58 79 ( ) = 18 89 T[0] T[1] T[2] T[3] T[4] T[5] T[6] T[7] T[8] T[9]

Average Number of Probes 40

i 0 2 20 h 7958 7958 0 98 = i 1 58 12 9 0 ( ) Ð→ 79 + 1 ≡ 0 0.0 0.2 0.4 0.6 0.8 = Load Factor i 2 2 Ð→ 7958 + 22 ≡ 32 = Ð→ + ≡

Another Quadratic Probing Example 9 Quadratic Probing: Table Coverage 10

48 5 55 40 76 Strategy #2: Quadratic Probing Example T[0] T[1] T[2] T[3] T[4] T[5] T[6] 1 i = 0; Insert 76,40,48,5,55,47 into a 2 while (index in use) { Why Does insert(47) Fail? 2 hash table with hash function 3 try (h(key) + i )% T 2 4 } h x x and quadratic probing For all i, 5 i mod 7 0,2,5,6 . The proof is by induction. This S S actually generalizes: ( ) = ( + ) ∈ { 2 } 2 48 5 55 40 76 For all c,k, c i mod k c i k mod k T[0] T[1] T[2] T[3] T[4] T[5] T[6] So, quadratic probing doesn’t( + always) fill= the( +( table− ). ) i 0i 0 22 2 h h4076h485 i 04076485 0002 56756 h 5547 5547 0 77 7 65 = = 7 i=1 2 2 The Good News! ( (() Ð→) Ð→i 1 485+++11≡2≡≡760 ( ) Ð→ 5547 + 1 ≡7 7 06 = 7 i=2 2 2 1 Ð→i 2 5 ++2 2≡≡ 2 Ð→ 5547 + 2 ≡7 7 32 If T is prime and λ , then quadratic probing will find an empty slot in = 7 = 2 Ð→i 3 + 2≡ Ð→ 47 + 3 ≡7 0 T 1 = i 4 2 atS mostS probes.< So, if we keep λ , we don’t need to detect cycles. Ð→ 47 + 4 ≡7 0 2 2 = i 5 2 The proofS willS be posted on the website. Ð→ 47 + 5 ≡7 2 < = Ð→ + ≡ We will never get a 1 or a 4! So, does quadratic probing completely fix clustering? This means we will never be able to insert 47. What’s going on? Quadratic Probing: Clustering 11 ///////Linear ////////////QuadraticDouble Hashing 12

With linear probing, we saw primary clustering (keys hashing near each Example other). Quadratic Probing fixes this by “jumping”. Unfortunately, we still get secondary clustering: Strategy #3: Double Hashing Insert 13,28,33,147,43 into a 1 i = 0; hash table with: 2 while (index in use) { 3 try (h(key) + i g(key)) % T h x x Secondary Clustering 4 } x ∗ S S g(x) = 1 mod T 1 Secondary Clustering is when different keys hash to the same place and We insist g x 0. T follow the same probing sequence. using (double) = + hashingŒ ‘ (S S− ) ( ) ≠ S S 39 29 9 19 T[0] T[1] T[2] T[3] T[4] T[5] T[6] T[7] T[8] T[9] 13 33 28 147 T[0] T[1] T[2] T[3] T[4] T[5] T[6] T[7] T[8] T[9]

i 0 Think of this as long probing chains of the same color. The keys all start ii 00 h h1474333 1474333 00 37 == at the same place; so, the chain gets really long. i=1 ii 11 ( ( ))Ð→Ð→1474333 ++11≡≡11 4314modmod9 9 87 3 = == ii 11 Ð→Ð→14743 ++22((11++414modmod9)9≡) ≡3 9 == i 1 We can avoid secondary clustering by using a probe function that Ð→Ð→ 43 ++3((1 ++4 mod 9) ≡) ≡8 depends on the key = . Ð→ + ( + ) ≡ We got stuck again!

Double Hashing Analysis 13 Where We Are 14 Filling the Table Just like with Quadratic Probing, we sometimes hit an infinite loop with Separate Chaining is Easy! double hashing. We will not get an infinite loop in the case with find, delete proportional to load factor on average primes p,q such that 2 q p: insert can be constant if just push on front of list h key key mod p < < g key q key mod q ( ) = Uniform( Hashing) = −( ) Open Addressing is Tricky! For double hashing, we assume uniform hashing which means: Clustering issues 1 Doesn’t always use the whole table Pr g(key1) mod p g(key2) mod p p Why Use it? [ = ] = Less memory allocation Average Number of Probes Easier data representation Unsuccessful Search Successful Search 1 1 1 ln 1 λ λ 1 λ Now, let’s move on to resizing the table. ‹  − This is way better than linear probing. −

Rehashing 15 Hashing and Comparing 16

When λ is too big, create a bigger table and copy over the items A hash function isn’t enough! We have to compare items: With separate chaining, we have to loop through the list checking if When To Resize the item is what we’re looking for With separate chaining, we decide when to resize (should be λ 1) With open addressing, we need to know when to stop probing 1 With open addressing, we need to keep λ ≤ 2 We have two options for this: equality testing or comparison testing. < New Table Size? In Project 2, you will use both types. Like always, we want around “twice as big” In Java, each Object has an equals method and a hashCode . . . but it should still be prime method So, choose the next prime about twice as big

How To Resize 1 class Object { Go through table, do standard insert for each into new table: 2 boolean equals(Object o) {...} 3 int hashCode() {...} Iterate over old table: n 4 ... 5 } n inserts / calls to the hash function: n 1 n O( ) But this is amortized 1 ×O( ) = O( ) O( ) Properties of Comparable and Hashable 17 A Good Hashcode 18

For any class, it must be the case that: 1 int result = 17; // start at a prime 2 foreach field f 3 int fieldHashcode = 4 boolean: (f ? 1: 0) If a.equals(b), then a.hashCode() == b.hashCode() 5 byte, char, short, int:(int) f 6 long:(int) (f ^ (f >>> 32)) 7 float: Float.floatToIntBits(f) 8 double: Double.doubleToLongBits(f), then above If a.compareTo(b) == 0, then a.hashCode() == b.hashCode() 9 Object: object.hashCode() 10 result = 31 * result + fieldHashcode; 11 return result; If a.compareTo(b) < 0, then b.compareTo(a) > 0

If a.compareTo(b) == 0, then b.compareTo(a) == 0

If a.compareTo(b) < 0 and b.compareTo(c) < 0, then a.compareTo(c) < 0

Hashing Wrap-Up 19

Hash Tables are one of the most important data structures Efficient find, insert, and delete based on sorted order are not so efficient Useful in many, many real-world applications Popular topic for job interview questions

Important to use a good hash function Good distribution, uses enough of keys values Not overly expensive to calculate (bit shifts good!)

Important to keep hash table at a good size Prime Size λ depends on type of table

What we skipped: perfect hashing, universal hash functions, ,