Motivation

Introduction to Algorithms z Arrayyps provide an indirect wa y to access a set. z Many times we need an association between Hash Tables two sets, or a set of keys and associated data. z Ideally we would like to access this data directly with the keys. CSE 680 z We would like a data structure that supports fast searchih, inserti on, and ddlti deletion. Prof. Roger Crawfis z Do not usually care about sorting. z The abstract data type is usually called a Dictionary or Partial Map z float googleStockPrice = stocks[“Goog”].CurrentPrice;

Dictionaries Direct Addressing

z What is the best wayyp to implement this? z Let’ s look at an easy case, suppose: z Linked Lists? z The range of keys is 0..m-1 z Double Linked Lists? z Keys are distinct z Queues? z Stacks? z Possible solution z Multiple indexed arrays (e.g., data[key[i]])? z Set up an array T[0..m-1] in which z To answer thi s, ask wh a ttht the comp lex ity o fthf the z T[i] = x if x∈ T and key[x] = i operations are: z T[i] = NULL otherwise z Insertion z This is called a direct-address table z Deletion z Operations take O(1) time! z Search z So what’s the problem? Direct Addressing

z Direct addressing works well when the z Hash Tables provide O(1) support for all range m of keys is relatively small of these operations! z But what if the keys are 32-bit integers? z The key is rather than index an array z Problem 1: direct-address table will have directly, index it through some function, 232 entries, more than 4 billion h(x), called a hash function. z Problem 2: even if memory is not an issue, the z myArray[ h(index) ] time to initialize the elements to NULL may be z Keyyq questions: z Solution: map keys to smaller range 0..p-1 z What is the set that the x comes from? z Desire p = O(m). z What is h() and what is its range?

Hash Table Hash Functions

z Consider this problem: z In ggpygpeneral a difficult problem. Try something simpler. z If I know a prior the m keys from some finite U 0 set U, is it possible to develop a function (universe of keys)

h(x) that will uniquely map the m keys onto h(k1) k1 the set of numbers 0..m-1? h(k4) K k4 k5 (actual h(k ) = h(k ) keys) 2 5

k2 h(k ) k3 3

m-1 Hash Functions Hash Functions

z A collision occurs when h(x) maps two keys to the z A hash function, h, mappys keys of a given t ype to same ltilocation. integers in a fixed interval [0, N − 1] U 0 z Example: (universe of keys) h(x) = x mod N collisionh(k1) is a hash function for integer keys k 1 h(k ) 4 z ThThee inintegerteger h(x) iiss cacalledlled tthehe hhashash vvaluealue ooff x. K k4 (actual k5 z A hash table for a given key type consists of h(k2) = h(k5) keys) z Hash function h z Array (called table) of size N k2 k h(k3) 3 z The goal is to store item (k, o) at index i = h(k)

p -1

Example Example

z We design a hash table 0 z Our hash table uses an 0 ∅ array ofif size N = 100. ∅ storing employees 1 1 025-612-0001 z We have n = 49 025-612-0001 records using their 2 2 981-101-0002 employees. 981-101-0002 social security number, 3 z Need a method to handle 3 SSN as the key. ∅ collisions. ∅ 4 4 451-229-0004 451-229-0004 z SSN is a nine-digit z As long as the chance … positive in teger … for colli s ion is low, we z Our hash table uses an can achieve this goal. 9997 z 9997 array of size N = 10,000 ∅ Setting N = 1000 and ∅ and the hash function 9998 looking at the last four 9998 200-751-9998 digits will reduce the 200-751-9998 h(x) = last four digits of x 9999 9999 176-354-9998 ∅ chance of collision. ∅ Collisions Chaining

z Can collisions be avoided? z Chaining puts elements that hash to the z In general, no. See perfect hashing for the case same slot in a linked list: were the set of keys is static (not covered). U —— (universe of keys) k k —— z Two primary techniques for resolving 1 4 —— k collisions: 1 —— K k4 z Chaining – keep a collection at each key k5 —— (actual k k k k —— slot. keyy)s) 7 5 2 7 —— k z Open addressing – if the current slot is full k2 3 k8 k3 —— k6 use the next open one. k8 k6 —— ——

Chaining Chaining

z How do we insert an element? z How do we delete an element? z Do we need a doubly-linked list for efficient delete?

U —— U ——

(universe of keys) k1 k4 —— (universe of keys) k1 k4 —— —— —— k k 1 —— 1 —— k4 k4 K k5 —— K k5 —— (actual k k k k —— (actual k k k k —— keyy)s) 7 5 2 7 keyy)s) 7 5 2 7 —— —— k k k2 3 k2 3 k8 k3 —— k8 k3 —— k6 k6 k8 k6 —— k8 k6 —— —— —— Chaining Oppgen Addressing

z How do we search for a element with a z Basic idea: z To insert: if slot is full, try another slot, …, until given key? T an opp(en slot is found (ppgrobing) U ——

(universe of keys) k1 k4 —— z To search, follow same sequence of probes as —— would be used when inserting the element k1 —— z If reach element with correct key, return it k4 K k5 —— z If reach a NULL pointer, element is not in table (actual k k k k —— keys) 7 5 2 7 z Good df for fi fidxed set s ( (ddiadding btbut no d dlti)eletion) —— k k2 3 z Example: spell checking k8 k3 —— k6 k8 k6 —— ——

Oppgen Addressing Probing

z The colliding item is placed in a z They key question is what should the different cell of the table. next cell to try be? z No dynamic memory. z Random would be great, but we need to z Fixed Table size. z Load factor: n/N, where n is the number be able to repeat it. ofitf items t o st ore and dNth N the si ze of fth the h ash z Three common tech n iques: table. z (useful for discussion only) z Cleary, n ≤ NorN, or n/N ≤ 1. z z To get a reasonable performance, n/N<0.5. z Double Hashing Linear Probing Search with Linear Probing

z Consider a hash table A that Algorithm get(k) z Linear probing handles z Example: uses linear probing i ← h(k) collisions by placing the z h(x) = x mod 13 z get(k) p ← 0 colliding item in the next z We start at cell h(k) z Insert keys 18, 41, 22, 44, repeat (circularly) available table z We pr obe con secutiv e 59, 32 , 31 , 73 , in this locations until one of the c ← A[i] cell. order following occurs if c =∅ z z An item with key k is found, Each table cell inspected or return null is referred to as a probe. z AllifdAn empty cell is found, or eliflse if c.key () = k z N cells have been return c.element() z Colliding items lump 0123456789101112 unsuccessfully probed else together, causing future z To ensure the efficiency, if k collis ions to cause a is not in the table, we want to i ← (i + 1) mod N find an empty cell as soon as p ← p + 1 41 18 44 59 32 22 31 73 longer sequence of possible. The load factor can until p = N 0123456789101112 NOT be close to 1. probes. return null

Linear Probing Uppgdates with Linear Probing

z To handle insertions and z put(k, o) z Search for key= 20. z Example: dldelet ions, we intro duce a z We throw an exception if the z h(20)=20 mod 13 =7. z h(x) = x mod 13 special object, called table is full z Go through rank 8, 9, …, 12, z Insert keys 18, 41, 22, 44, AVAILABLE, which replaces z We start at cell h(k) 0. 59, 32, 31, 73, 12, 20 in deleted elements z We probe consecutive cells this order z remove(k) until one of the following z Search for key=15 occurs z We search for an entry with z h(()15)=15 mod 13=2. key k z A cell i is found that is either empty or stores z Go through rank 2, 3 and z If such an entry (k, o) is AVAILABLE, or found, we replace it with the return null. 0123456789101112 z N cells have been special item AVAILABLE unsuccessfully probed and we return element o z We store entry (k, o) in cell i z Have to modify other methods 20 41 18 44 59 32 22 31 73 12 to skip available cells. 0123456789101112 Quadratic Probing Quadratic Probing

z Primary clustering occurs with linear z Suppose that an element should appear probing because the same linear pattern: in bin h: z if a bin is inside a cluster, then the next bin z if bin h is occupied, then check the following must either: sequence of bins: z also be in that cluster, or h +1+ 12, h +2+ 22, h +3+ 32, h +4+ 42, h +5+ 52, ... z expand the cluster h + 1, h + 4, h + 9, h + 16, h + 25, ... z Instead of searching forward in a linear z For example, with M17M = 17: fashion, consider searching forward using a quadratic function

Quadratic Probing Quadratic Probing

z If one of h + i2 falls into a cluster, this z For example, suppose an element was does not imply the next one will to be inserted in bin 23 in a hash table with 31 bins z The sequence in which the bins would be checked is: 23, 24, 27, 1, 8, 17, 28, 10, 25, 11, 30, 20, 12, 6, 2, 0 Quadratic Probing Quadratic Probing

z Even if two bins are initially close, the z Thus, quadratic probing solves the sequence in which subsequent bins are problem of primary clustering checked varies greatly z Unfortunately, there is a second problem z Again, with M = 31 bins, compare the which must be dealt with first 16 bins which are checked starting z ShSuppose we have M8M = 8 bins: with 22 and 23: 12 ≡ 1, 22 ≡ 4, 32 ≡ 1

22, 23, 26, 0, 7, 16, 27, 9, 24, 10, 29, 19, 11, 5, 1, 30 z In this case, we are checking bin h + 1 23, 24, 27, 1, 8, 17, 28, 10, 25, 11, 30, 20, 12, 6, 2, 0 twice havinggy checked only one other bin

Quadratic Probing Quadratic Probing

z Unfortunately, there is no guarantee that z Example with M = 11: h + i2 mod M 0, 1, 4, 9, 16 ≡ 5, 25 ≡ 3, 36 ≡ 3 will cyc le throug h 010, 1, ..., M – 1 z With M13M = 13: z Solution: 0, 1, 4, 9, 16 ≡ 3, 25 ≡ 12, 36 ≡ 10, 49 ≡ 10 z require that M be prime z With M = 17: z in this case, h + i2 mod M for i = 0, ..., (M – 0, 1, 4, 9, 16, 25 ≡ 8, 36 ≡ 2, 49 ≡ 15, 64 ≡ 13, 81 ≡ 1)/2 will cycle through exactly (M + 1)/2 13 values before repeating Quadratic Probing Secondaryyg Clustering

z Thus, quadratic probing avoids primary z The phenomenon of primary clustering clustering will not occur with quadratic probing z Unfortunately, we are not guaranteed z However, if multiple items all hash to the that we will use all the bins same initial bin, the same sequence of z In realit y, if th e h as h func tion is numbers will be followed reasonable, this is not a significant z This is termed secondary clustering problem un til λ approaches 1 z The effect is less significant than that of primary clustering

Double Hashing Double Hashing

z Use two hash functions z Many of same (dis)advantages as linear z If M is prime, eventually will examine every probing position in the table z z double_hash_insert(K) Distributes keys more uniformly than if(table is full) error linear probing does probe = h1(K) z NtNotes: offset = h2(K) while (table[probe] occupied) z h2(x) should never return zero. probe = (probe + offset) mod M z M should be prime. table[probe] = K Double Hashinggp Example Oppgyen Addressing Summary

z h1(K) = K mod 13 z In general, the hash function contains two arguments now: z h2(K) = 8 - K mod 8 z Key value z we want h2 to be an offset to add z Probe number h(k,p), p=0,1,...,m-1 z 18 41 22 44 59 32 31 73 z Probe sequences z Should be a permutation of <0,1,...,m-1> 0 1234567891011121 2 3 4 5 6 7 8 9 10 11 12 z There are m! possible permutations z Good hash functions should be able to produce all m! probe sequences 44 41 73 18 32 53 31 22 0 1 2 3 4 5 6 7 8 9 10 11 12

Oppgyen Addressing Summary Choosing A Hash Function

z None of the methods discussed can z Clearly choosing the hash function well generate more than m2 different probing sequences. is crucial. z Linear Probing: z What will a worst -case hash function do? z Clearly, only m probe sequences. z What will be the time to search in this case? z Quadratic Probing: z Wha t are d es ira ble fea tures o f the has h z The initial key determines a fixed probe sequence, so only m distinct probe sequences. function? z DblHhiDouble Hashing z Should di str ibute keys un iform ly into s lots z Each possible pair (h1(k),h2(k)) yields a distinct z Should not depend on patterns in the data pp,robe, so m2 permutations. From Keys to Indices Java Hash

z A hash function is usuallyyp the composition of z Java provides a hashCode() method for two maps: the Object class, which typically returns z hash code map: key Æ integer z compression map: itinteger Æ [0, N − 1] the 32-bit memory address of the object. z An essential requirement of the hash function z This default hash code would work is to map equal keys to equal indices poorly for Integer and String objects z A “good” hash function minimizes the ppyrobability of collisions z The hashCode() method should be suita bly re de fine d by c lasses.

Popular Hash-Code Maps Popular Hash-Code Maps

z Integer cast: for numeric types with 32 z Polynomial accumulation: for strings of bits or less, we can reinterpret the bits of a natural language, combine the

the number as an int character values (ASCII or Unicode) a 0 z Component sum: for numeric types with a 1 ... a n-1 by viewing them as the more than 32 bits (e. g., long and coefficients of a polynomial: a 0 +a+ a 1 x+x + double), we can add the 32-bit ...+ x n-1 a n-1 components. Popular Hash-Code Maps Random Hashing

z The polynomial is computed with Horner’s z Random hashing rule, ignoring overflows, at a fixed value x: z Uses a simple random number generation a0 + x (a1 + x (a2 + ... x (an-2 + x an-1 ) ... )) technique z The choice x = 33, 37, 39, or 41 gives at z Scatters the items “randomly” throughout most 6 collisions on a vocabulary of 50,000 the hash table English words z Why is the component-sum hash code bad for strings?

Popppular Compression Ma ps The Division Method

z Division: h(()k) = |k| mod N z h(k) = k mod m z the choice N =2 k is bad because not all the bits are z In words: hash k into a table with m slots using taken into account the slot given by the remainder of k divided by m z the table size N is usually chosen as a prime number z What happens to elements with adjacent z certain patterns in the hash codes are propagated values of k? z Multiply, Add, and Divide (MAD): z What happens if m is a power of 2 (say P z h(k) = |ak + b| mod N 2 )? z elim ina tes pa tterns provid e d a mod N ≠ 0 z What if m is a power of 10? z same formula used in linear congruential (pseudo) z Upshot: pick table size m = prime number random number generators not too close to a power of 2 (or 10) The Multiplication Method The Multiplication Method

z For a constant A, 0<0 < A <1:< 1: z For a constant A, 0<0 < A <1:< 1: z h(k) = ⎣ m (kA - ⎣kA⎦) ⎦ z h(k) = ⎣ m (kA - ⎣kA⎦) ⎦

What does this term represent? Fractional part of kA

z Choose m = 2P z Choose A not too close to 0 or 1 z Knuth: Good choice for A = (√5 - 1)/2

Analyygsis of Chaining Analyygsis of Chaining

z Assume simple uniform hashing: each z What will be the average cost of an key in table is equally likely to be hashed unsuccessful search for a key? to any slot. z Given n keys and m slots in the table: z O(1+α) the load factor α = n/m = average # keys per slot. Analyygsis of Chaining Analyygsis of Chaining

z What will be the average cost of an z What will be the average cost of a unsuccessful search for a key? successful search?

z O(1+α) z O(1 + α/2) = O(1 + α)

Analysis of Chaining Analyypsis of Open Addressin g

z Consider the load factor, α, and assume each key is z So the cost of searching = O(1 + α) uniforml y has he d. z If the number of keys n is proportional to z Probability that we hit an occupied cell is then α. z Probability that we the next probe hits an occupied the number of slots in the table, what is cell is also α. α? z Will terminate if an unoccupied cell is hit: α(1- α). z From Theorem 11. 6, the expected number of probes z A: α = O(1) in an unsuccessful search is at most 1/(1- α). z In other words, we can make the expected z Theorem 11.8: Expected number of probes in a costft of searchi ng const ant tif if we mak e α successful search is at most: 1 ⎛ 1 ⎞ constant ln⎜ ⎟ α ⎝1−α ⎠