7.7 Hashing 7.7 Hashing 7.7 Hashing 7.7 Hashing
Total Page:16
File Type:pdf, Size:1020Kb
7.7 Hashing 7.7 Hashing Dictionary: Definitions: ñ S.insert(x): Insert an element x. ñ Universe U of keys, e.g., U N0. U very large. ñ S.delete(x): Delete the element pointed to by x. ⊆ ñ Set S U of keys, S m n. ñ S.search(k): Return a pointer to an element e with ⊆ j j = ≤ key[e] k in S if it exists; otherwise return null. ñ Array T[0; : : : ; n 1] hash-table. = − ñ Hash function h : U [0; : : : ; n 1]. ! − So far we have implemented the search for a key by carefully choosing split-elements. The hash-function h should fulfill: ñ Fast to evaluate. Then the memory location of an object x with key k is determined ñ Small storage requirement. by successively comparing k to split-elements. ñ Good distribution of elements over the whole table. Hashing tries to directly compute the memory location from the given key. The goal is to have constant search time. EADS EADS 7.7 Hashing c Ernst Mayr, Harald Räcke 201 c Ernst Mayr, Harald Räcke 202 7.7 Hashing 7.7 Hashing Ideally the hash function maps all keys to different memory Suppose that we know the set S of actual keys (no insert/no locations. delete). Then we may want to design a simple hash-function that maps all these keys to different memory locations. k1 k1 k1 k7 k7 k1 U k7 universe k6 k3 x k7 of keys U universe k6 k3 x k3 of keys k6 S (actual keys) k3 k6 This special case is known as Direct Addressing. It is usually very unrealistic as the universe of keys typically is quite large, and in Such a hash function h is called a perfect hash function for set S. particular larger than the available memory. EADS 7.7 Hashing EADS 7.7 Hashing c Ernst Mayr, Harald Räcke 203 c Ernst Mayr, Harald Räcke 204 7.7 Hashing 7.7 Hashing Typically, collisions do not appear once the size of the set S of actual keys gets close to n, but already once S ω(pn). If we do not know the keys in advance, the best we can hope for j j ≥ is that the hash function distributes keys evenly across the table. Lemma 21 The probability of having a collision when hashing m elements Problem: Collisions into a table of size n under uniform hashing is at least Usually the universe U is much larger than the table-size n. m(m 1) m2 − 1 e− 2 1 e− 2n : − ≈ − Hence, there may be two elements k1; k2 from the set S that map to the same memory location (i.e., h(k1) h(k2)). This is called a = collision. Uniform hashing: Choose a hash function uniformly at random from all functions f : U [0; : : : ; n 1]. ! − EADS 7.7 Hashing EADS 7.7 Hashing c Ernst Mayr, Harald Räcke 205 c Ernst Mayr, Harald Räcke 206 7.7 Hashing f (x) Proof. Let Am;n denote the event that inserting m keys into a table of size n does not generate a collision. Then m m 1 Y n ` 1 Y− j Pr[Am;n] − + 1 = n = − n ` 1 j 0 = = m 1 Y− Pm 1 j m(m 1) j=n j −0 n 2 − e− e− = e− n : ≤ = = j 0 = x fg(x) (x) 1e− x x Here the first equality follows since the `-th element that is = − n ` 1 hashed has a probability of − + to not generate a collision n under the condition that the previous elements did not induce collisions. x The inequality 1 x e− is derived by stopping the − ≤x tayler-expansion of e− after the second term. EADS 7.7 Hashing EADS 7.7 Hashing c Ernst Mayr, Harald Räcke 207 c Ernst Mayr, Harald Räcke 208 Resolving Collisions Hashing with Chaining Arrange elements that map to the same position in a linear list. ñ Access: compute h(x) and search list for key[x]. ñ Insert: insert at the front of the list. The methods for dealing with collisions can be classified into the two main types k1 k4 ñ open addressing, aka. closed hashing k1 ñ hashing with chaining. aka. closed addressing, open k4 k5 k5 k2 k7 k7 hashing. U k2 universe k6 k3 k8 x of keys S (actual keys) k3 k8 k6 EADS 7.7 Hashing EADS 7.7 Hashing c Ernst Mayr, Harald Räcke 209 c Ernst Mayr, Harald Räcke 210 7.7 Hashing Hashing with Chaining Let A denote a strategy for resolving collisions. We use the following notation: The time required for an unsuccessful search is 1 plus the length of the list that is examined. The average length of a list is α m . ñ A+ denotes the average time for a successful search when = n using A; Hence, if A is the collision resolving strategy “Hashing with Chaining” we have ñ A− denotes the average time for an unsuccessful search A− 1 α : when using A; = + ñ We parameterize the complexity results in terms of α : m , = n the so-called fill factor of the hash-table. Note that this result does not depend on the hash-function that is used. We assume uniform hashing for the following analysis. EADS 7.7 Hashing EADS 7.7 Hashing c Ernst Mayr, Harald Räcke 211 c Ernst Mayr, Harald Räcke 212 Hashing with Chaining Hashing with Chaining For a successful search observe that we do not choose a list at random, but we consider a random key in the hash-table and k m m m m 1 X X 1 X X ask for the search-time for k. E 1 Xij 1 E Xij m + = m + i 1 j i 1 i 1 j i 1 = = + = = + This is 1 plus the number of elements that lie before k in k’s list. 1 Xm Xm 1 1 = m + n Let k` denote the `-th key inserted into the table. i 1 j i 1 = = + 1 Xm Let for two keys ki and kj, Xij denote the event that i and j hash 1 (m i) = + mn − to the same position. Clearly, Pr[X 1] 1=n for uniform i 1 ij = = = hashing. 1 m(m 1) 1 m2 + = + mn − 2 The expected successful search cost is m 1 α α keys before ki 1 − 1 : m m = + 2n = + 2 − 2m 1 X X E 1 Xij m + Hence, the expected cost for a successful search is A 1 α . i 1 j i 1 + ≤ + 2 = = + cost for key ki EADS 7.7 Hashing EADS 7.7 Hashing c Ernst Mayr, Harald Räcke 213 c Ernst Mayr, Harald Räcke 214 Open Addressing Open Addressing All objects are stored in the table itself. Define a function h(k, j) that determines the table-position to be Choices for h(k, j): examined in the -th step. The values 0 ,. , 1 form j h(k, ) h(k, n ) ñ h(k, i) h(k) i mod n. Linear probing. − = + a permutation of 0,..., n 1. 2 − ñ h(k, i) h(k) c1i c2i mod n. Quadratic probing. = + + Search : Try position 0 ; if it is empty your search fails; ñ h(k, i) h1(k) ih2(k) mod n. Double hashing. (k) h(k, ) = + otw. continue with h(k, 1), h(k, 2),.... For quadratic probing and double hashing one has to ensure that Insert(x): Search until you find an empty slot; insert your the search covers all positions in the table (i.e., for double element there. If your search reaches h(k, n 1), and this slot is hashing h2(k) must be relatively prime to n; for quadratic − non-empty then your table is full. probing c1 and c2 have to be chosen carefully). EADS 7.7 Hashing EADS 7.7 Hashing c Ernst Mayr, Harald Räcke 215 c Ernst Mayr, Harald Räcke 216 Linear Probing Quadratic Probing ñ Advantage: Cache-efficiency. The new probe position is very ñ Not as cache-efficient as Linear Probing. likely to be in the cache. ñ Secondary clustering: caused by the fact that all keys ñ Disadvantage: Primary clustering. Long sequences of mapped to the same position have the same probe sequence. occupied table-positions get longer as they have a larger probability to be hit. Furthermore, they can merge forming larger sequences. Lemma 23 Let Q be the method of quadratic probing for resolving collisions: Lemma 22 1 α Q+ 1 ln ≈ + 1 α − 2 Let L be the method of linear probing for resolving collisions: − 1 1 1 1 L+ 1 Q− ln α ≈ 2 + 1 α ≈ 1 α + 1 α − − − − 1 1 L− 1 ≈ 2 + (1 α)2 − EADS 7.7 Hashing EADS 7.7 Hashing c Ernst Mayr, Harald Räcke 217 c Ernst Mayr, Harald Räcke 218 Double Hashing 7.7 Hashing ñ Any probe into the hash-table usually creates a cash-miss. Lemma 24 Some values: Let A be the method of double hashing for resolving collisions: α Linear Probing Quadratic Probing Double Hashing 1 1 L+ L− Q+ Q− D+ D− D+ ln ≈ α 1 α 0.5 1.5 2.5 1.44 2.19 1.39 2 − 0.9 5.5 50.5 2.85 11.40 2.55 10 1 0.95 10.5 200.5 3.52 22.05 3.15 20 D− ≈ 1 α − EADS 7.7 Hashing EADS 7.7 Hashing c Ernst Mayr, Harald Räcke 219 c Ernst Mayr, Harald Räcke 220 i Pr[X i] = 7.7 Hashing Analysis of Idealized Open Address Hashing #probes Let X denote a random variable describing the number of probes in an unsuccessful search.