Hash Function
Total Page:16
File Type:pdf, Size:1020Kb
Motivation Introduction to Algorithms z Arrayyps provide an indirect way to access a set. z Many times we need an association between Hash Tables two sets, or a set of keys and associated data. z Ideally we would like to access this data directly with the keys. CSE 680 z We would like a data structure that supports fast searchih, inserti on, and ddlti deletion. Prof. Roger Crawfis z Do not usually care about sorting. z The abstract data type is usually called a Dictionary or Partial Map z float googleStockPrice = stocks[“Goog”].CurrentPrice; Dictionaries Direct Addressing z What is the best wayyp to implement this? z Let’ s look at an easy case, suppose: z Linked Lists? z The range of keys is 0..m-1 z Double Linked Lists? z Keys are distinct z Queues? z Stacks? z Possible solution z Multiple indexed arrays (e.g., data[key[i]])? z Set up an array T[0..m-1] in which z To answer this, as k w ha ttht the comp lex ity o fthf the z T[i] = x if x∈ T and key[x] = i operations are: z T[i] = NULL otherwise z Insertion z This is called a direct-address table z Deletion z Operations take O(1) time! z Search z So what’s the problem? Direct Addressing Hash Table z Direct addressing works well when the z Hash Tables provide O(1) support for all range m of keys is relatively small of these operations! z But what if the keys are 32-bit integers? z The key is rather than index an array z Problem 1: direct-address table will have directly, index it through some function, 232 entries, more than 4 billion h(x), called a hash function. z Problem 2: even if memory is not an issue, the z myArray[ h(index) ] time to initialize the elements to NULL may be z Keyyq questions: z Solution: map keys to smaller range 0..p-1 z What is the set that the x comes from? z Desire p = O(m). z What is h() and what is its range? Hash Table Hash Functions z Consider this problem: z In ggpygpeneral a difficult problem. Try something simpler. z If I know a prior the m keys from some finite U 0 set U, is it possible to develop a function (universe of keys) h(x) that will uniquely map the m keys onto h(k1) k1 the set of numbers 0..m-1? h(k4) K k4 k5 (actual h(k ) = h(k ) keys) 2 5 k2 h(k ) k3 3 m-1 Hash Functions Hash Functions z A collision occurs when h(x) maps two keys to the z A hash function, h, mappys keys of a g iven t ype to same ltilocation. integers in a fixed interval [0, N − 1] U 0 z Example: (universe of keys) h(x) = x mod N collisionh(k1) is a hash function for integer keys k 1 h(k ) 4 z ThThee inintegerteger h(x) iiss cacalledlled tthehe hhashash vvaluealue ooff x. K k4 (actual k5 z A hash table for a given key type consists of h(k2) = h(k5) keys) z Hash function h z Array (called table) of size N k2 k h(k3) 3 z The goal is to store item (k, o) at index i = h(k) p -1 Example Example z We design a hash table 0 z Our hash table uses an 0 ∅ array ofif size N = 100. ∅ storing employees 1 1 025-612-0001 z We have n = 49 025-612-0001 records using their 2 2 981-101-0002 employees. 981-101-0002 social security number, 3 z Need a method to handle 3 SSN as the key. ∅ collisions. ∅ 4 4 451-229-0004 451-229-0004 z SSN is a nine-digit z As long as the chance … positive in teger … for co llis ion is low, we z Our hash table uses an can achieve this goal. 9997 z 9997 array of size N = 10,000 ∅ Setting N = 1000 and ∅ and the hash function 9998 looking at the last four 9998 200-751-9998 digits will reduce the 200-751-9998 h(x) = last four digits of x 9999 9999 176-354-9998 ∅ chance of collision. ∅ Collisions Chaining z Can collisions be avoided? z Chaining puts elements that hash to the z In general, no. See perfect hashing for the case same slot in a linked list: were the set of keys is static (not covered). U —— (universe of keys) k k —— z Two primary techniques for resolving 1 4 —— k collisions: 1 —— K k4 z Chaining – keep a collection at each key k5 —— (actual k k k k —— slot. keyy)s) 7 5 2 7 —— k z Open addressing – if the current slot is full k2 3 k8 k3 —— k6 use the next open one. k8 k6 —— —— Chaining Chaining z How do we insert an element? z How do we delete an element? z Do we need a doubly-linked list for efficient delete? U —— U —— (universe of keys) k1 k4 —— (universe of keys) k1 k4 —— —— —— k k 1 —— 1 —— k4 k4 K k5 —— K k5 —— (actual k k k k —— (actual k k k k —— keyy)s) 7 5 2 7 keyy)s) 7 5 2 7 —— —— k k k2 3 k2 3 k8 k3 —— k8 k3 —— k6 k6 k8 k6 —— k8 k6 —— —— —— Chaining Oppgen Addressing z How do we search for a element with a z Basic idea: z To insert: if slot is full, try another slot, …, until given key? T an opp(en slot is found (ppgrobing) U —— (universe of keys) k1 k4 —— z To search, follow same sequence of probes as —— would be used when inserting the element k1 —— z If reach element with correct key, return it k4 K k5 —— z If reach a NULL pointer, element is not in table (actual k k k k —— keys) 7 5 2 7 z Goo dfd for fidfixed se ts ( (ddiadding btbut no dlti)deletion) —— k k2 3 z Example: spell checking k8 k3 —— k6 k8 k6 —— —— Oppgen Addressing Probing z The colliding item is placed in a z They key question is what should the different cell of the table. next cell to try be? z No dynamic memory. z Random would be great, but we need to z Fixed Table size. z Load factor: n/N, where n is the number be able to repeat it. ofitf items t o st ore and dNth N the si ze of fth the h as h z Three common tec hn iques: table. z Linear Probing (useful for discussion only) z Cleary, n ≤ NorN, or n/N ≤ 1. z Quadratic Probing z To get a reasonable performance, n/N<0.5. z Double Hashing Linear Probing Search with Linear Probing z Consider a hash table A that Algorithm get(k) z Linear probing handles z Example: uses linear probing i ← h(k) collisions by placing the z h(x) = x mod 13 z get(k) p ← 0 colliding item in the next z We start at cell h(k) z Insert keys 18, 41, 22, 44, repeat (circularly) available table z We p robe con secutiv e 59, 32, 31, 73, in this locations until one of the c ← A[i] cell. order following occurs if c =∅ z z An item with key k is found, Each table cell inspected or return null is referred to as a probe. z AllifdAn empty cell is found, or eliflse if c.key () = k z N cells have been return c.element() z Colliding items lump 0123456789101112 unsuccessfully probed else together, causing future z To ensure the efficiency, if k collis ions to cause a is not in the table, we want to i ← (i + 1) mod N find an empty cell as soon as p ← p + 1 41 18 44 59 32 22 31 73 longer sequence of possible. The load factor can until p = N 0123456789101112 NOT be close to 1. probes. return null Linear Probing Uppgdates with Linear Probing z To handle insertions and z put(k, o) z Search for key=20 . z Example: dldelet ions, we intro duce a z We throw an exception if the z h(20)=20 mod 13 =7. z h(x) = x mod 13 special object, called table is full z Go through rank 8, 9, …, 12, z Insert keys 18, 41, 22, 44, AVAILABLE, which replaces z We start at cell h(k) 0. 59, 32, 31, 73, 12, 20 in deleted elements z We probe consecutive cells this order z remove(k) until one of the following z Search for key=15 occurs z We search for an entry with z h(()15)=15 mod 13=2. key k z A cell i is found that is either empty or stores z Go through rank 2, 3 and z If such an entry (k, o) is AVAILABLE, or found, we replace it with the return null. 0123456789101112 z N cells have been special item AVAILABLE unsuccessfully probed and we return element o z We store entry (k, o) in cell i z Have to modify other methods 20 41 18 44 59 32 22 31 73 12 to skip available cells. 0123456789101112 Quadratic Probing Quadratic Probing z Primary clustering occurs with linear z Suppose that an element should appear probing because the same linear pattern: in bin h: z if a bin is inside a cluster, then the next bin z if bin h is occupied, then check the following must either: sequence of bins: z also be in that cluster, or h +1+ 12, h +2+ 22, h +3+ 32, h +4+ 42, h +5+ 52, ..