Hashing: Part 2 332

Total Page:16

File Type:pdf, Size:1020Kb

Hashing: Part 2 332 Adam BlankLecture 11 Winter 2017 CSE 332: Data Structures and Parallelism CSE Hashing: Part 2 332 Data Structures and Parallelism HashTable Review 1 Hashing Choices 2 Hash Tables 1 Choose a hash function Provides 1 core Dictionary operations (on average) 2 Choose a table size We call the key space the “universe”: U and the Hash Table T O( ) We should use this data structure only when we expect U T 3 Choose a collision resolution strategy (Or, the key space is non-integer values.) S S >> S S Separate Chaining Linear Probing Quadratic Probing Double Hashing hash function mod T collision? E int Table Index Fixed Table Index Other issues to consider: S S ÐÐÐÐÐÐÐ→Hash Table Client ÐÐÐÐ→ HashÐÐÐÐÐ→ Table Library 4 Choose an implementation of deletion ´¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¸¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¶´¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¸¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¶ 5 Choose a λ that means the table is “too full” Another Consideration? What do we do when λ (the load factor) gets too large? We discussed the first few of these last time. We’ll discuss the rest today. Review: Collisions 3 Open Addressing 4 Definition (Collision) Definition (Open Addressing) A collision is when two distinct keys map to the same location in the Open Addressing is a type of collision resolution strategy that resolves hash table. collisions by choosing a different location when the natural choice is full. A good hash function attempts to avoid as many collisions as possible, There are many types of open addressing. Here’s the key ideas: but they are inevitable. We must be able to duplicate the path we took. How do we deal with collisions? We want to use all the spaces in the table. We want to avoid putting lots of keys close together. There are multiple strategies: It turns out some of these are difficult to achieve. Separate Chaining Strategy #1: Linear Probing Example Open Addressing 1 i = 0; Insert 38,19,8,109,10 into a while (index in use) { Linear Probing 2 hash table with hash function Quadratic Probing 3 try (h(key) + i)% T 4 } h x x and linear probing Double Hashing S S ( ) = 8 109 10 38 19 T[0] T[1] T[2] T[3] T[4] T[5] T[6] T[7] T[8] T[9] (Items with the same hash code are the same color) Open Addressing: Linear Probing 5 Analyzing Linear Probing 6 Which Criteria Does Linear Probing Meet? Example Strategy #1: Linear Probing We want to use all the spaces in the table. 1 i = 0; Insert 38,19,8,109,10 into a Yes! Linear probing will fill the whole table while (index in use) { . 2 hash table with hash function 3 try (h(key) + i) % T We want to avoid putting lots of keys close together. h x x linear probing 4 } and Uh. not so much S S ( ) = 8 109 10 38 19 Primary Clustering T[0] T[1] T[2] T[3] T[4] T[5] T[6] T[7] T[8] T[9] (Items with the same hash code are the same color) Primary Clustering is when different keys collide to form one big group. 8 109 10 101 20 38 19 Other Operations with Linear Probing T[0] T[1] T[2] T[3] T[4] T[5] T[6] T[7] T[8] T[9] insert? Finds the next open spot. The worst case is n Think of this as “clusters of many colors”. Even though these keys are all find? We have to retrace our steps. If the insert chain was k long, O( ) different, they end up in a giant cluster. then find k . delete? We don’t have a choice; we must use lazy deletion. What In linear probing, we expect to get lgn size clusters. ∈ O( ) happens if we delete 19 and then do find(109) in our example? O( ) 8 109 10 38 19X This is really bad! But, how bad, really? T[0] T[1] T[2] T[3] T[4] T[5] T[6] T[7] T[8] T[ T[99]] Analyzing Linear Probing 7 ///////Linear Quadratic Probing 8 Load Factor & Space Usage There’s nothing theoretically wrong with open addressing that forces Note that λ 1, and we will eventually get to λ 1. primary clustering. We’d like a different (easy to compute) function to probe with. That is: Average Number≤ of Probes = Open Addressing In General Unsuccessful Search Successful Search Choose a new function f x and then probe with 1 1 1 1 1 1 2 1 λ 2 2 1 λ (h)key f i mod T + + ( ( )+ ( )) S S 140 ( − ) ( − ) Strategy #2: Quadratic Probing Example 1 i = 0; 120 Insert 89,18,49,58,79 into a 2 while (index in use) { 2 hash table with hash function 100 3 try (h(key) + i )% T 4 } h x x and quadratic probing 80 S S 60 49 58 79 ( ) = 18 89 T[0] T[1] T[2] T[3] T[4] T[5] T[6] T[7] T[8] T[9] Average Number of Probes 40 i 0 2 20 h 7958 7958 0 98 = i 1 58 12 9 0 ( ) Ð→ 79 + 1 ≡ 0 0.0 0.2 0.4 0.6 0.8 = Load Factor i 2 2 Ð→ 7958 + 22 ≡ 32 = Ð→ + ≡ Another Quadratic Probing Example 9 Quadratic Probing: Table Coverage 10 48 5 55 40 76 Strategy #2: Quadratic Probing Example T[0] T[1] T[2] T[3] T[4] T[5] T[6] 1 i = 0; Insert 76,40,48,5,55,47 into a 2 while (index in use) { Why Does insert(47) Fail? 2 hash table with hash function 3 try (h(key) + i )% T 2 4 } h x x and quadratic probing For all i, 5 i mod 7 0,2,5,6 . The proof is by induction. This S S actually generalizes: ( ) = ( + ) ∈ { 2 } 2 48 5 55 40 76 For all c,k, c i mod k c i k mod k T[0] T[1] T[2] T[3] T[4] T[5] T[6] So, quadratic probing doesn’t( + always) fill= the( +( table− ). ) i 0i 0 22 2 h h7640h485 i 07640548 0002 65756 h 5547 5547 0 77 7 65 = = 7 i=1 2 2 The Good News! ( (() Ð→) Ð→i 1 548+++11≡2≡≡760 ( ) Ð→ 5547 + 1 ≡7 7 06 = 7 i=2 2 2 1 Ð→i 2 5 ++2 2≡≡ 2 Ð→ 5547 + 2 ≡7 7 32 If T is prime and λ , then quadratic probing will find an empty slot in = 7 = 2 Ð→i 3 + 2≡ Ð→ 47 + 3 ≡7 0 T 1 = i 4 2 atS mostS probes.< So, if we keep λ , we don’t need to detect cycles. Ð→ 47 + 4 ≡7 0 2 2 = i 5 2 The proofS willS be posted on the website. Ð→ 47 + 5 ≡7 2 < = Ð→ + ≡ We will never get a 1 or a 4! So, does quadratic probing completely fix clustering? This means we will never be able to insert 47. What’s going on? Quadratic Probing: Clustering 11 ///////Linear ////////////QuadraticDouble Hashing 12 With linear probing, we saw primary clustering (keys hashing near each Example other). Quadratic Probing fixes this by “jumping”. Unfortunately, we still get secondary clustering: Strategy #3: Double Hashing Insert 13,28,33,147,43 into a 1 i = 0; hash table with: 2 while (index in use) { 3 try (h(key) + i g(key)) % T h x x Secondary Clustering 4 } x ∗ S S g(x) = 1 mod T 1 Secondary Clustering is when different keys hash to the same place and We insist g x 0. T follow the same probing sequence. using (double) = + hashing (S S− ) ( ) ≠ S S 39 29 9 19 T[0] T[1] T[2] T[3] T[4] T[5] T[6] T[7] T[8] T[9] 13 33 28 147 T[0] T[1] T[2] T[3] T[4] T[5] T[6] T[7] T[8] T[9] i 0 Think of this as long probing chains of the same color. The keys all start ii 00 h h1474333 1474333 00 37 == at the same place; so, the chain gets really long. i=1 ii 11 ( ( ))Ð→Ð→1474333 ++11≡≡11 4314modmod9 9 87 3 = == ii 11 Ð→Ð→14743 ++22((11++414modmod9)9≡) ≡3 9 == i 1 We can avoid secondary clustering by using a probe function that Ð→Ð→ 43 ++3((1 ++4 mod 9) ≡) ≡8 depends on the key = . Ð→ + ( + ) ≡ We got stuck again! Double Hashing Analysis 13 Where We Are 14 Filling the Table Just like with Quadratic Probing, we sometimes hit an infinite loop with Separate Chaining is Easy! double hashing. We will not get an infinite loop in the case with find, delete proportional to load factor on average primes p,q such that 2 q p: insert can be constant if just push on front of list h key key mod p < < g key q key mod q ( ) = Uniform( Hashing) = −( ) Open Addressing is Tricky! For double hashing, we assume uniform hashing which means: Clustering issues 1 Doesn’t always use the whole table Pr g(key1) mod p g(key2) mod p p Why Use it? [ = ] = Less memory allocation Average Number of Probes Easier data representation Unsuccessful Search Successful Search 1 1 1 ln 1 λ λ 1 λ Now, let’s move on to resizing the table. − This is way better than linear probing. − Rehashing 15 Hashing and Comparing 16 When λ is too big, create a bigger table and copy over the items A hash function isn’t enough! We have to compare items: With separate chaining, we have to loop through the list checking if When To Resize the item is what we’re looking for With separate chaining, we decide when to resize (should be λ 1) With open addressing, we need to know when to stop probing 1 With open addressing, we need to keep λ ≤ 2 We have two options for this: equality testing or comparison testing.
Recommended publications
  • Cmsc 132: Object-Oriented Programming Ii
    CMSC 132: OBJECT-ORIENTED PROGRAMMING II Hashing Department of Computer Science University of Maryland, College Park © Department of Computer Science UMD Announcements • Video “What most schools don’t teach” • http://www.youtube.com/watch?v=nKIu9yen5nc © Department of Computer Science UMD Introduction • If you need to find a value in a list what is the most efficient way to perform the search? • Linear search • Binary search • Can we have O(1)? © Department of Computer Science UMD Hashing • Remember that modulus allows us to map a number to a range • X % N → X mapped to value between 0 and N - 1 • Suppose you have 4 parking spaces and need to assign each resident a space. How can we do it? parkingSpace(ssn) = ssn % 4 • Problems?? • What if two residents are assigned the same spot? Collission! • What if we want to use name instead of ssn? • Generate integer out of the name • We just described hashing © Department of Computer Science UMD Hashing • Hashing • Technique for storing key-value entries into an array • In Java we will have an array of Objects where each Object has a key (e.g., student’s name) and a reference to data of interest (e.g., student’s grades) • The array is called the hash table • Ideally can result in O(1) search times • Hash Function • Takes a search key (Ki) and returns a location in the array (an integer index (hash index)) • A search key maps (hashes) to index i • Ideal Hash Function • If every search key corresponds to a unique element in the hash table © Department of Computer Science UMD Hashing • If we have a large range of possible search keys, but a subset of them are used, allocating a large table would a waste of significant space • Typical hash function (two steps) 1.
    [Show full text]
  • Lecture 12: Hash Tables
    CSI 3334, Data Structures Lecture 12: Hash Tables Date: 2012-10-10 Author(s): Philip Spencer, Chris Martin Lecturer: Fredrik Niemelä This lecture covers hashing, hash functions, collisions and how to deal with them, and probability of error in bloom filters. 1 Introduction to Hash Tables Hash Functions take input and randomly sort the input and evenly distribute it. An example of a hash function is randu: 31 ni+1 = 65539 ¤ nimod2 randu has flaws in that if you give it an odd number, it will only return odd num- bers. Also, all numbers generated by randu fit in 15 planes. An example of a bad hash function: return 9 2 Dealing with Collisions Definition 2.1 Separate Chaining - Adds to a linked list when collisions occur. Positives of Separate Chaining: There is no upper-bound on size of table and it’s easy to implement. Negatives of Separate Chaining: It gets slower as you have more collisions. Definition 2.2 Linear Probing - If the space is full, take the next space. Negatives of Linear Probing: Things linearly gather making it likely to cluster and slow down. This is known as primary clustering. This also increases the likelihood of new things being added to the cluster. It breaks when the hash table is full. 1 2 CSI 3334 – Fall 2012 Definition 2.3 Quadratic Probing - If the space is full, move i2spaces: Positives of Quadratic Probing: Avoids the primary clustering from linear probing. Negatives of Quadratic Probing: It breaks when the hash table is full. Secondary clustering occurs when things follow the same quadratic path slowing down hashing.
    [Show full text]
  • SAHA: a String Adaptive Hash Table for Analytical Databases
    applied sciences Article SAHA: A String Adaptive Hash Table for Analytical Databases Tianqi Zheng 1,2,* , Zhibin Zhang 1 and Xueqi Cheng 1,2 1 CAS Key Laboratory of Network Data Science and Technology, Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190, China; [email protected] (Z.Z.); [email protected] (X.C.) 2 University of Chinese Academy of Sciences, Beijing 100049, China * Correspondence: [email protected] Received: 3 February 2020; Accepted: 9 March 2020; Published: 11 March 2020 Abstract: Hash tables are the fundamental data structure for analytical database workloads, such as aggregation, joining, set filtering and records deduplication. The performance aspects of hash tables differ drastically with respect to what kind of data are being processed or how many inserts, lookups and deletes are constructed. In this paper, we address some common use cases of hash tables: aggregating and joining over arbitrary string data. We designed a new hash table, SAHA, which is tightly integrated with modern analytical databases and optimized for string data with the following advantages: (1) it inlines short strings and saves hash values for long strings only; (2) it uses special memory loading techniques to do quick dispatching and hashing computations; and (3) it utilizes vectorized processing to batch hashing operations. Our evaluation results reveal that SAHA outperforms state-of-the-art hash tables by one to five times in analytical workloads, including Google’s SwissTable and Facebook’s F14Table. It has been merged into the ClickHouse database and shows promising results in production. Keywords: hash table; analytical database; string data 1.
    [Show full text]
  • Hash Tables & Searching Algorithms
    Search Algorithms and Tables Chapter 11 Tables • A table, or dictionary, is an abstract data type whose data items are stored and retrieved according to a key value. • The items are called records. • Each record can have a number of data fields. • The data is ordered based on one of the fields, named the key field. • The record we are searching for has a key value that is called the target. • The table may be implemented using a variety of data structures: array, tree, heap, etc. Sequential Search public static int search(int[] a, int target) { int i = 0; boolean found = false; while ((i < a.length) && ! found) { if (a[i] == target) found = true; else i++; } if (found) return i; else return –1; } Sequential Search on Tables public static int search(someClass[] a, int target) { int i = 0; boolean found = false; while ((i < a.length) && !found){ if (a[i].getKey() == target) found = true; else i++; } if (found) return i; else return –1; } Sequential Search on N elements • Best Case Number of comparisons: 1 = O(1) • Average Case Number of comparisons: (1 + 2 + ... + N)/N = (N+1)/2 = O(N) • Worst Case Number of comparisons: N = O(N) Binary Search • Can be applied to any random-access data structure where the data elements are sorted. • Additional parameters: first – index of the first element to examine size – number of elements to search starting from the first element above Binary Search • Precondition: If size > 0, then the data structure must have size elements starting with the element denoted as the first element. In addition, these elements are sorted.
    [Show full text]
  • Chapter 5 Hashing
    Introduction hashing performs basic operations, such as insertion, Chapter 5 deletion, and finds in average time Hashing 2 Hashing Hashing Functions a hash table is merely an of some fixed size let be the set of search keys hashing converts into locations in a hash hash functions map into the set of in the table hash table searching on the key becomes something like array lookup ideally, distributes over the slots of the hash table, to minimize collisions hashing is typically a many-to-one map: multiple keys are if we are hashing items, we want the number of items mapped to the same array index hashed to each location to be close to mapping multiple keys to the same position results in a example: Library of Congress Classification System ____________ that must be resolved hash function if we look at the first part of the call numbers two parts to hashing: (e.g., E470, PN1995) a hash function, which transforms keys into array indices collision resolution involves going to the stacks and a collision resolution procedure looking through the books almost all of CS is hashed to QA75 and QA76 (BAD) 3 4 Hashing Functions Hashing Functions suppose we are storing a set of nonnegative integers we can also use the hash function below for floating point given , we can obtain hash values between 0 and 1 with the numbers if we interpret the bits as an ______________ hash function two ways to do this in C, assuming long int and double when is divided by types have the same length fast operation, but we need to be careful when choosing first method uses
    [Show full text]
  • Cuckoo Hashing
    Cuckoo Hashing Rasmus Pagh* BRICSy, Department of Computer Science, Aarhus University Ny Munkegade Bldg. 540, DK{8000 A˚rhus C, Denmark. E-mail: [email protected] and Flemming Friche Rodlerz ON-AIR A/S, Digtervejen 9, 9200 Aalborg SV, Denmark. E-mail: ff[email protected] We present a simple dictionary with worst case constant lookup time, equal- ing the theoretical performance of the classic dynamic perfect hashing scheme of Dietzfelbinger et al. (Dynamic perfect hashing: Upper and lower bounds. SIAM J. Comput., 23(4):738{761, 1994). The space usage is similar to that of binary search trees, i.e., three words per key on average. Besides being conceptually much simpler than previous dynamic dictionaries with worst case constant lookup time, our data structure is interesting in that it does not use perfect hashing, but rather a variant of open addressing where keys can be moved back in their probe sequences. An implementation inspired by our algorithm, but using weaker hash func- tions, is found to be quite practical. It is competitive with the best known dictionaries having an average case (but no nontrivial worst case) guarantee. Key Words: data structures, dictionaries, information retrieval, searching, hash- ing, experiments * Partially supported by the Future and Emerging Technologies programme of the EU under contract number IST-1999-14186 (ALCOM-FT). This work was initiated while visiting Stanford University. y Basic Research in Computer Science (www.brics.dk), funded by the Danish National Research Foundation. z This work was done while at Aarhus University. 1 2 PAGH AND RODLER 1. INTRODUCTION The dictionary data structure is ubiquitous in computer science.
    [Show full text]
  • Double Hashing
    Double Hashing Intro & Coding Hashing Hashing - provides O(1) time on average for insert, search and delete Hash function - maps a big number or string to a small integer that can be used as index in hash table. Collision - Two keys resulting in same index. Hashing – a Simple Example ● Arbitrary Size à Fix Size 0 1 2 3 4 Hashing – a Simple Example ● Arbitrary Size à Fix Size Insert(0) 0 % 5 = 0 0 1 2 3 4 0 Hashing – a Simple Example ● Arbitrary Size à Fix Size Insert(0) 0 % 5 = 0 Insert(5321) 5321 % 5 = 1 0 1 2 3 4 0 5321 Hashing – a Simple Example ● Arbitrary Size à Fix Size Insert(0) 0 % 5 = 0 Insert(5321) 5321 % 5 = 1 Insert(-8002) -8002 % 5 = 3 0 1 2 3 4 0 5321 -8002 Hashing – a Simple Example ● Arbitrary Size à Fix Size Insert(0) 0 % 5 = 0 Insert(5321) 5321 % 5 = 1 Insert(-8002) -8002 % 5 = 3 Insert(20000) 20000 % 5 = 0 0 1 2 3 4 0 5321 -8002 Modular Hashing ● Overall a good simple, general approach to implement a hash map ● Basic formula: ○ h(x) = c(x) mod m ■ Where c(x) converts x into a (possibly) large integer ● Generally want m to be a prime number ○ Consider m = 100 ○ Only the least significant digits matter ■ h(1) = h(401) = h(4372901) Collision Resolution ● A strategy for handling the case when two or more keys to be inserted hash to the same index. ● Closed Addressing § Separate Chaining ● Open Addressing § Linear Probing § Quadratic Probing § Double Hashing Separate Chaining ● Make each cell of hash table point to a linked list of records that have same hash function value Key Hash Value S 2 0 E 0 1 A 0 2 R 4 3 C 4 4 H 4
    [Show full text]
  • Lecture 26 — Hash Tables
    Parallel and Sequential Data Structures and Algorithms — Lecture 26 15-210 (Spring 2012) Lecture 26 — Hash Tables Parallel and Sequential Data Structures and Algorithms, 15-210 (Spring 2012) Lectured by Margaret Reid-Miller — 24 April 2012 Today: - Hashing - Hash Tables 1 Hashing hash: transitive verb1 1. (a) to chop (as meat and potatoes) into small pieces (b) confuse, muddle 2. ... This is the definition of hash from which the computer term was derived. The idea of hashing as originally conceived was to take values and to chop and mix them to the point that the original values are muddled. The term as used in computer science dates back to the early 1950’s. More formally the idea of hashing is to approximate a random function h : α β from a source (or universe) set U of type α to a destination set of type β. Most often the source! set is significantly larger than the destination set, so the function not only chops and mixes but also reduces. In fact the source set might have infinite size, such as all character strings, while the destination set always has finite size. Also the source set might consist of complicated elements, such as the set of all directed graphs, while the destination are typically the integers in some fixed range. Hash functions are therefore many to one functions. Using an actual randomly selected function from the set of all functions from α to β is typically not practical due to the number of such functions and hence the size (number of bits) needed to represent such a function.
    [Show full text]
  • Hashing (Part 2)
    Hashing (part 2) CSE 2011 Winter 2011 14 March 2011 1 Collision Handling Separate chaining Pbi(Probing (open a ddress ing ) Linear probing Quadratic probing Double hashing 2 1 Quadratic Probing Linear probing: Quadratic probing Insert item (k, e) A[i] is occupied i = h(k) Try A[(i+1) mod N]: used A[i] is occupied Try A[(i+22) mod N]: used Try A[(i+1) mod N]: used Try A[(i+32) mod N] Try A[(i+2) mod N] and so on and so on until an empty cell is found May not be able to find an empty cell if N is not prime, or the hash table is at least half full 3 Double Hashing Double hashing uses a Insert item (k, e) secondary hash function i = h(()k) d(k) and handles collisions A[i] is occupied by placing an item in the first available cell of the Try A[(i+d(k))mod N]: used series Try A[(i+2d(k))mod N]: used (i j d(k)) mod N Try A[(i+3d(k))mod N] for j 0, 1, … , N 1 and so on until The secondary hash an empty cell is found function d(k) cannot have zero values The table size N must be a prime to allow probing of all the cells 4 2 Example of Double Hashing kh(k ) d (k ) Probes Consider a hash 18 5 3 5 tbltable s tor ing itinteger 41 2 1 2 22 9 6 9 keys that handles 44 5 5 5 10 collision with double 59 7 4 7 32 6 3 6 hashing 315 4 590 N 13 73 8 4 8 h(k) k mod 13 d(k) 7 k mod 7 0123456789101112 Insert keys 18, 41, 22, 44, 59, 32, 31, 73, in this order 31 41 18 32 59 73 22 44 0123456789101112 5 Double Hashing (2) d(k) should be chosen to minimize clustering Common choice of compression map for the secondary hash function: d(k) q k mod q where q N q is a prime The possible values for d(k) are 1, 2, … , q Note: linear probing has d(k) 1.
    [Show full text]
  • Linear Probing with Constant Independence
    Linear Probing with Constant Independence Anna Pagh, Rasmus Pagh, and Milan Ruˇzi´c IT University of Copenhagen, Rued Langgaards Vej 7, 2300 København S, Denmark. Abstract. Hashing with linear probing dates back to the 1950s, and is among the most studied algorithms. In recent years it has become one of the most important hash table organizations since it uses the cache of modern computers very well. Unfortunately, previous analysis rely either on complicated and space consuming hash functions, or on the unrealistic assumption of free access to a truly random hash function. Already Carter and Wegman, in their seminal paper on universal hashing, raised the question of extending their analysis to linear probing. However, we show in this paper that linear probing using a pairwise independent family may have expected logarithmic cost per operation. On the positive side, we show that 5-wise independence is enough to ensure constant expected time per operation. This resolves the question of finding a space and time efficient hash function that provably ensures good performance for linear probing. 1 Introduction Hashing with linear probing is perhaps the simplest algorithm for storing and accessing a set of keys that obtains nontrivial performance. Given a hash function h, a key x is inserted in an array by searching for the first vacant array position in the sequence h(x), h(x) + 1, h(x) + 2,... (Here, addition is modulo r, the size of the array.) Retrieval of a key proceeds similarly, until either the key is found, or a vacant position is encountered, in which case the key is not present in the data structure.
    [Show full text]
  • More Analysis of Double Hashing for Balanced Allocations
    More Analysis of Double Hashing for Balanced Allocations Michael Mitzenmacher⋆ Harvard University, School of Engineering and Applied Sciences [email protected] Abstract. With double hashing, for a key x, one generates two hash values f(x) and g(x), and then uses combinations (f(x)+ ig(x)) mod n for i = 0, 1, 2,... to generate multiple hash values in the range [0, n − 1] from the initial two. For balanced allocations, keys are hashed into a hash table where each bucket can hold multiple keys, and each key is placed in the least loaded of d choices. It has been shown previously that asymp- totically the performance of double hashing and fully random hashing is the same in the balanced allocation paradigm using fluid limit methods. Here we extend a coupling argument used by Lueker and Molodowitch to show that double hashing and ideal uniform hashing are asymptotically equivalent in the setting of open address hash tables to the balanced al- location setting, providing further insight into this phenomenon. We also discuss the potential for and bottlenecks limiting the use this approach for other multiple choice hashing schemes. 1 Introduction An interesting result from the hashing literature shows that, for open addressing, double hashing has the same asymptotic performance as uniform hashing. We explain the result in more detail. In open addressing, we have a hash table with n cells into which we insert m keys; we use α = m/n to refer to the load factor. Each element is placed according to a probe sequence, which is a permutation of the cells.
    [Show full text]
  • Double Hashing
    CSE 332: Data Structures & Parallelism Lecture 11:More Hashing Ruth Anderson Winter 2018 Today • Dictionaries – Hashing 1/31/2018 2 Hash Tables: Review • Aim for constant-time (i.e., O(1)) find, insert, and delete – “On average” under some reasonable assumptions • A hash table is an array of some fixed size hash table – But growable as we’ll see 0 client hash table library collision? collision E int table-index … resolution TableSize –1 1/31/2018 3 Hashing Choices 1. Choose a Hash function 2. Choose TableSize 3. Choose a Collision Resolution Strategy from these: – Separate Chaining – Open Addressing • Linear Probing • Quadratic Probing • Double Hashing • Other issues to consider: – Deletion? – What to do when the hash table gets “too full”? 1/31/2018 4 Open Addressing: Linear Probing • Why not use up the empty space in the table? 0 • Store directly in the array cell (no linked list) 1 • How to deal with collisions? 2 • If h(key) is already full, 3 – try (h(key) + 1) % TableSize. If full, 4 – try (h(key) + 2) % TableSize. If full, 5 – try (h(key) + 3) % TableSize. If full… 6 7 • Example: insert 38, 19, 8, 109, 10 8 38 9 1/31/2018 5 Open Addressing: Linear Probing • Another simple idea: If h(key) is already full, 0 / – try (h(key) + 1) % TableSize. If full, 1 / – try (h(key) + 2) % TableSize. If full, 2 / – try (h(key) + 3) % TableSize. If full… 3 / 4 / • Example: insert 38, 19, 8, 109, 10 5 / 6 / 7 / 8 38 9 19 1/31/2018 6 Open Addressing: Linear Probing • Another simple idea: If h(key) is already full, 0 8 – try (h(key) + 1) % TableSize.
    [Show full text]