Chapter 5 Hashing

Total Page:16

File Type:pdf, Size:1020Kb

Chapter 5 Hashing Introduction hashing performs basic operations, such as insertion, Chapter 5 deletion, and finds in average time Hashing 2 Hashing Hashing Functions a hash table is merely an of some fixed size let be the set of search keys hashing converts into locations in a hash hash functions map into the set of in the table hash table searching on the key becomes something like array lookup ideally, distributes over the slots of the hash table, to minimize collisions hashing is typically a many-to-one map: multiple keys are if we are hashing items, we want the number of items mapped to the same array index hashed to each location to be close to mapping multiple keys to the same position results in a example: Library of Congress Classification System ____________ that must be resolved hash function if we look at the first part of the call numbers two parts to hashing: (e.g., E470, PN1995) a hash function, which transforms keys into array indices collision resolution involves going to the stacks and a collision resolution procedure looking through the books almost all of CS is hashed to QA75 and QA76 (BAD) 3 4 Hashing Functions Hashing Functions suppose we are storing a set of nonnegative integers we can also use the hash function below for floating point given , we can obtain hash values between 0 and 1 with the numbers if we interpret the bits as an ______________ hash function two ways to do this in C, assuming long int and double when is divided by types have the same length fast operation, but we need to be careful when choosing first method uses C to accomplish this task example: if , is just the lowest-order bits of are all the hash values equally likely? choosing to be a not too close to a power of 2 works well in practice 5 6 Hashing Functions Hashing Functions we can also use the hash function below for floating point we can hash strings by combining a hash of each _________ numbers if we interpret the bits as an integer (cont.) second uses a , which is a variable that can hold objects of different types and sizes is an additional parameter we get to choose if is larger than any character value, then this approach is what you would obtain if you treated the string as a base- ______________ 7 8 Hashing Functions Hashing Functions K&R suggest a slightly simpler hash function, corresponding we can use the idea for strings if our search key has to _____________ parts, say, street, city, state: hash = ((street * R + city) % M) * R + state) % M; same ideas apply to hashing vectors Weiss suggests 9 10 Hash Functions Hash Functions the choice of parameters can have a ____________ effect example: on the results of hashing good: words are __________________________ compare the text's string hashing algorithm for different pairs of and plot _______________ of the number of words hashed to each hash table location; we use the American dictionary from the aspell program as data (305,089 words) 11 12 Hash Functions Hash Functions example: example: very bad better 13 14 Hash Functions Collision Resolution example: hash table collision bad occurs when elements hash to the __________________ in the table various for dealing with collision separate chaining open addressing linear probing other methods 15 16 Separate Chaining Separate Chaining separate chaining insert, search, delete in lists keep a list of all elements that to the same location all proportional to of linked list each location in the hash table is a _________________ insert example: first 10 squares new elements can be inserted at of list duplicates can increment ______________ other structures could be used instead of lists binary search tree another hash table linked lists good if table is and hash function is good 17 18 Separate Chaining Separate Chaining how long are the linked lists in a hash table? observations value: where is the number of keys more important than table size and is the size of the table general rule: make the table as large as the number of is it reasonable to assume the hash table would exhibit elements to be stored, this behavior? keep table size prime to ensure good ________________ load factor average length of a list time to search: time to evaluate the hash function + time to the list unsuccessful search: successful search: 19 20 Separate Chaining Separate Chaining declaration of hash structure hash member function 21 22 Separate Chaining Separate Chaining routines for separate chaining routines for separate chaining 23 24 Open Addressing Linear Probing linked lists incur extra costs linear probing insert operation time to for new cells when is hashed, if slot is open, place there effort and complexity of defining second data structure if there is a collision, then start looking for an empty slot a different collision strategy involves placing colliding keys starting with location in the hash table, and in nearby slots proceed through , , 0, 1, 2, if a collision occurs, try cells until an empty wrapping around the hash table, looking for one is found an empty slot bigger table size needed with search operation is similar load factor should be below checking whether a table entry is vacant (or is one we three common strategies seek) is called a ___________ linear probing quadratic probing double hashing 25 26 Linear Probing Linear Probing example: add 89, 18, 49, 58, 69 with and as long as the table is , a vacant cell can be found but time to locate an empty cell can become large blocks of occupied cells results in primary _____________ deleting entries leaves __________ some entries may no longer be found may require moving many other entries expected number of probes for search hits: for insertion and search misses: for , these values are and , respectively 27 28 Linear Probing Quadratic Probing performance of linear probing (dashed) vs. more random quadratic probing collision resolution eliminates _______________________ adequate up to collision function is quadratic Successful, Unsuccessful, Insertion example: add 89, 18, 49, 58, 69 with and 29 30 Quadratic Probing Quadratic Probing in linear probing, letting table get nearly greatly quadratic probing hurts performance collisions will probe the same alternative cells quadratic probing clustering no of finding an empty cell once the causes less than half an extra probe per search table gets larger than half full at most, of the table can be used to resolve collisions if table is half empty and the table size is prime, then we are always guaranteed to accommodate a new element could end up with situation where all keys map to the same table location 31 32 Double Hashing Double Hashing double hashing double hashing example with apply a second hash function to and probe across is a prime smaller than table size ___________________________ insert 89, 18, 49, 58, 69 function must never evaluate to ____ make sure all cells can be probed 33 34 Double Hashing Rehashing double hashing example (cont.) table may get _______________ note here that the size of the table (10) is not prime run time of operations may take too long if 23 inserted in the table, it would collide with 58 insertions may for quadratic resolution since and the table size is 10, too many removals may be intermixed with insertions only one alternative location, which is taken solution: build a new table (with a new hash function) go through original hash table to compute a hash value for each (non-deleted) element insert it into the new table 35 36 Rehashing Rehashing example: insert 13, 15, 24, 6 into a hash table of size 7 example (cont.) with insert 23 table will be over 70% full; therefore, a new table is created 37 38 Rehashing Rehashing example (cont.) rehashing run time since elements and to rehash new table is size 17 the entire table of size roughly new hash function must have been insertions since last rehash all old elements are inserted into new rehashing may run OK if in ________________ table if interactive session, rehashing operation could produce a slowdown rehashing can be implemented with _________________ could rehash as soon as the table is half full could rehash only when an insertion fails could rehash only when a certain ______ ___ is reached may be best, as performance degrades as load factor increases 39 40 Hash Tables with Worst-Case Access Hash Tables with Worst-Case Access hash tables so far perfect hashing average case for insertions, searches, and assume all items known _________________ deletions separate chaining separate chaining: worst case if the number of lists continually increases, the lists will become shorter and shorter some queries will take nearly logarithmic time with enough lists, high probability of ______________ worst-case time would be better two problems important for applications such as lookup tables for number of lists might be unreasonably __________ routers and memory caches the hashing might still be unfortunate if is known in advance, and elements can be _______________, worst-case time is can be made large enough to have probability achievable of no collisions if collision detected, clear table and try again with a different hash function (at most done 2 times) 41 42 Hash Tables with Worst-Case Access Hash Tables with Worst-Case Access perfect hashing (cont.) perfect hashing (cont.) how large must be? example: slots 1, 3, 5, 7 empty; slots 0, 4, 8 have 1 theoretically, should be , which is ____________ element each; slots 2, 6 have 2 elements each; slot 9 solution: use lists has 3 elements resolve collisions by using hash tables instead of linked lists each of these lists can
Recommended publications
  • SAHA: a String Adaptive Hash Table for Analytical Databases
    applied sciences Article SAHA: A String Adaptive Hash Table for Analytical Databases Tianqi Zheng 1,2,* , Zhibin Zhang 1 and Xueqi Cheng 1,2 1 CAS Key Laboratory of Network Data Science and Technology, Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190, China; [email protected] (Z.Z.); [email protected] (X.C.) 2 University of Chinese Academy of Sciences, Beijing 100049, China * Correspondence: [email protected] Received: 3 February 2020; Accepted: 9 March 2020; Published: 11 March 2020 Abstract: Hash tables are the fundamental data structure for analytical database workloads, such as aggregation, joining, set filtering and records deduplication. The performance aspects of hash tables differ drastically with respect to what kind of data are being processed or how many inserts, lookups and deletes are constructed. In this paper, we address some common use cases of hash tables: aggregating and joining over arbitrary string data. We designed a new hash table, SAHA, which is tightly integrated with modern analytical databases and optimized for string data with the following advantages: (1) it inlines short strings and saves hash values for long strings only; (2) it uses special memory loading techniques to do quick dispatching and hashing computations; and (3) it utilizes vectorized processing to batch hashing operations. Our evaluation results reveal that SAHA outperforms state-of-the-art hash tables by one to five times in analytical workloads, including Google’s SwissTable and Facebook’s F14Table. It has been merged into the ClickHouse database and shows promising results in production. Keywords: hash table; analytical database; string data 1.
    [Show full text]
  • Hash Tables & Searching Algorithms
    Search Algorithms and Tables Chapter 11 Tables • A table, or dictionary, is an abstract data type whose data items are stored and retrieved according to a key value. • The items are called records. • Each record can have a number of data fields. • The data is ordered based on one of the fields, named the key field. • The record we are searching for has a key value that is called the target. • The table may be implemented using a variety of data structures: array, tree, heap, etc. Sequential Search public static int search(int[] a, int target) { int i = 0; boolean found = false; while ((i < a.length) && ! found) { if (a[i] == target) found = true; else i++; } if (found) return i; else return –1; } Sequential Search on Tables public static int search(someClass[] a, int target) { int i = 0; boolean found = false; while ((i < a.length) && !found){ if (a[i].getKey() == target) found = true; else i++; } if (found) return i; else return –1; } Sequential Search on N elements • Best Case Number of comparisons: 1 = O(1) • Average Case Number of comparisons: (1 + 2 + ... + N)/N = (N+1)/2 = O(N) • Worst Case Number of comparisons: N = O(N) Binary Search • Can be applied to any random-access data structure where the data elements are sorted. • Additional parameters: first – index of the first element to examine size – number of elements to search starting from the first element above Binary Search • Precondition: If size > 0, then the data structure must have size elements starting with the element denoted as the first element. In addition, these elements are sorted.
    [Show full text]
  • Cuckoo Hashing
    Cuckoo Hashing Rasmus Pagh* BRICSy, Department of Computer Science, Aarhus University Ny Munkegade Bldg. 540, DK{8000 A˚rhus C, Denmark. E-mail: [email protected] and Flemming Friche Rodlerz ON-AIR A/S, Digtervejen 9, 9200 Aalborg SV, Denmark. E-mail: ff[email protected] We present a simple dictionary with worst case constant lookup time, equal- ing the theoretical performance of the classic dynamic perfect hashing scheme of Dietzfelbinger et al. (Dynamic perfect hashing: Upper and lower bounds. SIAM J. Comput., 23(4):738{761, 1994). The space usage is similar to that of binary search trees, i.e., three words per key on average. Besides being conceptually much simpler than previous dynamic dictionaries with worst case constant lookup time, our data structure is interesting in that it does not use perfect hashing, but rather a variant of open addressing where keys can be moved back in their probe sequences. An implementation inspired by our algorithm, but using weaker hash func- tions, is found to be quite practical. It is competitive with the best known dictionaries having an average case (but no nontrivial worst case) guarantee. Key Words: data structures, dictionaries, information retrieval, searching, hash- ing, experiments * Partially supported by the Future and Emerging Technologies programme of the EU under contract number IST-1999-14186 (ALCOM-FT). This work was initiated while visiting Stanford University. y Basic Research in Computer Science (www.brics.dk), funded by the Danish National Research Foundation. z This work was done while at Aarhus University. 1 2 PAGH AND RODLER 1. INTRODUCTION The dictionary data structure is ubiquitous in computer science.
    [Show full text]
  • Double Hashing
    Double Hashing Intro & Coding Hashing Hashing - provides O(1) time on average for insert, search and delete Hash function - maps a big number or string to a small integer that can be used as index in hash table. Collision - Two keys resulting in same index. Hashing – a Simple Example ● Arbitrary Size à Fix Size 0 1 2 3 4 Hashing – a Simple Example ● Arbitrary Size à Fix Size Insert(0) 0 % 5 = 0 0 1 2 3 4 0 Hashing – a Simple Example ● Arbitrary Size à Fix Size Insert(0) 0 % 5 = 0 Insert(5321) 5321 % 5 = 1 0 1 2 3 4 0 5321 Hashing – a Simple Example ● Arbitrary Size à Fix Size Insert(0) 0 % 5 = 0 Insert(5321) 5321 % 5 = 1 Insert(-8002) -8002 % 5 = 3 0 1 2 3 4 0 5321 -8002 Hashing – a Simple Example ● Arbitrary Size à Fix Size Insert(0) 0 % 5 = 0 Insert(5321) 5321 % 5 = 1 Insert(-8002) -8002 % 5 = 3 Insert(20000) 20000 % 5 = 0 0 1 2 3 4 0 5321 -8002 Modular Hashing ● Overall a good simple, general approach to implement a hash map ● Basic formula: ○ h(x) = c(x) mod m ■ Where c(x) converts x into a (possibly) large integer ● Generally want m to be a prime number ○ Consider m = 100 ○ Only the least significant digits matter ■ h(1) = h(401) = h(4372901) Collision Resolution ● A strategy for handling the case when two or more keys to be inserted hash to the same index. ● Closed Addressing § Separate Chaining ● Open Addressing § Linear Probing § Quadratic Probing § Double Hashing Separate Chaining ● Make each cell of hash table point to a linked list of records that have same hash function value Key Hash Value S 2 0 E 0 1 A 0 2 R 4 3 C 4 4 H 4
    [Show full text]
  • Hashing (Part 2)
    Hashing (part 2) CSE 2011 Winter 2011 14 March 2011 1 Collision Handling Separate chaining Pbi(Probing (open a ddress ing ) Linear probing Quadratic probing Double hashing 2 1 Quadratic Probing Linear probing: Quadratic probing Insert item (k, e) A[i] is occupied i = h(k) Try A[(i+1) mod N]: used A[i] is occupied Try A[(i+22) mod N]: used Try A[(i+1) mod N]: used Try A[(i+32) mod N] Try A[(i+2) mod N] and so on and so on until an empty cell is found May not be able to find an empty cell if N is not prime, or the hash table is at least half full 3 Double Hashing Double hashing uses a Insert item (k, e) secondary hash function i = h(()k) d(k) and handles collisions A[i] is occupied by placing an item in the first available cell of the Try A[(i+d(k))mod N]: used series Try A[(i+2d(k))mod N]: used (i j d(k)) mod N Try A[(i+3d(k))mod N] for j 0, 1, … , N 1 and so on until The secondary hash an empty cell is found function d(k) cannot have zero values The table size N must be a prime to allow probing of all the cells 4 2 Example of Double Hashing kh(k ) d (k ) Probes Consider a hash 18 5 3 5 tbltable s tor ing itinteger 41 2 1 2 22 9 6 9 keys that handles 44 5 5 5 10 collision with double 59 7 4 7 32 6 3 6 hashing 315 4 590 N 13 73 8 4 8 h(k) k mod 13 d(k) 7 k mod 7 0123456789101112 Insert keys 18, 41, 22, 44, 59, 32, 31, 73, in this order 31 41 18 32 59 73 22 44 0123456789101112 5 Double Hashing (2) d(k) should be chosen to minimize clustering Common choice of compression map for the secondary hash function: d(k) q k mod q where q N q is a prime The possible values for d(k) are 1, 2, … , q Note: linear probing has d(k) 1.
    [Show full text]
  • Linear Probing with Constant Independence
    Linear Probing with Constant Independence Anna Pagh, Rasmus Pagh, and Milan Ruˇzi´c IT University of Copenhagen, Rued Langgaards Vej 7, 2300 København S, Denmark. Abstract. Hashing with linear probing dates back to the 1950s, and is among the most studied algorithms. In recent years it has become one of the most important hash table organizations since it uses the cache of modern computers very well. Unfortunately, previous analysis rely either on complicated and space consuming hash functions, or on the unrealistic assumption of free access to a truly random hash function. Already Carter and Wegman, in their seminal paper on universal hashing, raised the question of extending their analysis to linear probing. However, we show in this paper that linear probing using a pairwise independent family may have expected logarithmic cost per operation. On the positive side, we show that 5-wise independence is enough to ensure constant expected time per operation. This resolves the question of finding a space and time efficient hash function that provably ensures good performance for linear probing. 1 Introduction Hashing with linear probing is perhaps the simplest algorithm for storing and accessing a set of keys that obtains nontrivial performance. Given a hash function h, a key x is inserted in an array by searching for the first vacant array position in the sequence h(x), h(x) + 1, h(x) + 2,... (Here, addition is modulo r, the size of the array.) Retrieval of a key proceeds similarly, until either the key is found, or a vacant position is encountered, in which case the key is not present in the data structure.
    [Show full text]
  • More Analysis of Double Hashing for Balanced Allocations
    More Analysis of Double Hashing for Balanced Allocations Michael Mitzenmacher⋆ Harvard University, School of Engineering and Applied Sciences [email protected] Abstract. With double hashing, for a key x, one generates two hash values f(x) and g(x), and then uses combinations (f(x)+ ig(x)) mod n for i = 0, 1, 2,... to generate multiple hash values in the range [0, n − 1] from the initial two. For balanced allocations, keys are hashed into a hash table where each bucket can hold multiple keys, and each key is placed in the least loaded of d choices. It has been shown previously that asymp- totically the performance of double hashing and fully random hashing is the same in the balanced allocation paradigm using fluid limit methods. Here we extend a coupling argument used by Lueker and Molodowitch to show that double hashing and ideal uniform hashing are asymptotically equivalent in the setting of open address hash tables to the balanced al- location setting, providing further insight into this phenomenon. We also discuss the potential for and bottlenecks limiting the use this approach for other multiple choice hashing schemes. 1 Introduction An interesting result from the hashing literature shows that, for open addressing, double hashing has the same asymptotic performance as uniform hashing. We explain the result in more detail. In open addressing, we have a hash table with n cells into which we insert m keys; we use α = m/n to refer to the load factor. Each element is placed according to a probe sequence, which is a permutation of the cells.
    [Show full text]
  • An Overview of Cuckoo Hashing Charles Chen
    An Overview of Cuckoo Hashing Charles Chen 1 Abstract Cuckoo Hashing is a technique for resolving collisions in hash tables that produces a dic- tionary with constant-time worst-case lookup and deletion operations as well as amortized constant-time insertion operations. First introduced by Pagh in 2001 [3] as an extension of a previous static dictionary data structure, Cuckoo Hashing was the first such hash table with practically small constant factors [4]. Here, we first outline existing hash table collision poli- cies and go on to analyze the Cuckoo Hashing scheme in detail. Next, we give an overview of (c; k)-universal hash families, and finally, we summarize some selected experimental results comparing the varied collision policies. 2 Introduction The dictionary is a fundamental data structure in computer science, allowing users to store key-value pairs and look up by key the corresponding value in the data structure. The hash table provides one of the most common and one of the fastest implementations of the dictionary data structure, allowing for amortized constant-time lookup, insertion and deletion operations.1 In its archetypal form, a hash function h computed on the space of keys is used to map each key to a location in a set of bucket locations fb0; ··· ; br−1g. The key, as well as, potentially, some auxiliary data (termed the value), is stored at the appropriate bucket location upon insertion. Upon deletion, the key with its value is removed from the appropriate bucket, and in lookup, such a key-value pair, if it exists, is retrieved by looking at the bucket location corresponding to the hashed key value.
    [Show full text]
  • Double Hashing
    CSE 332: Data Structures & Parallelism Lecture 11:More Hashing Ruth Anderson Winter 2018 Today • Dictionaries – Hashing 1/31/2018 2 Hash Tables: Review • Aim for constant-time (i.e., O(1)) find, insert, and delete – “On average” under some reasonable assumptions • A hash table is an array of some fixed size hash table – But growable as we’ll see 0 client hash table library collision? collision E int table-index … resolution TableSize –1 1/31/2018 3 Hashing Choices 1. Choose a Hash function 2. Choose TableSize 3. Choose a Collision Resolution Strategy from these: – Separate Chaining – Open Addressing • Linear Probing • Quadratic Probing • Double Hashing • Other issues to consider: – Deletion? – What to do when the hash table gets “too full”? 1/31/2018 4 Open Addressing: Linear Probing • Why not use up the empty space in the table? 0 • Store directly in the array cell (no linked list) 1 • How to deal with collisions? 2 • If h(key) is already full, 3 – try (h(key) + 1) % TableSize. If full, 4 – try (h(key) + 2) % TableSize. If full, 5 – try (h(key) + 3) % TableSize. If full… 6 7 • Example: insert 38, 19, 8, 109, 10 8 38 9 1/31/2018 5 Open Addressing: Linear Probing • Another simple idea: If h(key) is already full, 0 / – try (h(key) + 1) % TableSize. If full, 1 / – try (h(key) + 2) % TableSize. If full, 2 / – try (h(key) + 3) % TableSize. If full… 3 / 4 / • Example: insert 38, 19, 8, 109, 10 5 / 6 / 7 / 8 38 9 19 1/31/2018 6 Open Addressing: Linear Probing • Another simple idea: If h(key) is already full, 0 8 – try (h(key) + 1) % TableSize.
    [Show full text]
  • Chapter 5. Hash Tables
    CSci 335 Software Design and Analysis 3 Prof. Stewart Weiss Chapter 5 Hashing and Hash Tables Hashing and Hash Tables 1 Introduction A hash table is a look-up table that, when designed well, has nearly O(1) average running time for a nd or insert operation. More precisely, a hash table is an array of xed size containing data items with unique keys, together with a function called a hash function that maps keys to indices in the table. For example, if the keys are integers and the hash table is an array of size 127, then the function hash(x), dened by hash(x) = x%127 maps numbers to their modulus in the nite eld of size 127. Key Space Hash Function Hash Table (values) 0 1 Japan 2 3 Rome 4 Canada 5 Tokyo 6 Ottawa Italy 7 Figure 1: A hash table for world capitals. Conceptually, a hash table is a much more general structure. It can be thought of as a table H containing a collection of (key, value) pairs with the property that H may be indexed by the key itself. In other words, whereas you usually reference an element of an array A by writing something like A[i], using an integer index value i, with a hash table, you replace the index value "i" by the key contained in location i in order to get the value associated with it. For example, we could conceptualize a hash table containing world capitals as a table named Capitals that contains the set of pairs (Italy, Rome), (Japan, Tokyo), (Canada, Ottawa) and so on, and conceptually (not actually) you could write a statement such as print Capitals[Italy] This work is licensed under the Creative Commons Attribution-ShareAlike 4.0 International License.
    [Show full text]
  • Hashing, Load Balancing and Multiple Choice
    Full text available at: http://dx.doi.org/10.1561/0400000070 Hashing, Load Balancing and Multiple Choice Udi Wieder VMware Research [email protected] Boston — Delft Full text available at: http://dx.doi.org/10.1561/0400000070 Foundations and Trends R in Theoretical Computer Science Published, sold and distributed by: now Publishers Inc. PO Box 1024 Hanover, MA 02339 United States Tel. +1-781-985-4510 www.nowpublishers.com [email protected] Outside North America: now Publishers Inc. PO Box 179 2600 AD Delft The Netherlands Tel. +31-6-51115274 The preferred citation for this publication is U. Wieder. Hashing, Load Balancing and Multiple Choice. Foundations and Trends R in Theoretical Computer Science, vol. 12, no. 3-4, pp. 275–379, 2016. R This Foundations and Trends issue was typeset in LATEX using a class file designed by Neal Parikh. Printed on acid-free paper. ISBN: 978-1-68083-282-2 c 2017 U. Wieder All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, mechanical, photocopying, recording or otherwise, without prior written permission of the publishers. Photocopying. In the USA: This journal is registered at the Copyright Clearance Cen- ter, Inc., 222 Rosewood Drive, Danvers, MA 01923. Authorization to photocopy items for internal or personal use, or the internal or personal use of specific clients, is granted by now Publishers Inc for users registered with the Copyright Clearance Center (CCC). The ‘services’ for users can be found on the internet at: www.copyright.com For those organizations that have been granted a photocopy license, a separate system of payment has been arranged.
    [Show full text]
  • Linear Probing Revisited: Tombstones Mark the Death of Primary Clustering
    Linear Probing Revisited: Tombstones Mark the Death of Primary Clustering Michael A. Bender* Bradley C. Kuszmaul William Kuszmaul† Stony Brook Google Inc. MIT Abstract First introduced in 1954, the linear-probing hash table is among the oldest data structures in computer science, and thanks to its unrivaled data locality, linear probing continues to be one of the fastest hash tables in practice. It is widely believed and taught, however, that linear probing should never be used at high load factors; this is because of an effect known as primary clustering which causes insertions at a load factor of 1 − 1=x to take expected time Θ(x2) (rather than the intuitive running time of Θ(x)). The dangers of primary clustering, first discovered by Knuth in 1963, have now been taught to generations of computer scientists, and have influenced the design of some of the most widely used hash tables in production. We show that primary clustering is not the foregone conclusion that it is reputed to be. We demonstrate that seemingly small design decisions in how deletions are implemented have dramatic effects on the asymptotic performance of insertions: if these design decisions are made correctly, then even if a hash table operates continuously at a load factor of 1 − Θ(1=x), the expected amortized cost per insertion/deletion is O~(x). This is because the tombstones left behind by deletions can actually cause an anti-clustering effect that combats primary clustering. Interestingly, these design decisions, despite their remarkable effects, have historically been viewed as simply implementation-level engineering choices.
    [Show full text]