Lecture 10 Student Notes
Total Page:16
File Type:pdf, Size:1020Kb
Load more
Recommended publications
-
Using Tabulation to Implement the Power of Hashing Using Tabulation to Implement the Power of Hashing Talk Surveys Results from I M
Talk surveys results from I M. Patras¸cuˇ and M. Thorup: The power of simple tabulation hashing. STOC’11 and J. ACM’12. I M. Patras¸cuˇ and M. Thorup: Twisted Tabulation Hashing. SODA’13 I M. Thorup: Simple Tabulation, Fast Expanders, Double Tabulation, and High Independence. FOCS’13. I S. Dahlgaard and M. Thorup: Approximately Minwise Independence with Twisted Tabulation. SWAT’14. I T. Christiani, R. Pagh, and M. Thorup: From Independence to Expansion and Back Again. STOC’15. I S. Dahlgaard, M.B.T. Knudsen and E. Rotenberg, and M. Thorup: Hashing for Statistics over K-Partitions. FOCS’15. Using Tabulation to Implement The Power of Hashing Using Tabulation to Implement The Power of Hashing Talk surveys results from I M. Patras¸cuˇ and M. Thorup: The power of simple tabulation hashing. STOC’11 and J. ACM’12. I M. Patras¸cuˇ and M. Thorup: Twisted Tabulation Hashing. SODA’13 I M. Thorup: Simple Tabulation, Fast Expanders, Double Tabulation, and High Independence. FOCS’13. I S. Dahlgaard and M. Thorup: Approximately Minwise Independence with Twisted Tabulation. SWAT’14. I T. Christiani, R. Pagh, and M. Thorup: From Independence to Expansion and Back Again. STOC’15. I S. Dahlgaard, M.B.T. Knudsen and E. Rotenberg, and M. Thorup: Hashing for Statistics over K-Partitions. FOCS’15. I Providing algorithmically important probabilisitic guarantees akin to those of truly random hashing, yet easy to implement. I Bridging theory (assuming truly random hashing) with practice (needing something implementable). I Many randomized algorithms are very simple and popular in practice, but often implemented with too simple hash functions, so guarantees only for sufficiently random input. -
Algorithmic Improvements for Fast Concurrent Cuckoo Hashing Presentation by Nir David
Algorithmic Improvements for Fast Concurrent Cuckoo Hashing Presentation by Nir David Xiaozhou Li , David G. Andersen , Michael Kaminsky , Michael J. Freedman Overview Background and Related Work Hash Tables Concurrency Control Mechanisms Naive use of concurrency control fails Principles to Improve Concurrency Concurrent Cuckoo Hashing Cuckoo Hashing Prior Work in Concurrent Cuckoo Algorithmic Optimizations Fine-grained Locking Optimizing for Intel TSX Evaluation Concurrent hash table Provides: Lookup, Insert and Delete operations. On Lookup, a value is returned for the given key, or “does not exist” if the key cannot be found. On Insert, the hash table returns success, or an error code to indicate whether the hash table is full or the key is already exists. Delete simply removes the key’s entry from the hash table. Several definitions before we go further Open Addressing: a method for handling collisions. A collision is resolved by probing, or searching through alternate locations in the array until either the target record is found, or an unused array slot is found. Linear probing - in which the interval between probes is fixed — often at 1 Quadratic probing - in which the interval between probes increases linearly Linear Probing: f(i) = i Insert(k,x) // assume unique keys 1. index = hash(key) % table_size; 2. if (table[index]== NULL) table[index]= new key_value_pair(key, x); 3. Else { • index++; • index = index % table_size; • goto 2; } 89 mod 10 = 9 Linear Probing Example 18 mod 10 = 8 Insert 89, 18, 49, 58, 69 58 mod 10 = 8 49 mod 10 = 9 Several definitions before we go further Chaining: another possible way to resolve collisions. -
The Power of Tabulation Hashing
Thank you for inviting me to China Theory Week. Joint work with Mihai Patras¸cuˇ . Some of it found in Proc. STOC’11. The Power of Tabulation Hashing Mikkel Thorup University of Copenhagen AT&T Joint work with Mihai Patras¸cuˇ . Some of it found in Proc. STOC’11. The Power of Tabulation Hashing Mikkel Thorup University of Copenhagen AT&T Thank you for inviting me to China Theory Week. The Power of Tabulation Hashing Mikkel Thorup University of Copenhagen AT&T Thank you for inviting me to China Theory Week. Joint work with Mihai Patras¸cuˇ . Some of it found in Proc. STOC’11. I Providing algorithmically important probabilisitic guarantees akin to those of truly random hashing, yet easy to implement. I Bridging theory (assuming truly random hashing) with practice (needing something implementable). Target I Simple and reliable pseudo-random hashing. I Bridging theory (assuming truly random hashing) with practice (needing something implementable). Target I Simple and reliable pseudo-random hashing. I Providing algorithmically important probabilisitic guarantees akin to those of truly random hashing, yet easy to implement. Target I Simple and reliable pseudo-random hashing. I Providing algorithmically important probabilisitic guarantees akin to those of truly random hashing, yet easy to implement. I Bridging theory (assuming truly random hashing) with practice (needing something implementable). Applications of Hashing Hash tables (n keys and 2n hashes: expect 1/2 keys per hash) I chaining: follow pointers • x • ! a ! t • • ! v • • ! f ! s -
Non-Empty Bins with Simple Tabulation Hashing
Non-Empty Bins with Simple Tabulation Hashing Anders Aamand and Mikkel Thorup November 1, 2018 Abstract We consider the hashing of a set X U with X = m using a simple tabulation hash function h : U [n] = 0,...,n 1 and⊆ analyse| the| number of non-empty bins, that is, the size of h(X→). We show{ that the− expected} size of h(X) matches that with fully random hashing to within low-order terms. We also provide concentration bounds. The number of non-empty bins is a fundamental measure in the balls and bins paradigm, and it is critical in applications such as Bloom filters and Filter hashing. For example, normally Bloom filters are proportioned for a desired low false-positive probability assuming fully random hashing (see en.wikipedia.org/wiki/Bloom_filter). Our results imply that if we implement the hashing with simple tabulation, we obtain the same low false-positive probability for any possible input. arXiv:1810.13187v1 [cs.DS] 31 Oct 2018 1 Introduction We consider the balls and bins paradigm where a set X U of X = m balls are distributed ⊆ | | into a set of n bins according to a hash function h : U [n]. We are interested in questions → relating to the distribution of h(X) , for example: What is the expected number of non-empty | | bins? How well is h(X) concentrated around its mean? And what is the probability that a | | query ball lands in an empty bin? These questions are critical in applications such as Bloom filters [3] and Filter hashing [7]. -
Cuckoo Hashing
Cuckoo Hashing Outline for Today ● Towards Perfect Hashing ● Reducing worst-case bounds ● Cuckoo Hashing ● Hashing with worst-case O(1) lookups. ● The Cuckoo Graph ● A framework for analyzing cuckoo hashing. ● Analysis of Cuckoo Hashing ● Just how fast is cuckoo hashing? Perfect Hashing Collision Resolution ● Last time, we mentioned three general strategies for resolving hash collisions: ● Closed addressing: Store all colliding elements in an auxiliary data structure like a linked list or BST. ● Open addressing: Allow elements to overflow out of their target bucket and into other spaces. ● Perfect hashing: Choose a hash function with no collisions. ● We have not spoken on this last topic yet. Why Perfect Hashing is Hard ● The expected cost of a lookup in a chained hash table is O(1 + α) for any load factor α. ● For any fixed load factor α, the expected cost of a lookup in linear probing is O(1), where the constant depends on α. ● However, the expected cost of a lookup in these tables is not the same as the expected worst-case cost of a lookup in these tables. Expected Worst-Case Bounds ● Theorem: Assuming truly random hash functions, the expected worst-case cost of a lookup in a linear probing hash table is Ω(log n). ● Theorem: Assuming truly random hash functions, the expected worst-case cost of a lookup in a chained hash table is Θ(log n / log log n). ● Proofs: Exercise 11-1 and 11-2 from CLRS. ☺ Perfect Hashing ● A perfect hash table is one where lookups take worst-case time O(1). ● There's a pretty sizable gap between the expected worst-case bounds from chaining and linear probing – and that's on expected worst-case, not worst-case. -
6.1 Systems of Hash Functions
— 6 Hashing 6 Hashing 6.1 Systems of hash functions Notation: • The universe U. Usually, the universe is a set of integers f0;:::;U − 1g, which will be denoted by [U]. • The set of buckets B = [m]. • The set X ⊂ U of n items stored in the data structure. • Hash function h : U!B. Definition: Let H be a family of functions from U to [m]. We say that the family is c-universal for some c > 0 if for every pair x; y of distinct elements of U we have c Pr [h(x) = h(y)] ≤ : h2H m In other words, if we pick a hash function h uniformly at random from H, the probability that x and y collide is at most c-times more than for a completely random function h. Occasionally, we are not interested in the specific value of c, so we simply say that the family is universal. Note: We will assume that a function can be chosen from the family uniformly at random in constant time. Once chosen, it can be evaluated for any x in constant time. Typically, we define a single parametrized function ha(x) and let the family H consist of all ha for all possible choices of the parameter a. Picking a random h 2 H is therefore implemented by picking a random a. Technically, this makes H a multi-set, since different values of a may lead to the same function h. Theorem: Let H be a c-universal family of functions from U to [m], X = fx1; : : : ; xng ⊆ U a set of items stored in the data structure, and y 2 U n X another item not stored in the data structure. -
Linear Probing with 5-Independent Hashing
Lecture Notes on Linear Probing with 5-Independent Hashing Mikkel Thorup May 12, 2017 Abstract These lecture notes show that linear probing takes expected constant time if the hash function is 5-independent. This result was first proved by Pagh et al. [STOC’07,SICOMP’09]. The simple proof here is essentially taken from [Pˇatra¸scu and Thorup ICALP’10]. We will also consider a smaller space version of linear probing that may have false positives like Bloom filters. These lecture notes illustrate the use of higher moments in data structures, and could be used in a course on randomized algorithms. 1 k-independence The concept of k-independence was introduced by Wegman and Carter [21] in FOCS’79 and has been the cornerstone of our understanding of hash functions ever since. A hash function is a random function h : [u] [t] mapping keys to hash values. Here [s]= 0,...,s 1 . We can also → { − } think of a h as a random variable distributed over [t][u]. We say that h is k-independent if for any distinct keys x ,...,x [u] and (possibly non-distinct) hash values y ,...,y [t], we have 0 k−1 ∈ 0 k−1 ∈ Pr[h(x )= y h(x )= y ] = 1/tk. Equivalently, we can define k-independence via two 0 0 ∧···∧ k−1 k−1 separate conditions; namely, (a) for any distinct keys x ,...,x [u], the hash values h(x ),...,h(x ) are independent 0 k−1 ∈ 0 k−1 random variables, that is, for any (possibly non-distinct) hash values y ,...,y [t] and 0 k−1 ∈ i [k], Pr[h(x )= y ]=Pr h(x )= y h(x )= y , and ∈ i i i i | j∈[k]\{i} j j arXiv:1509.04549v3 [cs.DS] 11 May 2017 h i (b) for any x [u], h(x) is uniformly distributedV in [t]. -
Cuckoo Hashing
Cuckoo Hashing Rasmus Pagh* BRICSy, Department of Computer Science, Aarhus University Ny Munkegade Bldg. 540, DK{8000 A˚rhus C, Denmark. E-mail: [email protected] and Flemming Friche Rodlerz ON-AIR A/S, Digtervejen 9, 9200 Aalborg SV, Denmark. E-mail: ff[email protected] We present a simple dictionary with worst case constant lookup time, equal- ing the theoretical performance of the classic dynamic perfect hashing scheme of Dietzfelbinger et al. (Dynamic perfect hashing: Upper and lower bounds. SIAM J. Comput., 23(4):738{761, 1994). The space usage is similar to that of binary search trees, i.e., three words per key on average. Besides being conceptually much simpler than previous dynamic dictionaries with worst case constant lookup time, our data structure is interesting in that it does not use perfect hashing, but rather a variant of open addressing where keys can be moved back in their probe sequences. An implementation inspired by our algorithm, but using weaker hash func- tions, is found to be quite practical. It is competitive with the best known dictionaries having an average case (but no nontrivial worst case) guarantee. Key Words: data structures, dictionaries, information retrieval, searching, hash- ing, experiments * Partially supported by the Future and Emerging Technologies programme of the EU under contract number IST-1999-14186 (ALCOM-FT). This work was initiated while visiting Stanford University. y Basic Research in Computer Science (www.brics.dk), funded by the Danish National Research Foundation. z This work was done while at Aarhus University. 1 2 PAGH AND RODLER 1. INTRODUCTION The dictionary data structure is ubiquitous in computer science. -
The Power of Simple Tabulation Hashing
The Power of Simple Tabulation Hashing Mihai Pˇatra¸scu Mikkel Thorup AT&T Labs AT&T Labs May 10, 2011 Abstract Randomized algorithms are often enjoyed for their simplicity, but the hash functions used to yield the desired theoretical guarantees are often neither simple nor practical. Here we show that the simplest possible tabulation hashing provides unexpectedly strong guarantees. The scheme itself dates back to Carter and Wegman (STOC'77). Keys are viewed as consist- ing of c characters. We initialize c tables T1;:::;Tc mapping characters to random hash codes. A key x = (x1; : : : ; xc) is hashed to T1[x1] ⊕ · · · ⊕ Tc[xc], where ⊕ denotes xor. While this scheme is not even 4-independent, we show that it provides many of the guarantees that are normally obtained via higher independence, e.g., Chernoff-type concentration, min-wise hashing for estimating set intersection, and cuckoo hashing. Contents 1 Introduction 2 2 Concentration Bounds 7 3 Linear Probing and the Concentration in Arbitrary Intervals 13 4 Cuckoo Hashing 19 5 Minwise Independence 27 6 Fourth Moment Bounds 32 arXiv:1011.5200v2 [cs.DS] 7 May 2011 A Experimental Evaluation 39 B Chernoff Bounds with Fixed Means 46 1 1 Introduction An important target of the analysis of algorithms is to determine whether there exist practical schemes, which enjoy mathematical guarantees on performance. Hashing and hash tables are one of the most common inner loops in real-world computation, and are even built-in \unit cost" operations in high level programming languages that offer associative arrays. Often, these inner loops dominate the overall computation time. -
Incremental Edge Orientation in Forests Michael A
Incremental Edge Orientation in Forests Michael A. Bender Stony Brook University [email protected] Tsvi Kopelowitz Bar-Ilan University [email protected] William Kuszmaul Massachusetts Institute of Technology [email protected] Ely Porat Bar-Ilan University [email protected] Clifford Stein Columbia University cliff@ieor.columbia.edu Abstract For any forest G = (V, E) it is possible to orient the edges E so that no vertex in V has out-degree greater than 1. This paper considers the incremental edge-orientation problem, in which the edges E arrive over time and the algorithm must maintain a low-out-degree edge orientation at all times. We give an algorithm that maintains a maximum out-degree of 3 while flipping at most O(log log n) edge orientations per edge insertion, with high probability in n. The algorithm requires worst-case time O(log n log log n) per insertion, and takes amortized time O(1). The previous state of the art required up to O(log n/ log log n) edge flips per insertion. We then apply our edge-orientation results to the problem of dynamic Cuckoo hashing. The problem of designing simple families H of hash functions that are compatible with Cuckoo hashing has received extensive attention. These families H are known to satisfy static guarantees, but do not come typically with dynamic guarantees for the running time of inserts and deletes. We show how to transform static guarantees (for 1-associativity) into near-state-of-the-art dynamic guarantees (for O(1)-associativity) in a black-box fashion. -
Fundamental Data Structures Contents
Fundamental Data Structures Contents 1 Introduction 1 1.1 Abstract data type ........................................... 1 1.1.1 Examples ........................................... 1 1.1.2 Introduction .......................................... 2 1.1.3 Defining an abstract data type ................................. 2 1.1.4 Advantages of abstract data typing .............................. 4 1.1.5 Typical operations ...................................... 4 1.1.6 Examples ........................................... 5 1.1.7 Implementation ........................................ 5 1.1.8 See also ............................................ 6 1.1.9 Notes ............................................. 6 1.1.10 References .......................................... 6 1.1.11 Further ............................................ 7 1.1.12 External links ......................................... 7 1.2 Data structure ............................................. 7 1.2.1 Overview ........................................... 7 1.2.2 Examples ........................................... 7 1.2.3 Language support ....................................... 8 1.2.4 See also ............................................ 8 1.2.5 References .......................................... 8 1.2.6 Further reading ........................................ 8 1.2.7 External links ......................................... 9 1.3 Analysis of algorithms ......................................... 9 1.3.1 Cost models ......................................... 9 1.3.2 Run-time analysis -
Vivid Cuckoo Hash: Fast Cuckoo Table Building in SIMD
ViViD Cuckoo Hash: Fast Cuckoo Table Building in SIMD Flaviene Scheidt de Cristo, Eduardo Cunha de Almeida, Marco Antonio Zanata Alves 1Informatics Department – Federal Univeristy of Parana´ (UFPR) – Curitiba – PR – Brazil ffscristo,eduardo,[email protected] Abstract. Hash Tables play a lead role in modern databases systems during the execution of joins, grouping, indexing, removal of duplicates, and accelerating point queries. In this paper, we focus on Cuckoo Hash, a technique to deal with collisions guaranteeing that data is retrieved with at most two memory ac- cess in the worst case. However, building the Cuckoo Table with the current scalar methods is inefficient when treating the eviction of the colliding keys. We propose a Vertically Vectorized data-dependent method to build Cuckoo Ta- bles - ViViD Cuckoo Hash. Our method exploits data parallelism with AVX-512 SIMD instructions and transforms control dependencies into data dependencies to make the build process faster with an overall reduction in response time by 90% compared to the scalar Cuckoo Hash. 1. Introduction The usage of hash tables in the execution of joins, grouping, indexing, and removal of duplicates is a widespread technique on modern database systems. In the particular case of joins, hash tables do not require nested loops and sorting, dismissing the need to execute multiple full scans over the same relation (table). However, a hash table is as good as its strategy to avoid or deal with collisions. Cuckoo Hash [Pagh and Rodler 2004] stands among the most efficient ways of dealing with collisions. Using open addressing, it does not use additional structures and pointers - as Chained Hash [Cormen et al.