Using Tabulation to Implement the Power of Hashing Using Tabulation to Implement the Power of Hashing Talk Surveys Results from I M

Total Page:16

File Type:pdf, Size:1020Kb

Using Tabulation to Implement the Power of Hashing Using Tabulation to Implement the Power of Hashing Talk Surveys Results from I M Talk surveys results from I M. Patras¸cuˇ and M. Thorup: The power of simple tabulation hashing. STOC’11 and J. ACM’12. I M. Patras¸cuˇ and M. Thorup: Twisted Tabulation Hashing. SODA’13 I M. Thorup: Simple Tabulation, Fast Expanders, Double Tabulation, and High Independence. FOCS’13. I S. Dahlgaard and M. Thorup: Approximately Minwise Independence with Twisted Tabulation. SWAT’14. I T. Christiani, R. Pagh, and M. Thorup: From Independence to Expansion and Back Again. STOC’15. I S. Dahlgaard, M.B.T. Knudsen and E. Rotenberg, and M. Thorup: Hashing for Statistics over K-Partitions. FOCS’15. Using Tabulation to Implement The Power of Hashing Using Tabulation to Implement The Power of Hashing Talk surveys results from I M. Patras¸cuˇ and M. Thorup: The power of simple tabulation hashing. STOC’11 and J. ACM’12. I M. Patras¸cuˇ and M. Thorup: Twisted Tabulation Hashing. SODA’13 I M. Thorup: Simple Tabulation, Fast Expanders, Double Tabulation, and High Independence. FOCS’13. I S. Dahlgaard and M. Thorup: Approximately Minwise Independence with Twisted Tabulation. SWAT’14. I T. Christiani, R. Pagh, and M. Thorup: From Independence to Expansion and Back Again. STOC’15. I S. Dahlgaard, M.B.T. Knudsen and E. Rotenberg, and M. Thorup: Hashing for Statistics over K-Partitions. FOCS’15. I Providing algorithmically important probabilisitic guarantees akin to those of truly random hashing, yet easy to implement. I Bridging theory (assuming truly random hashing) with practice (needing something implementable). I Many randomized algorithms are very simple and popular in practice, but often implemented with too simple hash functions, so guarantees only for sufficiently random input. I Too simple hash functions may work deceivingly well in random tests, but the real world is full of structured data on which they may fail miserably (as we shall see later). Target I Simple and reliable pseudo-random hashing. I Bridging theory (assuming truly random hashing) with practice (needing something implementable). I Many randomized algorithms are very simple and popular in practice, but often implemented with too simple hash functions, so guarantees only for sufficiently random input. I Too simple hash functions may work deceivingly well in random tests, but the real world is full of structured data on which they may fail miserably (as we shall see later). Target I Simple and reliable pseudo-random hashing. I Providing algorithmically important probabilisitic guarantees akin to those of truly random hashing, yet easy to implement. I Many randomized algorithms are very simple and popular in practice, but often implemented with too simple hash functions, so guarantees only for sufficiently random input. I Too simple hash functions may work deceivingly well in random tests, but the real world is full of structured data on which they may fail miserably (as we shall see later). Target I Simple and reliable pseudo-random hashing. I Providing algorithmically important probabilisitic guarantees akin to those of truly random hashing, yet easy to implement. I Bridging theory (assuming truly random hashing) with practice (needing something implementable). I Too simple hash functions may work deceivingly well in random tests, but the real world is full of structured data on which they may fail miserably (as we shall see later). Target I Simple and reliable pseudo-random hashing. I Providing algorithmically important probabilisitic guarantees akin to those of truly random hashing, yet easy to implement. I Bridging theory (assuming truly random hashing) with practice (needing something implementable). I Many randomized algorithms are very simple and popular in practice, but often implemented with too simple hash functions, so guarantees only for sufficiently random input. Target I Simple and reliable pseudo-random hashing. I Providing algorithmically important probabilisitic guarantees akin to those of truly random hashing, yet easy to implement. I Bridging theory (assuming truly random hashing) with practice (needing something implementable). I Many randomized algorithms are very simple and popular in practice, but often implemented with too simple hash functions, so guarantees only for sufficiently random input. I Too simple hash functions may work deceivingly well in random tests, but the real world is full of structured data on which they may fail miserably (as we shall see later). Applications of Hashing Hash tables (n keys and 2n hashes: expect 1/2 keys per hash) I chaining: follow pointers • x a t • ! ! • v • ! • f s r • ! ! ! Applications of Hashing Hash tables (n keys and 2n hashes: expect 1/2 keys per hash) I chaining: follow pointers • x a t x • ! ! ! • v • ! • f s r • ! ! ! Applications of Hashing Hash tables (n keys and 2n hashes: expect 1/2 keys per hash) I chaining: follow pointers I linear probing: sequential search in one array • q x a g ! c ! ! • • t Applications of Hashing Hash tables (n keys and 2n hashes: expect 1/2 keys per hash) I chaining: follow pointers I linear probing: sequential search in one array • q x a g ! c ! x ! • t Applications of Hashing Hash tables (n keys and 2n hashes: expect 1/2 keys per hash) I chaining: follow pointers I linear probing: sequential search in one array I cuckoo hashing: search 2 locations, complex updates a • s • z • y f x w • r • x b • Applications of Hashing Hash tables (n keys and 2n hashes: expect 1/2 keys per hash) I chaining: follow pointers I linear probing: sequential search in one array I cuckoo hashing: search 2 locations, complex updates a • s • z • y f x w • r • x b • Applications of Hashing Hash tables (n keys and 2n hashes: expect 1/2 keys per hash) I chaining: follow pointers I linear probing: sequential search in one array I cuckoo hashing: search 2 locations, complex updates a • s • z • y f x w • r • x b • Applications of Hashing Hash tables (n keys and 2n hashes: expect 1/2 keys per hash) I chaining: follow pointers I linear probing: sequential search in one array I cuckoo hashing: search 2 locations, complex updates a • s • z • y f x w • r • x b • Applications of Hashing Hash tables (n keys and 2n hashes: expect 1/2 keys per hash) I chaining: follow pointers I linear probing: sequential search in one array I cuckoo hashing: search 2 locations, complex updates a • s • z • y f x w • r • x b • Applications of Hashing Hash tables (n keys and 2n hashes: expect 1/2 keys per hash) I chaining: follow pointers I linear probing: sequential search in one array I cuckoo hashing: search 2 locations, complex updates z • s • w • y f x x • a • r x b I sketch A and B to later find A B / A B | \ | | [ | A B / A B = Pr[min h(A)=min h(B)] | \ | | [ | h We need h to be "-minwise independent: 1 " x S : Pr[h(x)=min h(S)] = ± 2 S | | Sketching, streaming, and sampling: 2 I second moment estimation: F2(x¯)= i xi P Applications of Hashing Hash tables (n keys and 2n hashes: expect 1/2 keys per hash) I chaining: follow pointers. I linear probing: sequential search in one array I cuckoo hashing: search 2 locations, complex updates I sketch A and B to later find A B / A B | \ | | [ | A B / A B = Pr[min h(A)=min h(B)] | \ | | [ | h We need h to be "-minwise independent: 1 " x S : Pr[h(x)=min h(S)] = ± 2 S | | Applications of Hashing Hash tables (n keys and 2n hashes: expect 1/2 keys per hash) I chaining: follow pointers. I linear probing: sequential search in one array I cuckoo hashing: search 2 locations, complex updates Sketching, streaming, and sampling: 2 I second moment estimation: F2(x¯)= i xi P Applications of Hashing Hash tables (n keys and 2n hashes: expect 1/2 keys per hash) I chaining: follow pointers. I linear probing: sequential search in one array I cuckoo hashing: search 2 locations, complex updates Sketching, streaming, and sampling: 2 I second moment estimation: F2(x¯)= i xi I sketch A and B to later find A B / A B | \ | | P[ | A B / A B = Pr[min h(A)=min h(B)] | \ | | [ | h We need h to be "-minwise independent: 1 " x S : Pr[h(x)=min h(S)] = ± 2 S | | Prototypical example: degree k 1 polynomial − I u = b prime; I choose a0, a1,...,ak 1 randomly in [u]; − k 1 I h(x)= a0 + a1x + + ak 1x − mod u. ··· − Many solutions for k-independent hashing proposed, but generally slow for k > 3 and too slow for k > 5. Wegman & Carter [FOCS’77] We do not have space for truly random hash functions, but Family = h :[u] [b] k-independent iff for random h : H { ! } 2H I ( )x [u], h(x) is uniform in [b]; 8 2 I ( )x ,...,x [u], h(x ),...,h(x ) are independent. 8 1 k 2 1 k Many solutions for k-independent hashing proposed, but generally slow for k > 3 and too slow for k > 5. Wegman & Carter [FOCS’77] We do not have space for truly random hash functions, but Family = h :[u] [b] k-independent iff for random h : H { ! } 2H I ( )x [u], h(x) is uniform in [b]; 8 2 I ( )x ,...,x [u], h(x ),...,h(x ) are independent. 8 1 k 2 1 k Prototypical example: degree k 1 polynomial − I u = b prime; I choose a0, a1,...,ak 1 randomly in [u]; − k 1 I h(x)= a0 + a1x + + ak 1x − mod u. ··· − Wegman & Carter [FOCS’77] We do not have space for truly random hash functions, but Family = h :[u] [b] k-independent iff for random h : H { ! } 2H I ( )x [u], h(x) is uniform in [b]; 8 2 I ( )x ,...,x [u], h(x ),...,h(x ) are independent.
Recommended publications
  • The Power of Tabulation Hashing
    Thank you for inviting me to China Theory Week. Joint work with Mihai Patras¸cuˇ . Some of it found in Proc. STOC’11. The Power of Tabulation Hashing Mikkel Thorup University of Copenhagen AT&T Joint work with Mihai Patras¸cuˇ . Some of it found in Proc. STOC’11. The Power of Tabulation Hashing Mikkel Thorup University of Copenhagen AT&T Thank you for inviting me to China Theory Week. The Power of Tabulation Hashing Mikkel Thorup University of Copenhagen AT&T Thank you for inviting me to China Theory Week. Joint work with Mihai Patras¸cuˇ . Some of it found in Proc. STOC’11. I Providing algorithmically important probabilisitic guarantees akin to those of truly random hashing, yet easy to implement. I Bridging theory (assuming truly random hashing) with practice (needing something implementable). Target I Simple and reliable pseudo-random hashing. I Bridging theory (assuming truly random hashing) with practice (needing something implementable). Target I Simple and reliable pseudo-random hashing. I Providing algorithmically important probabilisitic guarantees akin to those of truly random hashing, yet easy to implement. Target I Simple and reliable pseudo-random hashing. I Providing algorithmically important probabilisitic guarantees akin to those of truly random hashing, yet easy to implement. I Bridging theory (assuming truly random hashing) with practice (needing something implementable). Applications of Hashing Hash tables (n keys and 2n hashes: expect 1/2 keys per hash) I chaining: follow pointers • x • ! a ! t • • ! v • • ! f ! s
    [Show full text]
  • Non-Empty Bins with Simple Tabulation Hashing
    Non-Empty Bins with Simple Tabulation Hashing Anders Aamand and Mikkel Thorup November 1, 2018 Abstract We consider the hashing of a set X U with X = m using a simple tabulation hash function h : U [n] = 0,...,n 1 and⊆ analyse| the| number of non-empty bins, that is, the size of h(X→). We show{ that the− expected} size of h(X) matches that with fully random hashing to within low-order terms. We also provide concentration bounds. The number of non-empty bins is a fundamental measure in the balls and bins paradigm, and it is critical in applications such as Bloom filters and Filter hashing. For example, normally Bloom filters are proportioned for a desired low false-positive probability assuming fully random hashing (see en.wikipedia.org/wiki/Bloom_filter). Our results imply that if we implement the hashing with simple tabulation, we obtain the same low false-positive probability for any possible input. arXiv:1810.13187v1 [cs.DS] 31 Oct 2018 1 Introduction We consider the balls and bins paradigm where a set X U of X = m balls are distributed ⊆ | | into a set of n bins according to a hash function h : U [n]. We are interested in questions → relating to the distribution of h(X) , for example: What is the expected number of non-empty | | bins? How well is h(X) concentrated around its mean? And what is the probability that a | | query ball lands in an empty bin? These questions are critical in applications such as Bloom filters [3] and Filter hashing [7].
    [Show full text]
  • 6.1 Systems of Hash Functions
    — 6 Hashing 6 Hashing 6.1 Systems of hash functions Notation: • The universe U. Usually, the universe is a set of integers f0;:::;U − 1g, which will be denoted by [U]. • The set of buckets B = [m]. • The set X ⊂ U of n items stored in the data structure. • Hash function h : U!B. Definition: Let H be a family of functions from U to [m]. We say that the family is c-universal for some c > 0 if for every pair x; y of distinct elements of U we have c Pr [h(x) = h(y)] ≤ : h2H m In other words, if we pick a hash function h uniformly at random from H, the probability that x and y collide is at most c-times more than for a completely random function h. Occasionally, we are not interested in the specific value of c, so we simply say that the family is universal. Note: We will assume that a function can be chosen from the family uniformly at random in constant time. Once chosen, it can be evaluated for any x in constant time. Typically, we define a single parametrized function ha(x) and let the family H consist of all ha for all possible choices of the parameter a. Picking a random h 2 H is therefore implemented by picking a random a. Technically, this makes H a multi-set, since different values of a may lead to the same function h. Theorem: Let H be a c-universal family of functions from U to [m], X = fx1; : : : ; xng ⊆ U a set of items stored in the data structure, and y 2 U n X another item not stored in the data structure.
    [Show full text]
  • Linear Probing with 5-Independent Hashing
    Lecture Notes on Linear Probing with 5-Independent Hashing Mikkel Thorup May 12, 2017 Abstract These lecture notes show that linear probing takes expected constant time if the hash function is 5-independent. This result was first proved by Pagh et al. [STOC’07,SICOMP’09]. The simple proof here is essentially taken from [Pˇatra¸scu and Thorup ICALP’10]. We will also consider a smaller space version of linear probing that may have false positives like Bloom filters. These lecture notes illustrate the use of higher moments in data structures, and could be used in a course on randomized algorithms. 1 k-independence The concept of k-independence was introduced by Wegman and Carter [21] in FOCS’79 and has been the cornerstone of our understanding of hash functions ever since. A hash function is a random function h : [u] [t] mapping keys to hash values. Here [s]= 0,...,s 1 . We can also → { − } think of a h as a random variable distributed over [t][u]. We say that h is k-independent if for any distinct keys x ,...,x [u] and (possibly non-distinct) hash values y ,...,y [t], we have 0 k−1 ∈ 0 k−1 ∈ Pr[h(x )= y h(x )= y ] = 1/tk. Equivalently, we can define k-independence via two 0 0 ∧···∧ k−1 k−1 separate conditions; namely, (a) for any distinct keys x ,...,x [u], the hash values h(x ),...,h(x ) are independent 0 k−1 ∈ 0 k−1 random variables, that is, for any (possibly non-distinct) hash values y ,...,y [t] and 0 k−1 ∈ i [k], Pr[h(x )= y ]=Pr h(x )= y h(x )= y , and ∈ i i i i | j∈[k]\{i} j j arXiv:1509.04549v3 [cs.DS] 11 May 2017 h i (b) for any x [u], h(x) is uniformly distributedV in [t].
    [Show full text]
  • The Power of Simple Tabulation Hashing
    The Power of Simple Tabulation Hashing Mihai Pˇatra¸scu Mikkel Thorup AT&T Labs AT&T Labs May 10, 2011 Abstract Randomized algorithms are often enjoyed for their simplicity, but the hash functions used to yield the desired theoretical guarantees are often neither simple nor practical. Here we show that the simplest possible tabulation hashing provides unexpectedly strong guarantees. The scheme itself dates back to Carter and Wegman (STOC'77). Keys are viewed as consist- ing of c characters. We initialize c tables T1;:::;Tc mapping characters to random hash codes. A key x = (x1; : : : ; xc) is hashed to T1[x1] ⊕ · · · ⊕ Tc[xc], where ⊕ denotes xor. While this scheme is not even 4-independent, we show that it provides many of the guarantees that are normally obtained via higher independence, e.g., Chernoff-type concentration, min-wise hashing for estimating set intersection, and cuckoo hashing. Contents 1 Introduction 2 2 Concentration Bounds 7 3 Linear Probing and the Concentration in Arbitrary Intervals 13 4 Cuckoo Hashing 19 5 Minwise Independence 27 6 Fourth Moment Bounds 32 arXiv:1011.5200v2 [cs.DS] 7 May 2011 A Experimental Evaluation 39 B Chernoff Bounds with Fixed Means 46 1 1 Introduction An important target of the analysis of algorithms is to determine whether there exist practical schemes, which enjoy mathematical guarantees on performance. Hashing and hash tables are one of the most common inner loops in real-world computation, and are even built-in \unit cost" operations in high level programming languages that offer associative arrays. Often, these inner loops dominate the overall computation time.
    [Show full text]
  • Fundamental Data Structures Contents
    Fundamental Data Structures Contents 1 Introduction 1 1.1 Abstract data type ........................................... 1 1.1.1 Examples ........................................... 1 1.1.2 Introduction .......................................... 2 1.1.3 Defining an abstract data type ................................. 2 1.1.4 Advantages of abstract data typing .............................. 4 1.1.5 Typical operations ...................................... 4 1.1.6 Examples ........................................... 5 1.1.7 Implementation ........................................ 5 1.1.8 See also ............................................ 6 1.1.9 Notes ............................................. 6 1.1.10 References .......................................... 6 1.1.11 Further ............................................ 7 1.1.12 External links ......................................... 7 1.2 Data structure ............................................. 7 1.2.1 Overview ........................................... 7 1.2.2 Examples ........................................... 7 1.2.3 Language support ....................................... 8 1.2.4 See also ............................................ 8 1.2.5 References .......................................... 8 1.2.6 Further reading ........................................ 8 1.2.7 External links ......................................... 9 1.3 Analysis of algorithms ......................................... 9 1.3.1 Cost models ......................................... 9 1.3.2 Run-time analysis
    [Show full text]
  • Fast and Powerful Hashing Using Tabulation
    Fast and Powerful Hashing using Tabulation Mikkel Thorup∗ University of Copenhagen January 4, 2017 Abstract Randomized algorithms are often enjoyed for their simplicity, but the hash functions employed to yield the desired probabilistic guaranteesare often too complicated to be practical. Here we survey recent results on how simple hashing schemes based on tabulation provide unexpectedly strong guarantees. Simple tabulation hashing dates back to Zobrist [1970]. Keys are viewed as consisting of c characters and we have precomputed character tables h1, ..., hc mapping characters to random hash values. A key x = (x1, ..., xc) is hashed to h1[x1] h2[x2]..... hc[xc]. This schemes is very fast with character tables in cache. While simple tabulation is⊕ not even 4-independent⊕ ,it does provide many of the guarantees that are normally obtained via higher independence, e.g., linear probing and Cuckoo hashing. Next we consider twisted tabulation where one input character is ”twisted” in a simple way. The resulting hash function has powerful distributional properties: Chernoff-style tail bounds and a very small bias for min-wise hashing. This also yields an extremely fast pseudo-random number generator that is provably good for many classic randomized algorithms and data-structures. Finally, we consider double tabulation where we compose two simple tabulation functions, applying one to the output of the other, and show that this yields very high independence in the classic framework of Carter and Wegman [1977]. In fact, w.h.p., for a given set of size proportionalto that of the space con- sumed, double tabulation gives fully-random hashing. We also mention some more elaborate tabulation schemes getting near-optimal independence for given time and space.
    [Show full text]
  • Fast Hashing with Strong Concentration Bounds
    Fast hashing with strong concentration bounds Aamand, Anders; Knudsen, Jakob Bæk Tejs; Knudsen, Mathias Bæk Tejs; Rasmussen, Peter Michael Reichstein; Thorup, Mikkel Published in: STOC 2020 - Proceedings of the 52nd Annual ACM SIGACT Symposium on Theory of Computing DOI: 10.1145/3357713.3384259 Publication date: 2020 Document version Publisher's PDF, also known as Version of record Document license: Unspecified Citation for published version (APA): Aamand, A., Knudsen, J. B. T., Knudsen, M. B. T., Rasmussen, P. M. R., & Thorup, M. (2020). Fast hashing with strong concentration bounds. In K. Makarychev, Y. Makarychev, M. Tulsiani, G. Kamath, & J. Chuzhoy (Eds.), STOC 2020 - Proceedings of the 52nd Annual ACM SIGACT Symposium on Theory of Computing (pp. 1265-1278). Association for Computing Machinery. Proceedings of the Annual ACM Symposium on Theory of Computing https://doi.org/10.1145/3357713.3384259 Download date: 25. sep.. 2021 Fast Hashing with Strong Concentration Bounds Anders Aamand Jakob Bæk Tejs Knudsen Mathias Bæk Tejs Knudsen [email protected] [email protected] [email protected] BARC, University of Copenhagen BARC, University of Copenhagen SupWiz Denmark Denmark Denmark Peter Michael Reichstein Mikkel Thorup Rasmussen [email protected] [email protected] BARC, University of Copenhagen BARC, University of Copenhagen Denmark Denmark ABSTRACT KEYWORDS Previous work on tabulation hashing by Patraşcuˇ and Thorup from hashing, Chernoff bounds, concentration bounds, streaming algo- STOC’11 on simple tabulation and from SODA’13 on twisted tabu- rithms, sampling lation offered Chernoff-style concentration bounds on hash based ACM Reference Format: sums, e.g., the number of balls/keys hashing to a given bin, but Anders Aamand, Jakob Bæk Tejs Knudsen, Mathias Bæk Tejs Knudsen, Peter under some quite severe restrictions on the expected values of Michael Reichstein Rasmussen, and Mikkel Thorup.
    [Show full text]
  • Hashing, Load Balancing and Multiple Choice
    Full text available at: http://dx.doi.org/10.1561/0400000070 Hashing, Load Balancing and Multiple Choice Udi Wieder VMware Research [email protected] Boston — Delft Full text available at: http://dx.doi.org/10.1561/0400000070 Foundations and Trends R in Theoretical Computer Science Published, sold and distributed by: now Publishers Inc. PO Box 1024 Hanover, MA 02339 United States Tel. +1-781-985-4510 www.nowpublishers.com [email protected] Outside North America: now Publishers Inc. PO Box 179 2600 AD Delft The Netherlands Tel. +31-6-51115274 The preferred citation for this publication is U. Wieder. Hashing, Load Balancing and Multiple Choice. Foundations and Trends R in Theoretical Computer Science, vol. 12, no. 3-4, pp. 275–379, 2016. R This Foundations and Trends issue was typeset in LATEX using a class file designed by Neal Parikh. Printed on acid-free paper. ISBN: 978-1-68083-282-2 c 2017 U. Wieder All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, mechanical, photocopying, recording or otherwise, without prior written permission of the publishers. Photocopying. In the USA: This journal is registered at the Copyright Clearance Cen- ter, Inc., 222 Rosewood Drive, Danvers, MA 01923. Authorization to photocopy items for internal or personal use, or the internal or personal use of specific clients, is granted by now Publishers Inc for users registered with the Copyright Clearance Center (CCC). The ‘services’ for users can be found on the internet at: www.copyright.com For those organizations that have been granted a photocopy license, a separate system of payment has been arranged.
    [Show full text]
  • Hashing, We Don’T Know in Advance What the User Will Put Into the Table
    Algorithms Lecture 5: Hash Tables [Sp’20] Insanity is repeating the same mistakes and expecting dierent results. — Narcotics Anonymous (1981) Calvin: There! I finished our secret code! Hobbes: Let’s see. Calvin: I assigned each letter a totally random number, so the code will be hard to crack. For letter “A”,you write 3,004,572,688. “B” is 28,731,569½. Hobbes: That’s a good code all right. Calvin: Now we just commit this to memory. Calvin: Did you finish your map of our neighborhood? Hobbes: Not yet. How many bricks does the front walk have? — Bill Watterson, “Calvin and Hobbes” (August 23, 1990) [RFC 1149.5 specifies 4 as the standard IEEE-vetted random number.] — Randall Munroe, xkcd (http://xkcd.com/221/) Reproduced under a Creative Commons Attribution-NonCommercial 2.5 License 5 Hash Tables 5.1 Introduction A hash table is a data structure for storing a set of items, so that we can quickly determine whether an item is or is not in the set. The basic idea is to pick a hash function h that maps every possible item x to a small integer h(x). Then we use the hash value h(x) as a key to access x in the data structure. In its simplest form, a hash table is an array, in which each item x is stored at index h(x). Let’s be a little more specific. We want to store a set of n items. Each item is an element of a fixed set U called the universe; we use u = U to denote the size of the universe, which is just the number of items in U.
    [Show full text]
  • Chapter 13 Part 2
    Chapter 13 Randomized Algorithms Part 2 Slides by Kevin Wayne, modified by Rasmus Pagh Copyright @ 2005 Pearson-Addison Wesley. All rights reserved. 1 13.10 Load Balancing Chernoff Bounds (13.6) Q: If we hash m items into n buckets, how many items will be in the largest bucket? Theorem. Suppose X1, …, Xn are independent 0-1 random variables. Let X = X1 + … + Xn. Then for any µ ≥ E[X] and for any δ > 0, we have µ ⎡ eδ ⎤ Pr[X (1 ) ] Upper tail bound: > + δ µ < ⎢ 1+δ ⎥ ⎣(1+ δ ) ⎦ “sum of independent 0-1 random variables is tightly centered on the mean” Lower tail bound: 2 Pr[X < (1−δ )µ] < e−δ µ / 2 3 Load Balancing: Average load 1 Theorem. For m=n items, with probability ≥ 1 - 1/n no bucket receives more than C logn / log log n items, for some constant C. Fact: this bound is asymptotically tight: with high probability, some bucket receives Θ(logn / log log n) Analysis. Let Xi = number of items hashed to bucket i. Let Yij = 1 if item j hashed to bucket i, and 0 otherwise. We have E[Yij] = 1/n Thus, Xi = ∑ j Yi j, and µ = E[Xi] = 1. Apply Chernoff to bound probability that X > C logn / log log n. 4 Load Balancing: Larger buckets Theorem. If there are m = 16n ln n items (avg. µ = 16 ln n items/bucket). Then with high probability every bucket will have between half and twice the average load. Pf. Let Xi , Yij be as before. Applying Chernoff upper (lower) tail bound with δ = 1 (δ = 0.5) yields: 16nln n ln n 2 1 1 (16nlnn) ⎛ e ⎞ ⎛1⎞ 1 1 − 2( 2 ) 1 Pr[X i > 2µ] < ⎜ ⎟ < ⎜ ⎟ = Pr[ Xi < 2 µ] < e = ⎝ 4 ⎠ ⎝ e ⎠ n2 n2 Union bound ⇒ every bucket has load between half and twice the average with probability ≥ 1 - 2/n.€ ▪ 5 13.6 Universal Hashing Hashing Hash function.
    [Show full text]
  • Lecture 10 Student Notes
    6.897: Advanced Data Structures Spring 2012 Lecture 10 — March 20, 2012 Prof. Erik Demaine 1 Overview In the last lecture, we finished up talking about memory hierarchies and linked cache-oblivious data structures with geometric data structures. In this lecture we talk about different approaches to hashing. First, we talk about different hash functions and their properties, from basic universality to k-wise independence to a simple but effective hash function called simple tabulation. Then, we talk about different approaches to using these hash functions in a data structure. The approaches we cover are basic chaining, perfect hashing, linear probing, and cuckoo hashing. The goal of hashing is to provide a solution that is faster than binary trees. We want to be able to store our information in less than O(u lg u) space and perform operations in less than O(lg u)time. In particular, FKS hashing achieves O(1) worst-case query time with O(n) expected space and takes O(n) construction time for the static dictionary problem. Cuckoo Hashing achieves O(n) space, O(1) worst-case query and deletion time, and O(1) amortized insertion time for the dynamic dictionary problem. 2 Hash Function In order to hash, we need a hash function. The hash function allows us to map a universe U of u keys to a slot in a table of size m. We define three different four different hash functions: totally random, universal, k-wise independent, and simple tabulation hashing. Definition 1. A hash function is a map h such that h : {0, 1,...,u− 1}→{0, 1,...,m− 1}.
    [Show full text]