Fast Hashing with Strong Concentration Bounds
Total Page:16
File Type:pdf, Size:1020Kb
Fast hashing with strong concentration bounds Aamand, Anders; Knudsen, Jakob Bæk Tejs; Knudsen, Mathias Bæk Tejs; Rasmussen, Peter Michael Reichstein; Thorup, Mikkel Published in: STOC 2020 - Proceedings of the 52nd Annual ACM SIGACT Symposium on Theory of Computing DOI: 10.1145/3357713.3384259 Publication date: 2020 Document version Publisher's PDF, also known as Version of record Document license: Unspecified Citation for published version (APA): Aamand, A., Knudsen, J. B. T., Knudsen, M. B. T., Rasmussen, P. M. R., & Thorup, M. (2020). Fast hashing with strong concentration bounds. In K. Makarychev, Y. Makarychev, M. Tulsiani, G. Kamath, & J. Chuzhoy (Eds.), STOC 2020 - Proceedings of the 52nd Annual ACM SIGACT Symposium on Theory of Computing (pp. 1265-1278). Association for Computing Machinery. Proceedings of the Annual ACM Symposium on Theory of Computing https://doi.org/10.1145/3357713.3384259 Download date: 25. sep.. 2021 Fast Hashing with Strong Concentration Bounds Anders Aamand Jakob Bæk Tejs Knudsen Mathias Bæk Tejs Knudsen [email protected] [email protected] [email protected] BARC, University of Copenhagen BARC, University of Copenhagen SupWiz Denmark Denmark Denmark Peter Michael Reichstein Mikkel Thorup Rasmussen [email protected] [email protected] BARC, University of Copenhagen BARC, University of Copenhagen Denmark Denmark ABSTRACT KEYWORDS Previous work on tabulation hashing by Patraşcuˇ and Thorup from hashing, Chernoff bounds, concentration bounds, streaming algo- STOC’11 on simple tabulation and from SODA’13 on twisted tabu- rithms, sampling lation offered Chernoff-style concentration bounds on hash based ACM Reference Format: sums, e.g., the number of balls/keys hashing to a given bin, but Anders Aamand, Jakob Bæk Tejs Knudsen, Mathias Bæk Tejs Knudsen, Peter under some quite severe restrictions on the expected values of Michael Reichstein Rasmussen, and Mikkel Thorup. 2020. Fast Hashing these sums. The basic idea in tabulation hashing is to view a key as with Strong Concentration Bounds. In Proceedings of the 52nd Annual ACM consisting of c = O(1) characters, e.g., a 64-bit key as c = 8 charac- SIGACT Symposium on Theory of Computing (STOC ’20), June 22–26, 2020, ters of 8-bits. The character domain Σ should be small enough that Chicago, IL, USA. ACM, New York, NY, USA, 14 pages. https://doi.org/10. character tables of size jΣj fit in fast cache. The schemes then use 1145/3357713.3384259 O(1) tables of this size, so the space of tabulation hashing is O(jΣj). However, the concentration bounds by Patraşcuˇ and Thorup only 1 INTRODUCTION apply if the expected sums are ≪ jΣj. Chernoff’s concentration bounds [13] date back to the 1950s but To see the problem, consider the very simple case where we use bounds of this types go even further back to Bernstein in the tabulation hashing to throw n balls into m bins and want to analyse 1920s [8]. Originating from the area of statistics they are now one the number of balls in a given bin. With their concentration bounds, of the most basic tools of randomized algorithms [36]. A canonical n = m Pn we are fine if , for then the expected value is 1. However, if form considers the sum X = i=1 Xi of independent random vari- m = 2, as when tossing n unbiased coins, the expected value n=2 is ables X1;:::; Xn 2 [0; 1]. Writing µ = E [X] it holds for any ε ≥ 0 ≫ jΣj for large data sets, e.g., data sets that do not fit in fast cache. that To handle expectations that go beyond the limits of our small Pr[X ≥ (1 + ε)µ] ≤ exp(−µ C(ε)) (1) space, we need a much more advanced analysis of simple tabulation, plus a new tabulation technique that we call tabulation-permutation ≤ exp(−ε2µ=3) for ε ≤ 1; hashing which is at most twice as slow as simple tabulation. No and for any 0 ≤ ε ≤ 1 other hashing scheme of comparable speed offers similar Chernoff- style concentration bounds. Pr[X ≤ (1 − ε)µ] ≤ exp(−µ C(−ε)) (2) ≤ exp(−ε2µ=2): CCS CONCEPTS Here C : (−1; 1) ! [0; 1) is given by C(x) = (x + 1) ln(x + 1) − x, x ! so exp(−C(x)) = e . Textbook proofs of (1) and (2) can be • Theory of computation Pseudorandomness and deran- (1+x )(1+x ) domization; Data structures design and analysis; Sketching and found in [36, §4]1. Writing σ 2 = Var [X], a more general bound is sampling. that for any t ≥ 0, Pr[jX − µj ≥ t] ≤ 2 exp(−σ 2C(t=σ 2)) (3) ≤ 2 exp(−(t=σ )2=3) for t ≤ σ 2: Permission to make digital or hard copies of all or part of this work for personal or Since σ 2 ≤ µ and C(−ε) ≤ 1:5 C(ε) for ε ≤ 1, (3) is at least as good classroom use is granted without fee provided that copies are not made or distributed as (1) and (2), up to constant factors, and often better. In this work, for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM we state our results in relation to (3), known as Bennett’s inequality must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, [7]. to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. Hashing is another fundamental tool of randomized algorithms STOC ’20, June 22–26, 2020, Chicago, IL, USA dating back to the 1950s [24]. A random hash function, h : U ! R, © 2020 Association for Computing Machinery. 1 ACM ISBN 978-1-4503-6979-4/20/06...$15.00 The bounds in [36, §4] are stated as working only for Xi 2 f0; 1g, but the proofs can https://doi.org/10.1145/3357713.3384259 easily handle any Xi 2 [0; 1]. 1265 STOC ’20, June 22–26, 2020, Chicago, IL, USA A. Aamand, J. B. T. Knudsen, M. B. T. Knudsen, P. M. R. Rasmussen, and M. Thorup assigns a hash value, h(x) 2 R, to every key x 2 U . Here both U and x0 2 [u] be a distinguished query key. Then since for every 0 R are typically bounded integer ranges. The original application y0 2 [m], the hash function h = (hjh(x0) = y0) satisfies that 0 P 0 was hash tables with chaining where x is placed in bin h(x), but X = x 2S wx · [h (x) = y0] is strongly concentrated, it follows 00 P today, hash functions are ubiquitous in randomized algorithms. For that X = x 2S wx ·[h(x) = h(x0)] is strongly concentrated. Thus, instance, they play a fundamental role in streaming and distributed h allows us to get Chernoff-style concentration on the weight of settings where a system uses a hash function to coordinate the the balls landing in the same bin as x0. random choices for a given key. In most applications, we require This may be generalized such that in the third case from above, concentration bounds for one of the following cases of increasing the weight function may be chosen as a function of h(x0). Thus, generality. the property of being query invariant is very powerful. It is worth (1) Let S ⊆ U be a set of balls and assign to each ball, x 2 S, a noting that the constants of the asymptotics may change when weight, w 2 [0; 1]. We wish to distribute the balls of S into conditioning on a query. Furthermore, the expected value and vari- x 0 a set of bins R = [m] = f0; 1;:::;m − 1g. For a bin, y 2 [m], ance of X may differ from that of X, but this is included in the P X = x 2S wx · [h(x) = y] is then the total weight of the definition. balls landing in bin y. (2) We may instead be interested in the total weight of the balls One way to achieve Chernoff-style bounds in all of the above k with hash values in the interval [y1;y2) for somey1;y2 2 [m], cases is through the classic -independent hashing framework of P Wegman and Carter [47]. The random hash function h : U ! that is, X = x 2S wx · [y1 ≤ h(x) < y2]. (3) More generally, we may consider a fixed value function R is k-independent if for any k distinct keys x1;:::; xk 2 U , k v : U × R ! [0; 1]. For each key x 2 U , we define the ran- (h(x1);:::;h(xk )) is uniformly distributed in R . Schmidt and dom variable Xx = v(x;h(x)), where the randomness of Xx Siegel [42] have shown that with k-independence, the above Cher- P stems from that of h(x). We write X = x 2U v(x;h(x)) for noff bounds hold with an added error probability decreasing expo- the sum of these values. nentially in k. Unfortunately, a lower bound by Siegel [43] implies To exemplify applications, the first case is common when trying to that evaluating a k-independent hash function takes Ω(k) time allocate resources; the second case arises in streaming algorithms; unless we use a lot of space (to be detailed later). and the third case handles the computation of a complicated statistic, Patraşcuˇ and Thorup have shown that Chernoff-style bounds X, on incoming data. In each case, we wish the variable X to be can be achieved in constant time with tabulation based hashing concentrated around its mean, µ = E [X], according to the Chernoff- methods; namely simple tabulation [38] for the first case described style bound of (3).