Linear Probing with Constant Independence

Anna Pagh∗ Rasmus Pagh∗ Milan Ruˇzic´ ∗

ABSTRACT 1. INTRODUCTION Hashing with linear probing dates back to the 1950s, and Hashing with linear probing is perhaps the simplest algo- is among the most studied algorithms. In recent years it rithm for storing and accessing a set of keys that obtains has become one of the most important organiza- nontrivial performance. Given a h, a key x is tions since it uses the cache of modern computers very well. inserted in an array by searching for the first vacant array Unfortunately, previous analyses rely either on complicated position in the sequence h(x), h(x) + 1, h(x) + 2,... (Here, and space consuming hash functions, or on the unrealistic addition is modulo r, the size of the array.) Retrieval of a assumption of free access to a truly random hash function. key proceeds similarly, until either the key is found, or a Already Carter and Wegman, in their seminal paper on uni- vacant position is encountered, in which case the key is not versal hashing, raised the question of extending their anal- present in the . Deletions can be performed ysis to linear probing. However, we show in this paper that by moving elements back in the probe sequence in a greedy linear probing using a pairwise independent family may have fashion (ensuring that no key x is moved beyond h(x)), until expected logarithmic cost per operation. On the positive a vacant array position is encountered. side, we show that 5-wise independence is enough to ensure Linear probing dates back to 1954, but was first analyzed constant expected time per operation. This resolves the by Knuth in a 1963 memorandum [7] now considered to be question of finding a space and time efficient hash function the birth of the area of analysis of algorithms [10]. Knuth’s that provably ensures good performance for linear probing. analysis, as well as most of the work that has since gone into understanding the properties of linear probing, is based on Categories and Subject Descriptors the assumption that h is a truly random function. In 1977, Carter and Wegman’s notion of [2] initi- E.2 [Data]: Data Storage Representations; H.3.3 [Informa- ated a new era in the design of hashing algorithms, where ex- tion Storage and Retrieval]: Information Search and Re- plicit and efficient ways of choosing hash functions replaced trieval the unrealistic assumption of complete randomness. In their seminal paper, Carter and Wegman state it as an open prob- General Terms lem to “Extend the analysis to [...] and open addressing.”1 Algorithms, Performance, Theory 1.1 Previous results using limited randomness Keywords The first analysis of linear probing relying only on limited Hashing, Linear Probing randomness was given by Siegel and Schmidt in [11, 13]. Specifically, they show that O(log n)-wise independence is sufficient to achieve essentially the same performance as in the fully random case. (We use n to denote the number of keys inserted into the hash table.) Another paper by Siegel [12] shows that evaluation of a hash function from a O(log n)-wise independent family requires time Ω(log n) ∗Computational Logic and Algorithms Group, IT University unless the space used to describe the function is nΩ(1).A of Copenhagen, Rued Langgaards Vej 7, 2300 København S, family of functions is given that achieves space usage nǫ and Denmark. constant time evaluation of functions, for any ǫ > 0. How- ever, this result is only of theoretical interest since the asso- ciated constants are very large (and growing exponentially with 1/ǫ). Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are 1 not made or distributed for profit or commercial advantage and that copies Nowadays the term “open addressing” refers to any hashing bear this notice and the full citation on the first page. To copy otherwise, to scheme where the data structure is an array containing only republish, to post on servers or to redistribute to lists, requires prior specific keys and empty locations. However, Knuth used the term permission and/or a fee. to refer to linear probing in [7], and since it is mentioned STOC'07, June 11–13, 2007, San Diego, California, USA. here together with the double hashing probe sequence, we Copyright 2007 ACM 978-1-59593-631-8/07/0006 ...$5.00. believe that it refers to linear probing. A potentially more practical method is the “split and moderate load factors (30-70%). While linear probing op- share” technique described in [4]. It can be used to achieve erations are known to require more instructions than those characteristics similar to those of linear probing, still us- of other open addressing methods, the fact that they ac- ing space nǫ, for any given ǫ > 0. The idea is to split the cess an interval of array entries means that linear probing set of keys into many subsets of roughly the same size, and works very well with modern architectures for which sequen- simulate full randomness on each part. Thus, the resulting tial access is much faster than random access (assuming that solution would be a collection of linear probing hash tables. the elements we are accessing are each significantly smaller A significant drawback of both methods above, besides than a cache line, or a disk block, etc.). However, the hash a large number of instructions for function evaluation, is functions used to implement linear probing in practice are the use of random accesses to the hash function description. heuristics, and there is no known theoretical guarantee on The strength of linear probing is that for many practical their performance. Since linear probing is particularly sen- parameters, almost all lookups will incur only a single cache sitive to a bad choice of hash function, Heileman and Luo [5] miss. Performing random accesses while computing the hash advice against linear probing for general-purpose use. Our function value may destroy this advantage. results imply that simple and efficient hash functions, whose description can be stored in CPU registers, can be used to 1.2 Our results give provably good performance. We show in this paper that linear probing using a pair- wise independent family may have expected logarithmic cost 2. PRELIMINARIES per operation. Specifically, we resolve the open problem of Carter and Wegman by showing that linear probing inser- tion of n keys in a table of size 2n using a function of the 2.1 Notation and definitions Define [x] = 0, 1,...,x 1 . Throughout this paper form x ((ax + b) mod p) mod 2n, where p = 4n + 1 is { − } prime and7→ we randomly choose a [p] 0 and b [p], re- S denotes a subset of some universe U, and h will denote quires Ω(n log n) insertion steps in∈ expectation\{ } for∈ a worst a function from U to R = [r]. We denote the elements case insertion sequence (chosen independently of a and b). of S by x1,x2,...,xn , and refer to the elements of S as keys. We{ let n = S , and} α = n/r. For Q R we define Since the total insertion cost equals the total cost of looking | | ⊆ up all keys, the expected average time to look up a key in h(S) m Q = x S h(x) Q . For any integers x and a| define∩ x | a =|{ x ∈ (x| mod ∈a). The}| function x x x the resulting hash table is Ω(log n). The main observation ⊖ − 7→ − ⌊ ⌋ behind the proof is that if a is the multiplicative inverse is denoted by frac(x). A family of functions from U to R is k-wise indepen- (modulo p) of a small integer m, then inserting a certain set H 2 dent if for any k distinct elements x1,...,xk U and h that consists of two intervals has expected cost O(n /m). ∈ On the positive side, we show that 5-wise independence chosen uniformly at random from , the random variables 2 H is enough to ensure constant expected time per operation, h(x1),...,h(xk) are independent . We refer to the variable for load factor α = n/r bounded away from 1. Our proof is α¯ = n max Prh h(x)= ρ based on a new way of bounding the cost of linear probing H x U, ρ R ∈H ∈ ∈ { } operations, by counting intervals in which “many” probe as the maximum load of . When the hash function family sequences start. H 1 in question is understood from the context, we omit the Our analysis gives a bound of O(1 + (1 α)3 ) expected − subscript ofα ¯. If distributes hash function values of all time per operation at load factor α. This implies a bound elements of U uniformlyH on R, we will haveα ¯ = α, and in of O(1 + 1 ) expected time on average for successful (1 α)2 generalα ¯ α. − 1 ≥ searches. These bounds are a factor Ω( 1 α ) higher than Let Q R. By a + Q we denote the translated set for linear probing with full independence.− However, this ⊆ situation changes if r is a power of 2 and the probe sequence (a + y) mod r y Q , { | ∈ } slightly changes to and we define a Q analogously. An interval (modulo r) is − h(x), h(x) 1, h(x) 2, h(x) 3,... a set of the form a + [b], for integers a and b. We will later ⊕ ⊕ ⊕ use sets of the form h(x)+Q, for a fixed x and with Q being where denotes bitwise exclusive or. This probe sequence is ⊕ an interval. arguably even more cache friendly than classical linear prob- We introduce a function which simplifies statements of ing if we assume that memory block boundaries are at pow- some upper bounds — those in which constant factors are ers of 2. In fact, we analyze a slightly more general class of explicit. Define T (¯α) as the function over the domain (0, 1) open addressing methods called blocked probing, which also with the value given by includes a special kind of bidirectional linear probing. For 5.2¯α 1 1 this class we get the same dependence on α (up to constant (1 α¯)2 + 3 , α¯ 3 T (¯α)= − ≥ . factors) as for full independence, again using only 5-wise in- 2.5¯α , α<¯ 1 ( (1 α¯)4 3 dependent hash functions. A particularly precise analysis of − successful searches is conducted, showing that the expected 5.7¯α It can be checked that T (¯α) < (1 α¯)2 . number of probes made during a search for a random ele- − 2 2 ment in the table is less than 1 + 1 α . We note that in some papers, the notion of k-wise indepen- − dence is stronger in that it is required that function values 1.3 Significance are uniformly distributed in R. However, some interesting k-wise independent families have a slightly nonuniform dis- Several recent experimental studies [1, 5, 9] have found tribution, and we will provide analysis for such families as linear probing to be the fastest hash table organization for well. 2.2 Hash function families Clearly, E(Xi) = 0 for any i, and the variables X1,...,Xn Carter and Wegman [15] observed that the family of de- are 4-wise independent. Therefore we have E(Xi1 Xi2 Xi3 Xi4 )= gree k 1 polynomials in any finite field is k-wise indepen- 0 unless i1 = i2 = i3 = i4 or (i1, i2, i3, i4) contains 2 num- dent.− Specifically, for any prime p we may use the field bers, both of them exactly twice. This means that defined by arithmetic modulo p to get a family of functions E(X4) = E(X X X X ) from [p] to [p] where a function can be evaluated in time i1 i2 i3 i4 1 i1,i2,i3,i4 n O(k) on a RAM, assuming that addition and multiplication ≤ X ≤ modulo p can be performed in constant time. To obtain a E(X4)+ 4 E(X2)E(X2). ≤ i 2 i j smaller range R = [r] we may map integers in [p] down to 1 i n 1 i 3, in terms of = 3( pi) 3(¯αq) . ≤ evaluation time. In practice, the space usage can be kept so i small that it does not matter. The construction for 4-wise X independence has been shown to be particularly efficient. In conclusion we have Though this is not stated in [14], it is not hard to verify that 3(¯αq)2 +αq ¯ Pr X d E(X4)/d4 < , the same construction in fact gives 5-wise independence, and { ≥ }≤ d4 thus our analysis will apply. finishing the proof. 2.3 A probabilistic lemma Here we state a lemma that is essential for our upper 3. PAIRWISE INDEPENDENCE bound results, described in sections 4 and 5. It gives an In this section we show that pairwise independence is not upper bound on the probability that an interval around a sufficient to ensure good performance for linear probing: particular hash function value contains the hash function Logarithmic time per operation is needed for a worst-case values of “many” keys. The proof is similar to the proof set. This complements our upper bounds for 5-wise (and of [8, Lemma 4.19]. higher) independence. We will consider two pairwise inde- Lemma 1. Let S U be a set of size n, and a 5-wise pendent families: The first one is a very commonly used hash independent family of⊆ functions from U to R withH maximum function family. The latter family is similar to the first, ex- load at most α<¯ 1. If h is chosen uniformly at random from cept that we have ensured function values to be uniformly , then for any Q R of size q, and any fixed x U S, distributed in R. To lower bound the cost of linear probing H ⊂ ∈ \ we use the following lemma. 3(¯αq)2 +αq ¯ Pr h(S) m (h(x)+ Q) αq¯ + d < . {| ∩ |≥ } d4 Lemma 2. Suppose that n keys are inserted in a linear Proof. We will show a stronger statement, namely that probing hash table of size r with probe sequences starting for any ρ R the bound holds if we choose h uniformly at i1,...,in, respectively. Further, suppose that I1,...,Iℓ is ∈ at random from the subfamily ρ = h h(x) = ρ . any set of intervals (modulo r) such that we have the multiset H { ∈ H | } Notice that is 4-wise independent on U x , and that the equality ij = Ij . Then the total number of steps to ρ j { } j distributionH of function values is identical to\{ the} distribution perform the insertions is at least S S when h is chosen from . 2 H Ij Ij /2 . Let pi = Pr h(xi) (h(x)+Q) , and consider the random | 1 ∩ 2 | { ∈ } 1 j1

1 pi, if h(xi) h(x)+ Q Proof. We proceed by induction on ℓ. Since the number Xi = − ∈ . pi, otherwise of insertion steps is independent of the order of insertions,  − we may assume that the insertions corresponding to Iℓ occur Let X = i Xi and observe that last. By the induction hypothesis, the total number of steps to do all preceding insertions is at least h(SP) m (h(x)+ Q) = X + pi X +αq ¯ . | ∩ | ≤ i 2 Ij Ij /2 . X | 1 ∩ 2 | The last inequality above is by the definition of maximum 1 j1

2. Now suppose that for all i, Proof. Consider the parameters a, b, and v of a random function in ∗(p,r). Since r = p/2 we havep ˆ = p + 1, p H ⌈ ⌉ I1 I2 < r/16 . and (p/pˆ) > 1/4. Therefore, with constant probability it | ∩ | p I1,I2 L ,I1=I2 holds that a = 0 and v [p] . Restricted to functions ∈Xi 6 6 ∈ satisfying this, the family ∗(p,r) is identical to (p,r). Proof. We refer to x, S′, and P ′ as defined previously Thus, the lower bound carriesH over (with a smaller constant)H . in this section. As argued above, the expected cost of the By Lemma 3, ∗ is pairwise independent with uniformly operation is distributed functionH values.

O 1+ Pr h(x) + [l] P ′ . We remark that the lower bound is tight. A correspond- { ⊆ }! ing O(n log n) upper bound can be shown by applying the Xl>0 analysis of Section 4.1 and using Chebychev’s inequality to 6 Let l0 = (1 α¯)3 . For l l0 we use the trivial upper bound bound the probability of a fully loaded interval, rather than − ≤ Pr h(x) + [l] P ′ 1. In the following we consider the using the 4th moment inequality as in Lemma 1. { ⊆ } ≤ case l>l0. 1+¯α Let Aℓ be the event that h(S′) m (h(x)+[ℓ]) ℓ 1, | ∩ |≥ 2 − 4. 5-WISE INDEPENDENCE and let A′ be the event that h(S′) m (h(x) [ℓ]) ℓ. ℓ | ∩ − | ≥ We want to bound the cost of a given operation (insertion, Notice that if the second inequality of Lemma 4 holds then Al′′ ,...,A′′ 1−α¯ hold. Consequently, if deletion, or lookup of a key x) performed when the hash l + 2 l table contains the set S of keys. It is well known that for ⌈ ⌉ 1 α¯ ′ linear probing, the set P of occupied table positions depends l div ( −2 l +1) = k , only on the set S and the hash function, independently of the ⌈ ⌉ then A′ 1−α¯ holds. Combining this observation with sequence of insertions and deletions performed. An upper (k+1) l ⌈ 2 ⌉ bound for the cost of any operation on x is Lemma 4 shows that for any l such that h(x) + [l] P ′ the following event holds: ⊆ O(1 + max l h(x) + [l] P ) . { | ⊆ } ′ 1− ¯ This is because the cost is bounded by the distance from Al Ak α l . ∪ ⌈ 2 ⌉ h(x) to the next unoccupied position. To avoid two similar “ k>[0 ” calculations for the cases x S and x S we will consider ∈ 6∈ Hence, the set S′ = S x and the corresponding set P ′ of occupied table positions.∪{ We} first show a lemma which intuitively ′ ′ 1− ¯ Pr h(x)+[l] P Pr(Al)+ Pr(Ak α l ) . says that if the operation on the key x goes on for at least { ⊆ }≤ ⌈ 2 ⌉ l>l0 l>l0 k>0 l steps, then there are either “many” keys hashing to the X X ` X ´ interval h(x) + [l], or there are “many” keys that hash to 1+¯α The event Al implies h(S′ x ) m (h(x) + [l]) l 2. some interval having h(x) as its right endpoint. | \{ } ∩ |≥ 2 − Thus, using Lemma 1 on S′ x we have \{ } 1 α¯ Lemma 4. For any l> 0 and α¯ (0, 1), if h(x)+[l] P ′ Pr(A ) Pr h(S′ x ) (h(x) + [l]) αl¯ +( − l 2) ∈ ⊆ l m 2 and l′ = max ℓ h(x) [ℓ] P ′ then at least one of the ≤ {| \{ } ∩ |≥ − } { | − ⊆ } 3(¯αl)2 +αl ¯ following holds: < 1 α¯ 4 ( −2 l 2) 1+¯α − 1. h(S′) m (h(x) + [l]) l 1, or 4 2 2 = O((1 α¯)− l− ) . | ∩ |≥ − − 1 α¯ 2. h(S′) m (h(x) [l′]) l′ + − l . In the last relation we used the fact that l > 6 . Using | ∩ − |≥ 2 (1 α¯) 4 2 − 1+¯α the inequality Pr(Aℓ′ ) < O((1 α¯)− ℓ− ), again derived from Proof. Suppose that h(S′) m (h(x) + [l]) < 2 l 1. − Now, fix any way of placing| the∩ keys in the hash| table, e.g.,− Lemma 1, we get suppose that keys are inserted in sorted order. Consider the 4 1 α¯ 2 ′ − − − set S∗ S′ of keys stored in the interval I =(h(x) [l′]) Pr(Ak 1 α¯ l ) < O((1 α¯) (k −2 l) ) ⌈ 2 ⌉ − ⊆ − ∪ k>0 k>0 (h(x) + [l]). By the choice of l′ there must be an empty X X 6 2 position to the left of I, so h(S∗) I. This means: = O((1 α¯)− l− ) . ⊆ − h(S′) m (h(x) [l′]) h(S∗) m (h(x) [l′]) Finally, | ∩ − | ≥ | ∩ − | S∗ h(S∗) m (h(x) + [l]) 6 2 ≥ | | −1+¯ | α ∩ | Pr h(x) + [l] P ′ < O((1 α¯)− l− ) > I ( 2 l 1) { ⊆ } − | |− − l>l0 l>l0 1 α¯ X X = l′ + − l . 3 2 = O((1 α¯)− ) . − For the average successful lookup cost, notice that the Theorem 2. Consider any sequence of operations (inser- n keys in the hash table were inserted (in some order) at tions, deletions, and lookups) in a linear probing hash table maximum loadsαi/n ¯ , for i = 1,...,n, so the expected total where the hash function h used has been chosen uniformly insertion cost of these keys is at random from a 5-wise independent family of functions . n H 3 2 Let n and α¯ < 1 denote, respectively, the maximum num- O 1+(1 αi/n¯ )− = O(n + (1 α¯)− n) . − − ber of keys in the table during a particular operation and the i=1 ! corresponding maximum load. Then the expected cost of that X 3 operation is O(1+(1 α¯)− ). As a consequence, the expected Observing that the total insertion cost equals the total cost − 2 average cost of successful lookups is O(1 +(1 α¯)− ). of looking up all keys, the claim follows. − lg r lg r 4.1 A bound with explicit constant factor j∗ j K 2j = 2 1+ K 2− + 2− In earlier analyses of linear probing it was common to − 3¯α j=j∗ j=j∗ consider the number of probes into the table made during X X K 1 K 1 an operation. When the number of probes is analyzed, and < 2j∗ 1+ + − 2j∗ 1 1 3¯α 4j∗ 1 1 not the number of machine instructions executed, it makes − 2 · − 4 sense to be more precise and compute the constant factor 1 frac(lg √K) (1 frac(lg √K)) √K 2 − 1 + 2√K 2− − involved. The analysis in the previous section hides a rather ≤ 4· − · large constant factor. We will use a slightly different proof + 9¯α technique to obtain a better bound on the average success- √ 4 3 K + 9¯α 1 . ful search cost. This proof technique will be used again in ≤ − t t section 5. The last inequality is true because 2 + 2 2− 3, for · ≤ We estimate the total expected number of probes made t [0, 1]. The assumption that K 1 implies thatα> ¯ 0.29, ∈ ≥ 1 by a sequence of n insertions into an empty table of size r. but we decide to use the obtained bound forα ¯ . Thus, ≥ 3 From this we can easily derive a bound on the number of we may replace 4 1 with the upper bound of 1 . 9¯α − 3 probes made by a successful search of a random element in Doing an easier calculation without splitting of the sum 4 1 the set. Our bounds show that the expected probe count is at index j gives E(di) < K(2 + 9¯α ). Whenα< ¯ 3 , we may ∗ 4 2.5¯α not big, for moderate values ofα ¯, although definitely higher replace K(2 + 9¯α ) with the upper bound of (1 α¯)4 . than in the case of truly random functions. − The number of probes made during an insertion does not 5. BLOCKED PROBING depend on the order of insertions — or equivalently, on the In this section we propose and analyze a family of open policy of placing elements being inserted. We assume that addressing methods, containing among other a variant of the following policy is in effect: if x is the new element bidirectional linear probing. The expected probe count for to be inserted, place x into the first slot h(x)+ i that is any single operation is within a constant factor from the either empty or contains an element x′ such that h(x′) / corresponding value in linear probing with a fully random h(x) + [i + 1]. If x is placed into a slot previously occupied∈ hash function. For successful searches we do a more precise by x′ then the probe sequence continues as if x′ is being analysis. It is shown that the expected number of probes inserted. The entire procedure terminates when an empty made during a search for a random element of S is less than slot is found. 2 α¯ 1+ 1 α¯ α . The bounds for single operations require a 5- wise independent− family of functions. The bound for average Theorem 3. Let be a 5-wise independent family of H successful search requires 4-wise independence. functions which map U to R. When linear probing is used Suppose that keys are hashed into a table of size r by a with a hash function chosen uniformly at random from , H function h. For simplicity we assume that r is a power of the expected total number of probes made by a sequence of n i i two. Let Vj = j, j + 1,...,j + 2 1 where j is assumed insertions into an empty table is less than n(1 + T (¯α)). { i − }i to be a multiple of 2 . The intervals Vj may be thought of

Proof. For every xi S, we define di to be the dis- as sets of references to slots in the hash table. In a search ∈ i placement of xi, i.e., the number such that xi resides in slot for key x, intervals Vj that enclose h(x) are examined in the 0 (h(xi)+ di) mod r after all keys have been inserted. The order of increasing i. More precisely, Vh(x) is examined first; i total number of probes made over all insertions is equal to if the search did not finish after traversing V i , then n h(x) 2 ⊖i+1 i=1(1+ di). From the way elements are inserted, we con- the search proceeds in the untraversed half of Vh(x) 2i+1 . clude that, for 1 i n and 1 l d , every interval ⊖ i The search stops after traversal of an interval if any of the hP(x ) + [l] is fully loaded≤ ≤, meaning that≤ at≤ least l elements of i following three cases hold: S xi hash into it. Let Ail be the event that the interval \ { } of slots h(xi) + [l] is fully loaded. Then, a) key x was found, r b) the interval contained empty slot(s), E(di) = Pr di k { ≥ } c) the interval contained key(s) whose hash value does Xk=1 r k not belong to the interval. Pr Ail In case (a) the search may obviously stop immediately on ≤ ! Xk=1 l\=1 discovery of x — there is no need to traverse through the lg r ⌊ ⌋ rest of the interval. In cases (b) and (c) we will be able to j 2 Pr(A j ) . conclude that x is not in the hash table. ≤ · i 2 j=0 i X Traversal of unexamined halves of intervals Vj may take different concrete forms — the only requirement is that every An upper bound on Pr(Ai 2j ) can be derived from Lemma slot is probed exactly once. From a practical point of view, 1. However, for small lengths (andα ¯ not small) the bound a good choice is to probe slots sequentially in a way that is useless, so then we will simply use the trivial upper bound makes the scheme a variant of bidirectional linear probing. 3¯α2 of 1. Let K = (1 α¯)4 . We first consider the case K 1. This concrete scheme defines a probe sequence that in probe − ≥ Denoting j = 1 lg K we have numbers 2i to 2i+1 1 inspects either slots ∗ 2 − i i i i i i+1 ˚ ˇ lg r (h(x) 2 + 2 , h(x) 2 + 2 + 1,..., h(x) 2 + 2 1) j∗ j K K ⊖ ⊖ ⊖ − E(di) 2 1+ 2 + ≤ − 22j 3¯α 23j i i i i j=j∗ or (h(x) 2 1, h(x) 2 2,..., h(x) 2 2 ) X „ · « ⊖ − ⊖ − ⊖ − depending on whether h(x) mod 2i = h(x) mod 2i+1 or not. Proof. Denote by x the fixed element from U S that is \¯U A different probe sequence that falls in this class of methods, being inserted or searched (unsuccessfully). Let Cα¯ be the i i 1 U i but is not sequential, is (x, j) h(x) xor j, with j starting random variable that takes value 2 when 2 − < Cα¯ 2 , 7→ U lg r i 1 ≤ from 0. 0 i lg r. We can write C¯ =1+ 2 − Ti, where ≤ ≤ α¯ i=1 Ti is an indicator variable whose value is 1 when at least Insertions. Until key x which is being inserted is placed i 1 P 2 − + 1 probes are made during the search. Let A be in a slot, the same probe sequence is followed as in a search j the event that the interval of slots V j is fully loaded, for x. However, x may be placed in a non-empty slot if its h(x) 2j j ⊖ hash value is closer to the slot number in a special metric meaning that at least 2 elements of S are hashed into the which we will now define. Let d(y1,y2) = min i y2 interval. If Ti = 1 then Ai 1 holds, so we have: i { | ∈ − V i . The value of d(y ,y ) is equal to the position of y1 2 1 2 lg r ⊖ } the most significant bit in which y1 and y2 differ. If during ¯U i 1 E(Cα¯ ) 1+ 2 − Pr(Ai 1) . ′ ≤ − insertion of x we encounter a slot y containing key x then i=1 key x is put into slot y if d(h(x),y) < d(h(x′),y). In an X lg r i implementation there is no need to evaluate d(h(x),y) values The sum i=0 2 Pr(Ai) appeared in the proof of Theorem i every time. We can keep track of what interval Vh(x) 2i 3, so we reuse the upper bound found there to conclude that ⊖ U P is being traversed at the moment and check whether h(x′) E(Cα¯ ) < 1+ T (¯α). ¯I belongs to that interval. We now move on to analyzing insertions. Let Cα¯ be the i i 1 I i When x is placed in slot y which was previously occupied random variable that takes value 2 when 2 − < Cα¯ 2 , 0 i lg r. The variable CU gives us the slot where≤x is by x′, a new slot for x′ has to be found. Let i = d(h(x′),y). ≤ ≤ α¯ The procedure now continues as if x′ is being inserted and placed, but we have to consider possible movements of other i i 1 we are starting with traversal of V ′ i V − ′ − . Ifthe elements. If x is placed into a slot previously occupied by h(x ) 2 h(x ) 2i 1 i+1 i ⊖ \ ⊖ key x′ from a “neighboring” interval V V i , variant of bidirectional linear probing is used, the traversal h(x) 2i+1 h(x) 2 ⊖ \ ⊖ may start from position y, which may matter in practice. then as many as 2i probes may be necessary to find a place i i+1 ′ for x in Vh(x) 2i , if there is one. If Vh(x) 2i+1 is fully loaded, Deletions. After removal of a key we have to check if ⊖ i+1 ⊖ the new empty slot can be used to bring some keys closer then as many as 2 additional probes may be needed to find a place within V i+2 V i+1 , and so on. In to their hash values, in terms of the metric d. Let x be h(x) 2i+2 \ h(x) 2i+1 the removed key, y be the slot in which it resided, and general — and taking into⊖ account all⊖ repositioned elements i 1 − − i = d(h(x),y). There is no need to examine Vh(x) 2i 1 . — we use the following accounting to get an overestimate of ⊖ I i i i 1 E(C¯ ): For every fully loaded interval V i we charge − α¯ h(x) 2 If Vh(x) 2i Vh(x) 2i−1 contains another empty slot then the ⊖ ⊖ \ i procedure does not⊖ continue in wider intervals. If it contin- 2 probes, and for every fully loaded neighboring interval i+1 i i ues and an element gets repositioned then the procedure is Vh(x) 2i+1 Vh(x) 2i we also charge 2 probes. The proba- ⊖ \ ⊖ recursively applied starting from the new empty slot. bility of a neighboring interval of length 2i being full is equal It is easy to formally check that appropriate invariant to Pr(Ai). As a result, holds and that the above described set of procedures works lg r 1 − correctly. We leave this to the reader. I i E(C¯ ) 1+ 2 2Pr(Ai) < 1 + 2T (¯α) . α¯ ≤ · i=0 5.1 Analysis X We analyze the performance of operations on a hash ta- The analysis of deletions is analogous. ble of size r when blocked probing is used. Suppose that the hash table stores an arbitrary fixed set of n elements, For higher values ofα ¯, the dominant term in the upper 2 and letα ¯ denote the maximum load of on sets of size bounds is O((1 α¯)− ). The constants factors in front of U I D S H 2 − n. Let Cα¯ , Cα¯ , Cα¯ , and Cα¯ be the random variables that term (1 α¯)− are relatively high compared to standard represent, respectively: the number of probes made dur- linear probing− with fully random hash functions. This is in ing an unsuccessful search for a fixed key, the number of part due to approximative nature of the proof of Theorem 4, probes made during an insertion of a fixed key, the num- and in part due to tail bounds that we use, which are weaker ber of probes made during a deletion of a fixed key, and than those for fully independent families. In the fully inde- the number of probes made during a successful search for pendent case, the probability that an interval of length q is q(1 α+ln α) a random element from the set. In the above notation we fully loaded is less than e − , according to Chernoff- did not explicitly include the fixed set and fixed elements Hoeffding bounds [3, 6]. Plugging this bound into the proof that are used in the operations – they have to be implied by of Theorem 4 would give, e.g., Ξ the context. The upper bounds on the expectations of Cα¯ 1 α+ln α e − variables, which are given by the following theorem, do not E(CU ) < 1+ . (1) α ln 2 1 α + ln α depend on choices of those elements. · | − | 2 Theorem 4. Let be a 5-wise independent family of For α close to 1, a good upper bound on (1) is 1 + ln 2 (1 H 2 − functions which map U to R. For a maximum load of α<¯ 1, α)− . The constant factor here is 2.88, as opposed to blocked probing with a hash function chosen uniformly at 5.2 from the statement of Theorem≈ 4. As α gets smaller, ≈ 2 2 random from provides the following expectations: the bound in (1) gets further below 1 + (1 α)− . H ln 2 − U E(Cα¯ ) < 1+ T (¯α), 5.2 Analysis of successful searches I E(Cα¯ ) < 1 + 2T (¯α), As before, we assume that the function h is chosen uni- D E(C ) < 1 + 2T (¯α) . formly at random from . For a subset Q of R, let Xi be α¯ H the indicator random variable that has value 1 iff h(xi) Q, Then only one value among p¯1,..., p¯n q is not equal to 0, n ∈ − 1 i n. The variable X = Xi counts the number and only one value among p¯ 1,..., p¯ µ is not equal to 0. i=1 − of≤ elements≤ that are mapped to Q. We introduce a random −⌈ ⌉ variable that counts the numberP of elements that have over- Proof. We will prove the claims for i > 0 and i < 0 flowed on Q. Define Y = max X q, 0 , where q = Q . We separately. Suppose the contrary, that there are two non- { − } | | zero values amongp ¯1,..., p¯n q. W.l.o.g. we may assume will find an upper bound on E(Y ), such that it is expressed − α¯ thatp ¯1 > 0 andp ¯2 > 0. Define the function f(d1,d2) = only in terms of q andα ¯ . Denote such a bound by Mq . It H is clear that p¯1(d1 (q µ))+p ¯2(d2 (q µ)) over d1,d2 0. We are interested− in− finding the− maximum− of function ≥f subject to lg r 1 lg r 1 k k k k − − ¯ ¯ S 1 l r α¯ 1 α¯ constraintp ¯1d1 +p ¯2d2 D = 0, where D =p ¯1d1 +p ¯2d2 . E(C ) 1+ 2 M l =1+ M l . − α¯ ≤ n 2l 2 α 2 According to the method of Lagrange multipliers, the only Xl=0 Xl=0 point in the interior of the domain that is a candidate for We are starting the analysis of E(Y ), for an arbitrary fixed an extreme point is (d,ˆ dˆ), where dˆ = k D . We can- n q p¯1+¯p2 set Q. The value of E(Y ) is − j Pr X = q + j .A j=1 · { } not establish the character of the point directlyq through an bound on E(Y ) can be obtained as the optimal value of an P appropriate quadratic form, because the required forms eval- optimization problem, which we will introduce. We denote uate to zero (the form should be positive definite or nega- E(X) shortly by µ. Let variables pi, 0 i n have domain ≤ ≤ tive definite to establish the local minimum or maximum, [0, 1]. Define the following optimization problem in variables respectively). However, by comparing f(d,ˆ dˆ) to the val- p0,...,pn: k k ues of f at boundary points ( D/p¯1, 0) and (0, D/p¯2), n q we find that the maximum is reached in the interior of the − p ˆ ˆ p maximize j pq+j domain and it must be at point (d, d). It is not hard to · j=1 show that dˆ (q µ, n], by looking at the set determined X ∈ k − k by equationp ¯1d +p ¯2d D = 0 and using the fact that subject to constraints: 1 2 − n (d¯1, d¯2) (q µ, n] (q µ, n]. ∈ − × − ˜ pi = 1 , We construct a new solution (˜p, d) to problem Π in two i=0 phases. First, setp ˜i =p ¯i and d˜i = d¯i for i / 1, 2 , set X ¯ ¯ µ 1 p¯1d1+¯p2d2 ∈ { } n p˜2 = d˜2 = 0, d˜1 = dˆ andp ˜1 = . From (d,ˆ dˆ) being ⌈ ⌉− dˆ (µ i)pi = (i µ)pi , the point of maximum of f, it follows that − − i=0 i= µ +1 X X⌊ ⌋ ¯ ¯ n p¯1d1 +p ¯2d2 k < p¯1 +p ¯2 . (µ i) pi = Dk . dˆ − i=0 X Now (˜p, d˜) exactly satisfies (3) and it satisfies inequality con- If we choose an even k 2, and set Dk to be an upper ditions corresponding to (2) and (4). Already now, the value ≥ bound on the value of the kth central moment of X, then the of the objective function is higher than the value at (¯p, d¯). optimal value of the objective function is an upper bound It increases further when we increase values ofp ˜1 andp ˜ 1 − on E(Y ). Remark that in an optimal solution pi = 0 for in a way that satisfies all the conditions with equalities. We µ i q. We will actually not solve the above prob- ¯ ⌈ ⌉ ≤ ≤ reached a contradiction, meaning that (¯p, d) cannot have two lem, but its relaxation. We introduce variables di, i I = non-zero values amongp ¯1,..., p¯n q. ∈ − µ ,..., 2, 1, 1, 2,...n q . For i µ ,..., 1 Suppose that there are two non-zero values amongp ¯ 1,. . ., {−⌈ ⌉ − − − } ∈ {−⌈ ⌉ − } − the domain of variables di is (0, µ]; for i 1,...n q the p¯ µ . W.l.o.g. we may assume thatp ¯ 1 > 0 andp ¯ 2 > 0. ∈ { − } −⌈ ⌉ − − domain of variables di is (q µ, n]. The variables represent- To reach a contradiction we use an argument that is in some − ing probabilities are still denoted by pi, but the index set is way dual to the argument for the first part. Define the k k now I. Define the optimization problem Π over all variables function g(d 1,d 2) =p ¯ 1d 1 +p ¯ 2d 2. Now we are in- − − − − − − pi, di as: terested in finding the minimum of g subject to constraint p¯ 1d 1 +p ¯ 2d 2 (¯p 1d¯ 1 +p ¯ 2d¯ 2) = 0. The proof con- n q − − − − − − − − − tinues similarly to− the first part. maximize pj (dj (q µ)) − − j=1 X Lemma 6. Let OPT be the optimal value of the objective subject to constraints: 1 k function for problem Π. Then OPT < 2 √Dk, and also pi = 1 , (2) k−1 1 k i I 1 k 1 Dk(1 k ) 1 ∈ k √ − X 2 Dk 2 (q µ), (q µ)k 2 µ n q OPT < − − − ≥ ⌈ ⌉ Dk 1 k 1 1 k − − − ( (q µ)k 1 ((1 k ) (1 k ) ), otherwise dipi = dipi , (3) − − − − i= 1 i=1 X− X Proof. According to Lemma 5, an optimal solution to k di pi = Dk . (4) problem Π can be obtained from the following simplified i I problem: X∈ The optimal value of the objective function for problem Π is maximize p1(d1 (q µ)) − − not smaller than the optimal value in the original problem. subject to constraints: ¯ p1d1 = (1 p1)d 1 , Lemma 5. Let (¯p, d) be an optimal solution to problem Π − − ¯ ¯ ¯ ¯ k k such that di = dj for 0 i>j. p1d1 + (1 p1)d 1 = Dk . 6 6 − − k k p1 Eliminating d 1 yields the constraint d (p1 + k−1 )= Suppose first that l1 0. Simple calculations yield the 1 (1 p1) − − ≥ Dk. Expressing d1 from this constraint shows that the ob- following four inequalities: jective function is equivalent to l1 α¯ 1 α¯ t1/2 M l < 2− , k 2 √ 1 α¯ √Dk l=0 2 f(p1)= p1(q µ) X − k 1 k 1 k − − l2 p − + (1 p1) 1 − α¯ α¯ 1 t2/2 t1/2 1 t2 − M l 1+ 2− 2− 2− + q 2 ≤ 1 α¯ √2 − − 2 1 l=l1+1 − The value of the first term is symmetrical around the point 2 . X “` ´ 1 1 t1 Therefore, the maximum of f is reached in the interval (0, 2 ]. + 2− , By analyzing only the first term, we get a simple upper (√2+1)2 1 k ” l bound of 2 √Dk. To get the other bound we look at the 3 α¯ 1 α¯ function M l (l3 l2) , 2 ≤ 4 1 α¯ − l=l +1 ¯ k 1 1/k X2 − f(p1)= √Dk p1− p1(q µ) , ∞ · − − α¯ α¯ t3 1 α¯ 2t3 M2l 0.24 2 + 0.02 − 2 . which satisfies f < f¯. Analyzing f¯′ we easily find the max- ≤ 1 α¯ α¯ ¯ 1 l=Xl3+1 − imum of f(p1) over p1 (0, 2 ]. ∈ For the third inequality, we notice that l3 l2 = 2 for 1 1 − t3 , and l3 l2 = 1 for t3 > . The assumption that Corollary 2. Suppose that is 4-wise independent. De- ≤ 2 − 2 H l 0 implies thatα> ¯ 1 (we do not need the exact bound fine 1 2 on≥α ¯), which we use for the second term of the r.h.s. of 1 α¯ √ 2 the fourth inequality. Maximizing over t1,t2,t3 shows that 2 √αq¯ , If (1 α¯)2q ( 2+1) α¯ 2¯α − ≥ ∞ M l < . 1 αq¯ 1 q(1 α¯) , If 2 α¯ < (√2+1)2 l=0 2 1 α¯ α¯ √2 √ 2 (1 α¯)2q − M = 8 − − ≤ − Now suppose that l1 < 0 and l2 0. The first part of the q 1 α¯ , If 1 α¯ < 2 P ≥ > 4 1 α¯ √2 (1 α¯)2q sum is now > − 2 ≤ − < 3¯α q+¯α α¯ 1 0.11 , If < l2 (1 α¯)3q2 (1 α¯)2q √2 · − − α¯ α¯ 1 t2/2 1 t2 > M2l 1+ 2− 2− > α¯ ≤ 1 α¯ √2 − 2 − It holds: that E(Y ) < M . l=0 q X − “ ” ` 1 ´ 1 α¯ Proof. For the first three cases we used k = 2, and for 1+ √α¯ + − , − √ 2 the fourth case k = 4 was used. The constants in problem Π 2 2 2 ` ´ were substituted as: µ =αq ¯ , D2 =αq ¯ , and D4 = 3¯α q +¯αq. while the bounds on the other two parts stay the same. Since The bound on the fourth central moment is proved very l1 < 0, it follows thatα< ¯ 0.7. Whenα< ¯ 0.7, it holds that α¯ similarly to the proof of of Lemma 1. The bound on the 3√α¯ > . On the other hand, l2 0 is equivalent to 1 α¯ ≥ variance is easier to prove. 1 − 1 √ α¯ 2 . Whenα ¯ 2 , it holds that α¯ > 1 α¯. Simple ≥ ≥ α¯ 2¯α− ∞ calculations again show that l=0 M2l < 1 α¯ (a slightly The analysis is finalized with the following theorem. lower constant could also be stated). − P Now suppose that l2 < 0 and l3 0. This assumption ≥ 1 α¯ 6¯α Theorem 5. Let be a 4-wise independent family of implies thatα> ¯ 0.3, and we may write −α¯ < 1 α¯ . In this α¯ α¯ − H ∞ functions which map U to R. When blocked probing is used case, we get l=0 M2l < 1.1 1 α¯ . − with a hash function chosen uniformly at random from , α¯ 0.66¯α2 In the final case, l < 0, we have ∞ M l < + H P 3 l=0 2 (1 α¯)3 then − 0.15¯α . Sinceα ¯ < 0.33 in this case, we may use α¯ < (1 α¯)3 P (1 α¯)2 2 , 0.5 α¯ − 1 − 1 α¯ 0.75 and (1 α¯)2 < 2.25 to get the stated bound. S α¯ 1−.1 ≤ − E(Cα¯ ) < 1+ α 1 α¯ , 0.3 < α<¯ 0.5 . · 8 0−.85 < 1 α¯ , α¯ 0.3 6. OPEN PROBLEMS − ≤ An immediate question is whether it is possible to improve : S Proof. As stated at the beginning of this section, E(Cα¯ ) the dependence on α in the analysis of linear probing with is upper-bounded by constant independence. In general, the problem of finding

lg r 1 practical, and provably good hash functions for a range of − 1 α¯ other important hashing methods remains unsolved. For ex- 1+ M l . α 2 l=0 ample, [9] and its variants presently have no X such functions. Also, if we consider the problem of hashing α¯ Now that we have Mq values, we are only left to carry out a set into n/ log n buckets such that the number of keys in α¯ summation of l M2l . We split the sum into four parts. The each bucket is O(log n) w.h.p., there is no known explicit splitting indexes are set as follows: l = lg α¯ , class achieving this with function descriptions of O(log U ) 1 (√2+1)2(1 α¯)2 P − | | α¯ √2α ¯ bits. Possibly, such families could be designed using efficient l2 = lg 2(1 α¯)2 , l3 = lg (1 α¯)2 . Some¨ of the indices may˝ circuits, rather than a standard RAM instruction set. be smaller than− 0, in which case− the sum is split into fewer ¨ ˝ ¨ ˝ Acknowledgment. We thank Martin Dietzfelbinger and parts. For tighter bounding, we will also need the following α¯ α¯ an anonymous reviewer for numerous suggestions improving values: t1 = frac lg , t2 = frac lg 2 , (√2+1)2(1 α¯)2 2(1 α¯) − − the presentation. In particular, we thank Martin Dietzfel- √2α ¯ and t3 = frac lg (1` α¯)2 . ´ ` ´ binger for showing us a simplified proof of Lemma 1. − ` ´ 7. REFERENCES [8] C. P. Kruskal, L. Rudolph, and M. Snir. A complexity [1] J. R. Black, C. U. Martel, and H. Qi. Graph and theory of efficient parallel algorithms. Theoretical hashing algorithms for modern architectures: Design Computer Science, 71(1):95–132, Mar. 1990. and performance. In Proceedings of 2nd International [9] R. Pagh and F. F. Rodler. Cuckoo hashing. Journal of Workshop on Algorithm Engineering (WAE ’92), Algorithms, 51:122–144, 2004. pages 37–48. Max-Planck-Institut f¨ur Informatik, [10] H. Prodinger and W. Szpankowski (eds.). Special issue 1998. on average case analysis of algorithms. Algorithmica, [2] J. L. Carter and M. N. Wegman. Universal classes of 22(4), 1998. Preface. hash functions. J. Comput. System Sci., [11] J. P. Schmidt and A. Siegel. The analysis of closed 18(2):143–154, 1979. hashing under limited randomness (extended [3] H. Chernoff. A measure of asymptotic efficiency for abstract). In Proceedings of the 22nd Annual ACM tests of a hypothesis based on the sum of observations. Symposium on Theory of Computing (STOC ’90), Annals of Mathematical Statistics, 23(4):493–507, pages 224–234. ACM Press, 1990. 1952. [12] A. Siegel. On universal classes of extremely random [4] M. Dietzfelbinger and C. Weidling. Balanced constant-time hash functions. SIAM J. Comput., allocation and dictionaries with tightly packed 33(3):505–543, 2004. constant size bins. In Proceedings of the 32nd [13] A. Siegel and J. Schmidt. Closed hashing is International Colloquium on Automata, Languages computable and optimally randomizable with and Programming (ICALP ’05), volume 3580 of universal hash functions. Technical Report Lecture Notes in Computer Science, pages 166–178. TR1995-687, New York University, Apr., 1995. Springer, 2005. [14] M. Thorup and Y. Zhang. Tabulation based [5] G. L. Heileman and W. Luo. How caching affects 4-universal hashing with applications to second hashing. In Proceedings of the 7th Workshop on moment estimation. In Proceedings of the 15th Annual Algorithm Engineering and Experiments (ALENEX ACM-SIAM Symposium on Discrete Algorithms ’05), pages 141–154. SIAM, 2005. (SODA ’04), pages 615–624, 2004. [6] W. Hoeffding. Probability inequalities for sums of [15] M. N. Wegman and J. L. Carter. New hash functions bounded random variables. Journal of the American and their use in authentication and set equality. J. Statistical Association, 58(301):13–30, 1963. Comput. System Sci., 22(3):265–279, 1981. [7] D. E. Knuth. Notes on ”open” addressing, July 22 1963. Unpublished memorandum. Available at http://citeseer.ist.psu.edu/knuth63notes.html.