The Maximum Displacement for Linear Probing Hashing

UPPSALA DISSERTATIONS IN MATHEMATICS 63 The Maximum Displacement for Linear Probing Hashing Niclas Petersson Department of Mathematics Uppsala University UPPSALA 2009 Dissertation presented at Uppsala University to be publicly examined in Polhemsalen, Ångströmlaboratoriet, Lägerhyddsvägen 1, Uppsala, Friday, February 27, 2009 at 13:15 for the degree of Doctor of Philosophy. The examination will be conducted in English. Abstract Petersson, N. (2009). The Maximum Displacement for Linear Probing Hashing. Uppsala Dissertations in Mathematics 63. viii+20 pp. Uppsala. ISBN 978-91-506- 2050-4. In this thesis we study the standard probabilistic model for hashing with linear probing. The main purpose is to determine the asymptotic distribution for the maximum displacement. Depending on the ratio between the number of items and the number of cells, there are several cases to consider. Paper I solves the problem for the special case of almost full hash tables. That is, hash tables where every cell but one is occupied. Paper II completes the analysis by solving the problem for all remaining cases. That is, for every case where the number of items divided by the number of cells lies in the interval [0, 1]. The last two papers treat quite different topics. Paper III studies the area covered by the supremum process of Brownian motion. One of the main theorems in Paper I is expressed in terms of the Laplace transform of this area. Paper IV provides a new sufficient condition for a collection of independent random variables to be negatively associated when conditioned on their total sum. The condition applies to a collection of independent Borel-distributed random variables, which made it possible to prove a Poisson approximation that where essential for the completion of Paper II. Keywords: Probabilistic analysis of algorithms, hashing, linear probing, negative dependence, Brownian motion. Niclas Petersson, Department of Mathematics, Box 480, Uppsala University, SE-751 06 Uppsala, Sweden Copyright © Niclas Petersson, 2009 ISSN 1401-2049 ISBN 978-91-506-2050-4 urn:nbn:se:uu:diva-9545 (http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-9545) Finally, we need a great deal of faith in probability theory when we use hashing methods, since they are efficient only on the average, while the worst case is terrible! Donald Knuth, The Art of Computer Programming List of Papers This thesis is based on the following papers, which are referred to in the text by their Roman numerals. I Petersson, N. (2008). The maximum displacement for linear probing hashing I. II Petersson, N. (2009). The maximum displacement for linear probing hashing II. III Janson, S. and Petersson, N. (2007). The integral of the supremum process of Brownian motion. IV Petersson, N. (2008). Negative dependence of independent random variables conditional on their total sum. v Contents 1 Introduction ....................................... 1 1.1 Linear Probing Hashing . ........................... 1 1.2 Displacements................................... 2 1.3 Modelvs.Reality................................ 3 1.4 VariationsoftheAlgorithm......................... 5 2 Summary of Results . .............................. 7 2.1 PaperI ........................................ 7 2.2 PaperII........................................ 9 2.3 PaperIII....................................... 11 2.4 PaperIV....................................... 12 Summary in Swedish ................................... 15 Acknowledgements ..................................... 17 Bibliography . ........................................ 19 vii 1. Introduction As research subject, hashing is shared by mathematicians and computer scientists with a common interest in probabilistic analysis of algorithms. Hashing can in the present context be described as a special class of search methods used for storage of information in a computer memory or database.1 Such methods are also called scatter storage techniques, since the principal idea is to spread the information randomly in the memory. Although the subject of hashing originates from computer science, it is actually quite advantageous to introduce the algorithm in a strictly mathematical way. Especially in this thesis, which is devoted to the more mathematical aspects of hashing. 1.1 Linear Probing Hashing The standard version of hashing with linear probing can be described as follows, where n and m are integers such that 0 ≤ n ≤ m.(Forthe classic, computer science orientated definition, see Knuth [14, Sec. 6.4, Algorithm 6.4.L].) n items x1,...,xn are placed sequentially into a table with m n cells 1,...,m according to some hash sequence {hi}1 where hi ∈{1,...,m}. That is, item xi is inserted into cell hi if it is empty. If not, then xi (linearly) probes cell hi +1, hi +2, etc until an empty cell is found and the item can be inserted. All positions in the table can be interpreted modulo m as any item that is rejected from the last cell m, continues to probe cell 1, 2, etc until it is inserted in an empty cell. Throughout this thesis, all hash sequences also follow the standard 2 probabilistic model. That is, the hash addresses hi are independent random numbers, uniformly distributed on {1,...,m}. Hence all mn n possible hash sequences {hi}1 are equally likely when choosing a sequence at random. A hash table with m cells and n items generated by the above algorithm will from now on be denoted Tm,n. 1The first known documentation of the main ideas behind hashing is an internal IBM memorandum by H. P. Luhn from 1953, see Knuth [14, Sec. 6.4]. The word hashing did not become standard terminology (in print at least) until the late 1960s. 2The validity of this model is briefly discussed in Section 1.3. 1 A hash table Tm,n is called full if n = m and almost full if n = m − 1. A non-full hash table is said to be confined if the last cell in the table is empty. Confined hash tables are from now on distinguished with a prime Tm,n. Due to the circular symmetry of the construction procedure, all cells are equally likely to be empty in the final table. Thus, a confined hash table is nothing but a standard table conditioned on the last cell being empty. So any non-full hash table can be given the confined property by choosing an empty cell with uniform probability (among all the empty cells) and then relabeling the table so that the chosen cell becomes the last. It can therefore be assumed that the hash table is confined whenever it is suitable for the analysis. A block in a hash table Tm,n is defined as a sequence of consecutive cells {i +1,...,j} (modulo m) such that cell i and j are empty and all cells in between are occupied. Thus, every block contains exactly one empty cell (the last one) and the table has as many blocks as it has empty cells. The total number of blocks (usually denoted by N) is therefore given by N = m − n. The smallest block size is 1, since an empty cell following the empty cell of the previous block constitutes a block of its own. Note that the internal structure of different blocks are independent of each other. Thus, any block of a given length k may be regarded as a confined almost full hash table Tk,k−1 in itself. An almost full hash table Tm,m−1 is in this sense the most primitive type of table, since it only consists of one block. 1.2 Displacements If item xi is inserted into cell qi,thenitsdisplacement or individual displacement is di := (qi − hi)modm. Thus, the displacement equals the number of unsuccessful probes item xi makes before being inserted into an empty cell. The displacement is a measure of time (or cost) it takes to insert an item into the table. Once the hash table has been made, xi may be allocated with the same type of linear probing as when the item was inserted. That is, by probing cell hi, hi +1, etc until the item is found. The search time needed to allocate xi in the final table is therefore proportional to di +1. Some displacements has already been thoroughly analyzed. The total displacement, i.e. n D := di , i=1 has for instance been studied by Knuth [14], Flajolet, Poblete and Viola [8] and Janson [9]. The (individual) displacement of an item chosen at random has also been studied in detail by Janson [10] and Viola [20]. 2 The maximum displacement (or worst case) is defined as D := max di . 1≤i≤n The primary purpose of this thesis is to determine the asymptotic distribution of the maximum displacement for all 0 ≤ λ ≤ 1 when n, m →∞ in such a way that n/m → λ. Since the maximum displacement also is the record among the maximum displacements within each block, it is important to know how the block sizes are distributed. Block behavior has been studied by Chassaing et al.3 in [3], [4] and [5] and Janson [9]. A secondary purpose is to study the number of items with displacements very close to the maximum. Such items are from now on said to have a critical displacement. The exact definition of a critical displacement must be adjusted to the ratio λ, but the principle is to define “item xi has a critical displacement” ⇔ di ≥ ζ, for some appropriate function ζ = ζ(m, n). Here, the word appropriate means that the total number of critical displacements should be a non- degenerate random variable asymptotically. 1.3 Model vs. Reality The goal of this section is to connect the mathematical model for hashing to its origin in computer science. This will be done with the aid of a small example in form of the creation of an address book. Assume that one wants to store some sort of basic contact information to n different persons in a hash table with m cells.

The Maximum Displacement for Linear Probing Hashing

Linear Probing with Constant Independence

CSC 344 – Algorithms and Complexity Why Search?

Implementing the Map ADT Outline

Hash Tables & Searching Algorithms

Pseudorandom Data and Universal Hashing∗

Collisions There Is Still a Problem with Our Current Hash Table

Cuckoo Hashing

Linear Probing with 5-Independent Hashing

Cuckoo Hashing

The Power of Simple Tabulation Hashing

Fundamental Data Structures Contents

Efficient Usage of One-Sided RDMA for Linear Probing