Hash Tables with Pseudorandom Global Order

Wolfgang Brehm

[email protected] Coherent Imaging Division, CFEL, DESY, Notkestraße 85 22607 Hamburg

Abstract. Given a sorting of the keys stored in a one can guarantee a worst case time complexity for associations of O(log(n)) and an average complexity of O(log(log(n))) , thereby improving upon the guarantees usually encountered for hash tables using open addressing. The idea is to use the numerical order given by a hashing function and resolve collisions upholding said order by using insertion sort on the small patches that inevitably form. The name patchmap has been devised for the implementation of this data-structure and the source code is freely available.

Keywords: hash table, complexity analysis, c++

(Received / Accepted )

1 Introduction classified into open addressing and chaining. Chaining means that each position in the array Hashmaps are an essential part of many algorithms stores a pointer to another that as they allow associative retrieval of n elements in stores all the entries with the same hash value. average constant time and have a space complexity Using a data structure with O(log(n)) worst case of O(n). The key is associated to the value by com- lookup, insertion and deletion like a balanced tree puting a of the key and storing the will yield strong worst case guarantees for all op- key value pair in an array at the position indicated erations but low practical performance due to the by the hash. This position is called bucket. As long high degree of indirection and therefore likely cache as there are no collisions, that is two different keys misses. Open addressing means that the address in that produce the same position, different hash map the hash table is not directly determined by the implementations are conceptually identical. This hash of the key but “open”. The position to store can be the case in the limit for an infinitely large ar- an entry is found by starting from the position in- ray or in the case of a perfect hashing function. As dicated by the hash value and then searching for a memory is constrained and there can be no one pre- free bucket with some probing strategy. For hash determined perfect hash function for arbitrary keys tables using open addressing the load factor α, the the key step where different hash map implementa- ratio between elements in the table and the total tions differ is the resolution of hash collisions. The number of buckets, is the main determinant for better the hash collisions are handeled, the more memory efficiency. of them can be allowed to occur and the smaller The idea to revisit ordered hash tables was in- the internal array can be. Different hash table im- spired by a scientific use case. Many large his- plementations are therefore to be judged by the tograms needed to be computed and the standard tradeoff between memory required and the time to c++ hash table was not able to hold all the el- find, insert or delete a key they offer. ements it would have needed to within the given Collision resolution strategies are commonly memory constraints. The interim solution was to use sorted arrays, which provided the minimum 2 The Algorithm possible memory footprint, but a higher time com- The key-value pairs are stored in an array of size m. plexity. The desire arose to interpolate between the A binary mask of the same size is used to indicate two solutions. A data structure that would act like whether a bucket in the array is free or set. a sorted array when memory is tight and gradually The proposed algorithm requires a hashing more like a hash map the more memory becomes function, preferably bijective, an order on the keys available would be the ideal solution. to be stored if the hashing function is not bijective The central idea of this publication is to use and a mapping function that compresses the range an ordering of the keys consistent with the hash of the hash value onto the range of the size of the value of the key to resolve the collisions that occur array while preserving the order of the hash. in a hash table. The hash function is chosen in a The hash function can be any function that way that its application to the keys will yield uni- maps the keys onto positive integers less than some formly distributed numbers even if the keys them- integer, but as the resolution of collisions is essen- selves are not evenly distributed. Using sorting tially a variant of linear probing it is crucial for to resolve hash collisions has been suggested be- good performance that the hash values are evenly fore, most notably by Amble and Knuth [3]. The distributed. A hash function that works well can difference in the proposed algorithm is not using be composed from a binary rotation and two mul- the sorting given by the value of the key but the tiplications with uneven random integer constants. hash of the key. This leads to a global order of The map function used is the multiplication of the the keys stored in the hash table and therefore al- hash value with m and then dividing the result by lows for efficient lookup using interpolation search the maximum hash value plus one: and binary search. Interpolation search has the best average case and a binary search provides the best worst case complexity for lookups in a sorted map(h) := array. This means that the lookup, sucessfull or return (h∗m) not, of a key will have a worst case time complex- /(maximum hash value plus 1) ity of O(log(n)), when done right, and an average complexity of O(log(log((1 − α)−1))). These guar- antees are as good as for optimally ordered hash This can efficiently be computed for fixed width tables [12]1 and stronger than the guarantees that integer types using long multiplication with two most other probing strategies can give for lookups fixed width integers of the same length or asim- [13]. And just like for binary tree hashing, which ple multiplication with a fixed width integer with approximates optimal ordering, the faster lookups double the size. Note that the division for hash come at higher cost for insertions of up to O(n) for values that fill the full range given by the digits is a full table, but are constant for tables that have a simple shift of the digits and does therefore not a load factor less than 1. Note that the probable need to be computed explicitly, this technique is worst case for advanced probing strategies equally known as fastrange [4]. is O(log(n)) as it also approaches optimal order- Two types of comparisons need to be consid- ing, and that robin hood hashing [2] with dynamic ered, comparing keys with keys and keys with free resizing achieves the same worst case complexity bucket positions. When comparing two keys a key with exceedingly high probability, as the hash table is considered to be less if its hash value is less than can be resized as soon as the maximum displace- the other or in case the hash values are equal if the ment is larger than O(log(n)), but the iterative re- key itself is less than the other key. When compar- sizing is not guaranteed to terminate with O(n) ing a key to an empty position it is considered to space. Robin hood hashing with dynamic resizing compare less if the application of the map function is sucessfully implemented in the flat_hash_map to the result of the hash function of the key is less and sherwood_map [8] of Malte Skarupke for ex- than the position: position < map(hash(key)) . ample. As long as there have been no collisions the keys stored in the hash table are already in the order given by the hash value. The lookup depends on 1 The authors cauteously claim that the average complex- this order. The operations for insertion and dele- ity of successfull lookups in an optimally ordered hash table are bounded by a constant but give no proof or convincing tion should therefore be implemented in a way so simulations. they uphold this order. This is achieved by finding the closest free bucket starting from the position in- element as lower_limits and upper_limit respec- dicated by map(order(key)). The key value pair is tively of the array, the range is iteratively subdi- inserted into the free bucket and then swapped with vided by linear interpolation until the key at mid- neighbouring entries until the order of the hash ta- point equals the key that needs to be found or the ble is restored. This procedure behaves like lin- range is empty or the maximum number of itera- ear probing in a complexity analysis and therefore tions has been reached. If the number of iterations needs the same number of probes, O(1+(1−α)−2), exceeds O(log(n)) the search switches to a binary see [10] 6.4 Hashing.( ) search. Almost always the interpolation search will 1 α terminate before that, but switching to a binary The term − 2 + α has been found to de- 4 (1 α) search guarantees that the search will terminate in scribe the required number of swaps wells. For a O(log(n)). numerical simulation see figure 3. Deletion works analogous to insertion, but in reverse. The procedure for inserting is conceptually iden- find_binary(key , tical to the insertion step in Algorithm B in [3] with lower_limit , linear probing, but instead of comparing the keys upper_limit) := directly, the hash value of the keys is being com- 1. midpoint := pared. lower_limit +(upper_limit−lower_limit+1)/2 2. if midpoint is set insert(key,value) := and (key at midpoint = key) 1. find the free position p then return midpoint closest to map(hash(key)) 3. if (lower_limit = upper_limit) 2. mark p as set then return not_found 3. insert the key value pair at p 4 . i f key 4. while there is a key to the right i s l e s s than that i s l e s s midpoint 4.1 swap the keys then upper_limit := midpoint 4.2 swap the values 5. if midpoint − 4 . 3 p := p 1 i s l e s s than 5. while there is a key to the left key to be found to which the key is less then lower_limit := midpoint 5.1 swap the keys 6 . goto 1 5.2 swap the values 5 . 3 p := p+1 For a table that has been filled completely the average time complexity for key retrieval will be O(log(log(n))) and the worst case time complexity erase(key) := will be O(log(n)) as this case reduces to the search 1. p := find(key) in a sorted list of uniformly distributed integers. 2. while there is a key right of p The algorithm for inserting new keys behaves like where map(hash(key))>position linear probing in the way that 2.1 move the key value pair left occurs. The complexity for lookup with a linear 1 − −1 2.2 advance p to the right search would therefore be 2 (1 + (1 α) ) for a 3. while there is a key left of p successfull lookup and at most one probe more for where map(hash(key))

An implementation of the patchmap as outlined in 400 this paper is available on github under an unmod- 350 ified MIT license: 300 https://github.com/1ykos/ordered_patch_map 250 200 5 Discussion 150 While chaining hash tables can achieve better or at 100 least equal bounds on the complexity of all hash ta- time for deletion [ns] 50 ble operations it comes at the cost of a higher degree 0 of indirection and an additional overhead to storing the 0 0.2 0.4 0.6 0.8 1 pointers to the secondary data structure containing the memory efficiency elements with the same hash value. This is why in prac- (c) tice chaining usually performs worse in terms of mem- ory efficiency and speed. Of the hash tables using open Figure 2: Selected hash table implementations and their addressing the patchmap offers the tightest worst case performance given in average memory efficiency on the horizontal axis and average time for lookup, insertion and bounds for lookup but double hashing and other ad- deletion of a random key in nanoseconds on the vertical vanced probing strategies achieve better run time com- axis, computed for up to 227 keys. Insertion contains the plexities for insertion and deletion. Probing strategies time needed for dynamic expansions. that approximate an optimal arrangement at a higher patchmap: , khash: ×, bytell: +, cost of insertion like Robin hood hashing in conjunction google::sparse_hash_map: #, google::dense_hash_map D, △ 3 2 with double hashing [2][11], binary tree hashing [12] flat_hash_map: , std::unordered_map: , sparsepp: There are two points for the patchmap - one optimized for and the algorithm presented here seem to have the same memory efficiency and one for speed. average case complexity for lookups, equally asymp- totically approaching optimality. In practice however the mask structure and the cache-friendly linear access pattern of the patchmap allows to effectively lower the 8 erate very large histograms prompted the inception of this algorithm, I am very grateful for the freedoms I 7 am enjoying in the Coherent Imaging research group and that I was able to pursue the developement of the 6 algorithm as well.

5 References [1] attractivechaos, klib, 4 http://attractivechaos.github.io/klib/ (8th of April 2019) . 3 [2] Celis,P. 1986, Robin Hood Hashing, Ph.D Thesis,

number of operations Waterloo, Ont., Canada . 2 [3] Amble, O. and Knuth, D. (1974) Ordered Hash insertion Tables. Comp. J., 17, 135–142. 1 lookup 1+(α(1-α)-2+α)/4 [4] Lemire,D. (2018) Fast Random Integer Generation α α 1+log2(log2( /(2-2 ))+1)+1) in an Interval. ArXiv, abs/1805.10941 . 0 0 0.2 0.4 0.6 0.8 1 [5] Popovitch, G. sparsepp. load factor https://github.com/greg7mdp/sparsepp (8th of April 2019) . Figure 3: Average cost for insertion and lookup in the hash map as it is being filled up with 228 pseudorandom keys. In- [6] Skarupke, M. A new fast hash table in response sertion (in red)( shows a strongly) quadratic behavior and the to Google’s new fast hash table, function 1 + 1 α + α is indicated with a solid black https://probablydance.com/2018/05/28/a-new- 4 (1−α)2 line. Lookup (in blue) is almost linear in α and shows a steep fast-hash-table-in-response-to-googles-new-fast- increase only for load factors close to 1. The theoretically de- hash-table/ (4th of April 2019) . rived complexity bound for lookup 1 + log (log ( 2−α ) + 1) 2 2 2−2α [7] Skarupke, M. flat_hash_map. is indicated with dashed black line. https://github.com/skarupke/flat_hash_map (4th of April 2019) . constants associated with insertion and deletion and [8] Skarupke, M. I wrote the fastest hashtable, seems to be a good compromise for medium high load https://probablydance.com/2017/02/26/i-wrote- factors of around 83%. The fact that the patchmap the-fastest-hashtable/ (4th April 2019) . can perform lookups quickly even when it is very full [9] Popovitch, G., Hide, D. et al., google::sparsehash. could make it suitable for use cases where memory is https://github.com/sparsehash/sparsehash (8th constrained and the hash table itself is virtually static. of April 2019) Other than that the performance characteristics of the [10] Knuth, D. (1998) The Art of Computer Program- patchmap mean that there are some cases for which it ming, Volume 3: Sorting and Searching Addison is the optimal choice in general. The quadratically in- Wesley Longman Publishing Co., Inc. Redwood creasing costs for inserting and deleting elements when City, CA, USA the patchmap fills up limit its utility where memory is very constrained and insertion and deletion are fre- [11] Devroye, L., Morin, P. and Viola, A. (2012) On quent. Worst Case Robin-Hood Hashing, SIAM J. Com- Sorting based on the hash is almost identical to put., 33(4), 923–936 robin hood hashing with bidirectional linear probing. [12] Gonnet, G. H. and Munro, J. I. (1979) Efficient When deciding which key to displace and which to keep ordering of hash tables, SIAM J. Comput., 8(3), at a given position, the larger key will also be displaced 463–478 more than the smaller key. The key difference is that [13] Ramakrishna, MV, (1988) An exact probability when ensuring an absolute order a guarantee can be model for finite hash tables, Proceedings. Fourth given for the worst time complexity because a binary International Conference on Data Engineering, search can be employed. 362–368

6 Acknowledgements This research was undertaken alongside my research ac- tivities in crystallographic data processing under Pro- fessor Henry Chapman. While the requirement to gen-