Separate Chaining Meets Compact Hashing
Total Page:16
File Type:pdf, Size:1020Kb
Separate Chaining Meets Compact Hashing Dominik K¨oppl Department of Informatics, Kyushu University, Japan Society for Promotion of Science Abstract In fact, almost all modern replacements feature open addressing layouts. Their implementations While separate chaining is a common strategy for are highlighted with detailed benchmarks putting resolving collisions in a hash table taught in most separate chaining with unordered map as its major textbooks, compact hashing is a less common tech- representation in the backlight of interest. How- nique for saving space when hashing integers whose ever, when considering compact hashing with satel- domain is relatively small with respect to the prob- lite data, separate chaining becomes again a com- lem size. It is widely believed that hash tables petitive approach, on which we shed a light in this waste a considerable amount of memory, as they ei- article. ther leave allocated space untouched (open address- ing) or store additional pointers (separate chain- ing). For the former, Cleary introduced the com- 1.1 Related Work pact hashing technique that stores only a part of The hash table of Askitis [2] also resorts to separate a key to save space. However, as can be seen by chaining. Its buckets are represented as dynamic the line of research focusing on compact hash ta- arrays. On inserting an element into one of these bles with open addressing, there is additional in- array buckets, the size of the respective array incre- formation, called displacement, required for restor- ments by one (instead of, e.g., doubling its space). ing a key. There are several representations of this The approach differs from ours in that these arrays displacement information with different space and store a list of (key,value)-pairs while our buckets time trade-offs. In this article, we introduce a sep- separate keys from values. arate chaining hash table that applies the compact The scan of the buckets in a separate chaining hashing technique without the need for the dis- hash table can be accelerated with SIMD (single placement information. Practical evaluations re- instruction multiple data) instructions as shown by veal that insertions in this hash table are faster Ross [20] who studied the application of SIMD in- or use less space than all previously known com- structions for comparing multiple keys in parallel pact hash tables on modern computer architectures in a bucketized Cuckoo hash table. when storing sufficiently large satellite data. For reducing the memory requirement a of hash table, a sparse hash table layout was introduced 1 Introduction by members of Google2. Sparse hash tables are a arXiv:1905.00163v1 [cs.DS] 1 May 2019 lightweight alternative to standard open addressing A major layout decision for hash tables is hash tables, which are represented as plain arrays. how collisions are resolved. A well-studied Most of the sparse variants replace the plain array and easy-implementable layout is separate chain- with a bit vector of the same length marking posi- ing, which is also applied by the hash ta- + https://probablydance.com/2017/02/26/i-wrote-the- ble unordered map of the C+ standard library fastest-hashtable/, libstdc++ [4, Sect. 22.1.2.1.2]. On the downside, https://attractivechaos.wordpress.com/2018/10/01/advanced- it is often criticized for being bloated and slow1. techniques-to-implement-fast-hash-tables, https://tessil.github.io/2016/08/29/benchmark-hopscotch- 1Cf. http://www.idryman.org/blog/2017/05/03/writing- map.html, to name a few. a-damn-fast-hash-table-with-tiny-memory-footprints/, 2https://github.com/sparsehash/sparsehash 1 tions at which an element would be stored in the size. Choosing an adequate value for bmax is im- array. The array is emulated by this bit vector and portant, as it affects the resizing and the search its partitioning into buckets, which are dynamically time of our hash table. resizeable and store the actual data. Resize. When we try to insert an element into a The notion of compact hashing was coined by bucket of maximum size bmax, we create a new hash Cleary [5] who studied a hash table with bidirec- table with twice the number of buckets 2 H and j j tional linear probing. The idea of compact hashing move the elements from the old table to the new is to use an injective function mapping keys to pairs one, bucket by bucket. After a bucket of the old of integers. Using one integer, called remainder, as table becomes empty, we can free up its memory. a hash value, and the other, called quotient, as the This reduces the memory peak commonly seen in data stored in the hash table, the hash table can hash tables or dynamic vectors reserving one large restore a key by maintaining its quotient and an array, as these data structures need to reserve space information to retain its corresponding remainder. for 3m elements when resizing from m to 2m. This This information, called displacement, is crucial as technique is also common for sparse hash tables. the bidirectional linear probing displaces elements Search in Cache Lines. We can exploit mod- on a hash collision from the position correspond- ern computer architectures featuring large cache ing to its hash value, i.e., its remainder. Poyias sizes by selecting a sufficiently small bmax such that and Raman [19] gave different representations for buckets fit into a cache line. Since we are only in- the displacement in the case that the hash table terested in the keys of a bucket during a lookup, an applies linear probing. optimization is to store keys and values separately: In this paper, we show that it is not necessary In our hash table, a bucket is a composition of a to store additional data in case that we resort to key bucket and a value bucket, each of the same separate chaining as collision resolution. The main size. This gives a good locality of reference [7] for strength of our hash table is its memory-efficiency searching a key. This layout is favorable for large during the construction while being at least as fast values of bmax and (keys,value)-pairs where the key as other compact hash tables. Its main weakness is size is relatively small to the value size, since (a) the the slow lookup time for keys, as we do not strive cost for an extra pointer to maintain two buckets for small bucket sizes. instead of one becomes negligible while (b) more keys fit into a cache line when searching a key in a bucket. An overview of the proposed hash table 2 Separate Chaining with layout is given in Fig. 1. Compact Hashing 2.1 Compact Hashing Our hash table H has H buckets, where H is a Compact hashing restricts the choice of the hash j j j j power of two. Let h be the hash function of H. function h. It requires an injective transform f An element with key K is stored in the (h(K) mod that maps a key K to two integers (q; r) with H )-th bucket. To look up an element with key K, 1 r H , where r acts as the hash value h(K). j j the (h(K) mod H )-th bucket is linearly scanned. The≤ values≤ j j q and r are called quotient and re- j j A common implementation represents a bucket mainder, respectively. The quotient q can be used with a linked list, and tries to avoid collisions as to restore the original key K if we know its cor- they are a major cause for decelerating searches. responding remainder r. We translate this tech- Here, the buckets are realized as dynamic arrays, nique to our separate chaining layout by storing q similar to the array hash table of Askitis [2]. We as key in the r-th bucket on inserting a key K with further drop the idea of avoiding collisions. Instead, f(K) = (q; r). we want to maintain buckets of sufficiently large A discussion of different injective transforms is sizes to compensate the extra memory for main- given by Fischer and K¨oppl[11, Sect. 3.2]. Sup- taining (a) the pointers to the buckets and (b) their pose that all keys can be represented by k bits. We sizes. To prevent a bucket from growing too large, want to construct a bijective function f : [1::2k] k ! we introduce a threshold bmax for the maximum f([1::2 ]), where we use the last lg m bits for the 2 transform-type,value-bucket-type quotient-bucket-type bucket<transform::quotient-type> → hash map declare key-type : tranform-type::key-type item-type declare quotient-type : tranform-type::quotient-type value-bucket-type declare value-type : value-bucket-type::item-type bucket transform : transform-type buckets : uint8 t (storing lg H ) | | bucket sizes : uint8 t[2buckets] set(position : int, item-type) quotient buckets : quotient-bucket-type[2buckets] get(position : int): item-type value buckets : value-bucket-type[2buckets] insert(key-type, value-type) transform-type find(key-type): value-type user: hash map: transform: quotient buckets:[r] find(K) key-type transform f(K) (q, r) declare quotient-type declare remainder-type find i with get(i) = q v : quotient buckets[r][i] f(key-type):(quotient-type, remainder-type) 1 f − (quotient-type, remainder-type): key-type v Figure 1: Diagram of our proposed hash table. The call of find(K) returns the i-th element of the r-th bucket at which K is located. The injective transform determines the types of the key and the quotient.