Concurrent Hash Tables: Fast and General?(!)
Total Page:16
File Type:pdf, Size:1020Kb
Concurrent Hash Tables: Fast and General(?)! Tobias Maier1, Peter Sanders1, and Roman Dementiev2 1 Karlsruhe Institute of Technology, Karlsruhe, Germany ft.maier,[email protected] 2 Intel Deutschland GmbH [email protected] Abstract 1 Introduction Concurrent hash tables are one of the most impor- A hash table is a dynamic data structure which tant concurrent data structures which is used in stores a set of elements that are accessible by their numerous applications. Since hash table accesses key. It supports insertion, deletion, find and update can dominate the execution time of whole applica- in constant expected time. In a concurrent hash ta- tions, we need implementations that achieve good ble, multiple threads have access to the same table. speedup even in these cases. Unfortunately, cur- This allows threads to share information in a flex- rently available concurrent hashing libraries turn ible and efficient way. Therefore, concurrent hash out to be far away from this requirement in partic- tables are one of the most important concurrent ular when adaptively sized tables are necessary or data structures. See Section 4 for a more detailed contention on some elements occurs. discussion of concurrent hash table functionality. To show the ubiquity of hash tables we give a Our starting point for better performing data short list of example applications: A very sim- structures is a fast and simple lock-free concurrent ple use case is storing sparse sets of precom- hash table based on linear probing that is however puted solutions (e.g. [27], [3]). A more compli- limited to word sized key-value types and does not cated one is aggregation as it is frequently used support dynamic size adaptation. We explain how in analytical data base queries of the form SELECT to lift these limitations in a provably scalable way FROM... COUNT... GROUP BY x [25]. Such a query se- and demonstrate that dynamic growing has a per- lects rows from one or several relations and counts formance overhead comparable to the same gener- for every key x how many rows have been found alization in sequential hash tables. (similar queries work with SUM, MIN, or MAX). Hash- We perform extensive experiments comparing ing can also be used for a data-base join [5]. An- the performance of our implementations with six of other group of examples is the exploration of a large the most widely used concurrent hash tables. Ours combinatorial search space where a hash table is are considerably faster than the best algorithms used to remember the already explored elements with similar restrictions and an order of magnitude (e.g., in dynamic programming [36], itemset mining faster than the best more general tables. In some [28], a chess program, or when exploring an implic- extreme cases, the difference even approaches four itly defined graph in model checking [37]). Simi- orders of magnitude. larly, a hash table can maintain a set of cached ob- arXiv:1601.04017v2 [cs.DS] 6 Sep 2016 Category: [D.1.3] Programming Techniques jects to save I/Os [26]. Further examples are dupli- Concurrent Programming [E.1] Data Structures cate removal, storing the edge set of a sparse graph Tables [E.2] Data Storage Representation Hash- in order to support edge queries [23], maintaining table representations the set of nonempty cells in a grid-data structure used in geometry processing (e.g. [7]), or maintain- Terms: Performance, Experimentation, Mea- ing the children in tree data structures such as van surement, Design, Algorithms Emde-Boas search trees [6] or suffix trees [21]. Keywords: Concurrency, dynamic data struc- Many of these applications have in common that tures, experimental analysis, hash table, lock- { even in the sequential version of the program { freedom, transactional memory hash table accesses constitute a significant fraction 1 of the running time. Thus, it is essential to have 2 Related Work highly scalable concurrent hash tables that actually deliver significant speedups in order to parallelize This publication follows up on our previous findings these applications. Unfortunately, currently avail- about generalizing fast concurrent hash tables [18]. able general purpose concurrent hash tables do not In addition to describing how to generalize a fast offer the needed scalability (see Section 8 for con- linear probing hash table, we offer an extensive crete numbers). On the other hand, it seems to be experimental analysis comparing many concurrent folklore that a lock-free linear probing hash table hash tables from several libraries. where keys and values are machine words, which is There has been extensive previous work on con- preallocated to a bounded size, and which supports current hashing. The widely used textbook \The no true deletion operation can be implemented us- Art of Multiprocessor Programming" [12] by Her- ing atomic compare-and-swap (CAS) instructions lihy and Shavit devotes an entire chapter to concur- [36]. Find-operations can even proceed naively and rent hashing and gives an overview over previous without any write operations. In Section 4 we ex- work. However, it seems to us that a lot of previ- plain our own implementation (folklore) in detail, ous work focuses more on concepts and correctness after elaborating on some related work, and intro- but surprisingly little on scalability. For example, ducing the necessary notation (in Section 2 and most of the discussed growing mechanisms assume 3 respectively). To see the potential big perfor- that the size of the hash table is known exactly mance differences, consider an exemplary situation without a discussion that this introduces a perfor- with mostly read only access to the table and heavy mance bottleneck limiting the speedup to a con- contention for a small number of elements that are stant. Similarly, the actual migration is often done accessed again and again by all threads. folklore sequentially. actually profits from this situation because the con- Stivala et al. [36] describe a bounded concurrent tended elements are likely to be replicated into lo- linear probing hash table specialized for dynamic cal caches. On the other hand, any implementa- programming that only support insert and find. tion that needs locks or CAS instructions for find- Their insert operation starts from scratch when the operations, will become much slower than the se- CAS fails which seems suboptimal in the presence quential code on current machines. The purpose of of contention. An interesting point is that they our paper is to document and explain performance need only word size CAS instructions at the price differences, and, more importantly, to explore to of reserving a special empty value. This technique what extent we can make folklore more general with could also be adapted to port our code to machines an acceptable deterioration in performance. without 128-bit CAS. Kim and Kim [14] compare this table with a cache-optimized lockless implementation of hashing These generalizations are discussed in Section 5. with chaining and with hopscotch hashing [13]. The We explain how to grow (and shrink) such a table, experiments use only uniformly distributed keys, and how to support deletions and more general data i.e., there is little contention. Both linear prob- types. In Section 6 we explain how hardware trans- ing and hashing with chaining perform well in that actional memory can be used to speed up insertions case. The evaluation of find-performance is a bit and updates and how it may help to handle more inconclusive: chaining wins but using more space general data types. than linear probing. Moreover it is not specified whether this is for successful (use key of inserted After describing implementation details in Sec- elements) or mostly unsuccessful (generate fresh tion 7, Section 8 experimentally compares our hash keys) accesses. We suspect that varying these pa- tables with six of the most widely used concurrent rameters could reverse the result. hash tables for microbenchmarks including inser- Gao et al. [10] present a theoretical dynamic lin- tion, finding, and aggregating data. We look at ear probing hash table, that is lock-free. The main both uniformly distributed and skewed input dis- contribution is a formal correctness proof. Not all tributions. Section 9 summarizes the results and details of the algorithm or even an implementation discusses possible lines of future research. is given. There is also no analysis of the complexity 2 of the growing procedure. 3 Preliminaries Shun and Blelloch [34] propose phase concurrent hash tables which are allowed to use only a sin- We assume that each application thread has its own gle operation within a globally synchronized phase. designated hardware thread or processing core and They show how phase concurrency helps to im- denote the number of these threads with p. A data plement some operations more efficiently and even structure is non-blocking if no blocked thread cur- deterministically in a linear probing context. For rently accessing this data structure can block an example, deletions can adapt the approach from operation on the data structure by another thread. [15] and rearrange elements. This is not possible A data structure is lock-free if it is non-blocking in a general hash table since this might cause find- and guarantees global progress, i.e., there must al- operations to report false negatives. They also out- ways be at least one thread finishing its operation line an elegant growing mechanism albeit without in a finite number of steps. implementing it and without filling in all the detail like how to initialize newly allocated tables. They Hash Tables store a set of hKey; Valuei pairs (ele- propose to trigger a growing operation when any ments).1 A hash function h maps each key to a cell operation has to scan more than k log n elements of a table (an array).