Resizable, Scalable, Concurrent Hash Tables Via Relativistic Programming

Resizable, Scalable, Concurrent Hash Tables via Relativistic Programming Josh Triplett Paul E. McKenney Jonathan Walpole Portland State University IBM Linux Technology Center Portland State University [email protected] [email protected] [email protected] Abstract still do not scale well. Running any of these hash-table implementations on additional processors does not pro- We present algorithms for shrinking and expanding a vide a proportional increase in performance. hash table while allowing concurrent, wait-free, linearly One solution for scalable concurrent hash tables scalable lookups. These resize algorithms allow Read- comes in the form of Read-Copy Update (RCU) [18, Copy Update (RCU) hash tables to maintain constant- 16, 12]. Read-Copy Update provides a synchronization time performance as the number of entries grows, and re- mechanism for concurrent programs, with very low over- claim memory as the number of entries decreases, with- head for readers [13]. Thus, RCU works particularly out delaying or disrupting readers. We call the resulting well for data structures with significantly more reads than data structure a relativistic hash table. writes; this category includes many data structures com- Benchmarks of relativistic hash tables in the Linux monly used in operating systems and applications, such kernel show that lookup scalability during resize im- as read-mostly hash tables. proves 125x over reader-writer locking, and 56% over Existing RCU-based hash tables use open chaining, Linux’s current state of the art. Relativistic hash lookups with RCU-based linked lists for each hash bucket. These experience no performance degradation during a resize. tables support insertion, removal, and lookup operations Applying this algorithm to memcached removes a scala- [13]. Our previous work introduced an algorithm to bility limit for get requests, allowing memcached to scale move hash items between hash buckets due to a change linearly and service up to 46% more requests per second. in the key [24, 23], making RCU-based hash tables even Relativistic hash tables demonstrate the promise of more broadly usable. a new concurrent programming methodology known Unfortunately, RCU-based hash tables still have a ma- as relativistic programming. Relativistic programming jor deficiency: they do not support dynamic resizing. makes novel use of existing RCU synchronization prim- The performance of a hash table depends heavily on itives, namely the wait-for-readers operation that waits the number of hash buckets. Making a hash table too for unfinished readers to complete. This operation, con- small will lead to excessively long hash chains and poor ventionally used to handle reclamation, here allows or- performance. Making a hash table too large will con- dering of updates without read-side synchronization or sume too much memory, increasing hardware require- memory barriers. ments or reducing the memory available for other applications or performance-improving caches. Many users of 1 Introduction hash tables cannot know the proper size of a hash table in advance, since no fixed size suits all system configura- Hash tables offer applications and operating systems tions and workloads, and the system’s needs may change many convenient properties, including constant average at runtime. Such systems require a hash table that sup- time for accesses and modifications [3, 10]. Hash tables ports dynamic resizing. used in concurrent applications require some sort of syn- Resizing a concurrent hash table based on mutual ex- chronization to maintain internal consistency. Frequently clusion requires relatively little work: simply acquire the accessed hash tables will become application bottlenecks appropriate locks to exclude concurrent reads and writes, unless this synchronization scales to many threads on then move items to a new table. However, RCU-based many processors. hash tables cannot exclude readers. This property proves Existing concurrent hash tables primarily make use of critical to RCU’s scalability and performance, since ex- mutual exclusion, in the form of locks. These approaches cluding readers would require expensive read-side syn- do not scale, due to contention for those locks. Alterna- chronization. Thus, any RCU-based hash-table resize al- tive implementations exist, using non-blocking synchro- gorithm must cope with concurrent reads while resizing. nization or transactions, but many of these techniques Solving this problem without reducing read perfor- still require expensive synchronization operations, and mance has seemed intractable. Existing RCU-based scal- 1 able concurrent hash tables in the Linux kernel, such gorithms, and the corresponding lookup operation. Sec- as the directory-entry cache (dcache) [17, 11], do not tion5 describes the other hash-table implementations support resizing; they allocate a fixed-size table at boot we tested for comparison. Section6 discusses the im- time based on system heuristics such as available mem- plementation and benchmarking of our relativistic hash- ory. Nick Piggin proposed a resizing algorithm for table algorithm, including both microbenchmarks and RCU-based hash tables, known as “Dynamic Dynamic real-world benchmarks. Section7 presents and analyzes Data Structures” (DDDS) [21], but this algorithm slows the benchmark results. Section8 discusses the future common-case lookups by requiring them to check multi- of the relativistic programming methodology supporting ple hash tables, and this slowdown increases significantly this work. during a resize. As our primary contribution, we present the first al- 2 Related Work gorithm for resizing an RCU-based hash table without blocking or slowing concurrent lookups. Because Relativistic hash tables use the RCU wait-for-readers op- lookups can occur at any time, we keep our relativistic eration to enforce the ordering and visibility of write op- hash table in a consistent state at all times, and never al- erations, without requiring synchronization operations in low a lookup to spuriously miss an entry due to a concur- the reader. This novel use of wait-for-readers evolved rent resize operation. Furthermore, our resize algorithms through a series of increasingly sophisticated write-side avoid copying the individual hash-table nodes, allowing barriers. Paul McKenney originally proposed the elim- readers to maintain persistent references to table entries. ination of read memory barriers by introducing a new A key insight made our relativistic hash table possi- write memory barrier primitive that forced a barrier on all ble. We use an existing RCU synchronization primi- CPUs via inter-processor interrupts [14]. McKenney’s tive, the wait-for-readers operation, to control which ver- later work on Sleepable Read-Copy Update (SRCU) [15] sions of the hash-table data structure concurrent readers used the RCU wait-for-readers operation to manage the can observe. This use of wait-for-readers generalizes order in which write operations became visible to read- and subsumes its original application of safely manag- ers, providing the first example of using wait-for-readers ing memory reclamation. This general-purpose ordering to order non-reclamation operations; this use served the primitive forms the basis of a new concurrent program- same function as a write memory barrier, but without re- ming methodology, which we call relativistic program- quiring a corresponding read memory barrier in every ming (RP). Relativistic programming enables scalable, RCU reader. Philip Howard further refined this approach high-performance data structures previously considered in his work on relativistic red-black trees [9], using the intractable for RCU. wait-for-readers operation to order the visibility of tree We use the phrase relativistic programming by anal- rotation and balancing operations and prevent readers ogy with relativity, in which observers can disagree on from observing inconsistent states. Howard’s work used the order of causally unrelated events. Relativistic pro- wait-for-readers as a stronger barrier than a write mem- gramming aims to minimize synchronization, by allow- ory barrier, enforcing the order of write operations re- ing reader operations to occur concurrently with writers; gardless of the order a reader encounters them in the writers may never block readers to enforce a system-wide data structure. Relativistic programming builds on this serialization of memory operations. Inevitably, then, in- stronger barrier. dependent readers can disagree on the order of unrelated Relativistic and RCU-based data structures typically writer operations, such as the insertion order of two items use mutual exclusion to synchronize between writers. into separate hash-table chains; however, writers can still Philip Howard and Jonathan Walpole [8] demonstrated synchronize to preserve the order of causally related op- an alternative approach, combining relativistic readers erations. Whereas concurrent programming methodolo- with software transactional memory (STM) writers, and gies such as transactional memory always preserve the integrating the wait-for-readers operation into the trans- ordering of even unrelated writes—at significant cost to action commit. This approach provides scalable high- performance and scalability, since readers must use syn- performance relativistic readers, while also allowing chronization to support blocking or retries—relativistic scalable writers within the limits of STM. Relativis- programming provides the means to program

Resizable, Scalable, Concurrent Hash Tables Via Relativistic Programming

A Fast Concurrent and Resizable Robin Hood Hash Table Endrias

Intel Threading Building Blocks

16 Concurrent Hash Tables: Fast and General(?)!

Locks, Lock Based Data Structures  Review: Proportional Share Scheduler – Ch

Read-Copy-Update for Helenos (Slides)

Threading for Performance with Intel® Threading Building Blocks

Concurrent Data Structures

29. Lock-Based Concurrent Data Structures Operating System: Three Easy Pieces

Blocking and Non-Blocking Concurrent Hash Tables in Multi-Core Systems

Phase-Concurrent Hash Tables for Determinism

Fast, Dynamically-Sized Concurrent Hash Table⋆

Concurrent Hash Tables: Fast and General?(!)