Algorithmic Improvements for Fast Concurrent Cuckoo Hashing

Algorithmic Improvements for Fast Concurrent Cuckoo Hashing Xiaozhou Li1, David G. Andersen2, Michael Kaminsky3, Michael J. Freedman1 1Princeton University, 2Carnegie Mellon University, 3Intel Labs Abstract Google dense_hash_map 64 bit key/value pairs Fast concurrent hash tables are an increasingly important C++11 std:unordered_map read-to-write ratio = 1:1 building block as we scale systems to greater numbers of cores optimistic concurrent cuckoo 120 million keys and threads. This paper presents the design, implementation, Intel TBB concurrent_hash_map (*) cuckoo+ with fine-grained locking and evaluation of a high-throughput and memory-efficient (*) cuckoo+ with HTM concurrent hash table that supports multiple readers and writ- 0 5 10 15 20 25 30 35 40 ers. The design arises from careful attention to systems-level Throughput (million reqs per sec) optimizations such as minimizing critical section length and reducing interprocessor coherence traffic through algorithm Figure 1: Highest throughput achieved by different hash ta- re-engineering. As part of the architectural basis for this bles on a 4-core machine. (*) are our new hash tables. engineering, we include a discussion of our experience and results adopting Intel’s recent hardware transactional memory to systems issues such as internal parallelism, cache align- (HTM) support to this critical building block. We find that ment, and cache coherency. naively allowing concurrent access using a coarse-grained At the outset of this research, we hoped to capitalize on the lock on existing data structures reduces overall performance recently introduced hardware transactional memory (HTM) with more threads. While HTM mitigates this slowdown support in Intel’s new Haswell chipset, the TSX instruc- somewhat, it does not eliminate it. Algorithmic optimizations tions [1]. Contrary to our expectations, however, we ended up that benefit both HTM and designs for fine-grained locking implementing a design that performs well regardless of its use are needed to achieve high performance. of HTM, and the bulk of our time was not spent dealing with Our performance results demonstrate that our new hash concurrency mechanisms, but rather in algorithm and data table design—based around optimistic cuckoo hashing— structure engineering to optimize for concurrent access. For outperforms other optimized concurrent hash tables by up fast hash tables, HTM’s biggest benefit may be to software en- to 2.5x for write-heavy workloads, even while using substan- gineering, by reducing the intellectual complexity of locking, tially less memory for small key-value items. On a 16-core with a modest performance gain as a secondary benefit. machine, our hash table executes almost 40 million insert and As a result of these efforts, this paper presents the design more than 70 million lookup operations per second. and implementation of the first high-performance, multiple- reader/writer hash table that achieves the memory efficiency 1. Introduction of multi-way Cuckoo hashing [18]. Most fine-grained concurrent hash tables today store entries in a linked-list with High-performance, concurrent hash tables are one of the fun- per-bucket locks [2] or Read-Copy-Update (RCU) mecha- damental building blocks for modern systems, used both in nisms [17, 19]. While often fast, the pointers used in these concurrent user-level applications and in system applications approaches add high overhead when the key/value items are such as kernel caches. As we continue our hardware-driven small. In contrast, our Cuckoo-based design achieves high race towards more and more cores, the importance of having occupancy with no pointers. high-performance, concurrency-friendly building blocks in- We contribute a design that provides high throughput for creases. Obtaining these properties increasingly requires a multiple writers; prior work we build upon [8] allowed only combination of algorithmic engineering and careful attention a single writer, limiting the generality of the data structure. Our design uses algorithmic engineering of Cuckoo hashing, combined with architectural tuning in the form of effective Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed prefetching, use of striped fine-grained spinlocks, and an for profit or commercial advantage and that copies bear this notice and the full cita- optimistic design that minimizes the size of the locked critical tion on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner/author(s). Copyright is held by the section during updates. owner/author(s). The result of these engineering efforts is a solid building EuroSys’14, April 13–16, 2014, Amsterdam, Netherlands ACM 978-1-4503-2704-6/14/04. block for small key-value storage. On a 16-core machine, our http://dx.doi.org/10.1145/2592798.2592820 table achieves almost 40 million inserts per second, outper- forming the concurrent hash table in Intel’s Thread Building Our work builds upon one such hash table design. Cuckoo Blocks by 2.5x, while using less than half of the memory hashing [18] is an open-addressed hashing technique with for 64 bit key/value pairs. Figure 1 gives an example of high memory efficiency and O(1) amortized insertion time how our scheme (cuckoo+) outperforms other hash tables and retrieval. As a basis for its hashing, our work uses the with mixed random read/write workloads. Section 6 presents multi-reader version of cuckoo hashing from MemC3 [8], a performance evaluation detailing the advantages of this which is optimized for high memory efficiency and fast con- cuckoo-based approach for multicore applications. current reads. Scalable concurrent hash tables. The Intel Threading 2. Background and Related Work Building Blocks library (Intel TBB) [2] provides a This section provides background information on hash ta- concurrent_hash_map that allows multiple threads to bles and concurrency control mechanisms. We conclude with concurrently access and update values. This hash table is also a brief performance evaluation of the effects of naively ap- based upon the classic separate chaining design, where keys plying standard concurrency control techniques to several are hashed to a bucket that contains a linked list of entries. common hash table implementations. These results remind This design is quite popular for concurrent hash tables: Be- that high-performance concurrency is not trivial: careful al- cause a key hashes to one unique bucket, holding a per-bucket gorithm engineering is important regardless of the underlying lock permits guaranteed exclusive modification while still concurrency control mechanisms, and the algorithmic effects allowing fine-grained access. Further care must be taken if dominate the choice of concurrency mechanism. the hash table permits expansion. 2.1 Hash Tables 2.2 Concurrency Control Mechanisms As used in this paper, a hash table provides Lookup, As noted earlier, part of our motivation was to explore the Insert, and Delete operations for indexing all key-value application of hardware transactional memory to this core objects. Hash tables do not support retrieval by any key order- data structure. All concurrent data structures require some ing. Popular designs vary in their support for iterating through mechanism for arbitrating concurrent access, which we briefly the hash table in the presence of concurrent modifications; we list below, focusing on those used in this work. omit consideration of this feature. Locking. Multi-threaded applications take advantage of in- Interface. On Lookup, a value is returned for the given creasing number of cores to achieve high performance. To key, or “does not exist” if the key cannot be found. On ensure thread-safety, multiple threads have to serialize their Insert, the hash table returns success, or an error code to operations when accessing shared data, often through the use indicate whether the hash table is too full or the key already of a critical section protected by a lock. exists. Delete simply removes the key’s entry from the The simplest form of locking is to wrap a coarse-grained hash table. We focus on Lookup and Insert, as Delete lock around the whole shared data structure. Only one thread is very similar to Lookup. can hold the lock at the same time. This tends to be pes- simistic, since the thread with the lock prevents any other High-performance single-thread hash tables. As an exam- threads from accessing the shared resource, even if they only ple of a modern, extremely fast hash table, we compare in want to read the data or make non-conflicting updates. several places against Google’s dense_hash_map, a hash Another option is to use fine-grained locking by splitting table available in the Google SparseHash [9] library. Dense the coarse-grained lock into multiple locks. Each fine-grained hash sacrifices space efficiency for extremely high speed: It lock is responsible for protecting a region of the data, and uses open addressing with quadratic internal probing. It main- multiple threads can operate on different regions of the data tains a maximum 0.5 load factor by default, and stores entries at the same time. Fine-grained locking can improve the over- in a single large array. all performance of a concurrent system. However, it must C++11 introduces an unordered_map implemented as be carefully designed and implemented to behave correctly a separate chaining hash table. It has very fast lookup perfor- without deadlock, livelock, starvation, etc. mance, but also at the cost of more memory usage. Hardware Transactional Memory (HTM). It is often hard The performance of these hash tables does not scale with to write fast and correct multi-threaded code using fine- the number of cores in the machine, because only one writer grained locking. Transactional memory [10] is designed to or one reader is allowed at the same time.

Load more