Read-Log-Update a Lightweight Synchronization Mechanism for Concurrent Programming

Read-Log-Update A Lightweight Synchronization Mechanism for Concurrent Programming Alexander Matveev ∗ Nir Shavit ∗† Pascal Felber z Patrick Marlier z ∗ MIT y Tel-Aviv University z University of Neuchâtel Abstract data structure being modified. Readers access the unmodi- This paper introduces read-log-update (RLU), a novel exten- fied data structure while updaters modify the copy. The key sion of the popular read-copy-update (RCU) synchroniza- to RCU is that once modifications are complete, they are tion mechanism that supports scalability of concurrent code installed using a single pointer modification in a way that by allowing unsynchronized sequences of reads to execute does not interfere with ongoing readers. To avoid synchro- concurrently with updates. RLU overcomes the major limi- nization, updaters wait until all pre-existing readers have tations of RCU by allowing, for the first time, concurrency finished their operations, and only then install the modi- of reads with multiple writers, and providing automation that fied copies. This barrier-based mechanism allows for sim- eliminates most of the programming difficulty associated ple epoch-based reclamation [17] of the old copies, and with RCU programming. At the core of the RLU design is the mechanism as a whole eliminates many of the atomic a logging and coordination mechanism inspired by software read-modify-write instructions, memory barriers, and cache transactional memory algorithms. In a collection of micro- misses that are so expensive on modern multicore systems. benchmarks in both the kernel and user space, we show that RCU is supported in both user-space and in the kernel. RLU both simplifies the code and matches or improves on It has been widely used in operating system programming the performance of RCU. As an example of its power, we (over 6’500 API calls in the Linux kernel as of 2013 [30]) show how it readily scales the performance of a real-world and concurrent applications (as reported at http://urcu.so/, application, Kyoto Cabinet, a truly difficult concurrent pro- user-space RCU is notably used in a DNS server, a network- gramming feat to attempt in general, and in particular with ing toolkit, and a distributed storage system). classic RCU. Motivation. Despite its many benefits, RCU programming is not a panacea and its performance has some significant 1. Introduction limitations. First, it is quite complex to use. It requires the programmer to implement dedicated code in order to dupli- Context. An important paradigm in concurrent data struc- cate every object it modifies, and ensure that the pointers ture scalability is to support read-only traversals: sequences inside the copies are properly set to point to the correct loca- of reads that execute without any synchronization (and hence tions, before finally connecting this set of copies using a sin- require no memory fences and generate no contention [20]). gle atomic pointer assignment (or using lock-protected crit- The gain from such unsynchronized traversals is significant ical sections). This complexity is evidenced by the fact that because they account for a large fraction of operations in there are not many RCU-enhanced data structures beyond many data structures and applications [20, 34]. simple linked-lists, and the few other RCU-ed data structures The popular read-copy-update (RCU) mechanism of are innovative algorithms published in research papers [2, 5]. McKenney and Slingwine [28] provides scalability by en- A classic example of this difficulty is the doubly linked list abling this paradigm. It allows read-only traversals to pro- implementation in the Linux Kernel, in which threads are ceed concurrently with updates by creating a copy of the only allowed to traverse the list in the forward direction (the backward direction is unsafe and may return in inconsistent sequence of items) because of the limitations of the single Permission to make digital or hard copies of all or part of this work for personal or pointer manipulation semantics [5, 26]. classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation Second, RCU is optimized for a low number of writers. on the first page. Copyrights for components of this work owned by others than ACM The RCU kernel implementation [25, 27] reduces contention must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a and latency, but does not provide concurrency among writ- fee. Request permissions from [email protected]. ers. SOSP’15, October 4-7, 2015, Monterey, CA, USA. Copyright c 2015 ACM 978-1-4503-3834-9/15/10. $15.00. Third, threads using RCU experience delays when wait- http://dx.doi.org/10.1145/2815400.2815406 ing for readers to complete their operations. This makes 168 RCU potentially unfit for time-critical applications. Recent writers typically complete without committing their modifi- work by Arbel and Morrison [3] suggests how to reduce cations, and if conflict is detected, only then the clock is in- these delays by having the programmer provide RCU with cremented. This deferred increment applies all the deferred predicates that define the access patterns to data structures. modifications at once, with a much lower overhead since Our objective is to propose an alternative mechanism the number of accesses to the shared clock is lowered, and that will be simpler for the programmer while matching or more importantly, the number of times a writer must wait for improving on the scalability obtainable using RCU. readers to complete is significantly reduced in comparison to RCU. In fact, this deferral achieves in an automated way al- Contributions. In this paper, we propose read-log-update most the same improvement in delays as the RCU predicate (RLU), a novel extension of the RCU framework that sup- approach of Arbel and Morrison [3]. ports read-only traversals concurrently with multiple up- We provide two versions of RLU that differ in the way dates, and does so in a semi-automated way. Unlike with writers interact. One version allows complete writer concur- RCU, adding support for concurrency to common sequen- rency and uses object locks which writers attempt to acquire tial data structures is straightforward with RLU because it before writing. The other has the system serialize all writers removes from the programmer the burden of handcrafting and thus guarantee that they will succeed. The writers’ oper- the concurrent copy management using only single pointer ations in the serialized version can further be parallelized us- manipulations. RLU can be implemented in such a way that ing hardware transactions with a software fallback [24, 36]. it remains API-compatible with RCU. Therefore, it can be Our RLU implementation is available as open source from used as a drop-in replacement in the large collection of RCU- https://github.com/rlu-sync. based legacy code. We conducted an in-depth performance study and com- In a nutshell, RLU works as follows. We keep the over- parison of RLU against kernel and user-space RCU imple- all RCU barrier mechanism with updaters waiting until all mentations. Our findings show that RLU performs as well as pre-existing readers have finished their operations, but re- RCU on simple structures like linked lists, and outperforms place the hand-crafted copy creation and installation with a it on more complex structures like search trees. We show clock-based logging mechanism inspired by the ones used how RLU can be readily used to implement a doubly linked in software transactional memory systems [1, 8, 33]. The list with full semantics (of atomicity), allowing threads to biggest limitation of RCU is that it cannot support multiple move forward or backward in the list—this is a task that to updates to the data structure because it critically relies on a date is unattainable with RCU since the doubly linked list single atomic pointer swing. To overcome this limitation, in in the Linux kernel restricts threads to only move forward RLU we maintain an object-level write-log per thread, and or risk viewing an inconsistent sequence of items in the list record all modifications to objects in this log. The objects [26]. We also show how RLU can be used to increase the being modifiead are automatically copied into the write-log scalability of a real-world application, Kyoto Cabinet, that and manipulated there, so that the original structures remain would be difficult to apply RCU to because it requires to untouched. These locations are locked so that at most one modify several data structures concurrently, a complex feat thread at a time will modify them. if one can only manipulate one pointer at a time. Replacing The RLU system maintains a global clock [8, 33] that the use of reader-writer locks with RLU improves the per- is read at the start of every operation, and used to decide formance by a factor of 3 with 16 hardware threads. which version of the data to use, the old one or the logged one. Each writer thread, after completing the modifications to the logged object copies, commits them by incrementing 2. Background and Related Work the clock to a new value, and waits for all readers that started RLU owes much of its inspiration to the read-copy-update before the clock increment (have a clock value lower than (RCU) algorithm, introduced by McKenney and Slingwine it) to complete. This modification of the global clock using [28] as a solution to lock contention and synchronization a single operation has the effect of making all the logged overheads in read-intensive applications.

Read-Log-Update a Lightweight Synchronization Mechanism for Concurrent Programming

A Fast Concurrent and Resizable Robin Hood Hash Table Endrias

Intel Threading Building Blocks

16 Concurrent Hash Tables: Fast and General(?)!

Locks, Lock Based Data Structures  Review: Proportional Share Scheduler – Ch

Read-Copy-Update for Helenos (Slides)

Threading for Performance with Intel® Threading Building Blocks

Concurrent Data Structures

29. Lock-Based Concurrent Data Structures Operating System: Three Easy Pieces

Blocking and Non-Blocking Concurrent Hash Tables in Multi-Core Systems

Phase-Concurrent Hash Tables for Determinism

Fast, Dynamically-Sized Concurrent Hash Table⋆

Concurrent Hash Tables: Fast and General?(!)