Concurrent Updates with RCU: Search Tree As an Example∗

Concurrent Updates with RCU: Search Tree as an Example∗ Maya Arbel Hagit Attiya Department of Computer Science Department of Computer Science Technion Technion [email protected] [email protected] ABSTRACT alternative, contemporary approach is presented by read copy up- Read copy update (RCU) is a novel synchronization mechanism, date (RCU) [22], a distinctive synchronization mechanism favor- in which the burden of synchronization falls completely on the ing read-only operations even further, by allowing them to proceed updaters, by having them wait for all pre-existing readers to fin- even while writers are modifying the data they are reading. Instead, the read-side critical section is wrapped by the rcu_read_lock and ish their read-side critical section. This paper presents CITRUS, a concurrent binary search tree (BST) with a wait-free contains rcu_read_unlock functions; an additional synchronize_rcu func- operation, using RCU synchronization and fine-grained locking tion can be used by a writer as a barrier ensuring that all preceding for synchronization among updaters. This is the first RCU-based read-side critical sections complete. In typical RCU usage, the bur- data structure that allows concurrent updaters. While there are den of synchronization is placed on updates, who can wait for all methodologies for using RCU to coordinate between readers and pre-existing readers to finish their read-side critical section. updaters, they do not address the issue of coordination among up- RCU is used extensively within the Linux kernel [20], mostly daters, and indeed, all existing RCU-based data structures rely on to facilitate memory reclamation [12, 21]. However, its potential coarse-grained synchronization between updaters. for concurrent programming remained unexploited. Although several RCU-based data structures have been proposed, for example, Experimental evaluation shows that CITRUS beats previous RCU-based search trees, even under mild update contention, and search trees and hash tables [6, 18, 25, 26], they all do not allow compares well with the best-known concurrent dictionaries. concurrent updates, either pessimistically, using a coarse-grained lock [6], or optimistically, using transactional memory [18]. At best, the data structure is partitioned into segments, e.g., buckets in Categories and Subject Descriptors a hash table [25, 26], each guarded by a single lock. D.1.3 [Programming Techniques]: Concurrent Programming— This leaves unanswered the question of coordination between Concurrent programming; F.1.2 [Computation By Abstract De- concurrent updates to the data structure, while using RCU, which vices]: Modes of Computation—Parallelism and concurrency is the topic of this paper. A natural approach for supporting concurrent updates is to em- General Terms ploy fine-grained locks, acquired and released according to some locking policy, e.g., two-phase locking or hand-over-hand locking. Algorithms, Architecture We found that these fine-grained approaches fail to ensure consistent views for read-only operations that access several items in the Keywords data structure, e.g., partial snapshots [2] or other iterators [24]. Fig- Shared memory; internal search tree; read-copy-update ure 1 shows an example in which two readers, r1 and r2, attempt to collect the leaves, by in-order traversal, while two updates delete leaves 9 and 12. In (a), r already traversed the left sub-tree while 1. INTRODUCTION 2 r1 has not yet started its traversal. After 9 is deleted (b), r1 finishes The traditional approach to synchronization between readers and traversing the tree while r2 did not take additional steps, and 12 is writers allows concurrency among readers, but excludes readers deleted before both readers complete (c). Since each reader may when a writer is executing, e.g., through a readers/writer lock. An observe a different permutation of the writes to the data structure, ∗ the values returned by r1 and r2 are such that they observed the This research is supported by Yad-HaNadiv foundation and the updates in different order. This means that without additional itera- Israel Science Foundation (grant number 1227/10). tions or interaction, there is no consistent way to order the updates and the readers, complicating the task of designing a concurrent Permission to make digital or hard copies of all or part of this work for personal or data structure. classroom use is granted without fee provided that copies are not made or distributed Nevertheless, this example (and others that we found) hints that for profit or commercial advantage and that copies bear this notice and the full cita- tion on the first page. Copyrights for components of this work owned by others than the difficulty is in having read-only operations that need to atom- ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or re- ically access several locations. The counter-examples do not hold publish, to post on servers or to redistribute to lists, requires prior specific permission when the read-only operation only searches for a particular data and/or a fee. Request permissions from [email protected]. item, e.g., in a dictionary. PODC’14, July 15–18, 2014, Paris, France. This observation led to the design of CITRUS, a binary search Copyright 2014 ACM 978-1-4503-2944-6/14/07 ...$15.00. tree implementing a dictionary with concurrent updates (insert and http://dx.doi.org/10.1145/2611462.2611471. delete) and contains queries. Updates synchronize using fine- r1 10 r2 10 r2 10 rcu_read_lock 7 15 7 15 7 15 Read-side critical section r1 rcu_read_lock rcu_read_unlock 9 12 12 (a) (b) (c) synchronize_rcu r = fg r = f7; 12g r = f7; 12g 1 1 1 Figure 2: The semantics of synchronize_rcu r2 = f9g r2 = f9g r2 = f9; 15g Figure 1: An example in which RCU readers may observe a different permutation of the writes to the data structure. A configuration is an instantaneous snapshot of the system de- scribing the state of all local and shared variables. In the initial configuration, all variables hold an initial value. grained locks, while contains proceeds in a wait-free manner, in An execution π is an alternating sequence of configurations and parallel with updates, relying on RCU to ensure correctness. The steps, C0; s1; :::; si;Ci; :::, where C0 is the initial configuration, combination of RCU synchronization with fine-grained locking of and each configuration Ci is the result of executing the step si in modified nodes yields a simple design, greatly resembling the se- configuration Ci−1.A prefix σ of π is a sub-sequence of π starting quential algorithm, and leading to a (relatively) simple proof of in C0 and ending with a configuration. An interval of π is a sub- correctness, only several pages long. sequence that starts with a step and ends with a configuration. The CITRUS implements an internal search tree, with keys stored in interval of an operation op starts with the invocation step of op and all nodes, and it is unbalanced. Searching for a key, either in a ends with the configuration following the response of op or the end contains operation or at the beginning of an update, is done in a of π, if there is no such response. wait-free manner, as in the sequential data structure, but inside an The API for read copy update (RCU) provides several functions, RCU read-side critical section. three of which are used in CITRUS to synchronize between read- Updates start by searching for their key, in a manner similar ers and updates: rcu_read_lock, rcu_read_unlock and synchro- to contains. An unsuccessful update (insert that finds its key or nize_rcu.A read-side critical section is the interval starting with delete that does not find its key), returns immediately. Otherwise, the step returning from rcu_read_lock and ending with the config- the update is located where the change should be done. uration after the invocation of rcu_read_unlock. The implementa- A new node is inserted as a leaf, requiring little synchronization, tions of rcu_read_lock and rcu_read_unlock must be wait-free. but delete may need to remove an internal node. When the node The RCU property (Figure 2) ensures that if a step of a read-side has two children, this is done by replacing the node with its succes- critical section precedes the invocation of synchronize_rcu, then sor in the tree. This scenario requires coordination with concurrent all steps of this critical section precede the return from synchro- searches that could miss the successor in both its previous location nize_rcu [12, 13]. and in its new location, necessitating delicate synchronization in A dictionary is a set of key-value pairs, with totally ordered keys, some concurrent search trees [4, 7, 9, 19]. CITRUS easily circum- with the following operations: vents this pitfall by copying the successor to the new location, and insert(k,val) adds (k; val) to the set; returns true if k is not in the using the RCU synchronization mechanism to wait for on-going set and false otherwise. searches, before removing the successor from its previous location. Experimental evaluation of CITRUS shows that its performance delete(k) removes (k; val) from the set; returns true if k is in the beats previous RCU-based search trees [6, 18], even under mild set and false otherwise. update contention. It also compares well with the best-available concurrent dictionaries, e.g., [4, 15, 23]. contains(k) returns val if (k; val) is in the set; otherwise, it re- The evaluation also reveals that the user-level RCU implemen- turns false. tation [8] is ill-suited for workloads in which many updates con- An update is either an insert or a delete. currently synchronize through it. We sketch a new implementation A binary search tree implements a dictionary; it is internal if that avoids these pitfalls and scales better with growing number of key-value pairs are stored in all nodes.

Concurrent Updates with RCU: Search Tree As an Example∗

A Fast Concurrent and Resizable Robin Hood Hash Table Endrias

Intel Threading Building Blocks

16 Concurrent Hash Tables: Fast and General(?)!

Locks, Lock Based Data Structures  Review: Proportional Share Scheduler – Ch

Read-Copy-Update for Helenos (Slides)

Threading for Performance with Intel® Threading Building Blocks

Concurrent Data Structures

29. Lock-Based Concurrent Data Structures Operating System: Three Easy Pieces

Blocking and Non-Blocking Concurrent Hash Tables in Multi-Core Systems

Phase-Concurrent Hash Tables for Determinism

Fast, Dynamically-Sized Concurrent Hash Table⋆

Concurrent Hash Tables: Fast and General?(!)