LL/SC and Atomic Copy: Constant Time, Space Efficient

LL/SC and Atomic Copy: Constant Time, Space Efficient Implementations using only pointer-width CAS Guy E. Blelloch1 and Yuanhao Wei1 1Carnegie Mellon University, USA {guyb, yuanhao1}@cs.cmu.edu March 3, 2020 Abstract When designing concurrent algorithms, Load-Link/Store-Conditional (LL/SC) is often the ideal primitive to have because unlike Compare and Swap (CAS), LL/SC is immune to the ABA problem. Unfortu- nately, the full semantics of LL/SC are not supported in hardware by any modern architecture, so there has been a significant amount of work on simulations of LL/SC using CAS, a synchronization primitive that enjoys widespread hardware support. However, all of the algorithms so far that are constant time either use unbounded sequence numbers (and thus base objects of unbounded size), or require Ω(MP ) space for M LL/SC object (where P is the number of processes). We present a constant time implementation of M LL/SC objects using Θ(M + kP 2) space (where k is the number of outstanding LL operations per process) and requiring only pointer-sized CAS objects. In particular, using pointer-sized CAS objects means that we do not use unbounded sequence numbers. For most algorithms that use LL/SC, k is a small constant, so this result implies that these algorithms can be implemented directly from CAS objects of the same size with asymptotically no time overhead and Θ(P 2) additive space overhead. This Θ(P 2) extra space is paid for once across all the algorithms running on a system, so asymptotically, there is little benefit to using CAS over LL/SC. Our algorithm can also be extended to implement L-word LL/SC objects in Θ(L) time for LL and SC, O(1) time for VL, and Θ((M + kP 2)L) space. To achieve these bounds, we begin by implementing a new primitive called Single-Writer Copy which takes a pointer to a word sized memory location and atomically copies its contents into another object. The restriction is that only one process is allowed to write/copy into the destination object at a time. We believe this primitive will be very useful in designing other concurrent algorithms as well. arXiv:1911.09671v3 [cs.DC] 29 Feb 2020 1 Introduction In lock-free, shared memory programming, it’s well known that the choice of atomic primitives makes a big dif- ference in terms of ease of programability, efficiency, and even computability. Most processors today support a set of basic synchronization primitives such as Compare-and-Swap, Fetch-and-Add, Fetch-and-Store, etc. However, many useful primitives are not supported, which motivates the need for efficient software implementations of these primitives. In this work, we present constant time, space-efficient implementations of a widely used primitive called Load-Link/Store-Conditional (LL/SC) as well as a new primitive we call Single-Writer Copy (swcopy). All our implementations use only pointer-width read, write, and CAS. In particular, restricting ourselves to pointer-width operations means that we do not use unbounded sequence numbers, which are often used in other LL/SC from CAS implementations [23, 22, 18]. Many other algorithms based on CAS also use unbounded sequence numbers (often alongside double-wide CAS) to get around the ABA problem and this is 1 sometimes called the IBM tag methodology[20, 14]. Our LL/SC implementation can be used to avoid the use of unbounded sequence numbers and double-wide CAS in these algorithms. We implemented a Single-Writer Atomic Copy (swcopy) primitive and found that it greatly simplified our implementation of LL/SC. We believe it will be useful in a wide range of other applications as well. The swcopy primitive can be used to atomically read one memory location and write the result into another. The memory location being read can be arbitrary, but the location being written to has to be a special Destination object. A Destination object supports three operation, read, write, and swcopy and it allows any process to read from it, but only a single process to write or swcopy into it. We expect this primitive to be very useful in concurrent algorithms that use announcement arrays as it allows the algorithm to atomically read a memory location and announce the value that it read. This primitive can be used to solve various problems related to resource management, such as concurrent reference counting, in a constant (expected) time, wait-free manner [8]. In this work, we focus on wait-free solutions. Roughly speaking, wait-freedom ensures that all processes are making progress regardless of how they are scheduled. In particular, this means wait-free algorithms do not suffer from problems such as deadlock and livelock. All algorithms in this paper take in O(1) or O(L) time (where L is the number of words spanned by the implemented object), which is stronger than wait-freedom. The correctness condition we consider is linearizability, which intuitively means that all operations appear to take effect at a single point. In our results below, the time complexity of an operation is the number of instructions that it executes (both local and shared) in a worst-case execution and space complexity of an object is the number of words that it uses (both local and shared). Counting local objects/operations is consistent with previous papers on the topic [4, 22, 18]. There as been a significant amount of prior work on implementing LL/SC from CAS [5, 23, 16, 18, 22] and we discuss them in more detail in Section 2. Result 1 (Load-Link/Store-Conditional): A collection of M LL/SC objects operating on L-word values shared by P processes, each performing at most k outstanding LL operations, can be implemented with: 1. Θ(L) time for LL and SC, O(1) time for VL, 2. Θ((M + kP 2)L) space, 3. single word (at least pointer-width) read, write, CAS. For many data structures implemented from LL/SC, such as Fetch-And-Increment [10] and various Uni- versal Construction [12, 7, 1], k is at most 2. Our result implies that we can implement any number of such data structures from equal sized CAS while maintaining the same time complexities and using only Θ(P 2) additional space across all the objects. In contrast, using previous approaches [4, 5, 16, 23] to implementing LL/SC from equal sized CAS would require Ω(P ) space per LL/SC object. Θ(P 2) space overhead is very small compared to the memory size of most machines, so this result says that there is almost no disadvantage, from an asymptotic complexity perspective, to using LL/SC rather than CAS. We also implement a Destination object supporting read, write and swcopy with the following bounds. Result 2 (Single-Writer Copy): A collection of M Destination objects shared by P processes can be implemented with: 1. O(1) worst-case time for read, write, and swcopy 2. Θ(M + P 2) space 3. single word (at least pointer-width) read, write, CAS. 2 To help implement the Destination objects, we implement a weaker version of LL/SC with the bounds below. Our version of weak LL/SC is a little different from what was previously studied [4, 16, 23]. We compare the two in more detail in Section 2. Result 3 (Weak Load-Link/Store Conditional): A collection of M weak LL/SC objects operating on L-word values shared by P processes, each performing at most one outstanding wLL, can be implemented with: 1. Θ(L) time for wLL and SC, O(1) time for VL, 2. Θ((M + P 2)L) space, 3. single word (at least pointer-width) read, write, CAS. Our implementations of swcopy and LL/SC are closely related. We begin in Section 4 by implementing a weaker version of LL/SC (Result 3). Then, in Section 5, we use this weaker LL/SC to implement swcopy (Result 2), and finally, in Section 6, we use swcopy to implement the full semantics of LL/SC (Result 1). As we shall see, once we have swcopy, our algorithm for regular LL/SC is almost the same as our algorithm for weak LL/SC. 2 Related Work LL/SC from CAS. Results for implementing LL/SC from CAS are summarized in Table 1. The column titled “Size of LL/SC Object” lists the largest possible LL/SC object supported by each algorithm. For example, W − 2 log P means that the implemented LL/SC object can store at most W − 2 log P bits, and LW means that the implemented object can be arbitrarily large. All the algorithm shown in the table are wait-free and have optimal time bounds. The time and space bounds listed in Table 1 are for the common case where k is a constant. So far, all previous algorithms suffer from one of three drawbacks. They either (1) are not wait-free constant time [9, 15], (2) use unbounded sequence numbers [23, 22, 18, 19], or (3) require Ω(MP ) space [5, 16, 3, 4, 17, 23]. There are also some other desirable properties that an algorithm can satisfy. For example, the algorithms by Jayanti and Petrovic [19] and Doherty et al. [9] do not require knowing the number of processes in the system. Also, some algorithms are capable of implementing multi-word LL/SC from single-word CAS, whereas others only work when LL/SC values are smaller than word size. Weak LL/SC from CAS. A variant of WeakLLSC was introduced by [4] and also studied in [16, 23].

LL/SC and Atomic Copy: Constant Time, Space Efficient

Concurrent Objects and Linearizability Concurrent Computaton

Goals of This Lecture How to Increase the Compute Power?

Core Processors

Concurrent and Parallel Programming

Arxiv:2012.03692V1 [Cs.DC] 7 Dec 2020

A Task Parallel Approach Bachelor Thesis Paul Blockhaus

Round-Up: Runtime Checking Quasi Linearizability Of

Local Linearizability for Concurrent Container-Type Data Structures∗

Efficient Lock-Free Durable Sets 128:3

Wait-Freedom Is Harder Than Lock-Freedom Under Strong Linearizability Oksana Denysyuk, Philipp Woelfel

Root Causing Linearizability Violations⋆

Proving Linearizability Using Partial Orders