Reagents: Expressing and Composing Fine-Grained Concurrency

Reagents: Expressing and Composing Fine-grained Concurrency Aaron Turon Northeastern University [email protected] Abstract Such libraries are an enormous undertaking—and one that must Efficient communication and synchronization is crucial for fine- be repeated for new platforms. They tend to be conservative, im- grained parallelism. Libraries providing such features, while indis- plementing only those data structures and primitives likely to fulfill pensable, are difficult to write, and often cannot be tailored or com- common needs, and it is generally not possible to safely combine posed to meet the needs of specific users. We introduce reagents, the facilities of the library. For example, JUC provides queues, sets a set of combinators for concisely expressing concurrency algo- and maps, but not stacks or bags. Its queues come in both blocking rithms. Reagents scale as well as their hand-coded counterparts, and nonblocking forms, while its sets and maps are nonblocking while providing the composability existing libraries lack. only. Although the queues provide atomic (thread-safe) dequeuing and sets provide atomic insertion, it is not possible to combine these Categories and Subject Descriptors D.1.3 [Programming tech- into a single atomic operation that moves an element from a queue niques]: Concurrent programming; D.3.3 [Language constructs into a set. and features]: Concurrent programming structures In short, libraries for fine-grained concurrency are indispens- General Terms Design, Algorithms, Languages, Performance able, but hard to write, hard to extend by composition, and hard to tailor to the needs of particular users. Keywords fine-grained concurrency, nonblocking algorithms, monads, arrows, compositional concurrency Our contribution We have developed reagents, which abstractly represent fine- 1. Introduction grained concurrent operations. Reagents are expressive and com- Programs are what happens between cache misses. posable, but when invoked retain the scalability and performance of existing algorithms: The problem Expressive Reagents provide a basic set of building blocks for Amdahl’s law tells us that sequential bottlenecks fundamentally writing concurrent data structures and synchronizers. The building limit our profit from parallelism. In practice, the effect is amplified blocks include isolated atomic updates to shared state, and inter- by another factor: interprocessor communication, often in the form active synchronous communication through message passing. This of cache coherence. When one thread waits on another, the program blend of isolation and interaction is essential for expressing the full pays the cost of lost parallelism and an extra cache miss. The extra range of fine-grained concurrent algorithms (x3.3). The building misses can easily accumulate to yield parallel slowdown, more than blocks also bake in many common concurrency patterns, like op- negating the benefits of the remaining parallelism. timistic retry loops, backoff schemes, and blocking and signaling. Cache, as ever, is king. Using reagents, it is possible to express sophisticated concurrency The easy answer is: avoid communication. In other words, par- algorithms at a higher level of abstraction, while retaining the per- allelize at a coarse grain, giving threads large chunks of indepen- formance and scalability of direct, hand-written implementations. dent work. But some work doesn’t easily factor into large chunks, or equal-size chunks. Fine-grained parallelism is easier to find, and Composable Reagents also provide a set of composition opera- easier to sprinkle throughout existing sequential code. tors, including choice, sequencing and pairing. Using these oper- Another answer is: communicate efficiently. The past two ators, clients can extend and tailor concurrency primitives without decades have produced a sizable collection of algorithms for syn- knowledge of the underlying algorithms. For example, if a reagent dequeue chronization, communication, and shared storage which mini- provides only a nonblocking version of an operation like , mize the use of memory bandwidth and avoid unnecessary wait- a user can easily tailor it to a version that blocks if the queue is dequeue ing. This research effort has led to industrial-strength libraries— empty; this extension will work regardless of how is de- dequeue java.util.concurrent (JUC) the most prominent—offering a fined, and will continue to work even if the implementa- wide range of primitives appropriate for fine-grained parallelism. tion is changed. Similarly, clients of reagent libraries can sequence operations together to provide atomic move operations between ar- bitrary concurrent collections—again, without access to or knowledge of the implementations. Compositionality is also useful to algorithm designers, since many sophisticated algorithms can be un- Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed derstood as compositions of simpler ones (x3.3, x3.4). for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute We begin in x2 by illustrating, through examples, some of the to lists, requires prior specific permission and/or a fee. common patterns, tradeoffs and concerns in designing fine-grained PLDI’12, June 11–16, 2012, Beijing, China. concurrent algorithms. The two subsequent sections contain our Copyright c 2012 ACM 978-1-4503-1205-9/12/06. $10.00 core technical contributions: • The design of the reagent combinators is given in x3. Each com- class TreiberStack [A] f binator is motivated by specific needs and patterns in concurrent private val head = new AtomicReference[List[A]](Nil) algorithms; the section shows in particular how to write all of def push(a: A) f the algorithms described in x2 concisely and at a higher-than- val backoff = new Backoff usual level of abstraction. while (true) f val cur = head.get() • The implementation of reagents is detailed in x5, via both if (head.cas(cur, a :: cur)) return high-level discussion and excerpts from our code in Scala. backoff .once() It reveals the extent to which reagents turn patterns of fine- g grained concurrency into a general algorithmic framework. It g also shows how certain choices in the design of reagents enable def tryPop(): Option[A] = f important optimizations in their implementation. val backoff = new Backoff Reagents have a clear cost model that provides an important guar- while (true) f antee: they do not impose extra overhead on the atomic updates val cur = head.get() performed by individual concurrent algorithms. That is, a reagent- cur match f based algorithm manipulates shared memory in exactly the same case Nil ) return None way as its hand-written counterpart, even if it reads and writes case a :: tail ) many disparate locations. This is a guarantee not generally pro- if (head.cas(cur, tail )) return Some(a) vided by concurrency abstractions like software transactional mem- g ory (STM), which employ redo or undo logs or other mechanisms backoff .once() to support composable atomicity. Reagents only impose overheads g when multiple algorithms are sequenced into a large atomic block, g e.g. in an atomic transfer between collections. g In principle, then, reagents offer a strictly better situation than with current libraries: when used to express the algorithms pro- Figure 1. Treiber’s stack (in Scala) vided by current libraries, reagents provide a higher level of abstraction yet impose negligible overhead; nothing is lost. But unlike with current libraries, the algorithms can then be extended, tailored, The prevailing wisdom1 is that such coarse-grained locking is in- and combined. Extra costs are only paid when the new atomic com- herently unscalable: positions are used. • It forces operations on the data structure to be serialized, even We test this principle empirically in x6 by comparing multi- when they could be performed in parallel. ple reagent-based collections to their hand-written counterparts, as well as to lock-based and STM-based implementations. The bench- • It adds extra cache-coherence traffic, since each core must ac- marks include both single operations used in isolation (where little quire the same lock’s cache line in exclusive mode before oper- overhead for reagents is expected) and combined atomic transfers. ating. For fine-grained communication, that means at least one We compare at both high and low levels of contention. Reagents cache miss per operation. perform universally better than the lock- and STM-based imple- • It is susceptible to preemptions or stalls of a thread holding a mentations, and are indeed competitive with hand-written lock-free lock, which prevents other threads from making progress. implementations. Finally, it is worth noting in passing that reagents generalize To achieve scalability, data structures employ finer-grained both Concurrent ML [22] and Transactional Events [4], yielding locking, or eschew locking altogether. Fine-grained locking as- the first lock-free implementation for both. We give a thorough sociates locks with small, independent parts of a data structure, discussion of this and other related work—especially STM—in x7. allowing those parts to be manipulated in parallel. Lockless (or A prototype implementation of reagents, together with the nonblocking) data structures instead perform updates directly, using benchmarks

Reagents: Expressing and Composing Fine-Grained Concurrency

Slicing (Draft)

Assessing Gains from Parallel Computation on a Supercomputer

Instruction Level Parallelism Example

Minimizing Startup Costs for Performance-Critical Threading

18-447 Lecture 21: Parallelism – ILP to Multicores Parallel

Performance of Physics-Driven Procedural Animation of Character Locomotion for Bipedal and Quadrupedal Gait

Easy Dataflow Programming in Clusters with UPC++ Depspawn

Econstor Wirtschaft Leibniz Information Centre Make Your Publications Visible

Extracting Parallelism from Legacy Sequential Code Using Transactional Memory

Optimization Techniques for Efficient HTA Programs

Introduction to MPI

Performance Loss Between Concept and Keyboard