P0098R0: Towards Implementation and Use of Memory Order Consume

P0098R0: Towards Implementation and Use of memory order consume Doc. No.: WG21/P0098R0 Date: 2015-09-24 Reply to: Paul E. McKenney, Torvald Riegel, Jeff Preshing, Hans Boehm, Clark Nelson, Olivier Giroux, and Lawrence Crowl Email: [email protected], [email protected], jeff@preshing.com [email protected], [email protected], [email protected], and [email protected] Other contributors: Alec Teal, David Howells, David Lang, George Spelvin, Jeff Law, Joseph S. Myers, Linus Torvalds, Mark Batty, Michael Matz, Peter Sewell, Peter Zijlstra, Ramana Radhakrishnan, Richard Biener, Will Deacon, Faisal Vali, Behan Webster, Tony Tye, JF Bastien, ... September 29, 2015 This document is a revision of WG21/N4321, based DEC Alpha) memory-fence instructions, even though on email discusssions and including two more pro- new elements are being inserted into these linked posals in Sections 7.9 and 7.10. This proposal structures before, during, and after the traversal. has been further refined by discussions on various Without memory order consume, both the compiler email reflectors. WG21/N4321 is itself a revision and (again, in the case of DEC Alpha) the CPU of WG21/N4215, based on feedback at the 2014 would be within their rights to carry out aggres- UIUC meeeting and on the various email reflectors. sive data-speculation optimizations that would per- WG21/N4215 is in turn a revision of WG21/N4036, mit readers to see pre-initialization values in the based on feedback at the 2014 Rapperswil meeting, at newly added data elements. The purpose of memory the 2014 Redmond SG1 meeting, and on the various order consume is to prevent these optimizations. email reflectors. A detailed change log appears starting on page 41. Of course, memory order acquire may be used as a substitute for memory order consume, however do- ing so results in costly explicit memory-fence instruc- 1 Introduction tions (or, where available, load-acquire instructions) on weakly ordered systems such as ARM, Itanium, The most obscure member of the C11 and C++11 and PowerPC [3, 9, 12, 13]. These systems enforce memory order enum seems to be memory order dependency ordering in hardware, in other words, if consume [29]. The purpose of memory order the address used by one memory-reference instruction consume is to allow reading threads to correctly tra- depends on the value from a preceding load instruc- verse linked data structures without the need for tion, the hardware forces that earlier load to com- locks, atomic instructions, or (with the exception of plete before the later memory-reference instruction 1 WG21/P0098R0 2 commences.1 Similarly, if the data to be stored by a memory order consume use cases within the Linux given store instruction depends on the value from a kernel in Section 3. Section 4 looks at how depen- preceding load instruction, the hardware again forces dency ordering is currently supported in pre-C11 im- that earlier load to complete before the later store in- plementations, and then Section 5 looks at possible struction commences. Recent software tools for ARM ways to support those use cases in existing C11 and and PowerPC can help explicate their memory mod- C++11 implementations, followed by some thoughts els [1, 2, 19, 25]. Note that strongly ordered systems on incremental paths towards official support of these like x86, IBM mainframe, and SPARC TSO enforce use cases in the standards. Section 6 lists some weak- dependency ordering as a side effect of the fact that nesses in the current C11 and C++11 specification of they do not reorder loads with subsequent memory dependency ordering, and finally Section 7 outlines a references. Therefore, memory order consume is ben- few possible alternative dependency-ordering specifi- eficial on hot code paths, removing the need for hard- cations. ware ordering instructions for weakly ordered systems Note: SC22/WG14 liason issue. and permitting additional compiler optimizations on strongly ordered systems. When implementing concurrent insertion-only 2 Introduction to RCU data structures, a few of which are found in the Linux kernel, memory order consume is all that is required. The RCU synchronization mechanism is often used as However, most data structures also require removal a replacement for reader-writer locking because RCU of data elements. Such removal requires that the avoids the high-overhead cache thrashing that is char- thread removing the data element wait for all read- acteristic of many common reader-writer-locking im- ers to release their references to it before reclaim- plementations. RCU is based on three fundamental ing that element. The traditional way to do this is concepts: via garbage collectors (GCs), which have been available for more than half a century [15] and which are 1. Light-weight in-memory publish-subscribe oper- now available even for C and C++ [4]. Another ation. way to wait for readers is to use read-copy update (RCU) [21, 24], which explicitly marks read-side re- 2. Operation that waits for pre-existing readers. gions of code and provides primitives that wait for all pre-existing readers to complete. RCU is gaining 3. Maintaining multiple versions of data to avoid significant use both within the Linux kernel [16] and disrupting old readers that are still referencing outside of it [5, 6, 8, 14, 30]. old versions. Despite the growing number of memory order consume use cases, there are no known high- These three concepts taken together allow readers performance implementations of memory order and updaters to make forward progress concurrently. consume loads in any C11 or C++11 environments. We would like to use C11's and C++11's memory This situation suggests that some change is in or- order consume to implement RCU's lightweight sub- der: After all, if implementations do not support scribe operation, rcu dereference(). We assume the standard's memory order consume facility, users that rcu dereference() is a good example of how can be expected to continue to exploit whatever developers would exploit the dependency-ordering implementation-specific facilities allow them to get feature of weakly ordered systems, so we look to rcu their jobs done. This document therefore provides dereference() as an indication of the semantics that a brief overview of RCU in Section 2 and surveys memory order consume should have. In one typical RCU use case, updaters publish 1 But please note that hardware can and does take advan- new versions of a data structure while readers con- tage of the as-if rule, just as compilers do. currently subscribe to whatever version is current WG21/P0098R0 3 12000 140000 10000 120000 100000 8000 80000 6000 60000 locking 4000 40000 # RCU API Uses 2000 # RCU/locking API Uses 20000 RCU rwlock 0 0 2002 2004 2006 2008 2010 2012 2014 2016 2002 2004 2006 2008 2010 2012 2014 2016 Year Year Figure 2: Growth of RCU Usage vs. Locking Figure 1: Growth of RCU Usage changed little since RCU was introduced. This data at the time a given reader starts. Once all pre- suggests that RCU is at least as important to parallel existing readers complete, old versions can be re- software as is reader-writer locking. claimed. This sort of use case may be a bit unfa- In more recent years, a user-level library implemen- miliar to many, but it is extremely effective in many tation of RCU has been available [7]. This library is situations, offering excellent performance, scalability, now available for many platforms and has been in- real-time latency, deadlock avoidance, and read-side cluded in a number of Linux distributions. It has composability. More details on RCU are readily avail- been pressed into service for a number of open-source able [8, 17, 18, 20, 21, 23, 26]. software projects, proprietary products, and research Figure 1 shows the growth of RCU usage over time efforts. within the Linux kernel, which is strong evidence of Fully and fully performant C11/C++11 support RCU's effectiveness. However, RCU is a specialized for memory order consume is therefore quite impor- mechanism, so its use is much smaller than general- tant. However, good progress can often be made in purpose techniques such as locking, as can be seen in the short term by focusing on the cases that are com- Figure 2. It is unlikely that RCU's usage will ever monly used in practice rather than on the general approach that of locking because RCU coordinates case. The next section therefore takes a rough census only between readers and updaters, which means of the Linux kernel's use of the rcu dereference() that some other mechanism is required to coordinate family of primitives, which memory order consume is among concurrent updates. In the Linux kernel, that intended to implement. update-side mechanism is normally locking, although pretty much any synchronization mechanism may be used, including transactional memory [10, 11, 28]. 3 Linux-Kernel Use Cases However RCU is now being used in many situations where reader-writer locking would be used. Fig- Section 3.1 lists types of dependency chains in the ure 3 shows that the use of reader-writer locking has Linux kernel, Section 3.2 lists operators used within WG21/P0098R0 4 12000 1 void new_element(struct foo **pp, int a) 2 { 3 struct foo *p = malloc(sizeof(*p)); 10000 4 5 if (!p) 6 abort(); 8000 7 p->a = a; RCU 8 atomic_store_explicit(pp, p, memory_order_release); 9 } 10 6000 11 int traverse(struct foo_head *ph) 12 { rwlock 13 int a = -1; 4000 14 struct foo *p; 15 16 p = atomic_load_explicit(&ph->h, memory_order_acquire); # RCU/rwlocking API Uses 2000 17 while (p != NULL) { 18 a = p->a; 19 p = atomic_load_explicit(&p->n, memory_order_acquire); 0 20 } 21 return a; 22 } 2002 2004 2006 2008 2010 2012 2014 2016 23 Year 24 Figure 4: Release/Acquire Linked Structure Traver- Figure 3: Growth of RCU Usage vs.

Load more