P0098R0: Towards Implementation and Use of memory order consume

Doc. No.: WG21/P0098R0 Date: 2015-09-24 Reply to: Paul E. McKenney, Torvald Riegel, Jeff Preshing, Hans Boehm, Clark Nelson, Olivier Giroux, and Lawrence Crowl Email: paulmck@.vnet.ibm.com, [email protected], jeff@preshing.com [email protected], [email protected], [email protected], and [email protected]

Other contributors: Alec Teal, David Howells, David Lang, George Spelvin, Jeff Law, Joseph S. Myers, , Mark Batty, Michael Matz, Peter Sewell, Peter Zijlstra, Ramana Radhakrishnan, Richard Biener, Will Deacon, Faisal Vali, Behan Webster, Tony Tye, JF Bastien, ... September 29, 2015

This document is a revision of WG21/N4321, based DEC Alpha) memory-fence instructions, even though on email discusssions and including two more pro- new elements are being inserted into these linked posals in Sections 7.9 and 7.10. This proposal structures before, during, and after the traversal. has been further refined by discussions on various Without memory order consume, both the email reflectors. WG21/N4321 is itself a revision and (again, in the case of DEC Alpha) the CPU of WG21/N4215, based on feedback at the 2014 would be within their rights to carry out aggres- UIUC meeeting and on the various email reflectors. sive data-speculation optimizations that would per- WG21/N4215 is in turn a revision of WG21/N4036, mit readers to see pre-initialization values in the based on feedback at the 2014 Rapperswil meeting, at newly added data elements. The purpose of memory the 2014 Redmond SG1 meeting, and on the various order consume is to prevent these optimizations. email reflectors. A detailed change log appears starting on page 41. Of course, memory order acquire may be used as a substitute for memory order consume, however do- ing so results in costly explicit memory-fence instruc- 1 Introduction tions (or, where available, load-acquire instructions) on weakly ordered systems such as ARM, , The most obscure member of the C11 and ++11 and PowerPC [3, 9, 12, 13]. These systems enforce memory order enum seems to be memory order dependency ordering in hardware, in other words, if consume [29]. The purpose of memory order the address used by one memory-reference instruction consume is to allow reading threads to correctly tra- depends on the value from a preceding load instruc- verse linked data structures without the need for tion, the hardware forces that earlier load to com- locks, atomic instructions, or (with the exception of plete before the later memory-reference instruction

1 WG21/P0098R0 2 commences.1 Similarly, if the data to be stored by a memory order consume use cases within the Linux given store instruction depends on the value from a kernel in Section 3. Section 4 looks at how depen- preceding load instruction, the hardware again forces dency ordering is currently supported in pre-C11 im- that earlier load to complete before the later store in- plementations, and then Section 5 looks at possible struction commences. Recent software tools for ARM ways to support those use cases in existing C11 and and PowerPC can help explicate their memory mod- C++11 implementations, followed by some thoughts els [1, 2, 19, 25]. Note that strongly ordered systems on incremental paths towards official support of these like , IBM mainframe, and SPARC TSO enforce use cases in the standards. Section 6 lists some weak- dependency ordering as a side effect of the fact that nesses in the current C11 and C++11 specification of they do not reorder loads with subsequent memory dependency ordering, and finally Section 7 outlines a references. Therefore, memory order consume is ben- few possible alternative dependency-ordering specifi- eficial on hot code paths, removing the need for hard- cations. ware ordering instructions for weakly ordered systems Note: SC22/WG14 liason issue. and permitting additional compiler optimizations on strongly ordered systems. When implementing concurrent insertion-only 2 Introduction to RCU data structures, a few of which are found in the , memory order consume is all that is required. The RCU synchronization mechanism is often used as However, most data structures also require removal a replacement for reader-writer locking because RCU of data elements. Such removal requires that the avoids the high-overhead cache thrashing that is char- removing the data element wait for all read- acteristic of many common reader-writer-locking im- ers to release their references to it before reclaim- plementations. RCU is based on three fundamental ing that element. The traditional way to do this is concepts: via garbage collectors (GCs), which have been avail- able for more than half a century [15] and which are 1. Light-weight in-memory publish-subscribe oper- now available even for C and C++ [4]. Another ation. way to wait for readers is to use read-copy update (RCU) [21, 24], which explicitly marks read-side re- 2. Operation that waits for pre-existing readers. gions of code and provides primitives that wait for all pre-existing readers to complete. RCU is gaining 3. Maintaining multiple versions of data to avoid significant use both within the Linux kernel [16] and disrupting old readers that are still referencing outside of it [5, 6, 8, 14, 30]. old versions. Despite the growing number of memory order consume use cases, there are no known high- These three concepts taken together allow readers performance implementations of memory order and updaters to make forward progress concurrently. consume loads in any C11 or C++11 environments. We would like to use C11’s and C++11’s memory This situation suggests that some change is in or- order consume to implement RCU’s lightweight sub- der: After all, if implementations do not support scribe operation, rcu dereference(). We assume the standard’s memory order consume facility, users that rcu dereference() is a good example of how can be expected to continue to exploit whatever developers would exploit the dependency-ordering implementation-specific facilities allow them to get feature of weakly ordered systems, so we look to rcu their jobs done. This document therefore provides dereference() as an indication of the semantics that a brief overview of RCU in Section 2 and surveys memory order consume should have. In one typical RCU use case, updaters publish 1 But please note that hardware can and does take advan- new versions of a data structure while readers con- tage of the as-if rule, just as do. currently subscribe to whatever version is current WG21/P0098R0 3

12000 140000

10000 120000 100000 8000 80000 6000 60000 locking

4000 40000 # RCU API Uses

2000 # RCU/locking API Uses 20000 RCU rwlock 0 0 2002 2004 2006 2008 2010 2012 2014 2016 2002 2004 2006 2008 2010 2012 2014 2016

Year Year

Figure 2: Growth of RCU Usage vs. Locking Figure 1: Growth of RCU Usage

changed little since RCU was introduced. This data at the time a given reader starts. Once all pre- suggests that RCU is at least as important to parallel existing readers complete, old versions can be re- software as is reader-writer locking. claimed. This sort of use case may be a bit unfa- In more recent years, a user-level library implemen- miliar to many, but it is extremely effective in many tation of RCU has been available [7]. This library is situations, offering excellent performance, scalability, now available for many platforms and has been in- real-time latency, deadlock avoidance, and read-side cluded in a number of Linux distributions. It has composability. More details on RCU are readily avail- been pressed into service for a number of open-source able [8, 17, 18, 20, 21, 23, 26]. software projects, proprietary products, and research Figure 1 shows the growth of RCU usage over time efforts. within the Linux kernel, which is strong evidence of Fully and fully performant C11/C++11 support RCU’s effectiveness. However, RCU is a specialized for memory order consume is therefore quite impor- mechanism, so its use is much smaller than general- tant. However, good progress can often be made in purpose techniques such as locking, as can be seen in the short term by focusing on the cases that are com- Figure 2. It is unlikely that RCU’s usage will ever monly used in practice rather than on the general approach that of locking because RCU coordinates case. The next section therefore takes a rough census only between readers and updaters, which means of the Linux kernel’s use of the rcu dereference() that some other mechanism is required to coordinate family of primitives, which memory order consume is among concurrent updates. In the Linux kernel, that intended to implement. update-side mechanism is normally locking, although pretty much any synchronization mechanism may be used, including transactional memory [10, 11, 28]. 3 Linux-Kernel Use Cases However RCU is now being used in many situa- tions where reader-writer locking would be used. Fig- Section 3.1 lists types of dependency chains in the ure 3 shows that the use of reader-writer locking has Linux kernel, Section 3.2 lists operators used within WG21/P0098R0 4

12000 1 void new_element(struct foo **pp, int a) 2 { 3 struct foo *p = malloc(sizeof(*p)); 10000 4 5 if (!p) 6 abort(); 8000 7 p->a = a; RCU 8 atomic_store_explicit(pp, p, memory_order_release); 9 } 10 6000 11 int traverse(struct foo_head *ph) 12 { rwlock 13 int a = -1; 4000 14 struct foo *p; 15 16 p = atomic_load_explicit(&ph->h, memory_order_acquire);

# RCU/rwlocking API Uses 2000 17 while (p != NULL) { 18 a = p->a; 19 p = atomic_load_explicit(&p->n, memory_order_acquire); 0 20 } 21 return a; 22 }

2002 2004 2006 2008 2010 2012 2014 2016 23

Year 24 Figure 4: Release/Acquire Linked Structure Traver- Figure 3: Growth of RCU Usage vs. Reader-Writer sal Locking

pointers while locklessly traversing the data struc- these dependency chains, Section 3.3 lists operators ture, as shown in Figure 4. that are considered to terminate dependency chains, Unfortunately, a memory order acquire load re- Section 3.4 lists operator that often act as the last link quires expensive special load instructions or memory- in a dependency chain, and finally Section 3.5 surveys fence instructions on weakly ordered systems such a longer-than-average (but by no means maximal) as ARM, Itanium, and PowerPC. Furthermore, in dependency chain that appears in the Linux kernel. traverse(), the address of each memory order It is worth reviewing the relationship between acquire load within the while loop depends on the 2 memory order acquire and memory order consume value of the previous memory order acquire load. loads, both of which interact with memory order Therefore, in this case, most weakly ordered systems release stores. don’t really need the special load instructions or the A memory order release store is said to synchro- memory-fence instructions, as these systems can in- nize with a memory order acquire load if that load stead rely on the hardware-enforced dependency or- returns the value stored or in some special cases, dering. On strongly ordered systems, memory order some later value [29, 1.10p6-1.10p8]. When a memory acquire needlessly suppresses memory-movement order release store synchronizes with a memory compiler optimizations. order acquire load, any memory reference preced- This is the use case for memory order consume, ing the memory order release store will happen be- which can be substituted for memory order acquire fore any memory reference following the memory in cases where hardware dependency ordering applies. order acquire load [29, 1.10p11-1.10p12]. This One such case is the preceding example, and Fig- property allows a linked structure to be locklessly tra- 2 versed by using memory order release stores when The initial load on line 16 might well depend on an earlier load, but for simplicity, this example assumes that the initial updating pointers to reference new data elements and foo head structure is statically allocated, and thus not subject by using memory order acquire loads when loading to updates. WG21/P0098R0 5

1 void new_element(struct foo **pp, int a) 2 { which might result in guidance to implementations 3 struct foo *p = malloc(sizeof(*p)); or perhaps even modifications to the definition of 4 5 if (!p) memory order consume. 6 abort(); 7 p->a = a; 8 atomic_store_explicit(pp, p, memory_order_release); 9 } 3.1 Types of Linux-Kernel Depen- 10 dency Chains 11 int traverse(struct foo_head *ph) 12 { 13 int a = -1; One goal for memory order consume is to implement 14 struct foo *p; rcu dereference(), which heads a Linux-kernel 15 16 p = atomic_load_explicit(&ph->h, memory_order_consume); dependency-ordering tree. There is a number of 17 while (p != NULL) { variants of rcu dereference() in the Linux kernel 18 a = p->a; 19 p = atomic_load_explicit(&p->n, memory_order_consume); in order to implement the four flavors of RCU and 20 } also to enable RCU usage diagnositics for code 21 return a; 22 } that is shared by readers and updaters. These 23 additional variants are rcu dereference(), rcu dereference bh(), rcu dereference bh check(), 24 rcu dereference bh check(), rcu dereference Figure 5: Release/Consume Linked Structure Traver- check(), rcu dereference index check() (now sal removed), rcu dereference protected(), rcu dereference raw(), rcu dereference sched(), rcu dereference sched check(), srcu ure 5 shows that same example recast in terms of dereference(), and srcu dereference check(). memory order consume.A memory order release Taken together, there are about 1300 uses of these store is dependency ordered before a memory order functions in version 3.13 of the Linux kernel. consume load when that load returns the value stored, However, about 250 of those are rcu dereference or in some special cases, some later value [29, 1.10p1]. protected(), which is used only in update-side code Then, if the load carries a dependency to some and thus does not head up read-side dependency later memory reference [29, 1.10p9], any memory chains, which leaves about 1000 uses to be inspected reference preceding the memory order release store for dependency-ordering usage. will happen before that later memory reference [29, 1.10p9-1.10p12]. This means that when there is de- 3.2 Operators in Linux-Kernel De- pendency ordering, memory order consume gives the pendency Chains same guarantees that memory order acquire does, but at lower cost. A surprisingly small fraction of the possible C opera- On the other hand, memory order consume re- tors appear in dependency chains in the Linux kernel, quires the compiler to track the carries-a-dependency namely ->, infix =, casts, prefix &, prefix *, [], infix relationships, with the set of such relationships +, infix -, ternary ?:, and infix (bitwise) &. headed by a given memory order consume load be- By far the two most common operators are the ing called that load’s dependency chains. It is quite -> pointer field selector and the = assignment oper- possible that the complexity of implementing this ca- ator. Enabling the carries-dependency relationship pability has thus far prevented high-quality memory through only these two operators would likely cover order consume implementations from appearing. It better than 90% of the Linux-kernel use cases. is therefore worthwhile to review use of dependency Casts, the prefix * indirection operator, and the chains in practice in order to determine what types prefix & address-of operator are used to implement of operations typically appear in dependency chains, Linux’s list primitives, which translate from list WG21/P0098R0 6

1 struct foo { 2 int a; algorithms as markers. Such markers, though not 3 }; common in the Linux kernel, are well-known in the 4 struct foo *fp; 5 struct foo default_foo; art, with hazard pointers being but one example [27]. 6 This operator is also sometimes used to locate the 7 int bar(void) 8 { beginning of an aligned structure, for example, if p 9 struct foo *p; references a field within a data structure that is 4096 10 11 p = rcu_dereference(fp); bytes in size (or smaller), and that is also aligned to a 12 return p ? p->a : default_foo.a; 4096-byte boundary, then p & ~0xfff will, with the 13 } addition of appropriate casting, produce a pointer to the beginning of the structure. Note that it is ex- Figure 6: Default Value For RCU-Protected Pointer, pected that both operands of infix & are expected to Linux Kernel have some non-zero bits, because otherwise a NULL pointer will result, and NULL pointers cannot reason- 1 class foo { ably be said to carry much of anything, let alone a 2 int a; 3 }; dependency. 4 std::atomic fp; Although I did not find any infix operators in 5 foo default_foo; | 6 my census of Linux-kernel dependency chains, sym- 7 int bar(void) metry considerations argue for also including it, for 8 { 9 std::atomic p; example, for read-side pointer tagging, or, for another 10 example, locating the beginning of the next in an ar- 11 p = fp.load_explicit(memory_order_consume); 12 return p ? kill_dependency(p->a) : default_foo.a; ray of aligned structures. Presumably both of the 13 } operands of infix | must have at least one zero bit. To recap, the operators appearing in Linux-kernel Figure 7: Default Value For RCU-Protected Pointer, dependency chains are: ->, infix =, casts, prefix &, C++11 prefix *, [], infix +, infix -, ternary ?:, infix (bitwise) &, and probably also |. pointers embedded in a structure to the structure it- self. These operators are also used to get some of the 3.3 Operators Terminating Linux- effects of C++ subtyping in the C language. Kernel Dependency Chains The [] array-indexing operator, and the infix + Although C++11 has the kill dependency() func- and - arithmetic operators are used to manipulate tion to terminate a dependency chain, no such func- RCU-protected arrays, as well as to index into arrays tion exists in the Linux kernel. Instead, Linux-kernel contained within RCU-protected structures. RCU- dependency chains are judged to have terminated protected arrays are becoming less common because upon exit from the outermost RCU read-side critical they are being converted into more complex data section,3 when existence guarantees are handed off structures, such as trees. However, RCU-protected from RCU to some other synchronization mechanism structures containing arrays are still fairly common. (usually locking or reference counting), or when the The ternary ?: if-then-else expression is used to variable carrying the dependency goes out of scope. handle default values for RCU-protected pointers, for example, as shown in Figure 6, or in C++11 form 3 The beginning of a given RCU read-side critical section is in Figure 7. Note that the dependency is carried marked with rcu read lock(), rcu read lock bh(), rcu read only through the rightmost two operands of ?:, never lock sched(), or srcu read lock(), and the end by the cor- through the leftmost one. responding primitive from the list rcu read unlock(), rcu read unlock bh(), rcu read unlock sched(), or srcu read The infix & operator is used to mask low-order bits unlock(). There is currently no C++11 counterpart for an from RCU pointers. These bits are used by some RCU read-side critical section. WG21/P0098R0 7

That said, it is possible to analyze Linux-kernel in version 3.13 of the Linux kernel. dependency chains to see what part of the chain is actually required by the algorithm in question. We can therefore define the essential subset of a depen- 3.4 Operators Acting as Last Link in dency chain to be that subset within which ordering Linux-Kernel Dependency Chains is required by the algorithm. In the 3.13 version of the Linux kernel, the following operators always mark Although the -> operator is frequently used as part of the end of the essential subset of a dependency chain: a Linux-kernel dependency chain, it often is intended (), !, ==, !=, &&, ||, infix *, /, and %. to be the last link in that chain. Therefore, the uses The postfix () function-invocation operator is an cases for the -> operator deserve special mention. interesting special case in the Linux kernel. In theory, The first use case involves fetching non-pointer RCU could be used to protect JITed function bodies, data from an RCU-protected data structure. For but in current practice RCU is instead used to wait example, in the DRDB subsystem in Linux, -> is for all pre-existing callers to the function referenced used to fetch a timeout value. This code requires by the previous pointer. The functions are all com- that dependency ordering apply to this fetch, but it piled into the kernel, and the dependency chains are does not require a dependency chain extending be- therefore irrelevant to the () operator. Hence, in ver- yond that point. This sort of case would require sion 3.13 of the Linux kernel, the () operator marks a kill dependency() for implementations based on the end of the essential subset of any dependency the C++11 and C11 standards. chain that it resides in. The second use case involves linked data struc- The !, ==, !=, &&, and || operators are used ex- tures where an RCU update might be applied on clusively in “if” statements to make control-flow de- any pointer in the chain, for example, the stan- cisions, and therefore also mark the end of the essen- dard Linux-kernel linked list. The -> operator pro- tial subset of any dependency chains that they reside vides dependency ordering for the fetch of the ->next in. In theory, these relational and boolean operators pointer, but that fetch must itself be a memory order could be used to form array indexes, but in practice consume load in order to provide the required depen- the Linux kernel does not yet do this in RCU de- dency ordering for the fields in the next structure pendency chains, and furthermore, as of version 4.2 in the list. Thus, a linked-list traversal consists of of the Linux kernel, integers are no longer allowed a series of back-to-back non-overlapping dependency to carry dependencies except in very restricted situ- chains. ations. The other relational operators (>, <, >=, and These two use cases raise the question of whether <=) should probably also be added to this list. a dependency chain can continue beyond a -> oper- The infix *, /, and % arithmetic operators could ator. The answer is “yes”, and this occurs when a potentially be used for construct array addresses, but linked structure is made visible to RCU readers as a they are not yet used that way in the Linux kernel. unit. For example, consider a linked list where each Instead, they are used to do computation on values list element links to a constant binary search tree. fetched as the last operation in an essential subset of If this tree is in place when the element is added to a dependency chain. the list, then a memory order consume load is needed In short, in the current Linux kernel, (), !, ==, only when fetching the pointer to the element. The !=, &&, ||, infix *, /, and % all mark the end of the dependency chain headed by this fetch suffices to or- essential subset of a dependency chain. That said, der accesses to the binary search tree. there is potential for them to be used as part of the These cases need to be differentiated. The third essential subset of dependency changes in future ver- use case appears to be the least frequent, which sug- sions of the Linux kernel. And the same is of course gests that the -> operator (or a sequence of -> oper- true of the remaining C-language operators, which ators) always be the last link of a dependency chain. did not appear within any of the dependency chains Another alternative is to differentiate on type, so WG21/P0098R0 8

1 void new_element(struct foo **pp, int a) 2 { • The arp process() function invokes the follow- 3 struct foo *p = malloc(sizeof(*p)); ing macros and functions: 4 5 if (!p) 6 abort(); – IN DEV ROUTE LOCALNET(), which expands 7 p->a = a; to the ipv4 devconf get() function. 8 atomic_store_explicit(pp, p, memory_order_release); 9 } – arp ignore(), which in turn calls: 10 11 int traverse(struct foo_head *ph) ∗ IN DEV ARP IGNORE(), which expands 12 { 13 int a = -1; to the ipv4 devconf get() function. 14 struct foo *p; ∗ inet confirm addr(), which calls: 15 16 p = atomic_load_explicit(&field_dep(ph, h), · dev net(), which in turn calls 17 memory_order_consume); 18 while (p != NULL) { read pnet(). 19 a = field_dep(p, a); 20 p = atomic_load_explicit(&field_dep(p, n), – IN DEV ARPFILTER(), which expands to 21 memory_order_consume); ipv4 devconf get(). 22 } 23 return a; – IN DEV CONF GET(), which also expands to 24 } ipv4 devconf get(). – arp fwd proxy(), which calls: Figure 8: Decorated Linked Structure Traversal ∗ IN DEV PROXY ARP(), which expands to ipv4 devconf get(). that only pointer-like types (including intptr t and ∗ IN DEV MEDIUM ID(), which also ex- uintptr t) carry dependencies. pands to ipv4 devconf get(). – arp fwd pvlan(), which calls: 3.5 Linux-Kernel Dependency Chain ∗ IN DEV PROXY ARP PVLAN(), which ex- Length pands to ipv4 devconf get(). Many Linux-kernel dependency chains are very short – pneigh enqueue(). and contained, with a fair number living within the confines of a single C statement. If there were only Again, although a great many dependency chains a few short dependency chains in the Linux kernel, in the Linux kernel are quite short, there are quite a one could imagine decorating all the operators in each few that spread both widely and deeply. We therefore dependency chain, for example, replacing the -> op- cannot expect Linux kernel hackers to look fondly erator with something like the mythical field dep() on any mechanism that requires them to decorate operator shown on lines 16, 19, and 20 of Figure 8. each and every operator in each and every depen- However, there are a great many dependency dency chain as was shown in Figure 8. In fact, even chains that extend across multiple functions. One kill dependency() will likely be an extremely diffi- relatively modest example is in the Linux network cult sell. stack, in the arp process() function. This depen- dency chain extends as follows, with deeper nesting 4 Dependency Ordering in Pre- indicating deeper function-call levels: C11 Implementations • The arp process() function invokes in dev get rcu(), which returns an RCU-protected Pre-C11 implementations of the C language do not pointer. The head of the dependency chain is have any formal notion of dependency ordering, but therefore within the in dev get rcu() func- these implementations are nevertheless used to build tion. the Linux kernel—and most likely all other software WG21/P0098R0 9 using RCU. This section lays out a few straightfor- introducing misordering. Similar arithmetic pit- ward rules for both implementers (Section 4.2) and falls must be avoided if the infix *, /, or % oper- users of these pre-C11 C-language implementations ators appear in the essential subset of a depen- (Section 4.1). dency chain. Again, recent experience indicates that it is even better to make sure that one of 4.1 Rules for C-Language RCU Users the operands of these arithmetic operators is a pointer-like type. The rules for C-language RCU users have evolved over time, so this section will present them in reverse 5. Avoid all-zero operands to the bitwise & opera- chronological order. tor, and similarly avoid all-ones operands to the bitwise | operator. If the compiler is able to 4.1.1 Rules for 2014 GCC Implementations deduce the value of such operands, it is within its rights to substitute the corresponding con- The primary rule for developers implementing RCU- stant for the bitwise operation. Once again, this based algorithms is to avoid letting the compiler de- breaks the dependency chain, introducing mis- terming the value of any variable in any dependency ordering. chain. This primary rule implies a number of sec- Please note that single-bit operands to bitwise & ondary rules: can be dangerous because the compiler requires 1. Use only intrinsic operators on basic types. If only a small amount of additional information to you are making use of C++ template metapro- deduce the exact value, which could again result gramming or operator overloading, more elabo- in constant substitution. Operands to bitwise | rate rules apply, and those rules are outside the that have only one zero bit are similarly danger- scope of this document. ous. 2. Use a volatile load to head the dependency chain. 6. If you are using RCU to protect JITed functions, This is necessary to avoid the compiler repeating so that the () function-invocation operator is a the load or making use of (possibly erroneous) member of the essential subset of the dependency prior knowledge of the contents of the memory tree, you may need to interact directly with the location, each of which can break dependency hardware to flush instruction caches. This issue chains. arises on some systems when a newly JITed func- tion is using the same memory that was used by 3. Avoid use of single-element RCU-protected ar- an earlier JITed function. rays. The compiler is within its right to assume that the value of an index into such an array 7. Do not use the boolean && and || operators in must necessarily evaluate to zero. The com- essential dependency chains. The reason for this piler could then substitute the constant zero for prohibition is that they are often compiled using the computation, breaking the dependency chain branches. Weak-memory machines such as ARM and introducing misordering. That said, recent or PowerPC order stores after such branches, but experience indicates that it is even better to sim- can speculate loads, which can break data depen- ply avoid carrying dependencies through integer dency chains. types. 8. Do not use relational operators (==, !=, >, >=, 4. Avoid cancellation when using the + and - infix <, or <=) in the essential subset of a dependency arithmetic operators. For example, for a given chain. The reason for this prohibition is that, as variable x, avoid (x − x). The compiler is within for boolean operators, relational operators are its rights to substitute zero for any such cancel- often compiled using branches. Weak-memory lation, breaking the dependency chain and again machines such as ARM or PowerPC order stores WG21/P0098R0 10

1 struct foo { after such branches, but can speculate loads, 2 int a; which can break dependency chains. 3 }; 4 struct foo *fp; 5 struct foo default_foo; 9. Be very careful about comparing pointers in the 6 essential subset of a dependency chain. As Linus 7 int bar(void) 8 { Torvalds explained, if the two pointers are equal, 9 struct foo *p; the compiler could substitute the pointer you are 10 11 p = fp; comparing against for the pointer in the essen- 12 smp_read_barrier_depends(); tial subset of the dependency chain. On ARM 13 return p ? p->a : default_foo.a; 14 } and Power hardware, it might be that only the original value carried a hardware dependency, so this substitution would break the chain, in turn Figure 9: Default Value For RCU-Protected Pointer, permitting misordering. Such comparisons are Old Linux Kernel OK in the following cases:

(a) The pointer references memory that was values, then it will be able to deduce the initialized at boot time, or otherwise long exact value based on a not-equals compar- enough ago that readers cannot still have ison.) pre-initialized data cached. Examples in- clude module-init time for module code, be- The same issue arises when a sequence of in- fore kthread creation for code running in a equality comparison operators narrow to a single kthread, while the update-side lock is held, value. and so on. (b) The pointer is never dereferenced after 10. Disable any value-speculation optimizations that being compared. This exception applies your compiler might provide, especially if you are when comparing against the NULL pointer making use of feedback-based optimizations that or when scanning RCU-protected circular take data collected from prior runs. linked lists. (c) The pointer being compared against is part 4.1.2 Rules for 2003 GCC Implementations of the essential subset of a dependency chain. This can be a different dependency Prior to the 2.6.9 version of the Linux kernel, there chain, but only as long as that chain stems was neither rcu dereference() nor rcu assign from a pointer that was modified after any pointer(). Instead, explicit memory barriers were initialization of interest. This exception can used, smp read barrier depends() by readers and apply when carrying out RCU-protected smp wmb() by updaters. For example, the code shown traversals from different entry points that for current Linux kernels in Figure 6 would be as converged on the same data structure. shown in Figure 9 for 2.6.8 and earlier versions of the Linux kernel. A similar transformation relates the (d) The pointer being compared against is older use of smp wmb() and the more recent use of fetched using rcu access pointer() and rcu assign pointer(). all subsequent dereferences are stores. This older API was clearly much more vulnerable (e) The pointers compared not-equal and the to compiler optimizations than is the current API, compiler does not have enough information but the real motivation for this change was read- to deduce the value of the pointer. (For ability and maintainability, as can be seen from the example, if the compiler can see that the commit log for the mid-2004 patch introducing rcu pointer will only ever take on one of two dereference(): WG21/P0098R0 11

1 void new_element(struct foo **pp, int a) This patch introduced an rcu 2 { dereference() macro that replaces most 3 struct foo *p = malloc(sizeof(*p)); 4 uses of smp read barrier depends(). 5 if (!p) The new macro has the advantage of 6 abort(); 7 p->a = a; explicitly documenting which pointers 8 rcu_assign_pointer(pp, p); are protected by RCU – in contrast, it 9 } 10 is sometimes difficult to figure out which 11 int traverse(struct foo_head *ph) pointer is being protected by a given 12 { 13 int a = -1; smp read barrier depends() call. 14 struct foo *p; 15 16 p = rcu_dereference(&ph->h); 17 while (p != NULL) { The commit log for the mid-2004 patch introducing 18 if (p == (struct foo *)0xbadfab1e) rcu assign pointer() justifies the change in terms 19 a = ((struct foo *)0xbadfab1e)->a; 20 else of eliminating hard-to-use explicit memory barriers: 21 a = p->a; 22 p = rcu_dereference(&p->n); 23 } Attached is a patch that adds an rcu 24 return a; 25 } assign pointer() that allows a number of explicit smp wmb() memory barriers to be dispensed with, improving readability. Figure 10: Dangerous Optimizations: Hardware Branch Predictions

The importance of suppressing compiler optimiza- tions did not become apparent until much later. In 4.2 Rules for C-Language Imple- fact, a volatile cast was not added to the implementa- menters tion of rcu dereference() until 2.6.24 in early 2008. The main rule for C-language implementers is to avoid any sort of value speculation, or, at the very 4.1.3 Rules for 1990s Sequent C Implemen- least, provide means for the user to disable such tations speculation. An example of a value-speculation op- timization that can be carried out with the help of 1990s systems featured far slower CPUs and much hardware branch prediction is shown in Figure 10, less memory that is commonly provisioned to- which is an optimized version of the code in Fig- day, and the compilers were correspondingly less ure 5. This sort of transformation might result from sophisticated. Therefore, at that time, a sim- feedback-directed optimization, where profiling runs ple C-language field selector was used instead of determined that the value loaded from ph was almost any sort of rcu dereference() or memory order alway 0xbadfab1e. Although this transformation is consume operation. Not only was there no correct in a single-threaded environment, in a concur- volatile cast, there also was nothing resembling smp rent environment, nothing stops the compiler or the read barrier depends(). The lack of smp read CPU from speculating the load on line 19 before it barrier depends() is not too surprising, given that executes the rcu dereference() on line 16, which DYNIX/ptx did not run on DEC Alpha. could result in line 19 executing before the corre- sponding store on line 7, resulting in a garbage value This approach was nevertheless quite reliable be- 4 cause the use cases within the DYNIX/ptx kernel in variable a. were both few and straightforward, and provided lit- There are some situations where this sort of opti- tle or no opportunity for optimizations that might 4 Kudos to Olivier Giroux for pointing out use of branch break dependency chains. prediction to enable value speculation. WG21/P0098R0 12

1 int a(struct foo *p [[carries_dependency]]) mization would be safe, including: 2 { 3 return kill_dependency(p->a != 0); 1. The value speculated is a numeric value rather 4 } 5 than a pointer, so that if the guess proves correct 6 int b(int x) after the fact, the computation will be appropri- 7 { 8 return x; ate after the fact. 9 } 10 2. The value speculated is a pointer to invariant 11 foo *c(void) 12 { data, then reasonable values will be produced 13 return fp.load_explicit(memory_order_consume); by dereferencing, even if the guess proves to have 14 /* return rcu_dereference(fp) in Linux kernel. */ 15 } been correct only after the fact. 16 17 int d(void) 3. As above, but where any updates result in data 18 { 19 int a; that produces appropriate computations at any 20 foo *p; and all phases of the update. 21 22 rcu_read_lock(); 23 p = c(); However, this list does not contain the general case 24 a = p->a; of memory order consume loads. 25 rcu_read_unlock(); 26 return a; Pure hardware implementations of value specula- 27 } tion can avoid this problem because they monitor cache-coherence protocol events that would result from some other CPU invalidating the guess. Figure 11: Example Functions for Dependency Or- In short, compiler writers must provide means to dering, Part 1 disable all forms of value speculation, unless the spec- ulation is accompanied by some means of detecting the race condition that Figure 10 is subject to. dependency-chain annotations and also all issues Are there other dependency-breaking optimizations that might otherwise arise from use of dependency- that should be called out separately? breaking optimizations. This approach is fully com- patible with the Linux kernel community’s current approach to dependency chains. Unfortunately, there 5 Dependency Ordering in C11 are any number of valuable optimizations that break and C++11 Implementations dependency chains, so this approach seems impracti- cal. The simplest way to avoid dependency-ordering is- A third approach is to avoid value speculation sues is to strengthen all memory order consume oper- and other dependency-breaking optimizations in any ations to memory order acquire. This functions cor- function containing either a memory order consume rectly, but may result in unacceptable performance load or a [[carries dependency]] attribute. For due to memory-barrier instructions on weakly or- example, the hardware-branch-predition optimiza- 5 dered systems such as ARM and PowerPC, and may tion shown in Figure 10 would be prohibited in such further unnecessarily suppress code-motion optimiza- functions, as would cancellation optimizations such tions. as optimizing a = b + c - c into a = b. This too Another straightforward approach is to avoid value can result in missed opportunities for optimization, speculation and other dependency-breaking opti- though very probably many fewer than the previous mizations. This might result in missed opportu- approach. This approach can also result in issues due nities for optimization, but avoids any need for to dependency-breaking optimizations in functions 5 From a Linux-kernel community viewpoint, that should lacking [[carries dependency]] attributes, for ex- read “will result in unacceptable performance”. ample, function d() in Figure 11. It can also result WG21/P0098R0 13

1 [[carries_dependency]] struct foo *e(void) 2 { approach would make functions c() and d() in Fig- 3 return fp.load_explicit(memory_order_consume); ure 11 handle dependency chains in a natural manner, 4 /* return rcu_dereference(fp) in Linux kernel. */ 5 } but avoiding whole-program analysis would require 6 something similar to the [[carries dependency]] 7 int f(void) 8 { annotations called out in the C11 and C++11 stan- 9 int a; dards. 10 foo *p; 11 A fifth approach would be to require that all op- 12 rcu_read_lock(); erations on the essential subset of any dependency 13 p = e(); 14 a = p->a; chain be annotated. This would greatly ease imple- 15 rcu_read_unlock(); mentation, but would not be likely to be accepted by 16 return kill_dependency(a); 17 } the Linux kernel community. 18 A sixth approach is to track dependencies as called 19 int g(void) 20 { out in the C11 and C++11 standards. However, in- 21 int a; stead of emitting a memory-barrier instruction when 22 foo *p; 23 a dependency chain flows into or out of a function 24 rcu_read_lock(); without the benefit of [[carries dependency]], in- 25 p = e(); 26 a = p->a; sert an implicit kill dependency() invocation. Im- 27 rcu_read_unlock(); plementation should also optionally issue a diagnostic 28 return b(a); 29 } in this case. The motivation for this approach is that it is expected that many more kill dependencies() than [[carries dependency]] would be required to Figure 12: Example Functions for Dependency Or- convert the Linux kernel’s RCU code to C11. In the dering, Part 2 example in Figure 12, this approach would allow func- tion g() to avoid emitting an unnecessary memory- barrier instruction, but without function f()’s ex- in spurious memory-barrier instructions when a de- plicit kill dependency(). Both functions are in Fig- pendency chain goes out of scope, for example, with ure 12. the of function g() in Figure 12. A seventh and final approach is to track dependen- A fourth approach is to add a compile-time op- cies as called out in in the C11 and C++11 standards. eration corresponding to the beginning and end of With this approach, functions e() and f() properly RCU read-side critical section. These would need to preserve the required amount of dependency order- be evaluated at compile time, taking into account ing. the fact that these critical sections can nest and can be conditionally entered and exited. Note that the exit from an outermost RCU read-side critical sec- 6 Weaknesses in C11 and tion should imply a kill dependency() operation on C++11 Dependency Or- each variable that is live at that point in the code.6 Although it is probably impossible to precisely de- dering termine the bounds of a given RCU read-side critical section in the general case, conservative approaches Experience has shown several weaknesses in the de- that might overestimate the extent of a given sec- pendency ordering specified in the C11 and C++11 tion should be acceptable in almost all cases. This standards: 1. The C11 standard does not provide attributes, 6 What if a given rcu read unlock() sometimes marked the and in particular, does not provide the end of an outermost RCU read-side critical section, but other times was nested in some other RCU read-side critical section? [[carries dependency]] attribute. This pre- In that case, there should be no kill dependency(). vents the developer from specifying that a given WG21/P0098R0 14

1 p = atomic_load_explicit(gp, memory_order_consume); 2 if (p == ptr_a) ing a dependency. For another example, con- 3 a = p->special_a; sider the value-speculation-like code shown in 4 else 5 a = p->normal_a; Figure 13 that is sometimes written by devel- opers, and that was described in bullet 9 of Section 4.1. In this example, the standard re- Figure 13: Dependency-Ordering Value-Narrowing quires dependency ordering between the memory Hazard order consume load on line 1 and the sub- sequent dereference on line 3, but a typical compiler would not be expected to differenti- dependency chain passes into or out of a given ate between these two apparently identical val- function. The usual C-language response to this ues. These two examples show that a compiler situation is to introduce a new keyword. would need to detect and carefully handle these 2. The implementation complexity of the cases either by artificially inserting dependen- dependency-chain tracking required by both cies, omitting optimizations, differentiating be- standard can be quite onerous on the one hand, tween apparently identical values, or even by and the overhead of unconditionally promoting emitting memory order acquire fences. memory order consume loads to memory order 5. The whole point of memory order consume and acquire can be excessive on weakly ordered the resulting dependency chains is to allow de- implementations on the other. There is therefore velopers to optimize their code. Such optimiza- no easy way out for a memory order consume tion attempts can be completely defeated by the implementation on a weakly ordered system. memory order acquire fences that the standard currently requires when a dependency chain goes 3. The function-level granularity of [[carries out of scope without the benefit of a [[carries dependency]] seems too coarse. One problem dependency]] attribute. Preventing the com- is that points-to analysis is non-trivial, so that piler from emitting these fences requires liberal compilers are likely to have difficulty determin- use of kill dependency(), which clutters code, ing whether or not a given pointer carries a de- requires large developer effort, and further re- pendency. For example, the current wording of quires that the developer know quite a bit about the standard (intentionally!) does not disallow which code patterns a given version of a given dependency chaining through stores and loads. compiler can optimize (thus avoiding needless Therefore, if a dependency-carrying value might fences) and which it cannot (thus requiring man- ever be written to a given variable, an implemen- ual insertion of kill dependency(). tation might reasonably assume that any load from that variable must be assumed to carry a As of this writing, no known implementations fully dependency. support C11 or C++11 dependency ordering. It is worth asking why Paul didn’t anticipate these 4. The rules set out in the standard [29, 1.10p9] weaknesses. There are several reasons for this: do not align well with the rules that develop- 1. Compiler optimizations have become more ag- ers must currently adhere to in order to main- gressive over the eight years since Paul started tain dependency chains when using pre-C11 and working on standardization. pre-C++11 compilers (see Section 4.1). For ex- ample, the standard requires (x-x) to carry a 2. New dependency-ordering use cases have arisen dependency, and providing this guarantee would during that same time, in particular, there are at the very least require the compiler to also longer dependency chains and more of them, turn off optimizations that remove (x-x) (and including dependency chains spanning multiple similar patterns) if x might possibly be carry- compilation units. WG21/P0098R0 15

3. The number of dependency chains has increased Section 7.10 uses a new Carries dependency stor- by more than an order of magnitude during that age class to mark objects that carry dependencies. time, so that changes in code style can be ex- Section 7.11 provides an initial comparative evalua- pected to face a commeasurate increase in resis- tion. tance from the Linux kernel community – unless those changes bring some tangible benefit. 7.1 Revising C11 and C++11 With that, let’s look at some potential alternatives Dependency-Ordering Definition to dependency ordering as defined in the C11 and The following sections each describe a proposed revi- C++11 standards. sion of the dependency-ordering definition from that in the current C11 and C++11 standards. In many of these proposals, developers are required to follow 7 Potential Alternatives to C11 an additional rule in order to be able to rely on de- and C++11 Dependency Or- pendency ordering: Subsequent execution must not lead to a situation where there is only one possible dering value for the variable that is intended to carry the dependency.7 This is shown in Figure 17, where the Given the weaknesses in the current standard’s spec- compiler is permitted to break dependency ordering ification of dependency ordering, it is quite reason- on line 6 because it knows that the value of p is equal able to consider alternatives. To this end, Section 7.1 to that of q, which means that it could substitute discusses ease-of-use issues involved with revisions to the latter value from the former, which would break the C11 and C++11 definitions of dependency or- dependency ordering. In short, a dependency chain dering, Section 7.2 enlists help from the type system, breaks if it comes to a point where only a single value but also imposes value restrictions (thus revising the is possible, regardless of the value of the memory C11 and C++11 semantics for dependencies), Sec- order consume load heading up the chain. At first tion 7.3 enlists help from the type system without glance, this additional rule could be quite difficult to the value restrictions, and Section 7.4 describes a live with, as dependency ordering could come and go whole-program approach to dependency chains (also depending on small details of code far away from that revising the C11 and C++11 semantics for depen- point in the dependency chain. dencies). Section 7.5 describes a post-Rapperswil However, a review of the Linux-kernel operators in proposal that dependency chains be restricted to Section 3.2 shows that the most commonly used op- function-scope local variables and temporaries, and erators act identically under both definitions. The Section 7.6 describes a second post-Rapperswil pro- problem-free operators include ->, infix =, casts, pre- posal that the [[carries dependency]] attribute fix &, prefix *, and ternary ?:. be used to label local-scope variables that carry de- One example of a potentially troublesome opera- pendencies. Section 7.7 describes a proposal dis- tor, namely ==, is shown in Figure 17, where line 6 cussed verbally at Rapperswil that explicitly marks breaks dependency ordering because the value of p is the tails of dependency chains. Section 7.8 describes known to be equal to that of q, which is not part of a the inverse, namely marking the heads of dependency dependency chain. This example could be addressed chains. Section 7.9 describes an approach that avoids through careful diagnostic design coupled with appro- marking by sharply restricting the number and type priate coding standards. For example, the compiler of operations permitted in dependency chains. Each approach appears to have advantages and disadvan- 7 This restricted notion of dependence is sometimes called tages, so it is hoped that further discussion will either semantic dependence, and the value at the end of a depen- dence chain that does not represent a semantic dependence is help settle on one of these alternatives or generate sometimes said to be independent of the value at the head of something better. To help initiate this discussion, the dependency chain. WG21/P0098R0 16

1 int my_array[MY_ARRAY_SIZE]; 1 struct liststackhead { 2 2 struct liststack __rcu *first; 3 i = atomic_load_explicit(gi, memory_order_consume); 3 }; 4 r1 = my_array[i]; 4 5 struct liststack { 6 struct liststack __rcu *next; 7 void *t; Figure 14: Single-Element Arrays and Dependency 8 struct rcu_head rh; Ordering 9 }; 10 11 _Carries_dependency 12 void *ls_front(struct liststackhead *head) could emit a warning on line 6, but remain silent for 13 { 14 _Carries_dependency void *data; the equivalent line substituting q for p, namely, do 15 struct liststack *lsp; . 16 something with(q->a) 17 rcu_read_lock(); Another example is the use of postfix [] that 18 lsp = rcu_dereference(head->first); 19 if (lsp == NULL) is shown in Figure 14. If this code fragment was 20 data = NULL; compiled with MY ARRAY SIZE equal to one, there is 21 else 22 data = rcu_dereference(lsp->t); no dependency ordering between lines 3 and 4, but 23 rcu_read_unlock(); that same code fragment compiled with MY ARRAY 24 return data; SIZE equal to two or greater would be dependency- 25 } ordered. Here a diagnostic for single-element arrays might prove useful, and such a diagnostic can easily Figure 15: List-Based-Stack Example Code, 1 of 2 be supplied in this case using #if and #error. Or, better yet, don’t carry dependencies through through integers for use as array indexes. some sample code using Linux-kernel nomenclature In the Linux kernel, infix + and - are used for with the addition of a mythical C keyword Carries pointer and array computations. These are all safe dependency to annotate parameters, variables, and in that they operate on an integer and pointer, so return values that carry dependencies. Please note that any cancellation will not normally be detectable that this code example in no way endorses the dubi- at compile time. However, one big purpose of diag- ous practice of creating a parallel program with the nostics is to detect abnormal conditions indicating sort of choke point exemplified by the head of this probable bugs. Therefore, in cases where the com- list. Note also that cmpxchg() heads a dependency piler can determine that two values from dependency chain, which is completely reasonable within the con- chains are annihilating each other via infix + and -, text of the Linux kernel due to its acquire semantics, a diagnostic would be appropriate. which of course might be argued to indicate that the Similarly, the Linux kernel uses infix (bitwise) & to annotations in ls pop() are unnecessary. manipulate bits at the bottom of a pointer, where again cancellation will not normally be detectable at 7.2 Type-Based Designation of De- compile time—except in the case of operations on a pendency Chains With Restric- NULL pointer, for which dependency ordering is not tions meaningful in any case. However, as with infix + and -, if the compiler detects value annihilation, a This approach was formulated by Torvald Riegel in diagnostic would be appropriate. response to Linus Torvalds’s spirited criticisms of the Although issues with false positives and negatives current C11 and C++11 wording. needs further investigation, there is reason to hope This approach introduces a new value dep that this revision of the definition of dependency or- preserving type qualifier. Dependency ordering is dering might avoid significant impacts on ease of use. preserved only via variables having this type quali- With this hope, we proceed to the specific propos- fier. This is meant to model the real scope of depen- als, using the code in Figures 15 and 16 to show dencies, which is data flow, not execution at function- WG21/P0098R0 17

1 value_dep_preserving struct foo *p; 2 3 p = atomic_load_explicit(gp, memory_order_consume); 1 int ls_push(struct liststackhead *head, void *t) 4 q = some_other_pointer; 2 { 5 if (p == q) 3 struct liststack *lsp; 6 do_something_with(p->a); 4 struct liststack *lsnp1; 7 else 5 struct liststack *lsnp2; 8 do_something_else_with(p->b); 6 size_t sz; 7 8 sz = sizeof(*lsp); 9 sz = (sz + CACHE_LINE_SIZE - 1) / CACHE_LINE_SIZE; Figure 17: Single-Value Variables and Dependency 10 sz *= CACHE_LINE_SIZE; Ordering 11 lsp = malloc(sz); 12 if (!lsp) 13 return -ENOMEM; 14 if (!t) level granularity. This approach should therefore give 15 abort(); 16 lsp->t = t; developers much finer control of which dependencies 17 rcu_read_lock(); are tracked. 18 lsnp2 = ACCESS_ONCE(head->first); 19 do { Assigning from a value dep preserving value to a 20 lsnp1 = lsnp2; 21 lsp->next = lsnp1; non-value dep preserving variable terminates the 22 lsnp2 = cmpxchg(&head->first, lsnp1, lsp); tracking of dependencies in much the same way that 23 } while (lsnp1 != lsnp2); 24 rcu_read_unlock(); an explicit kill dependency() would. However, un- 25 return 0; like an explicit kill dependency(), compilers should 26 } 27 be able to emit a suppressable warning on implicit 28 static void ls_rcu_free_cb(struct rcu_head *rhp) conversions, so as to alert the developer about other- 29 { 8 30 struct liststack *lsp; wise silent dropping of dependency tracking. 31 Next, we specify that memory order consume loads 32 lsp = container_of(rhp, struct liststack, rh); 33 free(lsp); return a value dep preserving type by default; the 34 } compiler must assume such a load to be capable of 35 36 _Carries_dependency producing any value of the underlying type. In other 37 void *ls_pop(struct liststackhead *head) words, the implementation is not permitted to apply 38 { 39 _Carries_dependency struct liststack *lsp; any value-restriction knowledge it might gain from 40 struct liststack *lsnp1; whole-program analysis. We call this a local seman- 41 _Carries_dependency struct liststack *lsnp2; 42 _Carries_dependency void *data; tic dependency to distinguish not only from a pure 43 (syntactic) dependency, but also from a global seman- 44 rcu_read_lock(); 45 lsnp2 = rcu_dereference(head->first); tic dependency, where global information may be ap- 46 do { plied. Note that any global semantic dependency is 47 lsnp1 = lsnp2; 48 if (lsnp1 == NULL) { also a local semantic dependency, but that any local 49 rcu_read_unlock(); semantic dependency which is headed by a variable 50 return NULL; 51 } that can be proven to take on only a single value is not 52 lsp = rcu_dereference(lsnp1->next); a global semantic dependency. The term “semantic 53 lsnp2 = cmpxchg(&head->first, lsnp1, lsp); 54 } while (lsnp1 != lsnp2); dependency” should be interpreted to mean a global 55 data = rcu_dereference(lsnp2->t); semantic dependency unless otherwise stated. 56 rcu_read_unlock(); 57 call_rcu(&lsnp2->rh, ls_rcu_free_cb); This allows developers to start with a clean slate 58 return data; for the additional rule that they must follow to be 59 } able to rely on dependency ordering: Subsequent ex- ecution must not lead to a situation there is only one Figure 16: List-Based-Stack Example Code, 2 of 2 8 Other choices are possible in this case, including emit- ting a memory order acquire fence in order to conservatively preserve a potentially intended ordering. WG21/P0098R0 18

1 #define rcu_dereference(x) \ possible value for the value dep preserving expres- 2 atomic_load_explicit((x), memory_order_consume); sion, because otherwise the implementation is per- 3 4 struct liststackhead { mitted to break the dependency chain. As noted 5 struct liststack value_dep_preserving *first; earlier, this is shown in Figure 17, where the com- 6 }; 7 piler is permitted to break dependency ordering on 8 struct liststack { line 6 because it knows that the value of p is equal 9 struct liststack value_dep_preserving *next; 10 void *t; to that of q, which means that it could substitute 11 struct rcu_head rh; the latter value from the former, which would break 12 }; 13 dependency ordering. 14 value_dep_preserving This approach has several advantages: 15 void *ls_front(struct liststackhead *head) 16 { 17 value_dep_preserving void *data; 1. The implementation is simpler because no de- 18 value_dep_preserving struct liststack *lsp; pendency chains need to be traced. The imple- 19 20 rcu_read_lock(); mentation can instead drive optimization deci- 21 lsp = rcu_dereference(head->first); sions strictly from type information. 22 if (lsp == NULL) 23 data = NULL; 24 else 2. Use of the value dep preserving type modifier 25 data = rcu_dereference(lsp->t); allows the developer to limit the extent of the 26 rcu_read_unlock(); 27 return data; dependency chains. 28 } 3. This type modifier can be used to mark a depen- dency chain’s entry to and exit from a function Figure 18: List-Based-Stack Restricted Type-Based in a straightforward way, without the need for Designation, 1 of 2 attributes.

4. The value dep preserving type modifiers serve will be increasingly necessary for large-scale concur- as valuable documentation of the developer’s in- rent programming efforts. In particular, the concern tent. is that forcing the compiler to assume that a memory order consume load could possibly return any value 5. This approach permits many additional opti- permitted by its type might require program-analysis mizations compared to those permitted by the tools to consider counterfactual hypothetical execu- current standard on code that carries a depen- tions, which might complicate specification of seman- dency. Expressions such as (x-x) no longer tics and verification. require establishment of artificial dependencies Figures 18 and 19 show how this approach plays and the compiler is no longer required to detect out with the list-based stack. value-narrowing hazards like that shown in Fig- ure 13. However, the compiler is still prohibited from adding its own value-speculation optimiza- 7.3 Type-Based Designation of De- tions. pendency Chains

6. Linus Torvalds seems to be OK with it, which Jeff Preshing made an off-list suggestion of using a indicates that this set of rules might be practical value dep preserving type modifier as suggested from the perspective of developers who currently by Torvald Riegel, but using this type modifier to exploit dependency chains. strictly enforce dependency ordering. For example, consider the code fragment shown in Figure 17. The According to Peter Sewell, one disadvantage is that scheme described in Section 7.2 would not necessar- this approach will be quite difficult to model, which in ily enforce dependency ordering between the load on turn will pose obstacles for the analysis tooling that line 3 and the access one line 6, while the approach WG21/P0098R0 19

described in this section would enforce dependency 1 int ls_push(struct liststackhead *head, void *t) ordering in this case. 2 { Furthermore, cancelling or value-destruction oper- 3 struct liststack *lsp; 4 struct liststack *lsnp1; ations on value dep preserving values would not 5 struct liststack *lsnp2; disrupt dependency ordering. As with the cur- 6 size_t sz; 7 rent C11 and C++11 standards, the implementation 8 sz = sizeof(*lsp); would be required to emit a memory-barrier instruc- 9 sz = (sz + CACHE_LINE_SIZE - 1) / CACHE_LINE_SIZE; 10 sz *= CACHE_LINE_SIZE; tion or compute an artificial dependency for such op- 11 lsp = malloc(sz); erations. (Note however that use of cancelling or 12 if (!lsp) 13 return -ENOMEM; value-destruction operations on dependency chains 14 if (!t) has proven quite rare in practice.) 15 abort(); 16 lsp->t = t; This approach shares many of the advantages of 17 rcu_read_lock(); Torvald Riegel’s approach: 18 lsnp2 = ACCESS_ONCE(head->first); 19 do { 20 lsnp1 = lsnp2; 1. The implementation is simpler because no de- 21 lsp->next = lsnp1; pendency chains need be traced. The implemen- 22 lsnp2 = cmpxchg(&head->first, lsnp1, lsp); 23 } while (lsnp1 != lsnp2); tation can instead drive optimization decisions 24 rcu_read_unlock(); strictly from type information. 25 return 0; 26 } 27 2. Use of the value dep preserving type modifier 28 static void ls_rcu_free_cb(struct rcu_head *rhp) allows the developer to limit the extent of the 29 { 30 struct liststack *lsp; dependency chains. 31 32 lsp = container_of(rhp, struct liststack, rh); 3. This type modifier can be used to mark a depen- 33 free(lsp); dency chain’s entry to and exit from a function 34 } 35 in a straightforward way, without the need for 36 value_dep_preserving attributes. 37 void *ls_pop(struct liststackhead *head) 38 { 39 value_dep_preserving struct liststack *lsp; 4. The value dep preserving type modifiers serve 40 struct liststack *lsnp1; as valuable documentation of the developer’s in- 41 value_dep_preserving struct liststack *lsnp2; 42 value_dep_preserving void *data; tent. 43 44 rcu_read_lock(); 5. Although optimizations on a dependency chain 45 lsnp2 = rcu_dereference(head->first); 46 do { are restricted just as in the current standard, 47 lsnp1 = lsnp2; the use of value dep preserving restricts the 48 if (lsnp1 == NULL) { 49 rcu_read_unlock(); dependency chains to those intended by the de- 50 return NULL; veloper. 51 } 52 lsp = rcu_dereference(lsnp1->next); 6. Restricting dependency-breaking optimizations 53 lsnp2 = cmpxchg(&head->first, lsnp1, lsp); 54 } while (lsnp1 != lsnp2); on all dependency chains marked value dep 55 data = rcu_dereference(lsnp2->t); , without exceptions for cases in 56 rcu_read_unlock(); preserving 57 call_rcu(&lsnp2->rh, ls_rcu_free_cb); which the compiler knows too much, might make 58 return data; this approach easier to learn and to use. 59 } It is expected that modeling this approach should Figure 19: List-Based-Stack Restricted Type-Based be straightforward because the modeling tools would Designation, 2 of 2 be able to make use of the type information. This approach results in the same code as shown in Fig- ures 18 and 19 of the previous section. WG21/P0098R0 20

7.4 Whole-Program Option when loaded by the same thread that did the store. This approach, also suggested off-list by Jeff Presh- ing, has the goal of reusing existing non-dependency- Some implementations might allow the developer ordered source code unchanged (albeit requiring re- to choose between these two approaches, for example, 9 compilation in most cases). For example, this ap- by using a compiler switch provided for that purpose. proach permits an instance of std::map to be refer- This approach also has the effect of permitting a enced by a pointer loaded via memory order consume trivial implementation of a memory order consume and to provide that std::map instance with the atomic thread fence(). When using the first im- benefits of dependency ordering without any code plementation approach, the atomic thread fence() changes whatsoever to std::map. It is important to is simply promoted to memory order acquire. In- note that this protection will be provided only to a terestingly enough, when using the second ap- read-only std::map that is referenced by a changing proach, the memory order consume atomic thread pointer loaded via memory order consume, in partic- fence() may simply be ignored. The reason for ular, not to a concurrently updated std::map refer- this is that this approach has the effect of promot- enced by a pointer (read-only or otherwise) loaded ing memory order relaxed loads to memory order via memory order consume. This latter case would consume, which already globally enforces all the require changes to the underlying std:map implemen- ordering that the memory order consume atomic tation, at a minimum, changing some of the loads to thread fence() is required to provide locally.10 be memory order consume loads. Nevertheless, the This approach has its own set of advantages and ability to provide dependency-ordering protection to disadvantages: pre-existing linked data structures is valuable, even with this read-only restriction. 1. This approach dispenses with the [[carries This approach, which again does require full re- dependency]] attribute and the kill compilation, can be implemented using two ap- dependency() primitive. proaches: 2. This approach better promotes reuse of existing 1. Promote all memory order consume loads to source code. In particular, it should require no memory order acquire, as may be done with changes to the current Linux-kernel source base, the current standard. aside from changes to the rcu dereference() family of primitives. 2. On architectures that respect memory order- ing, prohibit all dependency-breaking optimiza- 3. This approach allows implementations to carry tions throughout the entire program, but only out dependency-breaking optimizations on de- in cases where a change in the value returned pendency chains as long as a change in the by a memory order consume load could cause a value from the memory order consume load does change in the value computed later in that same not change values further down the dependency dependency chain, in other words, where there is chain, both with and without the optimization. a global semantic dependency. Note again that Jeff conjectures that the set of dependency- the possibility of storing a value obtained from breaking optimizations used in practice apply a memory order consume load, then loading it only outside of dependency chains, by the re- later, means that normal loads as well as memory vised definition in which single-value restrictions order relaxed loads often must be considered 10 Of course, this presumed promotion from memory order to head their own dependency chains, but only relaxed to memory order consume means that architectures such as DEC Alpha that do not respect dependency order- 9 A module or library that is known to never carry a de- ing must continue to use the first option of emitting memory- pendency need not be recompiled. ordering instructions for memory order consume loads. WG21/P0098R0 21

1 #define rcu_dereference(x) \ 2 atomic_load_explicit((x), memory_order_consume); 3 4 struct liststackhead { 1 int ls_push(struct liststackhead *head, void *t) 5 struct liststack *first; 2 { 6 }; 3 struct liststack *lsp; 7 4 struct liststack *lsnp1; 8 struct liststack { 5 struct liststack *lsnp2; 9 struct liststack *next; 6 size_t sz; 10 void *t; 7 11 struct rcu_head rh; 8 sz = sizeof(*lsp); 12 }; 9 sz = (sz + CACHE_LINE_SIZE - 1) / CACHE_LINE_SIZE; 13 10 sz *= CACHE_LINE_SIZE; 14 void *ls_front(struct liststackhead *head) 11 lsp = malloc(sz); 15 { 12 if (!lsp) 16 void *data; 13 return -ENOMEM; 17 struct liststack *lsp; 14 if (!t) 18 15 abort(); 19 rcu_read_lock(); 16 lsp->t = t; 20 lsp = rcu_dereference(head->first); 17 rcu_read_lock(); 21 if (lsp == NULL) 18 lsnp2 = ACCESS_ONCE(head->first); 22 data = NULL; 19 do { 23 else 20 lsnp1 = lsnp2; 24 data = rcu_dereference(lsp->t); 21 lsp->next = lsnp1; 25 rcu_read_unlock(); 22 lsnp2 = cmpxchg(&head->first, lsnp1, lsp); 26 return data; 23 } while (lsnp1 != lsnp2); 27 } 24 rcu_read_unlock(); 25 return 0; 26 } 27 Figure 20: List-Based-Stack Whole-Program Ap- 28 static void ls_rcu_free_cb(struct rcu_head *rhp) proach, 1 of 2 29 { 30 struct liststack *lsp; 31 32 lsp = container_of(rhp, struct liststack, rh); 11 33 free(lsp); break dependency chains. If this conjecture 34 } holds, it also applies to Torvald’s approach de- 35 36 void *ls_pop(struct liststackhead *head) scribed in Section 7.2. 37 { 38 struct liststack *lsp; 39 struct liststack *lsnp1; 4. Code that follows the rules presented in Sec- 40 struct liststack *lsnp2; tion 4.1 (substituting memory order consume 41 void *data; 42 loads for volatile loads) would have its depen- 43 rcu_read_lock(); dency ordering properly preserved. 44 lsnp2 = rcu_dereference(head->first); 45 do { 46 lsnp1 = lsnp2; It is unlikely that this approach could be modeled 47 if (lsnp1 == NULL) { 48 rcu_read_unlock(); reasonably given the current state of the art. The 49 return NULL; requirement that any given memory order consume 50 } 51 lsp = rcu_dereference(lsnp1->next); load be able to generate at least two different val- 52 lsnp2 = cmpxchg(&head->first, lsnp1, lsp); ues at the tail of the dependency chain is believed 53 } while (lsnp1 != lsnp2); 54 data = rcu_dereference(lsnp2->t); to be a show-stopper, especially when coupled with 55 rcu_read_unlock(); whole-program analysis, which might find that there 56 call_rcu(&lsnp2->rh, ls_rcu_free_cb); 57 return data; is only one value entering at the head of the depen- 58 } dency chain. This approach allows annotations to be discarded, Figure 21: List-Based-Stack Whole-Program Ap- as shown in Figures 20 and 21. However, the memory proach, 2 of 2 11 This is certainly the case for the usual optimizations ex- emplified by replacing (x-x) with zero. WG21/P0098R0 22

order consume loads are still required in order to units, as is the case in some parts of the Linux enable the promote-to-acquire implementation style. kernel.

2. The implementation is likely to be some- 7.5 Local-Variable Restriction what simpler because only those dependency chains passing through local variables, compiler- This approach, suggested off-list by Hans Boehm, generated temporaries, compiler-visible function limits the extent of dependency trees to a local, which arguments, and compiler-visible return values includes local variables, temporaries, function argu- need be traced. One could also argue that func- ments, and return variables. Assigning a value from a tion arguments and return values marked with memory order consume load to such an object begins [[carries dependency]] attribute also need to a dependency chain. Assigning a value loaded from be traced. such a local to a global variable (including function- local variables marked static) or to the heap implies 3. Many irrelevant dependency chains are pruned a kill dependency(), so that dependency chains are by default, thus fewer std::kill dependency() confined to locals. However, if the compiler is unable calls are required. to see the full dependency chain, for example, be- cause it passes into a function in another translation 4. Although optimizations on dependency chains unit that is not marked [[carries dependency]], must be restricted, the restricted scope of de- the compiler should promote memory order consume pendency chains reduces the impact of these re- 12 to memory order acquire. strictions. Section 3.2 indicates that the following operators should transmit dependency status from one local 5. Applying this approach to the Linux kernel variable or temporary to another: ->, infix =, casts, would only require the addition of markings on prefix &, prefix *, [], infix +, infix -, ternary ?:, function parameters and return values corre- infix (bitwise) &, and probably also |. Similarly, Sec- sponding to cross-translation-unit function calls. tion 3.3 indicates that the following operators should However, there are a significant number of these, imply a kill dependency(): (), !, ==, !=, &&, ||, so this approach can expect significant resistance infix *, /, and %. from the Linux community. It will also be necesary to check whether Linux- kernel usage expects dependency chains to pass It is expected that modeling this approach should through globals and heap objects that are in some be no more difficult than for the current C11 and way thread-local. If there are such use cases, and if C++11 standards. they are sane and cannot easily be changed to use This approach allows local-variable annotations to local variables, should [[carries dependency]] be be dropped, as shown in Figure 22 and 23 used to flag dependency-carrying globals and heap objects? This approach has the following advantages and 7.6 Mark Dependency-Carrying Lo- disadvantages: cal Variables This approach, suggested offlist by Clark Nelson, uses 1. This approach requires that the C language add the [[carries dependency]] attribute to mark non- the [[carries dependency]] attribute if de- static local-scope variables as carrying a dependency, pendency chains are to span multiple translation in addition to its current use marking function ar- guments and return values as carrying dependen- 12 Some implementations might provide means to allow the user to specify that a diagnostic be generated if such promotion cies. It is not permissible to mark global variables is necessary. or structure members with this attribute. Assigning WG21/P0098R0 23

1 #define rcu_dereference(x) \ 2 atomic_load_explicit((x), memory_order_consume); 3 1 int ls_push(struct liststackhead *head, void *t) 4 struct liststackhead { 2 { 5 struct liststack *first; 3 struct liststack *lsp; 6 }; 4 struct liststack *lsnp1; 7 5 struct liststack *lsnp2; 8 struct liststack { 6 size_t sz; 9 struct liststack *next; 7 10 void *t; 8 sz = sizeof(*lsp); 11 struct rcu_head rh; 9 sz = (sz + CACHE_LINE_SIZE - 1) / CACHE_LINE_SIZE; 12 }; 10 sz *= CACHE_LINE_SIZE; 13 11 lsp = malloc(sz); 14 _Carries_dependency 12 if (!lsp) 15 void *ls_front(struct liststackhead *head) 13 return -ENOMEM; 16 { 14 if (!t) 17 void *data; 15 abort(); 18 struct liststack *lsp; 16 lsp->t = t; 19 17 rcu_read_lock(); 20 rcu_read_lock(); 18 lsnp2 = ACCESS_ONCE(head->first); 21 lsp = rcu_dereference(head->first); 19 do { 22 if (lsp == NULL) 20 lsnp1 = lsnp2; 23 data = NULL; 21 lsp->next = lsnp1; 24 else 22 lsnp2 = cmpxchg(&head->first, lsnp1, lsp); 25 data = rcu_dereference(lsp->t); 23 } while (lsnp1 != lsnp2); 26 rcu_read_unlock(); 24 rcu_read_unlock(); 27 return data; 25 return 0; 28 } 26 } 27 28 static void ls_rcu_free_cb(struct rcu_head *rhp) Figure 22: List-Based-Stack Local-Variable Restric- 29 { 30 struct liststack *lsp; tion, 1 of 2 31 32 lsp = container_of(rhp, struct liststack, rh); 33 free(lsp); 34 } from a [[carries dependency]] object to a non- 35 object results in an im- 36 _Carries_dependency [[carries dependency]] 37 void *ls_pop(struct liststackhead *head) plicit kill dependency(). 38 { This approach is similar to that of Section 7.3, ex- 39 struct liststack *lsp; 40 struct liststack *lsnp1; cept that it uses an attribute rather than a type mod- 41 struct liststack *lsnp2; ifier. As such, it has many of the advantages and dis- 42 void *data; 43 advantages of that approach, however, some believe 44 rcu_read_lock(); that an attribute-based approach will be more ac- 45 lsnp2 = rcu_dereference(head->first); 46 do { ceptable to the committee than would a type-modifier 47 lsnp1 = lsnp2; approach.13 However, this approach does require 48 if (lsnp1 == NULL) { 49 rcu_read_unlock(); that C add attributes. 50 return NULL; This leave the question of which operators 51 } 52 lsp = rcu_dereference(lsnp1->next); transmit dependency chains from one [[carries 53 lsnp2 = cmpxchg(&head->first, lsnp1, lsp); object to another. Section 3.2 indi- 54 } while (lsnp1 != lsnp2); dependency]] 55 data = rcu_dereference(lsnp2->t); cates that the following operators should transmit 56 rcu_read_unlock(); dependency status from one local variable or tem- 57 call_rcu(&lsnp2->rh, ls_rcu_free_cb); 58 return data; porary to another: ->, infix =, casts, prefix &, prefix 59 } *, [], infix +, infix -, ternary ?:, infix (bitwise) &, and probably also . Similarly, Section 3.3 shows | Figure 23: List-Based-Stack Local-Variable Restric- that the following operators should imply a kill tion, 2 of 2 13 Lawrence Crowl suggests a third approach, namely a vari- able modifier. WG21/P0098R0 24

1 #define rcu_dereference(x) \ dependency(): (), !, ==, !=, &&, ||, infix *, /, and 2 atomic_load_explicit((x), memory_order_consume); %. 3 4 struct liststackhead { This approach has the following advantages and 5 struct liststack *first; disadvantages: 6 }; 7 8 struct liststack { 1. This approach requires that the C language add 9 struct liststack *next; the [[carries dependency]] attribute. 10 void *t; 11 struct rcu_head rh; 12 }; 2. The implementation is likely to be simpler be- 13 14 _Carries_dependency cause only those dependency chains passing 15 void *ls_front(struct liststackhead *head) through variables marked with the [[carries 16 { 17 _Carries_dependency void *data; dependency]] attribute need be traced. 18 _Carries_dependency struct liststack *lsp; 19 3. Many irrelevant dependency chains are pruned 20 rcu_read_lock(); 21 lsp = rcu_dereference(head->first); by default, thus fewer std::kill dependency() 22 if (lsp == NULL) calls are required. 23 data = NULL; 24 else 25 data = rcu_dereference(lsp->t); 4. The [[carries dependency]] calls serve as 26 rcu_read_unlock(); valuable documentation of the developer’s in- 27 return data; tent. 28 }

5. Although optimizations on dependency chains Figure 24: List-Based-Stack Marked Local Variables, must be restricted, use of explicit [[carries 1 of 2 dependency]] greatly reduces unnecessary re- striction of optimizations on unintentional de- pendency chains. 7.7 Explicitly Tail-Marked Depen- 6. Applying this to the Linux kernel would require dency Chains significant marking of variables carrying depen- dencies, given that the Linux kernel currently This approach, suggested at Rapperswil by Olivier requires no such markings. Giroux, can be thought of as the inverse of std::kill dependency(). Instead of explicitly 7. Common types of abstraction need to be han- marking where the dependency chains terminate, dled correctly. For example, there are more than Olivier’s proposal uses a std::dependency() prim- 500 invocations of list for each entry rcu() itive to indicate the locations in the code that the in the v4.1 Linux kernel, each of which heads dependency chains are required to reach. The first ar- a distinct dependency chain. In addition, some gument to std::dependency() is the value to which of these distinct dependency chains invoke com- the dependency must be carried, and the second argu- mon functions and macros. It is not clear that ment is the variable that heads the dependency chain, compile-time marking suffices for these cases. in other words, the second argument is the variable that was loaded from by a memory order consume It is expected that modeling this approach should load. This proposal differs from the others in that be no more difficult than for the current C11 and it is expected to be implemented not necessarily by C++11 standards. preserving the dependency, but instead by inserting This approach results in code as shown in Fig- barriers in those cases where optimizations have elim- ures 24 and 25, where the [[carries dependency]] inated any required dependencies. The goal here is to attributes have been replaced with a mythical impose minimal restrictions on optimizations of code Carries dependency C keyword. containing dependency chains. WG21/P0098R0 25

1 p = atomic_load_explicit(&gp, memory_order_consume); 2 if (p != NULL) 1 int ls_push(struct liststackhead *head, void *t) 3 do_it(atomic_dependency(p, gp)); 2 { 3 struct liststack *lsp; 4 struct liststack *lsnp1; Figure 26: Explicit Dependency Operations 5 struct liststack *lsnp2; 6 size_t sz; 7 1 void foo(struct bar *q [[carries_dependency]]) 8 sz = sizeof(*lsp); 2 { 9 sz = (sz + CACHE_LINE_SIZE - 1) / CACHE_LINE_SIZE; 3 if (q != NULL) 10 sz *= CACHE_LINE_SIZE; 4 do_it(atomic_dependency(q->b, q)); 11 lsp = malloc(sz); 5 } 12 if (!lsp) 6 13 return -ENOMEM; 7 p = atomic_load_explicit(&gp, memory_order_consume); 14 if (!t) 8 foo(atomic_dependency(p, gp)); 15 abort(); 16 lsp->t = t; 17 rcu_read_lock(); 18 lsnp2 = ACCESS_ONCE(head->first); Figure 27: Explicit Dependency Operations and car- 19 do { ries dependency 20 lsnp1 = lsnp2; 21 lsp->next = lsnp1; 22 lsnp2 = cmpxchg(&head->first, lsnp1, lsp); 23 } while (lsnp1 != lsnp2); 24 rcu_read_unlock(); A C-language example is shown in Figure 26, 25 return 0; where std::dependency() is transliterated to the C- 26 } 27 language atomic dependency() function. On line 3, 28 static void ls_rcu_free_cb(struct rcu_head *rhp) atomic dependency() returns the value of its first 29 { 30 struct liststack *lsp; argument (p), while ensuring that the data depen- 31 dency from the memory order consume load from gp 32 lsp = container_of(rhp, struct liststack, rh); 33 free(lsp); is faithfully reflected in the assembly language imple- 34 } menting this code fragment. The assembly-language 35 36 _Carries_dependency reflection of this dependency might be in terms of 37 void *ls_pop(struct liststackhead *head) an assembly-language dependency (for example, on 38 { 39 _Carries_dependency struct liststack *lsp; ARM or PowerPC), implicit (for 40 struct liststack *lsnp1; example, on x86 or mainframe), or by an explicit 41 _Carries_dependency struct liststack *lsnp2; 42 _Carries_dependency void *data; memory-barrier instruction. However, if there was no 43 atomic dependency() function, the compiler would 44 rcu_read_lock(); 14 45 lsnp2 = rcu_dereference(head->first); be under no obligation to preserve the dependency. 46 do { These explicitly specified dependencies may be 47 lsnp1 = lsnp2; 48 if (lsnp1 == NULL) { combined with [[carries dependency]] attributes 49 rcu_read_unlock(); on function arguments, for example, as shown in Fig- 50 return NULL; 51 } ure 27. Note the interplay of atomic dependency() 52 lsp = rcu_dereference(lsnp1->next); and [[carries dependency]], where line 8 estab- 53 lsnp2 = cmpxchg(&head->first, lsnp1, lsp); 54 } while (lsnp1 != lsnp2); lishes the dependency between the load from gp and 55 data = rcu_dereference(lsnp2->t); the [[carries dependency]] argument q of foo(), 56 rcu_read_unlock(); 57 call_rcu(&lsnp2->rh, ls_rcu_free_cb); and where line 4 establishes the further dependency 58 return data; between argument q of foo() and do it()s argu- 59 } ment. This approach is not yet complete. One issue is Figure 25: List-Based-Stack Marked Local Variables, the possibility of a given operation being dependent 2 of 2 14 Would it be better to have the second argument to atomic dependency() be a label rather than an expression? WG21/P0098R0 26

on multiple memory order consume loads. One ap- plicit atomic dependency() calls (and, option- proach is of course to omit this functionality, and an- ally, intermediate [[carries dependency]] at- other is to allow atomic dependency() to allow an tributes) need be traced. expression as its first argument and a variable list of memory order consume loaded variables. 3. Irrelevant dependency chains are pruned by de- Another issue is connecting [[carries fault, with no std::kill dependency() calls re- dependency]] return values to subsequent atomic quired. dependency() invocations. There are a number 4. The atomic dependency() calls serve as valu- of possible resolutions to this issue. One approach able documentation of the developer’s intent. would be to use [[carries dependency]] attribute to mark the declaration of the variable to which 5. Although optimizations on dependency chains the function’s return value is assigned, bringing the must be restricted, use of explicit atomic proposal from Section 7.6 to bear. In the special dependency() greatly reduces unnecessary re- case where the memory order consume load is in the striction of optimizations on unintentional de- same function body as the atomic dependency() pendency chains. that depends on it, the atomic dependency() could reference the variable that was the source 6. Applying this to the Linux kernel would require of the original memory order consume load. An- significant marking of dependency chains, given other approach would be to allow function-return that the Linux kernel currently relies on implicit carries dependency]] attributes to define names ends of dependency chains. that could be used by later atomic dependency() invocations. 7. Common types of abstraction need to be han- A third issue arises when atomic dependency() dled correctly. For example, there are more than must be applied after the head of the dependency 500 invocations of list for each entry rcu() chain has gone out of scope, for example, if the head in the v4.1 Linux kernel, each of which heads was contained in a variable defined in an inner scope a distinct dependency chain. In addition, some that has since been exited. of these distinct dependency chains invoke com- A fourth issue arises if optimizations along a mon functions and macros. It is not clear that needed dependency chain allow ordering the depen- compile-time marking suffices for these cases. dent operation to precede the head of the depen- dency chain, in which case inserting barriers would It is not yet known whether this approach can be be ineffective. The current proposal for addressing reasonably modeled. this issue is to suppress memory-movement optimiza- The result is shown in Figures 28 and 29. tions across the atomic dependency(), perhaps us- ing something like atomic signal fence() or the 7.8 Explicitly Head-Marked Depen- Linux kernel’s barrier() macro. This approach al- dency Chains lows dependency checking and fence insertion to be carried out as a final pass in the compilation process. This approach, suggested via email by Olivier Giroux, This approach has the following advantages and can be thought of as another inverse of std::kill disadvantages: dependency(). In this case the heads of the de- pendency chains are marked, indicating to which 1. This approach requires that the C language add pointed-to objects dependencies should be carried. the [[carries dependency]] attribute. This description is an extrapolation of a very con- cise proposal, and corrections are welcome. 2. The implementation is likely to be simpler be- The general idea is to provide an augmented cause only those dependency chains having ex- form of the load() member function that indicates WG21/P0098R0 27

1 int ls_push(struct liststackhead *head, void *t) 2 { 3 struct liststack *lsp; 4 struct liststack *lsnp1; 5 struct liststack *lsnp2; 6 size_t sz; 7 8 sz = sizeof(*lsp); 9 sz = (sz + CACHE_LINE_SIZE - 1) / CACHE_LINE_SIZE; 10 sz *= CACHE_LINE_SIZE; 11 lsp = malloc(sz); 12 if (!lsp) 13 return -ENOMEM; 14 if (!t) 15 abort(); 1 #define rcu_dereference(x) \ 16 lsp->t = t; 2 atomic_load_explicit((x), memory_order_consume); 17 rcu_read_lock(); 3 18 lsnp2 = ACCESS_ONCE(head->first); 4 struct liststackhead { 19 do { 5 struct liststack *first; 20 lsnp1 = lsnp2; 6 }; 21 lsp->next = lsnp1; 7 22 lsnp2 = cmpxchg(&head->first, lsnp1, lsp); 8 struct liststack { 23 } while (lsnp1 != lsnp2); 9 struct liststack *next; 24 rcu_read_unlock(); 10 void *t; 25 return 0; 11 struct rcu_head rh; 26 } 12 }; 27 13 28 static void ls_rcu_free_cb(struct rcu_head *rhp) 14 _Carries_dependency 29 { 15 void *ls_front(struct liststackhead *head) 30 struct liststack *lsp; 16 { 31 17 void *data; 32 lsp = container_of(rhp, struct liststack, rh); 18 struct liststack *lsp; 33 free(lsp); 19 34 } 20 rcu_read_lock(); 35 21 lsp = rcu_dereference(head->first); 36 _Carries_dependency 22 if (lsp == NULL) 37 void *ls_pop(struct liststackhead *head) 23 data = NULL; 38 { 24 else 39 struct liststack *lsp; 25 data = 40 struct liststack *lsnp1; 26 rcu_dereference(atomic_dependency(lsp->t, 41 struct liststack *lsnp2; 27 head->first)); 42 void *data; 28 rcu_read_unlock(); 43 29 return atomic_dependency(data, lsp->t); 44 rcu_read_lock(); 30 } 45 lsnp2 = rcu_dereference(head->first); 46 do { 47 lsnp1 = lsnp2; Figure 28: List-Based-Stack Tail-Marked Dependen- 48 if (lsnp1 == NULL) { 49 rcu_read_unlock(); cies, 1 of 2 50 return NULL; 51 } 52 lsp = rcu_dereference(lsnp1->next); 53 lsnp2 = cmpxchg(&head->first, lsnp1, lsp); 54 } while (lsnp1 != lsnp2); 55 data = rcu_dereference(atomic_dependency(lsnp2->t, lsnp)); 56 rcu_read_unlock(); 57 call_rcu(&lsnp2->rh, ls_rcu_free_cb); 58 return atomic_dereference(data, lsnp2->t); 59 }

Figure 29: List-Based-Stack Tail-Marked Dependen- cies, 2 of 2 WG21/P0098R0 28

1 struct foo { 1 #define rcu_dereference(x) \ 2 struct foo *a; 2 atomic_load_explicit((x), memory_order_consume); 3 struct foo *b; 3 4 struct foo *c; 4 struct liststackhead { 5 int d; 5 struct liststack *first; 6 }; 6 }; 7 7 8 p = atomic_load_explicit(&gp, memory_order_consume, 8 struct liststack { 9 p->a, p->b); 9 struct liststack *next; 10 qa = p->a; /* Dependency carried. */ 10 void *t; 11 qb = p->b; /* Dependency carried. */ 11 struct rcu_head rh; 12 qc = p->c; /* No dependency carried. */ 12 }; 13 d = p->d; /* No dependency carried. */ 13 14 _Carries_dependency 15 void *ls_front(struct liststackhead *head) 16 { Figure 30: Explicit Dependency Operations and Aug- 17 void *data; mented Load 18 struct liststack *lsp; 19 20 rcu_read_lock(); 21 lsp = rcu_dereference(head->first, head->first->t); dependencies, for example, x.load(memory order 22 if (lsp == NULL) 23 data = NULL; consume, x->next) would cause a dependency to 24 else be carried through the ->next field, but through 25 data = 26 rcu_dereference(lsp->t, *lsp->t)); /* ??? */ no other field. This is shown in Figure 30, where 28 rcu_read_unlock(); the explicit dependency information on line 9 causes 29 return data; 30 } lines 10 and 11 to carry a dependency, but lines 12 and 13 not to do so. Some open questions regarding this approach: Figure 31: List-Based-Stack Head-Marked Depen- dencies, 1 of 2 1. How does this interact with arguments and re- turn values? Do the corresponding annotations need to indicate to which fields dependencies dependency ordering in the declaration of the might be carried? Should mismatches be con- struct or class? sidered an error, and if so, which sorts of mis- 4. Common types of abstraction need to be han- matches? dled correctly. For example, there are more than 2. How are opaque types handled? For exam- 500 invocations of list for each entry rcu() ple, consider the Linux kernel linked-list facil- in the v4.1 Linux kernel, each of which heads ity, which embeds a list head structure into a distinct dependency chain. In addition, some the enclosing object that is to be placed on the of these distinct dependency chains invoke com- list. The memory order consume load returns mon functions and macros. It is not clear that a (struct list head *), but it may be nec- compile-time marking suffices for these cases. essary to carry a dependency to one or more of the fields in the enclosing object. Should this be handled via something like x.load(memory 7.9 Restricted Dependency Chains order consume, *x), but if so, doesn’t this re-introduce the need for lots of std::kill This approach restricts dependency chains to opera- dependency() calls? tions for which compilers would naturally carry de- pendencies. As such, this approach can be consid- 3. Larger structures might have quite a few fields ered to be a refinement of the whole-program option that need dependencies carried. Should there (Section 7.4), restricted as described in Section 4.1.2, be some sort of shorthand to make this easier but also omitting control dependencies and RCU- to code, for example, tagging the fields needing protected array indexes. It can also be thought of WG21/P0098R0 29

1 #define rcu_ptr_extract(p) \ 2 ({ \ 1 int ls_push(struct liststackhead *head, void *t) 3 uintptr_t ___ip = (uintptr_t)(p); \ 2 { 4 \ 3 struct liststack *lsp; 5 ___ip = ___ip & ~0x7; \ 4 struct liststack *lsnp1; 6 (typeof(p)) ___ip; \ 5 struct liststack *lsnp2; 7 }) 6 size_t sz; 8 7 9 #define rcu_ptr_set_bits(p, bits) \ 8 sz = sizeof(*lsp); 10 ({ \ 9 sz = (sz + CACHE_LINE_SIZE - 1) / CACHE_LINE_SIZE; 11 uintptr_t ___ip = (uintptr_t)(p); \ 10 sz *= CACHE_LINE_SIZE; 12 \ 11 lsp = malloc(sz); 13 ___ip = ___ip & ~0x7; \ 12 if (!lsp) 14 ___ip = ___ip | (bits); \ 13 return -ENOMEM; 15 (typeof(p)) ___ip; \ 14 if (!t) 16 }) 15 abort(); 17 16 lsp->t = t; 18 #define rcu_ptr_get_bits(p, bits) \ 17 rcu_read_lock(); 19 ({ \ 18 lsnp2 = ACCESS_ONCE(head->first); 20 uintptr_t ___ip = (uintptr_t)(p); \ 19 do { 21 \ 20 lsnp1 = lsnp2; 22 ___ip & ~0x7; \ 21 lsp->next = lsnp1; 23 }) 22 lsnp2 = cmpxchg(&head->first, lsnp1, lsp); 23 } while (lsnp1 != lsnp2); 24 rcu_read_unlock(); Figure 33: Pointers and Bit Manipulation on 64-Bit 25 return 0; 26 } System 27 28 static void ls_rcu_free_cb(struct rcu_head *rhp) 1 p = rcu_dereference(gp); 29 { 2 *p = 5; 30 struct liststack *lsp; 31 32 lsp = container_of(rhp, struct liststack, rh); 33 free(lsp); Figure 34: Extending Dependency Chain on Left- 34 } 35 Hand Side 36 _Carries_dependency 37 void *ls_pop(struct liststackhead *head) 38 { 39 struct liststack *lsp; as a refinement of the local-variable restriction (Sec- 40 struct liststack *lsnp1; tion 7.5). As always, a memory order consume load 41 struct liststack *lsnp2; 42 void *data; heads a dependency chain, and as always, a memory 43 order consume load may be implemented with a sim- 44 rcu_read_lock(); 45 lsnp2 = rcu_dereference(head->first, head->first->next); ple load instruction on architectures such as x86, 46 do { ARM, and Power. 47 lsnp1 = lsnp2; 48 if (lsnp1 == NULL) { This results in a specific list of operations that ex- 49 rcu_read_unlock(); tend dependency chains and a separate specific list of 50 return NULL; 51 } operations that terminate such chains, each covered 52 lsp = rcu_dereference(lsnp1->next, lsnp->next->t); in its own section. 53 lsnp2 = cmpxchg(&head->first, lsnp1, lsp); 54 } while (lsnp1 != lsnp2); 55 data = rcu_dereference(lsnp2->t, *lsnp2->t)); 56 rcu_read_unlock(); 7.9.1 Extending Dependency Chains 57 call_rcu(&lsnp2->rh, ls_rcu_free_cb); 58 return data; The following primitive operations extend that 59 } chain:15

Figure 32: List-Based-Stack Head-Marked Depen- 1. If any value is part of a dependency chain, then dencies, 2 of 2 15 In case of operator overloading, the actual functions called must be analyzed in order to determine their effects on depen- dency chains. WG21/P0098R0 30

using that value as the left-hand side of an as- break the dependency chain. Therefore, instead signment expression extends the chain to cover of casting to an integral type to carry out the the assignment. This rule is exercised in the addition, cast to a pointer to char.16 Line 24 of Linux kernel by stores into fields making up an Figure 20 illustrates this, given that the ->t acts RCU protected data element. This is illustrated as a pointer offset prior to indirection. by Figure 34. 6. If a pointer is part of a dependency chain, then 2. If any value is part of a dependency chain, then subtracting an integer from that pointer extends using that value as the right-hand side of an as- the chain to the resulting value. This applies signment expression extends the chain to cover for both positive and negative integers. Again, both the assignment and the value returned by casting to an integral type and then carrying that assignment statement. Line 20 of Figure 20 out the subtraction will break the dependency shows how this rule may be used to extend a chain, so instead cast to a pointer to char. The dependency chain into a local variable. Linux-kernel container of() macro illustrates this. This macro is used to find the beginning of 3. If any value that is part of a dependency chain is a structure given a pointer to a field within that stored to a non-shared variable, then any value same structure. loaded by a later load from that same variable by that same thread is also part of the dependency 7. If a pointer is part of a dependency chain, then chain. Lines 20 and 24 of Figure 20 illustrate this dereferencing it using the prefix * operator ex- rule, though this rule would apply even if local tends the chain through the dereference opera- variable lsp was a intptr t instead of a pointer. tion. Line 24 of Figure 20 illustrates this, given Note that the job of determining whether or not that the ->t acts as a pointer offset prior to in- a given variable is non-shared falls to the devel- direction. oper, not the compiler. That said, a high-QoI 8. If a pointer is part of a dependency chain, then compiler might choose to make this determina- dereferencing it using the -> field-selection oper- tion in order to issue helpful diagnostic messages. ator extends the chain to the field. Note that the 4. If a pointer that is part of a dependency chain when the -> operator is followed by one or more is stored to any variable, then any value loaded . operators, these latter operators are equiva- by a later load from that same variable by that lent to adding a constant integer to the original same thread is also part of the dependency chain. pointer. Line 24 of Figure 20 directly illustrates Lines 20 and 24 of Figure 20 illustrate this rule, this rule. though this rule would apply even if local vari- 9. If a pointer is part of a dependency chain, then able was instead a shared variable. As be- lsp casting it (either explicitly or implicitly) to any fore, determining whether or not the store and pointer-sized type extends the chain to the re- load were carried out by the same thread falls to sult. Line 3 of Figure 33 illustrates this rule. the developer, not to the compiler. 10. If a value of type intptr t or uintptr t is 5. If a pointer is part of a dependency chain, then part of a dependency chain, then casting it to adding an integral value to that pointer extends a pointer type extends the chain to the result. the chain to the resulting value. This applies Line 6 of Figure 33 illustrates this rule. for both positive and negative integers, and also to addition via the infix + operator and via the 16 Yes, some old systems had strange formats for character postfix [] operator. Note that the addition must pointers, and this restriction does exclude those systems from this nuance of dependency ordering. However, to the best of be carried out on a pointer: Casting to an inte- my knowledge, all such systems were uniprocessors, so this is gral type and then carrying out the addition will not a real problem. WG21/P0098R0 31

1 struct bar { 11. If a value of type intptr t or uintptr t is part 2 struct bar *next; of a dependency chain, then the bitwise infix & 3 int a; 4 int b; and | operators extend the dependency chain to 5 }; the resulting value. Line 5 of Figure 33 illus- 6 struct bar *head = { &head, 1, 2 }; 7 trates this rule. 8 for (p = head->next; p; p = rcu_dereference(p->next)) { 9 foo += p->a; 12. If a value of type intptr t or uintptr t is part 10 if (p == &head) 11 break; of a dependency chain, the the bitwise infix ^ 12 } operator extends the dependency chain to the 13 bar *= head->b; resulting value. The use case for this traversal of buddy-allocator-like lists or dense-array heaps, Figure 35: Back-Propagation of Dependency-Chain but it is not clear whether these use cases justify Breakage Due to Comparisons this addition to dependency ordering. 1 if (p > &foo) 13. If a pointer is part of a dependency chain, then 2 do_something(p); 3 else if (p < &foo) applying the unary & address-of operator, op- 4 do_something_else(p); tionally casting this address to a pointer type 5 else 6 do_something_nodep(p); (perhaps repeatedly to different pointer types, either explicitly or implicitly), then applying the * dereference operator extends the chain to the Figure 36: Inequality-Comparison Dependency- result. This is used by some of the Linux-kernel Chain Breakage list-processing macros.

14. If a pointer is part of a dependency chain, and dependency chain extends to the returned value that pointer appears in the entry of a ?: expres- in the calling function. sion selected by the condition, then the chain 18. If a given operation extends a dependency chain, extends to the result. Please note that ?: does then so does its atomic counterpart. For exam- not extend chains from its condition, only from ple, the rules applying to assignments also apply its second or third argument. to atomic loads and stores. It also applies to 15. If a pointer to a function is part of a depen- atomic exchange and and atomic compare and dency chain, then invoking the pointed-to func- swap. tion extends the chain from the pointer to the Any other operation terminates a dependency instructions executed. Note that the exact mech- chain. anism used to update instructions is implemen- tation defined, and might require use of special instruction-cache-flush operations.17 7.9.2 Terminating Dependency Chains Even though all other operations terminate depen- 16. If a pointer is part of a dependency chain, then dency chains, there are a few that deserve special if that pointer is used as the actual parameter of mention: a function call, the dependency chain extends to the formal parameter. 1. Equality comparisons. 17. If a pointer is part of a dependency chain, then 2. Narrowing magnitude comparisons. if that pointer is returned from a function, the 3. Narrowing arithmetic operations. 17 This may sound strange, but just you try implementing dynamic linking without the ability to update instructions! 4. Narrowing bitwise operations. WG21/P0098R0 32

5. Storing non-pointers into shared variables. optimization, especially for pointers obtained from the heap. After all, use of statically allocated struc- 6. Passing values between threads without using a tures in RCU-protected lists could be quite useful memory order consume load. during out-of-memory conditions, in which case the 7. Undefined behavior. specialization optimization would almost always re- duce performance, which is not what optmizations 8. Use of kill dependency. are supposed to be doing.

Each of these is covered below. Narrowing magnitude comparisons: A series of >, <, >=, or <= operators that informs the compiler Equality comparisons: If a pointer is part of a of the exact value of a pointer causes that pointer dependency chain, then a == or != comparison that to no longer be part of the dependency chain. See compares equal to some other pointer, where that Figure 36 for an example of this. On line 6 of this other pointer is not part of any dependency chain, figure, the compiler knows that the value of p is equal will cause any uses of the original pointer to no longer to &foo, so although there is dependency ordering to be part of the dependency chain. This dependency- lines 2 and 4, there is no dependency ordering to chain breakage can back-propagate to earlier uses of line 6. This dependency-chain breakage can back- the pointer, so that in Figure 35, if the comparison propagate, just as for equality comparisons. How- on line 10 compares equal, then the access on line 9 ever, dependencies are maintained for normal uses, is not part of the dependency chain. This is admit- for exmaple, the use of comparisons for deadlock tedly a rather strange code fragment, and besides, avoidance when acquiring locks contained in multi- the Linux-kernel barrier() macro could prevent this ple RCU-protected data elements. if placed between lines 9 and 10. Furthermore, the Linux kernel’s list macros avoid this situation because Narrowing arithmetic operations: If a pointer the equal comparison terminates the loop. is part of a dependency chain, and if the values So what if the compiler introduces an equality com- added to or subtracted from that pointer cancel the parison? This might happen when doing feedback- pointer value so as to allow the compiler to pre- directed optimization, where the compiler might no- cisely determine the resulting value, then the result- tice (for example) that a particularly statically allo- ing value will not be part of any dependency chain. cated structure was almost always the first element For example, if p is part of a dependency chain, then on a given list. The compiler might therefore in- ((char *)p-(uintptr t)p)+65536 will not be.18 troduce a specialization optimization, comparing the addresses and generating code using the statically al- located structure on equals comparison. On the one Narrowing bitwise operations: If a value of type hand, in the cases where the Linux kernel adds a stat- intptr t or uintptr t is part of a dependency chain, ically allocated structure to an RCU-protected linked and if that value is one of the operands to an & data structure, that structure has been initialized at or | infix operator whose result has too few or too compile time, so that dependency ordering is not re- many bits set, then the resulting value will not be quired. On the other hand, this appears to be an part of any dependency chain. For example, on a extremely dubious optimization for linked data struc- 64-bit system, if p is part of a dependency chain, tures: In a great many cases, the added overhead of then (p & 0x7) provides just the tag bits, and nor- the comparison would overwhelm the benefits of gen- mally cannot even be legally dereferenced. Similarly, erating code based on the statically allocated struc- (p | ~0) normally cannot be legally dereferenced. ture. 18 That said, 5.7p4 of C++ and 6.5.6p8 of C both say that High-quality implementations would therefore be indexing outside of an object is undefined behavior, so the loss expected to provide means for disabling this sort of of dependency ordering is likely the least of the problems here. WG21/P0098R0 33

However, (p & ~0x7) will provide a usable pointer 7.9.3 Restricted Dependency Chains and the that is part of p’s dependency chain. Setting or Linux Kernel clearing bits that are not used by the implementa- tion or that would be zero for a properly aligned ob- This covers all known pointer-based RCU uses in the ject will not break the dependency chain. That said, Linux kernel, aside from RCU-protected array in- the compiler might not know the definition of “prop- dexes (more on these later). However, as noted ear- erly aligned”, for example, in the cases of manually lier, these restrictions might prove too constraining cache-aligned or page-aligned objects. for future code. Therefore, it might be necessary to Note that the ^ exclusive-OR operator can also combine this approach with some variation on one of break dependency chains, most straightforwardly via the methods for explicitly marking variables, formal something like a ^ a. parameters, and return values that are intended to carry dependencies. Storing non-pointers into shared variables: If Section 3.2 discussed dependency chains headed a non-pointer value that is part of a dependency chain by memory order consume loads of integers that are is stored into a shared variable, then the dependency later used as array indexes or pointer offsets. Al- chain does not extend to a later load from that vari- though there are a (very) few such uses in the Linux able. kernel, accommodating dependency chains headed by loads of integers greatly complicates the handling of dependency chains.19 For example, given an integer x Passing values between threads: If a value that produced by a memory order consume load, we must is part of a dependency chain is stored into a variable correctly handle expressions containing (x - x), in- by one thread, and loaded from that same variable by cluding cases where the cancellation is not at all obvi- some other thread using either a non-atomic load or ous from the source code. In contrast, given a pointer a memory order relaxed load, then the dependency p, the expression (p - p) not a pointer, and the rules chain does not extend to the second thread. To get given above do not require carrying a dependency this effect, the second thread would instead need to through such an expression. This approach there- use a memory order consume load. Note that this fore excludes dependency chains headed by memory would extend the dependency chain even if the cor- order consume loads from non-pointer atomic vari- responding store was a memory order relaxed store ables. because the required store-side ordering is provided This raises the question of what should be done by the dependency chain. about the Linux kernel code that relies on order- ing carried through integer array indexes. Paul has Undefined behavior: If undefined behavior is in- answered this question by creating a Linux-kernel voked, then, consistent with the notion of undefined patch that removes the kernel’s dependency on RCU- behavior, there are no dependency-chain guarantees. protected array indexes [22], and this patch was ac- If a given pointer, intptr t, or uintptr t is used cepted into the v4.2 release of the Linux kernel. such that only one value avoids undefined behavior, then the dependency chain is broken in the same way as it would be in the case of an equality comparison 7.9.4 Restricted Dependency Chains: Ad- with that same value. vantages and Disavantages

This approach to dependency ordering has the fol- kill dependency(): The result of calling kill lowing advantages and disadvantages: dependency is never part of any dependency chain. This operation can be used to suppress diagnostics 19 And for this reason, carrying of dependencies via integers that implementations might omit for likely misuses is no longer supported in the Linux kernel as of v4.2, except of dependency ordering. in a few very restricted instances. WG21/P0098R0 34

1 p = atomic_load_explicit(gp, memory_order_consume); 2 if (p == ptr_a) { use of std::kill dependency() might at some 3 q = kill_dependency(p); future point allow the compiler to produce bet- 4 a = q->special_a; 5 } else { ter diagnostics for dubious dependency-chain use 6 a = p->normal_a; cases. For example a compiler might issue a 7 } warning for the value-narrowing hazard shown on line 3 of Figure 13, and that diagnostic might Figure 37: Avoiding Diagnostic Due To Dependency- be suppressed as shown in Figure 37. Ordering Value-Narrowing Hazard 7. More generally, this approach allows annotations to be discarded, as was shown in Figures 20 1. There is no need for compilers to trace depen- and 21. However, the memory order consume dency chains. Instead, dependencies are an au- loads are still required in order to support DEC tomatic result of current code-generation and op- Alpha and prevent compiler optimizations that timization practices. might otherwise destroy the dependency chain. 2. There would be little or no limitation on com- piler optimizations. Compilers are no longer re- 8. It would not be necessary to use [[carries quired to establish artificial dependencies for ex- dependency]] attributes. However, as with pressions such as (x - x) or to detect value- std::kill dependency(), use of something like narrowing hazards involving == and !=. [[carries dependency]] might produce bet- ter diagnostics for dubious use dependency-chain 3. The breaking of dependency chains when a use cases. That said, it might be preferable to pointer compares equal to some other pointer substitute either type or object modifiers for at- might prove to be onerous, but it is not a tributes, given that use of attributes is not per- problem for current Linux use cases.20 The mitted to change the meaning of the program most common cases are comparison against a and also that attributes are not yet supported list header (in which case an equality compar- by the C language. ison terminates the traversal) and comparison against NULL (in which case an equality com- 9. With the exception of one use case involving ar- parison indicates a pointer that cannot be deref- rays, no changes to the Linux-kernel source code erenced in any case). are required. As noted earlier, a patch is avail- able to remove the Linux kernel’s dependency 4. The compiler is prohibited from carrying out on RCU-protected array indexes [22], and this value-speculation optimizations on pointers or patch was accepted into the v4.2 release of the values of type intptr t or uintptr t that have Linux kernel. been cast from a pointer type and subjected to no operations other than infix &, |, and ^. (Is this an advantage or a disadvantage? The an- 10. This approach has the advantage of implemen- swer to this question is left to the reader.) tation experience extending for more than two decades. 5. In a great many cases, dependency chains can reliably pass through library functions compiled In the future, this approach could be augmented by by pre-C11 compilers. attributes, type modifiers, object modifiers, or other markings to allow more elaborate dependency chains 6. It would not be necessary to use std::kill to be created on the one hand, and to improve the dependency() calls in most cases. That said, compiler’s ability to emit diagnostics for dubious uses 20 This needs to be re-verified. of dependency chains on the other. WG21/P0098R0 35

7.10 Storage Class for the carries a dependency relation shown in 1.10p9 of the standard. This approach, similar to that suggested by Lawrence Assignment from an object with Carries Crowl in response to the approach described in Sec- dependency storage class to an object not of this tion 7.6, uses a new storage class to indicate where storage class would act as an implicit std::kill dependencies need to be carried. This section also dependency(). This is useful when protection of a applies lessons learned in discussions of the other given object is shifted from a dependency chain to approaches, particularly those of Sections 7.3, 7.5 a lock or reference count. However, high quality-of- and 7.9. In particular, this approach retains the re- implementation compilers would be expected to pro- striction called out in Section 7.9, namely allowing vide options to issue warnings when such an implicit dependencies to be carried only through pointer-like std::kill dependency() was encountered, which types, including restricted operations on intptr t would allow developers to enforce coding-style rules and uintptr t.21 requiring explicit std::kill dependency(), if de- The idea is to provide a storage class Carries sired. dependency that indicates that the modified local A high quality-of-implementation compiler would variable, struct/union field, formal parameter, ac- also be expected to provide options to enable warn- tual parameter, or return value carries a depen- ings to be issued when a dependency chain was dency. This is similar to the existing [[carries headed by something other than a memory order attribute, but is usable by C (which dependency]] consume load or an object with the Carries lacks attributes) as well as by C++, and which, dependency storage class. A cast could be used to unlike [[carries dependency]], is permitted to suppress these warnings, which could be provided as change the program’s semantics. This last is impor- std::unkill dependency() if experience indicates tant because it provides greater flexibility in avoiding that this is useful. This std::unkill dependency() unsolicited (and costly) memory-fence instructions. is likely to be useful to enable common code that This approach results in the code shown in Fig- traverses data structures using dependencies in some ures 24 and 25. cases and using in other cases. This approach uses strict dependencies rather than The new Carries dependency storage class may semantic dependencies, which means that the imple- appear with register, static, thread local, and mentation will need to emit explicit memory-fence extern. If Carries dependency appears in one dec- instructions in cases where the dependencies would laration of a variable, it shall appear in all other dec- not be preserved by the underlying hardware. How- larations of that entity. The Carries dependency ever, high quality-of-implementation compilers would storage class can be applied only to the names of be expected to provide options to issue warnings variables, functions, formal parameters, and struc- when such instructions were emitted, allowing the ture/union fields. Only variables of scalar or pointer developer to modify the code to avoid these ex- type may be marked as Carries dependency.22 pensive instructions. In addition, high quality-of- Here are use cases for the storage-class combina- implementation compilers are expected to carry de- tions: pendencies past things like successful equality com- parisons with no performance degradation. The oper- • register Carries dependency: A local vari- ations carrying dependencies are exactly those shown able that carries a dependency that should also 21 JF Bastien points out that in C++ this could be imple- be stored in a machine register. mented as a class that is special in a manner similar to the atomic classes. One example of the special handling required • static Carries dependency: A static local is to preserve dependencies past successful pointer equality variable that carries a dependency. Such a vari- comparisons. However, a C-language implementation would require something resembling a storage class, so this proposal 22 But if you have a valid use case for an aggregate construct starts at that point. that carries a dependency, please let us know. WG21/P0098R0 36

able is presumably subject to some sort of mu- to use the approach called out in Section 7.9. tual exclusion mechanism so as to avoid inter- Longer term, some portions of the Linux ker- thread access. nel might incrementally transition to this object- modifier approach, especially if doing so provides • static thread local Carries dependency: significant diagnostic benefits. A static thread-local variable that carries a dependency. 6. The strict dependencies should permit tools based on formal methods to successfully analyze • extern Carries dependency: A variable dependency chains. High quality tooling might shared among translation units that carries a well encourage existing RCU-using projects to dependency. Such a variable is presumably sub- adopt this storage-class-based approach. ject to some sort of mutual exclusion mechanism so as to avoid inter-thread access. 7.11 Evaluation • extern thread local Carries dependency: This evaluation starts by enumerating the different A thread-local variable shared among transla- audiences that any change to tion units that carries a dependency. memory order consume must address (Section 7.11.1) and then compares the • thread local Carries dependency:A various proposals based on the perceived viewpoints thread-local variable that carries a dependency. of these audiences (Section 7.11.2). Again, this Carries dependency storage class 7.11.1 Audiences can only be applied to pointer-like types, with only restricted operations permitted on intptr t and The main audiences for any change to memory uintptr t (see Section 7.9.1 for details). order consume include standards committee mem- This approach has the following advantages and bers, compiler implementers, formal-methods re- disadvantages: searchers, developers intending to write new code, and developers working with existing RCU code. The 1. The implementation is likely to be simpler in Linux kernel community is of course a notable exam- that dependencies need not be traced. Instead, ple of this last category. only computations involving objects of the Standards committee members would like a clean Carries dependency storage class need be han- and non-intrusive change to the standard. They dled specially. This eliminates unnecessary re- would of course also like solutions minimizing the striction of optimizations on unintentional de- number and vehemence of complaints from the other pendency chains. audiences, or, failing that, reducing the complaints 2. The implicit std::kill dependency() means to a tolerable noise level. that many irrelevant dependency chains are Compiler implementers would like a mechanism pruned by default, so that fewer explicit that fits nicely into current implementations, which std::kill dependency() invocations are re- does much to explain their satisfaction with the ap- quired. proach of strengthening memory order consume to memory order acquire. In particular, they would 3. Attributes need not be added to the C language. like to avoid unbounded tracing of dependencies, and 4. No modifications to the type system are re- would prefer minimal constraints on their ability to quired. apply time-honored optimizations. Formal-methods researchers would like a definition 5. Applying this to the Linux kernel would require of memory order consume that fits into existing the- significant marking of variables carrying depen- oretical frameworks without undue conceptual vio- dencies, however, the Linux kernel is expected lence. Of particular concern is any need to deal with WG21/P0098R0 37 counter-factuals, in other words, any need to rea- tic dependency.24 Variable, formal-parameter, and son not only about values of variables required for return-value marking can either be type-based (“T”), the solution of a given litmus test, but also about storage-class-based (“S”), attribute-based (“A”), or other unrelated values for these variables. As such, not required (“ ”). Beginning-of-chain handling can counter-factuals are the rock upon which otherwise either require explicit indication to which quanti- attractive approaches involving semantic dependency ties dependencies must be carried (“D”) or noth- have foundered.23 Some practitioners might won- ing (“ ”). End-of-chain handling can either re- der why the opinion of formal-methods researchers quire an explicit kill dependency (“K”), an implicit should be given any weight at all, and the answer to kill dependency (“k”), explicit designation of de- this question is that it is the work of formal-methods pendency (“D”), or nothing (“ ”).25 Dependency researchers that provides us the much-needed tools tracking might be required for all chains (“Y”), ex- that we need to analyze both the memory-ordering plicitly designated chains (“y”), or not required at specification itself as well as programs using that all (“ ”). C-language [[carries dependency]] sup- specification. port might be required (“Y”) or not (“ ”). Diagnostic Developers writing new code need something that capability is difficult to evaluate for many of the pro- expresses their algorithm with a minimum of syn- posals, but for some proposals it is limited (“n”), for tactic saccharine, that is easy to learn, and that is others quite difficult (“N”), and for still others too easy to maintain. For example, one of the weak- early to tell (“?”). Some proposals inflict unsolicited nesses of the current standards’ definition of memory memory fences on developers (“Y”), and others do order consume is the need to sprinkle large numbers not (“ ”), and some proposals are source-code com- of kill dependency() calls throughout one’s code. patible with the Linux kernel (“ ”), while others are In short, developers would like it to be easy to write, not (“N”). Anything other than pure syntactic depen- analyze, and maintain code that uses dependency or- dendencies (“dep”) are quite challenging for current dering. formal-verification tools, resulting in most proposals Developers with existing RCU code have the same getting (“N”) or (“?”). desires as do developers writing new code, but are The ideal proposal would have dependency type also very interested in minimizing the code churn re- “dep” (thus making it easier to model dependency quired to adhere to the standard. ordering and making it unnecessary for developer to The challenge if of course to find a proposal that have to outwit full-program optimizations), no need addresses the viewoints of all of these audiences. As for variable, formal-parameter, or return-value mark- we will see in the next session, this is not easy. How- ing (thus minimizing changes required for existing ever, there is some hope that the approach presented RCU code), implicit “do the right thing” end-of-chain in Section 7.9 might suffice. handling,26 (thus minimizing the need for whack- a-mole source-code markups), no need for depen- dency tracking (thus making it easier to implement), 7.11.2 Comparison

A summary comparison of the proposals is shown in 24 Recall that a local semantic dependency remains a de- pendency even if the memory order consume load at its head Table 1. can return only a single value. In contrast, a global seman- The dependency type can either be “dep” for nor- tic dependency remains a dependency only if more than one mal dependency, “rdep” for the restricted depen- value can appear at the end of the chain. Therefore, optimiza- dencies discussed in Section 7.9, “sdep” for (global) tions based on global full-program analysis can break a global semantic dependency but can break neither a local semantic semantic dependency, or “lsdep” for local seman- dependency nor a normal dependency. 25 Variables that go out of scope always have any dependency 23 That said, Alan Jeffries is making another attempt to chain implicitly killed. come up with a suitable formal definition of semantic depen- 26 Perhaps implemented by a careful choice of exactly which dency. operators carry dependencies in which situations. WG21/P0098R0 38 Dependency Type Object Marking Formal-Parameter Marking Return-Value Marking Beginning-Of-Chain Handling End-Of-Chain Handling Dependency Tracing Required C Attribute Support Required Diagnostic Capability Unsolicited Memory Fences Linux-Kernel Compatibility Suitable for Formal Verification C11 / C++11 dep A A K Y Y n Y N Type-Based Designation of Dependency lsdep T T T k ? Y N N Chains With Restrictions (Section 7.2) Type-Based Designation of Dependency dep T T T k ? Y N Chains (Section 7.3) Whole-Program Option (Section 7.4) sdep N N Local-Variable Restriction (Section 7.5) dep A A k Y ? Y N Mark Dependency-Carrying Local Variables dep A A A k Y ? Y N (Section 7.6) Explicitly Tail-Marked Dependency Chains dep A A Dk y Y N Y N ? (Section 7.7) Explicitly Head-Marked Dependency Chains dep ? ? D k y Y N Y N ? (Section 7.8) Restricted Dependency Chains (Section 7.9) rdep N N Storage Class (Section 7.10) rdep S S S Y N

Marking: “A”: attribute, “S”: Object, “T”: type. Beginning of chain: “D”: explicit designation. End of chain: “D”: explicit designation, “k/K”: implicit/explicit kill dependency. Dependency tracking required: “y”: only for marked chains, “Y”: always. Diagnostic capability: “n”: incomplete, “N”: very limited, “?”: could be provided.

Table 1: Comparison of Consume Proposals WG21/P0098R0 39 no need for C-language support for the [[carries Design and Implementation (New York, NY, dependency]] attribute (thus minimizing changes to USA, 2014), PLDI ’14, ACM, pp. 40–40. the C standard), good diagnostic capabilities, refrain from supplying unsolicited memory fences, be source- [3] ARM Limited. ARM Architecture Reference code compatible with the Linux kernel, and lend itself Manual: ARMv7-A and ARMv7-R Edition, well to analysis by formal-verification tools. 2010. However, there appears to be no such proposal pos- [4] Space efficient conservative sible, so the recommended approach is to use the re- Boehm, H. J. garbage collection. SIGPLAN Not. 39, 4 (Apr. stricted dependency chains discussed in Section 7.9 2004), 490–501. for large existing code bases and the storage-class approach discussed in Section 7.10 for new code and [5] Bonzini, P., and Day, M. RCU imple- perhaps also as a future direction for large existing mentation for Qemu. http://lists.gnu. code bases. org/archive/html/qemu-devel/2013-08/ msg02055.html, August 2013.

8 Summary [6] Dalton, M. THE DESIGN AND IMPLE- MENTATION OF DYNAMIC INFORMATION This document has analyzed Linux-kernel use of de- FLOW TRACKING SYSTEMS FOR SOFT- pendency ordering and has laid out the status-quo in- WARE SECURITY. PhD thesis, Stanford teraction between the Linux kernel and pre-C11 com- University, 2009. Available: http://csl. pilers. It has also put forward some possible ways of stanford.edu/ christos/publications/ building towards a full implementation of C11’s and ~ 2009.michael_dalton.phd_thesis.pdf C++11’s handling of dependency ordering. It calls [Viewed March 9, 2010]. out some weaknesses in C11’s and C++11’s handling of dependency ordering and offers some alternatives. [7] Desnoyers, M. [RFC git tree] userspace RCU Finally, it recommends one approach (Section 7.9) for (urcu) for Linux. http://liburcu.org, Febru- large existing code bases and a second approach (Sec- ary 2009. tion 7.10) for new code bases. Over time, it is hoped that the availability of diagnostic tools will motivate [8] Desnoyers, M., McKenney, P. E., Stern, migration to the second of these two approaches. A., Dagenais, M. R., and Walpole, J. User-level implementations of read-copy update. IEEE Transactions on Parallel and Distributed References Systems 23 (2012), 375–382.

[1] Alglave, J., Maranget, L., Pawan, P., [9] Grisenthwaite, R. ARM Barrier Litmus Sarkar, S., Sewell, P., Williams, D., and Tests and Cookbook. ARM Limited, 2009. Nardelli, F. Z. PPCMEM/ARMMEM: A tool for exploring the POWER and ARM memory [10] Howard, P. W., and Walpole, J. A rel- models. http://www.cl.cam.ac.uk/~pes20/ ativistic enhancement to software transactional ppc-supplemental/pldi105-sarkar.pdf, memory. In Proceedings of the 3rd USENIX con- June 2011. ference on Hot topics in parallelism (Berkeley, CA, USA, 2011), HotPar’11, USENIX Associa- [2] Alglave, J., Maranget, L., and tion, pp. 1–6. Tautschnig, M. Herding cats: Modelling, simulation, testing, and data-mining for weak [11] Howard, P. W., and Walpole, J. Relativis- memory. In Proceedings of the 35th ACM SIG- tic red-black trees. Concurrency and Computa- PLAN Conference on Programming Language tion: Practice and Experience (2013), n/a–n/a. WG21/P0098R0 40

[12] Intel Corporation. A Formal Speci- [22] McKenney, P. E. [PATCH tip/core/rcu fication of Intel Itanium Processor Fam- 1/4] mce: Stop using array-index-based RCU ily Memory Ordering, 2002. Avail- primitives. [PATCHtip/core/rcu1/4]mce: able: http://developer.intel.com/ Stopusingarray-index-basedRCUprimitives, design/itanium/downloads/251429.htm May 2015. ftp://download.intel.com/design/ [23] Itanium/Downloads/25142901.pdf [Viewed: McKenney, P. E., Purcell, C., Al- January 10, 2007]. gae, Schumin, B., Cornelius, G., Qwer- tyus, Conway, N., Sbw, Blainster, Ru- [13] International Business Machines Corpo- fus, C., Zoicon5, Anome, and Eisen, H. ration. Power ISA Version 2.07, 2013. Read-copy update. http://en.wikipedia. org/wiki/Read-copy-update, July 2006. [14] Kannan, H. Ordering decoupled metadata ac- cesses in multiprocessors. In MICRO 42: Pro- [24] McKenney, P. E., and Slingwine, J. D. ceedings of the 42nd Annual IEEE/ACM Inter- Read-copy update: Using execution history to national Symposium on Microarchitecture (New solve concurrency problems. In Parallel and York, NY, USA, 2009), ACM, pp. 381–390. Distributed Computing and Systems (Las Vegas, NV, October 1998), pp. 509–518. [15] McCarthy, J. Recursive functions of symbolic expressions and their computation by machine, [25] McKenney, P. E., and Stern, A. part i. Commun. ACM 3, 4 (Apr. 1960), 184– Axiomatic validation of memory bar- 195. riers and atomic instructions. http: //lwn.net/Articles/608550/, August 2014. [16] McKenney, P. E. Read-copy update (RCU) usage in Linux kernel. Available: [26] McKenney, P. E., and Walpole, J. What is http://www.rdrop.com/users/paulmck/ RCU, fundamentally? Available: http://lwn. RCU/linuxusage/rculocktab.html [Viewed net/Articles/262464/ [Viewed December 27, January 14, 2007], October 2006. 2007], December 2007. [27] Michael, M. M. Hazard pointers: Safe mem- [17] McKenney, P. E. What is RCU? part 2: ory reclamation for lock-free objects. IEEE Usage. Available: http://lwn.net/Articles/ Transactions on Parallel and Distributed Sys- 263130/ [Viewed January 4, 2008], January tems 15, 6 (June 2004), 491–504. 2008. [28] Rossbach, C. J., Hofmann, O. S., Porter, [18] McKenney, P. E. The RCU API, 2010 edi- D. E., Ramadan, H. E., Bhandari, A., and tion. http://lwn.net/Articles/418853/, De- Witchel, E. TxLinux: Using and managing cember 2010. hardware transactional memory in an operating [19] McKenney, P. E. Validating memory barri- system. In SOSP’07: Twenty-First ACM ers and atomic instructions. http://lwn.net/ Symposium on Operating Systems Principles Articles/470681/, December 2011. (October 2007), ACM SIGOPS. Avail- able: http://www.sosp2007.org/papers/ [20] McKenney, P. E. Is Parallel Programming sosp056-rossbach.pdf [Viewed October 21, Hard, And, If So, What Can You Do About It? 2007]. kernel.org, Corvallis, OR, USA, 2012. [29] Toit, S. D. Working draft, stan- [21] McKenney, P. E. Structured deferral: syn- dard for programming language C++. chronization via procrastination. Commun. http://www.open-std.org/jtc1/sc22/wg21/ ACM 56, 7 (July 2013), 40–49. docs/papers/2013/n3691.pdf, May 2013. WG21/P0098R0 41

[30] Vigueras, G., Orduna,˜ J. M., and Lozano, advantages and disadvantages of Jeff Preshing’s M. A Read-Copy Update based parallel server whole-program option in Section 7.4 Add advan- for distributed crowd simulations. The Journal tages and disadvantages for Hans Boehm’s pro- of Supercomputing (Apr. 2012). posal in Section 7.5 and for Clark Nelson’s pro- posal in Section 7.6. (August 23, 2014.) Change Log • Update Olivier Giroux’s explicit dependency- chain marking proposal in Section 7.7 based on This paper first appeared as N4026 in May of 2014. feedback from Olivier. (August 24, 2014.) Revisions to this document are as follows: • Add a section header (Section 4.1.1) for the • Add a suggestion to Section 7.1 that developers dependency-chain rules for 2014 GCC implemen- provide diagnostics for singleton arrays when us- tations. Add Section 4.1.3 laying out the sim- ing RCU-protected indexes. (May 28, 2014.) pler rules for managing dependency chains in the older 1990s Sequent C implementations. Add • Apply changes from Torvald Riegel feedback: a footnote to Section 7.6 recording Lawrence 1. Move Section 7.1, which summarizes the Crowl’s preference for object modifiers instead of Linux-kernel uses of dependency chains, up either attributes or type modifiers. Add words to to the beginning of Section 7. Section 7.7 noting that atomic signal fence() or the Linux kernel’s barrier() can be used to 2. Fix some typos in Section 7.4, and add suppress code-motion optimizations that might words accounting for DEC Alpha, which otherwise prevent fixing up tail-marked depen- must continue to promote memory order dency chains that were broken by code-motion consume loads to memory order acquire optimizations. (October 1, 2014.) with this section’s approach. Also clarify Jeff Preshing’s conjecture about the lack of • Add Section 3.4, which describes operators that optimizations that break semantic depen- act as the last link in the Linux kernel’s depen- dency chains. dency chains, including the ambiguous role of the -> operator. (October 1, 2014.) (June 5, 2014.) • Update plots of RCU API usage. (October 5, • Add Section 7.5, which covers Hans Boehm’s 2014.) approach, which restricts dependency chains to local variable. Also add Section 7.6, which • Flesh out Section 4.1.2, which lays out a short covers Clark Nelso’s approach, which uses history of dependency-chain management rules the [[carries dependency]] attribute to mark for the Linux kernel. Numerous minor clarifica- those local variables that carry dependencies. tions and corrections throughout the document. (July 17, 2014.) (October 5, 2014.) • Mark the paper as a revision of N4036. (July 17, • Add “and” before final author’s name and email 2014.) address. Also add a pointer back to N4036. (Oc- tober 5, 2014.) • Add Jeff Preshing, Hans Boehm, and Clark Nel- son as authors, and clarify Hans’s proposal in • Add a brain-dead build script. (October 5, Section 7.5. (August 13, 2014.) 2014.) • Add Olivier Giroux’s proposal for explicitly • Add Section 7.11, which has the beginnings of marking the tails of dependency chains as Sec- an evaluation of the various proposals. (October tion 7.7, and add Olivier as author. Update the 5, 2014.) WG21/P0098R0 42

The paper was then published as N4215, which • Apply feedback from Torvald Riegel to Sec- was revised as follows: tion 7.9. (May 20, 2015.)

• Add verbiage to Sections 7.2 and 7.11.2 differ- • Don’t allow integers to carry dependencies into entiating between various forms of syntactic and or out of unmarked functions in Section 7.9. semantic dependencies. (November 11, 2014.) (May 21, 2015.)

• Add a now-deleted section suggesting a C • Update Section 7.9 to note that assignments can- keyword instead of the C++ [[carries not extend a dependency chain from one thread dependency]] attribute. This section also sug- to another, and that assignments cannot extend gests that combining ideas from several propos- an intptr t- or uintptr t-based dependency als might be helpful. (November 11, 2014.) chain through a shared variable, even within a given thread. (May 21, 2015.) • Add Section 7.8 which contains a first attempt to describe Olivier Giroux’s head-marked depen- • Update Section 7.9 to note that atomic op- dency chains. (November 11, 2014.) erations can extend dependency chains in the same way that their non-atomic counterparts • Add example code in Figures 15 and 16. Add can. (May 21, 2015.) similar figures to the proposals showing how they mark the dependency chains in the sample code. • Update Section 7.9 to explicitly call out the fact (November 11, 2014.) that stores and subsequent loads can extend de- pendency chains. (May 21, 2015.) At this point, the paper was published as N4321, which was revised as follows: • Update Section 7.9 to explicitly state that un- defined behavior terminates dependency chains, • Add a warning against using operands to & and and that if only one value of a pointer avoids | that leave only one bit set (or, respectively, undefined behavior, any dependency chains cleared) to Section 4.1.1. (April 9, 2015.) through that pointer at that point are termi- nated. (May 22, 2015.) • Update plots of RCU API usage. (April 19, 2014.) • Update Section 7.9 to document the difficul- ties stemming from value-specialization opti- • Add Section 7.9, which restricts dependency mizations. (May 23, 2015.) chains with the goal of allowing unmarked de- pendency chains while avoiding restricting com- • Update Section 7.9 to note the back-propagation piler optimizations. One important restriction is of the dependency-chain-breaking effects of the the elimination of RCU-protected array indexes. compiler deducing the exact value of a pointer. (April 19, 2014.) (May 27, 2015.)

• Add verbiage to Section 3.2 noting that the & • Update Section 7.9 to give more information operator is sometimes used to find the beginning on how much bit-setting and bit-clearing is too of a power-of-two aligned structure. (April 19, much. (June 8, 2015.) 2014.) • Update Section 7.9 to note that the Linux ker- • Apply feedback from Linus Torvalds to Sec- nel does not rely on dependency ordering when tion 7.9. (May 20, 2015.) a statically allocated structure is linked into an RCU-protected structure, but also note that op- • Apply feedback from Linus Torvalds to Sec- timizations breaking dependency ordering in this tion 7.9. (May 20, 2015.) case seem quite dubious. (July 11, 2015.) WG21/P0098R0 43

• Update Section 7.9 illustrating dependency- pointer-like types, with restrictions on opera- chain rules with examples. (July 11, 2015.) tions on intptr t and uintptr t. (September 21, 2015.) • Update Section 7.9 noting that simple loads suf- fice for memory order consume loads. (July 11, At this point, the paper was published as 2015.) P0098R0. • Update Sections 7.7 and 7.8 calling out abstrac- tion challenges for head- and tail-marking ap- proaches. (July 13, 2015.)

• Add subsections to Section 7.9. (July 13, 2015.)

• Update Section 7.9 to amplify objections to spe- cialization optimizations applied to heap-based pointers. (July 13, 2015.)

• Update Section 7.9.1 to note that if a given oper- ation extends a dependency chain, then so does its atomic counterpart. (August 21, 2015.)

• Add text throughout stating that the patches prohibiting carrying of dependencies through in- teger variables have been accepted into v4.2 of the Linux kernel (with a very few exceptions). (September 5, 2015.)

• Add clarifications to Section 7.9, particularly surrounding division of responsibility between the developer and the compiler. (September 5, 2015.)

• Add diagnostic capability and unsolicited mem- ory fences as criteria in Table 1. (September 14, 2015.)

• Add Linux-kernel compatibility and formal ver- ification as criteria in Table 1. (September 14, 2015.)

• Add a storage-class proposal in Section 7.10 and update Table 1 accordingly. (September 17, 2015.)

• Update Linux-kernel synchronization statistics to v4.2. (September 19, 2015.)

• Update Section 7.10 to clarify that the storage- class approach retains type restriction: Only