Portland State University PDXScholar

Computer Science Faculty Publications and Presentations Computer Science

3-2011 Scalable Correct Memory Ordering via Relativistic Programming

Josh Triplett Portland State University

Philip William Howard Portland State University

Paul E. McKenney

Jonathan Walpole Portland State University

Follow this and additional works at: https://pdxscholar.library.pdx.edu/compsci_fac

Part of the Computer and Systems Architecture Commons, and the Systems Architecture Commons Let us know how access to this document benefits ou.y

Citation Details Triplett, Josh, Paul E. McKenney, Philip W. Howard, and Jonathan Walpole (2011) "Scalable Correct Memory Ordering via Relativistic Programming." Portland State University Computer Science Department Technical Report 11-03.

This Technical Report is brought to you for free and open access. It has been accepted for inclusion in Computer Science Faculty Publications and Presentations by an authorized administrator of PDXScholar. Please contact us if we can make this document more accessible: [email protected]. Scalable Correct Memory Ordering via Relativistic Programming

Josh Triplett Paul E. McKenney Portland State University IBM Linux Technology Center [email protected] [email protected] Philip W. Howard Jonathan Walpole Portland State University Portland State University [email protected] [email protected]

Abstract operations, and using proof systems to verify cor- rectness. We propose and document a new concurrent pro- We propose relativistic programming,anewmodel gramming model, relativistic programming.This for concurrent programming which achieves scalabil- model allows readers to run concurrently with writ- ity without sacrificing the simplicity of strong order- ers, without blocking or using expensive synchro- ing guarantees. This model allows readers to run nization. Relativistic programming builds on exist- concurrently with writers, without blocking or us- ing synchronization primitives that allow writers to ing expensive synchronization. Despite this concur- wait for current readers to finish with minimal reader rency, however, relativistic writers can strictly con- overhead. Our methodology models data structures trol the order their writes become visible to readers, as graphs, and reader algorithms as traversals of allowing straightforward reasoning about relativistic these graphs; from this foundation we show how algorithms. writers can implement arbitrarily strong ordering guarantees for the visibility of their writes, up to Readers commonly perform multiple separate and including total ordering. memory reads as they traverse a data structure. By allowing readers to run concurrently with writers, and avoiding any delays in readers, we allow readers 1 Introduction to perform their reads interleaved with a writer’s writes. It therefore becomes the writer’s responsi- Concurrent programming faces a difficult tradeoff bility to prevent readers from observing inconsistent between simplicity and scalability. Simple concur- states. Without blocking readers, the writer cannot rent algorithms incur high synchronization costs, generally make a series of writes appear atomic, but while highly scalable algorithms prove highly com- our methodology allows the writer to maintain the plex to develop and maintain. This dilemma hinges order of its writes. With no write-side synchroniza- primarily on the ordering of memory operations, and tion, readers could observe any possible permutation the order in which they become visible to threads. of writes; by adding write-side synchronization, writ- Simple concurrent algorithms force more serializa- ers may strictly preserve program order for any or tion to preserve sequential reasoning techniques, all writes, allowing any ordering policy up to and while scalable algorithms allow more reordering and including total order. The choice of ordering poli- force the programmer to think about concurrent be- cies affects writer latency, but never impacts reader havior. performance or scalability. Existing concurrent programming models ap- “Laws of Order” [1] documents a set of inter- proach this dilemma by choosing a point on the slid- locked properties for concurrent algorithms, and ing scale and attempting to incrementally reduce proves that any algorithm with all those properties the drawbacks associated with that point. Trans- must necessarily rely on expensive synchronization actional memory presents a simple programming instructions. These properties include “strong non- model where transactions appear serialized, and commutativity”: multiple operations whose results strives to eliminate the expensive synchronization both depend on their relative ordering. Relativis- costs. Scalable algorithms seek simplicity by intro- tic readers, however, cannot execute strongly non- ducing reasoning models for weakly ordered memory commutative operations—a relativistic reader can-

1 not affect the results of a write, regardless of order- the base address of the hash table, the head pointer ing. This allows relativistic readers to avoid expen- of the desired hash bucket, and the various “next” sive synchronization while maintaining correctness. pointers of the linked list in that bucket. These mul- Relativistic programming builds on several exist- tiple reads may still represent a single semantic oper- ing synchronization primitives; in particular, rela- ation on some abstract data type, such as the lookup tivistic writers rely on a primitive that waits for all of an item in a map or set. current readers to finish. Read-Copy Update (RCU) Similarly, a single writer may need to perform implementations [10, 9, 7]providesuchaprimitive, multiple modifications to implement a single seman- with little to no overhead for the corresponding read- tic operation. For example, to implement removal side delineation [8]. Any synchronization primitive from a linked list or hash table, a writer must mod- with the same semantics will work as well. ify the preceding next pointer (or head pointer) to Section 2 documents the relativistic programming point past the disappearing item, and then reclaim methodology. Section 3 presents the status of our the memory associated with that node. To move a ongoing research in this area. Section 4 summarizes node in a linked list, a writer could insert the new our contributions. node and remove the old, in some order. In the absence of additional synchronization, if a writer with concurrent readers performs two mem- 2 Methodology ory operations on a data structure, a reader travers- ing the data structure and reading these memory Consider a simple set of reader and writer algo- locations at different times may observe four dis- rithms based on mutual exclusion. Each algorithm tinct states of memory: the state with neither op- begins by acquiring a lock, and ends by releasing eration performed, the state with both operations that lock. Readers and writers run for a finite dura- performed, or either of the two states with a single tion, delineated by these lock operations. As far as operation performed. (These same principles extend other threads can observe, all operations of a run- to more than two memory operations in the obvious ning thread occur between the lock and unlock oper- manner, with 2n possible reader observations for n ations; since other threads cannot examine the data memory operations; subsequent examples will ini- structure without acquiring the lock themselves, the tially focus on the case of two memory operations.) operations performed by a thread appear atomic. With neither operation performed, the state of the Thus, any set of reads and writes in this system will data structure appears the same to a reader as if that appear sequential. reader completed before the writer began; thus, ob- Reader-writer locking does not change this prop- serving this state cannot do harm to a reader. Sim- erty; readers cannot modify the memory observed by ilarly, with both operations performed, the state of other readers, so running readers concurrently does the data structure appears the same to a reader as if not affect the sequential consistency of the system; that reader began after the writer completed; thus, to generate an equivalent sequential order, choose observing this state can do no harm either. Incor- any arbitrary order for read operations. Similarly, rect behavior can only occur if a reader observes one common applications of non-blocking synchroniza- of the two possible intermediate states. tion and transactional memory will roll back readers For an example of such harm, consider the re- or writers to ensure that reader operations cannot moval of a node described earlier: first unlink the observe intermediate states. node, then free the memory. If a reader does not When using a relativistic programming technique observe the node unlinked (and thus sees the node), such as RCU, readers and writers still have finite the reader must not observe the memory reclama- durations, but readers no longer exclude other read- tion; otherwise, the reader would access reclaimed ers, or even writers. Thus, any number of readers memory. can run concurrently, together with a simultaneous Mutual exclusion takes the simplest approach to writer, providing joint-access parallelism. Multiple avoid this problem: by preventing readers from run- writers continue to use disjoint-access parallelism, ning concurrently with writers, readers cannot ob- accessing separate data simultaneously through the serve intermediate states at all, and the writer’s use of locking or partitioned data structures. modifications appear atomic. This simplifies reason- A reader may perform multiple distinct read oper- ing about the writer’s algorithm, making it only nec- ations on memory in the course of traversing a single essary to consider the initial and final states of the data structure. For instance, in the course of look- data structure. In the removal example, the reader ing up an item in a hash table, a reader may read would either observe the node in the data structure

2 and not reclaimed, or would not observe the node at end of the wait operation. Thus, if a writer inserts all. a “wait for current readers” operation between two However, the writer often does not need to guar- other writes, no reader can run concurrently with antee full atomicity of its changes. Of the four possi- both writes, because any such reader would neces- ble orderings of the two changes required to remove a sarily begin before the wait operation begins and node, only one causes a problem in practice: not ob- end after the wait operation ends, violating the wait serving the removal (thus finding the node) and sub- semantics. sequently observing the memory reclamation (read- This introduces a key property of our method- ing reclaimed memory). The reverse situation for ology for correct relativistic programming: waiting removal cannot arise: a reader cannot observe the for current readers acts as a strong ordering bar- removal and care about the reclamation one way or rier between two write operations. This mechanism another. Thus, the writer need only ensure that the extends to ordering an arbitrary number of opera- operations appear in a particular order to the read- tions, as long as a wait operation occurs between ers, rather than making them appear atomic. each pair. Thus, from 2n possible observations of n In the case of moving a node in a linked list or operations, a writer can eliminate all but n.There- hash table, both orderings of operations present un- maining possibilities consist of observing incremen- usual results. If a reader observes the removal and tally longer prefixes of the series of operations, up not the insertion, it will fail to find the item at all. to observing the entire set. If a reader observes the insertion and not the re- In theory, a relativistic writer can order its oper- moval, it will observe two copies of the item. How- ations with this approach alone and achieve correct- ever, the latter scenario might not violate the seman- ness. However, the “wait for current readers” opera- tics of the data structure; if looking for that node, tion adds significant latency to the writer, and using the reader will find one of the two identical copies, it multiple times replicates that latency. Thus, for and if looking for another node, the reader will skip practical reasons, the writer needs alternative means the duplicated items. Barring some additional se- of ordering write operations that does not always re- mantic requirement, this ordering does not pose a quire a wait operation. problem. Thus, again, the writer need only preserve To implement a lightweight ordering mechanism, the order of its write operations. This tolerance of writers can take the reader algorithms into account. duplicates allows relativistic writers to copy part of Readers, when traversing a data structure, introduce a data structure and operate on that copy to avoid a natural ordering corresponding to the order they disturbing concurrent readers. traverse nodes. Almost all modern CPU and mem- Without the ability to block readers, a writer must ory architectures guarantee that if a CPU reads a instead avoid allowing readers to observe any al- pointer value, then dereferences that pointer to read ternative orderings of write operations. Once the the target memory (a dependent read), the fetch of writer has performed the first operation, some con- the pointer will occur before the fetch of the tar- current readers will observe that operation, and oth- get memory.1 Thus, dependent reads occur in the ers will not. The former readers cannot cause a prob- reader’s program order. lem, since whether they observe the second write or Common data structures storable in shared mem- not, they will not violate the ordering. The latter ory will look like some subset of a directed graph: readers, which have not seen the first write, con- allocated memory corresponds to a graph node, and stitute potential hazards, since if they subsequently a pointer corresponds to a directed edge, from the observe the second write, they violate the ordering node containing the pointer to the node containing constraint. the target address. Because dependent reads occur The writer can achieve this ordering with a “wait in program order, the reader has a well-defined node for current readers” operation, such as those pro- position at any given point in its traversal; call this vided by RCU. A reader can only observe two oper- the reader’s cursor. A writer cannot easily observe ations in different orders if its traversal runs concur- 1 rently with both operations. Thus, if the writer can On architectures such as DEC Alpha which do not au- tomatically guarantee memory ordering for dependent reads prevent readers from running concurrently with both [5], this traversal requires appropriate barriers for such reads, operations, the writer has enforced the order of those such as smp_read_barrier_depends in the , used operations. The semantics of the “wait for current in rcu_dereference. readers” operation require that if the beginning of a Aggressive optimizations, particularly those re- garding local caches of global data, can also lead to problems reader occurs before the beginning of the wait oper- in this step [2]. This may necessitate compile-time barriers to ation, the end of the reader must occur before the locally prevent such optimizations.

3 a reader’s cursor directly, but it can nonetheless rely the original node, and may subsequently observe the on these cursors moving according to the reader’s new node, but as previously described this duplica- traversal algorithm. tion need not violate the data structure semantics. Similarly, write operations occur at a well-defined Thus, the writer need not flush all cursors from the node position. Readers whose cursors will reach that structure. As long as the writer preserves the pro- node position, but have not yet done so, must nec- gram order of its writes via appropriate system write essarily observe that write operation. In an acyclic memory barriers, the readers will observe the writes data structure, readers whose cursors have already in the same order. Such a write can passed that node position cannot observe that write occur implicitly as part of a relativistic “publish” operation. primitive for pointers. To avoid excess complexity in this reasoning, con- This observation provides a general rule for effi- sider only those structures isomorphic to an acyclic ciently ordering writes: writers may rely on reader graph. This excludes structures in which the read traversal order to provide a natural order for obser- algorithms may revisit the same node more than vations, and need only wait for current readers to once in a single traversal, which significantly simpli- finish if the natural order does not imply the desired fies reasoning about the progression of read cursors. order. This does not prohibit structures in which cycles ex- Finally, while we have provided the tools by which ist, only structures in which the readers may per- a relativistic writer can preserve the ordering of arbi- form cyclic traversals. For example, a doubly linked trary write operations, a given write algorithm need list would support cyclic traversals, but if we pro- not preserve every possible ordering constraint. A hibit readers from following the “previous” pointers, writer may omit the synchronization between two reader traversals will remain acyclic. Writers may writes if readers will not care about the order in still use those previous pointers for their operations. which those writes become visible; subsequent syn- With this conceptual framework in mind, consider chronization will still order that entire set of writes again the algorithm for moving a node via insertion with resepect to later writes. In particular, a writer and removal. The writer must insert the new node, may choose to batch a series of updates with a sin- and subsequently remove the original node, without gle synchronization. Similarly, independent writers allowing any reader to observe the second operation may not need to order their updates with respect and not the first. The writer’s insertion occurs at a to each other, despite ordering the individual steps particular position in the list. After the writer has of each update to preserve correctness. Whether a inserted the new node, reader cursors before that po- writer may weaken ordering in this way depends on sition will observe the node, and cursors past that the semantics required by the corresponding read position will not. An ordering violation will occur if algorithms; however, this choice cannot affect the the reader cursors past the insert position can sub- performance of readers, only that of writers. sequently observe the removal; this would contradict Summarizing the properties of correct relativistic the writer’s program order, and cause the reader to algorithms: miss the node entirely. If the newly inserted node appears in the list be- 1. Writers need only enforce ordering of their op- fore the original node, read cursors between the two erations, not atomicity of an entire block of op- positions might observe an ordering violation: they erations. have already missed seeing the inserted node, and 2. Writers can wait for current readers between might potentially miss seeing the original node as two write operations to enforce the order of well, making the node appear missing. The writer those operations for concurrent readers. can preserve ordering by waiting for those readers to finish; this will “flush” all existing cursors from 3. Writers may rely on reader traversal order to the structure, and any new readers must start from provide a natural order for their observations, the well-defined start locations for traversals—in the and need only wait for current readers to finish case of a linked list, the head of the list. Any new if the writer might overtake a concurrent reader read cursors must pass through the inserted node, traversing in the same direction; otherwise, a and thus cannot observe the ordering violation. standard write memory barrier suffices. However, if the newly inserted node appears in the list after the original node, no readers can possi- 4. Writers may choose to leave a series of writes bly observe a problematic ordering violation. Read- unordered and omit synchronization between ers between the two locations must have observed them, while still using synchronization before or

4 after those writes to order other writes with re- 4 Conclusion spect to the entire series. This allows writers to reduce write latency by relying on the seman- Our proposed relativistic programming methodol- tics of the reader algorithm, without affecting ogy supports the construction of concurrent algo- reader scalability. rithms for arbitrary acyclic data structures. Rela- tivistic readers provide excellent scalability and per- formance by never blocking or using expensive syn- chronization instructions. The corresponding rela- 3 Ongoing Research tivistic writers may enforce a total ordering of write operations based on writer program order, or weaken Our current methodology for relativistic program- that ordering to decrease write latency, without af- ming focuses on the scalability of concurrent read- fecting reader scalability. Relativistic programming ers, by pushing the costs of synchronization and cor- thus allows concurrent algorithms to achieve scal- rectness onto writers. This approach proves ideal ability without sacrificing the simplicity of strong for many common read-mostly workloads. The con- memory ordering. verse case of write-mostly workloads demands an ap- proach which allows writers to proceed without syn- chronization, with readers bearing the synchroniza- 5 Acknowledgments tion costs. We can achieve that goal through parti- tioned data structures, which allow writers to per- Funding for this research provided by two Maseeh form CPU-local or thread-local writes without syn- Graduate Fellowships, and by the National Science chronization; readers then observe all relevant local Foundation under Grant No. CNS-0719851. Thanks structures to obtain a result. Simple cases of parti- to Dr. Fariborz Maseeh and the National Science tioned data structures include sloppy counters [3] or Foundation for their support. approximate counters [4], which partition a single integer counter. We plan to provide a more com- prehensive treatment of write-optimized concurrent References algorithms in the future. [1] Attiya, H., Guerraoui, R., Hendler, D., Most existing relativistic algorithms synchronize Kuznetsov, P., Michael, M. M., and between writers using mutual exclusion, with var- Vechev, M. Laws of Order: Expensive Syn- ious degrees of granularity. However, to simplify chronization in Concurrent Algorithms Cannot the construction of relativistic writer algorithms, we be Eliminated. In Proceedings of the ACM have also developed a technique to implement rel- POPL’11 (2011). ativistic writers using transactional memory, with- out disrupting or rolling back concurrent relativistic [2] Boehm, H.-J. Threads cannot be implemented reads. as a library. In PLDI ’05: Proceedings of Scalable relativistic implementations of various the 2005 ACM SIGPLAN conference on Pro- data structures already exist based on Read-Copy gramming language design and implementation Update, including linked lists, radix trees, and hash (2005), ACM, pp. 261–268. tables. We recently developed relativistic algorithms for resizable hash-tables [11] and balanced trees [6], [3] Boyd-Wickizer, S., Clements, A. T., based heavily on the new general methodology we’ve Mao, Y., Pesterev, A., Kaashoek, M. F., presented here. Morris, R., and Zeldovich, N. An Analysis of Linux Scalability to Many Cores. In OSDI In section 1, we discussed “Laws of Order”, the 2010: Proceedings of the 9th USENIX confer- properties it introduces which require expensive syn- ence on Operating Systems Design and Imple- chronization instructions, and how relativistic read- mentation (2011). ers can avoid these properties and the correspond- ing cost. We have completed an extensive analy- [4] Corbet, J. The search for fast, scalable coun- sis of “Laws of Order” and its implications for scal- ters. Linux Weekly News. http://lwn.net/ able concurrent programming, including the remain- Articles/202847/, February 2006. ing properties which imply expensive synchroniza- tion instructions, and the various ways to avoid those [5] Digital Equipment Corporation. Shared properties and the associated cost. We plan to doc- Memory, Threads, Interprocess Communica- ument this analysis in future work. tion, August 2001.

5 [6] Howard, P. W., and Walpole, J. Relativis- tic red-black trees. Tech. Rep. 10-06, Portland State University, 2010. http://www.cs.pdx. edu/pdfs/tr1006.pdf. [7] McKenney, P. E. Exploiting Deferred De- struction: An Analysis of Read-Copy-Update Techniques in Operating System Kernels.PhD thesis, OGI School of Science and Engineering at Oregon Health and Sciences University, 2004. [8] McKenney, P. E. RCU vs. locking per- formance on different CPUs. In linux.conf.au (Adelaide, Australia, January 2004). [9] McKenney, P. E., Sarma, D., Arcangeli, A., Kleen, A., Krieger, O., and Russell, R. Read-copy update. In Ottawa Linux Sym- posium (June 2002), pp. 338–367. [10] McKenney, P. E., and Slingwine, J. D. Read-Copy Update: Using Execution History to Solve Concurrency Problems. In Parallel and Distributed Computing and Systems (October 1998), pp. 509–518. [11] Triplett, J., McKenney, P. E., and Walpole, J. Resizable, scalable, concurrent hash tables. Tech. Rep. 11-01, Portland State University, 2011. http://www.cs.pdx.edu/ pdfs/tr1101.pdf.

6