David Padua (Ed.) Encyclopedia of Parallel Computing

With  Figures and  Tables

123 Editor-in-Chief David Padua University of Illinois at Urbana-Champaign Urbana, IL USA

ISBN ---- e-ISBN ---- DOI ./---- Print and electronic bundle ISBN: ---- Springer New York Dordrecht Heidelberg London

Library of Congress Control Number: 

© Springer Science+Business Media, LLC 

All rights reserved. This work may not be translated or copied in whole or in part without the written permission of the publisher (Springer Science+Business Media, LLC,  Spring Street, New York, NY , USA), except for brief excerpts in connection with reviews or scholarly analysis. Use in connection with any form of information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed is forbidden. The use in this publication of trade names, trademarks, service marks, and similar terms, even if they are not identified as such, is not to be taken as an expression of opinion as to whether or not they are subject to proprietary rights.

Printed on acid-free paper

Springer is part of Springer Science+Business Media (www.springer.com) Synchronization S 

. Duato J () A new theory of deadlock-free adaptive routing may produce incorrect results. As a trivial example, con- in wormhole networks. IEEE Trans Parallel Distrib Syst (): sider a global counter incremented by multiple threads. – Each loads the counter into a register, increments . Duato J () A necessary and sufficient condition for deadlock- the register, and writes the updated value back to mem- free adaptive routing in wormhole networks. IEEE Trans Parallel Distrib Syst ():– ory. If two threads load the same value before either . Duato J () A necessary and sufficient condition for deadlock- stores it back, updates may be lost: free routing in cut-through and store-and-forward networks. IEEE Trans Parallel Distrib Syst ():– c==0 . Peh L-S, Dally WJ () Flit reservation flow control. In: Thread 1: Thread 2: Proceedings of the th international symposium on high- r1 := c performance computer architecture, Toulouse, France, January , pp – r1:= c . Stunkel CB et al () Architecture and implementation of vul- ++r1 can. In: Proceedings of the th international parallel processing ++r1 symposium, Cancun, Mexico, pp – c:=r1 . Stunkel CB et al () The SP high-performance switch. In: Pro- c:=r1 ceedings of the scalable high performance computing conference, Knoxville, pp – c==1 . Seitz C () The cosmic cube. Commun ACM ():– Synchronization serves to preclude invalid thread interleavings. It is commonly divided into the subtasks of atomicity and condition synchronization. Atomicity Symmetric Multiprocessors ensures that a given sequence of instructions, typi- cally performed by a single thread, appears to all other Shared-Memory Multiprocessors threads as if it had executed indivisibly – not inter- leaved with anything else. In the example above, one would typically specify that the load-increment-store instruction sequence should execute atomically. Synchronization Condition synchronization forces a thread to wait, before performing an operation on shared data, until MichaelL. Scott some desired precondition is true. In the example above, University of Rochester, Rochester, NY, USA one might want to wait until all threads had performed their increments before reading the final count. Synonyms While it is tempting to suspect that condition syn- Fences; Multiprocessor synchronization; Mutual exclu- chronization subsumes atomicity (make the precondi- sion; Process synchronization tion be that no other thread is currently executing a conflicting operation), atomicity is in fact considerably S Definition harder, because it requires consensus among all com- Synchronization is the use of language or library mech- peting threads: they must all agree as to which will anisms to constrain the ordering (interleaving) of proceed and which will wait. Put another way, condi- instructions performed by separate threads, to preclude tion synchronization delays a thread until some locally orderings that lead to incorrect or undesired results. observable condition is seen to be true; atomicity is a property of the system as a whole. Discussion Like many aspects of parallel computing, syn- In a parallel program, the instructions of any given chronization looks different in shared-memory and thread appear to occur in sequential order (at least from message-passing systems. In the latter, synchronization that thread’s point of view), but if the threads run inde- is generally subsumed in the message-passing meth- pendently, their sequences of instructions may inter- ods; in a shared-memory system, it typically employs a leave arbitrarily, and many of the possible interleavings separate set of methods.  S Synchronization

Shared-memory implementations of synchroniza- In either case, it returns the previous value, from tion can be categorized as busy-wait (spinning), or which one can deduce whether the replacement scheduler-based. The former actively consume processor occurred. cycles until the running thread is able to proceed. The Load-linked (l) and store-conditional (l, v).Thefirst latter deschedule the current thread, allowing the pro- of these returns the value at location l and “remem- cessor to be used by other threads, with the expectation bers” l.Thesecondstoresv to l if l has not been that future activity by one of those threads will make modified by any other processor since a previous the original thread runnable again. Because it avoids the load-linked by the current processor. cost of two context switches, busy-wait synchronization These instructions differ in their expressive power. is typically faster than scheduler-based synchronization Herlihy has shown [] that compare-and-swap (CAS) when the expected wait time is short and when the and load-linked / store-conditional (LL/SC) are univer- processor is not needed for other purposes. Scheduler- sal primitives, meaning, informally, that they can be based synchronization is typically faster when expected used to construct a non-blocking implementation of any wait times are long; it is necessary when the num- other RMW operation. The following code provides a ber of threads exceeds the number of processors (else simple implementation of fetch-and-ϕ using CAS. quantum-long delays or even deadlock can occur). In the typical implementation, busy-wait synchronization val old := *l; is built on top of whatever hardware instructions exe- loop cute atomically. Scheduler-based synchronization, in val new := phi(old); turn, is built on top of busy-wait synchronization, which val found := CAS(l, old, new); is used to protect the scheduler’s own data structures if (old == found) break; (see entries on Scheduling Algorithms and on Processes, old := found; Tasks, and Threads). If the test on line  of this code fails, it must be because some other thread successfully modified *l.Thesystem Hardware Primitives as a whole has made forward progress, but the current In the earliest multiprocessors, load and store were thread must try again. the only memory-access instructions guaranteed to As discussed in the entry on Non-blocking Algo- be atomic, and busy-wait synchronization was imple- rithms, this simple implementation is -free but not mented using these. Modern machines provide a vari- wait-free. There are stronger (but slower and more ety of atomic read-modify-write (RMW) instructions, complex) non-blocking implementations in which each which serve to update amemorylocationatom- thread is guaranteed to make forward progress in a ically. These significantlyimplify s the implementa- bounded number of its own instructions. tion of synchronization. Common RMW instructions include: NB: In any distributed system, and in most modern Test-and-set (l) sets the Boolean variable at location l shared memory systems, instructions executed by a to true,andreturnsthepreviousvalue. given thread are not, in general, guaranteed to be seen Swap (l, v) stores the value v to location l and returns in sequential order by other threads, and instructions the previous value. of any two threads are not, in general, guaranteed to be Atomic-ϕ (l, v) replaces the value o at location l with seen in the same order by all of their peers. Modern ϕ(o, v) for some simple arithmetic function ϕ (add, processors typically provide so-called fence or barrier sub, and, etc.). instructions (not to be confused with the barriers dis- Fetch-and-ϕ (l, v) is like atomic-ϕ,butalsoreturnsthe cussed under Condition Synchronization below) that previous value. force previous instructions of the current thread to be Compare-and-swap (l, o, n) inspects the value v at seen by other threads before subsequent instructions of location l, and if it is equal to o, replaces it with n. the current thread. Implementations of synchronization Synchronization S  methods typically include sufficient fences that if syn- before any releases) ensures that the global set of critical chronization method s in thread t occurs before syn- section executions remains serializable. chronization method s in thread t,thenallinstruc- tions that precede s in t will appear in t to have Relaxations of Mutual Exclusion occurred before any of its own instructions that fol- So-called reader–writer locks increase concurrency by low s. For more information, see the entry on Memory observing that it is safe for more than one thread to read Models. The remainder of the discussion here assumes a location concurrently, so long as no thread is modi- that memory is sequentially consistent,thatis,that fying that location. Each critical section is classified as instructions appear to interleave in some global total either a reader or a writer of the data associated with a order that is consistent with program order in every given lock. The reader_acquire method waits until there thread. is no concurrent writer of the lock; the writer_acquire method waits until there is no concurrent reader or Atomicity writer. A multi-instruction operation is said to be atomic if In a standard reader–writer lock, a thread must appears to occur “all at once” from every other thread’s know, when it first reads a location, whether it will ever point of view. In a sequentially consistent system, this need to write that location in the current critical sec- means that the program behaves as if the instructions tion. In some contexts it may be possible to relax this of the atomic operation were contiguous in the global restriction. The Linux kernel, for example, provides a instruction interleaving. More specifically, in any sys- sequence lock mechanism that allows a reader to abort tem, intermediate states of the atomic operation should its peers and upgrade to writer status. Programmers are never be visible to other threads, and actions of other required to follow a restrictive programming discipline threads should never become visible to a given thread that makes critical sections “restartable,” and checks, in the middle of one of its own atomic operations. before any write or “dangerous” read, to see whether a The most straightforward way to implement atomic- peer’s upgrade has necessitated a restart. ity is with a mutual-exclusion (mutex) lock –anabstract For data structures that are almost always read, and object that can be held by at most one thread at a time. very occasionally written, several operating system ker- In standard usage, a thread invokes the acquire method nels provide some variant of a mechanism known as of the lock when it wishes to begin an atomic opera- RCU (originally an abbreviation for read-copy update). tion and the release method when it is done. Acquire RCU divides execution into so-called epochs.Awriter waits (by spinning or rescheduling) until it is safe for creates a new copy of any data structure it needs to the operation to proceed. The code between the acquire update. It replaces the old copy with the new, typically and release (the body of the atomic operation) is known using a single CAS instruction. It then waits until the as a critical section. end of the current epoch to be sure that all readers that Critical sections that conflict with one another (typ- might have been using the old copy have completed S ically, that access some common location, with at least their critical sections (at which point it can reclaim the one section writing that location) must be protected old copy, or perform other actions that depend on the by the same lock. Programming discipline commonly visibility of the update). The advantage of RCU, in com- ensures this property by associating data with locks. A parisontolocks,isthatitimposeszero overhead in the thread must then acquire locks for all the data accessed read-only case. in a critical section. It may do so all at once, at the begin- For more general-purpose use, transactional mem- ning of the critical section, or it may do so incremen- ory (TM) allows arbitrary operations to be executed tally, as the need for data is encountered. Considerable atomically, with an underlying implementation based caremayberequiredtoensurethatlocksareacquiredin on speculation and rollback.Originallyproposed[]as the same order by all critical sections, to avoid deadlock. a hardware assist for lock-free data structures – sort of All locks are typically held until the end of the criti- a multi-word generalization of LL/SC – TM has seen a cal section. This two-phase locking (all acquires occur flurry of activity in recent years, and several hardware  S Synchronization

and software implementations are now widely available. lock, in which a thread acquires the lock by using a Each keeps track of the memory locations accessed by test-and-set instruction to change a Boolean flag from transactions (would-be atomic operations). When two false to true. Unfortunately, spinning by waiting threads concurrent transactions are seen to conflict, at most one tends to induce extreme contention for the lock loca- is allowed to commit; the others abort,“rollback,”and tion, tying up bus and memory resources needed for try again, using a fully automated, transparent analogue productive work. On a cache-coherent machine, better of the programming discipline required by sequence performance can be achieved with a “test-and-test-and- locks. For further details, see the separate entry set” (TATAS) lock, which reduces contention by using on TM. ordinary load instructions to spin on a value in the local cache so long as the lock remains held: type lock = Boolean; Fairness Because they sometimes force multiple threads to wait, proc acquire(lock *l): synchronization mechanisms inevitably raise issues of while (test-and-set(l)) fairness. When a lock is released by the current holder, while (*l) /* spin */ ; which waiting thread should be allowed to acquire it? proc release(lock *l): In a system with reader–writer locks, should a thread be *l := false; allowed to join a group of already-active readers when writers are already waiting? When transactions conflict This lock works well on small machines (up to, say, four in a TM system, which should be permitted to proceed, processors). and which should wait or abort? Which waiting thread acquires a TATAS lock at Many answers are possible. The choice among con- release time depends on vagaries of the hardware, flicting threads may be arbitrary, random, first-come- and is essentially arbitrary. Strict FIFO ordering can first-served (FIFO), or based on some other notion be achieved with a ticket lock, which uses fetch-and- of priority. From the point of view of an individual increment (FAI) and a pair of counters for constant thread, the resulting behavior may range from poten- space and (per-thread) time. To acquire the lock, a tial starvation (no progress guarantees) to some sort of thread atomically performs an FAI on the “next avail- proportional share of system run time. Between these able” counter and waits for the “now serving” counter extremes, a thread may be guaranteed to run eventually to equal the value returned. To release the lock, a thread if it is continuously ready, or if it is ready infinitely often. increments its own ticket, and stores the result to the Even given the possibility of starvation, the system as “now serving” counter. While arguably fairer than a awholemaybelivelock-free (guaranteed to make for- TATAS lock, the ticket lock is more prone to perfor- ward progress) as a result of algorithmic guarantees or mance anomalies on a multiprogrammed system: if any pseudo-random heuristics. (Actual livelock is generally waiting thread is preempted, all threads behind it in line considered unacceptable.) Any starvation-free system is will be delayed until it is scheduled back in. clearly livelock free.

Scalable Busy-Wait Locks Simple Busy-Wait Locks On a machine with more than a handful of processors, Several early locking algorithms were based on only TATAS and ticket locks scale poorly, with time per criti- loads and stores, but these are mainly of historical inter- cal section growing linearly with the number of waiting est today. All required Ω(tn) space for t threads and n threads. Anderson [] showed that exponential backoff locks, and ω() (more-than-constant) time to arbitrate (reminiscent of the Ethernet contention-control algo- among threads competing for a given lock. rithm) could substantially improve the performance of In modern usage, the simplest constant-space, busy- TATAS locks. Mellor-Crummey and Scott []showed wait mutual exclusion lock is the test-and-set (TAS) similar results for linear backoff in ticket locks (where Synchronization S  a thread can easily deduce its distance from the head of instructions are required both in the satisfying thread, the line). to ensure that its prior writes are visible to other threads, Toeliminate contention entirely, waiting threads can and in the waiting thread, to ensure that its subsequent be linked into an explicit queue, with each thread spin- reads do not occur until after the spin completes. And ning on a separate location that will be modified when even on a sequentially consistent machine, special steps the thread ahead of it in line completes its critical sec- are required to ensure that the compiler does not violate tion. Mellor-Crummey and Scott showed how to imple- the programmer’s expectations by reordering instruc- ment such queues in total space O(t + n) for t threads tions within threads. and n locks; their MCS lock is widely used in large- In some programming languages and systems, a scale systems. Craig [] and, independently, Landin variablemaybemadesuitableforconditionsynchro- and Hagersten [] developed an alternative CLH lock nization by labeling it volatile (or, in C++’X, that links the queue in the opposite direction and per- atomic<>). The compiler will insert appropriate forms slightly faster on some cache-coherent machines. fences at reads and writes of volatile variables, and Auslander et al. developed a variant of the MCS will refrain from reordering them with respect to other lock that is API-compatible with traditional TATAS instructions. locks []. Kontothanassis et al. []andHeetal.[] Some other systems provide special event objects, developed variants of the MCS and CLH locks that with methods to set and await them. Semaphores and avoid performance anomalies due to preemption of monitors, described in the following two subsections, threads waiting in line. can be used for both mutual exclusion and condition synchronization. Scheduler-Based Locks In systems with dynamically varying concurrency, A busy-wait lock wastes processor resources when the fork and join methods used to create threads and expected wait times are long. It may also cause perfor- to verify their completion can be considered a form of mance anomalies or deadlock in a multiprogrammed condition synchronization. (These are, in fact, the prin- system. The simplest solution is to yield the proces- cipal form of synchronization in systems like Cilk and sorinthebodyofthespinloop,effectivelymoving OpenMP.) the current thread to the end of the scheduler’s ready list and allowing other threads to run. More com- Barriers monly, scheduler-based locks are designed to deschedule One form of condition synchronization is particularly the waiting thread, moving it (atomically) from the common in data-parallel applications, where threads ready list to a separate queue associated with the lock. iterate together through a potentially large number of The release method then moves one waiting thread from algorithmic phases. A synchronization barrier,usedto the lock queue to the ready list. To minimize over- separate phases, guarantees that no thread continues to + head when waiting times are short, implementations phase n  until all threads have finished phase n. S of scheduler-based synchronization commonly spin for In most (though not all) implementations, the bar- a small, bounded amount of time before invoking the rier provides a single method, composed internally of scheduler and yielding the processor. This strategy is an arrival phase that counts the number of threads that often known as spin-then-wait. have reached the barrier (typically via a log-depth tree) and a departure phase in which permission to continue is broadcast back to all threads. In a so-called fuzzy Condition Synchronization barrier [], these arrival and departure phases may be It is tempting to assume that busy-wait condition separate methods. In between, a thread may perform synchronization can be implemented trivially with a any instructions that neither depend on the arrival of Boolean flag: a waiting thread spins until the flag istrue; other threads nor are required by other threads prior to a thread that satisfies the condition sets the flag to true. their departure. Such instructions can serve to “smooth On most modern machines, however, additional fence out” phase-by-phase imbalances in the work assigned  S Synchronization

to different threads, thereby reducing overall wait time. Monitors Wait time may also be reduced by an adaptive barrier [, While semaphores remain the most widely used ], which completes the arrival phase in constant time scheduler-based shared-memory synchronization mech- after the arrival of the final thread. anism, they suffer from several limitations. In particu- Unfortunately, when t threads arrive more or less lar, the association between a binary semaphore (mutex simultaneously, no barrier implementation using ordi- lock) and the data it protects is solely a matter of con- nary loads, stores, and RMW instructions can complete vention, as is the paired usage of P and V methods. the arrival phase in less than Ω(log t) time. Given the Early experience with semaphores, combined with the importance of barriers in scientific applications, some development of language-level abstraction mechanisms supercomputers have provided special near-constant- in the s, led several developers to suggest build- time hardware barriers. In some cases the same hard- ing higher-level synchronization abstractions into pro- ware has supported a fast eureka method, in which gramming languages. These efforts culminated in the one thread can announce an event to all others in definition of monitors [], variants of which appear in constant time. many languages and systems. A monitor is a data abstraction (a module or Semaphores class) with an implicit mutex lock and an optional First proposed by Dijkstra in  [] and still widely set of condition variables.Eachentry (method) of the used today, semaphores support both mutual exclusion monitor automatically acquires and releases the mutex and condition synchronization. A general semaphore is a lock; entry invocations thus exclude one another in nonnegative counter with an initial value and two meth- time. Programmers typically devise, for each monitor, ods, known as V and P.TheV method increases the aprogram-specificinvariant that captures the mutual valueofthesemaphorebyone.TheP method waits for consistency of the monitor’s state (data members – the value to be positive and then decreases it by one. fields). The invariant is assumed to be true at the begin- A binary semaphore has values restricted to zero and ning of each entry invocation, and must be true again at one (it is customarily initialized to one), and serves as a the end. mutual exclusion lock. The P method acquires the lock; Condition variables support a pair of methods the V method releases the lock. Programming discipline superficially analogous to P and V; in Hoare’s origi- is required to ensure that P and V methods occur in nal formulation, these were known as wait and signal. matching pairs. Unlike P and V,thesemethodsarememory-less:asignal The typical implementation of semaphores pairs the invocation is a no-op if no thread is currently waiting. counter with a queue of waiting threads. The V method For each reason that a thread might need to wait checks to see whether the counter is currently zero. If so, within a monitor, the programmer declares a sepa- it checks to see whether any threads are waiting in the rate condition variable. When it waits on a condition, queue and, if there are, moves one of them to the ready the thread releases exclusion on the monitor. The pro- list. If the counter is already positive (in which case the grammer must thus ensure that the invariant is true queue is guaranteed to be empty) or if the counter is immediately prior to every wait invocation. zero but the queue is empty, V simply increments the counter. The P method also checks to see whether the Semantic Details counter is zero. If so, it places the current thread on The details of monitors vary significantly from one lan- the queue and calls the scheduler to yield the processor. guage to another. The most significant issues, discussed Otherwise it decrements the counter. in the paragraphs below, are commonly known as the General semaphores can be used to represent nested monitor problem and the modeling of signals as resources of which there is a limited number, but more hints vs. absolutes. More minor issues include language than one. Examples include I/O devices, communica- syntax, alternative names for signal and wait, the mod- tion channels, or free or full slots in a fixed-length buffer. eling of condition variables in the type system, and the Most operating systems provide semaphores as part of prioritization of threads waiting for conditions or for the kernel API. access to the mutex lock. Synchronization S 

The nested monitor problem arises when an entry of been performed). Blocking semantics for send methods one monitor invokes an entry of another monitor, and vary from one system to another: the second entry waits on a condition variable. Should Asynchronous send – In some systems, a sender con- the wait method release exclusion on the outer monitor? tinues execution immediately after invoking a send If it does, there is no guarantee that the outer moni- method, and the underlying system takes responsi- tor will be available again when execution is ready to bility for delivering the message. While often desir- resume in the inner call. If it does not, the program- able, this behavior complicates the delivery of failure mer must take care to ensure that the thread that will notifications, and may be limited by finite buffering perform the matching signal invocation does not need capacity. to go through the outer monitor in order to reach the Synchronous send – In other systems – notably those inner one. A variety of solutions to this problem have based on Hoare’s Communicating Sequential Pro- been proposed; the most common is to leave the outer cesses (CSP) [] – a sender waits until its message monitor locked. has been received. Signal methods in Hoare’s original formulation were Remote-invocation send –Inyetothersystems,asend defined to transfer monitor exclusion directly from the method has both ingoing and outcoming parame- signaler to the waiter, with no intervening execution. ters; the sender waits until a reply is received from The purpose of this convention was to guarantee that its peer. the condition represented by the signal was still true when the waiter resumed. Unfortunately, the conven- tion often has the side effect of inducing extra context Distributed Locking switches, and requires that the monitor invariant be Libraries, languages, and applications commonly imple- true immediately prior to every signal invocation. Most ment higher-level distributed locks or transactions on modern monitor variants follow the lead of Mesa []in top of message passing. The most common lock imple- declaring that a signal is merely a hint, and that a waiting mentation is analogous to the MCS lock: acquired process must double-check the condition before contin- requests are sent to a lock manager thread. If the lock uing execution. In effect, code that would be written is available, the manager responds directly; otherwise it forwards the request to the last thread currently wait- if (!condition) ing in line. The release method sends a message to the cond_var.wait(); manager or, if a forwarding request has already been in a Hoare monitor is written received, to the next thread in line for the lock. Races in which the manager forwards a request at the same while (!condition) time the last lock holder sends it a release are trivially cond_var.wait(); resolved by statically choosing one of the two (per- haps the lock holder) to inform the next thread in line. in a Mesa monitor. To make it easier to write programs Distributed transaction systems are substantially more S in which a condition variable “covers” a set of pos- complex. sible conditions (particularly when signals are hints), many monitor variants provide a signal-all or broadcast method that awakens all threads waiting on a condition, Rendezvous and Remote Procedure Call rather than only one. In some systems, a message must be received explic- itly by an already existing thread. In other systems, a thread is created by the underlying system to han- Message Passing dle each arriving message. Either of these options – In a system in which threads interact by exchanging explicit or implicit receipt – can be paired with any of the messages, rather than by sharing variables, synchro- threesendoptionsdescribedabove.Thecombination nization is generally implicit in the send and receive of remote-invocation send with implicit receipt is often methods. A receive method typically blocks until an called remote procedure call (RPC). The combination of appropriate message is available (a matching send has remote-invocation send with explicit receipt is known  S System Integration

as rendezvous. Interestingly, if all shared data is encap- . Dijkstra EW (Sept ) Cooperating sequential processes. Tech- sulated in monitors, one can model – or implement – nical report, Technological University, Eindhoven, The Nether- each monitor with a manager thread that executes entry lands. Reprinted in Genuys F (ed) Programming Languages, Academic Press, New York, , pp –. Also avail- calls one at a time. Each such call then constitutes a able at www.cs.utexas.edu/users/EWD/transcriptions/EWDxx/ rendezvous between the sender and the monitor. EWD.html. . Gupta R (Apr ) The fuzzy barrier: a mechanism for high speed synchronization of processors. Proceedings of the rd Inter- Related Entries national Conference on Architectural Support for Programming Actors Languages and Operating Systems, Boston, MA, pp – . Gupta R, Hill CR (June ) A scalable implementation of barrier Cache Coherence synchronization using an adaptive combining tree. Int J Parallel  Concurrent Collections Programming Model Progr ():– Deadlocks . He B, Scherer III WN, Scott ML (Dec ) Preemption adaptivity Memory Models in time-published queuebased spin locks. Proceeding of the  Monitors, Axiomatic Verification of International Conference on High Performance Computing, Goa, Non-Blocking Algorithms India . Herlihy MP (Jan ) Wait-free synchronization. ACM Trans  Path Expressions Progr Lang Syst ():–  Processes, Tasks, and Threads . Herlihy MP, Moss JEB () Transactional memory: architec- Race Conditions tural support for lock-free data structures. Proceedings of the th Scheduling Algorithms International Symposium on Computer Architecture, San Diego, Shared-Memory Multiprocessors CA, May  pp – . Herlihy MP, Shavit N () The Art of Multiprocessor Program- Transactions, Nested ming. Morgan Kaufmann, Burlington, MA . Hoare CAR (Oct ) Monitors: an operating system structuring concept. Commun ACM ():– Bibliographic Notes . Hoare CAR (Aug ) Communicating sequential processes. The study of synchronization began in earnest with Commun ACM ():– Dijkstra’s “Cooperating Sequential Processes” mono- . Kontothanassis LI, Wisniewski R, Scott ML (Feb ) Scheduler- graph of  []. Andrews and Schneider provide an conscious synchronization. ACM Trans Comput Sys ():– . Lampson BW, Redell DD (Feb ) Experience with processes excellent survey of synchronization mechanisms circa and monitors in Mesa. Commun ACM ():–  []. Mellor-Crummey and Scott describe and com- . Magnussen P, Landin A, Hagersten E (Apr ) Queue locks pare a variety of busy-wait spin locks and barriers, and on cache coherent multiprocessors. Proceedings of the th introduce the MCS lock []. More extensive coverage International Parallel Processing Symposium, Cancun, Mexico, of synchronization can be found in Chapter  of Scott’s pp – . Mellor-Crummey JM, Scott ML (Feb ) Algorithms for scalable programming languages text [], or in the recent texts synchronization on sharedmemory multiprocessors. ACM Trans of Herlihy and Shavit []andTaubenfeld[]. Comput Syst ():– . Scott ML () Programming Language Pragmatics, rd edn. Morgan Kaufmann, Burlington, MA Bibliography . Scott ML, Mellor-Crummey JM (Aug ) Fast, contention-free combining tree barriers. Int J Parallel Progr ():– . Anderson TE (Jan ) The performance of spin lock alternatives . Taubenfeld G () Synchronization Algorithms and Concur- for shared-memory multiprocessors. IEEE Trans Parallel Distr Sys rent Programming. Prentice Hall, Upper Saddle River ():– . Andrews GR, Schneider FB (Mar ) Concepts and notations for concurrent programming. ACM Comput Surv ():– . Auslander MA, Edelsohn DJ, Krieger OY, Rosenburg BS, Wis- niewski RW () Enhancement to the MCS lock for increased functionality and improved programmability. U.S. patent applica- tion ,submitted  Oct  . Craig TS (Feb ) Building FIFO and priority-queueing spin System Integration locks from atomic swap. Technical Report --, University of Washington Computer Science Department Terrestrial Ecosystem Carbon Modeling