Scalable Join Patterns
Aaron Turon Claudio V. Russo Northeastern University Microsoft Research [email protected] [email protected]
Abstract var j = Join.Create(); § ¤ Coordination can destroy scalability in parallel program- Synchronous.Channel[] hungry; ming. A comprehensive library of scalable synchroniza- Asynchronous.Channel[] chopstick; tion primitives is therefore an essential tool for exploiting j.Init(out hungry, n); j.Init(out chopstick, n); parallelism. Unfortunately, such primitives do not easily for(int i = 0; i < n; i++) { var left = chopstick[i]; combine to yield solutions to more complex problems. We var right = chopstick[(i+1) % n]; demonstrate that a concurrency library based on Fournet and j.When(hungry[i]).And(left).And(right).Do(() => { Gonthier’s join calculus can provide declarative and scalable eat(); left(); right();// replace chopsticks coordination. By declarative, we mean that the programmer }); needs only to write down the constraints of a coordination } problem, and the library will automatically derive a correct solution. By scalable, we mean that the derived solutions ¦ Figure 1. Dining Philosophers, declaratively ¥ deliver robust performance both as the number of proces- sors increases, and as the complexity of the coordination problem grows. We validate our claims empirically on seven The proliferation of specialized synchronization primi- coordination problems, comparing our generic solution to tives is therefore no surprise. For example, java.util.con- specialized algorithms from the literature. current [13] contains a rich collection of carefully engi- Categories and Subject Descriptors D.3.3 [Programming neered classes, including various kinds of locks, barriers, Languages]: Language Constructs and Features—Concurrent semaphores, count-down latches, condition variables, ex- programming structures changers and futures, together with nonblocking collections. Several of these classes led to research publications [14, 15, General Terms Algorithms, Languages, Performance 26, 28]. A JAVA programmer faced with a coordination prob- Keywords message passing, concurrency, parallelism lem covered by the library is therefore in great shape. Inevitably, though, programmers are faced with new 1. Introduction problems not directly addressed by the primitives. The prim- Parallel programming is the art of keeping many processors itives must be composed into a solution. Doing so correctly busy with real work. But except for embarrassingly-parallel and scalably can be as difficult as designing a new primitive. cases, solving a problem in parallel requires coordination be- Take the classic Dining Philosophers problem [3], in tween threads, which entails waiting. When coordination is which philosophers sitting around a table must coordinate unavoidable, it must be carried out in a way that minimizes their use of the chopstick sitting between each one; such both waiting time and interprocessor communication. Effec- competition over limited resources appears in many guises tive implementation strategies vary widely, depending on the in real systems. The problem has been thoroughly studied, coordination problem. Asking an application programmer to and there are solutions using primitives like semaphores that grapple with these concerns—without succumbing to con- perform reasonably well. There are also many natural “so- currency bugs—is a tall order. lutions” that do not perform well—or do not perform at all. Naive solutions may suffer from deadlock, if for example each philosopher picks up the chopstick to their left, and then finds the one to their right taken. Correct solutions may Permission to make digital or hard copies of all or part of this work for personal or scale poorly with the number of philosophers (threads). For classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation example, using a single global lock to coordinate philoso- on the first page. To copy otherwise, to republish, to post on servers or to redistribute phers is correct, but will force non-adjacent philosophers to to lists, requires prior specific permission and/or a fee. OOPSLA’11, October 22–27, 2011, Portland, Oregon, USA. take turns through the lock, adding unnecessarily sequential- Copyright © 2011 ACM 978-1-4503-0940-0/11/10. . . $10.00 ization. Avoiding these pitfalls takes experience and care. In this paper, we demonstrate that Fournet and Gonthier’s We present our algorithm incrementally, both through join calculus [4, 5] provides the basis for a declarative and diagrams and through working C] code (§3). scalable coordination library. By declarative, we mean that • We validate our scalability claims experimentally on the programmer needs only to write down the constraints for seven different coordination problems (§4). For each co- a coordination problem, and the library will automatically ordination problem we evaluate a joins-based implemen- derive a correct solution. By scalable, we mean that the tation running in both Russo’s lock-based library and our derived solutions deliver robust, competitive performance new fine-grained library. We compare these results to the both as the number of processors or cores increase, and as performance of direct, custom-built solutions. the complexity of the coordination problem grows. In all cases, our new library scales significantly better Figure 1 shows a solution to Dining Philosophers using than Russo’s. We scale competitively with—sometimes our library, which is a drop-in replacement for Russo’s C] better than—the custom-built solutions, though we suffer JOINS library [24]. The library uses the message-passing paradigmn of the join calculus. For Dining Philosophers, we from higher constant-time overheads in some cases. use two arrays of channels (hungry and chopstick) to carry • We state and sketch proofs for the key safety and liveness value-less messages; being empty, these messages represent properties characterizing our algorithm (§5). unadorned events. The declarative aspect of this example is Our goal is not to replace libraries like java.util.con- the join pattern starting with j.When. The declaration says current, but rather to complement them, by exposing some that when events are available on the channels hungry[i], of their insights in a way easily and safely useable by ap- left, and right, they may be simultaneously and atomically plication programmers. We discuss this point further, along consumed. When the pattern fires, the philosopher, having with related work in general, in the final section. obtained exclusive access to two chopsticks, eats and then returns the chopsticks. In neither the join pattern nor its 2. The JOINS library body is the order of the chopsticks important. The remaining details are explained in §2.1 Russo’s Joins [24] is a concurrency library derived from Most implementations of join patterns, including Russo’s, Fournet and Gonthier’s join calculus [4, 5], a process algebra use coarse-grained locks to achieve atomicity, resulting in similar to Milner’s π-calculus but conceived for efficient poor scalability (as we show experimentally in §4). Our con- implementation. In this section, we give an overview of the tribution is a new implementation of the join calculus that library API and illustrate it using examples drawn from the uses ideas from fine-grained concurrency to achieve scala- join calculus literature [2, 4, 5]. bility on par with custom-built synchronization primitives: The join calculus takes a message-passing approach to concurrency where messages are sent over channels and • We first recall how join patterns can be used to solve channels are themselves first-class values that can be sent as a wide range of coordination problems (§2), as is well- messages. What makes the calculus interesting is the way established in the literature [2, 4, 5]. The examples pro- messages are received. Programs do not actively request vide basic implementations of some of the java.util. to receive messages from a channel. Instead, they employ concurrent classes mentioned above.2 In each case, the join patterns (also called chords [2]) to declaratively specify JOINS-based solution is as straightforward to write as the reactions to message arrivals. The power of join patterns lies one for dining philosophers. in their ability to respond atomically to messages arriving simultaneously on several different channels. • To maximize scalability, we must allow concurrent, in- Suppose, for example, that we have two channels Put and dependent processing of messages, avoid centralized Get, used by producers and consumers of data. When a pro- contention, and spin or block intelligently. This must ducer and a consumer message are available, we would like be done, of course, while guaranteeing atomicity and to receive both simultaneously, and transfer the produced progress for join patterns. value to the consumer. Using Russo’s Joins API, we write: We meet these challenges by (1) using lock-free [9] data structures to store messages, (2) treating messages as class Buffer
¦ ¥ This simple example introduces many aspects of the API. The pattern will consume messages Fst(a), Snd(b) and First, we are using two different kinds of channels: Put Send() atomically, when all three are available. Only the is an asynchronous channel that carries messages of type first two messages carry arguments, so the body of the pat- T, while Get is a synchronous channel that returns T replies tern takes two arguments. Its return value, a pair, becomes but takes no argument. A sender never blocks on an asyn- the return value of the call to Both(). chronous channel, even if the message cannot immediately On the other hand, several patterns may mention the same be received through a join pattern. For the Buffer class, that channel, expressing choice: means that a single producer may send many Put messages, even if none of them are immediately consumed. Because Asynchronous.Channel Fst; §Asynchronous.Channel Snd; ¤ Get is a synchronous channel, on the other hand, senders will Synchronous
¦ ¥ ¦ ¥ Figure 6. Sending a message
ensuring that only one thread will succeed in claiming a which case that thread is responsible for firing the chord) given message. If at any point we fail to claim a message, or no pattern is matchable, we immediately exit (line 5). In we roll back all of the messages claimed so far (lines 23– both of these cases, we are assured that any chords involving 24). The implementation ensures that the Chans arrays for the message will be (or has been) dealt with elsewhere, and each chord are ordered consistently, so that in any race at since the message is itself asynchronous, we need not wait least one thread entering the second phase will complete the for this to occur. phase successfully (§5). On the other hand, if we resolved the message by claim- Note that the code in Figure 5 is a simplified version ing it and enough other messages to fire a chord, we pro- of our implementation that does not handle patterns with ceed by consuming all involved messages (which does not repeated channels, and does not stack-allocate or recycle require a memory fence). If the chord’s pattern involved only message arrays. These differences are discussed in §3.6. asynchronous channels (line 10) we spawn a new thread to execute the chord body asynchronously. Otherwise we must 3.4 Firing, blocking and rendezvous communicate with a synchronous sender, telling it to execute Message resolution does not depend on the (a)synchrony of the body. a channel, but the rest of the message-sending process does. A key difference from asynchronous senders is that syn- The code for sending messages is shown in Figure 6, with chronous sending should not return until a relevant chord separate entry points AsyncSend and SyncSend. The actions has fired. Moreover, synchronous chord bodies should be taken while sending depend, in part, on the result of message executed by a synchronous caller, rather than by a newly- resolution: spawned thread. Since multiple synchronous callers can be Send We CLAIMED They CONSUMED No match combined in a single chord, exactly one of them should be (AC) Spawn (10) chosen to execute the chord, and then share the result with Async Return (5) Return (5) (SC) Wake (13–20) (and wake up) all the others (we call this “rendezvous”). Sync Fire (35) Wait for result (28) Block (28) Each synchronous message has a Signal associated with where AC and SC, respectively, stand for asynchronous it. Signals provide methods Block and Set, allowing syn- chord (no synchronous channels) and synchronous chord chronous senders to block7 and be woken. Calls to Set atom- (at least one synchronous channel). ically trigger the signal. If a thread has already called Block, First we follow the path of an asynchronous message, it is awoken and the signal is reset. Otherwise, the next call which begins by adding and resolving the message (lines 2– 3). If either the message was CONSUMED by another thread (in 7 It spins a bit first; see §3.6 to Block will immediately return, again resetting the signal. The main challenge in employing the counter representa- We ensure that Block and Set are each called by at most one tion is that, in our fine-grained protocol, it must be possible thread; the implementation ensures that waking only occurs to tentatively decrement the counter (the analog of claim- as a result of triggering the signal (no “spurious wakeups”). ing a message), in such a way that other threads do not in- If an asynchronous sender consumed a synchronous correctly assume the message has actually been consumed. sender’s message (line 6), the synchronous sender will be Our approach is to represent void, asynchronous message blocked waiting to be informed of this event (line 28). The bags as a word sized pair of half-words, separately count- asynchronous sender chooses one such synchronous sender ing claimed and pending messages. Implementations of, for to wake (lines 15–18), telling it which chord and messages example, Chan.AddPending and Msg.TryClaim are specialized to fire (line 16). to atomically update the shared-state word by CASing in a Now we consider synchronous senders. Just as before, we classic read-modify-try-update loop. For example, we claim first add and resolve the message (lines 24–25). If the mes- a “message” as follows: sage was resolved by claiming enough additional messages to fire a chord, all relevant messages are immediately con- bool TryClaim() { § ¤ sumed (line 32). Otherwise, the sender must block (line 28). uint startState = chan.state;// shared state There are two ways a blocked, synchronous sender can uint curState = startState; be woken: by an asynchronous sender or by another syn- while(true){ chronous sender (rendezvous). In the former case, the (ini- startState = curState; ushort claimed; tially null) ShouldFire field will contain a match that the ushort pending = Decode(startState, out claimed); synchronous caller is responsible for firing. In the latter if (pending > 0) { case, ShouldFire remains null, but the Result field will con- var nextState = Encode(++claimed, --pending); tain the result of a chord body as executed by another syn- curState = CAS(ref chan.state, chronous sender, which is immediately returned (line 30). comparand: startState, If a synchronous sender does execute a chord body (line value: nextState); 35), it must then wake up any other synchronous senders if (curState == startState) involved in the chord and inform them of the result (lines 36– return true; 42). For simplicity, we ignore the possibility that the chord } else return false; body raises an exception, but proper handling is easy to add } and is addressed by the benchmarked implementation. }
¦More importantly, Chan.FindPending no longer needs to tra- ¥ 3.5 Lazy queuing, counter channels, and message verse a data structure but can merely atomically read the stealing bag’s encoded state once, setting sawClaimed if the claimed To achieve competitive scalability, we make three enhance- count is non-zero. ments to the above implementation. The importance of these While the counter representation avoids allocation, it enhancements is shown empirically in §4. does lead to more contention over the same shared state First, we do not always need to allocate or enqueue a (compared with a proper bag). It also introduces the possi- message when sending. For example, in the Lock class when bility of overflow, which we ignored here. Nevertheless, we sending an Acquire message, we could first check to see have found it to be beneficial in practice (§4), especially for whether a corresponding Release message is available, and if non-singleton resources like Semaphore.Release messages. so, immediately claim and consume it without ever touching The final enhancement is only relevant to synchronous the Acquire bag. This shortcut saves both on allocation and channels. When an asynchronous sender matches a syn- potentially on interprocessor communication. chronous chord, it consumes all the relevant messages, and More generally, we provide an optimistic fast-path that then wakes up one of the synchronous senders. If the syn- attempts to immediately match the remaining messages to chronous caller is actually blocked—so that waking requires fire a chord. We call this lazy queuing. Implementing lazy a context switch—significant time may lapse before the queuing is relatively straightforward, so we do not provide chord is actually fired. the code here. Since we do not provide a fairness guarantee, we can in- The second enhancement applies to void, asynchronous stead permit “stealing”: we can wake up one synchronous channels (e.g. Lock.Release). Sophisticated lock-based im- caller, but roll back the rest of the messages to PENDING plementations of join patterns typically optimize the rep- status, putting them back up for grabs by currently-active resentation of such channels to a simple counter, neatly threads—including the thread that just sent the asynchronous avoiding the cost of allocation for messages used just as message. In low-traffic cases, messages are unlikely to be signals [2]. We have implemented a similar optimization, stolen; in high-traffic cases, stealing is likely to lead to bet- adapted to suit our protocol. ter throughput. This strategy is similar to the one taken in 1 bool foundSleeper = false; 1 Msg waker = null; 2 §for(int i = 0; i < m.Chord.Chans.Length; i++) { ¤ 2 §var backoff = new Backoff(); ¤ 3 if (m.Chord.Chans[i].IsSync && !foundSleeper) { 3 while(true){ 4 foundSleeper = true; 4 m = Resolve(myMsg); 5 m.Claims[i].AsyncWaker = myMsg;// hand over msg 5 6 m.Claims[i].Status = Stat.WOKEN; 6 if (m != null) break;// claimeda chord; exit 7 m.Claims[i].Signal.Set();// assumed fence 7 if (waker != null) 8 } else{ 8 RetryAsync(waker);// retry last async waker 9 m.Claims[i].Status = Stat.PENDING;// relinquish 9 10 } 10 myMsg.Signal.Block(); 11 } 11 12 if (myMsg.Status == Stat.CONSUMED) { ¦Figure 7. Waking a synchronous sender while allowing ¥13 return myMsg.Result;// rendezvous stealing; replaces lines 12–20 of AsyncSend 14 } else{// status is Stat.WOKEN 15 waker = myMsg.AsyncWaker;// will retry later 16 myMsg.Status = Stat.PENDING; 17 backoff.Once(); Polyphonic C] [2], as well as the “barging” allowed by the 18 } java.util.concurrent synchronizer framework [15]. 19 } Some care must be taken to ensure our key liveness prop- 20 erty still holds: when an asynchronous message wakes a 21 ConsumeAll(claims); synchronous sender, it moves from a safely resolved state 22 // retry last async waker,*after* consuming myMsg (CLAIMED as part of a chord) to an unresolved state (PENDING). 23 if (waker != null) RetryAsync(waker); There is no guarantee that the woken synchronous sender will be able to fire a chord involving the original asyn- ¦Figure 8. Sending a synchronous message while coping ¥ chronous message (see [2] for an example). Yet AsyncSend with stealing; replaces lines 27–23 of SyncSend simply returns to its caller. We must somehow ensure that the original asynchronous message is successfully resolved. that the synchronous sender is woken by an asynchronous Thus, when a synchronous sender is woken, we record the message (lines 15–17), it records the waking message and asynchronous message that woke it, transferring responsi- ultimately tries once more to resolve its own message. We bility for the message. perform exponential backoff every time this happens, since The new code for an asynchronous sender notifying a continually being awoken only to find messages stolen indi- synchronous waiter is shown in Figure 7; it replaces lines cates high traffic. 12–20 of AsyncSend. We introduce a new status, WOKEN.A After every resolution of the synchronous sender’s mes- synchronous message is marked WOKEN if an asynchronous sage (myMsg), the sender retries sending the last asynchronous sender is transferring responsibility, and CONSUMED if a syn- message that woke it, if any (lines 7–8, 22). Doing so fulfills chronous sender is going to fire a chord involving it. In both the liveness requirements outlined above: the synchronous cases, the signal is set after the status is changed; in the latter sender has become responsible for the message that woke it. case, it is set after the chord is actually fired and the return We use RetryAsync, which is similar to AsyncSend but uses value is available. Resolve is revised to return null at any an already-added message rather than adding a new one. point that the message is seen at status WOKEN or CONSUMED. We are careful to avoid calling RetryAsync while myMsg is The WOKEN status ensures that a blocked synchronous CLAIMED by the calling thread; doing so could result in the caller is woken only once, which is important both for our retry code forever waiting for us to revert or consume the use of Signal, and to ensure that the synchronous sender will claim. On the other hand, it is fine to retry the message even only be responsible for one waking asynchronous message. if it has already been successfully consumed as part of a The message field AsyncWaker replaces ShouldFire, and is chord; RetryAsync will simply exit in this case. used to inform the woken thread which asynchronous mes- sage woke it up. 3.6 Pragmatics and extensions Figure 8 gives the revised code for sending a synchronous There are a few smaller differences between the presented message in the presence of stealing; it is meant to replace code and the actual implementation: lines 27–33 in the original implementation. A synchronous sender loops until its message is resolved by claiming a • We avoid boxing (allocation) and downcasts whenever chord (exit on line 6), or by another thread consuming it (exit possible: on .NET, additional polymorphism (beyond on line 13). In each iteration of the loop, the sender blocks what the code showed) can help avoid uses of object. (line 10); even if its message has already been CONSUMED, it • We do not allocate a fresh message array every time must wait for the signal to get its return value. In the case TryClaim is called; in fact, we do not heap-allocate mes- sage arrays at all. Instead, we stack-allocate an array8 in Each benchmark follows standard practice for evaluating SyncSend and AsyncSend, and reuse this array for every synchronization primitives: we repeatedly synchronize, for a call to TryClaim. total of k synchronizations between n threads [17]. We use • We handle nonlinear patterns, in which a single channel k ≥ 100,000 and average over three trials for all bench- appears multiple times. The code is straightforward. marks. To test interaction with thread scheduling preemp- tion, we let n range up to 96—twice the 48 cores in our An important pragmatic point is that the Signal class first benchmarking machine. performs some spinwaiting before blocking. Spinning is per- Each benchmark has two variants for measuring different formed on a memory location associated with the signal, so aspects of synchronization: each spinning thread will wait on a distinct memory location PARALLELSPEEDUP In the first variant, we simulate whose value will only be changed when the thread should doing a small amount of work between synchronization be woken, an implementation strategy long known to per- episodes (and during the critical section, when appropri- form well on cache-coherent architectures [17]. The amount ate). By performing some work, we can gauge to what of spinning performed is determined adaptively on a per- extent a synchronization primitive inhibits or allows par- thread, per-channel basis. allel speedup. By keeping the amount of work small, we We expect it to be straightforward to add timeouts and gauge in particular speedup for fine-grained parallelism, nonblocking attempts for synchronous sends to our imple- which presents the most challenging case for scalable co- mentation, because we can always use CAS to consume a ordination. message we have added to a bag to abort a send (which will, of course, also discover if the send has already succeeded). PURE SYNCHRONIZATION In the second variant, we Finally, to add channels with ordering constraints one synchronize in a tight loop, yielding the cost of synchro- needs only use a queue or stack rather than a bag. Switch- nization in the limiting case where the actual work is neg- ing from bags to fair queues and disabling message stealing ligible. In addition to providing some data on constant- yields per-channel fairness for joins. In Dining Philosophers, time overheads, this variant serves as a counterpoint to for example, queues will guarantee that requests from wait- the previous one: it ensures that scalability problems ing philosophers are fulfilled before those of philosophers were not hidden by doing too much work between syn- that have just eaten. Such guarantees come at the cost of sig- chronization episodes. Rather than looking for speedup, nificantly decreased parallelism, since they entail sequential we are checking for slowdown. matching of join patterns. At an extreme, programmers can To simulate work, we use .NET’s Thread.SpinWait method, enforce a round-robin scheme for matching patterns using which spins in a tight loop for a specified number of times an additional internal channel [25]. (and ensures that the spinning will not be optimized away). To make the workload a bit more realistic—and to avoid 4. Performance “lucky” schedules—we randomize the number of spins be- tween each synchronization, which over 100,000 iterations In this section we provide empirical support for the claims will yield a normal distribution of total work with very small that our implementation is (1) scalable and (2) competitive standard deviation. We ensure that the same random seeds with purpose-built solutions. are provided across trials and compared algorithms, so we always compare the same amount of total work. The mean 4.1 Methodology spin counts are determined per-benchmark and given in the To evaluate our implementation, we constructed a series next section. of microbenchmarks for seven classic coordination prob- For each problem we compare performance between: lems: dining philosophers, producers/consumers, mutual ex- • a join-based solution using our fully-optimized imple- clusion, semaphores, barriers, rendezvous, and reader-writer mentation (S-Join, for “scalable joins”), locking. Our solutions for these problems are fully described in • a join-based solution using Russo’s library (L-Join, for the first two sections of the paper. They cover a spectrum of “lock-based joins”), shapes and sizes of join patterns. In some cases (producer/- • at least one purpose-built solution from the literature or consumer, locks, semaphores, rendezvous) the size and num- .NET libraries (label varies), and ber of join patterns stays fixed as we increase the number • when relevant, our implementation with some or all op- of processors, while in others a single pattern grows in size timizations removed to demonstrate the effect of the op- (barriers) or there are an increasing number of fixed-size pat- timization (e.g., S-J w/o S,C for dropping the Stealing terns (philosophers). and Counter optimizations; S-J w/o L for dropping the Lazy queuing optimization). 8 Stack-allocated arrays are not directly provided by .NET, so we use a custom value type built by polymorphic recursion. We detail the purpose-built solutions below. Figure 9. Speedup on simulated fine-grained workloads: throughput versus threads
Philosophers (with work) Producer/consumer (with work) 12 1.6
1.4 10 1.2 8 S-Join 1 S-Join L-Join L-Join 6 Dijkstra 0.8 .NET-queue S-J w/o C,L 0.6 .NET-bag 4 S-J w/o C,S S-J w/o L 0.4 2 0.2
0 0 6 12 18 24 30 36 42 48 54 60 66 72 78 84 90 96 6 12 18 24 30 36 42 48 54 60 66 72 78 84 90 96
Lock (with work) Semaphore (with work) 3.5 7
3.0 6
2.5 5 S-Join S-Join 2.0 L-Join 4 L-Join .NET .NET-Slim 1.5 3 .NET-spin .NET S-J w/o C,S S-J w/o C 1.0 2
0.5 1
0.0 0 6 12 18 24 30 36 42 48 54 60 66 72 78 84 90 96 6 12 18 24 30 36 42 48 54 60 66 72 78 84 90 96 s (bigger is better) µ Rendezvous (with work) Barrier (with work) 12 4.0
3.5 10 3.0 8 2.5 S-Join S-Join-Tree : iterations/10 6 L-Join 2.0 L-Join Exchanger .NET 1.5 4 S-J w/o L S-Join 1.0 2 0.5 Throughput 0 0.0 6 12 18 24 30 36 42 48 54 60 66 72 78 84 90 96 6 12 18 24 30 36 42 48 54 60 66 72 78 84 90 96
RWLock (50/50, with work) RWLock (75/25, with work) 2 2 1.8 1.8 1.6 1.6 1.4 1.4 S-Join S-Join 1.2 1.2 L-Join L-Join 1 .NET 1 .NET 0.8 .NET-Slim 0.8 .NET-Slim 0.6 S-J w/o C 0.6 S-J w/o C 0.4 0.4 0.2 0.2 0 0 6 12 18 24 30 36 42 48 54 60 66 72 78 84 90 96 6 12 18 24 30 36 42 48 54 60 66 72 78 84 90 96
Threads (on 48-core machine) Figure 10. Pure synchronization performance: throughput versus threads
Philosophers (no work) Producer/consumer (no work) 35 3.5
30 3
25 2.5 S-Join S-Join 20 2 L-Join L-Join .NET-queue 15 Dijkstra 1.5 .NET-bag S-J w/o C,S S-J w/o L 10 1
5 0.5
0 0 6 12 18 24 30 36 42 48 54 60 66 72 78 84 90 96 6 12 18 24 30 36 42 48 54 60 66 72 78 84 90 96
Lock (no work) Semaphore (no work) 35 18 16 30 14 25 12 S-Join S-Join 20 10 L-Join L-Join .NET-Slim 15 .NET 8 .NET .NET-spin 6 S-J w/o C 10 4 5 2 0 0 6 12 18 24 30 36 42 48 54 60 66 72 78 84 90 96 6 12 18 24 30 36 42 48 54 60 66 72 78 84 90 96 s (bigger is better) µ Rendezvous (no work) Barrier (no work) 20 16
18 14 16 12 14 10 12 S-Join S-Join-Tree : iterations/10 10 L-Join 8 L-Join Exchanger .NET 8 S-J w/o L 6 S-Join 6 4 4 2
Throughput 2 0 0 6 12 18 24 30 36 42 48 54 60 66 72 78 84 90 96 6 12 18 24 30 36 42 48 54 60 66 72 78 84 90 96
RWLock (50/50, no work) RWLock (75/25, no work) 12 12
10 10
8 8 S-Join S-Join 6 L-Join 6 L-Join .NET .NET 4 .NET-Slim 4 .NET-Slim
2 2
0 0 6 12 18 24 30 36 42 48 54 60 66 72 78 84 90 96 6 12 18 24 30 36 42 48 54 60 66 72 78 84 90 96
Threads (on 48-core machine) Two benchmarks (rendezvous and barriers) required ex- Lock An iteration consists of acquiring and then releasing tending Russo’s library to support multiple synchronous a single, global lock. We compare against both the built-in channels in a pattern; in these cases, and only in these cases, .NET lock (which is a highly-optimized part of the CLR im- we use a modified version of the library. plementation itself) and System.Threading.SpinLock (which Our benchmarking machine has four AMD “Magny- is implemented in .NET). Cours” Opteron 6100 Series 1.7GHz processors, with 12 Semaphore We let the initial semaphore count be n/2; cores each (for a total of 48 cores), 32GB RAM, and runs An iteration consists of acquiring and then releasing the Windows Server 2008 R2 Datacenter. All benchmarks were semaphore. We compare to two .NET semaphores: the run under the 64-bit CLR. Semaphore class, which wraps kernel semaphores, and Sema- 4.2 The benchmarks phoreSlim, a faster, pure .NET implementation of semaphores. The results for all benchmarks appear in Figure 9 (for paral- Rendezvous The n threads perform a total of k syn- lel speedup) and Figure 10 (for pure synchronization). The chronous exchanges as quickly as possible. Unfortunately, axes are consistent across all graphs: the x-axis measures the .NET 4 does not provide a built-in library for rendezvous, so number of threads, and the y-axis measures throughput (as we ported Scherer et al.’s exchanger [26] from JAVA; this is iterations performed every 10µs). Larger y values reflect bet- the exchanger included in java.util.concurrent. ter performance. Barriers An iteration consists of passing through the bar- For measuring parallel speedup, we used the following rier. We show results for both the tree and the flat versions mean spin counts for simulated work: of the join-based barrier. We compare against the .NET 4 Benchmark In crit. section Out of crit. section Barrier class, a standard sense-reversing barrier. Philosophers 25 5,000 Prod/Cons N/A producer 5,000 RWLock An iteration consists of (1) choosing at random consumer 500 whether to be a reader or writer and (2) acquiring, and then Lock 50 200 releasing, the appropriate lock. We give results for 50-50 and Semaphore 25 100 75-25 splits between reader and writers. We compare against Rendezvous N/A 5,000 two .NET implementations: the ReaderWriterLock class, Barrier N/A 10,000 which wraps the kernel RWLocks, and the ReaderWriterLock- RWLock 50 200 Slim class, which is a pure .NET implementation. With too little simulated work, there is no hope of speedup; 4.3 Analysis with too much, the parallelism becomes coarse-grained The results of Figure 9 demonstrate that our scalable join and thus insensitive to the performance of synchronization. patterns are competitive with—and can often beat—state of These counts were chosen to be high enough that at least one the art custom libraries. Application-programmers can solve implementation showed speedup, and low enough to yield coordination problems in the simple, declarative style we significant performance differences. have presented here, and expect excellent scalability, even The particulars of the benchmarks are as follows, where n for fine-grained parallelism. is the number of threads and k the total number of iterations In evaluating benchmark performance, we are most in- (so each thread performs k/n iterations): terested in the slope of the throughput graph, which mea- Philosophers Each of the n threads is a philosopher; the sures scalability with the number of cores (up to 48) and then threads are arranged around a table. An iteration consists of scheduler robustness (from 48 to 96). In the parallel speedup acquiring and then releasing the appropriate chopsticks. We benchmarks, both in terms of scalability (high slope) and ab- compare against Dijkstra’s original solution, using a lock solute throughput, we see the following breakdown: per chopstick, acquiring these locks in a fixed order, and S-Join clear winner Producer/consumer, releasing them in the reverse order. Semaphore, Barrier S-Join Producer/consumer We let n/2 threads be producers and competitive Philosophers, Lock, n/2 be consumers. Producers repeatedly generate trivial Rendezvous, RWLock 50/50 .NET output and need not wait for consumers, while consumers clear winner RWLock 75/25 repeatedly take and throw away that output. We compare The .NET concurrency library could benefit from replacing against the .NET 4 BlockignCollection class, which trans- some of its primitives with ones based on the joins imple- forms a nonblocking collection into one that blocks when mentation we have shown—the main exception being locks. attempting to extract an element from an empty collec- With further optimization, it may be feasible to build an en- tion. We wrap the BlockingCollection around the .NET tire scalable synchronization library around joins. 4 ConcurrentQueue class (a variant of Michael and Scott’s The performance of our implementation is mostly robust classic lock-free queue [18]) and ConcurrentBag. as we oversubscribe the machine. The Barrier benchmark is a notable exception, but this is due to the structure of source release and acquisition, favoring threads that are in the problem: every involved thread must pass through the the right place at the right time.9 Lazy queuing improves barrier at every iteration, so at n > 48 threads, a context constant factors across the board, in some cases (Produc- switch is required for every iteration. Context switches are er/Consumer, Rendezvous) also aiding scalability. Finally, very expensive in comparison to the small amount of work the counter representation provides a considerable boost for we are simulating. benchmarks, like Semaphore, in which the relevant channel Not all is rosy, of course: the pure synchronization bench- often has multiple pending messages. marks show that scalable join patterns suffer from constant- Performance of lock-based joins Russo’s lock-based im- time overheads in some cases, especially for locks. The table plementation of joins is consistently—often dramatically— below approximates the overhead of pure synchronization in the poorest performer for both parallel speedup and pure our implementation compared to the best .NET solution, by synchronization. On the one hand, this result is not surpris- dividing the scalable join pure synchronization time by the ing: the overserialization induced by coarse-grained locking best .NET pure synchronization time: is a well-known problem. On the other hand, Russo’s im- Overhead compared to best custom .NET solution plementation is fairly sophisticated in the effort it makes to n Phil Pr/Co Lock Sema Rend Barr RWL shorten critical sections. The implementation includes all the 6 5.2 0.7 6.5 2.9 0.7 1.5 4.2 ] optimizations proposed for POLYPHONIC C [2], including 12 5.2 0.9 7.4 4.0 1.7 0.3 3.9 a form of stealing, a counter representation for void-async 24 1.9 0.9 6.6 3.0 1.1 0.2 1.8 channels (simpler than ours, since it is lock-protected), and 48 1.6 1.2 7.4 2.3 1.0 0.2 1.4 (n threads; smaller is better) bitmasks approximating the state of messages queues for fast matching. Despite this sophistication, it is clear that lock- Overheads are most pronounced for benchmarks that use based joins do not scale. .NET’s built-in locks (Philosophers, Lock). This is not sur- We consider STM-based join implementations in §6.2. prising: .NET locks are mature and highly engineered, and are not themselves implemented as .NET code. Notice, too, 5. Correctness that in Figure 10 the overhead of the spinlock (which is im- plemented within .NET) is much closer to that of scalable There are many ways to characterize correctness of concur- join patterns. In the philosophers benchmark, we are able to rent algorithms. The most appropriate specification for our compensate for our higher constant factors by achieving bet- algorithm is something like the process-algebraic formula- ter parallel speedup, even in the pure-synchronization ver- tion of the join calculus [4]. In that specification, multiple sion of the benchmark. messages are consumed—and a chord is fired—in a single One way to decrease overhead, we conjecture, would be step. A full proof that our implementation satisfies this spec- to provide compiler support for join patterns. Our library- ification is too much to present here, and we have not yet based implementation spends some of its time travers- carried out such a rigorous proof. We have, however, identi- ing data structures representing user-written patterns. In fied what we believe are the key lemmas for performing such a compiler-based implementation, these runtime traversals a proof. We take for granted that our bag implementation could be unrolled, eliminating a number of memory ac- is linearizable [10] and lock-free [9]; roughly, this means cesses and conditional control flow. Removing that overhead that operations on bags are observably atomic and dead/live- could put scalable joins within striking distance of the ab- lock free even under unfair scheduling. There are a pair of solute performance of .NET locks. On the other hand, such key properties—one safety, one liveness—characterizing the an implementation would probably not allow the dynamic Resolve method: construction of patterns that we use to implement barriers. Lemma 1 (Resolution Safety). Assume that msg has been In the end, constant overhead is trumped by scalabil- inserted into a channel. If a subsequent call to Resolve(msg) ity: for those benchmarks where the constant overhead is returns, msg is in a resolved state; moreover, the return value high, our implementation nevertheless shows strong parallel correctly reflects how the message was resolved. speedup when simulating work. The constant overheads are dwarfed by even the small amount of work we simulate. Fi- Lemma 2 (Resolution Liveness). Assume that threads are nally, even for the pure synchronization benchmarks, our im- scheduled fairly. If a sender is attempting to resolve a mes- plementation provides competitive scalability, in some cases sage, eventually some message is resolved by its sender. extracting speedup despite the lack of simulated work. Recall that there are three ways a message can be re- solved: it and a pattern’s worth of messages can be marked Effect of optimizations Each of the three optimizations CLAIMED by the calling thread; it can be marked CONSUMED by discussed in §3.5 is important for achieving competitive another thread; and it can be in an arbitrary status when it is throughput. Stealing tends to be most helpful for those prob- lems where threads compete for limited resources (Philoso- 9 This is essentially the same observation Doug Lea made about barging for phers, Locks), because it minimizes the time between re- abstract synchronizers [15]. determined that there are not enough messages prior to it to find no evaluation of its performance on parallel hardware. fire a chord. Benton, Cardelli and Fournet proposed an object-oriented Safety for the first two cases is fairly easy to show: we version of join patterns for C] called POLYPHONIC C] [2]; can assume that CAS works properly, and can see that around the same time, von Itzstein and Kearney indepen- dently described JOINJAVA [11], a similar extension of JAVA. • Once a message is CLAIMED by a thread, the next change The advent of generics in C] 2.0 led Russo to encapsu- to its status is by that thread. late join pattern constructs in the JOINS library [24], which • Once a message is CONSUMED, its status never changes. served as the basis for our library. There are also implemen- These facts mean that interference cannot “unresolve” a tations for ERLANG [22], C++ [16], and VB [25]. message that has been resolved in those three ways. The All of the above implementations use coarse-grained other fact we need to show is that the retry flag is only false locking to achieve the atomicity present in the join calcu- ] if, indeed, no pattern is matched using only the message and lus semantics. In some cases (e.g. POLYPHONIC C , Russo’s messages that arrived before it. Here we use the linearizabil- library) significant effort is made to minimize the critical ity assumptions about bags, together with the facts about the section, but as we have shown (§4) coarse-grained locking status flags just given. remains an impediment to scalability. Now we turn to the liveness property. Notice that a call to By implementing joins as a library we forgo some ex- Resolve fails to return only if retry is repeatedly true. This pressivity, not just static checking. JOCAML, for example, can only happen as a result of messages being CLAIMED. We supports restricted polymorphic sends: the type of a chan- can prove, using the consistent ordering of channels during nel can be generalized in those type variables that do not the claiming process, that if any thread reaches the claiming appear in the types of other, conjoined channels [6]. Since ] process (lines 33–42), some thread succeeds in claiming a our channels are monomorphic C delegates, we are, un- pattern’s worth of messages. The argument goes: claiming fortunately, unable to capture that polymorphism. Neverthe- by one thread can fail only if claiming/consuming by another less, one can still express a wide range of useful generic ab- thread has succeeded, which means that the other thread has stractions (e.g. Buffer
100 100 ) ) s s m m /
10 / r r e
e 10 t t i Sulzmann i Sulzmann ( (
t 1 t u
Singh u Singh p p 1 h S-Join h S-Join g 0.1 g u u o o r r
h 0.1 0.01 h T T
0 0.01 2 6 10 14 18 22 26 30 34 38 42 46 2 6 10 14 18 22 26 30 34 38 42 46 Threads Threads
Figure 11. Comparison with Haskell-STM implementations on 48-core machine. Note log scale. guards and propagated clauses in patterns, and to handle very different languages and runtime systems. However, it these features they spawn a thread per message;HASKELL seems clear that the STM implementations suffer from both threads are lightweight enough to make such an approach drastically increased constant overheads, as well as much viable. The manuscript provides some performance data, but poorer scalability. Surprisingly, of the two STM implemen- only on a four core machine, and does not provide compar- tations, Singh’s much simpler implementation was the better isons against direct solutions to the problems they consider. performer. The simplest—but perhaps most important—advantage Given these results, and the earlier results for lock-based of our implementation over STM-based implementations is implementations, our JOINS implementation is the only one that we do not require STM, making our approach more we know to scale when used for fine-grained parallelism. portable. STM is an active area of research, and state-of-the- While STM can be used to implement joins, it is also pos- art implementations require significant effort. sible to see STM and joins as two disparate points in the The other advantage over STM is the specificity of our spectrum of declarative concurrency: STM allows arbitrary algorithm. An STM implementation must provide a general shared-state computation to be declared atomic, while joins mechanism for declarative atomicity, conflict-resolution and only permits highly-structured atomic blocks in the form of contention management. Since we are attacking a more con- join patterns. From this standpoint, our library is a bit like strained problem, we can employ a more specialized (and k-compare single swap [20] in attempting to provide scal- likely more efficient and scalable) solution. For example, able atomicity somewhere between CAS and STM. There in our implementation one thread can be traversing a bag is a clear tradeoff: by reducing expressiveness relative to a looking for PENDING messages, determine that none are avail- framework like STM, our joins library admits a relatively able, and exit message resolution all while another thread is simple implementation with robust performance and scala- adding a new PENDING message to the bag. Or worse: two bility. threads might both add messages to the same bag. It is not clear how to achieve the same degree of concurrency with 6.3 Coordination in java.util.concurrent STM: depending on the implementation, such transactions The java.util.concurrent library contains a class called Ab- would probably be considered conflicting, and one aborted stractQueuedSynchronizer that provides basic functionality and retried. While such spurious retries might be avoidable for queue-based, blocking synchronizers [15]. Internally, it by making STM aware of the semantics of bags, or by care- represents the state of the synchronizer as a single 32-bit in- fully designing the data structures to play well with the STM teger, and requires subclasses to implement tryAcquire and implementation, the effort involved is likely to exceed that of tryRelease methods in terms of atomic operations on that the relatively simple algorithm we have presented. integer. It is used as the base class for at least six synchroniz- To test our suspicions about the STM-based join imple- ers in the java.util.concurrent package, thereby avoiding mentations, we replicated the pure synchronization bench- substantial code duplication. In a sense, our JOINS library is marks for producer/consumer and locks, on top of both a generalization of the abstract synchronizer framework: we Singh and Sulzmann’s implementations. Figure 11 gives the support arbitrary internal state (represented by asynchronous results on the same 48-core benchmarking machine we used messages), n-way rendezvous, and the exchange of mes- for §4. Note that, to avoid losing too much detail, we plot the sages at the time of synchronization. throughput in these graphs using a log scale. The comparison Another interesting aspect of java.util.concurrent is its is, of course, a loose one, as we are comparing across two use of dual data structures [27], in which blocking calls to a data structure (such as Pop on an empty stack) insert a “reser- those users to understand those insights, to use a fixed set of vation” in a nonblocking manner; they can then spinwait to primitives, or to give up scalability. There is still much work see whether that reservation is quickly fulfilled, and other- to be done, both in evaluation (does declarative coordination wise block. Reservations provide an analog to the conditions help in real applications?) and scope (can we expand beyond used in monitors, but apply to nonblocking data structures. the join calculus?), but we believe the algorithms presented JOINS provide another perspective on reservations: for us, in this paper provide a promising foundation for further Pop can be used as a method, but it is really a message- exploration. send on a synchronous channel. Reservations, spinwaiting Acknowledgements We are indebted to both Sam Tobin- and blocking thus fall out as a natural consequence. With Hochstadt and Vincent St-Amour for careful reading and JOINS, it is also easy to allow such synchronous messages many helpful discussions, Jesse Tov, Daniel Brown, and to be involved in several different patterns, allowing the re- Stephen Chang for feedback on drafts of this paper, Stephen quests to be fulfilled in multiple ways, each of which can in- Toub and Microsoft’s PCP team for providing access to volve complex synchronization; the correct protocol for han- our test machine, Luc Maranget for motivating our counter dling reservations is then automatically provided. optimization, and Samin Ishtiaq and Matthew Parkinson for Our benchmarking results indicate that it may be reason- general support. able to build some parts of a library like java.util.concur- rent around JOINS; certainly, some of .NET’s concurrency library could be profitably replaced with a JOINS-based im- References plementation. While this is a nice result, our goal is not to [1] N. Benton. Jingle bells: Solving the Santa Claus problem in replace such carefully-engineered libraries. Rather, we want Polyphonic C]. Unpublished manuscript, Mar. 2003. URL to push forward the frontier, taking the insights that made http://research.microsoft.com/˜nick/santa.pdf. java.util.concurrent successful and putting them in the [2] N. Benton, L. Cardelli, and C. Fournet. Modern concurrency hands of application programmers, without exposing their abstractions for C]. ACM Transactions on Programming Lan- intricacies. guages and Systems, 26(5), Sept. 2004. [3] E. W. Dijkstra. Hierarchical ordering of sequential processes. 6.4 Parallel CML Acta Informatica, 1(2):115–138, 1971. Our algorithm draws some inspiration from Reppy, Russo [4] C. Fournet and G. Gonthier. The reflexive chemical abstract and Xiao’s PARALLEL CML, a combinator library for first- machine and the join-calculus. In ACM SIGPLAN Symposium class synchronous events [23]. The difficulty in implement- on Principles of Programming Languages (POPL), 1996. ing CML is, in a sense, dual to that of the join calculus: dis- [5] C. Fournet and G. Gonthier. The join calculus: a language junctions of events (rather than conjunctions of messages) for distributed mobile programming. In APPSEM Summer must be atomically resolved. PARALLEL CML implements School, Caminha, Portugal, Sept. 2000, 2002. disjunction (choice) by adding a single event to the queue of [6] C. Fournet, C. Laneve, L. Maranget, and D. Rémy. Implicit each involved channel. Events have a “state” similar to our typing à la ML for the join-calculus. In International Confer- message statuses; event resolution is performed by an opti- ence on Concurrency Theory (CONCUR), 1997. mistic protocol that uses CAS to claim events. [7] C. Fournet, F. Le Fessant, L. Maranget, and A. Schmitt. Jo- The PARALLEL CML protocol is, however, much sim- Caml: a language for concurrent distributed and mobile pro- pler than the protocol we have presented: events are resolved gramming. In Advanced Functional Programming, 4th Inter- while holding a channel lock. In particular, if an event is of- national School, Oxford, Aug. 2002, 2003. fering a choice between sending on channel A and receiving [8] T. Harris, S. Marlow, S. Peyton-Jones, and M. Herlihy. Com- on channel B, the resolution code will first lock channel A posable memory transactions. In ACM SIGPLAN Symposium while looking for partners, then (if no partners are found) on Principles and Practice of Parallel Programming (PPoPP), unlock A and lock channel B. These channel locks prevent 2005. concurrent changes to channel queues, allowing the imple- [9] M. Herlihy and N. Shavit. The Art of Multiprocessor Pro- mentation to avoid subtle questions about when it is safe to gramming. Morgan Kaufmann, 2008. stop running the protocol—exactly the questions we address [10] M. P. Herlihy and J. M. Wing. Linearizability: a correct- in §3.3. The tradeoff for this simplicity is reduced scalability ness condition for concurrent objects. ACM Transactions on under high contention due to reduced concurrency, as in our Programming Languages and Systems, 12(3):463–492, July experimental results for lock-based joins. 1990. [11] G. S. Itzstein and D. Kearney. Join Java: An alternative concurrency semantics for Java. Technical Report ACRC-01- 7. Conclusion 001, University of South Australia, 2001. We have demonstrated that it is possible to place many [12] F. Le Fessant and L. Maranget. Compiling join-patterns. In In- of the insights behind scalable synchronization algorithms ternational Workshop on High-Level Concurrent Languages directly into the hands of library users—without requiring (HLCL), Sept. 1998. [13] D. Lea. URL A. Library Reference http://gee.cs.oswego.edu/dl/concurrency-interest/. A new Join instance j is allocated by calling an overload of [14] D. Lea. A java fork/join framework. In ACM conference on factory method Join.Create: Java, 2000. Join j = Join.Create(); or [15] D. Lea. The java.util.concurrent synchronizer framework. Join j = Join.Create(size); Science of Computer Programming, 58(3):293 – 309, 2005. [16] Y. Liu, 2009. URL http://channel.sourceforge.net/. The optional integer size is used to explicitly bound the num- ber of channels supported by Join instance j. An omitted [17] J. M. Mellor-Crummey and M. L. Scott. Algorithms for scalable synchronization on shared-memory multiprocessors. size argument defaults to 32; size initializes the constant, ACM Transactions on Computer Systems (TOCS), 9(1):21–65, read-only property j.Size. February 1991. A Join object notionally owns a set channels, each ob- tained by calling an overload of method Init, passing the [18] M. M. Michael and M. L. Scott. Simple, fast, and practi- cal non-blocking and blocking concurrent queue algorithms. location, channel(s), of a channel or array of channels using In ACM Symposium on Principles of Distributed Computing an out argument: (PODC), 1996. j.Init(out channel); [19] M. M. Michael and M. L. Scott. Nonblocking algorithms and j.Init(out channels, length); preemption-safe locking on multiprogrammed shared mem- ory multiprocessors. Journal of Parallel and Distributed Com- The second form takes a length argument to initialize loca- puting, 51(1):1–26, May 1998. tion channels with an array of length distinct channels. Channels are instances of delegate types. In all, the library [20] M. Moir, V. Luchangco, and N. Shavit. Non-blocking k- compare single swap. In ACM Symposium on Parallelism in provides six channel flavors: Algorithms and Architectures (SPAA), 2003. // void-returning asynchronous channels [21] M. Odersky. An overview of functional nets. In APPSEM delegate void Asynchronous.Channel(); Summer School, Caminha, Portugal, Sept. 2000, 2002. delegate void Asynchronous.Channel(A a); // void-returning synchronous channels [22] H. Plociniczak and S. Eisenbach. JErlang: Erlang with Joins. delegate void Synchronous.Channel(); In Coordination Models and Languages, 2010. delegate void Synchronous.Channel(A a); [23] J. Reppy, C. V. Russo, and Y. Xiao. Parallel concurrent ml. // value-returning synchronous channels In ACM SIGPLAN International Conference on Functional delegate R Synchronous , for some type P then } When(ci)/And(ci) adds one parameter pj of type Pj = P } to delegate d. • If ci is an array of type Channel [] for some type P ¦ ¥ then When(ci)/And(ci) adds one parameter pj of array type Pj = P [] to delegate d. Parameters are added to d from left to right, in increasing order of i. In the current implementation, a continuation can receive at most m ≤ 16 parameters. A join pattern associates a set of channels with a body d. A body can execute only once all the channels guarding it have been invoked. Invoking a channel may enable zero, one or more patterns: • If no pattern is enabled then the channel invocation is queued up. If the channel is asynchronous, then the ar- gument is added to an internal bag. If the channel is syn- chronous, then the calling thread is blocked, joining a no- tional bag of threads waiting on this channel. • If there is a single enabled join pattern, then the argu- ments of the invocations involved in the match are con- sumed, any blocked thread involved in the match is awak- ened, and the body of the pattern is executed in that thread. Its result - some value or exception - is broad- cast to all other waiting threads, awakening them. If the pattern contains no synchronous channels, then its body runs in a new thread. • If there are several enabled patterns, then an unspecified one is chosen to run. • Similarly, if there are multiple invocations of a particu- lar channel pending, which invocation will be consumed when there is a match is unspecified. The current number of channels initialized on j is avail- able as read-only property j.Count; its value is bounded by j.Size. Any invocation of j.Init that would cause j.Count to exceed j.Size throws JoinException. Join patterns must be well-formed, both individually and collectively. Executing Do(d) to complete a join pattern will throw JoinException if d is null, the pattern repeats a chan- nel (and the implementation requires linear patterns), a chan-