Chapter 10

Work-Stealing

10.1 Introduction

In this chapter we look at the problem of multithreaded computa- tions. This problem is interesting in its own right, and we will see that solutions to the problem require the design and implementation of novel -free concur- rent objects. A delightful aspect of this topic is that it combines well-established practical results with a non-trivial mathematical foundation. Figure 10.1 shows one way to decompose the well-known Fibonacci function into a multithreaded program. This implementation is an extremely inefficient way to compute Fibonacci numbers, but we use it here to illustrate the basic principles of multithreaded programming. We use the standard Java package.1 To compute the n-th Fibonacci number, we create a new Fib object: Fib f = new Fib(10); A Fib object extends java.lang.Thread, so we can start it in parallel with the main computation: f.start(); Later, when we want the result, we call its join method. This method pauses the caller until the Fib object has completed its computation, and it is safe to pick up the result. f.join(); System.out.println("10-th Fibonacci number is: " + f.result); The Fib object’s run method creates two child Fib objects and starts them in parallel. The parent cannot use the results computed by its children until those children join the parent. The parent then sums the children’s results, and halts.

0This chapter is part of the Manuscript Multiprocessor Synchronization by Maurice Herlihy and Nir Shavit. The current text is for your personal use and not for distribution outside our classroom. 1For clarity, some exception handling is omitted from figures.

1 2 CHAPTER 10. WORK-STEALING

public class Fib extends Thread { public int arg; public int result;

public Fib(int n) { arg = n; result = -1; }

public void run() { if (arg < 2) { result = arg; } else { Fib left = new Fib(arg-1); Fib right = new Fib(arg-2); left.start(); right.start(); left.join(); right.join(); result = left.result + right.result; } } }

Figure 10.1: Multithreaded Fibonacci 10.2. MODEL 3

Figure 10.2: Multithreaded Fibonacci DAG

Notice that starting and joining threads in Java does not guarantee that any computations actually happen in parallel. Instead, you should think of these methods as advisory: they tell an underlying scheduler program that it may execute these programs in parallel. This chapter is concerned with the design and implementation of effective schedulers.

10.2 Model

It is convenient to model a multithreaded computation as a directed acyclic graph (DAG), where each node represents an atomic step, and edges represent dependencies. For example, a single thread is just a linear chain of nodes. A step corresponding to a start method call has two outgoing edges: one to its successor in the same thread, and one to the first step of the newly-started thread. There is an edge from a child thread’s last step to the parent thread’s step in which it calls the child’s join method. Figure 10.2 shows the DAG corresponding to a short Fibonacci execution. All the DAGS considered here have out-degree at most two. Clearly, some computations are more parallel than others. We now consider some ways to make such notions precise. Let TP be the minimum number of steps needed to execute a multithreaded program on a system of P dedicated processors. Note that TP is an idealized measure: it may not always be possible for every processor to “find” steps to execute, and actual computation time may partly determined by other concerns, such as memory usage. Nevertheless, TP is clearly a lower bound on how much parallelism one can extract from a 4 CHAPTER 10. WORK-STEALING multithreaded computation. Some values of P are important enough that they have special names. T1, the number of steps needed to execute the program on a single processor, is called the computation’s work. Work is also the total number of steps in the entire computation. In one step, P processors can execute at most P steps, so

TP ≥ T1/P.

The other extreme is also of special importance: T∞, the number of steps to execute the program on an unlimited number of processors, is called the critical- path length. Because finite resources cannot do better than infinite resources,

TP ≥ T∞. The speedup on P processors is the following ratio:

T1/TP

We say a computation has linear speedup if T1/TP = Θ(P ). Finally, the paral- lelism of a computation is the maximum possible speedup: T1/T∞. To illustrate these concepts, we now examine a simple multithreaded matrix multiplication program. Matrix multiplication can be decomposed as follows: µ ¶ µ ¶ µ ¶ A A B B C C 11 12 = 11 12 · 11 12 A21 A22 B21 B22 C21 C22 µ ¶ B C + B C B C + B C = 11 11 12 21 11 12 12 22 . B21C11 + B22C21 B21C12 + B22C22 To turn this observation into code, assume we have a Matrix class, with put and get methods to access elements. This class also proivdes a method that splits an n-by-n matrix into four (n/2)-by-(n/2) submatrices: public Matrix[][] split() { ... } In Java terminology, the four submatrices are “backed” by the original matrix, meaning that changes to the submatrices are reflected in the matrix, and vice- versa. This method can be implemented to take constant time by appropriate repositioning of indexes (left as an exercise for the reader). The code for multi- threaded matrix addition appears in Figure 10.3, and multiplication in Figure 10.4. Let AP (n) be the number of steps needed to execute Add on P processors. The work A1(n) is defined by the recurrence:

A1(n) = 4A1(n/2) + Θ(1) = Θ(n2). This work is the same as the conventional doubly-nested loop implementation. The critical-path length is also easy to compute:

A∞(n) = A∞n/2 + Θ(1) = Θ(log n) 10.2. MODEL 5

public class Add extends Thread { public Matrix sum, arg;

public Add(Matrix sum, Matrix arg) { this.sum = sum; this.arg = arg; }

public void run() { if (sum.getDimension() == 1) { sum.put(0, 0, sum.get(0,0) + arg.get(0,0)); } else { Matrix[][] s = this.sum.split(); Matrix[][] a = this.arg.split(); // create children Add[] child = { new Add(s[0][0], a[0][0]), new Add(s[0][1], a[0][1]), new Add(s[1][0], a[1][0]), new Add(s[1][1], a[1][1]) }; // start children for (int i = 0; i < child.length; i++) child[i].start(); // join children for (int i = 0; i < child.length; i++) { try { child[i].join(); } catch (InterruptedException e) {} } } } }

Figure 10.3: Matrix addition 6 CHAPTER 10. WORK-STEALING public class Mult extends Thread { public Matrix prod, arg0, arg1;

public Mult(Matrix prod, Matrix arg0, Matrix arg1) { this.prod = prod; this.arg0 = arg0; this.arg1 = arg1; }

public void run() { int n = prod.getDimension(); if (n == 1) { prod.put(0, 0, arg0.get(0,0) + arg1.get(0,0)); } else { Matrix tmp = new Matrix(n,n); Matrix[][] r = this.prod.split(); Matrix[][] a = this.arg0.split(); Matrix[][] b = this.arg1.split(); Matrix[][] t = tmp.split(); // create children Mult[] child = { new Mult(r[0][0], a[0][0], b[0][0]), new Mult(r[0][1], a[0][0], b[0][1]), new Mult(r[1][0], a[1][0], b[0][0]), new Mult(r[1][1], a[1][0], b[0][1]), new Mult(t[0][0], a[0][1], b[1][0]), new Mult(t[0][1], a[0][1], b[1][1]), new Mult(t[1][0], a[1][1], b[1][0]), new Mult(t[1][1], a[1][1], b[1][1]) }; // start children for (int i = 0; i < child.length; i++) child[i].start(); // join children for (int i = 0; i < child.length; i++) { try { child[i].join(); } catch (InterruptedException e) {} } } } }

Figure 10.4: Matrix multiplication 10.3. REALISTIC MULTIPROCESSOR SCHEDULING 7

This claim follows because each of the half-size additions is performed in parallel with the others. Let MP (n) be the number of steps needed to execute Mult on P processors. The work A1(n) is defined by the recurrence:

M1(n) = 8M1(n/2) + A1(n) 2 M1(n) = 8M1(n/2) + Θ(n ) = Θ(n3).

This work is also the same as the conventional triply-nested loop implementa- tion. The critical-path length is:

M∞(n) = A∞(n/2) + A∞(n)

= A∞(n/2) + Θ(log n) = Θ(log2n)

This claim follows because the half-size multiplications are performed in parallel, followed by a single addition. The parallelism for the Mult program is

3 2 M1(n)/M∞(n) = Θ(n / log n), which is pretty high. For example, suppose we want to multiply two 1000-by- 1000 matrices. Here, n3 = 109, and log n = log 1000 ≈ 10 (logs are base two), so the parallelism is approximately 109/102 = 107. Roughly speaking, this instance of matrix multiplication could, in principle, occupy roughly 107 processors, well beyond the powers of any existing multiprocessor. You should understand that the parallelism computation given above is a highly idealized upper bound on the performance of any multithreaded matrix multiplication program. For example, when there are idle threads, it may not be easy to assign those threads to idle processors. Moreover, a program that displays less parallelism but consumes less memory may perform better because it encounters fewer page faults. The actual performance of a multithreaded computation remains a complex engineering problem, but the kind of analysis presented in this section is an indispensable first step in understanding the degree to which a problem can be solved in parallel.

10.3 Realistic Multiprocessor Scheduling

Our analysis so far has been based on the assumption that each multithreaded program has P dedicated processors. This assumption, unfortunately, does not correspond to the way shared-memory multiprocessors are used in real life. Mul- tiprocessors typically run a mix of jobs, where jobs come and go dynamically. One might start, say, a matrix multiplication job on P processors. At some point, the decides to download mail, pre-emptying one pro- cessor, and our job is now running on P − 1 processors. The mail program 8 CHAPTER 10. WORK-STEALING pauses waiting for a disk read or write to complete, and in the interim the matrix program has P processors again. Most operating systems provide user-level processes, where a process con- sists of a program counter (like a thread) and usually an address space. The operating system kernel includes a scheduler that runs user-level processes on physical processors. The mapping between processes and processors, and when the processes are scheduled, is typically not under the control of the application. One approach is to set up a one-to-one correspondence between application- level threads and processes: creating a new thread creates a new process, and ending a thread ends that process. This approach, however, is impractical because process creation is expensive. Instead, it makes more sense to create a fixed collection of relatively long-lived processes to execute the varying collection of short-lived threads. We end up with a three-level model. At the top level, we write multithreaded programs (such as matrix multiplication) that decompose an application into a dynamically-varying number logical threads. At the middle level, we write a user-level scheduler that maps threads to a fixed number of P processes. At the bottom level, the kernel maps our P user-level processes onto a dynamically- varying number of processors. This last level is not under our control: applica- tions cannot tell the kernel how to schedule itself, and most modern operating systems kernels are anyway hidden from users. Our challenge here is to define a user-level scheduler that makes the best use of an unknown kernel scheduling policy. Let us assume for now that the kernel works in discrete steps. (This discrete- step assumption is not required for correctness, but it makes the analysis easier.) At each step i, the kernel chooses an arbitrary subset of 0 ≤ pi ≤ P user-level processes to run for one step. The processor average PA over T steps is defined to be 1 TX−1 P = p . (10.1) A T i i=0 Instead of designing a user-level schedule to achieve a P -fold speedup, we can try to achieve a PA-fold speedup. A schedule is greedy if the number of program steps executed at each time step is the minimum of pi, the number of available processors, and the number of ready nodes in the program DAG.

Theorem 10.3.1 Consider a multithreaded program with work T1, critical-path length T∞, and P user-level processes. Any greedy execution has length at most T T (P − 1) 1 + ∞ . PA PA Proof: Equation 10.1 implies that:

1 TX−1 T = p . P i A i=0 10.4. WORK STEALING 9

We will bound T by bounding the sum of the pi. At each kernel-level step, imagine placing tokens in one of two buckets. For each user-level process that executes a node at step i, we place a token in a work bucket, and for each process that remains idle at that step, we place a token in a idle bucket. After the last step, the work bucket contains T1 tokens, one for each node of the computation DAG. How many tokens does the idle bucket contain? An idle step is one in which some process places a token in the idle bucket. Because the schedule is greedy, there is at least one process with a ready node at each idle step, so of the pi processes scheduled at step i, at most pi − 1 ≤ P − 1 can be idle. Let Gi be sub-DAG of the computation consisting of the nodes that have not be executed at the end of step i. Every node with in-degree 0 in Gi−1 was ready at the start of step i. We claim that there must be fewer than pi such nodes, because otherwise the greedy schedule would execute pi of them, and step i would not be idle. It follows that the longest directed path in Gi is one shorter than the longest directed path in Gi−1. The longest directed path before step 0 is T∞, so the greedy schedule can have at most T∞ idle steps. Combining these observations shows that the idle bucket contains at most T∞(P − 1) tokens. The total number of tokens in both buckets is therefore

TX−1 pi ≤ T1 + T∞(P − 1), i=0 yielding the desired bound. It turns out that this bound is within a factor of two of optimal. Actually achieving an optimal schedule is NP-complete, so greedy schedules are a simple and practical way to get performance that is reasonably close to optimal.

10.4 Work Stealing

We now understand that if we keep the user-level processes supplied with work, then the resulting schedule is greedy, and our multithreded application will achieve a pretty good speedup. Multithreaded computations, however, create and destroy threads dynamically, sometimes in unpredictable ways. We still do not know how to connect ready threads with idle processes. There are two basic approaches to keeping processes busy. In work sharing, processes distribute surplus work to other processes, with the goal of ensuring that all processes are assigned approximately the same amount of work. In work stealing, a process that runs out of work will try to “steal” work from other processes. For now, we focus on work stealing, which has the attractive feature that no inter-process synchronization is need if all processes have plenty of work. Each process keeps a pool of ready threads in the form of a double-ended queue (DEQueue), providing pushBottom, popBottom, and popTop methods (we do not need a pushTop method). The process that owns a DEQueue is 10 CHAPTER 10. WORK-STEALING

public class DEQueue { longRMWregister top; // tag & top thread index int bottom; // bottom thread index Thread[] deq; // array of threads

public class Abort extends java.lang.Exception {};

// extract tag field from top private int TAG_MASK = 0xFFFF0000; private int TAG_SHIFT = 16; private int getTag(long i) { return (int)(i & TAG_MASK) >> TAG_SHIFT; }

// extract index field from top private int INDEX_MASK = 0x0000FFFF; private int INDEX_SHIFT = 0; private int getIndex(long i) { return (int)(i & INDEX_MASK) >> INDEX_SHIFT; }

// combine tag and index to form new top private long makeTop(int tag, int index) { return (tag << TAG_SHIFT) & (index << INDEX_SHIFT); }

Figure 10.5: Manipulating the top field 10.4. WORK STEALING 11 public class DEQueue { longRMWregister top; // tag & top thread index int bottom; // bottom thread index Thread[] deq; // array of threads

...

/** * called by local thread to set aside work **/ public void pushBottom(Thread t){ this.deq[this.bottom] = t; // store object this.bottom++; // advance bottom }

... }

Figure 10.6: The pushBottom method called the local process for that object. When the local process creates a new thread, it calls pushBottom to push that thread onto the DEQueue. When the local process needs more work, it calls popBottom to remove a thread from the DEQueue. If the local process discovers its DEQueue is empty, then it becomes a thief : it chooses a victim process at random, and calls that object’s popTop method attempting to pop a thread from the top of that DEQueue. Ideally, we would like an efficient, wait-free, linearizable DEQueue imple- mentation. In practice, we have to settle for slightly weaker conditions. Our implementation of popTop may throw an exception if a concurrent popTop call succeeds, or if a concurrent popBottom takes the last thread in the DEQueue. The DEQueue class has three fields: top, bottom, and deq. The top field is a long integer encompassing two subfield. In the top field, the high-order 16 bits constitute the tag value, while the low-order 16 bits constitute the index value. Figure 10.5 shows how to extract the tag and index values. The tag field is needed to avoid the “ABA” problem examined in the previous section. The pushBottom method (Figure 10.11) simply stores the new thread at the bottom queue location and increments the bottom field. The popBottom method (Figure 10.7 is more complex. It tests whether the DEQueue contains more than one thread. If so, then it returns the bottom thread without performing a CAS. Otherwise, there is a danger that the bottom thread may be stolen. The method tries to set both top and bottom fields to zero. If it succeeds, it returns the last remaining thread, and otherwise, that thread has been stolen, and the method returns null. The important aspect of this protocol is that an expensive CAS operation is needed only when the DEQueue is almost empty. 12 CHAPTER 10. WORK-STEALING

public class DEQueue { longRMWregister top; // tag & top thread index int bottom; // bottom thread index Thread[] deq; // array of threads

...

/** * Called by local thread to get more work **/ Thread popBottom() { // is the queue empty? if (this.bottom == 0) // empty return null; this.bottom--; Thread t = this.deq[this.bottom]; long oldTop = this.top.read(); if (this.bottom > getIndex(oldTop)) return t; long newTop = makeTop(getTag(oldTop),0); this.bottom = 0; if (this.bottom == getIndex(oldTop)) if (this.top.CAS(oldTop, newTop)) return t; this.top.write(newTop); return null; } }

Figure 10.7: The popBottom method 10.4. WORK STEALING 13

public class DEQueue { long top; // tag & top thread index int bottom; // bottom thread index Object[] deq; // array of threads ... /** * Called by thieves to try to steal a thread **/ Thread popTop() throws Abort { long oldTop = this.top.read(); int bottom = this.bottom; if (bottom < getIndex(oldTop)) // empty return null; Thread t = this.deq[getIndex(oldTop)]; long newTop = makeTop(getTag(oldTop), getIndex(oldTop)+1); if (this.top.CAS(oldTop, newTop)) return t; throw new Abort(); } ... }

Figure 10.8: The popTop method 14 CHAPTER 10. WORK-STEALING

The popTop method (Figure 10.8) checks whether the DEQueue is empty, and if not, tries to steal the top element by applying CAS to the top field. If the CAS succeeds, the theft is successful, and otherwise the method throws an exception.

10.5 The Steal-Half Protocol

In the work-stealing protocol described in the previous section, each process maintains a local work queue, and steals an item from others if its queue becomes empty. At its core is a lock-free protocol for stealing an individual item from a bounded-size queue, minimizing the need for costly CAS operations when fetching items locally. We have seen that stealing one item is ensures that the time needed to execute a multithreaded computation to within a constant factor of optimal. Nevertheless, there is reason to believe that the scheme can be improved by allowing the thief process to steal multiple items from the victim. The most straightforward way to steam multiple threads is simply for the thief to call popTop multiple times. This solution is unsatisfactory because each such call requires an expensive CAS operation. This section shows how to generalize the previous section’s algorithm, to allow processes to steal up to half of the threads in a given queue at a time. This revised protocol preserves the key properties of the original: it is lock-free and it minimizes the number of CAS operations that the local process needs to perform. As before, each process has a local work queue, called an extended double- ended queue (EDEQueue), where the local process calls pushBottom and popBottom methods, while thieves call the stealTop method. The stealTop method can return multiple threads, not just one. The original DEQueue implementation had a nice property: as long as there is more than one item in the DEQueue, the local process can pop from the bottom of the DEQueue without an expensive CAS operation. If there is a single item in the DEQueue, then the process needs to use CAS to synchronize with potential thieves. You should think of this CAS as a consensus protocol in which the processes decide what is to become of the contested item. For any “reasonable” sequence of k pushBottom or k popBottom calls, this protocol requires a constant number of CAS operations. If a thief process could remove up to half of the items, it may be necessary to reach consensus on the status of each item in the overlap. For any sequence of k pushBottom or k popBottom calls, the protocol that removes items one at a time would require Θ(k) CAS operations, an unacceptable overhead. The extended deque algorithm we will present manages to steal up to half the items and pay only Θ(log k) CAS operations for any “reasonable” sequence of k pushBottom or k popBottom calls. The extended DEQueue implementation presented here differs from the orig- inal in two ways: (1) it is implemented as a cyclic array, and (2) the top field 10.5. THE STEAL-HALF PROTOCOL 15 public class EDEQueue { public longRMWregister stealRange; // where to steal int bottom; // bottom thread index Object[] deq; // array of threads

private static final int QUEUE_SIZE = 32;

// extract tag field from top private int TAG_MASK = 0xFFFF0000; private int TAG_SHIFT = 16; private int getTag(long i) { return (int)(i & TAG_MASK) >> TAG_SHIFT; }

// extract index field from top private int TOP_MASK = 0x0000FF00; private int TOP_SHIFT = 8; private int getTop(long i) { return (int)(i & TOP_MASK) >> TOP_SHIFT; }

// extract index field from top private int STEAL_MASK = 0x000000FF; private int STEAL_SHIFT = 0; private int getSteal(long i) { return (int)(i & STEAL_MASK) >> STEAL_SHIFT; }

// combine tag and index to form new top private long makeStealRange(int tag, int top, int steal) { return (tag << TAG_SHIFT) & (top << TOP_SHIFT) & (steal << STEAL_SHIFT); }

private int log2(int x) { int result = 0; while (x != 0) { x = x << 1; result++; } return result; }

public int getSize() { return this.bottom - getTop(this.stealRange.read()); }

/** * Adjust steal range if needed * @returns whether unsuccessful CAS occurred **/ private boolean updateStealRange(long oldStealRange) { int size = this.getSize(); int logSize = log2(size); // floor of actual log long currentStealRange = this.stealRange.read(); // is size a power of two or did someone steal something? if (size == (1 << logSize) || oldStealRange != currentStealRange) { // Try to update the stealRange to contain max(1,2^(logSize-1)) threads int newSize = Math.max(1, (1 << (logSize-1))); int top = getTop(currentStealRange); int tag = getTag(currentStealRange); return this.stealRange.CAS(currentStealRange, makeStealRange(tag+1, top, top+newSize)); } return true; } ... }

Figure 10.9: Methods for manipulating stealRange 16 CHAPTER 10. WORK-STEALING

Figure 10.10: The extended DEQueue is replace by a field called stealRange. which defines the range of items that can be stolen atomically by a thief process. The stealRange field has three subfields: tag is used to avoid the ABA problem, top is the index of the thread at the top of the queue, and steal is the index of the last thread to be stolen. A local process updates the stealRange field only when

• The number of items in the EDEQueue becomes a power of two, or

• A successful steal has occurred since the last time that thread observed the EDEQueue.

Figure 10.9 illustrates methods for manipulating the object’s stealRange The EDEQueue object has the following additional fields: deq is an array of size QUEUE SIZE that stores threads. The range of occupied entries in the deq array is the half-open range [top ··· bottom) modulo DEQ SIZE. For simplicity, the queue’s top and bottom counters are incremented without bound, but are used modulo DEQ SIZE when indexing into the array. The bottom field points to the entry following the last entry containing a thread. If bottom and top are equal, the EDEQueue is empty. Each process 10.5. THE STEAL-HALF PROTOCOL 17

/** * called by local thread to set aside work **/ public void pushBottom(Thread t, long oldStealRange) throws Full { if (this.getSize() == QUEUE_SIZE) throw new Full(); this.deq[this.bottom % QUEUE_SIZE] = t; // store object this.bottom++; // advance bottom updateStealRange(oldStealRange); }

Figure 10.11: Code for pushBottom keeps track of its own prevStealRange value, of the same type as stealRange, which the process uses to determine whether a steal has occurred since the last method call. When and how to steal work is a policy decision best made by the individual application. Reasonable policies include the following:

• Try to steal only when the local DEQueue is empty (steal-on-empty).

• Try to steal probabilistically, with the probability decreasing as the num- ber of items in the DEQueue increases: this (probabilistic balancing)

• Try to steal whenever the number of items in the DEQueue increases/decreases by a constant factor from the last time a steal attempt was performed.

The local process calls the pushBottom method (illustrated in Figure 10.11) whenever it needs to insert a new thread into its local EDEQueue. If the EDEQueue is not full, the new thread is placed in the entry indexed by bottom, bottom is incremented, and the thread checks whether the stealRange field needs to be updated. The updateStealRange method tries to reset the stealRange field if if either the queue size is a power of two, or if the stealRange field has changed since the last time the process examined the object (meaning that some threads have been stolen). The local thread calls popBottom (illustrated in Figure 10.12 when it needs to consume another thread. If the queue is empty, the method returns null. It then calls updateStealRange to check the stealRange field. If that method detects that the field should be updated, but fails to complete the update, then the popBottom method throws an exception Otherwise, the method pops off a thread, and checks stealRange. If that range does not include the thread, then the method returns it, since no thief could have taken the thread. Otherwise, there are two possibilities: the queue is empty and the thread was stolen, or the queue contains a single thread. The method tries to update the stealRange field with an empty value. If it fails, it returns null, and if it succeeds, it returns the thread; 18 CHAPTER 10. WORK-STEALING

public Object popBottom(long prevStealRange) throws Abort { if (this.getSize() == 0) return null; boolean ok = updateStealRange(prevStealRange); if (!ok) throw new Abort();

this.bottom--; Object t = this.deq[this.bottom % QUEUE_SIZE]; long oldStealRange = this.stealRange.read();

int rangeTop = getTop(oldStealRange); int rangeBot = getSteal(oldStealRange); if (this.bottom > rangeTop) return t; // no need to synchronize else if (rangeTop == rangeBot) { // oldStealRange is empty this.bottom = 0; // last thread already stolen return null; } else { // Try to make stealRange empty long currentStealRange = this.stealRange.read(); int tag = getTag(currentStealRange); int bot = getSteal(currentStealRange); if this.stealRange.CAS(currentStealRange, makeStealRange(tag+1, bot+1, bot))) { return t; // thread was not stolen so far) } else { return null; // thread stolen } } }

Figure 10.12: Code for popBottom 10.5. THE STEAL-HALF PROTOCOL 19

public int stealTop(EDEQueue victim) { long oldStealRange = victim.stealRange; int oldSteal = getSteal(oldStealRange); int oldTop = getTop(oldStealRange); int oldTag = getTag(oldStealRange); int rangeLen = oldSteal - oldTop; // figure out how much we can steal int capacity = QUEUE_SIZE - this.getSize(); int numToSteal = Math.min(capacity, rangeLen); // tentatively copy stolen threads for (int i = 0; i < numToSteal; i++) this.deq[this.bottom+i % QUEUE_SIZE] = victim.deq[oldTop+i % QUEUE_SIZE]; // try to make theft complete long newStealRange = makeStealRange(oldTag+1, oldSteal+numToSteal, oldSteal); if (CAS(oldStealRange, newStealRange))) { this.bottom += numToSteal; // make theft visible to thief this.updateStealRange(0); // adjust thief’s steal range return numToSteal; } return 0; }

Figure 10.13: Code for stealTop 20 CHAPTER 10. WORK-STEALING

A local process calls the stealTop method (illustrated in Figure 10.13 to steal threads from another EDEQueue. The method first computes how many threads it can steal by comparing the victim’s stealRange and the excess ca- pacity of the thief’s deq array. It then tentatively copies that many threads from the victim to the thief, but without yet updating either the thief’s bottom field or the victim’s stealRange field. The method then calls CAS to adjust the victim’s stealRange field to reflect the missing items. If it succeeds, then the thief updates its own bottom field, followed by its own stealRange. The method then returns the number of threads stolen. If the CAS fails, the theft has also failed, and the method returns zero. Note that if another process succeeds in stealing concurrently from the thief, then the thief may fail to update its own stealRange, but it will not be pre- vented from updating its bottom and completing the theft.

10.6 Course Notes

This chapter adapted material from Leiserson and Prokop’s Introduction to Mul- tithreaded Programming. The DAG based model for analysis of multithreaded computation was formalized by Blumofe and Leiserson in 1994. They also gave the first work-stealing deque based implementation. Their deque however was not lock-free. Theorem 10.3.1 and its proof first appeared in [1]. The steal-half protocol is a simpler version of a protocol due to Hendler and Shavit [4]. Bibliography

[1] Nimar S. Arora and Robert D. Blumofe and C. Greg Plaxton, Thread Scheduling for Multiprogrammed Multiprocessors, ACM Symposium on Parallel Algorithms and Architectures, 1998.

[2] Robert D. Blumofe and Charles E. Leiserson. Scheduling multithreaded computations by work stealing. In Proceedings of the 35th Annual Sympo- sium on Foundations of Computer Science, pages 356–368, Santa Fe, New Mexico, November 1994. [3] C. Leiserson and H. Prokop, A Minicourse on Multithreaded Programming. ftp://theory.lcs.mit.edu/pub/cilk/minicourse.ps.gz. [4] D. Hendler and N. Shavit,Hendler, D., and Shavit, N. Non-blocking steal- half work queues. In Proceedings of the 21st Annual ACM Symposium on Principles of Distributed Computing (2002).

21