Work-Stealing

Chapter 10 Work-Stealing 10.1 Introduction In this chapter we look at the problem of scheduling multithreaded computations. This problem is interesting in its own right, and we will see that solutions to the problem require the design and implementation of novel lock-free concur- rent objects. A delightful aspect of this topic is that it combines well-established practical results with a non-trivial mathematical foundation. Figure 10.1 shows one way to decompose the well-known Fibonacci function into a multithreaded program. This implementation is an extremely inefficient way to compute Fibonacci numbers, but we use it here to illustrate the basic principles of multithreaded programming. We use the standard Java thread package.1 To compute the n-th Fibonacci number, we create a new Fib object: Fib f = new Fib(10); A Fib object extends java.lang.Thread, so we can start it in parallel with the main computation: f.start(); Later, when we want the result, we call its join method. This method pauses the caller until the Fib object has completed its computation, and it is safe to pick up the result. f.join(); System.out.println("10-th Fibonacci number is: " + f.result); The Fib object’s run method creates two child Fib objects and starts them in parallel. The parent cannot use the results computed by its children until those children join the parent. The parent then sums the children’s results, and halts. 0This chapter is part of the Manuscript Multiprocessor Synchronization by Maurice Herlihy and Nir Shavit. The current text is for your personal use and not for distribution outside our classroom. 1For clarity, some exception handling is omitted from figures. 1 2 CHAPTER 10. WORK-STEALING public class Fib extends Thread { public int arg; public int result; public Fib(int n) { arg = n; result = -1; } public void run() { if (arg < 2) { result = arg; } else { Fib left = new Fib(arg-1); Fib right = new Fib(arg-2); left.start(); right.start(); left.join(); right.join(); result = left.result + right.result; } } } Figure 10.1: Multithreaded Fibonacci 10.2. MODEL 3 Figure 10.2: Multithreaded Fibonacci DAG Notice that starting and joining threads in Java does not guarantee that any computations actually happen in parallel. Instead, you should think of these methods as advisory: they tell an underlying scheduler program that it may execute these programs in parallel. This chapter is concerned with the design and implementation of effective schedulers. 10.2 Model It is convenient to model a multithreaded computation as a directed acyclic graph (DAG), where each node represents an atomic step, and edges represent dependencies. For example, a single thread is just a linear chain of nodes. A step corresponding to a start method call has two outgoing edges: one to its successor in the same thread, and one to the first step of the newly-started thread. There is an edge from a child thread’s last step to the parent thread’s step in which it calls the child’s join method. Figure 10.2 shows the DAG corresponding to a short Fibonacci execution. All the DAGS considered here have out-degree at most two. Clearly, some computations are more parallel than others. We now consider some ways to make such notions precise. Let TP be the minimum number of steps needed to execute a multithreaded program on a system of P dedicated processors. Note that TP is an idealized measure: it may not always be possible for every processor to “find” steps to execute, and actual computation time may partly determined by other concerns, such as memory usage. Nevertheless, TP is clearly a lower bound on how much parallelism one can extract from a 4 CHAPTER 10. WORK-STEALING multithreaded computation. Some values of P are important enough that they have special names. T1, the number of steps needed to execute the program on a single processor, is called the computation’s work. Work is also the total number of steps in the entire computation. In one step, P processors can execute at most P steps, so TP ¸ T1=P: The other extreme is also of special importance: T1, the number of steps to execute the program on an unlimited number of processors, is called the critical- path length. Because finite resources cannot do better than infinite resources, TP ¸ T1: The speedup on P processors is the following ratio: T1=TP We say a computation has linear speedup if T1=TP = Θ(P ). Finally, the parallelism of a computation is the maximum possible speedup: T1=T1. To illustrate these concepts, we now examine a simple multithreaded matrix multiplication program. Matrix multiplication can be decomposed as follows: µ ¶ µ ¶ µ ¶ A A B B C C 11 12 = 11 12 ¢ 11 12 A21 A22 B21 B22 C21 C22 µ ¶ B C + B C B C + B C = 11 11 12 21 11 12 12 22 : B21C11 + B22C21 B21C12 + B22C22 To turn this observation into code, assume we have a Matrix class, with put and get methods to access elements. This class also proivdes a method that splits an n-by-n matrix into four (n=2)-by-(n=2) submatrices: public Matrix[][] split() { ... } In Java terminology, the four submatrices are “backed” by the original matrix, meaning that changes to the submatrices are reflected in the matrix, and vice- versa. This method can be implemented to take constant time by appropriate repositioning of indexes (left as an exercise for the reader). The code for multithreaded matrix addition appears in Figure 10.3, and multiplication in Figure 10.4. Let AP (n) be the number of steps needed to execute Add on P processors. The work A1(n) is defined by the recurrence: A1(n) = 4A1(n=2) + Θ(1) = Θ(n2): This work is the same as the conventional doubly-nested loop implementation. The critical-path length is also easy to compute: A1(n) = A1n=2 + Θ(1) = Θ(log n) 10.2. MODEL 5 public class Add extends Thread { public Matrix sum, arg; public Add(Matrix sum, Matrix arg) { this.sum = sum; this.arg = arg; } public void run() { if (sum.getDimension() == 1) { sum.put(0, 0, sum.get(0,0) + arg.get(0,0)); } else { Matrix[][] s = this.sum.split(); Matrix[][] a = this.arg.split(); // create children Add[] child = { new Add(s[0][0], a[0][0]), new Add(s[0][1], a[0][1]), new Add(s[1][0], a[1][0]), new Add(s[1][1], a[1][1]) }; // start children for (int i = 0; i < child.length; i++) child[i].start(); // join children for (int i = 0; i < child.length; i++) { try { child[i].join(); } catch (InterruptedException e) {} } } } } Figure 10.3: Matrix addition 6 CHAPTER 10. WORK-STEALING public class Mult extends Thread { public Matrix prod, arg0, arg1; public Mult(Matrix prod, Matrix arg0, Matrix arg1) { this.prod = prod; this.arg0 = arg0; this.arg1 = arg1; } public void run() { int n = prod.getDimension(); if (n == 1) { prod.put(0, 0, arg0.get(0,0) + arg1.get(0,0)); } else { Matrix tmp = new Matrix(n,n); Matrix[][] r = this.prod.split(); Matrix[][] a = this.arg0.split(); Matrix[][] b = this.arg1.split(); Matrix[][] t = tmp.split(); // create children Mult[] child = { new Mult(r[0][0], a[0][0], b[0][0]), new Mult(r[0][1], a[0][0], b[0][1]), new Mult(r[1][0], a[1][0], b[0][0]), new Mult(r[1][1], a[1][0], b[0][1]), new Mult(t[0][0], a[0][1], b[1][0]), new Mult(t[0][1], a[0][1], b[1][1]), new Mult(t[1][0], a[1][1], b[1][0]), new Mult(t[1][1], a[1][1], b[1][1]) }; // start children for (int i = 0; i < child.length; i++) child[i].start(); // join children for (int i = 0; i < child.length; i++) { try { child[i].join(); } catch (InterruptedException e) {} } } } } Figure 10.4: Matrix multiplication 10.3. REALISTIC MULTIPROCESSOR SCHEDULING 7 This claim follows because each of the half-size additions is performed in parallel with the others. Let MP (n) be the number of steps needed to execute Mult on P processors. The work A1(n) is defined by the recurrence: M1(n) = 8M1(n=2) + A1(n) 2 M1(n) = 8M1(n=2) + Θ(n ) = Θ(n3): This work is also the same as the conventional triply-nested loop implementation. The critical-path length is: M1(n) = A1(n=2) + A1(n) = A1(n=2) + Θ(log n) = Θ(log2n) This claim follows because the half-size multiplications are performed in parallel, followed by a single addition. The parallelism for the Mult program is 3 2 M1(n)=M1(n) = Θ(n = log n); which is pretty high. For example, suppose we want to multiply two 1000-by- 1000 matrices. Here, n3 = 109, and log n = log 1000 ¼ 10 (logs are base two), so the parallelism is approximately 109=102 = 107. Roughly speaking, this instance of matrix multiplication could, in principle, occupy roughly 107 processors, well beyond the powers of any existing multiprocessor. You should understand that the parallelism computation given above is a highly idealized upper bound on the performance of any multithreaded matrix multiplication program. For example, when there are idle threads, it may not be easy to assign those threads to idle processors. Moreover, a program that displays less parallelism but consumes less memory may perform better because it encounters fewer page faults. The actual performance of a multithreaded computation remains a complex engineering problem, but the kind of analysis presented in this section is an indispensable first step in understanding the degree to which a problem can be solved in parallel.

Load more