The Data Locality of Work Stealing
Umut A. Acar Guy E. Blelloch Robert D. Blumofe School of Computer Science School of Computer Science Department of Computer Sciences Carnegie Mellon University Carnegie Mellon University University of Texas at Austin [email protected] [email protected] [email protected]
Abstract Several researches have studied techniques to improve the data locality of multithreaded programs. One class of such techniques This paper studies the data locality of the work-stealing scheduling is based on software-controlled distribution of data among the lo- algorithm on hardware-controlled shared-memory machines. We cal memories of a distributed shared memory system [15, 22, 26]. present lower and upper bounds on the number of cache misses Another class of techniques is based on hints supplied by the pro- using work stealing, and introduce a locality-guided work-stealing grammer so that “similar” tasks might be executed on the same algorithm along with experimental validation. processor [15, 31, 34]. Both these classes of techniques rely on
As a lower bound, we show that there is a family of multi- the programmer or compiler to determine the data access patterns
G (n) threaded computations n each member of which requires in the program, which may be very difficult when the program has total instructions (work), for which when using work-stealing the complicated data access patterns. Perhaps the earliest class of tech-
number of cache misses on one processor is constant, while even on niques was to attempt to execute threads that are close in the com-
n) two processors the total number of cache misses is ( . This im- putation graph on the same processor [1, 9, 20, 23, 26, 28]. The plies that for general computations there is no useful bound relating work-stealing algorithm is the most studied of these techniques [9, multiprocessor to uninprocessor cache misses. For nested-parallel 11, 19, 20, 24, 36, 37]. Blumofe et al showed that fully-strict com-
computations, however, we show that on P processors the expected putations achieve a provably good data locality [7] when executed
additional number of cache misses beyond those on a single proces- with the work-stealing algorithm on a dag-consistent distributed
m
O (C d m e PT )
sor is bounded by 1 , where is the execution time shared memory systems. In recent work, Narlikar showed that work
s C
of an instruction incurring a cache miss, s is the steal time, is the stealing improves the performance of space-efficient multithreaded T size of cache, and 1 is the number of nodes on the longest chain applications by increasing the data locality [29]. None of this pre- of dependences. Based on this we give strong bounds on the total vious work, however, has studied upper or lower bounds on the running time of nested-parallel computations using work stealing. data locality of multithreaded computations executed on existing For the second part of our results, we present a locality-guided hardware-controlled shared memory systems. work stealing algorithm that improves the data locality of multi- In this paper, we present theoretical and experimental results threaded computations by allowing a thread to have an affinity for on the data locality of work stealing on hardware-controlled shared a processor. Our initial experiments on iterative data-parallel appli- memory systems (HSMSs). Our first set of results are upper and
cations show that the algorithm matches the performance of static- lower bounds on the number of cache misses in multithreaded com-
M (C )
partitioning under traditional work loads but improves the perform- putations executed by the work-stealing algorithm. Let 1 de-
ance up to 50% over static partitioning under multiprogrammed note the number of cache misses in the uniprocessor execution and
M (C ) P
work loads. Furthermore, the locality-guided work stealing im- P denote the number of cache misses in a -processor ex-
proves the performance of work-stealing up to 80%. ecution of a multithreaded computation by the work stealing algo-
rithm on an HSMS with cache size C . Then, for a multithreaded
T T 1 computation with 1 work (total number of instructions), criti- 1 Introduction cal path (longest sequence of dependences), we show the following results for the work-stealing algorithm running on a HSMS. Many of today’s parallel applications use sophisticated, adaptive
algorithms which are best realized with parallel programming sys- Lower bounds on the number of cache misses for general
tems that support dynamic, lightweight threads such as Cilk [8], computations: We show that there is a family of computa-
T =(n) M (C )=3C Nesl [5], Hood [10], and many others [3, 16, 17, 21, 32]. The core G
1 1
tions n with such that while
(C )= of these systems is a thread scheduler that balances load among the M
even on two processors the number of misses 2
n) processes. In addition to a good load balance, however, good data ( . locality is essential in obtaining high performance from modern
parallel systems. Upper bounds on the number of cache misses for nested-
parallel computations: For a nested-parallel computation, we
M M (C )+2C 1 show that P , where is the number of Permission to make digital or hard copies of part or all of this work or
personal or classroom use is granted without fee provided that copies are not steals in the P -processor execution. We then show that the
m
O (d ePT ) m
expected number of steals is 1 , where is the made or distributed for profit or commercial advantage and that copies bear s this notice and the full citation on the first page. To copy otherwise, to time for a cache miss and s is the time for a steal. republish, to post on servers, or to redistribute to lists, requires prior specific
Upper bound on the execution time of nested-parallel com- permission and/or a fee. putations: We show that the expected execution time of a
SPAA 2000, Bar Harbor, Maine USA © ACM 2000 1-58113-185-2/00/07...$5.00 1
linear work stealing is due to the fact that the data for each run does not
18 work-stealing
2 locality-guided workstealing 2 static partioning fit into the L cache of one processor but fits into the collective L 16
cache of 6 or more processors. For this benchmark the following
14 can be seen from the graph.
12 1. Locality-guided work stealing does significantly better than
10 standard work stealing since on each step the cache is pre- warmed with the data it needs. Speedup 8
6 2. Locality-guided work stealing does approximately as well as static partitioning for up to 14 processes. 4
2 3. When trying to schedule more than 14 processes on 14 processors static partitioning has a serious performance 0 0 5 10 15 20 25 30 35 40 45 50 Number of Processes drop. The initial drop is due to load imbalance caused by the coarse-grained partitioning. The performance then ap- proaches that of work stealing as the partitioning gets more Figure 1: The speedup obtained by three different over-relaxation fine-grained. algorithms. We are interested in the performance of work-stealing computa- tions on hardware-controlled shared memory (HSMSs). We model
an HSMS as a group of identical processors each of which has its
T (C )
1
O ( + nested-parallel computation on P processors is
P own cache and has a single shared memory. Each cache contains
m
md e CT +(m + s)T ) T (C )
1 1
1 , where is the uniprocessor C s blocks and is managed by the memory subsystem automatically. execution time of the computation including cache misses. We allow for a variety of cache organizations and replacement poli- cies, including both direct-mapped and associative caches. We as- As in previous work [6, 9], we represent a multithreaded com- sign a server process with each processor and associate the cache putation as a directed, acyclic graph (dag) of instructions. Each of a processor with process that the processor is assigned. One lim- node in the dag represents a single instruction and the edges repre- itation of our work is that we assume that there is no false sharing. sent ordering constraints. A nested-parallel computation [5, 6] is a race-free computation that can be represented with a series-parallel dag [33]. Nested-parallel computations include computations con- 2 Related Work sisting of parallel loops and fork an joins and any nesting of them. This class includes most computations that can be expressed in As mentioned in Section 1, there are three main classes of tech- Cilk [8], and all computations that can be expressed in Nesl [5]. niques that researchers have suggested to improve the data locality Our results show that nested-parallel computations have much bet- of multithreaded programs. In the first class, the program data is ter locality characteristics under work stealing than do general com- distributed among the nodes of a distributed shared-memory system putations. We also briefly consider another class of computations, by the programmer and a thread in the computation is scheduled on computations with futures [12, 13, 14, 20, 25], and show that they the node that holds the data that the thread accesses [15, 22, 26]. can be as bad as general computations. In the second class, data-locality hints supplied by the programmer The second part of our results are on further improving the data are used in thread scheduling [15, 31, 34]. Techniques from both locality of multithreaded computations with work stealing. In work classes are employed in distributed shared memory systems such as stealing, a processor steals a thread from a randomly (with uniform COOL and Illinois Concert [15, 22] and also used to improve the distribution) chosen processor when it runs out of work. In certain data locality of sequential programs [31]. However, the first class applications, such as iterative data-parallel applications, random of techniques do not apply directly to HSMSs, because HSMSs do steals may cause poor data locality. The locality-guided work steal- not allow software controlled distribution of data among the caches. ing is a heuristic modification to work stealing that allows a thread Furthermore, both classes of techniques rely on the programmer to to have an affinity for a process. In locality-guided work stealing, determine the data access patterns in the application and thus, may when a process obtains work it gives priority to a thread that has not be appropriate for applications with complex data-access pat- affinity for the process. Locality-guided work stealing can be used terns. to implement a number of techniques that researchers suggest to The third class of techniques, which is based on execution improve data locality. For example, the programmer can achieve an of threads that are close in the computation graph on the same initial distribution of work among the processes or schedule threads process, is applied in many scheduling algorithms including work based on hints by appropriately assigning affinities to threads in the stealing [1, 9, 23, 26, 28, 19]. Blumofe et al showed bounds computation. on the number of cache misses in a fully-strict computation exe- Our preliminary experiments with locality-guided work steal- cuted by the work-stealing algorithm under the dag-consistent dis- ing give encouraging results, showing that for certain applications tributed shared-memory of Cilk [7]. Dag consistency is a relaxed the performance is very close to that of static partitioning in ded- memory-consistency model that is employed in the distributed icated mode (i.e. when the user can lock down a fixed number of shared-memory implementation of the Cilk language. In a dis- processors), but does not suffer a performance cliff problem [10] tributed Cilk application, processes maintain the dag consistency in multiprogrammed mode (i.e. when processors might be taken by means of the BACKER algorithm. In [7], Blumofe et al bound by other users or the OS). Figure 1 shows a graph comparing work the number of shared-memory cache misses in a distributed Cilk stealing, locality-guided work stealing, and static partitioning for a application for caches that are maintained with the LRU replace-
simple over-relaxation algorithm on a 14 processor Sun Ultra En- ment policy. They assumed that accesses to the shared memory
terprise. The over-relaxation algorithm iterates over a 1 dimen- are distributed uniformly and independently, which is not gener-
sional array performing a 3-point stencil computation on each step. ally true because threads may concurrently access the same pages The superlinear speedup for static partitioning and locality-guided by algorithm design. Furthermore, they assumed that processes do
2
3 4 9 10 WW sharing respectively. If one node is reading and the other is modifying the data we say they are RW sharing. RW or WW shar- ing can cause data races, and the output of a computation with such 125 81112 13 races usually depends on the scheduling of nodes. Such races are typically indicative of a bug [18]. We refer to computations that do not have any RW or WW sharing as race-free computations. In this 6 7 paper we consider only race-free computations. The work-stealing algorithm is a thread scheduling algorithm Figure 2: A dag (directed acyclic graph) for a multithreaded com- for multithreaded computations. The idea of work-stealing dates putation. Threads are shown as gray rectangles. back to the research of Burton and Sleep [11] and has been studied extensively since then [2, 9, 19, 20, 24, 36, 37]. In the work-stealing algorithm, each process maintains a pool of ready threads and ob- not generate steal attempts frequently by making processes do ad- tains work from its pool. When a process spawns a new thread the ditional page transfers before they attempt to steal from another process adds the thread into its pool. When a process runs out of process. work and finds its pool empty, it chooses a random process as its victim and tries to steal work from the victim’s pool. In our analysis, we imagine the work-stealing algorithm oper- 3 The Model ating on individual nodes in the computation dag rather than on the threads. Consider a multithreaded computation and its execution by In this section, we present a graph-theoretic model for multithreaded the work-stealing algorithm. We divide the execution into discrete computations, describe the work-stealing algorithm, define series- time steps such that at each step, each process is either working on parallel and nested-parallel computations and introduce our model a node, which we call the assigned node, or is trying to steal work.
of an HSMS (Hardware-controlled Shared-Memory System). The execution of a node takes 1 time step if the node does not incur
As with previous work [6, 9] we represent a multithreaded com- a cache miss and m steps otherwise. We say that a node is executed putation as a directed acyclic graph, a dag, of instructions (see Fig- at the time step that a process completes executing the node. The ure 2). Each node in the dag represents an instruction and the edges execution time of a computation is the number of time steps that represent ordering constraints. There are three types of edges, con- elapse between the time step that a process starts executing the root tinuation, spawn, and dependency edges. A thread is a sequential node to the time step that the final node is executed. The execution ordering of instructions and the nodes that corresponds to the in- schedule specifies the activity of each process at each time step. structions are linked in a chain by continuation edges. A spawn During the execution, each process maintains a deque (doubly edge represents the creation of a new thread and goes from the node ended queue) of ready nodes; we call the ends of a deque the top
representing the instruction that spawns the new thread to the node and the bottom. When a node, u, is executed, it enables some other
u v
representing the first instruction of the new thread. A dependency node v if is the last parent of that is executed. We call the edge
(u; v ) u v j edge from instruction i of a thread to instruction of some other an enabling edge and the designated parent of . When a
thread represents a synchronization between two instructions such process executes a node that enables other nodes, one of the enabled i that instruction j must be executed after . We draw spawn edges nodes become the assigned node and the process pushes the rest with thick straight arrows, dependency edges with curly arrows and onto the bottom of its deque. If no node is enabled, then the process continuation edges with thick straight arrows throughout this paper. obtains work from its deque by removing a node from the bottom of Also we show paths with wavy lines. the deque. If a process finds its deque empty, it becomes a thief and
For a computation with an associateddag G, we define the com- steals from a randomly chosen process, the victim. This is a steal
s ks
T G
putational work, 1 , as the number of nodes in and the critical attempt and takes at least and at most time steps for some
k 1
T G
path, 1 , as the number of nodes on the longest path of . constant to complete. A thief process might make multiple
v u
Let u and be any two nodes in a dag. Then we call an an- steal attempts before succeeding, or might never succeed. When a
v u u v cestor of v , and a descendant of if there is a path from to . steal succeeds, the thief process starts working on the stolen node Any node is its descendant and ancestor. We say that two nodes are at the step following the completion of the steal. We say that a steal relatives if there is a path from one to the other, otherwise we say attempt occurs at the step it completes. that the nodes are independent. The children of a node are inde- The work-stealing algorithm can be implemented in various
pendent because otherwise the edge from the node to one child is ways. We say that an implementation of work stealing is deter-
u v
redundant. We call a common descendant y of and a merger of ministic if, whenever a process enables other nodes, the imple-
v u y v y y u and if the paths from to and to have only in common. mentation always chooses the same node as the assigned node for
We define the depth of a node u as the number of edges on the then next step on that process, and the remaining nodes are always
shortest path from the root node to u. We define the least common placed in the deque in the same order. This must be true for both
v u v ancestor of u and as the ancestor of both and with maximum multiprocess and uniprocess executions. We refer to a determin-
depth. Similarly, we define the greatest common descendant of u istic implementation of the work-stealing algorithm together with
u v
and v , as the descendant of both and with minimum depth. An the HSMS that runs the implementation as a work stealer.For
u; v ) u v
edge ( is redundant if there is a path between and that brevity, we refer to an execution of a multithreaded computation
u; v ) does not contain the edge ( . The transitive reduction of a dag with a work stealer as an execution. We define the total work as
is the dag with all the redundant edges removed. the number of steps taken by a uniprocess execution, including the
T (C ) C In this paper we are only concerned with the transitive reduction cache misses, and denote it by 1 , where is the cache size.
of the computational dags. We also require that the dags have a We denote the number of cache misses in a P -process execution
C M (C ) 0
single node with in-degree , the root, and a single node with out- with -block caches as P . We define the cache overhead
P M (C ) M (C ) M (C )
0
1 1 degree , the final node. of a -process execution as P , where is In a multiprocess execution of a multithreaded computation, in- the number of misses in the uniprocess execution on the same work dependent nodes can execute at the same time. If two independent stealer. nodes read or modify the same data, we say that they are RR or We refer to a multithreaded computation for which the transi-
3
1 G1 L4C R4C 29
G1 G2 s G2 t s t stu 36 1015 (a) (b) (c) 457811 13 16 18 Figure 3: Illustrates the recursive definition for series-parallel dags. Figure (a) is the base case, figure (b) depicts the serial, and figure 1214 17 19 (c) depicts the parallel composition. Figure 4: The structure for dag of a computation with a large cache overhead.
tive reduction of the corresponding dag is series-parallel [33] as
(V; E)
a series-parallel computation. A series-parallel dag G is a 4 General Computations
2 V
dag with two distinguished vertices, a source, s and a sink,
2 V t and can be defined recursively as follows (see Figure 3). In this section, we show that the cache overhead of a multiprocess
execution of a general computation and a computation with futures
G s t Base: consists of a single edge connecting to . can be large even though the uniprocess execution incurs a small
number of misses. G
Series Composition: consists of two series-parallel dags
G (V ;E ) G (V ;E )
1 1 2 2 2
1 and with disjoint edge sets such that Theorem 1 There is a family of computations
s G u G 1
is the source of 1 , is the sink of and the source of
+
fG : n = k C ; f or k 2Z g
G t G V \ V = fug
n
2 1 2 2 , and is the sink of . Moreover .
Parallel Composition: The graph consists of two series-parallel
(n)
with O computational work, whose uniprocess execution incurs
G (V ;E ) G (V ;E )
1 1 2 2 2
dags 1 and with disjoint edges sets
C 2
3 misses while any -process execution of the computation incurs
s t G
such that and are the source and the sink of both 1 and
n) C
( misses on a work stealer with a cache size of , assuming
G V \ V = fs; tg
1 2
2 . Moreover .
= O (C ) S
that S , where is the maximum steal time.
G n =4C C A nested-parallel computation is a race-free series-parallel compu- Proof: Figure 4 shows the structure of a dag, 4 for .
tation [6]. Each node except the root node represents a sequence of C in-
We also consider multithreaded computations that use futures [12, structions accessing a set of C distinct memory blocks. The root
+ S C
13, 14, 20, 25]. The dag structures of computations with futures node represents C instructions that accesses distinct memory
L R
C 4C are defined elsewhere [4]. This is a superclass of nested-parallel blocks. The graph has two symmetric components 4 and ,
computations, but still much more restrictive than general com- which corresponds to the left and the right subtree of the root ex-
G C putations. The work-stealing algorithm for futures is a restricted cluding the leaves. We partition the nodes in 4 into three classes, form of work-stealing algorithm, where a process starts executing a such that all nodes in a class access the same memory blocks while newly created thread immediately, putting its assigned thread onto nodes from different classes access mutually disjoint set of memory
its deque. blocks. The first class contains the root node only, the second class
L C
In our analysis, we consider several cache organization and re- contains all the nodes in 4 , and the third class contains the rest
R G
C 4C
placement policies for an HSMS. We model a cache as a set of of the nodes, which are the nodes in 4 and the leaves of .
n = kC G L R k
n n
(cache) lines, each of which can hold the data belonging to a mem- For general , n can be partitioned into , and the
G L R
n n
ory block (a consecutive, typically small, region of memory). One leaves of n and the root similarly. Each of and contains
k
e 1 d
instruction can operate on at most one memory block. We say that 2 nodes and has the structure of a complete binary tree with 2
an instruction accesses a block or the line that contains the block additional k leaves at the lowest level. There is a dependency edge
L R G
n n
when the instruction reads or modifies the block. We say that an from the leaves of both n and to the leaves of .
G b instruction overwrites a line that contains the block when the in- Consider a work stealer that executes the nodes of n in the or-
struction accesses some other block that replaces b in the cache. der that they are numbered in a uniprocess execution. In the unipro- L
We say that a cache replacement policy is simple if it satisfies two cess execution, no node in n incurs a cache miss except the root L
conditions. First the policy is deterministic. Second whenever the node, since all nodes in n access the same memory blocks as the
L R k
l n
policy decides to overwrite a cache line, , it makes the decision root of n . The same argument holds for and the leaves of
G L R
l
n n
to overwrite by only using information pertaining to the accesses n . Hence the execution of the nodes in , , and the leaves
2C C
that are made after the last access to l . We refer to a cache man- causes misses. Since the root node causes misses, the total C aged with a simple cache-replacement policy as a simple cache. number of misses in the uniprocess execution is 3 . Now, consider
Simple caches and replacement policies are common in practice. a 2-process execution with the same work stealer and call the pro-
1 1 0
For example, least-recently used (LRU) replacement policy, direct cesses, process 0 and . At time step , process starts executing R
mapped caches and set associative caches where each set is main- the root node, which enables the root of n no later than time step 0 tained by a simple cache replacement policy are simple. m. Since process starts stealing immediately and there are no
In regards to the definition of RW or WW sharing, we assume other processes to steal from, process 1 steals and starts working
R m + S
that reads and writes pertain to the whole block. This means we do on the root of n , no later than time step . Hence, the root
R L L
n n
not allow for false sharing—when two processes accessing differ- of n executes before the root of and thus, all the nodes in R
ent portions of a block invalidate the block in each other’ s caches. execute before the corresponding symmetric node in n . There-
G R n
In practice, false sharing is an issue, but can often be avoided by fore, for any leaf of n , the parent that is in executes before
L G n
a knowledge of underlying memory system and appropriately pad- the parent in n . Therefore a leaf node of is executed immedi-
L C
ding the shared data to prevent two processes from accessing dif- ately after its parent in n and thus, causes cache misses. Thus,
kC )= (n) ferent portions of the same block. the total number of cache misses is ( .
4
1
l X l i
1 1
2 . Consider and let be a cache line. Let be the first instruc-
l l 2
tion that accesses or overwrites 1 . Let be the cache line that
X l 1
the same instruction accesses or overwrites in 2 and map to
l 2 l 1
11 2 . Since the caches are simple, an instruction that overwrites in
X l X
2 2
1 overwrites in . Therefore the number of misses that over-
l X l
1 2
3 6 12 13 writes 1 in is equal to the number of misses that overwrites
X i i 1
in 2 after instruction . Since itself can cause miss, the number
l X 1 1
487 14 of misses that overwrites 1 in is at most more than the num-
l X 2
ber of misses that overwrites 2 in . We construct the mapping X 5 10 9 for each cache line in 1 in the same way. Now, let us show that
the mapping is one-to-one. For the sake of contradiction, assume
l l X X
2 1 2
that two cache lines, 1 and ,in map to the same line in .
i i 2 Let 1 and be the first instructions accessing the cache lines in
Figure 5: The structure for dag of a computation with futures that
X i i i i
1 2 1 2 1 such that is executed before . Since and map to the
can incur a large cache overhead.
X i 2
same line in 2 and the caches are simple, accesses the line that
i X l = l
1 1 2
1 accessesin but then , a contradiction. Hence, the total
X C
number of cache misses in 1 is at most more than the misses
There exists computations similar to the computation in Fig- X
in 2 . ure 4 that generalizes Theorem 1 for arbitrary number of processes
by making sure that all the processes but 2 steal throughout any multiprocess execution. Even in the general case, however, where Theorem 3 Let D denote the total number of drifted nodes in an
the average parallelism is higher than the number of processes, execution of a nested-parallel computation with a work stealer on C Theorem 1 can be generalized with the same bound on expected P processes, each of which has a simple cache with words. Then
the cache overhead of the execution is at most CD. G number of cache misses by exploiting the symmetry in n and by
assuming a symmetrically distributed steal-time. With a symmetri-
X P X 1
Proof: Let P denote the -process execution and let be the cally distributed steal-time, for any , a steal that takes steps more uniprocess execution of the same computation with the same work than mean steal-time is equally likely to happen as a steal that takes
stealer. We divide the multiprocess computation into D pieces each
less steps than the mean. Theorem 1 holds for computations with
of which can incur at most C more misses than in the uniprocess
futures as well. Multithreaded computing with futures is a fairly q execution. Let u be a drifted node let be the process that executes
restricted form of multithreaded computing compared to comput-
v q u. Let be the next drifted node executed on (or the final node ing with events such as synchronization variables. The graph F in
of the computation). Let the ordered set O represent the execution
Figure 5 shows the structure of a dag, whose 2-process execution u order of all the nodes that are executed after u ( is included) and
causes large number of cache misses. In a 2-process execution of
v q before v ( is excluded if it is drifted, included otherwise) on in
F , the enabling parent of the leaf nodes in the right subtree of the
X O
P . Then nodes in are executed on the same process and in the
root are in the left subtree and therefore the execution of each such
X X P same order in both 1 and . leaf node causes C misses.
Now consider the number of cache misses during the execution
O X X P of the nodes in in 1 and . Since the computation is nested
5 Nested-Parallel Computations parallel and therefore race free, a process that executes in paral- q lel with q does not cause to incur cache misses due to sharing.
In this section, we show that the cache overhead of an execution of Therefore by Lemma 2 during the execution of the nodes in O the
X C
a nested-parallel computation with a work stealer is at most twice number of cache misses in P is at most more than the number
X D
the product of the number of steals and the cache size. Our proof of misses in 1 . This bound holds for each of the sequence of D has two steps. First, we show that the cache overhead is bounded by such instructions O corresponding to drifted nodes. Since the
the product of the cache size and the number of nodes that are exe- sequence starting at the root node and ending at the first drifted
X X X
P P
cuted “out of order” with respect to the uniprocess execution order. node incurs the same number of misses in 1 and takes at
CD X Second, we prove that the number of such out-of-order executions most more misses than 1 and the cache overhead is at most
is at most twice the number of steals. CD.
G P X
Consider a computation and its -process execution, P , Lemma 2 (and thus Theorem 3) does not hold for caches that X
with a work stealer and the uniprocess execution, 1 with the same are not simple. For example, consider the execution of a sequence
G u
work stealer. Let v be a node in and node be the node that exe- of instructions on a cache with least-frequently-used replacement
v X v
cutes immediately before in 1 . Then we say that is drifted in policy starting at two cache states. In the first cache state, the blocks
X u v
P if node is not executed immediately before by the process that are frequently accessedby the instructions are in the cache with
v X that executes in P . high frequencies, whereas in the second cache state, the blocks that Lemma 2 establishes a key property of an execution with simple are in the cache are not accessed by the instruction and have low caches. frequencies. The execution with the second cache state, therefore, incurs many more misses than the size of the cache compared to
Lemma 2 Consider a process with a simple cache of C blocks. Let
the execution with the second cache state. X
1 denote the execution of a sequence of instructions on the proc-
Now we show that the number of drifted nodes in an execution
S X 2 ess starting with cache state 1 and let denote the execution
of a series-parallel computation with a work stealer is at most twice S of the same sequence of instructions starting with cache state 2 .
the number of steals. The proof is based on the representation of
X C X 2 Then 1 incurs at most more misses than . series-parallel computations as sp-dags. We call a node with out-
Proof: We construct a one-to-one mapping between the cache degree of at least 2 a fork node and partition the nodes of an sp-dag
except the root into three categories: join nodes, stable nodes and
X X 2 lines in 1 and such that an instruction that accesses a line
nomadic nodes . We call a node that has an in-degree of at least
l X l X l
1 2 2 1
1 in accesses the entry in , if and only if is mapped to 1 2 a join node and partition all the nodes that have in-degree into
5
t w
u r s z u xy
v G s 1 t
vz
Figure 6: Children of s and their merger. G2
G
1
y z
u Figure 8: The join node s is the least common ancestor of and .
v s Node u and are the children of .
st
v
(V; E) y z
Lemma 7 Let G be an sp-dag and let and be two par-
G G y G t
2 ents of a join node in . Let 1 denote the embedding of with
z G z y
respect to and 2 denote the embedding of with respect to .
s t v
Figure 7: The joint embedding of u and . Let denote the source and denote the sink of the joint embed-
G s t
ding. Then the parents of any node in 1 except for and is in
G G s t G
2 2
1 and the parents of any node in except for and is in .
z s t
two classes: a nomadic node has a parent that is a fork node, and a Proof: Since y and are independent, both of and are different
y z
stable node has a parent that has out-degree 1. The root node has in- from and (see Figure 8). First, we show that there is not an
G s G
1 2 degree 0 and it does not belong to any of these categories. Lemma 4 edge that starts at a node in except at and ends at a node in
lists two fundamental properties of sp-dags; one can prove both except at t and vice versa. For the sake of contradiction, assume
(m; n) m 6= s G n 6= t
properties by induction on the number of edges in an sp-dag. there is an edge such that is in 1 and is in
G m y z
2 . Then is the least common ancestor of and ; hence no
(m; n) m G
G G
Lemma 4 Let be an sp-dag. Then has the following proper- such exists. A similar argument holds when is in 2 and
n G ties. is in 1 .
Second, we show that there does not exists an edge that origi-
G
G G 1. The least common ancestor of any two nodes in is unique. G
2 1
nates from a node outside of 1 or and ends at a node at or
G (w; x) x
2 . For the sake of contradiction, let be an edge such that
2. The greatest common descendant of any two nodes in G is
G w G G x
1 2 is in 1 and is not in or . Then is the unique merger for
unique and is equal to their unique merger. s
the two children of the least common ancestor of w and , which
t r we denote with r . But then is also a merger for the children of .
The children of r are independent and have a unique merger, hence s
Lemma 5 Let s be a fork node. Then no child of is a join node.
w; x) x
there is no such edge ( . A similar argument holds when is
G G 1
in 2 . Therefore we conclude that the parents of any node in
v s u
Proof: Let u and denote two children of and suppose is a
s t G G s 2
except and is in 1 and the parents of any node in except u
join node as in Figure 6. Let t denote some other parent of and
t G
and is in 2 .
u v z u
z denote the unique merger of and . Then both and are
t u
mergers for s and , which is a contradiction of Lemma 5. Hence
y z
is not a join node. Lemma 8 Let G be an sp-dag and let and be two parents of a
G y z
join node t in . Consider the joint embedding of and and let
y z Corollary 6 Only nomadic nodes can be stolen in an execution of u be the guard node of the embedding. Then and are executed a series-parallel computation by the work-stealing algorithm. in the same respective order in a multiprocess execution as they
are executed in the uniprocess execution if the guard node u is not
stolen. u Proof: Let u be a stolen node in an execution. Then is pushed
on a deque and thus the enabling parent of u is a fork node. By
t v
Proof: Let s be the source, the sink, and the leader of the 1
Lemma 5, u is not a join node and has an incoming degree . There- v joint embedding. Since u is not stolen, is not stolen. Hence, by fore u is nomadic.
Lemma 7, before it starts working on u, the process that executes
Consider a series-parallel computation and let G be its sp-dag.
v t
s executed and all its descendants in the embedding except for
v G s t
Let u and be two independent nodes in and let and denote
u y u Hence, z is executed before and is executed after as in the
their least common ancestor and greatest common descendant re- z
uniprocess execution. Therefore, y and are executed in the same G
spectively as shown in Figure 7. Let 1 denote the graph that is respective order as they execute in the uniprocess execution. s
induced by the relatives of u that are descendants of and also an-
t G cestors of . Similarly, let 2 denote the graph that is induced by
Lemma 9 A nomadic node is drifted in an execution only if it is
s t the relatives of v that are descendants of and ancestors of . Then
stolen.
G u v G 2
we call 1 the embedding of with respect to and the em- u bedding of v with respect to . We call the graph that is the union
Proof: Let u be a nomadic and drifted node. Then, by Lemma 5,
G G u v s 2
of 1 and the joint embedding of and with source and
s u u s
u has a single parent that enables .If is the first child of to
G y z sink t. Now consider an execution of and and be the children
execute in the uniprocess execution then u is not drifted in the mul-
y z y
of s such that is executed before . Then we call the leader and v tiprocess execution. Hence, u is not the first child to execute. Let
z the guard of the joint embedding. u be the last child of s that is executed before in the uniprocess ex-
ecution. Now, consider the multiprocess execution and let q be the
6
x 1 6 An Analysis of Nonblocking Work Stealing
t1 s 1 The non-blocking implementation of the work-stealing algorithm y 1 delivers provably good performance under traditional and multi- u
x programmed workloads. A description of the implementation and 2 s its analysis is presented in [2]; an experimental evaluation is given 2 t 2 in [10]. In this section, we extend the analysis of the non-blocking work-stealing algorithm for classical workloads and bound the ex- y 2 ecution time of a nested-parallel, computation with a work stealer
to include the number of cache misses, the cache-miss penalty and
t t 2 Figure 9: Nodes 1 and are two join nodes with the common the steal time. First, we bound the number of steal attempts in an guard u. execution of a general computation by the work-stealing algorithm. Then we bound the execution time of a nested-parallel computation with a work stealer using results from Section 5. The analysis that we present here is similar to the analysis given in [2] and uses the process that executes v . For the sake of contradiction, assume that
same potential function technique.
u v u is not stolen. Consider the joint embedding of and as shown
We associate a nonnegative potential with nodes in a computa-
G s in Figure 8. Since all parents of the nodes in 2 except for and
tion’ s dag and show that the potential decreases as the execution
t G q G 2 are in 2 by Lemma 7, executes all the nodes in before it
proceeds. We assume that a node in a computation dag has out-
z u q u executes u and thus, precedes on . But then is not drifted,
degree at most 2. This is consistent with the assumption that each u because z is the node that is executed immediately before in the node represents on instruction. Consider an execution of a compu-
uniprocess computation. Hence u is stolen.
(V; E) tation with its dag, G with the work-stealing algorithm. The
Let us define the cover of a join node t in an execution as the set execution grows a tree, the enabling tree, that contains each node
of all the guard nodes of the joint embedding of all possible pairs in the computation and its enabling edge. We define the distance
t
u 2 V d(u) T depth(u) depth(u) of parents of in the execution. The following lemma shows that a of a node , ,as 1 , where is the
join node is drifted only if a node in its cover is stolen. depth of u in the enabling tree of the computation. Intuitively, the distance of a node indicates how far the node is away from end of Lemma 10 A join node is drifted in an execution only if a node in its cover is stolen in the execution. the computation. We define the potential function in terms of dis-
tances. At any given step i, we assign a positive potential to each 0 Proof: Consider the execution and let t be a join node that is ready node, all other nodes have potential. A node is ready if it is
drifted. Assume, for the sake of contradiction, that no node in the enabled and not yet executed to completion. Let u denote a ready
i (u) u
t C (t) y z t
cover of , , is stolen. Let and be any two parents of as node at time step . Then we define, i , the potential of at
i z in Figure 8. Then y and are executed in the same order as in the time step as
uniprocess execution by Lemma 8. But then all parents of t exe-
2d(u) 1 u
cute in the same order as in the uniprocess execution. Hence, the 3 if is assigned;
(u)=
i
2d(u) enabling parent of t in the execution is the same as in the uniprocess
3 otherwise. 1 execution. Furthermore, the enabling parent of t has out-degree ,
because otherwise t is not a join node by Lemma 5 and thus, the
i
The potential at step , i , is the sum of the potential of each ready
t t process that enables t executes . Therefore, is not drifted. A
node at step i. When an execution begins, the only ready node is
contradiction, hence a node in the cover of t is stolen. T
the root node which has distance 1 and is assigned to some proc-
2T 1
1
=3
ess, so we start with 0 . As the execution proceeds, Lemma 11 The number of drifted nodes in an execution of a series- nodes that are deeper in the dag become ready and the potential parallel computation is at most twice the number of steals in the decreases. There are no ready nodes at the end of an execution and execution.
the potential is 0.
Proof: We associate each drifted node in the execution with a Let us give a few more definitions that enable us to associate
R (q ) i
steal such that no steal has more than 2 drifted nodes associated a potential with each process. Let denote the set of ready
q q u
with it. Consider a drifted node, u. Then is not the root node nodes that are in the deque of process along with ’s assigned
i u
of the computation and it is not stable either. Hence, u is either a node, if any, at the beginning of step . We say that each node
R (q ) q q
i u
nomadic or join node. If u is nomadic, then is stolen by Lemma 9 in belongs to process . Then we define the potential of ’s
u u
and we associate u with the steal that steals . Otherwise, is a deque as
X
C (u)
(q )= (u) :
join node and there is a node in its cover that is stolen by
i i
Lemma 10. We associate u with the steal that steals a node in its
u2R (q ) i
cover. Now, assume there are more than 2 nodes associated with a
A
u t i
steal that steals node . Then there are at least two join nodes 1 In addition, let denote the set of processes whose deque is empty
i D
t u u i
and 2 that are associated with . Therefore, node is in the joint at the beginning of step , and let denote the set of all other
t t x y
i
2 1 1
embedding of two parents of 1 and also . Let , be these processes. We partition the potential into two parts
t x y t
2 2 2
parents of 1 and , be the parents of , as shown in Figure 9.
= (A )+ (D ) ;
u
i i i i But then has parent that is a fork node and is a joint node, which i
contradicts Lemma 5. Hence no such u exists.
where X
Theorem 12 The cache overheadof an execution of a nested-parallel X
(q ) (q ) ; (A )= (D )=
i i i i i computation with simple caches is at most twice the product of the i and
number of misses in the execution and the cache size.
q 2A q 2D
i i Proof: Follows from Theorem 3 and Lemma 11. and we analyze the two parts separately.
7
P
Lemma 13 lists four basic properties of the potential that we use Proof: Consider all P processes and steal attempts that occur
i q D
frequently. The proofs for these properties are given in [2] and the at or after step . For each process in i , if one or more of the q
listed properties are correct independent of the time that execution P attempts target as the victim, then the potential decreases by
(1=2) (q )
of a node or a steal takes. Therefore, we give a short proof sketch. i due to the execution or assignment of nodes that belong 4 to q by property in Lemma 13. If we think of each attempt as a Lemma 13 The potential function satisfies the following proper- ball toss, then we have an instance of the Balls and Weighted Bins
ties.
q D
Lemma (Lemma 14). For each process in i , we assign a weight
u i W =(1=2) (q ) q A
i i
1. Suppose node is assigned to a process at step . Then the q , and for each other process in , we assign
(2=3) (u) W =0 W =(1=2) (D )
q i i
potential decreases by at least i . a weight . The weights sum to . Using
=1=2
in Lemma 14, we conclude that the potential decreases i
2. Suppose a node u is executed at step . Then the potential
W =(1=4) (D ) 1 i
by at least i with probability greater than
(5=9) (u) i
decreases by at least i at step .
=((1 )e) > 1=4
1 due to the execution or assignment of nodes
D
i
q D i that belong to a process in .
3. Consider any step and any process in i . The topmost
q 3=4
node u in ’s deque contributes at least of the potential We now bound the number of steal attempts in a work-stealing
q (u) (3=4) (q ) i
associated with . That is, we have i . computation.
p q D
4. Suppose a process chooses process in i as its victim at
Lemma 16 Consider a P -process execution of a multithreaded com-
p q i
time step i (a steal attempt of targeting occurs at step ).
T T 1
putation with the work-stealing algorithm. Let 1 and denote
(1=2) (q )
Then the potential decreases by at least i due to the computational work and the critical path of the computation.
the assignment or execution of a node belonging to q at the Then the expected number of steal attempts in the execution is
end of step i.
m
e PT ) O (d ">0
1 . Moreover, for any , the number of steal at-
s
m
1
e PT +lg(1=")) O (d 1 "
Property follows directly from the definition of the potential tempts is 1 with probability at least . s
function. Property 2 holds because a node enables at most two
children with smaller potential, one of which becomes assigned. Proof: We analyze the number of steal attempts by breaking the
m
d e P
Specifically, the potential after the execution of node u decreasesby execution into phases of steal attempts. We show that with
s
1 1 5
(u)(1 )= (u) 3
at least . Property follows from a struc- constant probability, a phase causes the potential to drop by a con-
3 9 9
t =1
tural property of the nodes in a deque. The distance of the nodes in stant factor. The first phase begins at step 1 and ends at
0
m
t d e P
a process’ deque decrease monotonically from the top of the deque the first step 1 such that at least steal attempts occur dur-
s
0
[t ;t ] to bottom. Therefore, the potential in the deque is the sum of geo- 1
ing the interval of steps 1 . The second phase begins at step
0
t = t +1 metrically decreasing terms and dominated by the potential of the 2
1 , and so on. Let us first show that there are at least 1
top node. The last property holds because when a process chooses m steps in a phase. A process has at most outstanding steal
q D q i process in as its victim, the node at the top of ’ s deque attempt at any time and a steal attempt takes at least s steps to
is assigned at the next step. Therefore, the potential decreases by complete. Therefore, at most P steal attempts occur in a period
2=3 (u) 1 (u) (3=4) (q )
i i i
by property . Moreover, by prop- of s time steps. Hence a phase of steal attempts takes at least
m
3
(d e)P )=P es m
erty and the result follows. d time units.
s j
Lemma 16 shows that the potential decreases as a computation Consider a phase beginning at step i, and let be the step at
+ m j
proceeds. The proof for Lemma 16 utilizes balls and bins game which the next phase begins. Then i . We will show that
Pr f (3=4) g > 1=4 i
bound from Lemma 14. we have j . Recall that the potential can
= (A )+ (D )
i i i i
be partitioned as i . Since the phase contains P
Lemma 14 (Balls and Weighted Bins) Suppose that at least balls m
d e P Pr f (1=4) (D )g > 1=4
j i i
steal attempts, i due s
are thrown independently and uniformly at random into P bins, D
to execution or assignment of nodes that belong to a process in i ,
i W i =1;:::;P
where bin has a weight i , for . The total weight is P
P by Lemma 15. Now we show that the potential also drops by a
W W = i X i
. For each bin , define the random variable i as
i=1
(A ) i
constant fraction of i due to the execution of assigned nodes
n A
that are assigned to the processes in i . Consider a process, say
W i
i if some ball lands in bin ;
X =
i
q A q (q )=0 i in i .If does not have an assigned node, then .
0 otherwise.
q u (q )= (u) i
If has an assigned node , then i . In this case,
P
P
q u i + m 1 X = 0 < <1 X process completes executing node at step at the If i , then for any in the range , we have i=1 (5=9) (u) latest and the potential drops by at least i by property fX W g > 1 1=((1 )e) Pr . 2 q A of Lemma 13. Summing over each process in i , we have