The Data Locality of Work Stealing

Umut A. Acar Guy E. Blelloch Robert D. Blumofe School of Computer Science School of Computer Science Department of Computer Sciences Carnegie Mellon University Carnegie Mellon University University of Texas at Austin [email protected] [email protected] [email protected]

Abstract of these systems is a scheduler that balances load among the processes. In addition to a good load balance, however, good data This paper studies the data locality of the work-stealing locality is essential in obtaining high performance from modern algorithm on hardware-controlled shared-memory machines. We parallel systems. present lower and upper bounds on the number of cache misses Several researches have studied techniques to improve the data using work stealing, and introduce a locality-guided work-stealing locality of multithreaded programs. One class of such techniques algorithm along with experimental validation. is based on software-controlled distribution of data among the lo-

As a lower bound, we show that there is a family of multi- cal memories of a distributed shared memory system [15, 22, 26]. ¢¡ threaded computations each member of which requires £¥¤§¦©¨ Another class of techniques is based on hints supplied by the pro- total operations (work), for which when using work-stealing the to- grammer so that “similar” tasks might be executed on the same tal number of cache misses on one processor is constant, while even processor [15, 31, 34]. Both these classes of techniques rely on

on two processors the total number of cache misses is ¤§¦©¨ . For the programmer or to determine the data access patterns

nested-parallel computations, however, we show that on proces- in the program, which may be very difficult when the program has

sors the expected additional number of cache misses beyond those complicated data access patterns. Perhaps the earliest class of tech-

 ¨  on a single processor is bounded by ¥¤§ , where is niques was to attempt to execute threads that are close in the com-

the execution time of an instruction incurring a cache miss,  is the putation graph on the same processor [1, 9, 20, 23, 26, 28]. The

  steal time,  is the size of cache, and is the number of nodes work-stealing algorithm is the most studied of these techniques [9, on the longest chain of dependences. Based on this we give strong 11, 19, 20, 24, 36, 37]. Blumofe et al showed that fully-strict com- bounds on the total running time of nested-parallel computations putations achieve a provably good data locality [7] when executed using work stealing. with the work-stealing algorithm on a dag-consistent distributed For the second part of our results, we present a locality-guided shared memory systems. In recent work, Narlikar showed that work work stealing algorithm that improves the data locality of multi- stealing improves the performance of space-efficient multithreaded threaded computations by allowing a thread to have an affinity for applications by increasing the data locality [29]. None of this pre- a processor. Our initial experiments on iterative data-parallel appli- vious work, however, has studied upper or lower bounds on the cations show that the algorithm matches the performance of static- data locality of multithreaded computations executed on existing partitioning under traditional work loads but improves the perform- hardware-controlled shared memory systems.

ance up to  "! over static partitioning under multiprogrammed In this paper, we present theoretical and experimental results work loads. Furthermore, the locality-guided work stealing im- on the data locality of work stealing on hardware-controlled shared

proves the performance of work-stealing up to #$ "! . memory systems (HSMSs). Our first set of results are upper and lower bounds on the number of cache misses in multithreaded com-

putations executed by the work-stealing algorithm. Let %'&(¤§¢¨ de-

1 Introduction note the number of cache misses in the uniprocessor execution and

%')*¤§¢¨ denote the number of cache misses in a -processor ex- Many of today’s parallel applications use sophisticated, adaptive ecution of a multithreaded computation by the work stealing algo- algorithms which are best realized with parallel programming sys- rithm on an HSMS with cache size  . Then, for a multithreaded

tems that support dynamic, lightweight threads such as [8],

  computation with & work (total number of instructions), criti- Nesl [5], Hood [10], and many others [3, 16, 17, 21, 32]. The core cal path (longest sequence of dependences), we show the following results for the work-stealing algorithm running on a HSMS.

+ Lower bounds on the number of cache misses for general

computations: We show that there is a family of computa-

¢¡ %'&(¤§¢¨/,102 tions with &-,.£¥¤§¦©¨ such that while

even on two processors the number of misses %435¤§¢¨6,

£¥¤§¦©¨ .

+ Upper bounds on the number of cache misses for nested-

parallel computations: For a nested-parallel computation, we

)87

%'&9¤§¢¨:<;=¢> > show that % , where is the number of

steals in the -processor execution. We then show that the linear stealing, locality-guided work stealing, and static partitioning for a 18 work-stealing

locality-guided workstealing simple over-relaxation algorithm on a J9K processor Sun Ultra En- static partioning 16 terprise. The over-relaxation algorithm iterates over a J dimen-

14 sional array performing a 0 -point stencil computation on each step. The superlinear speedup for static partitioning and locality-guided

12 work stealing is due to the fact that the data for each run does not

? 10 ; fit into the L ; cache of one processor but fits into the collective L Speedup 8 cache of L or more processors. For this benchmark the following can be seen from the graph. 6

4 1. Locality-guided work stealing does significantly better than standard work stealing since on each step the cache is pre- 2 warmed with the data it needs.

0 0 5 10 15 20 25 30 35 40 45 50 Number of Processes 2. Locality-guided work stealing does approximately as well as static partitioning for up to 14 processes. Figure 1: The speedup obtained by three different over-relaxation 3. When trying to schedule more than 14 processes on 14 algorithms. processors static partitioning has a serious performance drop. The initial drop is due to load imbalance caused by the coarse-grained partitioning. The performance then ap-

proaches that of work stealing as the partitioning gets more

  

(  ¨  expected number of steals is ¥¤( , where is the fine-grained.

time for a cache miss and  is the time for a steal.

We are interested in the performance of work-stealing computa- + Upper bound on the execution time of nested-parallel com- tions on hardware-controlled shared memory (HSMSs). We model

putations: We show that the expected execution time of a an HSMS as a group of identical processors each of which has its

¥¤=@$A9BDCFE :

nested-parallel computation on processors is ) own cache and has a single shared memory. Each cache contains

 

$  :/¤§G:H ¨I ¨ &9¤§¢¨ '

  , where is the uniprocessor  blocks and is managed by the memory subsystem automatically. execution time of the computation including cache misses. We allow for a variety of cache organizations and replacement poli- cies, including both direct-mapped and associative caches. We as- As in previous work [6, 9], we represent a multithreaded com- sign a server process with each processor and associate the cache putation as a directed, acyclic graph (dag) of instructions. Each of a processor with process that the processor is assigned. One lim- node in the dag represents a single instruction and the edges repre- itation of our work is that we assume that there is no false sharing. sent ordering constraints. A nested-parallel computation [5, 6] is a race-free computation that can be represented with a series-parallel dag [33]. Nested-parallel computations include computations con- 2 Related Work sisting of parallel loops and fork an joins and any nesting of them. This class includes most computations that can be expressed in As mentioned in Section 1, there are three main classes of tech- Cilk [8], and all computations that can be expressed in Nesl [5]. niques that researchers have suggested to improve the data locality Our results show that nested-parallel computations have much bet- of multithreaded programs. In the first class, the program data is ter locality characteristics under work stealing than do general com- distributed among the nodes of a distributed shared-memory system putations. We also briefly consider another class of computations, by the programmer and a thread in the computation is scheduled on computations with futures [12, 13, 14, 20, 25], and show that they the node that holds the data that the thread accesses [15, 22, 26]. can be as bad as general computations. In the second class, data-locality hints supplied by the programmer The second part of our results are on further improving the data are used in thread scheduling [15, 31, 34]. Techniques from both locality of multithreaded computations with work stealing. In work classes are employed in distributed shared memory systems such as stealing, a processor steals a thread from a randomly (with uniform COOL and Illinois Concert [15, 22] and also used to improve the distribution) chosen processor when it runs out of work. In certain data locality of sequential programs [31]. However, the first class applications, such as iterative data-parallel applications, random of techniques do not apply directly to HSMSs, because HSMSs do steals may cause poor data locality. The locality-guided work steal- not allow software controlled distribution of data among the caches. ing is a heuristic modification to work stealing that allows a thread Furthermore, both classes of techniques rely on the programmer to to have an affinity for a process. In locality-guided work stealing, determine the data access patterns in the application and thus, may when a process obtains work it gives priority to a thread that has not be appropriate for applications with complex data-access pat- affinity for the process. Locality-guided work stealing can be used terns. to implement a number of techniques that researchers suggest to The third class of techniques, which is based on execution improve data locality. For example, the programmer can achieve an of threads that are close in the computation graph on the same initial distribution of work among the processes or schedule threads process, is applied in many scheduling algorithms including work based on hints by appropriately assigning affinities to threads in the stealing [1, 9, 23, 26, 28, 19]. Blumofe et al showed bounds computation. on the number of cache misses in a fully-strict computation exe- Our preliminary experiments with locality-guided work steal- cuted by the work-stealing algorithm under the dag-consistent dis- ing give encouraging results, showing that for certain applications tributed shared-memory of Cilk [7]. Dag consistency is a relaxed the performance is very close to that of static partitioning in ded- memory-consistency model that is employed in the distributed icated mode (i.e. when the user can down a fixed number of shared-memory implementation of the Cilk language. In a dis- processors), but does not suffer a performance cliff problem [10] tributed Cilk application, processes maintain the dag consistency in multiprogrammed mode (i.e. when processors might be taken by means of the BACKER algorithm. In [7], Blumofe et al bound by other users or the OS). Figure 1 shows a graph comparing work the number of shared-memory cache misses in a distributed Cilk

single node with in-degree , the root, and a single node with out-

P Q

degree , the final node.

QV

Q

Q TU

R O

Q In a multiprocess execution of a multithreaded computation, in-

M N

S dependent nodes can execute at the same time. If two independent P

nodes read or modify the same data, we say that they are RR or R Q WW sharing respectively. If one node is reading and the other is modifying the data we say they are RW sharing. RW or WW shar- ing can cause data races, and the output of a computation with such Figure 2: A dag (directed acyclic graph) for a multithreaded com- races usually depends on the scheduling of nodes. Such races are putation. Threads are shown as gray rectangles. typically indicative of a bug [18]. We refer to computations that do not have any RW or WW sharing as race-free computations. In this paper we consider only race-free computations. application for caches that are maintained with the LRU replace- The work-stealing algorithm is a thread scheduling algorithm ment policy. They assumed that accesses to the shared memory for multithreaded computations. The idea of work-stealing dates are distributed uniformly and independently, which is not gener- back to the research of Burton and Sleep [11] and has been studied ally true because threads may concurrently access the same pages extensively since then [2, 9, 19, 20, 24, 36, 37]. In the work-stealing by algorithm design. Furthermore, they assumed that processes do algorithm, each process maintains a pool of ready threads and ob- not generate steal attempts frequently by making processes do ad- tains work from its pool. When a process spawns a new thread the ditional page transfers before they attempt to steal from another process adds the thread into its pool. When a process runs out of process. work and finds its pool empty, it chooses a random process as its victim and tries to steal work from the victim’s pool. In our analysis, we imagine the work-stealing algorithm oper- 3 The Model ating on individual nodes in the computation dag rather than on the threads. Consider a multithreaded computation and its execution by In this section, we present a graph-theoretic model for multithreaded the work-stealing algorithm. We divide the execution into discrete computations, describe the work-stealing algorithm, define series- time steps such that at each step, each process is either working on parallel and nested-parallel computations and introduce our model a node, which we call the assigned node, or is trying to steal work. of an HSMS (Hardware-controlled Shared-Memory System). The execution of a node takes J time step if the node does not incur

As with previous work [6, 9] we represent a multithreaded com- a cache miss and  steps otherwise. We say that a node is executed putation as a directed acyclic graph, a dag, of instructions (see Fig- at the time step that a process completes executing the node. The ure 2). Each node in the dag represents an instruction and the edges execution time of a computation is the number of time steps that represent ordering constraints. There are three types of edges, con- elapse between the time step that a process starts executing the root tinuation, spawn, and dependency edges. A thread is a sequential node to the time step that the final node is executed. The execution ordering of instructions and the nodes that corresponds to the in- schedule specifies the activity of each process at each time step. structions are linked in a chain by edges. A spawn During the execution, each process maintains a deque (doubly edge represents the creation of a new thread and goes from the node ended queue) of ready nodes; we call the ends of a deque the top

representing the instruction that spawns the new thread to the node and the bottom. When a node, Y , is executed, it enables some other

Y Z

representing the first instruction of the new thread. A dependency node Z if is the last parent of that is executed. We call the edge

¤§Y]\^Z2¨ Y Z X edge from instruction W of a thread to instruction of some other an enabling edge and the designated parent of . When a

thread represents a synchronization between two instructions such process executes a node that enables other nodes, one of the enabled W that instruction X must be executed after . We draw spawn edges nodes become the assigned node and the process pushes the rest with thick straight arrows, dependency edges with curly arrows and onto the bottom of its deque. If no node is enabled, then the process continuation edges with thick straight arrows throughout this paper. obtains work from its deque by removing a node from the bottom of Also we show paths with wavy lines. the deque. If a process finds its deque empty, it becomes a thief and

For a computation with an associated dag , we define the com- steals from a randomly chosen process, the victim. This is a steal

_F attempt and takes at least  and at most time steps for some

putational work, & , as the number of nodes in and the critical

_a`bJ

path,  , as the number of nodes on the longest path of . constant to complete. A thief process might make multiple

Z Y

Let Y and be any two nodes in a dag. Then we call an an- steal attempts before succeeding, or might never succeed. When a

Z Y Y Z cestor of Z , and a descendant of if there is a path from to . steal succeeds, the thief process starts working on the stolen node Any node is its descendant and ancestor. We say that two nodes are at the step following the completion of the steal. We say that a steal relatives if there is a path from one to the other, otherwise we say attempt occurs at the step it completes. that the nodes are independent. The children of a node are inde- The work-stealing algorithm can be implemented in various

pendent because otherwise the edge from the node to one child is ways. We say that an implementation of work stealing is deter-

Y Z

redundant. We call a common descendant [ of and a merger of ministic if, whenever a process enables other nodes, the imple-

Z Y [ Z [ [ Y and if the paths from to and to have only in common. mentation always chooses the same node as the assigned node for

We define the depth of a node Y as the number of edges on the then next step on that process, and the remaining nodes are always

shortest path from the root node to Y . We define the least common placed in the deque in the same order. This must be true for both

Z Y Z ancestor of Y and as the ancestor of both and with maximum multiprocess and uniprocess executions. We refer to a determin-

depth. Similarly, we define the greatest common descendant of Y istic implementation of the work-stealing algorithm together with

Y Z

and Z , as the descendant of both and with minimum depth. An the HSMS that runs the implementation as a work stealer. For

Y Z edge ¤§Y]\^Z2¨ is redundant if there is a path between and that brevity, we refer to an execution of a multithreaded computation

does not contain the edge ¤§Y©\^Z2¨ . The transitive reduction of a dag with a work stealer as an execution. We define the total work as

is the dag with all the redundant edges removed. the number of steps taken by a uniprocess execution, including the  In this paper we are only concerned with the transitive reduction cache misses, and denote it by & ¤§¢¨ , where is the cache size.

of the computational dags. We also require that the dags have a We denote the number of cache misses in a -process execution %')c¤§¢¨ with  -block caches as . We define the cache overhead

‡

~

ƒ

‰

~

ˆ

Š

~

€

i

h

„

~

~

~ ‚

~ f

2

~

‡

h

2

f f

1

~

‚



‰

ƒ

f ˆ

1

‹

„



g

g

de Figure 4: The structure for dag of a computation with a large cache overhead. (a) (b) (c)

Figure 3: Illustrates the recursive definition for series-parallel dags. mapped caches and set associative caches where each set is main- Figure (a) is the base case, figure (b) depicts the serial, and figure tained by a simple cache replacement policy are simple. (c) depicts the parallel composition. In regards to the definition of RW or WW sharing, we assume that reads and writes pertain to the whole block. This means we do not allow for false sharing—when two processes accessing differ-

ent portions of a block invalidate the block in each other’ s caches.

)

% ¤§¢¨kjl%'&(¤§¢¨ %'&9¤§¢¨ of a -process execution as , where is In practice, false sharing is an issue, but can often be avoided by the number of misses in the uniprocess execution on the same work a knowledge of underlying memory system and appropriately pad- stealer. ding the shared data to prevent two processes from accessing dif- We refer to a multithreaded computation for which the transi- ferent portions of the same block.

tive reduction of the corresponding dag is series-parallel [33] as

a series-parallel computation. A series-parallel dag ¤§mn\po¢¨ is a

dag with two distinguished vertices, a source, rqsm and a sink, 4 General Computations t

qrm and can be defined recursively as follows (see Figure 3).

In this section, we show that the cache overhead of a multiprocess

t +

Base: consists of a single edge connecting  to . execution of a general computation and a computation with futures

+ Series Composition: consists of two series-parallel dags can be large even though the uniprocess execution incurs a small

number of misses. 3$¤§mu32\^o3=¨

&9¤§m]&(\^o &^¨ and with disjoint edge sets such that

& Y &

 is the source of , is the sink of and the source of

t Theorem 1 There is a family of computations

m m ,yx$YFz

3 3 3

, and is the sink of . Moreover &wv .

+

Parallel Composition: The graph consists of two series-parallel ¡Œ

x ¦,y_F \^ŽF"n_qr‘H’cz 3$¤§mu32\^o3=¨

dags &9¤§m]&(\^o &^¨ and with disjoint edges sets

t 

such that and are the source and the sink of both & and

t

¥¤§¦©¨

m m ,{x$$\ z

3 3

. Moreover &*v . with computational work, whose uniprocess execution incurs ;

02 misses while any -process execution of the computation incurs 

A nested-parallel computation is a race-free series-parallel compu- ¤§¦©¨ misses on a work stealer with a cache size of , assuming “ tation [6]. that “a,y ¥¤§¢¨ , where is the maximum steal time.

We also consider multithreaded computations that use futures [12, ”

13, 14, 20, 25]. The dag structures of computations with futures Proof: Figure 4 shows the structure of a dag, for ¦4,•K2 . C are defined elsewhere [4]. This is a superclass of nested-parallel Each node except the root node represents a sequence of  in-

computations, but still much more restrictive than general com- structions accessing a set of  distinct memory blocks. The root 

putations. The work-stealing algorithm for futures is a restricted node represents :/“ instructions that accesses distinct memory

” ” —

form of work-stealing algorithm, where a process starts executing a blocks. The graph has two symmetric components – and , C

newly created thread immediately, putting its assigned thread onto which corresponds to the left and the right subtree of C the root ex- ” its deque. cluding the leaves. We partition the nodes in into three classes, In our analysis, we consider several cache organization and re- such that all nodes in a class access the same memoryC blocks while placement policies for an HSMS. We model a cache as a set of nodes from different classes access mutually disjoint set of memory

(cache) lines, each of which can hold the data belonging to a mem- blocks. The first class contains the root node only, the second class ”

ory block (a consecutive, typically small, region of memory). One contains all the nodes in – , and the third class contains the rest

” ”

C

instruction can operate on at most one memory block. We say that of the nodes, which are the nodes in — and the leaves of .

¡ ¡ ¢¡

C C

– — _

an instruction accesses a block or the line that contains the block For general ¦,y_˜ , can be partitioned into , and the

¡ ¡ ¡ —

when the instruction reads or modifies the block. We say that an leaves of and the root similarly. Each of – and contains

;]™ ]j-J

instruction overwrites a line that contains the block | when the in- nodes and has the structure of a complete binary tree with 3 _

struction accesses some other block that replaces | in the cache. additional leaves at the lowest level. There is a dependency edge

¡ ¡ ¡ — We say that a cache replacement policy is simple if it satisfies two from the leaves of both – and to the leaves of . conditions. First the policy is deterministic. Second whenever the Consider a work stealer that executes the nodes of ¢¡ in the or-

policy decides to overwrite a cache line, } , it makes the decision der that they are numbered in a uniprocess execution. In the unipro-

¡ –

to overwrite } by only using information pertaining to the accesses cess execution, no node in incurs a cache miss except the root

¡ –

that are made after the last access to } . We refer to a cache man- node, since all nodes in access the same memory blocks as the

¡ ¡

— _

aged with a simple cache-replacement policy as a simple cache. root of – . The same argument holds for and the leaves of

¡ ¡ ¡ —

Simple caches and replacement policies are common in practice. . Hence the execution of the nodes in – , , and the leaves  For example, least-recently used (LRU) replacement policy, direct causes ;= misses. Since the root node causes misses, the total

Ÿ

¢ ¨)

that executes Z in .

¢

› ¢ ¢ Lemma 2 establishes a key property of an execution with simple

caches.

š ¢

Lemma 2 Consider a process with a simple cache of  blocks. Let

¢ ¨

& denote the execution of a sequence of instructions on the proc-

¡ ¨/3

ess starting with cache state “& and let denote the execution “

of the same sequence of instructions starting with cache state 3 .

š 

¢£

¨  ¨ 3 Then & incurs at most more misses than .

ž Proof: We construct a one-to-one mapping between the cache

› Ÿ œ ¨3

lines in ¨& and such that an instruction that accesses a line

¨& }©3 ¨/3 }ª&

}¥& in accesses the entry in , if and only if is mapped to

} ¨ } W

& &

Figure 5: The structure for dag of a computation with futures that 3 . Consider and let be a cache line. Let be the first instruc- }©3

can incur a large cache overhead. tion that accesses or overwrites }ª& . Let be the cache line that }¥&

the same instruction accesses or overwrites in ¨/3 and map to

} } &

3 . Since the caches are simple, an instruction that overwrites in

¨ } ¨

3 3 & overwrites in . Therefore the number of misses that over-

number of misses in the uniprocess execution is 0" . Now, consider

} ¨ }

& 3 writes & in is equal to the number of misses that overwrites

a ; -process execution with the same work stealer and call the pro-

W W J

in ¨/3 after instruction . Since itself can cause miss, the number

J J

cesses, process and . At time step , process starts executing

¡

} ¨ J & of misses that overwrites & in is at most more than the num-

the root node, which enables the root of — no later than time step

} ¨ 3

ber of misses that overwrites 3 in . We construct the mapping  . Since process starts stealing immediately and there are no

for each cache line in ¨& in the same way. Now, let us show that

other processes to steal from, process J steals and starts working

¡ the mapping is one-to-one. For the sake of contradiction, assume b:<“

on the root of — , no later than time step . Hence, the root

¡ ¡ ¡

} } ¨ ¨

3 & 3

that two cache lines, & and , in map to the same line in .

– –

of — executes before the root of and thus, all the nodes in

¡

W W 3 Let & and be the first instructions accessing the cache lines in

execute before the corresponding symmetric node in — . There-

¢¡ ¡

WI& WD3 W«& WD3 ¨& such that is executed before . Since and map to the

fore, for any leaf of , the parent that is in — executes before

¡ ¡ W¬3 same line in ¨3 and the caches are simple, accesses the line that

the parent in – . Therefore a leaf node of is executed immedi-

¡

W ¨ } ,y}

& & 3

& accesses in but then , a contradiction. Hence, the total 

ately after its parent in – and thus, causes cache misses. Thus,  number of cache misses in ¨& is at most more than the misses

the total number of cache misses is ¤¤¥_˜¢¨*,y ¤¤§¦©¨ . ¨ in 3 . There exists computations similar to the computation in Fig- ure 4 that generalizes Theorem 1 for arbitrary number of processes

Theorem 3 Let ­ denote the total number of drifted nodes in an

by making sure that all the processes but ; steal throughout any

multiprocess execution. Even in the general case, however, where execution of a nested-parallel computation with a work stealer on  the average parallelism is higher than the number of processes, processes, each of which has a simple cache with words. Then

Theorem 1 can be generalized with the same bound on expected the cache overhead of the execution is at most ¢­ .

number of cache misses by exploiting the symmetry in ¡ and by

¨) ¨ Proof: Let denote the -process execution and let & be the assuming a symmetrically distributed steal-time. With a symmetri-

uniprocess execution of the same computation with the same work ¦ cally distributed steal-time, for any ¦ , a steal that takes steps more

stealer. We divide the multiprocess computation into ­ pieces each than mean steal-time is equally likely to happen as a steal that takes

of which can incur at most  more misses than in the uniprocess

¦ less steps than the mean. Theorem 1 holds for computations with ® execution. Let Y be a drifted node let be the process that executes

futures as well. Multithreaded computing with futures is a fairly

Z ® Y . Let be the next drifted node executed on (or the final node restricted form of multithreaded computing compared to comput-

of the computation). Let the ordered set represent the execution

ing with events such as synchronization variables. The graph § in Y order of all the nodes that are executed after Y ( is included) and

Figure 5 shows the structure of a dag, whose ; -process execution

Z ® before Z ( is excluded if it is drifted, included otherwise) on in

causes large number of cache misses. In a ; -process execution of

)

¨ . Then nodes in are executed on the same process and in the

§ , the enabling parent of the leaf nodes in the right subtree of the

¨ ¨) same order in both & and . root are in the left subtree and therefore the execution of each such Now consider the number of cache misses during the execution

leaf node causes  misses.

)

¨-& ¨ of the nodes in in and . Since the computation is nested

parallel and therefore race free, a process that executes in paral- ® 5 Nested-Parallel Computations lel with ® does not cause to incur cache misses due to sharing.

Therefore by Lemma 2 during the execution of the nodes in the

) 

In this section, we show that the cache overhead of an execution of number of cache misses in ¨ is at most more than the number

¨ ­

a nested-parallel computation with a work stealer is at most twice of misses in & . This bound holds for each of the sequence of ­ the product of the number of steals and the cache size. Our proof such instructions corresponding to drifted nodes. Since the

has two steps. First, we show that the cache overhead is bounded by sequence starting at the root node and ending at the first drifted

¨ ¨)¨)

the product of the cache size and the number of nodes that are exe- node incurs the same number of misses in & and takes at

¢­ ¨ cuted “out of order” with respect to the uniprocess execution order. most more misses than & and the cache overhead is at most

Second, we prove that the number of such out-of-order executions ¢­ . is at most twice the number of steals.

Lemma 2 (and thus Theorem 3) does not hold for caches that ¨-) Consider a computation and its -process execution, , are not simple. For example, consider the execution of a sequence

with a work stealer and the uniprocess execution, ¨& with the same of instructions on a cache with least-frequently-used replacement

Y

work stealer. Let Z be a node in and node be the node that exe- policy starting at two cache states. In the first cache state, the blocks

Z ¨ Z

cutes immediately before in & . Then we say that is drifted in that are frequently accessed by the instructions are in the cache with

Y Z ¨-) if node is not executed immediately before by the process high frequencies, whereas in the second cache state, the blocks that

À

±

½

² ³

° ¯ 2

1

¾¿ G

G

¹

» ¼

Á º

Figure 6: Children of  and their merger.

·

¸ [

Figure 8: The join node  is the least common ancestor of and .

Z  Node Y and are the children of . 2

1

´ µ

G G

Y 

¶ induced by the relatives of that are descendants of and also an-

t

cestors of . Similarly, let 3 denote the graph that is induced by

t 

the relatives of Z that are descendants of and ancestors of . Then

& Y Z 3 Z

Figure 7: The joint embedding of Y and . we call the embedding of with respect to and the em- Y

bedding of Z with respect to . We call the graph that is the union

Y Z  3

of & and the joint embedding of and with source and

t ¸

sink . Now consider an execution of and [ and be the children

¸

[ [ are in the cache are not accessed by the instruction and have low of  such that is executed before . Then we call the leader and frequencies. The execution with the second cache state, therefore, ¸ the guard of the joint embedding.

incurs many more misses than the size of the cache compared to

¸ [

the execution with the second cache state. Lemma 7 Let ¤§mn\^o¢¨ be an sp-dag and let and be two par-

t [

Now we show that the number of drifted nodes in an execution ents of a join node in . Let & denote the embedding of with

¸ ¸ [

of a series-parallel computation with a work stealer is at most twice respect to and 3 denote the embedding of with respect to . t

the number of steals. The proof is based on the representation of Let  denote the source and denote the sink of the joint embed-

t 

series-parallel computations as sp-dags. We call a node with out- ding. Then the parents of any node in & except for and is in

t



& 3 3 degree of at least ; a fork node and partition the nodes of an sp-dag and the parents of any node in except for and is in .

except the root into three categories: join nodes, stable nodes and

t

¸ 

nomadic nodes . We call a node that has an in-degree of at least Proof: Since [ and are independent, both of and are different

¸

[ J

; from and (see Figure 8). First, we show that there is not an

a join node and partition all the nodes that have in-degree into

 3

two classes: a nomadic node has a parent that is a fork node, and a edge that starts at a node in & except at and ends at a node in t

J except at and vice versa. For the sake of contradiction, assume

stable node has a parent that has out-degree . The root node has in-

t

¤§G\¦©¨ ÄÃ,b ¦yÃ, &

there is an edge such that is in and is in

degree and it does not belong to any of these categories. Lemma 4

¸

 [ 3 . Then is the least common ancestor of and ; hence no

lists two fundamental properties of sp-dags; one can prove both

¤§G\^¦©¨  such exists. A similar argument holds when is in 3 and

properties by induction on the number of edges in an sp-dag. ¦

is in & .

Lemma 4 Let be an sp-dag. Then has the following proper- Second, we show that there does not exists an edge that origi-

3 &

ties. nates from a node outside of & or and ends at a node at or

¤§Å¢\ÂÆ©¨ Æ 3

. For the sake of contradiction, let be an edge such that

1. The least common ancestor of any two nodes in is unique.

Å Æ

& 3

is in & and is not in or . Then is the unique merger for

 the two children of the least common ancestor of Å and , which

2. The greatest common descendant of any two nodes in is t  unique and is equal to their unique merger. we denote with  . But then is also a merger for the children of .

The children of  are independent and have a unique merger, hence Æ

there is no such edge ¤§ÅÇ\§Æ©¨ . A similar argument holds when is

3 & 

Lemma 5 Let  be a fork node. Then no child of is a join node. in . Therefore we conclude that the parents of any node in

t

  3

except and is in & and the parents of any node in except

t

Y Z  Y

Proof: Let and denote two children of and suppose is a and is in 3 . t

join node as in Figure 6. Let denote some other parent of Y and

¸ ¸

¸

Z Y

denote the unique merger of Y and . Then both and are [

t Lemma 8 Let be an sp-dag and let and be two parents of a

t

¸ Y mergers for  and , which is a contradiction of Lemma 5. Hence join node in . Consider the joint embedding of [ and and let

is not a join node. ¸ [ Y be the guard node of the embedding. Then and are executed in the same respective order in a multiprocess execution as they

Corollary 6 Only nomadic nodes can be stolen in an execution of are executed in the uniprocess execution if the guard node Y is not

a series-parallel computation by the work-stealing algorithm. stolen.

t Y

Proof: Let Y be a stolen node in an execution. Then is pushed Z Proof: Let  be the source, the sink, and the leader of the

on a deque and thus the enabling parent of Y is a fork node. By Z

joint embedding. Since Y is not stolen, is not stolen. Hence, by

Y J

Lemma 5, is not a join node and has an incoming degree . There- Lemma 7, before it starts working on Y , the process that executes

t

Y Z

fore is nomadic.  executed and all its descendants in the embedding except for

¸

[ Y

Consider a series-parallel computation and let be its sp-dag. Hence, is executed before Y and is executed after as in the

¸ t

uniprocess execution. Therefore, [ and are executed in the same

Z  Let Y and be two independent nodes in and let and denote

their least common ancestor and greatest common descendant re- respective order as they execute in the uniprocess execution.

spectively as shown in Figure 7. Let & denote the graph that is

t t

Æ [

& & 3

embedding of two parents of & and also . Let , be these

t t

Ê

É

Î Ï

Æ [

3 3 3

parents of & and , be the parents of , as shown in Figure 9. Y

Ê But then has parent that is a fork node and is a joint node, which

Ê

Ì Ì

É

Í É Í

contradicts Lemma 5. Hence no such Y exists. Ë

Theorem 12 The cache overhead of an execution of a nested-parallel

Ê

È É È computation with simple caches is at most twice the product of the

number of misses in the execution and the cache size. t

t Proof: Follows from Theorem 3 and Lemma 11. 3 Figure 9: Nodes & and are two join nodes with the common

guard Y . 6 An Analysis of Nonblocking Work Stealing

Lemma 9 A nomadic node is drifted in an execution only if it is The non-blocking implementation of the work-stealing algorithm stolen. delivers provably good performance under traditional and multi- programmed workloads. A description of the implementation and its analysis is presented in [2]; an experimental evaluation is given Proof: Let Y be a nomadic and drifted node. Then, by Lemma 5,

in [10]. In this section, we extend the analysis of the non-blocking

 Y Y  Y has a single parent that enables . If is the first child of to work-stealing algorithm for classical workloads and bound the ex- execute in the uniprocess execution then Y is not drifted in the mul-

ecution time of a nested-parallel, computation with a work stealer Z tiprocess execution. Hence, Y is not the first child to execute. Let

to include the number of cache misses, the cache-miss penalty and Y be the last child of  that is executed before in the uniprocess ex- the steal time. First, we bound the number of steal attempts in an ecution. Now, consider the multiprocess execution and let ® be the execution of a general computation by the work-stealing algorithm. process that executes Z . For the sake of contradiction, assume that

Then we bound the execution time of a nested-parallel computation

Y Z Y is not stolen. Consider the joint embedding of and as shown

with a work stealer using results from Section 5. The analysis that 

in Figure 8. Since all parents of the nodes in 3 except for and

t we present here is similar to the analysis given in [2] and uses the

® 3 are in 3 by Lemma 7, executes all the nodes in before it

¸ same potential function technique.

Y ® Y executes Y and thus, precedes on . But then is not drifted,

¸ We associate a nonnegative potential with nodes in a computa- because is the node that is executed immediately before Y in the tion’ s dag and show that the potential decreases as the execution uniprocess computation. Hence Y is stolen.

t proceeds. We assume that a node in a computation dag has out- Let us define the cover of a join node in an execution as the set degree at most ; . This is consistent with the assumption that each of all the guard nodes of the joint embedding of all possible pairs

t node represents on instruction. Consider an execution of a compu- of parents of in the execution. The following lemma shows that a tation with its dag, ¤§mc\^o¢¨ with the work-stealing algorithm. The join node is drifted only if a node in its cover is stolen. execution grows a tree, the enabling tree, that contains each node

in the computation and its enabling edge. We define the distance t¥Ó

Lemma 10 A join node is drifted in an execution only if a node in tªÓ



ÐF¤§YF¨  j4Ð2ÑÂÒ ¤§YF¨ Ð2ÑÂÒ ¤§YF¨ of a node YqGm , , as , where is the its cover is stolen in the execution. depth of Y in the enabling tree of the computation. Intuitively, the

t distance of a node indicates how far the node is away from end of Proof: Consider the execution and let be a join node that is the computation. We define the potential function in terms of dis-

drifted. Assume, for the sake of contradiction, that no node in the

t t t W

¸ tances. At any given step , we assign a positive potential to each

¨ [

cover of , ¥¤ , is stolen. Let and be any two parents of as

¸ ready node, all other nodes have potential. A node is ready if it is

in Figure 8. Then [ and are executed in the same order as in the t enabled and not yet executed to completion. Let Y denote a ready

uniprocess execution by Lemma 8. But then all parents of exe-

ÔÕI¤§YF¨ Y node at time step W . Then we define, , the potential of at

cute in the same order as in the uniprocess execution. Hence, the W t time step as

enabling parent of in the execution is the same as in the uniprocess

t Ö

execution. Furthermore, the enabling parent of has out-degree J ,

3^× &

t

0 Y

because otherwise is not a join node by Lemma 5 and thus, the B©Ø9EIÙ if is assigned;

Ô ¤§Y˜¨c,

Õ

t t t

3^× 0 process that enables executes . Therefore, is not drifted. A B©Ø9E otherwise.

contradiction, hence a node in the cover of t is stolen.

W Ú The potential at step , Õ , is the sum of the potential of each ready

Lemma 11 The number of drifted nodes in an execution of a series- node at step W . When an execution begins, the only ready node is 

parallel computation is at most twice the number of steals in the the root node which has distance  and is assigned to some proc-

3 & ÚÛ-,Ü0 execution. ess, so we start with @ Ý/Ù . As the execution proceeds, nodes that are deeper in the dag become ready and the potential Proof: We associate each drifted node in the execution with a decreases. There are no ready nodes at the end of an execution and

steal such that no steal has more than ; drifted nodes associated

the potential is . Y

with it. Consider a drifted node, Y . Then is not the root node Let us give a few more definitions that enable us to associate

Y

— ¤¥®=¨

of the computation and it is not stable either. Hence, is either a a potential with each process. Let Õ denote the set of ready

Y Y ®

nomadic or join node. If is nomadic, then is stolen by Lemma 9 nodes that are in the deque of process ® along with ’s assigned

Y Y Y Y

and we associate with the steal that steals . Otherwise, is a node, if any, at the beginning of step W . We say that each node

¥¤§YF¨

® ® join node and there is a node in its cover that is stolen by in —ÞÕp¤¥®=¨ belongs to process . Then we define the potential of ’s

Lemma 10. We associate Y with the steal that steals a node in its deque as

cover. Now, assume there are more than ; nodes associated with a

t

Ú ¤¥®=¨, ß Ô ¤§YF¨nå

Õ Õ &

steal that steals node Y . Then there are at least two join nodes

t

3 Y Y and that are associated with . Therefore, node is in the joint Ø$à"áâªBäãIE æ

In addition, let Õ denote the set of processes whose deque is empty This lemma can be proven with an application of Markov’ s in-

ç W ­

at the beginning of step , and let Õ denote the set of all other equality. The proof of a weaker version of this lemma for the case

processes. We partition the potential ÚèÕ into two parts of exactly throws is similar and given in [2]. Lemma 14 also

follows from the weaker lemma because ¨ does not decrease with

more throws.

Ú ,{Ú ¤¥æ ¨©:8Ú ¤§­ ¨n\

Õ Õ Õ Õ Õ

We now show that whenever or more steal attempts occur, the

Ú ¤§­ ¨ Õ where potential decreases by a constant fraction of Õ with constant

probability.

W X

Ú ¤¥æ ¨*,éß Ú ¤¥®=¨ Ú ¤§­ ¨*,éß Ú ¤¥®=¨n\

Õ Õ Õ Õ Õ

Õ and Lemma 15 Consider any step and any later step such that at

W X

least steal attempts occur at steps from (inclusive) to (exclu- ãêàuìcâ ãêà2ëâ sive). Then we have

and we analyze the two parts separately.

£¨¤

J J

þ*Ú j Ú ©Þ` ¤§­ ¨ ¦ å

Lemma 13 lists four basic properties of the potential that we use Ú Õ Õ Õ K frequently. The proofs for these properties are given in [2] and the K listed properties are correct independent of the time that execution of a node or a steal takes. Therefore, we give a short proof sketch. Moreover the potential decrease is because of the execution or as-

signment of nodes belonging to a process in ­/Õ .

Lemma 13 The potential function satisfies the following proper-

ties. Proof: Consider all processes and steal attempts that occur

W ® ­

at or after step . For each process in Õ , if one or more of the

® W

1. Suppose node Y is assigned to a process at step . Then the attempts target as the victim, then the potential decreases by

¤ÂJ(í=;$¨ÂÚèÕI¤¥®=¨

¤§;$í(02¨IÔ ¤§YF¨

potential decreases by at least Õ . due to the execution or assignment of nodes that belong K

to ® by property in Lemma 13. If we think of each attempt as a W

2. Suppose a node Y is executed at step . Then the potential ball toss, then we have an instance of the Balls and Weighted Bins

¤§$í î"¨IÔÕ«¤§Y˜¨ W ­/Õ

decreases by at least at step . Lemma (Lemma 14). For each process ® in , we assign a weight

,õ¤ÂJ(í=;$¨ÂÚèÕp¤¥®=¨ ® æ¢Õ

ö , and for each other process in , we assign

ã

W ® ­

3. Consider any step and any process in Õ . The topmost

ö ,y ö , ¤ÂJí$;$¨ÂÚ ¤§­ ¨ Õ

a weight . The weights sum to Õ . Using

ã

® 0"í K

node Y in ’s deque contributes at least of the potential

, J(í=;

ÿ in Lemma 14, we conclude that the potential decreases ÔÕI¤§YF¨ï`y¤¥0"í K"¨ÂÚðÕI¤¥®=¨

associated with ® . That is, we have .

ÿnö , ¤ÂJí K"¨ÂÚ ¤§­ ¨ Jèj Õ

by at least Õ with probability greater than

¦ Jí K

J(í2¤Â¤ÂJ j4ÿ*¨IÑ=¨ due to the execution or assignment of nodes

Ò ® ­

4. Suppose a process chooses process in Õ as its victim at ­

that belong to a process in Õ .

Ò ® W

time step W (a steal attempt of targeting occurs at step ).

¤ÂJ(í$;=¨ÂÚ ¤¥®=¨ Then the potential decreases by at least Õ due to We now bound the number of steal attempts in a work-stealing

the assignment or execution of a node belonging to ® at the computation.

end of step W .

Lemma 16 Consider a -process execution of a multithreaded com-

J

  Property follows directly from the definition of the potential putation with the work-stealing algorithm. Let & and denote

function. Property ; holds because a node enables at most two the computational work and the critical path of the computation.

children with smaller potential, one of which becomes assigned. Then the expected number of steal attempts in the execution is

Y



¥¤(   ¨ ¦• 

Specifically, the potential after the execution of node decreases by  . Moreover, for any , the number of steal at-

& &

ò ó ó

Ôc¤§YF¨ê¤ÂJñj j ¨*,õô Ôc¤§YF¨ 0



  :F¤ÂJ(í =¨Â¨ Jèj at least . Property follows from a struc- ¥¤(  tempts is  with probability at least . tural property of the nodes in a deque. The distance of the nodes in

a process’ deque decrease monotonically from the top of the deque Proof: We analyze the number of steal attempts by breaking the

  to bottom. Therefore, the potential in the deque is the sum of geo- execution into phases of  steal attempts. We show that with

metrically decreasing terms and dominated by the potential of the constant probability, a phase causes the potential to drop by a con-

t

, J

top node. The last property holds because when a process chooses stant factor. The first phase begins at step & and ends at

t

® ­ ®

Õ

   process in as its victim, the node at the top of ’ s deque 

the first step & such that at least steal attempts occur dur-

t t

is assigned at the next step. Therefore, the potential decreases by 

\ 

& 

ing the interval of steps & . The second phase begins at step

t t



;=í 0=Ô ¤§YF¨ J Ô ¤§YF¨k`{¤¥02í K"¨ÂÚ ¤¥®=¨

Õ Õ

Õ by property . Moreover, by prop-

, : J 3 & , and so on. Let us first show that there are at least

erty 0 and the result follows. J  steps in a phase. A process has at most outstanding steal Lemma 16 shows that the potential decreases as a computation attempt at any time and a steal attempt takes at least  steps to proceeds. The proof for Lemma 16 utilizes balls and bins game complete. Therefore, at most steal attempts occur in a period bound from Lemma 14.

of  time steps. Hence a phase of steal attempts takes at least

I¤( (¨Â ¢¨Âí$ ( `s   time units.

Lemma 14 (Balls and Weighted Bins) Suppose that at least balls X Consider a phase beginning at step W , and let be the step at

are thrown independently and uniformly at random into bins,

7 X

which the next phase begins. Then W˜:< . We will show that

£¨¤

W ö W,•J$\9å÷å9å9\^

where bin has a weight Õ , for . The total weight is

7

x2Ú© ¤¥02í(K2¨ÂÚ z¦ J(í K

we have Õ . Recall that the potential can

)

öø,úù ö W ¨

Õ Õ

. For each bin , define the random variable as ÚèÕ,yÚèÕI¤¥æ¢Õ¥¨2:GÚèÕª¤§­/Õ¥¨

Õüû& be partitioned as . Since the phase contains

£¥¤

( w x$Ú j8Ú© `{¤ÂJ(í(K2¨ÂÚ ¤§­ ¨Âz¦ Jí K

Õ Õ

steal attempts, Õ due ­

to execution or assignment of nodes that belong to a process in Õ , öaÕ if some ball lands in bin W ; ¨ýÕ,•þ by Lemma 15. Now we show that the potential also drops by a

otherwise.

Ú ¤¥æ ¨ Õ

constant fraction of Õ due to the execution of assigned nodes

æ Õ

) that are assigned to the processes in . Consider a process, say

¨1,úù ¨ ÿ ¡ 'ÿ¢ úJ

If Õ , then for any in the range , we have

Õüû&

® æ ® Ú ¤¥®=¨,. Õ

in Õ . If does not have an assigned node, then .

£¥¤

x(¨é`'ÿnöúz§¦yJèj8J(í2¤Â¤ÂJðj4ÿ*¨IÑ=¨

Y ÚèÕI¤¥®=¨ý, ÔÕI¤§YF¨

. If ® has an assigned node , then . In this case,

Y W=:aúj4J X process ® completes executing node at step at the

¤§$í(î2¨IÔ ¤§YF¨

latest and the potential drops by at least Õ by property

-

2

; ® æ

of Lemma 13. Summing over each process in Õ , we have

. / 01

. 1

/ 0

©

Ú ` ¤§$í î"¨ÂÚèÕ«¤¥æ¤Õ¥¨

ÚèÕ]j . Thus, we have shown that the potential

,

-

Ú ¤¥æ ¨ Ú ¤§­ ¨

Õ Õ Õ

decreases at least by a quarter of Õ and . Therefore

, ,

- -

æ ­ Õ

no matter how the total potential is distributed over Õ and , the ,

total potential decreases by a quarter with probability more than -

£¥¤

Jí K x=Ú jÚ©¢`y¤ÂJí K"¨ÂÚ z¦{J(í(K Õ

, that is, Õ . -

We say that a phase is successful if it causes the potential to -

, -

drop by at least a Jí K fraction. A phase is successful with prob-

3 &

, ,

- -

Jí K ÚÛ,.0

Ý Ù

ability at least . Since the potential starts at @

, -

and ends at (and is always an integer), the number of success-

”"!

ò

 

0 b#$ j{J(¨  

ful phases is at most ¤§;  . The expected 

number of phases needed to obtain #= successful phases is at ¥¤¥¨

most 0";  . Thus, the expected number of phases is , and

 w 

because each phase contains  steal attempts, the expected Figure 10: The tree of threads created in a data-parallel work-

( 4¨ ¥¤(  number of steal attempts is  . The high probability stealing application.

bound follows by an application of the Chernoff bound.



Theorem 17 Let %8)c¤§¢¨ be the number of cache misses in a - 3 process execution of a nested-parallel computation with a work- with probability at least Jèj .

The total number of dollars in steal bucket is the total number

 % ¤§¢¨ stealer that has simple caches of blocks each. Let & be the of steal attempts multiplied by the number of dollars added to the number of cache misses in the uniprocess execution Then

steal bucket for each steal attempt, which is ¥¤§(¨ . Therefore total

number of dollars in the steal bucket is

 

%')*¤§¢¨c,{% ¤§¢¨: ¥¤(  ¢ {:•  ¢ #%$c¤ÂJí =¨Â¨

&

 





¥¤§4) <¤¥{:3%$c¤ÂJí ¨Â¨Â¨

* & with probability at least J2j . The expected number of cache misses

is 

 4

with probability at least J/j . Each process adds exactly one



%'&÷¤§¢¨:8 ¥¤(  ¢  ¨

 dollar to a bucket at each step so we divide the total number of

dollars by to get the high probability bound in the theorem. A Proof: Theorem 12 shows that the cache overhead of a nested- similar argument holds for the expected time bound. parallel computation is at most twice the product of the number of

steals and the cache size. Lemma 16 shows that the number of steal

( /¤¥ý: %$c¤ÂJí =¨Â¨Â¨ J2j

attempts is ¥¤(( with probability at least 7 Locality-Guided Work Stealing ( <¨ and the expected number of steals is ¥¤(( . The number of steals is not greater than the number of steal attempts. Therefore The work-stealing algorithm achieves good data locality by execut- the bounds follow. ing nodes that are close in the computation graph on the same proc- ess. For certain applications, however, regions of the program that

Theorem 18 Consider a -process, nested-parallel, work-stealing access the same data are not close in the computational graph. As

§¦y computation with simple caches of  blocks. Then, for any , an example, consider an application that takes a sequence of steps the execution time is each of which operates in parallel over a set or array of values. We

will call such an application an iterative data-parallel application.

 ¤§¢¨ 

& Such an application can be implemented using work-stealing by

 

¥¤ :8 ˜ ¤¥ :%$¤ÂJ(í =¨Â¨=: ¤§8:(¨ê¤¥ :'%$¤ÂJí =¨Â¨Â¨  forking a tree of threads on each step, in which each leaf of the tree

updates a region of the data (typically disjoint). Figure 10 shows

( =¨ with probability at least ¤ÂJ j . Moreover, the expected running an example of the trees of threads created in two steps. Each node time is represents a thread and is labeled with the process that executes it.

The gray nodes are the leaves. The threads synchronize in the same &(¤§¢¨

 order as they fork. The first and second steps are structurally iden-

 

:<8  :s¤§b:8(¨I ¨nå

¥¤ tical, and each pair of corresponding gray nodes update the same

 region, often using much of the same input data. The dashed rect- Proof: We use an accounting argument to bound the running time. angle in Figure 10, for example, shows a pair of such gray nodes. At each step in the computation, each process puts a dollar into one To get good locality for this application, threads that update the of two buckets that matches its activity at that step. We name the same data on different steps ideally should run on the same proces- two buckets as the work and the steal bucket. A process puts a sor, even though they are not “close” in the dag. In work stealing,

dollar into the work bucket at a step if it is working on a node in however, this is highly unlikely to happen due to the random steals.  the step. The execution of a node in the dag adds either J or Figure 10, for example, shows an execution where all pairs of cor- dollars to the work bucket. Similarly, a process puts a dollar into responding gray nodes run on different processes.

the steal bucket for each step that it spends stealing. Each steal In this section, we describe and evaluate locality-guided work ¥¤§(¨ attempt takes ¥¤§(¨ steps. Therefore, each steal adds dollars stealing, a heuristic modification to work stealing which is de-

to the steal bucket. The number of dollars in the work bucket at the signed to allow locality between nodes that are distant in the com-

¥¤¥ :y¤§ j8J(¨%8)k¤§¢¨Â¨ end of execution is at most & , which is putational graph. In locality-guided work stealing, each thread can

be given an affinity for a process, and when a process obtains work 

 it gives priority to threads with affinity for it. To enable this, in addi-

¥¤¥ ¤§¢¨©: ¤§ j8J¨) ¢ ¤¥y:%$c¤ÂJ(í+ ¨Â¨Â¨

& tion to a deque each process maintains a mailbox: a first-in-first-out

*

(FIFO) queue of pointers to threads that have affinity for the proc- Benchmark Work Overhead Critical Path Average

 

& @$A

ess. There are then two differences between the locality-guided ( ) ( @$A ) Length ( ) Par. ( )

@ 5 @(Ý

î" J$å©J9

work-stealing and work-stealing algorithms. First, when creating staticHeat J(2å

;= J$å©J(; uå $K2 0=LuJ=åüJ=J

a thread, a process will push the thread onto both the deque, as heat J9Lñå

076 J$å©J(; uå $K$K 076$;2å "

in normal work stealing, and also onto the tail of the mailbox of lgHeat J9Lñå

076 J$å©J(; uå $K$K 076$;2å "

the process that the thread has affinity for. Second, a process will ipHeat J9Lñå

J$å $# first try to obtain work from its mailbox before attempting a steal. staticRelax K$KñåüJ

Because threads can appear twice, once in a mailbox and once on

î=0 J$å $# uå $0$î J$J; Lñå KñJ relax K$0ñå

a deque, there needs to be some form of synchronization between

;=; J$å $# uå $0$î J$J÷0$0ñå #=K lgRelax K$Kñå

the two copies to make sure the thread is not executed twice.

;=; J$å $# uå $0$î J$J÷0$0ñå #=K ipRelax K$Kñå A number of techniques that have been suggested to improve the data locality of multithreaded programs can be realized by the Table 1: Measured benchmark characteristics. We compiled all locality-guided work-stealing algorithm together with an appropri- applications with Sun CC compiler using -xarch=v8plus -O5

ate policy to determine the affinities of threads. For example, an  -dalign flags. All times are given in seconds.  denotes the initial distribution of work among processes can be enforced by set- execution time of the sequential algorithm for the application and

ting the affinities of a thread to the process that it will be assigned 

J9Kñå (K  is for Heat and 40.99 for Relax. at the beginning of the computation. We call this locality-guided work-stealing with initial placements. Likewise, techniques that rely on hints from the programmer can be realized by setting the affinity of threads based on the hints. In the next section, we de- multiple copies of threads at the end of a step and synchronizing scribe an implementation of locality-guided work stealing for iter- all processes before the next step start. In multiprogrammed work- ative data-parallel applications. The implementation described can loads, however, the kernel can swap a process out, preventing it be modified easily to implement other techniques mentioned. from participating to the current step. Such a swapped out process prevents all the other processes from proceeding to the next step. In 7.1 Implementation our implementation, to avoid the synchronization at the end of each step, we time-stamp thread data structures such that each process We built locality-guided work stealing into Hood. Hood is a multi- closely follows the time of the computation and ignores a thread threaded programming library with a nonblocking implementation that is “out-of-date”. of work stealing that delivers provably good performance under both traditional and multiprogrammed workloads [2, 10, 30]. In Hood, the programmer defines a thread as a C++ class, which 7.2 Experimental Results we refer to as the thread definition. A thread definition has a In this section, we present the results of our preliminary experi- method named run that defines the code that the thread executes. ments with locality-guided work stealing on two small applications. The run method is a C++ function which can call Hood library The experiments were run on a J÷K processor Sun Ultra Enterprise

functions to create and synchronize with other threads. A rope is K with K$ = MHz processors and M byte L2 cache each, and running an object that is an instance of a thread definition class. Each time Solaris 2.7. We used the processor bind system call of Solaris the run method of a rope is executed, it creates a new thread. A 2.7 to bind processes to processors to prevent Solaris kernel from rope can have an affinity for a process, and when the Hood run-time migrating a process among processors, causing the process to loose system executes such a rope, the system passes this affinity to the its cache state. When the number of processes is less than num- thread. If the thread does not run on the process for which it has ber of processors we bind one process to each processor, otherwise affinity, the affinity of the rope is updated to the new process. we bind processes to processors such that processes are distributed Iterative data-parallel applications can effectively use ropes by among processors as evenly as possible. making sure all “corresponding” threads (threads that update the We use the applications Heat and Relax in our evaluation. same region across different steps) are generated from the same Heat is a Jacobi over-relaxation that simulates heat propagation rope. A thread will therefore always have an affinity for the process on a ; dimensional grid for a number of steps. This benchmark was on which it’ s corresponding thread ran on the previous step. The derived from similar Cilk [27] and SPLASH [35] benchmarks. The dashed rectangle in Figure 10, for example, represents two threads main data structures are two equal-sized arrays. The algorithm runs that are generated in two executions of one rope. To initialize the in steps each of which updates the entries in one array using the data ropes, the programmer needs to create a tree of ropes before the first in the other array, which was updated in the previous step. Relax step. This tree is then used on each step when forking the threads. is a Gauss-Seidel over-relaxation algorithm that iterates over one a To implement locality-guided work stealing in Hood, we use J dimensional array updating each element by a weighted average a nonblocking queue for each mailbox. Since a thread is put to a of its value and that of its two neighbors. We implemented each mailbox and to a deque, one issue is making sure that the thread application with four strategies, static partitioning, work stealing, is not executed twice, once from the mailbox and once from the locality-guided work stealing, and locality guided work stealing deque. One solution is to remove the other copy of a thread when with initial placements. The static partitioning benchmarks divide a process starts executing it. In practice, this is not efficient be- the total work equally among the number of processes and makes cause it has a large synchronization overhead. In our implementa- sure that each process accesses the same data elements in all the tion, we do this lazily: when a process starts executing a thread, it steps. It is implemented directly with Solaris threads. The three sets a flag using an atomic update operation such as test-and-set or work-stealing strategies are all implemented in Hood. The plain compare-and-swap to mark the thread. When executing a thread, a work-stealing version uses threads directly, and the two locality- process identifies a marked thread with the atomic update and dis- guided versions use ropes by building a tree of ropes at the begin- cards the thread. The second issue comes up when one wants to ning of the computation. The initial placement strategy assigns ini- reuse the thread data structures, typically those from the previous tial affinities to the ropes near the top of the tree to achieve a good step. When a thread’ s structure is reused in a step, the copies from initial load balance. We use the following prefixes in the names of the previous step, which can be in a mailbox or a deque needs to the benchmarks: static (static partitioning), none, (work steal- be marked invalid. One can implement this by invalidating all the linear heat heat 18 heat lgheat 14000 lgHeat lgHeat ipheat ipHeat ipHeat 16 staticHeat 100 12000

14

80 10000 12

8000 10 9 8 60 Speedup 8 6000 Number of Steals 40

6 Percentage of drifted leaves 4000 4 20 2000 2

0 0 0 0 5 10 15 20 25 30 35 40 45 50 0 5 10 15 20 25 30 35 40 45 50 0 5 10 15 20 25 30 35 40 45 50 Number of Processes Number of Processes Number of Processes

Figure 11: Speedup of heat benchmarks Figure 12: Percentage of bad updates for Figure 13: Number of steals in the Heat

on J9K processors. the Heat benchmarks. benchmarks. ing), lg (locality guided work stealing), and lg (lg with initial of bad updates, whereas the locality-guided work-stealing bench- placement). marks achieve a very low percentage. Figure 13 shows the num- We ran all Heat benchmarks with -x 8K -y 128 -s 100 ber of random steals for the same benchmarks for varying number parameters. With these parameters each Heat benchmark allo- of processes. The graph is similar to the graph for bad updates,

cates two arrays of double precision floating point numbers of #ñJ9î"; because it is the random steals that causes the bad updates. The J9 $ columns and J(;(# rows and does relaxation for steps. We ran all figures for the Relax application are similar. Relax benchmarks with the parameters -n 3M -s 100. With

these parameters each Relax benchmark allocates one array of 0 million double-precision floating points numbers and does relax- References

ation for J9 = steps. With the specified input parameters, a Relax [1] Thomas E. Anderson, Edward D. Lazowska, and Henry M. benchmark allocates J÷L Megabytes and a Heat benchmark allo- Levy. The performance implications of thread management cates ; K Megabytes of memory for the main data structures. Hence, the main data structures for Heat benchmarks fit into the collective alternatives for shared-memory multiprocessors. IEEE Trans- actions on Computers, 38(12):1631–1644, December 1989. L2 cache space of K or more processes and the data structures for

Relax benchmarks fit into that of L or more processes. The data [2] Nimar S. Arora, Robert D. Blumofe, and C. Greg Plaxton. for no benchmark fits into the collective L1 cache space of the Ul- Thread scheduling for multiprogrammed multiprocessors. In tra Enterprise. We observe superlinear speedups with some of our Proceedings of the Tenth Annual ACM Symposium on Par- benchmarks when the collective caches of the processes hold a sig- allel Algorithms and Architectures (SPAA), Puerto Vallarta, nificant amount of frequently accessed data. Table 1 shows charac- Mexico, June 1998. teristics of our benchmarks. Neither the work-stealing benchmarks nor the locality-guided work-stealing benchmarks have significant [3] F. Bellosa and M. Steckermeier. The performance implica- overhead compared to the serial implementation of the correspond- tions of locality information usage in shared memory multi- ing algorithms. processors. Journal of Parallel and Distributed Computing, Figures 11 and Figure 1 show the speedup of the Heat and 37(1):113–121, August 1996. Relax benchmarks, respectively, as a function of the number of processes. The static partitioning benchmarks deliver superlinear [4] Guy Blelloch and Margaret Reid-Miller. Pipelining with fu- speedups under traditional workloads but suffer from the perform- tures. In Proceedings of the Ninth Annual ACM Symposium ance cliff problem and deliver poor performance under multipro- on Parallel Algorithms and Architectures (SPAA), pages 249– gramming workloads. The work-stealing benchmarks deliver poor 259, Newport, RI, June 1997. performance with almost any number of processes. the locality- guided work-stealing benchmarks with or without initial placements, [5] Guy E. Blelloch. Programming parallel algorithms. Commu- however, matches the static partitioning benchmarks under tradi- nications of the ACM, 39(3), March 1996. tional workloads and delivers superior performance under multi- [6] Guy E. Blelloch, Phillip B. Gibbons, and Yossi Matias. Prov- programming workloads. The initial placement strategy improves ably efficient scheduling for languages with fine-grained par- the performance under traditional work loads, but it does not per- allelism. In Proceedings of the Seventh Annual ACM Sympo- form consistently better under multiprogrammed workloads. This sium on Parallel Algorithms and Architectures (SPAA), pages is an artifact of binding processes to processors. The initial place- 1–12, Santa Barbara, California, July 1995. ment strategy distributes the load among the processes equally at the beginning of the computation but binding creates a load imbal- [7] Robert D. Blumofe, Matteo Frigo, Christopher F. Joerg, ance between processors and increases the number of steals. In- Charles E. Leiserson, and Keith H. Randall. An analysis of deed, the benchmarks that employ the initial-placement strategy dag-consistent distributed shared-memory algorithms. In Pro- does worse only when the number of processes is slightly greater ceedings of the Eighth Annual ACM Symposium on Parallel than the number of processors. Algorithms and Architectures (SPAA), pages 297–308, Padua, The locality-guided work-stealing delivers good performance Italy, June 1996. by achieving good data locality. To substantiate this, we counted the average number of times that an element is updated by two [8] Robert D. Blumofe, Christopher F. Joerg, Bradley C. Kusz- different processes in two consecutive steps, which we call a bad maul, Charles E. Leiserson, Keith H. Randall, and Yuli Zhou. update. Figure 12 shows the percentage of bad updates in our Cilk: An efficient multithreaded runtime system. Journal Heat benchmarks with work stealing and locality-guided work- of Parallel and Distributed Computing, 37(1):55–69, August stealing. The work-stealing benchmarks incur a high percentage 1996. [9] Robert D. Blumofe and Charles E. Leiserson. Scheduling [22] Vijay Karamcheti and Andrew A. Chien. A hierarchical load- multithreaded computations by work stealing. In Proceed- balancing framework for dynamic multithreaded computa- ings of the 35th Annual Symposium on Foundations of Com- tions. In Proceedings of ACM/IEEE SC98: 10th Anniversary. puter Science (FOCS), pages 356–368, Santa Fe, New Mex- High Performance Networking and Computing Conference, ico, November 1994. 1998. [10] Robert D. Blumofe and Dionisios Papadopoulos. The per- [23] Richard M. Karp and Yanjun Zhang. A randomized parallel formance of work stealing in multiprogrammed environ- branch-and-bound procedure. In Proceedings of the Twentieth ments. Technical Report TR-98-13, The University of Texas Annual ACM Symposium on Theory of Computing (STOC), at Austin, Department of Computer Sciences, May 1998. pages 290–300, Chicago, Illinois, May 1988. [11] F. Warren Burton and M. Ronan Sleep. Executing func- [24] Richard M. Karp and Yanjun Zhang. Randomized parallel tional programs on a virtual tree of processors. In Pro- algorithms for backtrack search and branch-and-bound com- ceedings of the 1981 Conference on Functional Program- putation. Journal of the ACM, 40(3):765–789, July 1993. ming Languages and Computer Architecture, pages 187–194, Portsmouth, New Hampshire, October 1981. [25] David A. Krantz, Robert H. Halstead, Jr., and Eric Mohr. Mul- T: A High-Performance Parallel Lisp. In Proceedings of the [12] David Callahan and Burton Smith. A future-based parallel SIGPLAN’89 Conference on Programming Language Design language for a general-purpose highly-parallel computer. In and Implementation, pages 81–90, 1989. David Padua, David Gelernter, and Alexandru Nicolau, edi- tors, Languages and for , Re- [26] Evangelos Markatos and Thomas LeBlanc. Locality-based search Monographs in Parallel and Distributed Computing, scheduling for shared-memory multiprocessors. Technical pages 95–113. MIT Press, 1990. Report TR-094, Institute of Computer Science, F.O.R.T.H., Crete, Greece, 1994. [13] M. C. Carlisle, A. Rogers, J. H. Reppy, and L. J. Hendren. Early experiences with OLDEN (parallel programming). In [27] MIT Laboratory for Computer Science. Cilk 5.2 Reference Proceedings 6th International Workshop on Languages and Manual, July 1998. Compilers for Parallel Computing, pages 1–20. Springer- Verlag, August 1993. [28] Eric Mohr, David A. Kranz, and Robert H. Halstead, Jr. Lazy task creation: A technique for increasing the granularity of [14] Rohit Chandra, Anoop Gupta, and John Hennessy. COOL: A parallel programs. IEEE Transactions on Parallel and Dis- Language for Parallel Programming. In David Padua, David tributed Systems, 2(3):264–280, July 1991. Gelernter, and Alexandru Nicolau, editors, Languages and Compilers for Parallel Computing, Research Monographs in [29] Grija J. Narlikar. Scheduling threads for low space require- Parallel and Distributed Computing, pages 126–148. MIT ment and good locality. In Proceedings of the Eleventh An- Press, 1990. nual ACM Symposium on Parallel Algorithms and Architec- tures (SPAA), June 1999. [15] Rohit Chandra, Anoop Gupta, and John L. Hennessy. Data locality and load balancing in COOL. In Proceedings of the [30] Dionysios Papadopoulos. Hood: A user-level thread li- Fourth ACM SIGPLAN Symposium on Principles and Prac- brary for multiprogrammed multiprocessors. Master’s the- tice of Parallel Programming (PPoPP), pages 249–259, San sis, Department of Computer Sciences, University of Texas at Diego, California, May 1993. Austin, August 1998. [16] D. E. Culler and G. Arvind. Resource requirements of [31] James Philbin, Jan Edler, Otto J. Anshus, Craig C. Douglas, dataflow programs. In Proceedings of the International Sym- and Kai Li. Thread scheduling for cache locality. In Pro- posium on Computer Architecture, pages 141–151, 1988. ceedings of the Seventh ACM Conference on Architectural Support for Programming Languages and Operating Systems [17] Dawson R. Engler, David K. Lowenthal, and Gregory R. An- (ASPLOS), pages 60–71, Cambridge, Massachusetts, October drews. Shared Filaments: Efficient fine-grain parallelism on 1996. shared-memory multiprocessors. Technical Report TR 93- 13a, Department of Computer Science, The University of Ari- [32] Dan Stein and Devang Shah. Implementing lightweight zona, April 1993. threads. In Proceedings of the USENIX 1992 Summer Con- ference, San Antonio, Texas, June 1992. [18] Mingdong Feng and Charles E. Leiserson. Efficient detection of determinacy races in Cilk programs. In Proceedings of the [33] Jacobo Valdes. Parsing Flowcharts and Series-Parallel 9th Annual ACM Symposium on Parallel Algorithms and Ar- Graphs. PhD thesis, Stanford University, December 1978. chitectures (SPAA), pages 1–11, Newport, Rhode Island, June 1997. [34] B. Weissman. Performance counters and state sharing an- notations: a unified aproach to thread locality. In Interna- [19] Robert H. Halstead, Jr. Implementation of Multilisp: Lisp tional Conference on Architectural Support for Programming on a multiprocessor. In Conference Record of the 1984 ACM Languages and Operating Systems., pages 262–273, October Symposium on Lisp and , pages 9– 1998. 17, Austin, Texas, August 1984. [35] Steven Cameron Woo, Moriyoshi Ohara, Evan Torrie, [20] Robert H. Halstead, Jr. Multilisp: A language for concurrent Jaswinder Pal Singh, and Anoop Gupta. The SPLASH-2 symbolic computation. ACM Transactions on Programming programs: Characterization and methodological considera- Languages and Systems, 7(4):501–538, October 1985. tions. In Proceedings of the 22nd Annual International Sym- posium on Computer Architecture (ISCA), pages 24–36, Santa [21] High Performance Fortran Forum. High Performance Fortran Margherita Ligure, Italy, June 1995. Language Specification, May 1993. [36] Y. Zhang and A. Ortynski. The efficiency of randomized par- allel backtrack search. In Proceedings of the 6th IEEE Sym- posium on Parallel and Distributed Processing, Dallas, Texas, October 1994. [37] Yanjun Zhang. Parallel Algorithms for Combinatorial Search Problems. PhD thesis, Department of Electrical Engineering and Computer Science, University of California at Berkeley, November 1989. Also: University of California at Berke- ley, Computer Science Division, Technical Report UCB/CSD 89/543.