Responsive Parallel Computation

Stefan Muller Umut A. Acar Robert Harper Carnegie Mellon University Carnegie Mellon University Carnegie Mellon University [email protected] [email protected] [email protected]

Abstract by using a greedy principle, which requires keeping processors as busy as possible by greedily assigning ready tasks to As parallel (multicore) hardware proliferates, there is growing them. Brent’s scheduling principle leads naturally to cost models interest in developing languages and techniques for writing and for parallelism based on the notions of work, defined as the total reasoning about parallel programs. One important direction is to computation to be performed, and span, defined as the longest chain abstract away from the details of how parallelism is implemented of dependent computations [5, 27, 41]. Brent’s scheduling principle by using implicitly parallel programming languages and reasoning implies that a parallel computation with W work and S span can about performance using abstract cost models based on the metrics be executed in W/P + S time. As a result, well-designed parallel of work and span, which can then be mapped to actual performance computations should show good speedup, which measures relative by taking advantage of classic results in the field, such as Brent’s improvement in completion time with respect to a single processor. Theorem. While very effective for compute-bound applications, the However, nearly all of these advances in programming languages, integration of these methods with effects such as input-output has not cost models and scheduling algorithms focus on minimizing com- been well understood, even though many applications (e.g., games pletion time in compute-intensive applications such as matrix opera- and servers) increasingly involve such interaction. tions, Fast Fourier Transformation, Barnes-Hut, etc. Such algorithms In this paper, we propose responsive to bring perform a large amount of computation over given data to produce together performant implicit parallelism with responsive interac- an output, with little or no interaction with the outside world. tion. We first present a parallel language that provides separate As the potential applications of parallelism expand beyond these mechanisms for interaction and parallelism, and for prioritization traditional, computational workloads, we are interested in the ques- of computations. The language separates foreground (high priority) tion of how to write and reason about parallel programs that interact and background (low priority) computations using a type system with the external world. For example, an application such as a game based on linear temporal logic and comes with both an operational or server that interacts with users or clients must respond quickly semantics and a cost model. The cost model is based on a refinement to ensure effective interaction while also performing a compute- of the work-span model that introduces the notions of foreground intensive task (e.g. analytics on a database, AI strategy in a game) in work and span. As our key result, we prove a “Brent-type” theo- parallel, mixing parallel computation and interaction. Such applica- rem that establishes a bound on the responsiveness as well as the tions must maximize not only speedup, but also responsiveness (or, completion time of a computation. We present a small implementa- equivalently, minimize response time, the time by which interactive tion and several examples that give some evidence of the practical sub-computations are delayed). Very little research, theoretical or implications of our results. experimental, has focused on parallel interaction. In this paper, we seek to address this problem of responsive 1. Introduction parallel computation by developing a language and a cost model for interactive parallel programs which highlight not just overall Shared-memory parallelism has become mainstream, with nearly ev- completion time but also the responsiveness of portions of the ery computer today including multicore chips. Many programming program which are marked as high-priority. Contributions of this languages have been developed to support parallel computing on paper include the following: such shared-memory multiprocessors, including OpenMP, Cilk [19], Fork/Join Java [30], Habanero Java [25], TPL [31], TBB [26], X10 [12], parallel ML [18], and parallel Haskell [11, 28]. • A calculus λip with features for parallelism, interaction and These languages enable the programmer to express parallelism priority annotations. The type system of λip, based on ideas from at an abstract level by means of primitives such as fork-join, async- linear temporal logic (e.g. [13]), cleanly separates computation finish, and futures. This approach relieves the programmer of the at different priorities. Its operational semantics utilizes a prompt complexities of specifying the mapping of tasks onto processors, scheduling principle which generalizes the traditional notion of instead relying on the runtime system to schedule tasks onto proces- greedy scheduling to prioritize foreground computations. sors in an online fashion to minimize completion time. Determining • A cost model which constructs cost graphs indicating both the an optimal schedule of tasks is NP-hard [44], but a classic result parallel structure and the priorities of subcomputations. by Brent [9] shows that a 2-factor approximation can be computed • Upper bounds on both the parallel completion time and the total response time of executions of λip programs, generalizing Brent’s Theorem to show that a prompt schedule is within a constant factor of optimal in terms of both completion time and responsiveness. • A preliminary implementation of a responsive parallel library for Standard ML, and implemented examples to give some evidence [Copyright notice will appear here once ’preprint’ option is removed.] of the practicality of the proposed techniques.

1 2016/7/7 6 ● ● ● ● 2. Overview 10 ● ● ● ● ● We present an overview of the results in the paper, giving relevant background as necessary. Focusing on intuition, the presentation 4 10 here is informal but the rest of the paper makes the ideas precise. For the purposes of illustration, we use an ML-like (strict, purely 2 functional) language. 10

Fork-Join Parallelism. Our starting point is a purely func- response time (us) Average with priorities ● parallel 0 Standard work stealing tional ML-like language with fork-join parallelism. A 10 par(e1, e2) e1 e2 tuple evaluates and in parallel, returning the 5 10 15 20 resulting values as an ordinary tuple. For example, we can write a function that computes the nth Fibonacci number by using the Processors standard recursive algorithm as follows. Figure 1. The average response time of a simple terminal applica- function fibn= tion (in microseconds, log scale). ifn <= 1 thenn with how work stealing operates: it treats the threads performing else I/O in the same way as the thousands of threads generated by the let(a,b) = par(fib(n - 1), fib(n - 2)) parallel Fibonacci function. ina+b Language λip. The fib_quest function illustrates a fundamental Since the two recursive calls are independent, they can be per- challenge in parallel computing: interaction requires certain blocks formed in parallel, leading to an “” algorithm. of computation, such as those that perform interaction, to be given This function is not an efficient way to compute Fibonacci numbers high priority, but existing parallelism constructs and cost models but is used to illustrate a compute-intensive parallel computation. do not support such prioritization. To enable priority, we introduce To execute such a program, modern parallel programming lan- two language constructs: fg annotates a piece of code, which we guages generate light-weight (user-level) threads, (a.k.a. tasks, refer to as a foreground block, as running in the foreground (with strands, fibers, sparks, etc.) and load balance them over proces- high priority). The annotation bg embeds a background computation, sors using a parallel scheduler. Many such schedulers have been which runs with low priority, inside a foreground block. proposed; most modern languages use a variant of work steal- To keep interaction responsive, it is key that foreground com- ing [10, 21]. In work stealing, processors work from a queue of putations never wait on background computations by demanding threads. When they run out of work, they steal threads from another, their results. However, a foreground block should be able to start usually randomly chosen, processor. background computations and “pass them around” until they can be Challenge: Responsive Interaction. To see the challenge of used (in the background). In Section 3, we present a language called ip responsive interaction, we extend our parallel language with input λ and a type system based on linear temporal logic that enforces and output constructs. Since our techniques do not depend on this key invariant. The language extends a simply-typed lambda exactly how these operations perform interaction (e.g. via a console, calculus with constructs for I/O and with prioritized computations. through GUI operations, over a network), we leave this aspect The basic idea behind the type system is to differentiate between unspecified. We can now write a simple interactive program that two “worlds”, background and foreground, and allow background asks the user two questions, responds, and repeats, i times. computations to be typed in the foreground by a distinguished type that has no elimination form in the foreground. function questi= In λip, we can write the fib_quest example so that it runs ifi <= 0 then() quest in the foreground. As shown below, since it returns to the else quest let_= output(‘‘What is your name?’’) in background, returns a background value. let nm= input() in function questi= let_= output(‘‘What is your quest?’’) in ifi <= 0 then bg() let qu= input() in else let_= output(‘‘You may pass, ’’ ˆ nm) in let_= output(‘‘What is your name?’’) in quest() let nm= input() in Composing our two examples, Fibonacci and quest, we can now let_= output(‘‘What is your quest?’’) in let qu= input() in write a very simple parallel interactive function that performs a large let_= output(‘‘You may pass, ’’ ˆ nm) in Fibonacci computation as it interacts with the user. quest() function fib_quest () = par(fib 43, quest 15) function fib_quest () = par(fib 43, fg(quest 15)) To see how our fib_quest example runs on an actual parallel ip machine, we implemented a variant of the example using a parallel The operational semantics of λ specifies an evaluation strategy, extension to the ML language [40, 41] that uses a well-engineered which we call prompt scheduling, which guarantees that foreground work-stealing scheduler1. The average response time for each user computations are run with high priority. The basic idea behind input, shown by the top curve (labeled “Standard work stealing”) in prompt scheduling is to create two kinds of threads, background Figure 1 is on the order of 1 second with 4 or more processors. With and foreground, and to schedule background threads only when fewer processors, we registered no response. Such response times foreground threads are exhausted, effectively giving priority to are unacceptable—effective interaction requires response times to foreground threads. Processors are left idle only when there are be on the order of milliseconds. These measurements are consistent not enough threads to execute (this latter requirement subsumes the traditional notion of greedy scheduling). Other than these constraints, 1 The implementation allows I/O operations to proceed without blocking the the order in which threads are scheduled is left non-deterministic. In underlying OS process/. Figure 1, the plot labeled “Work stealing with priorities” illustrates

2 2016/7/7 par function fib_server () = function fib_server () = letn= input() in letn= input() in ifn<0 then() ifn<0 then bg() fib(3) output else else output(fibn); bg(output(fibn)); fib(2) input fib_server() fib_server() δ1 fib(1)=1 fib(0)=1 fib(1)=1 output function main () = function main () = fib_server() fg(fib_server ()) 1+1=2 input Figure 2. Parallel Fibonacci without (left) and with priorities δ2 output (right). 2+1=3 fib(3) (3, ())

fib(2) Figure 4. A dag representation of fib(3) composed with quest(1). of as the (asymptotic) time needed to complete the computation with fib(1)=1 fib(0)=1 fib(1)=1 infinitely many processors. The structure of the fib function shows that it performs exponential work in linear span, i.e., W(n) = Θ(φn) 1+1=2 and S (n) = Θ(n). Brent’s Theorem establishes the key result that a computation with W work and S span can be executed on P 2+1=3 processors in TP ≤ W/P + S time. This bound is within a factor of two of optimal, since W/P and S are each, individually, lower Figure 3. A dag representation of fib(3). bounds on the total computation time. the response time with a simplified implementation (Section 6) of In interactive parallel computation, in addition to parallel run λip on top of the Parallel ML extension. The interaction is now time, we are interested in the responsiveness of the foreground instantaneous, completing under a millisecond. blocks. For this, we use total response time, which we define as the In the fib_quest example, foreground computations and back- total time required to execute foreground blocks, including the time ground computations do not interact in interesting ways. As a blocks wait to be scheduled. We extend the classical model with the more complex example, consider a “Fibonacci server”, fib_server, notions of foreground work W!, foreground span S !, and foreground shown in Figure 2 on the left. The function asks the user for an width D. Foreground work and span refer to the total work and span input n (a natural number) and computes the nth Fibonacci number of the foreground blocks, which we mark off in the dag. Foreground using fib. Since a Fibonacci computation can take a long time, the width is the maximum number of foreground blocks which can be input loop could become sluggish. To solve this problem, the pro- running at a time. It is called “width” since it corresponds to the grammer can run fib_server in the foreground while pushing the maximum number of foreground blocks that can be crossed by a cut call to fib to the background, as shown on the right in Figure 2. The separating the already-executed and the not-yet-executed vertices in expression bg(output(fibn)) spawns a new background thread the dag at any given time in an execution. Figure 4 shows the dag for to asynchronously perform the Fibonacci computation and output fib_quest (using n = 3 and i = 1). The foreground block for quest the result. The foreground computation can spawn many background is indicated by the rectangle enclosing part of the dag. The edge computations, each of which computes the requested Fibonacci num- weights δ1 and δ2 stand for the latency incurred by the two input ber in parallel with other other background computations as well as instructions, which are included in the span but not the work. The the foreground interactive server loop. foreground work and span are thus 5 and 5 + δ1 + δ2, respectively. Note that the foreground computation never demands the results The foreground width is 1. of the background computations (though it may pass around handles The cost semantics for λip (Section 4) specifies these notions to them as first class values). This requirement that foreground precisely by generating a dag that makes it possible to read off the computations do not depend on background computations is the key 1) work and span, 2) foreground work and span, and foreground principle that responsive parallel computations must enforce. width. We then show (Section 5) that the operational semantics of λip evaluates a program with work W, span S , foreground work W!, Cost Semantics. We establish that the evaluation strategy of λip foreground span S ! and foreground width D in time W +S with total is efficient and responsive by extending the traditional work-span P W! ! model for non-interactive parallel computing. The classic cost response time D P + S . Note that the total response time depends models represent parallel computations using a dag (directed acyclic only on the properties of the foreground blocks, effectively excluding graph) in which each vertex represents an instruction and each edge whatever other work might be performed in the background, and that represents a dependency between the instructions. In its most basic the completion time for the whole program is the same as given by form, each vertex/instruction represents a machine-level operation, Brent’s classic result. We are thus able to show that responsiveness but more abstractly, any sequence of operations can be considered can be guaranteed without penalizing non-interactive computations. as a vertex/instruction. As an example, Figure 3 illustrates the dag For example, for both fib_quest and fib_server, the response time W! ! for our fib function with input value n = 3, using a vertex to is bounded by the relatively small P + S and is independent of the represent each recursive call to fib. Vertices with out-degree two (much larger) work and span of the Fibonacci computations. “fork” two parallel computations. Vertices with in-degree two “join” An important property of Brent’s principle is that it guarantees two parallel computations; a join vertex synchronizes its two in- that the P-processor run-time is within a factor 2 of optimal. For neighbors by waiting for both of them to complete before executing. our prompt scheduling principle, we establish a similar but slightly Given a dag, work W is defined as the number of vertices in weaker optimality bound for total response time which assumes that the dag and span S is defined as the length of the longest path in the scheduler has no information about the shape of the dag and thus the dag. Work can be thought as the (asymptotic) time needed to cannot make decisions based on its future shape. Since computation complete the computation with one processor. Span can be thought dags unfold dynamically, this is a reasonable assumption.

3 2016/7/7 ` x, y, z ∈ Variables Expression typing Γ Σ e : τ@w a, b, c ∈ Threads d ∈ InputIDs 0 Γ, x : τ @ w `Σ x : τ@w Γ, x : nat @ w `Σ x : τ@w Types τ ::= unit | nat | τ1 → τ2 | τ1 × τ2 | τ1 + τ2 | τ Exprs. e ::= x | hi | n | λx : τ.e | e e | he, ei | fst(e) | snd(e) | inl(e) | inr(e) | case(e){x.e; y.e} | fix x : τ is e | Γ `Σ hi : unit@w Γ `Σ n : nat@w e k e | join[a, a] | out(e) | inp[d](x.e) | in(x.e) | 0 0 bg(e) | tid[a] | fg(e) Γ, x : τ @ w `Σ e : τ @w Γ `Σ e1 : τ → τ @w Γ `Σ e2 : τ@w Γ ` λx : τ.e : τ → τ0@w Γ ` e e : τ0@w Figure 5. Syntax Σ Σ 1 2 3. Language Γ `Σ e1 : τ1@w Γ `Σ e2 : τ2@w Γ `Σ he1, e2i : τ1 × τ2@w In this section, we introduce a core calculus called λip, which extends a simply-typed lambda calculus with the features introduced in Γ `Σ e1 : τ1@w Γ `Σ e2 : τ2@w ip Section 2 for I/O, parallelism and priority. The type system of λ Γ `Σ e1 k e2 : τ1 × τ2@w separates subcomputations by priority (foreground or background). The operational semantics for the language makes explicit the Γ `Σ e : τ1 × τ2@w Γ `Σ e : τ1 × τ2@w available pool of threads and simulates a run of a P-processor Γ `Σ fst(e): τ1@w Γ `Σ snd(e): τ2@w parallel scheduler on the program. The threads scheduled at each step of the semantics are chosen nondeterministically, allowing for a variety of scheduling policies and mechanisms (e.g. work stealing). Γ `Σ,a∼τ1@w,b∼τ2@w join[a, b]: τ1 × τ2@w The semantics require only that the scheduler be prompt, keeping all processors busy when possible and prioritizing foreground blocks. Γ `Σ e : τ1@w Γ `Σ e : τ2@w Γ `Σ inl(e): τ1 + τ2@w Γ `Σ inr(e): τ1 + τ2@w 3.1 Syntax Γ `Σ e : τ1 + τ2@w ip 0 0 The syntax of λ syntax is presented in Figure 5. Most features Γ, x : τ1 @ w `Σ e1 : τ @w Γ, y : τ2 @ w `Σ e2 : τ @w are fairly standard for a simply-typed lambda calculus. The types 0 Γ `Σ case(e){x.e1; y.e2} : τ @w τ include two base types: unit and natural numbers, as well as functions, binary tuples, binary sums and the circle type τ, Γ, x : τ @ w `Σ e : τ@w Γ `Σ e : nat@w which represents handles to background threads. The expressions Γ `Σ fix x : τ is e : τ@w Γ `Σ out(e): unit@w e include the standard introduction and elimination forms for base types, functions, pairs and sums: natural numbers n, λ-abstractions, Γ, x : nat @ w `Σ e : τ@w Γ, x : nat @ w `Σ e : τ@w application, pairs, projection, injection, and case analysis. Recursion Γ `Σ inp[d](x.e): τ@w Γ `Σ in(x.e): τ@w is possible through the fixed point operator fix x : τ is e. Parallel pairs e1 k e2 evaluate e1 and e2 simultaneously and Γ `Σ e : τ@B Γ `Σ e : τ@F return a pair of their values when both computations complete. Γ `Σ bg(e): τ@F Γ `Σ,b∼τ@B tid[b]: τ@F Γ `Σ fg(e): τ@B The form join[a, b], where a and b are thread identifiers, will, in 0 the operational semantics, indicate the point at which two parallel Thread pool typing Γ `Σ µ : Σ computations join, but this form is not included in source programs. Γ `Σ,Σ0 e : τ@w Γ `Σ0 µ : Σ The command out(e) evaluates e to a (natural number) value Γ `Σ0 ∅ : · Γ `Σ0 a ,→ (δ, e) ⊗ µ : Σ, a ∼ τ @ w and then outputs it, returning hi. The command inp[d](x.e) takes a natural number as input from the user, which blocks for some Figure 6. Static semantics of λip amount of time, and binds this value to x in e. The input happens in two stages; inp[d](x.e) performs the blocking and steps to it evaluates to a background thread handle fg(tid[b]). At this point, evaluation blocks until thread b has evaluated its expression down in(x.e) (which does not appear in source programs), which actually 0 substitutes the input into e. Blocking is controlled by the symbol d, to an irreducible value e , which is then returned. Since the results which identifies this input. The operational semantics and the cost of background computation are only used when a foreground block semantics of Section 4 will be parametrized over an assignment ∆ terminates, the requirement that foreground computations cannot which maps these symbols to sets of non-negative integer delays, wait on background computations is equivalent to the requirement thus specifying how long each input command may block. Generally, that foreground blocks fg(e) are not nested. This is enforced by our the set will be an interval specifying the minimum and maximum type system, which will be described in detail in Section 3.2. delays. The use of identifiers and mappings in this way allows The final syntactic form we introduce is thread pools µ. A thread different inputs to incur different delays. For example, input might pool is a mapping from thread identifiers a to pairs (δ, e) of a delay be used to represent a “sleep” operation that waits exactly n time and an expression, indicating that thread a may run command e steps and then returns 0. If this input was marked with identifier d, after time δ (the unit of which will be steps of the global transition then we would run the program with a ∆ such that ∆(d) = {n}. On semantics defined in Section 3.3). We write a thread pool as the other hand, if an input waits for a response from the user, we a1 ,→ (δ1, e1) ⊗ ... ⊗ an ,→ (δn, en) might use an interval containing a reasonable estimate of how long it would take the user to respond. There is no difficulty in extending the and write the concatenation of two disjoint thread pools as µ1 ⊗ µ2. I/O constructs to handle other base types, but we restrict ourselves 3.2 Static Semantics to natural numbers for simplicity. The circle type τ is the type of a handle to a background thread The type system of λip makes explicit the distinction between fore- running an expression of type τ. The introduction form for τ ground and background computations, which prevents responsive- is bg(e), which returns a first-class thread handle tid[b]. Thread ness problems that could result if foreground blocks were acci- handles are eliminated by fg(e), which runs e in the foreground until dentally allowed to wait for background code, or if time-sensitive

4 2016/7/7 foreground code were accidentally run in the background. We en- The local transition judgment is force this separation using a type system based on ideas drawn from | ⇒∆ 0 0 | ⊗ 0 linear temporal logic and other type systems based on LTL, espe- e µ a (δ , e ) µ µ cially those for staged computation. The relationship with staged which states that thread a running e transitions to e0, possibly languages is discussed further in Section 7. spawning new threads, which are collected in µ0. The original thread ip The main typing judgment for λ is Γ `Σ e : τ@w. This pool µ is unchanged; threads are never altered or removed by local judgment indicates that e has type τ at world w (where w is F transitions. The thread identifier a is not important for the local or B). The judgment makes use of two contexts. Variable contexts Γ transition, but will be used in some of the global definitions and have entries of the form x : τ @ w, indicating that variable x is in the results. The new expression e0 will be able to run after a delay of context with type τ at world w. Thread signatures Σ have entries of δ0 steps (if δ0 = 0, it can run immediately). The judgment is also the form a ∼ τ @ w, indicating that thread a is running an expression parametrized by ∆ : InputIDs → 2N, a mapping which assigns a set of type τ at world w. Most of the rules allow expressions to type at of possible delays to each input identifier d. any world, but require all subexpressions to be at the same world as Most of the transition rules are straightforward and are omitted. the whole expression, enforcing the restriction that code can only be The complete rules for function application are given as an example: moved between worlds by spawning an asynchronous background in e1 e2, the subexpression e1 is stepped until it is a lambda thread with bg(e) or starting a foreground block with fg(e). If e is abstraction, then e2 is stepped until it is a value, which is then background code of type τ, the expression bg(e) starts a background substituted for the variable in the body of the abstraction using thread of type τ and immediately returns a handle of type τ in standard capture-avoiding substitution. A parallel tuple e1ke2 spawns the foreground. If e types in the foreground with type τ, i.e. it two new threads b and c to execute e1 and e2, respectively. The will evaluate to (a thread running) background code of type τ, the local thread a steps to join[b, c], indicating that this thread is now expression fg(e) has type τ at B. Typing fg(e) at B prevents the waiting for b and c to complete. When both threads have stepped to nesting of foreground blocks, as desired. irreducible values, join[b, c] steps to a pair of the two values. In the There are two rules for typing variables. If x : τ @ w is in same vein, bg(e) spawns a new thread b to evaluate e and returns the the context, the variable x has type τ at world w. We also allow thread handle tid[b]. Note that, while threads spawned by parallel variables of type nat to type at either world, allowing foreground tuples and threads spawned by bg(e) are treated identically by the code to make use of variables (of type nat) bound in the background semantics (i.e. they are stepped with the same transitions and not and vice versa. The restriction to type nat ensures that code distinguished in the thread pool), the threads b and c spawned by a can’t “escape” to the wrong world encapsulated in a function or parallel tuple are never referred to by thread handles (e.g. tid[b]) thread. This is related to the mobility restriction of Murphy et al. because these threads are not first class. [33], and could easily be expanded to allow any “mobile” type, The expression fg(e) steps e until it reaches fg(tid[b]), which including unit, sums and products (but not functions or τ). The then blocks until thread b has evaluated its expression down to rules for join[a, b] and tid[b] look up the thread identifiers in the an irreducible value e0, at which point fg(tid[b]) steps to e0. signature and produce the appropriate types. The input rule is the only one which results in a delay, which is The final judgment, Γ `Σ0 µ : Σ, indicates that the thread pool µ chosen nondeterministically from ∆(d). After the delay, the new has the signature Σ. The rules require that a ∼ τ @ w ∈ Σ if and expression in(x.e) nondeterministically chooses a natural number n only if a ,→ (δ, e) ∈ µ and e has type τ at world w. For the purposes to substitute for x in e, representing the uncertainty in the input from of typing, thread pools are ordered and threads may only refer to the user or environment. threads that come later in µ. This ensures that the references between We can prove type safety at the local level by showing that threads are acyclic. However, we will occasionally treat the thread progress and preservation results hold for the local dynamics. Both pool as the unordered set of its threads when this property is not lemmas have some unusual features. Local progress states that if important. Expressions are also allowed to refer to threads in Σ0, an expression is well-typed, it is either fully evaluated or can take a which must be disjoint from Σ, allowing us to type just a part of step, or is waiting for some other thread using join or fg (which a thread pool, whose expressions may refer to threads outside this can take a step or is delayed). part. Whenever µ is the entire thread pool, Σ0 will be empty. Lemma 2 (Local Progress). If · `Σ e : τ@w and · `· µ : Σ, a∼τ@w, Lemma 1 states a property of thread pool typing which will be ∆ 0 0 useful later: the concatenation of two thread pools is well-typed with then either e val or e | µ ⇒a (δ, e ) | µ ⊗ µ or there exists → ∈ | ⇒∆ 0 0 | ⊗ 0 the concatenation of the two signatures. b , (δb, eb) µ such that δb > 0 or eb µ b (δb, eb) µ µ .

Lemma 1. If Γ `Σ ,Σ0 µ1 : Σ1 and Γ `Σ0 µ2 : Σ2 then Γ `Σ0 µ1 ⊗ µ2 : 2 Proof. By induction on the derivations of · `Σ e : τ@w and Σ1, Σ2. · `· µ : Σ, a∼τ@w. In most of the base cases, e either is a value or can step. The interesting cases are e = fg(tid[b]) and e = join[b, c]. Proof. By induction on the derivation of Γ `Σ0 µ1 : Σ1. If µ1 = ∅, 0 Consider the case for foreground. By inversion, b ∼ τ @ B ∈ Σ then the result is trivial. Otherwise, µ1 = a ,→ (δ, e) ⊗ µ1 and 0 0 0 and b ,→ (δ , e ) ∈ µ. If δ > 0, the case is proven, so suppose Σ = Σ , a ∼ τ @ w and Γ ` 0 0 e : τ@w and Γ ` 0 µ : Σ . By b b b 1 1 Σ ,Σ1 Σ2,Σ 1 1 ∆ 0 0 δb = 0. By induction, either (1) eb val and e | µ ⇒ (0, eb) | µ or induction, Γ ` 0 µ ⊗ µ : Σ , Σ . By weakening, Γ ` 0 0 e : τ@w. a Σ 1 2 1 2 Σ ,Σ1,Σ2 ∆ 0 0 0 (2) eb | µ ⇒b (δb, eb) | µ ⊗ µ or (3) there exists c ,→ (δc, ec) ∈ µ The result follows from the thread pool typing rules.  ∆ 0 0 0 such that δc > 0 or ec | µ ⇒c (δc, ec) | µ ⊗ µ . In cases (2) or (3), b 3.3 Dynamic Semantics and c, respectively, meet the conditions in the theorem. The case for join is similar.  The operational semantics of λip consists of two components: local and global. This separation, and much of our notation, is drawn from The statement of preservation is more standard but requires Harper [23]. The local semantics concerns individual threads, and finding a signature Σ0 which accounts for the new threads that are indicates how expressions transition. Selected rules are presented created when e takes a step. in Figure 7 as small-step transition rules. The rules in this figure ∆ define two judgments. The judgment e val indicates that e is an Lemma 3 (Local Preservation). If · `Σ e : τ@w and e | µ ⇒a 0 0 0 0 irreducible value. Values are the unit value, numerals, functions, (δ, e ) | µ ⊗ µ , then there exists Σ such that · `Σ,Σ0 e : τ@w and 0 0 pairs and injections of values, and thread handles tid[b]. · `Σ µ : Σ .

5 2016/7/7 e1 val e2 val e val e val

hi val n val λx : τ.e val he1, e2i val inl(e) val inr(e) val tid[a] val

∆ 0 0 ∆ 0 0 e1 | µ ⇒a (δ, e1) | µ ⊗ µ e2 | µ ⇒a (δ, e2) | µ ⊗ µ e2 val ∆ 0 0 ∆ 0 0 ∆ e1 e2 | µ ⇒a (δ, e1 e2) | µ ⊗ µ (λx : τ.e1) e2 | µ ⇒a (δ, (λx : τ.e1) e2) | µ ⊗ µ (λx : τ.e1) e2 | µ ⇒a (0, [e2/x]e1) | µ 0 b fresh c fresh µ = b ,→ (δb, eb) ⊗ c ,→ (δc, ec) ⊗ µ eb val ec val ∆ ∆ e1 k e2 | µ ⇒a (0, join[b, c]) | µ ⊗ b ,→ (0, e1) ⊗ c ,→ (0, e2) join[b, c] | µ ⇒a (0, heb, eci) | µ b fresh ∆ ∆ fix x : τ is e | µ ⇒a (0, [fix x : τ is e/x]e) | µ bg(e) | µ ⇒a (0, tid[b]) | µ ⊗ b ,→ (0, e)

∆ 0 0 0 ∆ 0 0 e | µ ⇒a (δ, e ) | µ ⊗ µ µ = b ,→ (δ, e) ⊗ µ e val e | µ ⇒a (δ, e ) | µ ⊗ µ ∆ 0 0 ∆ ∆ 0 0 fg(e) | µ ⇒a (δ, fg(e )) | µ ⊗ µ fg(tid[b]) | µ ⇒a (0, e) | µ out(e) | µ ⇒a (δ, out(e )) | µ ⊗ µ e val δ ∈ ∆(d) ∆ ∆ ∆ out(e) | µ ⇒a (0, hi) | µ inp[d](x.e) | µ ⇒a (δ − 1, in(x.e)) | µ in(x.e) | µ ⇒a (0, [n/x]e) | µ Figure 7. Selected local dynamic rules.

∆ 0 0 Proof. Induction on the derivation of e | µ ⇒a (δ, e ) | µ ⊗ µ .  the threads a1,..., am into categories (here we treat the thread pool as an unordered set of threads). Threads 1 through j are The global rules in Figure 8, together with the auxiliary defini- ready and foreground. Threads j + 1 through k are ready and tions in Figure 9, define the transitions of entire thread pools, i.e. background. Threads k + 1 through n are neither ready nor delayed, the entire state of the computation. The judgment µ final states and threads n + 1 through m are delayed. The rule will run the first N that µ has completed evaluating and its rules simply require that all threads, where N is the smaller of k and P. In this way, the scheduler threads in µ be irreducible. If a thread pool is not final, we wish runs as many threads as possible, prioritizing foreground threads, as to step threads according to the prompt scheduling principle. We required by the prompt scheduling principle. Since the threads may categorize threads in several (possibly overlapping ways). A delayed be reordered within the categories arbitrarily, the rule does not, for thread a ,→ (δ, e) is waiting on input, as indicated by a nonzero de- example, specify which ready foreground threads to schedule if more lay. A ready thread is one that can step. The judgment isreadyµ(a) than P are available. The new thread pool consists of the updated indicates that thread a is ready in µ. The single rule, defined in threads 1 through N, the unaltered threads N + 1 through n and Figure 8, requires that a be present in µ, not be delayed, and its the delayed threads n + 1 through m with their delays decremented. expression be able to take a step (i.e. it cannot be waiting for other Finally, the total response time r is incremented by the number of threads and cannot be an irreducible value). ready foreground blocks. (Commuting the summations, counting the A foreground thread is one that is currently executing a fore- number of blocks at each step is equivalent to counting the number ground block. The auxiliary definitions of Figure 9 determine which of steps taken to execute each block, which is the response time.) threads are currently running foreground blocks. For a thread pool µ, We can now prove progress and preservation for the global a thread a and its associated expression e, RFBµ(e, a) is a set semantics. Most of the work is done by Lemmas 2 and 3. A1,..., An where each Ai represents a separate, currently ready, Lemma 4 (Progress). If · `· µ : Σ, then either µ final or there foreground block in e and is a set of the threads currently working 0 0 0 0 exist r and µ such that r; µ ⇒g r ; µ ⊗ µ . on the foreground block. A foreground block is ready if any of the threads working on it is ready. Expressions not containing fore- Proof. Let µ = a1 ,→ (δ1, e1) ⊗ ... ⊗ am ,→ (δm, em). If any δi > 0, ground blocks result in the empty set. Because RFB (e, a) should µ then the configuration can take a step to reduce δi, so consider the only contain ready foreground blocks, the rule for e e must split 1 2 case where δ1 = ··· = δm = 0. By inversion on the configuration on whether e or e is currently evaluating. The subexpression 1 2 typing derivation, we have Σ = a1 ∼ τ1 @ w,..., an ∼ τn @ w and for which is currently evaluating is recursively explored for foreground all i, there exists Σi such that · `Σi ei : τi@w. By Lemma 2, either blocks. Finally, fg(e) contains one foreground block involving a ∆ 0 0 0 there exists some i such that ei | µ ⇒ (δ , e ) | µ ⊗ µ or for all i, and any other threads which are involved in executing e (since e ai i i i ei final. In the former case, the configuration can take a step, and may contain parallel pairs). This is determined by the function Jµ(e), in the latter case, µ final.  which gives the set of threads a on which e is (transitively) wait- join join 0 0 ing with expressions. For example, if e = [b, c] and Lemma 5 (Preservation). If · `· µ : Σ and r; µ ⇒g r ; µ , then there 0 0 0 thread b is running join[d, e] and thread c is fully evaluated, then exists Σ such that · `· µ : Σ . Jµ(e) = {b, c, d, e}. The definition is inductive on the expression and is straightforward. At a join, the two parent threads are added to the Proof. Apply Lemma 3 to each local step, then use weakening set and their expressions are explored recursively. Finally, RFB(µ) and Lemma 1 to combine the results. See the appendix in the collects all of the ready foreground blocks for each thread in µ. The supplementary materials for details.  judgment isfgµ(a) indicates that a is a foreground thread in µ and simply checks whether a is involved in any block of RFB(µ). We could now show a fairly standard type safety theorem, The global step relation is r; µ ⇒g r0; µ0, and has only one showing that a well-typed thread pool will not become “stuck”. rule, which allows some number of threads whose delay is 0 to However, there is one additional property, in addition to well- step using the local dynamics. The relation also includes a counter typedness, which we wish to ensure is preserved during execution. for the total response time r, which at each step is incremented by We call this property “well-joinedness”. It is defined by the judgment the number of ready foreground blocks. The rule first separates e wj (“e is well-joined”) in Figure 10. Intuitively, well-joinedness

6 2016/7/7 [ a ∈ A ∆ 0 0 e val µ final µ = µ0 ⊗ a ,→ (0, e) e | µ ⇒a (δ, e ) | µ ⊗ µ A∈RFB(µ)

∅ final a ,→ (δ, e) ⊗ µ final isreadyµ(a) isfgµ(a)

µ = a1 ,→ (0, e1) ⊗ ... ⊗ an ,→ (0, en) ⊗ an+1 ,→ (δn+1 + 1, en+1) ⊗ ... am ,→ (δm + 1, em) ∀1 ≤ i ≤ j.isreadyµ(ai) ∧ isfgµ(ai) ∀ j < i ≤ k.isreadyµ(ai) ∧ ¬(isfgµ(ai)) ∀k < i ≤ n.¬(isreadyµ(ai)) ∆ 0 0 N = min(k, P) ∀1 ≤ i ≤ N.ei | µ ⇒a (δi, ei ) | µ ⊗ µi 0 0 0 i 0 0 µ = a1 ,→ (δ1, e1) ⊗ ... ⊗ aN ,→ (δn, eN ) ⊗ aN+1 ,→ (0, eN+1) ⊗ ... ⊗ an ,→ (0, en) ⊗ an+1 ,→ (δn+1, en+1) ⊗ ... ⊗ am ,→ (δm, em) ⊗ µ1 ⊗ ... ⊗ µn r; µ ⇒g r + |RFB(µ)|; µ0

Figure 8. Global Dynamics

0 and for all 0 ≤ i ≤ n, we have ei wj, then e0 wj and for all n < i ≤ m, Jµ(hi) = {} we have ei wj. Jµ(e1 e2) = Jµ(e1) ∪ Jµ(e2) Jµ(join[b, c]) = {b, c} ∪ Jµ(eb) ∪ Jµ(ec) Proof. By induction on the derivation of the transition judgment. (if b ,→ (δb, eb), c ,→ (δc, ec) ∈ µ) See the appendix in the supplementary materials for details.  Jµ(tid[b]) = {} ... Finally, we prove a theorem which encompasses type safety and well-joinedness. If an initial thread pool consisting of a single RFBµ(hi, a) = {} source expression (which is well-typed under the empty context 0 0 RFBµ(e1 e2, a) = RFBµ(e1, a) ¬(e1 val) and signature) evaluates to µ after some number of steps, then µ is RFBµ(e1 e2, a) = RFBµ(e2, a) e1 val well-typed, not stuck and all of its expressions are well-joined. RFB (fg(e), a) = {{a} ∪ J (e)} ¬(e val) µ µ · ` RFB (fg(e), a) = {} e val Theorem 1 (Type Safety and Well-Joinedness). If · e : τ@w and µ → ⇒∗ 0 ... 0; a , (0, e) g r; µ , then 1. there exists Σ0 such that · ` µ0 : Σ0 RFB (∅) = {} · µ 2. either µ0 final or there exist r00 and µ00 such that r; µ0 ⇒ RFB 0 (a ,→ (δ, e) ⊗ µ) = RFB 0 (e, a) ∪ RFB 0 (µ) g µ µ µ r00; µ00 RFB(µ) = RFBµ(µ) 0 3. For all b ,→ (δ, eb) ∈ µ , we have eb wj. Figure 9. Auxiliary definitions for the global dynamics. Proof. Parts 1 and 2 are simply an inductive application of Lem- e nj e1 wj e2 nj e1 val e2 wj mas 4 and 5. We prove part 3 by induction on the derivation of 0; a ,→ (0, e) ⇒∗ r; µ0. If µ0 = a ,→ (0, e), then we must have x wj hi wj n wj λx : τ.e wj e1 e2 wj e1 e2 wj g e nj since e types with an empty signature, and this implies that e1 nj e2 nj e wj e wj e wj e wj (these facts can be shown by a straightforward induction on the e1 k e2 wj join[b, c] wj fst(e) wj snd(e) wj inl(e) wj typing derivation and the derivation of e nj, respectively. ∗ 00 00 00 00 Otherwise, suppose 0; a ,→ (0, e) ⇒g r ; µ and r ; µ ⇒g e wj e wj e1 nj e2 nj e nj 0 00 00 00 00 r; µ . By induction, eb wj for all b ,→ (δ , eb ) ∈ µ . Let b ,→ inr wj case { } wj bg wj tid wj 0 0 0 00 0 00 (e) (e) x.e1; y.e2 (e) [a] (δ , eb) ∈ µ . We have three cases: (1) b ,→ (δ , eb) ∈ µ or (2) 00 00 00 00 ∆ 0 0 00 b ,→ (0, eb ) ∈ µ and eb | µ ⇒b (δ , eb) | µ ⊗ µb or (3) there e wj e wj e nj e nj 00 00 ∆ 0 0 00 exists c ,→ (0, ec) ∈ µ such that ec | µ ⇒c (δc, ec) | µ ⊗ µc and fg(e) wj out(e) wj inp[d](x.e) wj fix x : τ is e wj 0 0 b ,→ (δ , eb) ∈ µc. In case (1), the result is clear by induction. In 0 cases (2) and (3), eb wj by Lemma 6.  Figure 10. Rules for well-joinedness is the property that join expressions appear only in the part of 4. Cost Semantics 2 an expression which is currently being evaluated . In particular, In this section, we define a cost semantics which constructs a cost they may not appear encapsulated in functions, or in expressions dag of the form described in Section 2 for a λip program. The parallel which have not yet been evaluated. The auxiliary judgment e nj structure of the program, as well as the cost metrics such as work (“no joins”) indicates that e contains no join expressions. Its and span, can be read off from the resulting dag. Recall that, in straightforward definition is omitted for space reasons. such a dag, vertices represent instructions of a program and edges We first show that well-joinedness is preserved by local transi- represent control dependencies between the instructions. Vertices tions: if all expressions of a thread pool are well-joined and one with no ancestor relationship between them may be executed in thread steps, then all resulting expressions are well-joined. parallel. We also use a recent extension to the dag model [32] which Lemma 6. If labels each edge with a positive integer weight δ to represent the delays incurred by input operations. An edge from u1 to u2 with | → ⊗ ⊗ → ⇒∆ e0 a1 , (δ1, e1) ... an , (δn, en) a weight δ is written (u1, u2, δ). If δ = 1, u1 incurred no latency and u2 0 | → ⊗ ⊗ → ⊗ ⊗ → (δ, e0) a1 , (δ1, e1) ... an , (δn, en) ... am , (δm, em) may execute on the next step. If δ > 1, u1 incurred a latency of δ and u2 may not execute fewer than δ steps after u1. In the weighted dag 2 For those familiar with evaluation contexts or stack machine seman- model, the notion of work is unchanged; it is still the total number of tics, join can only appear in the “hole” of an evaluation context or at vertices. The span, on the other hand, is now defined as the longest the top of a stack. weighted path. This captures the notion that the time spent blocking

7 2016/7/7 on inputs should not be counted as computational work; a scheduler That is, D is the maximum number of foreground blocks that may need not dedicate a processor to the blocked thread. The span, of be ready at the same time. course, must take these delays into account since the computation A configuration graph G mirrors the structure of the thread cannot complete until all of the inputs are available. pool µ; it is a mapping from thread names to thread graphs: Typically, dags are used to represent entire programs which start and end as a single thread. For this reason, it is generally assumed G = a1 ,→ g1 ⊗ ... ⊗ an ,→ gn that a dag has a single source vertex (with in-degree zero) and a The vertices, edges and foreground blocks of a configuration graph single sink vertex (with out-degree zero). However, in order to show are the union of the vertices, edges and foreground blocks of the a correspondence between the cost semantics and the operational component thread graphs. If G = a ,→ ga ⊗ b ,→ gb, an edge (a, u, δ) semantics, we wish to generate dags for thread pools µ which can may be viewed as an edge from the sink of ga to u. If ga = ∅, this represent whole programs or programs that have already begun to edge is ignored. The metrics such as work, span and foreground execute. As such, dags may no longer have a single source vertex, width extend in the natural way to configuration graphs. though they will continue to have a single sink vertex (the final The cost semantics in Figure 11 generates thread graphs for instruction of the initial thread). They will have a source vertex for ∆ expressions. The judgment e; µg ⇓ v; g indicates that the expression each ready thread. This modification is relatively straightforward: t e evaluates to v and has cost graph g in the presence of µg. Values v for each dag, we will generate a traditional dag, which we now consist of irreducible expressions, plus a new form of thread handle call a thread graph or thread dag with a single source and single which abstractly represents a thread as the value to which it will sink. These are then composed to form a configuration graph or evaluate and a handle to the sink of its expression’s cost graph: configuration dag by adding edges that correspond to the inter-thread dependencies created by join and fg. Values v ::= hi | n | λx : τ.e | hv, vi | inl(v) | inr(v) | We will use metavariables g (and similar) for thread dags. The tid[b] | thread[u](v) notation u1 wg u2 indicates that u1 is an ancestor of u2 in g. The expression being evaluated may refer to threads in µg. These If u has no ancestor in g, we write u @g. We write g1  g2 threads are included so that the value can be generated, but their cost to mean that g1 and g2 are isomorphic. We write a non-empty thread dag as a tuple (s, t, V, E, F) (an empty dag is written ∅). is not included in g. Many of the rules for the sequential components The first two components, s and t, are the source and sink vertices of the language and parallel tuples are based on the cost semantics respectively, and V is the set of vertices. We have s, t ∈ V and of Spoonhower et al. [41], with nontrivial modifications to allow the representation of in-progress computations. The rules for generating s wg t. The fourth component is a set of weighted, directed edges E ⊂ P((V ∪ Threads ∪ AuxVertices) × V × N). Note that the source and joining with background threads (bg(e) and fg(e), respectively), of an edge can, in addition to a vertex, be a thread identifier or are based on Spoonhower’s treatment of futures [40], which share an auxiliary vertex. Edges staring at thread identifiers indicate the property that an asynchronous expression is spawned in one part dependencies between thread dags in a configuration dag. Auxiliary of a computation and demanded in another. The generation of cost vertices α are used as placeholders for delays. A vertex u which is graphs is defined inductively on expressions. Subexpressions are delayed and can run in δ timesteps will have an in-edge (α, u, δ). The evaluated, and their cost graphs are combined using the operations auxiliary vertex α does not count toward the work but is counted defined in Figure 12. The figure also defines notation for simple as an ancestor of u, e.g. for the purposes of determining whether u graphs consisting of a single vertex [u] or a single edge [(u1, u2, δ)]. is ready. The final component of the tuple is a set F of foreground In most operations, the subexpressions are evaluated sequentially, blocks, where a foreground block f takes one of two forms: represented in the cost graph by combining the cost graphs of the subexpressions using serial composition g1 ⊕ g2 which joins the sink 1. a pair of the source and sink vertices of the block, written s 7→ t, of g1 to the source of g2 by an edge of weight 1 (a more general where s, t ∈ V and s wg t. The foreground block is the induced form, ⊕δ, uses an edge of weight δ, as shown in Figure 13). The subdag of g consisting of all u ∈ V such that s wg u and u wg t. empty graph ∅ acts as a unit for the ⊕ operator. In the rule for e1 k e2, 2. a sink vertex, written 7→ t, where t ∈ V. The foreground block is however, the cost graphs for e1 and e2 are combined using parallel the induced subdag of g consisting of all u ∈ V such that u wg t. composition g1 ⊗ g2, which joins the graphs in parallel with new vertices s and t as the source and sink (Figure 14). If one of the The former represents a foreground block whose source vertex has graphs is empty, the other is simply composed with s and t. The rule not yet been executed. It may still be ready if the source vertex is for bg(e) uses the left parallel composition operator [40]. If g is the ready. The latter represents a ready foreground block whose source cost graph for e, the graph g u “hangs g off of” vertex u (Figure 15). vertex has been executed. Multiple threads may be running code For the purposes of sequentially composing this graph with other that is part of this block. We write u ∈ f to indicate that u is part of graphs, u is both the source and the sink, reflecting the fact that the the subdag induced by f , and we say that u is a foreground vertex. new thread is executed concurrently with the continuation of the The work W(g) of a dag is the total number of vertices in the current thread. The rule for fg(e) evaluates e to a background thread dag. The span S (g) is the longest weighted path in the dag. For a and also gets a handle to the sink of the cost graph for the thread’s foreground block f which is part of g, we will write W ( f ) and S ( f ) g g expression. The sink is u if the thread is of the form thread[u](v) (a for the work and span, respectively, of the subdag of g induced by f . thread that hasn’t yet been spawned and is represented abstractly in For a graph g = (s, t, V, E, F), the foreground work and span W!(g) the cost graph) or is b if the thread is of the form tid[b] (an active and S !(g) are defined as: thread with identifier b). The rule adds an edge between the sink ! X ! X W (g) = Wg( f ) S (g) = S g( f ) and the vertex representing the fg instruction. In the rule for fg(e), f ∈F f ∈F the cost graph for e is marked as foreground with the operation g . This operation produces a foreground block s 7→ t if s and t are the Two foreground blocks f and f are serial if there exists a 1 2 source and sink of g and s has no ancestors (i.e. is not a join point), directed path in the graph from a vertex of f to a vertex of f or vice 1 2 or a foreground block 7→ t if t is the sink of g and g depends on other versa. A set of foreground blocks F0 ⊂ F may happen in parallel if threads. Finally, the input rule adds an edge of weight δ, where δ is for all f , f ∈ F0, f and f are not serial. The foreground width D 1 2 1 2 chosen nondeterministically from ∆(d). of a graph is defined as ∆ The judgment µl; µg ⇓ {G} generates a portion of a configura-  0 0 0 c max |F | | F ⊂ F ∧ F may happen in parallel tion graph from the threads in a partial thread pool µl by generating

8 2016/7/7 ∆ Expression cost semantics e; µ ⇓t v; g ∆ ∆ ∆ 0 ∆ e val e1; µ ⇓t λx : τ.e; g1 e2; µ ⇓t v; g2 [v/x]e; µ ⇓t v ; g3 u fresh e; µ ⇓t hv1, v2i; g u fresh ∆ ∆ 0 ∆ e; µ ⇓t e; ∅ e1 e2; µ ⇓t v ; g1 ⊕ g2 ⊕ [u] ⊕ g3 fst(e); µ ⇓t v1; g ⊕ [u] ∆ ∆ ∆ ∆ ∆ e; µ ⇓t hv1, v2i; g u fresh e1; µ ⇓t v1; g1 e2; µ ⇓t v2; g2 e1; µ ⇓t v1; g1 e2; µ ⇓t v2; g2 ∆ ∆ ∆ snd(e); µ ⇓t v2; g ⊕ [u] he1, e2i; µ ⇓t hv1, v2i; g1 ⊕ g2 e1 k e2; µ ⇓t hv1, v2i; g1 ⊗ g2 0 ∆ ∆ µ = µ ⊗ b ,→ (δb, eb) ⊗ c ,→ (δc, ec) eb; µ ⇓t v1; g1 ec; µ ⇓t v2; g2 u fresh ∆ join[b, c]; µ ⇓t hv1, v2i;(u, u, {u}, {(b, u, 1), (c, u, 1)}, ∅) ∆ ∆ 0 ∆ ∆ 0 e; µ ⇓t inl(v); g1 [v/x]e1; µ ⇓t v ; g2 u fresh e; µ ⇓t inr(v); g1 [v/y]e2; µ ⇓t v ; g2 u fresh ∆ 0 ∆ 0 case(e){x.e1; y.e2}; µ ⇓t v ; g1 ⊕ [u] ⊕ g2 case(e){x.e1; y.e2}; µ ⇓t v ; g1 ⊕ [u] ⊕ g2 ∆ ∆ e; µ ⇓t v; g g = (s, t, V, E, F) u fresh e; µ ⇓t thread[u1](v); g u2 fresh ∆ u ∆ bg(e); µ ⇓t thread[t](v); g fg(e); µ ⇓t v;(g ) ⊕ [u2] ∪ {(u1, u2, 1)} 0 ∆ ∆ ∆ µ = µ ⊗ b ,→ (δ, eb) e; µ ⇓t tid[b]; g eb; µ ⇓t v; gb u fresh e; µ ⇓t v; g u fresh ∆ ∆ fg(e); µ ⇓t v;(g ) ⊕ [u] ∪ {(b, u, 1)} out(e); µ ⇓t hi; g ⊕ [u] ∆ v ∆ [n/x]e ⇓t v; g u1 fresh u2 fresh δ ∈ ∆(d) [n/x]e ⇓t; g u fresh [fix x : τ is e/x]e; µ ⇓t v; g u fresh ∆ in ⇓v ⊕ ∆ inp[d](x.e); µ ⇓t v;[u1] ⊕δ [u2] ⊕ g (x.e); µ t;[u] g fix x : τ is e; µ ⇓t v;[u] ⊕ g ∆ Thread pool cost semantics µl; µg ⇓t v; g ∆ ∆ µl; µg ⇓c {G} e; µg ⇓t g g , ∅ ∆ ∆ ∅; µg ⇓c {} a ,→ (δ, e) ⊗ µl; µg ⇓c {a ,→ g δ ⊗ G} Figure 11. Cost Semantics

[u] = (u, u, {u}, ∅, ∅) [(u1, u2, δ)] = (u1, u2, {u1, u2}, {(u1, u2, δ)}, ∅) (s1, t1, V1, E1, F1) ⊕δ (s2, t2, V2, E2, F2) = (s1, t2, V1 ∪ V2, E1 ∪ E2 ∪ {(t1, s2, δ)}, F1 ∪ F2) g1 ⊕ g2 = g1 ⊕1 g2 (s1, t1, V1, E1, F1) ⊗ (s2, t2, V2, E2, F2) = (s, t, V1 ∪ V2 ∪ {s, t}, E1 ∪ E2∪ {(s, s1, 1), (s, s2, 1), (t1, t, 1), (t2, t, 1)}, F1 ∪ F2) s, t fresh (s, t, V, E, F) u = (u, u, V ∪ {u}, E ∪ {(u, s, 1)}, F) g 0 = g (s, t, V, E, F) δ = [α] ⊕δ g δ > 0, α fresh (s, t, V, E, F) = (s, t, V, E, F ∪ {s 7→ t}) @a, δ.(a, s, δ) ∈ E (s, t, V, E, F) = (s, t, V, E, F ∪ {7→ t}) ∃a, δ.(a, s, δ) ∈ E

Figure 12. Graph building and composition operations

s the remaining work and span of the program. The work of a thread g1 pool µ under ∆ is written W(µ, ∆) and is defined as the maximum work over all dags that can be generated from µ:

u ∆ δ g1 g2 W(µ, ∆) = max{W(G) | µ; µ ⇓c {G}} We take the maximum since the cost semantics is nondeterministic. ! ! g2 g ... The definitions of S (µ, ∆), W (µ, ∆) and S (µ, ∆) are similar. t In the remainder of this section, we show important properties of the cost semantics and its correspondence with the operational u Figure 13. g1 ⊕δ g2 Figure 14. g1 ⊗ g2 Figure 15. g semantics. Lemma 7 relates the invariants of the type system to a thread graph for each thread and composing any non-empty graphs cost graphs: cost graphs generated by F expressions have no nested foreground blocks and edges from other threads occur only at joins. that result. As above, the whole thread pool µg is included so that threads may refer to other threads which are not currently under Lemma 7. If · `Σ e : τ@F and · `· a ,→ (δ, e) ⊗ µ : Σ, a ∼ τ @ F attention, but these threads are not included in G. If a thread is and a ,→ (δ, e) ⊗ µ; µ ⇓∆ {G} where delayed with delay δ > 0, its cost graph is composed serially after a c fresh auxiliary vertex using an edge of weight δ. G = a ,→ g ⊗ a1 ,→ (s1, t1, V1, E1, F) ⊗ ... an ,→ (sn, tn, Vn, En, Fn) The cost semantics allows us to assign costs (work, span, etc.) to and g s t V E F , then programs, as represented by thread pools. The work and span of a = ( , , , , ) thread pool that is in the middle of execution can be thought of as 1.F = ∅

9 2016/7/7 2. if e nj, then there does not exist (b, u, δ) ∈ E (where b is a thread If G2 = a1 ,→ (s1, t1, V1, E1, F1) ⊗ ... ⊗ an ,→ (sn, tn, Vn, En, Fn)

identifier) and tG2 is the sink vertex of G2, then 3. if e wj, then there does not exist (b, u, δ) ∈ E for any u , s. (s, t, V, E, F) ⊕ G2 = (s, tG2 , V ∪ V1 ∪ · · · ∪ Vn, 0 0 4. for all ai, we have si wG t if and only if ai ∈ Jµ(e) E ∪ E1 ∪ · · · ∪ En ∪ {(t, s , 1) | s @G2 } 0 0 ∪{(t, s , δ + 1) | (α, s , δ) ∈ E1 ∪ · · · ∪ En}, Proof. Parts 1-3 are by induction on the derivation of · `Σ e : τ@F. F ∪ F1 ∪ · · · ∪ Fn) In part 3, inversion on e wj is used to show that any graphs that may be serially composed before a join are empty. Part 4 is by Note that the operation G1 ⊕ g2 is not necessary; ordinary serial lexicographic induction on the derivations of · `Σ e : τ@F and composition works in this case since G1 has a unique sink. 0 0 · `· a ,→ (0, e) ⊗ µ : Σ, a ∼ τ @ F. See the appendix in the Lemma 9 examines the effect of a transition r; µ ⇒g r ; µ on supplementary materials for details.  the cost graph of a specified thread a ,→ (δ, e) of µ0. In other words, the lemma shows how the cost semantics behaves under “converse Many of the functions and properties we have defined over evaluation” or “head expansion”, a standard step in relating small- thread pools have corresponding definitions over cost graphs. For step and big-step semantics. Part 1 considers the case in which a example, we have already defined foreground threads of a thread is not one of the threads that transitions. In this case, if e evaluates pool and foreground vertices of a cost graph. We can also define to v0 with cost graph g0 under µ0, it evaluates to a related value and the function RFB(·) over graphs. The ready foreground blocks of a isomorphic cost graph under µ. Part 2 considers the more complex 0 0 graph are the foreground blocks that have ready vertices. Formally, case in which a steps from e to e , adding the threads in µa. In this 0 0 0 0 0 0 case, if e evaluates to v under µ and a ,→ (δ , e ) ⊗ µa produces RFB((s, t, V, E, F)) = {7→ t ∈ F} ∪ {s 7→ t ∈ F | s @G} the cost graph G0, then e evaluates to a value related to v0 with a cost graph that adds at least one vertex as an ancestor of G0. This RFB(a0 ,→ g0 ⊗ ... ⊗ an ,→ gn) = RFB(g0) ∪ · · · ∪ RFB(gn) lemma will then be used to show that the cost semantics and the For these definitions to make sense, it should be the case that the operational semantics correspond on the final values, and will later thread pool and cost graph versions of the definitions agree. First, a be used to show the Brent-type theorem that the cost graph is an µ G thread is ready in if and only if its source has no ancestors in . accurate representation of the length of a prompt schedule. Second, an element of RFB(µ) consists of threads whose sources are elements of a ready foreground block in RFB(G). Lemma 9. Fix ∆ and suppose that · `Σ e : τ@w and · `· µ0 : Σ. Let µ = a ,→ (δ, e) ⊗ µ0. Lemma 8. Fix ∆. Suppose µ = a1 ,→ (δ1, e1) ⊗ ... ⊗ an ,→ (δn, en) ∆ ⇒ 0 0 0 ⇓∆ 0 0 and · `· µ : Σ and ei wj for all ei and µ; µ ⇓c {G}, where 1. Suppose r; µ g r ; µ and e; µ t v ; g . There exist v and g ∆ 0 0 such that e; µ ⇓ v; g and g  g and v µ0 v . G = a1 ,→ (s1, t1, V1, E1, F1) ⊗ ... an ,→ (sn, tn, Vn, En, Fn) t 0 0 0 0 0 0 0 2. Suppose µ = a ,→ (δ , e ) ⊗ µ0 ⊗ µa and r; µ ⇒g r ; µ where Then ∆ 0 0 0 0 0 ∆ 0 0 0 ∆ 0 e | µ ⇒a (δ , e ) | µ ⊗ µa and e ; µ ⇓ v ; ga and µa; µ ⇓c {G0}. 0 0 0 t 00 00 0 1. isreadyµ(ai) if and only if si @G. Let G = a ,→ ga δ ⊗ G0. There exist v and g and g , where g ∆ 00 0 0 ⇓ ⊕  0 2. RFB(µ) = {{ai | si wG t} |7→ t ∈ RFB(G)} ∪ {{ai | si wG t} | s 7→ is nonempty, such that e; µ t v; g and g  g G and v µ v . t ∈ RFB(G)} 0 ∆ 0 0 Proof. By induction on the derivations of e; µ ⇓t v ; g and e | ∆ 0 0 0 Proof. See the appendix in the supplementary materials.  µ ⇒a (δ , e ) | µ ⊗ µa. See the appendix in the supplementary materials for details.  Next, we show that the operational semantics and cost semantics agree on the values produced by an expression. One complication We can now show the final result of this section: that if a in showing such a result is accounting for the value thread[v](u) well-typed λip program evaluates to a value using the operational which is produced by the cost semantics but not the operational semantics, the cost semantics will produce a cost graph for that semantics3. We therefore show that the cost semantics and the program, along with the same final value. operational semantics are equivalent up to a relation µ which Theorem 2. If · `· a ,→ (δ, e) ⊗ µ : Σ, a ∼ τ @ w and relates the two forms of thread handle. We define µ inductively. ∗ 0 0 0 The important rules are the ones for thread handles: r; a ,→ (δ, e) ⊗ µ ⇒g r ; a ,→ (0, e ) ⊗ µ 0 ∆ 0 0 µ = b ,→ (δ, e) ⊗ µ e; µ ⇓t v; g and a ,→ (0, e ) ⊗ µ final, then there exist v and g such that ∆ 0 e; µ ⇓ v; g and v a,→(0,e0) ⊗ µ0 e . thread[u](v) µ tid[b] tid[b] µ tid[b] t 0 0 0 0 ∆ 0 v µ v Proof. Since e val, we have e ; µ ⇓t e ; ∅. Proceed by an inductive 0 application of Lemma 9.  thread[u](v) µ thread[u](v )

All other rules simply preserve µ. It can be shown that µ is 5. Cost Bounds for Prompt Scheduling Principle reflexive, transitive and respects substitution. In order to show how individual steps of the operational seman- The main result of this paper is showing a generalization of Brent’s tics change the cost graph (which we will in turn use to show the Theorem (and similar results) to our language and our cost model, correspondence between the two versions of the semantics), we which takes into account responsiveness as well as total computation generalize serial composition to allow thread graphs to be composed time. We show that a P-processor prompt schedule of a responsive parallel computation with work W, span S , foreground work W!, with configuration graphs. In g1 ⊕ G2, the sink vertex of g1 is joined foreground span S ! and foreground width D completes in total time to all source vertices of G2 with edges of weight 1. Source vertices at most W/P + S with total response time at most DW!/P + S !. The of G2 which are auxiliary vertices are eliminated in the process. bound on the computation time is known to be within a factor 2 3 The form tid[a] is not explicitly produced by the cost semantics, but can be of optimal. We will show that, in the worst case and for an online carried through since it is an irreducible expression and therefore evaluates scheduler (one that does not know the computation ahead of time), to itself under the cost semantics. the bound on the response time is also within a factor 2 of optimal.

10 2016/7/7 The key step in showing the bound on the computation time time to execute f , it is clear that S !, the sum of the spans over all is showing that a global transition step decreases the total work blocks, is a lower bound on response time. by P or the total span by 1. The intuition behind this proof, as in In order to argue that the bound on response time given by most proofs of Brent-type theorems, is that, by definition, a greedy Theorem 3 is within a factor 2 of optimal, it remains to show that scheduler (all prompt schedulers are greedy) will either execute P DW!/P is also a lower bound on response time. This is not the case instructions or execute all ready instructions (an entire “level” of the in general, but we will argue that it is a lower bound in the worst case dag), decreasing the critical path by 1. To show the bound on the assuming an online scheduler by presenting a class of computations response time, we show that, if any foreground blocks are ready, a on which the bound is tight. Consider a computation with total global step decreases the foreground work by P or the foreground work W! which consists only of D  W! foreground blocks, each span by the number of ready foreground blocks. The intuition is of which is sequential4. Think of the work of the computation as W! similar to the above: a prompt scheduler will either execute P “bricks” which are distributed arbitrarily into D stacks. At each step, foreground instructions or a level of every ready block. The proof a prompt scheduler will remove one brick from each of P stacks of this lemma makes heavy use of part 2 of Lemma 9, which shows (blocks). When a stack is empty, that block is complete and no that a local transition on a thread decreases the work and span of the longer counts toward the response time. Since an online scheduler thread’s dag by at least 1. only knows which blocks are ready (which stacks have a brick on top) and cannot base its decisions on how large each stack is (this Lemma 10. Fix ∆ and suppose that · ` µ : Σ and that e wj for all · would require knowing how long a block will take to execute, which a ,→ (δ, e) ∈ µ. If r; µ ⇒ r0; µ0, then g is impossible in general), we may play a game against the scheduler. 1.W (µ0, ∆) ≤ W(µ, ∆) Start by placing two bricks on each stack. Keep the rest of the bricks 2.S (µ0, ∆) ≤ S (µ, ∆) hidden. At each step, when the scheduler removes a brick from a stack, place another brick at the bottom of that stack until you run 3.W (µ, ∆) − W(µ0, ∆) ≥ P or S (µ, ∆) − S (µ0, ∆) ≥ 1 ! 0 ! out of bricks. In this way, all D blocks will be ready for at least 4.W (µ , ∆) ≤ W (µ, ∆) ! W −2D steps (the number of steps it will take to run out of bricks), ! 0 ! P 5.S (µ , ∆) ≤ S (µ, ∆) ! ! which will cause the response time to be at least D W −2D ∈ O(D W ). 6.W !(µ, ∆) − W!(µ0, ∆) ≥ P or S !(µ, ∆) − S !(µ0, ∆) ≥ r0 − r or P P 0 r = r. 6. Implementation and Examples Proof. See the appendix in the supplementary materials.  We developed a preliminary implementation of λip by building on a parallel extension of Standard ML [40, 41]. The implementation The proof of the response time and computation time bounds is uses fork-join parallelism and futures, both built in to the parallel ML then straightforward. extension, to create threads, and supplies constructs to allow threads to be annotated as foreground or background. The implementation Theorem 3. Fix ∆ and let e be such that · `· e : τ@B. Suppose e; ∅ ⇓∆ v; g and let W = W(g) and S = S (g) and W! = W!(g) and also consists of a simple prioritized scheduler. We did not extend t SML’s type system to implement λip’s type system. S ! = S !(g) and D = D(g). If 0; a ,→ (0, e) ⇒T r; µ and µ final, g We implemented several parallel interactive examples to show W W! ! then T ≤ P + S and r ≤ D P + S . that the language features are easy to use and that prioritization can ensure responsiveness. All of the examples are very responsive Proof. Let µ0 = a ,→ (0, e) and µT = µ and r0 = 0 and rT = r. We when written with correct priorities but, as expected, quickly become have a sequence 0; µ0 ⇒g r1; µ1 ⇒g ... ⇒g rT ; µT . unresponsive when all code is run in the background and the amount ! ! For each i, let Wi = W(µi, ∆) (and similar for S i, Wi and S i . Note of computation is increased. ! ! that W0 = W (and similar for S , W , S ) and that Fibonacci Server. Our implementation of the Fibonacci server ! ! WT = S T = WT = S T = 0 from Section 2 accepts integer inputs from the user (over standard input) in the foreground, and computes their Fibonacci numbers in By Theorem 1, eb wj for all b ,→ (δ, eb) ∈ µi. By Lemma 10, the background, outputting the results asynchronously. W0 W1 WT + S 0 ≥ 1 + + S 1 ≥ · · · ≥ 1 + + S T = 1 Interactive Convex Hull. This program displays a window on P P P which the user can click to add points. The points are displayed W0 immediately (in the foreground). When a new point is added, a This immediately gives P + S 0 ≥ T. W! background computation is started to compute the convex hull of the For each i, consider the quantity D i + S ! + r . Note that for P i i current collection of points using the parallel Quickhull algorithm. W! ! W! i ! W ! i ! The hull is displayed asynchronously when computed. i = 0, D P + S i + ri = D P + S and for i = T, D P + S i + ri = r. ⇒ When ri; µi g ri+1; µi+1, by Lemma 10, either Web Server. An interaction loop (in the foreground) listens for 1. r = r and the other terms do not increase or connections. When a connection is opened, the loop spawns a new i+1 i thread which is immediately promoted to the foreground (using 2. W! − W! ≥ P and r − r = |RFB(µ )| ≤ D (the last inequality i i+1 i+1 i i bg(fg(...))) to listen for requests over that connection, allowing is by definition of D) or the main loop to immediately listen for more connections. The new ! ! 3. S i − S i+1 ≥ |RFB(µi)| and ri+1 − ri = |RFB(µi)| connection thread waits for an HTTP request, which it serves in the foreground. The request is also added to a log (stored as a global In all three cases, the quantity above decreases or remains the same, mutable reference). Meanwhile, a background thread periodically ≤ W! ! so r D P + S .  checks the log and performs analytics on it (currently just tallying Clearly, W/P and S are both lower bounds on the computation time, so the bound of W/P + S is within a factor of two of optimal. 4 Such a computation is not technically expressible in our language, since Recall that the response time is the sum over all foreground blocks f it will take some background work to start up the blocks, but if W! is large of the time taken to execute f . Since S g( f ) is a lower bound on the enough, this starting background work and span can be neglected.

11 2016/7/7 scheduling, however, is a principle rather than an algorithm in the sense that our bounds do not take into account the cost of determin- ing the schedule itself. We only bound the length of the schedule (which implies time) and the total response time. Our bounds are thus similar to Brent’s result for scheduling parallel (non-interactive) computations [9]. The design, analysis, and implementation of scheduling algorithms is a vast research topic, spanning multiple areas such as parallel computing, high-performance computing, op- erating systems, and queueing theory. Here, we briefly discuss a sample of the more closely related work. The work on scheduling for parallel programs goes back to the Figure 16. Convex hull (left) and web server (right). 1970’s. Ullman [44], Brent [9], and Eager et al. [15] established the hardness of optimal scheduling and the greedy (or Brent) scheduling the number of visits to each page). The analytics can be viewed by principle. Based on these early results, many scheduling algorithms requesting stats.html from the server. have been developed and bounds have been proven [1, 2, 6, 7, 10, 16, 19–21, 35, 43], More recent papers showed that priority-based 7. Related Work schedulers can improve performance in practice [24, 45, 46], but Abstractions and cost models for parallel programming have been offer no bounds. All of this work, however, considers non-interactive, studied extensively and many programming languages and exten- compute-intensive applications. Muller and Acar [32] developed an sions have been created [12, 18, 19, 25, 26, 28, 30, 31]. The focus of algorithm for scheduling blocking parallel programs to hide latency, nearly all of this work on parallel computing has been maximizing but do not consider responsiveness. throughput in compute-intensive applications. Our work builds on Scheduling is a key problem in the operating systems commu- this prior work by proposing language abstractions, a cost semantics nity [39]. There has been significant recent interest in making oper- and a scheduling principle for responsiveness in interactive parallel ating systems work well on multicore machines [3, 8]. The focus, applications. Our results build on prior work on type systems for however, has been on reducing contention within the OS and, as in staged computation, semantics and cost semantics for parallel com- the high-performance computing community, distributing resources puting, and also more remotely on the broader area of scheduling. to jobs so that they can run effectively. Scheduling within a job is less central to OS research. Type Systems for Staged Computation. The type system of ip There has been a great deal of work on scheduling for responsive- λ is based on that of Davies [13] for binding time analysis, ness in queueing theory [22]. This line of work assumes a contin- which he derived from linear temporal logic via the Curry-Howard uous stream of independent jobs arriving for processing according correspondence. This work influenced much followup work on to some stochastic process. Each job is processed or “served” by a metaprogramming and staged computation [14, 29, 34, 42]. The single processor (server) that decides at every point in time which idea behind these systems is to allow computation at a stage to of the current jobs to run. The work on queueing-theoretic schedul- create and manipulate, but not eliminate, a computation in a later ing, however, has given almost no consideration to parallel jobs, stage. For example, a stage 1 computation can create a Stage 2 typically assuming jobs to be sequential. Nor has there been any computation as a “black box” but cannot inspect that computation consideration of jobs which, as part of their execution, interact with by, for example, pattern matching on its result. We specifically use the external world and thus might need to guarantee responsiveness a two-stage variant of the modality of Davies [13], similar to that bounds for specific blocks or tasks. of Feltman et al. [17], which inspires some of our notation. While our type system is essentially a staged type system, our op- 8. Conclusion erational interpretation is different from that of staged computation. In staged computation, evaluation proceeds in order of increasing The problem of responsive parallel computing consists of writing stages. For example, in a two-staged system, all computations of the parallel programs which perform both computational tasks and first stage are evaluated, followed by the second stage. In λip, we interaction, and running these programs so that they show good don’t order evaluation according to the stages—we allow them to parallel speedup an remain responsive to input. We predict that occur concurrently. We know that a stage 1 (F) computation cannot this problem will become more important as parallel programming possibly inspect a stage 2 (B) computation, but there is no need to becomes the norm. The language features presented in this paper wait for all stage 1 computations to complete before we can start a allow easy expression of responsive parallel programs. A promising stage 2 computation. This is key to responsive and efficient parallel area of future work would be to juxtapose these features with a computation. new, well-engineered scheduler based on the prompt scheduling ip principle. The cost metrics and the cost model for reasoning about Cost Semantics. The cost semantics for λ can be viewed as responsiveness developed in this paper will hopefully prove useful instrumenting the evaluation to help the programmer to reason in such further studies of responsive parallelism. about cost. This idea of instrumenting evaluations goes back to the early 1990s [36, 37]. Cost semantics have proved to be particularly References important in lazy languages (e.g., [37, 38]) and parallel languages (e.g., [4, 5, 41]). Our approach builds directly on the work of [1] U. A. Acar, G. E. Blelloch, and R. D. Blumofe. The data locality of Blelloch and Greiner [5] and Spoonhower et al. [41], who use work stealing. Theory of Computing Systems (TOCS), 35(3):321–347, 2002. computation graphs represented as dags (directed acyclic graphs) to reason about time and space in functional parallel programs. These [2] U. A. Acar, A. Chargueraud,´ and M. Rainey. Scheduling parallel cost models, however, consider compute-intensive applications and programs by work stealing with private deques. In ACM SIGPLAN do not consider interactive applications and responsiveness. Symposium on Principles and Practice of Parallel Programming (PPOPP), 2013. Scheduling. In this paper, the scheduling principle presented, [3] A. Baumann, P. Barham, P.-E. Dagand, T. Harris, R. Isaacs, S. Peter, prompt scheduling, can guarantee the completion/run time and re- T. Roscoe, A. Schupbach,¨ and A. Singhania. The multikernel: A new sponse time bounds for parallel interactive computations. Prompt OS architecture for scalable multicore systems. In Proceedings of

12 2016/7/7 the ACM SIGOPS 22nd Symposium on Operating Systems Principles, , LFP ’84, pages 9–17. ACM, 1984. ISBN SOSP ’09, pages 29–44, 2009. 0-89791-142-3. [4] G. Blelloch and J. Greiner. Parallelism in sequential functional [22] M. Harchol-Balter. Performance Modeling and Design of Computer languages. In Proceedings of the 7th International Conference on Systems: Queueing Theory in Action. Cambridge University Press, New Functional Programming Languages and Computer Architecture, pages York, NY, USA, 1st edition, 2013. ISBN 1107027500, 9781107027503. 226–237, 1995. ISBN 0-89791-719-7. [23] R. Harper. Practical Foundations for Programming Languages. Cam- [5] G. E. Blelloch and J. Greiner. A provable time and space efficient bridge University Press, New York, NY, USA, 2012. ISBN 1107029570, implementation of NESL. In Proceedings of the 1st ACM SIGPLAN 9781107029576. International Conference on Functional Programming, pages 213–225. [24] S. Imam and V. Sarkar. Load balancing prioritized tasks via work- ACM, 1996. stealing. In Euro-Par 2015: Parallel Processing - 21st International [6] G. E. Blelloch, J. T. Fineman, P. B. Gibbons, and H. V. Simhadri. Conference on Parallel and , Vienna, Austria, Scheduling irregular parallel computations on hierarchical caches. In August 24-28, 2015, Proceedings, pages 222–234, 2015. Proceedings of the 23rd ACM Symposium on Parallelism in Algorithms [25] S. M. Imam and V. Sarkar. Habanero-Java library: a Java 8 framework and Architectures, SPAA ’11, pages 355–366, 2011. ISBN 978-1-4503- for multicore programming. In 2014 International Conference on 0743-7. Principles and Practices of Programming on the Java Platform Virtual [7] R. D. Blumofe and C. E. Leiserson. Scheduling multithreaded com- Machines, Languages and Tools, PPPJ ’14, Cracow, Poland, September putations by work stealing. J. ACM, 46:720–748, Sept. 1999. ISSN 23-26, 2014, pages 75–86, 2014. 0004-5411. doi: http://doi.acm.org/10.1145/324133.324234. URL [26] Intel. Intel Threading Building Blocks, 2011. http://doi.acm.org/10.1145/324133.324234. [27] J. Jaja. An introduction to parallel algorithms. Addison Wesley [8] S. Boyd-Wickizer, H. Chen, R. Chen, Y. Mao, F. Kaashoek, R. Morris, Longman Publishing Company, 1992. A. Pesterev, L. Stein, M. Wu, Y. Dai, Y. Zhang, and Z. Zhang. Corey: An operating system for many cores. In Proceedings of the 8th [28] G. Keller, M. M. Chakravarty, R. Leshchinskiy, S. Peyton Jones, and USENIX Conference on Operating Systems Design and Implementation, B. Lippmeier. Regular, shape-polymorphic, parallel arrays in Haskell. OSDI’08, pages 43–57, 2008. In Proceedings of the 15th ACM SIGPLAN international conference on Functional programming, ICFP ’10, pages 261–272, 2010. ISBN [9] R. P. Brent. The parallel evaluation of general arithmetic expressions. 978-1-60558-794-3. J. ACM, 21(2):201–206, 1974. [29] T. B. Knoblock and E. Ruf. Data specialization. In Proceedings of the [10] F. W. Burton and M. R. Sleep. Executing functional programs on a ACM SIGPLAN 1996 Conference on Programming Language Design virtual tree of processors. In Functional Programming Languages and and Implementation, PLDI ’96, pages 215–225, New York, NY, USA, Computer Architecture (FPCA ’81), pages 187–194. ACM Press, Oct. 1996. ACM. ISBN 0-89791-795-2. doi: 10.1145/231379.231428. 1981. [30] D. Lea. A Java fork/join framework. In Proceedings of the ACM 2000 [11] M. M. T. Chakravarty, R. Leshchinskiy, S. Peyton Jones, G. Keller, conference on Java Grande, JAVA ’00, pages 36–43, 2000. ISBN and S. Marlow. Data parallel Haskell: a status report. In 1-58113-288-3. Workshop on Declarative Aspects of Multicore Programming, DAMP ’07, pages 10–18. ACM, 2007. ISBN 978-1-59593- [31] D. Leijen, W. Schulte, and S. Burckhardt. The design of a task parallel 690-5. doi: http://doi.acm.org/10.1145/1248648.1248652. URL library. In Proceedings of the 24th ACM SIGPLAN conference on http://doi.acm.org/10.1145/1248648.1248652. Object oriented programming systems languages and applications, OOPSLA ’09, pages 227–242, 2009. ISBN 978-1-60558-766-0. [12] P. Charles, C. Grothoff, V. Saraswat, C. Donawa, A. Kielstra, K. Ebcioglu, C. von Praun, and V. Sarkar. X10: an object-oriented [32] S. K. Muller and U. A. Acar. Latency-hiding work stealing. In Pro- approach to non-uniform cluster computing. In Proceedings of the 20th ceedings of the twenty-eighth annual ACM symposium on parallelism annual ACM SIGPLAN conference on Object-oriented programming, in algorithms and architectures, SPAA ’16. ACM, 2016. systems, languages, and applications, OOPSLA ’05, pages 519–538. [33] T. Murphy, VII, K. Crary, and R. Harper. Distributed control flow ACM, 2005. ISBN 1-59593-031-0. with classical modal logic. In L. Ong, editor, Computer Science Logic, [13] R. Davies. A temporal-logic approach to binding-time analysis. In 19th International Workshop (CSL 2005), Lecture Notes in Computer LICS, pages 184–195, 1996. Science. Springer, August 2005. [14] R. Davies and F. Pfenning. A modal analysis of staged computation. J. [34] A. Nanevski and F. Pfenning. Staged computation with names ACM, 48(3):555–604, 2001. and necessity. J. Funct. Program., 15(5):893–939, 2005. doi: 10.1017/S095679680500568X. [15] D. L. Eager, J. Zahorjan, and E. D. Lazowska. Speedup versus efficiency in parallel systems. IEEE Transactions on Computing, 38(3):408–423, [35] G. J. Narlikar and G. E. Blelloch. Space-efficient scheduling of nested 1989. parallelism. ACM Transactions on Programming Languages and Systems, 21, 1999. [16] D. G. Feitelson, L. Rudolph, and U. Schwiegelshohn. Parallel job scheduling - A status report. In Job Scheduling Strategies for Parallel [36] M. Rosendahl. Automatic complexity analysis. In FPCA ’89: Func- Processing (JSSPP), 10th International Workshop, pages 1–16, 2004. tional Programming Languages and Computer Architecture, pages [17] N. Feltman, C. Angiuli, U. A. Acar, and K. Fatahalian. Automatically 144–156. ACM, 1989. splitting a two-stage lambda calculus. In Proceedings of the 25th [37] D. Sands. Complexity analysis for a lazy higher-order language. In European Symposium on Programming, ESOP ’16, Eindhoven, The ESOP ’90: Proceedings of the 3rd European Symposium on Program- Netherlands, 2016. Springer-Verlag. ming, pages 361–376, London, UK, 1990. Springer-Verlag. [18] M. Fluet, M. Rainey, J. Reppy, and A. Shaw. Implicitly threaded [38] P. M. Sansom and S. L. Peyton Jones. Time and space profiling for non- parallelism in Manticore. Journal of Functional Programming, 20(5-6): strict, higher-order functional languages. In Principles of Programming 1–40, 2011. Languages, pages 355–366, 1995. [19] M. Frigo, C. E. Leiserson, and K. H. Randall. The implementation of [39] A. Silberschatz, P. B. Galvin, and G. Gagne. Operating system concepts the Cilk-5 multithreaded language. In PLDI, pages 212–223, 1998. (7. ed.). Wiley, 2005. [20] J. Greiner and G. E. Blelloch. A provably time-efficient parallel [40] D. Spoonhower. Scheduling Deterministic Parallel Programs. PhD implementation of full speculation. ACM Trans. Program. Lang. Syst., thesis, Carnegie Mellon University, Pittsburgh, PA, USA, 2009. 21(2):240–285, Mar. 1999. ISSN 0164-0925. [41] D. Spoonhower, G. E. Blelloch, R. Harper, and P. B. Gibbons. Space [21] R. H. Halstead, Jr. Implementation of multilisp: Lisp on a multipro- profiling for parallel functional programs. In International Conference cessor. In Proceedings of the 1984 ACM Symposium on LISP and on Functional Programming (ICFP), 2008.

13 2016/7/7 [42] W. Taha and T. Sheard. MetaML and multi-stage programming with explicit annotations. Theoretical Computer Science, 248(1):211 – 242, 2000. [43] O. Tardieu, B. Herta, D. Cunningham, D. Grove, P. Kambadur, V. Saraswat, A. Shinnar, M. Takeuchi, and M. Vaziri. X10 and APGAS at petascale. In Proc. ACM SIGPLAN Symp. on Principles and Practice of Parallel Programming (PPoPP), pages 53–66, 2014. [44] J. Ullman. NP-complete scheduling problems. Journal of Computer and System Sciences, 10(3):384 – 393, 1975. ISSN 0022-0000. doi: http://dx.doi.org/10.1016/S0022-0000(75)80008-0. [45] M. Wimmer, D. Cederman, J. L. Tra¨ff, and P. Tsigas. Work-stealing with configurable scheduling strategies. In Proceedings of the 18th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP ’13, pages 315–316, 2013. [46] M. Wimmer, F. Versaci, J. L. Tra¨ff, D. Cederman, and P. Tsigas. Data structures for task-based priority scheduling. In Proceedings of the 19th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP ’14, pages 379–380, 2014.

14 2016/7/7