Synthesis of Divide and Conquer Parallelism for Loops ∗

Azadeh Farzan and Victor Nicolet † University of Toronto, Canada

Abstract 1. Introduction Divide-and-conquer is a common parallel programming Despite big advances in optimizing and parallelizing compil- skeleton supported by many cross-platform multithreaded ers, correct and efficient parallel code is often hand-crafted libraries, and most commonly used by programmers for par- in a difficult and error-prone process. The introduction of allelization. The challenges of producing (manually or au- libraries like Intel’s TBB [39] was motivated by this diffi- tomatically) a correct divide-and-conquer parallel program culty and with the aim of making the task of writing well- from a given sequential code are two-fold: (1) assuming that performing parallel programs easier for an average program- a good solution exists where individual worker threads exe- mer. These libraries offer efficient implementations for com- cute a code identical to the sequential one, the programmer monly used parallel skeletons, which makes it easier for the has to provide the extra code for dividing the tasks and com- programmer to write code in the style of a given skeleton bining the partial results (i.e. joins), and (2) the sequential without having to make special considerations for important code may not be suitable for divide-and-conquer paralleliza- performance factors like scalability of memory allocation or tion as is, and may need to be modified to become a part of task scheduling. Divide-and-conquer parallelism is the most a good solution. We address both challenges in this paper. commonly used of these skeletons. We present an automated synthesis technique to synthesize Consider the function sum = 0; correct joins and an algorithm for modifying the sequential sum that returns the sum for (i = 0; i < |s|; i++) { code to make it suitable for parallelization when modifica- of the elements of an ar- sum = sum + s[i]; } tions are necessary. We focus on a class of loops that tra- ray of integers. The code verse a read-only collection and compute a scalar function on the right is a sequential loop that computes this function. over that collection. We present theoretical results for when To compute the sum, in the style of divide and conquer par- the necessary modifications to sequential code are possi- allelism, the computation should be structured as illustrated ble, theoretical guarantees for the algorithmic solutions pre- in Figure 1. The array s of length ∣s∣ is partitioned into n+1 sented here, and experimental evaluation of the approach’s success in practice and the quality of the produced parallel sum(s[0..k1]) sum(s[k1 +1..k2]) sum(s[kn 1 +1..kn]) sum(s[kn +1.. s 1]) ··· | | programs.

CCS Concepts • Computing methodologies → Paral- ··· lel programming languages;• Theory of computation → sum(s[0..k2]) sum(s[kn 1 +1.. s 1]) ··· ··· | | Program verification; Parallel computing models ···

Keywords Divide and Conquer Parallelism, Program Syn- sum(s) thesis, Homomorphisms Figure 1. Divide and conquer style computation of sum. ∗ An extended version of this paper including proofs of theorems can be chunks, and sum is individually computed for each chunk. found at www.cs.toronto.edu/~azadeh/papers/pldi17-ex.pdf † joined Funded by Ontario Ministry of Research’s Early Researcher Award and The results of these partial sum computations are NSERC Discovery Accelerated Supplement Award (operator ⊙) into results for the combined chunks at each in- termediate step, with the join at the root of the tree returning the result of sum for the entire array. The burden of a correct design is to come up with the correct implementation of the Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and join operator. In this example, it is easy to quickly observe the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner/author(s). that the join has to simply return the sum of the two partial PLDI’17 June 18–23, 2017, Barcelona, Spain results. In general, it could be tricky to reformulate an arbi- © 2017 Copyright held by the owner/author(s). ACM ISBN 978-1-nnnn-nnnn-n/17/06. . . $15.00 trary sequential computation in this style. Recent advances DOI: http://dx.doi.org/10.1145/3062341.3062355 in program synthesis [2] demonstrate the power of synthesis in producing non-trivial computations. A natural question to foundations to answer these questions, and in Section 6, we ask is whether this power can be leveraged for this problem. present an algorithm for producing this lifting automatically In this paper, we focus on a class of divide-and-conquer that satisfies all aforementioned criteria. In summary, the pa- parallel programs that operate on sequences (lists, arrays, or per makes the following contributions: in general any collection type with a linear traversal itera- • We present an algorithm to synthesize a join for divide- tor) in which the divide operator is assumed to be the default and-conquer parallelism when one exists. Moreover, sequence concatenation operator (i.e. divide s into s1 and these joins are accompanied by automatically generated s2 where s = s1 • s2). In Section 4, we discuss how we use machine-checked proofs of correctness (Sections 4, 7). syntax-guided synthesis (SyGuS) [2] efficiently to synthesize • We present an algorithm for automatic lifting of non- the join operators. Moreover, in Section 7, we discuss how parallelizable loops (where a join does not exist) to trans- the proofs of correctness for synthesized join operations can form them into parallelizable ones (Section 6). be automatically generated and checked. This addresses a • We lay the theoretical foundations for when a loop is ef- challenge that many SyGuS schemes seem to bypass, in that, ficiently parallelized (divide-and-conquer-style), and ex- they only guarantee the synthesized artifact be correct for the plore when an efficient lift exists and when it can be au- set of examples used in the synthesis process (or boundedly tomatically discovered (Sections 5, 6). many input instances), and do not provide a proof of correct- • We built a tool, PARSYNT, and present experimental ness for the entire (infinite) data domain of program inputs. results that demonstrate the efficacy of the approach and A general divide-and-conquer parallel solution is not al- the efficiency of the produced parallel code (Section 8). ways as simple as the diagram in Figure 1. Consider the function is-sorted(s) which returns true if an array is sorted, 2. Overview and false otherwise. Providing the partial computation re- sults, a boolean value in this case, from both sides (of a join) We start by presenting an overview of our contributions by will not suffice. If both sub-arrays are sorted, the join cannot way of two examples that demonstrate the challenges of con- make a decision about the sortedness of the concatenated ar- verting a sequential computation to divide-and-conquer par- ray. In other words, a join cannot be defined solely in terms allelism and help us illustrate our solution. We use sequences of the sortedness of the subarrays. as linear collections that abstract any collection data type To a human programmer, it is clear that the join re- with a linear traversal iterator. quires the last element of the first subarray and the first Second Smallest. Con- element of the second subarray to connect the dots. The m = MAX_INT; sider the loop implemen- extra information in this case is as simple as remember- m2 = MAX_INT; tation of the function for (i = 0; i < |s|; i++) { ing two extra values. But, as we demonstrate with an- m2 = min(m2, max(m,s[i])); min2 on the right, that re- other example in Section 2, the extra information required m=min(m, s[i]); turns the second smallest } for the join may need to be computed by worker threads element of a sequence. Figure 2. Second Smallest to be available to the join. Intuitively, this means that Functions min and max worker threads have to be are used for brevity and can be replaced by their standard modified (compared to the O A definitions min(a, b) = a < b ? a ∶ b and max(a, b) = a > sequential code) to compute parallelizable ⇥ loop projection b ? a ∶ b. Here, m and m2 keep track of the smallest and the this extra information in or- I O second smallest elements of the sequence respectively. der to guarantee the exis- sequential loop tence of a join operator. We For a novice, in devising a join for m = min(ml, mr) 1 a divide and conquer paralleliza- call this modification of the code, lifting , for short, after m2 = min(m2l, m2r) the identical standard mathematical concept as illustrated in tion of this example, it is easy to the diagram above, where A stands for the extra (auxiliary) make the mistake of using the incorrect updates illustrated 2 information that needs to be computed by the loop. on the right . The necessity of lifting in some cases raises two ques- The correct join operator computes the combined small- tions: (1) does such a lifting always exist? And, (2) can est and second smallest of two subsequences according the the overhead from lifting and the accompanying join over- following two equations, where the l and subscripts distin- shadow the performance gains from parallelization? In Sec- guish the values coming from the left and the right subse- tion 5, we address both questions. The challenge for automa- quences (respectively): tion is to modify the original sequential loop in a way that it computes enough additional information such that (i) a join m = min(ml, mr) does exist, (ii) the join is efficient, and (iii) the overhead m2 = min(min(m2l, m2r), max(ml, mr)) of lifting is not unreasonably high. We lay the theoretical 2 A significant percentage of undergraduate students in an elementary algo- 1 Not to be confused with lambda lifting or the informal use of it in [24]. rithms class routinely make this mistake when given this exercise. We use syntax-guided synthesis, to synthesize the correct more than just coming up with the correct join implemen- join operators for sequential loops like this one. We use a tation. She has to modify the original sequential code to template which is based on the code of the loop body with make it parallelizable, and then devise the correct join for strategically introduced unknowns to define the state space the lifted sequential loop. The algorithm presented in Sec- of synthesis, and we use homomorphisms to introduce a tion 6 does exactly that. Let us illustrate how the algorithm sufficient correctness specification for synthesis (more on discovers the auxiliary accumulator sum. this in Section 4) that lends itself to efficient synthesis and Consider the general case of computing mts(s) sequen- facilitates automatic generation of correctness proofs for the tially, where sequence s is of length n. Imagine a point in the synthesized code (more on this in Section 7). middle of the computation, when the sequence s has been processed up to and including the index i computing the Maximum Tail Sum. Consider the function mts that re- value mtsi. We have the following sequence of recurrence- turns the maximum suffix sum of a sequence of integers like equations that represent the first two unfoldings of this (positive and negative). For example, mts([1, −2, 3, −1, 3]) = sequential loop starting from index i + 1:

5, which is associated to sum of the suffix [3, −1, 3]. mtsi+1 = max(mtsi+s[i + 1], 0) (1) = The code on the right il- mts = 0; mtsi+2 max(mtsi+1 + s[i + 2], 0) for (i = 0; i < |s|; i++) { lustrates a loop that imple- = max(max(mtsi + s[i + 1], 0) + s[i + 2], 0) (2) ments . To parallelize mts = max(mts + s[i], 0) When the (unfolded) computation above starts, the initial mts } this loop, the programmer value of mtsi is not available to it, since it is being computed needs to come up with a correct join operator. In this case, by a different thread simultaneously. The challenge is to even the most clever programmer would be at a loss, since perform the computation above, without having the value of no correct join operator exists. Consider the partial sequence mtsi in the beginning, and then adjust the computed value [1, 3], when joined with two different partial sequences accordingly once the value becomes available (at join time). [−2, 5] and [0, 5]. We have: Looking at Equation 1, if the value of s[i + 1] is known to the join operator, then it can plug values mtsi, s[i + 1], and mts([1, 3]) = 4 mts([1, 3]) = 4 0 (a program constant) into the expression on the righthand mts([−2, 5]) = 5 mts([0, 5]) = 5 side and compute the correct value of mtsi+1. At this stage, our algorithm (presented in Section 6) conjectures s i + 1 mts([1, 3, −2, 5]) = 7 mts([1, 3, 0, 5]) = 9 [ ] as an auxiliary value and + as its accumulator, since it is actively looking for accumulators to learn, and + appears The values of mts for both pairs of partial sequences are the next to s[i + 1] in the expression. same, yet, the values of the mts for the two concatenations Once an auxiliary accumulator is conjectured, the algo- are different. If a join function ⊙ exists such that mts(s • rithm turns its attention to the next unfolding, namely Equa- t = mts s ⊙ mts t , then this function would have to ) ( ) ( ) tion 2 to test its conjecture. Here, the unknown mtsi is sitting produce two different results for 4 ⊙ 5 in the two instances one level deeper in the expression. This means that the join above. In other words, the mts value of the concatenation has to perform more steps to compute the value of mtsi+2. It is not computable solely based on the mts values of partial is easy to see that the depth of mtsi will linearly increase for sequences. What is to be done? mtsi+3 and so on. At this point, the tool decides to use a set At this point, a clever programmer makes the observa- of standard algebraic equalities to rewrite the righthand side tion that the loop is not parallelizable in its original form. of Equation 2 to a different expression where mtsi appears at a lower depth, following the insight that the closer it is to She discovers that beyond knowing mts([1, 3]) = 4 and the root, the less the amount of work left to the join opera- mts([−2, 5]) = 5, she needs to know that sum([−2, 5]) = = tor. The step by step rewriting of the expression happens as 3, in order to conclude that mts([1, 3, −2, 5]) 7. follows:

Consider a modifica- mts = 0; mtsi+2 = max(max(mtsi + s[i + 1], 0) + s[i + 2], 0) tion of the previous se- sum = 0; for (i = 0; i < |s|; i++) { = max(max(max(mtsi + s[i + 1], s[i + 1]), 0) + s[i + 2], 0) quential loop for mts on mts = max(mts + s[i], 0); = max(max(max(mtsi + s[i + 1] + s[i + 2], the right. There is an addi- sum = sum + s[i]; tional loop variable, sum, } s[i + 1] + s[i + 2]), s[i + 2]), 0) that records the sum of all the sequence elements. In the = max(mtsi + s[i + 1] + s[i + 2], sequential loop that computes mts, this is a redundant com- max(s[i + 1] + s[i + 2], s[i + 2], 0)) putation. We call sum an auxiliary accumulator, and the new = max(mtsi+s[i + 1] + s[i + 2], mts([s[i + 1..i + 2])). loop a lifting of the sequential loop. The structure of the expression in the last line provides The lifted code can now sum = suml + sumr a clue to how the parallel computation can be organized. be parallelized using While one thread is computing mtsi (i.e. mts(s[0..i])), mts = max(mtsr, sumr + mtsl) the join operator on the the second thread computes mts(s[i + 1..i + 2]). To com- right. For loops like this, the burden of the programmer is pute the overall value of mts(s[0..i + 2]), the join would ′ ′ need the partial results mtsi and mts(s[i + 1..i + 2]), e ∈ Exp ∶∶= e ○ e e, e ∈ Exp Expressions and the value of s[i + 1] + s[i + 2]. The algorithm decides ∣x ∣ k x ∈ Var, k ∈ Z, Q, R at this point that s i + 1 + s i + 2 is the auxiliary value ∣s[e] s ∈ SeVar, e ∈ Exp [ ] [ ] ′ that needs to be computed, and realizes that the conjectured ∣if be then e else e accumulator from the previous round makes this informa- ∈ BExp = ′ ′ ∈ Exp tion already available. Therefore, not having seen anything be ∶∶ e e e, e Boolean Exps < ′ ′ ∈ BExp new, it stops and concludes that the auxiliary accumulator ∣be ⋏ be ∣¬be be, be ∣t[e] t ∈ BSVar, e ∈ Exp sum = sum + s[k] (where k is the current iteration num- true false ber) is the new auxiliary computation required to make the ∣ ∣ Program ∶∶= c; c′ c, c′ ∈ Program loop parallelizable. ∣x:=e x ∈ Var, e ∈ Exp ∣b:=be b ∈ BVar, be ∈ Exp ′ ′ 3. Background and Notation ∣if(e){c}else{c } be ∈ Exp, c, c ∈ Program ∣for(i ∈ I){c} i ∈ Iterator This section introduces the notation used in the remainder of the paper. It includes definitions of some new concepts, and Figure 3. Program Syntax . The binary ○ operator represents any arithmetic operation +, −, ∗, , operator represents any some already known in the literature. ( /) comparator (<, ≤, >, ≥, =, ≠). I is an< iteration domain, and ⋏ operator represents any boolean operation (∧, ∨). 3.1 Sequences and Functions Definition 3.3 (Single-Pass Computable). Function h ∶ We assume a generic type Sc that refers to any scalar types → D is single-pass computable iff it is rightwards or used in typical programming languages, such as int, float, leftwards.S char, and bool whenever the specific type is not important in the context. The significance of the type is that scalars are 3.2 Homomorphisms assumed to be of constant size, and conversely, any constant- Homomorphisms are a well-studied class of mathematical size representable data type is assumed to be scalar in this functions. In this paper, we focus on a special class of homo- paper. Moreover, we assume all operations on scalars to have morphisms, where the source structure is a set of sequences constant-time complexity. with the standard concatenation operator. Type defines the set of all sequences of elements of type . For anyS sequence , we use to denote the length of Definition 3.4. A function h ∶ → D is called ⊙- Sc x ∣x∣ S the sequence. x[i] (for 0 ≤ i < ∣x∣) denotes the element of homomorphic for binary operator ⊙ ∶ D × D → D iff the sequence at index , and denotes the subsequence for all sequences x, y ∈ we have h(x • y) = h(x) ⊙ h(y). i x[i..j] S between indexes i and j (inclusive). Concatenation operator Note that, even though it is not explicitly stated in the def- • ∶ × → is defined over sequences in the standard way, S S S inition above, ⊙ is necessarily associative on D since con- and is associative. The sequence type stands in for many catenation is associative (over sequences). Moreover, h([]) collection types such as arrays, lists, or any other collection (where [] is the empty sequence) is the unit of ⊙, because that admits a linear iterator and an associative composition [] is the unit of concatenation. If ⊙ has no unit, that means operator (i.e. concatenation). h([]) is undefined. For example, function head(x), that re- In this paper, our focus is on single-pass computable turns the first element of a sequence, is not defined for an functions on sequences (of scalars to scalars). The assump- empty list. head(x) is ⊛-homomorphic, where a ⊛ b = a tion is that a sequential implementation for the function is (for all a, b) but ⊛ does not have a left unit element. given as an imperative sequential loop. Below, we formally define what a single-pass computable function is. 3.3 Programs and Loops Definition 3.1. A function h ∶ → D is called rightwards For the presentation of the results in this paper, we assume iff there exists a binary operatorS⊕ ∶ D × Sc → D such that that our sequential program is written in a simple imperative for all x ∈ and a ∈ Sc, we have h(x • [a]) = h(x) ⊕ a language with basic constructs for branching and looping. S We assume that the language includes scalar types int and Note that the notion of associativity for ⊕ is not well- bool, and a collection type seq. The syntax of our input pro- defined, since it is not a binary operation defined over a set grams is illustrated in Figure 3. We forego a semantic defini- (i.e. the two arguments to the operator have different types). tion since it is standard and intuitively clear. For readability, we use a simple iterator and an integer index (instead of the Definition 3.2. A function h ∶ → D is called leftwards iff S generic i ∈ ), and use the standard array random access there exists a binary operator ⊗ ∶ Sc × D → D such that I notation a[i] to refer to each element of a collection type. In for all x ∈ and a ∈ Sc, we have h([a] • x) = a ⊗ h(x). S principle, any collection with an iterator and a split function Associativity of ⊗ is likewise not well-defined. (that implements inverse of concatenation) works. There has been a lot of research on iteration spaces and iterators (e.g. [51] in the context of translation validation and [25] in the 4.1 Parallelizable Loops context of partitioning) that formalize complex traversals by We start by formally defining when a loop is paralleliz- abstract iterators. able. This is intuitively related to when the computation per- State and Input Variables. Every variable that appears on formed by the loop body is a homomorphic function. First, the lefthand side of an assignment statement in a loop body we define a function fE that models E (the body of the is called a state variable, and the set of state variables is loop). The reader who is not interested in the details can fast- denoted by SVar. Every other variable is an input variable, forward to Definition 4.1 with an intuitive understanding of and the set of input variables is denoted by IVar. The se- function fE. quence that is being read by the loop is considered an input Let the system of recurrence equations E = ⟨s1 = variable, since it is only read and not modified. Note that exp1(SVar, IVar), . . . , sn = expn(SVar, IVar)⟩ represent SVar ∩ IVar = ∅. the body of a loop. Let IVar = {σ, ι, i1, . . . , ik} where σ is the sequence variable, and ι is the current index (which Model of a Loop Body. We introduce a formal model for we assume to be an integer for simplicity) in the sequence, the body of a loop which has no other loops nested inside = and ⃗i ⟨i1, . . . , ik⟩ captures the rest of the input variables it. The loop body, therefore, consists of assignment and in the loop body. Let be the set of all such k-ary vectors. conditional statements only. I For a loop body E, define function fE = f1 œ f2 œ Let SVar = {s1, . . . , sn}. The body is modelled by a ... œ fn, where each fi ∶ × int × → type(si) is a func- system of (ordered) recurrence equations E = ⟨E1,...En⟩, S I tion such that fi(σ, ι,⃗i) returns the value of si at iteration ι where each equation Ei is of the form si = expi(SVar, IVar). with input values σ and ⃗i. It is, by definition, straightforward Remark 3.5. Every non-nested loop in the program model to see that ⟨f1(σ, ι,⃗i), . . . , fn(σ, ι,⃗i)⟩ represents the state of in Figure 3 can be modelled by a system of recurrence the loop at iteration ι. equations (as defined above). Definition 4.1. A loop, with body E, is parallelizable iff fE This can be achieved through a translation of the loop body is ⊙-homomorphic for some ⊙. from the imperative form to the functional form. In [15], we Definition 4.1 basically takes the encoding of the loop as include a description of how this can be done, but also re- a tupled function, and declares the loop parallelizable if this fer the interested reader to [3, 26] for a more complete ex- tupled function is homomorphic. It is important to note that planation. The essence of this translation, which is through parallelizable here is not used in the broadest sense of the transformation of conditional statements into conditional ex- term. It is limited to the particular scheme of divide-and- pressions, is illustrated in the example below. conquer parallelism where the divide operator is the inverse Example 3.6. Consider m = MAX_INT; of concatenation. m2 = MAX_INT; the long version of the for (i = 0; i < |s|; i++) { 4.2 Syntax-Guided Synthesis of Join second smallest example if m > s[i] then if m2 > m then m2 = m; We use syntax-guided synthesis (SyGuS)[2] to synthesize from Section 2 on the else right, where the auxiliary if m2 > s[i] then m2 = s[i]; the join operator for parallelizable loops. The goal of pro- if m > s[i] then m = s[i]; gram synthesis is to automatically synthesize a correct pro- functions min and max } are replaced with their definitions as conditional statements gram based on a given specification. SyGuS is an emerg- and where the code complies with the syntax in Figure 3. ing field with several existing solvers. The task of the user of these solvers is to define (1) a correctness specification The loop body m2 = if m2 < (if m > s[i] then m else s[i]) for the program to be synthesized, and (2) provide syntactic is then mod- then m2 constraints that define the state space of possible programs elled by re- else (if m > s[i] then m else s[i]); m=if m < s[i] then m else s[i]; (mainly for tractability). In this section, we describe what currence equa- the correctness specification and syntactic constrains that we tions on the right, through the transformation of the condi- use. Beyond that, the specific design of these two elements, tional statements above into the assignment statements that ′ this work does not make any new contributions to SyGuS. use conditional expressions (if be then e else e ). Note that we focus on synthesizing binary join operators 4. Synthesis of Join in this paper. The restriction is superficial, in the sense that any other statically fixed number would work. And, it is not The premise of this section is that the sequential loop under hard to imagine an easy generalization to a parametric join. consideration is parallelizable, in the sense that a join op- eration for divide-and-conquer parallelism exists. We use a 4.2.1 Correctness Specification system of recurrence equations E in the style of Section 3.3 The homomorphism definition 3.4 provides us with the cor- to model the body of a loop for all formal and algorithmic rectness specification to synthesize a join operator ⊙ for a developments in this paper, and therefore, we use the term loop body E (or rather fE to be precise). In case of the loop body to refer to E whenever appropriate. specific SyGuS solver used in this paper (and many others similar to it), a bounded set of possible inputs are required, ne0 ∶∶= x ∣ c x ∈ nV ars, c a numeric constant and therefore, the correctness specification is formulated on ned>0 ∶∶= ned−1 ⊕ ned−1 ∣ − ned−1 symbolic inputs of bounded length K. A join ⊙ is a solution ∣ if (bed−1) ned−1 else ned−1 of the synthesis problem if for all sequences x, y of length be0 ∶∶= b ∣ true ∣ false b ∈ bV ars less than K, we have f x • y = f x ⊙ f y . E( ) E( ) E( ) bed>0 ∶∶= bed−1 ∧◯ bed−1 ∣ ¬bed−1 There is a tension between having a small enough K for ∣ ned−1 ned−1 < the solver to scale, and having a large enough K for the ∣ if (bed−1) bed−1 else bed−1 answer to be correct for inputs that are larger than K. We use small enough values for the solver to scale, and take care ⊕ ∶= +, −, min, max, ×, ÷ binary numeric = > >= < <= = of general correctness by automatically generating proofs of ∶ , , , , comparisons < = , correctness (see Section 7). ∧◯ ∶ ∧ ∨ binary boolean Figure 4. Grammar of expressions used for ?? and ?? holes. 4.2.2 Syntactic Restrictions LR R ned and bed correspond to expressions of depth up to and equal to The syntactic restrictions are defined through a pair of a d. nVars and bVars stand for numeric and booleans variables. sketch and a grammar for expressions. A sketch is a partial To characterize the scope of expressivity of our compilation program containing holes (i.e. unknown values) to be com- function , we provide formal conditions under which the pleted by expressions in the state space defined by the gram- synthesisC approach of Section 4.2 is successful in discover- mar. The sketch is an ordered set of equations E = ⟨s1 = ing a correct join. This also facilitates a comparison of our exp1(SVar, IVar), . . . , sn = expn(SVar, IVar)⟩. Intuitively, join synthesis with the parallelization approach of [34]. it is produced from the loop body by replacing occurrences of variables and constants by holes. Note that the inputs to Definition 4.3. Function g is a weak right inverse of func- = a join are the results produced from two worker threads, to tion f iff f ◦ g ◦ f f. which, we refer as left and right threads. To contain the state Each function may have many weak right inverses. Intu- ′ space of solutions described by this sketch, we distinguish itively, g produces a sequence s (out of a result r = f(s)) ′ ′ two different types of holes. such that f(s ) = r = f(s). Therefore, s (if much shorter Left holes ??LR are holes that can be completed by an than s) can be viewed as a summary of s with respect to the expression over variables from both the left and the right computation of f on s. In [34], this very notion of a weak threads. Right holes ??R will be filled with expressions over right inverse is used to produce parallel implementations of variables from the right thread only. functions. It is required that g’s output is bounded or equally We define a compilation function as that s′ is always a constant length sequence (i.e. independent C C (c) = ??R of ∣s∣). Our join synthesis is also guaranteed to succeed un- ?? if x ∈ IVar = R der the same assumption. Note that, however, in contrast to C (x) v ∈ ??LR if x SVar [34], we do not require both left and right implementations = C (x[e]) ??R of the function as input, and either one alone suffices. C (op(e1, .., en)) = op (C(e1), ..., C(en)) Proposition 4.4. If the loop body E is parallelizable and where e is an expression, op is an operator from the input language, x is a variable, and c is a constant. The sketch for the weak right inverse of fE is constant time computable the join code will then be (that guarantees that it returns a list of bounded length), then there is a ⊙ in the space described by the sketch of E such C(E) = ⟨s1 = C(exp1(SVar, IVar)), . . . , sn = C(expn(SVar, IVar))⟩ that is -homomorphic. where each hole in (exp ) (1 ≤ i ≤ n) can be completed fE ⊙ C i by expressions in a predefined grammar that is suitable for Proposition 4.4 (see [15] for a proof sketch) provides us a given class of programs. For the experiments in this paper, with the guarantee that under the conditions given, the state the grammar in Figure 4 is used, where the expression depth space defined for the join, through the sketch compilation d is gradually increased until a solution is found. function does not miss an existing solution. Note that the compilationC function maintains the structure of the ex- Example 4.2. m2 = min(??LR, max(??LR, ??R)); pressions when it compilesC a sketch (Remember Example Consider the m=min(??LR, ??R); second smallest 4.2). When the weak inverse is constant-time computable, example from Section 2. The sketch (for the code in Figure the third homomorphism theorem [19] guarantees that a so- 2) is illustrated on the right. It is clear that the join presented lution faithful to this structure exists. The computation of in Section 2 can be simply discovered using this Sketch the weak inverse itself is done through the plugged expres- when the unknowns are filled by instances of state variables. sions for the unknowns ??L and ??LR where a fully generic grammars as illustrated in Figure 4 is used, that includes all 4.3 Efficacy of Join Synthesis possible constant-time computations at an appropriate depth. Syntactic limitations imposed for scalability in SyGuS run Proposition 4.4 characterizes only a subset of functions, the risk of leaving a correct candidate out of the search space. for which join synthesis succeeds. There are many (simple) functions whose weak right inverse is not bounded, where from scratch. It is clear that a parallel implementation based a join is successfully synthesized. For example, function on this idea is less efficient than the sequential implementa- length is a simple example, where the weak right inverse tion of f. Therefore, just making the function homomorphic always has to be a list of the same length as the original list will not result in a good solution for parallelism. Next, we (which would lead to an inefficient parallel implementation). identify a subset of homomorphic liftings that are computa- If join synthesis fails, our tool un-constrains the compiled tionally efficient. sketch to include more expressions gradually until a join is found. In practice, this was not necessary for any of our 5.2 Computationally Efficient Homomorphisms benchmarks. In Section 6.4, we comment on how some Let us assume that f ∶ → D is a single-pass linear time feedback from the algorithm presented in Section 6.1 helps computable (see DefinitionS 3.3) function and elements of D with constructing an effective sketch when generalization are tuples of scalars. The assumption that the function is beyond the sketch compilation function may be required. linear time comes from the fact that we target loops without C any loops nested inside, which (by definition) are linear-time 5. Parallelizability computable (on the size of their input sequence). The mts (maximum tail sum) example from Section 2 Consider the (rightwards) sequential computation of f demonstrates that a loop is not always parallelizable. Here that is defined using the binary operator ⊗ as follows: (and later in Section 6), we discuss how a loop can be lifted f y • a = f y ⊗ a to become parallelizable. ( ) ( ) 5.1 Homomorphisms and Parallelism where a ∈ Sc, f ∶ → D, and ⊗ ∶ D × Sc → D. The fact that f is single-passS linear time computable implies that A strong connection between homomorphisms and paral- the time complexity of computing f at every step (of the lelism is well-known [18, 20, 34], specifically that, homo- ′ recursion above) is constant-time. Let f be a lifting of f morphic functions admit efficient parallel implementations. that is ⊙-homomorphic (for some ⊙). Theorem 5.1 (First Homomorphism Lemma [9]). A func- Proposition 5.3. The (balanced) divide-and-conquer imple- tion h ∶ → Sc is a homomorphism if and only if h can be ′ mentation based on f has the same asymptotic complexity written asS a composition of a map and a reduce operation. as f (i.e. linear time) iff the asymptotic complexity of ⊙ is The if part of Theorem 5.1 basically states that a ho- sub-linear (i.e o(n)). momorphism has an efficient parallel implementation, since The simplest case is when ⊙ is constant time. We call efficient parallel implementations for maps and reductions these constant homomorphisms for short. Note that the syn- are known. The only if part is equally important, because thesis routine presented in Section 4 synthesizes constant it indicates that a large class of simple and easy to formu- time join operators exclusively. In the next section, we late parallel implementations of functions ought to be homo- present an algorithm that lifts a non-homomorphic function morphisms. Theorem 5.1 gives rise to an important research to a constant homomorphism if one exists. question: If a function is not a homomorphism, can the se- Let us briefly consider the super-constant yet sub-linear quential code be modified (lifted) so that it corresponds to case for joins, to justify why this paper only focuses on a homomorphic function to facilitate easy parallelization? In the constant-time joins and ignores the super-constant joins. Section 6, we answer the question in the affirmative. But first Based on the definition of homomorphism, we know: in this section, we argue why this lifting should be done care- ′ ′ ′ fully to maintain performance advantages of parallelization. f ([x1, . . . xn]) = f ([x1, . . . xk]) ⊙ f ([xk+1, . . . xn]) ′ Non-homomorphic Functions. There is a simple observa- For ⊙ to be super-constant in n, f ([x1, . . . xk]) (respec- ′ tion, which was made in [21] (and probably also in earlier tively, f ([xk+1, . . . xn])) has to produce a super-constant work), that every non-homomorphic function can be made size output. Constant size inputs (to ⊙) cannot yield super- homomorphic by a rather trivial lifting: constant time executions. Since the output of f is assumed to Proposition 5.2. Given a function → , the function be constant size (scalars and tuples of them are by definition f ∶ Sc ′ f œ ι is ⟐-homomorphic where S constant size), the output to f is super-constant only due to the additional auxiliary computation that makes f ′ a homo- ι(x) = x ∧ (a, x) ⟐ (b, y) = f(x • y) morphism, but is not part of the sequential computation of f. We believe the automatic discovery of auxiliary informa- Operation œ denotes tupling of the two functions in the stan- tion of this nature can be a fundamentally hard problem. The dard sense f œ g (x) = (f(x), g(x)). Basically, the ι com- sub-linear (super-constant) auxiliary information is often the ponent of the tuple remembers a copy of the string that is be- result of a clever algorithmic trick that improves over the ing processed and the partial computation results f(x) and trivial linear auxiliary (i.e. remembering the entire sequence f(y) are discarded by ⟐, which then computes f(x • y) as was discussed in Proposition 5.2). The field of efficient data streaming algorithms [1, 4] includes a few examples of ofs, bal ([]) ofs, bal ([0..k]) such clever algorithms. Join synthesis can be adapted to han- h i h i dle such cases if the auxiliary information is available. How- s[0] s[k + 1] ever, since discovery of super-constant auxiliary information ofs, bal (s[0]) ofs, bal (s[0..k + 1]) is a necessary step for automation, and extremely difficult to ··· h i h i ··· do automatically, we only target constant homomorphisms in this paper. ofs, bal (s[0..k 1]) ofs, bal (s[0.. s 2]) h i h i | | 6. Synthesizing Parallelizable Code s[k] s[ s 1] | | ofs, bal (s[0..k]) ofs, bal (s[0.. s 1]) Let us assume that the synthesis of the join operator from h i h i | | Section 4.2 fails. The reason could be that either (1) the function is not a homomorphism, or (2) the function is a ho- Figure 5. Sequential computation broken into two threads. momorphism, but syntactic restrictions for synthesis exclude unknown ⟨ofs, bal⟩(s[0..k]). Then at the end, when the val- the correct join operator. We deal with case (1) by proposing ues become known (through the left thread), the join opera- an algorithm that lifts the function to a homomorphism by tor would adjust the result based on this known value. adding new computation to the loop body, and briefly com- Algorithm 1 illustrates our main algorithm. The algo- ment on case (2) in Section 6.4. rithm (in function Lift()) inspects each state variable sep- arately to see if its divide-and-conquer computation re- 6.1 The Algorithm quires introduction of new auxiliary accumulators (by call- ing Solve()). In our example, the algorithm discovers that In Section 2, we used the bal = true; mts example to show how ofs = 0; ofs requires no new auxiliary accumulators, but bal does. for (i = 0; i < |s|; i++) { In function Solve(), it starts simulating the loop from an inspecting unfoldings of 0 0 if s[i] == ’(’ then arbitrary state ⟨s , . . . , s ⟩, mirroring the arbitrary break in the computation could re- ofs = ofs + 1; 0 n sult in the discovery of dis- else Figure 5 where the red value is swapped with this arbitrary initial state. The while loop then iteratively unfolds the ex- covery of the auxiliary ac- ofs = ofs - 1; bal = bal && (ofs >= 0); pression in the style of Equations 1 and 2 in Section 2 for cumulator sum. There, the } the mts example. ‘unfold’ (in Algorithm 1) computes the k- intuitive idea was to inspect expressions computing mts val- th unfolding, for a given k. Let us focus on bal, which is the ues to find an equivalent expression where the unknown ap- problematic part of the loop state, and consider the first few peared at a lower depth. Here, we use a new example, to in- unfoldings of the computation of the right hand side thread. troduce our algorithm, and provide insight for why the algo- The first step is as follows: rithm goes beyond just lowering the depth of the unknowns. bal s 0..k + 1 = bal s 0..k ∧ ofs s 0..k + 1 ≥ 0 The code above corresponds to a sequential loop that checks ( [ ]) ( [ ]) ( ( [ ]) ) if an input string is a balanced sequence of parentheses. The = bal(s[0..k]) ∧ integer variable ofs (for offset) keeps track of the differ- ofs(s[0..k]) + ofs([s[k + 1]]) ≥ 0 ence in the number of open and closed parentheses seen so using the equality ofs(x • y) = ofs(x) + ofs(y). The far, and the boolean variable bal maintains the status of bal- final expression above has both its unknown (red) values ancedness of prefix read so far. At the end of the loop, the at the optimal depth in the expression tree, and does not string is balanced if ofs = 0 ∧ bal = true. seem to require any auxiliary information to compute it once If the loop only consisted of ofs, then it would corre- the unknowns become knowns3. Let us look at the next spond to a homomorphism. It is easy to see that ofs(x•y) = unfolding (using the result of the first step above): ofs(x) + ofs(y). The same is not true for bal. If x is bal- bal s 0..k 2 = bal s 0..k 1 ofs s 0..k 2 ≥ 0 anced and y is not, then x • y could be balanced (or not). ( [ + ]) ( [ + ]) ∧ ( ( [ + ]) ) To determine this, information beyond the boolean values of = [bal(s[0..k]) ∧ bal and the integer values of ofs for x and y is required. (ofs(s[0..k]) + ofs([s[k + 1]])) ≥ 0] ∧ Let us see how our algorithm discovers the required auxil- [ofs(s[0..k]) + ofs(s[k + 1..k + 2]) ≥ 0] iary information. Consider breaking the sequential computa- tion of the above code into two threads as illustrated in Fig- Now, an alarming pattern seems to emerge. The number of occurrences of the unknown ofs s 0..k has doubled, ure 5. The value of ⟨ofs, bal⟩(s[0..k]) (illustrated in red), ( [ ]) and to compute the expression once it is known, one needs to that is the result of the computation on the left, will not be store the intermediate result for ofs([s[k + 1]]) to use. Intu- known to the computation on the right when it is needed itively, it is clear that with 3 or more unfoldings, the pattern at the very beginning. This dependency is the killer of par- continues: the number of unknowns are arbitrary replicated, allelism. The idea behind Algorithm 1 is to see if we can rearrange the computation on the right so that it can be per- 3 To be fully accurate, the algorithm guesses an auxiliary accumulator here formed with a constant known initial value, instead of the which will later be declared to be redundant since it is effectively ofs. Algorithm 1: Homomorphic Lifting Computation ‘normalize’ takes the unfolded expression and returns an equivalent expression of the form exp(e , . . . , e , e) where Data: A set of recurrence equations E 1 m e only refers to the input variables and each e is an expres- Result: A set of recurrence equations ′ i E sion of the form illustrated in the inset, where v is a state Function Lift(E) ′ variable. The intuition is that other than the occurrences of E ← ∅; the state variables (which is the = for each si expi in E (in order) do unknown), the remainder of e ⌦ ′′ i v if Solve(“si = expi ) = null then needs to captured by an aux- report failure input iliary accumulator and made variables else available to the join operator. ′ ′ = ′′ E ← E ∪Solve(“si expi ); In our example, we would have m = 2, e = true, ′ return E ; e1 = bal(s[0..k]), and e2 = ofs(s[0..k]) ≥ Function Solve(s = Exp(SVar, IVar)) max(−ofs([s[k + 1]]), −ofs(s[k + 1..k + 2])). This 0 0 leads to − + − + + Initially k = 1, Aux = ∅, σ0 = ⟨s0, . . . , sn⟩ max( ofs([s[k 1]]), ofs(s[k 1..k 2])) while Aux =/ OldAux do being conjectured as an auxiliary accumulator from e2, and OldAux ← Aux ; nothing to be conjectured from e1. Finally, ‘collect’ gathers these e ’s in . For each e , we check if it can be assembled τ ← unfold(σ0, s, E, k); i i using existingE auxiliary accumulators in Aux. If it can, the ` ← Normalize(τ, E); algorithm moves on. If no, a new auxiliary variable and if ` = empty then accumulator is added. The while loop continues until no return empty; new auxiliaries are discovered. ← collect(`, SVar); E for each e in do if e alreadyE covered by something in Aux 6.1.1 Normalization then The most complex step of Algorithm 1 is the task performed Add the accumulator and continue by ‘normalize’. The description of ‘normalize’ in Algorithm else Create a new variable in Aux, with is intentionally left as a functional specification only, without an implementation. This abstract version of normalize is expression e. k ← k + 1; assumed to always succeeds when the appropriate auxiliary accumulator exist. This is based on the simple idea that if the Aux ← remove-redundancies(Aux); algebraic rules needed for simplification/manipulation of the return Aux; expression are given, and effective strategy exists to explore Function Normalize(τ, E) the state space of expressions defined by these rules, then Output: A set of expressions e , . . . , e , e such { 1 k } an existing expression in the space will always be found. We that τ = f e , . . . , e , e , each e has ( 1 k ) i assume this idealized version of normalize to propose a crisp exactly one occurrence of a state variable completeness theorem in Section 6.3.2. at depth 1. The existence of an effective and efficient implementation and all intermediate values of ofs([s[k + 1..k + m]])) need (to replace this idealized version) depends on the given ex- to be stored (for all 1 ≤ m ≤ n − k − 1 ), to retain the pression, operators that appear in it, and the algebraic equal- ability to compute the total bal at the end. This leads us to ities that hold for those operators. There has been a lot of the second principle that we use in rearranging these expres- research in the area of rewrite systems [29, 30, 35] that can sions, namely reducing the number of occurrences of the un- inspire several heuristics for normalization. The problem can knowns. Using the last algebraic rule in Figure 6, the above also be formulated using standard syntax-guided synthesis, expression can be rewritten as: although we suspect that this solution will not scale well bal(s[0..k + 2]) = bal(s[0..k]) ∧ since (in addition to standard scalability problems in SyGuS) [ofs(s[0..k]) ≥ max(−ofs([s[k + 1]])), −ofs(s[k + 1..k + 2])] non-linearity can easily be introduced, even in simple cases. The main complication in a search for an equivalent ex- This solves the problem. Instead of having to remember pression (of a given form) is that the state space for this all intermediate values of ofs([s[k + 1..k + m]])), the loop search can be infinite, with no particular structure to guide a can just remember the maximum value of their negations. finitary search. We propose a heuristic normalization proce- This can be added as an auxiliary accumulator to the sequen- dure that utilizes a cost function to enforce a finitary search. tial loop to lift it to a homomorphism. Algorithm 1 finds the The heuristic is simple and yet seems to work well (effi- equivalent expression above through the call to ‘normalize’, ciently and effectively) for the benchmarks in this paper. Be- which is the most important step of the algorithm. The de- low, depe(v) is the depth of the deepest occurrence of v and tails of normalization are explained in the next section. occe(v) is the number of occurrences of v in expression e. Cost-directed rewrite rules (apply the rule only if it reduces 6.3.1 Existence of Constant Homomorphic Liftings the cost of the expression): Here, independently of any algorithm, we deliberate over the a ⊙ b → b ⊙ a (a ⊙ b) ⊙ c → a ⊙ (b ⊙ c) question of existence of a constant homomorphic lifting (see (a ⊙ b) ⊗ c → (a ⊗ c) ⊙ (b ⊗ c) Section 5.2). Remember that lifting the loop to a homomor- (c ? x ∶ y) ⊙ z → c ? (x ⊙ z) ∶ (y ⊙ z) phism with a single linear-size auxiliary variable (Proposi- c1 ? (c2 ? x ∶ y) ∶ z → c1 ∧ c2 ? x ∶ (¬c2 ? y ∶ z) tion 5.2) is always possible. ¬(a ∧ b) → (¬a) ∨ (¬b) The important observation of this section is that for a −(a + b) → (−a) − b single-pass linear time computable function, a constant ho- a > c ∨ b > c → max(a, b) > c momorphic lifting does not necessarily exist. First, a func- c > a ∧ c > b → c > max(a, b) tion may be single-pass linear time computable in one direc- Figure 6. ⊙ and ⊗ stand in for the appropriate arithmetic tion (e.g. leftward) and not the other (e.g. rightward): ∗ ones. For brevity, we only include one rule from each alge- Remark 6.3. Let hn ∶ {0, 1} → bool be defined as braic equality omitting the symmetric versions from the list. hn(w) = true iff w[n] = 1 (i.e. the n-th letter). h is right- Definition 6.1. The cost function CostV ∶ exp → Int × wards but not leftwards linear time computable in n. Int assigns to any expression , a pair where is e (d, n) d For a function like h, there does not exist a constant Max ∈ dep v , and n is occ v . v V e( ) ∑v∈V e( ) homomorphic lifting. The process of normalization can be formalized by a set Proposition 6.4. If a function h is not leftwards (rightwards) of rewrite rules R that are derived from a set of standard linear-time computable, then there exists no homomorphic algebraic equalities that hold for the operators appearing in lifting of h that is linear-time computable. the program text. Each algebraic equality gives rise to two rewrite rules (one for each direction of the equality). Figure This is a simple consequence of the Third Homomorphism 6 includes instances of the rules that we use in our imple- Theorem [19]. A homomorphism naturally induces a left- mentation. Given a set of algebraic rules R, we define the ward and a rightward computable function. Existence of cost minimizing normalization procedure as the successive a constant homomorphic lifting would then imply that the application of rewrite rules in R in order to reduce an ex- function should be single-pass linear-time computable both pression to an equivalent expression with minimal cost. The leftwards and rightwards which is a contradiction. algorithm is a standard cost-driven search algorithm. Evalua- The natural question to ask, after the negative result of tion of this algorithm (with the set of rules provided in Figure Proposition 6.4 is: when does a constant homomorphic lift- 6) in Section 8 demonstrates that despite its simplicity, it is ing exist? One characterization is under a condition similar fast and effective in finding the fruitful normal forms. It fails to the one used in Proposition 4.4. to infer only 1 auxiliary accumulator (among 15), and the Theorem 6.5. If f is leftwards and rightwards linear time reason for that is the shortcoming of the set of rewrite rules computable, and the weak right inverse of f is constant time in dealing with conditional statements. The algebraic rules computable, then there exists a binary operator ⊙ such that regarding conditionals in Figure 6 can only normalize light f is a constant ⊙-homomorphism. instances of conditionals. Devising a sophisticated heuristic normalization is a topic of interest for future work. It remains an open problem whether the restriction on the computation of the weak right inverse above can be lifted to 6.2 Soundness of Algorithm 1 give rise to the following desirable result. The following Proposition formally states the correctness Conjecture 6.6. If f is leftwards and rightwards linear time of Algorithm 1. Note that in all formal statements about computable, then there exists a binary operator ⊙ such that Algorithm 1, we assume the idealized version of ‘normalize’ f is a constant ⊙-homomorphism. as illustrated in Algorithm 1. For the analysis of the algorithm in Section 6.3.2, the Proposition 6.2. If Algorithm 1 terminates successfully and most important observation of this section is that not all reports no new auxiliary variables, then the loop body cor- single-pass linear time loops can be parallelized through a responds to a ⊙-homomorphic function (for some ⊙). If it constant homomorphic lifting. terminates successfully and discovers new auxiliary accu- mulators, then the lifted loop body augmented with them cor- 6.3.2 Completeness of Algorithm 1 responds to a ⊙-homomorphic function (for some ⊙). Since not every single-pass linearly computable function can be lifted into a constant homomorphism, there exists 6.3 Completeness no complete algorithm the entire class of these function. Here, we formally discuss when Algorithm 1 can be ex- It remains to determine how complete Algorithm 1 is for pected to terminate and successfully generate a divide and single-pass linearly computable functions that can be lifted conquer parallelization of the sequential loop. to constant homomorphisms (see Theorem 6.5). Theorem 6.7. For a function f, if there exists a constant 7. Correctness of the Synthesized Programs ⊙-homomorphic lifting f œ g, then there exists a ⊛- We use a SyGus solver that relies on bounded checks to homomorphic lifting where synthesize join operators, and therefore, correctness of the (f(x), g(x)) ⊛ (f(y), g(y)) = synthesized join is not guaranteed for all input sequences. In this section, we discuss a heuristic scheme to generate proofs exp f x , f y , g y , exp′ f x , g x , f y , g y ( ( ( ) ( ) ( )) ( ( ) ( ) ( ) ( ))) of correctness for the general case automatically. Automatic that is, the value of f(x • y) component of the join does not verification of code is known to be a hard problem. Full depend on g(x). automation often comes at the cost of incompleteness. The method presented here is no exception to this rule. The claim Theorem 6.7 is very essential in the completeness argu- is that based on a few key observations, the boundaries of ment for Algorithm 1. The way the algorithm operates is automation can be extended to include many of the typical that it tries to discover exp(f(x), f(y), g(y)) through un- benchmarks that are in scope for this work. folding the y part of f(x • y), where it does not have access We use an example to introduce our automated proof to g(x). To make any claims about completeness of Algo- construction scheme. Let us assume that we want to verify rithm 1, one has to argue that it is sufficient to look at f(x), the correctness of the join operator synthesized to parallelize f(y), and g(y) and not g(x). Equally important is to con- the mts example. We use Dafny [28] program verifier to sider that when two expressions are equivalent, in general, a encode and check the proof of correctness. set of algebraic rules (i.e. axioms) that prove their equiv- alence through finitelyR many steps may not exist. Under the A full functional correctness proof of the parallel program is assumption of existence of such rules, we can state the fol- not necessary, rather, it suffices to show that the join forms a homomorphism together with the (lifted) sequential code. lowing completeness theorem about Algorithm 1: Theorem 6.8 (Completeness). If there exists a constant ho- The sequential code together with Definition 3.4 consti- momorphic lifting of the loop body E then Algorithm 1 suc- tute the specification of correctness for the join operator. ceeds in discovering this lifting, if the appropriate set of al- Note that, by piggybacking on the assumption that origi- gebraic rules are provided. nal sequential code is correct, only this one type of property R (e.g. formation of a homomorphism) has to ever be proved Note that completeness is contingent on the availability of a for any given program. This conveniently limits the scope sufficient set of algebraic rules and an effective search strat- of automated verification. As mentioned in Section 3.3, we egy that would lead to the discovery of the correct candidate. construct a functional variant of the loop body, which is used Having a sufficient set of rules is theoretically impossible at every stage of our approach as its representation. This is in general, yet practically achievable for standard programs. very helpful for proof construction. An efficient convergent search strategy that does not miss the correct candidate, however, is more elusive in practice since Proofs are constructed for program fragments in our inter- mediate functional form, and therefore, there is no need to the problem is known to be undecidable [12] under certain handle loops or synthesize loop invariants. conditions for the algebraic rules. We use this intermediate functional form to model the 6.4 Feedback to Join Synthesis loop body as a collection of functions in Dafny. The general There is an important observation that Algorithm 1 includes rule is that every state variable in the loop body is modelled information that can be helpful to join synthesis. When Al- by a unique Dafny function. Figures 7(a) and 7(b) illustrate gorithm 1 terminates successfully and discovers no new aux- the Dafny functions corresponding to state variables sum and iliary accumulators, it certifies the loop as a homomorphic mts respectively. Proofs are constructed for these recursively function (Proposition 6.2). If join synthesis has previously defined functions, where more often than not, the correctness failed to synthesize a join for this loop, the conclusion is that specifications for functions, specially for simpler ones, are syntactic restrictions for join must have been too strict. In inductive. This means that no further (human provided anno- this instance, the normalized expression used for the discov- tations) are required for the proof, which is in harsh contrast ery of the accumulators contains hints about the shape of the to imperative looped programs where loop invariants have join operator, and these hints can be used for re-instantiating to be provided even for the simplest of loops. It is impor- join synthesis. These hints are structure information (i.e. a tant to note that instances do exist that a strengthening of the template for the expression tree) that can be extracted from specification (in form of injection of known invariants) is re- exp(e1, . . . , em, e), the result of ‘normalize’. This expres- quired to make the specification inductive. However, we did sion provides a recipe for how partial results (for bounded not encounter any such example among our benchmarks. unfoldings) can be combined at join time, and therefore, can The next step is to model the synthesized join. Each be used as a cheat sheet for devising a join sketch. In our state variable has a distinct join function. In this example, benchmarks, join synthesis succeeded on all instances and we introduce a Dafny function to model the join for Sum this trick was never used. as illustrated in Figure 7(d), and another one for Mts as (a) function Mts(s: seq): int (b) (c) function Sum(s: seq): int function Max(x: int, y: int): int { if s == [] then 0 else { if s == [] then 0 else { if x

function SumJoin(sum_l: int, sum_r: int): int function MtsJoin(mts_l: int, sum_l: int, mts_r: int, sum_r: int): int { sum_l + sum_r } (d) { Max(mts_r, mts_l + sum_r) } (e)

lemma HomomorphismSum(s: seq, t: seq) lemma HomorphismMts(s: seq, t: seq) ensures Sum(s + t) == SumJoin(Sum(s), Sum(t)) ensures Mts(s + t) == MtsJoin(Mts(s), Sum(s), Mts(t), Sum(t)) { { if t == [] { assert(s + t == s); } if t == [] { assert(s + [] == s); } else { else { calc { calc { Mts(s + t); Sum(s + t); == {HomomorphismSum(s,t[..|t|-1]); == {assert (s + t[..|t|-1]) + [t[|t|-1]] == s + t;} assert (s + t[..|t|-1]) + [t[|t|-1]] == s + t;} SumJoin(Sum(s), Sum(t)); } MtsJoin(Mts(s), Sum(s), Mts(t), Sum(t)); } } } } (f) } (g)

Figure 7. The encodings of sum (a) mts (b), max (c). The encoding of join in two parts for sum (d) and mts (e). The proofs that the joins form a homomorphism in two parts for sum (f) and mts (g). s + t denotes the concatenation of sequences s and t in Dafny, [] is the empty sequence while |s| stands for the length of the sequence s. The lemmas are proved by induction through the separation of the base case t == [] and the induction step. The calc environment provides the hints necessary for the induction step proof. illustrated in 7(e). Each function corresponds to an update This guidance is generic, i.e. not dependent on the nature (a line of code) synthesized for the (overall) join operator of the function Sum, and is applicable to any other function. (as presented in Section 2). Consider the proof for the correctness of MtsJoin (Figure 7(g)). This proof is identical to that of HomomorphismSum The proof is modularly constructed for each state variable. lemma, with the minor difference of recalling of (the al- There exists a natural decomposition for homomorphism ready proved) MtsSum lemma (bright pink text). Note that proofs of tupled functions, that is, each component of the Mts calls Sum in its definition, and therefore the proof of tuple can be shown to be a homomorphism independently. correctness of the join operator for Mts has to assume Therefore, the proof argument can then be modularly con- the correctness of the join for Sum (which is exactly the structed for each state variable of the loop. The overall proof, HomomorphismSum lemma). This rule, namely, if value of u therefore, consists of simpler (to construct) proofs of the join depends on value of v then recall the homomorphism lemma operator producing the correct result for each state variable. for v in the proof of homomorphism for u, is generically ap- In our example, we use two Dafny lemmas accordingly. The plied in all constructed proofs. lemma in Figure 7(f) states correctness of join for Sum, and Our tool follows the simple rules illustrated by this ex- the one in Figure 7(g) states the correctness of join for Mts. ample, and generates proofs like the one illustrate in Figure Let us focus on HomomorphismSum lemma in Figure 7(f). 7 automatically once the synthesis task is finished. It is im- This lemma states that Sum is SumJoin-homomorphic (as portant to note that if the proof checks, then correctness is defined in Definition 3.4). Dafny cannot prove this lemma guaranteed. If it fails, then it can mean that either problem- automatically, if the body of the lemma is left empty. It can specific invariants were required, or the bounded synthesizer prove the lemma, however, if it is given guidance about how has synthesized a wrong solution. We did not have any in- to formulate the argument, in other words, if the body of the stances of failure in our benchmarks. Our proposed solution lemma (as illustrated in Figure 7(f)) is provided. in case of failure is to do a more extensive bounded check to discover counterexamples or acquire more coverage for test- Each homomorphism proof is an induction argument over the length of the input sequence(s). ing. A backup plan like this is required due to the boundaries of applicability of automated proof generation techniques. This fact is true independent of the specific function un- der consideration and enables us to devise a generic (induc- 8. Experiments tion) proof template that can be instantiated for each function and its corresponding join operator. Let us inspect this guid- Our parallelization technique is implemented in a prototype ance (i.e. the body of the lemma in Figure 7(f)) more closely. tool called PARSYNT, for which we report experimental The body basically tells Dafny that the proof is an induction results in this section. argument on the length of t (the second sequence argument), with the base case of an empty sequence (reflected in the if 8.1 Implementation condition). In the base case, Dafny should be aware of the We use CIL [36] to parse C programs, do basic program fact that an empty t simplifies the reasoning about s + t to analysis, and locate inner loops. The loop bodies are then reasoning about s. In the induction step (i.e. the else part), converted into functional form, from which a sketch and Dafny is guided to peel off the last element of t to get a the correctness specification is generated. ROSETTE [47] smaller instance for induction. is used as our backend solver, with a custom grammar for ae(niae y“”i h al) hnaxlayvari- auxiliary When table). the in is auxiliaries “–” discover by to (indicated parallelizable attempt made is no then benchmark form), a original if (in that 1- Note average). (about times on negligible The 2ms are time. synthesis total accumulator the auxiliary dominate for which synthesis, join in of Performance opeels.Tebnhak r l nC: in all are benchmarks includes The 1 list. Table complete approach. a our of effectiveness the evaluate Benchmarks. Evaluation 8.2 space search involved. smaller are operators a non-linear with when produce sketch for we join problem, constrained this more sidestep a To handle. to the solvers for the when bound especially contains also expressions, body We loop synthesizable solver. the the for of search size the the of reduce to space information type state use we 4, Section of in discussed narrowing using by the space to search the addition times generation/checking In proof m3-6Y30. and core expressions. synthesis dual Auxiliary Intel synthesizable auxiliaries. and necessary RAM 2 8G out 1 with finding laptop in cases.Hardware: succeeds all tool in *: negligible case. this in reported be can al 1. Table • • • • • Axrequired #Aux time? Synt Join required? Aux eemnsi uligi iil rmasuc naline a heights. in various source of do a buildings from other that visible of is elements building a the if predicate. determines all given a sequence satisfy a not of beginning the dropwhile 1. strings. two a between after distance seen hamming been has 0 1’s. a 0 which and in strings 0’s accepts that of 1’s. sequence (contiguous) a of in 1’s max-block-1 of sequences) ous count-1’s list. a C. from balanced-() in function integer) given to (string are standard coefficients the when corre- returned. the also is of it poly value defines that the position to the sum, addition sponding in where [43]) (from 43]. [34, from pearls and 2. mps Section in discussed lists. example smallest over functions standard min ∗ 1 ∗ , mxmmpexsum), prefix (maximum mss max sarglrepeso filter. expression regular a is optstevleo oyoila ie point given a at polynomial a of value the computes xeietlrslsfrpromneo P of performance for results Experimental , mxmmsgetsm r programming are sum) segment (maximum length fo akl)i a is Haskell) (from onstenme fbok ie contigu- (i.e. blocks of number the counts ecletdadvresto ecmrsto benchmarks of set diverse a collected We 1.6 no P – hcsi tigo rcesi balanced. is brackets of string a if checks eun h egho h aiu block maximum the of length the returns

AR sum o-ieroperators non-linear 2.0 no S , –

YNT min sum mps-p 0after1 left 1.7 no –

, max is-sorted and al it h ie spent times the lists 1 Table 22.9 no and – average mts line-sight right salnug recognizer language a is 2nd-min mts-p filter 1.5 no – hamming hamming mxmmti sum), tail (maximum oe ukon)as (unknowns) holes 1.6 hc r difficult are which , no – htrmvsfrom removes that

and length r h variations the are 6.1 no – stesecond the is average optsthe computes 2nd-min atoi AR fo [34]), (from yes 3.1 – S mps YNT sthe is yes 3.9 1

vralbnhak.Tmsaei eod.“”idctsta orlvn data relevant no that indicates “–” seconds. in are Times benchmarks. all over mts are 29.4 yes 2 mss 82.2 yes 1 rbe ihTB ti u oTBsshdln overhead scheduling known TBB’s a to is due cores is 32 It above TBB. performance well TBB’s with scaling of problem [14] not cores study that of A shown number cores. has the 32 on around linear to are up speedups It the input. that sequential clear original is the more than is iteration version per parallel expensive computation the auxiliary parallelization, 50k. for that at required set benchmarks input was is those size the for grain of the that and size Note elements The 2bn about programs. is arrays sequential input the over Ubuntu. 64-bit running total) RAM cores DL980 of (64 Proliant 256G processors and a X6550 Intel on eight-core done 8 with was G7 of code quality parallel the of generated Evaluation the task. mechanical simple a came solu- our R (from supports transforming tions It so platforms. parallelism, accom- different divide-and-conquer and across processors, portability available across modates tasks redistribut- parallel dynamically ing by scalability im- offers performance that library, proved C++ runtime popular syn- a we is TBB that imple- thesize. solutions to parallel library divide-and-conquer the Intel’s the as use ment [39] We (TBB) target). Blocks Building (our Thread parameters design many algorithmic on the quality depends beyond The implementation, programs. parallel a synthesized of the evaluate we of Here, performance one. optimized the be- the and difference version performance this negligible tween a however, redundant. was is, other There the dis- accumulators and than necessary, auxiliary other strictly two was the them covered of of one only any one, for this cannot For version we better programmers, a as produce and programs, synthesized Code all Synthesized the of Quality ae aigaoto ne sadtems xesv case expensive most the and 1s ( under or most with about proofs times, taking check synthesis cases join and the generate to compared to negligible in time is discussed the as Finally, conditional 6 6.1. Figure involving Section in rules illustrated those sophisticated than more statements for indicates need procedure sec- failed a the the discover of inspection to Manual fails one. tool ond auxiliaries the 2 but the discovered, of is 1 required case, this In succeeds. always synthesis (i.e. discov- exception were one many With how ered. reports 1 Table required, were ables

mss mts-p iue8ilsrtstesedp forprle solutions parallel our of speedups the illustrates 8 Figure 77.1 yes aigaot7s. about taking ) – mps-p 115.2 no – poly OSETTE yes 7.0 1 is-sorted 84.4 yes noaTBbsdipeetto be- implementation TBB-based a into ) 1 atoi yes 3.7 1 dropwhile 19.8 max-block-1 yes 1 balanced-() yes 1.4 emnal inspected manually We 2 0∗1∗ yes 3.0 1 count-1’s ,teauxiliary the ), yes 6.6 1 line-sight yes 7.6 1 0after1 0 yes 7.5 1 ∗ ∗

1 max-block-1 ∗ . Later, the theoretical ideas of [34] were extended to trees in [33], and generalized for lists in [31]. But these papers do not provide a practical way of producing parallel code. In [13], a study of how functions may be extended to homo- morphisms is presented, but no algorithm is provided.

Loop Parallelization More recently in [42], symbolic execution is used to iden- tify and break dependences in loops that are hard to paral- lelize. Since we produce correct parallel implementations, we do not have to incur the extra cost of symbolic execution at runtime. That said, the scope of applicability of [42] and our approach are not comparable. In the related area of dis- tributed computation, there has been research on producing MapReduce programs automatically, for example through using specific rewrite rules [41] or synthesis [44]. The most relevant and recent work is GraSSP [16] which uses synthe- sis to parallelize a reference sequential implementation by analyzing data dependencies. To the best of our knowledge, Figure 8. Speedups relative to the sequential implementation since the (constant size) prefix information used in [16] is only a special case of our auxiliary accumulators, our ap- and not due to a design problem in our produced parallel proach subsumes theirs. programs. We separately measured the overhead of TBB, by limiting the number of cores to 1. The slowdown (over sequential) is on average negligible; the average slowdown Synthesis and Concurrency Synthesis techniques have is close to 1, with a standard deviation of 0.04. been leveraged for the parallel programs before. Instances include synthesis of distributed map/reduce programs from input/output examples [44], optimization and parallelization of stencils [24, 45], concurrent data structures synthesis [46], 9. Related Work concurrency bug repair [49]. Other than use of synthesis, Homomorphisms and Parallelism Most closely related these problem areas and the solutions have very little in com- our work are the approaches that use homomorphisms for mon with this paper. parallelization. There has been previous attempts in using the derivation of list homomorphisms for parallelization, Parallelizing Compilers and Runtime Environments such as methods based on the third homomorphism theorem Automatic parallelization in compilers is a prolific field of [18, 20], those based on function composition [17], their less research, with source-to-source compilers using highly so- expressive/more practical variant based on matrix multipli- phisticated methods to parallelize generic code [50, 11, 5, cation [43], methods based on the quantifier elimination [32] 37] or more specialized nested loops with polyhedral opti- as well as those based on recurrence equations [8]. We will mization [6, 7, 48]. There is a body of work specific to reduc- discuss the most closely related one here. tions and parallel-prefix computations [10, 22, 27] that deal In [34], the third homomorphism theorem and the con- with dependencies that cannot be broken. Breaking static de- struction of the weak right inverse are used to derive paral- pendencies at runtime, Galois [40] being a good example lel programs. The approach in [34] requires much more in- of this category, is another type of approaching the auto- formation from the programmer compared to our approach. parallelization problem. Handling irregular reductions, when The programmer needs to provide the leftwards and right- the operations in the loop body are not immediately associa- wards sequential implementations of a function to get the tive, has been explored through employing techniques such parallel one. This is often unreasonable, specially when the as data replication or synchronization [23]. Unfortunately, parallel version has a twist, since coming up with the left- there is not enough space to do the mountain of research wards implementation of a function could be as complex on parallelizing compilers justice here. In contrast to correct (and time consuming) as parallelizing it in the first place. source-to-source transformation achieved through provably Intuitively, by providing the reverse computation, the pro- correct program transformation rules, the aim of this paper grammer is providing the information that we compute au- is to use search (in the style of synthesis) in a space that in- tomatically here in Section 6. By contrast, we only need one cludes many incorrect programs. This facilitates discovery (reference) sequential implementation as input. In Section of equivalent parallel implementations that are not reachable 4.3, we make a more technical comparison against [34]. through generally correct program transformations. References [15]F ARZAN,A., AND NICOLET, V. Automated synthesis of divide and conquer parallelism. [1]A LON,N.,MATIAS, Y., AND SZEGEDY, M. The space complexity of approximating the frequency moments. In [16]F EDYUKOVICH,G.,MAAZ BIN SAFEER,A., AND BODIK, Proceedings of the Twenty-eighth Annual ACM Symposium R. Gradual synthesis for static parallelization of single-pass on Theory of Computing (1996), STOC ’96, pp. 20–29. array-processing programs. In PLDI (2017). [2]A LUR,R.,BOD´IK,R.,DALLAL,E.,FISMAN,D.,GARG, [17]F ISHER,A.L., AND GHULOUM, A. M. Parallelizing com- P., JUNIWAL,G.,KRESS-GAZIT,H.,MADHUSUDAN, P., plex scans and reductions. In Proceedings of the ACM SIG- MARTIN,M.M.K.,RAGHOTHAMAN,M.,SAHA,S.,SE- PLAN 1994 Conference on Programming Language Design SHIA,S.A.,SINGH,R.,SOLAR-LEZAMA,A.,TORLAK, and Implementation (1994), PLDI ’94, pp. 135–146. AND DUPA Depend- E., U , A. Syntax-guided synthesis. In [18]G ESER,A., AND GORLATCH, S. Parallelizing functional able Software Systems Engineering. 2015, pp. 1–25. programs by generalization. In Proceedings of the 6th Inter- [3]A PPEL, A. W. SSA is . SIGPLAN national Joint Conference on Algebraic and Logic Program- Not. 33, 4 (Apr. 1998), 17–20. ming (1997), ALP ’97-HOA ’97, pp. 46–60. [4]B ABCOCK,B.,BABU,S.,DATAR,M.,MOTWANI,R., AND [19]G IBBONS, J. The third homomorphism theorem. J. Funct. WIDOM, J. Models and issues in data stream systems. In Pro- Program. 6, 4 (1996), 657–665. ceedings of the Twenty-first ACM SIGMOD-SIGACT-SIGART [20]G ORLATCH, S. Systematic extraction and implementation Symposium on Principles of Database Systems (2002), PODS of divide-and-conquer parallelism. In Proceedings of the ’02, pp. 1–16. 8th International Symposium on Programming Languages: [5]B ACON, D. F., GRAHAM,S.L., AND SHARP, O. J. Com- Implementations, Logics, and Programs (1996), PLILP ’96, piler transformations for high-performance computing. ACM pp. 274–288. Comput. Surv. 26 , 4 (Dec. 1994), 345–420. [21]G ORLATCH, S. Extracting and implementing list homomor- [6]B ASTOUL, C. Efficient code generation for automatic par- phisms in parallel program development. Sci. Comput. Pro- allelization and optimization. In Proceedings of the Second gram. 33, 1 (Jan. 1999), 1–27. International Conference on Parallel and Distributed Com- [22]H ILLIS, W. D., AND STEELE JR, G. L. Data parallel algo- puting (2003), ISPDC’03, pp. 23–30. rithms. Communications of the ACM 29, 12 (1986), 1170– [7]B ASTOUL, C. Code generation in the polyhedral model is 1183. Proceedings of the 13th Interna- easier than you think. In [23]H WANSOO,H., AND CHAU-WEN, T. A comparison of tional Conference on Parallel Architectures and Compilation parallelization techniques for irregular reductions. In Parallel Techniques (2004), PACT ’04, pp. 7–16. and Distributed Processing Symposium., Proceedings 15th [8]B EN-ASHER, Y., AND HABER, G. Parallel solutions of International (2001), p. 27. IEEE Trans. Parallel simple indexed recurrence equations. [24]K AMIL,S.,CHEUNG,A.,ITZHAKY,S., AND SOLAR- Distrib. Syst. 12 , 1 (Jan. 2001), 22–37. LEZAMA, A. Verified lifting of stencil computations. In Pro- [9]B IRD, R. S. An introduction to the theory of lists. In Pro- ceedings of the 37th ACM SIGPLAN Conference on Program- ceedings of the NATO Advanced Study Institute on Logic of ming Language Design and Implementation (2016), PLDI Programming and Calculi of Discrete Design (1987), pp. 5– ’16, pp. 711–726. 42. [25]K EJARIWAL,A.,D’ALBERTO, P., NICOLAU,A., AND [10]B LELLOCH, G. E. Prefix sums and their applications. POLYCHRONOPOULOS, C. D. A geometric approach for par- titioning n-dimensional non-rectangular iteration spaces. In [11]B LUME, W., DOALLO,R.,EIGENMANN,R.,GROUT,J., Proceedings of the 17th International Conference on Lan- HOEFLINGER,J.,LAWRENCE, T., LEE,J.,PADUA,D., guages and Compilers for High Performance Computing PAEK, Y., POTTENGER,B.,RAUCHWERGER,L., AND TU, P. Parallel programming with Polaris. Computer 29, 12 (Dec. (2005), LCPC’04, pp. 102–116. 1996), 78–82. [26]K ELSEY, R. A. A correspondence between continuation passing style and static single assignment form. In Papers [12]B OONE, W. W. The word problem. Proceedings of the National Academy of Sciences of the United States of America from the 1995 ACM SIGPLAN Workshop on Intermediate 44 (1958), 1061–1065. Representations (1995), IR ’95, pp. 13–22. [27]L ADNER,R.E., AND FISCHER, M. J. Parallel prefix com- [13]C HIN, W.-N., TAKANO,A., AND HU, Z. Parallelization via context preservation. In Proceedings of the 1998 Inter- putation. Journal of the ACM (JACM) 27, 4 (1980), 831–838. national Conference on Computer Languages (1998), ICCL [28]L EINO, K. R. M. Dafny: An automatic program verifier for ’98, pp. 153–162. functional correctness. In Logic for Programming, Artificial Intelligence, and Reasoning (LPAR) (2010), pp. 348–370. [14]C ONTRERAS,G., AND MARTONOSI, M. Characterizing and improving the performance of Intel Threading Build- [29]M ARCHE´, C. Normalized rewriting: an alternative to rewrit- ing Blocks. In 4th International Symposium on Workload ing modulo a set of equations. Journal of Symbolic Compu- Characterization (IISWC 2008), Seattle, Washington, USA, tation 21, 3 (1996), 253–288. September 14-16, 2008 (2008), pp. 57–66. [30]M ARCHE´,C., AND URBAIN,X. Termination of associative- commutative rewriting by dependency pairs. Springer Berlin ings of the 2014 ACM International Conference on Object Heidelberg, Berlin, Heidelberg, 1998, pp. 241–255. Oriented Programming Systems Languages & Applications (2014), OOPSLA ’14, pp. 909–927. [31]M ORIHATA, A. A short cut to parallelization theorems. In ACM SIGPLAN International Conference on Functional [42]R AYCHEV, V., MUSUVATHI,M., AND MYTKOWICZ, T. Par- Programming, ICFP’13, Boston, MA, USA - September 25 - allelizing user-defined aggregations using symbolic execu- 27, 2013 (2013), pp. 245–256. tion. In Proceedings of the 25th Symposium on Operating Systems Principles (2015), SOSP ’15, pp. 153–167. [32]M ORIHATA,A., AND MATSUZAKI, K. Automatic paral- lelization of recursive functions using quantifier elimination. [43]S ATO,S., AND IWASAKI, H. Automatic parallelization via In Functional and Logic Programming, 10th International matrix multiplication. SIGPLAN Not. 46, 6 (June 2011), 470– Symposium, FLOPS 2010, Sendai, Japan, April 19-21, 2010. 479. Proceedings (2010), pp. 321–336. [44]S MITH,C., AND ALBARGHOUTHI, A. Mapreduce program [33]M ORIHATA,A.,MATSUZAKI,K.,HU,Z., AND TAKEICHI, M. The third homomorphism theorem on trees: downward & synthesis. SIGPLAN Not. 51, 6 (June 2016), 326–340. upward lead to divide-and-conquer. In Proceedings of the [45]S OLAR-LEZAMA,A.,ARNOLD,G.,TANCAU,L.,BODIK, 36th ACM SIGPLAN-SIGACT Symposium on Principles of R.,SARASWAT, V., AND SESHIA, S. Sketching stencils. Programming Languages, POPL 2009, Savannah, GA, USA, SIGPLAN Not. 42, 6 (June 2007), 167–178. January 21-23, 2009 (2009), pp. 177–185. [46]S OLAR-LEZAMA,A.,JONES,C.G., AND BODIK,R. [34]M ORITA,K.,MORIHATA,A.,MATSUZAKI,K.,HU,Z., Sketching concurrent data structures. In Proceedings of the AND TAKEICHI, M. Automatic inversion generates divide- 29th ACM SIGPLAN Conference on Programming Language and-conquer parallel programs. In Proceedings of the 28th Design and Implementation (2008), PLDI ’08, pp. 136–148. ACM SIGPLAN Conference on Programming Language De- [47]T ORLAK,E., AND BOD´IK, R. Growing solver-aided lan- sign and Implementation (2007), PLDI ’07, pp. 146–155. guages with rosette. In ACM Symposium on New Ideas in [35]N ARENDRAN, P., AND RUSINOWITCH,M. Any ground Programming and Reflections on Software, Onward! 2013, associative-commutative theory has a finite canonical sys- part of SPLASH ’13, Indianapolis, IN, USA, October 26-31, tem. Springer Berlin Heidelberg, Berlin, Heidelberg, 1991, 2013 (2013), pp. 135–152. pp. 423–434. [48]V ASILACHE,N.,BASTOUL,C., AND COHEN, A. Polyhe- [36]N ECULA,G.C.,MCPEAK,S.,RAHUL, S. P., AND dral code generation in the real world. In Proceedings of WEIMER, W. CIL: Intermediate Language and Tools for the 15th International Conference on Compiler Construction Analysis and Transformation of C Programs. 2002. (2006), CC’06, pp. 185–201. [37]P ADUA,D.A., AND WOLFE, M. J. Advanced compiler [49] Cˇ ERNY´ , P., HENZINGER, T. A., RADHAKRISHNA,A., optimizations for supercomputers. Commun. ACM 29, 12 RYZHYK,L., AND TARRACH, T. Regression-free synthe- (Dec. 1986), 1184–1201. sis for concurrency. In Proceedings of the 16th International [38]P APADIMITRIOU,C.H., AND SIPSER, M. Communication Conference on Computer Aided Verification - Volume 8559 complexity. J. Comput. Syst. Sci. 28, 2 (1984), 260–269. (2014), pp. 568–584. [39]P HEATT, C. Intel® threading building blocks. Journal of [50]W ILSON,R.,FRENCH,R.,WILSON,C.,AMARASINGHE, Computing Sciences in Colleges 23, 4 (2008), 298–298. S.,ANDERSON,J.,TJIANG,S.,LIAO,S.,TSENG,C., HALL,M.,LAM,M., AND HENNESSY, J. The suif com- [40]P INGALI,K.,NGUYEN,D.,KULKARNI,M.,BURTSCHER, piler system: A parallelizing and optimizing research com- M.,HASSAAN,M.A.,KALEEM,R.,LEE, T.-H., piler. Tech. rep., Stanford, CA, USA, 1994. LENHARTH,A.,MANEVICH,R.,MENDEZ´ -LOJO,M., PROUNTZOS,D., AND SUI, X. The tao of parallelism in [51]Z UCK,L.D.,PNUELI,A.,FANG, Y., GOLDBERG,B., AND algorithms. SIGPLAN Not. 46, 6 (June 2011), 12–25. HU, Y. Translation and run-time validation of optimized code. Electr. Notes Theor. Comput. Sci. 70, 4 (2002), 179– [41]R ADOI,C.,FINK,S.J.,RABBAH,R., AND SRIDHARAN, 200. M. Translating imperative code to mapreduce. In Proceed-