Synthesis of Divide and Conquer Parallelism for Loops ∗

Synthesis of Divide and Conquer Parallelism for Loops ∗ Azadeh Farzan and Victor Nicolet † University of Toronto, Canada Abstract 1. Introduction Divide-and-conquer is a common parallel programming Despite big advances in optimizing and parallelizing compil- skeleton supported by many cross-platform multithreaded ers, correct and efficient parallel code is often hand-crafted libraries, and most commonly used by programmers for par- in a difficult and error-prone process. The introduction of allelization. The challenges of producing (manually or au- libraries like Intel’s TBB [39] was motivated by this diffi- tomatically) a correct divide-and-conquer parallel program culty and with the aim of making the task of writing well- from a given sequential code are two-fold: (1) assuming that performing parallel programs easier for an average program- a good solution exists where individual worker threads exe- mer. These libraries offer efficient implementations for com- cute a code identical to the sequential one, the programmer monly used parallel skeletons, which makes it easier for the has to provide the extra code for dividing the tasks and com- programmer to write code in the style of a given skeleton bining the partial results (i.e. joins), and (2) the sequential without having to make special considerations for important code may not be suitable for divide-and-conquer paralleliza- performance factors like scalability of memory allocation or tion as is, and may need to be modified to become a part of task scheduling. Divide-and-conquer parallelism is the most a good solution. We address both challenges in this paper. commonly used of these skeletons. We present an automated synthesis technique to synthesize Consider the function sum = 0; correct joins and an algorithm for modifying the sequential sum that returns the sum for (i = 0; i < |s|; i++) { code to make it suitable for parallelization when modifica- of the elements of an ar- sum = sum + s[i]; } tions are necessary. We focus on a class of loops that tra- ray of integers. The code verse a read-only collection and compute a scalar function on the right is a sequential loop that computes this function. over that collection. We present theoretical results for when To compute the sum, in the style of divide and conquer par- the necessary modifications to sequential code are possi- allelism, the computation should be structured as illustrated ble, theoretical guarantees for the algorithmic solutions pre- in Figure 1. The array s of length ¶s¶ is partitioned into n+1 sented here, and experimental evaluation of the approach’s success in practice and the quality of the produced parallel sum(s[0..k1]) sum(s[k1 +1..k2]) sum(s[kn 1 +1..kn]) sum(s[kn +1.. s 1]) ··· − | | − programs. CCS Concepts • Computing methodologies → Paral- ··· lel programming languages;• Theory of computation → sum(s[0..k2]) sum(s[kn 1 +1.. s 1]) ··· ··· − | | − Program verification; Parallel computing models ··· Keywords Divide and Conquer Parallelism, Program Syn- sum(s) thesis, Homomorphisms Figure 1. Divide and conquer style computation of sum. ∗ An extended version of this paper including proofs of theorems can be chunks, and sum is individually computed for each chunk. found at www.cs.toronto.edu/~azadeh/papers/pldi17-ex.pdf † joined Funded by Ontario Ministry of Research’s Early Researcher Award and The results of these partial sum computations are NSERC Discovery Accelerated Supplement Award (operator j) into results for the combined chunks at each in- termediate step, with the join at the root of the tree returning the result of sum for the entire array. The burden of a correct design is to come up with the correct implementation of the Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and join operator. In this example, it is easy to quickly observe the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner/author(s). that the join has to simply return the sum of the two partial PLDI’17 June 18–23, 2017, Barcelona, Spain results. In general, it could be tricky to reformulate an arbi- © 2017 Copyright held by the owner/author(s). ACM ISBN 978-1-nnnn-nnnn-n/17/06. $15.00 trary sequential computation in this style. Recent advances DOI: http://dx.doi.org/10.1145/3062341.3062355 in program synthesis [2] demonstrate the power of synthesis in producing non-trivial computations. A natural question to foundations to answer these questions, and in Section 6, we ask is whether this power can be leveraged for this problem. present an algorithm for producing this lifting automatically In this paper, we focus on a class of divide-and-conquer that satisfies all aforementioned criteria. In summary, the pa- parallel programs that operate on sequences (lists, arrays, or per makes the following contributions: in general any collection type with a linear traversal itera- • We present an algorithm to synthesize a join for divide- tor) in which the divide operator is assumed to be the default and-conquer parallelism when one exists. Moreover, sequence concatenation operator (i.e. divide s into s1 and these joins are accompanied by automatically generated s2 where s = s1 a s2). In Section 4, we discuss how we use machine-checked proofs of correctness (Sections 4, 7). syntax-guided synthesis (SyGuS) [2] efficiently to synthesize • We present an algorithm for automatic lifting of non- the join operators. Moreover, in Section 7, we discuss how parallelizable loops (where a join does not exist) to trans- the proofs of correctness for synthesized join operations can form them into parallelizable ones (Section 6). be automatically generated and checked. This addresses a • We lay the theoretical foundations for when a loop is ef- challenge that many SyGuS schemes seem to bypass, in that, ficiently parallelized (divide-and-conquer-style), and ex- they only guarantee the synthesized artifact be correct for the plore when an efficient lift exists and when it can be au- set of examples used in the synthesis process (or boundedly tomatically discovered (Sections 5, 6). many input instances), and do not provide a proof of correct- • We built a tool, PARSYNT, and present experimental ness for the entire (infinite) data domain of program inputs. results that demonstrate the efficacy of the approach and A general divide-and-conquer parallel solution is not al- the efficiency of the produced parallel code (Section 8). ways as simple as the diagram in Figure 1. Consider the function is-sorted(s) which returns true if an array is sorted, 2. Overview and false otherwise. Providing the partial computation results, a boolean value in this case, from both sides (of a join) We start by presenting an overview of our contributions by will not suffice. If both sub-arrays are sorted, the join cannot way of two examples that demonstrate the challenges of con- make a decision about the sortedness of the concatenated ar- verting a sequential computation to divide-and-conquer par- ray. In other words, a join cannot be defined solely in terms allelism and help us illustrate our solution. We use sequences of the sortedness of the subarrays. as linear collections that abstract any collection data type To a human programmer, it is clear that the join re- with a linear traversal iterator. quires the last element of the first subarray and the first Second Smallest. Con- element of the second subarray to connect the dots. The m = MAX_INT; sider the loop implemen- extra information in this case is as simple as remember- m2 = MAX_INT; tation of the function for (i = 0; i < |s|; i++) { ing two extra values. But, as we demonstrate with an- m2 = min(m2, max(m,s[i])); min2 on the right, that re- other example in Section 2, the extra information required m=min(m, s[i]); turns the second smallest } for the join may need to be computed by worker threads element of a sequence. Figure 2. Second Smallest to be available to the join. Intuitively, this means that Functions min and max worker threads have to be are used for brevity and can be replaced by their standard modified (compared to the O A definitions min(a; b) = a $ b ? a ∶ b and max(a; b) = a % sequential code) to compute parallelizable ⇥ loop projection b ? a ∶ b. Here, m and m2 keep track of the smallest and the this extra information in or- I O second smallest elements of the sequence respectively. der to guarantee the exis- sequential loop tence of a join operator. We For a novice, in devising a join for m = min(ml; mr) 1 a divide and conquer paralleliza- call this modification of the code, lifting , for short, after m2 = min(m2l; m2r) the identical standard mathematical concept as illustrated in tion of this example, it is easy to the diagram above, where A stands for the extra (auxiliary) make the mistake of using the incorrect updates illustrated 2 information that needs to be computed by the loop. on the right . The necessity of lifting in some cases raises two ques- The correct join operator computes the combined small- tions: (1) does such a lifting always exist? And, (2) can est and second smallest of two subsequences according the the overhead from lifting and the accompanying join over- following two equations, where the l and r subscripts distin- shadow the performance gains from parallelization? In Sec- guish the values coming from the left and the right subse- tion 5, we address both questions.

Load more