Scalable Conditional Induction Variables (CIV) Analysis
Total Page:16
File Type:pdf, Size:1020Kb
Scalable Conditional Induction Variables (CIV) Analysis ifact Cosmin E. Oancea Lawrence Rauchwerger rt * * Comple A t te n * A te s W i E * s e n l C l o D C O Department of Computer Science Department of Computer Science and Engineering o * * c u e G m s E u e C e n R t v e o d t * y * s E University of Copenhagen Texas A & M University a a l d u e a [email protected] [email protected] t Abstract k = k0 Ind. k = k0 DO i = 1, N DO i =1,N Var. DO i = 1, N IF(cond(b(i)))THEN Subscripts using induction variables that cannot be ex- k = k+2 ) a(k0+2*i)=.. civ = civ+1 )? pressed as a formula in terms of the enclosing-loop indices a(k)=.. Sub. ENDDO a(civ) = ... appear in the low-level implementation of common pro- ENDDO k=k0+MAX(2N,0) ENDIF ENDDO gramming abstractions such as filter, or stack operations and (a) (b) (c) pose significant challenges to automatic parallelization. Be- Figure 1. Loops with affine and CIV array accesses. cause the complexity of such induction variables is often due to their conditional evaluation across the iteration space of its closed-form equivalent k0+2*i, which enables its in- loops we name them Conditional Induction Variables (CIV). dependent evaluation by all iterations. More importantly, This paper presents a flow-sensitive technique that sum- the resulted code, shown in Figure 1(b), allows the com- marizes both such CIV-based and affine subscripts to pro- piler to verify that the set of points written by any dis- gram level, using the same representation. Our technique re- tinct iterations do not overlap, a.k.a., output independence: quires no modifications of our dependence tests, which is k0+2*i1 = k0+2*i2 ) i1=i2. It follows that the loop in agnostic to the original shape of the subscripts, and is more Figure 1(b) can be safely parallelized. powerful than previously reported dependence tests that rely A known difficulty in performing this analysis arises on the pairwise disambiguation of read-write references. when subscripts use scalars that do not form a uniform re- We have implemented the CIV analysis in our paralleliz- currence, i.e., their stride (increment) is not constant across ing compiler and evaluated its impact on five Fortran bench- iteration space of the analyzed loop. We name such vari- marks. We have found that that there are many important ables conditional induction variables or CIV because they loops using CIV subscripts and that our analysis can lead to are typically updated (only) on some of the possible control- their scalable parallelization. This in turn has led to the par- flow paths of an iteration. For example, the loops in Fig- allelization of the benchmark programs they appear in. ures 1(a) and (c) are similar, except that in (c) both civ and the array update are performed only when condition cond(b(i)) holds. Although the recurrence computing civ 1. Introduction values can still be parallelized via a scan [2] (prefix sum) An important step in the automatic parallelization of loops is precomputation, neither CIV nor the subscript can be sum- the analysis of induction variables and their transformation marized as affine expressions of loop index i, and hence the into a form that allows their parallel evaluation. In the case of previous technique would fail. “well behaved” induction variables, that take monotonic val- This paper presents a novel induction-variable analysis ues with a constant stride, their generating sequential recur- that allows both affine and CIV based subscripts to be sum- rence can be substituted with the evaluation of closed form marized at program level, using one common representation. expression of the loop indices. This transformation enables Current solutions [3, 8, 10, 23, 26] use a two-step ap- the parallel evaluation of the induction variables. When these proach: First, CIV scalars are recognized and their proper- variables are used to form addresses of shared data struc- ties, such as monotonicity and the way they evolve inside tures, the memory references can be statically analyzed and and across iterations, are inferred. Intuitively, in our exam- possibly parallelized. Thus we can conclude that the analy- ple, this corresponds to determining that the values of civ sis of induction-variables use is crucial to loop dependence are increasing within the loop with step at most 1. This pa- analysis [1, 7, 18] and their subsequent parallelization. per does not contribute to this stage, but exploits previously For example, the loop in Figure 1(a) increments by two developed techniques [23]. Second and more relevant, each the value of k (produced by the previous iteration), and up- pair of read-write subscripts is disambiguated, by means of dates the kth element of array a. The uniform incremen- specialized dependency tests that exploit the CIV’s proper- tation of k allows to substitute this recurrence on k with ties. In our example, the cross-iteration monotonicity of the 2015 IEEE/ACM International Symposium on Code Generation and Optimization 978-1-4799-8161-8/15/$31.00 c 2015 IEEE 1-4799-8161-8/15/$31.00 ©2015 IEEE 213 CIV values dictates that the CIV value, named civ2 after our analysis simplifies CIV-summary expressions enabling being incremented in some iteration i2 is strictly greater successful verification of the relevant invariants. Second, than the value, named civ1 of any previous iteration i1. It our analysis extracts the slice of the loop that computes the follows by induction that the update of a(civ) cannot re- CIV values and evaluates it in parallel before loop execution. sult in cross-iteration (output) dependencies, i.e., the system This allows: (i) runtime verification of CIV monotonicity civ2-civ1 ≥ 1 and civ2 =civ1 has no solution. whenever this cannot be statically established, (ii) sufficient Other summarization techniques [9, 14, 16] aggregate ar- conditions for safe parallelization to be expressed in terms ray references across control-flow constructs, e.g., branches, of the CIV values at loop’s entry, end, and anywhere in be- loops, and model dependency testing as an equation on the tween, and (iii) safe parallelization in the simple case when resulted abstract sets. In practice, they were found to scale CIV are not used for indexing. The latter is not reported in better than previously developed analysis based on pairwise related work. In summary, main contributions are: accesses [9], but they do not support CIVs. • A flow-sensitive analysis that extends program level sum- This paper proposes an extension to summary-based anal- marization to support CIV-based indices and leads to the ysis that allows CIV based subscripts to use the same rep- parallelization of previously unreported loops, resentation as the affine subscripts. This enables scalable memory reference analysis without modification of the pre- • A non-trivial code generation that separates the program viously developed dependency tests. The gist of our tech- slice that performs the parallel-prefix-sum evaluation of nique is to aggregate symbolically the CIV references on ev- the CIV values from the main (parallel) computation in ery control-flow path of the analyzed loop, in terms of the which they are inserted. CIV values at the entry and end of each iteration or loop. • An evaluation of five difficult to parallelize benchmarks The analysis succeeds if (i) the symbolic-summary results with important CIV-subscripted loops, which measures are identical on all paths and (ii) they can be aggregated all runtime overheads and shows application level speed- across iterations in the interval domain. ups as high as 7:1× and on average 4:33× on 8 cores. We demonstrate the technique on the loop in Figure 1(c), Speedups of two of this paper benchmarks shown were where we use civi and civi to denote the values of civ µ b also reported in [14]. They were obtained in part by using at the entry and end of iteration i, respectively. On the the CIV technique that is presented here for the first time. THEN path, i.e., cond(b(i)) holds, the write set of array is interval W = i i , i.e., point f i g, a i [civµ+1,civb] civµ+1 2. An Intuitive Demonstration because the incremented value of civ is live at the iteration end. On the ELSE path, the write set of a is the empty set. For Figure 2(a) shows a simplified version of loop CORREC do401 uniformity of representation we express this empty set as an from BDNA benchmark, as an example of non-trivial loop interval with its lower bound greater than its upper bound. that uses both CIV and affine-based subscripts. Variable civ In particular, the interval of the THEN path matches, because uses (gated) single-static-assignment (SSA) notation [25]: i i For example, statement civ@2=γ(i.EQ.M,civ@1,civ@4) civ is not updated and so civµ+1>civb. Aggregating Wk across the first i-1 iterations results in has the semantics that variable civ@2 takes, depending on i−1 1 i the evaluation of i.EQ.M, either the value of civ@1 for the [k=1Wk = [civµ+1,civµ], where we have used civ val- k−1 k first iteration of the loop that starts at M, or the value of civ@4 ues’ monotonicity and the implicit invariant civb ≡civµ, i.e., the civ value at an iteration end is equal to the civ value for all other iterations.