Automatic Correction of Loop Transformations

Nicolas Vasilache Albert Cohen Louis-Noël Pouchet ALCHEMY Group, INRIA Futurs and Paris-Sud 11 University fi[email protected]

Abstract enabling transformations. Loop shifting or pipelining [6] and loop skewing [27] are such enabling transformations is a combinatorial problem. Due of the schedule, while loop peeling isolates violations on to the growing complexity of modern architectures, it in- boundary conditions [13] and variable renaming or privati- volves two increasingly difficult tasks: (1) analyzing the zation removes dependences. Unfortunately, except in spe- profitability of sequences of transformations to enhance cial cases (e.g., the unimodular transformation framework), parallelism, locality, and resource usage, which amounts to one faces the combinatorial decision problem of choosing a hard problem on a non-linear objective function; (2) the enabling transformationswith no guarantee that they will do construction and exploration of search space of legal trans- any good. This approachdoes not scale to real programs,in- formation sequences. Practical optimizing and paralleliz- volving sequences of tens or hundreds of carefully selected ing decouple these tasks, resorting to a prede- enabling transformations. fined set of enabling transformations to eliminate all sorts of Our method makes these enabling transformations obso- optimization-limiting semantical constraints. State-of-the- lete, replacing them by a much more tractable “apply and art optimization heuristics face a hard decision problem on correct” approach. While selecting these enabling transfor- the selection of enabling transformations only remotely re- mations a priori is a complex decision problem involving lated to performance. legality and profitability, our method takes the opposite di- We propose a new design where optimization heuristics rection: we compute the transformations needed to make first address the main performance anomalies, then cor- a candidate sequence legal a posteriori. We rely on the rect potentially illegal loop transformations a posteriori, polyhedral model, which promoted affine schedules as an attempting to minimize the performance impact of the nec- abstract intermediate representation of the program seman- essary adjustments. We propose a general method to cor- tics. The main algorithm to exhibit a “good” legal sched- rect any sequence of loop transformations through a com- ule in this model has been proposed by Feautrier [11] and bination of loop shifting, code motion and index-set split- was recently improved [12]. This algorithm, and its main ting. Sequences of transformations are modeled by compo- extensions [19, 15, 3], rely on linear optimization models sitions of geometric transformations on multidimensional and suffer from multiple sources of combinatorial complex- affine schedules. We provide experimental evidence of the ity: (1) the number of polyhedra it considers is exponential scalability of the algorithms on real loop optimizations. in the program size; (2) the dimensionality of the polyhedra is proportional to the program size, which incurs an expo- nential complexity in the ILP the algorithm relies on. 1. Introduction Radical improvements are needed to scale these algo- rithms to real-size loop nests (a few hundred statements or Loop tiling, fusion, distribution, shifting as well as uni- more), and to complement linear programming with more modular transformations are pervasive tools to enhance lo- pragmatic empirical operation research heuristics. Our con- cality and parallelism. In most real-world programs, a blunt tribution reduces the dimensions of this combinatorial prob- use of these transformations violates the causality of the lem: it allows operation research heuristics to focus on the computation. On the bright side, design- linear part of the schedule and perform a simpler scalable ers observed that a sequence of systematic preconditioning correction pass on the constant part of the schedule. steps can often eliminate these violations; these steps are Our first contribution is an automatic correction algo- called enabling transformations. On the other side, this rithm to fix an illegal transformation sequence with “min- puts a heavy burden on any locality and parallelism en- imal changes”, based on the analysis of dependence viola- hancement heuristic [20]: the hard profitability problem it tions. This correction scheme amounts to translating (a.k.a. deals with is complicated by a decision problem to identify shifting) the schedule until dependences are satisfied. We prove two important properties: (1) the algorithm is sound transformations in the context of automatic loop paralleliza- and complete: any multidimensional affine schedule will be tion [1, 6]. They induce less code complexity than, e.g., corrected as long as dependence violations can be corrected loop skewing (which may degrade performance due to the by translation; (2) it applies a polynomial number of opera- complexity of loop bound expressions), and can cope with tions on dependence polyhedra. a variety of cyclic dependence graphs. Yet the maximal par- Our second contribution is a novel, more practical for- allelism problem, seen as a decision one, has been proved mulation of index-set splitting: it is the most general tech- NP-complete by Darte [6]. This results from the inability to nique to decompose iteration domains, allowing to build represent coarse-grain data-parallelism in loops in a linear more expressive piecewise affine schedules. Its state-of-the- fashion (many data-parallel programs cannot be expressed art formulation is another complex decision problem [13] with multidimensional schedules [19]). Although we use associated with a non-scalable scheduling algorithm. Our the same tools as Darte to correct schedules, we avoid this approach replaces it by a simple heuristic, combined with pitfall by stating our problem as a linear optimization. the automatic correction procedure. This is extremely help- Decoupling research on the linear and constant parts of ful, considering that each decision problem has a combina- the schedule has been proposed earlier [7, 6, 26]. These torial impact on phase selection, ordering and parametriza- techniques all use simplified representation of dependences tion. The ability to directly control code size expansion is (i.e., dependence vectors) and rely on finding enabling another advantage of this reformulation. transformations for their specific purpose, driven by opti- Our third contribution is a practical implementation of mization heuristics. Our approach is more general as it ap- these algorithms in the URUK platform [14]. Building plies to any potentially illegal affine schedule. Alternatively, on recent scalability improvements in polyhedral compila- Crop and Wilde [5] and Feautrier [12] attempt to reduce the tion [25], we show the effectiveness of our automatic cor- complexity of affine scheduling with a modular or structural rection scheme on two full SPEC CPU2000fp benchmarks. decomposition. These algorithms are effective, but still re- The paper is structured as follows. We discuss related sort to solving large integer-linear programs. They are com- work in Section 2. The polyhedral model and array de- plementary to our correction scheme. pendence analysis are briefly introduced in Section 3. We Our contribution also helps improving the productivity then define the correction problem in Section 4 and a com- of domain experts. A typical case is the design of domain- plete algorithm to fix an illegal transformation by means of specific program generators [22, 8], a pragmatic solution affine translations. Section 5 revisits and improves index- to the design of adaptive, nearly optimal libraries. Al- set splitting in the context of schedule corrections. Section6 though human-written code is concise (code factoring in presents experimental results. loops and functions), optimizing it for modern architectures incurs several code-size increasing steps, including function 2. Related Work inlining, specialization, loop versioning, tiling, pipelining and unrolling. Domain-specific program generators (also known as active libraries) rely on feedback-directed and it- The induced combinatorial of enabling transformation erative optimization. Our approach focuses the problem on decision problems drive the search for a more compositional the dominant performance anomalies, eliminating the ma- approach, where transformation legality is guaranteed by jority of the transformation steps (the enabling ones). construction, using the Farkas lemma on affine schedules. As stated in the introduction, this is one of our main moti- vation, but taking all legality constraints into account makes 3. Polyhedral Model the algorithms not scalable, and suggest staging the selec- tion of a loop nest transformation into an “approximately The polyhedral model is a thorough tool to reason about correct” combinatorial step – addressing the main perfor- analysis, optimization and parallelization of loop nests. We mance concerns – and a second correction step. The idea use the notations of the URUK framework [14], a normal- of applying “after-thought” corrections on affine schedules ized representation of programs and transformations. Under was first proposed by Bastoul [3]. However, it relies on con- the normalization invariants of this framework, it is possible sidering the program and all its dependences at each step of to apply any sequence of transformations without worrying the resolution, and so does not scale with program size. about its legality, lifting the tedious constraint of ensuring the legality of a transformation before applying it. Only af- We follow the idea of correcting illegal schedules, but ter applying a complete transformation sequence do we care focus on a particular kind of correction that generalizes about its correctness as a whole. loop shifting, fusion and distribution [1]. These three clas- sical (operational) program transformations boil down to the same algebraic translation in multidimensional affine Polyhedral Representation. The target of our optimiza- scheduling. These are some of the most important enabling tions are sequences of loop nests with constant strides and S affine bounds. This includes non-rectangular, non-perfectly • the β vector of size (dS + 1) encodes the syntactical, nested loops, and conditionals with Boolean expressions loop-independent interleaving of statements at every of affine inequalities. Loop nests fulfilling these hypothe- loop depth. ses are amenable to representation in the polyhedral model. We call Static Control Part (SCoP) any maximal syntactic Such encodings with 2dS + 1 dimensions were pre- program segment satisfying these constraints [14]. Invari- viously proposed by Feautrier [11], then by Kelly and ant variables within a SCoP are called global parameters, 1 Pugh [16]. We refer the reader to the URUK framework for and refereed to as g; the dimension of g is denoted by dg. a complete formalization and analysis [14]. For improved In classical loop nest optimization frameworks, the basic readability, we use a small example at depth 1 (i.e., domain unit of transformation is the loop; the polyhedral model in- dimension dS = dS = 1) and 1 parameter N (i.e., dg = 1). creases expressiveness, applying transformations individu- 1 2 ally to each statement. For each statement S within a SCoP, we extract the following information: for (i=0; i

follows the lexicographic order on multidimensional AS1 = [1] AS2 = [1] AS1 = [2] AS2 = [1] βS = [0,0] βS = [0,1] βS = [0,0] βS = [0,1] execution dates; Γ 1 Γ 2 Γ 1 Γ 2 S1 = [0,0] S2 = [1,−3] S1 = [0,0] S2 = [0,0] • the set of all its memory references of the form hX, f S(iS)i where X is an array and f S is its affine sub- Figure 3. Shift Figure 4. Slow script function. We also consider more general non- affine references with conservative approximations. The original code is shown on Figure 1. Using our Each iteration iS of a statement S is called an instance, and framework, it is natural to express transformations like fu- is denoted by hS,iSi. sion, fission and code motion by simple operations on the β Polyhedral compilation usually distinguishes between vectors — Figure 2 — even for misaligned loops with three steps: first, represent an input program in the for- different parametric bounds [25]. Shifting by a constant malism, then apply a transformation to this representation amountor by a parametric amount— Figure 3 — is equally Γ and finally generate the target (syntactic) code. It is well simple with operations on the . We also show the result of known that arbitrarily complex sequences of loop transfor- slowing S1 by a factor 2, operating on A with respect to S2 mations can be captured in one single transformation step of — Figure 4. It is the duty of the final code generation algo- the polyhedral model [27, 14]; this includes parallelism ex- rithm [24, 2] to reconstruct loop nests from affine schedules traction and exploitation [12, 19, 15]. Yet to ease the com- and iteration domains. position of program transformations on the polyhedral rep- An important property is that modifications of β and Γ resentation, we further split the representation of the sched- are commutative and trivially reversible. It is obvious that ule Θ into smaller, interleaved, matrix and vector parts: this does not hold in AST representations of the program. Typically, on Figure 3, one would need to rebuild the origi- S • the A part is a square matrix of size (dS,dS) and nal statement S1 from the prologue and kernel when work- expresses the speed at which different statement in- ing with an AST representation. Whereas in the polyhedral stances execute along a given dimension; model, S1 is still a single statement until the final regener- S ation of the code, which alleviates such annoying pattern- • the Γ matrix of size (dS,dg) allows to perform multi- dimensional shifting with respect to global parameters; matching based reconstructions. This crucial observation is fundamental for defining a complete correction scheme, 1We insert the non-parametric constant as the last element of g and will avoiding the traditional decision problems associated with not distinguish it through the remainder of the paper. the selection of enabling transformations. Dependence Analysis. Array dependence analysis is the and that are still executed in correct order after transforma- S→T starting point for any polyhedral optimization. It computes tion. Such a polyhedron will be referred to as SLAp . non transitively-redundant, iteration vector to iteration vec- tor, directed dependences [9, 25]. In order to correct de- 4. Correction by Shifting pendence violations, it is first needed to compute the exact dependence information between every pair of instances, We propose a greedy algorithm to incrementally correct i.e., every pair of statement iterations. Considering a pair violated dependences, from the outermost to the innermost of statements S and T accessing memory locations where at nesting depth. This algorithm addresses a similar prob- least one of them is a write, there is a dependence from an lem as retiming, in an extended multidimensional case with instance hS,iSi of S to an instance hT,iT i of T (or hT,iT i parametric affine dependences [18, 6]. The basic idea is to depends on hS,iSi) if and only if the following instance- correct violations via iterative translation of the schedules; wise conditions are met: (1) both instances belong to the these translations reflect into loop shifting for the case of S S loop-carried violated dependences and into loop fusion or corresponding statement iteration domain: Di · i ≥ 0 and T T distribution for the loop-independent case [25]. Di · i ≥ 0, (2) both instances refer to the same memory S S T T S location: f · i = f · i , and (3) the instance hS,i i is ex- 4.1. Violation and Slackness ecuted before hT,iT i in the original execution: ΘS · iS ≪ ΘT · iT , where ≪ is the lexicographic order on vectors. At depth p, the question raises whether a subset of those The multidimensional logical date at which an instance iterations – whose dependence has not yet been resolved up is executed is determined, for statement S, by the 2dS + 1 to depth p – are in violation and must be corrected. vector given by ΘS · iS. The schedule of statement instances When correcting loop-carried violations, we define the is given by the lexicographic order of their schedule vectors following affine functions from ZdS+dT +dg+1 to Zdg+1:2 and is the core of the code generation algorithm. ∆ ΘS→T S T ΓS ΓT The purpose of dependence analysis is to compute a di- • VIO p = Ap,• −Ap,• + p,• − p,• which computes rected dependence multi-graph DG. Unlike traditional re- the amount of time units at depth p by which a specific duced dependence graphs, an arc S → T in DG is labeled instance of T executes before one of S. S ∆ ΘS→T S T ΓS ΓT by a polyhedron capturing the pairs of iteration vectors (i , • SLA p = −Ap,• + Ap,• − p,• + p,• which com- iT ) in dependence. putes the amount of time units at depth p by which a specific instance of S executes before one of T .

Violated Dependences. After transforming the SCoP, the Then, let us define the parametric extremal values of these question arises whether the resulting program still executes two functions: correct code. Our approach consists in saving the depen- S→T S→T S→T dence graph, before applying any transformation [25], then • SHIFT = max ∆VIO Θ · X > 0 | X ∈ VIO S T p p to apply a given transformation sequence, and eventually to X∈D ×D n o S→T ∆ ΘS→T S→T run a legality analysis at the very end of the sequence. • SLACK = min SLA p · X ≥ 0 | X ∈ SLAp X∈DS×DT n o We consider a dependence from S to T in the original code and we want to determine if it has been preserved in We use the parametric integer linear program solver PIP the transformed program. [10] to perform these computations. The result is a piece- Violated dependence analysis [25] efficiently computes wise, quasi-affine function of the parameters (affine with the iterations of the Cartesian product space DS × DT that additional parameters to encode modulo operations). It is were in a dependence relation in the original program characterized by a disjoint union of polyhedra where this and whose order has been reversed by the transformation. function is quasi-affine. This piecewise affine function is These iterations, should they exist, do not preserve the encoded as a parametric quasi-affine selection tree – or causality of the original program. Let δS→T denote the de- quast – [10, 9]. pendence polyhedron from S to T ; we are looking for the The loop-independent case is much simpler. Each vio- exact set of iterations of δS→T such that there is a depen- lated dependence must satisfy conditions on β: dence from T to S at transformed depth p. By reasoning in • if VIOS→T 6= 0/ ∧ βS > βT , SHIFTS→T = βS − βT the transformed space, it is straightforward to see that the p p p p p S→T 0/ βS βT S→T βS βT set of iterations that violate the causality condition is the in- • if VIO p 6= ∧ p ≤ p , SLACK = p − p tersection of a dependence polyhedron with the constraint The correction problem can then be reformulated as finding set ΘS · iS ≥ ΘT · iT . a solution to a system of differential constraints on paramet- We denote a violated dependence polyhedron at depth p S→T ric quasts. For any statement S, we shall denote by cS the by VIOp . We also define a slackness polyhedron at depth 2 th p which contains the set of points originally in dependence We write Ap,• to express the p line and A1..p−1,• for lines 1 to p − 1. unknown amount of correction: a piecewise, quasi-affine SeparateMinShifts: Eliminate redundancy in a list of shifts Input: function; cS is called the shift amount for statement S. The redundantlist: list of redundant shift amounts and conditions problem we need to solve becomes: Output: non redundant list of shift amounts and conditions resultinglist ← empty list S→T 1 while(redundantlist not empty) cT − cS ≤ −SHIFT 2 if(resultinglist is empty) ∀(S,T) ∈ SCoP, S→T 3 resultinglist.append(redundantlist.head)  cT − cS ≤ SLACK 4 redundantlist ← redundantlist.tail 5 else Such a problemcan be solved with a variant of the Bellman- 6 tmplist ← empty list Ford algorithm [4], with piecewise quasi-affine functions 7 SHIFT1 ← redundantlist.head 8 redundantlist ← redundantlist.tail (quasts) labeling the edges of the graph. Parametric case 9 while(resultinglist not empty) distinction arises when considering addition and maximiza- 10 SHIFT2 ← resultinglist.head 11 resultinglist ← resultinglist.tail tion of the shift amounts resulting from different depen- COND COND 12 cond1 ← SHIFT1 ∧ SHIFT 2 ∧ (SHIFT 2 < SHIFT1) dence polyhedra. In addition, for correctness proofs of the 13 if(cond1 not empty) Bellman-Ford algorithm to hold, triangular inequalities and 14 tmplist.append(cond1, SHIFT 1) 15 cond2 SHIFT COND SHIFT COND (SHIFT SHIFT ) transitivity of the operator on quasts must also hold. The ← 1 ∧ 2 ∧ 2 ≥ 1 ≤ 16 if(cond2 not empty) algorithm in Figure 5 allows to maintain case disjunction 17 tmplist.append(cond2, SHIFT 2) while enforcing all the required properties for any configu- 18 if(cond1 is empty and cond2 is empty) COND 19 tmplist.append(SHIFT1 , SHIFT 1) ration of the parameters. At each step of the separation al- COND 20 tmplist.append(SHIFT2 , SHIFT 2) gorithm, two cases are computed and tested for emptiness. 21 resultinglist ← tmplist The SHIFTCOND predicate holds the linear constraints on the 22 return resultinglist parameters for a given violation to occur. Step 15 imple- 3 ments the complementary check. As an optimization, it is Figure 5. Conditional separation often possible to extend disjoint conditionals by continuity, which reduces the number of versions (associated with dif- S→T ∆ ΘS→T S→T ferent parameter configurations), hence reduce complexity VIOp | VIO p = SHIFT - are carried for cor- and the resulting code size. For example: rection at the next depth and will eventually be solved at the β if (M = 3) then 2 innermost level, as will be seen shortly. ≡ if (M ≥ 3) then M − 1 else if (M ≥ 4) then M − 1 For a loop-independent VDGp, the correction is simply β done by reordering the p values. No special computa- Experimentally, performing this post processing allows up βS βT tion is necessary as only the relative values of p and p to 20% less versioning at each correction depth. Given the are needed to determine the sequential order. When such multiplicative nature of duplications, this can translate into a violated, it means the candidate dependence polyhedron exponentially smaller generated code. S→T βS βT VIOp is not empty and p > p . The correction algo- rithm forces the synchronization of the loop independent 4.2. Constraints graphs βS βT S→T schedules by setting p = p and carries VIOp to be cor- We outline the construction of the constraints graph used rected at depth p + 1. in the correction. For depth p, the violated dependence For each original dependence δS→T in the DG that has graph VDGp is a directed multigraph where each node rep- not been completely solved up to current depth add an edge resents a statement in the SCoP. The construction of VDGp between S and T in VDGp of type S (slackness). Such S→T S→T COND proceeds as follows. For each violated polyhedron VIO p , an edge is decorated with the tuple (VIOp , SLACK , the minimal necessary correction is computed and results SLACKS→T ). When S type edges are considered, they bear in a parametric conditional and a shift amount. We add an the maximal shifting allowed for S so that the causality re- edge, between S and T in VDGp of type V (violation), dec- lation S → T is ensured. If at any time a node is shifted by S→T COND S→T orated with the tuple (VIOp , SHIFT ,−SHIFT ). a quantity bigger than one of the maximal allowed outgoing When V type arcs are considered, they bear the minimal slacks, it will give rise to new outgoing shift edges. shifting requirement by which T must be shifted for the cor- rected values of ΘS and ΘT to nullify the violated polyhe- 4.3. The Algorithm S→T S→T dron VIO p . Notice however that shifting T by SHIFT The correction algorithm is interleaved with the incre- amount does not fully solve the dependence problem. It mental computation of the VDG at each depth level. The solves it for the subset of points fundamental reason is that corrections at previous depths S→T ∆ ΘS→T S→T X ∈ VIO p | VIO p < SHIFT . The remain- need to be taken into account when computing the violated ing points - the facet of the polyhedron such that X ∈ dependence polyhedra at the current level. The main idea  for depths p > 0 is to shift targets of violated dependences 3Unlike the more costly separation algorithm by Quilleré [23] used for code generation, this one only needs intersections (no complement or dif- by the minimal shifting amount necessary. If any of those ference). incoming shifts is bigger than any outgoing slack, the out- CorrectLoopDependentNode: Corrects a node by shifting it may result in inserting new outgoing edges. When many Input: node: A node in VDGp incoming edges generate many outgoing edges, Step 11 Output: A list of parametric shift amounts and conditionals separates these possibly redundant amounts using the al- for the node corrections ← corrections of node already gorithm formerly given in Figure 5. In practice, PIP can computed in previous passes also introduce new modulo parameters for a certain class 1 foreach(edge (S, node, Vi) incoming into node) 2 compute minimal SHIFT S→node and SHIFT COND of ill-behaved schedules. These must be treated with spe- S→node COND 3 corrections.append(SHIFT , SHIFT ) cial care as they will expand dg and are generally a hint 4 if(corrections.size > 1) 5 corrections ← SeparateMinShifts(corrections) that the transformation will eventually generate code with 6 foreach(edge (node, T) outgoing from node) many internal modulo conditionals. On the other hand, for 7 foreach(corr in corrections) node→T loop-independent corrections all quantities are just integer 8 compute a new shift VIO p using corr for node node→T 9 if(VIO p not empty) differences, without any case distinction. The much sim- addedge(node, T, V , node→T ) to 10 VIO p VDGp pler algorithm is just a special case. 11 else node→T 12 compute a new slack SLA p using corr for node The full algorithm recursively computes the current node→T 13 addedge(node, T, S, SLA p ) to VDGp VDG for each depth p, taking into account previously cor- 14 removeedge(edge) from VDGp 15return corrections rected dependences and is outlined in Figure 8. Termina- tion and soundness are straightforward, from those of the Bellman-Ford version [4] applied successively at each depth Figure 6. Shifting a node for correction on the SHIFT amount. CorrectLoopDependent: Corrects a VDG by shifting Input: Lemma 1 (Depth-p Completeness) If shifting amounts VDG: A node in VDGp Output: A list of parametric shift amounts and conditionals satisfying the system of violation and slackness constraints for the node at depth p exist, the correction algorithm removes all viola- corrections ← empty list 1 for(i = 1 to |V| − 1) tions at depth p. 2 nodelist ← nodes(VDG) with incoming edge of type V 3 foreach(node in nodelist) The proof derives from the completeness of Bellman- 4 corrections.append(CorrectLoopDependentNode(node)) 5 return corrections Ford’s algorithm. Determining the minimal correcting shift amounts to computing the maximal value of linear multi- ∆ ΘS→T ∆ ΘS→T Figure 7. Correcting a VDG variate functions ( VIO p and SLA p ) over bounded parametrized convex polyhedra (VIOS→T and SLAS→T ). CorrectSchedules: Corrects an illegal schedule p p Input: This problem is solved in the realm of parametric inte- program: A program in URUK form ger linear programming. The separation algorithm ensures dependences: The list of polyhedral dependences of the program Output: Corrected program in URUK form equality or incompatibility of the conditionals enclosing the 1 Build VDG0 different amounts. The resulting quasts therefore satisfy the 2 correctionList ← CorrectLoopIndependent(VDG0) 3 commit shifts in correctionList transitivity of the operations of max, min, + and ≤. When 4 for(p=1; p<=max {rank(β )}) (S ∈ SCoP) S VDGp has no negative weight cycle, the correction at depth 5 Build VDGp 6 correctionList ← CorrectLoopDependent(VDGp) p succeeds; the proof is the same as for the Bellman-Ford 7 commit shifts in correctionList algorithm and can be found in [4]. 8 correctionList ← CorrectLoopIndependent(VDGp) 9 commit shifts in correctionList As mentioned earlier, the computed shift amounts are minimal such that the schedule at a given depth does not vi- Figure 8. Correction algorithm olate any dependence; full resolution of those dependences is carried for correction at the next level. Another solution would be to shift target statements by SHIFTS→T + 1, but going slacks turn into new violations that need to be cor- this is deemed too intrusive. Indeed, this amounts to adding rected. During the graph traversal, any node is seen at most −1 to all negative edges on the graph, potentially making |V|− 1 times, where V is the set of vertices. At each traver- the correction impossible. sal, we gather the previously computed corrections along The question arises whether a shift amount chosen at a with incoming violations and we apply the separation phase given depth may interfere with the correction algorithm at a of Figure 5. For loop-carried dependences, the algorithm to higher depth. The next lemma guarantees it is not the case. correct a node is outlined in Figure 6. Lemma 2 Correction at a given depth by the minimal shift We use an incrementally growing cache to speed up amount does not hamper correction at a subsequent depth. polyhedral computations, as proposed in [25]. Step 8 of Figure 6 uses PIP to compute the minimal incoming shift For p > 1, if VDGp contains a negative weight cycle, amount; it may introduce case distinctions, and since we la- VDGp−1 contains a null weighted slackness cycle travers- bel edges in the VDG with single polyhedra and not quasts, ing the same nodes. By construction, any violated edge at for (i=0; i<=N; i++) for (i=0; i<=N; i++) for (i=0; i<=N; i++) S1 A[i] = ...; S3 B[i] = A[i+1]; S2 if (i==0) A[1] = A[5]; S2 A[1] = A[5]; S2 A[1] = A[5]; S3 B[i] = A[i+1]; for (i=0; i<=N; i++) for (i=0; i<=N; i++) S1 A[i] = ...; S3 B[i] = A[i+1]; S1 A[i] = ...;

AS1 = [1] AS2 = [1] AS3 = [1] AS1 = [1] AS2 = [1] AS3 = [1] AS1 = [1] AS2 = [1] AS3 = [1] β β β β , β , β , β , β , β , S1 = [0,0] S2 = [1,0] S3 = [2,0] S1 = [2 0] S2 = [1 0] S3 = [0 0] S1 = [2 0] S2 = [2 0] S3 = [2 0] Γ Γ Γ Γ Γ Γ Γ Γ Γ S1 = [0,0] S2 = [0,0] S3 = [0,0] S1 = [0,0] S2 = [0,0] S3 = [0,0] S1 = [0,0] S2 = [0,0] S3 = [0,0]

Figure 9. Original code Figure 10. Illegal schedule Figure 11. After correcting

S2 S2 S2 [1] B=[1,0] [1] [0] B=[2,0] [2] [0] B=[2,0] [0] S1 S3 S1 S3 S1 S3 B=[2,0] [2] B=[0,0] B=[2,0] [2] B=[0,0] B=[2,0] [0] B=[2,0]

Figure 12. Outline of the correction for p = 0

for (i=0; i<=N; i++) for (i=0; i<=N; i++) S1 A[i] = ...; S1 A[i] = ...; S21 if (i==1 && N<=4) A[1] = A[5]; S21 if (i==1 && N<=4) A[1] = A[5]; S22 if(i==5 && N>=5) A[1] = A[5]; S22 if (i==5 && N>=5) A[1] = A[5]; S3 B[i] = A[i+1]; S31 if (i>=1 && N<=4) B[i-1] = A[i]; S32 if (i>=6 && N>=5) B[i-6] = A[i-5];

AS1 = [1] AS21 = [1] AS22 = [1] AS3 = [1] AS1 = [1] AS21 = [1] AS22 = [1] AS31 = [1] AS32 = [1] β β β β β β β β β S1 = [2,0] S21 = [2,0] S22 = [2,0] S3 = [2,0] S1 = [2,0] S21 = [2,0] S22 = [2,0] S31 = [2,0] S32 = [2,0] Γ Γ Γ Γ Γ Γ Γ Γ Γ S1 = [0,0] S21 = [0,0] S22 = [0,5] S3 = [0,0] S1 = [0,0] S21 = [0,0] S22 = [0,5] S31 = [0,1] S32 = [0,6]

Figure 13. Correcting + versioning S2 Figure 14. Correcting + versioning S3

S2 [0, 1] [0, 0] S2 [0, 5] if(N>=5) G=[0,0] [0] G=[0,0] S1 [0, 5] if(N>=5) S1 [0, 4] if(N>=5) [0,0] S3 G=[0,0] S3 G=[0,0] G=[0,0] [0, 1] G=[0,0] [0, 1]

Figure 15. Outline of the correction for p = 1

depth p presupposes that the candidate violated polyhedron S1 → S2 if N ≥ 5, S1 → S3, S2 → S3 if N ≥ 5. No slack S→T at previous depth VIO p−1 is not empty. Hence, for any vi- edges are introduced in the initial graph. The violated de- olated edge at depth p, there exists a 0-slack edge at any pendence graph is a DAG and the correction mechanism previous depths thus ensuring the existence of a 0-slack cy- will first push S2 at the same β0 as S1. Then, after updating cle at any previous depths. outgoing edges, it will do the same for S3 yielding the code In other words, the fact that a schedule cannot be cor- of Figure 11. Figure 12 shows the resulting VDG. So far, rected at a given depth is an intrinsic property of the sched- statement ordering within a given loop iteration is not fully ule. Combined with the previous lemma, we deduce the specified (hence the statements are considered parallel); the completeness of our greedy algorithm. code generation phase arbitrarily decides the shape of the Theorem 1 (Completeness) If shifting amounts satisfying final code. In addition, notice S2 and S3, initially located in the system of violation and slackness constraints exist at all different loop bodies than S1, have been merged through the depths, the correction algorithm removes all violations. loop-independent step of the correction. Let us outline the algorithm on the example of Figure 9, The next pass of the correction algorithm for depth p = 1 assuming N ≥ 2. Nodes represent statements of the pro- now detects the following loop carried violation: RAW gram and are labeled with their respective schedules (B for dependence hS1,5i → hS2,0i is violated with the amount β and G for Γ). Dashed edges represent slackness while 5 − 0 = 5 if N ≥ 5, WAW dependence hS1,1i → hS2,0i plain edges represent violations and are labeled with their is violated with the amount 1 − 0 = 1, RAW dependence constant or parametric amount. Suppose the chosen trans- hS1,i+1i → hS3,ii is violated with the amount i+1−i = 1. formation tries to perform the modifications of the above There is also a 0-slack edge S2 → S3 resulting from S2 writ- statements’ schedules according to the values in Figure 10. ing A[1] and S3 reading it at iteration i = 0. All these in- The first pass of the correction algorithm for depth p = 0 formation are stored in the violation polyhedron. A maxi- detects the following loop independent violations: S1 → S2, mization step with PIP determines a minimal shift amount of 5 if N ≥ 5. Stopping the correction after shifting S2 and Lemma 3 Let C = S → S1 → ... → Sn → S be a circuit in versioning given the values of N would generate the inter- the DG. By successive projections onto the image of every mediate result of Figure 13. However, since the versioning dependence polyhedron, we can incrementally construct a P only happens at commit phases of the algorithm in Figure 8, polyhedron δS→ Si that contains all the instances of S and the graph is given in Figure 15 and no node duplication is Si that are transitively in dependence along the prefix P of δS→S S S performed yet. The algorithm moves forward to correcting the circuit. If is not empty, the function Ap,• − Ap,• the node S3 and, at step 3, the slack edge S2 → S3 is up- must be positive for a correction to exist at depth p. dated with the new shift for S2. The result is an incoming shift edge with violation amount 4 if N ≥ 6; which yields Without this necessary property, index-set splitting would versions of Figure 16 after code generation optimization. not enhance the expressiveness of affine schedules enough for a correction to be found (by loop shifting only). This is 5. Correction by Index-Set Splitting not sufficient to ensure the schedule can be corrected. 5.2. Index-Set Splitting for Correction Index-set splitting has originally been crafted as an en- abling transformation. It is usually formulated as a decision Our splitting heuristic aims at preserving asymptotic lo- problem to express more parallelism by allowing the con- cality and ordering properties of original schedule while struction of piecewise affine functions. Yet, the iterative avoiding code explosion. The heuristic runs along with the method proposed by Feautrier et al. [13] relies on calls to shifting-based correction algorithm, by splitting only tar- a costly, non scalable, scheduling algorithm, and aims at get nodes of violated dependences. Intuitively, we decom- exhibiting more parallelism by breaking cycles in the origi- pose the target domain when the two following criteria hold: nal dependence graph. However, significant portions of nu- (1) the amount of correction is “too intrusive” with respect merical programs do not exhibit such cycles, but still suf- to the original (illegal) schedule; (2) it allows a “significant fer from major inefficiencies; this is the case of the sim- part” of the target domain to be preserved. A correction is plified excerpts from SPEC CPU2000fp benchmarks swim intrusive if it is a parametric shift (Γ) or a motion(β). and mgrid, see Figures 17–18. Other methods fail to en- able important optimizations in the presence of paralleliza- if (N<=4) elseif (N>=5) A[0] = ...; for (i=0; i<=4; i++) tion or fusion preventingdependences, or when loop bounds A[1] = ...; A[i] = ...; are not identical. Feautrier’s index-set splitting heuristic A[1] = A[5]; A[5] = ...; B[0] = A[1]; A[1] = A[5]; aims at improving the expressiveness of affine schedules. for (i=2; i<=N; i++) for (i=6; i<=N; i++) In the context of schedule corrections, the added expressive- A[i] = ...; A[i] = ...; B[i-1] = A[i]; B[i-6] = A[i-5]; ness helps our greedy algorithm to find less intrusive (i.e., B[N-1] = A[N]; for (i=N+1; i<=N+6; i++) deeper) shifts. Nevertheless, since not all schedules may B[i-6] = A[i-5]; be corrected by a combination of translation and index-set splitting, it is interesting to have a local necessary criterion Figure 16. Versioning after code generation to rule out impossible solutions. To assess the second criterion, we consider for every in- 5.1. Correction Feasibility coming shift edge, the projection of the polyhedron onto ev- ery non-parametric iterator dimension. A dimension whose When a VDG contains a circuit with negative weight, projection is non-parametric and included in an interval correction by translation alone becomes infeasible. If no smaller than a given constant (3 in our experiments) is circuit exists in the DG, then no circuit can exist in any called degenerate. In the following example, dimension j VDG, since by construction, edges of the VDG are built is degenerate and simplifies into 2i + 2 ≥ 6 j ≥ 2i − 5: from edges in the DG. In this case, whatever A, β and Γ parts are chosen for any statement, a correction is found. −2i + 3 j + 4M + 5 ≥ 0 If the DG contains circuits, a transformationmodifying β  i + j − M + 2 ≥ 0  i − 3 j + 2M − 9 ≥ 0 and Γ only is always correctable. Indeed, the reverse trans-   −i + M ≥ 0 lation on β and Γ for all statements is a trivial solution. As a  i ≥ 0 corollary, any combination of loop fusion, loop distribution   and loop shifting [1] can be corrected. Thanks to this strong The separation and commit phases of our algorithm cre- property of our algorithm, we often eliminate the decision ate as many disjoint versions of the correction as needed problem of finding enabling transformations such as loop to enforce minimal shifts. The node duplications allow to bounds alignment, loop shifting and peeling. This observa- express different schedules for different portions of the do- tion leads to a local necessary condition for ruling out non main of each statement. To determine if a domain split admissible correctable schedules. should be performed, we incrementally remove violated parts of the target domain corresponding to intrusive cor- S1 A[0] = ...; for (i=1; i

4Intuitively, a multidimensional distance of (4 × N,0,0) is equivalent 5-march=athlon64 -LNO:fusion=2:prefetch=2 -m64 -Ofast -msse2 - to 4 × N.N2 + 0.N + 0, assuming every loop has N iterations. lmpath; pathf90 always outperformed Intel ICC by a small percentage. shifts, our correction scheme triggers a very sophisticated [11] P. Feautrier. Some efficient solutions to the affine scheduling index-set splitting, effectively enabling aggressive fusion, at problem, part II, multidimensional time. Intl. J. of Parallel the cost of some control-flow overhead (triangular iteration Programming, 21(6):389–420, Dec. 1992. See also Part I, spaces). The speed-up drops to 15%, and adding register one dimensional time, 21(5):315–348. [12] P. Feautrier. Scalable and structured scheduling. To appear blocking and unrolling does not help because of increased at Intl. J. of Parallel Programming, 28, 2006. control-flow overhead. On this example, the running time [13] P. Feautrier, M. Griebl, and C. Lengauer. On index set split- of the correction algorithm is only a few seconds. ting. In Parallel Architectures and Compilation Techniques (PACT’99), Newport Beach, CA, Oct. 1999. IEEE Computer 7. Conclusion and Perspectives Society. [14] S. Girbal, N. Vasilache, C. Bastoul, A. Cohen, D. Parello, We presented a general and complete algorithm to M. Sigler, and O. Temam. Semi-automatic composition of correct dependence violations on multidimensional affine loop transformations for deep parallelism and memory hi- schedules. This algorithm is based on loop shifting, a gen- erarchies. Intl. J. of Parallel Programming, 2006. Special issue on Microgrids. 57 pages. eralization of that may also simulate the [15] M. Griebl. Automatic parallelization of loop programs for effect of loop fusion and distribution. We combined this al- distributed memory architectures. Habilitation thesis. Fac- gorithm with the first index-set splitting technique that oper- ultät für Mathematik und Informatik, Universität Passau, ates after affine schedules are computed. The result is a very 2004. effective technique to dramatically reduce the complexity [16] W. Kelly. Optimization within a unified transformation of loop nest optimization in the polyhedral model. We framework. Technical Report CS-TR-3725, University of demonstrated its scalability on two real-world benchmarks, Maryland, 1996. [17] C. Lee and M. Stoodley. UTDSP benchmark suite, 1998. although previous affine scheduling algorithms would only http://www.eecg.toronto.edu/ corinna/DSP/. consider much smaller kernels. Overall, we replaced the [18] C. E. Leiserson and J. B. Saxe. Retiming synchronous cir- combinatorial decision problem of finding a sequence of cuitry. Algorithmica, 6(1), 1991. enabling transformations, by an a posteriori tractable and [19] A. W. Lim and M. S. Lam. Communication-free paralleliza- controllable correction step. tion via affine transformations. In 24thACM Symp. on Prin- ciples of Programming Languages, pages 201–214, Paris, France, jan 1997. References [20] K. McKinley, S. Carr, and C.-W. Tseng. Improving data locality with loop transformations. ACM Transactions on [1] R. Allen and K. Kennedy. Optimizing Compilers for Modern Programming Languages and Systems, 18(4):424–453, july Architectures. Morgan and Kaufman, 2002. 1996. [2] C. Bastoul. Code generation in the polyhedral model is eas- [21] L.-N. Pouchet, C. Bastoul, A. Cohen, and N. Vasilache. It- ier than you think. In Parallel Architectures and Compila- erative optimization in the polyhedral model: Part I, one- tion Techniques (PACT’04), Antibes, France, Sept. 2004. dimensional time. In International Symposium on Code [3] C. Bastoul and P. Feautrier. Adjusting a program transfor- Generation and Optimization, pages 144–156, San Jose, mation for legality. Parallel processing letters, 15(1):3–17, California, Mar. 2007. IEEE Comp. Soc. March 2005. [22] M. Püschel, B. Singer, J. Xiong, J. Moura, J. Johnson, [4] T. H. Cormen, C. E. Leiserson, and R. L. Rivest. Introduc- D. Padua, M. Veloso, and R. W. Johnson. SPIRAL: A gen- tion to Algorithms. MIT Press, 1989. erator for platform-adapted libraries of signal processing al- [5] J. B. Crop and D. K. Wilde. Scheduling structured systems. gorithms. Journal of High Performance Computing and Ap- In EuroPar’99, LNCS, pages 409–412, Toulouse, France, plications, special issue on Automatic Performance Tuning, Sept. 1999. Springer-Verlag. 18(1):21–45, 2004. [6] A. Darte and G. Huard. Loop shifting for loop paralleliza- [23] F. Quilleré and S. Rajopadhye. Optimizing memory usage tion. Intl. J. of Parallel Programming, 28(5):499–534, 2000. in the polyhedral model. Technical Report 1228, IRISA, [7] A. Darte, G.-A. Silber, and F. Vivien. Combining retiming Université de Rennes, France, Jan. 1999. and scheduling techniques for loop parallelization and loop [24] F. Quilleré, S. Rajopadhye, and D. Wilde. Generation of tiling. Parallel Processing Letters, 7(4):379–392, 1997. efficient nested loops from polyhedra. Intl. J. of Parallel [8] S. Donadio, J. Brodman, T. Roeder, K. Yotov, D. Barthou, Programming, 28(5):469–498, Oct. 2000. A. Cohen, M. Garzaran, D. Padua, and K. Pingali. A lan- [25] N. Vasilache, A. Cohen, C. Bastoul, and S. Girbal. Violated guage for the compact representation of multiple program dependence analysis. In ACM Intl. Conf. on Supercomputing versions. In Languages and Compilers for Parallel Comput- (ICS’06), Cairns, Australia, June 2006. ing (LCPC’05), LNCS, Hawthorne, New York, Oct. 2005. [26] S. Verdoolaege, M. Bruynooghe, G. Janssens, and Springer-Verlag. 15 pages. F. Catthoor. Multi-dimensional incremental loops fusion for [9] P. Feautrier. Array expansion. In ACM Intl. Conf. on Super- data locality. In ASAP, pages 17–27, 2003. [27] M. E. Wolf. Improving Locality and Parallelism in Nested computing, pages 429–441, St. Malo, France, July 1988. Loops. PhD thesis, Stanford University, Aug. 1992. Pub- [10] P. Feautrier. Parametric integer programming. RAIRO lished as CSL-TR-92-538. Recherche Opérationnelle, 22:243–268, Sept. 1988.