Data-Flow/Dependence Profiling for Structured Transformations

Home , Loop interchange

Data-Flow/Dependence Profiling for Structured Transformations Fabian Gruber, Manuel Selva, Diogo Sampaio, Christophe Guillon, Antoine Moynault, Louis-Noël Pouchet, Fabrice Rastello

To cite this version:

Fabian Gruber, Manuel Selva, Diogo Sampaio, Christophe Guillon, Antoine Moynault, et al.. Data- Flow/Dependence Profiling for Structured Transformations. PPoPP 2019 - 24th Symposium onPrin- ciples and Practice of Parallel Programming, Feb 2019, Washington, D.C., United States. pp.173-185, 10.1145/3293883.3295737. hal-02060796

HAL Id: hal-02060796 https://hal.inria.fr/hal-02060796 Submitted on 7 Mar 2019

HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. Data-Flow/Dependence Profiling for Structured Transformations Fabian Gruber Christophe Guillon Manuel Selva Antoine Moynault Diogo Sampaio STMicroelectronics Univ. Grenoble Alpes, Inria, CNRS, Grenoble INP, LIG Louis-Noël Pouchet Fabrice Rastello Colorado State University Univ. Grenoble Alpes, Inria, CNRS, Grenoble INP, LIG Abstract performance [28, 53, 68]. The input program can be massaged Profiling feedback is an important technique used by devel- manually via function inlining, loop rewriting/normaliza- opers for performance debugging, where it is usually used to tion, etc. to enable such static analyses to succeed, but this pinpoint performance bottlenecks and also to find optimiza- is a rare scenario: In full programs and in particular those tion opportunities. Assessing the validity and potential bene- relying on external binaries visible only to the linker in com- fit of a program transformation requires accurate knowledge piled form, often data allocation and even the full call stack of the data flow and dependencies, which can be uncovered is inaccessible to static analysis. by profiling a particular execution of the program. Dynamic analysis frameworks [9, 21, 47, 63] address this In this work we develop poly-prof, an end-to-end in- limitation by reasoning on a particular execution of the pro- frastructure for dynamic binary analysis, which produces gram, through the analysis of its execution trace. That is, feedback about the potential to apply complex program disambiguation of dependence information and accessed rescheduling. Our tool can handle both inter- and intraproce- data is addressed by monitoring the stream of operations dural aspects of the program in a unified way, thus providing and accessed addresses. In general, the result is only valid interprocedural transformation feedback. for that particular execution. Such systems provide some feedback to the user, pinpointing the probable absence of Keywords Performance Feedback, Polyhedral Model, Loop dependencies along some loop, thus highlighting potential Transformations, Compiler Optimization, Binary, Instrumen- parallelism [37, 67, 70, 74] or SIMD vectorization [33]. tation, Dynamic Dependence Graph The contributions of this work are: 1. The development of 1 Introduction an inter-procedural intermediate representation that compacts program traces effectively while being amenable to The most effective program transformations for improving polyhedral optimization, and all associated algorithms and performance or energy consumption are typically based implementation to make the process scalable to full programs. on rescheduling instructions so as to expose data locality 2. poly-prof, a tool that provides optimization feedback us- and/or parallelism. Optimizing compilers typically attempt, ing binary translation and instrumentation. poly-prof is via static analysis, to build a representation of the program the first framework that provides feedback on structured precise enough to enable useful program transformations. A transformation potential for arbitrary binaries. key issue faced when analyzing general-purpose languages is the ambiguities introduced by the language itself: for example the use of pointers typically restricts the ability to 2 Overview of poly-prof precisely characterize the data being accessed, in turn trig- In contrast to, for instance, basic loop parallelization, struc- gering conservative and approximate dependence analysis tured transformations can be found even in the presence [27, 40, 68]. Such frameworks therefore rely on conservative of data dependencies between loop iterations, dramatically assumptions, limiting the ability to reason on the legality broadening the class of loop transformations that can be of complex program restructuring. In this work we specif- applied on the program. This, however, requires detailed ically target structured transformations, that is, a (possibly information about such dependencies. That is, instead of just interprocedural) reordering of operations involving com- proving the absence of dependencies one must capture and plex sequences of multidimensional loop transformations, represent them in order to be able to find patterns and reason including (partial) loop fusion/fission, loop tiling [11], etc. about their structure. When a region of the source program fits specific syn- An overview of poly-prof is shown in Fig. 1. Taking a tactic and semantics restrictions, such as avoiding function carefully generated execution trace of the program, the first calls, using only arrays as data structures, and very simple objective of poly-prof is to construct a useful and analyzable conditional statements with no indirections [16, 20], trans- representation of the trace. Building such a representation formation frameworks such as the polyhedral model [11] constitutes our first key set of contributions, and covers showed that multidimensional loop nest rescheduling can be the first three stages below. We then extended a polyhe- successfully implemented, leading to significantly improved dral compiler. The goal here is not to provide automated 1 transformations, but instead to assist the user in figuring Interprocedural Control Structure out where optimizations should be implemented, and im- Binary Instrumentation I portantly which ones. This corresponds to the fourth stage below that forms the second contribution of this work. Call stack → Dynamic CFG SCC loop forest recognition

Interprocedural control structure To eventually recon- loop nesting tree & recursive component set struct a compact and useful representation of the program Dynamic trace, the first stage starts by dynamically computing the control flow DDG: Dynamic Dependence Graph intraprocedural control flow graphs (CFGs) and the interprocedural call graph (CG) from an optimized binary. This Instrumentation II information is then used to construct the interprocedural loop dynamic inter-procedural iteration vector. context tree which combines the notions of loop-nesting-trees shadow memory & mem addresses & integer values relevance scalability and their equivalent for call graphs, the recursive-component- chunking set, both of which are input to the next stage. Their structure clamping and construction are detailed in section 3. statements & dependencies & accesses Semantic Dynamic Dependence Graph This stage generates the trace actual dynamic dependence graph (DDG) [50], a trace repre- Compact Polyhedral DDG senting both the dynamic execution of a program’s instruc- Folding. ISL tions as well as their data dependencies. It uses the interpro- statement doms. & dependence rels. & access funcs. cedural loop context tree to construct dynamic interprocedural iteration vectors, a new interprocedural representation of Compact IR the program execution which unifies two abstractions: the DDG Polyhedral Feedback of fat intraprocedural schedule tree [35, 72] and the interprocedu- PluTo. Polyfeat. ISL regions ral calling-context-tree [2]. Both the DDG and the dynamic source Polyhedral optimiser interprocedural iteration vector are described in section 4. Polyhedral analyser transfo (fuse, skew, tile, vectorise, parallelise…) Compact polyhedral DDG The third stage compacts, or metrics (hotness, fatness, …) folds, the DDG into a polyhedral program representation Annotated flamegraph where, essentially, sequences of individual operations in the asm level Best effort assembly/source matching trace are merged together into sets of points. Intuitively, feedback when a loop is executed it generates one execution of a state- Schedule tree decoration annotated ﬂame-graph (regions of interest, ment for each loop iteration and therefore one point in the transformations, metrics) Human interpretable trace for each execution. This stage amounts to folding those feedback display points back into a loop-equivalent representation, which itself can be the input of a polyhedral compiler. However Figure 1. Overview over the poly-prof framework. we proceed in a general case, where folding occurs across of Fig. 3, interesting loop nests are often spread across mul- multi-level loops but also, possibly recursive, procedure calls, tiple functions over long call chains, obfuscating them to leading to interprocedural folding. Details are presented in traditional static analysis. poly-prof is designed to be able [29], and an outline of the process is presented in Sec. 5. to represent both classical loops as well as function calls and DDG polyhedral feedback The last stage of poly-prof returns in a unified way with polyhedra [23, 75]. consists of a customized state-of-the-art polyhedral tool- poly-prof has been designed to handle several scenarios. chain, for the purpose of analyzing very large polyhedral 1. Nested loops containing function calls with side effects: programs. It analyzes the polyhedral DDG and suggests se- here calls can be viewed as atomic instructions, but profiling quences of program transformations needed to expose multi- the storage locations they access or modify and compute level parallelism and/or improved data locality, along with the dependencies between each call and the rest of the loop numerous statistics on the original and potentially trans- body is required. 2. Loops containing function calls that formed program including SIMDization potential. Section 6 themselves contain loops, as in the Ex. 1 of Fig. 3: here the describes this last step as well as how poly-prof provides objective is to view the whole interprocedural region as a human interpretable feedback to the user. multidimensional loop nest.3. Recursive functions, as in Ex. 2 Section 7 illustrates this feedback through a few case stud- of Fig. 3: here the primary objective is to avoid the depth of ies. Section 8 thoroughly evaluates poly-prof on the Rodinia the generated representation to grow proportionally with the benchmarking suite. Related work is discussed in Section 9 depth of the call stack; A secondary objective is to detect any before concluding. underlying regular loop structure amenable to polyhedral compilation (after recursion-to-iteration conversion). 3 Interprocedural Control Structure This section explains how to go from the execution of a As most of the execution time of a program is usually spent program to a representation of its interprocedural control in loops, poly-prof has been designed to find loop-based structure. transformations. In practice, however, as illustrated in Ex. 1 2 The first step of poly-prof is to extract the inter- and A L1 Main BC D E intraprocedural static control structure from a program in L1 B Header L1 compiled binary form. This static control structure of the pro- L2 1 (c)

gram will then be used in a later stage of poly-prof to trans- 2 Header L2 r form the stream of raw events (jump/call/return) gathered C during execution into loop events (entry/iterate/exit). Back-edge L A L1:B E poly-prof performs this through instrumentation of jump, Back-edge L D call, return and a few other instructions to dynamically components = {L1} B L2:C trace any transfer of control during execution. It thus “dy- E L1.entries = {B} L1.headers = {B,C} namically” extracts the control-flow-graph (CFG) of each Return C D function and then proceeds to statically build the loop-nesting- tree. It also dynamically extracts the call-graph (CG) of the (a) (b) (d) whole program and builds the counterpart of loop-nesting- Figure 2. (a/b) Example CFG and associated loop-nesting- trees for call-graphs, the recursive-component-set. An advan- tree; (c/d) Example CG and associated recursive-components- tage of our approach is that only the part of a program that is set. actually executed will be analyzed. For large programs with only a small performance-critical part this can substantially jump reduce the amount of work for the analyzer. Algorithm 1 CFG-loop events generated from a event. Input: • event: Branch event. • inLoops: Stack of currently live loops. From outermost to innermost. 3.1 Control-flow-graph and loop-nesting-tree Emitted events: For the profiled CFG, the loop detection algorithm used by • E(L, H): entry into loop L; H is the header. • I(L, H): iteration of loop L; H is the header. poly-prof is the one described by Havlak [31]. As formal- • X(L, B): exit of loop L, jumping to basic-block B ized by Ramalingam in [58], a loop-nesting-forest can be • N(B): local jump to basic-block B. recursively defined (the actual algorithm is almost linear) as 1 if event .type is local-jump: follows: 1. Each strongly connected component (SCC) of the 2 B:=event.dstBB 3 while L:=inLoops.peek() and L. isCFG and B not in L: CFG containing at least one cycle constitutes the region of 4 L.visiting:=false; inLoops.pop(); emit X(L,B) an outermost loop; 2. for each loop, one of its entry nodes is 5 if B is loop-header: designated the header node of the loop; 3. all edges within 6 H:=B; L:=loop(H) 7 if L.visiting=false: the loop that target its header node are called back-edges; 8 L.visiting:=true; inLoops.push(L); emit E(L,H) 4. “removing” all back-edges from the CFG allows one to 9 else : emit I(L,H) recursively define the next level of sub-loops. 10 emit N(B) 11 else : An example CFG and its corresponding loop-nesting-tree 12 ... # Alg. 2 is given in Figures 2a & 2b. Here, the CFG can be partitioned into one SCC (giving rise to loop L1 with B as its header) plus memory inefficient since the length of the paths is propor- two separated basic-blocks A and E. The region of L1, once its tional to the depth of the recursion. While recursive function back-edge (D, B) has been removed, contains one sub-SCC calls are found to be very uncommon in performance critical 1 formed by the loop L2 and the basic-block B. Among the code , we do need to handle them to ensure robustness. two entry nodes C and D of L2 only one, C, is selected to poly-prof handles recursiveness in an elegant way that we be its header. As depicted in Alg. 1, those notions (region, believe to be useful beyond the restricted scope of this paper, header, back-edge) are important as they allow associating for example to detect properties of tree-recursive calls. The loop events (E: entry, I: iterate, and X: exit) with the stream of main data structure for treating recursive control flow is the raw control events (jump, call, return) that are produced recursive-component-set which is for the call-graph what the by our instrumented execution. Here, generation of loop loop-nesting-tree is for the control-flow-graph. events is driven by the control event “jump” that is emitted Before providing more details, we go through the illustrat- by our instrumentation. Whenever we detect the start of a ing Ex. 2 of Fig. 3. Here, the edge from B3 to B0 (in orange) is new iteration of a given loop (Line 8), all live sub-loops are not a CFG-edge but a recursive call to B from call site B3. The considered exited, that is, we emit an ”exit” event (Line 4). recursive-component-set computed from the CG contains This is especially important for recursive loops as further one SCC with a cycle made up of a single function B. This explained in Alg. 2. example raises several questions: 1. Should C0 be part of the associated recursive loop? It actually depends on the context: 3.2 Call-graph and recursive-component-set It should, when called from B1, while it should not when called from D . 2. What about B ? One should observe that To uniquely identify each dynamic execution of an instruc- 0 5 B will be executed as many times as the number of recursive tion, also called a dynamic instruction, poly-prof uses in- 5 calls to B: In other words, B should be part of a loop. terprocedural iteration vectors. This is described in more 5 detail in Section 4. Note, that the modeling of programs containing recursive function calls with calling-context-paths is 1the Rodinia benchmark suite, for example, does not use any recursion 3 To conclude, while CG-cycles clearly represent potential Algorithm 2 Different recursive-loop events generated from dynamic loop structures, a CG-edge does not have the same a call or a return event. semantic as a CFG-edge. In particular, a call will never exit Input: • event, inLoops: same as for Alg. 1 a recursive loop. For poly-prof, the recursive loop ”L1” of Emitted events: • E (L, B): call-to a function header of recursive-component L and entry into Ex. 2 is a dynamic notion defined as follows: 1. Entering L1 is C the corresponding loop. B is the current basic-block after the call. characterized by a call to B (step 1). 2. L1 is exited when this • IC(L, B) / IR(L, B): call-to / return-from a function header of recursive- (first) call gets unstacked, that is, when the number of returns component L and iteration of the corresponding loop. B is the current basic- block after the call/return. reaches the number of calls (step 22). 3. Everything executed • XR(L, B): return from a function header of recursive-component L and exit in between is part of the loop and iteration, and the corre- from that loop. sponding increment of induction variables takes place when- 1 if event .type is call : ever a call/return to/from B occurs (steps 10,15,20,and 21). 2 F := event.callee; B := event.dstBB 3 L := F.recursive_component As one can see, once the recursive-component-set has been 4 if F is recursive-component entry and L.entry==undef: computed from the CG, the only relevant information dy- 5 L.entry := F; inLoops.push(L); emit EC (L,B) namically used to devise recursive-loop events corresponds 6 elif F is recursive-component header: 7 while L’:=inLoops.peek() and L’ in L: to the header functions, here B. 8 L’.visiting := false; inLoops.pop(); As already stated, the recursive-component-set is for the 9 emit X(L’,B) CG what the loop-nesting-tree is for the CFG. In a similar 10 L.stackcount++; emit IC (L,B) 11 else : emit C(F,B) way it can be recursively defined as follows: 12 if event .type is return : 1. Find all the top-level SCCs with at least one cycle in the 13 F := event.callee; B := event.dstBB CG. Each gives rise to a distinct recursive-component. 14 while L’:=inLoops.peek() and L’. isCFG and L’ in F 15 L’.visiting := false; inLoops.pop(); emit X(L’,B) 2. For each component, determine all its entry nodes. 16 L := F.recursive_component 3. Repeat the following phases until no more cycles exist: 17 if F is recursive-component entry and a. For a given SCC, choose an entry node and add it to the 18 L.stackcount == 0 and L.entry == F: 19 L.entry := undef; emit XR (L,B) headers-set of the recursive-component (top-level SCC) 20 elif F is recursive-component header: it belongs to. 21 L := F.recursive_component; L.stackcount--; b. Remove all edges inside this SCC that point to this node. 22 emit IR (L,B) 23 else : Havlak’s loop-nesting-forest construction algorithm can eas- 24 ... # Alg. 1 ily be adapted to build the recursive-component-set in almost linear time. The end result of the algorithm is a possibly dynamic instruction and naturally span a geometric space. empty set of recursive-components where each recursive- A data dependency is then simply represented as the pair of component has a non-empty set of headers plus a non-empty the IVs of the producer and the consumer. set of entries associated with it. Alg. 2 uses this data structure To handle interprocedural programs we also need a no- to associate loop events (entry, iterate, and exit) to control tion of calling context that is scalable in the presence of events (call and return). Here: 1. entering a recursive loop recursive calls. Our dynamic interprocedural iteration vector is characterized by a call to a component’s entry function (dynamic IIV) described in this section addresses those ob- (Line 4); 2. a new iteration is started whenever a call/return jectives by unifying two notions: 1. Kelly’s mapping which to/from one of the components’ header occurs (Line 6); 3. an describes intraprocedural IVs and is used by the polyhedral exit occurs when all the iterating calls to the headers have framework [35]; 2. calling-context-paths used by profiling been unstacked, that is, when the number of returns equals feedback tools. We first briefly recall those two notions. the number of calls, and we are returning from the original Kelly’s mapping For a given function, Kelly’s mapping function that entered the loop (Line 18). Tracking the state can be explained using a form of schedule tree [72] as shown of the call stack is done with the following two data struc- in Fig. 4. Here a schedule tree is nothing else than a decorated tures: L.stackcount represents for a recursive-component loop-nesting-forest. The first decoration consists of associat- L a counter of the number of calls-to a header minus the ing a “static” index to each loop and basic-block: Recall the number of returns from it; L.entry represents the function recursive characterization of loops given by Ramalingam in through which L was entered. the previous section. For a given loop region (e.g. Lj in the schedule-tree of the fused version in Fig. 4), its sub-regions 4 DDG: Dynamic dependence graph (once back-edges have been removed – here statements S and The objective of the second stage of poly-prof (“Instrumen- T ) form a directed-acyclic-graph (reduced DAG represented tation II”) is to profile the dynamic dependence graph ofa in dashed in Fig. 4) that can be numbered using a topological given execution, that is, to build a graph that has one vertex order. This numbering is used to index the corresponding for every dynamic instruction and one edge for every data nodes at every level of the loop-nesting-tree (e.g. S(0) and dependence. Because we want to enable feedback with struc- T (1) for the fused schedule or Li (0) and Li′ (1) for the fis- tured loop transformations, we need to map this graph to sioned one). The second decoration consists of associating a a “geometric” space that reflects the structural properties of canonical induction variable, that is, an induction variable the program. To this end, we tag each dynamic instruction that starts at value 0 and increments by 1, to each loop-vertex. with its iteration vector (IV). The IVs uniquely identify each As an example the loop-vertex associated with Lj is labeled 4 r Main step : trace event dynamic IIV (⊥) M0 1 : M0 N(M0) (M0) M0(0) A B 2 : A0 C(A0) (M0/A0) Main AB 3 : A1 E(L1, A1) (M0/L1, 0, A1) A0 B0 A (0) L (1), (i ) A (2) 4 : B0 C(B0) (M0/L1, 0, A1/B0) 0 1 1 3 5 : B1 E(L2, B1) (M0/L1, 0, A1/L2, 0, B1) L1 L2 (b)

N( ) ( / / )

A B 6 : B2 B2 M0 L1, 0, A1 L2, 0, B2 1 1 r(Main) r(A) r(B) A1 (0) A1(0) A2 (1) 7 : B1 I(L2, B1) (M0/L1, 0, A1/L2, 1, B1) 8 : B2 N(B2) (M0/L1, 0, A1/L2, 1, B2) A B 9 : B3 X(L2, B3) (M0/L1, 0, A1/B3) B (0) L (1), (i ) B (2) 2 2 M0 A0 L1:A1 A3 B0 L2:B1 B3 0 2 2 3 10 : A2 R(A2) (M0/L1, 0, A2) 11 : A1 I(L1, A1) (M0/L1, 1, A1) B (1) A3 B3 ...... B1 (0) 2 A1 A2 B1 B2 (a) Example 1 (c) (d) (e) step : trace call-stack event dynamic IIV M (⊥) r 1 : M0 M N(M0) (M0) M0 (0) 2 : D0 M/D C(D0) (M0/D0) M0(0) M1 (0) M1(0) D 3 : C0 M/D/C C(C0) (M0/D0/C0) 2 D0 (0) 4 : M1 M R (M1) (M1) D (0) 0 L1(0), (i1) 5 : B0 M/B EC(L1, B0) (M1/L1, 0, B0) Main BC C (0) Main 6 : B1 M/B N(B1) (M1/L1, 0, B1) 0 7 : C M/B/C C(C ) (M /L , 0, B /C ) M L1 0 0 1 1 1 0 0 B (0) B (1) B (1) B (2) B (3) B (4) B (5) 8 : B M/B R(B ) (M /L , 0, B ) 0 1 1 2 3 4 5 D 2 2 1 1 2 (g) 9 : B3 M/B N(B3) (M1/L1, 0, B3) M1 D0 10 : B0 M/B/B IC(L1, B0) (M1/L1, 1, B0) C (0) Main 0 B 11 : B1 M/B/B N(B1) (M1/L1, 1, B1) (j) B 12 : C0 M/B/B/C C(C0) (M1/L1, 1, B1/C0) 0 {M0() : } D(M0) B(M1) 13 : B2 M/B/B R(B2) (M1/L1, 1, B2) 14 : B M/B/B N(B ) (M /L , 1, B ) {M0D0() : } 3 0 1 1 3 {M D C () : } B1 15 : B0 M/B/B/B IC(L1, B0) (M1/L1, 2, B0) 0 0 0 C(D ) C(B ) B(B ) {M () : } C 0 1 3 16 : B1 M/B/B/B N(B1) (M1/L1, 2, B1) 1 {M1L1B0(i) : 0 ≤ i ≤ 2} B 17 : C0 M/B/B/B/C C(C0) (M1/L1, 2, B1/C0) 2 C0 C(B ) B(B ) {M L B (i) : 0 ≤ i ≤ 2} 1 3 18 : B2 M/B/B/B R(B2) (M1/L1, 2, B2) 1 1 1 19 : B M/B/B/B N(B ) (M /L , 2, B ) {M1L1B1C0(i) : 0 ≤ i ≤ 2} 4 4 1 1 4 {M L B (i) : 0 ≤ i ≤ 2} B3 B4 C(B ) B(B ) 20 : B M/B/B IR(L , B ) (M /L , 3, B ) 1 1 2 1 3 5 1 5 1 1 5 {M L B (i) : 0 ≤ i ≤ 1} 21 : B5 M/B IR(L1, B5) (M1/L1, 4, B5) 1 1 3 {M1L1B4(i) : i = 2} 22 : ⊥ XR2 (L1, ⊥) (⊥) B5 {M1L1B5(i) : 3 ≤ i ≤ 4} (f) Example 2 (h) (i) (k) Figure 3. (a/f) Code skeletons; Purple is used to represent CFG, while orange is used to represent CG; (b/g) Call-graph and recursive-component-set; (c) Loop-nesting-forests; (h) Calling-context-tree with call-sites information; (d/i) Example trace, associated loop events, and dynamic interprocedural interation vectors. R2 stands for two consecutive returns; (e/j) Dynamic schedule tree; (k) Folded domains. with Lj as well as the static index 0 followed by its canonical calling context is fully encoded making it possible to differ- variables j, resulting in Lj (0), (j). For any given statement, entiate the different contexts within which basic-block C0 is an IV with Kelly’s mapping is nothing else than the vector executed. Obviously, looking at this example, one wants to given by the path from the root to its corresponding leaf, fold all the repeated calls from B1 to C. called its iteration vector. Fig. 4 shows this mapping both in its textual form using region names and numerical form Dynamic IIV The dynamic IIV is basically a combination using indices. As one can see, an interesting property of the of Kelly’s mapping and the CCT. Similarly to Kelly’s map- numerical form of this mapping is that the scheduling of the ping, the dynamic IIV alternates between context-ids and original code is given by the lexicographical order of the so canonical induction variables. Differently from Kelly’s map- obtained iteration vectors (unique per dynamic instance). ping, but more like the CCT, each context-id in a dynamic Calling-context-tree Differently from the schedule tree, IIV is a, possibly empty, stack of calling contexts and the the calling-context-tree [2] (CCT) is only enumerative (it identifier for a basic block. does not contain any loop indices) and reflects a dynamic Examples of IIVs are shown in Fig. 3a: Function A contains behavior. In other words, it encodes the dynamic-call-tree in a loop L1 that contains a call to function B, itself contain- a “compact” way. The calling-context-tree of the example in ing a loop L2. In this interprocedural example one clearly Fig. 3f is reported in Fig. 3h. This figure is slightly different wants to see that basic-block B1 in loop L2 belongs to a two- from the “original” CCT, but corresponds to the current prac- dimensional nested loop. This is reflected by our dynamic IIV tice: to differentiate two calls to a common function from (see steps 5,7 of Fig. 3d) which will be (M0/L1, i1, A1/L2, i2, B1) different basic-blocks, callees are labeled with call sites (in where i1 (resp. i2) are the canonical induction variables of purple and under parenthesis in Fig. 3h). In this example, loop L1 (resp. L2). Here, our context-ids are loop-ids (e.g. L2) a calling-context-path (for example M1/B3 ... B3/B1/C) can or statement-ids (e.g. B0) and each context-id is decorated be as long as the number of recursive calls to B, but the with a, possibly empty, call stack (e.g. A1). 5 mentioned, B5 is also part of the loop: Indeed there are as 1 for (i=0;i

cal.. call _.. Algorithm 3 Updating of dynamic IIV (diiv) call 0.. call _.. call o.. call _.. cal.. Input: call __.. c.. loo.. c.. cal.. call sysco.. c.. c.. loo.. • event: Branch event. call opena.. call _dl_r.. c.. call .. call _.. call _dl_c.. call 0x4000d.. c.. c.. c.. call.. call .. loop _.. call.. • diiv: Current dynamic IIV. call 0x400.. c.. c.. call omp_set_.. c.. c.. l.. c.. c.. lo.. call.. call .. ca.. call __G.. call.. call _dl_m.. l.. c.. call 0x4000da7e35 l.. l.. call backprop.. call.. call bac.. call .. cal.. ca.. call _dl.. call 0.. c.. lo.. loop _dl_ma.. c.. call call_init.part.0+152 call facetrain.c:22 (b.. call f.. call facetrain.c:27 (backprop.. cal.. ca.. call 0x400.. call ba.. c.. Output: ca.. c.. loop _dl_map.. c.. loop call_init.part.0+144 call facetrain.c:47 (setup) call facetr.. call facet.. loo.. ca.. call dl_main+6.. c.. call _dl_init+113 call backprop_kernel.c:42 (main) • Updated dynamic IIV call _dl_sysdep_start+1261 loop _dl_init+80 call __libc_start_main+232 call _dl_start+643 loop _dl_init+72 call _start+36 call init_tls+387 call init_tls+437 call init_tls+452 all 1 if event == C(B) or EC (L,B): (b) Annotated flame-graph (backprop with libc) 2 diiv.innermost.CTX.push(L) 3 if event == E(L,B) or EC (L,B): Figure 5. Dynamic schedule tree 4 diiv.innermost.CTX.last := L Similar to Kelly’s mapping one can also construct a dy- 5 diiv.addDimension(0,B) 6 if event == X(L,B) or XR (L,B): namic schedule tree from the dynamic IIVs of a program 7 diiv.removeDimension() execution. Also, note that the schedule tree is for the dy- 8 diiv.innermost.CTX.last:=B namic IIVs what the calling-context-tree is for the set of 9 if event == I(L,B) or IC (L,B) or IR (L,B): 10 diiv.innermost.IV++ calling context paths. The schedule tree for our example is 11 diiv.innermost.CTX.last := B shown in Fig. 3e. The relationship between these three struc- 12 if event == R(B): tures is summarized in Fig. 5. As described in more details 13 diiv.innermost.CTX.pop() 14 diiv.innermost.CTX.last := B further, poly-prof exposes the schedule tree to the programmer in the form of a flame-graph [26] where the root of the tree is on the bottom and leaves on the top. Ex. 3f illustrates how the recursive-component-set is used 5 Compact Polyhedral DDG to “fold” calling-context-paths in the presence of recursive In the DDG, every dynamic instruction is represented by one calls. Here, M1 is the caller site (step 4) before entering the re- vertex and every data dependence by one edge. A program cursive loop L1 (step 5). Looking at C0, its multiple instances running on a modern CPU for just a few seconds can create are indexed by the corresponding recursive-loop induction billions of nodes in the DDG. Our folding algorithm com- variable i1 (respectively 0,1, and 2 at steps 7, 12, and 17). presses those large graphs into a form that is amenable to Within the loop it gets executed when C is called from B1. polyhedral analysis performed by our poly-profs back-end. The associated dynamic IIV is (M1/L1, i1, B1/C0). As already The details are presented in a separate paper [29]. Only the 6 1 for (j = 1; j <= n2) Polyhedron Label expression Id 2 sum = 0.0; (cj,ck) f(cj,ck) 3 for (k = 0; k <= n1) ′ ′ 4 tmp1 = load (& conn + k)}// I1 I1 → I2 0 ≤ c j ≤ 15, 0 ≤ ck ≤ 42 c j = c j + 0ck, ck = 0c j + ck 5 tmp2 = load ( tmp1 + j)// I2 I2 → I4 0 ≤ c j ≤ 15, 0 ≤ ck ≤ 42 c j′ = c j + 0ck, ck′ = 0c j + ck 6 tmp3 = load (& l1 + k)// I3 I4 → I4 0 ≤ c j ≤ 15, 1 ≤ ck ≤ 42 c j′ = c j + 0ck, ck′ = 0c j + ck − 1 7 sum = sum + tmp2 * tmp3// I4 8 k = k + 1// I5 Table 2. Output of the folding algorithm for the dependen- 9 tmp4 = call squash(sum);// I6 cies stream shown in Table 1 10 store (&l2 + j, tmp4)// I7 11 j = j + 1// I8 69], as they can be expressed as functions of the canonical Figure 6. Pseudo-assembler for first kernel of backprop induction variables. Detecting SCEVs is important for two benchmark reasons: First of all, the chains of dependencies associated to their computation shall be ignored, as otherwise it greatly and unnecessarily constrains possible code transformations; I1 → I2 I2 → I4 I4 → I4 Second, it allows to detect an important class of memory IV Label IV Label IV Label (cj,ck) (cj’,ck’) (cj,ck) (cj’,ck’) (cj,ck) (cj’,ck’) access patterns, the strided accesses, that are used to evaluate the profitability of some loop transformations. (0,0) (0,0) (0,0) (0,0) If we detect during folding of an instruction that all the (0,1) (0,1) (0,1) (0,1) (0,1) (0,0) ...... Label values of a polyhedron can be expressed using a SCEV, that is, if the folding succeeds in constructing an affine func- Table 1. Dependency input stream from example in Fig. 6 tion to express them, the corresponding points in the folded main properties are outlined here. Our folding algorithm polyhedra of IIVs as well as the associated dependencies borrows ideas from trace compression [36] with the notable are removed from the DDG. This happens for example for difference that is based on a geometric approach and is able instructions I5 and I8. For instruction I5, the folding algo- to perform over-approximations as briefly described below. rithm finds a SCEV with value a(cj, ck) = 0cj + 1ck + 1. Folding interface The output of the second stage of poly- prof (Instrumentation II) that feeds the folding stage is a Over-approximations Even in programs where the hot stream of IIVs and labels. The IIV representation used at this region is affine such as in PolyBench [56], profiling the en- stage slightly differs from the previous section. IIVs are split tire benchmark reveals a large amount of non-regular parts. into two parts: 1. The context corresponds to the non-numer- Keeping an accurate representation of non-regular parts ical part of the vector; 2. The coordinates correspond to the causes scalability issues both in the profiling part and in numerical part. Folding is then performed for each context the polyhedral back-end of poly-prof. Our idea to handle separately. The folding algorithm takes as input for each such non-regular parts in a scalable fashion is through over- approximation. This over-approximation, detailed in [29], dynamic instruction and each dynamic dependence: 1. Iì: its concerns labels for non-affine dependencies or non-affine iteration vector, which uniquely identifies it (i.e., its unique SCEVs and polyhedra for dependencies and instructions with (ì) coordinates). 2. Label: a vector a I of associated integer val- iteration domains with holes. ues. And as an output it produces: 1. a union of polyhedra ì that represent the set of all I. 2. for each polyhedron P, an 6 DDG Polyhedral Feedback affine function A such that for all Iì ∈ P, A(Iì) = a(Iì). For An essential motivation for folding DDGs into polyhedral dynamic instructions, a(Iì) are the integer and pointer values ì structures is to enable the use of advanced polyhedral compi- produced by the instruction, if any. For dependencies, a(I) lation systems, which are capable of finding a schedule that are the IV of the producer instruction. maximizes parallelism, finds tiling opportunities, etc. [11]. Dependency recognition Some of the streams compres- sed by the folding stage are dependencies. To illustrate its Polyhedral compilation of folded-DDGs Typically, a functionality on dependencies, consider bpnn_layerforward, polyhedral compiler is applied to small numerical kernels a kernel of the backprop benchmark from the Rodinia bench- made of a handful of statements [11, 24, 48, 57]. Polyhe- mark suite. Part of the input stream and corresponding folded dral schedulers suffer scalability challenge for larger pro- output dependencies can be seen in Tables 1 and 2. Note that grams [48]: their complexity typically grows exponentially iterators used by poly-prof are canonical ones (here cj and with the number of statements in the input program. Our ck) computed on the fly through program instrumentation, DDG folding and over-approximation techniques allow go- that do not necessarily match the ones present in the original ing from programs with thousands of statements (vastly code (e.g. j of the outer loop starts at one). exceeding the typical program scale these schedulers can SCEV recognition At the machine code level, even very handle) to only a few hundreds. regular programs contain a large amount of “unimportant” Numerous customizations for scalability of the polyhedral code, such as instructions that increment loop counters or compiler have been implemented, ranging from constrain- calculate addresses relative to a base pointer. Those are identi- ing the space of legal schedules to accelerate the scheduling fied by compilers as scalar evolution expressions (SCEVs)54 [ , process to approximation of the code generation process to 7 quickly output a decorated simplified AST describing the pro- Layerforward (1st call). 2D loop. 14%.

call __random+93 c.. call.. call.. call rand+4 c.. cal.. call 0.. call 0.. call 0x400a80 ca.. ca.. c.. cal.. call b.. call b.. call backprop.c:100 (bpnn_randomize_weights) c.. loop backprop... loop backprop.c:322 (bpnn_adjust_weights) cal.. call.. 1. specialize ca.. cal.. loop ba.. loop ba.. loop backprop.c:100 (bpnn_randomize_weights) ca.. loop backprop.c.. loop backprop.c:320 (bpnn_adjust_weights) cal.. call.. adjustweight (2nd call). 2D loop. 46%. ca.. cal.. loop .. call backprop.c:.. loop backprop.c:216 (bpnn_create) lo.. call backprop_k.. call backprop_kernel.c:67 (bpnn_train_kernel) loop backp.. gram structure after transformation. For example, the pres- lo.. call call_in.. call .. call facetrain.c:22 (backprop_face) call facetrain.c:27 (backprop_face) call facet.. c.. lo.. loop call_in.. call facetrain.c:47 (setup) c.. cal.. call _dl_ini.. call backprop_kernel.c:42 (main) call _d.. loop _dl_ini.. call __libc_start_main+232 call _dl.. loop _dl_ini.. call _start+36 2. interchange call ini.. call init_tls.. call init_tls+452 all 1. specialize ence of large integer constants causes combinatorial blow up 3. simd 3. interchange in the ILP solver used to solve the scheduling problem [52]. 4. omp parallel, simd

We implemented a parameterization of iteration domains, to loop backprop.c:254.. loop backprop.c:322 (bpnn_adjust_weights) loop backprop.c:253 .. loop backprop.c:320 (bpnn_adjust_weights) replace those constants by a parameter (an unknown, but con- call backprop_kernel... call backprop_kernel.c:67 (bpnn_train_kernel) stant integer value). That is, a domain {[i] : 0 <= i < 1024} call facetrain.c:27 (backprop_face) is replaced by [n] → {[i] : 0 <= i < n ∧ n ≥ 1024} prior to Figureall 7. Annotated flame-graph for backprop (see Sec 7). scheduling. We control the number of parameters introduced with GCC 8.1.1 and the flags -g -O2 -msse3. Speedup mea- by reusing the same parameter for a range of values, that is, surements are using Intel icc and ifort compilers (version if the value x ∈ [1024 − s, 1024 + s] with s ∈ Z (we typically 18.0.3, flags -Ofast -march=native -mtune=native) on set s = 20), then we replace x by n + (x − 1024). a machine with a Xeon Ivy Bridge CPU with two 6 core The reader may refer to the available implementation in CPUs (24 hyperthreads), each running at 2.1Ghz (GFlop/s PoCC [55] for further details about the simplifications imple- are averaged over 50 runs). mented, computation of profitability metrics is implemented in the PolyFeat module. Case study I This case study illustrates a simple (struc- Final output The main visual support used for reporting tured) feedback: pinpointing that dependencies live within the aggregated feedback is the dynamic schedule-tree described first quartant4 [ ] allows exposing data locality and fine-grain in Sec. 4, along with a simplified AST that shows the code parallelism through a loop interchange. For this study, we structure after the application of the suggested structured selected backprop, a supervised learning method used to transformation. This AST embeds various metrics on instruc- train artificial neural networks extracted from the Rodinia tion count, loop properties (parallelism and tilability) and benchmark suite [15]. In addition to what was outlined in the list of statements surrounded by every loop. This lets the previous case study, poly-prof exploits its ability to as- the user visualize the potential effort in manually writing sociate an iteration vector with each memory access, and to the code structure corresponding to the application of the match pointer values with scalar evolution. Since the scalar suggested transformation. evolution expressions for a memory access describe the ad- To better visualize the various program regions, metrics dresses it reads/writes, they can be used to detect strided can weight each tree-node and be rendered with flame-graphs accesses along any dimension, which in turn can be used to [26]. An example discussed in Sec. 7 is reported in Fig. 7. Here, detect vectorization potential. the “surface” of a region in the flame graph is proportional The flame-graph provided by poly-prof is reported in to its estimated computation weight. Such flame graphs are Fig. 7 where the regions of interest are: 1. the first call (of SVG files, which can be clicked to obtain detailed information two) to bpnn_layerfoward with a constant value of n2 = 16 ; on each region/box. In the example of Fig. 7, loop/call nodes 2. the last call (of two) of bpnn_adjust_weights with a con- are marked with the label loop/call, respectively. Colors stant value of ndelta = 16. Both functions contain a 2D- and gradients are used, that is, non-interesting regions can nested-loop, called Llayer and Ladjust respectively, for which be grayed out. In Fig. 7, grayed regions are non-affine and poly-prof suggests an interchange and SIMDization. For blacklisted (initialization and extensive calls to libc) ones. example, see Llayer on Tab. 3 for which the two loop dimen- Node width is used to highlight hotness of regions. In our sions are reported to be on lines 253 (outer) and 254 (inner) example, functions adjustweight and layerforward take of file backprop.c: The loop nest is fully permutable, that is, substantially more space than other regions. interchange is possible; only the outermost loop is parallel; Numerous additional extensions have been developed, to and there are more stride 0/1 accesses along the outermost provide useful feedback to the user: various metrics about dimension (100%) than along the innermost (67%). the size and structure of polyhedral regions in the DDG, To enable the suggested transformations, one needs to before (original program) and after (transformed program) specialize the two interesting function calls and array ex- applying an optimizing schedule, the possible code structure pand the scalar sum. The other calls to bpnn_layerfoward including the precise fusion/distribution scheme, a simplified and bpnn_adjust_weights are left unchanged, since they sequence of core loop transformations that should be applied (a) take up much less of the overall program runtime (b) to achieve the target transformation, etc. Given the extensive have different arguments and do not profit from the trans- textual length of the feedback we provide, an example is formation. Applying the transformation improved, in our shown only in the supplementary document. case, bpnn_layerfoward from 0.5 GFlop/s to 2.8 GFlop/s and bpnn_adjust_weights from 0.3 GFlop/s to 5.1 GFlop/s. 7 Case Studies This section aims to present case studies that illustrate both Case study II This case study illustrates an advanced feed- the type of feedback provided by poly-prof as well as the back: providing exact dependence vector “directions” allows ex- different transformations it can suggest. Current 7QEMU[ ], posing data locality and coarse-grain parallelism through loop on which poly-prof is based, cannot handle newer AVX in- skewing and tiling. For this study, we selected GemsFDTD, a fi- structions. Feedback was obtained from binaries compiled nite difference time domain method from the SPEC CPU2006 8 Fat regions % ops interchange+SIMD parallel permutable % stride 0/1 speedup backprop_kernel.c:52 (Llayer) 14% backprop.c:(254,253) (yes,no) (yes,yes) (100%, 50%) 5.3 x backprop_kernel.c:57 (Ladjust) 46% backprop.c:(322,320) (yes,yes) (yes,yes) (100%, 50%) 7.8 x Table 3. backprop case study. Reported lines (e.g. 52, 253, ...) are from debug information. Suggested interchange is represented using a permutation of code lines. Statistics/properties per loop dimension as follows: (outer, inner). Fat regions op tiling speedup The dynamic analysis is based on the QEMU-plugin in- update.F90:106 20% update.F90:{106,107,121} 2.6 x strumentation interface [30]. We have extended the interface update.F90:240 18% update.F90:{240,241,244} 1.9 x itself to support multiple interacting C/C++ plugins. Plugins Table 4. GemsFDTD case study. Reported lines are shifted primarily work at the level of the generic QEMU compiler debug info. instructions, making them CPU architecture agnostic. The benchmark suite [32] written in Fortran90. Analyzing For- choice of using QEMU for instrumenting the code is orthog- tran code is not a problem for poly-prof as it works at the onal to our contribution. There are some other candidates binary level. However, the compiler we used (gfortran-8.1.1) (e.g. Valgrind [51] or pin [45]) for implementing a dynamic messes up the debug information, making it necessary for binary analysis tool, each having their own advantages and the user to shift the line numbers for the provided code ref- disadvantages. Since we use a shadow memory to track data erences by hand. This case study fully exploits poly-prof’s dependencies, dynamic analysis obviously does not come ability to model (and compress in a polyhedral form) the data for free. As an example, the total CPU time (summing for dependencies, instead of simply checking their existence/ab- all cores) on our server required by the first three stages of sence. This knowledge about the structure of dependencies poly-prof to analyze the full Rodinia benchmark suite is 3h allows poly-prof to check for tiling opportunities. First, poly- 6’ (the full execution including libc was instrumented). prof detects that four functions from the benchmark are fat We have built our polyhedral feedback pass described (i.e. they execute a large amount of the program’s total num- in Sec. 6 in the PoCC compiler [55] that implements the ber of dynamic instructions): updateH_homo, updateE_homo, PluTo scheduler [11], along with a new polyhedral DDG UPML_updateH, and UPML_updateE; Inside the first two of analysis and optimization in PoCC/PolyFeat [55]. Extensions those functions are the five hottest loop nests, so we focus for scalability have been implemented in the schedulers and on them. As reported in Fig. 4, poly-prof annotates all five code generator, and in PolyFeat. loops as fully parallel and tilable. So to obtain a speedup we Experiment II To show problems static approaches en- tile each loop along all dimensions with a tile size of 32 and counter with the Rodinia suite we also ran the static LLVM- mark the outermost loop parallel with an OMP PARALLEL DO based [40] polyhedral compiler Polly [28] over the entire directive. We recall that tiled code can always be also coarse- suite. We used Polly version 7.0.1 and the flags -O3 -ffast- grain parallelized using wavefront parallelism, as exploited math -polly-process-unprofitable. Kernels that span multi- by the Pluto polyhedral scheduler [11]. Tiling and paral- ple functions were inlined to allow Polly to see the same lelizing the loops increased performance in updateE_homo code region as poly-prof. Calls to functions from libc or from 1.3 GFlop/s to 2.7 GFlop/s and updateH_homo from 1.3 the OpenMP runtime were not inlined. Where such calls are GFlop/s to 3.7 GFlop/s present this usually results in Polly being unable to analyze the kernel, though it can handle calls to simple functions 8 Experiments such as exp or sqrt. Polly was unable to build a polyhedral The goal of this section is to demonstrate that poly-prof can model of the whole region of interest for any of the 19 bench- be systematically applied on a full benchmark suite, and find marks. It was able to model some smaller subregions, 1D or potential for optimization. Note that the output of poly-prof 2D loop nests, in most benchmarks, but in nearly all cases for each benchmark is extensive: flame graph, statistics on each its own profitability metric decided not to optimize them. sub-region, potential structured transformation(s), simplified Two notable exceptions are the heartwall and lud bench- annotated AST after the application of the transformation, marks. In heartwall Polly was able to model a sequence of complete AST, etc. Consequently, we only illustrate here the nine 2D loop nests which accounts for roughly two thirds application of poly-prof on Rodinia by using aggregate met- of the code in the body of the kernel. For lud it managed to rics. These metrics are not meant to be used as is by the end model the whole inner 3D loop nest of the kernel, but not user: instead, the user is expected to work on one benchmark the outermost loop. In all benchmarks the inability to model at a time, and navigate the feedback we provide. the outermost loops blocks Polly from exploiting the thread level parallelism inherent in the Rodinia benchmark suite. Experiment I In addition to the case studies, we evalu- Summary statistics. Table 5 presents summary statistics ate below our tool-chain using the latest revision (3.1) of about the Rodinia 3.1 (CPU only) benchmarking suite, that the Rodinia benchmark suite [14, 15]. As poly-prof does we computed/aggregated by processing the feedback from not support multithreaded applications yet, each benchmark poly-prof and Polly on each benchmark. is manually modified to run in a single thread, and com- Column %Aff reports the percentage of dynamic oper- piled using the same compiler and flags as the case studies ations that are part of a fully affine region without over- (GCC 8.1.1, -g -O2 -msse3). approximation. The low proportion of affine code reported 9 i eatc-rsrigsrcue transformations. structured semantics-preserving via oe htaesrd- rsrd- nteeitn innermost existing the in stride-1 or stride-0 are that tores program: the in available loops. innermost parallel in occur that re- operations is of loop a when Similarly, at exposed parallel. iterations, be ported two can least blocks at parallel has two loop considered least a are As operations parallelizable. its be all to then parallel, dimension as loop detected non-inner is a If pragmas. us- parallel parallelized OpenMP be ing can that operations dynamic of percentage provides tilability. really and it parallelism unless in skewing improvements avoid transfor- to proposed tend the we in mation, used is skewing whether displays aliasing; pointer possible indirection); pointer (includes function statements; access conditional non-affine or bound an as region as: call; coded are whole the They model program. to affine unable was Polly LLVM operation. load an both perform can and instruction memory single to/from a store and x86 on since 100% of than sum itself. the region the of that Note operations point) floating (resp. memory the inside executed while program region, the of or operations dynamic libc of to call Column a runtime. contained OpenMP it the if or transformation the perform (column reported procedural is reference column code the The in hand. transformation by a selected suggests been optimizer has the which for region as a whole. the program for and wecan transformations thoseregions, find for affine, potentially not Note over-approximations are affine. affine benchmark find fully a still of as parts recognized when even not that, are ex- so modulo use and bounds pressions whose loops contain nested programs linearized These hand time: folding at lattices supporting for The achieved be can what shows metrics of group next The column The by provided statistics the on Based ahidr6 48M67% M 4% 62 M 71 M 42 pathfinder 98% M 372 lud M 251 cfd akrp1 0M85% M 10 M 10 backprop rdv 99% 98% 97% G 2 1% 99% M G 864 1 M M 2 93 M G 600 1 M 779 M27% M 89% 80 M 1 streamcluster 678 M 39% K 866 srad_v2 0% srad_v1 628 G 9 M 1 G 1 97% particlefilter 0% G nw 4 1% M 647 99% nn M 21% 879 M 16 myocyte G M 8 513 M 49% 256 M 1 M 11 leukocyte M M 24 210 G 4 lavaMD M 2 kmeans M 23 hotspot3D hotspot heartwall b+tree bfs heartwall Bench. C. %reuse ope F ( CFG complex al 5. Table %Mops / , esn h ol failed Polly why Reasons %Preuse Region hotspot #V interproc. umr ttsiscmue from computed statistics Summary and ecniee eint einter- be to region a considered We . %Mops

break #E erc eotsaelclt htis that locality space report metrics and , %reuse P. %FPops %simdops

aepitrntlo invariant. loop not pointer base %Aff lud / filnn a eurdto required was inlining if ) _m.p:299%6 3 Y Y Y 13% 92% 93% N 6% 31% 31% 14% 16%N 96% Y 99% 99% Y 5% Y 83% 27% Y 80% 99% 70% *_omp.cpp:1269 73% srad.cpp:114 80% 31% 71% 44% main.c:241 Y N 79% 99% 42% 99% pathfinder.cpp:99 N 81% 64% *_seq.c:593 31% Y Y 92% needle.cpp:308 28% 64% nn_openmp.c:119 98% 69% 56% Y 49% 37% N main.c:283 56% Y 99% lud.c:121 N 38% 89% 0% detect_main.c:51 97% 99% 99% 18% Y 35% kernel_cpu.c:123 99% *_clustering.c:160 42% 66% 81% 76% 3D.c:261 26% 98% 55% *_openmp.cpp:318 30% main.c:536 67% *3d_cpu.cpp:480 main.c:2345 bfs.cpp:137 facetrain.c:25 return %ops and Region stepretg fload/s- of percentage the is stecneuneo not of consequence the is eot h ecnaeof percentage the reports %FPops R. eot h percentage the reports eot h percentage the reports poly-prof ); it h esn why reasons the lists nade function unhandled B. % o-fieloop non-affine a egreater be can ||

A. %ops F. ops h biggest the , non-affine unhandled

ie the gives %Mops skew %FPops poly-prof interproc. 10 RCBFAP RCBFAP RCBF CBA RFA Reasons CF BP BF BF BF BF BF RF RF RF RF A B F why n uems opwt oeta %o h oa region total the of 5% than more with loop outermost Any %Tilops hsscindsrbspeiu okrltdt profiling, to related work previous describes section This htmdfe h opdph(.. ullo noln for loop unrolling full (e.g., depth loop the transformation modifies any that performed compiler the whether shows ( code source permutations. loop of sequence a via stride-0 made stride-1 be or can that operations load/store of percentage while program, the in loops ntuto.Vrositrae uha PToKt[ HPCToolKit machine as of such granularity interfaces Various the at instruction. rate) miss instruction cache as or (such throughput statistics provide sampling, using and, debug- performance [ [ Perf for optimization guiding both for technique and ging used widely a is tools feedback Performance compression. trace and debugging, performance Work Related 9 displayed. is result no therefore and stage, uling strategy). fusion/distribution balanced somewhat in reported heuristic fusion fusion the using struc- transformation, proposed the tured applying after code; components of binary number the the in components reported. of be number will the components reports 2 each then executing operations, nests the loop of consecutive half the two if of example, made is For “component”. region one as counts operation code. atransformed implement tomanually indication difficulty an the is of and proposed, transformation from structured originates the that structure fusion/distribution loop the tiling/interchange. through ments the in illustrated study as but potential, improved. locality be could reuse data poly-prof and possible, is parallelism ( depth tiling maximal the Next Polly failed erpr h aia opdpho h eini the in region the of depth loop maximal the report We oethat Note metrics Finally, (wavefront) coarse-grain tiled, be can region a as soon As sfebc nteRdnabnhaksuite. benchmark Rodinia the on feedback ’s N N N N N N N N N N N N N N N Y Y Y - 18 skew %reuse ( h ecnaeo prtosta a etiled. be can that operations of percentage the , rItlVue[ VTune Intel or ] M 0%10 4 14% 40% 14% 0% 0% 100% 77% 0% 100% 47% 0% 77% 100% 100% 47% 0% 63% 0% 100% 99% 63% 100% 53% 11% 0% 100% 3% 100% 46% 0% 100% 11% 100% 100% 42% 3% 100% 99% 0% 100% 44% 1% 100% 18% 100% 100% 44% 100% 61% 100% 1% 100% 100% 50% 100% 100% 100% 100% 100% 100% 9 0%1%18% 18% 55% 100% 55% 1% 99% 100% 99% 0% 98% 99% - - - - o aia opfso,and fusion, loop maximal for osntcretypoiefebc ntemporal on feedback provide currently not does %| |ops streamcluster ld-src lost vlaesaillclt improve- locality spatial evaluate to allows %simdops C n ntebnr oe( code binary the in and ) /

Comp. %reuse 59 ahrhrwr one values counter hardware gather ] / %Preuse fusion xase eoya h sched- the at memory exhausted %Preuse D6 - - 100% 100% 2D 100% 6D 2D 100% 3D 2D 6D 100% 3D 2D 3D 100% 2D 2D 3D 99% 3D 1D 2D 99% 4D 1D 3D 100% 1D 3D 4D 100% 3D 3D 1D 100% 5D 3D 4D 100% 4D 4D 5D 100% 4D 3D 4D 100% 4D 2D 4D 100% 4D 5D 4D 100% 4D 3D 4D 100% 6D 3D 4D 100% 4D 2D 7D 3D 2D 5D 3D 3D 2D 3D 2D

TileD ld-src Using profiling feedback profiling Using uln h opeiyof complexity the outline

64 ld-bin srpre,aogwith along reported, is ) FO.Tossc as such Tools (FDO). ] eot h maximal the reports TileD S for

backprop %Tilops ld-bin smartfuse 2-- - 52 S 2 22 S 5 11 S 4 15 S S 1 M 1 1 1 S 1 M 1 2 S 1 S 2 3 1 3 S 1 S 3 2 M 3 S 1 1 S 1 1 S 1 3 1 3 M 1 S 1 1 4 1 6 C

1 Comp. allow ] .This ). Comp. cfd case fusion a , C ). navigating through the gathered statistics thanks to a best- of monitoring data dependencies. The most common is the effort mapping from binary to source code. Compiler based use of static analysis so as to remove redundant instrumen- profiling tools, such as gprof25 [ ], make it possible to pro- tation [42], or to avoid monitoring must-dependencies [37]. vide complementary statistics (such as branch probabilities). Another approach that leads to over-approximation, consists While these allow pin-pointing important parts of the code of reducing the size of the shadow memory through the use where the programmer or compiler should focus his atten- of signature based addressing [43]. The most sophisticated tion on, the metrics they report are very low level and conse- technique is the one developed in SD3 [39] that amounts to quently they leave the role of finding optimizing code trans- detecting strided memory accesses to compress the shadow- formations to the programmer or compiler heuristics [65]. ing. A last technique that could be used for our purposes is Recent work such as in MAQAO [13, 19], AutoSCOPE [65] the parallelization of shadowing as done in [42, 49]. or MIAMI [46] combines these metrics with static analysis of binary/source code. Intel IACA [17], a purely static analysis Calling context tree Using calling context trees to dis- tool, uses a precise machine model, which the vendor does ambiguate instructions in different calling contexts is an idea not publish, to pin-point resource bottlenecks and predict from Ammons, Ball and Larus [2]. But in the presence of re- performance of snippets of binary code. cursive functions the size of their CCTs grows proportionally Dynamic data-flow analyses have been presented to pro- with the depth of the recursion, leading to an unreasonably vide useful performance feedback. The detection of paral- high memory overhead. Loop-call context trees [60] sim- lelism (such as vectorization) along canonical directions has ply encode the calling context of intraprocedural loops and particularly been investigated [3, 12, 21, 37, 38, 41, 67, 70, 74], suffer from the same size problems in recursive programs. as it requires only relatively localized information. Another use case is the evaluation of effective reuse8 [ , 10, 44, 46] with the objective of pinpointing data-locality problems. 10 Conclusion and Future Work The APOLLO [34, 47] framework uses an interesting tech- In this paper we introduced poly-prof, a profiling-based nique that corresponds to detecting dynamic regularities polyhedral optimization feedback tool. poly-prof tackles (affine expressions) of memory accesses in nested loops, and the problem of dynamic data dependence profiling and poly- use this information to perform speculative loop transforma- hedral optimization from binary programs, while addressing tion, which differs in scope and challenges from the profiling the associated scalability challenges of handling traces with feedback tool developed in this paper. billions of operations, as we demonstrated. Our tool handles Polyhedral compilation Integer linear algebra is a natu- non-regular programs with recursive function calls. Numer- ral formalism for representing the computation space of a ous technical contributions as presented in this paper were loop nest. The polyhedral framework [23] leverages, among required to enable structured transformation feedback on bi- others, operators on polyhedrons (e.g. union, intersection, nary programs, a significant step in complexity compared projection), enumeration (for code generation [6]), and para- to prior approaches typically limited to unstructured trans- metric integer linear programming [22] (for dependence formations. We implemented numerous feedback reporting analysis [16]). Historically, it has been designed to work on schemes for the user in poly-prof: flame graphs, statistics restricted programming languages, and was used as a frame- on each sub-region, proposed potential structured transfor- work to perform source-to-source transformations. More mation, simplified annotated AST after the application of recently, efforts have been made to integrate the technology the transformation, complete AST, etc. in mainstream compilers (e.g. Graphite [53, 68] for GCC [27] While this represents an important step towards being and Polly [28] for LLVM [40]). The set of loop transforma- able to provide polyhedral feedback on full-scale applications tions (known as affine transformations) that the polyhedral there are issues that still need to be addressed. This includes model can perform is wide and covers most of the important the development of under-approximation schemes in the ones for exposing locality and parallelism to improve perfor- DDG, as well as overall scalability enhancements, like using mance [11]. Tools such as PoCC [55] provide a convenient approximate (non-optimal) polyhedral scheduling strategies. interface to the numerous existing state-of-the-art schedul- Lastly, the process of mapping our results back from the ing heuristics and polyhedral libraries such as ISL [71]. machine code back to the source code is in itself a research Dynamic dependence graph Shadow memory [76] topic, and currently we do not provide much more than what records a piece of information for each storage location used objdump does. Our ongoing efforts in this direction leverage in a program. For dependency tracking this is usually the last polyhedral program equivalence techniques [5, 61, 73]. statement or dynamic instruction that modified that location. But shadow memories are also a core component of memory Acknowledgements error debugging tools [51, 62]. Like poly-prof, Redux [50] builds a data-flow graph from This work was supported in part by the U.S. National Sci- binary level programs. It has, however, no notion of loops or ence Foundation awards CCF-1645514, CCF-1731612 and calling contexts and does not try to compress the produced CCF-1750399, and by the LabEx PERSYVAL-Lab (ANR-11- graph and is consequently only able to handle very small pro- LABX-0025-01) funded by the French program Investisse- grams. A few techniques exist to address the high overhead ment d’avenir. 11 References 4th Workshop on EPIC architectures and compiler technology, San Jose, [1] L. Adhianto, S. Banerjee, M. Fagan, M. Krentel, G. Marin, J. Mellor- Vol. 200. Crummey, and N. R. Tallent. 2010. HPCTOOLKIT: tools for perfor- [20] Johannes Doerfert, Clemens Hammacher, Kevin Streit, and Sebastian mance analysis of optimized parallel programs. Concurrency and Hack. 2013. Spolly: speculative optimizations in the polyhedral model. Computation: Practice and Experience 22, 6 (2010), 685–701. IMPACT 2013 (2013), 55. [2] Glenn Ammons, Thomas Ball, and James R. Larus. 1997. Exploiting [21] Karl-Filip Faxén, Konstantin Popov, Sverker Jansson, and Lars Alberts- Hardware Performance Counters with Flow and Context Sensitive son. 2008. Embla-data dependence profiling for parallel programming. Profiling. In Proceedings of the ACM SIGPLAN 1997 Conference on Pro- In Complex, Intelligent and Software Intensive Systems, 2008. CISIS 2008. gramming Language Design and Implementation (PLDI ’97). International Conference on. IEEE, 780–785. [3] Ran Ao, Guangming Tan, and Mingyu Chen. 2013. ParaInsight: An [22] Paul Feautrier. 1988. Parametric integer programming. RAIRO- Assistant for Quantitatively Analyzing Multi-granularity Parallel Re- Operations Research 22, 3 (1988), 243–268. gion. In High Performance Computing and Communications & 2013 [23] Paul Feautrier and Christian Lengauer. 2011. Polyhedron model. In IEEE International Conference on Embedded and Ubiquitous Comput- Encyclopedia of Parallel Computing. Springer, 1581–1592. ing (HPCC_EUC), 2013 IEEE 10th International Conference on. IEEE, [24] Sylvain Girbal, Nicolas Vasilache, Cédric Bastoul, Albert Cohen, David 698–707. Parello, Marc Sigler, and Olivier Temam. 2006. Semi-Automatic Com- [4] Utpal K. Banerjee. 1988. Dependence Analysis for Supercomputing. position of Loop Transformations for Deep Parallelism and Memory Kluwer Academic Publishers, Norwell, MA, USA. Hierarchies. Intl. J. of Parallel Programming 34, 3 (2006). [5] Wenlei Bao, Sriram Krishnamoorthy, Louis-Noël Pouchet, Fabrice [25] Susan L Graham, Peter B Kessler, and Marshall K Mckusick. 1982. Rastello, and P. Sadayappan. 2016. PolyCheck: Dynamic Verification Gprof: A call graph execution profiler. In ACM Sigplan Notices, Vol. 17. of Iteration Space Transformations on Affine Programs. In Proceedings ACM, 120–126. of the 43rd Annual ACM SIGPLAN-SIGACT Symposium on Principles [26] Brendan Gregg. 2016. The flame graph. Commun. ACM 59, 6 (2016), of Programming Languages (POPL ’16). ACM, New York, NY, USA, 48–57. 539–554. [27] Arthur Griffith. 2002. GCC: the complete reference. McGraw-Hill, Inc. [6] Cédric Bastoul. 2004. Generating loops for scanning polyhedra: Cloog [28] Tobias Grosser, Armin Groesslinger, and Christian Lengauer. 2012. users guide. Polyhedron 2 (2004), 10. Polly - Performing polyhedral optimizations on a low-level intermedi- [7] Fabrice Bellard. 2005. QEMU, a Fast and Portable Dynamic Translator. ate representation. Parallel Processing Letters (2012), 22:04. In Proceedings of the Annual Conference on USENIX Annual Technical [29] F. Gruber, M. Selva, D. Sampaio, C. Guillon, L.-N. Pouchet, and F. Conference (ATEC ’05). Rastello. 2019. Building of a Polyhedral Representation from an Instru- [8] Erik Berg and Erik Hagersten. 2005. Fast data-locality profiling of mented Execution: Making Dynamic Analyses of non-Affine Programs native execution. In ACM SIGMETRICS Performance Evaluation Review, Scalable. Technical Report RR-9244. Inria. Vol. 33. ACM, 169–180. [30] Christophe Guillon. 2011. Program Instrumentation with QEMU. In [9] Kristof Beyls and Erik H. D’Hollander. 2006. Discovery of Locality- Proceedings of the International QEMU User’s Forum (QUF ’11). improving Refactorings by Reuse Path Analysis. In Proceedings of [31] Paul Havlak. 1997. Nesting of Reducible and Irreducible Loops. ACM the Second International Conference on High Performance Computing Trans. Program. Lang. Syst. 19, 4 (July 1997). and Communications (HPCC’06). Springer-Verlag, Berlin, Heidelberg, [32] John L. Henning. 2006. SPEC CPU2006 Benchmark Descriptions. 220–229. https://doi.org/10.1007/11847366_23 SIGARCH Comput. Archit. News 34, 4 (Sept. 2006), 1–17. [10] Kristof Beyls and Erik D’Hollander. 2006. Discovery of Locality- [33] Justin Holewinski, Ragavendar Ramamurthi, Mahesh Ravishankar, Improving Refactorings by Reuse Path Analysis. High Performance Naznin Fauzia, Louis-Noël Pouchet, Atanas Rountev, and P Sadayap- Computing and Communications (2006), 220–229. pan. 2012. Dynamic trace-based analysis of vectorization potential of [11] Uday Bondhugula, Albert Hartono, J. Ramanujam, and P. Sadayappan. applications. ACM SIGPLAN Notices 47, 6 (2012), 371–382. 2008. A Practical Automatic Polyhedral Program Optimization System. [34] Alexandra Jimborean, Luis Mastrangelo, Vincent Loechner, and In PLDI. Philippe Clauss. 2012. VMAD: An Advanced Dynamic Program Analysis [12] Khansa Butt, Abdul Qadeer, Ghulam Mustafa, and Abdul Waheed. and Instrumentation Framework. 2012. Runtime analysis of application binaries for function level paral- [35] W. Kelly and W. Pugh. 1995. A unifying framework for iteration lelism potential using QEMU. In Open Source Systems and Technologies reordering transformations. In Proceedings 1st International Conference (ICOSST), 2012 International Conference on. IEEE, 33–39. on Algorithms and Architectures for Parallel Processing. [13] Andres S Charif-Rubial, Emmanuel Oseret, José Noudohouenou, [36] Alain Ketterlin and Philippe Clauss. 2008. Prediction and Trace Com- William Jalby, and Ghislain Lartigue. 2014. CQA: A code quality pression of Data Access Addresses Through Nested Loop Recognition. analyzer tool at binary level. In High Performance Computing (HiPC), In Proceedings of the 6th Annual IEEE/ACM International Symposium 2014 21st International Conference on. on Code Generation and Optimization (CGO ’08). [14] Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W [37] A. Ketterlin and P. Clauss. 2012. Profiling Data-Dependence to Assist Sheaffer, Sang-Ha Lee, and Kevin Skadron. 2009. Rodinia: A benchmark Parallelization: Framework, Scope, and Optimization. In 2012 45th suite for heterogeneous computing. In Workload Characterization, 2009. Annual IEEE/ACM International Symposium on Microarchitecture. 437– IISWC 2009. IEEE International Symposium on. 448. [15] S. Che, J. W. Sheaffer, M. Boyer, L. G. Szafaryn, Liang Wang, andK. [38] Minjang Kim, Hyesoon Kim, and Chi-Keung Luk. 2010. Prospector: Skadron. 2010. A characterization of the Rodinia benchmark suite A dynamic data-dependence profiler to help parallel programming. with comparison to contemporary CMP workloads. In Workload Char- In HotPar’10: Proceedings of the USENIX workshop on Hot Topics in acterization (IISWC), 2010 IEEE International Symposium on. parallelism. [16] Jean-François Collard, Denis Barthou, and Paul Feautrier. 1995. Fuzzy [39] Minjang Kim, Nagesh B Lakshminarayana, Hyesoon Kim, and Chi- array dataflow analysis. In ACM SIGPLAN Notices, Vol. 30. ACM, 92– Keung Luk. 2013. SD3: An Efficient Dynamic Data-Dependence Profil- 101. ing Mechanism. IEEE Trans. Comput. 62, 12 (2013), 2516–2530. [17] Intel Corporation. 2009. Intel Architecture Code Analyzer – User’s [40] Chris Lattner and Vikram Adve. 2004. LLVM: A Compilation Frame- Guide. work for Lifelong Program Analysis & Transformation. In Proceedings [18] Arnaldo Carvalho de Melo. 2010. The new linux’perf’tools. In Slides of the International Symposium on Code Generation and Optimization: from Linux Kongress, Vol. 18. Feedback-directed and Runtime Optimization (CGO ’04). [19] Lamia Djoudi, Denis Barthou, Patrick Carribault, Christophe Lemuet, [41] Zhen Li, Rohit Atre, Zia Ul-Huda, Ali Jannesari, and Felix Wolf. 2015. Jean-Thomas Acquaviva, William Jalby, et al. 2005. Maqao: Modu- DiscoPoP: A profiling tool to identify parallelization opportunities. In lar assembler quality analyzer and optimizer for itanium 2. In The Tools for High Performance Computing 2014. Springer, 37–54. 12 [42] Zhen Li, Michael Beaumont, Ali Jannesari, and Felix Wolf. 2015. Fast [59] James Reinders. 2005. VTune performance analyzer essentials. Intel data-dependence profiling by skipping repeatedly executed memory Press (2005). operations. In International Conference on Algorithms and Architectures [60] Yukinori Sato, Yasushi Inoguchi, and Tadao Nakamura. 2011. On-the- for Parallel Processing. Springer, 583–596. fly Detection of Precise Loop Nests Across Procedures on a Dynamic [43] Zhen Li, Ali Jannesari, and Felix Wolf. 2015. An efficient data- Binary Translation System. In Proceedings of the 8th ACM International dependence profiler for sequential and parallel programs. In Parallel Conference on Computing Frontiers (CF ’11). and Distributed Processing Symposium (IPDPS), 2015 IEEE International. [61] Markus Schordan, Pei-Hung Lin, Dan Quinlan, and Louis-Noël Pouchet. IEEE, 484–493. 2014. Verification of polyhedral optimizations with constant loop [44] Xu Liu and John Mellor-Crummey. 2011. Pinpointing data locality bounds in finite state space computations. In International Sympo- problems using data-centric analysis. In Code Generation and Opti- sium On Leveraging Applications of Formal Methods, Verification and mization (CGO), 2011 9th Annual IEEE/ACM International Symposium Validation. Springer, 493–508. on. IEEE, 171–180. [62] Konstantin Serebryany, Derek Bruening, Alexander Potapenko, and [45] Chi-Keung Luk, Robert Cohn, Robert Muth, Harish Patil, Artur Klauser, Dmitriy Vyukov. 2012. AddressSanitizer: A Fast Address Sanity Geoff Lowney, Steven Wallace, Vijay Janapa Reddi, and Kim Hazel- Checker.. In USENIX Annual Technical Conference. 309–318. wood. 2005. Pin: building customized program analysis tools with [63] Andreas Simbürger, Sven Apel, Armin Größlinger, and Christian dynamic instrumentation. In Acm sigplan notices, Vol. 40. ACM, 190– Lengauer. 2018. PolyJIT: Polyhedral Optimization Just in Time. Inter- 200. national Journal of Parallel Programming (Aug 2018), 33. [46] G. Marin, J. Dongarra, and D. Terpstra. 2014. MIAMI: A framework [64] Michael D. Smith. 2000. Overcoming the Challenges to Feedback- for application performance diagnosis. In 2014 IEEE International Sym- directed Optimization (Keynote Talk). SIGPLAN Not. 35, 7 (Jan. 2000), posium on Performance Analysis of Systems and Software (ISPASS). 1–11. [47] Juan Manuel Martinez Caamaño, Manuel Selva, Philippe Clauss, [65] O. A. Sopeju, M. Burtscher, A. Rane, and J. Browne. 2011. AutoSCOPE: Artyom Baloian, and Willy Wolff. 2017. Full runtime polyhedral opti- Automatic Suggestions for Code Optimizations Using PerfExpert. (July mizing loop transformations with the generation, instantiation, and 2011). scheduling of code-bones. Concurrency and Computation: Practice and [66] O. Tange. 2011. GNU Parallel - The Command-Line Power Tool. ;login: Experience 29, 15 (2017), e4192. e4192 cpe.4192. The USENIX Magazine 36, 1 (Feb 2011), 42–47. http://www.gnu.org/s/ [48] Sanyam Mehta and Pen-Chung Yew. 2015. Improving Compiler Scal- parallel ability: Optimizing Large Programs at Small Price. In Proceedings of [67] Georgios Tournavitis and Björn Franke. 2010. Semi-automatic ex- the 36th ACM SIGPLAN Conference on Programming Language Design traction and exploitation of hierarchical pipeline parallelism using and Implementation (PLDI ’15). ACM, New York, NY, USA, 143–152. profiling information. In Parallel Architectures and Compilation Tech- https://doi.org/10.1145/2737924.2737954 niques (PACT), 2010 19th International Conference on. IEEE, 377–388. [49] Tipp Moseley, Alex Shye, Vijay Janapa Reddi, Dirk Grunwald, and [68] Konrad Trifunovic, Albert Cohen, David Edelsohn, Feng Li, Tobias Ramesh Peri. 2007. Shadow profiling: Hiding instrumentation costs Grosser, Harsha Jagasia, Razya Ladelsky, Sebastian Pop, Jan Sjödin, with parallelism. In Code Generation and Optimization, 2007. CGO’07. and Ramakrishna Upadrasta. 2010. GRAPHITE Two Years After: First International Symposium on. IEEE, 198–208. Lessons Learned From Real-World Polyhedral Compilation. In GCC [50] Nicholas Nethercote and Alan Mycroft. 2003. Redux: A dynamic Research Opportunities Workshop (GROW’10). ACM, Pisa, Italy. dataflow tracer. Electronic Notes in Theoretical Computer Science 89, 2 [69] Robert A Van Engelen. 2001. Efficient symbolic analysis for optimiz- (2003), 149–170. ing compilers. In International Conference on Compiler Construction. [51] Nicholas Nethercote and Julian Seward. 2007. Valgrind: A Framework Springer. for Heavyweight Dynamic Binary Instrumentation. SIGPLAN Not. [70] Hans Vandierendonck, Sean Rul, and Koen De Bosschere. 2010. The (June 2007). Paralax infrastructure: automatic parallelization with a helping hand. [52] Christos H Papadimitriou and Kenneth Steiglitz. 1998. Combinatorial In Parallel Architectures and Compilation Techniques (PACT), 2010 19th optimization: algorithms and complexity. Courier Corporation. International Conference on. IEEE, 389–399. [53] Sebastian Pop, Albert Cohen, Cédric Bastoul, Sylvain Girbal, Georges- [71] Sven Verdoolaege. 2010. isl: An Integer Set Library for the Polyhedral André Silber, and Nicolas Vasilache. 2006. GRAPHITE: Polyhedral Model. In Proceedings of the Third International Congress on Mathemat- analyses and optimizations for GCC. In Proceedings of the 2006 GCC ical Software (ICMS ’2010). Developers Summit. [72] Sven Verdoolaege, Serge Guelton, Tobias Grosser, and Albert Cohen. [54] Sebastian Pop, Albert Cohen, and Georges-André Silber. 2005. In- 2014. Schedule trees. (2014). duction Variable Analysis with Delayed Abstractions. In Proceedings [73] Sven Verdoolaege, Gerda Janssens, and Maurice Bruynooghe. 2012. of the First International Conference on High Performance Embedded Equivalence checking of static affine programs using widening to Architectures and Compilers (HiPEAC’05). handle recurrences. ACM Transactions on Programming Languages [55] Louis-Noël Pouchet. [n. d.]. the PoCC polyhedral compiler collection. and Systems (TOPLAS) 34, 3 (2012), 11. ([n. d.]). http://pocc.sourceforge.net. [74] Zheng Wang, Georgios Tournavitis, Björn Franke, and Michael FP [56] Louis-Noël Pouchet. 2017. Polybench: The polyhedral benchmark suite O’boyle. 2014. Integrating profile-driven parallelism detection and (version 4.2). http://polybench.sf.net (Accessed: 2017-09-13). (2017). machine-learning-based mapping. ACM Transactions on Architecture [57] Benoît Pradelle, Benoît Meister, Muthu Baskaran, Athanasios Kon- and Code Optimization (TACO) 11, 1 (2014), 2. stantinidis, Thomas Henretty, and Richard Lethin. 2016. Scalable [75] Michael E Wolf and Monica S Lam. 1991. A data locality optimizing hierarchical polyhedral compilation. In Parallel Processing (ICPP), 2016 algorithm. In ACM Sigplan Notices, Vol. 26. ACM, 30–44. 45th International Conference on. IEEE, 432–441. [76] Qin Zhao, Derek Bruening, and Saman Amarasinghe. 2010. Umbra: [58] Ganesan Ramalingam. 2002. On Loops, Dominators, and Dominance Efficient and scalable memory shadowing. In Proceedings of the 8th Frontiers. ACM Trans. Program. Lang. Syst. 24, 5 (Sept. 2002). annual IEEE/ACM international symposium on Code generation and optimization. ACM, 22–31.