A general compilation algorithm to parallelize and optimize counted loops with dynamic data-dependent bounds Jie Zhao, Albert Cohen

To cite this version:

Jie Zhao, Albert Cohen. A general compilation algorithm to parallelize and optimize counted loops with dynamic data-dependent bounds. IMPACT 2017 - 7th International Workshop on Polyhedral Compilation Techniques, Jan 2017, Stockholm, Sweden. pp.1-10. ￿hal-01657608￿

HAL Id: hal-01657608 https://hal.inria.fr/hal-01657608 Submitted on 7 Dec 2017

HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. A general compilation algorithm to parallelize and optimize counted loops with dynamic data-dependent bounds

Jie Zhao Albert Cohen INRIA & DI, École Normale Supérieure 45 rue d’Ulm, 75005 Paris fi[email protected]

ABSTRACT iterating until a dynamically computed, data-dependent up- We study the parallelizing compilation and loop nest opti- per bound. Such bounds are loop invariants, but often re- mization of an important class of programs where counted computed in the immediate vicinity of the loop they con- loops have a dynamically computed, data-dependent upper trol; for example, their definition may take place in the im- bound. Such loops are amenable to a wider set of trans- mediately enclosing loop. Dynamic counted loops play an formations than general while loops with inductively de- increasing important role in numerical solvers, media pro- fined termination conditions: for example, the substitution cessing applications, and data analytics, as we will see in of closed forms for induction variables remains applicable, the experimental evaluation. They can be seen as a spe- removing the data dependences induced by termination con- cial case of while loop that does not involve an arbitrary, ditions. We propose an automatic compilation approach to inductively defined termination condition. The ability to parallelize and optimize dynamic counted loops. Our ap- substitute their counter with a closed form—an affine in- proach relies on affine relations only, as implemented in duction variable—makes them amenable to a wider set of state-of-the-art polyhedral libraries. Revisiting a state-of- transformations than while loops. Dynamic counted loops the-art framework to parallelize arbitrary while loops, we are commonly found in sparse matrix computations, but not introduce additional control dependences on data-dependent restricted to this class of algorithms. They are also com- predicates. Our method goes beyond the state of the art in monly found in conjunction with statically unpredictable, fully automating the process, specializing the code gener- non-affine array subscripts. It is important to address the ation algorithm to the case of dynamic counted loops and performance challenge of such loops on modern architec- avoiding the introduction of spurious loop-carried depen- tures. dences. We conduct experiments on representative irregu- A significant amount of work aims at the parallelization lar computations, from computer vision and finite element of while loops, including [10,9,5, 14, 15, 16, 17] to cite methods to sparse matrix linear algebra. We validate that polyhedral methods only. These techniques face a painful the method is applicable to general affine transformations problem, solved in the better behaved case of static con- for locality optimization, vectorization and parallelization. trol loops for more than a decade: the lack of a robust, general-purpose algorithm and tool to generate imperative code after the application of an affine transformation. The Keywords state of the art approach to model while loops in a polyhe- parallelizing ; loop nest optimization; polyhedral dral framework, and in the code generator in particular, is model; dynamic counted loop the work of Benabderrahmane et al. [5]. This work [5] uses over-approximations to translate a while loop into a static control loop iterating from 0 to infinity that 1. INTRODUCTION can be represented and optimized in the polyhedral model, While a large number of computationally intensive ap- and introduces exit predicates and the associated data de- plications spend most of their time in static control loop pendences to preserve the computation of the original ter- nests—with affine conditional expressions and array sub- mination condition, and to enforce the proper termination scripts, several important algorithms do not meet such stati- of the generated loops the first time this condition is true. cally predictable requirements. We are interested in the class These data dependences severely restrict the application of of loop nest kernels involving dynamic counted loops. These loop transformations involving a while loop, since reorder- are regular counted loops with numerical constant strides, ing of the iterations of the latter is not permitted, and involves the removal of memory-based depen- dences on the exit predicates. The framework was also not fully automated at the time of its publication, leaving much room for the interpretation of its applicable cases and the space of legal transformations it effectively models. A sig- IMPACT 2017 nificant effort remains to be invested in the completion of Seventh International Workshop on Polyhedral Compilation Techniques a fully operational while loop polyhedral framework, even Jan 23, 2017, Stockholm, Sweden a strictly conservative one (i.e., non-speculative). In this In conjunction with HiPEAC 2017. http://impact.gforge.inria.fr/impact2017 paper, we take a more pragmatic, short term direction: we

1 focus on the special case of dynamically counted loops where trol flow as sets and relations over statement instances. It the most difficult of these problems do not occur. represents a program and its semantical properties using it- There has also been a significant body of research special- eration domains, access relations, dependences and schedule izing for high-performance implementations of spare matrix components. The statement instances are included in itera- computations. Manually-tuned libraries [2,4,7, 20, 21, 27] tion domains and are mapped to the accessed array elements are a commonly used approach, but it is obviously impossi- by access relations. Dependences relate statement instances ble to port them to each representation and various architec- depending on each other, while a schedule defines a partial tures. So a polyhedral framework that can handle non-affine execution order on these statement instances. subscripts [25] has a greater potential to to achieve transfor- S0:n=f(1); mations and optimizations on sparse matrix computations. for(i=0;i<100;i++) { In this paper, we propose an automatic polyhedral compi- S1:m=g(i); for(j=0;j

S0 es1→s1 es3→s3 es0→s3 2. BACKGROUND AND MOTIVATION es0→s2 The polyhedral compilation framework is a powerful for- es1→s2 es3→s2 S S S malism to express parallelizing loop transformations and 1 2 3 loop nest optimizations [12, 13,6,3]. Its application domain Figure 2: The dependence graph of the example was traditionally constrained to static-control, regular loop nests, but the polyhedral community has been working and By capturing control dependences as affine relations from made progress on extending its application to more complex exit predicates to dominated statements in loop bodies, one loops with dynamic, data-dependent behavior. The polyhe- may build a sound abstraction of the scheduling constraints dral framework abstracts computational regions of the con- for the loop nest. This technique is applicable to arbitrary

2 while loops, in conjunction with a suitable code generation and schedule enclosing a two-dimensional inner domain and strategy to recover the exact control flow protected by the schedule controlled by two additional parameters. In such exit predicate, and by over-approximating the loop upper a schedule tree, the domain and access relations of state- bound as +∞. This is the approach explored by Benabder- ment S2 can be represented exactly and parametrically in rahmane et al., but the resulting polyhedral representation m and n. This representation can be used to compute the is plagued by additional spurious loop-carried dependences dependence relation of the whole schedule tree. to update the exit predicate, removing many useful loop Based on this dependence information, one may derive a nest transformations from the affine scheduling space. In new schedule using the Pluto algorithm or one of its vari- the more restricted context of dynamic counted loops, it is ants [6, 26], to optimize locality and extract parallelism. possible to eliminate those loop-carried dependences as the The final step is to generate code from the schedule tree exit predicate only depends on loop-invariant data. to a high level program. The generation of the abstract We base our formalism and experiments on the schedule syntax tree (AST) follows the approach implemented in isl tree representation [18]. A schedule tree typically comprises [18], traversing the schedule tree and specializing the code a domain node describing the overall extent of the statement generation algorithm to integrate the specific control flow instances, context nodes introducing symbolic parameters template corresponding to the converted dynamic counted and constraints on those, sequence and set nodes expressing loops. Before encountering a filter node associated with a ordered or unordered branches, respectively, filter nodes se- dynamic counted loop, the exit predicate and its controlled lecting a subset of the statement instances as the children loop body is seen as a single black-box statement by the of a sequence or set node, and band nodes defining a partial AST generation algorithm. When passing the filter node schedule as well as permutability and/or parallelism proper- constraining the dynamic upper bound, it is necessary to ties on a group of statements. Band nodes are derived from complement the standard code generation procedure with tilable bands in the Pluto framework [6]. a dedicated “dynamic counted loop template”. This tem- plate involves the reconstruction of the exit predicate and domain the introduction of an early exit (break) instruction guarded context: [m, n] by the predicate. Our algorithm generates code in one sin- gle traversal of the schedule tree, avoiding the need to call sequence the AST generator multiple times for dynamically counted S0() S1(i); S2(i, j, k); S3(i) loops. This is another difference compared with Benabder- rahmane’s framework. S1(i) → (i); S2(i, j, k) → (i); S3(i) → (i + 1) sequence 3. PROGRAM ANALYSIS S (i) S (i, j, k): j < m ∧ k < n S (i) 1 2 3 Dynamic counted loops arise frequently in irregular appli- S2(i, j, k) → (j) cations, but they sometimes are not written in a form that can be handled with our technique. We need a preprocessing S2(i, j, k) → (k) step to make them amenable to our approach. S2(i, j, k) Figure 3: The schedule tree of the example 3.1 Preparation The first category of dynamic counted loops is as shown in Figure3 is the schedule tree of the example in Figure1. the example of Figure1. It is referred to as the normalized A schedule tree has the same expressiveness as any affine format. schedule representation, but it facilitates local schedule ma- Sparse matrix computations constitute a second category nipulations and offers a systematic way to associate non- of dynamic counted loops. They are a class of computations polyhedral semantical extensions. The domain node in Fig- using compressed data layout to store nonzero elements of ure3 is a matrix. Loops iterating on the compressed layout usually

Domain = {S0(); S1(i) | 0 ≤ i < 100; S2(i, j, k) | have a dynamic upper bound as well as a dynamic lower (1) bound. However, these loops can be easily transformed into 0 ≤ i < 100 ∧ 0 ≤ j ∧ 0 ≤ k; S (i) | 0 ≤ i < 100} 3 the normalized format a non-affine subtraction of the lower We will leverage this extensibility to represent non-affine bound from the upper bound. Note that this transformation loop bounds. Since m and n are defined in the i-loop, the may introduce non-affine array subscripts; we assume the de- loops they control cannot be appropriately represented in pendence analysis will conservatively handle such subscripts, the iteration domain. Interestingly, a schedule tree allows or symbolically eliminate identical non-affine expressions on to introduce constraints on symbolic parameters deeper than the left and right-hand side, or benefit from PENCIL anno- the context node introducing those parameters. These con- tations to refine its precision [8,1]. straints, captured through filter nodes, will only apply in Some forms of while loops may also be modeled, as long the branch controlled by the filter node, even if the con- as an affine can be identified and assum- text node introducing the parameter itself is located higher ing the variant part of the exit condition reduces to this up in the tree. We may thus insert a context node right induction variable. underneath the domain node, as it is recommended in the existing schedule tree implementation, but capturing con- 3.2 Modeling Control Dependences straints induced by m and n in the filter node of statement To make dynamic counted loops amenable to a polyhedral S2. This filter node effectively splits the three-dimensional representation, we introduce additional dependences associ- schedule into two nested bands. The resulting schedule tree ated with exit predicates and their definition statements. may indeed be seen as a one-dimensional external domain An exit predicate definition and check is inserted at the

3 beginning of each iteration of a dynamically counted loop. ated with semantics preservation, to At code generation time, all statements in the body of the Validity = (R−1 ◦ W 0 + W −1 ◦ R0 + W −1 ◦ W 0) counted loop will have to be dominated by break instruction (4) conditioned by the exit predicate. This follows the state of ∩(Schedule ≺ Schedule0) the art method for while loops [5], but without the induc- and the proximity dependences, associated with locality en- tive computation and loop-carried dependence on the exit hancement, to predicate. Of course, we delay the introduction of break in- structions until code generation, to keep the control flow in Proximity = (R−1 ◦ W 0 + W −1 ◦ R0 + W −1 ◦ W 0 (5) a statically manageable form for a polyhedral compiler. As +R−1 ◦ R0) ∩ (Schedule ≺ Schedule0) a very simple illustrative example, the code in Figure 4(a) is preprocessed as the version in Figure 4(b) before construct- where Schedule represents the original schedule constructed ing the affine representation. from the original code according to the procedure above, and 0 for(j=0; 1;j++) the (primed) maps distinguish iterations in dependence. for(j=0;j

∧0 ≤ j < m ∧ 0 ≤ k < n; S2(i, j, k) → n[] : (2) plement such specialization along the filters nodes introduc- 0 ≤ i < 100 ∧ 0 ≤ j < m ∧ 0 ≤ k < n} ing constraints on the parameters associated with dynam- ically evaluated exit predicates. It performs a depth-first and its write access relation is traversal of the schedule tree, applying a customizable form

W = [m, n] → {S0() → n[]; S1(i) → m[] : of Quilleré’s algorithm on each band node. Since special (3) care is needed to handle exit predicates and their impact on 0 ≤ i < 100; S (i) → n[] : 0 ≤ i < 100} 3 the dynamic upper bounds, the Grosser et al. algorithm is According to the variant of the Pluto algorithm imple- not able in its original form to generate semantically cor- mented in isl, one may set the validity dependences, associ- rect code for our extended schedule tree. However, it is

4 easy to modify it to handle the special case of exit predi- schedule tree isolates two nested band nodes to represent cates being homogeneous over all statements in a sequence different levels of the loop nest. Let us show a code gener- or set node of the schedule tree (e.g., all statements in a ation template that can be applied to the customization of band of permutable loops). Indeed, it is possible to hook the control flow for each one of these bands. the generation of custom templates for the computation of Generating arbitrary conditionals for a single dynamic exit predicates and the insertion of early exit statements counted loop is straightforward, since the predicate attached directly into the processing of filter nodes associated with in a mark node can be extracted easily. As a result, the dynamically updated parameters. This is facilitated by the AST generator only need to generate a conditional around ability to attach syntactic annotations about exit predicates the loop statements, as well as an early exit statement in using a so-called mark node of the schedule tree. A mark the else branch. node is designed to attach any kind of information to a sub- For cases with multiple dynamic counted loops, we pro- tree. While the symbolic parameters are introduced by a pose to systematically generate one conditional at the in- context node immediately below the domain node, a mark nermost level. Figure 4(b) shows an example illustrating node further down the tree can track the nesting level where this scenario. The predicates of both loops are included the symbolic constant should be treated in a specific way, in a single conditional, and generated under the inner loop. streamlining the code generation of loop nests with multiple This scheme can bring opportunities for affine loop transfor- dynamic counted loops. In particular, the AST generator mations, such as loop interchange, not expressible in Ben- can read information from a mark node to handle the gen- abderrahmane’s framework due to the presence of spurious eration of exit predicates. loop-carried dependences. In the following, we thus restrict ourselves to the case of When the Grosser et al. algorithm traverses the band homogeneous exit predicates over statements belonging to a nodes in a schedule tree, it projects out the local sched- sequence or set nodes in the schedule tree. This practically ule constraints from the domain node. As the symbol con- limits the applicability of the code generator to implement stants and their constraints would not appear in the domain loop fusion across static and dynamically counted loops, but node, the generated loops will iterate from 0 to u. It is thus still enables tiling and unimodular transformations (inter- necessary to emit an early exit statement, or many empty change, skewing, reversal). The extension to loop fusion iterations will be wasted. This is straightforward in the case requires additional work in the separation step of the AST of a single dynamic counted loop, but in nested cases, as generation algorithm, similar to Benabderrahmane’s tech- there will be only one conditional capturing multiple predi- nique, and its complete automation on schedule trees is left cates, we need extract the parameters one by one from the for future work. predicate list and to generate the corresponding exit state- ments from the innermost outwards. The exit predicates are 4.1 A Static Upper Bound generated in the form of multiple conditionals rather than A general while loop may be converted to an unbounded else branches. Figure5 shows the result on the example of for loop. Yet dynamic counted loops do have an upper Figure 4(b). bound that varies dynamically. The approach to parallelize for(j=0;j=n) dynamically counted loop. To facilitate the implementation break; of such transformations, following Benabderrahmane’s work } [5], one may optionally derive a static upper bound u that if(j>=m) break; is greater than or equal to all bounds reached by a given } dynamically counted loop. The u parameter can also be approximated statically, as Figure 5: Generation of exit predicates the dynamic upper bounds are functions of outer enclosing loop variables: a typical solution relies on Fourier-Motzkin The templates serve as a post-processing step, after code elimination, projecting out enclosing dimensions and elimi- generation. A break statement with the appropriate guard is nating non-affine constraints. The u parameter can also be always inserted in a dynamically counted loop of the gener- determined in other ways, from data structure properties ated code. As an illustrative example, Figure 6(b) shows the or additional user-defined predicates in Pencil [1]. For ex- result of applying loop interchange on the example of Fig- ample, in sparse matrix computations, it may be computed ure 6(a). Unlike Jimborean et al.’s work [19] that handles by inspecting the maximum number of non-zero entries in general while loops and needs to speculate on the number of a row in compressed-sparse-row (CSR) format. In many iterations, our technique always executes the same number cases, a symbolic expression for the u parameter can also be of iterations as the original dynamic counted loops. computed automatically, but we did not yet automate this Loop tiling is a special case that should be taken into ac- analysis. count. Before the generation of exit predicates, one may apply additional affine loop transformations on the schedule 4.2 Template for Nested Bands tree if it is possible. The typical candidate is loop tiling, As there is a definition statement of the dynamic up- which involve the insertion of an additional schedule dimen- per bound between the outer loops and a nested dynamic sion through strip-mining. When strip-mining a dynamic counted loop, and since we introduce additional dependences counted loop, there should be an exit statement at both lev- between this definition statement and the statements nested els. For the point loop (iterating within a tile), the common into dynamic counted loops, the canonically constructed case above applies. For the tile loop (iterating among tiles),

5 for(j=0;j=m) iterations of the dynamic counted loop itself. This is the S(i,j); break; } } price to pay for a fixed upper bound on the number of iter- ations. It may be mitigated with additional strip-mining of the outer loops, to better control the value of u, effectively (a) Dynamic counted loops (b) if conditional partitioning the loop nest into coarse-grain sub computa- Figure 6: Conditional abstraction of dynamic counted loops tions amenable to execution on a heterogeneous target. we align its bounds and strides to follow the structure of the 5. EXPERIMENTAL EVALUATION inner loop, so that its counter can also be compared system- Our framework takes a C program with Pencil functions atically with the same bound. Figure7 shows an example as input [1]; Pencil is intended as a target language for after the application of loop tiling on the code in Figure5. DSL and as a high level portable implementation language to program accelerators. It allows a domain ex- for(jj=0; jj=n) We use PPCG [26], a polyhedral compiler for Pencil that break; performs loop nest transformations, parallelization, data lo- } cality optimization, and generates OpenCL or CUDA code. if(j>=m) break; PPCG exploits the information provided by Pencil exten- } sions, performs transformations and generate CUDA code if(kk*CC>=n) automatically. In a follow-up auto-tuning step, we look for break; optimal parameter values for tile sizes, block sizes, grid sizes, } if(jj*BB>=m) etc. for a given application and target architecture. PPCG break; allows to export and import schedule trees, enabling us to } prototype some of the transformation steps of our method in iscc. Figure 7: Generation of exit predicates in the tiling case We compare the performance of the code generated with our technique with that generated from the Pencil source without our algorithm. 4.3 Template for Combined Parallel Bands The experiments are conducted on a NVIDIA Quadro When the target architectures are shared memory multi- K4000 GPU. The CPU on the experimental platform is In- processors, the code generation template for separate bands tel Xeon E5-2630. The sequential code is compiled with the described above is fully applicable. However, when target icc compiler from Intel Parallel Studio XE 2016, with the GPU accelerators or producing fix-length vector code, we -Ofast -fstrict-aliasing optimization flags. The CUDA code usually expect to combine nested bands to express paral- is compiled with the NVIDIA CUDA 7.5 toolkit with the lelism at multiple levels. This motivates the exploitation of -O3 optimization flag. We run each benchmark 9 times and data parallelism within dynamic counted loops, in combina- retain the median value. tion with other nested loops. Since dynamic counted loops result in nested bands in the schedule trees, the combined 5.1 HOG Benchmark exploitation of multiple levels of parallelism including one The HOG benchmark is extracted from the Pencil bench- or more dynamic counted loops requires special treatment mark, a collection of applications and kernels for evaluating that is not directly modeled by affine sets and relations in Pencil compilers. the tree or by band-local parallelism annotations. On GPU The distribution of intensity gradients or edge directions targets, the constraints on the grid/ND-range of multi-level describe the local object appearance and shape within an data parallelism require the collection of bound information image. When processing an image, the HOG descriptor di- across nested bands: when launching a kernel, the param- vides it into small connected regions called cells. A his- eters of the grid/ND-range must be known and may not togram of gradient directions is then compiled for the pixels evolve during the whole run of the kernel. within each cell. The descriptor finally concatenates these Unfortunately, the statements between nested bands that histograms together. The descriptor also contrast-normalize occur in dynamic counted loops are used to initialize dy- local histograms by calculating an intensity measure across namic upper bounds. Statements in the body of these dy- a block, a larger region of the image, and then using this namic counted loops depend on those definition statements, value to normalize all cells within the block to improve ac- through the added dependences modeling the original de- curacy. This normalization results in better invariance to pendence of the dynamic loop. Still, one may can sink these changes in illumination and shadowing. definition statements inside, within these dynamic counted The kernel of the HOG descriptor contains two nested, loops, as a preprocessing step. As a result, the nested bounds dynamic counted loops. The upper bounds of these inner can be easily combined together, with no intervening com- loops are defined and vary as the outermost loop iterates. putation or control flow. The dynamic parameter is an expression of max and min

6 functions of the outer loop iterator and an array of constants. qualifiers/keywords, allowing PPCG to derive the size of the We derive the static upper bound parameter u from the arrays offloaded on the accelerator despite the presence of BLOCK_SIZE constant, a globally defined parameter of the indirect accesses (subscripts of subscripts), and that these program to declare the size of an image block. arrays do not alias. The three other representations can Since we target a GPU architecture, we ought to extract modeled with a make-dense transformation, as proposed by large degrees of parallelism from multiple nested loops. As [25], followed by a series of loop transformations. BCSR explained in the previous section, we thus sink the defi- is obtained by applying a make-dense and tiling transfor- nition statements of dynamic parameters within inner dy- mations, DIA can be obtained after make-dense, shift and namic counted loops and apply our template for a com- permutation transformations, and a non-affine shift together bined band. We may then generate the CUDA code with with tiling can result in the ELL format. See Venkat et al. parameter values for tile sizes, block sizes, grid sizes, etc. [25] for details. We show performance results with and without host-device The maximum number of non-zero entries in a row is the data transfer time, in Figure8 and9 respectively, consider- static upper bound and may be set as the u parameter. ing multiple block sizes. The detection accuracy improves It can be derived through an inspection. As a result, the with the increase of the block size. Our algorithm achieves references of indirect array subscripts can be sunk under a promising performance improvement for each block size, the inner dynamic counted loop, exposing a combined band and our technique can obtain a speedup ranging from 4.4× in the schedule tree. We show the performance without to 23.3× while the Pencil code suffers from a degradation data transfer time in Figure 10. The input sparse matrices by about 75%. are obtained from the University of Florida sparse matrix collection [11]. Our technique accelerates the SpMV by 1.1× PPCG Our work to 4.7×, which is consistent with the numbers reported by

] 20 Venkat et al. [25]. × BCSR is the blocked version of CSR, its parallel version 10 is the same as that of CSR, after tiling with PPCG. We will therefore not show its performance. Similarly, we get

Speedup [ the CUDA code with auto-tuned parameter values for the 0 DIA and ELL formats, as shown in Figure 11 and 12. The 16 32 64 128 256 512 1024 original DIA code runs into a segmentation fault for the BLOCK_SIZE mc2depi and pwtk matrices, hence we will remove these two Figure 8: Performance of the HOG descriptor with data matrices from the experiments. Speedups range from 1.2× transfer time to 5.0× for DIA and from 1.9× to 7.4× for ELL. 5.3 Inspector/Executor Codes We also compare the performance of the generated code PPCG Our work when using an inspector/executor strategy to optimize the ] 20

× data layout of sparse matrix computations. The inspector is used to analyze memory reference patterns and to generate 10 communication schedules, so we mainly focus on the appli-

Speedup [ cation of our technique to the executor. Since BCSR can 0 be obtained by applying a tiling transformation on a CSR 16 32 64 128 256 512 1024 kernel, we only take into account the executor of CSR. The executor of DIA format is not a dynamic counted loop and BLOCK_SIZE will not be studied. Figure 9: Performance of the HOG descriptor without data Following the approach of Venkat et al. [25], inspector/ex- transfer time ecutor schemes may cooperate with a polyhedral transfor- mation and optimization framework. The framework uses 5.2 SpMV Computations three dedicated transformations to achieve this cooperation: make-dense, compact and compact-and-pad. Sparse matrix operations are an important class of algo- Both the compact and compact-and-pad transformations rithms frequently in graph applications, physical simulations can be applied to CSR and ELL formats. The executor of to data analytics. They attracted a lot of parallelization the CSR format is roughly equivalent to the original CSR and optimization efforts. Programmers may use different SpMV code, so the performance comparison is similar to formats to store a sparse matrix, among which we consider that of the original CSR SpMV code, as shown in Figure 13. four representations: the Compressed Sparse Row (CSR), The ELL executor uses a transposed matrix to achieve global Block CSR (BCSR), Diagonal (DIA) and ELLPACK (ELL) memory coalescing, whose efficiency depends heavily on the [27]. In their original form, loop bounds do not match our number of rows that have a similar number of non-zero en- canonical structure: we apply a non-affine shift by the dy- tries. The performance result of the ELL executor is shown namic lower bound as discussed earlier. Our experiment in Figure 14, for which we obtain a speedups from 2.9× to in this subsection target the benchmarks presented in [25], 10.2×. with our own modifications to suit the syntactic constraints of our framework. We first consider the Pencil function for the CSR repre- 6. RELATED WORK sentation. All arrays are declared through the C99 variable- The polyhedral framework is a powerful compilation tech- length array syntax with the static const restrict C99 type nique to parallelize and optimize loops. It has become one of

7 Sequential CUDA 6

4

2

Execution time/ms 0

cant consph pwtk rma10 cop20_A mc2depi pdb1HYS Press_Poisson tomographic1 mac_econ_fwd500 Figure 10: Execution Time comparison between sequential and CUDA CSR SpMV code

400 Sequential CUDA

200

Execution time/s 0

cant consph pwtk cop20_A mc2depi pdb1HYS Press_Poisson mac_econ_fwd500 Figure 11: Execution time of the sequential and CUDA DIA SpMV code

8 Sequential CUDA 6

4

2

Execution time/ms 0

cant consph pwtk rma10 cop20_A mc2depi pdb1HYS Press_Poisson tomographic1 mac_econ_fwd500 Figure 12: Execution time of the sequential and CUDA ELL SpMV code

Sequential CUDA 6

4

2

Execution time/ms 0

cant consph pwtk rma10 cop20_A mc2depi pdb1HYS Press_Poisson tomographic1 mac_econ_fwd500 Figure 13: Execution time of the sequential executor and the CUDA CSR SpMV code the main approach for the construction of modern paralleliz- is increasingly important given the growing adoption of the ing compilers. Its application domain used to be constrained technique. The polyhedral community invested significant to static-control, regular loop nests. But the extension of efforts to make progress in this direction. the polyhedral framework to handle irregular applications A representative application of irregular polyhedral tech-

8 15 Sequential CUDA

10

5

0 Execution time/ms

cant consph pwtk rma10 cop20_A mc2depi pdb1HYS Press_Poisson tomographic1 mac_econ_fwd500 Figure 14: Execution time of the sequential executor and the CUDA ELL SpMV code niques is the parallelization of while loops. The polyhedral defined termination conditions. To achieve this, we model model is expected to handle loop structures with arbitrary control dependences on data-dependent predicates by re- bounds that are typically regarded as while loops. Collard visiting a state-of-the-art framework to parallelize arbitrary [10,9] proposed a speculative approach based on the polyhe- while loops. We specialize this framework to facilitate its dral model that extends the iteration domain of the original integration in schedule-tree-based affine scheduling and code program and performs speculative execution on the new it- generation algorithms; and we provide code generation tem- eration domain. Parallelism is exposed at the expense of an plates for multiple scenarios, from single dynamic counted invalid space-time mapping that needs to be corrected at loops to nested bands case and a combined band case for run time. An alternative, conservative technique, consists fixed size multi-level data-parallel grids on GPUs. Our method in enumerating a super-set of the target execution space relies on the systems of affine inequalities, as implemented [14, 15, 16, 17], and then eliminating invalid iterations by in state-of-the-art polyhedral libraries. It takes a C pro- determining termination detection on the fly. The authors gram with Pencil functions as input, covering a wide range present solutions for both distributed and shared memory of non-static control application encompassing a well stud- architectures. Benabderrahmane et al. [5] introduce a gen- ied class of sparse matrix computations. The experimen- eral framework to parallelize and optimize arbitrary while tal evaluation using the PPCG source-to-source compiler on loops by modeling control-flow predicates. They transform representative irregular computations, from computer vision a while loop as a for loop iterating from 0 to +∞. Com- and finite element methods to sparse matrix linear algebra, pared to these approaches to parallelizing while loops in the validated the general applicability of the method and its polyhedral model, our technique relies on systems of affine performance benefits compared to shallower, black box ap- inequalities only, as implemented in state-of-the-art polyhe- proximation approaches of the control flow. Our next steps dral libraries. It does not need to resort to the first-order will be to fully automate and implement the algorithm in logic such as non-interpreted functions/predicates, it does PPCG, and to conduct further experiments on CPU and not involve speculative execution features, and it makes dy- GPU platforms, comparing the performance with manually- namic counted loops amenable to a wider set of transforma- tuned libraries like OSKI and CUSP as well as reference tions than general while loops. implementations of computer vision and data analytics ker- A significant body of work addressed the transformation nels. Full automation will also involve the static analysis and optimization of sparse matrix computations. The im- of symbolic upper bounds, possibly building on additional plementation of manually tuned libraries [2,4,7, 20, 21, Pencil constructs. 27] is the common approach to achieve high-performance, but it is difficult to port to each new representation and to Acknowledgments different architectures. Sparse matrix compilers based on polyhedral techniques have been proposed [25], abstracting This work was partly supported by the European Commis- the indirect array subscripts and complex loop-bounds in a sion and French Ministry of Industry through the ECSEL domain-specific fashion, and leveraging conventional Pluto- project COPCAMS id. 332913, and by the French ANR based optimizers on an abstracted form of the sparse matrix through the European CHIST-ERA project DIVIDEND. Our computation kernel. We ought to extend the applicability of experiments with Pencil benchmarks and PPCG benefited polyhedra techniques one step further, considering general from the direct support of Michael Kruse, Riyadh Baghdadi Pencil code as input, and leveraging the semantical anno- and Chandan Reddy. And finally, none of this work would tations expressible in Pencil to improve the generated code have been possible without the contributions of Sven Ver- efficiency and to abstract non-affine expressions. doolaege and Tobias Grosser on isl and PPCG. 8. REFERENCES 7. CONCLUSION [1] Riyadh Baghdadi, Ulysse Beaugnon, Albert Cohen, In this paper, we studied the parallelizing compilation Tobias Grosser, Michael Kruse, Chandan Reddy, Sven and optimization of an important class of loop nests where Verdoolaege, Adam Betts, Alastair F Donaldson, counted loops have a dynamically computed, data-dependent Jeroen Ketema, et al. PENCIL: a platform-neutral upper bound. Such loops are amenable to a wider set of compute intermediate language for accelerator transformations than general while loops with inductively programming. In Proceedings of the International

9 Conference on Parallel Architecture and Compilation [16] Martin Griebl and Jean-Francois Collard. Generation (PACT), pages 138–149. IEEE Computer Society, of synchronous code for automatic parallelization of 2015. while loops. In Proceedings of the 1st International [2] S Balay, S Abhyankar, M Adams, J Brown, P Brune, Euro-Par Conference on Parallel Processing, pages K Buschelman, V Eijkhout, W Gropp, D Kaushik, 313–326. Springer, 1995. M Knepley, et al. Petsc users manual revision 3.5. [17] Martin Griebl and Christian Lengauer. On scanning Argonne National Laboratory (ANL), 2014. space-time mapped while loops. In In Proceedings of [3] Cedric Bastoul. Code generation in the polyhedral 3rd Joint International Conference on Vector and model is easier than you think. In Proceedings of the Parallel Processing (CONPAR 94-VAPP VI), pages 13th International Conference on Parallel 677–688. Springer, 1994. Architectures and Compilation Techniques (PACT), [18] Tobias Grosser, Sven Verdoolaege, and Albert Cohen. pages 7–16. IEEE Computer Society, 2004. Polyhedral ast generation is more than scanning [4] Nathan Bell and Michael Garland. Implementing polyhedra. ACM Transactions on Programming sparse matrix-vector multiplication on Languages and Systems (TOPLAS), 37(4):12, 2015. throughput-oriented processors. In Proceedings of the [19] Alexandra Jimborean, Philippe Clauss, Jean-François Conference on High Performance Computing Dollinger, Vincent Loechner, and Juan Networking, Storage and Analysis (SC), pages Manuel Martinez Caamaño. Dynamic and speculative 18:1–18:11. ACM, 2009. polyhedral parallelization using compiler-generated [5] Mohamed-Walid Benabderrahmane, Louis-Noël skeletons. International Journal of Parallel Pouchet, Albert Cohen, and Cédric Bastoul. The Programming, 42(4):529–545, 2014. polyhedral model is more widely applicable than you [20] Yucheng Low, Danny Bickson, Joseph Gonzalez, think. In Proceedings of 19th International Conference Carlos Guestrin, Aapo Kyrola, and Joseph M on Compiler Construction (CC), pages 283–303. Hellerstein. Distributed graphlab: a framework for Springer, 2010. machine learning and data mining in the cloud. [6] Uday Bondhugula, Albert Hartono, J. Ramanujam, Proceedings of the VLDB Endowment, 5(8):716–727, and P. Sadayappan. A practical automatic polyhedral 2012. program optimization system. In ACM SIGPLAN [21] John Mellor-Crummey and John Garvin. Optimizing Conference on Programming Language Design and sparse matrix–vector product computations using Implementation (PLDI), June 2008. unroll and jam. International Journal of High [7] Aydın Buluç and John R Gilbert. The combinatorial Performance Computing Applications, 18(2):225–236, blas: Design, implementation, and applications. 2004. International Journal of High Performance Computing [22] Fabien Quilleré, Sanjay Rajopadhye, and Doran Applications, pages 496–509, 2011. Wilde. Generation of efficient nested loops from [8] J.-F. Collard, D. Barthou, and P. Feautrier. Fuzzy polyhedra. International Journal of Parallel array dataflow analysis. In ACM Symposium on Programming, 28(5):469–498, 2000. Principles and Practice of Parallel Programming, [23] Michelle Mills Strout, Larry Carter, and Jeanne pages 92–102, Santa Barbara, CA, July 1995. Ferrante. Compile-time composition of run-time data [9] Jean-François Collard. Space-time transformation of and iteration reorderings. ACM SIGPLAN Notices, while-loops using speculative execution. In Proceedings 38(5):91–102, 2003. of the Scalable High-Performance Computing [24] Michelle Mills Strout, Alan LaMielle, Larry Carter, Conference 1994, pages 429–436. IEEE Computer Jeanne Ferrante, Barbara Kreaseck, and Catherine Society, 1994. Olschanowsky. An approach for code generation in the [10] Jean-François Collard. Automatic parallelization of sparse polyhedral framework. Parallel Computing, while-loops using speculative execution. International 53:32–57, 2016. Journal of Parallel Programming, 23(2):191–219, 1995. [25] Anand Venkat, Mary Hall, and Michelle Strout. Loop [11] Timothy A Davis and Yifan Hu. The university of and data transformations for sparse matrix code. In florida sparse matrix collection. ACM Transactions on Proceedings of the 36th ACM SIGPLAN Conference Mathematical Software (TOMS), 38(1):1:1–1:25, 2011. on Programming Language Design and [12] P. Feautrier. Dataflow analysis of scalar and array Implementation (PLDI), pages 521–532, 2015. references. International Journal of Parallel [26] Sven Verdoolaege, Juan Carlos Juega, Albert Cohen, Programming, 20(1):23–53, February 1991. José Ignacio Gómez, Christian Tenllado, and Francky [13] P. Feautrier. Some efficient solutions to the affine Catthoor. Polyhedral parallel code generation for scheduling problem, part II, multidimensional time. cuda. ACM Transactions on Architecture and Code International Journal of Parallel Programming, Optimization (TACO), 9(4):54, 2013. 21(6):389–420, December 1992. See also Part I, one [27] Richard Vuduc, James W Demmel, and Katherine A dimensional time, 21(5):315–348. Yelick. Oski: A library of automatically tuned sparse [14] Max Geigl, Martin Griebl, and Christian Lengauer. A matrix kernels. Journal of Physics: Conference Series, scheme for detecting the termination of a parallel loop 16(1):521, 2005. nest. Proc. GI/ITG FG PARS, 98, 1998. [28] David Wonnacott and William Pugh. Nonlinear array [15] Max Geigl, Martin Griebl, and Christian Lengauer. dependence analysis. In Proc. Third Workshop on Termination detection in parallel loop nests with while Languages, Compilers and Run-Time Systems for loops. Parallel Computing, 25(12):1489–1510, 1999. Scalable Computers, 1995. Troy, New York.

10