A General Compilation Algorithm to Parallelize and Optimize Counted Loops with Dynamic Data-Dependent Bounds Jie Zhao, Albert Cohen

A general compilation algorithm to parallelize and optimize counted loops with dynamic data-dependent bounds Jie Zhao, Albert Cohen To cite this version: Jie Zhao, Albert Cohen. A general compilation algorithm to parallelize and optimize counted loops with dynamic data-dependent bounds. IMPACT 2017 - 7th International Workshop on Polyhedral Compilation Techniques, Jan 2017, Stockholm, Sweden. pp.1-10. hal-01657608 HAL Id: hal-01657608 https://hal.inria.fr/hal-01657608 Submitted on 7 Dec 2017 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. A general compilation algorithm to parallelize and optimize counted loops with dynamic data-dependent bounds Jie Zhao Albert Cohen INRIA & DI, École Normale Supérieure 45 rue d’Ulm, 75005 Paris fi[email protected] ABSTRACT iterating until a dynamically computed, data-dependent up- We study the parallelizing compilation and loop nest opti- per bound. Such bounds are loop invariants, but often re- mization of an important class of programs where counted computed in the immediate vicinity of the loop they con- loops have a dynamically computed, data-dependent upper trol; for example, their definition may take place in the im- bound. Such loops are amenable to a wider set of trans- mediately enclosing loop. Dynamic counted loops play an formations than general while loops with inductively de- increasing important role in numerical solvers, media pro- fined termination conditions: for example, the substitution cessing applications, and data analytics, as we will see in of closed forms for induction variables remains applicable, the experimental evaluation. They can be seen as a spe- removing the data dependences induced by termination con- cial case of while loop that does not involve an arbitrary, ditions. We propose an automatic compilation approach to inductively defined termination condition. The ability to parallelize and optimize dynamic counted loops. Our ap- substitute their counter with a closed form—an affine in- proach relies on affine relations only, as implemented in duction variable—makes them amenable to a wider set of state-of-the-art polyhedral libraries. Revisiting a state-of- transformations than while loops. Dynamic counted loops the-art framework to parallelize arbitrary while loops, we are commonly found in sparse matrix computations, but not introduce additional control dependences on data-dependent restricted to this class of algorithms. They are also com- predicates. Our method goes beyond the state of the art in monly found in conjunction with statically unpredictable, fully automating the process, specializing the code gener- non-affine array subscripts. It is important to address the ation algorithm to the case of dynamic counted loops and performance challenge of such loops on modern architec- avoiding the introduction of spurious loop-carried depen- tures. dences. We conduct experiments on representative irregu- A significant amount of work aims at the parallelization lar computations, from computer vision and finite element of while loops, including [10,9,5, 14, 15, 16, 17] to cite methods to sparse matrix linear algebra. We validate that polyhedral methods only. These techniques face a painful the method is applicable to general affine transformations problem, solved in the better behaved case of static con- for locality optimization, vectorization and parallelization. trol loops for more than a decade: the lack of a robust, general-purpose algorithm and tool to generate imperative code after the application of an affine transformation. The Keywords state of the art approach to model while loops in a polyhe- parallelizing compiler; loop nest optimization; polyhedral dral framework, and in the code generator in particular, is model; dynamic counted loop the work of Benabderrahmane et al. [5]. This work [5] uses over-approximations to translate a while loop into a static control loop iterating from 0 to infinity that 1. INTRODUCTION can be represented and optimized in the polyhedral model, While a large number of computationally intensive ap- and introduces exit predicates and the associated data de- plications spend most of their time in static control loop pendences to preserve the computation of the original ter- nests—with affine conditional expressions and array sub- mination condition, and to enforce the proper termination scripts, several important algorithms do not meet such stati- of the generated loops the first time this condition is true. cally predictable requirements. We are interested in the class These data dependences severely restrict the application of of loop nest kernels involving dynamic counted loops. These loop transformations involving a while loop, since reorder- are regular counted loops with numerical constant strides, ing of the iterations of the latter is not permitted, and loop interchange involves the removal of memory-based dependences on the exit predicates. The framework was also not fully automated at the time of its publication, leaving much room for the interpretation of its applicable cases and the space of legal transformations it effectively models. A sig- IMPACT 2017 nificant effort remains to be invested in the completion of Seventh International Workshop on Polyhedral Compilation Techniques a fully operational while loop polyhedral framework, even Jan 23, 2017, Stockholm, Sweden a strictly conservative one (i.e., non-speculative). In this In conjunction with HiPEAC 2017. http://impact.gforge.inria.fr/impact2017 paper, we take a more pragmatic, short term direction: we 1 focus on the special case of dynamically counted loops where trol flow as sets and relations over statement instances. It the most difficult of these problems do not occur. represents a program and its semantical properties using it- There has also been a significant body of research special- eration domains, access relations, dependences and schedule izing for high-performance implementations of spare matrix components. The statement instances are included in itera- computations. Manually-tuned libraries [2,4,7, 20, 21, 27] tion domains and are mapped to the accessed array elements are a commonly used approach, but it is obviously impossi- by access relations. Dependences relate statement instances ble to port them to each representation and various architec- depending on each other, while a schedule defines a partial tures. So a polyhedral framework that can handle non-affine execution order on these statement instances. subscripts [25] has a greater potential to to achieve transfor- S0:n=f(1); mations and optimizations on sparse matrix computations. for(i=0;i<100;i++) { In this paper, we propose an automatic polyhedral compi- S1:m=g(i); for(j=0;j<m;j++) lation approach to parallelize and optimize dynamic counted for(k=0;k<n;k++) loops that can express arbitrary affine transformations and S2:S(i,j,k); achieve architecture portability. Our approach is based on S3:n=f(i+1); the systems of affine inequalities, as implemented in state-of- } the-art polyhedral libraries. Just like [23, 24], it does not re- Figure 1: An example extracted from HOG sort to more expressive first-order logic with non-interpreted functions/predicates such as the advanced analyses and code generation techniques of Wonnacott et al. [28], and it does Consider the code in Figure1 extracted from the His- not involve speculative execution either. togram of Oriented Gradients (HOG) descriptor algorithm. Revisiting a state-of-the-art framework [5] originally de- A typical application of HOG is to detect human figures by signed for the polyhedral compilation of arbitrary while counting occurrences of a specific gradient orientation in lo- loops, we introduce exit predicates for dynamic counted calized portions of an image. The upper bounds, m and n, loops and model the control dependence of the original loop of the j-loop and k-loop are computed in their common en- through additional data dependences from the definition of closing loop and updated as the i-loop iterates. As m and n these exit predicates to the statements in the loop body. To are updated dynamically in the i-loop, it is not possible to model fixed-size data parallelism of vector instructions and classify the whole loop nest as a static control part (SCoP), hardware accelerators such as GPUs, one needs to extend and traditional polyhedral techniques do not directly apply. this baseline framework with an additional static analysis And tools aiming at a greater coverage of benchmarks— to compute an affine upper bound of all dynamic bounds of such as PPCG or LLVM/Polly—will abstract the offending a given loop. And finally, to enable the generation of imper- inner loops into a black box, greatly limiting the potential ative code after the application of affine transformations, we for locality-enhancing and parallelizing optimizations. Note propose code generation templates capturing the definition that the lower bound of a dynamic counted loop may not be of the non-affine exit predicates and implementing the asso- statically known either, a typical scenario of sparse matrix ciated imperative control flow, with applications to shared computations; but as a counted loop, it may be normalized memory multiprocessors as well as GPU accelerators. to count from 0, subtracting the lower bound from its upper Our method goes beyond the state of the art [5, 25] in bound. fully automating the process, specializing the code gener- Statement S2 does not have a data dependence on other ation algorithm to the case of dynamic counted loops and statements. However, there are output dependences among avoiding the introduction of spurious loop-carried depen- definition statements of dynamic parameters m and n.

A General Compilation Algorithm to Parallelize and Optimize Counted Loops with Dynamic Data-Dependent Bounds Jie Zhao, Albert Cohen

Expression Rematerialization for VLIW DSP Processors with Distributed Register Files ?

User-Directed Loop-Transformations in Clang

Elimination of Memory-Based Dependences For

Foundations of Scientific Research

Polyhedral-Model Guided Loop-Nest Auto-Vectorization Konrad Trifunović, Dorit Nuzman, Albert Cohen, Ayal Zaks, Ira Rosen

Compiler Construction

Synthesis and Exploration of Loop Accelerators for Systems-On-A-Chip

Portable Section-Level Tuning of Compiler Parallelized Applications

Mipsprotm Fortran 77 Programmer's Guide

Unified Polyhedral Modeling of Temporal and Spatial Locality

Power and Energy Impact by Loop Transformations

Loop Transformations: Convexity, Pruning and Optimization