Polyhedral-Model Guided Loop-Nest Auto-Vectorization Konrad Trifunović, Dorit Nuzman, Albert Cohen, Ayal Zaks, Ira Rosen

Polyhedral-Model Guided Loop-Nest Auto-Vectorization Konrad Trifunović, Dorit Nuzman, Albert Cohen, Ayal Zaks, Ira Rosen To cite this version: Konrad Trifunović, Dorit Nuzman, Albert Cohen, Ayal Zaks, Ira Rosen. Polyhedral-Model Guided Loop-Nest Auto-Vectorization. The 18th International Conference on Parallel Architectures and Com- pilation Techniques, Sep 2009, Raleigh, United States. hal-00645325 HAL Id: hal-00645325 https://hal.inria.fr/hal-00645325 Submitted on 27 Nov 2011 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. Polyhedral-Model Guided Loop-Nest Auto-Vectorization Konrad Trifunovic †, Dorit Nuzman ∗, Albert Cohen †, Ayal Zaks ∗ and Ira Rosen ∗ ∗IBM Haifa Research Lab, {dorit, zaks, irar}@il.ibm.com †INRIA Saclay, {konrad.trifunovic, albert.cohen }@inria.fr Abstract—Optimizing compilers apply numerous inter- Modern architectures must exploit multiple forms of par- dependent optimizations, leading to the notoriously difficult allelism provided by platforms while using the memory phase-ordering problem — that of deciding which trans- hierarchy efficiently. Systematic solutions to harness the formations to apply and in which order. Fortunately, new infrastructures such as the polyhedral compilation framework interplay of multi-level parallelism and locality are emerg- host a variety of transformations, facilitating the efficient explo- ing, by advances in automatic parallelization and loop nest ration and configuration of multiple transformation sequences. optimization [8], [9]. These rely on the polyhedral model Many powerful optimizations, however, remain external to the of compilation to facilitate efficient exploration and appli- polyhedral framework, including vectorization. The low-level, cation of very complex transformation sequences. However, target-specific aspects of vectorization for fine-grain SIMD has so far excluded it from being part of the polyhedral framework. exploiting subword parallelism by vectorization is excluded In this paper we examine the interactions between loop from the polyhedral model due to its low-level machine- transformations of the polyhedral framework and subsequent dependent nature. As a result, there remains a gap in pro- vectorization. We model the performance impact of the dif- viding a combined framework for exploring complex loop ferent loop transformations and vectorization strategies, and transformation sequences together with vectorization. In this then show how this cost model can be integrated seamlessly into the polyhedral representation. This predictive modelling work we help bridge this gap by incorporating vectorization facilitates efficient exploration and educated decision making considerations into a polyhedral model. This methodology to best apply various polyhedral loop transformations while could be extended in the future to consider the effects of considering the subsequent effects of different vectorization additional transformations within the polyhedral framework. schemes. Our work demonstrates the feasibility and benefit of The contributions of this work are fourfold: tuning the polyhedral model in the context of vectorization. Experimental results confirm that our model has accurate predictions, providing speedups of over 2.0x on average over • Cost model for vectorization. We developed a fast traditional innermost-loop vectorization on PowerPC970 and Cell-SPU SIMD platforms. and accurate cost model designed to compare the performance of various vectorization alternatives and their I. INTRODUCTION interactions with other loop optimizations. • Polyhedral modelling of subword parallelism. We Fine-grain data level parallelism is one of the most demonstrate how to leverage the polyhedral compilation effective ways to achieve scalable performance of numerical framework naturally and efficiently to assess oppor- computations. Automatic vectorization for modern short- tunities for subword parallelism in combination with SIMD instruction sets, such as Altivec, Cell SPU and complex loop transformation sequences. SSE, has been a popular topic, with successful impact on • Evaluation in a production compiler. Our model is production compilers [1], [2], [3], [4]. Exploiting subword fully automated and implemented based on GCC 4.4. parallelism in modern SIMD architectures, however suffers • Studying the interplay between loop transforma- from several limitations and overheads (involving alignment, tions. We provide a thorough empirical investigation redundant loads and stores, support for reductions and more) of the interplay between loop interchange with array which complicate the optimization dramatically. Automatic expansion and loop nest vectorization of both inner and vectorization was also extended to handle more sophisticated outer loops on modern short-SIMD architectures. control-flow restructuring including if-conversion [5] and outer-loop vectorization [6]. Classical techniques of loop distribution and loop interchange [7] can dramatically impact The rest of this paper is organized as follows: Section II the profitability of vectorization. To be successful, it is discusses related work. Section III studies a motivating vital to avoid inapt strategies that incur severe overheads. example. Section IV provides an overview, introduces some Nevertheless, little has been done to devise reliable profit notation and captures loop interchange and vectorization models to guide the compiler through this wealth of loop variants as affine transformations. Section V describes the nest transformation candidates, vectorization strategies and optimization process in detail. Section VI describes the code generation techniques. Our main goal in this paper is cost function. Section VII exposes performance results, and to propose a practical framework for such guidance. Section VIII concludes. II. RELATED WORK study the interplay of loop transformations with backend optimizations (including vectorization) and complex microar- Vectorization Cost-Model Related Work. Leading op- chitectures by constructing huge search spaces of unique, timizing compilers recognize the importance of devising a valid transformation sequences [9]. These search spaces cost model for vectorization, but have so far provided only are tractable using carefully crafted heuristics that exploit partial solutions. Wu et al. conclude [1] regarding the XL the structure of affine schedules. An analytical performance compiler that “Many further issues need to be investigated model capable of characterizing the effect of such complex before we can enjoy the performance benefit of simdization transformations (beyond loop interchange, and accommodat- ... The more important features among them are ... the ing for large-scale locality effects) does not currently exist. ability to decide when simdization is profitable. Equally Known analytical cache models for loop transformations important is a better understanding of the interaction be- are quite mature in some domains, loop tiling in particular tween simdization and other optimizations in a compiler [14], yet remain sensitive to syntactic patterns and miss key framework”. Likewise, Bik stresses the importance of user semantical features such as loop fusion effects [15], [16]. hints in the ICC vectorizer’s profitability estimation [2], to avoid vectorization slowdowns due to “the performance III. MOTIVATING EXAMPLE penalties of data rearrangement instructions, misaligned memory references, failure of store-to-load forwarding, or for (v = 0; v < N; v++) additional overhead of run-time optimizations to enable for (h = 0; h < N; h++) { S1: s = 0; vectorization.”; on the other hand opportunities may be for (i = 0; i < K; i++) missed due to overly conservative heuristics. for (j = 0; j < K; j++) S2: s += image[v+i][h+j] * filter[i][j]; These state-of-the-art vectorizing compilers incorporate a S3: out[v][h] = s >> factor;} cost model to decide whether vectorization is expected to be profitable. These models however typically apply to a Figure 1. Main loop kernel in Convolve single loop or basic-block, and do not consider alternatives combined with other transformations at the loop-nest level. The first and foremost goal of a vectorization cost- This work is the first to incorporate a polyhedral model to model is to avoid performance degradations while not miss- consider the overall cost of different vectorization alterna- ing out on improvement opportunities. In addition, a cost tives in a loop-nest, as well as the interplay with other loop model should also drive the selection of a vectorization transformations. strategy, assuming there exist a profitable one. Given Loop-nest auto-vectorization in conjunction with loop- a loop-nest, a compiler needs to choose which loop to interchange has been addressed in prior art [10], [7], [11]. vectorize, and at which position, employing one of several This however was typically in the context of traditional vec- strategies (innermost- or outer-loop vectorization, in-place tor machines (such as Cray), and interchange was employed or based on innermosting,

Load more