Implementation of Invariant Code Motion Optimization in an Embedded Processor Oriented Compiler Infrastructure
Total Page:16
File Type:pdf, Size:1020Kb
Implementation of Invariant Code Motion Optimization in an Embedded Processor Oriented Compiler Infrastructure Ivan Považan, Marko Krnjetin, Miodrag Đukić, Miroslav Popović Abstract—In order to achieve better performance results for Compiler optimizations the generated code various compiler optimizations are First approach relates to adapting the source code for performed during the compilation process. Even though it is certain compiler according to compiler specific coding expected that optimization would improve the code quality, for guidelines which can help compiler in code generation. embedded processor with small amount of resources applying On the other hand, more ideally, compiler should certain optimizations can sometimes lead to generation of larger number of instructions (i.e. larger code size and slower automatically perform the same and should not depend on the execution). One such optimization is a loop optimization - source code. Therefore compilers implement various invariant code motion. In this paper we present improvements to optimizations which are performed automatically on the the implementation of this optimization in an embedded source code, or are available through compiler switches to the processor oriented compiler infrastructure which give good user, in order to achieve the best performance. Complier results in cases of high register pressure. The enclosed test results optimizations are not independent and applying one can affect for translated code with and without proposed improvements show that our new approach generates code with better the other which can result in better or worse code. Choosing performance in all cases. the best combination of optimizations which would generate the best code, is a complex problem. Furthermore, Index Terms—Compiler optimization, invariant code motion, compilation passes like instruction selection, resource high register pressure allocation and instruction scheduling also need to be interleaved and interconnected for embedded processors to achieve better results, which additionally complicates I. INTRODUCTION choosing the best compilation strategy [1]. Clearly for DSP Having a high quality compiler for certain processor is not applications the most critical part of the source code, which an uncommon practice. It is actually quite the opposite and need to be condensed down as much as possible, are loops. nowadays there is a high demand for fast, robust and reliable Optimizations like loop unrolling, software pipelining, loop compilers that can generate code with high quality. Quality of tiling, invariant code motion, induction variable detection, the code most often refers to code size (number of instructions hardware loop detection and similar, generally improve the in memory) or code speed (number of cycles needed for the quality of the generated code by significant percentage. code execution). However, the question is: is this always true? In the embedded systems domain both characteristics are Most of mentioned loop optimizations are machine highly desirable and an ideally generated code has the independent optimizations and are performed on high level of performance near to the hand-written assembly code. In other intermediate representation of the translated code. This means words, the generated code is small and fast. However, that some target architecture specifics like: available resources designing and developing such compilers is not a trivial job to or available machine instructions are not considered at that do. Especially if the quality of the source code is taken into point in the compilation process, and applying such account. Compiler has to understand and recognize certain optimization can have undesirable effects. Moving the code constructs from the source code and to perform the best outside of the loop usually increases the number of compilation strategy in order to generate the best code, despite interferences between operands and can cause spills which the way those are written in the starting language. There are increases the overall number of instructions. In this paper we two approaches for dealing with this problem: present improvements for the implementation of invariant Source code adaptations code motion optimization which solve these problems. The paper is organized in six sections. This section gives an introduction to the problem domain. Next section includes Ivan Považan, RT-RK Institute for Computer Based Systems, Narodnog fronta 23a, Novi Sad, Serbia (e-mail: [email protected]) description of the used compiler framework, while the third Marko Krnjetin, RT-RK Institute for Computer Based Systems, Narodnog describes initial algorithm for invariant code motion and fronta 23a, Novi Sad, Serbia (e-mail: [email protected]) introduced improvements. Following section encloses test Miodrag Đukić, Faculty of Technical Sciences, Trg Dositeja Obradovića 6, Novi Sad, Serbia (e-mail: [email protected]) results and discussion about improvements acquired with Miroslav Popović, Faculty of Technical Sciences, Trg Dositeja modifications. Fifth section includes related work and finally Obradovića 6, Novi Sad, Serbia (e-mail: miroslav.popovic@ rt-rk.uns.ac.rs) Proceedings of 4th International Conference on Electrical, Electronics and Computing Engineering, pp. RTI1.4.1-5 IcETRAN 2017, Kladovo, Serbia, June 05-08, ISBN 978-86-7466-692-0 the last section gives a conclusion. change over loop iterations, and moves that computation from the loop body in order to improve performance of the II. RTCC generated code. This motion is also called hoisting. The technique is most often used on HLIR in order to hoist Implementation of invariant code motion optimization, invariant expressions, but can be also used during low level described in this paper is a part of embedded processor passes. Initially, RTCC included HLICM described in paper oriented compiler infrastructure called RTCC [2]. It is a class [3] which was implemented to work only as high level library which includes compiler modules for creating back- optimization. ends for embedded processor compilers. The framework In order to describe the HLICM algorithm we first need to defines sets of modules for: define how to detect invariant computation; then where we analyses – control flow, data dependency, call can move the code; and finally how to determine if code is graph, liveness, alias, single static assignment; movable. These three procedures are defined according to [4] optimizations – common sub-expression as follows: elimination, dead code elimination, constant Definition 1: The definition d: t ← a1 op a2 is loop- propagation, hardware loop detection, invariant invariant within loop L if, for each source operand ai: code motion, induction variable detection, copy 1. ai is a constant, propagation 2. or all the definitions of ai that reach d are outside compilation passes – instruction selection, resource the loop allocation, instruction scheduling, code generation 3. or only one definition of ai reaches d, and that definition is loop-invariant modelling target architecture – resources, 4. op some operation instructions, hazards, latencies, parallelization Definition 2: The preheader basic block p of the loop L is rules immediate dominator of head basic block for the loop L IR (intermediate representation) – translation from Definition 3: The definition d: t ← a1 op a2 can be hoisted FE IR (front-end intermediate representation), to the loop preheader p if: definition of BE IR (back-end intermediate 1. d dominates all loop exists at which t is live-out representation) 2. and there is only one definition of t in the loop IR is defined in form of graph of connected basic blocks 3. and t is not live-out of the loop preheader p that include list of operations. HLIR - high level intermediate Fig. 1 shows simple example with invariant code which can representation includes high level operations which are be detected on HLIR. The output generated after HLICM independent of the target architecture, while LLIR - low level module is displayed in Fig. 2 which shows that the algorithm intermediate representation or low level operations include can successfully detect such scenarios. target specific instructions. Creating a custom compiler requires using an existing front-end and modelling the target L1: processor with RTCC modules, where some of them can be i = 0 used directly while others can be derived and adjusted L2: according to target architecture specifics. if (i > 10) jmp L3 a = 0x101 Source code translation is performed in the compiler’s FE array[i] = a & b until IR is reached. After that FE IR is transformed into RTCC i++ HLIR and compilation resumed in the BE of the compiler. jmp L2 L3: Control flow, data dependency, call graphs, and other L3: analyses are then performed on the HLIR in order to provide Fig. 1. ICM example before HLICM optimization additional information used by other compilation passes and L1: high level code optimizations. Afterwards, HLIR is lowered to i = 0 LLIR with instruction selection module by applying rewriting a = 0x101 t = a & b rules which translate abstract representation into target L2: specific instructions. When LLIR is reached, low level if (i > 100) jmp L3 optimizations, instruction scheduling and resource allocation array[i] = t i++ are performed. Finally, the assembly code is generated. jmp L2 One of the available optimization modules, within RTCC L3: library, is invariant code motion module described in this Fig. 2. ICM