Fast-Path Loop Unrolling of Non-Counted Loops to Enable Subsequent Compiler Optimizations∗

Fast-Path Loop Unrolling of Non-Counted Loops to Enable Subsequent Compiler Optimizations∗ David Leopoldseder Roland Schatz Lukas Stadler Johannes Kepler University Linz Oracle Labs Oracle Labs Austria Linz, Austria Linz, Austria [email protected] [email protected] [email protected] Manuel Rigger Thomas Würthinger Hanspeter Mössenböck Johannes Kepler University Linz Oracle Labs Johannes Kepler University Linz Austria Zurich, Switzerland Austria [email protected] [email protected] [email protected] ABSTRACT Non-Counted Loops to Enable Subsequent Compiler Optimizations. In 15th Java programs can contain non-counted loops, that is, loops for International Conference on Managed Languages & Runtimes (ManLang’18), September 12–14, 2018, Linz, Austria. ACM, New York, NY, USA, 13 pages. which the iteration count can neither be determined at compile https://doi.org/10.1145/3237009.3237013 time nor at run time. State-of-the-art compilers do not aggressively optimize them, since unrolling non-counted loops often involves 1 INTRODUCTION duplicating also a loop’s exit condition, which thus only improves run-time performance if subsequent compiler optimizations can Generating fast machine code for loops depends on a selective appli- optimize the unrolled code. cation of different loop optimizations on the main optimizable parts This paper presents an unrolling approach for non-counted loops of a loop: the loop’s exit condition(s), the back edges and the loop that uses simulation at run time to determine whether unrolling body. All these parts must be optimized to generate optimal code such loops enables subsequent compiler optimizations. Simulat- for a loop. Therefore, a multitude of loop optimizations were pro- ing loop unrolling allows the compiler to determine performance posed to reduce iteration counts, remove loop conditions [2], hoist and code size effects for each potential transformation prior to invariant computations outside loop bodies [8], vectorize instruc- performing it. tions [17], revert iteration spaces and schedule loop code to utilize We implemented our approach on top of the GraalVM, a high- pipelined architectures [1]. However, non-counted loops [10], that performance virtual machine for Java, and evaluated it with a set is, loops for which we cannot statically reason about induction vari- of Java and JavaScript benchmarks in terms of peak performance, ables and iteration count, are less optimizable than counted loops compilation time and code size increase. We show that our approach as their iteration count cannot be statically determined. Optimizing can improve performance by up to 150% while generating a median such loops could potentially lead to large performance gains, given code size and compile-time increase of not more than 25%. Our that many programs contain non-counted loops. results indicate that fast-path unrolling of non-counted loops can As a generic way to optimize non-counted loops, we propose be used in practice to increase the performance of Java applications. to unroll them without attempting to remove their exit conditions. However, this does not necessarily improve performance as the CCS CONCEPTS number of loop exit condition checks is not reduced and existing compiler optimizations often cannot optimize them away. There- • Software and its engineering → Just-in-time compilers; Dy- fore, an approach is required that determines other optimization op- namic compilers; Virtual machines; portunities enabled by unrolling a non-counted loop. Such optimiza- KEYWORDS tion opportunities have already been researched by Leopoldseder et al.[29] for code duplication of control-flow merges, showing Loop Unrolling, Non-Counted Loops, Loop-Carried Dependencies, significant increases in run-time performance if applied selectively. Code Duplication, Compiler Optimizations, Just-In-Time Compila- Loop unrolling is in fact a sequential code duplication transformation, Virtual Machines tion; the body of a loop is duplicated in front of itself. We developed ACM Reference Format: an algorithm to duplicate the body of a non-counted loop, effectively David Leopoldseder, Roland Schatz, Lukas Stadler, Manuel Rigger, Thomas unrolling the loop and enabling subsequent compiler optimizations. Würthinger, and Hanspeter Mössenböck. 2018. Fast-Path Loop Unrolling of We developed a set of simulation-based unrolling strategies that ∗This research project is partially funded by Oracle Labs. analyze a loop for compiler optimizations enabled by loop unrolling. In summary, this paper contributes the following: ManLang’18, September 12–14, 2018, Linz, Austria • We present an optimization, called fast-path loop creation, that © 2018 Association for Computing Machinery. This is the author’s version of the work. It is posted here for your personal use. Not can be used to unroll general loops. Fast-path loop creation for redistribution. The definitive Version of Record was published in 15th International is based on splitting loops with multiple back edges into hot- Conference on Managed Languages & Runtimes (ManLang’18), September 12–14, 2018, Linz, Austria, https://doi.org/10.1145/3237009.3237013. path and slow-path loops, enabling code motion to optimize may-aliasing memory accesses. 1 ManLang’18, September 12–14, 2018, Linz, Austria David Leopoldseder et al. # Counted # Non-Counted % Non-Counted Benchmark Loops Loops Loops Loop Unrolling [10–12, 14, 25, 41] non-counted loops [10] is only lusearch 219 449 67% possible in a general way if a loop’s exit conditions are unrolled jython 1779 3333 64% together with the body of a loop. Previous work proposed avoiding h2 538 4473 89% to unroll non-counted loops together with their exit conditions like pmd 1456 3379 69% avrora 169 593 77% the one proposed by Huang and Leng [25] who used the weakest luindex 364 660 64% pre-condition calculus [13] to insert a loop body’s weakest pre- fop 653 1289 66% condition into a given loop’s initial condition to maintain a loops xalan 510 1022 66% batik 723 1857 71% semantic after unrolling it. However, side-effects incur problems sunflow 249 389 59% for optimizing compilers when trying to unroll non-counted loops. Arithmetic Mean 69:8% If the body of a loop contains side-effecting instructions, deriving Table 1: Number of counted and non-counted loops in the the weakest pre-condition requires modeling the state of the virtual Java DaCapo Benchmark suite. machine (VM), which we consider impractical for just-in-time compilation. As the compiler cannot generally infer the exact number of loop iterations for non-counted loops, it cannot speculatively unroll them and generate code containing side-effects that are unknown. • We present an algorithm to partially unroll the hot path of This is a general problem, as side-effects and multi-threaded ap- a non-counted loop based upon the proposed fast-path loop plications prevent compilers in managed execution systems like creation. the Java virtual machine (JVM) from statically reasoning about • We present a set of simulation-based unrolling strategies to the state of the memory, thus preventing the implementation of selectively apply unrolling of non-counted loops to improve generic loop unrolling approaches. Consider the code in Listing2. peak performance. The loop’s condition is the value in memory at index x. However, • We implemented our approach in an SSA-based Java bytecode- the body of the loop contains a memory write at index y. If alias to-native code just-in-time (JIT) compiler and evaluated the analysis [15, 31] fails to prove that x and y do not alias, there is no implementation in terms of performance, code size, and compile way, without duplicating the exit conditions as well, to unroll the time using a set of Java, Scala, and JavaScript benchmarks. loop without violating the read after write dependency of the read in iteration n on the write in iteration n − 1. Figure2 shows the 2 UNROLLING NON-COUNTED LOOPS memory dependencies of the loop from Listing2 in a graph illus- In this section, we explain why unrolling non-counted loops is more trating memory reads and write over two loop iterations. Unrolling challenging than unrolling counted loops. We present an initial idea the loop requires the compiler to respect the memory constraints of to tackle this issue based on simulation, on which we will expand the loop. That means that the condition cannot be re-ordered with in the subsequent sections. the loop’s body without violating the memory constraints of the Non-counted loops are loops for which neither at compile time loop. Therefore, a compiler cannot derive a weakest pre-condition nor at run time the number of loop iterations can be determined. with respect to the body of the loop as the post-condition of the This can be due to one out of several reasons; the most common loop body is unknown and the effect of the body of the loop onthe ones are shown in Listing1: the loop exit conditions are not nu- state of the VM is unknown. merical (see Figure 1a), the loop body consists of side-effecting We propose to unroll non-counted loops with code duplica- instructions1 aliasing with the loop condition (see Figure 1b), or tion [29]. Duplicating the loop body together with its loop condition loop induction variable analysis [48] cannot guarantee that a loop enables unrolling of general loops. We can represent every loop as contains induction variables leading to termination of the loop (see a while(true) loop with the exit condition moved into the loop Figure 1c). We informally observed that a significant amount of Java code 1 while ( true ){ contains non-counted loops. To further test this hypothesis, we 2 if (...){ 3 body instrumented the Graal [36] just-in-time (JIT) compiler to count 4 } else { 2 the number of counted and non-counted loops in the Java Da- 5 break Capo benchmark suite. The results in Table1 demonstrate that 6 } 7 } non-counted loops are more frequent than counted loops, which Listing 3: General Loop Construct.

Fast-Path Loop Unrolling of Non-Counted Loops to Enable Subsequent Compiler Optimizations∗

Loop Pipelining with Resource and Timing Constraints

CSE 582 – Compilers

Research of Register Pressure Aware Loop Unrolling Optimizations for Compiler

ICS803 Elective – III Multicore Architecture Teacher Name: Ms

CS153: Compilers Lecture 19: Optimization

Compiler Optimization for Configurable Accelerators Betul Buyukkurt Zhi Guo Walid A

A Hierarchical Region-Based Static Single Assignment Form

The Effect of C Ode Expanding Optimizations on Instruction Cache Design

A Tiling Perspective for Register Optimization Fabrice Rastello, Sadayappan Ponnuswany, Duco Van Amstel

Graduate Computer Architecture Chapter

The Effect of Code Expanding Optimizations on Instruction Cache Design

Functional Array Programming in Sac