Loop Optimizations in Modern C Compilers

Loop Optimizations in Modern C Compilers Chae Jubb [email protected] 10 December 2014 Abstract Many programs spend a significant portion of execution time in loops. Because of this, loop optimizations are increasingly important. We see two major types of optimizations affecting loop performance: general and loop-specific. General optimizations help loop performance simply because the loop is repeated many times. Loop-specific optimizations better performance because they alter the structure of the loop or make improvements that require many iterations to be efficient due to a large overhead. We discuss loop optimization strategies and then, using directed test cases, analyze how gcc and clang use those techniques to optimize at different levels. We find clang to be much more aggressive in optimizations at a lower level. At the highest optimization levels, these compilers produce executables that perform similarly. Importantly, at the most common optimization level (-O2), clang equals or exceeds gcc performance. 1 Introduction -O1, -O2, -O3. The binaries are generated for Intel x86 targets: an i7 quad-core, a Xeon 16-core, and a Loop performance is a large contributor to the over- Pentium 4. all performance of many applications. Programs, es- We then analyze each output binary in two ways. pecially mathematical and scientific, spend a large First, we examine the emitted x86 assembly code. majority of their execution in loops. This makes the This allows us to directly verify techniques used by performance especially critical. Small inefficiencies the compiler in transforming the source code to the are magnified with hundreds or thousands of itera- target x86. Second, the binary is performance-tested. tions. Because it's in the compiler's best interests to Because the compiler strives to create binaries that emit high-performing code, many special techniques perform well|not binaries that look as if they per- are used specifically to increase loop performance. form well|this portion is critical. We examine the use of a subset of these technique The disparities in runtimes of each test are drastic; and compare both emitted code and runtime perfor- thus, performance characteristics are used to evaluate mance across multiple optimization levels for multiple the effectiveness of a different optimization level on C compilers. a certain program compiled with a certain compiler. The times ought not be compared across tests. 2 Method 3 Loop Optimizations To examine loop optimization techniques, we first prepare some sample programs. These programs are Before beginning discussion of results, we first in- 1 2 then compiled using clang and gcc . For each com- troduce common loop optimizations. The tech- piler, various optimization levels are examined: -O0, niques described include both machine-independent 1clang version 3.5 and machine-dependent optimizations. As the names 2gcc version 4.8.2 suggest, the former category is used to make gen- 1 1 void bad invariant() f 1 void good invariant() f 2 i n t i, a=4; 2 i n t i, a=4; 3 f o r(i=0; i < 5 ; ++i ) f 3 i n t a2=a ∗ a ; 4 i n t a2=a ∗ a ; 4 f o r(i=0; i < 5 ; ++i ) f 5 /∗ loop body;à` not touched ∗/ 5 /∗ loop body;à` not touched ∗/ 6 g 6 g 7 g 7 g Figure 1: Re-calculating a loop invariant each itera- Figure 2: Calculating a loop invariant outside the tion loop. eral loop optimizations more concerned with the 1 void bad induction() f 2 i n t i, j, ∗ array ; algorithm used; whereas, the latter is more con- 3 f o r(i=0; i < 3 2 ; ++i ) f cerned with the implementation and takes into ac- 4 i n tj=4 ∗ i ; count specifics of the target device. 5 /∗ loop body, usesì` ∗/ Before beginning our examination of loop opti- 6 array[j] = rand(); 7 g mizations, we note that many, many optimizations 8 g will help loop performance. While not loop-specific, optimizations such as moving variables to registers from the stack will help performance, simply because Figure 3: Inefficient redundant induction variables of the gains of the optimization will be realized in each iteration. label variables such as these as \induction variables". More formally, any variable whose value is altered by 3.1 Machine-Independent a fixed amount each loop iteration is an induction We first consider optimizations made independent of variable. the x86 architecture. These include strategies such We can generally apply two types of optimizations as identifying loop invariants, inverting the loop, and to induction variables: reduction of strength and removing induction variables. elimination. Generally, reduction of strength involves replacing an expensive operation (like multiplication) with a less expensive one (such as addition). Some- 3.1.1 Loop Invariants times, however, a compiler may realize an induction One simple technique used to improve the perfor- variable is redundant and completely eliminate it. mance of loops is moving invariant calculations out- Figure 3 shows an example of an inefficient use side the loop. We see a sample program in Figure 1 of two redundant induction variables. We improve that unnecessarily re-calculates a loop invariant on this slightly in Figure 4 when we invoke a reduction each iterations. This will cause wasted cycles, hurting of strength. Finally, Figure 5 shows the redundant performance. Luckily, many compilers will recognize variable completely optimized away. that program as equivalent to the one in Figure 2, and, as such, produce the optimized code. (It will, of 3.1.3 Loop Unrolling course, take into account the differences in scope of that loop-invariant variable.) We next turn our attention to a simple trick some- times employed: loop unrolling. This optimization is extremely straightforward and can only be applied 3.1.2 Induction Variables to loops with a known length. Rather than having Nearly all for loops and some while loops will have a loop with n iterations, the compiler will produce variables that function as a sort of loop counter. We target code that simply repeats n times. This opti- 2 1 void ros induction() f 1 void pre inversion() f 2 i n t i, j, ∗ array ; 2 while(/ ∗ c o n d i t i o n ∗/) f 3 f o r(i=0, j= −4; i < 3 2 ; ++i ) f 3 /∗ loop body ∗/ 4 i n t j += 4; 4 g 5 /∗ loop body, usesì` ∗/ 5 g 6 array[j] = rand(); 7 g 8 g Figure 6: Example while loop before inversion Figure 4: Reduction of strength with redundant in- 1 void post inversion() f duction variables 2 i f(/ ∗ c o n d i t i o n ∗/) f 3 do f 4 /∗ loop body ∗/ 1 void elim induction() f 5 g while(/ ∗ c o n d i t i o n ∗/); 2 i n t i, ∗ array ; 6 g 3 f o r(i=0; i < 3 2 ; ++i ) f 7 g 4 /∗ loop body, usesì` ∗/ 5 array [ i ∗ 4 ] = rand ( ) ; 6 g Figure 7: Example while loop after inversion 7 g Figure 5: Elimination of redundant induction vari- We see no gain, but, more importantly, no loss in ables performance in this case. When the condition is true, both execution flows mization may increase performance on some proces- behave as they would for any middle, non-final iter- sors because it eliminates any jump instructions. In ation. fact, minimizing jump instructions is often the goal of many optimizations (even those not loop-specific) because that type of instruction presents the possi- Middle Iteration With a non-final iteration, we bility of a costly branch-misprediction. We do not obviously see the same behavior between the two ver- always use loops, though, because of a major draw- sion. This is due to the well-known behavior of while back: increased binary size. and do-while loops. 3.1.4 Loop Inversion Final Iteration By having the comparison after We now turn our attention to another optimization the loop rather than at the beginning, we can save designed to reduce the number of branch instructions. cycles. The unoptimized version will run the last iter- Loop inversion is a fairly simple transformation: a ation, jump to the beginning to check the condition. while loop is converted to a do-while loop wrapped Seeing the condition is no longer satisfied, we jump by an if statement as shown in Figure 6 and Figure 7. back to outside the loop. To fully analyze the effectiveness of this optimiza- Now we consider the optimized version. We run tion, we consider three cases: first iteration, any mid- the last iteration and then check the loop condition. dle iteration, final iteration. It will not be satisfied, and thus, we do not jump to the beginning and instead fall through. First Iteration We consider the two cases of the This optimization results in a savings of 2 jumps. first iteration. Either the loop is entered or it is not. While this savings may seem trivial, consider nested When the condition is not true before entering the loops. If this optimization were applied to an inner loop, each version produces a single jump instruction. loop, the savings could be quite noticeable. 3 3.2 x86 Optimizations 1 #include <s t d l i b . h> 2 #include <s t d i n t . h> We now turn our attention to the more specialized op- 3 timizations that directly target the x86 ISA's specific 4 i n t loop i n v ( u i n t 8 t l e n ) f features. These optimizations exploit things such as 5 i n t a[256]; 6 i n t i=0; memory addressing, flags, and register number and 7 width, to name a few.

Load more