Optimizations in C++ Compilers

DOI:10.1145/3369754 Article development led by queue.acm.org A practical journey. BY MATT GODBOLT Optimizations in C++ Compilers COMPILERS ARE A necessary technology to turn high- level, easier-to-write code into efficient machine code for computers to execute. Their sophistication at doing this is often overlooked. You may spend a lot of time carefully considering algorithms and fighting error messages but perhaps not enough time looking at what compilers are capable of doing. This article introduces some com- many of these optimizations are also piler and code generation concepts, available in other compiled languages. and then shines a torch over a few of the Indeed, the advent of front-end-agnos- very impressive feats of transformation tic compiler toolkits such as LLVM3 your compilers are doing for you, with means most of these optimizations some practical demonstrations of my work in the exact same way in languag- favorite optimizations. I hope you will es such as Rust, Swift, and D. gain an appreciation for what kinds I have always been fascinated by of optimizations you can expect your what compilers are capable of. I spent compiler to do for you, and how you a decade making video games where ev- might explore the subject further. Most ery CPU cycle counted in the war to get of all, you may learn to love looking at more sprites, explosions, or complicat- the assembly output and may learn to ed scenes on the screen than our com- respect the quality of the engineering petitors. Writing custom assembly, and in your compilers. reading the compiler output to see what The examples shown here are in C it was capable of, was par for the course. or C++, which are the languages I have Fast-forward five years and I was at a had the most experience with, but trading company, having switched out FEBRUARY 2020 | VOL. 63 | NO. 2 | COMMUNICATIONS OF THE ACM 41 practice sprites and polygons for fast process- apply similarly on other architectures. ing of financial data. Just as before, Additionally, I cover only the GCC and knowing what the compiler was doing Clang compilers, but equally clever with code helped inform the way we optimizations show up on compilers wrote the code. from Microsoft and Intel. Many optimizations Obviously, nicely written, testable fall under the code is extremely important—espe- Optimization 101 cially if that code has the potential to This is far from a deep dive into com- umbrella of make thousands of financial trans- piler optimizations, but some concepts strength reduction: actions per second. Being fastest is used by compilers are useful to know. great, but not having bugs is even In these pages, you will note a running taking expensive more important. column of examples of scripts and in- In 2012, we were debating which of structions for the processes and opera- operations and the new C++11 features could be adopt- tions discussed. All are linked by the transforming ed as part of the canon of acceptable cod- corresponding (letter). ing practices. When every nanosecond Many optimizations fall under the them to use less counts, you want to be able to give advice umbrella of strength reduction: taking expensive ones. to programmers about how best to write expensive operations and transforming their code without being antagonistic them to use less expensive ones. A very to performance. While experimenting simple example of strength reduction with how code uses new features such as would be taking a loop with a multiplica- auto, lambdas, and range-based for, I tion involving the loop counter, as shown wrote a shell script (a) to run the com- in (b). Even on today’s CPUs, multiplica- piler continuously and show its filtered tion is a little slower than simpler arith- output. This proved so useful in answer- metic, so the compiler will rewrite that ing all these “what if?” questions that loop to be something like (c). I went home that evening and created Here, strength reduction took a loop Compiler Explorer.1 involving multiplication and turned Over the years I have been constantly it into a sequence of equivalent op- amazed by the lengths to which com- erations using only addition. There pilers go in order to take our code and are many forms of strength reduction, turn it into a work of assembly code more of which show up in the practical art. I encourage all compiled language examples given later. programmers to learn a little assembly Another key optimization is inlining, in order to appreciate what their com- in which the compiler replaces a call to pilers are doing for them. Even if you a function with the body of that func- cannot write it yourself, being able to tion. This removes the overhead of the read it is a useful skill. call and often unlocks further optimi- All the assembly code shown here zations, as the compiler can optimize is for 64-bit x86 processors, as that is the combined code as a single unit. You the CPU I’m most familiar with and will see plenty of examples of this later. is one of the most common server Other optimization categories architectures. Some of the examples include: shown here are x86-specific, but in ˲ Constant folding. The compiler reality, many types of optimizations takes expressions whose values can be 42 COMMUNICATIONS OF THE ACM | FEBRUARY 2020 | VOL. 63 | NO. 2 practice calculated at compile time and replaces them with the result of the calcula- tion directly. ˲ Constant propagation. The compiler tracks the provenance of values and takes advantage of knowing that certain values are constant for all possible executions. ˲ Common subexpression elimination. Duplicated calculations are rewritten to calculate once and duplicate the result. ˲ Dead code removal. After many of the other optimizations, there may be areas of the code that have no effect on the output, and these can be removed. This includes loads and stores whose values are unused, as well as entire functions and expressions. ˲ Instruction selection. This is not an optimization as such, but as the compiler takes its internal representation of the program and generates CPU instructions, it usually has a large set of equivalent instruction sequences from which to choose. Making the right choice requires the compiler to can make a big difference. Of course, does the comparison cmp al, 1, know a lot about the architecture of the more information a compiler which sets the processor carry flag if the processor it’s targeting. has, the longer it could take to run, so testFunc() returned false, otherwise ˲ Loop invariant code movement. The there is a balance to be struck here. it clears it. The sbb r12d, -1 instruc- compiler recognizes that some expres- Let’s take a look at an example (d), tion then subtracts -1 with borrow, the sions within a loop are constant for the counting the number of elements of a subtract equivalent of carrying, which duration of that loop and moves them vector that pass some test (compiled also uses the carry flag. This has the de- outside of the loop. On top of this, with GCC, optimization level 3; https:// sired side effect: If the carry is clear the compiler is able to move a loop godbolt.org/z/acm19_count1). If the (testFunc() returned true), it sub- invariant conditional check out of a compiler has no information about tracts –1, which is the same as adding loop, and then duplicate the loop body testFunc, it will generate an inner 1. If the carry is set, it subtracts –1 + 1, twice: once if the condition is true, and loop like(e) . which has no effect on the value. Avoid- once if it is false. This can lead to fur- To understand this code, it’s useful ing branches can be advantageous in ther optimizations. to know that a s t d::v e c t or < > con- some cases if the branch is not easily ˲ Peephole optimizations. The com- tains some pointers: one to the begin- predictable by the processor. piler takes short sequences of instruc- ning of the data; one to the end of the It may seem surprising that the com- tions and looks for local optimizations data; and one to the end of the stor- piler reloads the b egin() and end() between those instructions. age currently allocated (f). The size pointers each loop iteration, and in- ˲ Tail call removal. A recursive func- of the vector is not directly stored, it’s deed it rederives size() each time tion that ends in a call to itself can of- implied in the difference between the too. However, the compiler is forced to ten be rewritten as a loop, reducing call b egin() and end() pointers. Note do so: it has no idea what testFunc() overhead and reducing the chance of that the calls to v e c t or < >::si z e() does and must assume the worst. stack overflow. and vector<>::operator[] have That is, it must assume calls to test- The golden rule for helping the been inlined completely. Func() may cause the vec to be modi- compiler optimize is to ensure it has In the assembly code (e), ebp points fied. Theconst reference here doesn’t as much information as possible to to the vector object, and the b egin() allow any additional optimizations for make the right optimization deci- and end() pointers are therefore a couple of reasons: testFunc() may sions. One source of information is QWORD PTR [rbp+0] and QWORD PTR have a non-const reference to vec your code: If the compiler can see [rbp+8], respectively.

Optimizations in C++ Compilers

CS153: Compilers Lecture 19: Optimization

Strength Reduction of Induction Variables and Pointer Analysis – Induction Variable Elimination

Lecture Notes on Peephole Optimizations and Common Subexpression Elimination

Efficient Symbolic Analysis for Optimizing Compilers*

Code Optimizations Recap Peephole Optimization

Compiler-Based Code-Improvement Techniques

Redundant Computation Elimination Optimizations

CS 4120 Introduction to Compilers Andrew Myers Cornell University

Study Topics Test 3

Optimization.Pdf

CSCI 742 - Compiler Construction

Technical Review of Peephole Technique in Compiler to Optimize Intermediate Code