Optimizing Real-World Applications with GCC Link Time Optimization

Optimizing real-world applications with GCC Link Time Optimization Taras Glek Jan Hubickaˇ Mozilla Corporation SuSE CRˇ s.r.o. [email protected] [email protected] Abstract 1. A middle-end (the part of GCC back-end independent of target architecture) extension that supports GCC has a new infrastructure to support a link time streaming an intermediate language representing optimization (LTO). The infrastructure is designed to al- the program to disk, low linking of large applications using a special mode 2. A new compiler front-end (the LTO front-end), (WHOPR) which support parallelization of the compi- which is able to read back the intermediate lan- lation process. In this paper we present overview of the guage, merge multiple units together, and process design and implementation of WHOPR and present test them in the compiler’s optimizer and code genera- results of its behavior when optimizing large applications. We give numbers on compile time, memory us- tion backend, age and code quality comparisons to the classical file by 3. A linker plugin integrated into the Gold linker, file based optimization model. In particular we focus which is able to call back into the LTO front-end on Firefox web browser. We show main problems seen during linking [Plugin], only when compiling a large application, such as startup (The plugin interface is designed to be independent time and code size growth. of both the Gold linker and the rest of GCC’s LTO infrastructure; thus the effort to the extend the tool- 1 Introduction chain for plugin support can be shared with other compilers with LTO support. Currently it is used Link Time Optimization (LTO) is a compilation mode in also by LLVM [Lattner].) which an intermediate language (an IL) is written to the object files and the optimizer is invoked during the link- 4. Modifications to the GCC driver (collect2) to ing stage. This allows the compiler to extend the scope support linking of LTO object files using either the of inter-procedural analysis and optimization to encom- linker plugin or direct invocation of the LTO front- pass the whole program visible at link-time. This gives end, the compiler more freedom than the file-by-file compila- 5. Various infrastructure updates, including a new tion mode, in which each compilation unit is optimized symbol table representation and support for merg- independently, without any knowledge of the rest of the ing of declarations and types within the middle- program being constructed. end, and Development of the LTO infrastructure in the 6. Support for using the linker plugin for other com- GNU Compiler Collection (GCC) started in 2005 ponents of the tool-chain—such as ar and nm. [LTOproposal] and the initial implementation was (Libtool was also updated to support LTO.) first included in GCC 4.5.0, released in 2009. The inter-procedural optimization framework was independently developed starting in 2003 [Hubicka04ˇ ], and was The inter-procedural optimization infrastructure con- designed to be used both independently and in tandem sists of the following major components: with LTO. 1. Callgraph and varpool data structures representing The LTO infrastructure represents an important change the program in optimizer friendly form, to the compiler, as well as the whole tool-chain. It con- sists of the following components: 2. Inter-procedural dataflow support, 1 3. A pass manager capable of executing inter- and nm, need to understand symbol tables of LTO sec- procedural and local passes, and tions. These tools were extended to use the plugin infrastructure, and with these problems solved, GCC will 4. A number of different inter-procedural also support “slim” objects consisting of the intermedi- optimization passes. ate code alone. Sections 2 and 3 contains an overview with more de- The GCC intermediate code is stored in several sections: tails on the most essential components of both infras- • tructures. Command line options (.gnu.lto_.opts) This section contains the command line options GCC is the third free software C/C++ compiler with used to generate the object files. This is used at LTO support (LLVM and Open64 both supported LTO link-time to determine the optimization level and in their initial respective public releases). GCC 4.5.0’s other settings when they are not explicitly specified LTO support was sufficient to compile small- to at the linker command line. medium-sized C, C++ and Fortran programs, but had At the time of writing the paper, GCC does not sup- were several deficiencies, including incompatibilities port combining LTO object files compiled with dif- with various language extensions, issues mixing multi- ferent set of the command line options into a single ple languages, and the inability to output debug informa- binary. tion. In Section 4 we describe an ongoing effort to make LTO useful for large real-world applications, discuss ex- • The symbol table (.gnu.lto_.symtab) isting problems and present early benchmarks. We focus This table replaces the ELF symbol table for func- on two applications as a running example thorough the tions and variables represented in the LTO IL. paper: the GCC compiler itself and the Mozilla Firefox Symbols used and exported by the optimized as- browser. sembly code of “fat” objects might not match the ones used and exported by the intermediate code. 2 Design and implementation of the Link Time The intermediate code is less-optimized and thus Optimization in GCC requires a separate symbol table. There is also possibility that the binary code in the The link-time optimization in GCC is implemented by “fat” object will lack a call to a function, since the storing the intermediate language into object files. In- call was optimized out at compilation time after the stead of producing “fake” object files of custom format, intermediate language was streamed out. In some GCC produces standard object files in the target format special cases, the same optimization may not hap- (such as ELF) with extra sections containing the inter- pen during the link-time optimization. This would mediate language, which is used for LTO. This “fat” ob- lead to an undefined symbol if only one symbol ta- ject format makes it easier to integrate LTO into existing ble was used. build systems, as one can, for instance, produce archives • Global declarations and types (.gnu.lto_ of the files. Additionally, one might be able to ship one .decls). set of fat objects which could be used both for development and the production of optimized builds, although This section contains an intermediate language this isn’t currently feasible, for reasons detailed below. dump of all declarations and types required to rep- As a surprising side-effect, any mistake in the tool chain resent the callgraph, static variables and top-level that leads to the LTO code generation not being used debug info. (e.g. an older libtool calling ld directly) leads to the • The callgraph (.gnu.lto_.cgraph). silent skipping of LTO. This is both an advantage, as the system is more robust, and a disadvantage, as the user This section contains the basic data structure used isn’t informed that the optimization has been disabled. by the GCC inter-procedural optimization infrastructure (see Section 2.2). This section stores an The current implementation is limited in that it only annotated multi-graph which represents the func- produces “fat” objects, effectively doubling compilation tions and call sites as well as the variables, aliases time. This hides the problem that some tools, such as ar and top-level asm statements. 2 • IPA references (.gnu.lto_.refs). environment to quickly link large applications2 This section contains references between function [Briggs]. and static variables. WHOPR employs three main stages: • Function bodies This section contains function bodies in the inter- 1. Local generation (LGEN) mediate language representation. Every function This stage executes in parallel. Every file in the body is in a separate section to allow copying of program is compiled into the intermediate lan- the section independently to different object files guage and packaged together with the local call- or reading the function on demand. graph and summary information. This stage is the same for both the LTO and WHOPR compilation • Static variable initializers mode. (.gnu.lto_.vars). 2. Whole Program Analysis (WPA) • Summaries and optimization summaries used by IPA passes. WPA is performed sequentially. The global callgraph is generated, and a global analysis pro- See Section 2.2. cedure makes transformation decisions. The global call-graph is partitioned to facilitate paral- The intermediate language (IL) is the on-disk represen- lel optimization during phase 3. The results of the tation of GCC GIMPLE [Henderson]. It is used for high WPA stage are stored into new object files which level optimization in the SSA form [Cytron]. The actual contain the partitions of program expressed in the file formats for the individual sections are still in a rela- intermediate language and the optimization deci- tively early stage of development. It is expected that in sions. future releases the representation will be re-engineered to be more stable and allow redistribution of object files 3. Local transformations (LTRANS) containing LTO sections. Stabilizing the intermediate This stage executes in parallel. All the decisions language format will require a more formal definition of made during phase 2 are implemented locally in the GIMPLE language itself. This is mentioned as one each partitioned object file, and the final object of main requirements in the original proposal for LTO code is generated. Optimizations which cannot be [LTOproposal], yet five years later we have to admit that decided efficiently during the phase 2 may be per- work on this made almost no progress.

Optimizing Real-World Applications with GCC Link Time Optimization

Design and Implementation of Generics for the .NET Common Language Runtime

Tricore Architecture Manual for a Detailed Discussion of Instruction Set Encoding and Semantics

Tribits Lifecycle Model Version

Lecture 2 — Software Correctness

Configuration Description (And the Resulting ECU Ex- Tract of the System Configuration for the Individual Ecus)

CS5233 Components – Models and Engineering - Komponententechnologien - Master of Science (Informatik)

Linkers and Loaders Do?

Low-Cost Deterministic C++ Exceptions for Embedded Systems

A Link-Time Optimizer for the Compaq Alpha

Link-Time Improvement of Scheme Programs

Implementation and Evaluation of Link Time Optimization with Libfirm

A Practical System for Intermodule Code Optimization at Link-Time