BOLT: a Practical Binary Optimizer for Data Centers and Beyond

BOLT: A Practical Binary Optimizer for Data Centers and Beyond Maksim Panchenko Rafael Auler Bill Nell Guilherme Ottoni Facebook, Inc. Menlo Park, CA, USA fmaks,rafaelauler,bnell,[email protected] Abstract the large sizes of these applications also impose scala- Performance optimization for large-scale applications has bility challenges to apply FDO to them. Instrumentation- recently become more important as computation continues based profilers incur significant memory and computa- to move towards data centers. Data-center applications are tional performance costs, often making it impractical to generally very large and complex, which makes code lay- gather accurate profiles from a production system. To sim- out an important optimization to improve their performance. plify deployment and increase adoption, it is desirable This has motivated recent investigation of practical tech- to have a system that can obtain profile data for FDO niques to improve code layout at both compile time and link from unmodified binaries running in their normal produc- time. Although post-link optimizers had some success in the tion environments. This is possible through the use of past, no recent work has explored their benefits in the context sample-based profiling, which enables high-quality pro- of modern data-center applications. files to be gathered with minimal operational complexity. In this paper, we present BOLT, a post-link optimizer This is the approach taken by tools such as Ispike [21], built on top of the LLVM framework. Utilizing sample- AutoFDO [6], and HFSort [25]. This same principle is used based profiling, BOLT boosts the performance of real-world as the basis of the BOLT tool presented in this paper. applications even for highly optimized binaries built with Profile data obtained via sampling can be retrofitted to both feedback-driven optimizations (FDO) and link-time op- multiple points in the compilation chain. The point into timizations (LTO). We demonstrate that post-link perfor- which the profile data is used can vary from compilation mance improvements are complementary to conventional time (e.g. AutoFDO [6]), to link time (e.g. LIPO [18] and compiler optimizations, even when the latter are done at a HFSort [25]), to post-link time (e.g. Ispike [21]). In gen- whole-program level and in the presence of profile infor- eral, the earlier in the compilation chain the profile information. We evaluated BOLT on both Facebook data-center mation is inserted, the larger is the potential for its impact, workloads and open-source compilers. For data-center appli- because more phases and optimizations can benefit from cations, BOLT achieves up to 8.0% performance speedups this information. This benefit has motivated recent work on on top of profile-guided function reordering and LTO. For compile-time and link-time FDO techniques. At the same the GCC and Clang compilers, our evaluation shows that time, post-link optimizations, which in the past were ex- BOLT speeds up their binaries by up to 20.4% on top of FDO plored by a series of proprietary tools such as Spike [8], arXiv:1807.06735v2 [cs.PL] 12 Oct 2018 and LTO, and up to 52.1% if the binaries are built without Etch [28], FDPR [11], and Ispike [21], have not attracted FDO and LTO. much attention in recent years. We believe that the lack of interest in post-link optimizations is due to folklore and the in- 1. Introduction tuition that this approach is inferior because the profile data is injected very late in the compilation chain. Given the large scale of data centers, optimizing their work- In this paper, we demonstrate that the intuition described loads has recently gained a lot of interest. Modern data- above is incorrect. The important insight that we leverage center applications tend to be very large and complex pro- in this work is that, although injecting profile data earlier grams. Due to their sheer amount of code, optimizing the in the compilation chain enables its use by more optimiza- code locality for these applications is very important to im- tions, injecting this data later enables more accurate use of prove their performance. the information for better code layout. In fact, one of the The large size and performance bottlenecks of data-center main challenges with AutoFDO is to map the profile data, applications make them good targets for feedback-driven collected at the binary level, back to the compiler’s interme- optimizations (FDO), also called profile-guided optimiza- diate representations [6]. In the original compilation used to tions (PGO), particularly code layout. At the same time, 1 Optimized Source Code Parser Compiler IR Code Gen. Object FilesLinker Executable Binary Binary Opt. Executable Binary Profile Data Figure 1: Example of a compilation pipeline and the various alternatives to retrofit sample-base profile data. produce the binary where the profile data is collected, many 2.1 Why sample-based profiling? optimizations are applied to the code by the compiler and Feedback-driven optimizations (FDO) have been proved to linker before the machine code is emitted. In a post-link op- help increase the impact of code optimizations in a variety timizer, which operates at the binary level, this problem is of systems (e.g. [6, 9, 13, 18, 24]). Early developments in much simpler, resulting in more accurate use of the profile this area relied on instrumentation-based profiling, which data. This accuracy is particularly important for low-level requires a special instrumented build of the application to optimizations such as code layout. collect profile data. This approach has two drawbacks. First, We demonstrate the finding above in the context of a it complicates the build process, since it requires a special static binary optimizer we built, called BOLT. BOLT is a build for profile collection. Second, instrumentation typi- modern, retargetable binary optimizer built on top of the cally incurs very significant CPU and memory overheads. LLVM compiler infrastructure [16]. Our experimental eval- These overheads generally render instrumented binaries in- uation on large real-world applications shows that BOLT can appropriate for running in real production environments. improve performance by up to 20.41% on top of FDO and In order to increase the adoption of FDO in production LTO. Furthermore, our analysis demonstrates that this im- environments, recent work has investigated FDO-style tech- provement is mostly due to the improved code layout that is niques based on sample-based profiling [6, 7, 25]. Instead enabled by the more accurate usage of sample-based profile of instrumentation, these techniques rely on much cheaper data at the binary level. sampling using hardware profile counters available in mod- Overall, this paper makes the following contributions: ern CPUs, such as Intel’s Last Branch Records (LBR) [15]. This approach is more attractive not only because it does not 1. It describes the design of a modern, open-source post- require a special build of the application, but also because link optimizer built on top of the LLVM infrastructure.1 the profile-collection overheads are negligible. By address- 2. It demonstrates empirically that a post-link optimizer ing the two main drawbacks of instrumentation-based FDO is able to better utilize sample-based profiling data to techniques, sample-based profiling has increased the adop- improve code layout compared to a compiler-based ap- tion of FDO-style techniques in complex, real-world pro- proach. duction systems [6, 25]. For these same practical reasons, 3. It shows that neither compile-time, link-time, nor post- we opted to use sample-based profiling in this work. link-time FDO supersedes the others but, instead, they are complementary. 2.2 Why a binary optimizer? This paper is organized as follows. Section 2 motivates the case for using sample-based profiling and static binary Sample-based profile data can be leveraged at various levels optimization to improve performance of large-scale applica- in the compilation pipeline. Figure 1 shows a generic com- tions. Section 3 then describes the architecture of the BOLT pilation pipeline to convert source code into machine code. binary optimizer, followed by a description of the optimiza- As illustrated in Figure 1, the profile data may be injected at tions that BOLT implements in Section 4 and a discussion different program-representation levels, ranging from source about profiling techniques in Section 5. An evaluation of code, to the compiler’s intermediate representations (IR), to BOLT and a comparison with other techniques is presented the linker, to post-link optimizers. In general, the designers in Section 6. Finally, Section 7 discusses related work and of any FDO tool are faced with the following trade-off. On Section 8 concludes the paper. the one hand, injecting profile data earlier in the pipeline allows more optimizations along the pipeline to benefit from this data. On the other hand, since sample-based profile data 2. Motivation must be collected at the binary level, the closer a level is In this section, we motivate the post-link optimization ap- to this representation, the higher the accuracy with which proach used by BOLT. the data can be mapped back to this level’s program representation. Therefore, a post-link binary optimizer allows the 1 BOLT is available at https://github.com/facebookincubator/BOLT. profile data to be used with the greatest level of accuracy. 2 (01) function foo(int x) { mation at a very low level prevents earlier optimizations in (02) if (x > 0) { (03) ... // B1 the compilation pipeline from leveraging this information. (04) } else { Therefore, with this approach, any optimization that we want (05) ... // B2 (06) } to benefit from the profile data needs to be applied at the bi- (07) } nary level.

Load more