A Framework for Automatic Generation of Fundamental Memory Operations∗

automemcpy: A Framework for Automatic Generation of Fundamental Memory Operations∗ Guillaume Chatelet Chris Kennelly Sam (Likun) Xi Google Research Google Google France USA USA [email protected] [email protected] [email protected] Ondrej Sykora Clément Courbet Xinliang David Li Google Research Google Research Google France France USA [email protected] [email protected] [email protected] Bruno De Backer Google Research France [email protected] Abstract computing enterprises in the world. This work increased the Memory manipulation primitives (memcpy, memset, memcmp) performance of the fleet by 1%. are used by virtually every application, from high perfor- CCS Concepts · Software and its engineering → Soft- mance computing to user interfaces. They often consume a ware development techniques; · General and reference significant portion of CPU cycles. Because they are so ubiq- → Measurement; Performance; Empirical studies; Met- uitous and critical, they are provided by language runtimes rics. and in particular by libc, the C standard library. These implementations are heavily optimized, typically written in Keywords memory functions, C standard library hand-tuned assembly for each target architecture. In this article, we propose a principled alternative to hand- ACM Reference Format: tuning these functions: (1) we profile the calls to these func- Guillaume Chatelet, Chris Kennelly, Sam (Likun) Xi, Ondrej Sykora, tions in their production environment and use this data to Clément Courbet, Xinliang David Li, and Bruno De Backer. 2021. drive the important high-level algorithmic decisions, (2) we automemcpy: A Framework for Automatic Generation of Funda- use a high-level language for the implementation, delegate mental Memory Operations. In Proceedings of the 2021 ACM SIG- PLAN International Symposium on Memory Management (ISMM ’21), the job of tuning the generated code to the compiler, and (3) June 22, 2021, Virtual, Canada. ACM, New York, NY, USA, 13 pages. we use constraint programming and automatic benchmarks https://doi.org/10.1145/3459898.3463904 to select the optimal high-level structure of the functions. We compile our memfunctions implementations using the same compiler toolchain that we use for application 1 Introduction code, which allows leveraging the compiler further by al- Interacting with memory is an essential part of software lowing whole-program optimization. We have evaluated our development. At the lowest level the two fundamental mem- approach by applying it to the fleet of one of the largest ory operations are load and store, but developers frequently use higher level memory primitives such as initialization, ∗Our memory function implementations [14], benchmarking methodol- ogy [11, 13] and raw measurements [10] have been open sourced. comparison, and copy. These operations are at the base of nearly all libraries,1 language run-times,2 and even program- 3 Permission to make digital or hard copies of part or all of this work for ming languages constructs. They may be customized for personal or classroom use is granted without fee provided that copies are some particular contextsÐsuch as for kernel use or embed- not made or distributed for profit or commercial advantage and that copies ded developmentÐbut the vast majority of software depends bearThis this work notice is licensed and the under full citation a Creative on the Commons first page. Attribution Copyrights 4.0 Interna- for third- partytional components License. of this work must be honored. For all other uses, contact the owner/author(s). 1e.g., std::string and std::vector in the C++ Standard Template Library. ISMM ’21, June 22, 2021, Virtual, Canada 2Garbage collectors move memory to reduce fragmentation, reflection APIs © 2021 Copyright held by the owner/author(s). rely on run-time type identifier comparison. ACM ISBN 978-1-4503-8448-3/21/06. 3In some C implementations, passing a large struct by value inserts a call https://doi.org/10.1145/3459898.3463904 to libc’s memcpy. 39 ISMM ’21, June 22, 2021, Virtual, Canada G. Chatelet, C. Kennelly, L. S. Xi, O. Sykora, C. Courbet, X. D. Li, and B. De Backer Table 1. Overview of memcpy source code language for vari- Table 2. Binary size of glibc 2.31 memcpy implementations ous architectures and libc implementations. for x86-64. Name Size (bytes) bionic glibc freebsd dietlibc uClibc eglibc klibc musl __memcpy_avx512_no_vzeroupper 1855 aarch64 asm asm asm - - - - asm __memcpy_avx512_unaligned_erms 1248 arm asm asm asm - - asm - asm __memcpy_avx_unaligned_erms 984 i386 asm asm asm asm C asm - asm __memcpy_sse2_unaligned_erms 765 x86-64 asm asm asm asm asm asm - asm __memcpy_sse3 10695 alpha - asm - - - asm - - __memcpy_sse3_back 10966 generic ------CC ia64 - asm - - asm asm - - mips - asm asm - - - - - powerpc - asm asm - C - - - • it enables quick updatesÐthis is especially important s390 - asm - - - asm - - sh - asm - - asm asm - - for maintenance and security reasons, sparc32 - asm - asm asm asm - - • to the developer, it accelerates the compilation process sparc64 - asm - asm - asm - - by reducing link time. Shared libraries also come with costs: on their implementations in the C standard library: memset, • Symbols from shared libraries can’t be resolved at com- memcmp and memcpy4. pile time and they need extra run-time infrastructure. In the rest of this paper, we will focus on the optimization Modern Linux systems and x86-64 in particular imple- of memcpy as the hottest of the three functions, but the same ment shared libraries using Position Independent Code approach can be applied to the other two. (PIC) [5ś7]. PIC avoids costly relocation of function As a reminder, memcpy is defined in ğ7.24.2.1 of the C addresses at load time by using an extra indirect call standard [21] like so: The memcpy function copies n characters through the Procedure Linkage Table (PLT). This indi- from the object pointed to by s2 into the object pointed to by rect call hurts performance and increases instruction s1. If copying takes place between objects that overlap, the Translation Lookaside Buffer (iTLB) misses. behavior is undefined. • Functions imported from shared libraries are not vis- 2 Overview of Current Implementations ible to the linker. This prevents optimizations such as inlining and Feedback-Directed Optimization (FDO). 2.1 Pervasive Use of Assembly We discuss FDO in more detail in Section 2.4. Memory primitives have been part of the C standard library for at least three decades. They have seen persistent efforts In Google data centers, applications are statically linked and to exploit the capabilities (and work around the quirks) of their size is typically in the hundreds of megabytes. Statically each architecture as it evolved over time. Developers, who linking the memory primitives only marginally increases the were typically well-versed in the subtleties of their target binary size and overcomes the aforementioned problems. micro-architecture, would choose assembly to obtain the maximal amount of control on the resulting generated code. 2.3 Run-Time Dispatch As a result, out of the eight main publicly available libc im- As CPUs and Instruction Set Architectures (ISAs) evolve, new plementations (see Table 1), all but one use assembly to instructions become available and performance of older in- implement memcpy. As shown in Table 2, glibc5 alone has six structions may improve or degrade. Unless a libc is built for x86-64 versions of memcpy to exploit different instruction set a specific CPU model, it must accommodate the older models extensions such as SSE2, SSE3, AVX, and AVX512. but it should also provide an optimized versions for newer CPU models. To this end, Linux libraries use a run-time dis- 2.2 Dynamic Loading and Relocation patching technique on top of PLT called IFUNC [26] that is Standard C libraries are usually provided as shared libraries, commonly used by glibc. On the first call to an IFUNC, the which brings a number of advantages: indirect addressÐstored in the Global Offset Table (GOT)Ð • it avoids code duplicationÐsaving disk space and mem- points to a custom resolver function that determines the best ory, implementation for the host machine. The resolver then over- rides its own GOT entry so that subsequent calls access the 4memmove is not considered in this paper as: (1) it requires additional logic and (2) in our experience, its use is anecdotal compared to memcpy, memcmp selected function. After the setup phase, an IFUNC has the and memset. same indirect call costs and limitations as a function from a 5The GNU C library. shared library, even if it is linked statically. 40 automemcpy: A Framework for Automatic Generation of Fundamental Memory Operations ISMM ’21, June 22, 2021, Virtual, Canada 2.4 Feedback-Directed Optimization subsystem that can attach to a running process and collect The most significant optimizations in recent years came from profiling data without a need to recompile or restart theap- the mainstream adoption of Feedback-Directed Optimiza- plication. Among other features, it allows sampling register tion6 (FDO) with global efficiency improvements typically values at certain points during the execution of the program. over 10% [15]. FDO relies on a measured execution profile to In our case, we set it up to sample the RDX register right af- make informed compilation decisions7 based on how the ap- ter a call instruction. Under the System V x86-64 ABI [25], plication actually executes the code. Because memory primi- this register contains the size argument of memcpy. Since tives are so pervasive among all applications, they are espe- memcpy is statically linked, its address in the binary is known cially well covered by sampling profilers8 and could greatly at compile time, so we can easily filter samples belonging to benefit from FDO. memcpy and gather them into a histogram. However, FDO can only be applied to the parts of the 3.2 Performance Definition application for which source code is available to the compiler/linker. In particular, it can’t be applied to dynamically There are many dimensions to consider when designing linked functions and IFUNCs.

A Framework for Automatic Generation of Fundamental Memory Operations∗

FUJITSU AI Zinrai Deep Learning System 200M/210H Software License Terms

Meeting Agenda 4:30 – 6:00 PM, Wednesday, Nov 2Nd, 2016 Lyons Town Hall

Embedded Linux Systems with the Yocto Project™

Cp 99, Other Filing/Pleading, 11/12/2014

Smashing the Stack Protector for Fun and Profit

Operating Systems and Applications for Embedded Systems >>> Toolchains

Rmox: a Raw-Metal Occam Experiment

Betrfs: a Right-Optimized Write-Optimized File System

Porting Musl to the M3 Microkernel TU Dresden

MC-1200 Series Linux Software User's Manual

Creating Ipsec and SSL/TLS Tunnels in Linux What's New

Rockjit: Securing Just-In-Time Compilation Using Modular Control-Flow Integrity