Cross-ISA Machine Instrumentation Using Fast and Scalable Dynamic Binary Translation

Cross-ISA Machine Instrumentation using Fast and Scalable Dynamic Binary Translation Emilio G. Cota Luca P. Carloni Columbia University Columbia University USA USA [email protected] [email protected] Abstract CCS Concepts • Software and its engineering → Vir- The rise in instruction set architecture (ISA) diversity and tual machines; Just-in-time compilers. the growing adoption of virtual machines are driving a need Keywords Dynamic Binary Translation, Binary Instrumen- for fast, scalable, full-system, cross-ISA emulation and instru- tation, Machine Emulation, Floating Point, Scalability mentation tools. Unfortunately, achieving high performance ACM Reference Format: for these cross-ISA tools is challenging due to dynamic bi- Emilio G. Cota and Luca P. Carloni. 2019. Cross-ISA Machine In- nary translation (DBT) overhead and the complexity of instrumentation using Fast and Scalable Dynamic Binary Transla- strumenting full-system emulators. tion. In Proceedings of the 15th ACM SIGPLAN/SIGOPS International In this paper we improve cross-ISA emulation and in- Conference on Virtual Execution Environments (VEE ’19), April 14, strumentation performance through three novel techniques. 2019, Providence, RI, USA. ACM, New York, NY, USA, 14 pages. First, we increase floating point (FP) emulation performance https://doi.org/10.1145/3313808.3313811 by observing that most FP operations can be correctly emulated by surrounding the use of the host FP unit with a 1 Introduction minimal amount of non-FP code. Second, we introduce the The emergence of virtualization, combined with the rise of design of a translator with a shared code cache that scales for multi-cores and the increasing use of different instruction multi-core guests, even when they generate translated code set architectures (ISAs), highlights the need for fast, scal- in parallel at a high rate. Third, we present an ISA-agnostic able, cross-ISA machine emulators. These emulators typi- instrumentation layer that can instrument guest operations cally achieve high performance and portability by leveraging that occur outside of the DBT’s intermediate representation dynamic binary translation (DBT). Unfortunately, state-of- (IR), which are common in full-system emulators. the-art cross-ISA DBT suffers under two common scenarios. We implement our approach in Qelt, a high-performance First, emulated floating point (FP) code incurs a large over- cross-ISA machine emulator and instrumentation tool based head (2 slowdowns are typical [6]), which hinders the use × on QEMU. Our results show that Qelt scales to 32 cores when of these tools for guest applications that are FP-heavy, such emulating a guest machine used for parallel compilation, as the emulation of graphical systems (e.g. Android) or as which demonstrates scalable code translation. Furthermore, a front-end to computer architecture simulators that run experiments based on SPEC06 show that Qelt (1) outper- scientific workloads. This FP overhead is rooted in the diffi- forms QEMU as a full-system cross-ISA machine emulator culty of correctly emulating the guest’s floating point unit by 1:76 /2:18 for integer/FP workloads, (2) outperforms × × (FPU), which can greatly differ from that of the host. This is state-of-the-art, cross-ISA, full-system instrumentation tools commonly solved by forgoing the use of the host FPU, using by 1:5 -3 , and (3) can match the performance of Pin, a × × instead a much slower soft-float implementation, i.e. one in state-of-the-art, same-ISA DBI tool, when used for complex which no FP instructions are executed on the host. instrumentation such as cache simulation. Second, efficient scaling of code translation when emulating multi-core systems is challenging. The scalability of code Permission to make digital or hard copies of all or part of this work for translation is not an obvious concern in DBT, since code personal or classroom use is granted without fee provided that copies execution is usually more common. However, some server are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights workloads [12] as well as parallel compilation jobs (e.g., in for components of this work owned by others than the author(s) must cross-compilation testbeds of software projects that support be honored. Abstracting with credit is permitted. To copy otherwise, or several ISAs, such as the Chromium browser [1]) can show republish, to post on servers or to redistribute to lists, requires prior specific both high parallelism and large instruction footprints, which permission and/or a fee. Request permissions from [email protected]. can limit the scalability of their emulation, particularly when VEE ’19, April 14, 2019, Providence, RI, USA using a shared code cache. © 2019 Copyright held by the owner/author(s). Publication rights licensed to ACM. In this paper we first address these two challenges by ACM ISBN 978-1-4503-6020-3/19/04...$15.00 improving cross-ISA DBT performance and scalability. We https://doi.org/10.1145/3313808.3313811 then combine these improvements with a novel ISA-agnostic 74 VEE ’19, April 14, 2019, Providence, RI, USA Emilio G. Cota and Luca P. Carloni instrumentation layer to produce a cross-ISA dynamic binary Guest Code instrumentation (DBI) tool, whose performance is higher stq t9,-30384 than that of existing cross-ISA DBI tools (e.g., [20, 25, 49, 54]). translate() br 0x12004d890 Same-ISA DBI tools such as DynamoRIO [11] and Pin [33] provide highly customizable instrumentation in return for TCG IR tcg_gen_code() Host Code modest performance overheads, and as a result have had movi_i64 tmp3,$0xffffffffffff8950 add_i64 tmp2,t12,tmp3 mov 0xd8(%r14),%rbp tremendous success in enabling work in fields as diverse as qemu_st_i64 t9,tmp2,leq,1 add $0xffffffffffff8950,%rbp security (e.g., [16, 29]), workload characterization (e.g., [9, 45]), deterministic execution (e.g., [3, 18, 38]) and computer Figure 1. An example of portable cross-ISA DBT. Alpha code architecture simulation (e.g., [13, 28, 37]). Our goal is thus to is translated into x86_64 using QEMU’s TCG IR. narrow the performance gap between cross-ISA DBI tools employ a system call translation layer, for example to run and these state-of-the-art same-ISA tools, in order to elicit 32-bit programs on 64-bit hosts or vice versa, it is generally similarly diverse research, yet for a variety of guest ISAs. limited to programs that are compiled for the same operat- Our contributions are as follows: ing system as the host. Full-system emulation refers to the We describe a technique to leverage the host FPU when emulation of an entire system: target hardware is emulated, • performing cross-ISA DBT to achieve high emulation per- which allows DBT-based virtualization of an entire target formance. The key insight behind our approach is to limit guest. The guest’s ISA and operating system are independent the use of the host FPU to the large subset of guest FP from the host’s. operations that yield identical results on both guest and Virtual CPUs (vCPUs). In user-mode, a vCPU corresponds host FPUs, deferring to soft-float otherwise (Section 3.1). to an emulated program’s thread of execution. In full-system We present the design of a parallel cross-ISA DBT engine • mode, a vCPU is the thread of execution used to emulate a that, while remaining memory efficient via the use ofa core of the guest CPU. shared code cache, scales for multi-core guests that generate translated code in parallel (Section 3.2). Translation Block (TB). Target code is translated into We present an ISA-agnostic instrumentation layer that groups of adjacent non-branch instructions that form a so- • converts a cross-ISA DBT engine into a low-overhead called TB. Target execution is thus emulated as a traversal cross-ISA DBI tool, with support for state-of-the-art in- of connected TBs. TBs and basic blocks both terminate at strumentation features such as instrumentation injection branch instructions. TBs, however, can be terminated earlier, at the granularity of individual instructions, as well as the e.g. to force an exit from execution to handle at run-time an ability to instrument guest operations that are emulated update to emulated hardware state. outside the DBT engine (Section 3.3). Intermediate Representation (IR). Portability across ISAs We implement our approach in Qelt, a cross-ISA machine is typically achieved by relying on an IR. The example in emulator and DBI tool based on QEMU [6]. Qelt achieves Figure1 shows how QEMU’s IR, called Tiny Code Generator high performance by combining our novel techniques with (TCG), facilitates independent development of target and further DBT optimizations, which are described in Section 3.4. host-related code. As shown in Section4, Qelt scales when emulating a guest machine used for parallel compilation, and it outperforms Helpers. Sometimes the IR is not rich enough to represent (1) QEMU as both a user-mode and full-system emulator and complex guest behavior, or it could only do so by overly com- (2) state-of-the-art cross-ISA instrumentation tools. Last, for plicating target code. In such cases, helper functions are used complex instrumentation such as cache simulation, Qelt can to implement this complex behavior outside of the IR, lever- match the performance of state-of-the-art, same-ISA DBI aging the expressiveness of a full programming language. tools such as Pin. Helpers are oftentimes used to emulate instructions that in- teract with hardware state, such as the memory management 2 Background unit’s translation lookaside buffer (TLB). 2.1 Cross-ISA DBT Plugins and callbacks. Instrumentation code is built into Target and Host ISAs.

Cross-ISA Machine Instrumentation Using Fast and Scalable Dynamic Binary Translation

Binary Translation Using Peephole Superoptimizers

Understanding Full Virtualization, Paravirtualization, and Hardware Assist

Specification-Driven Dynamic Binary Translation

Systems, Methods, and Computer Programs for Dynamic Binary Translation in an Interpreter

Binary Translation 1 Abstract Binary Translation Is a Technique Used To

The Evolution of an X86 Virtual Machine Monitor

Paper Describes a Technique for Improving Runtime Perfor- Mance of Statically Translated Programs

18 a Retargetable Static Binary Translator for the ARM Architecture

Codbt: a Multi-Source Dynamic Binary Translator Using Hardwareâ

EECS 470 Lecture 20 Binary Translation

Binary Translation Using Peephole Superoptimizers

Dynamic Binary Translation ∗