Paper Describes a Technique for Improving Runtime Perfor- Mance of Statically Translated Programs

Performance, Correctness, Exceptions: Pick Three A Failproof Function Isolation Method for Improving Performance of Translated Binaries Andrea Gussoni Alessandro Di Federico Pietro Fezzardi Giovanni Agosta Politecnico di Milano Politecnico di Milano Politecnico di Milano Politecnico di Milano [email protected] [email protected] [email protected] [email protected] Abstract—Binary translation is the process of taking a pro- existing programs for which the source code is not available, gram compiled for a given CPU architecture and translate it to enabling the analyst to extract insights about its execution that run on another platform without compromising its functionality. would be otherwise very though to obtain. This paper describes a technique for improving runtime performance of statically translated programs. Binary translation presents two major challenges: functional correctness, and performance. Ideally, both are very First, the program to be translated is analyzed to detect important, but in practice any implementation of a binary function boundaries. Then, each function is cloned, isolated and disentangled from the rest of the executable code. This process is translator needs to make trade-offs between them. called function isolation, and it divides the code in two separate On one hand, dynamic binary translators choose to trans- portions: the isolated realm and the non-isolated realm. late code on-the-fly during execution, favoring functional cor- Isolated functions have a simpler control-flow, allowing much rectness. This choice allows to preserve functional correct- more aggressive compiler optimizations to increase performance, ness even in presence of self-modifying code, runtime code but possibly compromising functional correctness. To prevent generation, and unpacking of compressed code. For example, this risk, this work proposes a mechanism based on stack emulators based on dynamic binary translation, such as QEMU, unwinding to allow seamless transition between the two realms are known to be able to emulate even full operating systems. while preserving the semantics, whenever an isolated function However, this requires runtime support and it incurs in a unexpectedly jumps to an unforeseen target. In this way, the significant performance overhead, since, at run-time, the code program runs in the isolated realm with improved performance translation and the execution of its output are interleaved. for most of the time, falling back to the non-isolated realm only when necessary to preserve semantics. On the other hand static binary translators make the oppo- The here proposed stack unwinding mechanism is portable site choice. They make stronger assumptions on their ability to across multiple CPU architectures. The binary translation and the statically identify all the executable code, trading off functional function isolation passes are based on state-of-the-art industry- equivalence for speed. In this way, they avoid the overhead proven open source components – QEMU and LLVM – making of interleaving translation and execution, performing all the them very stable and flexible. The presented technique is very translation ahead of time and then executing the translated robust, working independently from the quality of the functions code at full speed. In addition, static binary translators are boundaries detection. We measure the performance improve- usually able to perform more aggressive optimizations for two ments on the SPECint 2006 benchmarks [12], showing an average key reasons: 1) they have global visibility on the code they are of 42% improvement, while still passing the functional correctness translating (unlike dynamic translation), and 2) they pay the tests. cost of optimizations only once, while dynamic translators pay it at each run. This has shown to be beneficial in automated bug I. INTRODUCTION detection, especially in performance-critical scenarios such as fuzzing [11]. Binary translation is a widely used technique for program analysis. It is most useful to understand the behavior of The main drawback of this approach is that whenever the programs at runtime, on platforms that make unpractical or static analysis is wrong, the translated code loses functional extremely time-consuming to extract the necessary information correctness. Given that binary programs are huge blobs of during the execution. These situations include analysis of code interleaved executable code and data, static binary translators for a CPU architecture different from the analyst’s workstation, still need to err on the safe side, to avoid missing some pieces or exotic embedded devices that might not be physically of code, and be usable for most use cases. available or might not provide the necessary visibility and In this work we focus on a technique for improving runtime control over the execution environment. Binary translation is performance of statically translated programs. We adopt an also useful on more traditional architectures, to instrument approach based on dividing the translated binary in two realms: the realm of isolated functions and the realm of non-isolated functions. The isolated realm is composed by a set of basic Workshop on Binary Analysis Research (BAR) 2019 blocks of the original program that have been marked as 24 February 2019, San Diego, CA, USA ISBN 1-891562-58-4 belonging to the same function by a Function Boundaries https://dx.doi.org/10.14722/bar.2019.23093 Detection Analysis (FBDA). All the detected functions are www.ndss-symposium.org then cloned and isolated from the rest of the code, so that each of them can be separately optimized during translation, To achieve this result, rev.ng employs two key com- generating high-performance code. The non-isolated realm, on ponents: QEMU [7] and LLVM [13]. QEMU is employed as the other hand, is a single large function that contains all the a library to lift each basic block from the input program code of the original executable. into an architecture-independent representation, known as the QEMU tiny code. This representation was designed to perform Due to its complexity, optimizations will be less effective lightweight analyses and transformations and, therefore, it is on the non-isolated realm. However, thanks to its simplicity not very suitable to perform sophisticated analyses. For this and fidelity to the original code, it is considerably safer in reason, rev.ng translates the QEMU tiny code into LLVM IR. terms of correctness. On the other hand, the isolated realm LLVM is a robust and mature compilation framework, providing can break functional correctness if the execution reaches an a wide range of analyses and transformations, and a solid unexpected jump inside an isolated function. In such situations, infrastructure to easily develop new ones. we seamlessly fall back to the non-isolated realm, reusing existing stack unwinding mechanisms. In practice, they are While discussing the full design of rev.ng [9], [10] is used as a safety net in problematic cases to preserve functional outside the scope of this work, in the following we highlight correctness, paying the overhead of the unwinding only in the key points of interest for this paper. exceptional situations, while running at full speed in the isolated functions for most of the time. Our approach is The CPU State Variables (CSV). In rev.ng, each part of independent of the original CPU architecture of the translated the CPU state, such as registers, are represented as global program and agnostic about the ABI. Therefore, functional variables, known as the CPU State Variables (CSV). For correctness is not affected by the quality of the algorithm used instance, in x86-64, there is a 64-bit integer global variable to detect the function boundaries before isolating them, which for each general purpose register. The translated code reads is an orthogonal problem [6] [14] [5] [10]. and writes them with load and store operations. The root function. The actual translated code. Each input In summary, this paper makes the following contributions: basic block can produce multiple LLVM basic blocks. All these basic blocks are collected in a single LLVM function Design of an isolating static binary translator. We known as root. Therefore, the whole control-flow of the designed a static binary translator that divides the original program takes place inside this function. As an output program in two realms, one composed by high- example, if in the original program there is a direct jump performance isolated functions, and one composed by a from address A to address B, the generated program will single large and semantic-preserving function, from which simply jump from one of the basic blocks corresponding it is possible to migrate seamlessly. to A to the first basic block corresponding to B. Implementation of an isolating static binary translator. The dispatcher. In case of an indirect jump for which it rev.ng We implemented the technique in the binary is not possible to statically enumerate all the possible translation tool, based on well-known and tested open destinations, the execution is diverted to the dispatcher. QEMU LLVM rev.ng source frameworks: [7] and [13]. The dispatcher is a piece of code in the root function that can handle all the CPU architectures supported by QEMU decides, depending on the run-time value of the program (22+) and translate programs towards all the LLVM targets counter, the actual target of the unresolved indirect jump libgcc that support stack unwinding via [2]. in the translated program. It is implemented through a Measurement of the performance improvement. We used switch statement on the program counter, where each case rev.ng ’s algorithm [10] to detect function boundaries, and represents the address of a basic block in the original we measured the performance improvement on the SPECint program and its body is a jump to the corresponding LLVM 2006 benchmarks [12], showing an average improvement in basic block in the translated program. performance of 42%, while passing the tests for functional correctness. As an example, see Fig.

Paper Describes a Technique for Improving Runtime Perfor- Mance of Statically Translated Programs

Binary Translation Using Peephole Superoptimizers

Understanding Full Virtualization, Paravirtualization, and Hardware Assist

Specification-Driven Dynamic Binary Translation

Systems, Methods, and Computer Programs for Dynamic Binary Translation in an Interpreter

Binary Translation 1 Abstract Binary Translation Is a Technique Used To

The Evolution of an X86 Virtual Machine Monitor

18 a Retargetable Static Binary Translator for the ARM Architecture

Codbt: a Multi-Source Dynamic Binary Translator Using Hardwareâ

EECS 470 Lecture 20 Binary Translation

Binary Translation Using Peephole Superoptimizers

Dynamic Binary Translation ∗

Low Overhead Dynamic Binary Translation for ARM