Generating Low-Overhead Dynamic Binary Translators

Generating Low-Overhead Dynamic Binary Translators Mathias Payer Thomas R. Gross ETH Zurich, Switzerland ETH Zurich, Switzerland [email protected] [email protected] Abstract 1. Introduction Dynamic (on the fly) binary translation is an important Binary translation (BT) is the art of adding, remov- part of many software systems. In this paper we dis- ing or replacing individual instructions of the trans- cuss how to combine efficient translation with the gen- lated program. Two different approaches to BT exist: eration of efficient code, while providing a high-level (i) static, ahead-of-time translation and (ii) dynamic, table-driven user interface that simplifies the genera- just-in-time translation at runtime. BT enables program tion of the binary translator (BT). instrumentation and makes it possible to, e.g., add dy- The translation actions of the BT are specified in namic safety and validity checks to profile and validate high-level abstractions that are compiled into transla- a binary program, extend the binary program to include tion tables; these tables control the runtime program new functionality, or to patch the program at runtime. translation. This table generator allows a compact de- Another aspect of BT includes virtualization, where scription of changes in the translated code. code is encapsulated in an additional layer of protec- We use fastBT, a table-based dynamic binary trans- tion and all interactions between the binary code and lator that uses a code cache and various optimizations the higher levels are validated. for indirect control transfers to illustrate the design The key advantage of static translation is that the tradeoffs in binary translators. We present an analysis translation process does not incur a runtime penalty and of the most challenging sources of overhead and de- can be arbitrarily complex. The principal disadvantage scribe optimizations to further reduce these penalties. is that static BT is limited to code regions that can be Keys to the good performance are a configurable inlin- identified statically. This approach is problematic for ing mechanism and adaptive self-modifying optimiza- dynamically loaded libraries, self-modifying code, or tions for indirect control transfers. indirect control transfers. Dynamic BT translates code right before it is executed the first time. Dynamic BT Categories and Subject Descriptors D.3.4 [Program- adapts to phase changes in the program and optimizes ming Languages]: Processors hot regions incrementally. This paper focuses on dynamic translation. General Terms Languages, Performance, Optimiza- Most dynamic translators include a code cache to tion, Security lower the overhead of translation. Translated code is placed in this cache, and subsequent executions of the Keywords Dynamic instrumentation, Dynamic trans- same code region can benefit from the already translation, Binary translation lated code. An important design decision is that the stack of the user program remains unchanged. This principle makes virtualization possible where the pro- Permission to make digital or hard copies of all or part of this work for gram does not recognize that it is translated. The personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies untranslated return addresses result in an additional bear this notice and the full citation on the first page. To copy otherwise, to lookup overhead but are needed for some programs republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. (e.g., for exception management, debugging, and self- SYSTOR 2010 May 24–26, Haifa, Israel. modifying code). Copyright c 2010 ACM X-XXXXX-XXX-X/YY/MM. $5.00 The two key aspects in designing a fast binary trans- other translators. E.g., the fastBT approach allows in a lator are a lean translation process and low additional few hundred lines of code the implementation of a li- runtime overhead for any translated code region. Three brary that dynamically detects and redirects all memory forms of indirect control transfers are the source of accesses inside a transaction to a software transactional most of the dynamic overhead: memory system, or allows to redirect all system calls to an authorization framework that encapsulates and 1. Indirect jumps: The target of the jump depends virtualizes the user program. on dynamic information (e.g., a memory address, In this paper we present the general infrastructure or a register) and is not known at translation time. and methods to optimize indirect control transfers, Therefore a runtime lookup is needed. thereby reducing the overhead of dynamic BT. We 2. Indirect calls: Similar to indirect jumps the target is show different optimizations for function returns and not known and a runtime lookup is needed for every indirect jumps/calls. These evaluations are made us- execution. ing a low-overhead binary translator: When compar- 3. Function returns: The target of this indirect control ing programs transformed by fastBT with the original transfer is on the stack and the stack contains only program using a “null” transformation (e.g., a transfor- untranslated addresses, therefore a runtime lookup mation without additional instrumentation, checks, or is needed. statistics compared to the original program) the transformed programs exhibit a small speedup or a minor This paper presents fastBT, a generator for low- slow-down (between -3% to 5% for the majority of overhead, low footprint, table-based dynamic binary benchmarks, i.e. 17 programs), noticeable overhead of translators with efficient optimizations for all forms of 6% to 25% for 9 benchmarks, and two outliers with dynamic control transfers (e.g., novel combined pre- an overhead of 44% and 56%. The average overhead dictor and fast dynamic control transfer schemes that introduced through dynamic BT is 6% for the SPEC adaptively select the optimal configuration for each in- CPU2006 benchmarks. As part of the investigation direct control transfer, or the fastRET return optimiza- of different optimization strategies, we also compare tion). We argue that a single optimization will not suf- fastBT with HDTrans, PIN, and DynamoRIO; fastBT fice for a class of control transfers, but depending on always outperforms PIN and HDTrans in most bench- the arguments and the encoding the best fitting opti- marks. The comparison with DynamoRIO depends mization depends on the individual code location. De- on application characteristics and is discussed later in pending on the runtime behavior of the application a depth. less than optimal fit is adaptively replaced by a different strategy at runtime. This results in an optimal translation on a per location basis for indirect control 2. A generic dynamic binary translator transfers and not on a per class strategy. Additionally A generic dynamic binary translator processes basic fastBT offers a novel high-level interface to the trans- blocks of input instructions, places the translated in- lation engine based on the table generator without sac- structions into a code cache, and adds entries to the rificing performance. mapping table. The mapping table is used to map be- fastBT’s design and implementation are architecture- tween addresses in the original program and the code neutral but the focus of this paper is on IA-32 Linux. cache. Only translated code of the original program that The current implementation provides tables for the In- is placed in the code cache is executed. The translator tel IA-32 architecture, and uses a per-thread code cache is started and intercepts user code whenever the user for translated blocks. The output of the binary transla- program branches to untranslated code. The user pro- tor is constructed through user-defined actions that are gram is resumed as soon as the target basic block is called in the just-in-time translator and emit IA-32 in- translated. See Figure 1 for an overview of such a basic structions. The translation tables are generated from translator. a high-level description and are linked to the binary The user stack contains only untranslated instruc- translator at compile time. The user-defined actions and tion pointers. This setup is needed to support excep- the high-level construction of the translation tables of- tion management, debugging, and self-modifying code. fer an adaptability and flexibility that is not reached by Each one of these techniques accesses the program Original program Translator Code cache ··· ··· ··· Opcode flags *next tbl (or NULL) &action func table ··· ··· ··· 0 1' 1 3' Table 1: Layout of an opcode table, the opcode byte is Trampoline to used as table offset. The flags contain information about translate 4 parameters, immediate bytes, and registers. 2 3 2' Mapping 3 3' 4 1 1' of the instruction to the code cache. The function can 2 2' alter, copy, replace, or remove the instruction. The translation process stops at recognizable ba- Figure 1: Runtime layout of the binary translator. Basic sic block boundaries like conditional branches, indirect blocks of the original program are translated and placed in branches, or return instructions. Backward branches are the code cache using the opcode tables. not recognizable by a single-pass translator, in this case the second part of the basic block is translated a second stack to match return addresses with known values. time. The additional translation overhead for these split These comparisons would not work if the stack was basic blocks is negligible and smaller than the overhead changed, because the return addresses on the stack of a multi-pass translator. would point into the code cache, instead of into the At the end of a basic block the translator checks user program. Therefore fastBT replaces every return if the outgoing edges are already translated, and adds instruction with an indirect control transfer to return to jumps to the translated target.

Load more