Background Optimization in Full System Binary Translation

Background Optimization in Full System Binary Translation Roman A. Sokolov Alexander V. Ermolovich MCST CJSC Intel CJSC Moscow, Russia Moscow, Russia Email: [email protected] Email: [email protected] Abstract—Binary translation and dynamic optimization are architecture codes fully utilizing all architectural features widely used to provide compatibility between legacy and promis- introduced to support binary translation. Besides, dynamic ing upcoming architectures on the level of executable binary optimization can benefit from utilization of actual information codes. Dynamic optimization is one of the key contributors to dynamic binary translation system performance. At the same about executables behavior which static compilers usually time it can be a major source of overhead, both in terms of don’t possess. CPU cycles and whole system latency, as long as optimization At the same time dynamic optimization can imply sig- time is included in the execution time of the application under nificant overhead as long as optimization time is included translation. One of the solutions that allow to eliminate dynamic in the execution time of application under translation. Total optimization overhead is to perform optimization simultaneously with the execution, in a separate thread. In the paper we present optimization time can be significant but will not necessarily implementation of this technique in full system dynamic binary be compensated by the translated codes speed-up if application translator. For this purpose, an infrastructure for multithreaded run time is too short. execution was implemented in binary translation system. This Also, the operation of optimizing translator can worsen the allowed running dynamic optimization in a separate thread latency (i.e., increase pause time) of interactive application or independently of and concurrently with the main thread of execution of binary codes under translation. Depending on the operating system under translation. By latency is meant the computational resources available, this is achieved whether by time of response of computer system to external events such interleaving the two threads on a single processor core or by as asynchronous hardware interrupts from attached I/O devices moving optimization thread to an underutilized processor core. and interfaces. This characteristic of a computer system is as In the first case the latency introduced to the system by a important for the end user, operation of hardware attached or computational intensive dynamic optimization is reduced. In the second case overlapping of execution and optimization threads other computers across network as its overall performance. also results in elimination of optimization time from the total Full system dynamic binary translators have to provide low execution time of original binary codes. latency of operation as well. Binary translation systems of this class target to implement all the semantics and behavior I. INTRODUCTION model of source architecture and execute the entire hierar- Technologies of binary translation and dynamic optimiza- chy of system-level and application-level software including tion are widely used in modern software and hardware com- BIOS and operating systems. They exclusively control all the puting systems [1]. In particular, dynamic binary translation computer system hardware and operation. Throughout this systems (DBTS) comprising the two serve as a solution to paper we will also refer this type of binary translation systems provide compatibility between widely used legacy and promis- as virtual machine level (or VM-level) binary translators (as ing upcoming architectures on the level of executable binary opposed to application-level binary translators). codes. In the context of binary translation these architectures One recognized technique to reduce dynamic optimization are usually referred to as source and target, correspondingly. overhead is to perform optimization simultaneously (con- DBTSs execute binary codes of source architecture on currently) with the execution of original binary codes by top of instruction set (ISA) incompatible target architecture utilizing unemployed computational resources or free cycles. hardware. They perform translation of executable codes incre- It was utilized in a number of dynamic binary translation and mentally (as opposed to whole application static compilation) optimization systems [2], [3], [4], [5], [6], [7], [8]. We will interleaving it with execution of generated translated codes. refer this method as background optimization (as opposed to One of the key requirements that every DBTS has to meet consequent optimization, when optimizing translation inter- is that the performance of execution of source codes through rupts execution and utilizes processor time exclusively unless binary translation is to be comparable or even outperform the it completes). performance of native execution (when executing them on top The paper describes implementation of background opti- of source architecture hardware). mization in a VM-level dynamic binary translation system. Optimizing translator is usually employed to achieve higher This is achieved by separating of optimizing translation from DBTS performance. It allows to generate highly efficient target execution flow into an independent thread which can then concurrently share available processing resources with execution thread. Backgrounding is implemented whether by interleaving the two threads in case of a single-core (single processor) system or by moving optimization thread to an unemployed Elbrus CPU processor core in case of a dual-core (dual processor) system. (IA-32 incompatible) In the first case the latency introduced to the system by the ”heavy” phase of optimizing translation is reduced. In the second case, overlapping of execution and optimization VM-level dynamic binary translation threads also eliminates the time spent in dynamic optimization system LIntel phase from the total run time of the original application under translation. IA-32 BIOS, OS, drivers The specific contributions of this work are as follows: and libraries • implementation of multithreaded infrastructure in a VM- IA-32 applications level dynamic binary translation system; • single processor system targeted implementation of background optimization technique where processor time shar- Fig. 1. VM-level dynamic binary translation system LIntel. ing is implemented by interleaving optimizing translation with execution of original binary codes; Adaptive • dual processor system targeted implementation of back- retranslation IA-32 binaries ground optimization technique where optimizing translation is being completely offloaded onto underutilized processor core. Interpretation Non-optimizing trace and profiling translation Translation cache: The solutions described in the paper were implemented execution of translated codes in the VM-level dynamic binary translation system LIntel, and profiling Optimizing region which provides full system-level binary compatibility with translation Intel IA-32 architecture on top of Elbrus architecture [9], [10] hardware. Fig. 2. Adaptive binary translation. II. LINTEL A. Adaptive binary translation Elbrus is a VLIW (Very Long Instruction Word) micropro- LIntel follows adaptive, profile-directed model of translation cessor architecture. It has several special features including and execution of binary codes (Fig. 2). It includes four levels hardware support for full compatibility with IA-32 architecture of translation and optimization varying by the efficiency of the on the basis of transparent dynamic binary translation. resulting Elbrus code and the overhead implied, namely: inter- LIntel is a dynamic binary translation system developed for preter, non-optimizing translator of traces and two optimizing high performance emulation of Intel IA-32 architecture sys- translators of regions. LIntel performs dynamic profiling to tem through dynamic translation of source IA-32 instructions identify hot regions of source code and to apply reasonable into wide instructions of target Elbrus architecture (the two level of optimization depending on executable codes behavior. architectures are ISA-incompatible). It provides full system- Translation cache is employed to store and reuse generated level compatibility meaning that it is capable of translating translations throughout execution. Run-time support system the entire hierarchy of source architecture software (including controls the overall binary translation and execution process. BIOS, operating systems and applications) transparently for When the system starts, interpreter is used to carefully the end user (Fig. 1). As is noted above, LIntel is a co- decode and execute IA-32 instructions sequentially, with at- designed system (developed along with the architecture, with tention to memory access ordering and precise exception hardware assistance in mind) and heavily utilizes all the handling. Non-optimizing translation is launched if execution features of architecture introduced to support efficient IA-32 counter of a particular basic block exceeds specified threshold. compatibility. Non-optimizing translator builds a trace which is a seman- In its general structure LIntel is similar to many other binary tically accurate mapping of one or several contiguous basic translation and optimization systems described before [11], blocks (following one path of control) into the target code. The [12], [13] and is very close to Transmeta’s Code Morphing

Load more