Low Overhead Dynamic Binary Translation on ARM

The University of Manchester Research Low Overhead Dynamic Binary Translation on ARM DOI: 10.1145/3062341.3062371 Document Version Accepted author manuscript Link to publication record in Manchester Research Explorer Citation for published version (APA): d'Antras, A., Gorgovan, C., Garside, J., & Lujan, M. (2017). Low Overhead Dynamic Binary Translation on ARM. In Proceedings of the 38th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI 2017 (pp. 333–346). Association for Computing Machinery. https://doi.org/10.1145/3062341.3062371 Published in: Proceedings of the 38th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI 2017 Citing this paper Please note that where the full-text provided on Manchester Research Explorer is the Author Accepted Manuscript or Proof version this may differ from the final Published version. If citing, it is advised that you check and use the publisher's definitive version. General rights Copyright and moral rights for the publications made accessible in the Research Explorer are retained by the authors and/or other copyright owners and it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights. Takedown policy If you believe that this document breaches copyright please refer to the University of Manchester’s Takedown Procedures [http://man.ac.uk/04Y6Bo] or contact [email protected] providing relevant details, so we can investigate your claim. Download date:09. Oct. 2021 Low Overhead Dynamic Binary Translation on ARM Amanieu d’Antras Cosmin Gorgovan Jim Garside Mikel Luján School of Computer Science, University of Manchester, UK {bdantras,cgorgovan,jgarside,mikel}@cs.manchester.ac.uk Abstract running existing 32-bit ARM applications directly in AArch32 The ARMv8 architecture introduced AArch64, a 64-bit exe- mode, but maintaining this support comes at a significant cost cution mode with a new instruction set, while retaining binary in hardware complexity, power usage and development time. compatibility with previous versions of the ARM architec- For example, Cavium does not include hardware support for ture through AArch32, a 32-bit execution mode. Most hard- AArch32 in their ThunderX processors for this reason. ware implementations of ARMv8 processors support both This paper investigates using binary translation to trans- AArch32 and AArch64, which comes at a cost in hardware late an AArch32 program into AArch64 code dynamically, complexity. which enables it to run on a processor that only supports the We present MAMBO-X64, a dynamic binary translator AArch64 instruction set. Binary translation allows a program for Linux which executes 32-bit ARM binaries using only the to be translated transparently, instrumented or modified at the AArch64 instruction set. We have evaluated the performance machine code level. of MAMBO-X64 on three existing ARMv8 processors which We have developed MAMBO-X64, a Dynamic Binary support both AArch32 and AArch64 instruction sets. The Translator (DBT) which translates AArch32 Linux programs performance was measured by comparing the running time into AArch64 code. It is implemented as a process-level of 32-bit benchmarks running under MAMBO-X64 with the virtual machine; a separate binary translator instance is started same benchmark running natively. On SPEC CPU2006, we for each 32-bit process, while the operating system kernel and achieve a geometric mean overhead of less than 7:5 % on 64-bit processes run natively on the processor. The objective in-order Cortex-A53 processors and a performance improve- is to support the running of legacy AArch32 code without ment of 1 % on out-of-order X-Gene 1 processors. the need for specific hardware support, preferably at speeds MAMBO-X64 achieves such low overhead by novel competitive with hardware execution. optimizations to map AArch32 floating-point registers to There are several complex problems that need to be AArch64 registers dynamically, handle overflowing address addressed to achieve this objective. For example, while calculations efficiently, generate traces that harness hardware AArch64 has a strictly larger general purpose register bank return address prediction, and handle operating system sig- than AArch32, this is not necessarily the case for the floating- nals accurately. point register bank: AArch32 can, in some situations, exploit register bank aliasing to hold up to 48 floating point values CCS Concepts • Software and its engineering ! Just- in registers, whereas AArch64 is limited to only 32 floating in-time compilers point registers. Naïvely translating the latter case can generate Keywords Binary Translation, ARM significant performance degradation. Load and store instructions in both AArch32 and AArch64 1. Introduction support a similar set of addressing modes. Despite their simi- larity, however, simply translating the AArch32 addressing While ARM has traditionally been a 32-bit architecture, the modes into their AArch64 equivalent will not always produce ARMv8 version of the architecture [14] introduced a new 64- correct results; the address width used can affect whether the bit execution mode and instruction set, called AArch64. Most calculation overflows. of the current generation of ARMv8 processors is capable of It is widely acknowledged that indirect branch handling is the single biggest source of overhead for DBTs [16, 18] and that trace compilation is an integral part of advanced DBTs for improving performance. However, we show that existing trace generation algorithms interfere with optimizations for handling indirect branches. Precise handling of operating system signals is challeng- ing in DBTs because they can interrupt program execution at arbitrary points. In particular, a signal is delivered between been extended to HyperMAMBO-X64 [11], which integrates two instructions, however instruction boundaries in the trans- the binary translator component of MAMBO-X64 into a lated code may not match those of the original application hypervisor to allow fully transparent translation without the code. need for any OS modifications. This paper presents the architecture of MAMBO-X64 2.1 Binary Translator (Section 2) and describes the following contributions: The binary translator works by scanning sequences of 1. An efficient scheme for mapping AArch32 floating-point AArch32 instructions on demand and converting them into registers to AArch64 registers dynamically (Section 3); AArch64 code fragments, stored in a code cache. Each frag- 2. A method for efficiently translating AArch32 load/store ment is either a single-entry, single-exit basic block or a addressing modes into AArch64 by using speculation single-entry, multiple-exit trace formed by combining multi- (Section 3.3); ple basic blocks. As fragments are added to the code cache, 3. A novel trace compilation algorithm that leverages hard- they are linked to each other using branch instructions. In ware return address prediction to improve performance the case of direct branches, this is simply a direct branch (Section 4); and instruction to the target fragment, since the branch target is known at translation time. 4. A system for delivering operating system signals to trans- Indirect branches need to be handled differently because lated programs without suffering from race conditions, the branch target is only known at execution time and can vary with minimal performance impact (Section 5). from one execution to the next. This requires its own dynamic While the first two contributions are described as imple- translation and can impose a serious runtime penalty. Previous mented for AArch32 to AArch64 translation, they can be research [16, 18] has shown that indirect branch handling generalized to other architectures with similar characteristics. is the biggest performance bottleneck in a DBT. Function The last two contributions are general and are applicable to returns have been shown to be the most common type of all DBTs. indirect branch [10], which is why MAMBO-X64 uses a When evaluated on the X-Gene, an ARMv8 system which novel algorithm called ReTrace to translate these branches in support both AArch32 and AArch64 in hardware (Section 6), a way that leverages hardware return prediction. The details a 32-bit build of SPEC CPU2006 runs on average 1 % faster of this algorithm are described in Section 4. under MAMBO-X64 compared to running the same 32-bit MAMBO-X64 uses a thread-shared code cache model binary natively on the processor. Particular benchmarks are where the same translated code is shared among multiple measured to run up to 38 % faster under MAMBO-X64 than threads. This model has been shown to scale significantly natively, although a few other benchmarks suffer from a better than thread-private code caches on multi-threaded performance degradation of up to 19 %. No other DBT has applications [7, 15]. Executing code in the code cache does achieved a similar level of performance when translating from not require any synchronization and MAMBO-X64 is able a 32-bit to a 64-bit architecture (Section 7). to perform code translation concurrently in multiple threads. Synchronization is only required when adding or removing 2. MAMBO-X64 Architecture a fragment from the code cache, which is a rare operation MAMBO-X64 is structured as three components: compared to the execution of translated code. MAMBO-X64 is able to precisely emulate the full • Binary Translator: The binary translator is an operating AArch32

Load more