XBT: FPGA Accelerated Binary Translation

XBT: FPGA Accelerated Binary Translation Ke Chai Submitted in Partial Fulfillment of the Requirements for the Degree of Master of Science Thesis Advisor: Dr. Christos A. Papachristou Department of Electrical, Computer and Systems Engineering Case Western Reserve University August, 2021 XBT: FPGA Accelerated Binary Translation Case Western Reserve University Case School of Graduate Studies We hereby approve the thesis1 of Ke Chai for the degree of Master of Science Christos A. Papachristou Committee Chair, Advisor 07/16/2021 Department of Electrical, Computer and Systems Engineering Daniel G. Saab Committee Member 07/16/2021 Department of Electrical, Computer and Systems Engineering Seyed Hossein Miri Lavasani Committee Member 07/16/2021 Department of Electrical, Computer and Systems Engineering 1We certify that written approval has been obtained for any proprietary material contained therein. Table of Contents List of Tablesv List of Figures vi Acknowledgements vii ABSTRACT1 Chapter 1. Introduction2 Background2 Motivation2 Contribution3 Outline3 Chapter 2. Literature Review5 Binary Translation5 Dynamic Binary Translation5 Hardware-Accelerated Binary Translation6 Chapter 3. Methodology7 Configuration blocks7 Translation Blocks8 Reallocation Registers9 Branch Offset Issues 10 Unrecognized Instructions 11 Chapter 4. Prototype Design 12 Instruction Set 12 System Design 12 Microcode 14 Translation Process 15 Architecture Implementation 17 Chapter 5. Results 21 iii Design Reports 21 Benchmark Technique 22 Measurement of Speedup 22 Results 23 Chapter 6. Conclusions 25 Chapter 7. Future Work 26 References 27 iv List of Tables 4.1 MIPS32 User App. Instructions 13 4.2 Description of XBT Blocks 14 4.3 IMB/AMB Microcode 15 4.4 Register Reallocation Example 16 4.5 Unfolding I-type Example 16 4.6 Unfolding Load/Store Example 17 4.7 Reordering Example 17 4.8 Complex Instruction Example 18 4.9 XBT Configuration Registers 20 5.1 Translation Time: BT vs XBT 24 v List of Figures 3.1 XBT System Block Diagram8 3.2 Address Mapping Flow 11 4.1 An XBT Configuration Instance 13 4.2 Zynq 7000 SoC24 18 4.3 Block Design of XBT in Vivado 19 5.1 Power Report 21 5.2 Utilization Report 22 5.3 Timing Report 22 vi Acknowledgements First I want to thank my advisor Dr. Papachristou and Dr. Wolff. They have generously provided me with their knowledge, experience and help. Without them, this thesis would never be finished. Also I want to thank the committee members who have paid effort into reading this thesis. Thanks to my parents who gave me their consistent support, both emotionally and economically, to pursue my degree. Last but not least, I want to thank my wife who abandoned her well-paying job and followed me to America to take care of me. I really enjoyed her company and will never forget how much she has sacrificed for me. vii ABSTRACT XBT: FPGA Accelerated Binary Translation Ke Chai Binary translation (BT) is the process of converting executable binary from one instruction set architecture (ISA) to another. Accelerated binary translation (XBT) refers to BT using FPGA for hardware acceleration and feeding the target processor at-speed. This work proposes a reconfigurable pipelined structure built on FPGA that performs XBT on different ISAs. An XBT system that translates MIPS to RISC-V is implemented and tested on the Xilinx Zynq platform. Results of several benchmarks show obvious speedup of approximately 48 times compared to an equivalent software approach. 1 1 Introduction 1.1 Background Binary translation (BT) is the process of converting executable binary from one instruction set architecture (ISA) to another19. BT makes it possible to migrate applications between two ISAs without the need of source code and recompila- tion8,9,26. For example, a legacy MIPS program can be translated to an equivalent RISC-V program using BT and run on a RISC-V processor. BT also serves as an emulation method which has higher performance than normal software-based interpretation. Emulators like QEMU use BT techniques for better performance5. BT is a way to achieve Architecture-Independent Computing (AIC) which means to enable executing code of different ISAs on any machine3. There are mainly two kinds of BT approach: static binary translation (SBT) and dynamic binary translation (DBT). SBT translates the whole binary code before the execution, while DBT translates at runtime. Software DBT is more widely used for emulation purpose since it deals better with problems such as self-modifying code, but it usually has worse performance than SBT. 1.2 Motivation Unlike a program originally built for the target ISA, a binary translated program from another ISA suffers from the performance loss due to the differences between the ISAs25. Since DBT systems translates codes on-the-fly,the translation overhead 2 Introduction 3 is also a key factor that affects performance6,19. Accelerating the translation process is an important part of the overall speed improvement in the DBT process. FPGAs are widely used in applications which need flexible hardware acceleration such as AI and neural networks. FPGA fabrics are even embedded into system- on-chips (SoCs) and have high-speed, high-bandwidth connection to processors. Pipelining on FPGA allows the ability to have overlapping parallelism in problems dealing with large amount of sequential data. Though being less efficient compared to ASICs17, FPGAs have more flexibility that ASICs cannot provide. The FPGA’s reprogrammability enables the system to switch between different config- urations at runtime. 1.3 Contribution This work proposes a pipelined structure built on FPGA that performs accelerated binary translation (XBT). Using FPGA could make better use of parallelism, which enhances the performance. With the speedup brough by the FPGA fabrics, the method could efficiently generate semantically equivalent target code (i. e. thegen- erated binary after translation) from source code (i. e. the binary to be translated). In addition to the increase of translation speed, it also provides more flexibility at runtime. An XBT prototype that translates MIPS to RISC-V is presented in this work. Several benchmarks are run on a Xilinx Zynq chip using both XBT approach and software-based BT approach. Comparation of their translation speed proves that XBT have greater performance gain on the BT process. 1.4 Outline Section 2 cites and comments on some related work and background study of relevant BT topics. Introduction 4 Section 3 describes the methodology of XBT and how XBT solves the key problems that occurs in BT process. Section 4 gives a specific prototype of XBT translates MIPS to RISC-V. Details of the design are also shown. Section 5 gives the design report, benchmark method and result of the MIPS to RISC-V XBT on Xilinx Zynq platform. Section 6 gives the conclusion from the results. Section 7 discusses about the shortcomings and future work to be done. 2 Literature Review 2.1 Binary Translation Sites et al.22 described the concept of BT in a 1993 paper, in which two binary translators targeting Alpha AXP computers are also given. Altman et al.2 introduced BT as an effective way of automatic code porting without recompilation. Cifuentes et al.11 developed a reusable, component-based BT framework called UQBT, which can adapt easily and inexpensively to different source and target machines. More works4,13,18,23 are proposed on the optimization of BT process. In order to migrate legacy x86 applications to the newly-designed M1 processor with ARM architecture, Apple developed a BT system named Rosetta 216. It uses static BT approach that translates before the execution. However, It is not capable for translating kernel extensions or Virtual Machine apps. 2.2 Dynamic Binary Translation The concept of DBT can date back to a 1996 paper by Cifuentes et al.10. This paper argues that dynamic binary translators can reach performance equal to static ones while requiring less complex environment at runtime. It also presents a new technique as a complement to a retargetable binary translator. Probst19 gave the definition and usage of DBT in his 2002 paper. It shows so- lutions to the problems that occurs in the DBT process like jump/branch offset 5 Literature Review 6 issues, register mapping and conditional bits. It also mentions the existence of a translation cache. There are also works using DBT for architectural emulation. Chapman et al.7 combines DBT and virtualization for cross-platform emulation. The prototype, named “MagiXen”, is an implementation of a Xen virtual machine monitor that can run IA-32 virtual machines on Itanium platforms. DBT targeting VLIW machines is also designed for static scheduling, which can handle the trade-offs between performance and hardware complexity. Ebcioglu et al.1,12 Proposed an architecture called DAISY, i. e. Dynamically Architected Instruc- tion Set from Yorktown, to use DBT and VILW machines to gain high instruction level parallelism with simpler hardware designs. 2.3 Hardware-Accelerated Binary Translation There are existing works that involve hardware acceleration in DBT process. Yao et al.25 propose an FPGA based hardware-software co-designed DBT system from x86 to MIPS. A “CCflag” register and several user defined instructions are addedto the MIPS processor core to resolve the problem brought by x86 conditional flags and different byte order, i. e. endianess. To enhance the speed of translation, a jump address look-up table (JLUT) is also implemented as a part of the translator. Though involving FPGA, this work does not develop its reconfigurability. Rokicki et al.20 proposed a hardware-accelerated DBT operates on MIPS bina- ries and targets a custom VLIW core. A small single-issue processor is dedicated to the DBT process, along with blocks designed with high-level synthesis (HLS) tech- nology. A more recent paper of Rokicki21 even started to develop this approach on heterogeneous multi-core architectures to lower the power consumption while maintaining considerable performance.

Load more