Efficient Multiple-ISA Embedded Processor Core Design Based on RISC-V

Efficient Multiple-ISA Embedded Processor Core Design Based on RISC-V Yuanhu Cheng, Libo Huang*, Yijun Cui, Sheng Ma, Yongwen Wang, Bincai Sui {chengyuanhu,libohuang,cuiyijun18,masheng,wyw,bingcaisui}@nudt.edu.cn National University of Defense Technology Changsha, China ABSTRACT system. Second, most binary translation systems are still designed RISC-V ISA is developing rapidly, and its target field highly overlaps to complete functional simulation, but the performance has not with the ARM ISA. As a later ISA, RISC-V needs to solve the problem received much attention. of software compatibility. In the embedded field, by a multiple-ISA Another method to solve the software compatibility problem processor based on binary interpretation, RISC-V supports for ARM is the multiple-ISA processor that adds dedicated hardware to the Thumb can be implemented efficiently. However, binary interpreta- traditional processor to support more than one ISAs. Although the tion may result in lower performance of running non-native ISA multiple-ISA processor also increases the area and power of the chip, programs than native ISA. As a result, to improve the performance it is a more appropriate method that solving the problem of software of running the ARM Thumb programs, we propose some optimiza- compatibility in the embedded field if it can be implemented with tion methods of hardware support to reduce the number of RISC-V very low hardware resource consumption. However, at present, the instructions required to interpret ARM Thumb instructions. Based most critical issue in using the hardware approach is the additional on the open-source Zero-riscy core, we implement a demo to sup- hardware overhead associated with supporting multiple ISAs. For port ARMv6-M based on RISC-V. The results show that running example, in [7], a dynamic two-level binary translation system is the ARMv6-M code of Dhrystone and CoreMark benchmarks com- implemented. The first level translates from non-native ISA into piled by ARMCC, Zero-riscy can achieve 85.0% and 73.1% of the native ISA and the second level is used to optimize the generated performance of running RISC-V code compiled by GCC with less code, in which the first level increases the hardware overhead by than 13.5% increase of FPGA resources. 29% for supporting ARM, while the second level consumes more than the first. KEYWORDS In this paper, we try to support RISC-V and ARM Thumb using multiple-ISA processors based on binary interpretation, because the RISC-V, ARM Thumb, Embedded, Multiple-ISA, Binary translation method based on binary interpretation is the simplest so that it can achieve a multiple-ISA processor with lowest hardware resources. 1 INTRODUCTION Firstly, we discuss the conversion of registers and instructions In recent years, the RISC-V instruction set architecture (ISA) has between ARM Thumb and RISC-V RV32I, as well as the optimization developed rapidly, and its target field highly overlaps with the methods that can be taken to improve the performance of running ARM ISA. Compared to the traditional ISA, RISC-V has two obvious ARM Thumb programs using RISC-V processors. Then, we develop advantages [21]: 1) open-source, users can apply the RISC-V ISA a simple demo of the processor supporting the ARMv6-M (a subset for free to design processors, and even do secondary development of ARM Thumb) and RISC-V ISAs. The demo is implemented by based on open-source code from an open-source community; 2) adding a binary interpreter that converts ARMv6-M instructions the basic ISA is very simple and extensible. RISC-V has a series of into RISC-V instructions in the fetch stage based on PULPino’s standard extensions, so users can flexibly choose extended ISA to open-source RISC-V core, called Zero-riscy (Ibex) [1, 2]. We employe implement according to the scenario. These advantages of RISC-V Dhrystone and CoreMark to evaluate the performance of Zero-riscy can greatly shorten development costs and the period of the proces- running ARMv6-M programs through simulation, and count the sor. However, a large number of existing programs are developed FPGA resources consumed after synthesis. for traditional ISA (e.g. ARM and X86) and cannot run directly Finally, on optimized Zero-riscy, for the Dhrystone code com- on RISC-V processors. As a later ISA, RISC-V needs to solve the piled by ARMCC compiler, each ARM Thumb instruction is trans- problem of software compatibility. lated into 1.62 RISC-V instructions with a performance of 0.69 In desktop and server systems, software binary translation [19] is DMIPS/MHz, 85.0% of the performance of running RISC-V com- the most common method to overcome software compatibility. This piled by GNU GCC compiler. And for the CoreMark code, each method only requires to install a binary translation system, and ARM Thumb instruction is translated into 1.11 RISC-V instructions then the program based on other ISAs can be directly run through with a performance of 1.22 CoreMark/MHz, 73.1% of RISC-V. Com- this system. The advantage of software binary translation systems pared to without any optimization, the performance is improved by is its convenience and flexibility. Unfortunately, it has limitations 2.01 and 2.80 times respectively while the consumption of hardware in the embedded system. First of all, due to the requirements of the resources is decreased slightly. chip area and power of the embedded system, there is often not enough storage space and environment to run a binary translation *Corresponding author , Cheng and Huang, et al. 2 RELATED WORK be much lower than running a native ISA program, because many Binary translation is the most common way to overcome software native instructions may be required to complete a non-native in- compatibility. According to different implementations, binary trans- struction. In this section, we will focus on how to interpret ARM lation can be divided into three types: binary interpretation (or Thumb instructions into RISC-V instructions, as well as optimiza- emulator), dynamic translation, and static translation [3]. QEMU tion methods for some ARM Thumb instructions to improve per- [5] is a typical software binary translation system, which uses dy- formance. namic binary translation to translate an ISA into the corresponding target ISA and supports to simulate the most popular ISAs. Simi- 3.1 Register Mapping larly, DAISY [10] and FX!32 [8, 13] are binary translation systems The ARM Thumb includes thirteen general-purpose registers (R0 ~ that support specialized ISA. For embedded systems, LLBT [18] R12), one Stack Pointer (SP, R13), one Link Register (LR, R14) and is a static binary translation system based on the LLVM compiler, one Program Counter (PC, R15), all of them are 32-bits. The RISC- and [4, 15–17] study the optimization methods of dynamic binary V includes 32 general-purpose registers, and the PC is a special translation for embedded systems. register. Because RISC-V has more registers than ARM Thumb, There are some works that implement binary translation sys- register mapping can be easily achieved. However, the R0 of RISC-V tem with combination of software and hardware. In these works, is always equal to 0 and cannot be modified, we cannot directly the major function of the hardware is that accelerating the binary map the R0 of ARM Thumb to RISC-V. translation process and optimizing the code to improve perfor- Therefore, we can map the registers R0 ~ R12 of the ARM Thumb mance. GODSON-3 Processor [14] uses software to translate X86 to R16 ~ R28 of RISC-V, SP to R29, LR to R30, and PC to R31. Since instructions into MIPS instructions and inserts some units into most ARM Thumb instructions take 3 bits to represent a register hardware to support X86 instructions better. Furthermore, it adds (R0 ~ R7), we can add a two-bits prefix ’10’ in front of the 3-bits some instructions to MIPS ISA so that one X86 instruction can ARM Thumb register number to map it to the RISC-V register (if be converted into fewer MIPS instructions. Crusoe processor [9] the register number in the ARM Thumb instruction is represented is a microprocessor that uses Code Morphing Software (CMS) to by 4 bits, just add a one-bit prefix ’1’ in front of it). All register convert X86 instructions into VLIW instructions. CMS is a dynamic mapping relationships are shown in Table 1. binary translation program and stored in the Read-Only Memory It is worth mentioning that many instructions can modify the (ROM) of the motherboard and can be considered as part of the value of PC in ARM Thumb ISA, while RISC-V cannot. For a RISC- hardware. By modifying CMS, the processor can support any ISA. V processor that supports ARM Thumb, it can still use the PC And hardware is also used to accelerate the speed of binary transla- register defined in RISC-V to complete the instruction fetch. Ifan tion and optimize the code. And [22] improves the performance of ARM Thumb instruction needs to modify the PC value, the RISC-V MIPS microprocessors running X86 programs by hardware based AUIPC instruction is used firstly to load the PC value into register on software dynamic binary translation. R31. After performing related operations, the JALR instruction is Although there are some existing works that support multiple used to change the PC value and jump to target. ISAs by hardware, the method of hardware still requires more Because RISC-V and ARM Thumb are both Reduced Instruction research. [7, 11] implement a two-level binary translation system Set Computing (RISC) ISA, the instruction mapping between ARM outside a MIPS core.

Efficient Multiple-ISA Embedded Processor Core Design Based on RISC-V

1 Assembly Language Programming Status Flags the Status Flags Reflect the Outcomes of Arithmetic and Logical Operations Performe

ARM Instruction Set

Multiplication and Division Instructions

Overview of IA-32 Assembly Programming

Assembly Language: IA-X86

CS221 Booleans, Comparison, Jump Instructions Chapter 6 Assembly Language Is a Great Choice When It Comes to Working on Individu

Instruction Set Architecture

CS107, Lecture 13 Assembly: Control Flow and the Runtime Stack

X86 Assembly Programming Part 2

Arithmetic and Logical Operations Chapter Nine

Term 2 Lecture 6

Gnu Assembler