A Trigonometric Hardware Acceleration in 32-Bit RISC-V

This article has been accepted and published on J-STAGE in advance of copyediting. Content is final as presented. IEICE Electronics Express, Vol.VV, No.NN, 1–6 LETTER A Trigonometric Hardware Acceleration in 32-bit RISC-V Microcontroller with Custom Instruction Khai-Duy Nguyen1a), Dang Tuan Kiet1b), Trong-Thuc Hoang1c), Nguyen Quang Nhu Quynh2d), Xuan-Tu Tran3e), and Cong-Kha Pham1f) Abstract This work presents a 32-bit Reduced Instruction Set Computer Nowadays, computational tasks have become far more com- fifth-generation (RISC-V) microprocessor with a COordinate Rotation plex than the way general-purpose computers can serve DIgital Computer (CORDIC) accelerator. The accelerator is implemented them. Thus, processor efficiency requirements are beco- inside the core and being used by the software via custom instruction. The used microprocessor is the VexRiscv with the Instruction Set Architecture ming increasingly critical. Accelerators are extensively used (ISA) of RV32IM; that means 32-bit RISC-V including Integer and for many intensive computational tasks, reducing execution Multiplication. The experimental results were collected using Field- time and energy consumption. Different companies and re- Programmable Gate Array (FPGA) on the DE2-115 development kit and search groups are developing accelerators in RISC-V for Application Specific Integrated Chip (ASIC) synthesizer on 180-nm CMOS various applications such as digital signal processing [7], process library. key words: 32-bit microprocessor, accelerator, CORDIC, custom artificial intelligence [8, 9], and solving mathematical algo- instruction, RISC-V, trigonometric. rithms [10, 11]. Classification: Integrated circuits (logic) Among heavy computational tasks, the calculation of trigonometric functions is widely used, especially in digital signal 1. Introduction processing algorithms of wireless and communication systems such as WiMAX [12], 3GPP-LTE [13], MIMO [14], Reduced Instruction Set Computer fifth-generation (RISC- CDMA [15], OFDM [16] and WLAN [17]. However, the V) [1] is a free and open Instruction Set Architecture (ISA) complexity of trigonometric algorithms makes it a problem developed at the University of California at Berkeley. It when computing in the digital realm. If a general-purpose possesses a stand-out characteristic that makes it attractive to processor executes the algorithms, it will break down the the open-source communities in both academy and industry algorithms into multiple simple calculations, reducing its areas. Many RISC-V-based processors have been presented efficiency. To cope with this problem, an accelerator specia- recently. Some worth-mentioned works are the highly cus- lized in trigonometry computation is in need. tomizable Rocket-chip coreplex [2], the 32-bit E-core series [3], the 64-bit U-core series [4], the minimal RISC-V COordinate Rotation Digital Computer (CORDIC) [18] is a V-core for embedded systems [5], and the 32-bit single-cycle popular algorithm used for trigonometry computation [19– RISC-V [6]. Due to its versatility and customizability, the 22]. It is a simple and effective algorithm for hyperbolic RISC-V ISA has become a suitable target for developing a and trigonometric functions, usually converged with one highly customizable computer system. digit (or bit) per iteration. A CORDIC only uses adders and shifters to calculate the result, with the benefit that it could be done with relatively basic hardware. Besides, CORDIC 1University of Electro-Communications (UEC), 1-5-1 Chofu- gaoka, Chofu-shi, Tokyo 182-8585, Japan can calculate multiple functions with the same hardware, so 2The University of Danang, University of Science and Techno- they are ideal for many applications. We aim to develop a logy (DUT), 54 Nguyen Luong Bang st., Danang, Vietnam trigonometric accelerator applying the CORDIC algorithm 3The Information Technology Institute (VNU-ITI), 144 Xuan to increase efficiency in calculating trigonometric functions Thuy road, Cau Giay dist., Hanoi, Vietnam and researching the RISC-V processor’s customizability. a) [email protected] In this paper, we designed and implemented a 32-bit RISC- b) [email protected] V microprocessor with a CORDIC algorithm accelerator. c) [email protected] The implemented core processor is the VexRiscv Central d) [email protected] e) [email protected] Processing Unit (CPU), a base implementation and multi- f) [email protected] ple/division in 32-bit registers (RV32IM) of the RISC-V ISA DOI: 10.1587/elex.XX.XXXXXXXX processor. Within the VexRiscv core, the CORDIC accele- DOI: 10.1587/elex.18.20210266 rator was connected directly to the Execute stage. The core Received April 14, 2021 Received June 22, 2021 was placed in Briey System-on-Chip (SoC) and was syn- AcceptedAccepted July April 07, 14, 2021 2021 PublicizedPublished July April 20, 14, 2021 2021 thesized on Field Programmable Gate Array (FPGA) and 1 CopyrightCopyright ©© 2021 2021 The The Institute Institute of of Electronics, Electronics, Information Information and and Communication Communication Engineers Engineers IEICE Electronics Express, Vol.VV, No.NN, 1–6 Application Specific Integrated Chip (ASIC) level with the cell logic of 180-nm CMOS technology. The remainder of this paper is organized as follows. Section 2 provides background information for the project. Section 3 describes the architecture of the proposed SoC chip. The implemented results are presented in detail in Section 4. Finally, Section 5 concludes the paper. 2. Background Research 2.1 CORDIC Algorithm The CORDIC algorithm calculates the trigonometric functions by performing two-dimensional vector rotation in circular coordinate systems. For example, equation 1 represents the iteration equations of the CORDIC radix-2 algorithm in rotation mode of the circular coordinate system: Fig. 2. Briey SoC Architecture. −8 G8¸1 = G8 − 38 H82 The RV32I ISA base version has four core instruction for- −8 H8¸1 = H8 ¸ 38G82 (1) mats (R/I/S/U), as depicted in Fig. 1 [1]. All of these instructions are fixed in 32bit length and aligned on a four-byte I8¸1 = I8 − 38U8 boundary in memory. The three fields opcode, funct3 and The value of 38 is expressed in equation 2. The values of funct7 contain the instruction code. The addresses of the two −8 U8 are chosen so that tan¹U8º = 2 and the multiplication source registers are contained in rs1 and rs2 while rd con- of tangential terms are reduced to a simple shift operation. tains the address of the destination register. Depending on −1 −8 Therefore, we have the value of U8 = C0= ¹2 º. the instruction, the imm field, which contains the immediate ( operand, has different widths. Besides the base integer, stan- −1 if I8 < 0 dard extensions are defined to provide integer multiplication 38 = (2) 1 otherwise and division, atomic operations, and single/double-precision floating-point arithmetic. The more extensions to the base, 2.2 RISC-V Instruction Set Architecture the higher number of instructions. RISC-V ISA allows the creation of open-source processors, and, in relation to them, many open-source resources 2.3 VexRiscv: 32bit Microprocessor have been developed, such as compiler, debuggers, hardware VexRiscv [27] is an RV32IM variant of RISC-V ISA pro- implementations in different hardware description langua- cessor. VexRiscv is written in a new hardware construction ges. Thanks to the community’s support, suitable Integrated language called SpinalHDL [28], a language based on the Development Environment (IDE) and Operating Systems Scala programming language [29]. VexRiscv has a modular (OS) for many variations of processors are available. Origi- design, with most of the components of the processor are nally, RISC-V ISA was proposed to be a simple specifica- optional. The extension of VexRiscv includes multiplication tion of a processor. It means that the base integer ISA is an and division, instruction and data caches, memory manage- adequate minimal set of instructions [23]. Moreover, RISC- ment unit, hazard controller, etc. The advantage of the high- V has been designed to support extensive customization. level SpinalHDL maximizes the customization ability to the Therefore, optional instruction-set extensions can be cus- VexRiscv, which makes VexRiscv the ideal platform for de- tomized into the base integer ISA. The base integer ISA is veloping hardware accelerators. Briey System-on-Chip, an named "I" prefixed by RV32 and RV64, which provide 32-bit implementation consists of the VexRiscv core and periphe- and 64-bit address spaces, respectively. The 64-bit version rals, was used as the primary subject in this study. RV64I is suitable for large and sophisticated systems; on the other hand, 32-bit address spaces of RV32I are adequate for 3. Proposed Implementation embedded and Internet of Thing applications [24–26]. 3.1 System Overview Fig. 2 shows the architecture of the Briey SoC, which includes the VexRiscv core and its peripherals. The core is composed of a 32-bit RV32IM RISC-V CPU, a 4 Kilo- Bytes (KB) instruction cache ($I), a 4KB data cache ($D), an Arithmetic Logic Unit (ALU), and a full barrel shifter. Fig. 1. RISC-V base instruction formats [1]. Also, static memory translator, branching, debug module 2 IEICE Electronics Express, Vol.VV, No.NN, 1–6 via Joint Test Action Group (JTAG), and CORDIC module is included. The Briey SoC uses an Advanced eXtensi- G= = »G02>B¹I0º − H0B8=¹I0º¼ ble Interface (AXI) [32] Crossbar to connect the RISC-V H= = »H02>B¹I0º ¸ G0B8=¹I0º¼ (3) core with a Synchronous Dynamic Random Access Memory I = 0 (SDRAM) Controller, Advanced Peripheral Bus 3 (ABP3) = Bridge, and a 16-KB on-chip Random Access Memory In this mode, commonly used input values are G0 = 1/ , (RAM). The SoC also includes peripherals such as a Ge- H0 = 0, and I0 = angle with the scaling factor K ≈ 1.646. neral Purpose Input/Output (GPIO), a timer, and a Univer- From there, the values of cosine and sine can be gathered at sal Asynchronous Receiver/Transmitter (UART) controller.

A Trigonometric Hardware Acceleration in 32-Bit RISC-V

3.2 the CORDIC Algorithm

CORDIC-Like Method for Solving Kepler's Equation

CORDIC V6.0 Logicore IP Product Guide

A Review on Hardware Accelerator Design and Implementation of CORDIC Algorithm for a Gaming Application

A.1 CORDIC Algorithm

An Optimization of CORDIC Algorithm and FPGA Implementation

FPGA Technology in Beam Instrumentation and Related Tools

A Unified Reconfigurable CORDIC Processor for Floating-Point Arithmetic

Design and Analysis of Double Precision Floating Point Division Operator Based on CORDIC Algorithm

A Highly Optimized Arithmetic Software Library and Hardware Co

Cordic Algorithm and Its Applications In

CORDIC-Based LMMSE Equalizer for Software Defined Radio