A Soft Processor Overlay with Tightly-Coupled FPGA Accelerator

A Soft Processor Overlay with Tightly-coupled FPGA Accelerator Ho-Cheung Ng, Cheng Liu, Hayden Kwok-Hay So Department of Electrical & Electronic Engineering, The University of Hong Kong fhcng, liucheng, [email protected] Abstract—FPGA overlays are commonly implemented as standard filter kernel cannot readily operate on, are handled in coarse-grained reconfigurable architectures with a goal to im- software. prove designers’ productivity through balancing flexibility and ease of configuration of the underlying fabric. To truly facilitate Data: Pixels of size N × N full application acceleration, it is often necessary to also include a 1 # define BUF 16 // HW computes 16x16 output pixels highly efficient processor that integrates and collaborates with the 2 for r := 0 to N − 1 do accelerators while maintaining the benefits of being implemented 3 for c := 0 to N − 1 do within the same overlay framework. 4 if pixel[r; c] is edge then This paper presents an open-source soft processor that is 5 SW SOBEL( pixel, r; c ); designed to tightly-couple with FPGA accelerators as part of 6 else if ((r − 1) % BUF) == 0 && an overlay framework. RISC-V is chosen as the instruction 7 (c − 1) % BUF) == 0 then set for its openness and portability, and the soft processor is 8 HW SOBEL( pixel, r; c ); 9 else designed as a 4-stage pipeline to balance resource consumption 10 continue; and performance when implemented on FPGAs. The processor 11 end is generically implemented so as to promote design portability 12 end and compatibility across different FPGA platforms. 13 end Experimental results show that integrated software-hardware Algorithm 1: Pseudocode for Sobel edge detector. As applications using the proposed tightly-coupled architecture the hardware accelerator operates on a fixed 16 × 16 achieve comparable performance as hardware-only accelerators while the proposed architecture provides additional run-time array of output pixel at a time, software passes control flexibility. The processor has been synthesized to both low-end to the accelerator only for cases when all 17×17 pixels and high-performance FPGA families from different vendors, are available. Otherwise, the computation is carried out achieving the highest frequency of 268:67 MHz and resource in software. Assume N − 2 is a multiple of BUF. consumption comparable to existing RISC-V designs. I. INTRODUCTION While the design of Algorithm 1 may be specific to the particular implementation of Sobel edge detection, it high- By raising the abstraction level of the underlying config- lights several challenges commonly faced by many real-world urable fabric, many early works have already demonstrated the hardware-software designers. First of all, because of the lim- promise of using FPGA overlays to improve designer’s pro- ited flexibility of most hardware accelerators, the controlling ductivity in developing hardware accelerators [1], [2]. While software must ensure the necessary input data are available such hardware accelerators can often deliver significant perfor- before the accelerator is launched. Furthermore, unless the mance improvement over their software counterparts, they are hardware accelerator is arbitrarily flexible, software running often fixed in functionality and lack the flexibility to process in the CPU must also be able to process any run time data irregular input or data that depends on run-time dynamics. To that cannot readily be processed by the accelerator. truly take advantage of the performance benefit of hardware In view of the above, this paper proposes the use of a small, accelerators, it is therefore desirable to have an efficient CPU open source soft processor to provide fine-grained control in the overlay tightly-coupled with the accelerator to control for the hardware accelerator in the context of an overlay its operations and to maintain compatibility with the rest of framework. The core is designed to be tightly-coupled with the software system. the hardware accelerator in order to minimize the overhead in- To illustrate these intricate hardware-software codesign volved with switching control between hardware and software. challenges, Algorithm 1 shows a simple design that accelerates RISC-V RV32I [3] is chosen as the ISA for its openness and the Sobel edge detection algorithm in such heterogeneous simplicity. Finally, the core is generically designed in order to system. In this implementation, an accelerator that computes promote design portability and compatibility. 16 × 16 output pixels at a time is implemented in FPGA. As such, we consider the main contribution of this work During run time, depending on the user input image size, the rests on the demonstration of the benefits of tightly-coupling software reuses this hardware accelerator for as many complete a lightweight CPU with hardware accelerator to serve within 16 × 16 output pixels as possible. The remaining odd pixels, a combined overlay architecture. We also demonstrated that as well as pixels on the boundary of the image where the a simple RISC-V CPU with 4-stage pipeline is adequate to Copyright held by the owner/author(s). Presented at 31 2nd International Workshop on Overlay Architectures for FPGAs (OLAF2016), Monterey, CA, USA, Feb. 21, 2016. 2nd International Workshop on Overlay Architectures for FPGAs (OLAF2016) TABLE I: The format of BAA instruction. IF DEC EXE/ MEM WB 0x4 31 20 19 15 14 12 11 7 6 0 IR IR imm[11:0] rs1 funct3 - opcode 12 5 3 5 7 offset[11:0] base width - BAA rs1 we PC addr inst A rs2 rd1 we wa rd2 ALU addr rdata TABLE II: The format of RPA instruction. wd B R IMEM Reg. File 31 20 19 15 14 12 11 7 6 0 wdata imm[11:0] rs1 funct3 - opcode 12 5 3 5 7 DMEM offset[11:0] base width - RPA Primary Architecture: Soft Processor Aux_Mem_WrData Stall Aux_Mem_Addr Accelerator (e.g. soft processor is ported onto the legacy FPGA devices which Coarse-grained Aux_Mem_ Reconfigurable could be intrinsically small in size. Also, the additional delay RdData Architecture) Auxiliary Architecture incurred by the extra adder could be partly compensated by the two additional multiplexers which are placed in front of Fig. 1: High-level diagram of the proposed tightly-coupled DMEM. architecture. Moreover, as the benefit of using a virtual layer of overlay architecture on FPGA rests on improving designer’s productivity while providing certain customization capabilities, the provide control while delivering reasonable performance and proposed soft processor can also be customized in terms of the maintaining software compatibility. IMEM and DMEM size. Developers can change the memory In the next section, we elaborate on the design and im- sizes by modifying a few lines of macro or execute a program plementation of the soft processor and the tightly-coupled that comes along with the soft processor design package. architecture. We then evaluate the performance of the pro- It is important to note that, in order to further minimize posed architecture in Section III and discuss related work in FPGA resource consumptions, components that are not strictly Section IV. We make conclusions in Section V. necessary for processor overlay execution are removed from II. DESIGN AND IMPLEMENTATION DETAILS the current implementation. These components include the Control Status Registers (CSR) and their corresponding logic. Figure 1 displays a high-level diagram of the proposed Future versions of the processor implementation will incorpo- tightly-coupled architecture. This design consists of two en- rate the CSR back and will provide tools to allow developers tities where an accelerator can be integrated with the single- to remove them during the processor customization. issue, in-order processor pipeline by sharing the data memory (DMEM). To ensure correct execution flow, control signal is B. The Tightly-coupled Architecture fed from the accelerator to the processor control path so that the pipeline is stalled correctly when the accelerator is carrying In order to support direct memory access and allow zero- out execution. overhead transfers of control, the soft processor is tightly- coupled with the hardware accelerator as illustrated in Fig- A. The Soft Processor ure 1. In the proposed framework, Multiple Runtime Archi- In order to reduce substantial resource consumption while tecture Computer(MURAC) model [4] is adopted to handle maintaining certain efficiency, the soft processor is designed the transfer of control when the execution is switched from as a 4-stage pipeline by integrating the execution-stage and one architecture to another. memory-stage together. This eliminates the load-use hazard The execution model of the tightly-coupled architecture where a LOAD instruction is followed by an instruction that follows naturally from the MURAC model where the proposed uses the value which is just transferred from DMEM to 4-stage pipeline is defined as the primary architecture while Register File. the hardware accelerator is defined as the auxiliary architec- In spite of the above advantage, since the memory-stage is tures. Switching between these two architectures in runtime now merged with the execution-stage and the memory address is achieved by the Branch-Auxiliary Architecture (BAA) and needs to be ready before the load/store instruction reaches Return-To-Primary-Architecture (RPA) machine instructions. DMEM, an extra 32-bit adder is placed at the end of the Logically, a MURAC machine contains only one address space decode-stage. This could incur extra resource consumption and that is shared between two architectures. additional pipeline

A Soft Processor Overlay with Tightly-Coupled FPGA Accelerator

Xilinx Synthesis and Verification Design Guide

AN 307: Altera Design Flow for Xilinx Users Supersedes Information Published in Previous Versions

Implementation, Verification and Validation of an Openrisc-1200

RTL Design and IP Generation Tutorial

Small Soft Core up Inventory ©2019 James Brakefield Opencore and Other Soft Core Processors Reverse-U16 A.T

Introduction to Verilog

Starting Active-HDL As the Default Simulator in Xilinx

Stratix II Vs. Virtex-4 Power Comparison & Estimation Accuracy

Eee4120f Hpes

White Paper FPGA Performance Benchmarking Methodology

Modeling & Simulating ASIC Designs with VHDL

Vivado Design Suite User Guide System-Level Design Entry