Customizable Vector Acceleration in Extreme-Edge Computing: a RISC-V Software/Hardware Architecture Study on VGG-16 Implementation

electronics

Article Customizable Vector Acceleration in Extreme-Edge Computing: A RISC-V Software/Hardware Architecture Study on VGG-16 Implementation

Stefano Sordillo , Abdallah Cheikh , Antonio Mastrandrea, Francesco Menichelli and Mauro Olivieri *

Department of Information Engineering, Electronics and Telecommunications (DIET), Sapienza University of Rome, 00185 Rome, Italy; [email protected] (S.S.); [email protected] (A.C.); [email protected] (A.M.); [email protected] (F.M.) * Correspondence: [email protected]

Abstract: Computing in the cloud-edge continuum, as opposed to cloud computing, relies on high performance processing on the extreme edge of the Internet of Things (IoT) hierarchy. Hardware acceleration is a mandatory solution to achieve the performance requirements, yet it can be tightly tied to particular computation kernels, even within the same application. Vector-oriented hardware acceleration has gained renewed interest to support artificial intelligence (AI) applications like convolutional networks or classification algorithms. We present a comprehensive investigation of the performance and power efficiency achievable by configurable vector acceleration subsystems, obtaining evidence of both the high potential of the proposed microarchitecture and the advantage of hardware customization in total transparency to the software program.

Citation: Sordillo, S.; Cheikh, A.; Keywords: edge-computing; processors; hardware acceleration Mastrandrea, A.; Menichelli, F.; Olivieri, M. Customizable Vector Acceleration in Extreme-Edge Computing: A RISC-V 1. Introduction Software/Hardware Architecture Study on VGG-16 Implementation. The cloud-edge continuum computing paradigm relies on the possibility of local Electronics 2021, 10, 518. https:// processing in the edge of the IoT whenever it is convenient for reasons of energy efficiency, doi.org/10.3390/electronics10040518 reliability, or data security. As a consequence, there is a gradual shift of artificial intelligence (AI) algorithm execution from the cloud down low power embedded IoT devices on the Academic Editor: Luis Gomes edge, to be used in real-time for example to take voice commands or extract image features, for biometric, security, or filtering purposes [1]. Received: 26 January 2021 The resultant demand for very high processing speed on extreme edge computing Accepted: 19 February 2021 devices turns into unprecedented design challenges, especially because of the usually Published: 23 February 2021 limited energy budget. Therefore, the implementation of hardware acceleration on edge devices in the IoT hierarchy has become a major trend to reach the speed and energy Publisher’s Note: MDPI stays neutral efficiency requirements. with regard to jurisdictional claims in Vector computing acceleration was a major stream in high performance computing published maps and institutional affil- systems for decades and is gaining renewed interest in recent development in the su- iations. percomputing sector [2]. Yet, it is easy to note that the vector computing paradigm can also be applied to AI computing kernels that are run in embedded IoT devices on the edge. Nonetheless, the limited hardware budget usually available in edge devices makes it interesting to explore the possibility of configurable acceleration sub-systems to optimally Copyright: © 2021 by the authors. exploit the available hardware resources according to the specific computation kernels Licensee MDPI, Basel, Switzerland. being run during the application execution. This article is an open access article We implemented such exploration addressing the execution of the VGG-16 deep con- distributed under the terms and volutional neural network inference, widely known for its image recognition performance conditions of the Creative Commons as well as for its high computing power and storage demand. The VGG-16 execution Attribution (CC BY) license (https:// is composed of consecutive layers having different computational characteristics. There- creativecommons.org/licenses/by/ fore, it well represents a stress-test of the hardware micro-architecture with a time-variant 4.0/).

Electronics 2021, 10, 518. https://doi.org/10.3390/electronics10040518 https://www.mdpi.com/journal/electronics Electronics 2021, 10, 518 2 of 21

workload profile. Our target micro-architecture is an open-source RISC-V [3] processor core supporting multi-threaded execution and featuring a highly customizable vector acceleration subsystem [4]. The contributions of this work to the reader interested in advanced embedded system design for IoT extreme-edge computing, are manifold: • We report the quantitative evidence of the trade-offs in vector co-processor design and configuration targeting simple edge-computing soft-cores; • We present details on the small custom RISC-V compliant instruction extension suffi- cient to support typical vector operations in a tiny soft-core; • We present a complete yet very simple library of intrinsic functions to support application development, and we discuss the full detail of source code exploiting the co-processor instructions in each VGG-16 layer execution; • We give insights into the open-source Klessydra processor core microarchitecture. The rest of this article is organized as follows: Section2 covers the related works on hardware acceleration for embedded computing on the IoT edge, including configurable solutions, Section3 introduces the Klessydra T1 processor soft-core featuring configurable hardware acceleration subsystem. Section4 describes the fundamental features of the VGG-16 application case and its implementation on Klessydra T1. Section5 reports and discusses the results obtained for the different sub-parts of the chosen application cases, and Section6 summarizes the outcomes of the work.

2. Related Works Several previous works reported the design of hardware accelerated microcontroller cores implemented in edge-computing silicon chips. In [5], a RISC-V processor with DSP hardware support is presented, targeting near-threshold voltage operation. The Diet-SODA design implements a similar approach by running its DSP accelerator in near-threshold regime [6]. In [7–9] application specific accelerators are reported, based on highly parallel operation and minimized off-chip data movements for energy efficiency. All of the above works focus on silicon implementation of units tailored to specific computations. As opposed to this view, the proposed hardware architecture study is independent of technology assumptions, such as the supply voltage, and addresses any physical implementation, particularly soft-cores on commercial FPGA devices, in the view of exploiting application-driven configurability. Regarding FPGA-based implementations, in [10] the authors present a cluster of RISC-V cores connected to a tightly-coupled scratchpad memory and a special purpose engine dedicated to convolutions only. Thanks to FPGA implementation, the convolution engine can be configured at synthesis time to optimize the execution of each convolutional layers, yet exhibiting a severe performance degradation when executing layers it was not built to optimize. A recently published work [11] presents a SIMD configurable CNN coprocessor connected to a 32-bit RV32IM RISC-V processor. Compared to the pure SIMD Klessydra configuration, that uses 11678 LUTs and takes 824 clock cycles for a 4 × 4 matrix convolution, the work in [11] reports 12872 LUTs and 2070 clock cycles. In [12] the authors present a coprocessor soft-core at the edge of IoT, designed to be energy efficient in executing CNN as well as other machine learning algorithms. In particular, they explore the potential impact of data parallelism on the energy efficiency due the increased memory bandwidth. In our study, memory traffic as well as the memory static power consumption are taken into account in energy estimations. The works in [13,14] present a pipelined CNN coprocessor capable of accelerating convolutions based on the extremely high parallelism in the coprocessor, yet limited to convolutional computation kernels. In [15] the authors present different coprocessor configurations integrated with a parallel cluster of RISC-V cores and evaluated which of the configurations is the fastest and most energy efficient. They introduce special co-processing cores dedicated to the standard instruction subset RV32M, without exploring more sophisticated co-processor operations. Electronics 2021, 10, 518 3 of 21

In [16] the authors provide a DCNN accelerator for IoT. The accelerator itself is designed to work with 3 × 3 kernels, and being not configurable, in order to support larger kernels they use a technique called kernel decomposition, which in fact increases the waste in computational resources and decreases in the energy efficiency, similarly to the convolution engine in [10]. The coprocessor architecture proposed in this work is general purpose in nature, being based on vector operations, and can be tailored to support a given computation kernel in the most efficient way. Our work builds on the preliminary developments reported in [17,18] and complements the analysis presented in [4]. The standard “V” vector extension of RISC-V—supported for example by SiFive products [19] and by the EPAC accelerator within the European Processor Initiative project [2] is a large and complex instruction set extension, to cover applications ranging from embedded systems to HPC, which goes far beyond the scope of the lightweight Klessydra soft-core vector extension. Additionally, the standard “V” extension adopts a vector processing scheme based on a Vector Register File, while we explicitly chose to use generic Scratchpad Memories (SPMs) as local storage for more flexibility, at the price of losing compliance with any standard ISA extension. Rather than identifying vectors with a vector number chosen among 32 vector registers, we use pointers within the SPM address space to address vectors or portions of vectors. Additionally, as the number of SPMs available to the programmer in the microarchitecture is configurable. The Ara processor [20], as well as the Xuantie-910 processor [21] and the dual core presented in [22], are all silicon ASIC implementations (thus not configurable as a soft- core is) of micro-architectures, which are actually not compliant with the “V” standard extension, yet they are still based on fixed Vector Register Files. Additionally, the Xuantie- 910 processor addresses high performance superscalar execution of general-purpose non- vectorizable code, which is out of the scope of the Klessydra architecture. The processor reported in [23] adopts an interesting approach based on directly converting ARM SVE vectorized code into a non-standard vector RISC-V extension, thus it is explicitly based on the same operation and storage scheme of ARM SVE. Klessydra diverges from this approach, favoring a broader exploration through configurability. The processor presented in [24] is a soft-core as Klessydra is, but it is again based on a Vector Register File rather than on a configurable SPM-based acceleration.

3. The Klessydra T1 Customizable Architecture 3.1. Hardware Microarchitecture Klessydra is a family of open-source, RISC-V compliant and PULPino [25] compatible cores, which includes basic processors (T0 sub-family), hardware accelerated processors (T1 sub-family), and fault-tolerant processors (F0 sub-family) [26]. A characteristic feature of all Klessydra cores is the hardware support for interleaved multi-threading on a single core [27]. The RTL code and manuals of the Klessydra family are available in the Supplementary Materials. The hardware accelerated T1 cores are an extension of the basic T0 core, that is sketched in Figure1. The T0 microarchitecture resembles a classic four-stage RISC pipeline, except for having multiple Program Counters to support multi-threading, and replicated register ﬁles and Control/Status Registers. Differently from a multi-core implementation, an interleaved multi-threading single core shares all the combinational logic constituting the instruction processing pipeline among the hardware threads (“harts” [3]), by interleaving threads in time, while maintaining separate PCs and registers to keep the state of each thread. ElectronicsElectronics 2021, 202110, x, 10FOR, 518 PEER REVIEW 4 of 21 4 of 22

Program Memory

PC Fetch Debug PC Hart Updater Updater

hart a Regfile Decode

hart b, or c EXECUTE LSU

hart c CSR WB

Data Memory

FigureFigure 1. 1.KlessydraKlessydra T0 T0 core core microarchitecture.

TheIn T0 each microarchitecture clock cycle a different resembles Program a classic Counter four-stage is used for RISC instruction pipeline, fetching, except on for a hav- rotation basis. As a result, instructions belonging to different threads of execution are ing multiple Program Counters to support multi-threading, and replicated register files interleaved in the core pipeline, so that it is never possible that any two instructions in the andpipeline Control/Status manifest any Registers. register,structural Differently or branch from dependency.a multi-core By implementation, fetching an instruction an inter- leavedfrom multi-threading a new thread in each single clock core cycle, shares pipeline all the hazards combinational are eliminated, logic while constituting if the same the in- structionthread runprocessing for several pipeline clock cycles, among its own the datahardwa hazardre orthreads branching (“harts” hazard [3]), would by imposeinterleaving threadsintroducing in time, dependency-check while maintaining logic separate and pipeline PCs stalling. and registers The only to dependencykeep the state in the of each thread.instruction pipeline can occur between two threads on explicit shared memory access, whichIn each is responsibility clock cycle ofa different the programmer. Program Counter is used for instruction fetching, on a rotationThe basis. supported As a result, number instructions of interleaved belonging threads is to a parameterdifferent ofthreads the synthesizable of execution RTL are in- code of the core. terleaved in the core pipeline, so that it is never possible that any two instructions in the The T1 microarchitecture (Figure2) is derived from the T0 by adding the Vector Co- pipelineprocessing manifest Unit (VCU),any register, being internallystructural comprised or branch of dependency. Multi-Purpose By Functional fetching an Units instruction(MFU) from and a new Scratch-Pad thread in Memory each clock Interface cycle, (SPMI). pipeline hazards are eliminated, while if the same threadAt the run instruction for several level, clock the T1 cycles, architecture its own supports data hazard the parallel or branching execution ofhazard instruc- would imposetions ofintroducing different types, dependency-check belonging to the same logic hart. and Inpipeline fact, the stalling. LSU works The in only parallel dependency with in the otherinstruction units when pipeline executing can occur memory between store instructions, two threads that on cannot explicit cause shared a write-back memory access,conflict which on is the responsibility register file. The of MFUthe programmer. is allowed to read operands from the register file but canThe only supported write its results number to local of interleaved scratchpad memoriesthreads is (SPMs), a parameter thus keeping of the synthesizable the SPMs and RTL the Registerfile decoupled and allowing parallel execution between instructions writing code of the core. to each of these memories simultaneously. Scalar instructions of a hart are processed by theThe “Execution” T1 microarchitecture unit and operate (Figur on datae 2) in is the derived Register from File, the while T0 vectorby adding instructions the Vector are Co- processingprocessed Unit by the (VCU), VCU and being operate internally on data comprised in the SPMs. of Data Multi-Purpose transfers to/from Functional the data Units (MFU)memory and from/to Scratch-Pad the SPMs Memory are managed Interface by the(SPMI). LSU via dedicated instructions.

Electronics 2021, 10, x FOR PEER REVIEW 5 of 22

Program Memory MFU_req MFU_busy xM xM PC Fetch Debug MFU Init PC Hart Updater Updater MFU Exception HW-Loops Config Handler MFU xM FU hart a SPM Access Input Mapping Electronics 2021, 10, 518 Regfile Decode Handler 5 of 21 Electronics 2021, 10, x FOR PEER REVIEW Add 5 of 22 Shft Mul Accum Cmp FU Sub Enabler Intermediate Mapping xD MFU Ctrl FU Contention Handler Output Mapping xM Program Memory xF hart a, MFU_req MFU_busy EXECUTE xM vd 32b xM b, or c LSU MFU LSU Rd / Wr Bus 32-bit xM vs1 32b xM vs2 32b xM

PC Bank Intrlv Data Rotate Fetch Debug MFU Init PC Hart Halt MFU Updater Updater MFU ExceptionSPM I/O SPM Contention HandlerMapping HandlerHW-LoopsHalt Config Access Halt LSU MFU hart c CSR WB xM

SPMI SPMI

Bank0 Bank1 BankN xM hart a FU Input Mapping SPM Access xN SPM xD Regfile Decode Handler Add Shft Mul Accum Cmp Data Memory FU Sub Enabler Intermediate Mapping xD MFU Ctrl FU Contention Handler Output Mapping Figure 2. Klessydra T1 core microarchitecture.xM xF hart a, EXECUTE vd 32b xM b, or c LSU MFU LSU Rd / Wr Bus 32-bit vs1 32b xM At the instruction level, the T1 architecture supports the parallel executionvs2 32b xM of instruc-

tions of different types, belonging to the same hart. In fact, theBank LSUIntrlv Data works Rotate in parallel with

the other units when executing memory store instructions, that cannot cause Halta MFU write-back SPM I/O SPM Contention Mapping Handler Halt conflict on the register file. The MFU is allowed to read operands from theAccess registerHalt LSU file but hart c CSR WB

can only write its resultsSPMI to local scratchpad memories (SPMs),SPMI thus keeping the SPMs and Bank0 Bank1 BankN xM the Registerfile decoupled and allowing parallel executionxN betweenSPM xD instructions writing

toData each Memory of these memories simultaneously. Scalar instructions of a hart are processed by the “Execution” unit and operate on data in the Register File, while vector instructions are processed by the VCU and operate on data in the SPMs. Data transfers to/from the data FigureFigure 2. 2. KlessydraKlessydra T1 T1 core core microarchitecture. microarchitecture. memory from/to the SPMs are managed by the LSU via dedicated instructions. TheAtThe the MFUs MFUs instruction execute execute level, vector vector the arithmetic arithmeticT1 architecture instru instructions, supportsctions, whose whose the parallel latency latency execution is is proportional proportional of instruc- to to thetionsthe vector vector of different length. length. types, In In an an belongingin-order in-order interleav interleaved-multi-threading to the sameed-multi-threading hart. In fact, the pipeline, LSU pipeline, works a a hart hartin parallel requesting requesting with accesstheaccess other to to the units the busy busy when MFUs MFUs executing may may result result memory in in stallin stalling storeg thein thestructions, whole whole pipeline, pipeline, that cannot stal stallingling cause other other a write-back harts harts that that mayconflictmay not not onneed need the to register to access access file. the the TheMFU. MFU. MFU To To iscircumve circumvent allowednt to this, this,read in inoperands the the T1 T1 architecture, architecture,from the register the the waiting waitingfile but hartcanhart onlyexecutes executes write a aitsself-referencing self-referencing results to local jump scratchpad jump so so that that memories the the PC PC for for (SPMs), that that hart hart thus does does keeping not not advance advancethe SPMs until until and thethethe MFU Registerfile MFU becomes becomes decoupled free, free, avoiding avoiding and allowingunnecessary unnecessary para stllel stallsalls execution of of harts harts that thatbetween are are independent independent instructions from from writing the the MFUtoMFU each being being of these busy. busy. memories Figure Figure 3 demonstrates simultaneously. a cycle Scalar accurate instructions diagram of ofa hart the mechanism.are processed by the “Execution” unit and operate on data in the Register File, while vector instructions are processed by the VCU and operate on data in the SPMs. Data transfers to/from the data memory from/to the SPMs are managed by the LSU via dedicated instructions. The MFUs execute vector arithmetic instructions, whose latency is proportional to the vector length. In an in-order interleaved-multi-threading pipeline, a hart requesting access to the busy MFUs may result in stalling the whole pipeline, stalling other harts that may not need to access the MFU. To circumvent this, in the T1 architecture, the waiting hart executes a self-referencing jump so that the PC for that hart does not advance until the MFU becomes free, avoiding unnecessary stalls of harts that are independent from the MFU being busy. Figure 3 demonstrates a cycle accurate diagram of the mechanism. FigureFigure 3. 3. HartHart interleaving interleaving and and hart hart stall stall timing timing diagram. diagram. When deploying Klessydra T1 in a IoT edge device, one can configure the number When deploying Klessydra T1 in a IoT edge device, one can configure the number of of parallel lanes D in the MFU, the number of MFUs F, the SPM capacity, the number of parallel lanes D in the MFU, the number of MFUs F, the SPM capacity, the number of independently accessible SPMs N in each SPMI, the number of SPMIs M, as well as the way the MFUs and SPMI are shared between the harts. Representative configurations are the following: • Thread-Shared coprocessor: All harts in the core share a single MFU/SPM subsystem. Harts in this scheme are required to execute an infinite jump when trying to access the MFU when its busy. In this approach, instruction level parallelism is limited to occur only between coprocessor instructions writing to the SPM and non-coprocessor Figureinstructions 3. Hart interleaving writing and to hart the mainstall timing memory diagram. or register file. To mitigate the delays on a hart executing an infinite jump, the coprocessor here may exploit pure data level Whenparallelism deploying (DLP) Klessydra acceleration, T1 in by a SIMDIoT edge execution. device, one can configure the number of parallel lanes D in the MFU, the number of MFUs F, the SPM capacity, the number of

Electronics 2021, 10, 518 6 of 21

• Thread-Dedicated coprocessor: Each hart is appointed a full MFU/SPM subsystem, eliminating inter-hart coprocessor contention and allowing inter-coprocessor parallel execution. Stalls can only happen if the next instruction of the same hart that is using the MFU requests an MFU operation. DLP by SIMD execution can still be exploited in this approach, but also thread level parallelism (TLP) by fully symmetric MIMD execution, allowing execution of multiple vector instructions in parallel. • Thread-Dedicated SPMIs with a Shared MFU: The harts here maintain a dedicated SPM address space, yet they share the functional units in the MFU. This scheme still allows inter-hart parallel execution of coprocessor instructions, provided they use different internal functional units (FU) of the MFU (e.g., adder, multiplier). Harts that request a busy internal unit in the MFU will be stalled, and their access will be serialized until the contended unit becomes free, while harts that request a free functional unit can work in parallel with the other active harts in the MFU. DLP by SIMD execution can still be exploited in this approach, but also TLP by heterogeneous MIMD execution. Table1 summarizes the design parameters and corresponding conﬁgurations, whose names will be used as references in reporting performance results.

Table 1. Summary of explored hardware conﬁgurations.

M (Number of SPMI Units) F (Number of FUs) D (Number of Lanes in Each FU) Execution Paradigm 1 1 1 SISD 1 1 2, 4, 8 Pure SIMD 3 3 1 Symmetric MIMD 3 3 2, 4, 8 Symmetric MIMD + SIMD 3 1 1 Heterogenous MIMD 3 1 2, 4, 8 Heterogenous MIMD + SIMD

3.2. Programming Paradigm By default, a Klessydra core runs the maximum number of hardware threads (which is a synthesis parameter) allowed by the microarchitecture. The function Klessydra_get_coreID() can read the id number of the thread executing the function from the MHARTID CSR register, so this allows to distinguish threads and possibly have each thread to execute a different piece of program. Figure4 shows a generic C program skeleton in which each of the three threads executes its own instruction ﬂow. The functions sync_barrier_thread_registration() Electronics 2021, 10, x FOR PEER REVIEW 7 of 22 and sync_barrier() allow implementing a synchronization barrier by based on inter-thread software interrupts, to synchronize thread execution at certain points of the program.

sync_barrier_thread_registration(); //Executed by all threads if (Klessydra_get_coreID()==0){ // thread_0 subroutine } if (Klessydra_get_coreID()==1){ // thread_1 subroutine } if (Klessydra_get_coreID()==2){ // thread_2 subroutine } sync_barrier(); //Executed by all threads

FigureFigure 4.4. Code for multi-threaded executionexecution onon Klessydra-T1.Klessydra-T1.

FigureFigure5 5 gives gives a a representation representation of of the the memory memory map map assumed assumed by by the the Klessydra Klessydra T1 T1 operation.operation.

0000 0000 Int Vector Table 0000 0094 MTVEC point 32KB RAM Program Memory Program 0000 7FFF Code 0000 8000 512B ROM Boot Memory Hart 0 MIP reg 4b Hart 1 MIP reg 4b

Mem Mapped CSR 000F FF00 Hart h MIP reg 4b MIP regs 0010 0000 Shared Data 1MB 32KB RAM Data 1 MB RAM Memory Hart 0 Stack 128KB 001F FFFF Hart 1 Stack 128KB

Hart h Stack 128KB 0100 0000 SPM Section SPM(0) SPM Memory SPM(1)

1A10 0000 UART regs SPM(N) 1A10 1000 GPIO regs 1A10 2000 SPI MASTER regs 1A10 3000 TIMER regs Peripherals 1A10 4000 EVENT UNIT regs 1A10 5000 I2C regs 1A10 6000 FLL regs 1A10 7000 SoC CONTROL regs

Figure 5. Klessydra T1 memory map.

The SPM local storage is visible to the programmer as a specific address region in the memory map. The programmer can move data to/from any point of the SPM address

Electronics 2021, 10, x FOR PEER REVIEW 7 of 22

Figure 4. Code for multi-threaded execution on Klessydra-T1. Electronics 2021, 10, 518 7 of 21 Figure 5 gives a representation of the memory map assumed by the Klessydra T1 operation.

0000 0000 Int Vector Table 0000 0094 MTVEC point 32KB RAM Program Memory Program 0000 7FFF Code 0000 8000 512B ROM Boot Memory Hart 0 MIP reg 4b Hart 1 MIP reg 4b

Mem Mapped CSR 000F FF00 Hart h MIP reg 4b MIP regs 0010 0000 Shared Data 1MB 32KB RAM Data 1 MB RAM Memory Hart 0 Stack 128KB 001F FFFF Hart 1 Stack 128KB

Hart h Stack 128KB 0100 0000 SPM Section SPM(0) SPM Memory SPM(1)

1A10 0000 UART regs SPM(N) 1A10 1000 GPIO regs 1A10 2000 SPI MASTER regs 1A10 3000 TIMER regs Peripherals 1A10 4000 EVENT UNIT regs 1A10 5000 I2C regs 1A10 6000 FLL regs 1A10 7000 SoC CONTROL regs

FigureFigure 5. 5. KlessydraKlessydra T1 T1 memory memory map. map.

TheThe SPM SPM local local storage storage is is visible visible to to the the prog programmerrammer as a a specific specific address region in the memorymemory map. The The programmer programmer can can move data to/from any any point point of of the SPM address space with no constraint except the total capacity of the SPMs, which in turn is a parameter of the microarchitecture design. Inter-thread data transfers may happen via shared global static variables allocated in the main data memory or, in the case of a shared coprocessor configuration, via shared SPM address space. As in any multi-threading execution scheme, access to shared data must be accompanied by explicit thread synchronization, which is available in Klessydra by means of specific intrinsic functions implementing semaphore locks compliant with RISC-V atomic instructions, not in the scope of this work. The custom instruction extension supported by the VCU and LSU is summarized in Table2. The instructions supported by the coprocessor sub-system are exposed to the programmer in the form of very simple intrinsic functions, fully integrated in the RISC-V gcc compiler toolchain. The compiler does not have knowledge of the hardware configuration, so it only translates the intrinsic functions into the corresponding dedicated vector instructions, which are then executed by the hardware according to the chosen hardware configuration. The instructions implement vector operations working on the memory space mapped on the local SPMs. The vector length applied by MFU operations is encoded in a user accessible custom control/status register (CSR) named MVSIZE. Electronics 2021, 10, 518 8 of 21

Table 2. RISC-V instruction set custom extension for Klessydra-T processors.

Assembly Syntax—(r) Denotes Memory Function Declaration Short Description Addressing via Register r kmemld (rd), (rs1), (rs2) kmemld((void*) rd, (void*) rs1, (int) rs2) load vector into scratchpad region kmemstr (rd), (rs1), (rs2) kmemstr((void*) rd, (void*) rs1, (int) rs2) store vector into main memory kaddv (rd), (rs1), (rs2) kaddv((void*) rd, (void*) rs1, (void*) rs2) adds vectors in scratchpad region ksubv (rd), (rs1), (rs2) ksubv((void*) rd, (void*) rs1, (void*) rs2) subtract vectors in scratchpad region kvmul (rd), (rs1), (rs2) kvmul((void*) rd, (void*) rs1, (void*) rs2) multiply vectors in scratchpad region kvred (rd), (rs1) kvred((void*) rd, (void*) rs1) reduce vector by addition kdotp (rd), (rs1), (rs2) kdotp((void*) rd, (void*) rs1, (void*) rs2) vector dot product into register ksvaddsc (rd), (rs1), (rs2) ksvaddsc((void*) rd, (void*) rs1, (void*) rs2) add vector + scalar into scratchpad ksvaddrf (rd), (rs1), rs2 ksvaddrf((void*) rd, (void*) rs1, (int) rs2) add vector + scalar into register ksvmulsc (rd), (rs1), (rs2) ksvmulsc((void*) rd, (void*) rs1, (void*) rs2) multiply vector + scalar into scratchpad ksvmulrf (rd), (rs1), rs2 ksvmulrf((void*) rd, (void*) rs1, (int) rs2) multiply vector + scalar into register kdotpps (rd), (rs1), (rs2) kdotpps((void*) rd, (void*) rs1, (void*) rs2) vector dot product and post scaling ksrlv (rd), (rs1), rs2 ksrlv((void*) rd, (void*) rs1, (int) rs2) vector logic shift within scratchpad ksrav (rd), (rs1), rs2 ksrav((void*) rd, (void*) rs1, (int) rs2) vector arithmetic shift within scratchpad krelu (rd), (rs1) krelu((void*) rd, (void*) rs1) vector ReLu within scratchpad kvslt (rd), (rs1), (rs2) kvslt((void*) rd, (void*) rs1, (void*) rs2) compare vectors and create mask vector ksvslt (rd), (rs1), rs2 ksvslt((void*) rd, (void*) rs1, (int) rs2) compare vector-scalar and create mask kvcp (rd), (rs1) ksrlv((void*) rd, (void*) rs1) copy vector within scratchpad region csr MVSIZE, rs1 mvsize((int) rs1) vector length setting csr MVTYPE, rs1 mvtype((int) rs1) element width setting (8, 16, 32 bits) csr MPSCLFAC, rs1 mpsclfac((int) rs1) post scaling factor (kdotpps instruction)

4. VGG-16 Implementation on Klessydra T1 4.1. Implementation Workflow VGG-16 is a deep Convolutional Neural Network (CNN) used in computer vision for classification and detection tasks, consisting of 13 convolutional layers, 5 maxpooling layers, 2 fully-connected layers and one output/softmax layer. The original VGG-16 can label a 224 × 224 pixel RGB image to one class out of 1000, using approximately 554 MB space for 32-bit floating-point weights and bias values. In the view of a realistic IoT edge embedded scenario, we implemented a VGG-16 derivation based on the widely known CIFAR-10 dataset, targeting 10 classes and 32 × 32 pixel RGB images and requiring 135 MB for weights and bias values. Table3 reports the breakdown of the inference algorithm layers constituting the Cifar-10 VGG-16. The layers 19 to 21 do not compute operations on matrices, rather they implement dot-product operations between vectors of different sizes, similarly, layer 22 implements a Softmax function on a vector of length 10. Figure6 illustrates the workflow to implement a Cifar-10 VGG-16 application on the Klessydra processor platform. Notably, since the target hardware platform supports fixed-point arithmetic, we based the implementation on fixed-point weights and data. We set the integer part to 11 bits and the fractional part to 21 bits, which leads an accuracy drop from 98.04% to 84.01% on the of output results of the inference. We remark that re-training the network, as well as further algorithmic optimizations, such as quantization and compression techniques, are not in the scope of the present work. The verification phase of the network in fixed point arithmetic was done via MATLAB (The MathWorks, Natick, MA, USA) Deep Learning Toolbox. In order to be able to exploit the C language intrinsic functions of the Klessydra platform, the original MATLAB code for VGG-16 was ported to C code. This generic C code implementation was used as the basis for the subsequent vectorization to exploit the hardware co-processor, and it was also used to run the same algorithm on the reference platforms used for performance comparison. We verified that no additional loss of quality is introduced by the proposed hardware architecture, which produces an identical output to the C fixed-point version executed on a general purpose computer. Electronics 2021, 10, x FOR PEER REVIEW 9 of 22

4. VGG-16 Implementation on Klessydra T1 4.1. Implementation Workflow VGG-16 is a deep Convolutional Neural Network (CNN) used in computer vision for classification and detection tasks, consisting of 13 convolutional layers, 5 maxpooling lay- Electronics 2021, 10, 518 ers, 2 fully-connected layers and one output/softmax layer. The original VGG-16 can9 label of 21 a 224 × 224 pixel RGB image to one class out of 1000, using approximately 554 MB space for 32-bit floating-point weights and bias values. In the view of a realistic IoT edge embedded scenario, we implemented a VGG-16 Table 3. Cifar-10 VGG-16 inference layers. derivation based on the widely known CIFAR-10 dataset, targeting 10 classes and 32 × 32 pixel RGBLayer images Number and requiring 135 Computation MB for weights Type and bias values. Matrix Table Size 3 reports the breakdown of the inference algorithm layers constituting the Cifar-10 VGG-16. The layers 1 Convolution 32 × 32 19 to 21 do not2 compute operations on Convolutionmatrices, rather they implement 32 dot-product× 32 operations between3 vectors of different sizes, Max similarly, Pool layer 22 implements 16 ×a Softmax16 function on a vector4 of length 10. Convolution 16 × 16 Figure 6 illustrates5 the workflow to Convolution implement a Cifar-10 VGG-16 16application× 16 on the × Klessydra processor6 platform. Notably, Max sinc Poole the target hardware platform 8 8 supports 7 Convolution 8 × 8 fixed-point arithmetic,8 we based the implementation Convolution on fixed-point weights 8 × 8 and data. We set the integer9 part to 11 bits and the Convolutionfractional part to 21 bits, which leads 8 × 8 an accuracy drop from 98.04%10 to 84.01% on the of output Max Pool results of the inference. We 4 × remark4 that re- training the network,11 as well as further Convolution algorithmic optimizations, such 4 ×as4 quantization and compression12 techniques, are not in Convolution the scope of the present work. 4 ×The4 verification × phase of the network13 in fixed point arithmetic Convolution was done via MATLAB 4(The4 MathWorks, 14 Max Pool 2 × 2 Natick, MA, USA)15 Deep Learning Toolbox. Convolution In order to be able to exploit 2 × the2 C language intrinsic functions16 of the Klessydra platform Convolution, the original MATLAB code 2 ×for2 VGG-16 was ported to C code.17 This generic C code Convolutionimplementation was used as the 2basis× 2 for the subsequent vectorization18 to exploit the hardware Max Pool co-processor, and it was 1 also× 1 used to run the same algorithm19 on the reference platforms Fully connected used for performance comparison. 512 × 512 We ver- × ified that no additional20 loss of quality Fully is introduced connected by the proposed 4096 hardware4096 architec- 21 Fully connected 4096 × 4096 ture, which produces22 an identical output Softmax to the C fixed-point version executed 10 on a general purpose computer.

FigureFigure 6.6. WorkﬂowWorkflow forfor thethe VGG-16VGG-16 implementation.implementation.

4.2.Table Generic 3. Cifar-10 Fixed-Point VGG-16 Cinference Code Porting layers. The generic C code used for convolutional layers is reported in Figure7. Image Layer Number Computation Type Matrix Size convolutions are implemented using the zero-padding technique: the feature map (FM) matrix is converted1 into a new matrix having Convolution two additional rows and columns 32 × 32 of zeros on its borders, to avoid2 having ﬁlter elements Convolution without corresponding pixel 32 values × 32 when the centroid element3 of the 3 × 3 kernel slides Max along Pool the borders. As a general 16 × feature16 of the implementation,4 multiplications always Convolution need a pre-scaling and post-scaling 16 × 16 operation in order to re-align5 the ﬁxed-point representation Convolution of the result. The convolution2D() 16 × 16 function performs the pre-scaling6 when creating Max the zero-paddedPool matrix and also 8 × pre-scales8 the kernel values. The7 convolution is carried Convolution out by nested for loops, by which 8 the× 8 Kernel map (KM) matrix slides8 across the input image Convolution with a stride of one element. The 8 partial× 8 result of each multiplication is pre-scaled and added to the corresponding output pixel, completing the multiply and accumulate step. After the convolution is complete, a bias value is added to the output feature map, and the ReLU non-linear activation function is executed across

all the matrix elements to conclude the convolutional layer. Electronics 2021, 10, x FOR PEER REVIEW 10 of 22

9 Convolution 8 × 8 10 Max Pool 4 × 4 11 Convolution 4 × 4 12 Convolution 4 × 4 13 Convolution 4 × 4 14 Max Pool 2 × 2 15 Convolution 2 × 2 16 Convolution 2 × 2 17 Convolution 2 × 2 18 Max Pool 1 × 1 19 Fully connected 512 × 512 20 Fully connected 4096 × 4096 21 Fully connected 4096 × 4096 22 Softmax 10

4.2. Generic Fixed-Point C Code Porting The generic C code used for convolutional layers is reported in Figure 7. Image convolutions are implemented using the zero-padding technique: the feature map (FM) matrix is converted into a new matrix having two additional rows and columns of zeros on its borders, to avoid having filter elements without corresponding pixel values when the centroid element of the 3 × 3 kernel slides along the borders. As a general feature of the implementation, multiplications always need a pre-scaling and post-scaling operation in order to re-align the fixed-point representation of the result. The convolution2D() function performs the pre-scaling when creating the zero-padded matrix and also pre-scales the kernel values. The convolution is carried out by nested for loops, by which the Kernel map (KM) matrix slides across the input image with a stride of one element. The partial result of each multiplication is pre-scaled and added to the corresponding output pixel, com- Electronics 2021, 10, 518 pleting the multiply and accumulate step. After the convolution is complete, a bias 10value of 21 is added to the output feature map, and the ReLU non-linear activation function is executed across all the matrix elements to conclude the convolutional layer.

a) for (int i = 0; i < layer_outputs; i++){ //scan for every output matrix output_pt = &output_fm[i][0][0]; for (int k = 0; k < layer_inputs ; k++){ //scan for every input matrix for (int w=0; w<9; w++) kern.kernel_9[w]=layer_filters[(output_pt*9)+w]; convolution2D(MATRIX_SIZE, input_fm[k], kern.kernel_9, output_pt); } //convolutions are completed bias = layer_bias[i]; addBias(MATRIX_SIZE, output_pt, bias); relu(MATRIX_SIZE, output_pt); } //the output matrix is complete

b)for (i = 1; i < (size+2)-1; i++){ c) //Adding the Bias value for (j = 1; j < (size+2)-1; j++){ for (int i = 0; i < size; i++) output_pixel[(i-1)*size+(j-1)] for (int j = 0; j < size; j++) += ( FM_zeropad[i-1][j-1] * kernel[0] ) >> post_scale ; matrix[j + size*i] += bias; } //end of loop for first kernel element //ReLU function //… for (int i = 0; i < size; i++) for (j = 1; j < (size+2)-1; j++){ for (int j = 0; j < size; j++) output_pixel[(i-1)*size+(j-1)] if (input[j + size*i] < 0) { += ( FM_zeropad[i+1][j+1] * kernel[8] ) >> post_scale ; input[j + size*i] = 0; } //end of loop for last kernel element else continue; } //end of loop "i"

Electronics 2021, 10, x FOR PEER REVIEW 11 of 22 Figure 7.7. ((aa)) ConvolutionalConvolutional layer layer in in generic generic C code;C code; (b) (Convolution2Db) Convolution2Dfunction function inner inner operations; operations; (c) Bias (c) additionBias addition and ReLU and functionReLU function inner operationsinner operations (Layers: (Layers: 1, 2, 4, 1, 5, 2, 7, 4, 8, 5, 9, 7, 11, 8, 12, 9, 11, 13, 12, 15, 13, 16, 15, 17). 16, 17).

Figure 8 reportsFigure the C8 codereports adopted the C codefor Maxpool adopted layers. for Maxpool The Maxpool layers. layer The Maxpool halves layer halves the width and heightthe width of the and FMs, height sliding of the across FMs, them sliding a 2 across × 2 window, them a 2with× 2 a window, stride equal with a stride equal to two, filteringto all two, the ﬁlteringvalues except all the for values the highest except forof the batch. highest In of this the way batch. the In most this way the most important featuresimportant detected features from the detected image from are passed the image at the are successive passed at layers. the successive layers.

a)for (k = 0; k < layer_outputs; k++) { b) for (int m = 0; m < size_i; m+=2){ input_pt = &input_fm[k][0][0]; for (int n = 0; n < size_i; n+=2){ output_pt = &output_fm[k][0][0]; max = FM[n + size_i*m]; maxpool(input_size, input_pt, for (i = m; i < m+2; i++){ output_size, output_pt); for (j = n; j < n+2; j++){ } if (FM[j + size_i*i] > max) max = FM[j + size_i*i]; } } output[index++] = max; } }

FigureFigure 8. (a )8. Maxpool (a) Maxpool layer layer in generic in generic C code; C code; (b) Maxpool (b) Maxpool function function inner inner operations operations (Layers: (Layers: 3, 6, 10,3, 6, 14, 18). 10, 14, 18). The last three layers of the network are Fully Connected, corresponding to the code The last threein Figure layers9 .of The the fully-connectednetwork are Fully layer Connected, is implemented corresponding by a dot-product, to the code doing the pre- in Figure 9. Thescaling fully-connected of the inputs layer and is post implemented scaling of theby a results dot-product, from every doing multiplication, the pre- needed for scaling of the inputsﬁxed pointand post alignment. scaling of This the is results accomplished from every bythe multiplication,fullyconnect() neededfunction for after putting the fixed point alignment.weights This into is local accomplished buffers and by adding the fullyconnect() a bias to the function output after value. putting The the results are passed weights into localthrough buffers a Softmaxand adding layer, a bias in which to the the output network value. produces The results the classiﬁcationare passed of the image through a Softmaxwith layer, a given in probability.which the network produces the classification of the image with a given probability.

a) pt_layer_filters = &fully_connect_filter_array[0]; pt_layer_bias = &fully_connect_fibias_array[0]; input_pt = &input_vector[0]; for (i = 0; i < layer_outputs_elements; i++) { getWeights(pt_layer_filters, number_of_elements, buffer); output_vector[i] = fullyconnect(number_of_elements, input_pt, buffer); bias = getBias(pt_layer_bias); output_vector[i]+= bias; } //the output vector is complete

b)for(int i=0; i>pre_scale; sum = sum + exp(output[i]); tmp2=vect_2[i]>>pre_scale; } output += (tmp1*tmp2)>>post_scale; for (int i = 0; i < vector_lenght; i++){ } output[i] = exp(output[i])/SUM ; return output; }

Figure 9. (a) Fully-connected layer in generic C code; (b) Fully-connect inner operations; (c) Soft- max layer inner operations (Fully-connected Layer: 19, 20, 21; Softmax Layer: 22).

Electronics 2021, 10, x FOR PEER REVIEW 11 of 22

Figure 8 reports the C code adopted for Maxpool layers. The Maxpool layer halves the width and height of the FMs, sliding across them a 2 × 2 window, with a stride equal to two, filtering all the values except for the highest of the batch. In this way the most important features detected from the image are passed at the successive layers.

Figure 8. (a) Maxpool layer in generic C code; (b) Maxpool function inner operations (Layers: 3, 6, 10, 14, 18).

The last three layers of the network are Fully Connected, corresponding to the code in Figure 9. The fully-connected layer is implemented by a dot-product, doing the pre- scaling of the inputs and post scaling of the results from every multiplication, needed for fixed point alignment. This is accomplished by the fullyconnect() function after putting the Electronics 2021, 10,weights 518 into local buffers and adding a bias to the output value. The results are passed 11 of 21 through a Softmax layer, in which the network produces the classification of the image with a given probability.

Figure 9. (a)Figure Fully-connected 9. (a) Fully-connected layer in generic layer C in code; generic (b) Fully-connectC code; (b) Fully-connect inner operations; inner ( coperations;) Softmax layer (c) Soft- inner operations (Fully-connectedmax layer Layer: inner 19, 20,operations 21; Softmax (Fully-connected Layer: 22). Layer: 19, 20, 21; Softmax Layer: 22).

4.3. Vectorized C Code Implementation Program code vectorization targeting the Klessydra intrinsic function library is based on two types of intervention: data movement to efficiently exploit the scratchpad memory sub-system, and vector arithmetic operation exploiting the accelerator functional unit. A loop of kmemld() functions transfer the FM and KMs operands into two SPMs, that we refer to as spmA and spmB, from the main memory. To implement zero-padding, when loading the feature maps into spmA, we first reset the SPM content to zero and then proceed with loading bursts of data from the FM rows, with exact offsets that grant the correctness of zero-padding. Figure 10a displays the code executed to set up the FM in spmA. The offsets added to the pointers passed to the Kmemld() function allow for the implementation zero-padding. The ksrav() function implements fixed-point pre-scaling by performing an arithmetic right shift operation of a vector. It requires a pointer to the vector, a pointer to store the resulting vector and a shift amount Figure 10b similarly shows the loading and pre-scaling of the 9-element KM into spmB and also the calling sequence of the convolution2D() function. The Convolution2D() function requires the addresses of the FM and KM first elements in spmA and spmB, an address pointing to a region in spmD for temporary value storage, and the address to store the output matrix in spmC. Figure 11 reports the internal operations, which are built upon knowing which vectors are to be multiplied as the kernel map slides across all the input map pixels. Taking into account which elements will be multiplied when the kernel completely slides across a row of the FM, and the fact that this process is replicated for every row, we can multiply the FM row values with the corresponding scalar from the KM, and update the output matrix (OM) row with a vector of partial results. This process is straightforward and allows to fully exploit the vector coprocessor capabilities by using matrix rows as vector operands. Electronics 2021, 10, x FOR PEER REVIEW 12 of 22

a) b) for (int i = th_output_first_OM; i < th_output_last_OM; i++) { // LOADING&PRESCALING Kernel Maps for (int k = 0 ; k < input_per_layer; k++){ CSR_MVSIZE(9*SIZE_OF_INT); // LOADING & PRESCALING Feature Maps (FM) kmemld( CSR_MVSIZE(Row_lenght*SIZE_OF_INT); (void*) ((int*) spmB + spm_off_B ), for (int row_pointer=0; row_pointer

Figure 10. (a) Zero-padded, pre-scaled FM setup in SPM; ( b) Pre-scaled KM collection in SPM and calling sequence of Electronics 2021, 10, x FOR PEER REVIEW 13 of 22 convolution2D() (Layers: 1, 2, 4, 5, 7, 8, 9, 11, 12, 13, 15, 16, 17).

The Convolution2D() function requires the addresses of the FM and KM first elements in spmA and spmB, an address pointing to a region in spmD for temporary value storage, CSR_MVSIZE(Row_size); andfor(i=Zeropad_off; the address to i Row_size-Zeropad_off;store the output matrix i++){ in spmC. Figure 11 reports the internal operations,k_element=0; which are built upon knowing which vectors are to be multiplied as the kernel map slidesfor(FM_row_pointer=-Zeropad_off; across all the input map pixels. FM_row_pointer Taking into <= Zeropad_off; account which FM_row_pointer++){ elements will be multi- pliedfor when (column_offset=0; the kernel column_offset completely

FigureFigure 11.11. Convolution2DConvolution2D innerinner loopsloops operationsoperations (Layers:(Layers: 1,1, 2,2, 4,4, 5,5, 7,7, 8,8, 9,9, 11,11, 12,12, 13,13, 15,15, 16,16, 17).17).

ReferringReferring toto FigureFigure 11 11,, after after setting setting the th vectore vector length, length, the the loop loop with with index index “i” scans“i” scans the rowsthe rows of the of output the output matrix matrix (OM); (OM); the FM_row_pointer the FM_row_pointerloop and loop the andcolumn_offset the column_offsetloop iterates loop threeiterates times three each times to cover each theto cover necessary the necess vector-scalarary vector-scalar product requiredproduct forrequired the 3 × for3 kernelthe 3 × matrix.3 kernel The matrix.FM_offset The FM_offsetvariable points variable to thepoints proper to the FM proper row in FM spmA, row from in spmA, which from thesource which vectorthe source is fetched. vector The is fetched.ksvmulsc() Thefunction ksvmulsc() performs function the scalar-vectorperforms the multiplication scalar-vector multipli- between ancation FM rowbetween vector an and FM a row KM vector scalar, andand thea KM result scalar, is post-scaled and the result by the isksrav() post-scaledfunction by forthe ﬁxed-pointksrav() function alignment. for fixed-point The kaddv() alignment.function The performs kaddv() function the vector performs addition, the updating vector addi- the OMtion, row updating in spmC. the OM row in spmC. AfterAfter thethe convolutionsconvolutions areare done,done, thethe OMOM isis updatedupdated withwith thethe additionaddition ofof thethe biasbias valuevalue (Figure(Figure 1212a).a). AA kmemld()kmemld() isis requiredrequired toto havehave thethe singlesingle scalarscalar valuevalue inin thethe scratchpadscratchpad memory,memory, thenthen thethe wholewhole matrixmatrix isis updatedupdated bybyksvaddsc_v2() ksvaddsc_v2(),, whichwhich performsperforms thethe vectorvector plus scalar operation and includes a fourth parameter to adjust the vector length prior to doing the calculation.

a) b)

//Preload the spm_B with the bias value krelu( kmemld( (void*)( (int*)spmC + mem_off_C ), (void*)((int*)spmB + mem_off_B ), (void*)( (int*)spmC + mem_off_C ) (void*)(pt_to_bs+offset), ); //perform the ReLU on the output matrix (SIZE_OF_INT) for (int row_pt=0; row_pt

Figure 12. (a) Adding the bias to the Output Matrix; (b) Applying the ReLu function to the Output Matrix (Layers: 1, 2, 4, 5, 7, 8, 9, 11, 12, 13, 15, 16, 17).

As the last part of the convolutional layers, the ReLU non-linear function is applied to all the OM pixels, which is stored back in main memory. The SPM region is cleared for

Electronics 2021, 10, x FOR PEER REVIEW 13 of 22

CSR_MVSIZE(Row_size); for(i=Zeropad_off; i Row_size-Zeropad_off; i++){ k_element=0; for(FM_row_pointer=-Zeropad_off; FM_row_pointer <= Zeropad_off; FM_row_pointer++){ for (column_offset=0; column_offset < kernel_size; column_offset++){ FM_offset= (i+FM_row_pointer)*Row_size+column_offset; ksvmulsc(SPM_D, (SPM_A+FM_offset), (SPM_B + k_element++)); ksrav(SPM_D, SPM_D, scaling_factor); OM_offset = (Row_size*i)+Zeropad_off; kaddv((SPM_C+OM_offset),(SPM_C+OM_offset),SPM_D); } } }

Figure 11. Convolution2D inner loops operations (Layers: 1, 2, 4, 5, 7, 8, 9, 11, 12, 13, 15, 16, 17).

Referring to Figure 11, after setting the vector length, the loop with index “i” scans the rows of the output matrix (OM); the FM_row_pointer loop and the column_offset loop iterates three times each to cover the necessary vector-scalar product required for the 3 × 3 kernel matrix. The FM_offset variable points to the proper FM row in spmA, from which the source vector is fetched. The ksvmulsc() function performs the scalar-vector multiplication between an FM row vector and a KM scalar, and the result is post-scaled by the ksrav() function for fixed-point alignment. The kaddv() function performs the vector addi- Electronics 2021, 10, 518 tion, updating the OM row in spmC. 13 of 21 After the convolutions are done, the OM is updated with the addition of the bias value (Figure 12a). A kmemld() is required to have the single scalar value in the scratchpad memory, then the whole matrix is updated by ksvaddsc_v2(), which performs the vector plus scalar operation and includes a fourth parameterparameter to adjust the vector length prior to doing the calculation.

a) b)

Figure 12. (a) Adding thethe biasbias toto thethe OutputOutput Matrix;Matrix; ( b(b)) Applying Applying the the ReLu ReLu function function to to the the Output Output Matrix Matrix (Layers: (Layers: 1, 1, 2, 2, 4, 4, 5, 7,5, 8,7, 9,8, 11,9, 11, 12, 12, 13, 13, 15, 15, 16, 16, 17). 17).

As the last partpart ofof thethe convolutionalconvolutional layers, layers the, the ReLU ReLU non-linear non-linear function function is appliedis applied to toall all the the OM OM pixels, pixels, which which is storedis stored back back in mainin main memory. memory. The The SPM SPM region region is cleared is cleared for thefor next iteration of the loop by broadcasting a zero value into the target memory region with kbcast() (Figure 12b). The maxpooling layer is executed on the OM in main memory, through conventional scalar instructions, following the same implementation of the generic C code. The fully-connected layer is comprised of a computation kernel based on dot products (Figure 13a). The source vector is moved into spmA as a single burst of data using the kmemld() function, and pre-scaled by ksrav(). A loop handles the properly transposed loading of the neurons parameters into spmB. The two vectors in the SPMs are processed by the dot-product function kdotpps(), which includes a post-scaling of the product before accumulation for fixed point alignment. After the end of the loop, the vector of bias values is moved into spmD then added to the output vector of the layer. The result vector is processed by the krelu() function, and then it is stored back to the main memory. The kbcast() function clears the spmC memory space (Figure 13b). The softmax layer is executed in main memory through conventional scalar instructions, with the same implementation of the generic C code. The exact execution of the vectorized VGG-16 inference program running on Klessydra T1 cores was verified by comparing the full output produced by RTL simulation against the general purpose VGG-16 fixed-point C code running on an X86 server. Electronics 2021, 10, x FOR PEER REVIEW 14 of 22

the next iteration of the loop by broadcasting a zero value into the target memory region with kbcast() (Figure 12b). The maxpooling layer is executed on the OM in main memory, through conventional scalar instructions, following the same implementation of the generic C code. The fully-connected layer is comprised of a computation kernel based on dot products (Figure 13a). The source vector is moved into spmA as a single burst of data using the kmemld() function, and pre-scaled by ksrav(). A loop handles the properly transposed loading of the neurons parameters into spmB. The two vectors in the SPMs are processed by the dot-product function kdotpps(), which includes a post-scaling of the product before accumulation for fixed point alignment. After the end of the loop, the vector of bias values is moved into spmD then added Electronics 2021, 10, 518 to the output vector of the layer. The result vector is processed by the krelu() function,14 ofand 21 then it is stored back to the main memory. The kbcast() function clears the spmC memory space (Figure 13b).

kmemld( (void*)spmA, (void*)((int*)pt_to_vector), vector_lenght*SIZE_OF_INT); a) CSR_MVSIZE(vector_lenght*SIZE_OF_INT); ksrav( (void*)spmA, (void*)spmA, scaling factor); for (int loop_index = 0; loop_index < divisor_0; loop_index++){ kmemld( (void*)((int*)spmB + mem_off_B), (void*)((int*)pt_to_wh + (loop_index*vector_lenght)), (vector_lenght*SIZE_OF_INT)); CSR_MVSIZE(vector_lenght*SIZE_OF_INT); ksrav( (void*)((int*)spmB + mem_off_B), (void*)((int*)spmB + mem_off_B), scaling factor); kdotpps( (void*)spmC + loop_index, (void *)((int*)spmA), (void*)((int*)spmB + mem_off_B )); }

b) kmemld( (void*)spmD, (void*)(pt_to_bs), (vector_lenght*SIZE_OF_INT) ); kaddv( (void*)spmC, (void*)spmC, (void*)spmD ); punt_out = &layer_output[0]; krelu( (void*)spmC, (void*)spmC ); kmemstr( (void*)punt_out, (void*)spmC, (vector_lenght*SIZE_OF_INT) ); kbcast( (void*)spmC, (void*)zero );

Figure 13. Fully-connected layer operations. ( a) Dot-product kernel; ( b) Bias addition and ReLu (Fully-connected Layer: 19, 20, 21).

The softmax layer is executed in main memory through conventional scalar instructions,5. Performance with the same and Powerimplementation Analysis of the generic C code. 5.1. SetupThe exact execution of the vectorized VGG-16 inference program running on Klessy- dra T1All cores Klessydra was verified cores are by compatible comparing with the thefull PULPinooutput produced processor by platform RTL simulation [25]. Yet, againstthe original the general PULPino purpose memory VGG-16 subsystem fixed-point cannot C support code running the execution on an ofX86 the server. full VGG-16 inference algorithm, which requires 255 MB storage for the constant data consisting of the 5.neural Performance network and weights, Power and Analysis at least 1 MB memory space for global and local variables. 5.1.Thus, Setup we extended the PULPino memory sub-system to include 256 MB of addressable physicalAll Klessydra data memory, cores partitioned are compatible into a with 1 cycle the latency PULPino 1 MB processor RAM to platform be mapped [25]. on Yet, the FPGA BRAM, and a 6 cycle latency 255 MB space mapped on an external flash memory the original PULPino memory subsystem cannot support the execution of the full VGG- device, connected via SPI interface. The 1 MB RAM is the physical mapping of the portion 16 inference algorithm, which requires 255 MB storage for the constant data consisting of of the data memory address space that is dedicated to dynamically allocated data. the neural network weights, and at least 1 MB memory space for global and local varia- The program memory is 32 KB mapped in the FPGA BRAM. The modified PULPino platform featuring Klessydra T1 processor cores was synthe- sized on a Kintex7 FPGA from Xilinx integrated on the Genesys2 board from Digilent [28], using the Vivado tool flow. The design entry was the RTL VHDL/SystemVerilog description of the platforms under analysis. The C code of the VGG16 application was compiled by the RISC-V gcc tool chain to produce the binary code executable by the target processors. The execution of the application on the target processors was simulated both as RTL and post-synthesis gate level, to verify the correct functionality and to extract the signal activity for power estimation in Vivado. Table4 reports the hardware resource utilization and the maximum clock frequency producing zero or positive slack, for all the processor configurations under analysis. Electronics 2021, 10, 518 15 of 21

Table 4. Area and frequency summary of the Klessydra-T cores equipped with to 1 MB Data Mem.

Hardware Utilization Top Freq. Conﬁguration FF LUT DSP B-RAM LUT-RAM [MHz] SISD (M = 1, F = 1, D = 1) 2482 7083 11 88 264 132.1 Pure SIMD (M = 1, F = 1, D = 2) 2664 9010 15 88 264 127.0 Pure SIMD (M = 1, F = 1, D = 4) 3510 11,678 23 88 264 125.5 Pure SIMD (M = 1, F = 1, D = 8) 4904 18,531 39 88 264 112.6 Symmetric MIMD (M = 3, F = 3, D = 1) 3509 10,701 19 120 264 114.2 Symmetric MIMD+SIMD (M = 3, F = 3, D = 2) 4659 16,556 31 120 264 113.9 Symmetric MIMD+SIMD (M = 3, F = 3, D = 4) 6746 27,485 55 120 264 108.9 Symmetric MIMD+SIMD (M = 3, F = 3, D = 8) 11,253 52,930 103 120 264 96.3 Heterogenous MIMD (M = 3, F = 1, D = 1) 3025 10,655 11 120 264 119.9 Heterogenous MIMD+SIMD (M = 3, F = 1, D = 2) 3741 17,161 15 120 264 115.7 Heterogenous MIMD+SIMD (M = 3, F = 1, D = 4) 4767 25,535 23 120 264 110.4 Heterogenous MIMD+SIMD (M = 3, F = 1, D = 8) 7303 48,066 39 120 264 91.5 T0 (No accl) 1409 4079 7 72 176 194.6 RI5CY 1307 6351 6 72 0 65.1 Zeroriscy 1605 2834 1 72 0 77.2

The VGG-16 inference ﬁxed-point code was also implemented on the following alternative computing systems, to accomplish a comprehensive comparative analysis: • An FPGA board featuring a soft-processor comprised of the extended PULPino platform equipped with the DSP-accelerated RI5CY core, reaching 65 MHz clock frequency; • An FPGA board featuring a soft-processor comprised of the extended PULPino platform equipped with a Zeroriscy core [29], reaching 77 MHz clock frequency; • An STM32 single board computer featuring an 84 MHz ARM Cortex M4 core with DSP extension, 96 KB data memory; • A Raspberry-PI 3b+ single board computer featuring a 1.4 GHz ARM Cortex A53 quad-core CPU, 16 KB L1 cache and 512 KB L2 cache, 1 GB LPDDR2 main memory; • An x86 single board computer featuring a 3 GHz exa-core, 12-thread i7 CPU, 384 KB L1 cache, 1.5 MB L2 cache, 9 MB LLC, 8 GB DDR4 main memory. Electronics 2021, 10, x FOR PEER REVIEW 16 of 22 The system architecture organization corresponding to the devices under comparison are sketched in Figure 14. The read-only storage space dedicated to the VGG-16 weights is hosted by an SPI-connected Flash memory expansion board in all the considered architec- weights is hosted by an SPI-connected Flash memory expansion board in all the consid- eredtures, architectures, and the weights and the are weights preemptively are preem loadedptively intoloaded the into main the RAMmain RAM space space for the for inference thealgorithm inference execution. algorithm execution.

FLASH FLASH FLASH FLASH FLASH FLASH

D Mem D Mem D Mem D Mem D Mem D Mem L3 L2 SPM DSP DSP core L2 L1 core core VEC L1 core core core core core core core core core core core M4 RI5CY Zeroriscy Klessydra T1 i7 A53 (STM32) (PULPino) (PULPino) (PULPino)

Figure 14. System architecture organization of the compared boards. Figure 14. System architecture organization of the compared boards. 5.2. Results 5.2. Results The ﬁrst phase of performance analysis targeted the detailed account of the perfor- The first phase of performance analysis targeted the detailed account of the perfor- mancemance of of each each coprocessor coprocessor hardware hardware microarchitecture. microarchitecture. FigureFigure 15 15 shows shows the theexecution execution time timeobtained obtained by the bybest the performing best performing of all the ofex- all the ex- ploredplored T1 T1 coprocessor coprocessor configurations conﬁgurations and andby the by non-accelerated the non-accelerated T0 core, T0 for core, each for VGG- each VGG-16 16 layer. The results give evidence to the fact that the performance of the coprocessor hardware configurations varies with the algorithm layer it executes. The Symmetrical MIMD configurations with D ranging between 2 and 8 result to be the best performing for the convolutional layers, while the pure SIMD configurations with D = 4 results to be the optimal choice for the largest Fully Connected layers. Notably, the Maxpool and Softmax layers exhibit worse execution time in the accelerated cores than with in the non-accelerated T0 core, because in the present software implementation they are executed as scalar computation, and so the data transfer to/from the SPMs constitutes an overhead with no corresponding vector computation acceleration. Nonetheless, the relative impact of those layers on the overall execution time is negligible, as shown by the logarithmic scale.

Electronics 2021, 10, x FOR PEER REVIEW 16 of 22

weights is hosted by an SPI-connected Flash memory expansion board in all the considered architectures, and the weights are preemptively loaded into the main RAM space for the inference algorithm execution.

FLASH FLASH FLASH FLASH FLASH FLASH

Figure 14. System architecture organization of the compared boards.

5.2. Results

Electronics 2021, 10, 518 The first phase of performance analysis targeted the detailed account of the perfor-16 of 21 mance of each coprocessor hardware microarchitecture. Figure 15 shows the execution time obtained by the best performing of all the explored T1 coprocessor configurations and by the non-accelerated T0 core, for each VGG- 16layer. layer. The The results results give give evidence evidence to the to factthe thatfact thethat performance the performance of the of coprocessor the coprocessor hard- warehardware configurations configurations varies varies with thewith algorithm the algorithm layer itlayer executes. it executes. The Symmetrical The Symmetrical MIMD MIMDconfigurations configurations with D rangingwith D ranging between between 2 and 8 result2 and to8 result be the to best be the performing best performing for the con- for thevolutional convolutional layers, whilelayers, the while pure the SIMD pure configurations SIMD configurations with D =with 4 results D = 4 to results be the to optimal be the optimalchoice for choice the largest for the Fully largest Connected Fully Connect layers.ed Notably, layers. theNotably, Maxpool the Maxpool and Softmax and layers Softmax ex- layershibit worse exhibit execution worse execution time in the time accelerated in the accelerated cores than cores with inthan the with non-accelerated in the non-acceler- T0 core, atedbecause T0 core, in the because present in software the present implementation software implementation they are executed they asare scalar executed computation, as scalar computation,and so the data and transfer so the to/from data transfer the SPMs to/fro constitutesm the SPMs an overheadconstitutes with an overhead no corresponding with no correspondingvector computation vector acceleration. computation Nonetheless, acceleration. theNonetheless, relative impact the relative of those impact layers of on those the layersoverall on execution the overall time execution is negligible, time asis negligible, shown by theas shown logarithmic by the scale. logarithmic scale.

Figure 15. Absolute execution time [s] of the best performing accelerated conﬁguration and of the non-accelerated T0 core, per layer.

Figure 16 presents the total VGG16 inference execution time speed-up obtained by each

coprocessor configuration over the non-accelerated T0 core. The diagram also includes the ideal speed-up obtained assuming to use the optimal configuration for each layer. Figure 17 represents the hardware cost of the configurations that exhibit the highest speed- up, normalized to the non-accelerated T0 core hardware cost, for a direct comparison. The resulting hardware utilization efficiency is notable, as the maximum speed-up is over 50X, while the maximum hardware cost overhead is well below 15X. Figure 18 shows the total energy consumed by the most efficient of all the explored T1 coprocessor configurations and by the non-accelerated T0 core, for each VGG-16 layer. Again, the optimal coprocessor configuration for energy efficiency depends on the layer being executed. Optimal energy efficiency, unlike absolute performance, swings between Pure SIMD and Symmetrical MIMD configurations. Similarly to the execution time analysis, for pure scalar computation layers the energy consumption worsens in the vector-accelerated microarchitecture, due to the SPM data transfer overhead. Yet, the overall impact of those layers on the total energy is negligible as shown by the logarithmic scale. Electronics 2021, 10, x FOR PEER REVIEW 17 of 22

Figure 15. Absolute execution time [s] of the best performing accelerated configuration and of the non-accelerated T0 core, per layer.

Figure 16 presents the total VGG16 inference execution time speed-up obtained by each coprocessor configuration over the non-accelerated T0 core. The diagram also includes the ideal speed-up obtained assuming to use the optimal configuration for each layer. Figure 17 represents the hardware cost of the configurations that exhibit the highest Electronics 2021, 10, 518 speed-up, normalized to the non-accelerated T0 core hardware cost, for a direct compari-17 of 21 son. The resulting hardware utilization efficiency is notable, as the maximum speed-up is over 50X, while the maximum hardware cost overhead is well below 15X.

Execution time speed-up

60 56.2 53.5 51.9 50.9 50.0 49.2 50 45.3 44.1 43.6

40 35.4 33.1 31.6 30 25.7

1.0 0

ElectronicsFigure 2021 16., 10 TotalTotal, x FOR executionexecution PEER REVIEW time time speed-up speed-up over over non-accelerated non-accelerated core core obtained obtained by eachby each coprocessor coprocessor conﬁguration, configuration, along along with18 of 22

withthe speed-up the speed-up obtained obtained by using by using the optimal the optimal conﬁguration configuration for each for layer.each layer.

T0 (non-accelerated core) SISD SIMD=8 Het. MIMD=8 Symm. MIMD=8 18.00

16.00 14.71 14.00 12.98

12.00 11.78

10.00 7.99 8.00 5.57 5.57

6.00 5.18 4.54 NORMALIZED HW COST HW NORMALIZED

4.00 3.48 1.76 1.74 1.67 1.67 1.57 1.50 1.50 1.50 1.50 1.22 2.00 1.22 1.00 1.00 1.00 1.00 1.00

0.00 FF LUT DSP B-RAM LUT-RAM

FPGA ELEMENTS

Figure 17. Hardware overhead normalized to the non-accelerated T0 core.

Figure 18 shows the total energy consumed by the most efficient of all the explored T1 coprocessor configurations and by the non-accelerated T0 core, for each VGG-16 layer. Again, the optimal coprocessor configuration for energy efficiency depends on the layer being executed. Optimal energy efficiency, unlike absolute performance, swings between Pure SIMD and Symmetrical MIMD configurations. Similarly to the execution time analysis, for pure scalar computation layers the energy consumption worsens in the vector- accelerated microarchitecture, due to the SPM data transfer overhead. Yet, the overall impact of those layers on the total energy is negligible as shown by the logarithmic scale.

Most energy efficient coprocessor config. T0 (non-accelerated core)

100.0000

10.0000 (M=3,F=3,D=4) 1.0000 (M=3,F=3,D=4) (M=3,F=3,D=4) (M=3,F=3,D=2) (M=3,F=3,D=2) (M=1,F=1,D=4) (M=1,F=1,D=8) (M=3,F=3,D=2) (M=3,F=3,D=4) (M=3,F=3,D=4) (M=3,F=3,D=2) (M=3,F=3,D=2)

0.1000 (M=3,F=3,D=2) (M=3,F=3,D=4) (M=1,F=1,D=4) 0.0100 (M=1,F=1,D=8) Total Energy [J] Energy Total

0.0010 (M=1,F=1,D=1) (M=3,F=1,D=1) (M=1,F=1,D=1) (M=1,F=1,D=1) (M=1,F=1,D=1) (M=1,F=1,D=1) 0.0001 (M=3,F=3,D=1)

0.0000 12345678910111213141516171819202122Total layer

Figure 18. Total energy consumption [J] of the most energy efficient coprocessor configuration and of the non-accelerated T0 core, per layer.

Figure 19 gives significance of the total energy saving obtained by each coprocessor configuration over the non-accelerated T0 core. The energy saving is expressed as the fraction of the energy consumed in the accelerated core over the energy consumed in the non-

Electronics 2021, 10, x FOR PEER REVIEW 18 of 22

T0 (non-accelerated core) SISD SIMD=8 Het. MIMD=8 Symm. MIMD=8 18.00

16.00 14.71 14.00 12.98

12.00 11.78

10.00 7.99 8.00 5.57 5.57

6.00 5.18 4.54 NORMALIZED HW COST HW NORMALIZED

4.00 3.48 1.76 1.74 1.67 1.67 1.57 1.50 1.50 1.50 1.50 1.22 2.00 1.22 1.00 1.00 1.00 1.00 1.00

0.00 FF LUT DSP B-RAM LUT-RAM

FPGA ELEMENTS

Figure 17. Hardware overhead normalized to the non-accelerated T0 core.

Figure 18 shows the total energy consumed by the most efficient of all the explored T1 coprocessor configurations and by the non-accelerated T0 core, for each VGG-16 layer. Again, the optimal coprocessor configuration for energy efficiency depends on the layer being executed. Optimal energy efficiency, unlike absolute performance, swings between Pure SIMD and Symmetrical MIMD configurations. Similarly to the execution time anal- Electronics 2021, 10, 518 ysis, for pure scalar computation layers the energy consumption worsens in the vector-18 of 21 accelerated microarchitecture, due to the SPM data transfer overhead. Yet, the overall impact of those layers on the total energy is negligible as shown by the logarithmic scale.

Most energy efficient coprocessor config. T0 (non-accelerated core)

100.0000

10.0000 (M=3,F=3,D=4) 1.0000 (M=3,F=3,D=4) (M=3,F=3,D=4) (M=3,F=3,D=2) (M=3,F=3,D=2) (M=1,F=1,D=4) (M=1,F=1,D=8) (M=3,F=3,D=4) (M=3,F=3,D=2) (M=3,F=3,D=4) (M=3,F=3,D=2) (M=3,F=3,D=2)

0.1000 (M=3,F=3,D=2) (M=3,F=3,D=4) (M=1,F=1,D=4) 0.0100 (M=1,F=1,D=8) Total Energy [J] Energy Total

0.0010 (M=1,F=1,D=1) (M=3,F=1,D=1) (M=1,F=1,D=1) (M=1,F=1,D=1) (M=1,F=1,D=1) (M=1,F=1,D=1) 0.0001 (M=3,F=3,D=1)

0.0000 12345678910111213141516171819202122Total layer

FigureFigure 18. Total 18. Total energy energy consumption consumption [J] [J] of of the the most most energy energy efficientefficient coprocessor coprocessor configuration configuration and andof the of non-accelerated the non-accelerated T0 core, per layer. T0 core, per layer. Figure 19 gives significance of the total energy saving obtained by each coprocessor Figure 19 gives significance of the total energy saving obtained by each coprocessor Electronics 2021, 10, x FOR PEER REVIEWconfiguration over the non-accelerated T0 core. The energy saving is expressed as the19 frac- of 22

conﬁgurationtion of the energy over consumed the non-accelerated in the accelerated T0 core. core over The the energy energy saving consumed is expressed in the non- as the fraction of the energy consumed in the accelerated core over the energy consumed in the non-accelerated core, obtaining energy consumption between 6.4% and 4% of the non- accelerated core, obtaining energy consumption between 6.4% and 4% of the non-acceler- accelerated core (energy saving between 93.6% and 96%). The diagram also includes the ated core (energy saving between 93.6% and 96%). The diagram also includes the ideal

idealenergy energy reduction reduction obtained obtained assuming assuming to use the to useoptimal the optimalconfiguration conﬁguration for each layer. for each layer.

Energy reduction factor 100.0% 100.0%

10.0% 6.3% 6.6% 5.5% 5.7% 5.5% 5.1% 5.1% 4.8% 5.0% 4.7% 4.4% 4.4% 4.0%

1.0%

FigureFigure 19. Energy 19. Energy reduction reduction factor factor with with respect respect to tonon-accelerated non-accelerated core core (lower (lower is isbetter) better) obtained obtained by each by eachcoprocessor coprocessor configuration, along with the energy obtained by using the optimal configuration for each layer. conﬁguration, along with the energy obtained by using the optimal conﬁguration for each layer. Figures 16 and 19 evidence the ideal performance limit achievable by dynamically changing the coprocessor microarchitecture at no overhead, via software controlled Dy- namic Partial Reconfiguration (DPR) of the FPGA, so that the system always uses the optimal hardware scheme for speed or energy efficiency according to the computation kernel being executed. The storage, power and time overhead associated to DPR is not included in the analysis, and should be the subject of specific experiments. The second phase of performance analysis aimed at comparing the efficiency of the proposed soft-processor architecture with the alternative hardware architecture solutions for the execution of the same application. In this analysis, the proposed solution consisted of the extended PULPino platform equipped with the Klessydra T1 core + optimal vector coprocessor for each layer being executed. Table 5 summarizes the performance comparison results, expressed as total execution time, total energy consumption, and average energy consumed per algorithmic operation. Algorithmic operations are the data multiplications and additions that are inherent to the algorithm being computed, and do not depend on the actual software implementation. The absolute execution time obviously favors high-end computing devices, yet the results give evidence of the effectiveness of the Klessydra T1 customizable vector coprocessor sub-system with respect to other single-core PULPino soft-processor FPGA implementations. Additionally, the energy efficiency results show the potential advantage of a Klessydra T1 vector-accelerated soft-processor FPGA implementation, with respect to general purpose single-board computers.

Electronics 2021, 10, 518 19 of 21

Figures 16 and 19 evidence the ideal performance limit achievable by dynamically changing the coprocessor microarchitecture at no overhead, via software controlled Dy- namic Partial Reconfiguration (DPR) of the FPGA, so that the system always uses the optimal hardware scheme for speed or energy efficiency according to the computation kernel being executed. The storage, power and time overhead associated to DPR is not included in the analysis, and should be the subject of specific experiments. The second phase of performance analysis aimed at comparing the efficiency of the proposed soft-processor architecture with the alternative hardware architecture solutions for the execution of the same application. In this analysis, the proposed solution consisted of the extended PULPino platform equipped with the Klessydra T1 core + optimal vector coprocessor for each layer being executed. Table5 summarizes the performance comparison results, expressed as total execution time, total energy consumption, and average energy consumed per algorithmic operation. Algorithmic operations are the data multiplications and additions that are inherent to the algorithm being computed, and do not depend on the actual software implementation. The absolute execution time obviously favors high-end computing devices, yet the results give evidence of the effectiveness of the Klessydra T1 customizable vector coprocessor subsystem with respect to other single-core PULPino soft-processor FPGA implementations. Additionally, the energy efficiency results show the potential advantage of a Klessydra T1 vector-accelerated soft-processor FPGA implementation, with respect to general purpose single-board computers.

Table 5. Performance comparison with alternative solutions.

Processor Time [s] Energy [J] Energy per op [pJ/op] Core i7 PC board 0.08 2.90 21 Cortex A53 Raspberry Pi 3 0.89 2.32 17 Cortex M4 STM32 117.78 7.77 55 RI5CY PULPino on FPGA 444.30 40.06 285 Zeroriscy PULPino on FPGA 548.04 38.90 277 Klessydra-T1 PULPino on FPGA 7.91 1.74 12

6. Conclusions The validation of the VGG-16 inference output data produced by Klessydra processors against conventional processors demonstrated how the Klessydra open-source infrastruc- ture can be used for implementing configurable RISC-V soft-cores equipped with hardware acceleration for vector computing on FPGA. The detailed porting of the target application routines has been documented in this work. Performance results show the effectiveness of the Klessydra microarchitecture scheme, built upon interleaved multi-threading and vector coprocessor hardware acceleration, with respect to other FPGA-based single-core solutions. Looking at energy efficiency, the Klessydra FPGA soft-core solution shows superior performance with respect to commercial single-board computers that may be used as IoT extreme-edge devices. The results of the performance analysis conducted on the Klessydra T1 vector coprocessor schemes demonstrate the dependency of the optimal hardware configuration on the algorithm’s layer being executed. This evidence opens the way to the development of software configurable accelerators and further to the implementation of self-adapting coprocessor microarchitectures in IoT extreme-edge nodes.

Supplementary Materials: The Klessydra processor core family and accelerators are openly available online at https://www.github.com/klessydra. Author Contributions: Conceptualization, M.O.; methodology, M.O. and A.C.; hardware design and Synthesis, A.C.; software, S.S.; validation, A.C. and S.S.; formal analysis, S.S.; investigation, A.C.; data curation, A.C. and S.S.; tool maintenance, A.M.; writing—original draft preparation, M.O.; Electronics 2021, 10, 518 20 of 21

writing—review and editing, A.C., A.M., and F.M.; visualization, A.C.; supervision, M.O.; project administration, M.O.; funding acquisition, M.O. All authors have read and agreed to the published version of the manuscript. Funding: This research was supported by a Sapienza University internal research grant. Data Availability Statement: Data is contained within the article. Conﬂicts of Interest: The authors declare no conﬂict of interest.

References 1. Samie, F.; Bauer, L.; Henkel, J. From Cloud Down to Things: An Overview of Machine Learning in Internet of Things. IEEE Internet Things J. 2019, 4662, 1. [CrossRef] 2. European Processor Intiative (EPI). EU H2020 Research and Innovation Programme GA No 826647. Available online: https: //www.european-processor-initiative.eu/project/epi/ (accessed on 26 January 2021). 3. RISC-V. Instruction Set Specifications. Available online: https://riscv.org/specifications/ (accessed on 26 January 2021). 4. Cheikh, A.; Sordillo, S.; Mastrandrea, A.; Menichelli, F.; Scotti, G.; Olivieri, M. Klessydra-T: Designing Vector Coprocessors for Multi-Threaded Edge-Computing Cores. IEEE Micro 2021, 1. [CrossRef] 5. Gautschi, M.; Schiavone, P.; Traber, A.; Loi, I.; Pullini, A.; Rossi, D.; Flamand, E.; Gürkaynak, F.; Benini, L. Near-threshold RISC-V core with DSP extensions for scalable IoT endpoint devices. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 2017, 25, 2700–2713. [CrossRef] 6. Seo, S.; Dreslinski, R.G.; Woh, M.; Chakrabarti, C.; Mahlke, S.; Mudge, T. Diet SODA: A power-efficient processor for digital cameras. In Proceedings of the 16th ACM/IEEE International Symposium on Low Power Electronics and Design, Austin, TX, USA, 18–20 August 2010; pp. 79–84. 7. Chen, Y.H.; Krishna, T.; Emer, J.S.; Sze, V. Eyeriss: An energy-efficient reconfigurable accelerator for deep convolutional neural networks. IEEE J. Solid-State Circuits 2016, 52, 127–138. [CrossRef] 8. Moini, S.; Alizadeh, B.; Emad, M.; Ebrahimpour, R. A resource-limited hardware accelerator for convolutional neural networks in embedded vision applications. IEEE Trans. Circuits Syst. II Express Briefs 2017, 64, 1217–1221. [CrossRef] 9. Conti, F.; Benini, L. A ultra-low-energy convolution engine for fast brain-inspired vision in multicore clusters. In Proceedings of the IEEE Design, Automation and Test in Europe Conference and Exhibition (DATE), Grenoble, France, 9–13 March 2015; pp. 683–688. 10. Meloni, P.; Deriu, G.; Conti, F.; Loi, I.; Raffo, L.; Benini, L. Curbing the roofline: A scalable and flexible architecture for CNNs on FPGA. In Proceedings of the ACM International Conference on Computing Frontiers, Como, Italy, 16–18 May 2016; pp. 376–383. 11. Wu, N.; Jiang, T.; Zhang, L.; Zhou, F.; Ge, F. A Reconfigurable Convolutional Neural Network-Accelerated Coprocessor Based on RISC-V Instruction Set. Electronics 2020, 9, 1005. [CrossRef] 12. Watanabe, D.; Yano, Y.; Izumi, S.; Kawaguchi, H.; Takeuchi, K.; Hiramoto, T.; Iwai, S.; Murakata, M.; Yoshimoto, M. An Architec- tural Study for Inference Coprocessor Core at the Edge in IoT Sensing. In Proceedings of the 2nd IEEE International Conference on Artificial Intelligence Circuits and Systems (AICAS), Genoa, Italy, 23–25 March 2020; pp. 305–309. 13. Wu, Y.; Wang, J.J.; Qian, K.; Liu, Y.; Guo, R.; Hu, S.G.; Yu, Q.; Chen, T.P.; Liu, Y.; Rong, L. An energy-efficient deep convolutional neural networks coprocessor for multi-object detection. Microelectron. J. 2020, 98, 104737. [CrossRef] 14. Chang, M.C.; Pan, Z.G.; Chen, J.L. Hardware accelerator for boosting convolution computation in image classification applications. In Proceedings of the 2017 IEEE 6th Global Conference on Consumer Electronics (GCCE), Nagoya, Japan, 24–27 October 2017; pp. 1–2. 15. Lima, P.; Vieira, C.; Reis, J.; Almeida, A.; Silveira, J.; Goerl, R.; Marcon, C. Optimizing RISC-V ISA Usage by Sharing Coproces- sors on MPSoC. In Proceedings of the 2020 IEEE Latin-American Test Symposium (LATS), Maceio, Brazil, 30 March–2 April 2020; pp. 1–5. 16. Du, L.; Du, Y.; Li, Y.; Su, J.; Kuan, Y.C.; Liu, C.C.; Chang, M.C.F. A reconfigurable streaming deep convolutional neural network accelerator for Internet of Things. IEEE Trans. Circuits Syst. I Regul. Pap. 2018, 65, 198–208. [CrossRef] 17. Olivieri, M.; Cheikh, A.; Cerutti, G.; Mastrandrea, A.; Menichelli, F. Investigation on the optimal pipeline organization in RISC-V multi-threaded soft processor cores. In 2017 New Generation of CAS (NGCAS); IEEE: New York, NY, USA, 2017; pp. 45–48. 18. Cheikh, A.; Sordillo, S.; Mastrandrea, A.; Menichelli, F.; Olivieri, M. Efficient Mathematical Accelerator Design Coupled with an Interleaved Multi-threading RISC-V Microprocessor. In Proceedings of the International Conference on Applications in Electronics Pervading Industry, Environment and Society, Pisa, Italy, 11–13 September 2019; Springer: Cham, Switzerland, 2019; pp. 529–539. 19. Lattner, C. RISC-V Vector Extension Intrinsic Support. Available online: https://www.sifive.com/blog/risc-v-vector-extension- intrinsic-support (accessed on 26 January 2021). 20. Cavalcante, M.; Schuiki, F.; Zaruba, F.; Schaffner, M.; Benini, L. Ara: A 1-GHz+ Scalable and Energy-Efficient RISC-V Vector Processor with Multiprecision Floating-Point Support in 22-nm FD-SOI. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 2020, 28, 530–543. [CrossRef] Electronics 2021, 10, 518 21 of 21

21. Chen, C.; Xiang, X.; Liu, C.; Shang, Y.; Guo, R.; Liu, D.; Lu, Y.; Hao, Z.; Luo, J.; Chen, Z.; et al. Xuantie-910: A Commercial Multi-Core 12-Stage Pipeline Out-of-Order 64-bit High Performance RISC-V Processor with Vector Extension: Industrial Product. In Proceedings of the 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA), Valencia, Spain, 30 May–3 June 2020; pp. 52–64. 22. Wright, J.C.; Schmidt, C.; Keller, B.; Dabbelt, D.P.; Kwak, J.; Iyer, V.; Mehta, N.; Chiu, P.-F.; Bailey, S.; Asanovic, K.; et al. A Dual-Core RISC-V Vector Processor with On-Chip Fine-Grain Power Management in 28-nm FD-SOI. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 2020, 28, 2721–2725. [CrossRef] 23. Kimura, Y.; Kikuchi, T.; Ootsu, K.; Yokota, T. Proposal of Scalable Vector Extension for Embedded RISC-V Soft-Core Processor. In Proceedings of the 7th International Symposium on Computing and Networking Workshops (CANDARW), Nagasaki, Japan, 26–29 November 2019; pp. 435–439. 24. Johns, M.; Kazmierski, T.J. A Minimal RISC-V Vector Processor for Embedded Systems. In Proceedings of the 2020 Forum for Speciﬁcation and Design Languages (FDL), Kiel, Germany, 15–17 September 2020. 25. Traber, A.; Gautschi, M. PULPino: Datasheet; ETH: Zurich, Switzerland; University of Bologna: Bologna, Italy, 2017; Available online: https://pulp-platform.org/docs/pulpino_datasheet.pdf (accessed on 26 January 2021). 26. Blasi, L.; Vigli, F.; Cheikh, A.; Mastrandrea, A.; Menichelli, F.; Olivieri, M. A RISC-V Fault-Tolerant Microcontroller Core Architecture Based on a Hardware Thread Full/Partial Protection and a Thread-Controlled Watch-Dog Timer. In Proceedings of the International Conference on Applications in Electronics Pervading Industry, Environment and Society, Pisa, Italy, 11–13 September 2019; Springer: Cham, Switzerland, 2019; pp. 505–511. 27. Cheikh, A.; Cerutti, G.; Mastrandrea, A.; Menichelli, F.; Olivieri, M. The microarchitecture of a multi-threaded RISC-V compliant processing core family for IoT end-nodes. In Proceedings of the International Conference on Applications in Electronics Pervading Industry, Environment and Society, Rome, Italy, 21–22 September 2017; Springer: Cham, Switzerland, 2017; pp. 89–97. 28. Genesys 2 Kintex-7 FPGA Development Board. Available online: https://reference.digilentinc.com/reference/programmable- logic/genesys-2/start?redirect=1 (accessed on 26 January 2021). 29. Schiavone, P.D.; Conti, F.; Rossi, D.; Gautschi, M.; Pullini, A.; Flamand, E.; Benini, L. Slow and steady wins the race? A comparison of ultra-low-power risc-v cores for internet-of-things applications. In Proceedings of the 27th International Symposium on Power and Timing Modeling, Optimization and Simulation (PATMOS), Thessaloniki, Greece, 25–27 September 2017; pp. 1–8.