electronics

Article A Reconfigurable Convolutional Neural Network-Accelerated Based on RISC-V Instruction Set

Ning Wu *, Tao Jiang, Lei Zhang, Fang Zhou and Fen Ge

College of Electronic and Information Engineering, Nanjing University of Aeronautics and Astronautics, Nanjing 211106, China; [email protected] (T.J.); [email protected] (L.Z.); [email protected] (F.Z.); [email protected] (F.G.) * Correspondence: [email protected]; Tel.: +86-139-5189-3307

 Received: 5 May 2020; Accepted: 8 June 2020; Published: 16 June 2020 

Abstract: As a typical artificial intelligence algorithm, the convolutional neural network (CNN) is widely used in the Internet of Things (IoT) system. In order to improve the computing ability of an IoT CPU, this paper designs a reconfigurable CNN-accelerated coprocessor based on the RISC-V instruction set. The interconnection structure of the acceleration chain designed by the predecessors is optimized, and the accelerator is connected to the RISC-V CPU core in the form of a coprocessor. The corresponding instruction of the coprocessor is designed and the instruction compiling environment is established. Through the inline assembly in the C language, the coprocessor instructions are called, coprocessor acceleration library functions are established, and common algorithms in the IoT system are implemented on the coprocessor. Finally, resource consumption evaluation and performance analysis of the coprocessor are completed on a FPGA. The evaluation results show that the reconfigurable CNN-accelerated coprocessor only consumes 8534 LUTS, accounting for 47.6% of the total SoC system. The number of instruction cycles required to implement functions such as convolution and pooling based on the designed coprocessor instructions is better than using the standard instruction set, and the acceleration ratio of convolution is 6.27 times that of the standard instruction set.

Keywords: convolutional neural network (CNN); coprocessor; RISC-V; IoT

1. Introduction With the rapid development of artificial intelligence (AI) technology, more and more AI applications are beginning to be developed and deployed on internet of things (IoT) node devices. The intelligent internet of things (AI + IoT, AIoT), which integrates the advantages of AI and IoT technology, has gradually become a research hotspot in IoT-related fields [1]. Traditional cloud computing is not suitable for AIoT, because of its high delay and poor mobility [2]. For this reason, a new computing paradigm, edge computing, is proposed. In order to reduce the computing delay and network congestion, the computing is migrated from the cloud server to the device [2]. Edge computing brings innovation to the IoT system, but also challenges the AI computing performance of the IoT node processor. It is necessary to improve its AI computing performance under the condition of meeting the power consumption and area limitation of IoT node devices [3]. In order to improve the AI computing power of IoT node processors, some IoT chip manufacturers provide some artificial intelligence acceleration libraries for their IoT node processors, but only from the level optimization and tailoring algorithm, just a stopgap. It is necessary to design the AI algorithm calculator suitable for the IoT node processor from the hardware level.

Electronics 2020, 9, 1005; doi:10.3390/electronics9061005 www.mdpi.com/journal/electronics Electronics 2020, 9, 1005 2 of 19

Among all kinds of AI algorithms, the convolutional neural network (CNN) algorithm is widely used in various IoT systems with image scenes due to its excellent performance in image recognition. Compared to traditional signal processing algorithms, it has a higher recognition accuracy, can avoid complex and tedious data feature extraction, and has stronger adaptability to different image scenes [4]. The common hardware of CNN acceleration is a GPU or TPU, but this kind of hardware accelerator is mainly used for high-performance servers, which are not suitable for the use of IoT node devices. In Reference [5,6], most of the constituent units in the CNN network are implemented, which need to consume more hardware resources while obtaining higher computing acceleration performance, and cannot meet the resource-limited needs of nodes in the IoT systems. In Reference [7], the matrix multiplication in the CNN algorithm is reduced by the FFT, but the FFT itself consumes more hardware resources. The sparse CNN (SCNN) accelerator proposed in Reference [8] uses the data sparsity of neural network pruning operation to design a special MAC computing unit, but it has a highly customized structure and a single application scenario. In Reference [9], an acceleration structure based on in-memory computing is proposed. By shortening the distance between computing and storage, the access to memory is reduced and the computing efficiency is improved. However, the cost of the chip based on in-memory computing is high, so it is not suitable for large-scale application in the IoT systems. The work in Reference [10] proposes several alternative reconfiguration schemes that significantly reduce the complexity of sum-of-products operations but do not explicitly propose a CNN architecture. In Reference [11], an FPGA implementation of a CNN for addressing portability and power efficiency is designed, but the proposed architecture is not reconfigurable. In Reference [12], a CNN accelerator on the Xilinx ZYNQ 7100 (Xilinx, San Jose, CA, USA) hardware platform that accelerates both standard convolution and depthwise separable convolution is implemented, but the designed accelerator cannot be configured for other algorithms in the IoT system. Several research works [13–15] use the RISC-V ecosystem to design an accelerator-centric SoC or multicore processor, but do not configure the design accelerator in the form of a coprocessor and design the corresponding custom coprocessor instructions to speed up the algorithm processing speed. Based on Reference [16], this paper further optimizes the structure of the CNN accelerator. The basic operation modules in the acceleration chain designed by Reference [16] are interconnected through a crossbar, which makes the flow direction of input data more diversified, thus enriching the function of the acceleration unit and forming a reconfigurable CNN accelerator. Based on the expansibility of the RISC-V instruction set architecture, the reconfigurable CNN accelerator is connected to the E203 core (Nucleisys, Beijing, China) in the form of a coprocessor, and the corresponding custom coprocessor instructions are designed. The compiler environment and library functions are established for the designed instructions, the hardware and software design of the coprocessor is completed, and the implementation process of common algorithms in the IoT system on the coprocessor is described. Finally, resource evaluation of the coprocessor is completed based on the Xilinx FPGA. The evaluation results show that the coprocessor only consumes 8534 LUTS, accounting for 47.6% of the total SoC system. The RISC-V standard instruction set and custom coprocessor instruction set are used to implement the four basic algorithms of convolution, pooling, ReLU, and matrix addition. The acceleration performance of the coprocessor is evaluated by comparing the operation cycles of each algorithm in these two ways. The results show that the implementation of the coprocessor instruction set has a significant acceleration effect on these four algorithms, and the acceleration of the convolution has reached 6.27 times that achieved by the standard instruction set. The rest of this paper is organized as follows: Section2 introduces some background knowledge, including the RISC-V instruction set and E203 CPU. In Section3, the hardware design of the reconfigurable CNN-accelerated coprocessor is introduced. Section4 introduces the software design of the reconfigurable CNN-accelerated coprocessor. Section5 introduces the implementation of commonly used IoT algorithms on the coprocessor. Section6 is the experiment and resource analysis. Section7 summarizes the conclusions of this work. Electronics 2020, 9, x FOR PEER REVIEW 3 of 21

Electronics 2020, 9, 1005 3 of 19

2. Background 2. Background 2.1. RISC-V Instruction Set 2.1. RISC-V Instruction Set The RISC-V instruction set is a new type of instruction set that was born only in 2010. It summarizesThe RISC-V the instructionadvantages set and is anew disadvantages type of instruction of the set traditional that was born instruction only in 2010. set, It summarizesavoids the unreasonablethe advantages design and disadvantages of the traditional of the instruction traditional se instructiont, and its open-source set, avoids thefeatures unreasonable ensure that design it can of adjustthe traditional the existing instruction design at set, any and time. its open-source This guarante featureses the simplicity ensure that and it can efficiency adjust the of the existing instruction design [17].at any RISC-V time. is This a kind guarantees of modular the simplicityinstructionand set, ewhichfficiency contains of the three instruction basic instruction [17]. RISC-V sets is and a kind six extendedof modular instruction instruction sets. set, This which modular contains feature three enab basicles instruction the designer sets to and select six one extended basic instruction setsets. and This several modular extended feature enables instruction the designer sets tofor select combination one basic instructionaccording setto andthe several actual extended design requirementsinstruction sets so for that combination flexible processor according functions to the actual can be design realized. requirements More information so that flexible on RISC-V processor can befunctions found in can [17]. be realized. More information on RISC-V can be found in [17]. InstructionInstruction extensibility extensibility is is a a significant significant advantag advantagee of the RISC-V architecture. It It has has reserved reserved instructioninstruction coding coding space space for for the the special special domain domain ar architecturechitecture in in the the processor , design, and and users users can can easilyeasily expandexpand their their instruction instruction subsets. subsets. The The RISC-V RISC architecture-V architecture reserves reserves 4 groups 4of groups custom of instruction custom instructiontypes in 32-bit types instructions, in 32-bit instructions, as shown in as Figure shown1. in Figure 1.

FigureFigure 1. 1. RISC-VRISC-V architecture architecture instruction instruction space. space.

AccordingAccording to to RISC-V RISC-V architecture architecture [17,18], [17,18], custom-0custom-0 andand custom-1custom-1 instructioninstruction spaces spaces are are reserved reserved forfor user-defined user-defined extension extension instructions instructions and and will will not not be be used used as as standard standard extension extension instructions instructions in in the the future.future. The The instruction instruction space space marked marked as as custom-2/rv128custom-2/rv128 andand custom-3/rv128custom-3/rv128 isis reserved reserved for for the the future future rv128rv128,, which which will will also also avoid avoid being being used used by by standard standard extensions, extensions, so so it it can can also also be be used used for for custom custom extensionextension instructions. instructions.

2.2.2.2. E203 E203 CPU CPU InIn orderorder to to realize realize a reconfigurable a reconfigurable CNN-accelerated CNN-accelerated coprocessor coprocessor based on based the RISC-V on the instruction RISC-V instructionset, we need set, to we choose need ato suitable choose a RISC-V suitable processor RISC-V processor as the research as the object.research The object. E203 The core E203 is a core 32-bit is aRISC-V 32-bit processorRISC-V processor core with acore 2-level with pipeline a 2-level [19 ].pipe It extendsline [19]. integer It extends multiplication integer andmultiplication integer division and integerinstructions, division atomic instructions, operation atomic instructions, operation and inst 16-bitructions, compression and 16-bit instructions compression on instructions the basis of theon therv32i basis instruction of the rv32i set. Theinstruction E203 kernel set. isThe the E203 closest kernel to the is Armthe closest Cortex-M0 to the+ (Arm,Arm Cortex-M0 Cambridge, + UK)(Arm, in Cambridge,the design goal. UK) Itsin the structure design is goal. shown Its instructure Figure2 is. shown in Figure 2. TheThe E203 E203 kernel kernel adopts adopts a a single single transmit transmit sequenti sequentialal execution execution architecture, architecture, and and the the first first stage stage of of thethe pipelinepipeline isis the the fetch fetch instruction. instruction. It performs It performs simple simple decoding decoding of instructions of instructions and branch and prediction branch predictionof branch instructions. of branch instructions. The second stageThe second of the pipeline stage of completes the pipeline decoding, completes execution, decoding, memory execution, access, memoryand write access, back. and write back. TheThe first first stage stage of of the the pipeline pipeline mainly mainly includes includes a a simple simple decoding decoding unit, unit, a a branch branch prediction prediction unit, unit, andand a PCPC generationgeneration unit. unit. The The simple simple decoding decoding unit un (asit shown(as shown by label by label 1 in Figure1 in Figure2) performs 2) performs simple simpledecoding decoding on the on retrieved the retrieved instruction instruction to obtain to obtain the type the oftype instruction of instruction and whetherand whether it belongs it belongs to a tobranch a branch and jumpand jump instruction. instruction. For branch For branch and jump and instructions, jump instructions, a branch aprediction branch prediction unit (as shown unit (as by shownlabel 2 by in Figurelabel 22 in) is Figure required 2) is to required perform to branch perform prediction branch toprediction obtain the to predictedobtain the jump predicted address jump of addressthe instruction. of the instruction. The PC generation The PC unit generation (as shown unit by (as label shown 3 in Figure by label2) generates 3 in Figure the next2) generates PC value the to nextbe fetched, PC value and to accesses be fetched, instruction and accesses tightly-coupled instruction memory tightly-coupled (ITCM) or busmemory interface (ITCM) unit (BIU)or bus to interfacefetch instructions. unit (BIU) The to PCfetch value instructions. and instruction The PC value value are and placed instruction in the PC value register are and placed IR register in the andPC

Electronics 2020, 9, 1005 4 of 19

Electronics 2020, 9, x FOR PEER REVIEW 4 of 21 passedregister to the and next IR register stage ofand the passed pipeline. to the The next second stage of level the pipeline. of the pipeline The seco mainlynd level includesof the pipeline a decoding dispatchmainly unit includes (as shown a decoding by label dispatch 4 in Figure unit 2(as), a shown branch by prediction label 4 in analysisFigure 2), unit a branch (as shown prediction by label 5 in Figureanalysis2), an unit arithmetic (as shown logic by label operation 5 in Figure unit 2), (as an shown arithmetic by labellogic 6operation in Figure unit2), (as a multi-cycle shown by label multiplier 6 andin divider Figure 2), (as ashown multi-cycle by label multiplier 7 in Figure and divider2), an (as access shown memory by label unit 7 in (as Figure shown 2), an by access label memory 8 in Figure 2), andunit an (as extension shown by accelerator label 8 in Figure interface 2), and (EAI) an extension (as shown accelerator by label interface 9 in Figure (EAI) (as2). shown The decoding by label and dispatching9 in Figure unit 2). implementsThe decoding the an decodingd dispatching and unit dispatching implements of instructions.the decoding Itand dispatches dispatching instructions of to diinstructions.fferent execution It dispatches units instructions for execution to differ accordingent execution to the specificunits for typesexecution of instructions. according to the Ordinary specific types of instructions. Ordinary arithmetic logic operation instructions are dispatched to the arithmetic logic operation instructions are dispatched to the arithmetic logic. arithmetic logic.

FigureFigure 2. 2. E203E203 core core structure. structure.

TheThe operation operation unit unit forfor executionexecution includes includes inst instructionsructions such such as logic as logic operations, operations, addition, addition, subtraction, and shift. The branch and jump instructions are dispatched to the branch prediction subtraction, and shift. The branch and jump instructions are dispatched to the branch prediction analysis unit to perform branch prediction analysis. For instructions that fail to predict, pipeline analysis unit to perform branch prediction analysis. For instructions that fail to predict, pipeline flushing and instruction fetching are required. The multiplication and division instructions are flushingdispatched and to instruction a multi-cycle fetching multiplier are and required. divider to The execute multiplication the multiplication and or division division instructions operation are dispatchedthrough multiple to a multi-cycle cycles. Load multiplier and store and instructions divider toread execute and write the to multiplication the memory through or division the fetch operation throughunit. multipleThe coprocessor cycles. Loadinstructions and store are instructionsdispatched to read the and EAI write interface to the and memory executed through by the the fetch unit.coprocessor. The coprocessor The reconfigurable instructions CNN are dispatched acceleration to coprocessor the EAI interface designed and in this executed paper byis connected the coprocessor. Theto reconfigurable the E203 processor CNN through acceleration the EAI coprocessor interface. The designed EAI interface in this has paper four is channels: connected Request to the E203 processorchannel, through feedback the channel, EAI interface. memory Therequest EAI chan interfacenel, and has memory four channels: feedback Requestchannel. channel,The request feedback channel,channel memory is used requestby the main channel, processor and memoryto send cu feedbackstom extended channel. instructions The request to thechannel coprocessor, is used by including the source operand, dispatch label, and other information. The feedback channel is used the main processor to send custom extended instructions to the coprocessor, including the source for the coprocessor to feedback the execution of custom extended instructions. The feedback operand, dispatch label, and other information. The feedback channel is used for the coprocessor information includes the calculation results of the instructions, the dispatch label, etc. The memory to feedbackrequest channel the execution is used for of custommemory extendedaccess by instructions.the coprocessor, The which feedback reads informationand writes memory includes the calculationthrough the results main ofprocessor. the instructions, The memory the feedback dispatch ch label,annel is etc. used The for memorythe main requestprocessor channel to return is used forthe memory read and access write by results the coprocessor, of the memory which to readsthe coprocessor, and writes including memory the through data read the mainfrom processor.the Thememory. memory Please feedback refer channelto [19] for is a used detailed for theintroduction main processor of the EAI to returninterface. the read and write results of the memoryThe to thecoprocessor coprocessor, extended including instructions the data of readthe E203 from processor the memory. follow Please the instruction refer to [ 19expansion] for a detailed introductionrules of RISC-V, of the and EAI its interface. coprocessor extended instruction encoding format is shown in Figure 3. The coprocessor extended instructions of the E203 processor follow the instruction expansion rules of RISC-V, and its coprocessor extended instruction encoding format is shown in Figure3. Electronics 2020, 9, x FOR PEER REVIEW 5 of 21 ElectronicsElectronics 20202020,, 99,, 1005x FOR PEER REVIEW 55 ofof 1921

31 25 24 2019 15 14 13 12 11 7 6 0 31 25 24 2019 15 14 13 12 11 7 6 0 funct7 rs2 rs1 xd xs1 xs2 rd opcode funct7 rs2 rs1 xd xs1 xs2 rd opcode 7 5 511 15 7 7 5 511 15 7

FigureFigure 3. 3. E203E203 coprocessor coprocessor extended extended in instructionstruction encoding encoding format. format. Figure 3. E203 coprocessor extended instruction encoding format. TheThe 25th 25th to to 31st 31st bits bits of of the the instruction instruction are are func funct7t7 intervals, intervals, which which are are used used as as the the sub sub coding coding spacespace Theand and 25thfor for the to the parsing31st parsing bits of of of instructions the instructions instruction in inthe are the de funccoding decodingt7 intervals, stage. stage. Funct7 which Funct7 is are7 isbits used 7 bitsand as andcan the encode cansub encodecoding 128 instructions.128space instructions. and for rs1, the rs2, rs1,parsing and rs2, andofrd instructionsare rd aresource source operandin the operand de coding1, 1,source source stage. operand operand Funct7 2, is 2,and 7 and bits destination destination and can encode operand operand 128 registerregisterinstructions. indexes, indexes, rs1, respectively. respectively.rs2, and rd xs1, xs1,are xs2,source xs2, and and operand xd xd ar aree used used1, source to to indicate indicate operand whether whether 2, and read read destination source source and and operand write write destinationdestinationregister indexes, registers respectively. areare required. required. xs1, The xs2,The 0th and0th to 6thxdto ar bits6the used ofbits the toof instruction indicatethe instruction whether are opcode are read encodingopcode source encodingand segments, write segments,whichdestination can which be registers opcode can encodingbeare opcode required. values encoding The of custom-0, 0th values to 6th custom-1,of bitscustom-0, of custom-2,the custom-1, instruction or custom-3. custom-2, are opcode According or custom-3. encoding to the Accordingbitsegments, width of towhich funct7,the bit can eachwidth be group opcodeof funct7, of customencoding each instructionsgroup values of custom of can custom-0, encode instructions 128custom-1, instructions, can encode custom-2, and 128 four instructions,or custom-3. groups of andcustomAccording four instructionsgroups to the of bit custom canwidth encode instructions of funct7, 512 coprocessor each can groupencode instructions. of 512 custom coprocessor instructions instructions. can encode 128 instructions, and four groups of custom instructions can encode 512 coprocessor instructions. 3.3. Hardware Hardware Design Design of of CNN-Accelerated CNN-Accelerated Coprocessor Coprocessor 3. Hardware Design of CNN-Accelerated Coprocessor TheThe reconfigurable reconfigurable CNN CNN acceleration coprocessor designeddesigned inin thisthis paper paper is is a a further further extension extension of ofthe the compactThe compact reconfigurable CNN CNN accelerator accelerator CNN designedacceleration designed in Referencecoprocessorin Reference [16 designed]. [16]. It mainly It mainly in optimizesthis paperoptimizes theis a acceleration furtherthe acceleration extension chain chainstructureof the structure compact and adds and CNN theadds coprocessor-relatedaccelerator the coprocessor-related designed modules. in Reference modules. Other basic[16]. Other It components mainlybasic components optimizes are consistent the are acceleration consistent with those withinchain Reference those structure in Reference [16 ].and adds [16]. the coprocessor-related modules. Other basic components are consistent with those in Reference [16]. 3.1.3.1. Accelerator Accelerator Structure Structure Optimization Optimization 3.1. Accelerator Structure Optimization InIn the compactcompact CNN CNN accelerator accelerator designed designed in Reference in Reference [16], each [16], operation each operation unit in the unit acceleration in the accelerationchainIn is connectedthe chaincompact is byconnected aCNN serial accelerator structure,by a serial and structure,designed its data andin flow Reference its direction data flow [16], is direction single. each Whenoperation is single. some Whenunit algorithms insome the algorithmsareacceleration implemented, are chain implemented, theis connected data need the by todata a be serial need moved structure, to inbe themoved memory and in its the data and memory flow accelerator direction and accelerator many is single. times, many When which times, some will whichreducealgorithms will the reduce calculationare implemented, the calculation efficiency the andefficiency data increase need and to the be in crease powermoved the consumptionin thepower memory consumption of and the processor.accelerator of the processor. In many this paper,times, In thisfourwhich paper, basic will operation reducefour basic the modules, calculation operation convolution, efficiencymodules, and pooling,convolution, increase ReLU, the pooling, power and matrix consumption ReLU, plus, and are of interconnectedmatrix the processor. plus, are byIn interconnectedathis crossbar, paper, andfour by a reconfigurableabasic crossbar, operation and CNN amodules, reconfigurable accelerator convolution, isCNN designed. accelerator pooling, The acceleratorReLU,is designed. and mainly matrixThe accelerator comprisesplus, are mainlyfourinterconnected reconfigurable comprises by foura computing crossbar, reconfigurable and acceleration a reconfigurable computing processing accelerationCNN elements accelerator (PEs). processing Byis designed. configuring elements The PE (PEs).accelerator units withBy configuringdimainlyfferent comprises parameters, PE units fourwith the hardware differentreconfigurable accelerationparameters, computing ofthe various algorithms acceleration processing can be of realized.various elements algorithms The structure(PEs). can By is beshownconfiguring realized. in Figure The PE structureunits4. with is different shown in parameters, Figure 4. the hardware acceleration of various algorithms can be realized. The structure is shown in Figure 4. Cfg COE RAM convolution kernel convolution kernel Src A 32b Cfg COE RAM Src B Result SrcSrc C A PE1 32b FROM SYSTEM RAM Src B PE1 Result Src C FROM SYSTEM RAM Src C Src A 32b Src C Add Pool Src B PE2 Result Add Data SrcSrc C A 32b Add In Pool Src B Result Pool In Selector PE2

Data Add In Src C Pool In Selector Src A 32b Add Out Pool Out Src B Result SrcSrc C A PE3 32b Add Out Pool Out BUF RAM Src B PE3 Result Src C Src A Result BANK1BUF RAM Src A 32b Src B PE4 Result Src A Crossbar Result BANK1 SrcSrc C A 32b Src B Result Crossbar Src C PE4

system bus BUF RAM system bus BANK2BUF RAM Selector Data Conv Out Relu Out

BANK2 Selector Data Conv Out Relu InRelu Out Conv In Relu In Src B Conv In Src B Conv Relu Reconfigurable Controller Conv Reconfigurable Controller Accelerator PE Relu Accelerator PE

FigureFigure 4. 4. ReconfigurableReconfigurable convolutional convolutional neural neural network network (CNN) (CNN) accelerator accelerator structure structure diagram. diagram. Figure 4. Reconfigurable convolutional neural network (CNN) accelerator structure diagram. TheThe acceleratoraccelerator includesincludes a sourcea source convolution convol kernelution cachekernel module cache (COEmodule RAM), (COE a reconfigurable RAM), a reconfigurablecircuitThe controller accelerator circuit (Reconfigure controllerincludes Controller), (Reconfigurea source two convol ping-pongController),ution bu twokernelffer blocksping-pong cache (BUF modulebuffer RAM BANK),blocks (COE (BUF and RAM), four RAM PEa BANK),units.reconfigurable Each and PE four unitcircuit PE contains controllerunits. four Each basic(Reconfigure PE computing unit co Contains componentsntroller), four two basic and ping-pong acomputing configurable buffer components crossbarblocks (BUF (Crossbar). and RAM a configurableBANK), and crossbar four PE (Crossbar). units. Each The PEconfiguration unit contains of the four crossbar basic allowscomputing data tocomponents flow to different and a configurable crossbar (Crossbar). The configuration of the crossbar allows data to flow to different

Electronics 2020, 9, x FOR PEER REVIEW 6 of 21 computing components, which can speed up different algorithms. The entire accelerator can be parameterized and data accessed through an external bus. The key to the acceleration of various algorithms by the four PE units is the design of the crossbar circuit. The configuration of the crossbar can make the input data flow pass through any one or more calculation modules. Its structure is shown in Figure 5. The crossbar mainly consists of an input buffer (FIFO), a configuration register group (Cfg Regs), and five multiplexers (MUX). The five MUX select the data path to be opened according to the configurationElectronics 2020, 9 , 1005information of the Cfg Regs. Thus, the data stream passes through different6 of 19 calculation modules in different orders. For example, the edge detection operation in image processing usually extracts the input image and then convolves it with the edge detection operator. ItThe is configurationnecessary to configure of the crossbar the MU allowsX as datathe path to flow in tothe di redfferent part computing of Figure components,5. In the convolutional which can neuralspeed upnetwork different algorithm, algorithms. it Theis usually entire accelerator necessary canto beperform parameterized convolution, and data ReLU, accessed and throughpooling operationsan external on bus. the source matrix. You need to configure the MUX as the path in the blue part of FigureThe 5. keyThe tocrossbar the acceleration circuit and of each various calculation algorithms module by the use four the PE ICB units bus is for the data design transmission. of the crossbar The ICBcircuit. bus The is a configurationbus protocol ofcustomized the crossbar by canthe makeopen thesource input CPU data E203. flow It pass combines through the any advantages one or more of thecalculation AXI bus modules. and the AHB Its structure bus. Foris more shown information in Figure5 about. the ICB bus, please refer to [19].

Src A Crossbar Conv Out Add Out Relu Out Pool Out FIFO MUX MUX MUX MUX MUX Result Conv Add Relu Pool Sel_Res Sel_Conv Cfg Cfg Sel_Add Regs Sel_Relu Sel_Pool Result Conv In Add In Relu In Pool In

Figure 5. TheThe structure structure of of Crossbar. Crossbar.

3.2. CoprocessorThe crossbar Design mainly consists of an input buffer (FIFO), a configuration register group (Cfg Regs), and five multiplexers (MUX). The five MUX select the data path to be opened according to the In addition to optimizing the accelerator chain module designed in Reference [16], this paper configuration information of the Cfg Regs. Thus, the data stream passes through different calculation also adds the EAI controller, decoder, and data fetcher, and completes the hardware design of the modules in different orders. For example, the edge detection operation in image processing usually reconfigurable CNN-accelerated coprocessor, whose structure is shown in Figure 6. extracts the input image and then convolves it with the edge detection operator. It is necessary to configure the MUX as the path in the red part of Figure5. In the convolutional neural network algorithm, it is usually necessary to perform convolution, ReLU, and pooling operations on the source matrix. You need to configure the MUX as the path in the blue part of Figure5. The crossbar circuit and each calculation module use the ICB bus for data transmission. The ICB bus is a bus protocol customized by the open source CPU E203. It combines the advantages of the AXI bus and the AHB bus. For more information about the ICB bus, please refer to [19].

3.2. Coprocessor Design In addition to optimizing the accelerator chain module designed in Reference [16], this paper also adds the EAI controller, decoder, and data fetcher, and completes the hardware design of the reconfigurable CNN-accelerated coprocessor, whose structure is shown in Figure6. The EAI controller is used to process the time sequence related to the EAI interface, and hand over the instruction information and source operands obtained from the EAI request channel to the decoder for decoding. The memory data obtained from the EAI memory response channel are handed over to the data fetcher for allocation to the corresponding cache. The decoder is used to decode custom extended instructions. For configuration instructions, the instructions are handed over to the reconfiguration controller to realize the configuration of each functional module. For the memory access instruction, the instruction is handed over to the data fetcher for processing, and the memory access is realized. The data fetcher implements the processing of the memory access instruction. It reads data from external memory to the corresponding cache through the memory response channel of the EAI interface, the convolution kernel coefficient is loaded to the COE CACHE unit, and the source matrix is loaded to CACHE BANK1 or CACHE BANK2. The calculation results are written back to the external memory through the memory request channel of EAI interface. Electronics 2020, 9, 1005 7 of 19 Electronics 2020, 9, x FOR PEER REVIEW 7 of 21

BIU

WB IFU EXU LSU Regfile E203 kernel Request Response Memory Memory Channel Channel Request Response Channel Channel

Decoder EAI Controller

Reconfigurable Data Fetcher Controller

COE CACHE

Src A Result PE1 Src B Data Src C Selector

Src A Result PE2 Src B Src C

CACHE Src A BANK1 Result Src B PE3 Data Src C Selector CACHE BANK2 Src A Result PE4 Src B Src C Coprocessor

Figure 6.6. Hardware structure of coprocessor.coprocessor.

WhenThe EAI the controller main processor is used executes to process an the instruction, time sequence the decoding related unitto the of EA theI maininterface, processor and hand first determinesover the instruction whether theinformation instruction and belongs source to operan the customds obtained instruction from group the EAI according request to channel the opcode to the of thedecoder instruction. for decoding. For the instructionsThe memory belonging data obtained to the custom from the instruction EAI memory group, response determine channel whether are to readhanded the sourceover to operand the data according fetcher for to theallocation xs1 and to xs2 the bits corresponding in the instruction. cache. If The to read, decoder read theis used source to operanddecode custom from the extended register groupinstructions. according For to configur rs1 andation rs2. Theinstructions, main processor the instructions also needs toare maintain handed theover data to the correlation reconfiguration of instructions. controller If theto realize instructions the configuration have data correlation, of each functional the processor module. will For pause the thememory pipeline access until instruction, the correlation the instruction is removed. is Athanded the same over time,to the if data the instructionfetcher for needsprocessing, to be writtenand the backmemory to the access target is register, realized. the targetThe data register fetcher rd will impl alsoements be used the as processing one of the basesof the for memory the subsequent access instructioninstruction. data It reads correlation data from judgment. external Then, memory the instructions to the corresponding are sent to thecache coprocessor through the through memory the EAIresponse request channel channel of for the processing. EAI interface, After thethecoprocessor convolution receives kernel thecoefficient instructions, is loaded the instructions to the COE are furtherCACHE decoded unit, and and the allocated source to matrix different is unitsloaded for executionto CACHE according BANK1 toor the CACHE types of BANK2. instructions. The Thecalculation coprocessor results will are executewritten instructionsback to the extern in theal form memory of blocking. through the Only memory when request one instruction channel of is executedEAI interface. can the main processor dispatch the next instruction. Finally, after the instruction is executed, the resultWhen of the instruction main processor execution executes is returned an instruction, to the main the processordecoding unit through of the the main response processor channel. first Fordetermines the instruction whether to the be written instruction back, belongs the result to ofthe instruction custom instruction execution group needs according to be written to the back opcode to the targetof the register. instruction. For the instructions belonging to the custom instruction group, determine whetherThe designedto read the reconfigurable source operand CNN according acceleration to the coprocessor xs1 and xs2 and bits the in compactthe instruction. CNN accelerator If to read, designedread the source by Reference operand [16 from] are the used register to process group multimedia according datato rs1 such and as rs2. voice The and main image. processor The dataalso flowneeds is to shown maintain in Figure the data7. correlation of instructions. If the instructions have data correlation, the processorFigure will7a shows pause the the data pipeline flow directionuntil the correlation of the compact is removed. CNN accelerator At the same designed time, inif Referencethe instruction [ 16]. Itneeds is an to external be written device back mounted to the target on the register, SoC bus, the whichtarget requiresregister rd the will CPU also core be toused handle as one the of data the transfer.bases for First,the subsequent it needs to instruction move the multimedia data correlation data fromjudgment. the external Then, the interface instructions to the are data sent memory to the throughcoprocessor the CPUthrough core, the and EAI then request to the acceleratorchannel for for processing. processing. After After the the coprocessor processing, receives it needs thethe CPUinstructions, core to movethe instructions the calculation are resultsfurther fromdecoded the acceleratorand allocated to the to datadifferent memory units and for then execution to the according to the types of instructions. The coprocessor will execute instructions in the form of blocking. Only when one instruction is executed can the main processor dispatch the next instruction. Finally, after the instruction is executed, the result of instruction execution is returned to

Electronics 2020, 9, x FOR PEER REVIEW 8 of 21 Electronics 2020, 9, 1005 8 of 19 the main processor through the response channel. For the instruction to be written back, the result of instruction execution needs to be written back to the target register. external interfaceThe designed to send reconfigurable them to the networkCNN acceleration interface. coprocessor It needs to moveand the data compact between CNN the accelerator accelerator and thedesigned data memory by Reference twice. [16] are used to process multimedia data such as voice and image. The data flow is shown in Figure 7. IoT node equipment Voice, image Multimedia Processing and other External Data Data External result network data CPU kernel multimedia interface memory memory interface interface sensors SoC bus

Compact CNN Accelerator

(a) IoT node equipment Voice, image Multimedia Processing and other data External External result network CPU kernel multimedia interface interface interface sensors EAI

Reconfigurable CNN Accelerated Coprocessor

(b)

FigureFigure 7. Data 7. Data flow flow comparison: comparison: (a) Data(a) Data flow flow direction direction of of the the compact compact CNN CNN acceleratoraccelerator designed designed in in ReferenceReference [16]; ([16];b) data (b) flowdata flow direction direction of the of reconfigurablethe reconfigurable CNN CNN accelerated accelerated coprocessor coprocessor designed designed in this paper.in this paper.

FigureFigure7b shows 7a shows the the data data flow flow direction direction of of thethe reconfigurablecompact CNN accelerator CNN-accelerated designed in coprocessor Reference designed[16]. inIt is this an paper.external Compared device mounted to the on accelerator the SoC bu designeds, which requires in Reference the CPU [16 core], the to coprocessorhandle the data can read multimediatransfer. First, data it needs directly to move from the the multimedia external interface, data from and the after external processing, interface it to can the be data directly memory sent to thethrough network the interface CPU core, by and the externalthen to the interface, accelerator which for doesprocessing. not need After the the data processing, moving it between needs the the coprocessorCPU core and to the move data the memory. calculation results from the accelerator to the data memory and then to the external interface to send them to the network interface. It needs to move data between the The coprocessor method reduces the data moving, further speeds up the algorithm processing accelerator and the data memory twice. speed, and saves power consumption. In addition, the coprocessor can provide coprocessing instruction Figure 7b shows the data flow direction of the reconfigurable CNN-accelerated coprocessor support,designed with higherin this paper. code density Compared and to simple the accelera programming.tor designed in Reference [16], the coprocessor can read multimedia data directly from the external interface, and after processing, it can be directly sent 4. Software Design of CNN-Accelerated Coprocessor to the network interface by the external interface, which does not need the data moving between the 4.1. Instructioncoprocessor Design and the of Coprocessordata memory. The coprocessor method reduces the data moving, further speeds up the algorithm processing Thespeed, coprocessor and saves instructions power consumption. designed forIn theadditi reconfigurableon, the coprocessor CNN-accelerated can provide coprocessor coprocessing are showninstruction in Table1 .support, with higher code density and simple programming.

4. Software Design of CNN-AcceleratedTable 1. Custom Coprocessor coprocessor instruction list. Instruction Funct7 Rd Xd Rs1 Xs1 Rs2 Xs2 acc.rest4.1. Instruction 1 Design of - Coprocessor 0 - - - 0 acc.load.cd 2 - 0 CoeMemoryAddress 1 Length|CacheBaseAddress 1 acc.load.bdThe coprocessor3 BANKID instructions 0designedBankMemoryAddress for the reconfigurable1 CNN-acceleratedLength|BankBaseAddress coprocessor1 are acc.cfg.ldlshown in Table 4 1. - 0 SrcDataBaseAddress 1 Location 1 acc.cfg.sdl 5 - 0 StoreDataBaseAddress 1 Location 1 acc.cfg.sms 6 PEID 0 Width 1 Height 1 acc.cfg.ldo 7 - 0 Offset1|Offset2 1 Offset3|Offset4 1 acc.cfg.sdo 8 - 0 Offset1|Offset2 1 Offset3|Offset4 1 acc.cfg.cp 9 PEID 0 CoeCacheBaseAddress 1 CoeHeight|CoeWidth 1 acc.cfg.cb 10 - 0 PEID 1 CoeBasis 1 acc.cfg.pw 11 - 0 Width 1 Height 1 acc.cfg.adl 12 - 0 SrcAddDataAddress 1 Location 1 acc.cfg.ado 13 - 0 Offset1|Offset2 1 Offset3|Offset4 1 acc.cfg.wm 14 - 0 PEID 1 Order 1 acc.sc 15 - 0 - 0 - 0 Electronics 2020, 9, x FOR PEER REVIEW 9 of 21

Table 1. Custom coprocessor instruction list.

Instruction Funct7 Rd Xd Rs1 Xs1 Rs2 Xs2 acc.rest 1 - 0 - - - 0 acc.load.cd 2 - 0 CoeMemoryAddress 1 Length|CacheBaseAddress 1 acc.load.bd 3 BANKID 0 BankMemoryAddress 1 Length|BankBaseAddress 1 acc.cfg.ldl 4 - 0 SrcDataBaseAddress 1 Location 1 acc.cfg.sdl 5 - 0 StoreDataBaseAddress 1 Location 1 acc.cfg.sms 6 PEID 0 Width 1 Height 1 acc.cfg.ldo 7 - 0 Offset1|Offset2 1 Offset3|Offset4 1 acc.cfg.sdo 8 - 0 Offset1|Offset2 1 Offset3|Offset4 1 acc.cfg.cp 9 PEID 0 CoeCacheBaseAddress 1 CoeHeight|CoeWidth 1 acc.cfg.cb 10 - 0 PEID 1 CoeBasis 1 acc.cfg.pw 11 - 0 Width 1 Height 1 acc.cfg.adl 12 - 0 SrcAddDataAddress 1 Location 1 acc.cfg.ado 13 - 0 Offset1|Offset2 1 Offset3|Offset4 1 acc.cfg.wmElectronics 2020 , 9, 100514 - 0 PEID 1 Order 9 of 19 1 acc.sc 15 - 0 - 0 - 0

There areare 15 15 coprocessor coprocessor instructions instructions in total, in whichtotal, arewhich divided are intodivided two categories: into two coprocessing categories: coprocessingconfiguration configuration instruction and instruction coprocessing and data coproc loadingessing instruction. data loading Configuration instruction. instructions Configuration are instructionsused to configure are used each to function configure module each offunction the coprocessor, module of such the ascoprocessor, PE working such mode, as cross-switchPE working mode,data flow cross-switch direction, etc.data The flow data direction, loading instructionetc. The data loads loading the convolution instruction coe loadsfficient the and convolution the source coefficientmatrix and and loads the them source into thematrix respective and loads coprocessor them into cache. the Therespective assembly coprocessor format of coprocessorcache. The assemblyinstructions format is shown of coprocessor in Figure8 .instructions is shown in Figure 8.

acc.rest acc.load.cd rs1 rs2 acc.load.bd rd rs1 rs2 acc.cfg.wm rs1 rs2 acc.cfg.ldl rs1 rs2 acc.cfg.ldo rs1 rs2 acc.cfg.sdl rs1 rs2 acc.cfg.sdo rs1 rs2 acc.cfg.sms rd rs1 rs2 acc.cfg.cp rd rs1 rs2 acc.cfg.cb rs1 rs2 acc.cfg.pw rs1 rs2 acc.cfg.adl rs1 rs2 acc.cfg.ado rs1 rs2 acc.sc

Figure 8. AssemblyAssembly format format of of coprocessor instructions.

The use of the designed coprocessor instructions is generally divided into 5 steps: 1. Reset the coprocessor 1. Reset the coprocessor Before using the coprocessor, you need to use the acc.rest instruction to Before using the coprocessor, you need to use the acc.rest instruction to reset the coprocessor. reset the coprocessor. 2. Load the convolution coefficient 2. Load the convolution coefficient If you need to use the convolution module in the coprocessor, If you need to use the convolution module in the coprocessor, you need to use the acc.load.cd you need to use the acc.load.cd instruction to load the convolution coefficient into the COE cache. instruction to load the convolution coefficient into the COE cache. 3.3. LoadLoad the the source source matrix The coprocessor supports reading the source matrix directly from memory for calculation, and also supports moving the source matrix to CACHE BANK and reading the data from CACHE BANK for calculation. The instruction to achieve this function is acc.load.bd. 4. Configure coprocessor parameters

(1) Working mode Configure the working mode of each PE of the coprocessor, that is, configure the crossbar in each PE, select the acceleration unit that participates in the calculation, and configure the direction of the data flow according to the algorithm. The instruction to achieve this function is acc.cfg.wm. (2) Calculate data source location Configure the coprocessor calculation data source. The coprocessor can load data directly from memory or load data from CACHE BANK. Before starting the coprocessor, you need to configure the coprocessor calculation data source according to the storage location of the source matrix. The instruction to achieve this is acc.cfg.ldl. Configure the offset from the base address of the calculation data required by each PE in the data source. The calculation data required by the four PEs is continuously stored in the data source, so the relative offset of the calculation data required by each PE needs to be determined. The instruction to achieve this is acc.cfg.ldo. (3) Storage location of calculation results The configuration coprocessor calculation result storage location is the same as the configuration calculation data source. It supports saving calculation results to memory or CACHE BANK, and the configuration instruction is acc.cfg.sdl. Configure the relative offset of each PE calculation result in the storage location. The configuration instruction is acc.cfg.sdo. Electronics 2020, 9, 1005 10 of 19

(4) The width and height of source matrix Configure the width and height of the source matrix in each PE, and the instruction to achieve this function is acc.cfg.sms. (5) Convolution kernel coefficient parameter Configure the offset address of the convolution kernel in the COE CACHE and the size of the convolution kernel. The configuration instruction is acc.cfg.cp. Configure each convolution kernel bias. The configuration instruction is acc.cfg.cb. (6) Pooling parameters Configure the height and width of the maximum pooling method. The configuration instruction is acc.cfg.pw. (7) Matrix plus parameters Configure the loading position of another input matrix participating in the matrix addition calculation, support the input matrix to be loaded from memory or CACHE BANK, and the configuration instruction is acc.cfg.adl. Configure the relative offset of another input matrix in each PE. The configuration instruction is acc.cfg.ado. 5. Start calculation Using the acc.sc instruction calculation, it enables the coprocessor to start the calculation according to the configuration parameters.

4.2. Establishment of Instruction Compiling Environment for Coprocessor The RISC-V foundation open-source compilation tool chain includes GCC and LLVM. The most commonly used compilation tool chain is GCC. By modifying the Binutils toolset in the GCC compilation tool chain, a compilation environment for coprocessor instructions is established. The RISC-V foundation open-source GCC compilation tool chain mainly includes riscv32 and riscv64 versions. The E203 kernel only supports the rv32IMAC instruction set in RISC-V, so the riscv32 version of the GCC compilation tool chain is used. The GCC compilation tool chain mainly includes the GCC compiler, Binutils binary toolset, GDB debugging tools, and C runtime libraries. The Binutils binary toolset is a set of tools for processing binary programs. Some commonly used tools and functions in the 32-bit Binutils toolset are shown in Table2.

Table 2. Binutils toolset list.

Tool Name Function riscv32-unkown-elf-as Assembler, convert assembly code to executable ELF file riscv32-unkown-elf-ld Linker, linking multiple target and library files as executables riscv32-unknown-elf-objcopy Converting ELF format files to bin format files riscv32-unknown-elf-objdump Disassembler, which converts binary files into assembly code

By calling each Binutils tool in the table, the assembly code can be assembled and linked, and an object file can be generated. In order to enable the Binutils toolset to support the custom coprocessor instructions in Section 4.1, the source code of Binutils needs to be modified. After adding all custom coprocessor instructions to the Binutils source code, compile the Binutils source code to generate the toolset in Table2. To test the functionality of the toolset, write the following test assembler test.s as shown in Figure9. Use riscv32-unkown-elf-as assembler to assemble the program to generate the test.out binary object file, and use riscv32-unknown-elf-objdump to disassemble test.out to obtain the disassembly information shown in Figure 10. The first column is the binary information of the instruction, and the second column is the assembly code obtained by disassembling the instruction. Analyze the binary information of the instruction, which is consistent with the design of Section 4.1, and the assembly code obtained by disassembly is also consistent with the test assembly code ( 5, 1, 2, t0, ra, and sp represent the same register). × × × This test indicates that the designed custom coprocessor instructions have been successfully added to the Binutils toolset. Electronics 2020, 9, x FOR PEER REVIEW 11 of 21

The Binutils binary toolset is a set of tools for processing binary programs. Some commonly used tools and functions in the 32-bit Binutils toolset are shown in Table 2. Electronics 2020, 9, x FOR PEER REVIEW 11 of 21 Table 2. Binutils toolset list. The Binutils binary toolset is a set of tools for processing binary programs. Some commonly used Tool Name Function tools and functions in the 32-bit Binutils toolset are shown in Table 2. riscv32-unkown-elf-as Assembler, convert assembly code to executable ELF file riscv32-unkown-elf-ld Linker,Table 2.linking Binutils multiple toolset list. target and library files as executables riscv32-unknown-elf-objcopy Converting ELF format files to bin format files riscv32-unknown-elf-objdumpTool Name Disassembler, which converts Function binary files into assembly code riscv32-unkown-elf-as Assembler, convert assembly code to executable ELF file Byriscv32-unkown-elf-ld calling each Binutils tool in the Linker, table, linking the assembly multiple code target can and be assembled library files and as executableslinked, and an objectriscv32-unknown-elf-objcopy file can be generated. In order to enable Converting the Binutils ELF toolset format to files support to bin the format custom files coprocessor instructionsriscv32-unknown-elf-objdump in Section 4.1, the source Disassembler, code of Binutils which needs converts to be modified.binary files After into addingassembly all codecustom coprocessor instructions to the Binutils source code, compile the Binutils source code to generate the toolsetBy in calling Table each 2. To Binutils test the tool functionality in the table, of the the assembly toolset, write code thecan followingbe assembled test andassembler linked, test.s and anas objectshown file in Figurecan be 9.generated. In order to enable the Binutils toolset to support the custom coprocessor instructions in Section 4.1, the source code of Binutils needs to be modified. After adding all custom coprocessor instructions to the Binutils source code, compile the Binutils source code to generate the toolsetElectronics in2020 Table, 9, 1005 2. To test the functionality of the toolset, write the following test assembler test.s11 of as 19 shown in Figure 9.

Figure 9. Test assembler.

Use riscv32-unkown-elf-as assembler to assemble the program to generate the test.out binary object file, and use riscv32-unknown-elf-objdump to disassemble test.out to obtain the disassembly information shown in Figure 10. FigureFigure 9. 9. TestTest assembler. assembler.

Use riscv32-unkown-elf-as assembler to assemble the program to generate the test.out binary object file, and use riscv32-unknown-elf-objdump to disassemble test.out to obtain the disassembly information shown in Figure 10.

Instruction Instruction binary assembly code

Figure 10. Assembly code disassembly information.

4.3. Coprocessor-Accelerated LibraryInstruction Function Design Instruction binary assembly code After completing the instruction design of the custom coprocessor and the establishment of the instruction compilation environment, the use of the coprocessor can be completed by writing assembly code. However, the method of writing assembly code has a low development efficiency. In order to facilitate the use of the coprocessor, the common functions of the coprocessor are packaged into the form of a C language function interface, and the coprocessor acceleration library functions are designed. The designed library function interface is shown in Table3. The library functions use the C language inline assembly to implement the call to the coprocessor instructions. Taking the CfgPoolWidth function interface as an example, the specific implementation is shown in Figure 11. The CfgPoolWidth function calls the acc.cfg.pw instruction in the form of inline assembly and passes in the values of the width and height variables to implement the height and width configuration of the pooling operation. Electronics 2020, 9, 1005 12 of 19

Table 3. Library function interface list.

Function Interface Function void CoprocessorRest () Reset coprocessor int LoadCoeData (unsigned int CoeMemoryAddress, Load convolution kernel coefficients into unsigned int CacheBaseAddress, unsigned int Length) COE CACHE int LoadBankData (unsigned int BankMemoryAddress, unsigned int BankBaseAddress, unsigned int Length, Load source matrix into CACHE BANK int Bankid) int CfgLoadDataLoc (int Location, unsigned int Configure the location of the calculation data source SrcDataBaseAddress, unsigned int Offset1, unsigned int and the relative offset of the calculation data of each Offset2, unsigned int Offset3, unsigned int Offset4) PE unit in the data source int CfgStoreDataLoc (int Location, unsigned int Configure the calculation result storage location and StoreDataBaseAddress, unsigned int Offset1, unsigned int the relative offset of the calculation result of each PE Offset2, unsigned int Offset3, unsigned int Offset4) unit in the storage location Configure the loading position of another input int CfgAddDataLoc (int Location, unsigned int matrix participating in the matrix addition calculation SrcAddDataAddress, unsigned int Offset1, unsigned int and the relative offset of the other input matrix in Offset2, unsigned int Offset3, unsigned int Offset4) each PE unit in the loading position int CfgSrcMatrixSize (int Width, int Height, int PEID) Configure the width and height of the source matrix int CfgCoeParm (unsigned int CoeCacheBaseAddress, int Configure convolution kernel size and offset address CoeHeight, int CoeWidth, int PEID) in COE CACHE int CfgCoeBias (float CoeBasis1, float CoeBasis2, float Configure the convolution kernel bias for each PE unit CoeBasis3, float CoeBasis4) int CfgPoolWidth (int Width, int Height) Configure pooling width and height Configure the working mode of 4 PE units, that is, int CfgWorkMode (unsigned int Order1, unsigned int select the acceleration unit that participates in the Order2, unsigned int Order, 3 unsigned int Order4) calculation and the direction of the data flow int CoprocessorStartCal () Start calculation Electronics 2020, 9, x FOR PEER REVIEW 14 of 21

FigureFigure 11.11. The specificspecific implementationimplementation ofof thethe CfgPoolWidthCfgPoolWidth function.function.

5. ImplementationThe CfgPoolWidth of Common function Algorithms calls the acc.cfg.pw on Coprocessor instruction in the form of inline assembly and passesThe in reconfigurable the values of CNN-accelerated the width and coprocessor height variables designed to inimplement this paper the can accelerateheight and not width only theconfiguration CNN algorithm of the butpooling also operation. some other commonly used algorithms in the IoT system. This article describes the implementation of the LeNet-5 network, Sobel edge detection, and FIR filtering algorithms on5. Implementation the coprocessor. of Common Algorithms on Coprocessor The reconfigurable CNN-accelerated coprocessor designed in this paper can accelerate not only the CNN algorithm but also some other commonly used algorithms in the IoT system. This article describes the implementation of the LeNet-5 network, Sobel edge detection, and FIR filtering algorithms on the coprocessor.

5.1. LeNet-5 Network Implementation In order to verify the acceleration of the coprocessor to the CNN algorithm, the classical LeNet-5 network is used in this paper. The structure of the LeNet-5 network is shown in Figure 12.

Input image 32×32 28×28 14×14 10×10 5×5 400×1 10×1

convolutions convolutions expansion full subsample subsample connection Figure 12. The structure of LeNet-5 network.

The LeNet-5 network mainly comprises six hidden layers [20]: (1) Convolution layer C1. Convolutions are performed using six 5 × 5 size convolution kernels and 32 × 32 original images to generate six 28 × 28 feature maps, and the feature map is activated using the ReLU function. (2) Pooling layer S2. The S2 layer uses a 2 × 2 pooling filter to perform maximum pooling on the output of C1 to obtain six 14 × 14 feature maps. (3) Partially connected layer C3. The C3 layer uses 16 5 × 5 convolution kernels to partially connect with the six feature maps output by S2 and calculates 16 10 × 10 feature maps. The partial connection relationship and calculation process are shown in Figure 13. Take the calculation process of the output 0th feature map as an example: First, use the 0th convolution kernel to

Electronics 2020, 9, x FOR PEER REVIEW 14 of 21

Figure 11. The specific implementation of the CfgPoolWidth function.

The CfgPoolWidth function calls the acc.cfg.pw instruction in the form of inline assembly and passes in the values of the width and height variables to implement the height and width configuration of the pooling operation.

5. Implementation of Common Algorithms on Coprocessor The reconfigurable CNN-accelerated coprocessor designed in this paper can accelerate not only the CNN algorithm but also some other commonly used algorithms in the IoT system. This article describes the implementation of the LeNet-5 network, Sobel edge detection, and FIR filtering Electronics 2020, 9, 1005 13 of 19 algorithms on the coprocessor.

5.1.5.1. LeNet-5 Network ImplementationImplementation InIn orderorder to to verify verify the the acceleration acceleration of the of coprocessorthe coprocessor to the to CNN the algorithm,CNN algorithm, the classical the LeNet-5classical networkLeNet-5 isnetwork used in is this used paper. in this The paper. structure The struct of theure LeNet-5 of the networkLeNet-5 isnetwork shown is in shown Figure in12 .Figure 12.

Input image 32×32 28×28 14×14 10×10 5×5 400×1 10×1

convolutions convolutions expansion full subsample subsample connection FigureFigure 12.12. TheThe structurestructure ofof LeNet-5LeNet-5 network.network.

TheThe LeNet-5LeNet-5 networknetwork mainlymainly comprisescomprises sixsix hiddenhidden layerslayers [[20]:20]: (1) Convolution layer C1. Convolutions are performed using six 5 × 5 size convolution kernels and (1) Convolution layer C1. Convolutions are performed using six 5 5 size convolution kernels and 32 × 32 original images to generate six 28 × 28 feature maps, and× the feature map is activated 32 32 original images to generate six 28 28 feature maps, and the feature map is activated using× the ReLU function. × using the ReLU function. (2) Pooling layer S2. The S2 layer uses a 2 × 2 pooling filter to perform maximum pooling on the (2) Pooling layer S2. The S2 layer uses a 2 2 pooling filter to perform maximum pooling on the output of C1 to obtain six 14 × 14 feature× maps. output of C1 to obtain six 14 14 feature maps. (3) Partially connected layer C3. ×The C3 layer uses 16 5 × 5 convolution kernels to partially connect (3) Partially connected layer C3. The C3 layer uses 16 5 5 convolution kernels to partially connect with the six feature maps output by S2 and calculates× 16 10 × 10 feature maps. The partial with the six feature maps output by S and calculates 16 10 10 feature maps. The partial connection relationship and calculation2 process are shown in× Figure 13. Take the calculation connection relationship and calculation process are shown in Figure 13. Take the calculation process of the output 0th feature map as an example: First, use the 0th convolution kernel to process of the output 0th feature map as an example: First, use the 0th convolution kernel to

convolve with the 0, 1, 2 feature maps output by the S2 layer, and then add the results of the three convolutions, Plus a bias, and finally activate it to obtain the 0th feature map of the C3 layer. (4) Pooling layer S . This layer uses a 2 2 pooling filter to pool the output of C into 16 5 5 4 × 3 × feature maps. (5) Expand layer S . This layer combines the 16 5 5 feature maps output by S into a one-dimensional 5 × 4 matrix of size 400.

(6) Fully connected layer S6. The S6 layer fully connects the one-dimensional matrix output from the S5 layer with 10 convolution operators, and obtains 10 classification results as the recognition results of the input image.

From comprehensive analysis of the operating characteristics of each layer of the network, the calculation of each layer of LeNet-5 is summarized in Table4.

Table 4. LeNet-5 calculation steps of each layer of LeNet-5.

Layer Calculation Formula Explanation C [C0–C5] = ReLU[S [K0–K5]] Convolution and ReLU 1 1 1 × 1 1 [ 0 5] = [ 0 5] S2 S2–S2 Pool C1–C1 Pooling 5 Convolution, matrix addition, and [ 0 15] = [ P i [ 0 15]] C3 C3–C3 ReLU S2 K3–K3 i=0 × ReLU [ 0 15] = [ 0 15] S4 S4–S4 Pool C3–C3 Pooling = [ 0 15] S5 S5 S4–S4 Expand S [S0–S9] = S [K0–K9] Fully connected 6 6 6 5 × 6 6 Electronics 2020, 9, x FOR PEER REVIEW 15 of 21

convolve with the 0, 1, 2 feature maps output by the S2 layer, and then add the results of the three convolutions, Plus a bias, and finally activate it to obtain the 0th feature map of the C3 layer. (4) Pooling layer S4. This layer uses a 2 × 2 pooling filter to pool the output of C3 into 16 5 × 5 feature maps. (5) Expand layer S5. This layer combines the 16 5 × 5 feature maps output by S4 into a one-dimensional matrix of size 400. (6) Fully connected layer S6. The S6 layer fully connects the one-dimensional matrix output from the

ElectronicsS5 layer2020, 9with, 1005 10 convolution operators, and obtains 10 classification results as the recognition14 of 19 results of the input image.

5 4 3 2

1 feature map 0 0

3 Figure 13. Schematic diagram of the C3 layerlayer calculation calculation process process of of LeNet-5 LeNet-5 network.

AsFrom can comprehensive be seen from the analysis above table, of the the operating implementation characteristics of the LeNet-5 of each network layer of on the the network, coprocessor the cancalculation be performed of each in layer four of steps: LeNet-5 is summarized in Table 4.

(1) First, map the C1 Tablelayer 4. and LeNet-5 the Scalculation2 layer. Configure steps of each the layer coprocessor of LeNet-5. to use the convolution, ReLU, and pooling modules of the PE unit, and configure the parameters of these three modules. ConfigureLayer the crossbarCalculation to enable Formula the data flow in the sequence Explanation shown in Figure 14a, start the []ReLU–CC05– =×[[ SKK 0 5 ]] coprocessor,C1 and11 calculate the C1 and 1 S2 1layers. Convolution and ReLU []PoSS05–l[]= o CC 0– 5 (2) Map CS32 layer and22 S4 layer. Configure 1 1 the coprocessor to use thePooling convolution, matrix addition, 5 ReLU, and pooling015 modules=× ofi the PE 0 unit, 15 and configure the parameters of these four modules. Electronics 2020C,3 9, x FOR[]ReLU[CC33 PEER––] REVIEW  S 2[] KK 3 3 Convolution, matrix addition, and ReLU 16 of 21 Configure the crossbar to enablei=0 the data flow in the order shown in Figure 14b, start the 015= 0 15 S4 []PoSS44––ol[ CC 3 3 ] Pooling coprocessor,coprocessor, calculate calculate the the C C3 layer3 layer and and S4 layer, S4 layer, and andcache cache the calculation the calculation result result in BUF in RAM BUF SSS= []015– BANK1.RAMS BANK1.5 544 Expand 09= × 0 9 S6 [[SS66– ]] SKK 5 6– 6 Fully connected (3)(3) TheThe S S55 layerlayer uses uses CPU CPU calculations. calculations. Use Use a a software software program program to to expand the calculation result in (2)(2) into into a a 1 1 × 400400 one-dimensional one-dimensional matrix, matrix, which which is is buffered buffered in in BUF BUF RAM RAM BANK2. × (4) MapAs can the be S6 seenlayer. from Configure the above the coprocessor table, the implementationto use the convolution of the moduleLeNet-5 ofnetwork the PE uniton the so (4) Map the S6 layer. Configure the coprocessor to use the convolution module of the PE unit so coprocessorthatthat the the can convolution convolution be performed module module in four supports steps: one-dimensional convolution. Configure Configure the crossbar so (1) First,thatthat the themap data data the flow flowC1 layer follows follows and thethe the Ssequence sequence2 layer. Configure in in Fi Figuregure the 14c14 coprocessorc and uses only to use the the convolution convolution, module. ReLU, andConfigure pooling the modules data source of the as PE BUF unit, RAM and BANK2, configure start the the parameters coprocessor, of these and calculatethree modules. the S6 Configure the data source as BUF RAM BANK2, start the coprocessor, and calculate the S6 layer. Configurelayer. the crossbar to enable the data flow in the sequence shown in Figure 14a, start the coprocessor, and calculate the C1 and S2 layers. 3 4 (2) Map C layer and S layer. Configure the coprocessor to use theSrc convolution,C matrix addition, Src C Src C Add Add Pool Add Pool Pool ReLU,Add and Out pooling modules of the PE unit,Add Out and configure the parameters Addof Outthese four modules. Pool In Pool In Pool In ConfigureAdd In the crossbar to enable theAdd data In flow in the order shown Addin In Figure 14b, start the Pool Out Pool Out Pool Out Result Result Src A Src A Src A Result Crossbar Crossbar Crossbar

Conv In Conv In Conv In Relu Out Relu Out Relu Out

Relu In Relu In Relu In Conv Out Conv Out Conv Out Src B Src B Src B Conv Conv Relu Relu Conv Relu PE PE PE

(a) (b) (c)

Figure 14. Data flow flow mapping diagram of each layer of LeNet-5 network:network: (a)) Mapping ofof CC11 layer and S2 layer;layer; ( (b)) mapping mapping of of C C3 layerlayer and and S S44 layer;layer; ( (cc)) mapping mapping of of S S66 layer.layer.

5.2. Sobel Edge Detection and FIR Algorithm Implementation In image processing algorithms, edge detection of images is often required. Edge detection based on Sobel operators is a commonly used method. Sobel operators are first-order gradient algorithms that can effectively filter out noise interference and extract accurate edge information [21]. Sobel edge detection uses two 3 × 3 matrix operators and the input image to convolve to obtain the gray values of the horizontal and vertical edges, respectively. If A is the original image and Gx and Gy are the horizontal and vertical edge images, respectively, the calculation formulas of Gx and Gy are shown in Figure 15.

-1 0 1 1 2 1

Gx=-2 0 2 ×A Gy= 0 0 0 ×A

-1 0 1 -1 -2 -1

Figure 15. Gx, Gy calculation formula.

The complete edge gray value of the image can be approximated by Equation (1). =+ GGx Gy (1) Before image edge detection, downsampling is usually performed to reduce the image size and the amount of data for operations. In summary, the implementation of Sobel edge detection on the accelerator can be performed in three steps: (1) Calculate Gx and Gy. Configure the coprocessor to use two PE units. Each PE unit uses pooling and convolution modules, and the convolution kernels used by the convolution modules in the two PEs are the two convolution kernels of the Sobel operator. Configure the crossbar so that the data flow follows the sequence in Figure 16a. Start the coprocessor, calculate Gx and Gy, and cache the calculation result in BUF RAM BANK1.

Electronics 2020, 9, x FOR PEER REVIEW 16 of 21

coprocessor, calculate the C3 layer and S4 layer, and cache the calculation result in BUF RAM BANK1. (3) The S5 layer uses CPU calculations. Use a software program to expand the calculation result in (2) into a 1 × 400 one-dimensional matrix, which is buffered in BUF RAM BANK2. (4) Map the S6 layer. Configure the coprocessor to use the convolution module of the PE unit so that the convolution module supports one-dimensional convolution. Configure the crossbar so that the data flow follows the sequence in Figure 14c and uses only the convolution module. Configure the data source as BUF RAM BANK2, start the coprocessor, and calculate the S6 layer.

Src C Src C Src C Add Add Pool Add Pool Pool Add Out Add Out Add Out Pool In Pool In Pool In Add In Add In Add In Pool Out Pool Out Pool Out Result Result Src A Src A Src A Result Crossbar Crossbar Crossbar

Conv In Conv In Conv In Relu Out Relu Out Relu Out

Relu In Relu In Relu In Conv Out Conv Out Conv Out Src B Src B Src B Conv Conv Relu Relu Conv Relu PE PE PE

(a) (b) (c)

ElectronicsFigure2020 14., 9, Data 1005 flow mapping diagram of each layer of LeNet-5 network: (a) Mapping of C1 layer and15 of 19

S2 layer; (b) mapping of C3 layer and S4 layer; (c) mapping of S6 layer.

5.2.5.2. Sobel Edge Detection and FIR Algorithm ImplementationImplementation InIn imageimage processingprocessing algorithms,algorithms, edgeedge detectiondetection ofof imagesimages isis oftenoften required.required. EdgeEdge detectiondetection basedbased onon SobelSobel operatorsoperators isis aa commonlycommonly usedused method.method. SobelSobel operatorsoperators areare first-orderfirst-order gradientgradient algorithmsalgorithms thatthat cancan eeffectivelyffectively filterfilter outout noisenoise interferenceinterference andand extractextract accurateaccurate edgeedge informationinformation [[21].21]. SobelSobel edgeedge detection uses two 3 3 matrix operators and the input image to convolve to obtain the gray values detection uses two 3 × 3 matrix operators and the input image to convolve to obtain the gray values of of the horizontal and vertical edges, respectively. If A is the original image and G and G are the the horizontal and vertical edges, respectively. If A is the original image and Gxx and Gy are the horizontal and vertical edge images, respectively, the calculation formulas of G and G are shown in horizontal and vertical edge images, respectively, the calculation formulas of Gxx and Gyy are shown in FigureFigure 1515..

-1 0 1 1 2 1

Gx=-2 0 2 ×A Gy= 0 0 0 ×A

-1 0 1 -1 -2 -1

FigureFigure 15.15. GGxx,G, Gyy calculation formula.

TheThe completecomplete edgeedge graygray valuevalue ofof thethe imageimage cancan bebe approximatedapproximated byby EquationEquation (1).(1).

GG=+ G (1) G = Gx +x Gy y (1) | | | | Before image edge detection, downsampling is usually performed to reduce the image size and Before image edge detection, downsampling is usually performed to reduce the image size and the amount of data for operations. the amount of data for operations. In summary, the implementation of Sobel edge detection on the accelerator can be performed in In summary, the implementation of Sobel edge detection on the accelerator can be performed in three steps: three steps: (1) Calculate Gx and Gy. Configure the coprocessor to use two PE units. Each PE unit uses pooling

(1) andCalculate convolutionGx and modules,Gy. Configure and the the convolution coprocessor ke tornels use twoused PE by units. the convolution Each PE unit modules uses pooling in the twoand convolutionPEs are the two modules, convolution and the kernels convolution of the kernels Sobel operator. used by theConfigure convolution the crossbar modules so in that the thetwo data PEs areflow the follows two convolution the sequence kernels in Figure of the 16a. Sobel Start operator. the coprocessor, Configure calculate the crossbar Gx and so thatGy, and the cachedata flow the followscalculation the sequenceresult in BUF in Figure RAM 16 BANK1.a. Start the coprocessor, calculate Gx and Gy, and cache the calculation result in BUF RAM BANK1.

(2) Use the CPU to calculate |Gx| and |Gy|. Use a software program to take the absolute value of each element in the cached result of (1) to obtain |Gx| and |Gy|, and cache the calculation result in BUF RAM BANK2. (3) Calculate |G|. Configure the coprocessor to use the matrix addition module of the PE unit and configure the crossbar so that the data flow follows the order in Figure 16b and only the matrix addition module is used. Configure the data source as BUF RAM BANK2, start the coprocessor, and calculate |G|.

In speech signal processing, FIR filters are often used for denoising. FIR filtering is a one-dimensional convolution operation. You only need to configure the PE to use a convolution module. Configure the crossbar so that the data flow follows the sequence shown in Figure 17. The convolution module is configured as a one-dimensional convolution, loads the convolution kernel into COE RAM, starts the coprocessor, and calculates the FIR filtering result. Electronics 2020, 9, x FOR PEER REVIEW 17 of 21

(2) Use the CPU to calculate |Gx| and |Gy|. Use a software program to take the absolute value of each element in the cached result of (1) to obtain |Gx| and |Gy|, and cache the calculation result

Electronicsin BUF 2020 RAM, 9, x FOR BANK2. PEER REVIEW 17 of 21 (3) Calculate |G|. Configure the coprocessor to use the matrix addition module of the PE unit and

(2) configureUse the CPU the tocrossbar calculate so that|Gx| the and data |G yflow|. Use follo a softwarews the order program in Figure to take 16b the and absolute only the value matrix of additioneach element module in theis used. cached Configure result of the (1) datato obtain source |G asx| andBUF | RAMGy|, and BANK2, cache startthe calculation the coprocessor, result andin BUF calculate RAM BANK2.|G|. (3) Calculate |G|. Configure the coprocessor to use the matrix addition module of the PE unit and configure the crossbar so that the data flow follows the order in Figure 16b and only the matrix Src C Src C Electronicsaddition2020Add, module9, 1005 is used. Configure thePool data source as BUFAdd RAM BANK2, start the coprocessor,Pool 16 of 19 Add Out Add Out and calculate |G|. Pool In Pool In Add In Add In Pool Out Pool Out Src A Result SrcSrc C A Result Src C Add Crossbar Pool Add Crossbar Pool Add Out Add Out Pool In Pool In Conv In Conv In Relu Out Add In Relu Out Add In Pool Out Pool Out Relu In Relu In Conv Out Conv Out Result Src B Result SrcSrc B A Src A Conv Conv Crossbar Relu Crossbar Relu PE PE Conv In Conv In Relu Out Relu Out

(a) Relu In (b) Relu In Conv Out Conv Out Src B Src B Conv FigureConv 16. Data flow mapping diagram of SoRelubel edge detection algorithm: (a) Calculate Gx and GyRelu; (b) Calculate |G|. PE PE

In speech signal (aprocessing,) FIR filters are often used for denoising.(b) FIR filtering is a one-dimensional convolution operation. You only need to configure the PE to use a convolution FigureFigure 16. 16. DataData flow flow mapping mapping diagram diagram of of So Sobelbel edge edge detection detection algorithm: algorithm: (a) (Calculatea) Calculate Gx andGx and Gy; (Gby) ; module.Calculate(b) CalculateConfigure |G|.|G |.the crossbar so that the data flow follows the sequence shown in Figure 17.

In speech signal processing, FIR filters are often used for denoising. FIR filtering is a Src C one-dimensional convolution operation.Add You only need to configurePool the PE to use a convolution Add Out module. Configure the crossbar so that the data flow followsPool In the sequence shown in Figure 17. Add In Pool Out

SrcSrc CA Result Add Crossbar Pool Add Out Pool In Conv In Add In Relu Out Relu PoolIn Out Conv Out SrcSrc B A Result Conv Crossbar Relu PE Conv In Relu Out

FigureFigure 17. 17. DataData flow flow mapping mapping of of FIR FIRRelu filterIn filter algorithm. algorithm. Conv Out Src B 6. Experiment and Resource AnalysisConv Relu The convolution module is configured as a one-PEdimensional convolution, loads the convolution kernelThis into section COE RAM, completes starts resource the coprocesso analysisr, and and performance calculates the analysis FIR filtering of the designedresult. reconfigurable CNN acceleration coprocessorFigure based17. Data on flow Xilinx mapping FPGA. of FIR The filter FPGA algorithm. model is Xilinx xc7a100tftg256-1 (Xilinx, San Jose, CA, USA), and the synthesis tool is Vivado 18.1 (Xilinx, San Jose, CA, USA). FPGAThe circuit convolution synthesis module is performed is configured on the E203as a one- SoCdimensional connected to convolution, the coprocessor. loads Table the convolution5 shows the kernelresource into consumption COE RAM, ofstarts each the main coprocesso functionalr, and unit. calculates the FIR filtering result.

Table 5. E203 SoC resource consumption list.

Module Slice LUTs Slice Registers DSPs E203 kernel 4338 1936 0 coprocessor 8534 7023 21 CLINT 33 132 0 PLIC 464 411 0 DEBUG 235 568 0 Peripherals 3199 2281 0 Others 1092 1325 0 Total 17,913 13,676 21

As shown in Table5, the E203 kernel and coprocessor account for most of the resource consumption in the E203 SoC. The E203 core accounts for 24.2% of the total LUT consumption, while the designed coprocessor accounts for 47.6% of the LUT resource consumption in the SoC. Electronics 2020, 9, 1005 17 of 19

To evaluate the performance of the coprocessor, the four basic arithmetic algorithms of convolution, pooling, ReLU, and matrix addition are implemented in two ways. One is implemented by the I and M instruction sets of the standard instruction set of RISC-V. The other is implemented using the coprocessor instructions designed in this paper, comparing the number of cycles of algorithm execution in these two implementations. By writing a testbench file to load the compiled binary file as an input stimulus for the entire E203 SoC, it is simulated by simulation software to count the execution cycles of each algorithm in two implementations. The experimental results are shown in Table6.

Table 6. Number of execution cycles of each algorithm implemented by different instructions.

Algorithm Rv32IM Instruction Coprocessor Instruction Convolution (4 4 matrix and 3 3 × × 12,982 2070 convolution kernel) Pooling (6 6 matrix and 2 2 filter) 2452 1706 × × ReLU (6 6 matrix) 1249 1101 × Matrix Addition (6 6 matrix) 2755 2005 × Electronics 2020, 9, x FOR PEER REVIEW 19 of 21 As can be seen from Table6, using the coprocessor can accelerate all four algorithms, and the acceleration ratio of each algorithm is shown in Figure 18.

Figure 18. Acceleration ratio of each algorithm.

It can can be be seen seen from from Figure Figure 18 18 that that the the coproce coprocessorssor has has the the most most obvious obvious acceleration acceleration effect eff ecton onthe theconvolution convolution algorithm, algorithm, which which is is6.27 6.27 times times fast fasterer than than the the standard standard instruction instruction set. This This is because, on the one hand, the coprocessor realizes convolution calculation by the special hardware unit, while the RISC-V-based main processor realizesrealizes convolution calculation by software; on the other hand, the coprocessor architecture reduces thethe data moving, further speeding up the algorithm processing speed.

7. Conclusions 7. Conclusion In this paper, we optimize the compact CNN accelerator designed by Reference [16], In this paper, we optimize the compact CNN accelerator designed by Reference [16], and and interconnect the operation units in the acceleration chain with crossbar, to realize the reconfigurable interconnect the operation units in the acceleration chain with crossbar, to realize the reconfigurable design of a CNN accelerator. Furthermore, the accelerator is connected to the E203 core in the form of a design of a CNN accelerator. Furthermore, the accelerator is connected to the E203 core in the form coprocessor, and a reconfigurable CNN acceleration coprocessor is designed. Based on the EAI interface, of a coprocessor, and a reconfigurable CNN acceleration coprocessor is designed. Based on the EAI the hardware design of the reconfigurable CNN-accelerated coprocessor is completed. The designed interface, the hardware design of the reconfigurable CNN-accelerated coprocessor is completed. The custom coprocessor instructions are introduced, and the open-source compilation tool chain GCC designed custom coprocessor instructions are introduced, and the open-source compilation tool is modified to complete the compilation environment. Coprocessor instructions are packaged into chain GCC is modified to complete the compilation environment. Coprocessor instructions are a function interface in the C language inline assembly mode, and coprocessor acceleration library packaged into a function interface in the C language inline assembly mode, and coprocessor functions are established. The implementation process of common algorithms on the coprocessor acceleration library functions are established. The implementation process of common algorithms on is described. Finally, the resource evaluation of the coprocessor was completed based on the Xilinx the coprocessor is described. Finally, the resource evaluation of the coprocessor was completed based on the Xilinx FPGA. The evaluation results show that the coprocessor shown in the design consumes only 8534 LUTs, accounting for only 47.6% of the entire SoC system. By comparing the RISC-V standard instruction set and the custom coprocessor instruction set to achieve the four basic algorithms of convolution, pooling, ReLU, and matrix addition, the number of running cycles is used to evaluate the acceleration performance of the coprocessor. The results show that the implementation of the coprocessor instruction set has a significant acceleration effect on these four algorithms, and the acceleration of the convolution has reached 6.27 times that achieved by the standard instruction set.

Author Contributions: Concept and structure of this paper, N.W.; Resources, F.G. and T.J.; Supervision, L.Z.; Writing—original draft, N.W.; Writing—review and editing, F.G. and F.Z.

Funding: This work was supported by the National Natural Science Foundation of China (No. 61774086), and the Fundamental Research Funds for Central Universities (No. NP2019102, NS2017023)

Acknowledgments: The authors would like to thank Lei Zhang for his beneficial suggestions and comments.

Conflicts of Interest: The authors declare no conflict of interest.

Electronics 2020, 9, 1005 18 of 19

FPGA. The evaluation results show that the coprocessor shown in the design consumes only 8534 LUTs, accounting for only 47.6% of the entire SoC system. By comparing the RISC-V standard instruction set and the custom coprocessor instruction set to achieve the four basic algorithms of convolution, pooling, ReLU, and matrix addition, the number of running cycles is used to evaluate the acceleration performance of the coprocessor. The results show that the implementation of the coprocessor instruction set has a significant acceleration effect on these four algorithms, and the acceleration of the convolution has reached 6.27 times that achieved by the standard instruction set.

Author Contributions: Concept and structure of this paper, N.W.; Resources, F.G. and T.J.; Supervision, L.Z.; Writing—original draft, N.W.; Writing—review and editing, F.G. and F.Z. All authors have read and agreed to the published version of the manuscript. Funding: This work was supported by the National Natural Science Foundation of China (No. 61774086), and the Fundamental Research Funds for Central Universities (No. NP2019102, NS2017023). Acknowledgments: The authors would like to thank Lei Zhang for his beneficial suggestions and comments. Conflicts of Interest: The authors declare no conflict of interest.

References

1. Sheth, A.P. Internet of Things to Smart IoT Through Semantic, Cognitive, and Perceptual Computing. IEEE Intell. Syst. 2016, 31, 108–112. 2. Shi, W.S.; Cao, J.; Zhang, Q.; Li, Y.H.; Xu, L.Y. Edge Computing: Vision and Challenges. IEEE Internet Things J. 2016, 3, 637–646. [CrossRef] 3. Varghese, B.; Wang, N.; Barbhuiya, S.; Kilpatrick, P.; Nikolopoulos, D.S. Challenges and Opportunities in Edge Computing. In Proceedings of the 2016 IEEE International Conference on Smart Cloud (SmartCloud), New York, NY, USA, 18–20 November 2016. 4. Chua, L.O. CNN: A Paradigm for Complexity; World Scientific: Singapore, 1998. 5. Sainath, T.N.; Kingsbury, B.; Saon, G.; Soltau, H.; Mohamed, A.R.; Dahl, G.; Ramabhadran, B. Deep Convolutional Neural Networks for Large-scale Speech Tasks. Neural Networks 2015, 64, 39–48. [CrossRef] 6. Zhang, C.; Li, P.; Sun, G.; Guan, Y.; Xiao, B.; Cong, J. Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Networks. In Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, Monterey, CA, USA, 15 February 2015. 7. Sze, V.; Chen, Y.H.; Yang, T.J.; Emer, J.S. Efficient Processing of Deep Neural Networks: A Tutorial and Survey. Proc. IEEE 2017, 105, 2295–2329. [CrossRef] 8. Parashar, A.; Rhu, M.; Mukkara, A.; Puglielli, A.; Venkatesan, R.; Khailany, B.; Emer, J.; Keckler, S.W.; Dally, W.J. SCNN: An Accelerator for Compressed-sparse Convolutional Neural Networks. In Proceedings of the 2017 ACM/IEEE 44th Annual International Symposium on (ISCA), Toronto, ON, Canada, 24–28 June 2017. 9. Chi, P.; Li, S.; Xu, C.; Zhang, T.; Zhao, J.; Liu, Y.; Wang, Y.; Xie, Y. PRIME: A Novel Processing-in-Memory Architecture for Neural Network Computation in ReRAM-Based Main Memory. In Proceedings of the 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA), Seoul, Korea, 18–22 June 2016. 10. Hardieck, M.; Kumm, M.; Möller, K.; Zipf, P. Reconfigurable Convolutional Kernels for Neural Networks on FPGAs. In Proceedings of the 2019 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, New York, NY, USA, 24–26 February 2019. 11. Bettoni, M.; Urgese, G.; Kobayashi, Y.; Macii, E.; Acquaviva, A. A Convolutional Neural Network Fully Implemented on FPGA for Embedded Platforms. In Proceedings of the 2017 New Generation of CAS (NGCAS), Genova, Italy, 6–9 September 2017. 12. Liu, B.; Zou, D.; Feng, L.; Feng, S.; Fu, P.; Li, J.B. An FPGA-Based CNN Accelerator Integrating Depthwise Separable Convolution. Electronics 2019, 8, 281. [CrossRef] 13. Kurth, A.; Vogel, P.; Capotondi, A.; Marongiu, A.; Benini, L. HERO: Heterogeneous Embedded Research Platform for Exploring RISC-V Manycore Accelerators on FPGA. In Proceedings of the Workshop on Computer Architecture Research with RISC-V (CARRV), Boston, MA, USA, 14 October 2017. Electronics 2020, 9, 1005 19 of 19

14. Ajayi, T.; Al-Hawaj, K.; Amarnath, A. Experiences Using the RISC-V Ecosystem to Design an Accelerator-Centric SoC in TSMC 16nm. In Proceedings of the Workshop on Computer Architecture Research with RISC-V (CARRV), Boston, MA, USA, 14 October 2017. 15. Matthews, E.; Shannon, L. Taiga: A Configurable RISC-V Soft-Processor Framework for Heterogeneous Computing Systems Research. In Proceedings of the Workshop on Computer Architecture Research with RISC-V (CARRV), Boston, MA, USA, 14 October 2017. 16. Ge, F.; Wu, N.; Xiao, H.; Zhang, Y.; Zhou, F. Compact Convolutional Neural Network Accelerator for IoT Endpoint SoC. Electronics 2019, 8, 497. [CrossRef] 17. Patterson, D.; Waterman, A. The RISC-V Reader: An Open Architecture Atlas; Strawberry Canyon: Berkeley, CA, USA, 2017. 18. Available online: https://riscv.org/ (accessed on 26 February 2020). 19. Available online: https://github.com/SI-RISCV/e200_opensource (accessed on 26 February 2020). 20. El-Sawy, A.; El-Bakry, H.; Loey, M. CNN for Handwritten Arabic Digits Recognition Based on LeNet-5. Springer Sci. Bus. Media 2016, 533, 566–575. 21. Bhogal, R.K.; Agrawal, A. Image Edge Detection Techniques Using Sobel, T1FLS, and IT2FLS. In Proceedings of the ICIS 2018, San Francisco, CA, USA, 13–16 December 2018.

© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).