electronics

Article Design of Power-Efficient Training Accelerator for Convolution Neural Networks

JiUn Hong 1, Saad Arslan 2 , TaeGeon Lee 1 and HyungWon Kim 1,*

1 Department of Electronics Engineering, Chungbuk National University, Chungdae-ro 1, Seowon-gu, Cheongju 28644, Korea; [email protected] (J.H.); [email protected] (T.L.) 2 Department of Electrical and Computer Engineering, COMSATS University Islamabad, Park Road, Tarlai Kalan, Islamabad 45550, Pakistan; [email protected] * Correspondence: [email protected]

Abstract: To realize techniques, a type of deep neural network (DNN) called a convo- lutional neural networks (CNN) is among the most widely used models aimed at image recognition applications. However, there is growing demand for light-weight and low-power neural network accelerators, not only for inference but also for training process. In this paper, we propose a training accelerator that provides low power and compact chip size targeted for mobile and edge computing applications. It accelerates to achieve the real-time processing of both inference and training using concurrent floating-point data paths. The proposed accelerator can be externally controlled and employs resource sharing and an integrated convolution-pooling block to achieve low area and low energy consumption. We implemented the proposed training accelerator in an FPGA (Field Programmable Gate Array) and evaluated its training performance using an MNIST CNN example in comparison with a PC with GPU (). While both methods achieved a similar  training accuracy of 95.1%, the proposed accelerator, when implemented in a silicon chip, reduced  the energy consumption by 480 times compared to the counterpart. Additionally, when implemented Citation: Hong, J.; Arslan, S.; Lee, T.; on an FPGA, an energy reduction of over 4.5 times was achieved compared to the existing FPGA Kim, H. Design of Power-Efficient training accelerator for the MNIST dataset. Therefore, the proposed accelerator is more suitable for Training Accelerator for Convolution deployment in mobile/edge nodes compared to the existing software and hardware accelerators. Neural Networks. Electronics 2021, 10, 787. https://doi.org/10.3390/ Keywords: training accelerator; neural network; CNN; AI chip electronics10070787

Academic Editor: Leonardo Pantoli 1. Introduction Received: 15 February 2021 Accepted: 19 March 2021 Deep learning is a type of machine learning based on artificial neural networks. An Published: 26 March 2021 artificial neural network (ANN) is a neural network whose structure is modeled based on the human brain. A convolution neural network (CNN) extends the structure of an ANN

Publisher’s Note: MDPI stays neutral by employing convolutional filters and feature map compression layers called pooling with regard to jurisdictional claims in to reduce the need for a large number of weights. Recently, a great deal of research has published maps and institutional affil- been conducted in optimizing and modifying CNNs to target many applications such as iations. image classification [1], object detection [2], and speech recognition [3], and good results have been reported. Such CNN techniques are already widely used in the industry, and they are expected to penetrate daily life, such as with autonomous driving [4] and home automation [5]. Recently, the development of IoT (Internet of Things) products and smartphones Copyright: © 2021 by the authors. Licensee MDPI, Basel, Switzerland. has been remarkable, consequently leading to a growing demand for CNN accelerators— This article is an open access article not only for high-speed servers but also mobile/edge devices. The requirement of high distributed under the terms and accuracy and high speed in CNN applications such as image recognition, classification, conditions of the Creative Commons and segmentation led the extensive research effort in developing high-speed CNN accel- Attribution (CC BY) license (https:// erators [6–9]. However, most accelerators are targeted for either high-speed servers for creativecommons.org/licenses/by/ training large CNNs or low-power compact NPU (Neural Processing Unit) for the inference 4.0/).

Electronics 2021, 10, 787. https://doi.org/10.3390/electronics10070787 https://www.mdpi.com/journal/electronics Electronics 2021, 10, 787 2 of 19

operations of pretrained CNN models. The traditional accelerators for high speed tend to consume too much power for mobile and edge application. To overcome this, many energy-efficient CNN accelerators for inference that are better suited to mobile/edge devices have been researched [10,11]. Some researchers have proposed software accelerators and compression techniques for the inference operation of deep learning models on mobile devices [12,13]. Most CNN inference accelerators download pretrained weights (parameters) that are calculated in advance by utilizing available trained datasets. Training involves more complex and resource-demanding processing, such as in the case of gradient-based learning for the image classification of MNIST [14]. For training in mobile/edge devices, therefore, using a general-purpose CPU () or GPU (Graphics Processing Unit) is not realistic due to large size and poor energy efficiency. There have been many attempts to develop hardware accelerator architectures for the training of neural networks [13,15–19]. On- device training, in edge nodes, is getting increasingly common due to privacy concerns, improved performance-to-power ratios, and reduced bandwidths [13,17,19]. Another motivation for implementing edge nodes that are capable of training lies in their ability to perform online training (or even self-training) for improved accuracy [20,21]. Some existing architectures are not suitable for implementation in edge nodes due to their large size [15,19], multi-FPGA implementation [16], or low energy efficiency [13,19]. Therefore, there is a need for an energy-efficient learning accelerator for mobile/edge devices that can output high accuracy with fewer layers. Despite the growing need for fast/low-power solutions for training of deep neural networks (DNNs) on mobile devices [22], only real-time inference solutions using dedicated hardware units and processors are now common. This is because the computing and memory resources are very limited to train deep learning models on mobile devices. To this end, research on mobile high-speed/low-power training hardware for DNNs is experiencing its dawn. Currently, such low power DNN training hardware is suffering from issues of poor learning accuracy. The complexity and difficulty of developing mobile training hardware have significantly increased because such hardware needs to support self-training functions such as semi-supervised [23] and unsupervised learning beyond supervised learning. The goal of this research was to act as a step towards developing a chip that has the capability of updating model weights while being deployed in a mobile environment. The update of weights was carried out to improve personalization to the user/environment, resulting in an improved accuracy. The training process of CNNs typically consists of three steps: (1) inference (IF) (2) backpropagation (BP), and (3) weight gradient update (GU). These three steps are repeated using a large amount of data to get highly accurate weights. Some studies have reported the evaluation of embedded hardware architectures optimized for OPENCL (Open Computing Language) and neural networks (NNs) to enable learning [24]. However, it is difficult to apply such embedded hardware to mobile/edge devices, because their embedded software requires high speed CPUs that often incur high power consumption. Moreover, most of the previous low power accelerators aimed at inference [25] only im- plemented integer operators [26] to optimally handle quantized integer weights, so they lacked floating-point (FP) operations. However, training processes require FP operations to process a wide range of numbers. To calculate such values, the authors of [27] implemented an accelerator employing FP operators. However, it consumes excessive power in FP calculations because it needs to cover a wide dynamic range of intermediate values. In this paper, we present an energy-efficient CNN training accelerator for mobile/edge devices. The key contributions of this paper are as follows. 1. We designed an accelerator architecture that can conduct both inference and training at the same time in a single chip. 2. We present a design and verification methodology for the accelerator architecture and chip implementation by converting a high-level CNN to a hardware structure and comparing the computation results of the hardware against the model. Electronics 2021, 10, x FOR PEER REVIEW 3 of 20

2. We present a design and verification methodology for the accelerator architecture Electronics 2021, 10, 787 and chip implementation by converting a high-level CNN to a hardware structure3 of 19 and comparing the computation results of the hardware against the model. 3. We designed an architecture that shares floating point units (FPUs) to reduce the power consumption and area. 3. We designed an architecture that shares floating point units (FPUs) to reduce the 4. We designed a backpropagation architecture that can calculate the gradient of the power consumption and area. input and weight simultaneously. 4. We designed a backpropagation architecture that can calculate the gradient of the 5. Weinput integrated and weight convolution simultaneously. and pooling blocks to improve performance and reduce 5. memoryWe integrated requirements. convolution and pooling blocks to improve performance and reduce Sectionmemory 2 briefly requirements. describes the basic principles and behavior of CNNs. Section 3 dis- cusses theSection design2 briefly of the describes CNN accelerator the basic an principlesd the improvements. and behavior Section of CNNs. 4 shows Section the3 ex- perimentaldiscusses results the design of the of designed the CNN acceleratoraccelerator, and and the finally improvements. Section 5 gives Section the4 showsconclusions the of thisexperimental paper. results of the designed accelerator, and finally Section5 gives the conclusions of this paper. 2. Background 2. Background CNN is an abbreviation for a convolutional neural network and refers to a structure that hasCNN layers is anthat abbreviation perform convolution for a convolutional and pooling neural operations. network and The refers convolution to a structure opera- tionthat extract has layers the thatcharacteristics perform convolution of the data. and Much pooling research operations. is currently The convolution underway operation using extract the characteristics of the data. Much research is currently underway using CNN CNN structures because they promise high accuracy for image processing and classifica- structures because they promise high accuracy for image processing and classification tion applications. applications. FigureFigure 1 shows1 shows an an example example CNN CNN structure structure us useded for for classification, classification, which which includes includes one convolution,one convolution, one pooling, one pooling, and two and fully-connected two fully-connected (FC) layers. (FC) layers.Here, each Here, layer each serves layer a specificserves purpose a specific in purpose obtaining in obtaininga classification a classification result. The result. convolution The convolution layer extracts layer extracts features to featuresproduce toa producefeature map, a feature and map,the pooling and the layer pooling reduces layer the reduces size theof the size feature of the featuremap. Fi- nally,map. the Finally, FC layer the FCproduces layer produces the classification the classification result result based based on the on features the features found found in inthe featurethe feature map. map.

Feature Maps 1 Feature Maps 2 Feature Maps 3

Input Image Output Classification

Convolution Pooling Fully Connected Fully Connected

FigureFigure 1. An 1. An example example convolutional convolutional neural neural network network (CNN) (CNN) classification classification structure.structure. Our CNN structure targeted the MNIST dataset, with 10 classes representing hand- Our CNN structure targeted the MNIST dataset, with 10 classes representing hand- written digitals of 0–9. Therefore, the output data length was reduced to ten after passing written digitals of 0–9. Therefore, the output data length was reduced to ten after passing through the two FC layers, which were stored in the output classification memory. Here, througheach data the elementtwo FC oflayers, the memory which were represents stored the in probability the output of classification corresponding memory. class among Here, eachthe data digits element 0–9. The of the highest memory value represents among the the ten probability data elements of corresponding is most likely class the correct among theanswer. digits 0–9. If, for The example, highest the value seventh among value the in ten the data memory elements has the is most largest likely value, the it correct is de- answer.termined If, for that example, digit 6 isthe provided seventh asvalue the in input the image.memory Section has the 2.1 larges discussest value, training it is deter- in minedgeneral that for digit CNNs, 6 is Sectionsprovided 2.2 as–2.4 the discuss input image. the operation Section of 2.1 various discusses layers training in CNNs in during general forinference CNNs, Sections and training, 2.2–2.4 and discuss Section the 2.4 operation discusses of the various operation layers of the in CNNs Softmax during activation infer- enceand and cross training, entropy and loss Section functions 2.4 in discusses CNNs. the operation of the Softmax activation and cross entropy loss functions in CNNs. 2.1. Training in Convolutional Neural Networks

Most layers in convolutional neural networks involve manipulating the input data based on the filter/weight parameters. Training is the automatic acquisition of the optimal value of the weight and filter parameters from the training data. During training, the filter and weight parameters are tuned to achieve the desired operation. The indicators used

Electronics 2021, 10, x FOR PEER REVIEW 4 of 20

2.1. Training in Convolutional Neural Networks Most layers in convolutional neural networks involve manipulating the input data based on the filter/weight parameters. Training is the automatic acquisition of the optimal value of the weight and filter parameters from the training data. During training, the filter and weight parameters are tuned to achieve the desired operation. The indicators used for the training of neural networks are called loss functions. The final goal of training is to Electronics 2021, 10, 787 find the weight parameters that make the outcome of the loss functions4 ofthe 19 smallest. This requires calculating the derivative (gradient) of the parameters. We repeated the process of slowly updating the values of the parameters with the calculated derivative values as afor clue the and training explored of neural the networks optimal are parameters. called loss functions. The final goal of training is to find the weight parameters that make the outcome of the loss functions the smallest. This requiresThe calculating training the process derivative is (gradient)composed of theof parameters.two sub processes. We repeated Firstly, the process an of input image is appliedslowly updating to the network the values and of the the parameters image is withpropag theated calculated forward derivative to obtain values a asclassification a out- putclue from and explored the network. the optimal Secondly, parameters. the computed loss values are back-propagated to update the weightsThe training and processfilter parameters. is composed ofThis two process sub processes. is repeated Firstly, multiple an input imagetimes isfor an available applied to the network and the image is propagated forward to obtain a classification training dataset of images to reach optimal parameter values. Once a network is fully output from the network. Secondly, the computed loss values are back-propagated to trained,update the only weights forward and filter propagation parameters. is This carried process out is to repeated classify multiple images, times and for the an process is re- ferredavailable to trainingas inference. dataset of images to reach optimal parameter values. Once a network is fullyComputational trained, only forward graphs propagation can be is used carried to out understand to classify images, the process and the processof forward is and back- wardreferred propagation, to as inference. as shown in Figure 2. In a computational graph, the calculation that Computational graphs can be used to understand the process of forward and backward proceedspropagation, from as shown left to in right Figure is2. called In a computational a forward graph, propagation the calculation (forward), that proceeds and the right to left calculationfrom left to right is called is called a abackpropagation forward propagation (backward). (forward), and In the Figure right to 2, left the calculation black arrow explains theis called forward a backpropagation process, and (backward). the red arrow In Figure explains2, the black the arrowbackward explains process. the forward process, and the red arrow explains the backward process.

Figure 2. A Simple computational graph of forward and backward processes. Figure 2. A Simple computational graph of forward and backward processes. Firstly, in forward propagation, the input signal ‘x’ is manipulated by a function f (x) toFirstly, obtain in output forward signal propagation, ‘y’, as represented the byinput Equation signal (1). ‘x’ Next, is manipulated in the backward by a function f(x) ∂y ) tocomputational obtain output procedure, signal the ‘y’, signal as represented Z is multiplied by by Equation the local gradient (1). Next, ( ∂x inof the the nodebackward compu- and then transfers it to the next (left) node. Here, the local gradient is the derivative of the tational procedure, the signal Z is multiplied by the local gradient ( ) of the node and expression in Equation (1), used during inference process. then transfers it to the next (left) node. Here, the local gradient is the derivative of the = ( ) expression in Equation (1), used duringy f xinference process. (1) 2.2. Fully Connected Layer = () (1) Most neural networks use an FC layer due to its generic nature, as the FC layer does not make any assumption on the nature of the input data. Each FC layers contains several 2.2.neurons/nodes, Fully Connected each of Layer which accepts all inputs and performs a weighted sum. Therefore, each neuron is capable of learning to identify features beyond spatial coherence. The fully connectedMost layer,neural as anetworks whole, multiplies use an allFC the layer weights due and to its all thegeneric inputs nature, to produce as the the FC layer does notresult. make Since any all inputsassumption and neurons on the meet, nature it is called of the a fullyinput connected data. Each layer. FC layers contains several neurons/nodes,During inference each (forward) of which operation, accepts theall inpu FC layerts and works performs the same away weighted as matrix sum. Therefore, eachmultiplication. neuron is The capable output of result learning is produced to identify by multiplying features the beyond input X spatialand weight coherence.W The fully matrices, as shown in Figure3a. Assuming an FC layer with ten inputs and ten neurons connected(outputs), the layer, required as a weight whole, matrix multiplies size is 10 ×all10 the (100 weights weight parameters). and all the inputs to produce the result. Since all inputs and neurons meet, it is called a fully connected layer. During inference (forward) operation, the FC layer works the same way as matrix multiplication. The output result is produced by multiplying the input and weight matrices, as shown in Figure 3a. Assuming an FC layer with ten inputs and ten neurons (outputs), the required weight matrix size is 10 × 10 (100 weight parameters).

ElectronicsElectronics 20212021, 10, ,x10 FOR, 787 PEER REVIEW 5 of 205 of 19

(a) (b)

FigureFigure 3. Operation 3. Operation of fully of fully connected connected layer layer (a) (ina) the in the inference inference process process and and (b (b) in) in the the training training process. process. Figure3b shows the backward process of the FC layer, where the produced loss value isFigureL. It is 3b also shows convenient the backward to represent process the of backward the FC layer, process where using the matrix produced multiplication, loss value as is L.expressed It is also convenient by Equations to (2) represent and (3). the Equation backward (2) is process used to calculateusing matrix gradient multiplication, of input, while as expressedEquation by (3) Equations is used to calculate(2) and (3). gradient Equation of weight. (2) is used Here, to WcalculateT represents gradient the transposeof input, of whileweight Equation matrix (3)W is. used The gradientto calculate of input gradient (∂L/ of∂X weight.) is propagated Here, backward represents to previous the trans- (left) poselayers, of weight while matrix the gradient . The of gradient weights of is input used to(/ update) is the propagated weights in backward this layer. to pre- vious (left) layers, while the gradient of weights is used to update the weights in this layer. ∂L ∂L T = · W L (2) = ∙ L (2) ∂X ∂Y ∂L ∂L = =W ∙T · (3) (3) ∂W ∂Y

2.3. Convolution Layer 2.3. Convolution Layer The convolution layer convolves the input image with a set of filters to detect the characteristicsThe convolution of an layer image. convolves Each filter the activates input im whenage with its corresponding a set of filters feature to detect is found the at characteristicsa spatial position of an image. in theinput Each image.filter acti Asvates a result, when a feature its corresponding map that highlights feature is spots found where at athe spatial corresponding position in feature the input is found image. in As the a input result, image a feature is produced. map that highlights spots where theThe corresponding convolution operationfeature is isfound performed in the byinput applying image a is filter produced. to the input data. Figure4 showsThe convolution an example operation of a convolution is performed operation. by applying The convolutiona filter to the operationinput data. applies Figure the 4 showsfilter an to theexample input of data a convolution while moving operat it ation. regular The convolution intervals (stride). operation In each applies stride, the the filtercorresponding to the input data elements while moving of the input it at regular image andintervals the filter (stride). are In multiplied each stride, together. the cor- The respondingintermediate elements products of the are input added image together and the and filter stored, are as multiplied shown in together. Figure4 (steps The inter- 1–9). In mediateeach products step of Figure are added4, the processtogether of and multiplication stored, as shown and accumulate in Figure 4 (MAC)(steps 1–9). is often In each carried stepout of Figure in nine 4, steps the process of single of multiplication-accumulation multiplication and accumulate (fused (MAC) multiply–add: is often carried FMA). out For in nineexample, steps theof single calculation multiplication-accumulation in Figure4-(1) calculates (fused as shown multiply–add: in Equation FMA). (4): For ex- ample, the calculation in Figure 4-(1) calculates as shown in Equation (4): 1 × 1 + 1 × 0 + 1 × 1 + 0 × 0 + 1 × 1 + 1 × 0 + 0 × 1 + 0 × 0 + 1 × 1 = 4 (4) 1 × 1 + 1 × 0 + 1 × 1 + 0 × 0 + 1 × 1 + 1 × 0 + 0 × 1 + 0 × 0 + 1 × 1 = 4 (4) TheThe result result is then is then stored stored in the in the corresponding corresponding place place of ofthe the output output memory. memory. Perform- Performing ing this process inin allall locationslocations completescompletes thethe convolutionalconvolutional outputoutput operation.operation. In theIn theinference inference process, process, the the input input image image and and filter filter is isprocessed processed using using the the FMA. FMA. Dur- During ing backpropagation, the the process process is is similar similar to toforward forward propagation, propagation, which which is convolution. is convolution. However,However, during during the thebackpropagation, backpropagation, the theoutput output gradient gradient is convolved is convolved with with the theinput input image to obtain the gradient of filter/weight. The gradient of the input is used as an image to obtain the gradient of filter/weight. The gradient of the input is used as an input input value in the calculation of the next layer, and the gradient of the weight is used value in the calculation of the next layer, and the gradient of the weight is used to update to update the weight/filter value of the current layer. Therefore, for the first layer, the the weight/filter value of the current layer. Therefore, for the first layer, the calculation of calculation of gradient of input can be skipped. Figure5 shows the operation to obtain gradient of input can be skipped. Figure 5 shows the operation to obtain the gradient of the gradient of weight for the convolutional layer during backpropagation process. The weight for the convolutional layer during backpropagation process. The backpropagated backpropagated value from the pooling layer is convolved with the input image to obtain value from the pooling layer is convolved with the input image to obtain gradient of gradient of weights. When there are multiple filters, the process is repeated for each filter weights. When there are multiple filters, the process is repeated for each filter using the using the backpropagated corresponding values from the pooling layer. backpropagated corresponding values from the pooling layer.

Electronics 2021Electronics, 10, x FOR2021 PEER, 10 REVIEW, 787 6 of 20 6 of 19 Electronics 2021, 10, x FOR PEER REVIEW 6 of 20

111 00 101 1 110 0 101 11100 101 x1 x0 x1 101⁎ 010filter x1 x0 x1 101⁎ 010filter x1 x0 x1 101⁎ 010filter 111x1 x0 x1 00⁎ filter 1 110x1 x0 x1 0 ⁎ filter 11100x1 x0 x1 ⁎ filter 010101 010101 010101 011x0 x1 x0 10101 0 111x0 x1 x0 0 101 01110x0 x1 x0 101 011x0 x1 x0 10 0 111x0 x1 x0 0 01110x0 x1 x0 001x1 x0 x1 11 4 0 011x1 x0 x1 1 43 00111x1 x0 x1 434 001x1 x0 x1 11 4 0 011x1 x0 x1 1 43 00111x1 x0 x1 434 00110 00110 00110 00110 00110 00110 01100 01100 01100 01100 01100 01100 Image Convolved Image Convolved Image Convolved Image (1)ConvolvedFeature Image (2)ConvolvedFeature Image (3) ConvolvedFeature (1)Feature (2)Feature (3) Feature 11100 101 11100 101 11100 101 101⁎ 010filter 101⁎ 010filter 101⁎ 010filter 11100⁎ filter 11100⁎ filter 11100⁎ filter 010101 010101 010101 011x1 x0 x1 10101 0 111x1 x0 x1 0 101 01110x1 x0 x1 101 011x1 x0 x1 10 0 111x1 x0 x1 0 01110x1 x0 x1 001x0 x1 x0 11 434 0 011x0 x1 x0 1 434 00111x0 x1 x0 434 001x0 x1 x0 11 434 0 011x0 x1 x0 1 434 00111x0 x1 x0 434 001x1 x0 x1 10 2 0 011x1 x0 x1 0 24 00110x1 x0 x1 243 001x1 x0 x1 10 2 0 011x1 x0 x1 0 24 00110x1 x0 x1 243 01100 01100 01100 01100 01100 01100 Image Convolved Image Convolved Image Convolved Image (4)ConvolvedFeature Image (5)ConvolvedFeature Image (6) ConvolvedFeature (4)Feature (5)Feature (6) Feature 11100 101 11100 101 11100 101 11100 101⁎ 010filter 11100 101⁎ 010filter 11100 101⁎ 010filter ⁎ 010filter ⁎ 010filter ⁎ 010filter 01110 101 01110 101 01110 101 01110 101 01110 101 01110 101 001x1 x0 x1 11 434 0 011x1 x0 x1 1 434 00111x1 x0 x1 434 001x1 x0 x1 11 434 0 011x1 x0 x1 1 434 00111x1 x0 x1 434 001x0 x1 x0 10 243 0 011x0 x1 x0 0 243 00110x0 x1 x0 243 001x0 x1 x0 10 243 0 011x0 x1 x0 0 243 00110x0 x1 x0 243 011x1 x0 x1 00 2 0 110x1 x0 x1 0 23 01100x1 x0 x1 234 011x1 x0 x1 00 2 0 110x1 x0 x1 0 23 01100x1 x0 x1 234 Image Convolved Image Convolved Image Convolved Image (7)ConvolvedFeature Image (8)ConvolvedFeature Image (9) ConvolvedFeature (7)Feature (8)Feature (9) Feature

FigureFigure 4. 4.Operation Operation ofFigure ofconvolution convolution 4. Operation layer layer in of ininference. convolution inference. layer in inference.

111 00 231Backward 1 110 0 231 Backward 11100 231 Backward 111x2 x3 x1 00 231⁎ 010BackwardPooling 1 110x2 x3 x10 231⁎ 010BackwardPooling 11100x2 x3 x1 231⁎ 010BackwardPooling x2 x3 x1 ⁎ 010 x2 x3 x1 ⁎ 010 x2 x3 x1 ⁎ 010 011 10 541Poolingresult 111 541Poolingresult 110 541Poolingresult x0 x1 x0 541result 0 x0 x1 x0 0 541 result 01 x0 x1 x0 541 result 011x0 x1 x0 10 0 111x0 x1 x0 0 01110x0 x1 x0 001x5 x4 x0 11 7 0 011x5 x4 x0 1 710 00111x5 x4 x0 71012 001x5 x4 x0 11 7 0 011x5 x4 x0 1 710 00111x5 x4 x0 71012 00110 00110 00110 00110 00110 00110 01100 01100 01100 01100 01100 01100 Image Gradient of weight Image Gradient of weight Image Gradient of weight Image (1)Gradient (Filter)of weight Image (2)Gradient (Filter)of weight Image (3)Gradient (Filter)of weight (1)(Filter) (2)(Filter) (3) (Filter) 11100 231Backward 11100 231 Backward 11100 231 Backward 231⁎ 010BackwardPooling 231⁎ 010BackwardPooling 231⁎ 010BackwardPooling 11100⁎ 11100⁎ 11100⁎ 010541Poolingresult 111 0010541Poolingresult 010541Poolingresult 011x2 x3 x1 10541result 0 x2 x3 x1 541 result 01110x2 x3 x1 541 result 011x2 x3 x1 10 0 111x2 x3 x1 0 01110x2 x3 x1 001x0 x1 x0 11 71012 0 011x0 x1 x0 1 71012 00111x0 x1 x0 71012 001x0 x1 x0 11 71012 0 011x0 x1 x0 1 71012 00111x0 x1 x0 71012 001x5 x4 x0 10 10 0 011x5 x4 x0 0 10 11 00110x5 x4 x0 10 11 15 001x5 x4 x0 10 10 0 011x5 x4 x0 0 10 11 00110x5 x4 x0 10 11 15 01100 01100 01100 01100 01100 01100 Image Gradient of weight Image Gradient of weight Image Gradient of weight Image (4)Gradient (Filter)of weight Image (5)Gradient (Filter)of weight Image (6)Gradient (Filter)of weight (4)(Filter) (5)(Filter) (6) (Filter) 11100 231Backward 11100 231 Backward 11100 231 Backward 11100 231⁎ 010BackwardPooling 11100 231⁎ 010BackwardPooling 11100 231⁎ 010BackwardPooling ⁎ 010Pooling ⁎ 010 Pooling ⁎ 010 Pooling 01110 541result 01110 541 result 01110 541 result 01110 541result 01110 541 result 01110 541 result 001x2 x3 x1 11 71012 0 011x2 x3 x1 1 71012 00111x2 x3 x1 71012 001x2 x3 x1 11 71012 0 011x2 x3 x1 1 71012 00111x2 x3 x1 71012 001x0 x1 x0 10 10 11 15 0 011x0 x1 x0 0 10 11 15 00110x0 x1 x0 10 11 15 001x0 x1 x0 10 10 11 15 0 011x0 x1 x0 0 10 11 15 00110x0 x1 x0 10 11 15 011x5 x4 x0 00 5 0 110x5 x4 x0 0 514 01100x5 x4 x0 51412 011x5 x4 x0 00 5 0 110x5 x4 x0 0 514 01100x5 x4 x0 51412 Image Gradient of weight Image Gradient of weight Image Gradient of weight (7)Gradient of weight (8)Gradient of weight (9)Gradient of weight Image (7)(Filter) Image (8)(Filter) Image (9) (Filter) (Filter) (Filter) (Filter) Figure 5. Operation of convolution layer in backpropagation. FigureFigure 5. 5.Operation Operation of ofconvolution convolution layer layer in inbackpropagation. backpropagation. 2.4. Pooling Layer 2.4.2.4. Pooling Pooling Layer Layer A pooling layer is used to reduce the dimensions of the feature map produced by the A Apooling pooling layer layer is isused used to to reduce reduce the the dimensions dimensions of of the the feature feature map map produced produced by by the the convolutionalconvolutional layer. By reducing layer. By the reducing dimensions, the dimensions, the memory the memoryand processing and processing require- requirements convolutional arelayer. lowered By reducing in the followingthe dimensions, layers. the The memory pooling and divides processing the feature-map require- in multiple ments are lowered in the following layers. The pooling divides the feature-map in multi- ments are loweredregions in the and following then replaces layers. the The values pooling in each divides region the withfeature-map only one in representative multi- value. ple regions and then replaces the values in each region with only one representative value. ple regions andThe then commonly replaces the used values pooling in each methods region are with average only one pooling representative and max value. pooling, where the The commonly used pooling methods are average pooling and max pooling, where the The commonlyrepresentative used pooling value methods is average are aver orage maximum pooling value and ofmax that pooling, region, respectively.where the Figure6a,b representative value is average or maximum value of that region, respectively. Figure 6a,b representativeshows value is examples average ofor averagemaximum and valu maxe of pooling that region, on an respectively. input image, Figure respectively. 6a,b This figure shows examples of average and max pooling on an input image, respectively. This figure shows examplesuses of anaverage input and image max size pooling of 4 × on4, an a pooling input image, region respectively. size of 2 × 2,This and figure a stride of 2. As a uses an input image size of 4 × 4, a pooling region size of 2 × 2, and a stride of 2. As a uses an input result,image thesize pooling of 4 × 4, output a pooling has a region size of 2size× 2.of 2 × 2, and a stride of 2. As a result,result, the the pooling pooling output output has has a sizea size of of 2 ×2 2.× 2.

Electronics 2021, 10, x FOR PEER REVIEW 7 of 20

Electronics 2021, 10, 787 7 of 19 Electronics 2021, 10, x FOR PEER REVIEW 7 of 20

310912 310912 70411 Pooling 59 70411 Pooling 10 12 310912 310912 56513 Pooling 48 56513 Pooling 613 70411 59 70411 10 12 4195 4195 56513 48 56513 613 Input layer Pooling result Input layer Pooling result 4195 4195 Input layer(a) Average Pooling Pooling result Input layer(b) Max Pooling Pooling result

Figure 6. Operation (a)of Averageaverage Pooling pooling (a) and max pooling (b) (b)in inference.Max Pooling

Figure 6. Operation of average pooling (a) and max pooling (b) in inference. FigureDuring 6. forwardOperation propagation, of average pooling when ( athe) and feat maxure pooling map passes (b) in inference.through a pooling layer, the feature map size is reduced to a predetermined size. Conversely, in backpropagation, DuringDuring forwardforward propagation,propagation, whenwhen thethe featurefeature mapmap passespasses throughthrough aa poolingpooling layer,layer, whenthe the feature feature map map size passes is reduced through to a the predetermined pooling layer, size. the Conversely, image size in needs backpropagation, to be in- the feature map size is reduced to a predetermined size. Conversely, in backpropagation, creased.when The the featuresize is increased map passes to through match the the si poolingze of the layer, original the image feature size map needs produced to be increased. by when the feature map passes through the pooling layer, the image size needs to be in- convolutionThe size isduring increased forward to match propagation. the size of To the th originalis end,feature the location map produced of the largest by convolution value creased. The size is increased to match the size of the original feature map produced by duringduring the forward forward pooling propagation. process To is stored this end, in the the argmax location memory. of the largest This allows value the during pool- the convolution during forward propagation. To this end, the location of the largest value ing layerforward to remember pooling process which isparts stored were in theextracte argmaxd and memory. the size This of the allows original the feature pooling map, layer to during the forward pooling process is stored in the argmax memory. This allows the pool- whichremember can be used which during parts backpropagation. were extracted and the size of the original feature map, which can ing layer to remember which parts were extracted and the size of the original feature map, beFigure used during7a shows backpropagation. the operation of the pooling layer during the forward process in which can be used during backpropagation. training.Figure In addition7a shows to storing the operation the pooling of the results pooling for each layer region, during it the stores forward the positions process in Figure 7a shows the operation of the pooling layer during the forward process in of thetraining. maximum In addition values extracted to storing from the poolingeach region. results For for example, each region, the first it stores region the has positions the training. In addition to storing the pooling results for each region, it stores the positions largestof the value maximum 10 at position values 1, extracted the second from region each has region. largest For value example, 12 at theindex first 2, regionand so hason. the of the maximum values extracted from each region. For example, the first region has the Thislargest process value is repeated 10 at position unless 1, the the poolin secondg is region complete. has largest During value backpropagation, 12 at index 2, andthe so poolingon.largest Thislayer value process restores 10 isat repeatedtoposition the original 1, unless the secondsize the of pooling theregion convolution is has complete. largest output, value During 12as backpropagation, atshown index in 2, Figure and so the on. 7b. Moreover,poolingThis process layer it places restoresis repeated the to backpropagated the unless original the size poolin va oflues theg isconvolution at complete. the corresponding output,During asbackpropagation, positions shown in stored Figure 7theb. in theMoreover,pooling argmax layer itmemory. places restores the For backpropagatedto inference the original only, size values during of the at theforward convolution corresponding propagation, output, positions as argmax shown stored is in not Figure in the needed.argmax7b. Moreover, memory. it places For inference the backpropagated only, during forward values at propagation, the corresponding argmax positions is not needed. stored in the argmax memory. For inference only, during forward propagation, argmax is not needed. 31094 (additional pooling result) 70121131094Max (additional Pooling 12pooling result)10 12 51513701211Max Pooling311261310 12 469551513 Argmax31Pooling result613 Convolution4695 Output Argmax Pooling result Convolution Output (a)

0500 Reverse (a) Max 001100500 PoolingReverse12 511 000800110 Max Pooling311214 8511 09000008 Argmax31Backprop14 8 Convolution0900 Output Argmax Backprop Convolution Output (b)

FigureFigure 7. Operation 7. Operation of pooling of pooling layer layer (inb) backpropagation: in backpropagation: (a) forward (a) forward propagation propagation storing storing maxi- maximum mumvalue value position position and and (b ()b backward) backward propagation propagation utilizing utilizing argmax argmax memory memory to to increase increase size. size. Figure 7. Operation of pooling layer in backpropagation: (a) forward propagation storing maxi- mum value position and (b) backward propagation utilizing argmax memory to increase size.

Electronics 2021, 10, 787 8 of 19

Electronics 2021, 10, x FOR PEER REVIEW 8 of 20

2.5. Softmax and Cross Entropy Error 2.5. SoftmaxSoftmax and is aCross function Entropy used Error to represent the resulting value of a fully connected layer with aSoftmax probability is a function between used 0 and to 1. represent In general, th thee resulting Softmax value function of a is fully used connected for classification layer andwith is a expressedprobability as between Equation 0 and (5): 1. In general, the Softmax function is used for classifica- tion and is expressed as Equation (5): exp(a ) ( k ) yk = n (5) =∑ exp(a ) (5) i=∑1 ( i) wherewhere exp(exp(x) is an an exponential exponential function function ex,, n is is the number neurons neurons in in the the output output layer, layer, andand ykis is the the output output of of the thek -thk-th neuron. neuron. Usually,Usually, whenwhen training,training, we use a loss function as as the the last last operation. operation. The The loss loss function function playsplays an important role role of of searching searching for for the the optimal optimal parameter parameter value. value. The The loss loss function function is isan an indicator indicator of ofthe the “bad” “bad” performance performance of a ofne aural neural network. network. It indicates It indicates how well how the well cur- the currentrent neural neural network network is not is notprocessing processing the trai thening training data. data. The smaller The smaller the value the value of the of loss the lossfunction, function, the thesmaller smaller the theerror. error. If Softmax If Softmax is used is used as asthe the activation activation function, function, the the loss loss is is calculatedcalculated usingusing a cross entropy e errorrror (CEE) as as the the loss loss function. function. TheThe forwardforward andand backward backward processes processes of of the the Softmax Softmax and and CEE CEE is shownis shown in Figurein Figure8. The 8. blackThe black arrows arrows indicate indicate the forward the forward process, process, and the and red the arrows red arrows indicate indicate the backpropagation the backprop- processagation ofprocess each function.of each function. The final The value final produced value produced by the forwardby the forward process process is the outputis the ofoutput the loss of the function, loss function, represented represented by L. The by result.The result of the of Softmax the Softmax and CEE and functionCEE function while propagatingwhile propagating backwards backwards is y1 − ist 1,y2−−t,2,y3−−t,3 ···−. Since⋯. Sincey1, y 2,y,3,···,,⋯represent represent the outputthe output of the ofSoftmax the Softmax function function and t 1and, t2, t3,,···,is,⋯ the is correct the correct answer answer label, ylabel,1 − t 1,y2−− t2,,y3 −−t3,··· −represents ⋯ represents the error the that error is usedthat inis calculatingused in calculating gradients gradients of inputs. of Since inputs. this paperSince focusedthis paper on focused updating on the updating weight gradient,the weight we gradient, did not usewe adid log not value use in a thelog CEEvalue layer in forthe theCEE calculation layer for the of thecalculation loss value. of the loss value.

FigureFigure 8.8. OperationOperation of Softmax and cross entropy error error function. function.

3.3. Architecture of Proposed Training Accelerator TheThe key goal goal of of our our proposed proposed accelerator accelerator was was to target to target the low the lowarea areaand energy-effi- and energy- efficientcient design design of mobile/edge of mobile/edge training training hardware. hardware. As the As number the number of layers of layers in the in neural the neural net- networkwork model model increases, increases, the the area area and and power power also also tend tend to toincrease. increase. We We therefore therefore employed employed aa network compression compression technique technique comprising comprising layer layer pruning pruning and and weight weight quantization. quantization. As Asa result, a result, we weproduced produced an optimized an optimized CNN CNN that is that shallower, is shallower, with withfewer fewerfilters filtersand nar- and narrowerrower bit-widths bit-widths for forweight weight parameters parameters while while maintaining maintaining a high a highaccuracy. accuracy. To prove To prove the theeffectiveness effectiveness of the of theproposed proposed architecture architecture and and design design methodology, methodology, we we chose chose a asmall small CNNCNN andand the MNIST dataset for training training and and va validationlidation while while setting setting the the target target accuracy accuracy atat 95%95% oror higher.higher. FigureFigure9 9 shows shows anan overalloverall blockblock diagramdiagram of thethe proposedproposed accelerator. The The proposed proposed acceleratoraccelerator comprises five five layers, which which includ includee one one convolution convolution layer, layer, one one pooling pooling layer, layer,

Electronics 2021, 10, x FOR PEER REVIEW 9 of 20

two FC layers, and one Softmax layer. The blue part indicates data path blocks for infer- ence process (the forward propagation), while the red indicates the ones for the training Electronics 2021, 10, 787 process (backpropagation). The training process calculates the gradient of the input and9 of 19 the gradient of the weight with respect to the output loss values. The modules named dout calculate the gradient of the input, and the modules marked by dW calculate the deriva- tives of the weight. The gradient of dout and dW are repeatedly calculated for every layer, two FC layers, and one Softmax layer. The blue part indicates data path blocks for inference except for the first convolutional layer. For the first convolution layer, only dW is calcu- processlated during (the forward backpropagation, propagation), since while there the is no red layer indicates prior theto this ones layer for in the need training of dout. process (backpropagation).Each layer except The convolution training process has a memory calculates that the stores gradient the result of the of forward input and and the gradientbackward of thepropagation. weight with The respectSoftmax to function the output employs loss values. a lookup The table modules of 156 namedentries doutto calculatecalculate the the gradient exponential of the function input, and , thefor modulesthe values marked of in by range dW calculatefrom −31 theto 0. derivatives For the ofinference the weight. mode, The the gradient weight of parameters dout and dW of areeach repeatedly layer arecalculated loaded from for everythe outside layer, exceptand forused, the firstwhile convolutional for the training layer. mode, For the firstweight convolution parameters layer, are updated only dW through is calculated the back- during backpropagation,propagation operations. since there is no layer prior to this layer in need of dout.

28 x 28

Image MEM Module enable

3 x 3 Conv 28 x 28 weight exp mem sign momentum Arg_max 10 x 10 196 x 10 3 x 3 MEM FC2 FC1 Conv FC1 FC2 dout dout weight Weight Weight mul:1/add:1 mul:1/add:1 Pooling mem MEM MEM FC2 dout FC1 dout dout Conv Softmax result MEM result MEM momentum dW Conv Pooling & 3 x 3 loss Conv FC2 FC2 FC1 FC1 weight Pooling dW dW result MEM mem mul:1/add:1 mul:1/add:1 mul : 1 mul : 1 Pooling result MEM momentum FC1 result FC2 result momentum momentum MEM MEM 3 x 3 Conv weight mem

momentum

Inference Backpropagation

Figure 9. Architecture of inference and training hardware blocks. FC: fully-connected. (MEM: memory) Figure 9. Architecture of inference and training hardware blocks. FC: fully-connected. (MEM: memory) Each layer except convolution has a memory that stores the result of forward and backward3.1. CNN Model propagation. Optimization The and Softmax Training function Procedure employs a lookup table of 156 entries to calculateFor thethe exponentialCNN optimization function step,ex ,we for used the valuesTensorflow of x within range Keras from to determine−31 to 0.the For theminimal inference number mode, of layers the weight in the target parameters CNN. Figure of each 10 shows layer arethe accuracy loaded fromand loss the values outside andchanging used, while(shown for in thethe trainingY-axis) through mode, thea training weight proces parameterss of 20 epochs are updated (shown through in the X- the backpropagationaxis). It can be observed operations. from Figure 10b that the loss value decreased as the epoch in- creased, while the accuracy increased up to the target accuracy of 95%. 3.1. CNNIf the Model validation Optimization accuracy and satisfied Training the Procedure target accuracy, we continued to reduce the CNNFor mode the CNNby applying optimization the network step, wecompre usedssion Tensorflow technique with comprising Keras to layer determine pruning the minimaland weight number quantization. of layers We in the repeated target this CNN. CNN Figure model 10 optimizationshows the accuracy and training and loss process values until the validation accuracy fell short of the target accuracy. Then, we chose the latest changing (shown in the Y-axis) through a training process of 20 epochs (shown in the compressed model that satisfied the target accuracy. X-axis). It can be observed from Figure 10b that the loss value decreased as the epoch In the step of CNN model optimization, we employed a high-level framework based increased, while the accuracy increased up to the target accuracy of 95%. on Tensorflow with Keras library, which is well-suited to repeated trials of network prun- If the validation accuracy satisfied the target accuracy, we continued to reduce the ing and weight quantization followed by a training process. CNN mode by applying the network compression technique comprising layer pruning and weight quantization. We repeated this CNN model optimization and training process until the validation accuracy fell short of the target accuracy. Then, we chose the latest compressed model that satisfied the target accuracy.

Electronics 2021, 10, x FOR PEER REVIEW 10 of 20 Electronics 2021, 10, 787 10 of 19

Electronics 2021, 10, x FOR PEER REVIEW 10 of 20

(a) (b)

FigureFigure 10. 10.VerifiedVerified accuracy accuracy and and loss loss of TensorFlow of TensorFlow model: model: (a) Training (a) Training and andvalidation validation accuracy; accuracy; (b)( trainingb) training and and validation validation loss. loss.

OnIn the the other step hand, of CNN this model framework optimization, was not we suitable employed for designing a high-level and framework verifying basedthe acceleratoron Tensorflow hardware, with(a) Kerassince the library, functions which of is well-suitedTensorflow toand repeated(b )Keras trialslibrary of are network predefined pruning and,Figureand furthermore, 10. weight Verified quantization accuracy hidden. and Therefore, loss followed of TensorFlow byonce a trainingCN model:N optimization (a process.) Training and was vali dationfinished, accuracy; we convert ed the(b) trainingCNNOn andmodel the validation other from hand, loss.Tensorflow/Keras this framework to was generic not suitable Python forcode designing using only and NumPy verifying li- the brary.accelerator For the hardware,remaining sincedesign the steps functions and verification of Tensorflow of the and accelerator Keras library architecture are predefined fol- On the other hand, this framework was not suitable for designing and verifying the lowedand, by furthermore, RTL (Register hidden. Transfer Therefore, Level) oncedesign, CNN we optimization used generic was Python finished, code. we convert ed accelerator hardware, since the functions of Tensorflow and Keras library are predefined the CNN model from Tensorflow/Keras to generic Python code using only NumPy library. and, Whenfurthermore, comparing hidden. the Therefore, code composed once CNN optimizationonly Python was language finished, and we convert the Keras ed model withtheFor CNN the the samemodel remaining dataset,from designTensorflow/Keras it was steps confirmed and verificationto generic that almostPython of the thecode accelerator same using accuracy only architecture NumPy was li- followedobtained. by Therebrary.RTL wereFor (Register the some remaining Transfer differences design Level) because steps design, and the verification we input used datasets generic of the acceleratorare Python randomly code. architecture entered. fol- lowedThe byWhen nextRTL comparing(Registerdesign step Transfer the was code Level)determining composed design, wethe only used op Pythontimizer generic language for Python the backpropagation code. and the Keras model of loss with gradient.theWhen same While comparing dataset, the high-level itthe was code confirmed composed model based that only almost onPython Tensorflow the language same and accuracyand Kerasthe Keras wasused model obtained. Adam as There an optimizer,withwere the some same the differences dataset,generic it Python was because confirmed model the input employthat almost datasetsed a the momentum are same randomly accuracy optimizer, entered. was obtained. which is more hardware-friendly.There wereThe some next differences design The Adam step because was optimizer the determining input requires datasets the arecomplex optimizer randomly arithmetic forentered. the backpropagationfunctions such as of square-rootlossThe gradient. next and design divider While step was thefunctions, determining high-level which modelthe incur optimizer based high for oncomplexity the Tensorflow backpropagation and and large Keras ofchip loss usedsizes. Adam On gradient. While the high-level model based on Tensorflow and Keras used Adam as an theas other an optimizer,hand, the momentum the generic optimizer Python model only requires employed multipliers a momentum and adders, optimizer, which whichcan optimizer, the generic Python model employed a momentum optimizer, which is more be iseasily more implemented hardware-friendly. in hardware The Adam with an optimizer RTL design. requires We complexcompared arithmetic the accuracy functions be- hardware-friendly.such as square-root The Adam and divider optimizer functions, requires complex which incur arithmetic high functions complexity such and as large chip tweensquare-root the two and CNN divider models—Tensorflow functions, which incur with high Keras complexity and generic and large Python—and chip sizes. On confirmed sizes. On the other hand, the momentum optimizer only requires multipliers and adders, thatthe other there hand, was thelittle momentum difference optimizer in the accuracy only requires after multipliers the complete and adders, training which of long can epochs. Figurebewhich easily 11 implemented canshows be easilythe accuracy in implementedhardware of the with CNN inan hardwareRTL modeled design. within We generic compared an RTL Python design. the accuracy code We using comparedbe- the mo- the mentumtweenaccuracy the optimizer. two between CNN models—Tensorflow the two CNN models—Tensorflow with Keras and generic with Python—and Keras and generic confirmed Python—and thatconfirmed there wasthat little there difference was littlein the difference accuracy after in the the accuracy complete after training the completeof long epochs. training of long Figureepochs. 11 showsFigure the 11 accuracyshows of the the accuracy CNN modeled of the CNNin generic modeled Python in code generic using Python the mo- code using mentumthe momentum optimizer. optimizer.

Figure 11. Result of native Python model training and testing. FigureFigure 11. 11.ResultResult of native of native Python Python model modeltraining training and testing. and testing.

Electronics 2021, 10, 787 11 of 19

Finally, the accelerator’s architecture and RTL design were conducted, whose func- tionality was verified against the generic Python model in the outputs of each layer.

3.2. Combined Convolution and Pooling Operation In conventional CNNs, large memories are required to store the output feature maps produced by each convolutional layer before pooling is performed. For example, with an input image size of 5 × 5 (25) and a filter size of 3 × 3 (9), the convolution layer (with stride = 1) produces a feature map size of 3 × 3. Moreover, the feature map size increases with the number of filters, regardless of the number channels in the input image. For instance, a convolutional layer with four 3 × 3 filters will produce an output of size as 3 × 3 × 4. In general, when performing the pooling operation, the pooling stride is the same as pooling region size. The pooling and convolution layers, however, are different in stride and size, so they cannot be simultaneously operated. Therefore, a convolution feature map memory is required to save the results, which are used as input by the pooling layer after the convolution operation is completed. With an increase in image size and number of filters, the feature map memory size grows to significantly impact the chip size of the accelerator. It is therefore necessary to reduce the feature map memory size for mobile/edge applications. The proposed accelerator requires an image size of 28 × 28 (784 values). In addition, it employs padding to prevent image size loss after convolution. After adding 1 pixel padding, the image processed by the convolutional layer has a size of 30 × 30 (900) pixel values. The convolutional layer comprises four 3 × 3 filters and thus conventionally requires a memory of size 28 × 28 × 4 (=3136 values) to store output feature maps. In order to reduce the memory size, the convolutional layer and the pooling layer are combined into a single operation block so that the two layers calculate their operations at the same time and only need memory to store the pooling result. Figure 12 shows the operation of the proposed block that combines the convolution and pooling operations. In a conventional convolution operation, the filter moves horizon- tally in steps specified by stride number and then moves vertically only upon reaching the edge of the image. However, in the proposed block, the movement of the filter considers the regions defined for pooling in Section 2.3. The filter moves in a zigzag manner within a region and moves to the next region only upon completing the current region. This allows Electronics 2021, 10, x FOR PEER REVIEWpooling to be performed on the completed region while the filter is filter is ready12 to of move 20

to the next region.

102x1 x0 x1 4 101 1024 ⁎ x1 x0 x1 101 010 filter ⁎ 010 filter 001x0 x1 x0 2 0012 101 x0 x1 x0 101 050x1 x0 x1 0 0500 3 x1 x0 x1 310 1000 1000

Image Partial convolution Image Partial convolution result Pooling result (1) (2) result Pooling result (3) (4) 1024 101 1024 101 ⁎ 010 filter ⁎ 001 2 010 filter x1 x0 x1 101 0012x1 x0 x1 101 050x0 x1 x0 0 0500 310 x0 x1 x0 310 10 100x1 x0 x1 0 1000 7 x1 x0 x1 72 Image Partial convolution Image Partial convolution result Pooling result result Pooling result

FigureFigure 12. 12.Operation Operation order of combine convolution convolution and and pooling pooling layer. layer.

When using the conventional/sequential method shown in Figure 12, the convolution takes 3136 clocks cycles and pooling takes an additional 196 cycles. Therefore, it takes a total of 3332 clock cycles and needs a feature map memory. On the other hand, the pro- posed method of Figure 12 takes only 3136 clock cycles, as the pooling operation is calcu- lated in parallel. Moreover, the proposed method only needs a small register instead of the feature map memory, thus reducing the overall chip size.

3.3. Operation of Training and Inference Mode The proposed neural network accelerator supports two modes of operation—infer- ence and training. When trained weights are available, obtained either through training mode or loaded from external source, the inference mode is used to classify images. In inference mode, the input images are applied to the first convolution layer and the classi- fication result is obtained from FC2 layer. This process is repeated once for a given set of images to validate the accuracy of the trained network. In the training mode, on the other hand, both forward and backward operation are involved in order to acquire trained weights. The forward operation classifies an image, as represented by probability for each class. The final probabilities are compared with one-hot coded label (true value) for the applied input image to calculate the error. The error is propagated backwards to perform change in weights/filter values. The process is repeated multiple times for a given set of images to achieve a high classification accuracy. Here, the number of images in the dataset and epoch can be adjusted.

3.4. Resource Sharing For area- and energy-efficient hardware, there is a need to eliminate unused modules and reuse them. Arithmetic operations in convolution and fully connected layers involve multiplication, addition, and MAC operations in both forward and backward propaga- tion. Moreover, the next layer idles while the current layer utilizes its arithmetic units. Taking advantage of this, the MAC and multipliers units are moved out of each layer. These modules are shared by multiple layers through the use of encoders and decoders. Figure 13 shows the complete structure of the proposed accelerator when multiple blocks share MAC and multiplier units. When running a model in software using a GPU, the priority is to run immediately so that several operations are processed at once. However, the focus of this paper was on energy efficiency and small area over speed, since the ac- celerator is to be deployed in energy-efficient mobile/IoT nodes.

Electronics 2021, 10, 787 12 of 19

Figure 12 illustrates the operation of the proposed block in detail, where a small register of size equal to the pooling region, instead of the feature map memory, is used. In Figure 12 in step 1, the first MAC operation is performed, and its result is stored in the register. In step 2, the filter moves to the right and a second MAC operation is performed. In step 3, the filter moves to left down to perform the third MAC operation. In step 4, the filter moves to right and the fourth MAC operation is performed. The results of all the four MAC operations are stored in the corresponding register entries, upon which the pooling operation is performed. Similarly, the MAC and pooling operations are repeated for the next pooling regions, thus overwriting the register values. When using the conventional/sequential method shown in Figure 12, the convolution takes 3136 clocks cycles and pooling takes an additional 196 cycles. Therefore, it takes a total of 3332 clock cycles and needs a feature map memory. On the other hand, the proposed method of Figure 12 takes only 3136 clock cycles, as the pooling operation is calculated in parallel. Moreover, the proposed method only needs a small register instead of the feature map memory, thus reducing the overall chip size.

3.3. Operation of Training and Inference Mode The proposed neural network accelerator supports two modes of operation—inference and training. When trained weights are available, obtained either through training mode or loaded from external source, the inference mode is used to classify images. In inference mode, the input images are applied to the first convolution layer and the classification result is obtained from FC2 layer. This process is repeated once for a given set of images to validate the accuracy of the trained network. In the training mode, on the other hand, both forward and backward operation are involved in order to acquire trained weights. The forward operation classifies an image, as represented by probability for each class. The final probabilities are compared with one-hot coded label (true value) for the applied input image to calculate the error. The error is propagated backwards to perform change in weights/filter values. The process is repeated multiple times for a given set of images to achieve a high classification accuracy. Here, the number of images in the dataset and epoch can be adjusted.

3.4. Resource Sharing For area- and energy-efficient hardware, there is a need to eliminate unused modules and reuse them. Arithmetic operations in convolution and fully connected layers involve multiplication, addition, and MAC operations in both forward and backward propagation. Moreover, the next layer idles while the current layer utilizes its arithmetic units. Taking advantage of this, the MAC and multipliers units are moved out of each layer. These modules are shared by multiple layers through the use of encoders and decoders. Figure 13 shows the complete structure of the proposed accelerator when multiple blocks share MAC and multiplier units. When running a model in software using a GPU, the priority is to run immediately so that several operations are processed at once. However, the focus of this paper was on energy efficiency and small area over speed, since the accelerator is to be deployed in energy-efficient mobile/IoT nodes. Figure 14 illustrates the use of encoders and decoders for the resource sharing of MAC and multiplier units. Here, the multiplication and MAC units are separated from each other for the sake of speed. In backpropagation, the gradient of the input is calculated using the MAC operation, while the gradient of weight involves the multiplication operation. These two operations can be performed by reusing one module, but this consumes more time. Therefore, we added one multiplication module (in addition to the MAC unit) to perform the gradient of weight and gradient of input operations in parallel during backpropagation. Electronics 2021, 10, 787 13 of 19 Electronics 2021, 10, x FOR PEER REVIEW 13 of 20

Pooling Pooling MEM

OFMAP conv Image MAC Multiplier MEM Unit Unit OFMAP FC1

Softmax Conv, FC1, FC2 OFMAP FC2

Exponent MEM Conv FC1 FC2 weight weight weight

Momentum

Multiplier / Substrate Unit

Figure 13. Hardware resource sharing by various layers in the neural network. MAC: multiplica- Electronics 2021, 10, x FOR PEER REVIEWFigure 13. Hardware resource sharing by various layers in the neural network. MAC: multiplication14 of 20 tion and accumulate. and accumulate. Figure 14 illustrates the use of encoders and decoders for the resource sharing of MAC andselect multiplier units. Here, the multiplicationSelect1 and MAC units are separated from eachfilter other for the sake of speed. In backpropagatSoftmax ion, the gradient of the input is calculated result usingfc1 wei the weiMAC operation, selectwhile the gradient of weightdout involves the multiplicationSelect1 oper- Conv Result Back fc2 wei Fc2 Back ation. These two operationsMAC can be performedresult by reusing one module, but thisFc1 dw consumes Result select out Fc1 Result Select1 Mul Unit dw more time. Therefore,Unit we added one multiplication module (in additionout to the BackMAC unit) image Fc2 Result Fc1 Fc2 dw Result Pool result to performin the gradient of weight and gradient of inputForward operations in parallel during back- result Fc2 in propagation.Fc1 result result The circuit proposed in this paper calculates the gradient by reusing the MAC and multiplier/subtractor(a) units, which offers ease in scalability. If more(b) (or larger) layers are needed for the training/inference of more complex images, the memory and MAC/multi- FigureFigure 14. 14. ResourceResource sharing sharing of of operations operations for for forward and backward propagation.propagation. ((aa)) MultiplierMultiplier and andplier/subtractor add operation for units convolution can be further and FC; reused (b) multiplier for calculation. for calculate The gradient control of logic weight can in conven-FC. addiently operation reallocate for convolution the memory and and FC; timing (b) multiplier for a differently-sized for calculate gradient network of weight to seamlessly in FC. op- 4. erate.ImplementationThe For circuit a larger proposed andnetwork, Evaluation ina larger this paper size me calculatesmory is needed the gradient for storing by reusingthe parameters the MAC and intermediate results. Since the current implementation uses on-chip memories, if a net- andWe multiplier/subtractor implemented the proposed units, which CNN offers training ease accelerators in scalability. for mobile If more and (or edge larger) com- lay- work is too large to fit, resynthesis is needed. puting.ers are We needed designed for the architecture training/inference and synthesizable of more complex RTL code images, using Verilog. the memory Then, andwe usedMAC/multiplier/subtractor the RTL code to implement units silicon can bechip further based reused on TSMC for calculation.65 nm CMOS The process control using logic Synopsys’scan conveniently Design reallocateCompiler theand memoryIC Complier, and timingtargeting for a aclock differently-sized frequency of network100 MHz. to Toseamlessly ensure the operate. correctness For aof larger the synthesis network, result, a larger post-synthesis size memory simulations is needed forwere storing per- formed.the parameters Since the and silicon intermediate chip was results.not read Sincey for themeasurement current implementation yet, we performed uses on-chipFPGA implementationmemories, if a network to verify is correctness too large to of fit, the resynthesis design. For is needed. the silicon chip, the power con- sumption and chip area were measured using Synopsys tools. 4. Implementation and Evaluation Firstly, we compared the inference accuracy of MNIST classification between (1) the PythonWe model implemented running on the a proposedGPU and (2) CNN the training proposed accelerators accelerator for using mobile the Verilog and edge model com- onputing. an FPGA. We designedFor this test, the architecturewe used the andsame synthesizable trained weights RTL in code both using environments Verilog. Then, and chosewe used 2200 the images RTL codefrom to the implement MNIST test silicon datase chipt. The based trained on TSMC weights 65 nmwere CMOS obtained process by training the native Python to provide an accuracy of 95% or higher. Figure 15 shows the test setup using a Xilinx ZCU104 FPGA board and Raspberry Pi 4. The Raspberry Pi was used to read/write weights, images, and results from the implemented CNN model via SPI (Serial Peripheral Interface Bus). Moreover, the Raspberry Pi computed the accuracy based on the results obtained from the implemented model on the FPGA.

Figure 15. Environment of simulation using hardware (FPGA board).

Electronics 2021, 10, x FOR PEER REVIEW 14 of 20

select Select1

filter Softmax result fc1 wei wei select dout Select1 Conv Result Back fc2 wei Fc2 Back MAC result Fc1 dw Result select out Fc1 Result Select1 Mul Unit dw Unit out Back image Fc2 Result Fc1 Fc2 dw Result Pool result in Forward result Fc2 in Fc1 result result

(a) (b) Figure 14. Resource sharing of operations for forward and backward propagation. (a) Multiplier and add operation for convolution and FC; (b) multiplier for calculate gradient of weight in FC.

4. Implementation and Evaluation Electronics 2021, 10, 787 14 of 19 We implemented the proposed CNN training accelerators for mobile and edge com- puting. We designed the architecture and synthesizable RTL code using Verilog. Then, we used the RTL code to implement silicon chip based on TSMC 65 nm CMOS process using usingSynopsys’s Synopsys’s Design Design Compiler Compiler and IC and Complier, IC Complier, targeting targeting a clock a clockfrequency frequency of 100 of MHz. 100 MHz.To ensure To ensure the correctness the correctness of the of synthesis the synthesis result, result, post-synthesis post-synthesis simulations simulations were were per- performed.formed. Since Since the thesilicon silicon chip chip was wasnot read not readyy for measurement for measurement yet, we yet, performed we performed FPGA FPGAimplementation implementation to verify to verify correctness correctness of the of design. the design. For the For silicon the silicon chip, chip, the power the power con- consumptionsumption and and chip chip area area were were measured measured using using Synopsys Synopsys tools. tools. Firstly,Firstly, we we compared compared the the inference inference accuracy accuracy of of MNIST MNIST classification classification between between (1) (1) the the PythonPython model model running running on on a a GPU GPU and and (2) (2) the the proposedproposed accelerator accelerator using using the the Verilog Verilog model model onon an an FPGA. FPGA. For For this this test, test, we we used used the the same same trained trained weights weights in in both both environments environments and and chosechose 22002200 images images from from the the MNIST MNIST test test dataset. dataset.The The trained trained weights weights were were obtained obtained by by trainingtraining the the native native Python Python to to provide provide an an accuracy accuracy of of 95% 95% or or higher. higher. Figure Figure 15 15 shows shows the the testtest setup setup using using a a Xilinx Xilinx ZCU104 ZCU104 FPGA FPGA board board and and Raspberry Raspberry Pi Pi 4. 4. The The Raspberry Raspberry Pi Pi was was usedused to to read/write read/write weights,weights, images,images, andand resultsresults fromfrom thethe implemented implemented CNN CNN model model via via SPISPI (Serial (Serial Peripheral Peripheral Interface Interface Bus). Bus). Moreover,Moreover, the the Raspberry Raspberry Pi Pi computed computed the the accuracy accuracy basedbased on on the the results results obtained obtained from from the the implemented implemented model model on on the the FPGA. FPGA.

FigureFigure 15. 15.Environment Environment of of simulation simulation using using hardware hardware (FPGA (FPGA board). board).

Table1 shows the evaluation results, revealing an inference accuracy of 96.05% and 2113 out of 2200 correct predictions for both the test environments/platforms. This evalua- tion ensured that the functionality of the proposed accelerator implementation was correct, and its floating-point arithmetic operators were also highly accurate.

Table 1. Comparison of Python and RTL test accuracy in inference.

Python Model Verilog RTL/Hardware Model Accuracy 96.05% 96.05% Number of Correct prediction image 2113 2113

Table2 shows the results of inference after separately training in Python and Verilog simulation using 2200 training images. Initially, partially trained weights were obtained from Python, providing a test accuracy of 93.5% for 1000 test images. Afterwards, these partially trained weights were again provided to Python and Verilog for further training. After separately training in Python and Verilog models, the test accuracy was compared, as Electronics 2021, 10, 787 15 of 19

shown in Table2. The Python and Verilog trained models showed similar test accuracies of 95% and 95.1%, respectively. This suggested that even if complete training was separately carried out in Python and Verilog, the deviation was expected to be small.

Table 2. Comparison of Python and RTL test accuracy for training and inference.

Python Model Verilog RTL Model Initial Accuracy (Weight) 93.5% Number of Training Data 2200 2200 Number of Inference Data 1000 1000 Accuracy 95% 95.1% Number of Correct Prediction Images 950 951

To highlight the benefit of resource sharing, in terms of area and energy consumption, we implemented and compared two accelerators using Synopsys: (1) an accelerator with dedicated resources and (2) an accelerator with shared resources. Tables3 and4 compare the chip area and power consumption of the two architectures. From Tables3 and4, it can be observed that the architecture with shared resources provided the best results.

Table 3. Comparison in terms of power for dedicated vs. shared resources architectures.

Accelerator with Dedicated Resource Accelerator with Shared Resources Logic Power (mW) Logic Power (mW) Combinational 26.069 (79.71%) register 22.8959 (78.88%) Clock network 0 sequential 0 Register 6.636 (20.29%) combinational 6.1314 (21.12%) Total 32.7041 mW Total 29.0273 mW

Table 4. Comparison in terms of area for dedicated vs. shared resources architectures.

Accelerator with Dedicated Resource Accelerator with Shared Resources Logic Area (µm2) Logic Area (µm2) Combinational 1,672,173.38 Combinational 1,570,811.06 Buf/Inv 36,890.2 Buf/Inv 35,927.28 Non-combinational 173,657.77 Non-combinational 1,726,553.89 Total 3,408,531.16 Total 3,297,364.95

Table5 compares the energy efficiency of three hardware systems with different configurations: (1) GPU-based PC running a CNN model (using Keras/Tensorflow framework). (2) Proposed accelerator with dedicated resources. (3) Proposed accelerator with shared resources. While the precision employed by the GPU-based PC was a 64-bit floating point, the precision implemented in the two proposed accelerators was a 32-bit floating point. Since the number of bits was different, a minor difference in the calculation result could be expected. However, since the order of the magnitudes of the weight values was the same in training, it could be confirmed that the weight values were updated by making the best use of each feature. In Table5, the energy consumption was calculated by multiplying the average power consumption of the training process multiplied by the projected time span for five epochs of the complete 50,000 training images of MNIST dataset. For the GPU-based PC, the average power consumption and the overall training time are reported. During training on PC, a smart Wi-Fi plug [28] was used to measure the power of the PC box (excluding any peripherals). The smart plug communicated the voltages, current, Electronics 2021, 10, x FOR PEER REVIEW 16 of 20

(1) GPU-based PC running a CNN model (using Keras/Tensorflow framework). (2) Proposed accelerator with dedicated resources. (3) Proposed accelerator with shared resources. While the precision employed by the GPU-based PC was a 64-bit floating point, the precision implemented in the two proposed accelerators was a 32-bit floating point. Since the number of bits was different, a minor difference in the calculation result could be ex- pected. However, since the order of the magnitudes of the weight values was the same in training, it could be confirmed that the weight values were updated by making the best use of each feature. In Table 5, the energy consumption was calculated by multiplying the average power consumption of the training process multiplied by the projected time span Electronics 2021, 10, 787 for five epochs of the complete 50,000 training images of MNIST dataset. For the16 GPU- of 19 based PC, the average power consumption and the overall training time are reported. During training on PC, a smart Wi-Fi plug [28] was used to measure the power of the PC box (excluding any peripherals). The smart plug communicated the voltages, current, power, and energy consumed to the app on a smartphone over Wi-Fi, as shown in Figure 16. power, and energy consumed to the app on a smartphone over Wi-Fi, as shown in Figure To calculate the energy consumed by the PC, the time duration of training was multiplied 16. To calculate the energy consumed by the PC, the time duration of training was multi- by the average power consumption. On the other hand, for the proposed accelerators, the plied by the average power consumption. On the other hand, for the proposed accelera- average power consumption was estimated by Synopsys’ Design Complier, and the overall tors, the average power consumption was estimated by Synopsys’ Design Complier, and training time was estimated by the Vivado Verilog simulator. Our experiment, reported in the overall training time was estimated by the Vivado Verilog simulator. Our experiment, Table5, demonstrated that the proposed accelerator with shared resources consumes only 0.21%reported of the in energyTable 5, consumed demonstrated by the that GPU-based the proposed PC. accelerator with shared resources consumes only 0.21% of the energy consumed by the GPU-based PC. Table 5. Comparison in terms of energy consumption. FP: floating point. Table 5. Comparison in terms of energy consumption. FP: floating point. GPU-Based PC * Accelerator with Dedicated Resource Accelerator with Shared Resources GPU-Based PC * Accelerator with Dedicated Resource Accelerator with Shared Resources Bit-precisionBit-precision FP64 FP64 FP32 FP32 FP32 FP32 PowerPower 54.6 W 54.6 W 32.7041 32.7041 mW mW 29.0273 mWmW 393.21393.21 s s 393.21 s s TimeTime 101 s 101 s (10 ns (clock(10 period) ns (clock × 157,284 period) × cycles157,284 × 50,000 cycles (10(10 ns ns (clock (clock period) period) ×× 157,284 cyclescycles × × 50,000 images × 5 epoch) × 50,000 images × 5 epoch) images × 5 epoch) 50,000 images × 5 epoch) EnergyEnergy 5514.6 J 5514.6 J 12,859.6 12,859.6 mJ mJ 11,413.82 mJ mJ * For* ForGPU-based GPU-based PC: PC: Nvidia Nvidia GeForce GeForce GTGT 710 710 with with Intel(R) Intel(R) Core™ Core™ i5-9600KF i5-9600KF CPU @CPU 3.70 @ GHz. 3.70 GHz.

Figure 16. The method of average power consumption using a PC. Figure 16. The method of average power consumption using a PC.

In order to compare with the existing FPGA accelerators [13,15,29], we obtained the FPGA resource usage and power of the proposed architecture, as shown in Table6. For a fair comparison among accelerators targeting various sized networks, we calculated the scaling factor, which was the ratio of operations in an existing work to our work. The results reveal a normalized energy consumption of 17.40 µJ by our design, which was lower compared to the results of [15,29]. The training accelerator in [13] employed batch size of 10 to reduce latency to 363.80 µs per image, resulting in a reduced energy consumption. However, using a batch size of 1 (same and ours) would have resulted in much higher energy consumption in [13] due to the added latency of frequent weight updates to DRAM (Dynamic Random Access Memory). Therefore, the low effective normalized energy per image makes the proposed architecture suitable candidate for IoT/mobile deployments. Electronics 2021, 10, 787 17 of 19

Table 6. Comparison of resource usage, execution time, average execution power, and normalized energy consumption with the existing works.

[15][29][13] Proposed Precision FP 32 FP Fixed 16 FP 32 Training dataset MNIST CIFAR-10 CIFAR-10 MNIST Device Maxeler MPC-X Xilinx ZU19EG Intel Stratix 10 Xilinx XCZU7EV LUT 69,510 329,288 169,143 ALM = 20,800 FF 87,580 466,047 219,372 DSP 23 1500 1699 12 BRAM 510 174 10.6 304 Operations (OPs) 14,149,798 74,430,000 59,299,400 114,824 Scaling Factor (OPs/OPs) 123.23 648.21 516.44 1.00 Time Per Image (µs) 355.00 864.26 363.80 26.17 Power (W) 27.30 14.24 a 20.63 0.67 a Energy Per Image (µJ) 9691.50 12,307.05 7506.25 17.40 Normalized Energy (µJ) b 78.65 18.99 14.53 17.40 a Provided by Xilinx Vivado. b Normalized energy is energy per image/scaling factor or OPs/(OPs/s/W)/scaling factor.

For the measurements of Table6, a Raspberry Pi was used for delivering images/weights and reading results, which consumed an additional 0.808 W of power. Upon including the power of the external device (Raspberry Pi), the consumed power and energy values increased to 1.478 W and 38.67 µJ, respectively. In real deployment, however, we plan to use an on-chip RISC-V core, which would have a much lower power consumption than the Raspberry Pi. The layout of the implemented test core, using TSMC 65 nm CMOS process and generated by a Synopsys IC Complier, is shown in Figure 17. The test core included the Electronics 2021, 10, x FOR PEER REVIEW 18 of 20 accelerator with shared resources and an off-chip interface for writing image/weights and reading produced results. The core occupied an area of 2.246 × 2.246 mm2.

Figure 17. Layout of the accelerator with shared resources. Figure 17. Layout of the accelerator with shared resources. 5. Conclusions 5. Conclusions This paper proposed a single-chip training and inference accelerator for use in mo- This paper proposed a single-chip training and inference accelerator for use in mo- bile/edgebile/edge and and IoT IoTdevices. devices. The inference The inference and training and trainingfunctions of functions the accelerator of the were accelerator were evaluatedevaluated for for MNIST MNIST handwritten handwritten digits 0–9 digits on an 0–9 FPGA. on an The FPGA. accelerator The acceleratorlocally stores locally stores weights and and only only needs needs external external communication communication for acquiring foracquiring input images input and reporting imagesand reporting results.results. The The architecture architecture has the has benefit the benefit of a reduced of a reduced energy consumption energy consumption compared to compareda to a GPU-based accelerator running on a PC, and it demonstrated similar training and infer- ence accuracies. Additionally, the proposed low-power accelerator has the capability for continuous self-training while deployed, thus ensuring a high accuracy of prediction. The accelerator promises huge potential in the growing number IoT edge devices by offering training capabilities at a fraction of the power consumption of a model running on a GPU. To further reduce area and energy consumption, pruning and quantization down to 16- bit or 8-bit floating points may be used for calculation. To use the proposed accelerator for various applications, we plan to develop a flexible architecture, where the same hardware can be used to accelerate networks of different depths, in the future.

Author Contributions: Conceptualization, J.H. and H.K.; methodology, J.H., S.A. and H.K.; soft- ware, J.H. and T.L.; validation, J.H. and S.A.; formal analysis, J.H. and S.A.; investigation, J.H. and T.L.; resources, T.L. and S.A., Writing—original draft preparation, J.H.; Writing—review and edit- ing, S.A. and H.K.; supervision, H.K.; project administration, H.K.; funding acquisition, H.K. All authors have read and agreed to the published version of the manuscript. Funding: This work was supported by IITP grant (No. 2020-0-01304), Development of Self-learnable Mobile Recursive Neural Network Technology Project, and also supported by the Grand Information Technology Research Center support program (IITP-2020-0-01462) supervised by the IITP and funded by the MSIT (Ministry of Science and ICT), Korean government. It was also sup- ported by Industry coupled IoT Semiconductor System Convergence Nurturing Center under Sys- tem Semiconductor Convergence Specialist Nurturing Project funded by the National Research Foundation (NRF) of Korea (2020M3H2A107678611). Conflicts of Interest: The authors declare no conflict of interest.

Electronics 2021, 10, 787 18 of 19

GPU-based accelerator running on a PC, and it demonstrated similar training and infer- ence accuracies. Additionally, the proposed low-power accelerator has the capability for continuous self-training while deployed, thus ensuring a high accuracy of prediction. The accelerator promises huge potential in the growing number IoT edge devices by offering training capabilities at a fraction of the power consumption of a model running on a GPU. To further reduce area and energy consumption, pruning and quantization down to 16-bit or 8-bit floating points may be used for calculation. To use the proposed accelerator for various applications, we plan to develop a flexible architecture, where the same hardware can be used to accelerate networks of different depths, in the future.

Author Contributions: Conceptualization, J.H. and H.K.; methodology, J.H., S.A. and H.K.; software, J.H. and T.L.; validation, J.H. and S.A.; formal analysis, J.H. and S.A.; investigation, J.H. and T.L.; resources, T.L. and S.A., Writing—original draft preparation, J.H.; Writing—review and editing, S.A. and H.K.; supervision, H.K.; project administration, H.K.; funding acquisition, H.K. All authors have read and agreed to the published version of the manuscript. Funding: This work was supported by IITP grant (No. 2020-0-01304), Development of Self-learnable Mobile Recursive Neural Technology Project, and also supported by the Grand Information Technology Research Center support program (IITP-2020-0-01462) supervised by the IITP and funded by the MSIT (Ministry of Science and ICT), Korean government. It was also supported by Industry coupled IoT Semiconductor System Convergence Nurturing Center under System Semiconductor Convergence Specialist Nurturing Project funded by the National Research Foundation (NRF) of Korea (2020M3H2A107678611). Conflicts of Interest: The authors declare no conflict of interest.

References 1. Li, Y.; Song, B.; Kang, X.; Du, X.; Guizani, M. Vehicle-Type Detection Based on Compressed Sensing and Deep Learning in Vehicular Networks. Sensors 2018, 18, 4500. [CrossRef][PubMed] 2. Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [CrossRef][PubMed] 3. Abdel-Hamid, O.; Mohamed, A.; Jiang, H.; Penn, G. Applying Convolutional Neural Networks Concepts to Hybrid NN-HMM Model for Speech Recognition. In Proceedings of the 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Kyoto, Japan, 25–30 March 2012; pp. 4277–4280. 4. Hadsell, R.; Sermanet, P.; Ben, J.; Erkan, A.; Scoffier, M.; Kavukcuoglu, K.; Muller, U.; LeCun, Y. Learning Long-Range Vision for Autonomous off-Road Driving. J. Field Robot. 2009, 26, 120–144. [CrossRef] 5. Elsisi, M.; Tran, M.-Q.; Mahmoud, K.; Lehtonen, M.; Darwish, M.M.F. Deep Learning-Based Industry 4.0 and Internet of Things towards Effective Energy Management for Smart Buildings. Sensors 2021, 21, 1038. [CrossRef][PubMed] 6. Strigl, D.; Kofler, K.; Podlipnig, S. Performance and Scalability of GPU-Based Convolutional Neural Networks. In Proceedings of the 2010 18th Euromicro Conference on Parallel, Distributed and Network-based Processing, Pisa, Italy, 17–19 February 2010; pp. 317–324. 7. Oh, K.-S.; Jung, K. GPU Implementation of Neural Networks. Pattern Recognit. 2004, 37, 1311–1314. [CrossRef] 8. Dong, S.; Gong, X.; Sun, Y.; Baruah, T.; Kaeli, D. Characterizing the Microarchitectural Implications of a Convolutional Neural Net- work (CNN) Execution on GPUs. In Proceedings of the 2018 ACM/SPEC International Conference on Performance Engineering, Berlin, Germany, 30 March 2018; pp. 96–106. 9. Potluri, S.; Fasih, A.; Vutukuru, L.K.; Machot, F.A.; Kyamakya, K. CNN Based High Performance Computing for Real Time Image Processing on GPU. In Proceedings of the Joint INDS’11 & ISTET’11, Klagenfurt am Worthersee, Austria, 25–27 July 2011; pp. 1–7. 10. Choi, Y.; Bae, D.; Sim, J.; Choi, S.; Kim, M.; Kim, L.-S. Energy-Efficient Design of Processing Element for Convolutional Neural Network. IEEE Trans. Circuits Syst. II 2017, 64, 1332–1336. [CrossRef] 11. Chang, X.; Pan, H.; Zhang, D.; Sun, Q.; Lin, W. A Memory-Optimized and Energy-Efficient CNN Acceleration Architecture Based on FPGA. In Proceedings of the 2019 IEEE 28th International Symposium on Industrial Electronics (ISIE), Vancouver, BC, Canada, 12–14 June 2019; pp. 2137–2141. 12. Lane, N.D.; Bhattacharya, S.; Georgiev, P.; Forlivesi, C.; Jiao, L.; Qendro, L.; Kawsar, F. DeepX: A Software Accelerator for Low-Power Deep Learning Inference on Mobile Devices. In Proceedings of the 2016 15th ACM/IEEE International Conference on Information Processing in Sensor Networks (IPSN), Vienna, Austria, 11–14 April 2016; pp. 1–12. 13. Venkataramanaiah, S.K.; Ma, Y.; Yin, S.; Nurvithadhi, E.; Dasu, A.; Cao, Y.; Seo, J.-S. Automatic Compiler Based FPGA Accelerator for CNN Training. In Proceedings of the 2019 29th International Conference on Field Programmable Logic and Applications (FPL), Barcelona, Spain, 8–12 September 2019; pp. 166–172. Electronics 2021, 10, 787 19 of 19

14. Lecun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. Gradient-Based Learning Applied to Document Recognition. Proc. IEEE 1998, 86, 2278–2324. [CrossRef] 15. Zhao, W.; Fu, H.; Luk, W.; Yu, T.; Wang, S.; Feng, B.; Ma, Y.; Yang, G. F-CNN: An FPGA-Based Framework for Training Convolutional Neural Networks. In Proceedings of the 2016 IEEE 27th International Conference on Application-specific Systems, Architectures and Processors (ASAP), London, UK, 6–8 July 2016; pp. 107–114. 16. Geng, T.; Wang, T.; Sanaullah, A.; Yang, C.; Xu, R.; Patel, R.; Herbordt, M. FPDeep: Acceleration and Load Balancing of CNN Training on FPGA Clusters. In Proceedings of the 2018 IEEE 26th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM), Boulder, CO, USA, 29 April–1 May 2018; pp. 81–84. 17. Sapna, S.; Sreenivasalu, N.; Paul, K. DAPP: Accelerating Training of DNN. In Proceedings of the 2018 IEEE 20th International Conference on High Performance Computing and Communications, Exeter, UK, 28–30 June 2018; pp. 867–872. 18. Narayanan, D.; Harlap, A.; Phanishayee, A.; Seshadri, V.; Devanur, N.R.; Ganger, G.R.; Gibbons, P.B.; Zaharia, M. PipeDream: Generalized Pipeline Parallelism for DNN Training. In Proceedings of the 27th ACM Symposium on Operating Systems Principles, Huntsville, ON, Canada, 27–30 October 2019; pp. 1–15. 19. Lu, J.; Lin, J.; Wang, Z. A Reconfigurable DNN Training Accelerator on FPGA. In Proceedings of the 2020 IEEE Workshop on Signal Processing Systems (SiPS), Coimbra, Portugal, 20–22 October 2020; pp. 1–6. 20. Park, D.; Kim, S.; An, Y.; Jung, J.-Y. LiReD: A Light-Weight Real-Time Fault Detection System for Edge Computing Using LSTM Recurrent Neural Networks. Sensors 2018, 18, 2110. [CrossRef][PubMed] 21. Sun, Q.; Zhang, H.; Li, Z.; Wang, C.; Du, K. ADAS Acceptability Improvement Based on Self-Learning of Individual Driving Characteristics: A Case Study of Lane Change Warning System. IEEE Access 2019, 7, 81370–81381. [CrossRef] 22. Kim, Y.-D.; Park, E.; Yoo, S.; Choi, T.; Yang, L.; Shin, D. Compression of deep convolutional neural networks for fast and low power mobile applications. arXiv 2015, arXiv:1511.06530. 23. Lee, H.-W.; Kim, N.; Lee, J.-H. Deep Neural Network Self-Training Based on Unsupervised Learning and Dropout. Int. J. Fuzzy Log. Intell. Syst. 2017, 17, 1–9. [CrossRef] 24. Liu, B.; Zou, D.; Feng, L.; Feng, S.; Fu, P.; Li, J. An FPGA-Based CNN Accelerator Integrating Depthwise Separable Convolution. Electronics 2019, 8, 281. [CrossRef] 25. Sim, J.; Lee, S.; Kim, L.-S. An Energy-Efficient Deep Convolutional Neural Network Inference Processor With Enhanced Output Stationary Dataflow in 65-Nm CMOS. IEEE Trans. VLSI Syst. 2020, 28, 87–100. [CrossRef] 26. Truong, N.D.; Nguyen, A.D.; Kuhlmann, L.; Bonyadi, M.R.; Yang, J.; Ippolito, S.; Kavehei, O. Integer Convolutional Neural Network for Seizure Detection. IEEE J. Emerg. Sel. Top. Circuits Syst. 2018, 8, 849–857. [CrossRef] 27. Fleischer, B.; Shukla, S.; Ziegler, M.; Silberman, J.; Oh, J.; Srinivasan, V.; Choi, J.; Mueller, S.; Agrawal, A.; Babinsky, T.; et al. A Scalable Multi- TeraOPS Deep Learning Processor Core for AI Trainina and Inference. In Proceedings of the 2018 IEEE Symposium on VLSI Circuits, Honolulu, HI, USA, 18–22 June 2018; pp. 35–36. 28. Grier Smart Plug JR-PM01. Available online: https://templates.blakadder.com/girier_JR-PM01.html (accessed on 6 March 2021). 29. Liu, Z.; Dou, Y.; Jiang, J.; Wang, Q.; Chow, P. An FPGA-Based Processor for Training Convolutional Neural Networks. In Proceedings of the 2017 International Conference on Field Programmable Technology (ICFPT), Melbourne, VIC, Australia, 11–13 December 2017; pp. 207–210.