technologies

Article Hardware Implementation of a Softmax-Like Function for

Ioannis Kouretas * and Vassilis Paliouras Electrical and Computer Engineering Department, University of Patras, 26 504 Patras, Greece; [email protected] * Correspondence: [email protected] † This paper is an extended version of our paper published in Proceedings of the 8th International Conference on Modern Circuits and Systems Technologies (MOCAST), Thessaloniki, Greece, 13–15 May 2019.

 Received: 28 April 2020; Accepted: 25 August 2020; Published: 28 August 2020 

Abstract: In this paper a simplified hardware implementation of a CNN softmax-like is proposed. Initially the softmax is analyzed in terms of required numerical accuracy and certain optimizations are proposed. A proposed adaptable hardware architecture is evaluated in terms of the introduced error due to the proposed softmax-like function. The proposed architecture can be adopted to the accuracy required by the application by retaining or eliminating certain terms of the approximation thus allowing to explore accuracy for complexity trade-offs. Furthermore, the proposed circuits are synthesized in a 90 nm 1.0 V CMOS standard-cell library using Synopsys Design Compiler. Comparisons reveal that significant reduction is achieved in area delay × and power delay products for certain cases, respectively, over prior art. Area and power savings × are achieved with respect to performance and accuracy.

Keywords: softmax; convolutional neural networks; VLSI; deep learning

1. Introduction Deep neural networks (DNN) have emerged as a means to tackle complex problems such as image classification and . The success of DNNs is attributed to the big data availability, the easy access to enormous computational power and the introduction of novel algorithms that have substantially improved the effectiveness of the training and inference [1]. A DNN is defined as a neural network (NN) which contains more than one hidden layer. In the literature, a graph is used to represent a DNN, with a set of nodes in each layer, as shown in Figure1. The nodes at each layer are connected to the nodes of the subsequent layer. Each node performs processing including the computation of an activation function [2]. The extremely large number of nodes at each layer impels the training procedure to require extensive computational resources. A class of DNNs are the convolutional neural networks (CNNs) [2]. CNNs offer high accuracy in computer-vision problems such as face recognition and video processing [3] and have been adopted in many modern applications. A typical CNN consists of several layers, each one of which can be convolutional, pooling, or normalization with the last one to be a non-linear activation function. A common choice for normalization layers is usually the softmax function as shown in Figure1. To cope with increased computational load, several FPGA accelerators have been proposed and have demonstrated that convolutional layers exhibit the largest hardware complexity in a CNN [4–15]. In addition to CNNs, hardware accelerators for RNNs and LSTMs have also been investigated [16–18]. In order to implement a CNN in hardware, the softmax layer should also be implemented with low complexity. Furthermore, the hidden layers of a DNN can use the softmax function when the model is designed to choose one among several different options for some internal variable [2]. In particular,

Technologies 2020, 8, 46; doi:10.3390/technologies8030046 www.mdpi.com/journal/technologies Technologies 2020, 8, 46 2 of 20

neural turing machines (NTM) [19] and differentiable neural computer (DNC) [20] use softmax layers within the neural network. Moreover, softmax is incorporated in attention mechanisms, an application of which is machine translation [21]. Furthermore, both hardware [22–28] and memory-optimized software [29,30] implementations of the softmax function have been proposed. This paper, extending previous work published in MOCAST2019 [31], proposes a simplified architecture for a softmax-like function, the hardware implementation of which is based on a proposed approximation which exploits the statistical structure of the vectors processed by the softmax layers in various CNNs. Compared to the previous work [31], this paper uses a large set of known CNNs, and performs extensive and fair experiments to study the impact of the applied optimizations in terms of the achieved accuracy. Moreover the architecture in [31] is further elaborated and generalized by taking into account the requirements of the targeted application. Finally, the proposed architecture is compared with various softmax hardware implementations. In order for the softmax-like function to be implemented efficiently in hardware, the approximation requirements are relaxed. The remainder of the paper is organized as follows. Section2 revisits the softmax activation function. Section3 describes the proposed algorithm and Section4 offers a quantitative analysis of the Version Augustproposed 12, 2020 architecture. submitted to SectionTechnologies5 discusses the hardware complexity of the proposed scheme based on 2 of 24 synthesis results. Finally, conclusions are summarized in Section6.

Input Hidden Hidden Hidden softmax layer layer 0 layer 1 layer 2 layer

x1 y1

x2 y2

x3 y3

x4 y4

FigureFigure 1. A 1.typicalA typical Deep deep Learning learning network. Network

2. Softmax Layer Review 34 network. Moreover,CNNs consist softmax of a number is incorporated of stages each in attentionof which contains mechanisms, several layers. an application The final layer of is which is 35 machine translationusually fully-connected [21]. Furthermore, using ReLU both as an hardware activation [ function22–28] and and drives memory-optimized a softmax layer before software the [29,30] 36 implementationsfinal output of the of the softmax CNN. The function classification have performed been proposed. by the CNNs This is accomplished paper, extending at the final previous layer work 37 publishedof in the MOCAST2019 network. In particular, [31], proposes for a CNN which a simplified consists of architecturei + 1 layers, the for softmax a softmax-like function is used function, to the transform the real values generated by the ith CNN layer to possibilities, according to 38 hardware implementation of which is based on a proposed approximation which exploits the statistical z 39 structure of the vectors processed by the softmax layerse j in various CNNs. Compared to the previous fj(z) = n , (1) 40 work [31], this paper uses a large set of known CNNs,∑ e andzk performs extensive and fair experiments k=1 41 to study the impact of the applied optimizations in terms of the achieved accuracy. Moreover the

42 architecturewhere in [31z is] is an further arbitrary elaborated vector withand real values generalizedzj, j = 1, by... , takingn, generated into account at the ith layerthe requirements of the CNN of the and n is the size of the vector. The (i + 1)st layer is called the softmax layer. By applying the logarithm 43 targeted application. Finally, the proposed architecture is compared with various softmax hardware function to both sides of (1), it follows that 44 implementations. In order for the softmax-like function to be implemented efficiently in hardware, the 45 approximation requirements are relaxed. 46 The remainder of the paper is organized as follows. Section2 revisits the softmax activation 47 function. Section3 describes the proposed algorithm and Section4 offers a quantitative analysis of the 48 proposed architecture. Section5 discusses the hardware complexity of the proposed scheme based on 49 synthesis results. Finally conclusions are summarized in Section6.

50 2. Softmax Layer Review CNNs consist of a number of stages each of which contains several layers. The final layer is usually fully-connected using ReLU as an activation function and drives a softmax layer before the final output of the CNN. The classification performed by the CNNs is accomplished at the final layer of the network. In particular, for a CNN which consists of i + 1 layers, the softmax function is used to transform the real values generated by the ith CNN layer to possibilities, according to

ezj fj(z) = n , (1) ∑ ezk k=1

51 where z is an arbitrary vector with real values zj, j = 1, ... , n, generated at the ith layer of the CNN 52 and n is the size of the vector. The (i + 1)st layer is called the softmax layer. By applying the logarithm 53 function to both sides of (1), it follows that Technologies 2020, 8, 46 3 of 20

ezj log( fj(z)) = log  n  (2) ezk  ∑   k=1    n = log(ezj ) log ∑ ezk (3) − k=1 ! n zk = zj log ∑ e . (4) − k=1 !

n In (4), the term log ∑ ezk is computed as k=1 !

n n em zk zk log ∑ e = log ∑ m e (5) ! e k=1  k=1  n 1 = log em ezk (6) ∑ em  k=1  n m m z = log e + log ∑ e− e k (7)  k=1  n z m = m + log ∑ e k− , (8)  k=1  where m = max(zk). From (4) and (8) it follows that k

n zk m log( fj(z)) = zj m + log ∑ e − (9) − !  k=1  n zk m = zj m + log ∑ e − 1 + 1 (10) − − !  k=1  = z (m + log(Q + 1)) , (11) j − where n n n z m z m z m Q = e k− 1 = e k− + 1 1 = e k− . (12) ∑ − ∑ − ∑ k=1 k=1 k=1 z =max z =max k6 k6 Due to the definition of m, it holds that

z m z m z m 0 e j− 1 (13) j ≤ ⇒ j − ≤ ⇒ ≤ ⇒ n n z m z m e k− n e k− + 1 n (14) ∑ ≤ ⇒ ∑ ≤ ⇒ k=1 k=1 z =max k6 Q Q + 1 n 1 Q0 1, (15) ≤ ⇒ n 1 ≤ ⇒ ≤ − Q where Q0 = . Expressing Q in terms of Q0 ,(11) becomes n 1 − log( f (z)) = z m log (n 1)Q0 + 1 . (16) j j − − −   Technologies 2020, 8, 46 4 of 20

The next section presents the proposed simplifications for (16) and the derivative architecture for the softmax-like hardware implementation.

3. Proposed Softmax Architecture Equation (15) involves the distance of the maximum component from the remainder of the components of a vector. As Q0 0, the differences in z s increase and z m. On the contrary, ≈ i j  as Q0 1 the differences in z s are eliminated. Based on this observation, a simplifying approximation → i can be obtained, as follows. The third term in the right hand side of (16), log (n 1)Q0 + 1 , can be − roughly approximated by 0. Hence, (16) is approximated by  

log( f (z)) z m. (17) j ' j − Furthermore, this simplification substantiallyb reduces hardware complexity as described below. From (17), it follows that zj max(zk) − k fj(z) = e . (18)

b Theorem 1. Let argmax( fj(z)) = q and argmax( fj(z)) = r be the decisions obtained by (1) and (18), z z respectively. Then q = r. b Proof. Due to the softmax definition, it holds that

ezq max f (z) = max   (19) j n ⇒ ezk  ∑   k=1    zq = max(zk). (20) k

For the case of the proposed function (18), it holds that

zr max(zk) max f (z) = max e − k (21) j ⇒   b zr = max(zk). (22) k

From (20) and (22), it is derived that zq = zr. Hence, argmax( fq(z)) = argmax( fr(z)) q = r z z ⇒

n fj(z) z m Corollary 1. It holds that = ∑ e k− . fj(z) = b k 1 Proof. Proof is trivial and is omitted.

Theorem1 states that the proposed softmax-like function and the actual softmax function always derive the same decisions. The proposed softmax-like approximation is based on the idea that the softmax function is used during training to target an output y by using maximum-log likelihood [2]. Hence, if the correct answer has already the maximum input value to the softmax function then log (n 1)Q0 + 1 0 will roughly alter the output decision due to the exponent function used in − '  0  term Q . In general, ∑ fj(z) > 1, since the sequence fj(z) cannot be denoted as a probability density j function. For models whereb the normalization functionb is required to be a pdf, a modified approach can be followed, as detailed below. Technologies 2020, 8, 46 5 of 20

According to the second approach, from (11) and (12) it follows:

n z m log( f (z)) = z m log( e k− + 1) (23) j j − − ∑ k=1 z =max b k6 p n m m m m = z m log( e k− + e k− + 1) (24) j − − ∑ ∑ k=1 k=p+1 m =max m =max k6 k6 p n m m m m = z m log( e k− + e k− ) (25) j − − ∑ ∑ k=1 k=p+1 = z m log(Q + Q ), (26) j − − 1 2 p m m with Q = e k− , where M = m , ... , m = m : k = 1, ... , p are chosen to be the top p 1 ∑ 1 { 1 p} { k } k=1 n m m maximum values of z. For the quantity Q , it holds Q = e k− , with M = m , ... , m = 2 2 ∑ 2 { p+1 n} k=p+1 m : k = p + 1, . . . , n being the remainder values of the vector z, i.e., M M = z ,..., z . { k } 1 ∪ 2 { 1 n} A second approximation is performed as

p p m m m m log(Q ) = log( e k− 1 + 1) e k− 1. (27) 1 ∑ − ≈ ∑ − k=1 k=1 Q 0. (28) 2 ≈ From (26)–(28), it derives that

p m m log( f (z, p)) z m e k− + 1 (29) j ≈ j − − ∑ k=1

b p z m ∑ emk m+1 j− − − k=1 fj(z, p) = e . (30)

Equation (30) uses parameter pbwhich defines the number of additional terms used. By properly selecting p, then it holds that ∑ fj(z, p) 1 and (30) approximates pdf better than (18). This is derived j ' from the fact that in a real life CNN,b the p maximum values are those that contribute to the computation of the softmax since all the remainder values are close to zero.

Lemma 1. It holds fj(z, p) = fj(z) when p = 1.

Proof. By definition,b it holdsb that when p = 1 then m1 = m, since the p = 1 maximum value m1 is identified as the maximum m. Hence, by substituting p = 1 in (30), it derives that Technologies 2020, 8, 46 6 of 20

1 z m ∑ emk m+1 j− − − k=1 f (z, 1) = e j ⇒ z m em m+1 f (z, 1) = e j− − − bj ⇒ z m f (z, 1) = e j− bj ⇒ (18) bfj(z, 1) = fj(z)

b b

From a hardware perspective, (18) and (30) can be performed by the same circuit which implements the exponential function. The contributions of the paper are as follows. Firstly, the quantity log (n 1)Q0 + 1 is eliminated from (16), implying that the target application requires decision − making. Secondly, further mathematical manipulations are proposed to be applied to (30), in order to approximate the outputs as pdf i.e., probabilities that sum to one. Thirdly, the circuit for the evaluation of ex is simplified, since

z m (31) j ≤ ⇒ z m 0 (32) j − ≤ ⇒ z m e j− 1. (33) ≤ and

z m (34) j ≤ ⇒ p m m z m e k− + 1 < 0 (35) j − − ∑ ⇒ k=1

p z m ∑ emk m+1 j− − − k=1 e < 1. (36)

Figure2 depicts the various building blocks of the proposed architecture. More specifically, the proposed architecture is comprised of the block which computes the maximum m, i.e., m = max(zk). k The particular computation is performed by a tree which generates the maximum by comparing the elements by two, as shown in Figure3. The depicted tree structure generates m = max(zk), k k = 0, 1, ... , 7. Notation zij denotes the maximum of zi and zj, while zijkl denotes the maximum of zij and zkl. The same architecture is used to compute the top p maximum values of zis. For example, Z01, Z21, Z45 and Z67 are the top four maximum values and m = max(zk) is the maximum. k Subsequently, m is subtracted from all the component values zk as dictated by (17). The subtraction is performed through adders, denoted as in Figure2, using two’s complement representation for the input negative values m. The obtained differences, also represented in two’s complement, − M are used as inputs to a LUT, which performs the proposed simplified ex operation of (18), to compute the final vector fj(z) as shown in Figure2a. Additional p terms are added and subsequently each output fj(z) through (30) generates the final value for the softmax-like layer output as shown in Figure2b. For theb hardware implementation of the ex function, an LUT is adopted the input of which is x = z bm. The LUT size increases on the larger range of ex. Our proposed hardware implementation j − is simpler than other exponential implementations which propose CORDIC transformations [32], use floating-point representation [33], or LUTs [34]. In (33), the ex values are restricted to the range (0, 1] and the derived LUT size significantly diminishes and leads to simplified hardware implementation. Version August 12, 2020 submitted to Technologies 7 of 24

Version August 12, 2020 submitted to Technologies 7 of 24 Technologies 2020, 8, 46 7 of 20

z1 Furthermore, no conversion from the logarithmic to the linear domain is required, since fj(z) represents m (18) max(zk) z1 simplified theVersion final August classification 12,z 2020 submitted layer. to Technologies − + x f1(z) 7 of 24 k (17) e LUT The next section quantitatively investigatesm the validity and usefulness(18) of employing fj(z), max(zk) z simplified z − + 2 bf1(z) in terms of the approximation error.k (17) ex LUT (18) z2 simplified b + x f2(z) z1 (17) e LUT simplified (18) m (18) bf (z) max(zk) +. simplifiedx 2 z − +. (17) e LUT f1(z) k . (17) ex LUT zn b . z2 b . (18) simplified (18) +zn simplifiedx fn(z) + e LUT f2(z) (17(17) ) ex LUT (18) simplified b b +. x fn(z) (a) Proposed softmax-like architecture. (17 with) p e=LUT1. Each output fj(z) through (18), generates the finalzn value for the softmax-like layerb (a) Proposed softmax-like architecture with p = 1. Each output output. simplified (18) + x fn(z) fjb(z) through (18), generates the final value(17) fore theLUT softmax-like layer output. zj b (a) Proposed softmax-like architecture(a) with p = 1. Each output b m max(zk) z fj(z) through− (18+), generates the final value for the softmax-like(17) layer k zj output. b m m1 z max(zk) − + (17) k z j simplified simplified (30) + x + x fj(z, p) m m e LUT e LUT z max(zk) − +1 (17) k b . simplified simplified (30) +. + fj(z, p) m1 ex LUT ex LUT mp simplified simplified (30) b . + x + x fj(z, p) +. simplifiede LUT e LUT ex LUT m.p b . (b) Proposed softmax-like architecture.+ mp Thesimplified notation denotes negation. Additional p terms ex LUT ◦ are added and subsequently each+ output fsimplified(z) through (30), generates the final value for the jex LUT (b)softmax-likeProposed softmax-like layer output. architecture. The notation(b) denotes negation. Additional p terms (b) Proposed softmax-like architecture. Theb notation ◦denotes negation. Additional p terms are added and subsequently each output f (z) through◦ (30), generates the final value for the FigureFigure 2. Proposed 2. Proposed softmax-like softmax-like layer layer architecture. architecture.j The circuit circuitmaxmax(z(kz),)k, k==0,0, 1, 1,...... , n,computesn computes the the are added and subsequently each output fj(z) through (30), generatesk the final value for the softmax-like layer output. k k softmax-like layer output. T maximum value m of the input vector z = [zb zn] T. Next m is subtracted by each z , as described maximum value m of the input vector z = [z11b× zn] . Next m is subtracted by eachk zk, as described Figurein 2.(17Proposed).( ) Proposed softmax-like softmax-like layer architecture architecture.··· with The= circuit1. Eachmax output(z ),(k )=through0, 1, ...(,18n)computesgenerates the in (17Figure). a 2. Proposed softmax-like layer architecture. Thep circuit max(zk), k f=j z0, 1, ... , n computes the the final value for the softmax-like layer output. ( ) Proposed softmax-likek k architecture. The notation b TT maximummaximum value valuem ofm theof theinput input vector vectorz z==[z[z11 zn]] .. Next Nextmmisis subtracted subtractedb by each by eachzk, asz describedk, as described denotes negation. Additional p terms are added······ and subsequently each output fj(z) through (30) ◦ in (17). in (17generates). the final value for the softmax-like layer output. b

z0 z1 z2 z3 z4 z5 z6 z7

z0 z1 z2 z3 z4 z5 z6 z7 z0 z1 z2 z3 z4 z5 z6 z7 z01 z23 z45 z67

z01 z23 z45 z67

z01 z23 z45 z67 z0123 z4567

z0123 z4567

z0123 z4567 max(zk ) maxk(zk ) k ( ) Figure 3. Tree structure for the computationmax of maxzk (z ), k = 0, 1, ... , 7. Notation z denotes the FigureFigure 3. 3.TreeTree structure structure for for the the computation computation of maxkmax((zkk)),, k ==0,0, 1, 1,...... , 7,. 7. Notation Notation zijdenotesdenotesij the the kzk k zij kk maximummaximum of z ofandz andz zwhilewhilez z denotesdenotes the the maximum maximum of ofzz andandzz . . maximumi of izi andj zj j whileijklzijklijkl denotes the maximum of zijijijand zklkl.kl Figure 3. Tree structure for the computation of max(zk), k = 0, 1, ... , 7. Notation zij denotes the k maximum of zi and zj while zijkl denotes the maximum of zij and zkl. Technologies 2020, 8, 46 8 of 20

4. Quantitative Analysis of Introduced Error This section quantitatively verifies the applicability of the approximation introduced in Section2, for certain applications, by means of a series of Examples. In order to quantify the error introduced by the proposed architecture, the mean square error (MSE) is evaluated as 1 2 MSE = f (z) f (z) , (37) n ∑ j − j   where fj(z) and fj(z) are the expected and the actually evaluatedb softmax output, respectively. As an illustrative example, denoted as Example 1, Figure4 depicts the histograms of component valuesb in test vectors used as inputs to the proposed architecture, selected to have specific properties detailed below. The corresponding parameters are evaluated by using the proposed architecture for the case of a 10-bit fixed-point representation, where 5 bits are used for the integer and 5 bits are allocated to the fractional part. More specifically, the vector in Figure4a contains n = 30 values, for which it holds that zj = 5, j = 1, ... , 11, zj = 4.99, j = 12, ... , 22 and zj = 5, j = 23, ... , 30. For this case the softmax values obtained by (1) are

0.0343, j = 1, . . . , 11 fj (z) =  0.0314, j = 12, . . . , 22 (38)  0.0347, j = 23, . . . , 30.

From a CNN perspective, the softmax layer output generates similar values, where possibilities are all around 3%, and hence classification or decision cannot be made with high confidence. By using (18), the modified softmax values are

0.9844, j = 1, . . . , 11 fj (z) =  0.8906, j = 12, . . . , 22 (39)  1, j = 23, . . . , 30. b  The statistical structure of the vector is characterized by the quantity Q0 = 0.9946 of (15). The estimated MSE = 0.8502 dictates that the particular vector is not suitable as an alternative to softmax input in terms of CNN performance, i.e., the obtained classification is performed with low confidence. Hence although the proposed approximation in (18) demonstrates large differences when compared to (1), neither is applicable in CNN terms. Consider the following Example 2. The component values for vector z in Example 2 are

3, j = 1  6, j = 2 zj =  4, j = 3 (40)   2, j = 4 < 1, j = 5, . . . , 30,    the histogram of which is shown in Figure 4b. In this case, the statistical structure of the vector demonstrates Q0 = 0.0449 and MSE = 0.0018. The feature of vector z in Example 2 is that it contains three large different component values close to each other, namely z1 = 3, z2 = 6, z3 = 4, z4 = 2 and all other components are smaller than 1. The softmax output in (1) for the values in the particular z are

0.0382, j = 1  0.7675, j = 2 fj (z) =  0.1039, j = 3 (41)   0.0141, j = 4 < 0.005, j = 5, . . . , 30.    Technologies 2020, 8, 46 9 of 20

By using the proposed approximation (18), the obtained modified softmax values are

0.0469, j = 1  1, j = 2 fj (z) =  0.1250, j = 3 (42)   0.0156, j = 4 b 0, j = 5, . . . , 30.    Equations (41) and (42) show that the proposed architecture chooses component z2 with value 1 while the actual probability is 0.7675. This means that the introduced error of MSE = 0.0018 can be negligible depending on the application, dictated by Q0 = 0.0449 1.  In the following, tests using vectors obtained from real CNN applications are considered. More specifically, in an example shown in Figure5, the vectors are obtained from image and digit classification applications. In particular, Figure5a,b depict the values used as input to the final softmax layer, generated during a single inference for a VGG-16 image classification network for 1000 classes and a custom net for MNIST digit classification for 10 classes. Quantity Q0 can be used to determine whether the proposed architecture is appropriate for application on vector z before Version August 12, 2020 submitted to Technologies 9 of 24 evaluatingVersion August the 12, MSE. 2020 submitted It is noted to Technologies that MSE for the example of Figure5a and MSE for the example9 of 24 of 13 5 Figure5b are of the orders of 10 − and 10− , respectively which renders them as negligible.

20 20 20 20

15 15 s 15 s 15 j j s s j j 10 10 10 10 number of number of number of 5 number of 5 5 5

0 0 0 0 4.9 4.92 4.94 4.96 4.98 5 0 1 2 3 4 5 6 4.9 4.92 4.94 4.96 4.98 5 0 1 2 3 4 5 6 zj zj zj zj (a) Q0 = 0.9946, MSE(a)(= 0.8502. (b) Q0 = 0.0449, MSEb) = 0.0018. (a) Q0 = 0.9946, MSE = 0.8502. (b) Q0 = 0.0449, MSE = 0.0018. Figure 4. Values obtainedFigure from 4.ExamplesValues obtained 1 and 2. from (a) ExamplesQ0 = 0.9946, 1 and MSE 2. = 0.8502. (b) Q0 = 0.0449, Figure 4. Values obtained from Examples 1 and 2. MSE = 0.0018.

3 300 3 300 s s j j 2 s s j j 200 2 200

1 number of number of 100 1 number of number of 100

0 0 0 0 5 0 5 10 15 15 10 5 0 5 10 − − − − 5 0 z 5 10 15 15 10 5 z 0 5 10 − j − − − j zj zj (a) Q0 = 0.001, (b) Q0 = 0.1111, 0 0 (a)MSE =Q1.5205e =(a05.)(0.001, (b)MSE =Q6.0841e= b13.) 0.1111, MSE = 1.5205e −05. MSE = 6.0841e − 13. Figure 5. Values obtained from imagenet image classification for 1000 classes and a custom net for Figure 5. Values obtained− from imagenet image classification for 1000 classes− and a custom net for MNISTFigure 5. digitValues classification obtained from for 10 imagenet classes. image (a) Q classification0 = 0.001, MSE for =1000 1.5205 classes10 and5 a.( customb) Q0 = net 0.1111, for MNIST digit classification for 10 classes. × − MSEMNIST = 6.0841 digit classification10 13. for 10 classes. × − By using the proposed approximation (18), the obtained modified softmax values are By using the proposed approximation (18), the obtained modified softmax values are 0.0469, j = 1 0.0469, j = 1  1, j = 2 f (z) =  1,0.1250, jj = 23 (42) j  fj (z) =  0.1250,0.0156, jj = 34 (42)  b  0.0156,0, jj = 45, . . . , 30.  b  0, j = 5, . . . , 30.  124 Eqs. (41) and (42) show that the proposed architecture chooses component with value 1 while  z2 124  125 the actualEqs. (41 probability) and (42) isshow 0.7675. that This the means proposed that thearchitecture introduced chooses error of componentMSE = 0.0018z2 with, can value be negligible, 1 while 125 126 thedepending actual probability on the application, is 0.7675. This dictated means by thatQ0 = the0.0449 introduced1. error of MSE = 0.0018, can be negligible, 0  126127 depending on the application, dictated by Q = 0.0449 1. In the following, tests using vectors obtained from real CNN applications are considered. More 127128 specifically,In the following, in an example tests shown using vectors in Fig.5, obtained the vectors from are real obtained CNN from applications image and are digit considered. classification More 128129 specifically,applications. in anInparticular example shown Figs. 5a in and Fig.5 5b, the depict vectors the are values obtained used from as input image to the and final digit softmax classification layer, 129130 applications.generated during In particular a single inference Figs. 5a forand a 5b VGG-16 depict imagenet the values image used classification as input to network the final for softmax 1000 classes layer, 130131 generatedand a custom during net a for single MNIST inference digit for classification a VGG-16 imagenet for 10 classes. image Quantity classificationQ0 can network be used for to 1000 determine classes 0 131132 andwhether a custom the proposed net for MNIST architecture digit classification is appropriate for for 10 classes. application Quantity on vectorQ canz before be used evaluating to determine the 132133 whetherMSE. It isthe noted proposed that MSE architecturefor the example is appropriate of Fig. for5a and applicationMSE for on the vector examplez before of Fig. evaluating 5b are of the the 13 5 133134 MSE.orders It of is 10 noted− and that 10MSE− , respectivelyfor the example which of renders Fig. 5a themand MSE as negligible.for the example of Fig. 5b are of the 13 5 134 orders of 10− and 10− , respectively which renders them as negligible. Technologies 2020, 8, 46 10 of 20

Subsequently, the proposed method is applied on the ResNet-50 [35], VGG-16, VGG-19 [36], InceptionV3 [37] and MobileNetV2 [38] CNNs, for 1000 classes with 10000 inferences of a custom image data set. In particular, for the case of ResNet-50, Figure6a,b depict the histograms of the MSE and Q0 values, respectively. More specifically, Figure6a demonstrates that the MSE values are of 3 28 4 magnitude 10− with 8828 of the values be in the interval[2.18 10− , 8.16 10− ]. Furthermore, 2 × × Figure6b shows that the Q0 values are of magnitude 10− with 9096 of them be in the interval [0, 0.00270]. Furthermore, Table1a–e depict actual softmax and the proposed method softmax-like values obtained by executing inference on the CNN models for six custom images. The values are sorted from left to right with the maximum value on the left side. Furthermore, the inferences a ... f and a0 ... f 0 denote the same custom image as input to the model. More specifically, Table1a demonstrates values from six inferences for the ResNet-50 model. It is shown that for the case of inference a the maximum obtained values are f1(z) = 0.75371140 and f1(z) = 1, for fj(z) and fj(z), respectively. Other values are f2(z) = 0.027212996, f3(z) = 0.018001331, f4(z) = 0.014599146, f5(z) = 0.014546203 b b and f2(z) = 0.03515625, f3(z) = 0.0234375, f4(z) = 0.0185546875, f1(z) = 0.0185546875, respectively. Hence, as shown by colorally1, the maximum takes the value ‘1’ and the remainder of the values followb the values obtainedb by the actual softmaxb function. Similar analysisb can be obtained for all the inferences a ... f and a0 ... f 0 for each one of the CNNs. Furthermore, the same class is outputted from the CNN in both cases for each inference. For the case of VGG-16, Figure7a depicts the histogram of the MSE values. It is shown that 3 34 4 the values are of magnitude 10− and 8616 are in the interval [1.12 10− , 7.26 10− ]. Figure7b 2 × × demonstrates histogram for the Q0 values with magnitude of 10− and more than 8978 values are in the interval [0, 0.00247]. Table1b demonstrates that in the case of the e and e0 inference the top values are 0.51182884 and 1, respectively. The second top values are 0.18920843 and 0.369140625, respectively. In this case, the decision is made with a confidence of 0.51182884 for the actual softmax value and 1 for the softmax-like value. Furthermore, the second top value is 0.369140625 which is not negligible when compared to 1 and hence denotes that the selected class is of low confidence. The same conclusion derives for the case of the actual softmax value. Furthermore, for the case of a f and a∗ f ∗ inferences, − − the values obtained by Figure8b are close to the actual pdf softmax outputs. In particular, for the d and d∗ cases, the top 5 values are 0.747070313, 0.10941570, 0.065981410, 0.018329350, 0.012467328 and 0.747070313, 0.110351563, 0.06640625, 0.017578125, 0.01171875, respectively. It is shown that values are Version August 12, 2020 submitted to Technologies 10 of 24 similar. Hence, depending the application, an alternative architecture, as shown in Figure2b can be used to generate pdf softmax values as outputs.

4 4 10 10 · 1 · 1

0.8 0.8

0.6 0.6

0.4 0.4

0.2 0.2 number of inferences number of inferences

0 0

0 2 4 6 8 0 0.5 1 1.5 2 2.5 3 0 2 MSE values 3 Q values 10− 10− · · (a) Histogram(a of)( the MSE (b) Histogram of thebQ)0 values. values. Figure 6. Values obtained from imagenet image ResNet-50 net classification for 1000 classes. (a) HistogramFigure of 6. theValues MSE obtained values. from (b) Histogramimagenet image of the ResNet-50Q0 values. net classification for 1000 classes.

Moreover, Figure8a,b depict graphically the values for the actual softmax and the proposed 4 4 10 softmax-like output for10 inferences A and B, respectively, for· the case of the VGG-16 CNN for the output 1 · 1 classification. Furthermore, Figure8c–f depict values for the architectures in Figure2a,b, respectively. 0.8 0.8

0.6 0.6

0.4 0.4

0.2 0.2 number of inferences number of inferences

0 0

0 2 4 6 0 0.5 1 1.5 2 2.5 0 2 MSE values 3 Q values 10− 10− · · (a) Histogram of the MSE (b) Histogram of the Q0 values. values.

Figure 7. Values obtained from imagenet image VGG-16 net classification for 1000 classes.

4 4 10 10 · 1 · 1

0.8 0.8

0.6 0.6

0.4 0.4

0.2 0.2 number of inferences number of inferences

0 0

0 2 4 6 0 0.5 1 1.5 2 0 2 MSE values 3 Q values 10− 10− · · (a) Histogram of the MSE values. (b) Histogram of the Q0 values.

Figure 8. Values obtained from imagenet image VGG-19 net classification for 1000 classes. Version August 12, 2020 submitted to Technologies 11 of 24

Version August 12, 2020 submitted to Technologies 10 of 24

104 104 1 · 1 ·

4 4 10 0.8 10 0.8· 1 · 1 0.6 0.6 0.8 0.8

Technologies 2020,0.48, 46 0.4 11 of 20 0.6 0.6

0.2 0.20.4 0.4 number of inferences It is shown thatnumber of inferences in the case of Figure2a, the values demonstrate a similar structure. In case of Figure2b, 0.20 0.20 values are similar to the actual softmax outputs. number of inferences number of inferences 0 0.5 1 1.5 2 2.5 3 Similar analysis0 0 can be2 performed4 for6 the case of VGG-19.0 In particular, Figure9a demonstrates MSE values3 3 Q values 34 2 4 that MSE is of magnitude 10− and 835110 of− which are in the interval [2.49 10− 10,− 6.91 10− ]. · 0 0.5 1 1.5 2 2.5 · 3 0 2 4 6 8 0 × × In Figure9b 8380(a) Histogram values are of the in MSEthe interval values. [0, 0.00192].(b) ForHistogram the case0 of of the InceptionV3,Q values. 2 histograms MSE values 3 Q values 10− 10− in Figure 10a,b demonstrate MSE and Q·0 values the 9194 and 9463 of which are· in the intervals Figure(a) 9. ValuesHistogram obtained of from the imagenet MSE image inceptionV3(b) Histogram net classification of the Q0 forvalues. 1000 classes. [2.62 10 25, 7.08 10 4] and [0, 0.003], respectively. For the MobileNetV2 network, Figure 11a,b − values. − × × 25 demonstrate MSE and Q0 values the 8990 and 9103 of which are in the intervals [2.48 10− , Figure3 6. Values obtained7 from imagenet image ResNet-50 net classification for 1000 classes. × 1.05 10− ] and [1.55 10− , 0.004], respectively. Furthermore, Table1c–e derive similar conclusions × × as in the case of VGG-16.104 104 1 · 1 ·

4 4 10 10 0.8 · 0.8 1 · 1

0.60.8 0.80.6

0.40.6 0.60.4

0.20.4 0.40.2 number of inferences number of inferences

0.2 0.20

0 number of inferences number of inferences 0 0 0 0.2 0.4 0.6 0.8 1 0 1 2 3 4 MSE values 2 Q values 2 10− 0 0.5 1 1.5 2 2.510− 0 2 4 6 · · 0 0 2 (a) Histogram ofMSE the values MSE values. 3 (b) HistogramQ ofvalues the Q values.10− 10− · · Figure 10.(a)ValuesHistogram obtained( ofa from)( the imagenet MSE image MobileNetV2(b) Histogram net classification of theb)Q0 values. for 1000 classes. values. Figure 7. Values obtained from imagenet image VGG-16 net classification for 1000 classes. (a) Histogram of theFigure MSE values. 7. Values (b) obtainedHistogram from of the imagenetQ0 values. image VGG-16 net classification for 1000 classes.

1 0.4 4 4 0.8 10 10 · 1 0.3· 1 0.6 0.8 0.8 Version August 12,0.2 2020 submitted to 12 of 24 Technologies 0.4 0.6 0.6 0.1 0.2 Softmax value for VGG-16 0.4 Softmax value for0.4 VGG-16 0 0

0.2 0 200 400 600 800 1,000 0.2 0 200 400 600 800 1,000 number of inferences

number of inferences Classes Classes 0 0 (a) Actual softmax(a)( values for (b) Actual softmaxb) values for inference A. inference0 B.0.5 1 1.5 2 1 0 2 4 6 1 0 2 MSE values 3 Q values 10− 10− · 0.8 · 0.8 (a) Histogram of the MSE values. (b) Histogram of the Q0 values. 0.6 0.6 Figure 8. Values obtained from imagenet image VGG-19 net classification for 1000 classes. 0.4 0.4

0.2 0.2 Softmax value for VGG-16 Softmax value for VGG-16 0 0 0 200 400 600 800 1,000 0 200 400 600 800 1,000 Classes Classes (c) Proposed approximation(c)((d) Proposed approximationd) softmax values for inference A softmax values for inference B based on architecture of Fig. 2a Figure 8. Cont.based on architecture of Fig. 2a and (18). and (18).

1

0.4 0.8

0.3 0.6

0.2 0.4

0.1 0.2 Softmax value for VGG16 Softmax value for VGG-16 0 0 0 200 400 600 800 1,000 0 200 400 600 800 1,000 Classes Classes (e) Proposed approximation (f) Proposed approximation softmax values for inference A softmax values for inference B based on architecture of Fig. 2b based on architecture of Fig. 2b and (30) with p = 10. and (30) with p = 10.

Figure 11. Actual and proposed approximation softmax values for two inferences namely A and B, for the VGG-16 CNN. Version August 12, 2020 submitted to Technologies 12 of 24

1 1

0.8 0.8

0.6 0.6

0.4 0.4

0.2 0.2 Softmax value for VGG-16 Softmax value for VGG-16 0 0 Version August 12, 2020 submitted to Technologies 10 of 24 0 200 400 600 800 1,000 0 200 400 600 800 1,000 Classes Classes (c) Proposed approximation (d) Proposed approximation 4 softmax4 values for inference A softmax10 values for inference B 10 1 · based1 · on architecture of Fig. 2a based on architecture of Fig. 2a and (18). and (18). 0.8 0.8

0.6 0.6

0.4 0.4

0.2 0.2 number of inferences number of inferences

0 0

0 2 4 6 8 0 0.5 1 1.5 2 2.5 3 0 2 MSE values 3 Q values 10− 10− Technologies 2020, 8, 46 · · 12 of 20 (a) Histogram of the MSE (b) Histogram of the Q0 values. values.

Figure 6. Values obtained from imagenet image ResNet-501 net classification for 1000 classes.

0.4 0.8

0.3 0.6 4 4 10 10 · 1 · 1 0.2 0.4 0.8 0.8 0.1 0.2 0.6

0.6 Softmax value for VGG16 Softmax value for VGG-16 0 0 0.4 0.4 0 200 400 600 800 1,000 0 200 400 600 800 1,000 Classes Classes 0.2 0.2 number of inferences number of inferences (e) Proposed approximation(e)((f) Proposed approximationf) 0 0 softmax values for inference A softmax values for inference B Figure 8. Actual and proposed approximation softmax values for two inferences, namely A and B, based on0 architecture2 4 of Fig. 2b6 based0 on0.5 architecture1 1.5 of Fig.2 2b2.5 0 2 for the VGG-16 CNN. (a) ActualMSE values softmax values3 for inference A. (b) ActualQ values softmax values10− for inference and (30) with p = 10. 10− and (30) with p = 10. · B. ( ) Proposed approximation softmax values· for inference A based on architecture of Figure2a c (a) Histogram of the MSE (b) Histogram of the Q0 values. Figure 11. Actual and proposed approximation softmax values for two inferences namely A and B, for and (18).(d) Proposedvalues. approximation softmax values for inference B based on architecture of Figure2a andthe( VGG-1618).(e) Proposed CNN. approximation softmax values for inference A based on architecture of Figure2b Figure 7. Values obtained from imagenet image VGG-16 net classification for 1000 classes. and (30) with p = 10.(f) Proposed approximation softmax values for inference B based on architecture of Figure2b and (30) with p = 10.

4 4 10 10 · 1 · 1

0.8 0.8

0.6 0.6

0.4 0.4

0.2 0.2 number of inferences number of inferences

0 0

0 2 4 6 0 0.5 1 1.5 2 0 2 MSE values 3 Q values 10− 10− · · Version August 12,(a) 2020Histogram submitted of to the(Technologiesa)( MSE values. (b) Histogram of theb)Q0 values. 11 of 24 Figure 9.FigureValues 8. obtainedValues obtained from imagenet from imagenet image VGG-19image VGG-19 net classification net classification for 1000 for 1000classes. classes. (a) Histogram of the MSE values. (b) Histogram of the Q0 values.

104 104 1 · 1 ·

0.8 0.8

0.6 0.6

0.4 0.4

0.2 0.2 number of inferences number of inferences

0 0

0 2 4 6 0 0.5 1 1.5 2 2.5 3 MSE values 3 Q values 2 10− 10− · · (a) Histogram of the(a)( MSE values. (b) Histogram of theb)Q0 values. FigureFigure 10. Values 9. Values obtained obtained from from imagenet image image inceptionV3 inceptionV3 net classification net classification for 1000 for classes. 1000 classes. (a) Histogram of the MSE values. (b) Histogram of the Q0 values.

104 104 1 · 1 ·

0.8 0.8

0.6 0.6

0.4 0.4

0.2 0.2 number of inferences number of inferences

0 0

0 0.2 0.4 0.6 0.8 1 0 1 2 3 4 MSE values 2 Q values 2 10− 10− · · (a) Histogram of the MSE values. (b) Histogram of the Q0 values.

Figure 10. Values obtained from imagenet image MobileNetV2 net classification for 1000 classes.

1 0.4

0.8 0.3 0.6 0.2 0.4

0.1 0.2 Softmax value for VGG-16 Softmax value for VGG-16 0 0 0 200 400 600 800 1,000 0 200 400 600 800 1,000 Classes Classes (a) Actual softmax values for (b) Actual softmax values for inference A. inference B. Version August 12, 2020 submitted to Technologies 11 of 24

104 104 1 · 1 ·

0.8 0.8

0.6 0.6

0.4 0.4

0.2 0.2 number of inferences number of inferences

0 0

0 2 4 6 0 0.5 1 1.5 2 2.5 3 MSE values 3 Q values 2 10− 10− Technologies 2020, 8, 46 · · 13 of 20 (a) Histogram of the MSE values. (b) Histogram of the Q0 values.

Figure 9. Values obtained from imagenet image inceptionV3 net classification for 1000 classes. In general, in all cases identical output decisions are obtained for the actual softmax and the softmax-like output layer for each one of the CNNs.

104 104 1 · 1 ·

0.8 0.8

0.6 0.6

0.4 0.4

0.2 0.2 number of inferences number of inferences

0 0

0 0.2 0.4 0.6 0.8 1 0 1 2 3 4 MSE values 2 Q values 2 10− 10− · · (a) Histogram of the(a)( MSE values. (b) Histogram of theb)Q0 values. FigureFigure 11. Values 10. Values obtained obtained from from imagenet imagenet image image MobileNetV2 MobileNetV2 net net classification classification for 1000 for classes. 1000 classes. (a) HistogramVersion of August the 12, MSE 2020 submittedvalues. to (Technologiesb) Histogram of the Q0 values. 19 of 24 Version August 12, 2020 submitted to Technologies 19 of 24 Considering the impact of the data wordlegth representation, let (l, k) denote the fixed-point 1 representation of0.4 a number with l integral and k fractional bits. Figure 12a,b depict histograms for the

MSE values obtained for the case of 1000 inferences by0.8 the VGG-16 CNN. It is shown that the case w = (6, 2) demonstrates0.3 the smaller MSE values. The reason for this is that the maximum value of the 0.6 inputs in the softmax layer is 56 for all the 1000 inferences, and hence the value of 6 for the integral 0.2 part is sufficient. 0.4 0.1 0.2 w=(3,2) Softmax value for VGG-16 Softmax value for VGG-16 ww=(3,2)=(4,2) 0 800 0 w=(4,2) 800 0 200 400 600 800 1,000 0 200 400 600 800 1,000 Classes600 Classes 600 (a) Actual softmax values for (b) Actual softmax values for 400

inference A. inference400 number inference B.

inference number 200 200 0 2 2 2 2 0 2 10− 4 10− 6 10− 8 10− 0.1 0 · · · · 2 MSE2 values 2 2 0 2 10− 4 10− 6 10− 8 10− 0.1 · · · · (a) Histograms for the MSEMSE values values for (3,2) and (4,2) data (a)wordlength.Histograms for the MSE( valuesa) for (3,2) and (4,2) data wordlength. 600 w=(5,2) w=(6,2) 600 w=(5,2) 500 w=(6,2) 500 400

400 300

300

inference number 200

inference number 200 100

100 0 0.2 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 2.2 − 0 MSE values 3 0.2 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 102− 2.2 − · (b) Histograms for the MSEMSE values values for (5,2) and (6,2) data3 (b) 10− wordlength. · (b) Histograms for the MSE values for (5,2) and (6,2) data Figure 12. HistogramsFigure 14. Histograms for thewordlength. MSE for the values MSE values for for various various data data wordlengths wordlengths for the case for of the 1000 case inferences of 1000 inferences in the VGG-16 CNN. in the VGG-16Figure CNN. 14. Histograms (a) Histograms for the MSE values for forthe various MSE data values wordlengths for (3,for the 2) case and of 1000 (4, inferences 2) data wordlength. (b) Histogramsin the for VGG-16 the MSE CNN. values for (5, 2) and (6, 2) data wordlength. Technologies 2020, 8, 46 14 of 20

Summarizing, it is shown that the proposed architecture suits well for the final stage of a CNN network as an alternative to implementing the softmax layer stage, since the MSE is negligible. Next, the proposed architecture is implemented in hardware and compared with published counterparts.

Table 1. Top-5 softmax values for six indicative inferences for each model. The actual softmax values and the proposed method values are obtained by (1) and (18), respectively. For the VGG-16 CNN, values obtained by (30) are also presented.

(a) ResNet-50.

Inference Top-5 Softmax Values (1) a 0.75371140 0.027212996 0.018001331 0.014599146 0.014546203 b 0.99992204 4.0420113 10 5 8.1600538 10 6 5.8431901 10 6 2.5279753 10 6 × − × − × − × − c 0.36263093 0.13568024 0.090758167 0.063106202 0.061747193 d 0.93937486 0.011548955 0.010892190 0.0068663955 0.0043140189 e 0.98696542 0.0090351542 0.0010830049 0.00068612932 0.00025251327 f 0.99833566 0.0015795525 4.6098357 10 5 2.2146454 10 5 1.0724646 10 5 × − × − × − (18) a0 1 0.03515625 0.0234375 0.0185546875 0.0185546875 b0 1 0 0 0 0 c0 1 0.3740234375 0.25 0.173828125 0.169921875 d0 1 0.01171875 0.0107421875 0.0068359375 0.00390625 e0 1 0.0087890625 0.0009765625 0 0 f 0 1 0.0009765625 0 0 0

(b) VGG-16.

Inference Top-5 Softmax Values (1) a 0.99999309 2.3484120 10 6 9.8677640 10 7 4.2830493 10 7 3.1285174 10 7 × − × − × − × − b 0.99389344 0.0014656042 0.00072381413 0.00062438881 0.00028156079 c 0.94363135 0.022138771 0.0087750498 0.0048798379 0.0047590565 d 0.73901427 0.10941570 0.065981410 0.018329350 0.012467328 e 0.51182884 0.18920843 0.10042682 0.055410255 0.030226296 f 0.59114474 0.40821150 0.00043615574 0.00017272410 1.3683101 10 5 × − (18) a0 1 0 0 0 0 b0 1 0.0009765625 0 0 0 c0 1 0.0234375 0.0087890625 0.0048828125 0.0048828125 d0 1 0.1474609375 0.0888671875 0.0244140625 0.0166015625 e0 1 0.369140625 0.1962890625 0.107421875 0.05859375 f 0 1 0.6904296875 0 0 0 (30)

a∗ 0.999023438 0 0 0 0 b∗ 0.99609375 0.000976563 0 0 0 c∗ 0.954101563 0.021484375 0.008789063 0.004882813 0.00390625 d∗ 0.747070313 0.110351563 0.06640625 0.017578125 0.01171875 e∗ 0.456054688 0.16796875 0.088867188 0.048828125 0.026367188 f ∗ 0.5 0.345703125 0 0 0 Technologies 2020, 8, 46 15 of 20

(c) VGG-19.

Inference Top-5 Softmax Values (1) a 0.99204987 0.0035423702 0.0018605839 0.00044701522 0.00035935538 b 0.99999964 2.9884677 10 7 1.5874324 10 11 1.5047866 10 11 9.9192204 10 13 × − × − × − × − c 0.99405837 0.0013178071 0.00071631809 0.00040839455 0.00027133274 d 0.19257018 0.12952097 0.12107860 0.10589719 0.074582554 e 0.99603385 0.0014963613 0.0010812994 0.00024322474 0.00015848021 f 0.74559504 0.15503055 0.010651816 0.0081892628 0.0075844983 (18) a0 1 0.0029296875 0.0009765625 0 0 b0 1 0 0 0 0 c0 1 0.0009765625 0 0 0 d0 1 0.671875 0.6279296875 0.548828125 0.38671875 e0 1 0.0009765625 0.0009765625 0 0 f 0 1 0.20703125 0.013671875 0.0107421875 0.009765625

(d) InceptionV3.

Inference Top-5 Softmax Values (1) a 0.98136926 0.00067191740 0.00022632803 0.00020886297 0.00018680355 b 0.52392030 0.17270486 0.12838276 0.0024479097 0.0017230138 c 0.61721277 0.042022489 0.038270507 0.011870607 0.0036431390 d 0.96187764 0.0011140818 0.00084153039 0.00069097377 0.00045776321 e 0.99643219 0.00058087677 0.00015713122 5.3965716 10 5 4.0285959 10 5 × − × − f 0.45723280 0.41415739 0.00078048115 0.00071852183 0.00068869896 (18) a0 1 0 0 0 0 b0 1 0.3291015625 0.244140625 0.00390625 0.0029296875 c0 1 0.0673828125 0.0615234375 0.0185546875 0.005859375 d0 1 0.0009765625 0 0 0 e0 1 0 0 0 0 f 0 1 0.9052734375 0.0009765625 0.0009765625 0.0009765625

(e) MobileNetV2.

Inference Top-5 Softmax Values (1) a 0.81305408 0.014405688 0.012406061 0.0091119893 0.0077789603 b 0.95702046 0.0042284634 0.0040278519 0.0020813416 0.00098748843 c 0.49231452 0.022776684 0.020905942 0.018753875 0.018386556 d 0.60401917 0.29827181 0.015593613 0.010511264 0.0038427035 e 0.97501647 0.0032496843 0.0014790110 0.0008857667 0.00076536590 f 0.87092900 0.022609057 0.0044059716 0.0023696721 0.0014177967 (18) a0 1 0.017578125 0.0146484375 0.0107421875 0.0087890625 b0 1 0.00390625 0.00390625 0.001953125 0.0009765625 c0 1 0.0458984375 0.0419921875 0.0380859375 0.037109375 d0 1 0.4931640625 0.025390625 0.0166015625 0.005859375 e0 1 0.0029296875 0.0009765625 0 0 f 0 1 0.025390625 0.0048828125 0.001953125 0.0009765625 Technologies 2020, 8, 46 16 of 20

5. Hardware Implementation Results This section describes implementation results obtained by synthesizing the proposed architecture outlined in Figure2. Among several authors reporting results on CNN accelerators, [ 22–24] have recently published works focusing on hardware implementation of the softmax function. In particular, in [23], a study based on stochastic computation is presented. Geng et al. provide a framework for the design and optimization of softmax implementation in hardware [26]. They also discuss operand bit-width minimization, taking into account application accuracy constraints. Du et al. propose a hardware architecture that derives the softmax function without a divider [25]. The appproach relies on an equivalent softmax expression which requires natural logarithms and exponentials. They provide detailed evaluation of the impact of the particular implementation on several benchmarks. Li et al. describe a 16-bit fixed-point hardware implementation of the softmax function [27]. They use a combination of look-up tables and multi-segment linear approximations for the approximation of exponentials and a radix-4 Booth–Wallace-based 6-stage pipeline multiplier and modified shift-compare divider. In [24], the architecture demonstrates LUT-based computations that add complexity and exhibits 444,858 µm2 area complexity by using 65-nm standard-cell library. For the same library the architecture in [25] reports 640,000 µm2 area complexity with 0.8 µw power consumption at a 500 MHz clock frequency. The architecture in [28] reports 104,526 µm2 area complexity with 4.24 µw power consumption at a 1 GHz clock frequency. The proposed architecture in [26] demonstrates power consumption and area complexity of 1.8 µw and 3000 µm2, respectively at a 500 MHz clock frequency with UMC 65 nm standard cell library. In [27], it is reported 3.3 GHz and 34,348 µm2 frequency and area complexity at 45 nm technology node. Yuan [22] presented an architecture for implementing the softmax layer. Nevertheless there is no discussion of the implementation of the LUTs and there are no synthesis results. Our proposed softmax-like function differs from the actual softmax function due to the approximation of the quantity log (n 1)Q0 + 1 , as discussed in Section3. In particular, − (18) approximates the softmax output as a decision making application and not as a pdf function. The proposed softmax-like function in (30) approximates outputs as pdf function, depending on the number of p terms used. As p n, (30) actual softmax function. The hardware complexity → → reduction derives from the fact that a limited number, p, of zis contribute to the computation of the softmax function. Summarizing, we compare both architectures depicted in Figure2a,b with [ 22] to quantify the impact of p on the hardware complexity. Section4 shows that the softmax-like function suits well in a CNN. For a fair comparison we have implemented and synthesized both architectures, our proposed and [22], by using a 90 nm 1.0 V CMOS standard-cell library with Synopsys Design Compiler [39]. Figure 13 depicts the architecture obtained from synthesis where the various building blocks, namely maximum evaluation, subtractor and the simplified exponential LUTs that perform in parallel, are shown. Furthermore registers have been added at the circuits inputs and outputs for applying the delay constraints. Detailed results are depicted in Table2a–c for the proposed softmax-like of Figure2a, the [ 22] and the proposed softmax-like of Figure2b layer with size 10, respectively. Furthermore, results are plotted graphically in Figure 14a,b where area vs. delay and power vs. delay are depicted, respectively. Results demonstrate that substantially area savings are achieved with no delay penalty. More specifically, for a 4 ns delay constraint the area complexity is 25,597 µm2 and 43,576 µm2 in case of architectures in Figure2b and [ 22], respectively. For the case where the pdf output is not significant, the area complexity reduction can be 17,293 µm2 for the architecture on Figure2a. Summarizing, depending on the application and the design constraints there is a trade-off between the additional p terms used for the evaluation of the softmax output. As we increase the value of the parameter p, then the actual softmax value is better approximated while hardware complexity increases. When p = 1, then the hardware complexity is minimized while softmax output approximation diverges. Version August 12, 2020 submitted to Technologies 17 of 24 Technologies 2020, 8, 46 17 of 20

flipflopi_8_0 10 nets(a[8][9],...) sub_50_G9 ...xi_8

flipflopi_7_0 10 nets(a[7][9],...) sub_50_G8 ...xi_7

flipflopi_3_0 input 10 nets(a[3][9],...) sub_50_G4 ...xi_3

register flipflopi_2_0 10 nets(a[2][9],...) sub_50_G3 ...xi_2

clk clk flipflopi_1_0 10 nets(a[1][9],...) sub_50_G2 ...xi_1

flipflopi_8 flipflopi_6_0 10 nets(a[6][9],...) sub_50_G7 ...xi_6

flipflopi_7 flipflopi_5_0 10 nets(a[5][9],...) sub_50_G6 ...xi_5

flipflopi_6 flipflopi_4_0 10 nets(a[4][9],...) sub_50_G5 ...xi_4

flipflopi_3 flipflopi_0_0 100 pins(g1... 10 nets(a[0][9],...) sub_50 ...xi_0 10 nets(a[9][9],...) flipflopi_5 flipflopi_9_0 max1 sub_50_G10 ...xi_9

flipflopi_4 output flipflopi_0

reset reset Subtractor register flipflopi_2 Maximum flipflopi_1 Simplified evaluation Exponential flipflopi_9 100 pins(a1... LUT

Figure 13. Proposed architecture obtained by synthesis. Figure 12. Proposed architecture obtained by synthesis. Table 2. Area, delay and power consumption for the 10-class softmax layer output of a convolutional neural network (CNN). 2 234 area complexity reduction can be 17293 µm for the architecture on Fig. 2a. Summarizing, depending 235 on the application and the design constraints(a) Architecture there is aof trade-off Figure2a. between the additional p terms used 236 for the evaluation of the softmax output.Delay(ns) As we Area increase (µm2) the Power value (µw) of the parameter p then the actual 237 softmax value is better approximated2.93 while hardware 16,891 complexity 1611.6 increases. When p = 1, then the 238 hardware complexity is minimized while3.43 softmax 17,293 output approximation 1423.7 diverges. 3.95 15,550 1070.5 4.42 15,788 936.8 239 6. Conclusions 4.94 15,084 812.4 5.47 15,349 503.5 240 This paper proposes hardware architectures for implementing the softmax layer in a CNN with 241 substantially reduced reduction in area delay and power delay product, respectively, for certain × (b) Architecture in [22×]. 242 cases. A family of architectures that can approximate the softmax function have been introduced 2 243 and evaluated, each member of whichDelay is (ns) obtained Area through (µm ) Power a design (µw) parameter p, which controls the 244 number of terms employed for the approximation.3.99 43,576 It is found 3228.2 that a very simple approximation using 5.19 32,968 1695.1 245 p = 1, suffices to deliver accurate results in certain cases, even though the derived approximation is not 6.45 25,445 871.8 246 a pdf. Furthermore, it has been demonstrated7.95 that 26,358 for image and 714.4 digit classification applications, the 13 5 247 proposed architecture suits ideally as it9.44 achieves 25,846 an MSE of the 624.9 order of 10− and 10− , respectively, 10.41 26,154 570.2 248 which are considered low.

(c) Architecture of Figure2b with 249 References p = 5. 250 1. LeCun, Y.; Bengio, Y.; Hinton, G. Deep learning. Nature 2015, 521, 436–444. Delay (ns) Area (µm2) Power (µw) 251 2. Goodfellow, I.; Bengio, Y.; Courville, A. Deep Learning; MIT Press, 2016. http://www.deeplearningbook. 3.42 27,615 1933.2 252 org. 3.92 25,597 1576.3 253 3. Girshick, R.B.; Donahue, J.; Darrell,4.91 T.; Malik, J. 23,654 Rich feature hierarchies1216.4 for accurate object detection and 254 semantic segmentation. CoRR 2013,5.42abs/1311.2524 21,458, [1311.2524] 1050.6. 6.45 20,251 838.6 255 4. Carreras, M.; Deriu, G.; Meloni, P. Flexible Acceleration of on FPGAs: NEURAghe 2.0. CPS 7.94 20,147 636.1 256 Summer School, PhD Workshop, 2019. 257 5. Zainab, M.; Usmani, A.R.; Mehrban, S.; Hussain, M. FPGA Based Implementations of RNN and CNN: 258 A Brief Analysis. 2019 International Conference on Innovative Computing (ICIC), 2019, pp. 1–8. 259 doi:10.1109/ICIC48496.2019.8966676. Technologies 2020, 8, 46 18 of 20

(a)(b) Figure 14. Area, delay and power complexity plots for a softmax layer of size 10, for the proposed and [22] circuits in the case of 10-bit wordlength implemented in a standard-cell library. A, B and C in the legends denote architectures in Figure2a, Figure2b and [ 22], respectively. (a) Area vs. delay plot. (b) Power vs. delay plot.

6. Conclusions This paper proposes hardware architectures for implementing the softmax layer in a CNN with substantially reduced reduction in area delay and power delay product, respectively, for certain × × cases. A family of architectures that can approximate the softmax function have been introduced and evaluated, each member of which is obtained through a design parameter p, which controls the number of terms employed for the approximation. It is found that a very simple approximation using p = 1, suffices to deliver accurate results in certain cases, even though the derived approximation is not a pdf. Furthermore, it has been demonstrated that for image and digit classification applications, 13 5 the proposed architecture suits ideally as it achieves MSEs of the order of 10− and 10− , respectively, which are considered low.

Author Contributions: All authors contributed equally. All authors have read and agreed to the published verion of the manuscript. Funding: This research received no external funding. Conflicts of Interest: The authors declare no conflicts of interest.

References

1. LeCun, Y.; Bengio, Y.; Hinton, G. Deep learning. Nature 2015, 521, 436–444. [CrossRef][PubMed] 2. Goodfellow, I.; Bengio, Y.; Courville, A. Deep Learning; MIT Press: Cambridge, MA, USA, 2016. Available online: http://www.deeplearningbook.org (accessed on 27 August 2020). 3. Girshick, R.B.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on and , Columbus, OH, USA, 23–28 June 2014. 4. Carreras, M.; Deriu, G.; Meloni, P. Flexible Acceleration of Convolutions on FPGAs: NEURAghe 2.0; Ph.D. Workshop; CPS Summer School: Alghero, Italy, 23 September 2019. 5. Zainab, M.; Usmani, A.R.; Mehrban, S.; Hussain, M. FPGA Based Implementations of RNN and CNN: A Brief Analysis. In Proceedings of the 2019 International Conference on Innovative Computing (ICIC), Lahore, Pakistan, 1–2 November 2019; pp. 1–8. [CrossRef] 6. Sim, J.; Lee, S.; Kim, L. An Energy-Efficient Deep Convolutional Neural Network Inference Processor with Enhanced Output Stationary Dataflow in 65-nm CMOS. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 2020, 28, 87–100. [CrossRef] 7. Hareth, S.; Mostafa, H.; Shehata, K.A. Low power CNN hardware FPGA implementation. In Proceedings of the 2019 31st International Conference on Microelectronics (ICM), Cairo, Egypt, 15–18 December 2019; pp. 162–165. [CrossRef] Technologies 2020, 8, 46 19 of 20

8. Zhang, S.; Cao, J.; Zhang, Q.; Zhang, Q.; Zhang, Y.; Wang, Y. An FPGA-Based Reconfigurable CNN Accelerator for YOLO. In Proceedings of the 2020 IEEE 3rd International Conference on Electronics Technology (ICET), Chengdu, China, 8–12 May 2020; pp. 74–78. 9. Tian, T.; Jin, X.; Zhao, L.; Wang, X.; Wang, J.; Wu, W. Exploration of Memory Access Optimization for FPGA-based 3D CNN Accelerator. In Proceedings of the 2020 Design, Automation Test in Europe Conference Exhibition (DATE), Grenoble, France, 9–13 March 2020; pp. 1650–1655. 10. Nakahara, H.; Que, Z.; Luk, W. High-Throughput Convolutional Neural Network on an FPGA by Customized JPEG Compression. In Proceedings of the 2020 IEEE 28th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM), Fayetteville, AR, USA, 3–6 May 2020; pp. 1–9. 11. Shahan, K.A.; Sheeba Rani, J. FPGA based and memory architecture for Convolutional Neural Network. In Proceedings of the 2020 33rd International Conference on VLSI Design and 2020 19th International Conference on Embedded Systems (VLSID), Bangalore, India, 4–8 January 2020; pp. 183–188. 12. Shan, J.; Lazarescu, M.T.; Cortadella, J.; Lavagno, L.; Casu, M.R. Power-Optimal Mapping of CNN Applications to Cloud-Based Multi-FPGA Platforms. IEEE Trans. Circuits Syst. II: Express Briefs 2020, 1. [CrossRef] 13. Zhang, W.; Liao, X.; Jin, H. Fine-grained Scheduling in FPGA-Based Convolutional Neural Networks. In Proceedings of the 2020 IEEE 5th International Conference on Cloud Computing and Big Data Analytics (ICCCBDA), Chengdu, China, 10–13 April 2020; pp. 120–128. 14. Zhang, M.; Li, L.; Wang, H.; Liu, Y.; Qin, H.; Zhao, W. Optimized Compression for Implementing Convolutional Neural Networks on FPGA. Electronics 2019, 8, 295. [CrossRef] 15. Wang, D.; Shen, J.; Wen, M.; Zhang, C. Efficient Implementation of 2D and 3D Sparse Deconvolutional Neural Networks with a Uniform Architecture on FPGAs. Electronics 2019, 8, 803. [CrossRef] 16. Bank-Tavakoli, E.; Ghasemzadeh, S.A.; Kamal, M.; Afzali-Kusha, A.; Pedram, M. POLAR: A Pipelined/ Overlapped FPGA-Based LSTM Accelerator. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 2020, 28, 838–842. [CrossRef] 17. Xiang, L.; Lu, S.; Wang, X.; Liu, H.; Pang, W.; Yu, H. Implementation of LSTM Accelerator for Speech Keywords Recognition. In Proceedings of the 2019 IEEE 4th International Conference on Integrated Circuits and Microsystems (ICICM), Beijing, China, 25–27 October 2019; pp. 195–198. [CrossRef] 18. Azari, E.; Vrudhula, S. An Energy-Efficient Reconfigurable LSTM Accelerator for Natural Language Processing. In Proceedings of the 2019 IEEE International Conference on Big Data (Big Data), Los Angeles, CA, USA, 9–12 December 2019; pp. 4450–4459. [CrossRef] 19. Graves, A.; Wayne, G.; Reynolds, M.; Harley, T.; Danihelka, I.; Grabska-Barwinska, A.; Colmenarejo, S.G.; Grefenstette, E.; Ramalho, T.; Agapiou, J.; et al. Hybrid computing using a neural network with dynamic external memory. Nature 2016, 538, 471–476. [CrossRef][PubMed] 20. Graves, A.; Wayne, G.; Danihelka, I. Neural Turing Machines. arXiv 2014, arXiv:1410.5401. 21. Olah, C.; Carter, S. Attention and Augmented Recurrent Neural Networks. Distill 2016.[CrossRef] 22. Yuan, B. Efficient hardware architecture of softmax layer in deep neural network. In Proceedings of the 2016 29th IEEE International System-on-Chip Conference (SOCC), Seattle, WA, USA, 6–9 September 2016; pp. 323–326. [CrossRef] 23. Hu, R.; Tian, B.; Yin, S.; Wei, S. Efficient Hardware Architecture of Softmax Layer in Deep Neural Network. In Proceedings of the 2018 IEEE 23rd International Conference on Digital Signal Processing (DSP), Shanghai, China, 19–21 November 2018; pp. 1–5. [CrossRef] 24. Sun, Q.; Di, Z.; Lv, Z.; Song, F.; Xiang, Q.; Feng, Q.; Fan, Y.; Yu, X.; Wang, W. A High Speed SoftMax VLSI Architecture Based on Basic-Split. In Proceedings of the 2018 14th IEEE International Conference on Solid-State and Integrated Circuit Technology (ICSICT), Qingdao, China, 31 October–3 November 2018; pp. 1–3. [CrossRef] 25. Du, G.; Tian, C.; Li, Z.; Zhang, D.; Yin, Y.; Ouyang, Y. Efficient Softmax Hardware Architecture for Deep Neural Networks. In Proceedings of the 2019 on Great Lakes Symposium on VLSI, Tysons Corner, VA, USA, 9–11 May 2019; Association for Computing Machinery: New York, NY, USA, 2019; pp. 75–80. [CrossRef] Technologies 2020, 8, 46 20 of 20

26. Geng, X.; Lin, J.; Zhao, B.; Kong, A.; Aly, M.M.S.; Chandrasekhar, V. Hardware-Aware Softmax Approximation for Deep Neural Networks. In Lecture Notes in Computer Science, Proceedings of the Efficient Hardware Architecture of Softmax Layer in Deep Neural NetworkComputer Vision-ACCV 2018-14th Asian Conference on Computer Vision, Perth, Australia, 2–6 December 2018; Revised Selected Papers, Part IV; Jawahar, C.V., Li, H., Mori, G., Schindler, K., Eds.; Springer: Cham, Switzerland, 2018; Volume 11364, pp. 107–122. [CrossRef] 27. Li, Z.; Li, H.; Jiang, X.; Chen, B.; Zhang, Y.; Du, G. Efficient FPGA Implementation of Softmax Function for DNN Applications. In Proceedings of the 2018 12th IEEE International Conference on Anti-counterfeiting, Security, and Identification (ASID), Xiamen, China, 9–11 November 2018; pp. 212–216. 28. Alabassy, B.; Safar, M.; El-Kharashi, M.W. A High-Accuracy Implementation for Softmax Layer in Deep Neural Networks. In Proceedings of the 2020 15th Design Technology of Integrated Systems in Nanoscale Era (DTIS), Marrakech, Morocco, 1–3 April 2020; pp. 1–6. 29. Dukhan, M.; Ablavatski, A. The Two-Pass Softmax Algorithm. arXiv 2020, arXiv:2001.04438. 30. Wei, Z.; Arora, A.; Patel, P.; John, L.K. Design Space Exploration for Softmax Implementations. In Proceedings of the 31st IEEE International Conference on Application-specific Systems, Architectures and Processors (ASAP), Manchester, UK, 6–8 July 2020. 31. Kouretas, I.; Paliouras, V. Simplified Hardware Implementation of the Softmax Activation Function. In Proceedings of the 2019 8th International Conference on Modern Circuits and Systems Technologies (MOCAST), Thessaloniki, Greece, 13–15 May 2019; pp. 1–4. [CrossRef] 32. Hertz, E.; Nilsson, P. Parabolic synthesis methodology implemented on the sine function. In Proceedings of the 2009 IEEE International Symposium on Circuits and Systems, Taipei, Taiwan, 24–27 May 2009; pp. 253–256. [CrossRef] 33. Yuan, W.; Xu, Z. FPGA based implementation of low-latency floating-point exponential function. In Proceedings of the IET International Conference on Smart and Sustainable City 2013 (ICSSC 2013), Shanghai, China, 19–20 August 2013; pp. 226–229. [CrossRef] 34. Tang, P.T.P. Table-lookup algorithms for elementary functions and their error analysis. In Proceedings of the 10th IEEE Symposium on Computer Arithmetic, Grenoble, France, 26–28 June 1991; pp. 232–236. [CrossRef] 35. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016. 36. Simonyan, K.; Zisserman, A. Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv 2014, arXiv:1409.1556. 37. Szegedy, C.; Vanhoucke, V.; Ioffe, S.; Shlens, J.; Wojna, Z. Rethinking the Inception Architecture for Computer Vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016. 38. Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; Chen, L. MobileNetV2: Inverted Residuals and Linear Bottlenecks. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 4510–4520. [CrossRef] 39. Synopsys. Available online: https://www.synopsys.com (accessed on 27 August 2020).

© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).