<<

2020 International Conference on High Performance Big Data and Intelligent Systems (HPBD&IS)

A Cordic-based Acceleration Method On FPGA For CNN Normalization layer

Yongxiang Cao Wan’ang Xiao* Jingdun Jia China Agricultural University Institute of Semiconductors Chinese China Agricultural University Key Laboratory of Agricultural Academy of Sciences Key Laboratory of Agricultural Informatization and Standardization, Center of Materials Science and Informatization and Standardization, Ministry of Agriculture and Rural Optoelectronics Engineering University Ministry of Agriculture and Rural Affairs China Agricultural University of Chinese Academy of Sciences Affairs China Agricultural University Beijing, China Beijing, China Beijing, China [email protected] [email protected] [email protected]

Abstract—This paper studies the accelerated method used lot of DSP resources. However, the DSP resources on FPGA by FPGA to implement the Alexnet Normalization layer, a boards are very expensive and limited. This limits the impro- classic network of the classical network. This paper uses vement of speed. Therefore, the fixed-point quantization of Cordic to implement basic operations such as the data on the FPGA needs to be converted into a data type , , function, etc, suitable for FPGA processing. But this brings some obstacles combining with the characteristics of FPGA to consume and difficulties in the calculation of non-linear functions, floating-point operation resources. This paper analyzes the especially the implementation of transcendental functions errors generated by the different digits of data after such as division, square root and exponential functions. To quantization and fixed-point input of the layer to determine the achieve the hardware acceleration of the convolutional neural final data bit width. Vivado 2018.1 is used for the realization network, I have adopted a method of mixing precision to and simulation environment of this paper. The bit width is selected to meet the design requirements and save resources as save resources and improve computing efficiency. This paper much as possible. After reducing the bit width, the relative proposes a method based on the Cordic algorithm to error is controlled below 0.02 and the Normalization layer is implement the Normalization layer in the CNN on the FPGA completed in 51.5 cycles in the calculation of the spatial and improves the algorithm to expand its convergence range position (x, y) unit. The final result relative error is controlled to the range of data formats required by the mixed- below 0.015 by using the method which greatly improves the acceleration network to achieve Norm layer. Thus, the layer's calculation speed of the Norma- lization layer. forward reasoning is accelerated.

Keywords—FPGA, Cordic, Normalization, Acceleration II. CORDIC-BASED ACCELERATION PRINCIPLES Method AND METHODE With the continuous expansion of CNN's application I. INTRODUCTION field and increasing functions, the number of required With the vigorous development of computing power, calculations has gradually increased during the training and Convolutional Neural Network (CNN) has also developed inference . At present, the existing computing rapidly. Now, new CNNs with more layers and better platforms are gradually unable to meet their needs. Hardware performance continue to appear. It is a multi-layer feed- acceleration refers to the technology of alleviating the forward neural network, whose different layers have workload of the by allocating a very different structures and calculation methods. Due to the need large amount of computational work to specialized hardware to collect large-scale image data, video data, and the for image processing. The original convolutional neural increased accuracy of the CNN network structure, the scale network was implemented by software programming on a of data calculation has risen sharply. The number of general-purpose [1]. Later, with the expansion of operations and the number of intermediate data have the network scale and depth, the demand for data volume and increased exponentially and the requirements for device computing power has increased dramatically. Currently, to performance increase. Generally, traditional computing implement convolutional neural networks, the training and resources are not able to adapt to their current booming prediction of the network have accelerated, and people have needs. Field Programmable (FPGA) has gradually begun to pursue more efficient models and hardware become the best choice for people to deploy and accelerate platforms with faster computing speeds. CNNs and become a current research hotspot. Therefore, FPGA is often used to accelerate the forward inference However, due to the ever-expanding scale of CNNs, the process of CNN, which is used to meet the needs of fast large power consumption of GPUs and the objective processing of pictures and so on. conditions that GPUs are too expensive and cannot be chipped, people gradually turn their attention to FPGAs. At When deploying on FPGA, large-scale high-precision present, some studies have proposed the use of GPU[2], floating-point operations cannot be performed. Even small- FPGA[3], ASIC[4] and even scale high-precision floating-point operations will consume a platforms to implement CNN hardware acceleration.

978-1-7281-6512-7/20/$31.00 ©2020 IEEE

Authorized licensed use limited to: INSTITUTE OF SEMICONDUCTORS CAS. Downloaded on January 30,2021 at 11:03:19 UTC from IEEE Xplore. Restrictions apply.

However, people gradually turn their attention to using between different samples, while nearest neighbor FPGA to accelerate the forward reasoning process of CNN in normalization (LRN) mainly occurs between the outputs of terms of comprehensive design cycle, flexibility, recon- different convolution kernels. Therefore, batch normalization figurability, security, latency, adaptability, power is based on mini-batch data and the learning parameters consumption, etc. This paper presents an effective FPGA- obtained during training need to be used during testing and based IP (Intellectual Property) core design method to inference. However, neighbor normalization is based only on implement some required by CNN, including IP- its data. Therefore, during the testing and inference process, core implementation process, parameterized RTL-level only input is required. The data of the upper layer can get the software IP core design, hardware synthesis, simulation, and data of the next layer without additional parameters and verification. Reference[3] verified through experiments that external input. Alexnet's calculation formula for the LRN the throughput of the Gzip algorithm of the OpenCL-based layer is as shown in: FPGA platform has a 10-fold performance improvement min(N 1, i n /2) over the CPU platform, which shortens the development i i j 2  cycle by three times compared to Verilog-based FPGAs with bx,,, y a x y/ (k  ( a x y ) ) (1) similar performance. Jiantao Qiu[5] proposed a CNN jmax(0, i n /2) accelerator design on an embedded FPGA to deploy a large- i scale image classification model. Reference[6] proposed a Among them, axy, represents the value at the position (x, new processor architecture for implementing discrete-time y) on the ith convolution surface, and bi represents the cellular CNNs on FPGAs. Reference[7] proposed a method xy, for implementing CNN on FPGA and DSP and analyzed normalized value of the nearest neighbor. computational. N CNN and designed an alternative model during deployment. is the total number of convolutional surfaces or pooling Xilinx also released a DNN model to assist FPGA in surfaces, n is the number of adjacent surfaces and the k, α, designing CNNs. At the same time, it released a chip Versal and β are adjustable parameters. By selecting appropriate for deploying CNNs. The AI Engines module was specially parameters, the local response can be normalized, thereby designed to implement CNN calculations. The Processing improving the accuracy of the network. In Alexnet's paper, System and Programmable Logic communicate with each the parameters chosen by the author are k=2, n=5, α=10-4, other to realize the deployment and acceleration of CNN's and β=0.75. FPGA. It can be seen that the design of CNN accelerators by FPGAs has gradually become a hotspot, and the most B. Cordic algorithm important thing in designing accelerators is how each layer In the field of digital , some theory- implements acceleration methods on hardware with based algorithms cannot be directly mapped to hardware, completely different physical structures and different logical making hardware approximation solutions preferred. The mechanisms, even specific to each Layer acceleration Cordic (Coordinate Digital Computer) algorithm is method. an iterative solution for the trigonometric function and other transcendental functions. The more iterations it takes, the A. Normalization layer closer it is to the real value[9]. The number of iterations can In neurobiology, there is a concept called inhibition, be set by the requirements of resources and precision. In the which refers to the suppression of neighboring neurons by end, these functions only need to be approximated by fire-collecting neurons. The normalization layer can be and shift, greatly reducing the operation of floating- understood as a process of intermediate data preprocessing is point numbers. The current Cordic theory has also been added in the middle of different layers of the convolutional extended to provide solutions for a wider range of functions neural network, that is, the output data of the previous layer and Cordic rotation has also been used to calculate a variety is normalized and then input to the next layer, Which can of operations[10]. All can be effectively prevent the occurrence of gradient diffusion, and calculated using vector rotation or derived from secondary then accelerate the CNN training process. Generally, when calculations and can be used for mutual conversion between large-scale normalization is performed in CNN, feature maps polar and rectangular coordinates and vector size that are not Relu-activated are generally batch-normalized. transformation[11]. Then, they are used as input to the excitation layer, thereby The Cordic algorithm obtains the corresponding and adjusting the partial derivative of the excitation function. The cosine values through continuous rotation in the rotation normalization layer is a very important layer to prevent mode and can obtain more meaningful values similar to gradients from exploding or disappearing. The existence of tangent values through various calculations. The angle of these situations has caused situations in which the weights of rotation is very specific. The angle of each rotation must be the model cannot be updated or the weights become larger such that the tangent is approximately equal to 1/2N. The and larger and cannot converge during training. Therefore, a purpose of the rotation is to make the Y-axis approach zero. normalization layer needs to be added inside the neural The angle of each rotation is accumulated to obtain the angle network. of rotation and the tangent value. The Cordic algorithm The LRN normalization method mainly occurs between obtains the corresponding arc tangent value through the outputs of different convolution kernels after ReLU. This continuous rotation in the vector mode and can obtain more method needs to traverse n adjacent kernel maps. After meaningful values through various calculations. passing through different filters, the point adjacent to the Cordic algorithm is divided into rotation mode and vector feature map at the same (x,y) plane position is normalized mode. The trigonometric function defined by the original This normalization implements a lateral inhibition pattern can be extended to straight lines and hyperbolic inspired by actual neurons, which is output by using neurons curves to calculate , division and hyperbolic from different cores[8]. Batch normalization mainly occurs trigonometric functions. This paper mainly uses the rotation

Authorized licensed use limited to: INSTITUTE OF SEMICONDUCTORS CAS. Downloaded on January 30,2021 at 11:03:19 UTC from IEEE Xplore. Restrictions apply.

mode in the hyperbolic coordinate system of the Cordic xn  x1 algorithm and the vector model in the circular coordinate  system and the linear coordinate system. The normalization y  0  n layer in CNN mainly uses the square root operation based on  the Cordic implementation. For a CNN, accuracy and  y1 zz (5) calculation time are important factors for judging whether a  n 1 x network is excellent and whether it can be applied to the  1 ground. Therefore, this paper also uses the Cordic algorithm  1,z 0 to implement the errors and implementation time for d   different-digit data tested and compared. Then, a more  1,z 0 suitable data format was selected for actual operation. Figure 2) Square Root 1 Coordinate rotation model in a circular coordinate system. The initial value was set in the vector model of the The Cordic algorithm defines the derivation process as hyperbolic coordinate system according to (6) to indirectly shown in (2), (3) and (4). find the square root value, as shown in (7).

x1  1  y1  1 (6)  z1  0 xA 2   n yn  0  1  11 (7) z n tanh ( ) ln Fig. 1. Coordinate rotation model diagram   12  n1 2i xp cos A 12 ( ) n   2  i1 yp sin III. NORMALIZATION LAYER ACCELERATION

xq cos(   )  cos  cos   sin  sin  DESIGN METHOD  cos (xy tan ) The Normalization layer is because there are too many  pp calculation processes such as division and square root. If the  (3) y sin(   )  sin  cos   cos  sin  FPGA uses DSP resources to implement floating-point  q operations, it will consume a lot of resources and excessively  cos (yxpp tan ) long cycles, causing unnecessary time and area costs. The  use of the Cordic algorithm allows the FPGA to implement If cosine is ignored, pseudo-rotation is formed and the specific operations in a shorter period, optimizes the formula pseudo-rotation can be compensated to obtain the real result. implementation of the normalization algorithm on FPGA, The modulus value is increased by 1/ cos times because observes its formula characteristics and designs a more removing cosine can simplify the operation. The iterative rational pipeline to strive to calculate the Normalization layer process is shown in (4). more quickly. FPGAs are good at performing a lot of integer operations, i xxi1 = i d i y i 2 not good at floating-point operations. Dividing, square root,  i exponential function, tangent function, etc. in the yi1  y i d i x i 2 Normalization layer are not suitable for direct  implementation on FPGA. Otherwise, not only a large 1 i (4) zi1  z i d i tan 2 number of DSP resources and LUT resources will be  10z required to increase the design area, but also to cause many  i operations to be unfolded in parallel through the pipeline. di    Increasing the implementation cycle can not take advantage  10zi of FPGA's high parallelism, adaptive, reconfigurable logic 1) Division method functions. The formula of each (x, y) position of each Traditional division can be realized by shift channel of the normalization layer is shown in (8). [12], and can also be implemented by the Cordic algorithm. The performance is different in scenarios with different bit widths. This paper uses the Cordic algorithm to implement it. In a linear coordinate system, division can be achieved using the vector model, as shown in (5).

Authorized licensed use limited to: INSTITUTE OF SEMICONDUCTORS CAS. Downloaded on January 30,2021 at 11:03:19 UTC from IEEE Xplore. Restrictions apply.

As shown in the figure above, parallel multipliers based  min(N 1, i n /2) 3 i i1 j 2 4 on the addition tree and reasonable implementation cycles bx,,, y a x y/ (2 ( a x y ) ) 10000  can greatly reduce the data wait time and operation delay  jmax(0, i n /2) time of traditional serial instruction processing data. Both  min(N 1, i n /2) 3 Sqrt (square root) and Div (division) in the figure above can i i j 2 b1000  a / (20000  ( a ) ) 4 be implemented by the Cordic algorithm, which is shorter  x,,, y x y x y  jmax(0, i n /2) (8) than the traditional implementation.  3 ii4 bax,, y1000 x y / m IV. EXPERIMENTAL RESULTS AND ANALYSIS  min(N 1, i n /2) Based on the working principle of the Normalization ma20000 (j )2 layer normalization function of Alexnet, this paper proposes   xy, a fast function implementation method based on the Cordic  jmax(0, i n /2) algorithm based FPGA platform with less error and lower Analyze the functions implemented by this accuracy loss. By using the Cordic algorithm, functions such layer before designing the acceleration scheme. The as division, square root and exponential function are purpose is shown in Fig. 2. Substituting the values of designed and implemented on the FPGA. On this basis, the accelerated design and implementation of the main formulas the corresponding positions of the channels on of the Norm and Softmax layers of the CNN are completed, which the values to be calculated are substituted and the calculation can be completed quickly. The formula into the above formula, where n/2 before and after acceleration implementation proposed in this paper can use each, total n channels. The normalization operation the advantages of FPGA's high parallelism to perform tile is performed and the calculated value is stored in expansion, and process the data of one layer at the same time the location as a new value. When the adjacent front in a parallel manner, thereby accelerating the calculation of CNN. and rear channels are less than n/2, the missing channels are ignored. Based on this feature, the To save area, the calculation accuracy is tested with flow-through method described in this paper is different digits of data. The CordicIP core provided by Xilinx is used to take the calculation of 1.68 and 0.686 as proposed to accelerate the calculation process. This examples. The respective outputs, calculation cycles and method describes the calculation method of the errors are shown in table 1. middle layer and the edge n-1 layer can also establish the corresponding flow in this method. It TABLE I. ANALYZATION OF SQUARE ROOT CALCULATION ERROR does not exceed the calculation time of the middle BASED ON CORDIC ALGORITHM FOR DIFFERENT BIT WIDTHS layer based on the calculated repetitive design flow, Operation Bit as shown in Figure 3. cycle Absolute error Relative error width (Clock)

16 4.5 1.40846E-06 1.08344E-06

16 4.5 4.46792E-05 0.000170615

15 5.5 0.389952205 0.299966051

15 5.5 0.193434572 0.738920966

14 5.5 0.000184514 0.000142014

14 5.5 0.000685548 0.002619409

13 6.5 0.001161076 0.00089327 13 6.5 0.00019726 0.000754414 0.003114201 0.0023959 Fig. 2. Schematic calculation 12 6.5 12 6.5 0.011916017 0.045614959

11 7.5 1.40846E-06 1.08359E-06

11 7.5 0.000136232 0.000520528

10 7.5 0.000184514 0.000142062

10 7.5 0.000441408 0.001686573

9 8.5 0.000184514 0.000142276

9 8.5 0.000197267 0.000765157

8 8.5 0.003114201 0.002401312 Fig. 3. Normalization function hardware implementation pipeline design 8 8.5 1.40846E-06 1.08344E-06

a. The truth value is calculated by MATLAB

Authorized licensed use limited to: INSTITUTE OF SEMICONDUCTORS CAS. Downloaded on January 30,2021 at 11:03:19 UTC from IEEE Xplore. Restrictions apply.

Fig. 4. Normalization layer function acceleration results

Fig. 5. Normalization layer acceleration results

During the realization of the normalization formula of the uses Alexnet's normalization layer calculation formula for Normalization layer, the calculation of a unit of the experiments. Its application is nearest neighbor Normalization layer can be realized in 51.5 cycles. a is the normalization, so this method can be used on the neighbor input data, b_out is the calculation result, the simulation normalization layer of other networks. The specific result is shown in the figure below, and the error analysis parameters and methods are based on different networks. result is shown in the table2. The parameters and formulas are adjusted accordingly.

TABLE II. ERROR ANALYSIS TABLE OF CALCULATION RESULTS ACKNOWLEDGMENT This work was partially supported by the Field Farming Corresp Out True onding Absolute error Relative error Online Monitoring Technology and System Standard put value decimal Research 2018YFF0213602, College of Information and Electrical Engineering, China Agricultural University. I 0.493164 0.485971 23F2 7.192923E-03 1.480113203E-02 0625 1395 thank the Center for Materials Science and Optoelectronic Engineering of the Institute of Semiconductors, University of a. Sample of a Table footnote. (Table footnote) Chinese Academy of Sciences for supporting the research. V. CONCLUSION Data with different bit widths will have a certain impact REFERENCES on the accuracy of the data, but experiments have found that [1] H. Schneiderman and T. Kanade, "A statistical method for 3D object the accuracy of the impact on the normalization layer has a detection applied to faces and cars," in Computer Vision and Pattern small impact and can be ignored in calculations. The FPGA Recognition, 2000. Proceedings. IEEE Conference on, 2000. was used to accelerate the CNN in the experiment. The main [2] L. Teng, D. Yong, J. Jiang, Y. Wang, and L. Qi, "Optimized deep belief networks on CUDA GPUs," in International Joint Conference aim is to normalize the layer. The Cordic algorithm is used to on Neural Networks, 2015. calculate division, square root and exponential function, [3] M. S. Abdelfattah, A. Hagiescu, and D. Singh, "Gzip on a chip: high- instead of the traditional implementation. The IP core performance lossless data compression on FPGAs using OpenCL," provided by Xilinx in this experiment is used to realize the 2014. calculation of the Cordic algorithm and communicate with [4] G. Lei, W. Chao, L. Xi, H. Chen, and X. Zhou, "A power-efficient the , thus replacing the use of the IP core of the above and high performance FPGA accelerator for convolutional neural functions. And optimized the calculation part of the networks: work-in-progress," in the Twelfth IEEE/ACM/IFIP International Conference, 2017. Normalization layer through the ideas of cutting water and [5] J. Qiu et al., "Going Deeper with Embedded FPGA Platform for adding tree. Achieved the calculation of data of 5 layers Convolutional Neural Network," in Acm/sigda International adjacent to each other at the X and Y positions in space. In Symposium on Field-programmable Gate Arrays, 2016. the implementation of the entire layer, there are (n-4) * h * w [6] K. Kayaer and V. Tavsanoglu, "A new approach to emulate CNN on such same calculations, and 4 * h * w similar calculations. A FPGAs for real time video processing," 2008. total of n * h * w calculations result from the convolution [7] J. J. Martínez, F. J. Toledo, and J. M. Ferrández, "New emulated into a new layer formed after normalization, saving a certain discrete model of CNN architecture for FPGA and DSP applications," amount of time, thereby achieving the acceleration of the in International Work-conference on Artificial Neural Networks, 2003. Normalization layer of the CNN, which can be applied to the [8] A. Krizhevsky, I. Sutskever, and G. Hinton, "ImageNet Classification entire CNN, realize CNN inference acceleration. This paper with Deep Convolutional Neural Networks," Advances in neural information processing systems, vol. 25, no. 2, 2012.

Authorized licensed use limited to: INSTITUTE OF SEMICONDUCTORS CAS. Downloaded on January 30,2021 at 11:03:19 UTC from IEEE Xplore. Restrictions apply.

[9] J. E. Volder, "The Birth of Cordic," Vlsi Signal Processing Systems [11] R. Andraka, "A Survey of CORDIC Algorithms for FPGA Based for Signal Image Video Technology, vol. 25, no. 2, pp. 101-105, 2000. Computers," presented at the Proceedings of the 1998 ACM/SIGDA [10] D. P. Deprettere E F, Udo R . " Pipelined architectures for fast Sixth International Symposium on Field Programmable Gate Arrays, VLSI filtering and array processing," presented at the Acoustics, 1998. Speech, and Signal Processing, IEEE International Conference on [12] L.Han, W.Hongsheng, Z.Yang, and C.Junguang. National Defense ICASSP '84. IEEE, 1984. Technology Foundation, "Optimized design and implementation of shift subtraction divider based on FPGA" , no. 8, pp. 39-42.

Authorized licensed use limited to: INSTITUTE OF SEMICONDUCTORS CAS. Downloaded on January 30,2021 at 11:03:19 UTC from IEEE Xplore. Restrictions apply.