A Cordic-Based Acceleration Method on FPGA for CNN Normalization Layer

2020 International Conference on High Performance Big Data and Intelligent Systems (HPBD&IS) A Cordic-based Acceleration Method On FPGA For CNN Normalization layer Yongxiang Cao Wan’ang Xiao* Jingdun Jia China Agricultural University Institute of Semiconductors Chinese China Agricultural University Key Laboratory of Agricultural Academy of Sciences Key Laboratory of Agricultural Informatization and Standardization, Center of Materials Science and Informatization and Standardization, Ministry of Agriculture and Rural Optoelectronics Engineering University Ministry of Agriculture and Rural Affairs China Agricultural University of Chinese Academy of Sciences Affairs China Agricultural University Beijing, China Beijing, China Beijing, China [email protected] [email protected] [email protected] Abstract—This paper studies the accelerated method used lot of DSP resources. However, the DSP resources on FPGA by FPGA to implement the Alexnet Normalization layer, a boards are very expensive and limited. This limits the impro- classic network of the classical network. This paper uses vement of speed. Therefore, the fixed-point quantization of Cordic algorithm to implement basic operations such as the data on the FPGA needs to be converted into a data type division, exponential function, square root function, etc, suitable for FPGA processing. But this brings some obstacles combining with the characteristics of FPGA to consume and difficulties in the calculation of non-linear functions, floating-point operation resources. This paper analyzes the especially the implementation of transcendental functions errors generated by the different digits of data after such as division, square root and exponential functions. To quantization and fixed-point input of the layer to determine the achieve the hardware acceleration of the convolutional neural final data bit width. Vivado 2018.1 is used for the realization network, I have adopted a method of mixing precision to and simulation environment of this paper. The bit width is selected to meet the design requirements and save resources as save resources and improve computing efficiency. This paper much as possible. After reducing the bit width, the relative proposes a method based on the Cordic algorithm to error is controlled below 0.02 and the Normalization layer is implement the Normalization layer in the CNN on the FPGA completed in 51.5 cycles in the calculation of the spatial and improves the algorithm to expand its convergence range position (x, y) unit. The final result relative error is controlled to the range of data formats required by the mixed- below 0.015 by using the method which greatly improves the acceleration network to achieve Norm layer. Thus, the layer's calculation speed of the Norma- lization layer. forward reasoning is accelerated. Keywords—FPGA, Cordic, Normalization, Acceleration II. CORDIC-BASED ACCELERATION PRINCIPLES Method AND METHODE With the continuous expansion of CNN's application I. INTRODUCTION field and increasing functions, the number of required With the vigorous development of computing power, calculations has gradually increased during the training and Convolutional Neural Network (CNN) has also developed inference process. At present, the existing computing rapidly. Now, new CNNs with more layers and better platforms are gradually unable to meet their needs. Hardware performance continue to appear. It is a multi-layer feed- acceleration refers to the technology of alleviating the forward neural network, whose different layers have workload of the central processing unit by allocating a very different structures and calculation methods. Due to the need large amount of computational work to specialized hardware to collect large-scale image data, video data, and the for image processing. The original convolutional neural increased accuracy of the CNN network structure, the scale network was implemented by software programming on a of data calculation has risen sharply. The number of general-purpose processor[1]. Later, with the expansion of operations and the number of intermediate data have the network scale and depth, the demand for data volume and increased exponentially and the requirements for device computing power has increased dramatically. Currently, to performance increase. Generally, traditional computing implement convolutional neural networks, the training and resources are not able to adapt to their current booming prediction of the network have accelerated, and people have needs. Field Programmable Gate Array(FPGA) has gradually begun to pursue more efficient models and hardware become the best choice for people to deploy and accelerate platforms with faster computing speeds. CNNs and become a current research hotspot. Therefore, FPGA is often used to accelerate the forward inference However, due to the ever-expanding scale of CNNs, the process of CNN, which is used to meet the needs of fast large power consumption of GPUs and the objective processing of pictures and so on. conditions that GPUs are too expensive and cannot be chipped, people gradually turn their attention to FPGAs. At When deploying on FPGA, large-scale high-precision present, some studies have proposed the use of GPU[2], floating-point operations cannot be performed. Even small- FPGA[3], ASIC[4] and even heterogeneous computing scale high-precision floating-point operations will consume a platforms to implement CNN hardware acceleration. 978-1-7281-6512-7/20/$31.00 ©2020 IEEE Authorized licensed use limited to: INSTITUTE OF SEMICONDUCTORS CAS. Downloaded on January 30,2021 at 11:03:19 UTC from IEEE Xplore. Restrictions apply. However, people gradually turn their attention to using between different samples, while nearest neighbor FPGA to accelerate the forward reasoning process of CNN in normalization (LRN) mainly occurs between the outputs of terms of comprehensive design cycle, flexibility, recon- different convolution kernels. Therefore, batch normalization figurability, security, latency, adaptability, power is based on mini-batch data and the learning parameters consumption, etc. This paper presents an effective FPGA- obtained during training need to be used during testing and based IP (Intellectual Property) core design method to inference. However, neighbor normalization is based only on implement some algorithms required by CNN, including IP- its data. Therefore, during the testing and inference process, core implementation process, parameterized RTL-level only input is required. The data of the upper layer can get the software IP core design, hardware synthesis, simulation, and data of the next layer without additional parameters and verification. Reference[3] verified through experiments that external input. Alexnet's calculation formula for the LRN the throughput of the Gzip algorithm of the OpenCL-based layer is as shown in： FPGA platform has a 10-fold performance improvement min(N 1, i n /2) over the CPU platform, which shortens the development i i j 2 cycle by three times compared to Verilog-based FPGAs with bx,,, y a x y/ (k ( a x y ) ) (1) similar performance. Jiantao Qiu[5] proposed a CNN jmax(0, i n /2) accelerator design on an embedded FPGA to deploy a large- i scale image classification model. Reference[6] proposed a Among them, axy, represents the value at the position (x, new processor architecture for implementing discrete-time y) on the ith convolution surface, and bi represents the cellular CNNs on FPGAs. Reference[7] proposed a method xy, for implementing CNN on FPGA and DSP and analyzed normalized value of the nearest neighbor. computational. N CNN and designed an alternative model during deployment. is the total number of convolutional surfaces or pooling Xilinx also released a DNN model to assist FPGA in surfaces, n is the number of adjacent surfaces and the k, α, designing CNNs. At the same time, it released a chip Versal and β are adjustable parameters. By selecting appropriate for deploying CNNs. The AI Engines module was specially parameters, the local response can be normalized, thereby designed to implement CNN calculations. The Processing improving the accuracy of the network. In Alexnet's paper, System and Programmable Logic communicate with each the parameters chosen by the author are k=2, n=5, α=10-4, other to realize the deployment and acceleration of CNN's and β=0.75. FPGA. It can be seen that the design of CNN accelerators by FPGAs has gradually become a hotspot, and the most B. Cordic algorithm important thing in designing accelerators is how each layer In the field of digital signal processing, some theory- implements acceleration methods on hardware with based algorithms cannot be directly mapped to hardware, completely different physical structures and different logical making hardware approximation solutions preferred. The mechanisms, even specific to each Layer acceleration Cordic (Coordinate Rotation Digital Computer) algorithm is method. an iterative solution for the trigonometric function and other transcendental functions. The more iterations it takes, the A. Normalization layer closer it is to the real value[9]. The number of iterations can In neurobiology, there is a concept called inhibition, be set by the requirements of resources and precision. In the which refers to the suppression of neighboring neurons by end, these functions only need to be approximated by fire-collecting neurons. The normalization layer can be addition and shift, greatly reducing the operation of floating- understood as a process of intermediate

A Cordic-Based Acceleration Method on FPGA for CNN Normalization Layer

3.2 the CORDIC Algorithm

CORDIC-Like Method for Solving Kepler's Equation

CORDIC V6.0 Logicore IP Product Guide

A Review on Hardware Accelerator Design and Implementation of CORDIC Algorithm for a Gaming Application

A.1 CORDIC Algorithm

An Optimization of CORDIC Algorithm and FPGA Implementation

FPGA Technology in Beam Instrumentation and Related Tools

A Unified Reconfigurable CORDIC Processor for Floating-Point Arithmetic

Design and Analysis of Double Precision Floating Point Division Operator Based on CORDIC Algorithm

A Highly Optimized Arithmetic Software Library and Hardware Co

A Trigonometric Hardware Acceleration in 32-Bit RISC-V

Cordic Algorithm and Its Applications In