A Hardware Accelerator for the Inference of a Convolutional Neural Network

Editorial Ciencia e 2019 Neogranadina Ingeniería Vol. 30(1) Neogranadina enero - julio 2020 ■ ISSN: 0124-8170 ▪ ISSN-e: 1909-7735 ■ pp. 107-116 DOI: https://doi.org/10.18359/rcin.4194 A Hardware Accelerator for The Inference of a Convolutional Neural Network Edwin Gonzáleza ■ Walter D. Villamizar Lunab ■ Carlos Augusto Fajardo Arizac Abstract: Convolutional Neural Networks (CNNs) are becoming increasingly popular in deep learning applications, e.g. image classification, speech recognition, medicine, to name a few. However, CNN inference is computationally intensive and demands a large number of memory resources. This work proposes a CNN inference hardware accelerator, which was implemented in a co-processing scheme. The aim is to reduce hardware resources and achieve the best possible throughput. The design is implemented in the Digilent Arty Z7-20 development board, which is based on the Xilinx Zynq-7000 System on Chip (SoC). Our implementation achieved a of accuracy for the MNIST database using only a 12-bits fixed-point format. Results show that the co-processing scheme operating at a conserva- tive speed of 100 MHz can identify around 441 images per second, which is about 17 % times faster than a 650 MHz - software implementation. It is difficult to compare our results against other Field- Programmable Gate Array (FPGA)-based implementations because they are not exactly like ours. However, some comparisons, regarding logical resources used and accuracy, suggest that our work could be better than previous ones. Besides, the proposed scheme is compared with a hardware implementation in terms of power consumption and throughput. Keywords: CNN; FPGA; hardware accelerator; MNIST; Zynq Received: July 12th, 2019 Accepted: October 31st, 2019 Available online: July 15th, 2020 How to cite: E. González, W. D. Villamizar Luna, and C. A. Fajardo Ariza, “A Hardware Accelerator for the Inference of a Convolutional Neural network”, Cien.Ing.Neogranadina, vol. 30, no. 1, pp. 107-116, Nov. 2019. a Universidad Industrial de Santander. E-mail: [email protected] ORCID: https://orcid.org/0000-0003-2217-9817 b Universidad Industrial de Santander. E-mail: [email protected] ORCID: https://orcid.org/0000-0003-4341-8020 c Universidad Industrial de Santander. E-mail: [email protected] ORCID: https://orcid.org/0000-0002-8995-4585 107 Acelerador en hardware para la inferencia de una red neuronal convolucional Resumen: las redes neuronales convolucionales cada vez son más populares en aplicaciones de aprendizaje profundo, como por ejemplo en clasificación de imágenes, reconocimiento de voz, me- dicina, entre otras. Sin embargo, estas redes son computacionalmente costosas y requieren altos recursos de memoria. En este trabajo se propone un acelerador en hardware para el proceso de inferencia de la red Lenet-5, un esquema de coprocesamiento hardware/software. El objetivo de la implementación es reducir el uso de recursos de hardware y obtener el mejor rendimiento computa- cional posible durante el proceso de inferencia. El diseño fue implementado en la tarjeta de desarro- llo Digilent Arty Z7-20, la cual está basada en el System on Chip (SoC) Zynq-7000 de Xilinx. Nuestra implementación logró una precisión del 97,59 % para la base de datos MNIST utilizando tan solo 12 bits en el formato de punto fijo. Los resultados muestran que el esquema de co-procesamiento, el cual opera a una velocidad de 100 MHz, puede identificar aproximadamente 441 imágenes por se- gundo, que equivale aproximadamente a un 17 % más rápido que una implementación de software a 650 MHz. Es difícil comparar nuestra implementación con otras implementaciones similares, porque las implementaciones encontradas en la literatura no son exactamente como la que realizó en este trabajo. Sin embargo, algunas comparaciones, en relación con el uso de recursos lógicos y la preci- sión, sugieren que nuestro trabajo supera a trabajos previos. Palabras clave: CNN; FPGA; acelerador en hardware; MNIST, Zynq 108 Revista Ciencia e Ingeniería Neogranadina ■ Vol. 30(1) It is important to note that most of the works 1 Introduction mentioned above have used High-Level Syn- Image classification has been widely used in many thesis (HLS) software. This tool allows creating a areas of the industry. Convolutional neural net- software accelerator directly, without the need to works (CNNs) have achieved high accuracy and manually create a Register Transfer Level (RTL). robustness for image classification (e.g. Lenet-5 [1], However, HLS software generally causes higher GoogLenet [2]). CNNs have numerous potential hardware resource utilization, which explains why applications in object recognition [3, 4], face detec- our implementation required less logical resources tion [5], and medical applications [6], among oth- than previous works. ers. In contrast, modern models of CNNs demand In this work, we implemented an RTL architec- a high computational cost and large memory ture for Lenet-5 inference, which was described by resources. This high demand is due to the multiple using directly Hardware Description Language arithmetic operations to be solved and the huge (HDL) (eg. Verilog). It aims to achieve low hard- amount of parameters to be saved. Thus, large ware resource utilization and high throughput. computing systems and high energy dissipation We have designed a Software/Hardware SW( /HW) are required. co-processing to reduce hardware resources. The Therefore, several hardware accelerators have work established the number of bits in a fixed- been proposed in recent years to achieve low point representation that achieves the best ratio power usage and high performance [7-10]. In [11] between accuracy/number of bits. The implemen- an FPGA-based accelerator is proposed, which tation was done using 12-bits fixed-point on the is implemented in Xilinx Zynq-7020 FPGA. They Zynq platform. Our results show that there is not a implement three different architectures: 32-bit significant decrease in accuracy besides low hard- floating, 20-bits fixed point format and a binarized ware resource usage. (one bit) version, which achieved an accuracy of This paper is organized as follows: Section 2 96.33 %, 94.67 %, and 88 %, respectively. In [12] an describes CNNs; Section 3 explains the proposed automated design methodology was proposed to scheme; Section 4 presents details of the implemen- perform a different subset ofCNN convolutional tation and the results of our work; Section 5 dis- layers into multiple processors by partitioning cusses results and some ideas for further research; available FPGA resources. Results achieved a 3.8x, And finally, Section 6 concludes this paper. 2.2x and 2.0x higher throughput than previous works in AlexNet, SqueezeNet, GoogLenet, respectively. In [13] a CNN for a low-power embedded 2 Convolutional neural networks system was proposed. Results achieved 2x energy CNNs allow the extraction of features from input efficiency compared with GPU implementation. In data to classify them into a set of pre-established [14] some methods are proposed to optimize CNNs categories. To classify data CNNs should be trained. regarding energy efficiency and high through- The training process aims at fitting the parameters put. In this work, an FPGA-based CNN for Lenet-5 to classify data with the desired accuracy. During was implemented on the Zynq-7000 platform. It the training process, many input data are pre- achieved a 37 % higher throughput and 93.7 % less sented to the CNN with the respective labels. Then energy dissipation than GPU implementation, and a gradient-based learning algorithm is executed to the same error rate of 0.99 % in software imple- minimize a loss function by updating CNN param- mentation. In [15] a six-layer accelerator is imple- eters (weights and biases) [1]. The loss function mented for MNIST digit recognition, which uses 25 evaluates the inconsistency between the predicted bits and achieves an accuracy of 98.62 %. In [16] a label and the current label. 5-layer accelerator is implemented for MNIST digit A CNN consists of a series of layers that run recognition, which uses 11 bits and achieves an sequentially. The output of a specific layer is the accuracy of 96.8 %. input of the subsequent layer. TheCNN typically A Hardware Accelerator for The Inference of a Convolutional Neural Network 109 Revista Ciencia e Ingeniería Neogranadina ■ Vol. 30(1) uses three types of layers: convolutional, sub-sam- Rectifier Linear Unit (ReLu), Sigmoid). The con- pling and fully connected layers. Convolutional volutional layer takes the feature maps in the pre- layers extract features from the input image. vious layer with depth and convolves them with Sub-sampling layers reduce both spatial size and kernels of the same depth. Then, the bias is added computational complexity. Finally, fully connected to the convolution output and this result passes layers classify data. through an activation function, in this case, ReLu [17]. Kernels shift into the input with a stride of one 2.1 Lenet-5 architecture to obtain an output feature map. Convolutional The Lenet-5 architecture [2] includes seven layers layers have feature maps of n-k+1×n-k+1 pixels. (Figure 1): three convolutional layers (C1, C3, and Note that weights and bias parameters are C5), two sub-sampling layers, (S2 and S4) and two not found in the S2 and S4 layers. The output of a fully connected layers (F6 and OUTPUT). sub-sampling layer is obtained by taking the max- In our implementation, the size of the input imum value from a batch of the feature map in the image is 28×28 pixels. Table 1 shows the dimen- preceding layer. A fully connected layer has fea- sions of inputs, feature maps, weights, and biases ture maps. Each feature map results from the dot for each layer. product between the input vector of the layer and Convolutional layers consist of kernels (matrix the weight vector. The input and weight vectors of weights), biases and an activation function (eg.

A Hardware Accelerator for the Inference of a Convolutional Neural Network

Persian Handwritten Digit Recognition Using Combination of Convolutional Neural Network and Support Vector Machine Methods

Advancements in Image Classification Using Convolutional Neural Network

Global Sparse Momentum SGD for Pruning Very Deep Neural Networks

Arxiv:1907.08798V1 [Astro-Ph.SR] 20 Jul 2019 Keywords: Sun: Coronal Mass Ejections (Cmes) - Techniques: Image Processing

Evaluation of CNN Models with Fashion MNIST Data

Introduction to Machine Learning CMU-10701 Deep Learning

Introduction to Machine Learning Deep Learning

Architecture for Real-Time, Low-Swap Embedded Vision Using Fpgas

Lecture 7 Convolutional Neural Networks CMSC 35246: Deep Learning

Deep Learning

DEEP LEARNING – a PERSONAL PERSPECTIVE Urs Muller MIT 6.S191 Introduction to Deep Learning NEURAL NETWORK Training Data MAGIC

6. Convolutional Neural Networks CS 519 Deep Learning, Winter 2017 Fuxin Li