A Systolic Accelerator for Neuromorphic Visual Recognition

electronics Article A Systolic Accelerator for Neuromorphic Visual Recognition Shuo Tian 1,* , Lei Wang 1, Shi Xu 2, Shasha Guo 1, Zhijie Yang 1 and Jianfeng Zhang 1 and Weixia Xu 1 1 College of Computer Science and Technology, National University of Defense Technology, Changsha 410000, China; [email protected] (L.W.); [email protected] (S.G.); [email protected] (Z.Y.); [email protected] (J.Z.); [email protected] (W.X.) 2 National Innovation Institute of Defense Technology, Beijing 100000, China; [email protected] * Correspondence: [email protected]; Tel.: +86-150-8472-1519 Received: 27 August 2020; Accepted: 10 October 2020; Published: 15 October 2020 Abstract: Advances in neuroscience have encouraged researchers to focus on developing computational models that behave like the human brain. HMAX is one of the potential biologically inspired models that mimic the primate visual cortex’s functions and structures. HMAX has shown its effectiveness and versatility in multi-class object recognition with a simple computational structure. It is still a challenge to implement the HMAX model in embedded systems due to the heaviest computational S2 phase of HMAX. Previous implementations such as CoRe16 have used a reconfigurable two-dimensional processing element (PE) array to speed up the S2 layer for HMAX. However, the adder tree mechanism in CoRe16 used to produce output pixels by accumulating partial sums in different PEs increases the runtime for HMAX. To speed up the execution process of the S2 layer in HMAX, in this paper, we propose SAFA (systolic accelerator for HMAX), a systolic-array based architecture to compute and accelerate the S2 stage of HMAX. Using the output stationary (OS) dataflow, each PE in SAFA not only calculates the output pixel independently without additional accumulation of partial sums in multiple PEs, but also reduces the multiplexers applied in reconfigurable accelerators. Besides, data forwarding for the same input or weight data in OS reduces the memory bandwidth requirements. The simulation results show that the runtime of the heaviest computational S2 stage in HMAX model is decreased by 5.7%, and the bandwidth required for memory is reduced by 3.53 × on average by different kernel sizes (except for kernel = 12) compared with CoRe16. SAFA also obtains lower power and area costs than other reconfigurable accelerators from synthesis on ASIC. Keywords: neuromorphic algorithm; HMAX model; systolic array; hardware accelerator 1. Introduction The human brain is the most power-efficient processor. The last three decades have seen great success in understanding the ventral and dorsal pathways for the human visual cortex. These advances have motivated computer vision scientists to develop so-called “neuromorphic vision algorithms” for perception and cognition [1–4]. Neuromorphic models have shown high computing power and impressive performances with simple neural structures via reverse-engineering of the human brain. Among them, hierarchical model and X (HMAX) [5] is one of the potential primate cortical models. HMAX was first proposed by Riesenhuber and Poggio in 1999 [5] and later extended by Serre et al. in 2007 [6]. It is a feedforward hierarchical feature model for object recognition tasks with a simple computation structure. HMAX can achieve an excellent balance between the computation cost, simplicity, and recognition performance, which benefits the embedded and low-power systems. Thus, Electronics 2020, 9, 1690; doi:10.3390/electronics9101690 www.mdpi.com/journal/electronics Electronics 2020, 9, 1690 2 of 14 HMAX has attracted many researchers to improve its performance or to accelerate the computation process [7–12]. HMAX mainly consists of four processing stages (S1, C1, S2, C2) with a hierarchy of alternating S (convolutional) and C (pooling) layers by progressively integrating inputs from the lower layers. Among the four stages, the S2 layer is the most computing-intensive. This stage convolutes each scale of C1 layer with a large number of convolution kernels. The computation complexity for S2 layer is 2 2 O(k rM Pi), where k × k is the kernel size, r indicates the number of orientation, M × M is the size of the input C1 layer, and Pi (i ∼ 4000) is the number of weight patches. Cortical Network Simulator (CNS) [13] for HMAX on GPU also indicates that HMAX spends nearly 85% of the total runtime to compute the S2 features. Therefore, it is still a significant challenge to deploy the HMAX model on real-time artificial intelligence at the edge. Fortunately, lots of work has been investigated to accelerate the S2 layer of HMAX [11,14,15]. Sabarad et al. [11] composed processing element (PE) arrays to form a reconfigurable two-dimensional convolution accelerator CoRe16 on FPGA to accelerate the heaviest computational S2 layer of HMAX model. Experimental results show that it achieves great speedup increased by 5 to 15 times compared to CNS-based implementation. Park et al. [15] used a different reconfigurable PE array to accelerate the S2 layer of HMAX. However, the adder tree mechanism in CoRe16, Park et al., and Maashri et al. [14], used to calculate each output pixel by accumulation of partial sums in different PEs, lowers the execution speed for HMAX. To further speed up the execution process of S2 layer in HMAX, in this paper, we propose SAFA (systolic accelerator for HMAX), a systolic-array based accelerator for the S2 stage in HMAX model. Systolic arrays perform high computational parallelism by communicating directly between large numbers of relatively simple, identical PE arrays [16]. Work in [17] indicates that output stationary (OS) dataflow in systolic array runs faster than input stationary (IS) or weight stationary (WS) dataflow. We then use the OS dataflow [18] to accelerate the S2 computation process for SAFA. Each PE in SAFA produces an output pixel independently without additional accumulation of partial sums in multiple PEs like CoRe16, which increases the calculation speed. Differently from the traditional PE, which executes multiplication and accumulation (MAC) in the systolic array, our designed PE uses a Gaussian radial basis function (GRBF) [19] operation for the S2 layer in HMAX. Following OS dataflow, our PE also lessens the multiplexers used in SAFA, and so can attain low power and area costs compared with reconfigurable accelerators. Besides, data forwarding for the same input or weight data in one cycle for OS dataflow reduces the required memory bandwidth. The main contributions are as follows: 1. We propose a systolic convolution array for the HMAX model, which can calculate different sizes of convolution. 2. We utilize the OS dataflow to compute each output pixel in every PE independently, which speeds up the execution process of the S2 layer in HMAX. Our designed PE can not only match the computation for the S2 layer in HMAX, but also reduces the multiplexers used in SAFA, which helps to obtain low power and area costs. Meanwhile, data forwarding for the same input feature map or weight parameters in SAFA greatly reduces the memory bandwidth requirement for SAFA. 3. We compared the runtime of SAFA with different shapes and found the best form of the systolic array to accelerate the S2 layer in HMAX. The simulation results show that the most computationally-intensive S2 phase’s runtime is decreased by 5.7%, and the storage bandwidth is reduced by 3.53 times on average for different kernel sizes (except for kernel = 12) compared with CoRe16. The synthesis results of SAFA on ASIC also achieve lower power and area costs than reconfigurable accelerators such as CoRe16. The rest of our work is organized as follows. Section2 presents the background and preliminary information. In Section3, we introduce the structure of SAFA, the PE function, and the dataflow for Electronics 2020, 9, 1690 3 of 14 SAFA in detail. Simulation results and discussions compared with CoRe16 and other reconfigurable accelerators are given in Section4. Finally, we conclude this paper in Section5. 2. Background and Preliminary Information The HMAX model we utilize in this paper was extended by Serre et al. [6]. As shown in Figure1 , the HMAX model [6] is a four-stage (S1, C1, S2, C2) feature extraction model which follows the hierarchical layer by alternating convolutional template matching (S1 and S2 layers) and max pooling operations (C1 and C2 layers). The overall goal of HMAX model is to transform the gray scale images into feature vectors, which can be issued to a support vector machine (SVM) for the final classification. The detailed computation process of HMAX is as follows. (Small Scale) 4k Pi Conv. Max GRBF Max P1 v0 v1 P2 vn Pn (Large Scale) Input Image S1 Layer C1 Layer S2 Layer C2 Layer Figure 1. The HMAX model overview. The gray-value image is first processed by performing the convolution at four different orientations and eight scales (the full model uses 16 scales) through the S1 layer. The next C1 layer provides the local max (M) pooling over a neighborhood of the S1 layer in both scales and space. In the next stage, the S2 layer uses the GRBF unit with a set of patches which have been randomly sampled from a set of representative images [6]. At last, a global max pooling operation is provided over the S2 layer to generate C2 values. S1 layer: The gray value input image is first convolved by Gabor filters in the S1 layer with different orientations at every possible position. As shown in Table1, the filter sizes in the S1 layer range from 7 × 7 to 37 × 37 with a step size of two pixels (only four scales are visualized in Figure1). Thus, the S1 layer has 64 different receptive fields (4 orientations × 16 scales) used for convolution. C1 layer: The C1 layer is the first max pooling layer to pool over nearby two S1 scales with the same orientation to the scale band.

Load more