Sparse Winograd Convolutional Neural Networks on Small-Scale Systolic Arrays

Sparse Winograd Convolutional neural networks on small-scale systolic arrays Feng Shi, Haochen Li, Yuhe Gao, Benjamin Kuschner, Song-Chun Zhu University of California Los Angeles Los Angeles, USA {shi.feng,sczh}@cs.ucla.edu ABSTRACT bandwidth. Another performance improvement can be achieved The reconfigurability, energy-efficiency, and massive parallelism on from algorithmic perspective by applying the Winograd transform. FPGAs make them one of the best choices for implementing efficient This approach attracts more and more attention from researchers deep learning accelerators. However, state-of-art implementations since its first GPU implmentation [17]. Winograd CNN accelerators seldom consider the balance between high throughput of computa- on FPGAs are also well studied recently [1, 11]; however, the greater tion power and the ability of the memory subsystem to support it. volume after the Winograd transformation is stressing on FPGAs. In this paper, we implement an accelerator on FPGA by combining To handle this issue we adopt an efficient memory layout, adopt the sparse Winograd convolution, clusters of small-scale systolic the pruned Winograd weights [2] and their elaborate hardware, arrays, and a tailored memory layout design. We also provide an and extend the computation into 3-D. Pruning neural networks analytical model analysis for the general Winograd convolution has been proven to greatly decrease both latency and energy con- algorithm as a design reference. Experimental results on VGG16 sumption for all range of devices [13]. The major contributions are show that it achieves very high computational resource utilization, summarized in the following: 20× ∼ 30× energy efficiency, and more than 5× speedup compared • Unified small-scale systolic arrays for both Winograd with the dense implementation. transform and matrix multiplications. We maximize the reusability of the existing design, e.g. RTL, for multiple mod- KEYWORDS ules. These modules share common characteristics, like ma- FPGA, Neural networks, Winograd Convolution, systolic arrays trix multiplication alike arithmetic operations. • Efficient memory access layout. We employ a recursive 1 INTRODUCTION memory access pattern to increase locality of buffers. This Convolutional neural network (CNN) is a class of deep learning pattern significantly impacts the overall performance. • algorithms which has become dominant in various computer vi- Block-based sparse matrix compression. We employ this sion tasks [10, 18], so it is attracting research on acceleration for compression technique to adopt the above mentioned recur- computational and power efficiencies. The core computations in sive memory layout. • the algorithm are convolution operations with multi-dimensional A comprehensive model analysis of Winograd convo- data, e.g. 3-D feature maps (FM) and 4-D filters, which require lution. We propose an analytical model to investigate the a high density of memory accesses and high throughput of the performance and energy consumption, and based on the computation engine. One research topic emerging in recent years analysis we use the conclusion as our design guidance. is to deploy the convolution operations onto FPGAs [6, 7, 9, 11], since FPGAs consist of massive compute units, e.g. DSP blocks, and 2 BACKGROUND storage elements interconnected by reconfigurable switch blocks. 2.1 Spatial Convolution The most recent works on systolic array-based FPGA accelerators [3, 15] deliver significant performance improvement on the automa- The convolution layer in a feedforward pass takes C channels of × arXiv:1810.01973v1 [cs.DC] 3 Oct 2018 tion of high-level synthesis (HLS) design flow. Unlike the works H W feature maps D as input, and convolve each of K filters × × [5, 15], which first construct 2-D mesh architecture for systolic of dimension C r r with the input feature maps to produce K ¹ − º × ¹ − º array then let the loops of codes to fit on these arrays (bitstream output featre maps, Y, of dimension H r + 1 W r + 1 . Let s generated once), we recursively break the memory layout down be the stride and assume that the width and height of the filters are to small blocks then map these blocks onto small-scale systolic the same, then the mathematical description of the convolution is arrays to perform multiplications of submatrices, and share these C r r submatrices among working arrays to reduce required memory X X X Yk;i; j = Gk;t;p;q × Dt;i∗s+p; j∗s+q (1) t=1 p=1 q=1 Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation 2.2 Winograd Algorithm on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner/author(s). Winograd proposed an efficient algorithm for short convolutions FPGA’19, 2019, Seaside, California USA [20] in computing of finite impulse response (FIR) filters inthe © 2019 Copyright held by the owner/author(s). signal processing field. [17] extends the Winograd algorithm to ACM ISBN 978-x-xxxx-xxxx-x/YY/MM. https://doi.org/10.1145/nnnnnnn.nnnnnnn convolutional neural networks on GPU and CPU. FPGA’19, 2019, Seaside, California USA F. Shi et al. By applying Winograd transform to an r-tap FIR filter denoted as F ¹m;rº, which computes m outputs with the filter size of r, the number of multiplications is reduced from m × r, if through the spatial convolution, to m + r + 1. 2.2.1 1-D Winograd Convolution. Taking F ¹2; 3º as an example, Winograd algorithm first transforms an input vectord = ¹d0;d1;d2;d3º and filter д = ¹д0;д1;д2º into j = ¹j0; j1; j2; j3º and h = ¹h0;h1;h2;h3º respectively through j0 = d0 − d2; h0 = д0 д0 + д1 + д2 j1 = d1 + d2; h1 = 2 Figure 1: An overview of Winograd convolution layer. д − д + д j = d − d ; h = 0 1 2 2 2 1 2 2 j3 = d1 − d3; h3 = д2 Next, element-wise multiplications are performed: These three stages form the pipeline of the data flow of our system design. c0 = j0 × h0; c1 = j1 × h1; c2 = j2 × h2; c3 = j3 × h3 (2) Finally, the output y = ¹y0;y1º can be generated via: 3.1 Reduction to Matrix multiplication y0 = c0 + c1 + c2; y1 = c1 − c2 − c3 (3) By reformulating (4) with the augmentation on the channel dimension, filter k, tile coordinates ¹x˜;y˜º, and substitution of U = GдGT The matrix form of the above procedure can be written as y = T h i and V = B dB, we get AT ¹Gдº ⊙ BT d , where ⊙ represents element-wise multiplication and " C # X Y = AT U ⊙ V A (5) 21 0 03 21 0 −1 0 3 k;x˜;y˜ k;c c;x˜;y˜ 6 1 1 1 7 6 7 c=1 1 1 1 0 6 7 60 1 1 0 7 AT = G = 6 2 2 2 7 BT = 6 7 0 1 −1 −1 6 1 − 1 1 7 60 −1 1 0 7 6 2 2 2 7 6 7 The summation part inside the parenthesis of (5) can be disentan- 60 0 17 60 1 0 −17 2 4 5 4 5 gled into (m + r − 1) individual multiplication of a matrix of size The element-wise product in (2) requires m + r − 1 = 4 mul- ¹C × Kº with another of size ¹C × dH=medW =meº. tiplications, whereas the direct method does m × r = 2 × 3 = 6 multiplications. C X collapsing ¹x˜;y˜º to b 2.2.2 2-D Winograd Convolution. The 1-D Winograd algorithm M = U ⊙ V −−−−−−−−−−−−−−−−−−−! k;x˜;y˜ k;c c;x˜;y˜ can be easily extended to 2-D or higher dimensional convolutions by c=1 i˜; j˜ of tile F ¹m × m;r × rº C being nested with itself. 2-D Winograd algorithm ¹i˜; j˜º ¹i˜; j˜º ¹i˜; j˜º M XU V can be formulated as follows, ¹k;bº = k;c c;b h i c=1 Y = AT GдGT ⊙ BT dB A (4) Another benefit of this reformation into matrix multiplications where d and д are tiles of input and the filter, having size of l × l is that the number of inverse transforms has also been reduced (l = m + r − 1) and r × r, respectively. The size of the output tile Y over C channels [17], since the factorization of inverse transform is m × m. along channels amortizes the cost. With this reformation, the matrix For larger input images, the Winograd transform is performed with multiplications are then efficiently implemented on FPGAs. the overlapping of tiles, with overlapping size r − 1, along each dimension. When applying Winograd algorithm to a convolution layer of CNNs, the tiles along the channel dimension of this layer 3.2 Matrix multiplications and memory access can be fetched simultaneously and each of them is applied with (4). patterns As described in section 3.1, Winograd convolution can be computed efficiently with matrix multiplications on GPUs or FPGA platforms. 3 ALGORITHM AND OPTIMIZATIONS To optimze the performance of matrix multiplication, we employ This section gives an overview of our algorithm and presents several the Z-Morton memory layout [8], which has been widely studied optimization methods. Fig. 1 shows the overview of our algorithm for the Cache oblivious algorithms on multithreaded CPUs [8, 12] which consists of three stages of the Winograd-based convolution: and image processing on FPGAs [4]. This memory layout increases input feature map and kernel transformations, matrix multiplica- both spatial and temporal locality of memory accesses of matrix tions, and the inverse transformation of the output feature maps.

Load more