A 8.93 TOPS/W LSTM Recurrent Neural Network Accelerator Featuring Hierarchical Coarse-Grain Sparsity with All Parameters Stored On-Chip
Total Page:16
File Type:pdf, Size:1020Kb
A 8.93 TOPS/W LSTM Recurrent Neural Network Accelerator Featuring Hierarchical Coarse-Grain Sparsity with All Parameters Stored On-Chip Deepak Kadetotad, Visar Berisha, Chaitali Chakrabarti, Jae-sun Seo School of Electrical, Computer and Energy Engineering, Arizona State University, Tempe, AZ 85287, USA {deepak.kadetotad, visar, chaitali, jaesun.seo}@asu.edu Abstract —Long short-term memory (LSTM) networks are 1H[W 5HFXUUHQW /D\HU /D\HU widely used for speech applications but pose difficulties for KW /670 KW ࢚ ൌ࣌ሺࢃ࢚࢞࢞ ࢃࢎࢎ࢚െ ࢈ሻ efficient implementation on hardware due to large weight storage RW sig- E output gate • + R &HOO moid requirements. We present an energy-efficient LSTM recurrent ࢌ࢚ ൌ࣌ሺࢃ࢞ࢌ࢚࢞ ࢃࢎࢌࢎ࢚െ ࢈ࢌሻ neural network (RNN) accelerator, featuring an algorithm- [W tanh ࢚ ൌ࣌ሺࢃ࢚࢞࢞ ࢃࢎࢎ࢚െ ࢈ሻ KW forget gate hardware co-optimized memory compression technique called &W memory cell ̱ sig- • &W ࢉ࢚ ൌ ܜ܉ܖܐሺࢃ࢞ࢉ࢚࢞ ࢃࢎࢉࢎ࢚െ ࢈ࢉሻ EI + hierarchical coarse-grain sparsity (HCGS). Aided by HCGS- moid K + input gate W ̱ ࢉ࢚ ൌࢌ࢚ ήࢉ࢚െ ࢚ ήࢉ࢚ based block-wise recursive weight compression, we demonstrate [W sig- • LW EL moid + × 8QZHLJKWHG LSTM networks with up to 16 fewer weights while achieving a ࢎ࢚ ൌ࢚ ήܜ܉ܖܐሺࢉ࢚ሻ &RQQHFWLRQ & W [W minimal accuracy loss. The prototype chip fabricated in 65nm LP :HLJKWHG candidate &RQQHFWLRQ tanh memory ࢎ࢚െKLGGHQYHFWRU CMOS achieves 8.93/7.22 TOPS/W for 2-/3-layer LSTM RNNs &RQQHFWLRQ ࢞ LQSXWYHFWRU ZLWKWLPHODJ ࢚ trained with HCGS for TIMIT/TED-LIUM corpora. + ZHLJKWPDWULFHVכࢃࢎכࢃ࢞ KW [W Index Terms—long short-term memory, speech recognition, EF hardware accelerator, weight compression, structured sparsity Fig. 1. Illustration of LSTM cell with computation equations. weights by 16× with minimal accuracy loss. HCGS-based I. INTRODUCTION LSTM accelerator which executes 2-/3-layer LSTMs for real- The emergence of internet of things (IoT) devices that time speech recognition was prototyped in 65nm LP CMOS. require edge computing with limited area and energy has It consumes 1.85/3.42 mW power and achieves 8.93/7.22 garnered intense interest in energy-efficient ASIC accelerators TOPS/W for TIMIT/TED-LIUM corpora. for deep learning applications. The particular challenge of II. LSTM AND HIERARCHICAL COARSE-GRAIN SPARSITY performing on-device automatic speech recognition (ASR) is that LSTMs that show high accuracy suffer from high A. LSTM-based Speech Recognition complexity and require a large number of parameters to be LSTM is a type of recurrent neural network (RNN) that trained and stored [1]. shows state-of-the-art accuracy for speech recognition [1]. Recent works presented methods to reduce the complexity Each layer of a LSTM consists of neurons, which computes the and storage of ASR hardware. Magnitude-based pruning was final output ht through four intermediate results called gates applied to LSTM hardware in [2], resulting in 20× model size (Fig. 1). From the LSTM equations in Fig. 1, we see that the reduction, but element-wise sparsity incurs considerable index weight memory requirement of LSTMs is 8× when compared memory and irregular memory access, hurting both perfor- to MLPs with the same number of neurons per layer. mance and power. To overcome this, structured sparsity tech- LSTM-based speech recognition typically consists of a niques have been proposed with row-/column-wise sparsity for pipeline of a feature extraction module, followed by a LSTM RNNs [3], with block-wise sparsity for multi-layer perceptrons RNN and then by a Viterbi decoder. A commonly used feature (MLPs) [4], and with block-circulant weight matrix for RNNs for speech recognition is feature-space Maximum Likelihood [5] in speech processing applications. However, these works Linear Regression (fMLLR). fMLLR features are extracted exhibit limited weight compression of ∼4× [3], [4] or high from Mel Frequency Cepstral Coefficients (MFCC) features, error rate [5], and have not been implemented in ASIC [2]–[5]. derived typically from 25 ms windows of audio samples with a While recent ASIC designs targeting RNNs focus on improved 10 ms overlap between subsequent windows. The features for energy-efficiency [6], [7], they do not incorporate compression the current window are combined with those of past and future techniques and do not report RNN accuracy for representative windows to provide context and the merged set of features are benchmarks, which are both necessary to accomplish practical inputs to the neural network. In our implementation, we merge ASR on small-form-factor edge devices. five past and five future windows to the current window to In this work, we present a new hierarchical coarse-grain create an input frame with 11 windows, leading to 440 fMLLR sparsity (HCGS) scheme that structurely compresses LSTM features per frame. The output layer of the LSTM consists of +LHUDUFKLFDO&RPSUHVVLRQRI:HLJKWV VW QG /HYHO6HOHFWRU 0HPRU\%DQN /HYHO6HOHFWRU 6HOHFWRU : : : LQSXWQHXURQV 2XWSXW%XIIHU ,1 1 6HOHFWHG1HXURQV )URPFKLSLQ VW QG 6HO08; 6HO08; OHYHO OHYHO 6HOHFWRU î 5(* &%XIIHU SDUDOOHO0$&XQLWV 0$&8QLW ,QSXW%XIIHU /670*DWH &RPSXWDWLRQ +%XIIHU SDUDOOHO0$&XQLWV RXWSXWQHXURQV RXWSXWQHXURQV 287 287 1 1 6HOHFWRU 5(* 1HXURQVRI/D\HU L î 7RFKLSRXW ࢚ ൌ࣌ൌ ࣌൫࢚࢞ ࢎ࢚െ࢚െ ࢈ ࢈൯ 0HPRU\%DQN )60 ࢌ࢚ ൌ ࣌ሺࢌ࣌ሺࢌ࢚࢞ ࢌ ࢌࢎ࢚െ࢚െ ࢈ ࢈ࢌሻ 1HXURQVRI/D\HU L %LDVHV ࢚ ൌ ࣌ሺ࣌ሺ࢚࢞ ࢎ࢚െ࢚െ ࢈ ࢈ሻ ࢉ̱ ൌܜ܉ܖܐሺࢉൌ ܜ܉ܖܐሺࢉ ࢉ ࢉ ࢈ ࢈ ሻ : : : ࢚ ࢚࢞ ࢎ࢚െ࢚െ ࢉ ̱ Fig. 2. Illustration of LSTM RNN weight compression featuring the proposed +&*6,QGH[ ࢉ࢚ ൌࢌ࢚ ήࢉ࢚െ ࢚ ήࢉ࢚ 0HPRU\%DQN ࢎ ൌൌ ήܜ܉ܖܐሺࢉή ܜ܉ܖܐሺࢉ ሻ hierarchical coarse-grain sparsity (HCGS). ࢚ ࢚ ࢚ Fig. 3. Overall architecture of the proposed LSTM RNN accelerator. probability estimates that are sent to the Viterbi decoder unit to determine the best sequence of phonemes/words. 1) HCGS Selector: The HCGS selector (Fig. 3, top-left) has two levels, where the first level of selector only enables B. Hierarchical Coarse-Grain Sparsity the propagation of activations associated with larger non-zero The proposed HCGS scheme maintains coarse-grain spar- weights blocks and the second level further filters through the sity while further allowing fine-grain weight connectivity, lead- activations associated with smaller non-zero blocks. For 16× ing to significant area and energy savings. Two-level HCGS is HCGS compression, only 32 activation outputs are required illustrated in Fig. 2, where the first level compresses weights from a total of 512 activations, ensuring only activations (e.g. 4× compression) using a larger block size (e.g. 32×32) corresponding to non-zero weights propagate to the MAC unit, and the remaining weights in the large blocks go through the greatly boosting energy-efficiency. × second level of compression (e.g. 4 ) with a smaller block size 2) Input and Output buffers: An input frame consists of × (e.g. 8 8). The hierarchical structure of block-wise weights is fMLLR features as described in Sec. II-A. The input buffer is randomly selected before the RNN training process starts, and used to store the fMLLR features of an input frame, which is is maintained throughout training and classification phases. streamed in 13-bits at a time over 512 cycles. The output buffer The dropped blocks remain at zero and do not contribute to consists of two identical buffers for double buffering, which the physical memory footprint during both training and classi- enables continuous computation of the LSTM accelerator fication. TIMIT and TED-LIUM corpora are used to train the while streaming out the final layer outputs. Each output buffer RNNs for phoneme and speech recognition, respectively. The employs a HCGS selector and a 6656:416 multiplexer to baseline 3-layer, 512-cell LSTM RNN that performs speech feedback the output of the current layer output to the next recognition for TED-LIUM corpus requires 24 MB of weight layer. The feedback from the output buffer to the input of the memory in floating-point precision. Aided by (1) the proposed MAC facilitates the reuse of the MAC unit. HCGS that reduces the number of weights by 16× and (2) low- 3) H-buffer and C-buffer: The H-buffer and C-buffer store precision (6-bit) representation of weights, the compressed the outputs of the previous frame (ht−1) and cell state (ct−1) parameters of a 3-layer, 512-cell LSTM RNN are reduced to for each layer, respectively. only 288 kB (83× reduction in model size compared to 24 MB). The resultant LSTM network can be fully stored on- 4) MAC Unit: The MAC unit consists of 64 parallel MACs chip, which results in energy-efficient acceleration. (computing vector-matrix multiplications) and the LSTM gate computation module (computing intermediate LSTM gate and III. ARCHITECTURE AND DESIGN OPTIMIZATIONS final output values), which can effectively perform 2,064 oper- ations in each cycle aided by the proposed HCGS compression. A. Hardware Architecture The non-linear activation functions (sigmoid and hyperbolic Fig. 3 shows the overall architecture of the proposed LSTM tangent) are implemented through piece-wise linear modules accelerator. It consists of the HCGS selector, input and output using 20 linear segments. buffers, MAC unit, H-buffer, C-buffer, two memory banks (144 5) Weight/bias storage and global controller: Weights are kB each) for weight storage, bias/index memory bank (8.5 kB), stored in the interleaved fashion as described in Sec. III-B, and the global controller. The proposed architecture facilitates where each memory sub-bank (W1-W3) stores weights cor- the computation of one LSTM cell output per cycle after