Hardware Acceleration of Fully Quantized BERT for Efficient

Hardware Acceleration of Fully Quantized BERT for Efficient Natural Language Processing Zejian Liu1;2, Gang Li1, Jian Cheng1;2 1National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences 2School of Future Technology, University of Chinese Academy of Sciences Beijing, China [email protected], fgang.li, [email protected] Abstract—BERT is the most recent Transformer-based model that achieves state-of-the-art performance in various NLP tasks. In this paper, we investigate the hardware acceleration of BERT Task Layer on FPGA for edge computing. To tackle the issue of huge Add & LN computational complexity and memory footprint, we propose to Linear fully quantize the BERT (FQ-BERT), including weights, acti- FFN2 MatMul vations, softmax, layer normalization, and all the intermediate Concat results. Experiments demonstrate that the FQ-BERT can achieve FFN1 & GeLU Softmax ×N 7.94×compression for weights with negligible performance loss. Add & LN Scaled Dot-Product We then propose an accelerator tailored for the FQ-BERT and Attention Scale evaluate on Xilinx ZCU102 and ZCU111 FPGA. It can achieve a Multi-head performance-per-watt of 3.18 fps/W, which is 28.91× and 12.72× Self-attention Linear Linear Linear MatMul over Intel(R) Core(TM) i7-8700 CPU and NVIDIA K80 GPU, respectively. LN 푸 퐾 푉 푄 퐾 푉 Embedding I. INTRODUCTION In recent years, transformer-based models [1] have been ubiquitous in various natural language processing (NLP) ap- Fig. 1. Left: the overall BERT architecture; middle: the architecture of plications, including machine translation [2], question answer- multi-head self-attention module; right: the architecture of scaled dot-product ing and document analysis [3]. Compared with traditional attention layer. recurrent neural network (RNN) and long short-term memory (LSTM) [4], transformer-based model uses multi-head self- results with integer or fixed-point format. Then we present an attention mechanism to encode and decode the information accelerator tailored for the proposed FQ-BERT and evaluate of input sentence to achieve better representative ability, and on two separate FPGAs. Experiments demonstrate that the the Bidirectional Encoder Representations from Transformers proposed accelerator outperforms CPU and GPU by a large (BERT) [5] is the state-of-the-art transformer-based model. margin in terms of performance-per-watt with less than 1% The architecture of BERT is depicted in Figure1. Different accuracy degradation on SST-2 dataset. The main contributions from the original transformer, BERT only consists of stacked of this paper can be summarized as follows: encoder layers as it focuses on learning language representations, which is similar to convolutional neural network for • We propose an efficient FPGA-based BERT accelerator via computer vision. The building block of self-attention is essen- algorithm/hardware co-design. On the algorithm side, we tially matrix-matrix multiplication, which is suitable for mas- propose a hardware-friendly FQ-BERT that represents all sive parallel computing, therefore high-performance GPUs are parameters and intermediate results with integer or fixed- preferred to accelerate the training and inference in the cloud. point data type. On the hardware side, to support different arXiv:2103.02800v1 [cs.AR] 4 Mar 2021 However, the vast computational complexity (>20GFLOPs) operations during inference, we present an accelerator and memory requirement (>320MB floating-point parameters) with dot-product based PE and bit-level reconfigurable of BERT poses a significant challenge for resource-constraint multiplier for 8/4-bit and 8/8-bit multiplication at run-time. platforms, which hinders the applications on edge devices. In this paper, we investigate the acceleration of BERT • We evaluate 8/4-bit FQ-BERT on SST-2 and MNLI on resource-constraint FPGAs through algorithm/hardware co- dataset, which achieve a compression ratio of 7.94×, design. Specifically, to reduce memory footprint, we adopt with 0.7% and 3.5% accuracy drop respectively. Then we quantization to compress the original model. Different from implement accelerator on Xilinx ZCU102 and ZCU111 Q8BERT [6] and Q-BERT [7], which only quantize part MPSoC. Experiments demonstrate that our accelerator can of parameters, we propose a fully quantized BERT (FQ- achieve a performance-per-watt of 3.18 fps/W, which is BERT). FQ-BERT represents weights, activations, scale factors, 28.91× and 12.72× over Intel(R) Core(TM) i7-8700 CPU softmax, layer normalization, and all the other intermediate and NVIDIA K80 GPU, respectively. Weight Buf Controller Bias Buf Scale Buf buf buf CPU · PU_H K Attn ·· PU_1 W PU_0 Q · V ·· PE_N Input/ PE_1 Q Buf Output PE_0 Off-chip K Buf Buf BIM Accu Psum Buf Memory V Buf X << X buf buf Output Output Buf Quant Format Change Format Change Attn Buf Layer Norm LN core Softmax Core Param Buf Fig. 2. The overall architecture of the proposed accelerator for fully quantized BERT. II. FULLY QUANTIZED BERT 32 8 6 4 2 32 8 6 4 2 95 90 92.32 92.32 92.09 84.19 84.19 91.86 91.74 83.51 91.63 83.21 83.11 82.89 82.8 91.28 79.91 In this section, we introduce the methods and tricks that we 90 89.33 80 used for quantizing parameters (weights, activations, biases and 71.9 85 70 others) of the original BERT to obtain FQ-BERT. 83.26 80 60 Accuracy Accuracy 77.64 75 50 48.58 A. Weights and Activations 70 40 CLIP NO_CLIP CLIP NO_CLIP To reduce the computational complexity and memory foot- (a) SST-2 (b) MNLI print, we adopt quantization to compress the BERT. In this work, we use symmetric linear quantization strategy for both Fig. 3. Impact of different quantization bitwidth of weights on accuracy. weights and activations, because symmetric quantization is more hardware friendly for the lack of zero-point. Specifically, We use exponential moving average (EMA) to collect the for each element x of an input tensor, the k-bit quantization statistics of activations to determine scaling factor during function is: inference: 2k−1 − 1 s = scale(A; k) = (3) xc = clamp(x; MIN; MAX) a EMA(max(jAj)) s = scale(xc; k) B. Biases and Others xI = bxc × se (1) Different from previous work that only weights and activa- x = x =s q I tions are quantized, our goal is to obtain a hardware-friendly clamp(x; a; b) = min(max(x; a); b) BERT that can facilitate the deployment on hardware and improve the utilization of computing units. To this end, we Figure3 shows the quantization bitwidth of weights and further quantize all the biases to 32-bit integers with the help corresponding accuracy on two document clssification datasets. of weights’ and activations’ scaling factors: It can be seen that when the bitwidth of weights is lower biasq = bbias × sbiase=sbias than 4, the classification accuracy drops dramatically, which (4) indicates that 4-bit quantization for weights is a reasonable sbias = sa × sw trade-off to maintain accuracy and considerable compression As weights, activations and biases are all quantized, the ratio. MIN and MAX are clip thresholds, which need to be calculation of the output can be rewritten as: carefully tuned during training. Its impact on performance is X also depicted in Figure3. Note that for symmetric quantization, yI = by × sye = ( aI × wI + bI ) × sf (5) MIN = −MAX. s is a scaling factor, which is determined by sy sf = the value of weights or activations and the quantization bitwidth sa × sw k. For weights, the scaling factor is calculated according to: where sa, sw and sy are quantized to 8-bit values, and sf is a 32-bit integer. In addition, we also quantize the numerator and 2k−1 − 1 output of softmax module, and parameters of layer normaliza- s = scale(W; k) = (2) w max(jW j) tion to 8-bit fixed-point values. Load 푿ퟎ 푾푸 푾푲 푾푽 푾풔 푾풇풇풏ퟏ 푾풇풇풏ퟐ 푾푸 풔ퟎ 풔ퟎ 8b 8b 푸 푲 푽 푻 퐬 풇풇풏ퟏ 풇풇풏ퟐ << 푿푾 푿푾 푿푾 푸푲 푨풕풕. 푽 푶푨퐖 푶푳푾 푶풇ퟏ푾 4b X 4b X Com. 풔ퟏ + << 풔ퟏ + 푺풐풇풕풎풂풙 푨풅풅&푳푵 푨풅풅&푳푵 8b 8b 푶 푶 ퟏ 4b X 4b X Save 푸 푲 푽 푨풕풕. 푶푨 푶푺 푶푳 풇ퟏ 풇ퟐ 푿 풔 ퟐ + 풔ퟐ + 8b 8b << 4b X 4b X 풔 Fig. 5. The dataflow of proposed accelerator. ퟑ + 풔ퟑ + 8b 8b 4b X 4b X (a) Type A (b) Type B sums this input and previous data and send the results to the Psum Buf. When the calculation is completed, the results will Fig. 4. Two type of BIM. M = 4. be sent to quantization module to get final value incorporating with biases and scaling factors. Since the quantization process III. ACCELERATOR spends more than one cycle to complete, the Psum Buf is double buffered to ensure the calculation can be pipelined. A. System Overview Softmax Core. BERT utilizes the softmax function to con- The overall architecture of our proposed accelerator is shown vert QKT to attention matrix, the expression of softmax is: in Figure2. It can be divided into two parts: the software exp(x ) program on CPU and the hardware accelerator on FPGA. Softmax(x) = i (6) i P exp(x ) As shown in Figure1, the whole model consists of embed- j j ding layers, encoder layers and task-specific layers. Different whele x is a row vector of QKT matrix. As the exponen- from encoder layers, the computation cost of other layers is tial function exp(xi) is very expensive in terms of resource small, but the memory cost is large, so these computations consumption, we use a lookup table instead. However, it is are completed on CPU, and the results are sent to FPGA.

Load more