An FPGA-Based Overlay Processor for Natural Language Processing

NPE: An FPGA-based Overlay Processor for Natural Language Processing Hamza Khan Asma Khan Zainab Khan UC Los Angeles UC Irvine Stanford University Lun Bin Huang Kun Wang Lei He Independent Researcher UC Los Angeles UC Los Angeles ABSTRACT and DistilBERT [20] have shown even better performance and ac- In recent years, transformer-based models have shown state-of-the- curacy. Implementing efficient accelerators for inference of these art results for Natural Language Processing (NLP). In particular, the transformer models has proven a challenging task. introduction of the BERT language model brought with it break- FPGA-based overlay processors have provided effective solutions throughs in tasks such as question answering and natural language for CNN and other network inference on the edge, allowing for inference, advancing applications that allow humans to interact nat- flexibility across networks without reconfiguring the FPGA[24–26]. urally with embedded devices. FPGA-based overlay processors have CNNs consist mostly of linear matrix operations like convolution been shown as effective solutions for edge image and video process- and pooling, and have consistently shown resilience to extreme ing applications, which mostly rely on low precision linear matrix quantization. However, BERT and its successors [13, 15, 18, 20] can- operations. In contrast, transformer-based NLP techniques employ not be efficiently accelerated using existing CNN FPGA accelerators. a variety of higher precision nonlinear operations with significantly Although they compute many matrix multiplications, transformer higher frequency. We present NPE, an FPGA-based overlay proces- models also introduce complex nonlinear operations that require sor that can efficiently execute a variety of NLP models. NPE offers higher precision, and are called with higher frequency. For instance, software-like programmability to the end user and, unlike FPGA softmax and layer normalization [2] are performed several times in designs that implement specialized accelerators for each nonlinear each encoder for BERT, and block subsequent computation until function, can be upgraded for future NLP models without requir- they are finished processing. As such, it is essential to calculate ing reconfiguration. We demonstrate that NPE can meet real-time them efficiently while maintaining high throughput. These non- conversational AI latency targets for the BERT language model linearities must be computed on-device for performance-sensitive with 4× lower power than CPUs and 6× lower power than GPUs. applications, as sending to a CPU causes significant latency over- We also show NPE uses 3× fewer FPGA resources relative to com- head and is not practical on the edge. parable BERT network-specific accelerators in the literature. NPE Most existing accelerators [11, 19] include specialized units for provides a cost-effective and power-efficient FPGA-based solution computing each type of nonlinearity. For instance, FTRANS [17], for Natural Language Processing at the edge. the only previously published FPGA accelerator for transformers, includes separate softmax and layer normalization modules. Since KEYWORDS NLP is a constantly-evolving field that may introduce different types of nonlinearities, this specialized approach means that an FPGA de- NLP, FPGA, Overlay, Processor, Accelerator, Nonlinear, BERT, Ma- sign may need reconfiguration for additional NLP networks. It also chine Learning leads to unnecessary area overhead and under-utilized resources across nonlinear operations. 1 INTRODUCTION In this paper we propose NPE, an FPGA-based overlay processor One of the most common Natural Language Processing (NLP) tasks for NLP model inference at the edge. As shown in Figure 1, unlike is sequence transduction, translating an input sequence to an out- most other accelerators, NPE employs a common method for ap- arXiv:2104.06535v1 [cs.AR] 13 Apr 2021 put sequence. Traditionally, convolutional neural networks (CNNs) proximating different nonlinear functions efficiently and accurately and recurrent neural networks (RNNs) have been used for this task without added overhead. The main contributions of our work are [9, 10]. The Transformer architecture [22] removes all recurrent as follows: and convolutional components and instead relies on self-attention mechanisms, typically showing better computation time and per- • We design a software-programmable domain-specific over- formance. A transformer is made up of two parts, encoders and de- lay processor with a matrix multiply unit and a multi-precision coders. The Bidirectional Encoder Representations from Transform- vector unit for NLP processing. ers (BERT) model [6] incorporates the encoders from transformers • We employ a unified piecewise polynomial approach for to generate a state-of-the-art language representation model. When nonlinear function approximation to allow extensibility to it was first introduced, BERT broke records on eleven different future nonlinear functions that may be required. NLP tasks. Since then, variations of BERT such as RoBERTa [18] • We demonstrate that our proposed accelerator can meet the real-time latency constraints for conversational AI while FPGA ’21, February 28-March 21, 2021, Virtual 2021. ACM ISBN 978-1-4503-8218-2/21/02...$15.00 maintaining 4× and 6× lower power than GPUs and CPUs https://doi.org/10.1145/3431920.3439477 respectively. Our design utilizes 3× fewer FPGA resources 3 BACKGROUND 3.1 Conversational AI Low power and low latency NLP is a prerequisite for conversational AI at the network edge. Conversational AI allows people to interact naturally with machines in a dialogue. For instance, a user may ask a question to a smart speaker and expect a human-like response almost immediately. For this response to seem natural, it must be returned within 300 milliseconds. In these 300 ms, the device must perform several complex steps including speech-to- text, user authentication, and natural language processing. As a result, any single model’s inference should be complete within 10- 15 milliseconds. Recently, many efforts have been made by the GPU community to optimize GPU implementations of BERT to reach this threshold. 3.2 The BERT Model Figure 1: BERT network structure, showing how each opera- BERT adopts the structure of the encoders from transformers. While tion would be mapped onto NPE compared to a conventional there are many BERT variants, the particular structure can be de- network-specialized accelerator. scribed by three parameters: number of encoders !, number of attention heads 퐴, and hidden layer size 퐻. We focus on BERTBASE, which is composed of 12 encoders, each with 12 attention heads relative to a comparable network-specific transformer FPGA and a hidden size of 768 (! = 12, 퐴 = 12, 퐻 = 768). Figure 1 shows accelerator. the general structure for BERT. The model starts with an embedding layer that converts each 2 RELATED WORK input language sequence into features. For instance, an input se- While BERT has been highly optimized at the software level for quence with 512 tokens in BERTBASE would be converted to a CPU and GPU, little work has been published related to custom 512 × 768 matrix, where each token is replaced by a 768-length hardware acceleration of any transformer-based networks, particu- feature vector. Here, a token refers to a few adjacent characters, larly on FPGAs. Two ASICs have been recently proposed, OPTIMUS where a word is made up of one or more tokens. The embedding [19] and 퐴3 [11], that each accelerate different parts of transformer step has negligible computation but requires lots of memory. There- inference. OPTIMUS optimizes matrix multiplication for transform- fore, we assume this initial step is performed off-chip and focus on ers by exploiting sparsity. It has dedicated exponential, divider, and accelerating the computationally-intensive encoders. square root components for nonlinearities, leading to wasted area Embedding is followed by ! = 12 encoders. Each encoder per- since each is only used a small fraction of the time. 퐴3 accelerates forms four back-to-back operations: multi-headed self-attention, attention mechanisms using various approximation methods. It has layer normalization, feed-forward layers, then layer normalization a very deep pipeline that is specialized for attention, which is inef- again. The encoder calculation can be decomposed into matrix- ficient from an overall design point of view because the multipliers matrix multiplies followed by one of three primitives: softmax, layer and function approximation units cannot be reused even for other normalization, and the GELU activation [12]. Table 1 describes the matrix multiplications. In particular, 퐴3 would need to be paired computation in further detail. with a conventional matrix multiply accelerator to implement BERT. The BERT model is typically trained to handle sequences of up At any one time, either the 퐴3 unit or the main accelerator would be to 512 tokens. For any particular task we pick a sequence length computing, leading to many idle resources and wasted performance B4@_;4= ≤ 512, padding any input sequences less than B4@_;4= potential. and truncating any sequences larger than B4@_;4=. This parameter To the best of our knowledge, FTRANS [17] is the only currently determines the complexity of the operations performed and the published FPGA accelerator for BERT and related transformer net- inference speed. Sequence lengths less than 32 are usually too small works. FTRANS takes a very specialized

An FPGA-Based Overlay Processor for Natural Language Processing

Fill Your Boots: Enhanced Embedded Bootloader Exploits Via Fault Injection and Binary Analysis

Computer Organization and Architecture Designing for Performance Ninth Edition

ARM Instruction Set

3.2 the CORDIC Algorithm

Consider an Instruction Cycle Consisting of Fetch, Operators Fetch (Immediate/Direct/Indirect), Execute and Interrupt Cycles

Akukwe Michael Lotachukwu 18/Sci01/014 Computer Science

Parallel Computing

Instruction Cycle

VLIW Architectures Lisa Wu, Krste Asanovic

18-741 Advanced Computer Architecture Lecture 1: Intro And

Introduction to Cpu

Fpga-Based Digital Phase-Locked Loop Analysis and Implementation