What Can Computer Vision Teach NLP About Efficient Neural Networks?

SqueezeBERT: What can computer vision teach NLP about efficient neural networks? Forrest N. Iandola Albert E. Shaw Ravi Krishna Kurt W. Keutzer [email protected] [email protected] UC Berkeley EECS UC Berkeley EECS [email protected] [email protected] Abstract Mobile News, 2017; Donnelly, 2018). Natural language processing (NLP) technology has the poten- Humans read and write hundreds of billions tial to aid these users and communities in several of messages every day. Further, due to the availability of large datasets, large comput- ways. When a person writes a message, NLP mod- ing systems, and better neural network models can help with spelling and grammar checking els, natural language processing (NLP) tech- as well as sentence completion. When content is nology has made significant strides in under- added to a social network, NLP can facilitate con- standing, proofreading, and organizing these tent moderation before it appears in other users’ messages. Thus, there is a significant oppor- news feeds. When a person consumes messages, tunity to deploy NLP in myriad applications NLP models can help classify messages into fold- to help web users, social networks, and busi- ers, compose news feeds, prioritize messages, and nesses. Toward this end, we consider smart- phones and other mobile devices as crucial identify duplicates. platforms for deploying NLP models at scale. In recent years, the development and adop- However, today’s highly-accurate NLP neural tion of Attention Neural Networks have led to network models such as BERT and RoBERTa dramatic improvements in almost every area of are extremely computationally expensive, with NLP. In 2017, Vaswani et al. proposed the BERT-base taking 1.7 seconds to classify a multi-head self-attention module, which demon- text snippet on a Pixel 3 smartphone. To begin strated superior accuracy to recurrent neural net- to address this problem, we draw inspiration from the computer vision community, where works on English-German machine language trans- 1 work such as MobileNet has demonstrated that lation (Vaswani et al., 2017). These modules have grouped convolutions (e.g., depthwise convo- since been adopted by GPT (Radford et al., 2018) lutions) can enable speedups without sacrific- and BERT (Devlin et al., 2019) for sentence classi- ing accuracy. We demonstrate how to replace fication, and by GPT-2 (Radford et al., 2019) and several operations in self-attention layers with CTRL (Keskar et al., 2019) for sentence comple- grouped convolutions and use this technique in tion and generation. Recent works such as ELEC- a novel network architecture called Squeeze- BERT, which runs 4.3x faster than BERT-base TRA (Clark et al., 2020) and RoBERTa (Liu et al., on the Pixel 3 while achieving competitive ac- 2019) have shown that larger datasets and more curacy on the GLUE test set. sophisticated training regimes can further improve A PyTorch-based implementation of Squeeze- the accuracy of self-attention networks. BERT is available as part of the Hug- Considering the enormity of the textual data cre- ging Face Transformers library: https:// ated by humans on mobile devices, a natural ap- huggingface.co/squeezebert proach is to deploy the NLP models directly onto mobile devices, embedding them in the apps used 1 Introduction and Motivation to read, write, and share text. Unfortunately, highly- The human race writes over 300 billion messages accurate NLP models are computationally expen- per day (Sayce, 2019; Schultz, 2019; Al-Heeti, sive, making mobile deployment impractical. For 2018; Templatify, 2017). Out of these, more than example, we observe that running the BERT-base half of the world’s emails are read on mobile de- 1Neural networks that use the self-attention modules of vices, and nearly half of Facebook users exclusively Vaswani et al. are sometimes called "Transformers," but in access Facebook from a mobile device (Lovely the interest of clarity, we call them "self-attention networks." 124 Proceedings of SustaiNLP: Workshop on Simple and Efficient Natural Language Processing, pages 124–135 Online, November 20, 2020. c 2020 Association for Computational Linguistics network on a Google Pixel 3 smartphone approx- (768 channels) layers. The high-channel- imately 1.7 seconds to classify a single text data count (3072 channels) layers in BERT-base sample.2 Much of the research on efficient self- do not have residual connections. However, attention networks for NLP has just emerged in the the ResNet and Residual-SqueezeNet (Iandola past year. However, starting with SqueezeNet (Ian- et al., 2016b) CV networks connect the high- dola et al., 2016b), the mobile computer vision channel-count layers with residuals, enabling (CV) community has spent the last four years op- higher information flow through the network. timizing neural networks for mobile devices. Intu- Similar to these CV networks, MobileBERT itively, it seems like there must be opportunities to adds residual connections between the high- apply the lessons learned from the rich literature of channel-count layers. mobile CV research to accelerate mobile NLP. In the following, we review what has already been ap- 1.2 What else can CV research teach NLP plied and propose two additional techniques from research about efficient networks? CV that we will leverage to accelerate NLP models. We are encouraged by the progress that Mobile- BERT has made in leveraging ideas that are popular 1.1 What has CV research already taught in the CV literature to accelerate NLP. However, NLP research about efficient networks? we are aware of two other ideas from CV, which In recent months, novel self-attention networks weren’t used in MobileBERT which could be ap- have been developed with the goal of achieving plied to accelerate NLP: faster inference. At present, the MobileBERT net- 1. Since the 1980s, computer vi- work defines the state-of-the-art in low-latency Convolutions. sion neural nets have relied heavily on con- text classification for mobile devices (Sun et al., volutional layers (Fukushima, 1980; LeCun 2020). MobileBERT takes approximately 0.6 sec- et al., 1989). Convolutions are quite flexi- onds to classify a text sequence on a Google Pixel 3 ble and well-optimized in software, and they smartphone while achieving higher accuracy on the can implement things as simple as a 1D fully- GLUE benchmark, which consists of 9 natural lan- connected layer, or as complex as a 3D dilated guage understanding (NLU) datasets (Wang et al., layer that performs upsampling or downsam- 2018), than other efficient networks such as Distil- pling. BERT (Sanh et al., 2019), PKD (Sun et al., 2019a), and several others (Lan et al., 2019; Turc et al., 2. Grouped convolutions. A popular tech- 2019; Jiao et al., 2019; Xu et al., 2020). To achieve nique in modern mobile-optimized neural net- this, MobileBERT introduced two concepts into works is grouped convolutions (see Section their NLP self-attention network that are already in 3). Proposed by Krizhevsky et al. in the widespread use in CV neural networks: 2012 winning submission to the ImageNet image classification challenge (Krizhevsky 1. Bottleneck layers. In ResNet (He et al., et al., 2011, 2012; Russakovsky et al., 2015), 2016), the 3x3 convolutions are computation- grouped convolutions disappeared from the ally expensive, so a 1x1 "bottleneck" convo- literature from some years, then re-emerged lution is employed to reduce the number of as a key technique circa 2016 (Chollet, channels input to each 3x3 convolution layer. 2016; Xie et al., 2017) and today are exten- Similarly, MobileBERT adopts bottleneck lay- sively used in efficient CV networks such ers that reduce the number of channels before as MobileNet (Howard et al., 2017), Shuf- each self-attention layer, reducing the compu- fleNet (Zhang et al., 2018), and Efficient- tational cost of the self-attention layers. Net (Tan and Le, 2019). While common in CV literature, we are not aware of work applying 2. High-information flow residual connec- grouped convolutions to NLP. tions. In BERT-base, the residual connections serve as links between the low-channel-count 1.3 SqueezeBERT: Applying lessons learned 2Note that BERT-base (Devlin et al., 2019), RoBERTa- from CV to NLP base (Liu et al., 2019), and ELECTRA-base (Clark et al., In this work, we describe how to apply convolu- 2020) all use the same self-attention encoder architecture, and therefore these networks incur approximately the same latency tions and particularly grouped convolutions in the on a smartphone. design of a novel self-attention network for NLP, 125 which we call SqueezeBERT. Empirically, we find Table 1: How does BERT spend its time? This is that SqueezeBERT runs at lower latency on a smart- a breakdown of computation (in floating-point operations, or FLOPs) and latency (on a Google Pixel 3 phone than BERT-base, MobileBERT, and several smartphone) in BERT-base. The sequence length is other efficient NLP models, while maintaining com- 128. petitive accuracy. Stage Module type FLOPs Latency 2 Implementing self-attention with Embedding Embedding 0.00% 0.26% convolutions Encoder Self-attention calculations 2.70% 11.3% In this section, first, we review the basic structure Encoder PFC layers 97.3% 88.3% Final Classifier PFC layers 0.00% 0.02% of self-attention networks. Next, we identify that Total 100% 100% their biggest computational bottleneck is in their position-wise fully-connected (PFC) layers. We then show that these PFC layers are equivalent to a 1D convolution with a kernel size of 1. of their

Load more