Arxiv:2105.03322V1 [Cs.CL] 7 May 2021
Total Page:16
File Type:pdf, Size:1020Kb
Are Pre-trained Convolutions Better than Pre-trained Transformers? Yi Tay Mostafa Dehghani Google Research Google Research, Brain Team Mountain View, California Amsterdam, Netherlands [email protected] [email protected] Jai Gupta Vamsi Aribandi∗ Dara Bahri Google Research Google Research Google Research Mountain View, California Mountain View, California Mountain View, California [email protected] [email protected] [email protected] Zhen Qin Donald Metzler Google Research Google Research Mountain View, California Mountain View, California [email protected] [email protected] Abstract 2015; Chidambaram et al., 2018; Liu et al., 2020; In the era of pre-trained language models, Qiu et al., 2020), modern pre-trained language mod- Transformers are the de facto choice of model eling started with models like ELMo (Peters et al., architectures. While recent research has 2018) and CoVE (McCann et al., 2017) which are shown promise in entirely convolutional, or based on recurrent (e.g. LSTM (Hochreiter and CNN, architectures, they have not been ex- Schmidhuber, 1997)) architectures. Although they plored using the pre-train-fine-tune paradigm. were successful, research using these architectures In the context of language models, are con- dwindled as Transformers stole the hearts of the volutional models competitive to Transform- NLP community, having, possibly implicitly, been ers when pre-trained? This paper investigates this research question and presents several in- perceived as a unequivocal advancement over its teresting findings. Across an extensive set of predecessors. experiments on 8 datasets/tasks, we find that Recent work demonstrates the promise of en- CNN-based pre-trained models are competi- tirely convolution-based models (Wu et al., 2019; tive and outperform their Transformer counter- Gehring et al., 2017) and questions the necessity of part in certain scenarios, albeit with caveats. self-attentive architectures like Transformers. For Overall, the findings outlined in this paper example, in (Wu et al., 2019), the proposed convo- suggest that conflating pre-training and archi- tectural advances is misguided and that both lutional seq2seq models outperform Transformers advances should be considered independently. on a series of canonical benchmarks such as ma- We believe our research paves the way for a chine translation and language modeling. From healthy amount of optimism in alternative ar- these findings emerge a rather natural line of ques- chitectures. tioning - should we consider pre-trained models 1 Introduction beyond Transformers? Despite early success, the relevance of convo- arXiv:2105.03322v1 [cs.CL] 7 May 2021 In the modern era of pre-training, there appears lutional models in the era of pre-trained language to be an unbreakable tie between Transformer ar- models remains an open question. To the best of chitectures (Vaswani et al., 2017) and pre-trained our knowledge, convolutional architectures have language models. Models such as BERT (Devlin not yet been rigorously evaluated under the pre- et al., 2018), RoBERTa (Liu et al., 2019), and T5 train-fine-tune paradigm. This is the primary pur- (Raffel et al., 2019) have all adopted Transformers pose of this work. Concretely, this paper seeks to as their underlying architecture. As a matter of fact, empirically validate whether pre-trained convolu- there are barely any recent pre-trained models not tions are competitive with pre-trained Transformers based on Transformers. across a range of tasks. While the contextual representation learning has The interaction between pre-training schemes a rich history (Pennington et al., 2014; Dai and Le, and model architectures is an under-studied topic. ∗Google AI Resident Are only Transformers able to capitalize on the benefits of pre-training? If we use a different ar- relevance of pre-trained convolutions still re- chitectural inductive bias, would there also be a mains an open question. substantial gain unlocked by pre-training? Are pre- trained convolutions better in particular scenarios? • We make several important observations. This paper investigates these questions. Specifically, we find that (1) pre-training helps There are a number of obvious benefits of convolutional models just as much as it helps convolution-based models. Firstly, convolutions Transformers, and (2) pre-trained convolu- do not suffer from the quadratic memory complex- tions are competitive alternatives in certain ity of self-attention - a problem significant enough scenarios in terms of model quality and train- that it spawned the creation of the entirely new cat- ing speed. egory of “efficient” Transformer architectures (Tay • We conduct extensive experiments across 8 et al., 2020b, 2021). Secondly, convolutions oper- datasets spanning a diverse range of tasks and ate locally and do not rely on positional encodings domains. On 7 out of 8 tasks, we find that as an order signal to the model. That said, convo- pre-trained convolutions outperform a recent lutions also come with a slew of downsides. For state-of-the-art transformer (T5 (Raffel et al., example, being unable to access global information 2019)) with and without pre-training. We ex- means such models are unable to perform a form of amine the speed and operation count (FLOPS) cross-attention across multiple sequences. We dive of convolutions versus Transformers and find into the details of this more in subsequent sections. that convolutions are not only faster but also In this paper, we present a pre-trained convolu- scale better to longer sequence lengths. tional sequence-to-sequence, or Seq2Seq, model. We train our convolutional model using span-based 2 Related Work sequence-to-sequence denoising objectives similar to those employed in T5 (Raffel et al., 2019). We Pre-training on a large corpus has become the pri- evaluate a variety of convolutional variants (e.g., di- mary method of learning universal language rep- lated, lightweight, dynamic (Wu et al., 2019), etc.) resentations to solve different downstream NLP under both raw (no pre-training) and pre-train-fine- tasks. The first generation of pre-trained mod- tune paradigms. Our goal is to understand the true els aimed at learning embedding for words, like competitiveness of convolutional architectures in Skip-Gram (Mikolov et al., 2013) and Glove (Pen- the era of pre-training. nington et al., 2014), and quickly developed to We show that pre-trained convolutions are com- learning contextualized representation for words, petitive against pre-trained Transformers via a like ELMO (Peters et al., 2018), GPT (Radford set of experiments on a potpourri of NLP tasks, et al., 2018), and BERT (Devlin et al., 2018). This, like toxicity detection, sentiment classification, however, is not the only axis in which pre-trained news classification, query understanding and se- models have evolved. mantic parsing/compositional generalization (Kim Different objective functions and various tasks, and Linzen, 2020). Moreover, we find that pre- both supervised and unsupervised, have been ex- trained convolutions can outperform, in terms of plored for pre-training. For instance, CoVe (Mc- model quality and training speed, state-of-the-art Cann et al., 2017) uses machine translation as the pre-trained Transformers (Raffel et al., 2019) in pre-training task, ELMO (Peters et al., 2018) and certain scenarios. However, to provide a balanced GPT (Radford et al., 2018) use language modeling perspective, we also describe scenarios where pre- objectives, BERT (Devlin et al., 2018) uses masked trained convolutions do not perform well and may language modeling, T5 (Raffel et al., 2019) and be deemed unsuitable. MASS (Song et al., 2019) use Seq2Seq masked language modeling, and XLNet (Yang et al., 2019) Contributions Overall, the main contributions utilizes permuted language modeling. In addition of this paper can be summarized as follows: to this, BART (Lewis et al., 2019) uses a denois- ing autoencoder setup during pre-training, where • We perform a comprehensive empirical evalu- the model takes a partially corrupted input and is ation of convolutional Seq2Seq models under trained to recover the original, undistorted input. the pre-train-fine-tune paradigm. To the best Some models use a contrastive learning setup dur- of our knowledge, the competitiveness and ing pertaining, like replaced token detection, used by ELECTRA (Clark et al., 2020), and sentence or- 3.1 Lightweight Depthwise Convolution der prediction, used by ALBERT (Lan et al., 2019) This section introduces Lightweight Depthwise and StructBERT (Wang et al., 2019). Convolutions (Wu et al., 2019) which forms the Another axis where pre-trained models in NLP backbone of our pre-trained convolution model. explored different ideas is model architecture. ELMO (Peters et al., 2018) and CoVe (McCann 3.1.1 Depthwise convolutions et al., 2017) used LSTMs as the base model. Later, Depthwise convolutions convolve independently Transformers (Vaswani et al., 2017) became the over every channel. Given an input tensor X de facto architecture of pre-trained NLP models. of dimensions n × d, the depthwise convolution, BERT (Devlin et al., 2018), XLNet (Yang et al., D(X; Wc;:; i; c) is defined as: 2019) and RoBERTa (Liu et al., 2019) use the k Transformer encoder, while GPT (Radford et al., X Oi;c = Wc;j · Xi+j−d k+1 e); c (1) 2018), GPT-2 (Radford et al.), and GPT-3 (Brown 2 j−1 et al., 2020) use the Transformer decoder as the d×k backbone. Some pre-trained models are also are where W 2 R are the learnable parameters based on the encoder-decoder transformer archi- of the layer. Oi;c is the output at position i and tecture, like T5 (Raffel et al., 2019), MASS (Song channel c. The overall output is a tensor of n × d et al., 2019), and BART (Lewis et al., 2019). In this of identical shape as the input. paper, we investigate another model architecture 3.1.2 Lightweight Convolutions variation by studying the power of convolutional neural network as the backbone of pre-trained mod- L(:) are depthwise separable convolutions with (1) els for NLP.