Efficient Contextual Representation Learning with Continuous Outputs

Efficient Contextual Representation Learning With Continuous Outputs Liunian Harold Liy, Patrick H. Chen∗, Cho-Jui Hsieh∗, Kai-Wei Chang∗ yPeking University ∗University of California, Los Angeles [email protected], [email protected] {chohsieh, kwchang}@cs.ucla.edu Abstract train on a one-billion-token corpus with a vocabulary of 800,000 words using three GPUs1. This Contextual representation models have slow training procedure hinders the development achieved great success in improving various cycle, prevents fine-grained parameter tuning, and downstream natural language processing makes training contextual representations inacces- tasks. However, these language-model- based encoders are difficult to train due to sible to the broader community. Recent work also their large parameter size and high compu- raises concerns about the environmental implica- tational complexity. By carefully examining tions of training such large models (Strubell et al., the training procedure, we observe that the 2019). In addition, the success of these models softmax layer, which predicts a distribution stems from a large amount of data they used. It of the target word, often induces significant is challenging, if not impossible, to train a contex- overhead, especially when the vocabulary tual representation model on a larger corpus with size is large. Therefore, we revisit the tens or hundreds of billions of tokens. design of the output layer and consider directly predicting the pre-trained embedding In this work, we explore how to accelerate con- of the target word for a given context. When textual representation learning. We identify the applied to ELMo, the proposed approach softmax layer as the primary cause of inefficiency. achieves a 4 times speedup and eliminates This component takes up a considerable portion of 80% trainable parameters while achieving all trainable parameters (80% for ELMo) and con- competitive performance on downstream sumes a huge amount of training time. However, tasks. Further analysis shows that the it is often not needed in the final model as the goal approach maintains the speed advantage under various settings, even when the of contextual representation learning is to build a sentence encoder is scaled up. generic encoder. Therefore, it is rather a waste to allocate extensive computational resources to the softmax layer. 1 Introduction Inspired by Kumar and Tsvetkov(2019), we In recent years, text representation learning ap- consider learning contextual representation mod- proaches, such as ELMo (Peters et al., 2018a), els with continuous outputs. In the training pro- GPT (Radford et al., 2018), BERT (Devlin et al., cess, the contextual encoder is learned by mini- 2019) and GPT-2 (Radford et al., 2019), have been mizing the distance between its output and a pre- developed to represent generic contextual infor- trained target word embedding. The constant time mation in natural languages by training an encoder complexity and small memory footprint of the out- with a language model objective on a large unla- put layer perfectly serve our desire to decouple belled corpus. During the training process, the en- learning contexts and words and devote most com- coder is given part of the text and asked to predict putational resources to the contextual encoder. In the missing pieces. Prior studies show that the en- addition, we combine the approach with open- coders trained in this way can capture generic con- vocabulary word embeddings such that the model textual information of the input text and improve a can be trained without the need to pre-define a variety of downstream tasks significantly. closed word set as the vocabulary. We also provide an alternative interpretation of learning contextual However, training contextual representations is encoders with continuous outputs that sheds light known to be a resource-hungry process. For exam- ple, ELMo is reported to take about two weeks to 1https://github.com/allenai/bilm-tf/issues/55 on how the pre-trained embedding could affect the based approach. GPT-2 is a scaled-up version of performance of the model. GPT and exhibits strong performance under zero- We conduct a comprehensive empirical study shot settings. to analyze the proposed approach and several existing methods that are originally proposed to re- Speeding Up Language Models Training Con- duce the complexity of the output layer in lan- siderable efforts have been devoted to accelerating guage models, such as the adaptive softmax, and the training process of language models. One line the sub-word methods. We incorporate these ap- of research focuses on developing faster sequence proaches with ELMo and conduct a comprehen- encoder architectures such as CNN (Kim et al., sive study to compare them in terms of training 2016; Dauphin et al., 2017), QRNN (Bradbury speed and performance on five downstream tasks. et al., 2016), SRU (Lei et al., 2018), and the Trans- We demonstrate that the proposed approach ef- former (Vaswani et al., 2017). These architectures fectively reduces the training time and trainable have been extensively used for learning language parameters while maintaining competitive perfor- representations (Radford et al., 2018; Devlin et al., mance compared with the baselines. Our approach 2019; Tang et al., 2018). Another line of work fo- also exhibits consistent computational advantage cuses on the large-vocabulary issue, as a large and under different conditions (e.g., with different vo- ever-growing vocabulary results in an intractable cabulary sizes, with different sentence encoders, softmax layer. Our work falls into the second line and with different number of GPUs). and we review existing solutions in detail. Source code is available at https: Several studies for language modeling focus on //github.com/uclanlp/ELMO-C. directly reducing the complexity of the softmax layer. Following Kumar and Tsvetkov(2019), 2 Background and Related Work we group them into two categories: sampling- based approximations and structural approxima- Contextual Representation We review contex- tions. Sampling-based approximations include the tual representation models from two aspects: how sampled softmax (Bengio et al., 2003) and NCE they are trained and how they are used in down- (Mnih and Teh, 2012). The sampled softmax stream tasks. approximates the normalization term of softmax CoVe (McCann et al., 2017) uses the source lan- by sampling a subset of negative targets, while guage encoder from a machine translation model NCE replaces the softmax with a binary classi- as a contextual representation model. Peters et al. fier. On the other hand, structural approximations (2018a) advocates for the use of larger unlabelled such as the hierarchical softmax (Morin and Ben- corpora and proposes ELMo, a forward and a gio, 2005) and the adaptive softmax (Grave et al., backward LSTM-based (Hochreiter and Schmid- 2016), form a structural hierarchy to avoid expen- huber, 1997) language model while GPT (Radford sive normalization. The adaptive softmax, in par- et al., 2018) and GPT-2 (Radford et al., 2019) build ticular, groups words in the vocabulary into either a language model with the Transformer (Vaswani a short-list or clusters of rare words. For frequent et al., 2017). BERT (Devlin et al., 2019) intro- words, a softmax over the short-list would suf- duces the masked language model and provides fice, which reduces computation and memory us- deep bidirectional representation. age significantly. The adaptive softmax has been There are two existing strategies for applying shown to achieve results close to those of the full pre-trained contextual representations to down- softmax whilst maintaining high GPU efficiency stream tasks: 1) feature-based and 2) fine-tuning. (Merity et al., 2018). In the feature-based approach, fixed features Regarding contextual representation models, are extracted from the contextual encoder (e.g., ELMo used the sampled softmax while GPT and ELMo, CoVe) and inserted as an input into a task- BERT resorted to a subword method. Specifically, specific model. In the fine-tuning approach, the they used WordPiece (Wu et al., 2016) or BPE contextual encoder is designed as a part of the (Sennrich et al., 2016) to split the words into sub- network architecture for downstream tasks, and words and the language models were trained to its parameters are fine-tuned with the downstream take subwords as input and also predict subwords. task. BERT is designed for the fine-tuning ap- This method is efficient and scalable, as the sub- proach but it is also evaluated with the feature- word vocabulary can be kept small. One potential drawback of these subword-level language mod- et al., 2016) to convert the input words in c into els, however, is that they produce representations word vectors. Then the sequence encoder embeds for fragments of words. Therefore, it takes extra the context into a context vector c 2 Rm using a efforts to generate word-level representations (See multi-layer LSTM (Hochreiter and Schmidhuber, the discussion in Section 4.2). 1997), a Gated CNN (Dauphin et al., 2017), or a The high cost of the softmax layer has Transformer (Vaswani et al., 2017). The softmax also been noted in the sentence representation layer then multiplies the context vector c with an learning literature. Following the success of output word embedding2 W 2 RV ×m and uses a Word2Vec (Mikolov et al., 2013), methods such as softmax function to produce a conditional distri- SkipThought (Kiros et al., 2015) are developed to bution p(wjc) over the vocabulary of size V . learn distributed sentence representations by pre- In a language model, the learning objective dicting the context sentences of a given sentence, l(w; c) for (w; c) is then expressed as: which involves sequentially decoding words of the l(w; c) = − log p(wjc) target sentences. Jernite et al.(2017) and Lo- T geswaran and Lee(2018) notice the inefficiency = − log softmax(cW ) X of the softmax layer during decoding and propose = −c · w + log exp(c · w0)(1); 0 to use discriminative instead of generative objec- w m tives, eliminating the need for decoding.

Load more