Pre-Training with Explicitly N-Gram Masked Language Modeling for Natural Language Understanding
Total Page:16
File Type:pdf, Size:1020Kb
ERNIE-Gram: Pre-Training with Explicitly N-Gram Masked Language Modeling for Natural Language Understanding Dongling Xiao, Yu-Kun Li, Han Zhang, Yu Sun, Hao Tian, Hua Wu and Haifeng Wang Baidu Inc., China fxiaodongling,liyukun01,zhanghan17,sunyu02, tianhao,wu hua,[email protected] Abstract pre-trained models, BERT (Devlin et al., 2019) em- ploys masked language modeling (MLM) to learn Coarse-grained linguistic information, such as representations by masking individual tokens and named entities or phrases, facilitates adequate- predicting them based on their bidirectional context. ly representation learning in pre-training. Pre- However, BERT’s MLM focuses on the represen- vious works mainly focus on extending the ob- tations of fine-grained text units (e.g. words or jective of BERT’s Masked Language Model- ing (MLM) from masking individual tokens subwords in English and characters in Chinese), to contiguous sequences of n tokens. We ar- rarely considering the coarse-grained linguistic in- gue that such contiguously masking method formation (e.g. named entities or phrases in English neglects to model the intra-dependencies and and words in Chinese) thus incurring inadequate inter-relation of coarse-grained linguistic infor- representation learning. mation. As an alternative, we propose ERNIE- Many efforts have been devoted to integrate Gram, an explicitly n-gram masking method coarse-grained semantic information by indepen- to enhance the integration of coarse-grained in- formation into pre-training. In ERNIE-Gram, dently masking and predicting contiguous se- n-grams are masked and predicted directly us- quences of n tokens, namely n-grams, such as ing explicit n-gram identities rather than con- named entities, phrases (Sun et al., 2019b), whole tiguous sequences of n tokens. Furthermore, words (Cui et al., 2019) and text spans (Joshi et al., ERNIE-Gram employs a generator model to 2020). We argue that such contiguously masking sample plausible n-gram identities as optional strategies are less effective and reliable since the n-gram masks and predict them in both coarse- prediction of tokens in masked n-grams are inde- grained and fine-grained manners to enable pendent of each other, which neglects the intra- comprehensive n-gram prediction and rela- tion modeling. We pre-train ERNIE-Gram dependencies of n-grams. Specifically, given a on English and Chinese text corpora and fine- masked n-gram w =fx1; :::; xng; x2VF , we max- Qn tune on 19 downstream tasks. Experimental imize p(w) = i=1 p(xijc) for n-gram learning, results show that ERNIE-Gram outperforms where models learn to recover w in a huge and n previous pre-training models like XLNet and jVF j sparse prediction space F 2R . Note that VF RoBERTa by a large margin, and achieves is the fine-grained vocabulary1 and c is the context. comparable results with state-of-the-art meth- We propose ERNIE-Gram, an explicitly n- ods. The source codes and pre-trained mod- els have been released at https://github. gram masked language modeling method in com/PaddlePaddle/ERNIE. which n-grams are masked with single [MASK] symbols, and predicted directly using explicit n- 1 Introduction gram identities rather than sequences of tokens, as depicted in Figure1(b). The models learn to Pre-trained on large-scaled text corpora and fine- predict n-gram w in a small and dense prediction jV j tuned on downstream tasks, self-supervised rep- space N 2 R N , where VN indicates a prior n- 2 n resentation models (Radford et al., 2018; Devlin gram lexicon and normally jVN j jVF j . To et al., 2019; Liu et al., 2019; Yang et al., 2019; 1 Lan et al., 2020; Clark et al., 2020) have achieved VF contains 30K BPE codes in BERT (Devlin et al., 2019) and 50K subword units in RoBERTa (Liu et al., 2019). remarkable improvements in natural language un- 2 VN contains 300K n-grams, where n 2 [2; 4) in this pa- derstanding (NLU). As one of the most prominent per, n-grams are extracted in word-level before tokenization. 1702 Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1702–1715 June 6–11, 2021. ©2021 Association for Computational Linguistics z 2 z 4 x2 x3 x5 y2 y4 x2 x3 y2 x5 y4 Embedding Layer ×L ×L ×L Fine-grained Transformer Encoder Transformer Encoder Transformer Encoder Classifier N-gram Classifier x1 x4 x6 x1 x4 x6 x1 x4 x6 + + + + + + + + + + + + + + + + i 1 2 3 4 5 6 1 2 3 4 5 1 2 3 4 5 Positional Id (a) Contiguously MLM (b) Explicitly N-gram MLM (c) Comprehensive N-gram MLM Figure 1: Illustrations of different MLM objectives, where xi and yi represent the identities of fine-grained tokens h×|VF j and explicit n-grams respectively. Note that the weights of fine-grained classifier (WF 2 R ) and N-gram h×|hVF ;VN ij classifier (WN 2R ) are not used in fine-tuning stage, where h is the hidden size and L is the layers. learn the semantic of n-grams more adequately, Net (Yang et al., 2019) adopts permutation lan- we adopt a comprehensive n-gram prediction guage modeling (PLM) to model the dependencies mechanism, simultaneously predicting masked n- among predicted tokens. ELECTRA introduces grams in coarse-grained (explicit n-gram identities) replaced token detection (RTD) objective to learn and fine-grained (contained token identities) man- all tokens for more compute-efficient pre-training. ners with well-designed attention mask metrics, as shown in Figure1(c). 2.2 Coarse-grained Linguistic Information In addition, to model the semantic relationships Incorporating for Pre-Training between n-grams directly, we introduce an en- Coarse-grained linguistic information is indispens- hanced n-gram relation modeling mechanism, able for adequate representation learning. There masking n-grams with plausible n-grams identities are lots of studies that implicitly integrate coarse- sampled from a generator model, and then recov- grained information by extending BERT’s MLM to ering them to the original n-grams with the pair contiguously masking and predicting contiguous se- relation between plausible and original n-grams. quences of tokens. For example, ERNIE (Sun et al., Inspired by ELECTRA (Clark et al., 2020), we in- 2019b) masks named entities and phrases to en- corporate the replaced token detection objective to hance contextual representations, BERT-wwm (Cui distinguish original n-grams from plausible ones, et al., 2019) masks whole Chinese words to achieve which enhances the interactions between explicit better Chinese representations, SpanBERT (Joshi n-grams and fine-grained contextual tokens. et al., 2020) masks contiguous spans to improve In this paper, we pre-train ERNIE-Gram on both the performance on span selection tasks. base-scale and large-scale text corpora (16GB and A few studies attempt to inject the coarse- 160GB respectively) under comparable pre-training grained n-gram representations into fine-grained setting. Then we fine-tune ERNIE-Gram on 13 En- contextualized representations explicitly, such as glish NLU tasks and 6 Chinese NLU tasks. Experi- ZEN (Diao et al., 2020) and AMBERT (Zhang and mental results show that ERNIE-Gram consistently Li, 2020), in which additional transformer encoders outperforms previous well-performed pre-training and computations for explicit n-gram representa- models on various benchmarks by a large margin. tions are incorporated into both pre-training and fine-tuning. Li et al., 2019 demonstrate that explicit 2 Related Work n-gram representations are not sufficiently reliable 2.1 Self-Supervised Pre-Training for NLU for NLP tasks because of n-gram data sparsity and Self-supervised pre-training has been used to learn the ubiquity of out-of-vocabulary n-grams. Differ- contextualized sentence representations though var- ently, we only incorporate n-gram information by ious training objectives. GPT (Radford et al., leveraging auxiliary n-gram classifier and embed- 2018) employs unidirectional language modeling ding weights in pre-training, which will be com- (LM) to exploit large-scale corpora. BERT (Devlin pletely removed during fine-tuning, so our method et al., 2019) proposes masked language modeling maintains the same parameters and computations (MLM) to learn bidirectional representations ef- as BERT. ficiently, which is a representative objective for pre-training and has numerous extensions such 3 Proposed Method as RoBERTa (Liu et al., 2019), UNILM (Dong In this section, we present the detailed implemen- et al., 2019) and ALBERT (Lan et al., 2020). XL- tation of ERNIE-Gram, including n-gram lexicon 1703 (a) N-gram Fine-grained VN extraction in Section 3.5, explicitly n-gram y2 Identities y4 x2 x3 Identities x5 Multi-Head MLM pre-training objective in Section 3.2, compre- ×L Self-Attention hensive n-gram prediction and relation modeling K,V Q K,V Q K,V Q K,V Q K,V Q mechanisms in Section 3.3 and 3.4. x1 x4 x6 Tokens + + + + + + + + 3.1 Background 1 2 3 4 5 2 2 Positions 4 (b) To inject n-gram information into pre-training, 1 2 3 4 5 2 2 4 K,V Allow to attend Q + + + + + + + + many works (Sun et al., 2019b; Cui et al., 2019; x1 x4 x6 Prevent from 1 + x Joshi et al., 2020) extend BERT’s masked language 1 attending modeling (MLM) from masking individual tokens 2 + 3 + x4 y ,y to contiguous sequences of n tokens. Predict { 2 4 } 4 + Contiguously MLM. Given input sequence x= 5 + x6 2 + fx1; :::; x g; x 2 VF and n-gram starting bound- jxj + 2 Predict {x 2 , x 3 , x 5} aries b = fb1; :::; b g, let z = fz1; :::; z g to jbj jb|−1 4 + be the sequence of n-grams, where zi =x[bi:bi+1), MLM samples 15% of starting boundaries from Figure 2: (a) Detailed structure of Comprehensive N- gram MLM.