Revisiting Pre-Trained Models for Chinese Natural Language Processing
Total Page:16
File Type:pdf, Size:1020Kb
Revisiting Pre-trained Models for Chinese Natural Language Processing Yiming Cui1;2, Wanxiang Che1, Ting Liu1, Bing Qin1, Shijin Wang2;3, Guoping Hu2 1Research Center for Social Computing and Information Retrieval (SCIR), Harbin Institute of Technology, Harbin, China 2State Key Laboratory of Cognitive Intelligence, iFLYTEK Research, China 3iFLYTEK AI Research (Hebei), Langfang, China 1fymcui,car,tliu,[email protected] 2;3fymcui,sjwang3,[email protected] Abstract become new fundamental components in natural language processing field. Bidirectional Encoder Representations from Starting from BERT, the community have made Transformers (BERT) has shown marvelous great and rapid progress on optimizing the pre- improvements across various NLP tasks, and consecutive variants have been proposed to trained language models, such as ERNIE (Sun et al., further improve the performance of the pre- 2019a), XLNet (Yang et al., 2019), RoBERTa (Liu trained language models. In this paper, we et al., 2019), SpanBERT (Joshi et al., 2019), AL- target on revisiting Chinese pre-trained lan- BERT (Lan et al., 2019), ELECTRA (Clark et al., guage models to examine their effectiveness 2020), etc. However, training Transformer-based in a non-English language and release the Chi- (Vaswani et al., 2017) pre-trained language models nese pre-trained language model series to the are not as easy as we used to train word embed- community. We also propose a simple but effective model called MacBERT, which im- dings or other traditional neural networks. Typi- proves upon RoBERTa in several ways, espe- cally, training a powerful BERT-large model, which cially the masking strategy that adopts MLM has 24-layer Transformer with 330 million parame- as correction (Mac). We carried out extensive ters, to convergence needs high-memory comput- experiments on eight Chinese NLP tasks to ing devices, such as TPU, which is very expensive. revisit the existing pre-trained language mod- On the other hand, though various pre-trained lan- els as well as the proposed MacBERT. Ex- guage models have been released, most of them perimental results show that MacBERT could are based on English, and there are few efforts on achieve state-of-the-art performances on many NLP tasks, and we also ablate details with sev- building powerful pre-trained language models on eral findings that may help future research.1 other languages. In this paper, we aim to build Chinese pre-trained 1 Introduction language model series and release them to the public for facilitating the research community, as Bidirectional Encoder Representations from Trans- Chinese and English are among the most spoken formers (BERT) (Devlin et al., 2019) has become languages in the world. We revisit the existing enormously popular and has proven to be effec- popular pre-trained language models and adjust tive in recent natural language processing stud- them to the Chinese language to see if these mod- ies, which utilizes large-scale unlabeled training els generalize well in the language other than En- data and generates enriched contextual representa- glish. Besides, we also propose a new pre-trained tions. As we traverse several popular machine read- language model called MacBERT, which replaces ing comprehension benchmarks, such as SQuAD the original MLM task into MLM as correction (Rajpurkar et al., 2018), CoQA (Reddy et al., (Mac) task and mitigates the discrepancy of the pre- 2019), QuAC (Choi et al., 2018), NaturalQuestions training and fine-tuning stage. Extensive experi- (Kwiatkowski et al., 2019), RACE (Lai et al., 2017), ments are conducted on eight popular Chinese NLP we can see that most of the top-performing mod- datasets, ranging from sentence-level to document- els are based on BERT and its variants (Dai et al., level, such as machine reading comprehension, text 2019; Zhang et al., 2019; Ran et al., 2019), demon- classification, etc. The results show that the pro- strating that the pre-trained language models have posed MacBERT could give significant gains in 1https://github.com/ymcui/MacBERT most of the tasks against other pre-trained language 657 Findings of the Association for Computational Linguistics: EMNLP 2020, pages 657–668 November 16 - 20, 2020. c 2020 Association for Computational Linguistics BERT ERNIE XLNet RoBERTa ALBERT ELECTRA MacBERT Type AE AE AR AE AE AE AE Embeddings T/S/P T/S/P T/S/P T/S/P T/S/P T/S/P T/S/P Masking T T/E/Ph - T T T WWM/NM LM Task MLM MLM PLM MLM MLM Gen-Dis Mac Paired Task NSP NSP - - SOP - SOP Table 1: Comparisons of the pre-trained language models. (AE: Auto-Encoder, AR: Auto-Regressive, T: Token, S: Segment, P: Position, W: Word, E: Entity, Ph: Phrase, WWM: Whole Word Masking, NM: N-gram Masking, NSP: Next Sentence Prediction, SOP: Sentence Order Prediction, MLM: Masked LM, PLM: Permutation LM, Mac: MLM as correction) models, and detailed ablations are given to better • MLM: Randomly masks some of the tokens examine the composition of the improvements. The from the input, and the objective is to predict contributions of this paper are listed as follows. the original word based only on its context. • Extensive empirical studies are carried out to • NSP: To predict whether sentence B is the revisit the performance of Chinese pre-trained next sentence of A. language models on various tasks with careful analyses. Later, they further proposed a technique called whole word masking (wwm) for optimizing the • We propose a new pre-trained language model original masking in the MLM task. In this setting, called MacBERT that mitigates the gap be- instead of randomly selecting WordPiece (Wu et al., tween the pre-training and fine-tuning stage 2016) tokens to mask, we always mask all of the to- by masking the word with its similar word, kens corresponding to a whole word at once. This which has proven to be effective on down- will explicitly force the model to recover the whole stream tasks. word in the MLM pre-training task instead of just recovering WordPiece tokens (Cui et al., 2019a), • To further accelerate future research on Chi- which is much more challenging. As the whole nese NLP, we create and release the Chinese word masking only affects the masking strategy of pre-trained language model series to the com- the pre-training process, it would not bring addi- munity. tional burdens on down-stream tasks. Moreover, as training pre-trained language models are compu- 2 Related Work tationally expensive, they also release all the pre- In this section, we revisit the techniques of the trained models as well as the source codes, which representative pre-trained language models in the stimulates the community to have great interests in recent natural language processing field. The over- the research of pre-trained language models. all comparisons of these models, as well as the 2.2 ERNIE proposed MacBERT, are depicted in Table1. We elaborate on their key components in the following ERNIE (Enhanced Representation through kNowl- subsections. edge IntEgration) (Sun et al., 2019a) is designed to optimize the masking process of BERT, which in- 2.1 BERT cludes entity-level masking and phrase-level mask- BERT (Bidirectional Encoder Representations ing. Different from selecting random words in the from Transformers) (Devlin et al., 2019) has proven input, entity-level masking will mask the named to be successful in natural language processing entities, which are often formed by several words. studies. BERT is designed to pre-train deep bidi- Phrase-level masking is to mask consecutive words, rectional representations by jointly conditioning on which is similar to the N-gram masking strategy both left and right context in all Transformer lay- (Devlin et al., 2019; Joshi et al., 2019).2. ers. Primarily, BERT consists of two pre-training 2Though N-gram masking was not included in Devlin et al. tasks: Masked Language Model (MLM) and Next (2019), according to their model name in the SQuAD leader- Sentence Prediction (NSP). board, we often admit their credit towards this method. 658 2.3 XLNet 2.6 ELECTRA Yang et al.(2019) argues that the existing pre- ELECTRA (Efficiently Learning an Encoder that trained language models that are based on autoen- Classifiers Token Replacements Accurately) (Clark coding, such as BERT, suffer from the discrepancy et al., 2020) employs a new generator-discriminator of the pre-training and fine-tuning stage because framework that is similar to GAN (Goodfellow the masking symbol [MASK] will never appear et al., 2014). The generator is typically a small in the fine-tuning stage. To alleviate this prob- MLM that learns to predict the original words of lem, they proposed XLNet, which was based on the masked tokens. The discriminator is trained to Transformer-XL (Dai et al., 2019). XLNet mainly discriminate whether the input token is replaced by modifies in two ways. The first is to maximize the the generator. Note that, to achieve efficient train- expected likelihood over all permutations of the ing, the discriminator is only required to predict a factorization order of the input, where they called binary label to indicate “replacement”, unlike the the Permutation Language Model (PLM). Another way of MLM that should predict the exact masked is to change the autoencoding language model into word. In the fine-tuning stage, only the discrimina- an autoregressive one, which is similar to the tradi- tor is used. tional statistical language models.3 3 Chinese Pre-trained Language Models 2.4 RoBERTa While we believe most of the conclusions in the pre- vious works are true in English condition, we won- RoBERTa (Robustly Optimized BERT Pretraining der if these techniques still generalize well in other Approach) (Liu et al., 2019) aims to adopt original languages.