Efficiently Pretrained Transformer-Based Language Model for Polish
Total Page:16
File Type:pdf, Size:1020Kb
HerBERT: Efficiently Pretrained Transformer-based Language Model for Polish Robert Mroczkowski1 Piotr Rybak1 Alina Wróblewska2 Ireneusz Gawlik1,3 1ML Research at Allegro.pl 2Institute of Computer Science, Polish Academy of Sciences 3Department of Computer Science, AGH University of Science and Technology {firstname.lastname}@allegro.pl, [email protected] Abstract While most of the research related to analyzing and improving BERT-based models focuses on En- BERT-based models are currently used for glish, there is an increasing body of work aimed solving nearly all Natural Language Process- at training and evaluation of models for other lan- ing (NLP) tasks and most often achieve state- guages, including Polish. Thus far, a handful of of-the-art results. Therefore, the NLP com- munity conducts extensive research on under- models specific for Polish has been released, e.g. 1 standing these models, but above all on design- Polbert , first version of HerBERT (Rybak et al., ing effective and efficient training procedures. 2020), and Polish RoBERTa (Dadas et al., 2020). Several ablation studies investigating how to Aforementioned works lack ablation studies, train BERT-like models have been carried out, making it difficult to attribute hyperparameters but the vast majority of them concerned only choices to models performance. In this work, we the English language. A training procedure de- fill this gap by conducting an extensive set of exper- signed for English does not have to be univer- sal and applicable to other especially typolog- iments and developing an efficient BERT training ically different languages. Therefore, this pa- procedure. As a result, we were able to train and per presents the first ablation study focused on release a new BERT-based model for Polish lan- Polish, which, unlike the isolating English lan- guage understanding. Our model establishes a new guage, is a fusional language. We design and state-of-the-art on the variety of downstream tasks thoroughly evaluate a pretraining procedure of including semantic relatedness, question answer- transferring knowledge from multilingual to ing, sentiment analysis and part-of-speech tagging. monolingual BERT-based models. In addition to multilingual model initialization, other fac- To summarize, our contributions are: tors that possibly influence pretraining are also 1. development and evaluation of an efficient explored, i.e. training objective, corpus size, BPE-Dropout, and pretraining length. Based pretraining procedure for transferring knowl- on the proposed procedure, a Polish BERT- edge from multilingual to monolingual lan- based language model – HerBERT – is trained. guage models based on work by Arkhipov This model achieves state-of-the-art results on et al.(2019), multiple downstream tasks. 2. detailed analysis and an ablation study chal- 1 Introduction lenging the effectiveness of Sentence Struc- tural Objective (SSO, Wang et al., 2020), and Recent advancements in self-supervised pretrain- Byte Pair Encoding Dropout (BPE-Dropout, ing techniques drastically changed the way we de- Provilkov et al., 2020), sign Natural Language Processing (NLP) systems. Even though, pretraining has been present in NLP 3. release of HerBERT2 – a BERT-based model for many years (Mikolov et al., 2013; Pennington for Polish language understanding, which et al., 2014; Bojanowski et al., 2017), only recently achieves state-of-the-art results on KLEJ we observed a shift from task-specific to general- Benchmark (Rybak et al., 2020) and POS tag- purpose models. In particular, the BERT model ging task (Wróblewska, 2020). (Devlin et al., 2019) proved to be a dominant ar- 1https://github.com/kldarek/polbert chitecture and obtained state-of-the-art results for 2https://huggingface.co/allegro/ a variety of NLP tasks. herbert-large-cased 1 Proceedings of the 8th BSNLP Workshop on Balto-Slavic Natural Language Processing, pages 1–10 April 20, 2021. ©2021 Association for Computational Linguistics The rest of the paper is organized as follows. In glish language understanding. There was little re- Section2, we provide an overview of related work. search into understanding how different pretrain- After that, Section3 introduces the BERT-based ing techniques affect BERT-based models for other language model and experimental setup used in this languages. The main research focus was to train work. In Section4, we conduct a thorough ablation BERT-based models and report their performance study to investigate the impact of several design on downstream tasks. The first such models were choices on the performance of downstream tasks. released for German3 and Chinese (Devlin et al., Next, in Section5 we apply drawn conclusions 2019), recently followed by Finnish (Virtanen et al., and describe the training of HerBERT model. In 2019), French (Martin et al., 2020; Le et al., 2020), Section6, we evaluate HerBERT on a set of eleven Polish (Rybak et al., 2020; Dadas et al., 2020), tasks and compare its performance to other state- Russian (Kuratov and Arkhipov, 2019), and many of-the-art models. Finally, we conclude our work other languages4. Research on developing and in- in Section7. vestigating an efficient procedure of pretraining BERT-based models was rather neglected in these 2 Related Work languages. Language understanding for low-resource lan- The first significant ablation study of BERT-based guages has also been addressed by training jointly language pretraining was described by Liu et al. for several languages at the same time. That ap- (2019). Authors demonstrated the ineffectiveness proach improves performance for moderate and of Next Sentence Prediction (NSP) objective, the low-resource languages as showed by Conneau and importance of dynamic token masking, and gains Lample(2019). The first model of this kind was the from using both large batch size and large training multilingual BERT trained for 104 languages (De- dataset. Further large-scale studies analyzed the vlin et al., 2019) followed by Conneau and Lample relation between the model and the training dataset (2019) and Conneau et al.(2020). sizes (Kaplan et al., 2020), the amount of compute used for training (Brown et al., 2020) and training 3 Experimental Setup strategies and objectives (Raffel et al., 2019). In this section, we describe the experimental setup Other work focused on studying and improv- used in the ablation study. First, we introduce the ing BERT training objectives. As mentioned be- corpora we used to train models. Then, we give an fore, the NSP objective was either removed (Liu overview of the language model architecture and et al., 2019) or enhanced either by predicting the training procedure. In particular, we describe the correct order of sentences (Sentence Order Pre- method of transferring knowledge from multilin- diction (SOP), Lan et al., 2020) or discriminat- gual to monolingual BERT-based models. Finally, ing between previous, next and random sentence we present the evaluation tasks. (Sentence Structural Objective (SSO), Wang et al., 2020). Similarly, the Masked Language Modelling 3.1 Training Data (MLM) objective was extended to either predict We gathered six corpora to create two datasets spans of tokens (Joshi et al., 2019), re-order shuf- on which we trained HerBERT. The first dataset fled tokens (Word Structural Objective (WSO), (henceforth called Small) consists of corpora of Wang et al., 2020) or replaced altogether with a the highest quality, i.e. NKJP, Wikipedia, and binary classification problem using mask genera- Wolne Lektury. The second dataset (Large) is over tion (Clark et al., 2020). five times larger as it additionally contains texts of For tokenization, the Byte Pair Encoding algo- lower quality (CCNet and Open Subtitles). Below, rithm (BPE, Sennrich et al., 2016) is commonly we present a short description of each corpus. Ad- used. The original BERT model used WordPiece ditionally, we include the basic corpora statistics in implementation (Schuster and Nakajima, 2012), Table1. which was later replaced by SentencePiece (Kudo and Richardson, 2018). Gong et al.(2018) dis- NKJP (Narodowy Korpus J˛ezykaPolskiego, eng. covered that rare words lack semantic meaning. National Corpus of Polish)(Przepiórkowski, 2012) Provilkov et al.(2020) proposed a BPE-Dropout is a well balanced collection of Polish texts. It technique to solve this issue. 3https://deepset.ai/german-bert All of the above work was conducted for En- 4https://huggingface.co/models 2 Corpus Tokens Documents Avg len representative parts of our corpus, i.e annotated subset of the NKJP, and the Wikipedia. Source Corpora Subword regularization is supposed to empha- NKJP 1357M 3.9M 347 size the semantic meaning of tokens (Gong et al., 2018; Provilkov et al., 2020). To verify its im- Wikipedia 260M 1.4M 190 pact on training language model we used a BPE- Wolne Lektury 41M 5.5k 7447 Dropout (Provilkov et al., 2020) with a probability CCNet Head 2641M 7.0M 379 of dropping a merge equal to 10%. CCNet Middle 3243M 7.9M 409 Architecture We followed the original BERT Open Subtitles 1056M 1.1M 961 (Devlin et al., 2019) architectures for both BASE (12 layers, 12 attention heads and hidden dimen- Final Corpora sion of 768) and LARGE (24 layers, 16 attention Small 1658M 5.3M 313 heads and hidden dimension of 1024) variants. Large 8599M 21.3M 404 Initialization We initialized models either ran- domly or by using weights from XLM-RoBERTa Table 1: Overview of all data sources used to train Her- (Conneau et al., 2020). In the latter case, the pa- BERT. We combine them into two corpora. The Small corpus consists of the highest quality text resources: rameters for all layers except word embeddings and NKJP, Wikipedia, and Wolne Lektury. The Large cor- token type embeddings were copied directly from pus consists of all sources. Avg len is the average num- the source model. Since XLM-RoBERTa does not ber of tokens per document in each corpus. use the NSP objective and does not have the token type embeddings, we took them from the original BERT model.