Improving Pre-Trained Multilingual Model with Vocabulary Expansion
Total Page:16
File Type:pdf, Size:1020Kb
Improving Pre-Trained Multilingual Models with Vocabulary Expansion Hai Wang1* Dian Yu2 Kai Sun3* Jianshu Chen2 Dong Yu2 1Toyota Technological Institute at Chicago, Chicago, IL, USA 2Tencent AI Lab, Bellevue, WA, USA 3Cornell, Ithaca, NY, USA [email protected], [email protected], fyudian,jianshuchen,[email protected] i.e., out-of-vocabulary (OOV) words (Søgaard and Abstract Johannsen, 2012; Madhyastha et al., 2016). A higher OOV rate (i.e., the percentage of the unseen Recently, pre-trained language models have words in the held-out data) may lead to a more achieved remarkable success in a broad range severe performance drop (Kaljahi et al., 2015). of natural language processing tasks. How- ever, in multilingual setting, it is extremely OOV problems have been addressed in previous resource-consuming to pre-train a deep lan- works under monolingual settings, through replac- guage model over large-scale corpora for each ing OOV words with their semantically similar in- language. Instead of exhaustively pre-training vocabulary words (Madhyastha et al., 2016; Ko- monolingual language models independently, lachina et al., 2017) or using character/word infor- an alternative solution is to pre-train a pow- mation (Kim et al., 2016, 2018; Chen et al., 2018) erful multilingual deep language model over or subword information like byte pair encoding large-scale corpora in hundreds of languages. (BPE) (Sennrich et al., 2016; Stratos, 2017). However, the vocabulary size for each lan- guage in such a model is relatively small, es- Recently, fine-tuning a pre-trained deep lan- pecially for low-resource languages. This lim- guage model, such as Generative Pre-Training itation inevitably hinders the performance of (GPT) (Radford et al., 2018) and Bidirec- these multilingual models on tasks such as se- tional Encoder Representations from Transform- quence labeling, wherein in-depth token-level ers (BERT) (Devlin et al., 2018), has achieved re- or sentence-level understanding is essential. markable success on various downstream natural In this paper, inspired by previous methods language processing tasks. Instead of pre-training designed for monolingual settings, we in- many monolingual models like the existing En- vestigate two approaches (i.e., joint mapping glish GPT, English BERT, and Chinese BERT, a and mixture mapping) based on a pre-trained more natural choice is to develop a powerful mul- multilingual model BERT for addressing the out-of-vocabulary (OOV) problem on a vari- tilingual model such as the multilingual BERT. ety of tasks, including part-of-speech tagging, However, all those pre-trained models rely on named entity recognition, machine translation language modeling, where a common trick is quality estimation, and machine reading com- to tie the weights of softmax and word embed- prehension. Experimental results show that dings (Press and Wolf, 2017). Due to the expen- using mixture mapping is more promising. To sive computation of softmax (Yang et al., 2017) the best of our knowledge, this is the first work and data imbalance across different languages, the that attempts to address and discuss the OOV issue in multilingual settings. vocabulary size for each language in a multilingual model is relatively small compared to the mono- 1 Introduction lingual BERT/GPT models, especially for low- resource languages. Even for a high-resource lan- It has been shown that performance on many guage like Chinese, its vocabulary size 10k in the natural language processing tasks drops dramati- multilingual BERT is only half the size of that in cally on held-out data when a significant percent- the Chinese BERT. Just as in monolingual settings, age of words do not appear in the training data, the OOV problem also hinders the performance of * This work was done when H. W. and K. S. were at a multilingual model on tasks that are sensitive to Tencent AI Lab, Bellevue, WA. token-level or sentence-level information. For ex- 316 Proceedings of the 23rd Conference on Computational Natural Language Learning, pages 316–327 Hong Kong, China, November 3-4, 2019. c 2019 Association for Computational Linguistics ample, in the POS tagging problem (Table2), 11 NSP task, given a certain sentence, it aims to pre- out of 16 languages have significant OOV issues dict the next sentence. The purpose of adding the (OOV rate ≥ 5%) when using multilingual BERT. NSP objective is that many downstream tasks such According to previous work (Radford et al., as question answering and language inference re- 2018; Devlin et al., 2018), it is time-consuming quire sentence-level understanding, which is not and resource-intensive to pre-train a deep lan- directly captured by LM objectives. guage model over large-scale corpora. To address After pre-training on large-scale corpora like the OOV problems, instead of pre-training a deep Wikipedia and BookCorpus (Zhu et al., 2015), we model with a large vocabulary, we aim at enlarg- follow recent work (Radford et al., 2018; Devlin ing the vocabulary size when we fine-tune a pre- et al., 2018) to fine-tune the pre-trained model on trained multilingual model on downstream tasks. different downstream tasks with minimal architec- We summarize our contributions as follows: (i) ture adaptation. We show how to adapt BERT to We investigate and compare two methods to allevi- different downstream tasks in Figure1 (left). ate the OOV issue. To the best of our knowledge, this is the first attempt to address the OOV prob- 2.2 Vocabulary Expansion lem in multilingual settings. (ii) By using English as an interlingua, we show that bilingual informa- Devlin et al.(2018) pre-train the multilingual tion helps alleviate the OOV issue, especially for BERT on Wikipedia in 102 languages, with a low-resource languages. (iii) We conduct exten- shared vocabulary that contains 110k subwords sive experiments on a variety of token-level and calculated from the WordPiece model (Wu et al., sentence-level downstream tasks to examine the 2016). If we ignore the shared subwords be- strengths and weaknesses of these methods, which tween languages, on average, each language has may provide key insights into future directions1. a 1:1k vocabulary, which is significantly smaller than that of a monolingual pre-trained model such 2 Approach as GPT (40k). The OOV problem tends to be less serious for languages (e.g., French and Spanish) We use the multilingual BERT as the pre-trained that belong to the same language family of En- model. We first introduce the pre-training proce- glish. However, this is not always true, especially dure of this model (Section 2.1) and then introduce for morphologically rich languages such as Ger- two methods we investigate to alleviate the OOV man (Ataman and Federico, 2018; Lample et al., issue by expanding the vocabulary (Section 2.2). 2018). OOV problem is much more severe in low- Note that these approaches are not restricted to resource scenarios, especially when a language BERT but also applicable to other similar models. (e.g., Japanese and Urdu) uses an entirely differ- ent character set from high-resource languages. 2.1 Pre-Trained BERT We focus on addressing the OOV issue at Compared to GPT (Radford et al., 2018) and subword level in multilingual settings. For- ELMo (Peters et al., 2018), BERT (Devlin et al., mally, suppose we have an embedding Ebert 2018) uses a bidirectional transformer, whereas extracted from the (non-contextualized) embed- GPT pre-trains a left-to-right transformer (Liu ding layer in the multilingual BERT (i.e., the et al., 2018); ELMo (Peters et al., 2018) in- first layer of BERT). And suppose we have an- dependently trains left-to-right and right-to-left other set of (non-contextualized) sub-word em- LSTMs (Peters et al., 2017) to generate represen- beddings fEl1 ;El2 ;:::;Eln g [ fEeng, which are tations as additional features for end tasks. pre-trained on large corpora using any standard In the pre-training stage, Devlin et al.(2018) use word embedding toolkit. Specifically, Een repre- two objectives: masked language model (LM) and sents the pre-trained embedding for English, and next sentence prediction (NSP). In masked LM, Eli represents the pre-trained embedding for non- they randomly mask some input tokens and then English language li at the subword level. We de- predict these masked tokens. Compared to unidi- note the vocabulary of Ebert, Een, and Eli by rectional LM, masked LM enables representations Vbert, Ven, and Vli , respectively. For each subword to fuse the context from both directions. In the w in Vbert, we use Ebert(w) to denote the pre- 1 Improved models will be available at https:// trained embedding of word w in Ebert. Eli (·) and github.com/sohuren/multilingul-bert. Een(·) are defined in a similar way as Ebert(·). For 317 cer Joint Space Joint BERT Space Mapping er как or ch er: 0.7 cer or: 0.2 as ch: 0.1 so as: 0.5 ho so: 0.3 Mixture как ho: 0.2 Mapping … … Phrase Table Figure 1: Left: fine-tuning BERT on different kinds of end tasks. Right: illustration of joint and mixture mapping (in this example, during mixture mapping, we represent e(cer) = 0:7 ∗ e(er) + 0:2 ∗ e(or) + 0:1 ∗ e(ch)). each non-English language l 2 fl1; l2; : : : ; lng, we to obtain the subword embeddings, we represent 0 aim to enrich Ebert with more subwords from the each subword in El (described in joint mapping) vocabulary in Eli since Eli contains a larger vo- as a mixture of English subwords where those En- cabulary of language li compared to Ebert. glish subwords are already in the BERT vocab As there is no previous work to address multi- Vbert.