Improving Pre-Trained Multilingual Models with Vocabulary Expansion Hai Wang1* Dian Yu2 Kai Sun3* Jianshu Chen2 Dong Yu2 1Toyota Technological Institute at Chicago, Chicago, IL, USA 2Tencent AI Lab, Bellevue, WA, USA 3Cornell, Ithaca, NY, USA
[email protected],
[email protected], fyudian,jianshuchen,
[email protected] i.e., out-of-vocabulary (OOV) words (Søgaard and Abstract Johannsen, 2012; Madhyastha et al., 2016). A higher OOV rate (i.e., the percentage of the unseen Recently, pre-trained language models have words in the held-out data) may lead to a more achieved remarkable success in a broad range severe performance drop (Kaljahi et al., 2015). of natural language processing tasks. How- ever, in multilingual setting, it is extremely OOV problems have been addressed in previous resource-consuming to pre-train a deep lan- works under monolingual settings, through replac- guage model over large-scale corpora for each ing OOV words with their semantically similar in- language. Instead of exhaustively pre-training vocabulary words (Madhyastha et al., 2016; Ko- monolingual language models independently, lachina et al., 2017) or using character/word infor- an alternative solution is to pre-train a pow- mation (Kim et al., 2016, 2018; Chen et al., 2018) erful multilingual deep language model over or subword information like byte pair encoding large-scale corpora in hundreds of languages. (BPE) (Sennrich et al., 2016; Stratos, 2017). However, the vocabulary size for each lan- guage in such a model is relatively small, es- Recently, fine-tuning a pre-trained deep lan- pecially for low-resource languages. This lim- guage model, such as Generative Pre-Training itation inevitably hinders the performance of (GPT) (Radford et al., 2018) and Bidirec- these multilingual models on tasks such as se- tional Encoder Representations from Transform- quence labeling, wherein in-depth token-level ers (BERT) (Devlin et al., 2018), has achieved re- or sentence-level understanding is essential.