Adapting Monolingual Models: Data can be Scarce when Language Similarity is High Wietse de Vries∗, Martijn Bartelds∗, Malvina Nissim and Martijn Wieling University of Groningen The Netherlands fwietse.de.vries, m.bartelds, m.nissim,
[email protected] Abstract languages that are not included in mBERT pre- training usually show poor performance (Nozza For many (minority) languages, the resources et al., 2020; Wu and Dredze, 2020). needed to train large models are not available. An alternative to multilingual transfer learning We investigate the performance of zero-shot is the adaptation of existing monolingual models transfer learning with as little data as possi- ble, and the influence of language similarity in to other languages. Zoph et al.(2016) introduce this process. We retrain the lexical layers of a method for transferring a pre-trained machine four BERT-based models using data from two translation model to lower-resource languages by low-resource target language varieties, while only fine-tuning the lexical layer. This method has the Transformer layers are independently fine- also been applied to BERT (Artetxe et al., 2020) tuned on a POS-tagging task in the model’s and GPT-2 (de Vries and Nissim, 2020). Artetxe source language. By combining the new lex- et al.(2020) also show that BERT models with ical layers and fine-tuned Transformer layers, retrained lexical layers perform well in downstream we achieve high task performance for both target languages. With high language sim- tasks, but comparatively high performance has only ilarity, 10MB of data appears sufficient to been demonstrated for languages for which at least achieve substantial monolingual transfer per- 400MB of data is available.