BanglaBERT: Combating Embedding Barrier in Multilingual Models for Low-Resource Language Understanding Abhik Bhattacharjee1, Tahmid Hasan1, Kazi Samin1, Md Saiful Islam2, Anindya Iqbal1, M. Sohel Rahman1, Rifat Shahriyar1 Bangladesh University of Engineering and Technology (BUET)1, University of Rochester2 {abhik,samin}@ra.cse.buet.ac.bd,
[email protected],
[email protected], {anindya,msrahman,rifat}@cse.buet.ac.bd Abstract en: Joe 2010. Joe da pale leo 2010. bn : েজা ২০১০went thereসােলর today পর আজ for the�থমবােরর �irst time মত since েসখােন েগল। In this paper, we introduce “Embedding Bar- sw: alien kwa mara ya kwanza tangu rier”, a phenomenon that limits the monolin- Figure 1: We present one sentence from each of the gual performance of multilingual models on three languages we study: English (en), Bangla (bn), low-resource languages having unique typolo- and Swahili (sw). Swahili and English have the same gies. We build ‘BanglaBERT’, a Bangla lan- typology, allowing Swahili to use better (subword) tok- guage model pretrained on 18.6 GB Internet- enizations of common entities like names, numerals etc. crawled data and benchmark on five stan- (colored blue) by readily borrowing them from English dard NLU tasks. We discover a signifi- in a multilingual vocabulary. Some tokens (colored or- cant drop in the performance of the state- ange), despite having different meanings, may be also of-the-art multilingual model (XLM-R) from be borrowed, that can later be contextualized for the BanglaBERT and attribute this to the Embed- corresponding language. In contrast, Bangla, having a ding Barrier through comprehensive experi- completely different script, does not enjoy this benefit.