Banglabert: Combating Embedding Barrier for Low-Resource Language Understanding
Total Page:16
File Type:pdf, Size:1020Kb
BanglaBERT: Combating Embedding Barrier in Multilingual Models for Low-Resource Language Understanding Abhik Bhattacharjee1, Tahmid Hasan1, Kazi Samin1, Md Saiful Islam2, Anindya Iqbal1, M. Sohel Rahman1, Rifat Shahriyar1 Bangladesh University of Engineering and Technology (BUET)1, University of Rochester2 {abhik,samin}@ra.cse.buet.ac.bd, [email protected], [email protected], {anindya,msrahman,rifat}@cse.buet.ac.bd Abstract en: Joe 2010. Joe da pale leo 2010. bn : েজা ২০১০went thereসােলর today পর আজ for the�থমবােরর �irst time মত since েসখােন েগল। In this paper, we introduce “Embedding Bar- sw: alien kwa mara ya kwanza tangu rier”, a phenomenon that limits the monolin- Figure 1: We present one sentence from each of the gual performance of multilingual models on three languages we study: English (en), Bangla (bn), low-resource languages having unique typolo- and Swahili (sw). Swahili and English have the same gies. We build ‘BanglaBERT’, a Bangla lan- typology, allowing Swahili to use better (subword) tok- guage model pretrained on 18.6 GB Internet- enizations of common entities like names, numerals etc. crawled data and benchmark on five stan- (colored blue) by readily borrowing them from English dard NLU tasks. We discover a signifi- in a multilingual vocabulary. Some tokens (colored or- cant drop in the performance of the state- ange), despite having different meanings, may be also of-the-art multilingual model (XLM-R) from be borrowed, that can later be contextualized for the BanglaBERT and attribute this to the Embed- corresponding language. In contrast, Bangla, having a ding Barrier through comprehensive experi- completely different script, does not enjoy this benefit. ments. We identify that a multilingual model’s performance on a low-resource language is the NLU field is yet to be democratized, as most hurt when its writing script is not similar to any of the high-resource languages. To tackle notable works are restricted to only a few high- the barrier, we propose a straightforward so- resource languages. For example, despite being the lution by transcribing languages to a com- sixth most spoken language, there has not been any mon script, which can effectively improve the comprehensive study on Bangla NLU. With these performance of a multilingual model for the shortcomings in mind, we build BanglaBERT – Bangla language. As a bi-product of the stan- a Transformer-based (Vaswani et al., 2017) NLU dard NLU benchmarks, we introduce a new model for Bangla, pretrained on 18.6 GB data we downstream dataset on natural language infer- ence (NLI) and show that BanglaBERT out- crawled from 60 popular Bangla websites, intro- performs previous state-of-the-art results on duce a new NLI dataset (a task previously unex- all tasks by up to 3.5%. We are making plored in Bangla), and evaluate BanglaBERT on the BanglaBERT language model and the new five downstream tasks. BanglaBERT outperforms Bangla NLI dataset publicly available in the multilingual baselines and previous state-of-the-art hope of advancing the community. The re- results on all tasks by up to 3.5%.1 sources can be found at https://github. Although there have been efforts to pretrain mul- com/csebuetnlp/banglabert. arXiv:2101.00204v2 [cs.CL] 28 Aug 2021 tilingual NLU models (Pires et al., 2019; Conneau and Lample, 2019; Conneau et al., 2020a) that can 1 Introduction generalize to hundreds of languages, low-resource languages are not their primary focus, leaving them Self-supervised pretraining (Devlin et al., 2019; underrepresented. Moreover, these models often Radford et al., 2019; Liu et al., 2019) has become trail in performance compared to language-specific a standard practice in natural language process- models (Canete et al., 2020; Antoun et al., 2020; ing. This pretraining stage allows language models Nguyen and Tuan Nguyen, 2020), resulting in a to learn general linguistic representations (Jawahar surge of interest in pretraining language-specific et al., 2019) that can later be fine-tuned for different models in the NLP community. However, pretrain- downstream natural language understanding (NLU) tasks (Wang et al., 2018; Rajpurkar et al., 2016; 1We are releasing the BanglaBERT model and the NLI Tjong Kim Sang and De Meulder, 2003). However, dataset to advance the Bangla NLP community. ing is highly expensive in terms of data and com- 2 BanglaBERT putational resources, rendering them very difficult In this section, we describe the pretraining data, pre- for low-resource communities. processing steps, model architecture, objectives, To systematically analyze the performance gap and downstream task benchmarks of BanglaBERT. in the multilingual models, we use BanglaBERT as a pivot and compare it with the best perform- 2.1 Pretraining Data ing multilingual model, XLM-RoBERTa (XLM- A high volume of good quality text data is a prereq- R) (Conneau et al., 2020a) on the Bangla NLU uisite for pretraining large language models. For in- tasks. We identify that even though XLM-R was stance, BERT (Devlin et al., 2019) is pretrained on pretrained with a similar objective, a comparable the English Wikipedia and the Books corpus (Zhu amount of texts, and had the same architecture as et al., 2015) containing 3.3 billion tokens. Subse- BanglaBERT, the vocabulary dedicated to Bangla quent works like RoBERTa (Liu et al., 2019) and scripts in the embedding layer was less than 1% of XLNet (Yang et al., 2019) used an order of mag- the total vocabulary (and less than ten folds com- nitude larger web-crawled data that have passed pared to BanglaBERT), which limited the represen- through heavy automatic filtering and cleaning. tation ability of XLM-R for Bangla. Also, since Bangla is a rather resource-constrained language Bangla does not share its vocabulary or writing in the web domain; for example, the Bangla script with any other high-resource language, the Wikipedia dump from June 2020 is only 350 vocabulary is strictly limited within that bound. We megabytes in size, two orders of magnitudes name this phenomenon as the ‘Embedding Bar- smaller than the English Wikipedia. As a result, we rier’. Furthermore, we conduct a series of experi- had to crawl the web extensively to collect our pre- ments to establish that Embedding Barrier is indeed training data. We selected 60 Bangla websites by a key contributing factor to the performance degra- their Amazon Alexa website rankings2 and volume dation of XLM-R. Through our experiments, we of extractable texts. The contents included encyclo- show that: pedias, news, blogs, e-books, and story sites. The 1. There is a considerable loss in performance amount of collected data totaled around 30 GB. (up to 4% accuracy) because of the Embed- There are other noisy sources of data dumps ding Barrier when vocabularies are limited. publicly available, a prominent one being OSCAR 2. Even a distant low-resource language having (Suárez et al., 2019). However, the OSCAR dataset the same typology as a high-resource one may contained a lot of offensive texts and repetitive benefit from borrowing (subword) tokens (al- contents, which we found too difficult to clean beit having different meanings). thoroughly, and hence opted not to use it. 3. On the contrary, linguistically similar lan- 2.2 Pre-processing guages may be hindered from contributing We performed thorough deduplication on the pre- to each other’s monolingual performance be- training data, removed non-textual contents (e.g., cause of being written in different scripts. HTML/JavaScript tags), and filtered out non- 4. Sister languages with identical scripts can Bangla pages using a language classifier (Joulin boost each other’s monolingual performance et al., 2017). After the processing, the dataset was even when they are low-resource. reduced to 18.6 GB in size. A detailed description of the corpus can be found in the Appendix. In addition to addressing the embedding bar- We trained a Wordpiece (Wu et al., 2016) vo- rier, we propose a simple solution to overcome cabulary of 32k tokens on the resulting corpus it – transcribing texts into a common script. Our with a 400 character alphabet, kept larger than the proposed method alleviates this embedding barrier native Bangla alphabet to capture code-switching by increasing the effective vocabulary size beyond (Poplack, 1980) and allow romanized Bangla con- the native scripts. Although not being a perfect tents for better generalization. While creating a solution, it establishes that having an uncommon training sample, we limited the sequence length to typology indeed limits the performance of multi- 512 tokens and did not cross document boundaries lingual models. We call for new ways to represent (Liu et al., 2019) while creating a sample. vocabularies of low-resource languages in multilin- gual models that can better lift this barrier. 2www.alexa.com/topsites/countries/BD SC EC DC NER NLI Model/Baseline (Acc.) (Weighted F-1) (Acc.) (Macro F-1) (Acc.) Previous SoTA 85.67 69.73 72.50 57.44 N/A mBERT 83.39 56.02 98.64 67.40 71.41 XLM-R 89.49 66.70 98.71 70.63 76.76 monsoon-nlp/bangla-electra 73.54 34.55 97.64 52.57 64.43 sagorsarker/bangla-bert-base 87.30 61.51 98.79 70.97 69.18 BanglaBERT (ours) 92.18 72.75 99.07 71.54 80.15 Table 1: Performance comparison on different downstream tasks 2.3 Pretraining Objective and Model p3.16xlarge instance on AWS. We used the Adam (Kingma and Ba, 2015) optimizer with a 2e-4 learn- Language models are pretrained with self- ing rate and linear warmup of 10,000 steps. supervised objectives that can take advantage of unannotated text data. BERT (Devlin et al., 2019) 2.4 Downstream Tasks and Results was pretrained with masked language modeling (MLM) and next sentence prediction (NSP).