Arxiv:2106.03958V2 [Cs.CL] 9 Jun 2021 Can Be an Effective Way to Adapt Lms for Lrls, However, There Is a Challenge

Exploiting Language Relatedness for Low Web-Resource Language Model Adaptation: An Indic Languages Study Yash Khemchandani1∗ Sarvesh Mehtani1∗ Vaidehi Patil1 Abhijeet Awasthi1 Partha Talukdar2 Sunita Sarawagi1 1Indian Institute of Technology Bombay, India 2Google Research, India fyashkhem,smehtani,awasthi,[email protected] [email protected], [email protected] Abstract Recent research in multilingual language models (LM) has demonstrated their ability to effectively handle multiple languages in a single model. This holds promise for low web-resource languages (LRL) as multilingual models can enable transfer of supervision from high resource languages to LRLs. However, in- corporating a new language in an LM still re- mains a challenge, particularly for languages with limited corpora and in unseen scripts. In Figure 1: Number of wikipedia articles for top-few In- this paper we argue that relatedness among lan- dian Languages and English. The height of the English guages in a language family may be exploited bar is not to scale as indicated by the break. Number of to overcome some of the corpora limitations of English articles is roughly 400x more than articles in LRLs, and propose RelateLM. We focus on In- Oriya and 800x more than articles in Assamese. dian languages, and exploit relatedness along two dimensions: (1) script (since many In- dic scripts originated from the Brahmic script), languages. For example, the Multilingual BERT and (2) sentence structure. RelateLM uses (mBERT) (Devlin et al., 2019) model was trained transliteration to convert the unseen script of on 104 different languages. When fine-tuned for limited LRL text into the script of a Re- various downstream tasks, multilingual LMs have lated Prominent Language (RPL) (Hindi in our demonstrated significant success in generalizing case). While exploiting similar sentence struc- across languages (Hu et al., 2020; Conneau et al., tures, RelateLM utilizes readily available bilingual dictionaries to pseudo translate RPL text 2019). Thus, such models make it possible to into LRL corpora. Experiments on multiple transfer knowledge and resources from resource real-world benchmark datasets provide valida- rich languages to Low Web-Resource Languages tion to our hypothesis that using a related lan- (LRL). This has opened up a new opportunity to- guage as pivot, along with transliteration and wards rapid development of language technologies pseudo translation based data augmentation, for LRLs. arXiv:2106.03958v2 [cs.CL] 9 Jun 2021 can be an effective way to adapt LMs for LRLs, However, there is a challenge. The current rather than direct training or pivoting through English. paradigm for training Mutlilingual LM requires text corpora in the languages of interest, usually in 1 Introduction large volumes. However, such text corpora is often available in limited quantities for LRLs. For exam- BERT-based pre-trained language models (LMs) ple, in Figure1 we present the size of Wikipedia, have enabled significant advances in NLP (Devlin a common source of corpora for training LMs, for et al., 2019; Liu et al., 2019; Lan et al., 2020). Pre- top-few scheduled Indian languages1 and English. trained LMs have also been developed for the mul- The top-2 languages are just one-fiftieth the size of tilingual setting, where a single multilingual model is capable of handling inputs from many different 1According to Indian Census 2011, more than 19,500 languages or dialects are spoken across the country, with 121 of ∗Authors contributed equally them being spoken by more than 10 thousand people. इक है … … ਇਹ ਇਕ ਬਾਂਦਰ ਹੈਬ拓ਦਰ इह इक है बांदर है इह MASK बांदर है ... ... MLM Training … इह एक है बंदर है ... … … इह इक है बांदर है ਇਹ : एक है इक है : एक है ਬ拓ਦਰ : बंदर, वानर बांदर : बंदर, वानर ... ... इह एक है बंदर है Alignment Loss … … अ楍छा : ਚੰਗਾ, ਵਧੀਆ अ楍छा : चँगा, वधीआ बालक है : ਮੁੰਡਾ बालक है : मअच्छा बालकहैुँडा ... ... … राम अच्छाबालकहै चँगा मअच्छा बालकहैुँडा है राम अच्छा बालक हैचँगामअच्छाबालकहैुँडा ... राम अच्छाबालकहै अ楍छा बालक है है … Alignment Loss राम अच्छा बालक हैअ楍छाबालक ... Transliteration Pre-training with Pseudo Translation into R’s script MLM and Alignment Figure 2: Pre-training with MLM and Alignment loss in RelateLM with LRL L as Punjabi (pa) in Gurumukhi script and RPL R as Hindi (hi) in Devanagari script. RelateLM first transliterates LRL text in the monolingual corpus (DL) and bilingual dictionaries (BL!R and BR!L) to the script of the RPL R. The transliterated bilingual dictionaries are then used to pseudo translate the RPL corpus (DR) and transliterated LRL corpus (DLR ). This pseudo translated data is then used to adapt the given LM M for the target LRL L using a combination of Masked Language Model (MLM) and alignment losses. For notations and further details, please see Section3. English, and yet Hindi is seven times larger than as they are all part of the Indo-Aryan family, ge- the O(20,000) documents of languages like Oriya ographically, and syntactically matching on key and Assamese which are spoken by millions of peo- features like the Subject-Object-Verb (SOV) order ple. This calls for the development of additional as against the Subject-Verb-Object (SVO) order mechanisms for training multilingual LMs which in English. Even though the scripts of several In- are not exclusively reliant on large monolingual dic languages differ, they are all part of the same corpora. Brahmic family, making it easier to design rule- Recent methods of adapting a pre-trained mul- based transliteration libraries across any language tilingual LM to a LRL include fine-tuning the full pair. In contrast, transliteration of Indic languages model with an extended vocabulary (Wang et al., to English is harder with considerable phonetic 2020), training a light-weight adapter layer while variation in how words are transcribed. The geo- keeping the full model fixed (Pfeiffer et al., 2020b), graphical and phylogenetic proximity has lead to and exploiting overlapping tokens to learn embed- significant overlap of words across languages. This dings of the LRL (Pfeiffer et al., 2020c). These are implies that just after transliteration we are able general-purpose methods that do not sufficiently to exploit overlap with a Related Prominent Lan- exploit the specific relatedness of languages within guage (RPL) like Hindi. On three Indic languages the same family. we discover between 11% and 26% overlapping We propose RelateLM for this task. RelateLM tokens with Hindi, whereas with English it is less exploits relatedness between the LRL of interest than 8%, mostly comprising numbers and entity and a Related Prominent Language (RPL). We names. Furthermore, the syntax-level similarity focus on Indic languages, and consider Hindi as between languages allows us to generate high qual- the RPL. The languages we consider in this pa- ity data augmentation by exploiting pre-existing per are related along several dimensions of linguis- bilingual dictionaries. We generate pseudo parallel tic typology (Dryer and Haspelmath, 2013; Lit- data by converting RPL text to LRL and vice-versa. tell et al., 2017): phonologically, phylogenetically These allow us to further align the learned embed- dings across the two languages using the recently is the lack of model’s capacity to accommodate proposed loss functions for aligning contextual em- all languages simultaneously. This has led to in- beddings of word translations (Cao et al., 2020; Wu creased interest in adapting multilingual LMs to and Dredze, 2020). LRLs and we discuss these in the following two In this paper, we make the following contributions: settings. • We address the problem of adding a Low Web- Resource Language (LRL) to an existing pre- LRL adaptation using monolingual data For trained LM, especially when monolingual cor- eleven languages outside mBERT, Wang et al. pora in the LRL is limited. This is an impor- (2020) demonstrate that adding a new target lan- tant but underexplored problem. We focus guage to mBERT by simply extending the embed- on Indian languages which have hundred of ding layer with new weights results in better per- millions of speakers, but traditionally under- forming models when compared to bilingual-BERT studied in the NLP community. pre-training with English as the second language. • We propose RelateLM which exploits relat- Pfeiffer et al.(2020c) adapt multilingual LMs to edness among languages to effectively in- the LRLs and languages with scripts unseen during corporate a LRL into a pre-trained LM. We pre-training by learning new tokenizers for the un- highlight the relevance of transliteration and seen script and initializing their embedding matrix pseudo translation for related languages, and by leveraging the lexical overlap w.r.t. the lan- use them effectively in RelateLM to adapt a guages seen during pre-training. Adapter (Pfeiffer pre-trained LM to a new LRL. et al., 2020a) based frameworks like (Pfeiffer et al., 2020b; Artetxe et al., 2020; Ust¨ un¨ et al., 2020) ad- • Through extensive experiments, we find that dress the lack of model’s capacity to accommodate RelateLM is able to gain significant improve- multiple languages and establish the advantages of ments on benchmark datasets. We demon- adding language-specific adapter modules in the strate how RelateLM adapts mBERT to Oriya BERT model for accommodating LRLs. These and Assamese, two low web-resource Indian methods generally assume access to a fair amount languages by pivoting through Hindi. Via ab- of monolingual LRL data and do not exploit relat- lation studies on bilingual models we show edness across languages explicitly. These methods that RelateLM is able to achieve accuracy of provide complimentary gains to our method of di- zero-shot transfer with limited data (20K doc- rectly exploiting language relatedness.

Arxiv:2106.03958V2 [Cs.CL] 9 Jun 2021 Can Be an Effective Way to Adapt Lms for Lrls, However, There Is a Challenge

The Festvox Indic Frontend for Grapheme-To-Phoneme Conversion

Tai Lü / ᦺᦑᦟᦹᧉ Tai Lùe Romanization: KNAB 2012

Download Download

Neural Machine Translation System of Indic Languages - an Attention Based Approach

An Introduction to Indic Scripts

A Multi-Indic-Lingual Approach for COVID Fake-Tweet Detection

The Formal Kharoṣṭhī Script from the Northern Tarim Basin in Northwest

Designing Devanagari Type

Töwkhön, the Retreat of Öndör Gegeen Zanabazar As a Pilgrimage Site Zsuzsa Majer Budapest

3 Writing Systems

Initial Literacy in Devanagari: What Matters to Learners

6 the Hindu-Arabic System (800 BC)