Samsung R&D Institute Poland Submission to WAT 2021 Indic

Samsung R&D Institute Poland submission to WAT 2021 Indic Language Multilingual Task Adam Dobrowolski, Marcin Szymanski,´ Marcin Chochowski, Paweł Przybysz Samsung R&D Institute, Warsaw, Poland fa.dobrowols2, m.szymanski, m.chochowski, p.przybyszg @samsung.com MultiIndicMT: An Indic Language Multilingual Task Team ID: SRPOL Abstract machine translation (Chu and Wang, 2018)(Dabre et al., 2020). This paper describes the submission to the This document is structured as follows. In sec- WAT 2021 Indic Language Multilingual Task tion 2 we describe the sources and techniques of by Samsung R&D Institute Poland. The task covered translation between 10 Indic Lan- corpora preparation used for the training. In sec- guages (Bengali, Gujarati, Hindi, Kannada, tions 3 and 4 we describe the model architecture Malayalam, Marathi, Oriya, Punjabi, Tamil and techniques used in training, tuning and ensem- and Telugu) and English. bling and finally Section 5 presents the results we We combined a variety of techniques: translit- gained on every stage of the training. eration, filtering, backtranslation, domain All trainings were performed on Transformer adaptation, knowledge-distillation and finally models. We used standard Marian NMT 1 v.1.9 ensembling of NMT models. We applied an ef- framework. fective approach to low-resource training that consist of pretraining on backtranslations and 2 Data tuning on parallel corpora. We experimented with two different domain- 2.1 Multilingual trainings adaptation techniques which significantly im- Multilingual models trained for the competition use proved translation quality when applied to a target language tag at the beginning of sentence monolingual corpora. We researched and applied a novel approach for finding the best to select the direction of the translation. hyperparameters for ensembling a number of 2.2 Transliteration translation models. All techniques combined gave significant im- Indian languages use a variety of scripts. Using provement - up to +8 BLEU over baseline re- transliteration between scripts of similar languages sults. The quality of the models has been con- may improve the quality of multilingual models firmed by the human evaluation where SRPOL as described in (Bawden et al., 2019)(Goyal and models scored best for all 5 manually evalu- Sharma, 2019). The transliteration we applied was ated languages. to replace Indian letters of various scripts to their equivalents in Devanagari script. We used indic- 1 Introduction NLP 2 library to perform the transliteration. Samsung R&D Poland Team researched effective In our previous experiments with Indian lan- techniques that worked especially well for low- guages we noticed an overall improvement of the resource languages: transliteration, iterative back- quality for multi-indian models, so we used translit- translation followed by tuning on parallel corpora. eration in all trainings. However, additional exper- We successfully applied these techniques during iments on transliteration during the competition the WAT2021 competition (Nakazawa et al., 2021). were not conclusive. The results for trainings on Especially for the competition we also applied cus- raw corpora, without transliteration were similar tom domain-adaptation techniques which substan- (see Table1). tially improved the final results. 1github.com/marian-nmt/marian Most of the applied techniques and ideas are 2https://github.com/anoopkunchukuttan/ commonly used for works on Indian languages indic_nlp_library 224 Proceedings of the 8th Workshop on Asian Translation, pages 224–232 Bangkok, Thailand (online), August 5-6, 2021. ©2021 Association for Computational Linguistics 2.3 Parallel Corpora Filtering different techniques to select the in domain sen- The base corpus for all trainings was the concaten- tences for backtranslation. With these techniques taion of complete bilingual corpora provided by we trained two separate families of MT models. the organizers (further referenced as bitext) (11M Domain adaptation by fastText (FT) - We ap- lines in total). No filtering or preprocessing (but plied the domain adaptation described in (Yu et al., the transliteration) were performed on this corpus. 2020). Following the hints from the paper, we The corpus included parallel data from: CVIT-PIB, trained the fastText (Joulin et al., 2017) model PMIndia, IITB 3.0, JW, NLPC, UFAL EnTam, Uka using balanced corpus containing sentences from Tarsadia, Wiki Titles, ALT, OpenSubtitles, Bible- PMIndia labelled as in-domain and CCAligned sen- uedin, MTEnglish2Odia, OdiEnCorp 2.0, TED, tences labelled as out-domain. Using the trained WikiMatrix. During the competition we performed model we filtered the parallel as well as monolin- several experiments to enrich/filter this parallel cor- gual corpora. pora: Domain adaptation by language model (LM) • Inclusion of CCAligned corpus As the second approach to select a subset of • Removing far from domain sentence pairs like best PMI-like sentences from monolingual general- religious corpora domain AI4Bharat (Kunchukuttan et al., 2020) corpora available for the task, we used the approach • Removing sentence pairs of low probability described in (Axelrod et al., 2011). For each of (according to e.g. sentence lengths, detected 10 Indian languages two RNN language models language etc.) were constructed using Marian toolkit: in-domain trained with a particular part of PMI corpus and • Domain adaptation by fastText out-of-domain created using a similar number of lines from a mix of all other corpora available for • Domain adaptation by language model that language respectively. All these models were None of these techniques applied on parallel cor- regularized with exponential smoothing of 0.0001, pora had led to quality improvement which is why dropout of 0.2 along with source and target word we decided to continue with the basic non-filtered token dropout of 0.1. For the AI4Bharat mono corpora as the base for future trainings. corpus sentence ranking, we used a cross-entropy difference between scores of previously mentioned 2.4 Backtranslation models as suggested in (Axelrod et al., 2011), nor- Backtranslation of monolingual corpora is a com- malized by the line length. Only sentences with a monly used technique for improving machine trans- score above arbitrarily chosen threshold were se- lation. Especially for low-resource languages lected for further processing. We noticed a signifi- where only small bilingual corpora are available cant influence of domain adaptation while selecting (Edunov et al., 2018). Training on backtransla- mono corpora used for backtranslation (see Table tions enriches the target language model, which 3). improves the overall translation quality. The synthetic backtranslated corpus was joined with the 2.6 Multi-Agent Dual Learning original bilingual corpus for the trainings. For some of trainings, we used the simpli- Using backtranslations of the full monolingual fied version of Multi-Agent Dual Learning corpuses led to the improvement of results on trans- (MADL) (Wang et al., 2019), proposed in Kim lation on Indian to English directions by 1.2 BLEU et al.(2019), to generate additional training data on average. There was no improvement in the op- from the parallel corpus. We generated n-best trans- posite directions. See Tables5 and6. lations of both the source and the target sides of the parallel data, with strong ensembles of, respec- 2.5 Domain adaptation tively, the forward and the backward models. Next, We enriched the parallel training corpora with back- we picked the best translation from among n candi- translated monolingual data selecting only sen- dates w.r.t. the sentence-level BLEU score. Thanks tences similar to PMI domain to increase the rate of to these steps, we tripled the number of sentences in-domain data in the training corpus. We used two by combining three types of datasets: 225 1. original source – original target, Parallel En-In In-En bitext 18.03 31.41 2. original source – synthetic target, CCAligned 6.82 12.15 PMIndia 5.59 11.94 3. synthetic source – original target, bitext+CC 17.62 30.56 where the synthetic target is the translation of the bitext, no religious 15.33 29.02 original source with the forward model, and the bitext, filtered FT 17.84 29.38 synthetic source is the translation of the original bitext, most likely 17.98 31.00 target with the backward model. bitext, no transliteration 18.36 31.27 With backtranslation 2.7 Postprocessing bitext+BT filtered LM 18.22 31.38 In comparison to our competitors we noticed sig- bitext+BT filtered FT 18.71 32.77 nificantly weaker performance on the En-Or di- bitext+CC+BT flitered FT 18.21 30.64 rection. After the analysis we found out that the MADL generated corpora contain sequences of characters MADL 18.87 31.94 (U+0B2F-U+0B3C, U+0B5F) not present in the MADL+BT filtered FT 18.83 33.25 devset corpora. Replacing these sequences with sequence (U+0B5F-U+0B3E) gave a significant Table 1: Average BLEU for preliminary trainings (4.1) on different corpora. improvement for En-Or of about +4 BLEU. 3 NMT System Overview layer dimension of 4096. The transformer-big All of our systems are trained with the Marian trainings were regularized with a dropout between NMT3 (Junczys-Dowmunt et al., 2018) framework. transformer layers of 0.1 and a label smoothing of 0.1 unlike the transformer-base which was trained 3.1 Baseline systems for preliminary without a dropout. experiments 4 Trainings First experiments were performed with transformer models (Vaswani et al., 2017), which we will now 4.1 Preliminary trainings refer to as transformer-base. The only difference During preliminary trainings, we tested which tech- is that we used 8 encoder layers and 4 decoder niques of filtering/backtranslation/MADL work layers instead of default configuration 6-6. The best for the task. Preliminary trainings were per- model has default embedding dimension of 512 formed for all 20 directions on a single transformer- and a feed-forward layer dimension of 2048. base model with no dropout.

Load more