Neural Machine Translation for Similar Languages: The Case of Indo- Languages

Santanu Pal1, Marcos Zampieri2 1Wipro AI Lab, 2Rochester Institute of Technology, USA [email protected]

Abstract MT systems on translating between pairs of similar languages without English as a pivot In this paper we present the WIPRO-RIT language (Barrault et al., 2019). The or- systems submitted to the Similar Language ganizers provided participants with training, Translation shared task at WMT 2020. The second edition of this shared task development, and testing parallel data from featured parallel data from pairs/groups three pairs of languages from three different of similar languages from three different language families: Spanish - Portuguese (Ro- language families: Indo-Aryan languages mance languages), Czech - Polish (Slavic lan- (Hindi and Marathi), Romance languages guages), and Hindi - Nepali (Indo-Aryan lan- (Catalan, Portuguese, and Spanish), and guages). Systems were evaluated using auto- South Slavic Languages (Croatian, Ser- matic metrics, namely BLEU (Papineni et al., bian, and Slovene). We report the results obtained by our systems in translating 2002) and TER (Snover et al., 2006). from Hindi to Marathi and from Marathi In SLT 2020, the task organizes once again to Hindi. WIPRO-RIT achieved competi- included an Indo-Aryan language track with tive performance ranking 1st in Marathi to Hindi and 2nd in Hindi to Marathi trans- Hindi and Marathi. Indo-Aryan languages lation among 22 systems. are a sub-family of the Indo-European lan- guage family which includes Bengali, Bo- 1 Introduction hjpuri, Hindi, Marathi, and Nepali. These lan- guages are mainly spoken in North and Cen- WMT 2020 is the fifth edition of WMT asa tral India, and some neighbouring countries conference following a series of well-attended such as Nepal, Bangladesh, and Pakistan etc. workshops that date back to 2006. WMT be- The script used in most of these languages are came a well-established conference due to its derived from the ancient Brahmi script and en- blend of research papers and popular shared riched with high grapheme to phoneme corre- tasks on different topics such as translation spondence leading to many orthographic sim- in various domains (e.g. biomedical, news), ilarities across these languages. translation quality estimation, and automatic post-editing. The competitions co-organized In addition to Hindi and Marathi, SLT 2020 with WMT provide important datasets and features two other tracks with similar lan- benchmarks widely used in the MT commu- guages from the following language families: nity. The vast majority of these tasks so far, Romance languages (Catalan, Portuguese, and however, involved training systems to trans- Spanish) and South Slavic Languages (Croa- late to and from English (Bojar et al., 2016, tian, Serbian, and Slovene). In this pa- 2017) while only a few of them addressed the per we describe the WIPRO-RIT submission problem of translating between pairs of lan- to the SLT 2020 Indo-Aryan track. Our guages with less resources. WIPRO-RIT system is based on the model To address this issue, in 2019, the Simi- described in Johnson et al. (2017). WIPRO- lar Language Translation (SLT) shared task RIT achieved competitive performance rank- was introduced at WMT. SLT’s purpose was ing 1st in Marathi to Hindi and 2nd in Hindi to evaluate the performance of state-of-the-art to Marathi translation among 22 systems.

424 Proceedings of the 5th Conference on Machine Translation (WMT), pages 424–429 Online, November 19–20, 2020. c 2020 Association for Computational Linguistics 2 Related Work using a Hindi–English NMT system. The Hindi–English NMT system was trained on With the substantial performance improve- English–Hindi parallel data released in WMT ments brought to MT by neural approaches, a 2014 (Bojar et al., 2014), IITB parallel cor- growing interest in translating between pairs pus (Kunchukuttan et al., 2018), the parallel of similar languages, language varieties, and dataset was collected from news (Siripragada dialects has been observed. Recent studies et al., 2020) and the PMIndia (Haddow and have addressed MT between Arabic dialects Kirefu, 2020) parallel corpus (see Table 1). (Harrat et al., 2019; Shapiro and Duh, 2019) Catalan and Spanish, Croatian and Serbian Data Sources #sentences (Popović et al., 2020), (Costa-jussà, 2017), WMT 273,885 Brazilian and European Portuguese (Costa- News 156,344 jussà et al., 2018), and several pairs of lan- IITB 1,561,840 guages and language varieties such as Brazil- PM India 56,831 ian and European Portuguese, Canadian and Total 2,048,900 European French, and similar languages such Remove duplicates 1,464,419 as Croatian and Serbian, and Indonesian and ∗ Cleaning 961,036 Malay (Lakew et al., 2018). The interest on diatopic language variation Table 1: English–Hindi parallel data statistics. is evidenced by the recent iterations of the Var- ∗Removing noisy mixed language sentences. Dial workshop in which papers on MT applied to similar languages varieties, and dialects We also back-translated 5 million Marathi (Shapiro and Duh, 2019; Myint Oo et al., monolingual sehments using our WIPRO-RIT 2019; Popović et al., 2020) have been pre- CONTRASTIVE 1 system described in more sented along with evaluation campaigns fea- detail Section 6. For Marathi–Hindi we did turing multiple shared tasks on a number of not use any back translation data in our CON- related topics such as cross-lingual morpho- TRASTIVE 2 and PRIMARY submissions. In logical analysis, cross-lingual parsing, dialect the both cases 5 million English–Hindi back- identification, and morphosyntactic tagging translation data provide significant (p < 0.01) (Zampieri et al., 2018, 2019; Găman et al., improvements over CONTRASTIVE 1 (de- 2020). tailed in Section 6). The released WMT 2014 EN-HI data and 3 Data the WMT SLT 2020 data were noisy for our For our experiments, we use the Hindi– purposes, so we apply methods for cleaning Marathi and Marathi–Hindi WMT 2020 SLT (see data statistics in Table 2). data. The released parallel dataset was col- Parallel #sentences lected from news (Siripragada et al., 2020), News 12,349 PMIndia (Haddow and Kirefu, 2020) and Indic PM India 25,897 Wordnet (Bhattacharyya, 2010; Kunchukut- Indic WordNet 11,188 tan, 2020a) datasets. To augment our dataset, Total 49,434 we use English–Hindi parallel data released in Filtered∗ 33923 WMT 2014 (Bojar et al., 2014), consisting of more than 2 million parallel sentences, which Table 2: Data statistics of released SLT Data; is available as an additional resource. We use ∗Filtration methods: (i) remove duplicates and (ii) a subset of 5 million segments of Hindi mono- filtering noisy mixed language sentences. lingual news crawled from ca. 32 million data. We also use a subset 5 million Marathi mono- We performed the following two steps: (i) we lingual data. We performed similar cleaning use the cleaning process described in Pal et al. and pre-processing methods as we described (2015), and (ii) we execute the Moses (Koehn in case of parallel data. et al., 2007) corpus cleaning scripts with min- The five million Hindi monolingual sen- imum and maximum number of tokens set to tences were first back-translated to English 1 and 100, respectively. After cleaning and re-

425 Parallel Sentences L1 → L2 Source Target Raw HI→MR data देश एकल प्रयासे स आगे बढ़ चुके ह। देश आता सामाईक प्रय配न करतेत. आह Processed TO_MR data देश एकल प्रयासे स आगे बढ़ चुके ह। देश आता सामाईक प्रय配न करतेत. आह Raw MR→HI data देश आता सामाईक प्रय配न करतेत. आह देश एकल प्रयासे स आगे बढ़ चुके ह। Processed TO_HI data देश आता सामाईक प्रय配न करतेत. आह देश एकल प्रयासे स आगे बढ़ चुके ह। Raw EN→HI The MoU was signed in February, 2016. data इस एमओयू पर फरवरी, 2016 म ह ताक्षरिकए गए थे। Processed TO_HI The MoU was signed in February, 2016. data इस एमओयू पर फरवरी, 2016 म ह ताक्षरिकए गए थे।

Table 3: Multilingual Processed data, indicating TO_XX as target language: moving duplicates, we have 1M EN-HI par- tence aligned multiple language pairs at once, allel sentences. Next, we perform punctua- During inference, we also need to add the tion normalization, and then we use the Moses aforementioned additional token to each input tokenizer to tokenize the English side of the source sentence of the source data to specify parallel corpus with ‘no-escape’ option. Fi- the desired target language. nally, we apply true-casing. For the case of Hindi and Marathi, we use Indic NLP Li- brary1 (Kunchukuttan, 2020b) for tokeniza- 5 Experiments ton. In the next sub-sections we describe the ex- 4 Model Architecture periments we carried out for translating from Hindi to Marathi and from Marathi to Hindi Our model is based on a transformer archi- for WIPRO-RIT’s WMT 2020 SLT shared tecture (Vaswani et al., 2017) built solely task submission. upon such attention mechanisms completely replacing recurrence and convolutions. The transformer uses positional encoding to encode 5.1 Experiment Setup the input and output sequences, and com- putes both self- and cross-attention through To handle out-of-vocabulary words and to re- so-called multi-head attentions, which are fa- duce the vocabulary size, instead of consider- cilitated by parallelization. We use multi-head ing words, we consider subword units (Sen- attention to jointly attend to information at nrich et al., 2016) by using byte-pair encod- different positions from different representa- ing (BPE). In the preprocessing step, instead tion subspaces. of learning an explicit mapping between BPEs We present a single multilingual NMT sys- in the English (EN), Hindi (HI) and Marathi tem based on the transformer architecture (MR), we define BPE tokens by jointly pro- that can translate between multiple languages. cessing all parallel data. Thus, all derive a To make use of multilingual data within a sin- single BPE vocabulary. Since HI and MR be- gle NMT model, we perform one simple mod- long to the similar languages, they naturally ification to the source side of the multilingual share a good fraction of BPE tokens, which data, we use an additional token at the begin- reduces the vocabulary size. ning of the each source sentence to indicate We report evaluation results (evaluated by the target language by the NMT model would the shared task organizers) of our approach be translated as shown in Table 3. with the released Test data. BLEU (Papineni We train the model with all the pro- et al., 2002), RIBES (Isozaki et al., 2010) and cessed multilingual data consisting of sen- TER (Snover et al., 2006) are used to evaluate 1https://github.com/anoopkunchukuttan/ the performance of all participating systems in indic_nlp_library/ the shared task.

426 Parallel Data #sentences C1 C2 P Filtered SLT 33,923 ✓ ✓ ✓ Filtered EN–HI 961,036 ✓ ✓ ✓ BT EN–HI 5 million ✓ ✓ ✓ BT HI–MR 5 million ✓ ✓

Table 4: The training criteria data statistics of our submitted systems (C1 = Contrastive 1, C2 = Contrastive 2, P = Primary, and BT = Back-translated data).

5.2 Hyper-parameter Setup vocabulary size of 32K. After each epoch, the training data is shuffled. During decoding, we We follow a similar hyper-parameter setup for perform beam search with a beam size of 4. all reported systems. All encoders, and the We use 32K BPE operations to train our BPE decoder, are composed of a stack of N = 6 X models. We use shared embeddings in all our identical layers followed by layer normaliza- experiments. tion. Each layer again consists of two sub- layers and a residual connection (He et al., 6 Results 2016) around each of the two sub-layers. We apply dropout (Srivastava et al., 2014) to the We present the results obtained by our systems output of each sub-layer, before it is added to for Hindi–Marathi in Table 5 and for Marathi– the sub-layer input and normalized. Further- Hindi in Table 6 in terms of BLEU, RIBES, more, dropout is applied to the sums of the and TER. We apply our proposed method word embeddings and the corresponding po- to train multilingual models in three different sitional encodings in both encoders as well as configurations. Table 4 shows different train- the decoder stacks. ing data used to train our CONTRASTIVE We set all dropout values in the network to 1 (C1), CONTRASTIVE 2 (C2) and Primary 0.1. During training, we employ label smooth- (P) submissions. ing with value ϵls = 0.1. The output dimen- sion produced by all sub-layers and embed- System BLEU ↑ RIBES ↑ TER ↓ ding layers is dmodel = 512. Each encoder and P 16.62 62.45 72.23 decoder layer contains a fully connected feed- C2 15.42 61.02 73.59 forward network (FFN) having dimensional- C1 13.25 58.51 76.17 ity of dmodel = 512 for the input and output and dimensionality of dff = 2048 for the inner Table 5: Results for Hindi to Marathi translation layers. For the scaled dot-product attention, ranked by BLEU score. the input consists of queries and keys of di- mension d , and values of dimension d . As k v ↑ ↑ ↓ multi-head attention parameters, we employ System BLEU RIBES TER h = 8 for parallel attention layers, or heads. P 24.53 66.23 66.39 For each of these we use a dimensionality of C2 22.93 65.89 68.11 C1 22.69 65.01 68.13 dk = dv = dmodel/h = 64. For optimization, we use the Adam optimizer (Kingma and Ba, −9 Table 6: Results for Marathi to Hindi Translation 2015) with β1 = 0.9, β2 = 0.98 and ϵ = 10 . ranked by BLEU score. The learning rate is varied throughout the training process, and increasing for the first training steps warmupsteps = 16000 and af- CONTRASTIVE 1 (C1) Our CON- terwards decreasing as described in (Vaswani TRASTIVE 1 submission is a multilingual sin- et al., 2017). All remaining hyper-parameters gle system and does not use any monolingual are set analogously to those of the trans- back translation data. The system is trained former’s base model. At training time, the on the released HI-MR and MR-HI parallel batch size is set to 25K tokens, with a maxi- data. In addition to we also use EN-HI parallel mum sentence length of 256 subwords, and a data.

427 CONTRASTIVE 2 (C2) This submission References is similar to CONTRASTIVE 1, however in Loïc Barrault, Ondřej Bojar, Marta R. Costa- this case we used 5M back-translated Marathi– jussà, Christian Federmann, Mark Fishel, Hindi and 5M back-translated Hindi–Marathi Yvette Graham, Barry Haddow, Matthias Huck, corpus. Source back-translated sentences be- Philipp Koehn, Shervin Malmasi, Christof Monz, Mathias Müller, Santanu Pal, Matt gin with an additional token indicating the tar- Post, and Marcos Zampieri. 2019. Findings of get language. the 2019 Conference on Machine Translation (WMT19). In Proceedings of WMT. PRIMARY (P) Our primary submission is trained using the same setting as we described Pushpak Bhattacharyya. 2010. IndoWordNet. In in CONTRASTIVE 2 system. The difference Proceedings of LREC. is our primary system is an ensemble of three Ondřej Bojar, Christian Buck, Christian Feder- different CONTRASTIVE 2 systems initiated mann, Barry Haddow, Philipp Koehn, Johannes with three different random seeds. Leveling, Christof Monz, Pavel Pecina, Matt Post, Herve Saint-Amand, et al. 2014. Find- ings of the 2014 workshop on statistical machine 7 Conclusion and Future Work translation. In Proceedings of WMT.

This paper presented the WIPRO–RIT system Ondřej Bojar, Rajen Chatterjee, Christian Feder- submitted to the Similar Language Transla- mann, Yvette Graham, Barry Haddow, Shujian tion shared task at WMT 2020. We presented Huang, Matthias Huck, Philipp Koehn, Qun the results obtained by our system in trans- Liu, Varvara Logacheva, et al. 2017. Findings of the 2017 Conference on Machine Translation lating from Hindi to Marathi and Marathi to (WMT17). In Proceedings of WMT. Hindi. Our primary system achieved compet- itive performance ranking first in Marathi to Ondrej Bojar, Rajen Chatterjee, Christian Fe- Hindi and second in Hindi to Marathi among dermann, Yvette Graham, Barry Haddow, Matthias Huck, Antonio Jimeno Yepes, Philipp 22 teams in terms of BLEU score. Koehn, Varvara Logacheva, Christof Monz, In future work, we would like to further et al. 2016. Findings of the 2016 Conference on explore the similarity between these two lan- Machine Translation. In Proceedings of WMT. guages in translating to other Indo-Aryan lan- Marta R. Costa-jussà. 2017. Why Catalan-Spanish guages (e.g. Bengali, Bhojpuri, and Nepali). Neural Machine Translation? Analysis, Com- We expect the models presented in this pa- parison and Combination with Standard Rule per to perform well for other Indo-Aryan lan- and Phrase-based Technologies. In Proceedings guage provided that suitable training data is of VarDial. available. Furthermore, we would like to ap- Marta R. Costa-jussà, Marcos Zampieri, and San- ply and evaluate our method on the two other tanu Pal. 2018. A Neural Approach to Language groups of languages in the WMT SLT 2020 Variety Translation. In Proceedings of VarDial. shared task, Romance languages: Catalan, Mihaela Găman, Dirk Hovy, Radu Tudor Ionescu, Portuguese, and Spanish, and South Slavic Heidi Jauhiainen, Tommi Jauhiainen, Kris- languages: Croatian, Serbian, and Slovene. ter Lindén, Nikola Ljubešić, Niko Partanen, Finally, we will be incorporating the transla- Christoph Purschke, Yves Scherrer, and Mar- tion models presented in this paper to CATa- cos Zampieri. 2020. A Report on the VarDial Evaluation Campaign 2020. In Proceedings of Log, an open-source online CAT tool that pro- VarDial. vides users with both MT and TM outputs (Nayek et al., 2015; Pal et al., 2016). Barry Haddow and Faheem Kirefu. 2020. Pmindia - a collection of parallel corpora of languages of india. arXiv preprint arXiv:2001.09907. Acknowledgments Salima Harrat, Karima Meftouh, and Kamel We would like to thank the WMT 2020 SLT Smaili. 2019. Machine Translation for Arabic shared task organizers for making the Hindi - Dialects (Survey). Information Processing & Marathi data available. We further thank the Management, 56(2):262–273. anonymous WMT reviewers for their insightful Kaiming He, Xiangyu Zhang, Shaoqing Ren, and feedback and suggestions. Jian Sun. 2016. Deep Residual Learning for Im- age Recognition. Proceedings of CVPR.

428 Hideki Isozaki, Tsutomu Hirao, Kevin Duh, Kat- Kishore Papineni, Salim Roukos, Todd Ward, and suhito Sudoh, and Hajime Tsukada. 2010. Au- Wei-Jing Zhu. 2002. Bleu: A method for au- tomatic evaluation of translation quality for dis- tomatic evaluation of machine translation. In tant language pairs. In Proceedings of EMNLP. Proceedings of ACL.

Melvin Johnson, Mike Schuster, Quoc V. Le, Maja Popović, Alberto Poncelas, Marija Brkic, and Maxim Krikun, Yonghui Wu, Zhifeng Chen, Andy Way. 2020. Neural Machine Translation Nikhil Thorat, Fernanda Viégas, Martin - for Translating into Croatian and Serbian. In tenberg, Greg Corrado, Macduff Hughes, and Proceedings of the Seventh Workshop on NLP Jeffrey Dean. 2017. Google’s multilingual neural for Similar Languages, Varieties and Dialects machine translation system: Enabling zero-shot (VarDial). translation. Transactions of the Association for Computational Linguistics, 5:339–351. Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. Neural Machine Translation of Diederik P Kingma and Jimmy Lei Ba. 2015. Rare Words with Subword Units. In Proceed- Adam: A Method for Stochastic Optimization. ings of ACL. Proceedings of ICLR. Pamela Shapiro and Kevin Duh. 2019. Comparing Philipp Koehn, Hieu Hoang, Alexandra Birch, pipelined and integrated approaches to dialectal Chris Callison-Burch, Marcello Federico, Nicola Arabic neural machine translation. In Proceed- Bertoldi, Brooke Cowan, Wade Shen, Christine ings of VarDial. Moran, Richard , Chris Dyer, Ondřej Bojar, Alexandra Constantin, and Evan Herbst. 2007. Shashank Siripragada, Jerin Philip, Vinay P. Nam- Moses: Open Source Toolkit for Statistical Ma- boodiri, and C V Jawahar. 2020. A multilingual chine Translation. In Proceedings of ACL. parallel corpora collection effort for Indian lan- guages. In Proceedings of LREC. Anoop Kunchukuttan. 2020a. Indowordnet Matthew Snover, Bonnie Dorr, Richard Schwartz, parallel corpus. https://github.com/ Linnea Micciulla, and John Makhoul. 2006. A anoopkunchukuttan/indowordnet_parallel. Study of Translation Edit Rate with Targeted Anoop Kunchukuttan. 2020b. The Indic- Human Annotation. In Proceedings of AMTA. NLP Library. https://github.com/ Nitish Srivastava, Geoffrey Hinton, Alex anoopkunchukuttan/indic_nlp_library/ Krizhevsky, Ilya Sutskever, and Ruslan blob/master/docs/indicnlp.pdf. Salakhutdinov. 2014. Dropout: A Simple Way to Prevent Neural Networks from Overfitting. Anoop Kunchukuttan, Pratik Mehta, and Pushpak J. Mach. Learn. Res., 15(1):1929–1958. Bhattacharyya. 2018. The IIT Bombay English- Hindi parallel corpus. In Proceedings of LREC. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Surafel M Lakew, Aliia Erofeeva, and Mar- Łukasz Kaiser, and Illia Polosukhin. 2017. At- cello Federico. 2018. Neural Machine Transla- tention Is All You Need. In Proceedings of NIPS. tion into Language Varieties. arXiv preprint arXiv:1811.01064. Marcos Zampieri, Shervin Malmasi, Preslav Nakov, Ahmed Ali, Suwon Shon, James Thazin Myint Oo, Ye Kyaw Thu, and Khin Glass, Yves Scherrer, Tanja Samardžić, Nikola Mar Soe. 2019. Neural machine transla- Ljubešić, Jörg Tiedemann, Chris van der Lee, tion between Myanmar (Burmese) and rakhine Stefan Grondelaers, Nelleke Oostdijk, Dirk (arakanese). In Proceedings of VarDial. Speelman, Antal van den Bosch, Ritesh Kumar, Bornini Lahiri, and Mayank Jain. 2018. Lan- Nayek, Sudip Kumar Naskar, Santanu Pal, guage Identification and Morphosyntactic Tag- Marcos Zampieri, Mihaela Vela, and Josef van ging: The Second VarDial Evaluation Cam- Genabith. 2015. CATaLog: New approaches to paign. In Proceedings of VarDial. TM and post editing interfaces. In Proceedings of NLP4TM. Marcos Zampieri, Shervin Malmasi, Yves Scherrer, Tanja Samardžić, Francis Tyers, Miikka Silfver- Santanu Pal, Sudip Naskar, and Josef van Gen- berg, Natalia Klyueva, Tung-Le Pan, Chu-Ren abith. 2015. UdS-sant: English–German hybrid Huang, Radu Tudor Ionescu, Andrei Butnaru, machine translation system. In Proceedings of and Tommi Jauhiainen. 2019. A Report on the WMT. Third VarDial Evaluation Campaign. In Pro- ceedings of VarDial. Santanu Pal, Marcos Zampieri, Sudip Kumar Naskar, Tapas Nayak, Mihaela Vela, and Josef van Genabith. 2016. CATaLog online: Porting a post-editing tool to the web. In Proceedings of LREC.

429